Posted by bhendrickson

A few days after new content shows up in Google, it will sometimes flicker out of the SERPS for a few hours.  Apparently, this is common knowledge to some SEOs.  This is not common knowledge to programmers like me, and I nearly made a tin foil hat in preparation for the googlicopters when I learned the project (Linkscape) that I’d worked on for months had disappeared from all SERPs on the Friday evening of the week it launched.  Fortunately, Rand responded to the internal email thread and, just as he predicted, 12 hours later, Linkscape was back in Google’s SERPs like nothing had happened.

This raised the question, "why would the engines drop new content out of the search results for a few hours after it is been in the results for a few days?"  I don’t know, but let me make an educated guess – sometimes there is a brief gap between pages falling out of a smaller (but quicker to build) index and when a larger (but slower to build) index is finished getting rebuilt with those pages in it.

Not having worked at Google I have no solid evidence they have multiple indices, but let me make the case that they probably do. Linkscape currently takes over a month to move something being crawled to it appearing in the results.  There is some low hanging fruit to reduce this to more like a couple weeks, but for the foreseeable future we aren’t going to have turn around time on the order of hours like the engines do because our index is large enough that it just takes a lot of computers and lot of hours to compute it.  So… how do the engines get around this issue?  They could make their indices support random inserts, but this would make them more complex and less efficient.  The other option is to have two indices.  That way they can have one index that is small and quick to update, and another that is large and slow to update.  The small index would try to have the difference between what is crawled and what is in the big index.  At query time, they would then need to check both.  Of course they could have more sizes of indices besides just two, but that doesn’t affect the basic point that presumably Google has more than one.

Google could remove a page from the small index only after it is in the big index, but then it would be in both indices for a while until the small index was rebuilt.  This overlap means the small index is larger than necessary, so can’t rebuild as quick as is possible, and so won’t be as fresh as is possible.  So perhaps they try to time it perfectly so their isn’t any overlap and isn’t any gap.  The problem comes that as they crawl faster, grow their indices, add complexity to their indexing or let the intern check in his summer project, it is easy for a small gap to form. So maybe it is just hard to ensure that there is never any gap unless one is willing to waste resources by letting them overlap.

Chas (the developer who sits next to me) manages some indices with a large+small model that, for the record, never has gaps.  And he contributes the fact that his large index starts rebuilding at midnight on Friday because load is lighter on the weekend.  However, his computers are set to GMT which means it starts Friday 5PM PST.  Well, it was a bit after 5PM on a Friday when Jane first noticed Linkscape dropped from Google’s SERPS (I received her email at 5:28PM).  Google has less CPU constraints than Chas, but they do have bandwidth constraints which is what’s needed to push new indices out to lots of computers.

So the theory is that Google had two indices that were suppose to go live in the first seconds of the weekend GMT.  First was the new large index that added our page.  Second was the new small index that dropped our page.  Only the small index was on time.

Or, at least, that is the best theory I can come up with.  What do you guys think?

p.s. from Rand – This post is Ben Hendrickson’s first on SEOmoz. He’s been with us nearly a year, working on Linkscape, and before that with Microsoft & his own technology startup project. I’m thrilled to have him contributing to the blog. Hopefully, he’ll get a photo up sometime soon 🙂

Do you like this post? YesNo

More: continued here