Posted by Nick Gerner
After watching Nate Buggia a few weeks ago speak about Live’s Webmaster Tools, I was struck by his statistic about the number of domains on the web. He suggested that there are 78 million domains. There’s certainly room for disagreement about this number—don’t forget Google has one trillion web pages 😉 —but I bet he’s in the right ballpark. If that’s right, could we manually review all of them?
Sure, 78 million domains is big. But not that big. A few months ago while investigating spam, Danny reviewed a fairly randomly chosen 500 domains in a matter of hours. And I think he did a great job of it, too. That’s a good foundation, but could we scale that up, and review millions of domains?
I see a few challenges here. Probably the biggest challenge I see is just getting this list of Live’s 78 million domains. Next you’re going to need a lot of manual reviewers. But if you’re Live (or some other search engine) you’ve already got that list, and a large contract labor force. Too bad for the rest of us.
I suppose if you’re clever you might be able to do this through Alexa’s Web Information Service and Amazon Mechanical Turk. Taking a look at the Mechanical Turk pricing, it looks like you could charge one cent for every domain (or maybe each block of a few dozen domains). So we’re probably talking about tens or hundreds of thousands of dollars. But that’s pocket change for Google. And Google has plenty of remote offices with lots of search quality engineers. In fact, they say, "Google makes use of evaluators in many countries and languages. These evaluators are carefully trained and are asked to evaluate the quality of search results in several different ways."
So let’s say a single person can review 1000 domains in a single day. And let’s say you’ve got 1000 reviewers working on this problem. That tells me that 78 days later you’ve got all the relevant domains on the internet reviewed. That’s less than 10% of Google’s workforce, less than 2% of Microsoft’s Workforce. Of course you could do it with less if you pre-filtered some of those domains, or took longer than three months to do it. If Google, Yahoo!, and Live haven’t already done this… well I can’t imagine that they haven’t done at least part of this by now.
More: continued here