Posted by Nick Gerner

I’ve been reading some of the papers from AIRWeb ’08, the recent web spam conference, and I came across one called, "A Few Bad Votes Too Many? Towards Robust Ranking in Social Media."  What was really interesting to me is that the authors quantify the effects of vote spam in "thumbing" systems (like the one we have here at SEOmoz), and that they rank the usefulness of different features in ranking questions and answers.  The results suggest that an average of just six spam controlled votes can half search relevance:

Question answering performance with spammers

The "baselines" here (the bottom two lines) are similar to what you get when you sort answers by votes.  So think about that when you get a highly thumbed answer "Buy Mortgage Loans Cheap" as the number one result to the query, "how to bake apple pie."  The "GBrank" lines take into account a few more features, and the best performing system (represented by the red line above) tries to take into account the fact that thumbs can be spammed.

No surprise here: spammers can alter search results.  But how bad are thumbs compared to other features in the presence of spam?  The researchers conveniently listed the features they used and their relative importance.  I’ve reproduced it below:

Without spammers, here are the top 10 features useful for ranking questions and answers in response to user queries.  I’ve bolded community-based features.

  • similarity between query and question
  • number of resolved questions for answerer
  • length ratio between query and answer
  • number of thumbs down votes
  • number of stars for answerer
  • number of thumbs up votes
  • similarity between query and question/answer
  • number of answer terms
  • number of questions asked by answerer
  • answer’s lifetime

Thumbs up and down are right up there with the traditional IR features like similarity between query and document.
With thumb spam the top 10 features are a bit different:

  • similarity between query and question
  • number of resolved questions for answerer
  • length ratio between query and answer
  • number of stars for answerer
  • similarity between query and question/answer
  • number of answer terms
  • number of questions asked by answerer
  • answer’s lifetime
  • number of question terms
  • length ratio between query and question

Suspiciously absent are thumbs up or down.  The seeming cornerstone of community engagement doesn’t even beat the number of words in the question!

This sounds like pretty bad news for sites relying on community interaction.  Recall that, by at least one measure, even an average of about six spammer controlled accounts (just six spurious thumbs up) can half the performance of search at question answering sites.

There is, however, a silver lining to be pulled from this paper.  Notice the number two feature in both lists: "number of resolved questions for answerer."  Also present in both lists are, "number of stars for answerer" and "number of questions asked by answerer."  While this paper didn’t consider attacking these features, it is comforting to know that these remain valid (and very useful) features.  One might also argue that some of these community features are going to be harder to attack, and easier to monitor by moderators.

It should come as no surprise that deeper forms of social engagement are more useful to the community site, in this case for search ranking.  Also, if you’re trying to improve your visibility/authority in that community, and get your content (in this case, answers to questions) in the hands of more readers, you’re much better off spending your time and energy on these deeper forms of engagement.

I guess what I’m trying to say is, rather than just thumbing this post, add a comment.  Or better yet, write a YOUmoz post in response (here’s an idea: What did this paper miss?  What have I pulled from this paper which doesn’t generalize to the outside world?). 

And if you’re a spammer and think you can get away with thumbing your Yahoo! Answers to the top of their SERPs,  know that they’ve got their eye on you! …but state-of-the-art algorithmic detection still has a long way to go to really catch you 🙁

P.S. I’m headed to Costa Rica for the next 10 days without internet, phone, etc.  So if you post a comment and I don’t respond, that would be why.

Do you like this post? YesNo