IR folks long-suspected PageRank to be a red herring but was not confirmed until the last few years. The reference I like to use comes from MSR and was published at WWW06,
M. Richardson, A. Prakash, and E. Brill, “Beyond pagerank: machine learning for static ranking,†in WWW ’06: Proceedings of the 15th international conference on World Wide Web, (New York, NY, USA), pp. 707–715, ACM Press, 2006.
The authors demonstrate that structure-independent features, combined with page’s popularity significantly outperformed PageRank. Informal conversations with engine architects and SEO folks confirms this.
It’s helpful to interpret these results in the context of a random walk on the web graph. PageRank is the stationary distribution of a random walker on the web graph. In situations where you have no knowledge about page visitation , this is a reasonable surrogate. However, in the presence of real user data (gathered through a toolbar or OS), the random walk model seems less attractive than models which incorporate visitation data.
That said, it also seems likely that actual effectiveness of search engines has more to do with using massive amounts of click data to train classic IR features and query triage schemes.
Peter Turneysays:
Interesting post. I used Google Scholar to find all citations of “Predicting fame and fortune: Pagerank or indegree”. Google found 16 citations:
I skimmed some of the citations, and two seemed particularly relevant: (1) Hits on the web: how does it compare? (2) Beyond PageRank: Machine Learning for Static Ranking. I was about to post this comment, when I saw that two previous comments gave exactly the same two references. Now I’m posting this comment anyway, to say that Google PageRank may be bogus, but Google Scholar seems to work just fine. 🙂
Just to offer some anecdotal (and unconfirmed) piece of information: it is claimed that the original Pagerank was not exactly the one described in the WWW97 paper.
In the plain vanilla implementation, the underlying model of Pagerank corresponds to a “random surfer” that follows hyperlinks and with probability 0.85 gets bored and jumps to a random page. I have heard that in the actual implementation, the random surfer jumps only to pages in the “edu” domain. (This idea is similar to the TrustRank algorithm.)
Of course, since 1996 many things have changed and today there are so many other factors that are taken into consideration during ranking that it is almost certain that PageRank is mainly a marketing tool.
I agree that PageRank has become mainly a marketing tool. However, there is a flaw in Upstill’s work. He doesn’t compare in-degree with PageRank but with the score given in Google’s Toolbar, called “PageRank”. Nobody knows what this score is exactly. In particular, nothing proves that it is the real “pure” PageRank as described in the original PageRank paper. I suspect that it is (a downgraded version of) the score that Google uses for ranking, which is a mixture of many factors, in which PageRank plays some (unknown) role.
Interesting observation, Jean, but the paper by Najork et al. (HITS on the Web: How does it Compare?) support the claim that PageRank is not even as accurate as in-degree.
True. My comment was not in defence of PageRank. The simple fact that Google need to supplement it with several dozens of other criteria shows that it is not ideal 😉 In a way, Upstill said something right with a disputable methodology.
Hi again,
Sorry for lack of details about me. My name is Sérgio Nunes and I’m a PhD student in the field of WebIR.
Also sorry for the lack of a proper reference on my statement. This is a recent experimental work by Marc Najork that delves into this issue:
“HITS on the Web: How does it Compare?”
http://research.microsoft.com/research/pubs/view.aspx?0rc=p&type=Publication&id=1734
IR folks long-suspected PageRank to be a red herring but was not confirmed until the last few years. The reference I like to use comes from MSR and was published at WWW06,
M. Richardson, A. Prakash, and E. Brill, “Beyond pagerank: machine learning for static ranking,†in WWW ’06: Proceedings of the 15th international conference on World Wide Web, (New York, NY, USA), pp. 707–715, ACM Press, 2006.
The authors demonstrate that structure-independent features, combined with page’s popularity significantly outperformed PageRank. Informal conversations with engine architects and SEO folks confirms this.
It’s helpful to interpret these results in the context of a random walk on the web graph. PageRank is the stationary distribution of a random walker on the web graph. In situations where you have no knowledge about page visitation , this is a reasonable surrogate. However, in the presence of real user data (gathered through a toolbar or OS), the random walk model seems less attractive than models which incorporate visitation data.
That said, it also seems likely that actual effectiveness of search engines has more to do with using massive amounts of click data to train classic IR features and query triage schemes.
Interesting post. I used Google Scholar to find all citations of “Predicting fame and fortune: Pagerank or indegree”. Google found 16 citations:
http://scholar.google.com/scholar?hl=en&lr=&cites=5736996577557537352
I skimmed some of the citations, and two seemed particularly relevant: (1) Hits on the web: how does it compare? (2) Beyond PageRank: Machine Learning for Static Ranking. I was about to post this comment, when I saw that two previous comments gave exactly the same two references. Now I’m posting this comment anyway, to say that Google PageRank may be bogus, but Google Scholar seems to work just fine. 🙂
Just to offer some anecdotal (and unconfirmed) piece of information: it is claimed that the original Pagerank was not exactly the one described in the WWW97 paper.
In the plain vanilla implementation, the underlying model of Pagerank corresponds to a “random surfer” that follows hyperlinks and with probability 0.85 gets bored and jumps to a random page. I have heard that in the actual implementation, the random surfer jumps only to pages in the “edu” domain. (This idea is similar to the TrustRank algorithm.)
Of course, since 1996 many things have changed and today there are so many other factors that are taken into consideration during ranking that it is almost certain that PageRank is mainly a marketing tool.
I agree that PageRank has become mainly a marketing tool. However, there is a flaw in Upstill’s work. He doesn’t compare in-degree with PageRank but with the score given in Google’s Toolbar, called “PageRank”. Nobody knows what this score is exactly. In particular, nothing proves that it is the real “pure” PageRank as described in the original PageRank paper. I suspect that it is (a downgraded version of) the score that Google uses for ranking, which is a mixture of many factors, in which PageRank plays some (unknown) role.
Interesting observation, Jean, but the paper by Najork et al. (HITS on the Web: How does it Compare?) support the claim that PageRank is not even as accurate as in-degree.
True. My comment was not in defence of PageRank. The simple fact that Google need to supplement it with several dozens of other criteria shows that it is not ideal 😉 In a way, Upstill said something right with a disputable methodology.