Please note that PageRank is mostly a marketing tool. It has been shown that PR is roughly equal to in-link count. Also, content-based features (like BM25) are largely superior to link-based features.
Thanks for the interesting semi-anonymous comment Sérgio, but it would be even more interesting if you provided references.
On average, PageRank might be equal to the in-link count, but proving that that it is equal to the in-link count (in some sense) with high probability seems much harder to do. You need to make assumptions about the topology of the Web. I’d be very interested in knowing what these assumptions are.
Link analysis uses a lot of matrix (2D) algebra. Including time as an additional dimension would require using tensor (3D) decomposition tools. This is what Jimeng Sun and Christos Faloutsos were advocating, eg in their paper and tutorial at the last SDM conference…
This post by Jean Véronis identifies a *correlation* but his attempt at finding *causality* is highly speculative. Few people really know what is going on behind Google/Yahoo ranking, and those who do, apparently don’t tell easily. A more natural explanation would be that the increased coverage and “authority” of Wikipedia makes it quite naturally rank higher using a number of reasonable results. (Note that the comparison is with a Dec. 2005 study — 2 years ago is a long time in WP time)
Also, to answer your title question: the Wikipedia search engine is pretty miserable so a regular search engine might be your best bet at efficient search in WP.
Oh, this I know (for the first time in any blog comment): a nonhomogeneous Markov process. It appears that the canonical reference is: Blackwell D. (1945). Finite nonhomogeneous Markov chains. Ann.
Math.46: 594-599.
If Web topology cannot cope anymore, this means we need to introduce time as a factor.
Just like… Google Blogsearch!
I really think the options offered in blog search will be available in the main Google page one day, how soon is a matter I am not able to answer, par contre.
Please note that PageRank is mostly a marketing tool. It has been shown that PR is roughly equal to in-link count. Also, content-based features (like BM25) are largely superior to link-based features.
Thanks for the interesting semi-anonymous comment Sérgio, but it would be even more interesting if you provided references.
On average, PageRank might be equal to the in-link count, but proving that that it is equal to the in-link count (in some sense) with high probability seems much harder to do. You need to make assumptions about the topology of the Web. I’d be very interested in knowing what these assumptions are.
Daniel,
Link analysis uses a lot of matrix (2D) algebra. Including time as an additional dimension would require using tensor (3D) decomposition tools. This is what Jimeng Sun and Christos Faloutsos were advocating, eg in their paper and tutorial at the last SDM conference…
This post by Jean Véronis identifies a *correlation* but his attempt at finding *causality* is highly speculative. Few people really know what is going on behind Google/Yahoo ranking, and those who do, apparently don’t tell easily. A more natural explanation would be that the increased coverage and “authority” of Wikipedia makes it quite naturally rank higher using a number of reasonable results. (Note that the comparison is with a Dec. 2005 study — 2 years ago is a long time in WP time)
Also, to answer your title question: the Wikipedia search engine is pretty miserable so a regular search engine might be your best bet at efficient search in WP.
How do you call a time-varying Markov process?
Oh, this I know (for the first time in any blog comment): a nonhomogeneous Markov process. It appears that the canonical reference is: Blackwell D. (1945). Finite nonhomogeneous Markov chains. Ann.
Math. 46: 594-599.
If Web topology cannot cope anymore, this means we need to introduce time as a factor.
Just like… Google Blogsearch!
I really think the options offered in blog search will be available in the main Google page one day, how soon is a matter I am not able to answer, par contre.
Google is already experimenting with temporal filters, just add “view:timeline” to the end of your queries.