14th August 2009, 1 min read

Picking a web page at random on the Web

To do statistics over the Web, we need samples. Thus, I want to know how to pick a Web page at random, without making much effort. If you are Google or Microsoft, it is easy. But what about the rest of us? And what if I want to pick users at random on Facebook?

In effect, I want to sample a virtually infinite data set: I consider the Web to be infinite because I cannot enumerate all elements in it. Given finite resources against an infinite data set, can I sample fairly?

Thankfully, the objects in this set are linked and indexed. Hence, I came up with two sampling strategies:

Using the hyperlinks: Start with any Web page. Recursively follow hyperlinks until you have many Web pages. Pick a page at random in this set. (Mathematically, I am sampling nodes in an infinite graph.)
Using the index: Sampling items Pick a set of common words, enter them into a search engine. Pick a page at random in the result set.

Are there any other strategies? What can I say (if anything) about the biases of my sampling?

Further reading: Benoit, D. and Slauenwhite, D. and Schofield, N. and Trudel, A., World’s First Class C Web Census: The First Step in a Complete Census of the Web, Journal of Networks 2 (2), 2007.

Update: An anonymous reader points me to Ziv Bar-Yossef and Maxim Gurevich, Random sampling from a search engine’s index, JACM 55 (5), 2008.

Update 2: Chris Brew points me to Henzinger et al., On Near-Uniform URL Sampling, WWW 2000.

Update 3: Regarding Facebook, I found Gjoka et al., Unbiased Sampling of Facebook, 2009.