Daniel Lemire's blog

, 22 min read

Publicly available large data sets for database research

29 thoughts on “Publicly available large data sets for database research”

  1. @Venkat

    Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
    Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.

    That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.

  2. Venkat says:
  3. Venkat says:

    Sorry, wrong link (though that one is also relevant).

    Right link:

    http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public?__snids__=36983349

  4. @Venkat

    Thanks. The quora link is already in my post. 😉

  5. Venkat says:

    Ah, missed it on first pass, since you highlighted the pointer rather than the reference.

    A useful exercise would be to create a table that technically characterizes what the data is like. A database of datasets.

  6. Hi Daniel,
    The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:

    On-Time Performance:
    http://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME

    Ticket Pricing:
    http://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey

    These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.

    -Robert

    p.s. I found this site through your comments on some Quora post.

  7. Hi Daniel,
    That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.
    -Robert

  8. @Robert

    Thanks! I’ll check it out.

  9. @Robert

    The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.

    Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.

  10. @Robert

    The On-Time link you gave leads me to an error page, I have now found the on-time data set and I am having fun with it.

    http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time

    It is possible that I just don’t grok the ticket pricing data set.

    (Note that I don’t use something as nice as tableau to explore it right now. I’m hacking.)

  11. @Robert

    I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).

    I don’t doubt you are able to exploit this data set, but I find it a bit difficult.

    Still, it is really nice to see the government making such data freely available.

  12. Jason Eisner says:

    For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?

  13. @Jason

    I think that oltpbenchmark.com might help if you are looking for workloads.

  14. There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…

    DbPedia has a structured form of the Wikipedia infoboxes, this is a lot like freebase:
    http://wiki.dbpedia.org/Downloads37

    There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)

    http://km.aifb.kit.edu/projects/btc-2010/
    http://km.aifb.kit.edu/projects/btc-2011/

  15. Michele Filannino says:

    The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).

    Bye,
    michele.

  16. @Michele

    Quite right. I’ve updated my blog post. Thanks.

  17. Hi Daniel,

    Freebase dataset is also directly available from their website, check http://wiki.freebase.com/wiki/Data_dumps

    I think that is also provided by Infochimps (http://www.infochimps.com/), which btw might be worth browsing for other datasets.

    Freebase guys also provide WEX (Freebase Wikipedia Extraction), which is a processed dump of the English wikipedia. You can find more info at http://wiki.freebase.com/wiki/WEX

    Have fun 🙂

    da

  18. Johan Dahlin says:

    Openstreetmap has 2.7 billion GPS data points, 1.4 billion nodes, 131 million ways. Dataset is a 250G xml file. (21G compressed)

    http://www.openstreetmap.org/stats/data_stats.html
    http://wiki.openstreetmap.org/wiki/Planet.osm

  19. The open library dataset is pretty awesome: http://openlibrary.org/developers/dumps

  20. boya says:
  21. (Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)

    I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from http://www2.census.gov/, which came out to 10,626 files and 1.3Gb. Then I
    searched through them for the biggest CSV files, and found a 2.1 GB .csv
    file here: http://www2.census.gov/acs2010_5yr/pums/csv_pus.zip

    I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:

    http://www.census.gov/acs/www/Downloads/data_documentation/pums/CodeLists/ACSPUMS2006_2010CodeLists.pdf

    One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.

  22. Cristian says:

    Daniel,

    take a look here:
    http://www.ncdc.noaa.gov/most-popular-data

    Regards,
    Cristian

  23. Anonymous says:

    Daniel,

    i found the dimes project, http://www.netdimes.org and do they provide access to data sets.
    http://www.netdimes.org/new/?q=node/65
    =======================================
    DIMES is a distributed scientific research project, aimed to study the structure and topology of the Internet, with the help of a volunteer community (similar in spirit to projects such as SETI@Home).
    =======================================

    Regards,
    Cristian.

  24. Dharmaraj says:

    Daniel,
    i’m looking for dataset for my project at the size of more than 1GB in CSV format.
    so can you tell me where i can get that

    1. Kinmokusu says:

      hello Dharmaraj ;
      I have the same need did you find a dataset ?

    2. Bob says:
  25. Bob says:
  26. Kenn says:

    I need a data set with 200 population size

  27. I was looking all afternoon for this. Thank you for putting it up.