27th March 2012, 22 min read

Publicly available large data sets for database research

29 thoughts on “Publicly available large data sets for database research”

Daniel Lemire says:

March 27, 2012 at 12:14 pm

@Venkat

Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.

That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.
Venkat says:

March 27, 2012 at 10:43 am

Seen this compilation?

http://www.nytimes.com/2012/03/25/business/factuals-gil-elbaz-wants-to-gather-the-data-universe.html?hp
Venkat says:

March 27, 2012 at 10:44 am

Sorry, wrong link (though that one is also relevant).

Right link:

http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public?__snids__=36983349
Daniel Lemire says:

March 27, 2012 at 10:54 am

@Venkat

Thanks. The quora link is already in my post. 😉
Venkat says:

March 27, 2012 at 11:34 am

Ah, missed it on first pass, since you highlighted the pointer rather than the reference.

A useful exercise would be to create a table that technically characterizes what the data is like. A database of datasets.
Robert Morton says:

March 27, 2012 at 4:54 pm

Hi Daniel,
The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:

On-Time Performance:
http://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME

Ticket Pricing:
http://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey

These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.

-Robert

p.s. I found this site through your comments on some Quora post.
Robert Morton says:

March 27, 2012 at 8:00 pm

Hi Daniel,
That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.
-Robert
Daniel Lemire says:

March 27, 2012 at 5:36 pm

@Robert

Thanks! I’ll check it out.
Daniel Lemire says:

March 27, 2012 at 7:58 pm

@Robert

The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.

Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.
Daniel Lemire says:

March 27, 2012 at 8:39 pm

@Robert

The On-Time link you gave leads me to an error page, I have now found the on-time data set and I am having fun with it.

http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time

It is possible that I just don’t grok the ticket pricing data set.

(Note that I don’t use something as nice as tableau to explore it right now. I’m hacking.)
Daniel Lemire says:

March 27, 2012 at 9:06 pm

@Robert

I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).

I don’t doubt you are able to exploit this data set, but I find it a bit difficult.

Still, it is really nice to see the government making such data freely available.
Jason Eisner says:

March 28, 2012 at 3:44 am

For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?
Daniel Lemire says:

March 28, 2012 at 7:20 am

@Jason

I think that oltpbenchmark.com might help if you are looking for workloads.
Gunnar Grimnes says:

March 28, 2012 at 3:56 am

There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…

DbPedia has a structured form of the Wikipedia infoboxes, this is a lot like freebase:
http://wiki.dbpedia.org/Downloads37

There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)

http://km.aifb.kit.edu/projects/btc-2010/
http://km.aifb.kit.edu/projects/btc-2011/
Michele Filannino says:

March 28, 2012 at 9:49 am

The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).

Bye,
michele.
Daniel Lemire says:

March 28, 2012 at 10:04 am

@Michele

Quite right. I’ve updated my blog post. Thanks.
Davide Eynard says:

March 30, 2012 at 3:29 am

Hi Daniel,

Freebase dataset is also directly available from their website, check http://wiki.freebase.com/wiki/Data_dumps

I think that is also provided by Infochimps (http://www.infochimps.com/), which btw might be worth browsing for other datasets.

Freebase guys also provide WEX (Freebase Wikipedia Extraction), which is a processed dump of the English wikipedia. You can find more info at http://wiki.freebase.com/wiki/WEX

Have fun 🙂

da
Johan Dahlin says:

March 31, 2012 at 11:57 am

Openstreetmap has 2.7 billion GPS data points, 1.4 billion nodes, 131 million ways. Dataset is a 250G xml file. (21G compressed)

http://www.openstreetmap.org/stats/data_stats.html
http://wiki.openstreetmap.org/wiki/Planet.osm
Onkar Hoysala says:

March 31, 2012 at 11:28 pm

The open library dataset is pretty awesome: http://openlibrary.org/developers/dumps
boya says:

April 7, 2012 at 7:26 am

http://campus.lostfocus.org/dataset/netflix.7z/
Daniel Lemire says:

September 4, 2012 at 9:07 am

(Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)

I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from http://www2.census.gov/, which came out to 10,626 files and 1.3Gb. Then I
searched through them for the biggest CSV files, and found a 2.1 GB .csv
file here: http://www2.census.gov/acs2010_5yr/pums/csv_pus.zip

I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:

http://www.census.gov/acs/www/Downloads/data_documentation/pums/CodeLists/ACSPUMS2006_2010CodeLists.pdf

One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.
Cristian says:

November 28, 2012 at 6:08 pm

Daniel,

take a look here:
http://www.ncdc.noaa.gov/most-popular-data

Regards,
Cristian
Anonymous says:

January 15, 2013 at 4:05 pm

Daniel,

i found the dimes project, http://www.netdimes.org and do they provide access to data sets.
http://www.netdimes.org/new/?q=node/65
=======================================
DIMES is a distributed scientific research project, aimed to study the structure and topology of the Internet, with the help of a volunteer community (similar in spirit to projects such as SETI@Home).
=======================================

Regards,
Cristian.
Dharmaraj says:

December 6, 2015 at 6:58 am

Daniel,
i’m looking for dataset for my project at the size of more than 1GB in CSV format.
so can you tell me where i can get that
1. Kinmokusu says:
  
  May 20, 2016 at 10:57 pm
  
  hello Dharmaraj ;
  I have the same need did you find a dataset ?
2. Bob says:
  
  February 11, 2017 at 11:23 am
  
  SSDM files CSV format
  
  2010-03-09:
  https://www.dropbox.com/sh/urxs2ifssb9oq78/AACSHOilKwsV8xwGVVpX1-nEa?dl=0
  
  2010-11-17:
  https://www.dropbox.com/sh/pneyuzakntq8fxa/AABqJCKJ6N-qDo9X4AFDcQxda?dl=0
  
  2011-11-13:
  https://www.dropbox.com/sh/hb95kjo3qlnn682/AAAS9UT1ckKukLkIbXI2CcNla?dl=0
  
  2013-05-31:
  https://www.dropbox.com/sh/naiq7dqgha8svn0/AACH2RFiu4ZY6oA884NiErnZa?dl=0
Bob says:

February 11, 2017 at 11:01 am

I found these. Over 8GB per set.
SSDM files CSV format

2010-03-09:
https://www.dropbox.com/sh/urxs2ifssb9oq78/AACSHOilKwsV8xwGVVpX1-nEa?dl=0

2010-11-17:
https://www.dropbox.com/sh/pneyuzakntq8fxa/AABqJCKJ6N-qDo9X4AFDcQxda?dl=0

2011-11-13:
https://www.dropbox.com/sh/hb95kjo3qlnn682/AAAS9UT1ckKukLkIbXI2CcNla?dl=0

2013-05-31:
https://www.dropbox.com/sh/naiq7dqgha8svn0/AACH2RFiu4ZY6oA884NiErnZa?dl=0
Kenn says:

June 4, 2017 at 1:37 am

I need a data set with 200 population size
?how many seconds in a year are there? says:

February 16, 2022 at 1:58 pm

I was looking all afternoon for this. Thank you for putting it up.