Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.
That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.
Hi Daniel,
The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:
These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.
-Robert
p.s. I found this site through your comments on some Quora post.
Hi Daniel,
That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.
-Robert
The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.
Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.
I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).
I don’t doubt you are able to exploit this data set, but I find it a bit difficult.
Still, it is really nice to see the government making such data freely available.
For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?
There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…
There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)
The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).
I think that is also provided by Infochimps (http://www.infochimps.com/), which btw might be worth browsing for other datasets.
Freebase guys also provide WEX (Freebase Wikipedia Extraction), which is a processed dump of the English wikipedia. You can find more info at http://wiki.freebase.com/wiki/WEX
Have fun 🙂
da
Johan Dahlinsays:
Openstreetmap has 2.7 billion GPS data points, 1.4 billion nodes, 131 million ways. Dataset is a 250G xml file. (21G compressed)
(Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)
I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from http://www2.census.gov/, which came out to 10,626 files and 1.3Gb. Then I
searched through them for the biggest CSV files, and found a 2.1 GB .csv
file here: http://www2.census.gov/acs2010_5yr/pums/csv_pus.zip
I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:
One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.
i found the dimes project, http://www.netdimes.org and do they provide access to data sets. http://www.netdimes.org/new/?q=node/65
=======================================
DIMES is a distributed scientific research project, aimed to study the structure and topology of the Internet, with the help of a volunteer community (similar in spirit to projects such as SETI@Home).
=======================================
Regards,
Cristian.
Dharmarajsays:
Daniel,
i’m looking for dataset for my project at the size of more than 1GB in CSV format.
so can you tell me where i can get that
Kinmokususays:
hello Dharmaraj ;
I have the same need did you find a dataset ?
@Venkat
Yes. I once had a page on this blog where I maintained a list of data sets within some kind of taxonomy. I gave up because it became unmanageable.
Similarly, I try to maintain a data warehousing bibliography. Ultimately, you realize that everything is miscellaneous.
That’s not quite accurate, of course, but I think it is quite a challenge to categorize data sets because they can be as diverse as human knowledge itself.
Seen this compilation?
http://www.nytimes.com/2012/03/25/business/factuals-gil-elbaz-wants-to-gather-the-data-universe.html?hp
Sorry, wrong link (though that one is also relevant).
Right link:
http://www.quora.com/Data/Where-can-I-get-large-datasets-open-to-the-public?__snids__=36983349
@Venkat
Thanks. The quora link is already in my post. 😉
Ah, missed it on first pass, since you highlighted the pointer rather than the reference.
A useful exercise would be to create a table that technically characterizes what the data is like. A database of datasets.
Hi Daniel,
The US Bureau of Transportation statistics has a number of large, free and well-structured data sets. In particular I recommend the On-Time Performance data set (~140M rows, ~90 columns) and the Ticket Pricing (Market) data set (~320M rows, ~40 columns). They are here:
On-Time Performance:
http://www.transtats.bts.gov/tables.asp?Table_ID=236&SYS_Table_Name=T_ONTIME
Ticket Pricing:
http://www.transtats.bts.gov/Tables.asp?DB_ID=125&DB_Name=Airline%20Origin%20and%20Destination%20Survey%20%28DB1B%29&DB_Short_Name=Origin%20and%20Destination%20Survey
These tables are separated into separate Zip files which you can easily pull via ‘wget’ or ‘curl’ in a script. They’re comma-separated and well structured. The data is actual airline data, which is the best thing about these data sets.
-Robert
p.s. I found this site through your comments on some Quora post.
Hi Daniel,
That hasn’t been my finding from the data. Exploring it with Tableau has exposed all kinds of interesting characteristics, especially in the On-Time Performance data.
-Robert
@Robert
Thanks! I’ll check it out.
@Robert
The data is large and nice, but it also appears to be essentially transactional. It seems to me that if I roll it up on key attributes, not much is left of the data.
Anyhow, I agree with you that it is very nice, just not what I’m looking for right now.
@Robert
The On-Time link you gave leads me to an error page, I have now found the on-time data set and I am having fun with it.
http://www.transtats.bts.gov/Tables.asp?DB_ID=120&DB_Name=Airline%20On-Time%20Performance%20Data&DB_Short_Name=On-Time
It is possible that I just don’t grok the ticket pricing data set.
(Note that I don’t use something as nice as tableau to explore it right now. I’m hacking.)
@Robert
I still find some oddities in the data set. For example, for some months, I get an empty set… as if there were no delays! Seems improbable especially since some of the tuples that I do get other months have delays of zero, or just no delay indication (null value).
I don’t doubt you are able to exploit this data set, but I find it a bit difficult.
Still, it is really nice to see the government making such data freely available.
For evaluating implementation performance, it would be very helpful to have not just the dataset but also a workload (stream of queries and updates to that dataset). Any options?
@Jason
I think that oltpbenchmark.com might help if you are looking for workloads.
There are several largish Semantic Web datasets, i.e. billions of subject,predicate,object RDF triples. These are not your typical datawarehouse data either, but you could at least make a large table with subject predicate object columns…
DbPedia has a structured form of the Wikipedia infoboxes, this is a lot like freebase:
http://wiki.dbpedia.org/Downloads37
There has also been a series of Semantic Web “Billion Triple” challenges, which made large crawls of semantic web data available (2010 was about a billion triples, 2011 about 3.5 I think)
http://km.aifb.kit.edu/projects/btc-2010/
http://km.aifb.kit.edu/projects/btc-2011/
The data set from Google is actually “Google Books Ngram” different from Google NGram data set (available for $150 from the Linguistic Data Consortium).
Bye,
michele.
@Michele
Quite right. I’ve updated my blog post. Thanks.
Hi Daniel,
Freebase dataset is also directly available from their website, check http://wiki.freebase.com/wiki/Data_dumps
I think that is also provided by Infochimps (http://www.infochimps.com/), which btw might be worth browsing for other datasets.
Freebase guys also provide WEX (Freebase Wikipedia Extraction), which is a processed dump of the English wikipedia. You can find more info at http://wiki.freebase.com/wiki/WEX
Have fun 🙂
da
Openstreetmap has 2.7 billion GPS data points, 1.4 billion nodes, 131 million ways. Dataset is a 250G xml file. (21G compressed)
http://www.openstreetmap.org/stats/data_stats.html
http://wiki.openstreetmap.org/wiki/Planet.osm
The open library dataset is pretty awesome: http://openlibrary.org/developers/dumps
http://campus.lostfocus.org/dataset/netflix.7z/
(Derrick H. Karimi could not pass my Turing test so he asked me to post the following comment:)
I found your page here very useful, and I would like to contribute. I took off on your US census idea. I downloaded all index.html files from http://www2.census.gov/, which came out to 10,626 files and 1.3Gb. Then I
searched through them for the biggest CSV files, and found a 2.1 GB .csv
file here: http://www2.census.gov/acs2010_5yr/pums/csv_pus.zip
I did not spend much time determining what the data actually means, but I think the answers may be coded according to this:
http://www.census.gov/acs/www/Downloads/data_documentation/pums/CodeLists/ACSPUMS2006_2010CodeLists.pdf
One would have to spend some more time to actually figure out what the data is. But it was a good data exploration case for me to have almost no contextual information about the dataset and visualize the trends.
Daniel,
take a look here:
http://www.ncdc.noaa.gov/most-popular-data
Regards,
Cristian
Daniel,
i found the dimes project, http://www.netdimes.org and do they provide access to data sets.
http://www.netdimes.org/new/?q=node/65
=======================================
DIMES is a distributed scientific research project, aimed to study the structure and topology of the Internet, with the help of a volunteer community (similar in spirit to projects such as SETI@Home).
=======================================
Regards,
Cristian.
Daniel,
i’m looking for dataset for my project at the size of more than 1GB in CSV format.
so can you tell me where i can get that
hello Dharmaraj ;
I have the same need did you find a dataset ?
SSDM files CSV format
2010-03-09:
https://www.dropbox.com/sh/urxs2ifssb9oq78/AACSHOilKwsV8xwGVVpX1-nEa?dl=0
2010-11-17:
https://www.dropbox.com/sh/pneyuzakntq8fxa/AABqJCKJ6N-qDo9X4AFDcQxda?dl=0
2011-11-13:
https://www.dropbox.com/sh/hb95kjo3qlnn682/AAAS9UT1ckKukLkIbXI2CcNla?dl=0
2013-05-31:
https://www.dropbox.com/sh/naiq7dqgha8svn0/AACH2RFiu4ZY6oA884NiErnZa?dl=0
I found these. Over 8GB per set.
SSDM files CSV format
2010-03-09:
https://www.dropbox.com/sh/urxs2ifssb9oq78/AACSHOilKwsV8xwGVVpX1-nEa?dl=0
2010-11-17:
https://www.dropbox.com/sh/pneyuzakntq8fxa/AABqJCKJ6N-qDo9X4AFDcQxda?dl=0
2011-11-13:
https://www.dropbox.com/sh/hb95kjo3qlnn682/AAAS9UT1ckKukLkIbXI2CcNla?dl=0
2013-05-31:
https://www.dropbox.com/sh/naiq7dqgha8svn0/AACH2RFiu4ZY6oA884NiErnZa?dl=0
I need a data set with 200 population size
I was looking all afternoon for this. Thank you for putting it up.