3rd May 2010, 6 min read

Toward data-driven science

9 thoughts on “Toward data-driven science”

Eric LaForest says:

May 3, 2010 at 8:21 am

Interesting. This data-centric, “use the world as its own model” approach is the exact same advocated by subsumption architecture robotics and other “embodied intelligence” AI work.
John says:

May 3, 2010 at 8:31 am

In addition to a way to reference data sets, I’d like to see every analysis being by asserting a checksum of the data. So you’re working with this data set, but are you starting with *exactly* this data, or are you doing a little off-record “cleaning” before you begin?

Sometimes data need to be cleaned, maybe a great deal. But it should be done on the record.
Daniel Lemire says:

May 3, 2010 at 9:17 am

@John

Of course, the problem is that, right now, sharing the data sets (cleaned or not) is a bit difficult. I use large data sets, and sometimes I have to do not trivial processing on them. How do I share my results? I can post files on my own web site, but that’s hardly satisfying.

But just imagine if you could drill down to the data sets people have used, and do an analysis of your own? I think research would really be improved.
Ekaterina Pek says:

May 3, 2010 at 10:01 am

On sharing datasets :

http://planet-research20.org/r2ose2010/index.php?option=com_myblog&show=our-corpus-is-your-corpus.html
Andre Vellino says:

May 3, 2010 at 10:06 am

Thanks Daniel. With respect to DOIs for datasets, this is happening already in Germany (http://www.tib-hannover.de/en/the-tib/doi-registration-agency/) and it’s going to be happening in Canada very soon.
Erik says:

May 3, 2010 at 10:37 am

With regard to needing better tools…ever check out freebase.com ?
marcel says:

May 4, 2010 at 7:00 am

Thanks Daniel. I experienced the “Where can I find the data these authors have used?” problem several times. Testing recommender system performance is a good example where the results heavily depend on the data set. Even if you use the same data to test algorithms, sometimes you end up with different outcomes, just because you did strange 5 folding procedure or you have other artifacts. I like the idea of a “unique identifier for datasets” very much.
Kevembuangga says:

May 6, 2010 at 12:09 am

I like the idea of a â€œunique identifier for datasetsâ€ very much.

Technically trivial (a SHA checksum over the whole content as the ID) but not yet perceived as a solution.
I don’t think the DOI will do any good, it is geared toward propriety enforcement not usability.
Andre Vellino says:

May 6, 2010 at 7:44 am

@kevembuangga The idea of DOIs for datasets is more geared to making it possible to reference datasets as “publications” so that (a) they count in peer-review for promotions and (b) they can be referenced in journal articles. The DOI idea enforces the application of meta-data (author / title / abstract etc.) so that, among other things, datasets can be searched / discovered. See DataCite : http://www.datacite.org/