Daniel Lemire's blog

, 6 min read

Toward data-driven science

9 thoughts on “Toward data-driven science”

  1. Interesting. This data-centric, “use the world as its own model” approach is the exact same advocated by subsumption architecture robotics and other “embodied intelligence” AI work.

  2. John says:

    In addition to a way to reference data sets, I’d like to see every analysis being by asserting a checksum of the data. So you’re working with this data set, but are you starting with *exactly* this data, or are you doing a little off-record “cleaning” before you begin?

    Sometimes data need to be cleaned, maybe a great deal. But it should be done on the record.

  3. @John

    Of course, the problem is that, right now, sharing the data sets (cleaned or not) is a bit difficult. I use large data sets, and sometimes I have to do not trivial processing on them. How do I share my results? I can post files on my own web site, but that’s hardly satisfying.

    But just imagine if you could drill down to the data sets people have used, and do an analysis of your own? I think research would really be improved.

  4. Thanks Daniel. With respect to DOIs for datasets, this is happening already in Germany (http://www.tib-hannover.de/en/the-tib/doi-registration-agency/) and it’s going to be happening in Canada very soon.

  5. Erik says:

    With regard to needing better tools…ever check out freebase.com ?

  6. marcel says:

    Thanks Daniel. I experienced the “Where can I find the data these authors have used?” problem several times. Testing recommender system performance is a good example where the results heavily depend on the data set. Even if you use the same data to test algorithms, sometimes you end up with different outcomes, just because you did strange 5 folding procedure or you have other artifacts. I like the idea of a “unique identifier for datasets” very much.

  7. Kevembuangga says:

    I like the idea of a “unique identifier for datasets” very much.

    Technically trivial (a SHA checksum over the whole content as the ID) but not yet perceived as a solution.
    I don’t think the DOI will do any good, it is geared toward propriety enforcement not usability.

  8. @kevembuangga The idea of DOIs for datasets is more geared to making it possible to reference datasets as “publications” so that (a) they count in peer-review for promotions and (b) they can be referenced in journal articles. The DOI idea enforces the application of meta-data (author / title / abstract etc.) so that, among other things, datasets can be searched / discovered. See DataCite : http://www.datacite.org/