Daniel Lemire's blog

, 1 min read

Toward data-driven science

Science and business, so far, have been mostly model driven. That is, you collect a few data points, just enough to fit your model. Then you proceed from your model. However, things have changed:

old new
Manually take samples of the water in a nearby lake (4 times a year) Setup a wireless sensor in the lake (5000 samples a day)
Model an algorithm and test it once on expensive mainframe computer Build dozens of prototypes and test them on cheap laptops
Have an accountant prepare a business intelligence report, once a year See how the business is doing through your dynamic data warehouse

Hence, improving access to data is fast becoming a critical issue. In a thought-provoking post, Andre Vellino sketches the future of data Information Retrieval. Some key points:

  • Back in the early nineties, we had many electronic documents, but a comparatively poor infrastructure to share them. Then came the web and the search engines such as Google. Currently, we have many good data sets, but sharing and indexing them is painful. Clearly, we need to produce a better infrastructure for sharing data!
  • Research papers should reference data sets, by a unique identifier (such as a Digital Object Identifier), so that we can ask “What research relied on this data set?” or “Where can I find the data these authors have used?”

This is one instance where funding agencies should step in and encourage this work. It is not enough to encourage researchers to share their data. We need better tools too!