Daniel Lemire's blog

, 1 min read

Slashdot: Why Is Data Mining Still A Frontier?

Slashdot asks “Why Is Data Mining Still A Frontier?” The article itself is not very exciting, but the comments are great. Here are some I like:

I would suggest that, in practice, the real difficulty is that the problems that need to really be solved for data mining to be as effective as some people seem to wish it was are, when you actually get down to it, issues of pure mathematics. Research in pure mathematics (and pure CS which is awfully similar really) is just hard. Pretending that this is a new and growing field is actually somewhat of a lie.

Available datasets are not themselves in anything like normal relational form, and so have potential internal inconsistencies. And that gets in the way before you even have the chance to try to form intelligent inferences based on relations between data sets, which of course are terribly inconsistent.

The ultimate problem, is that for most datasets, there are an infinite (at least), set of relations that can be induced from the data. This doesn’t even address the issue, that the choice of available data is a human task. However, going back to assuming we have all the data possible, you still need to have a specific performance task in mind.

To sum it up:

  • Data Mining requires hard and fancy Mathematics.
  • Data cleaning and integration is hard.
  • There are infinitely many ways to mine data and it is not obvious a priori what is useful.

I think Data Mining is a beautiful research topic. However, as the comments indicate, it is very hard and it requires a wide ranging expertise.