2nd March 2009, 14 min read

The missing research tool…

18 thoughts on “The missing research tool…”

Jordan Boyd-Graber says:

March 2, 2009 at 10:45 am

Hal Daume has something of the sort for NLP/ML papers. Put in your bibtex file, and it shows you what papers cite similar stuff:

http://www.cs.utah.edu/~hal/WhatToSee/
Erik Duval says:

March 2, 2009 at 11:11 am

Seems like we have similar concerns – I blogged about this more than 18 months ago: http://erikduval.wordpress.com/2007/10/04/i-need-help/.

Can’t believe that progress in this area is so slow, though there are some new initiatives (http://www.mendeley.com/).

We’re working to get something off the ground too. Would love to get more pointers to related work!
Daniel Lemire says:

March 2, 2009 at 11:12 am

@Dupuis

Scopus and/or Web of Science do a few of the things you’re looking for, not perfectly, but it’s a start.

I would argue it is a *bad* start.

Scopus knows about 11 of my papers, and one of them is not from me, so it knows about 10 of my papers. It thinks that my 2000 paper “Wavelet time entropy” is my most cited work (with 21 citations).

Should I trust Scopus? Is this an accurate picture? No. Not even close.

Google Scholar tells me otherwise. My RACOFI paper from 2003 is cited 39 times according to Google Scholar, yet it does not even exist on Scopus! My Slope One paper from 2005 is cited 36 times according to Google Scholar and it does not exist on Scopus! My Tag-cloud drawing paper from 2007 is cited 20 times according to Google and… it does not even exists on Scopus.

Even if you don’t trust the numbers Google Scholar gives you, these 3 papers I just gave you do exist. They have been repeatedly cited and there is even a wikipedia page about one of these papers. Yet, as far as Scopus is concerned, I have hardly been cited for my work after 2000… except for the Scale and translation invariant collaborative filtering systems paper…

Coverage matters a lot more to a researcher than precision. Missing 3 of my most important contributions is a big deal to me. I don’t care that it reports only 10 of my papers… I care that it misses my most important work though!!!

A tool that does not know about my important work, can’t possibly help me monitor upcoming papers efficiently.
Daniel Lemire says:

March 2, 2009 at 11:59 am

@Anonymous

Citeseer was good back before Google Scholar. Now it is irrelevant. Its coverage is ridiculous.
Daniel Haran says:

March 2, 2009 at 12:46 pm

That would be a trivial mashup, if the data were available in machine-readable format. What’s the challenge here? Is the raw data available? Is it parsing the papers to recognize citations?
Daniel Lemire says:

March 2, 2009 at 1:20 pm

@Haran

The data is most certainly not available in structured format. Nor is it available from one place only. Even if you can parse the papers to recognize the citations, you have to link the citations to the papers. That is not easy. There are many ways to cite a paper, and several papers have almost the same titles and almost the same authors.

There are places to get you started. For example, in Computer Science, DBLP makes available a rather large list of papers as XML. The papers in the arxiv database can also, I presume, be indexed somehow.

Recognizing similarities between what the researcher is doing, and a given paper is also not trivial. It is probably similar to spam filtering. There may even be people who will try to cheat the system to get their papers recommended more often!

So, it is a difficult challenge, for many reasons. But it seems that as years go by, no progress is being made. I have seen zero progress in the last two years on this problem. None. Nada.

And it is not just the challenge in getting access to the data. For example, even open access archives (such as arxiv) are hard to monitor!
Andre Vellino says:

March 2, 2009 at 2:30 pm

That’s true but the problem with publisher tools is they apply only to the content owned by the publisher (usually). What’s needed is publisher-neutral tools that don’t care who owns the intellectual property. (Which is related to, in spirit anyway, Daniel’s desideratum “The tool promotes open access content when possible”)
John Dupuis says:

March 2, 2009 at 10:07 am

Scopus and/or Web of Science do a few of the things you’re looking for, not perfectly, but it’s a start.
Mat Todd says:

March 2, 2009 at 3:23 pm

I use Web of Science for this. I have saved searches for relevant terms, and one for any papers that cite key papers. Weekly emails summarise it all.
neal says:

March 2, 2009 at 10:38 am

There is a tool (early version available) that promises to do these things; I blogged about it here: http://mobblog.cs.ucl.ac.uk/2009/02/23/research-is-the-new-music/
Suresh says:

March 2, 2009 at 11:17 am

try this:

http://www.cs.utah.edu/~hal/WhatToSee/

and it’s from an NLP researcher, to boot !
Anonymous says:

March 2, 2009 at 11:23 am

Is it not what http://citeseerx.ist.psu.edu/ does already ?
Preston L. Bannister says:

March 2, 2009 at 2:43 pm

This is a symptom of a general problem. As long as academic writings are hidden in a maze of pay-for-access ghettos, access to information (and the tools used) will be poor.

The same base problem makes academic writings less useful (and thus less meaningful) to the entire community.

Solve the underlying problem.
Krishnan says:

March 2, 2009 at 11:44 pm

Hi Daniel

We have built such a tool at HP Labs India. We plan to expose it as a service in future

Krishnan
Ali Shams says:

March 4, 2009 at 10:55 am

Daniel,
Look at http://www.scientificcommons.org . I think this is a very good start.

I think that such tool should be a social networking software and not a natural language processing.
santhosh says:

March 4, 2009 at 12:56 pm

Hi Daniel,
Look at http://silverfish.iiitb.ac.in, it’s a web-based semantics extraction and aggregation engine for academic documents.

You can find related authors,related papers, papers citing a paper.
Francois Rivest says:

March 5, 2009 at 8:14 pm

ISI Web of Science, although incomplete, allows you to trace papers citing a specific paper.

Also, I don’t known about computer science, but for Health sciences in general, many publisher allow you to set citations alert on papers of interest. I.e., each time a paper important for your litterature is cited (including yours), it e-mails you the reference.

I find these tools very valuable to be informed of what is going on the specific domain I am working on.
Steven says:

March 5, 2009 at 3:28 pm

After reading this post, I wrote a little shell script that polls Google Scholar for new citations to my papers. I used wget with the Google Scholar URL and full paper title, egrep -o “Cited by [0-9]+”, then store the counts in a file. If the counts change, the script e-mails me.

Of course, this misses whatever citations that Google Scholar doesn’t pick up.

The link by Suresh, WhatToSee, looks very useful.