Daniel Lemire's blog

, 16 min read

From counting citations to measuring usage (help needed!)

17 thoughts on “From counting citations to measuring usage (help needed!)”

  1. I find your approach interesting but I’m afraid you are going to collect a too sparse dataset to be useful 🙁

    On another hand, in the form you say “By an essential reference, we mean a reference that was highly influential or inspirational for the core ideas in your paper”. Unfortunately, I respectfully disagree with this being the only way of finding great papers and/or better metrics for influence in other researchers.

    First of all, not always a paper’s influence is so strong in another paper; specially as time goes by and some influential papers are considered basic literature that must be cited just to provide context for a current work but not because they are “inspirational” for that work.

    Secondly, some papers do not have any inspiration, they simply appear out of thin air. In this sense you have mentioned John Nash since his PhD dissertation is a great example of this: 2 reference, 1 is a self-cite 🙂

    In my first reading of this post I thought “why don’t they use PageRank or a similar algorith to compute that”. Now I’ve found the comment about eigenfactor.org. That approach is quite interesting *but* I simply don’t like the idea of computing the score for journals, why not compute it for individual papers?

    In fact, I think that would be a rather sensible approach: a relevant/interesting/influential paper would be a paper cited by many relevant/interesting/influential papers.

    In other words, once such a score was computed for all of the papers you can find the most influential papers cited by any paper and the most influential papers citing one paper.

    *And* with that information you could train a ML method to “distinguish” between the context surrounding cites to the papers which have been really influential in a given work and the context surrounding the supportive cites.

    Needless to say, there are tons of details missing here and it wouldn’t be as easy as I assume but I’d give it a try (provided the citation graph data was available).

  2. Itman says:

    Applying an ML algorithm is all about building good features. In, IR, for instance, it took decades. Considering how controversial this issue is, I would predict that it will take at least 10 years. Besiders, to come up with good features, you will probably have to do some nontrivial NLP.

  3. @Itman

    Sure but there is already some research on this going back to 2000 or even slightly before. Some people even published a Weka-based open source tool for this problem (circa 2010). So it is not exactly like we have no clue on what to do. However, it has remained a relatively obscure topic. I hope we change that.

    But ok, it could take more than two years before it becomes mainstream. I can dream though, can’t I?

  4. Hi Daniel,

    Regarding 1 and 4. Maybe I’m too stubborn but a citation graph would be a great asset in addition to the dataset you are planning to collect.

    Regarding 2. My fault, you did not mention a better metric. It’s me who’s thinking of the need for better metrics 🙂

    And yes, I hope to find time soon to complete the form 🙂

    Best, Dani

  5. * Are you familiar with http://www.eigenfactor.org/ ?

    It seems to be quite related to what you’re talking about.

    * I don’t know how much sense it makes, and how much biased it might be, but I think the people from Mendeley (www.mendeley.com) are in a good position to evaluate paper popularity in a more precise way. They already show some rankings based on how many people have a paper in their library, but I guess you can go beyond that and see how much time people actually spent reading the paper.

    If interested, you may want to contact this guy from Mendeley:


  6. @Alejandro

    Yes, I have been a member of Mendeley ever since it started. There are many initiatives to measure the overall impact of a research paper, including counting the number of downloads… the number of times the paper is mentioned, and so on. But my specific interest on “meaningful” citations goes beyond measuring the “importance” of a research paper.

  7. @Daniel

    1) Really, I should stress that all I am doing is taking the initiative, with the help of others, of building a data set that I think researchers (including myself, perhaps) would find useful in their own research. I do this because I am genuinely interested in the tools that might be constructed based on this research.

    2) I am not proposing some “better metric”. I am merely proposing that we make more mainstream the identification of meaningful references.

    By analogy, in the context of the web, that’s like arguing that we should differentiate meaningful links from shallow links by analyzing the text of the web pages. That, in itself, does not tell you how to rank web pages.

    3) I am not solely interested in recognizing influential work.

    4) I am not sure what you mean by “too sparse”? I guess you are thinking in a graph theoretical sense? The result of this project will not be a graph.

    I hope you will contribute to the data set. I expect you might be one of the researchers who might benefit from this data set.

  8. @Daniel

    We feel that the authors should provide this information, because it is difficult for an independent expert to make the assessment reliably.

  9. I’ve just completed the form.

    I must confess that I see things from a different perspective now. Certainly, there are a huge number of papers which are supportive for one’s work and a few which are really relevant (or meaningful in your wording).

    I’m now really curious to know whether ML methods will be able to tell apart one kind of citation from another.

    Good luck in this endeavor.

    Best, Dani

  10. Charlie says:

    A subject close to my heart, although I am not much for publishing my work…
    Years ago, I found that the best filter for finding relevant information on a subject new to me was what I have often referred to as the “inverse frequency” filter- an author that is publishing too frequently generally is contributing nothing new to the body of knowledge on a given subject- just rehashing historical results. On the other hand, an author that publishes at most once or twice a year, or, even better, once every couple of years, is more likely going to provide new and useful information.

    In the past, frequency of citation has been useful, but it seems the system is being gamed these days, and the fact that a certain author is frequently cited may be more an indication of how many “agreements” he has for mutual citations (or how strongly a particular publisher is promoting his work). Finding good technical literature is getting as hard as finding pertinent information on the public Internet!

    When one is familiar with a topic, one can quickly identify the key contributors to the knowledge base. However, when one is investigating a new field of interest or endeavor, sorting the wheat from the chafe is a daunting task. I don’t have time to read 10,000 papers (or even scan 10,000 bibliographies of published papers) to find the information crucial to my understanding of a subject. Therefore, any viable filter that can identify to true innovators would be of significant value to me personally (a service I would even willingly pay for!).

    Citation frequency is probably a better measure of the worth of a reference document than the author’s frequency of publication, but, as you point out, there is a crying need for some weighting measure to identify crucial, meaningful work…

  11. Suresh says:

    One interesting heuristic that a colleague of mine had proposed was to overweight citations within the meat of the paper, as opposed to those in the related work. It would be interesting to see if the data you collect has anything to say about this.

  12. @Suresh

    I agree. There are many simple heuristics which could be surprisingly accurate.

  13. Carl says:

    Are you familiar with the citation ontology (CiTO)? http://speroni.web.cs.unibo.it/cgi-bin/lode/req.py?req=http:/purl.org/spar/cito#introduction

    This provides an ontology for turning citations into linked data — i.e. the reason for the citation (supports, refutes, uses methods from, etc) is encoded in markup around the citation, so it doesn’t have to be guessed from the context by machine learning. Long-term this is surely a better solution, though until publishers adopt this standard, machine-learning algorithms such as yours may be the best we can do. Perhaps cito-marked up papers could be used as a training set?

  14. Peter Turney says:

    @Carl Very interesting. Thanks!

  15. Anton says:

    Lets say, for example, that Russian paper cites Chinese paper. It will be hard for automatic system to analyse that.

    For example, Google Scholar is very weak at analysing Russian papers. It lists only about 20% of them and correcly shows only about 10% of citations.

  16. Dan says:

    Do you plan to make the feature extraction code, and the processed data (i.e the data with extracted features) available?
    The linked-to “Dataset” is just the questionaire with links to the papers.


    1. I cannot release the software as I did not write it, but you can contact the first author of our study about it: http://www.xiaodanzhu.com/about.html He might be able to help.