Daniel Lemire's blog

, 19 min read

Not all citations are equal: identifying key citations automatically

20 thoughts on “Not all citations are equal: identifying key citations automatically”

  1. Greg Linden says:

    Nice idea, like the solution too. Hey, just FYI, the link in your post to Andre Vellino’s blog post appears to be incorrect. I think it’s supposed to go here:


  2. Marie says:

    Very nice idea!
    To take it one step further: Alot of the Related Work section in a paper (where most of the citations actually are) tend to be categorized in some way i.e. past work dealing with the same problem, work that dealt with a similar problem or work that dealt with related but orthogonal problems. While they might not be very influential, I still think they are useful citations in pointing users towards the relevant literature. What I’m envisioning is a Google scholar system that does not only categorize it into “priority” and “rest”, but more fine-grained descriptive labels such as “influential work”, “similar work”, “different but related work”, “general surveys”, etc.

  3. @Greg

    Thanks. I fixed the link.

  4. @Marie

    More generally, the classification of references into various categories has a long history (see our paper for references)… but there has been very few attempts to automate the process.

    This is really a case where we need machine learning to come in and help us.

  5. Hm. Shallow or deep. Influential or non-influential. Authors’ citation decisions are more complicated than that… Might I draw your attention to Marilyn Domas White and Peiling Wang’s “A Qualitative Study of Citing Behavior: Contributions, Criteria, and Metalevel Documentation” published in Library Quarterly (1997;67:122-154)? There’s a nice table in there of the various motivations of authors for including a citation or not, and some illustrative quotation. It doesn’t boil down as simple as you’ve made it for the purposes of teaching your machine!

    The temptation to look for machine solutions to ranking results in searches yielding tens of thousands of results is obvious, but I remain skeptical about the direction you’re taking, and the variables you describe in the pre-print are certainly game-able. If this does eventually result in a whizzy new search filter, I’d certainly love to give it a try, and I’d certainly appreciate being able to switch it OFF again, if you know where I’m coming from.

  6. @Douglas

    Shallow or deep. Influential or non-influential. Authors’ citation decisions are more complicated than that…

    Yes, I am aware of this. Allow me to quote a paragraph from our paper:

    The idea that the mere counting of citations is dubious is not new (Chubin & Moitra, 1975): The field of citation context analysis (a phrase coined by Small, 1982) has a long history dating back to the early days of citation indexing. There is a wide variety of reasons for a researcher to cite a source and many ways of categorizing them. For instance, Garfield (1965) identified fifteen such reasons, including giving credit for related work, correcting a work, and criticizing previous work.

    It doesn’t boil down as simple as you’ve made it for the purposes of teaching your machine!

    We did not make it simpler so that it would be easier for the computer, we wanted to make it easier for the authors!

    Previous attempts at automatic classifications have used much richer categorizations. See Garzone and Mercer (2000) who used 35 categories as well as Teufel et al. (2006) who used several categories organized in a two-level tree structure.

    the variables you describe in the pre-print are certainly game-able

    I agree. We allude to this in our paper:

    Moreover, identifying the genuinely significant citations might be viewed as an adversarial problem.

    I would argue however that it would be much harder to game our features than to game citation counts, to say nothing of publication counts.

    But let me draw attention to an interesting related issue: even if you don’t care to identify influential citations, and are happy to count citations, you could still benefit from the identification of influential citations to catch cheaters!

    Again, quoting from the preprint:

    In a survey, Wilhite and Fong (2012) found that 20% of all authors were coerced into citing some references by an editor, after their manuscript had undergone normal peer review. In fact, the majority of authors reported that they were willing to add superfluous citations if it is an implied requirement by an editor. If we could determine that many non-influential references in some journals are citing some specific journals, this could indicate unethical behavior.

  7. Thanks for your generous response. I did download your pre-print, but I perhaps have not yet given its 40 pages the full justice they deserve. I did get as far as your acknowledgement of the complexity of authors’ citation behavior, and I thank you for the references to studies with even more complex classifications than White and Wang’s.
    As we both know, the influence of impact factor has led to gaming of citation, but I’m dubious about mathematical remedies: insiders are in an arms race for search engine attention, the poor old public gets ever dumber, more homogeneous search results.
    Its called the scientific *literature* for good reason, and a reader’s opinion of an author will surely be shaped by their assessment of whether they give good citation or not, particularly as we move into an open access world where it is easy to get the full text (i.e. call the author’s citation bluff). I suspect that this social pressure not to draw the attention to orthogonal or off-topic material may prove an effective remedy…

  8. Xiaodan Zhu is a very talented researcher, I personally follow his work and I met some of his colleagues in Chicago last year (i2b2): very clever people!

    I didn’t read the paper yet, I will print it and write my impressions. According to my tiny experience, I would consider the following rules of thumb:
    – how many times a paper has been already cited (before the publication of the current paper)? The highest it is, the highest the probability of being a “must” citation (maybe because it represents a particular ML technique, or a particular definition or a particular standard).
    – if I clustered the sub-field communities than I could extract the previously mentioned number, but related to a particular sub-field. Highly cited papers in your own community (especially the very old ones) are very likely to be not crucial. Whereas, maybe, if you cite an highly cited paper from a very different community it could be that that paper gives you a new perspective.
    – if a citation is only in the background section, it’s very likely not to be crucial. I would look at things like (#citation_to_this_paper/#citation_in_this_section)
    – in general I suppose that citing non-popular papers has to be strongly related to the fact that the authors search for it, discovered it, wanted it in the paper, and consequently found it relevant and crucial for their research.

    I’ll let you know my impressions about the paper. Is it possible to provide my contribution to the dataset? I already have these information for my papers.

    Thanks for this post!

    Our dataset contains 100 **papers** annotated papers.

  9. @Michele

    Yes, your comment just makes clear that there is a lot more work required… It is my hope that this problem will receive more attention in the future.

    I have no plan at this time to extend the dataset but if you are interested, get in touch with me, I’ll try to help as much as I can. I suppose that there would be great value in producing an extended version.

    Thanks for pointing out the typo in my post.

  10. Excellent blog post, I am going to have to read the article. We in the scientific database world suffer from the same issue, shallow citations “there is a database called X, but I am going to tell you why I don’t care and am creating a new one” and deep citations “I used the data from database X and got a bunch of results, see figure 2”. The problem is the same as you describe, the solution will need something else to deal with the fact that not all biologists publish papers any longer to showcase their work. Some of us create databases or write machine learning algorithms for a living and we would also like to cited. Any ideas how your solution can help?

    ps. I am going to have to include your data resource in our registry so that I can run my algorithms on it.

  11. @Anita

    Right, the same kind of work could be done not only with formal bibliographic references, but any kind of reference made in a paper (e.g., references to software or databases).

  12. Jo Vermeulen says:

    Just wondering if there’s a tool available to calculate your HIP-index (e.g., based on a Google Scholar search)?

  13. @Jo Vermeulen

    The short answer is no. (It shouldn’t be surprising given that we just posted our paper days ago.)

    The long answer is that the core purpose of our paper is to encourage people to build such tools so we can finally do more than citation counting.

  14. Thanks Daniel,
    There is a very vocal group called the Resource Identification Initiative on Force11 that is trying to figure out how to do just that. Your point is a good one though and should enter the discussion.

  15. Min-Yen Kan says:

    Thanks for your blog post. Our group at NUS is also very much interested in these topics. Along with Simone Teufel, who has been working on the theory of Argumentative Zoning for many years, we have made a text classification tool that tries to describe the argumentative purpose of each sentence in an input article. You may find it an interesting project to read about, along with our ParsCit project.



  16. @Min-Yen

    Thanks. Though I am unsure whether we cite you, I am aware of your work on scholarly recommender systems.

  17. Jo Vermeulen says:

    @Daniel: Of course, I fully understand. Just thought it would be interesting to get an idea of the h-index vs HIP index of authors in my field (Human-Computer Interaction) 🙂

  18. @Jo Vermeulen

    From our work, you can expect that some researchers would benefit from the hip-index because, though they get fewer citations in the current system, they are cited more abundantly within papers.

  19. Jo Vermeulen says:

    @Daniel: I would expect this to be reflected in the number of citations of that paper too, over time (if other researchers pick it up).

    However, I can imagine the cip number for a specific paper might show this effect much sooner, and could then be used to indicate future ‘important’ or ‘influential’ papers.

  20. @Jo Vermeulen

    “I would expect this to be reflected in the number of citations of that paper too, over time”

    Yes, a paper receiving lots significant citations will eventually be cited quite a bit, but the reverse is not true. Highly cited papers do not necessarily contain new and influential ideas. Take, for example, review articles.

    Lots of articles become highly cited because people need a reference regarding some topic, so they pick whatever reference other people have picked, often without reading it. Thus, some bad researchers get cited a lot.