27th August 2014, 18 min read

Though unrefereed, arXiv has a better h-index than most journals…

17 thoughts on “Though unrefereed, arXiv has a better h-index than most journals…”

D. Eppstein says:

August 27, 2014 at 1:50 pm

Having many low-citation papers is almost completely irrelevant for the purposes of computing h-indexes. What counts is volume of high-citation papers. So the fact that there’s a lot of junk in arXiv should not cause it to be a surprise that its h-index is high: there’s also a lot of well-cited stuff.

(I am intentionally writing low-citation and high-citation rather than low-quality and high-quality, because while citations and quality are correlated they are not the same thing.)
Anon says:

August 27, 2014 at 5:19 pm

Note the definition: h5-median for a publication is the median number of citations for the articles **that make up its h5-index**.

So among the 42 top-cited papers, the median is 57 citations (e.g., MADLIB paper). This doesn’t say anything about the junk papers.

The way citations are counted is weird too — most of the highly cited papers are VLDB/SIGMOD papers, and most of the citations are likely to the conference versions, not arXiv versions.

– A.
Daniel Lemire says:

August 27, 2014 at 6:30 pm

@Anon

I misinterpreted their numbers.

Thanks.
Different Anon says:

August 27, 2014 at 7:42 pm

Now that you see you mis-interpreted the numbers, the explanation of the phenomon is immediate. If the collection of papers X is a subset of the collection of papers Y, then Y will have a higher h-index according to google.

Y will also have a higher median citation among papers making up the h-index.

The vast majority of papers are on the arxiv today, accounting for its very high stats.
Daniel Lemire says:

August 27, 2014 at 11:34 pm
@Another

“The vast majority of papers are on the arxiv today, accounting for its very high stats.”

My post was about the database section. There are about 30 new papers a month on arXiv cs.DB:
- January 2014: 30
- Febrary 2014: 38
- March 2014: 33
- April 2014: 23
- may 2014: 27
- june 2014 : 31
One venue alone (VLDB) had 236 papers in 2013.
Peter Turney says:

August 28, 2014 at 11:32 am

In support of what Daniel is saying, arXiv Computation and Language (cs.CL) is ranked more highly in the field of computational linguistics than the most prestigious journal (in my subjective opinion), Computational Linguistics:

http://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_computationallinguistics

As with cs.DB above, the comparison here is between a specific journal (Computational Linguistics) and a specific subset of arXiv (cs.CL).
Leonid Boytsov says:

August 28, 2014 at 5:08 pm

How many of the top-cited papers appear elsewhere, say, in VLDB? Note that to have a good h5 index, it is not necessary to have a lot of papers.

For example, in “Computer Vision & Pattern Recognition” arxiv is ranked 12 with h-5 index equal to 38.

So, what? Can’t you have 38 publications that have manuscript both on arxiv and elsewhere?

And, indeed, check the top paper in this category:
Point-Set Registration: Coherent Point Drift Myronenko, X Song

It also appears in Pattern Analysis and Machine Intelligence, IEEE.

So, the arguments are not convincing.
Daniel Lemire says:

August 29, 2014 at 8:10 am

@Leonid

If you refer to my blog post, can you please point out precisely the arguments you find lacking? I simply cannot respond to “the arguments are not convincing”. I need to know what you disagree with.
Leonid Boytsov says:

August 29, 2014 at 8:18 am

Good ranking CAN be explained by the fact that ARXIVE includes everything. More precisely, some of the best other papers. Because ARXIVE overlaps with good journal conferences you can’t rank it alone. Furthermore, famous authors may be willing to publish on Arxiv, because their papers will be read anyway. Even if they don’t publish anywhere else. For others this strategy likely won’t work. Yet h5 index won’t tell you this.
Daniel Lemire says:

August 29, 2014 at 8:31 am

@Leonid

Some things I am *not* saying:

1. “Papers posted on arXiv only appears on arXiv.” (Daniel: The opposite is true. It is trivial to check.)

2. “Posting your paper on arXiv alone is a good strategy to become highly cited.” (Daniel: Though I do not know for sure, I suspect it is a terrible strategy. In fact, submitting to arXiv alone is something I recommend against doing, except for technical reports that are otherwise unpublishable.)

3. “Posting your papers on arXiv will increase the number of citations you receive.” (Daniel: My default assumption is that this statement is generally false but can be true in specific cases. That is, sometimes arXiv can help if it makes your paper more accessible. However, posting the paper on your web site can also help, probably just as much.)

What I am saying:

*) On the whole, the quality of papers on arXiv is comparable to that of good venues, at least if you measure quality by citation. Subjectively, the quality of papers on arXiv is amazingly high. (Daniel: arXiv has an h-index comparable to the major conferences and it is only 2–3 times larger than the big venues. So the top tier on arXiv should be as good as what you get by browsing a leading venue. Given that arXiv is unrefereed, I find this result amazing. Being only 3 times worse than a leading conference, given that anyone can post papers… is a great score. Note that we did not discuss a comparison between arXiv and second-tier venue. I think arXiv would put these second-tier venues to shame.)

Now, if you were to argue that all the best papers from all the best venues appear on arXiv, you would only make my point stronger. It would not be a counterpoint to what I am saying!!!

As it stands, if you are an engineer and you do not have access to research papers through your employer, subscribing to arXiv, given that it is free, seems like a great choice. You are going to get slightly more junk than if you subscribe to ACM SIGMOD, say, but not a whole lot more.

What I did not point out in my blog post but that I should point out is that I do not understand this phenomenon. I do not understand why arXiv is so good. I certainly expected it to be far worse.

The numbers puzzle me. But facts are facts.
Daniel Lemire says:

August 29, 2014 at 8:41 am

@Leonid

“Good ranking CAN be explained by the fact that ARXIVE includes everything. More precisely, some of the best other papers.”

Yes. It includes many of the best papers in a given field.

Read this last sentence again: it includes many of the best papers in a given field.
Leonid Boytsov says:

August 29, 2014 at 8:48 am

I read some other sentences as well. I don’t what your intentions were, but words are highly misleading.

Starting from the post topic:
“Though unrefereed, arXiv has a better h-index than most journalsâ€¦”

Despite what you says in the last comments it sends an absolutely wrong signal. It sounds like, ohh look there is an unrefereed venue and it it’s so good. But it doesn’t matter, because this venue doesn’t exist alone.

Next you essentially say that the high h5 index can’t be explained by people publishing papers elsewhere, because only a small fraction of papers appears on arxive. This may be true, but it is not provable, because h5 index judges only few top papers and these papers are published elsewhere.

If this was all written to support a simple statement that publishing an open-source version of your paper doesn’t diminish its citation index, it’s a lot of misleading words. Because the statement is clearly obvious and has no specific relation to arxiv.
Daniel Lemire says:

August 29, 2014 at 9:33 am

@Leonid

1) How is my statement misleading:

“Though unrefereed, arXiv has a better h-index than most journalsâ€¦”

It is what Google is telling us. Please follow the link I have offered in my blog post:

http://scholar.google.com/citations?view_op=top_venues&hl=en&vq=eng_databasesinformationsystems

2) “Next you essentially say that the high h5 index can’t be explained by people publishing papers elsewhere, because only a small fraction of papers appears on arxive.”

I disagree with the words you are putting in my mouth. This not what I wrote. Here is what I wrote:

“One could argue that the good ranking can be explained by the fact that arXiv includes everything. However, it is far from true.”

3) “If this was all written to support a simple statement that publishing an open-source version of your paper doesn’t diminish its citation index”

No, this is not my message at all. My message is that if you consider arXiv as a venue, then it is a good quality venue. This means that as a reader, you can reasonably use arXiv as an information source. I conclude my blog post by encouraging people to subscribe to the list of arXiv new papers.

We can disagree as to whether people (especially engineers) should use arXiv as an information source regarding new papers. Maybe people should not subscribe to the Twitter feed I recommended. Maybe you have arguments against this… certainly, reasonable arguments can be found.

But I do not think you can disagree about the fact that arXiv, as a venue, has a high h-index. Well, if you do disagree about this, you need to take your disagreement with Google. I am only reporting on what Google is telling us.
Leonid Boytsov says:

August 29, 2014 at 11:05 am

>But I do not think you can disagree about the fact that arXiv, as a venue, has a high h-index. Well, if you do disagree about this, you need to take your disagreement with Google. I am only reporting on what Google is telling us.

Ohhh, I absolutely can. And here is my public disagreement

http://searchivarius.org/blog/does-arxiv-really-have-high-citation-index

Should I put it on arXiv to look fancier? 🙂 Regarding, Google. Google is great, but it doesn’t automatically mean that people at Google are always right. For example, Google trends was recently criticized:

http://www.forbes.com/sites/stevensalzberg/2014/03/23/why-google-flu-is-a-failure/
Stevan Harnad says:

September 3, 2014 at 5:12 am

ARXIV INCLUDES BOTH UNREFEREED AND REFEREED VERSIONS OF PAPERS: DISTINGUISH CITATION FROM EARLY ACCESS AND DOWNLOAD LOCUS

Peer-reviewed publication is not the same thing as access-provision:

Subscription Journals provide peer review as well as access (to subscribers).

Repositories provide access (to peer-reviewed journal articles and sometimes to earlier unrefereed drafts).

Hence repositories do not have citation counts or h-indexes:

Users access whatever version they can access, but they cite the journal article.

The only exception is unrefereed drafts — but even there, it is the author’s draft that is being cited, and not the repository:

Unrefereed drafts used to be cited as “name, title, unpublished (or ‘in prep’)” and refereed, accepted drafts used to be cited as “name, title, journal, in press).”

Adding an OA access-point to the journal citation is becoming an increasingly common (and desirable) practice, but it does not change the fact that what is being cited is the work, and the canonical version of the work is the refereed, published version.

Hence repositories do not have citation counts; they just have download access counts.

Some interesting statistics can, however, be done on the citation of unrefereed vs refereed versions.
Peter Turney says:

September 3, 2014 at 7:59 am

@Stevan

This distinction between access-point and journal-citation is fine in principle, but in fact many authors write their references as if arXiv were a journal-citation, not an access-point. It seems that, in the minds of many authors and in the computers of Google, the distinction between access-point and journal-citation is being blurred.
Stevan Harnad says:

September 3, 2014 at 10:35 am

@PeterTurney

Scholarly practices are evolving — in the online era one might even say they are “catching up” with the still mostly untapped potential of the online medium.

Yes, some authors are citing sloppily, but I assure you they are not doing so in their CVs! A posting to Arxiv is not a refereed publication unless it has been accepted for publication by a refereed journal. And that is the reference authors will cite (as long as peer review continues to be the criterion for peer-reviewed publication).

In Arxiv, the longstanding users such as HEP physicists have caught up: They cite the Arxiv preprint till the journal reference is available, and from then on they cite the journal reference (though they will still add the Arxiv URL or DOI for access).

Once Open Access becomes universal, all authors, in all disciplines, will catch up…