This is a long post about Journal Impact Factors. Thanks to Stephen Curry for encouraging me to post this.
- the JIF is based on highly skewed data
- it is difficult to reproduce the JIFs from Thomson-Reuters
- JIF is a very poor indicator of the number of citations a random paper in the journal received
- reporting a JIF to 3 d.p. is ridiculous, it would be better to round to the nearest 5 or 10.
I really liked this recent tweet from Stat Fact
It’s a great illustration of why reporting means for skewed distributions is a bad idea. And this brings us quickly to Thomson-Reuters’ Journal Impact Factor (JIF).
I can actually remember the first time I realised that the JIF was a spurious metric. This was in 2003, after reading a letter to Nature from David Colquhoun who plotted out the distribution of citations to a sample of papers in Nature. Up until that point, I hadn’t appreciated how skewed these data are. We put it up on the lab wall.
Now, the JIF for a given year is calculated as follows:
A JIF for 2013 is worked out by counting the total number of 2013 cites to articles in that journal that were published in 2011 and 2012. This number is divided by the number of “citable items” in that journal in 2011 and 2012.
There are numerous problems with this calculation that I don’t have time to go into here. If we just set these aside for the moment, the JIF is still used widely today and not for the purpose it was originally intended. Eugene Garfield, created the metric to provide librarians with a simple way to prioritise subscriptions to Journals that carried the most-cited scientific papers. The JIF is used (wrongly) in some institutions in the criteria for hiring, promotion and firing. This is because of the common misconception that the JIF is a proxy for the quality of a paper in that journal. Use of metrics in this manner is opposed by the SF-DORA and I would encourage anyone that hasn’t already done so, to pledge their support for this excellent initiative.
Why not report the median rather than the mean?
With the citation distribution in mind, why do Thomson-Reuters calculate the mean rather than the median for the JIF? It makes no sense at all. If you didn’t quite understand why from the @statfact tweet above, then look at this:
The Acta Crystallographica Section A effect. The plot shows that this journal had a JIF of 2.051 in 2008 which jumped to 49.926 in 2009 due to a single highly-cited paper. Did every other paper in this journal suddenly get amazingly awesome and highly-cited for this period? Of course not. The median is insensitive to outliers like this.
The answer to why Thomson-Reuters don’t do this is probably for ease of computation. The JIF (mean) requires only three numbers for each journal, whereas calculating the median would require citation information for each paper under consideration for each journal. But it’s not that difficult (see below). There’s also a mismatch in the items that bring in citations to the numerator and those that count as “citeable items” in the denominator. This opacity is one of the major criticisms of the Impact Factor and this presents a problem for them to calculate the median.
Let’s crunch some citation numbers
I had a closer look at citation data for a small number of journals in my field. DC’s citation distribution plot was great (in fact, superior to JIF data) but it didn’t capture the distribution that underlies the JIF. I crunched the IF2012 numbers (released in June 2013) sometime in December 2013. This is shown below. My intention was to redo this analysis more fully in June 2014 when the IF2013 was released, but I was busy, had lost interest and the company said that they would be more open with the data (although I’ve not seen any evidence for this). I wrote about partial impact factors instead, which took over my blog. Anyway, the analysis shown here is likely to be similar for any year and the points made below are likely to hold.
I mainly looked at Nature, Nature Cell Biology, Journal of Cell Biology, EMBO Journal and J Cell Science. Using citations in 2012 articles to papers published in 2010 and 2011, i.e. the same criteria as for IF2012.
The first thing that happens when you attempt this analysis is that you realise how unreproducible the Thomson-Reuters JIFs are. This has been commented on in the past (e.g. here), yet I had the same data as the company uses to calculate JIFs and it was difficult to see how they had arrived at their numbers. After some wrangling I managed to get a set of papers for each journal that gave close to the same JIF.
From this we can look at the citation distribution within the dataset for each journal. Below is a gallery of these distributions. You can see that the data are highly skewed. For example, JCB has kurtosis of 13.5 and a skewness of 3. For all of these journals ~2/3 of papers had fewer than the mean number of citations. With this kind of skew, it makes more sense to report the median (as described above). Note that Cell is included here but was not used in the main analysis.
So how do these distributions look when compared? I plotted each journal compared to JCB. They are normalised to account for the differing number of papers in each dataset. As you can see they are largely overlapping.
If the distributions overlap so much, how certain can we be that a paper in a journal with a high JIF will have more citations than a paper in a journal with a lower JIF? In other words, how good is the JIF (mean or median) at predicting how many citations a paper published in a certain journal is likely to have?
To look at this, I ran a Monte Carlo analysis comparing a random paper from one journal with a random one from JCB and looked at the difference in number of citations. Papers in EMBO J are indistinguishable from JCB. Papers in JCS have very slightly fewer citations than JCB. Most NCB papers have a similar number of cites to papers in JCB, but there is a tail of papers with higher cites, a similar but more amplified picture for Nature.
Thomson-Reuters quotes the JIF to 3 d.p. and most journals use this to promote their impact factor (see below). The precision of 3 d.p. is ridiculous when two journals with IFs of 10.822 and 9.822 are indistinguishable when it comes to the number of citations to randomly sampled papers in that journal.
So how big do differences in JIF have to be in order to be able to tell a “Journal X paper” from a “Journal Y paper” (in terms of citations)?
To look at this I ran some comparisons between the journals in order to get some idea of “significant differences”. I made virtual issues of each journal with differing numbers of papers (5,10,20,30) and compared the citations in each via Wilcoxon rank text and then plotted out the frequency of p-values for 100 of these tests. Please leave a comment if you have a better idea to look at this. I liked this method over the head-to-head comparison for two papers as it allows these papers the benefit of the (potential) reflected glory of other papers in the journal. In other words, it is closer to what the JIF is about.
OK, so this shows that sufficient sample size is required to detect differences, no surprise there. But at N=20 and N=30 the result seems pretty clear. A virtual issue of Nature trumps a virtual issue of JCB, and JCB beats JCS. But again, there is no difference between JCB and EMBO J. Finally, only ~30% of the time would a virtual issue of NCB trump JCB for citations! NCB and JCB had a difference in JIF of almost 10 (20.761 vs 10.822). So not only is quoting the JIF to 3 d.p. ridiculous, it looks like rounding the JIF to the nearest 5 (or 10) might be better!
This analysis supports the idea that there are different tiers of journal (in Cell Biology at least). But the JIF is the bluntest of tools to separate these journals. A more rigorous analysis is needed to demonstrate this more clearly but it is not feasible to do this while having a dataset which agrees with that of Thomson-Reuters (without purchasing the data from the company).
If you are still not convinced about how shortcomings of the JIF, here is a final example. The IF2013 for Nature increased from 38.597 to 42.351. Let’s have a look at the citation distributions that underlie this rise of 3.8! As you can see below they are virtually identical. Remember that there’s a big promotion that the journal uses to pull in new subscribers, seems a bit hollow somehow doesn’t it? Disclaimer: I think this promotion is a bit tacky, but it’s actually a really good deal… the News stuff at the front and the Jobs section at the back alone are worth ~£40.
Show us the data!
Recently, Stephen Curry has called for Journals to report the citation distribution data rather than parroting their Impact Factor (to 3 d.p.). I agree with this. The question is though – what to report?
- The IF window is far too narrow (2 years + 1 year of citations) so a broader window would be more useful.
- A comparison dataset from another journal is needed in order to calibrate ourselves.
- Citations are problematic – not least because they are laggy. A journal could change dramatically and any citation metric would not catch up for ~2 years.
- Related to this some topics are hot and others not. I guess we’re most interested in how a paper in Journal X compares to others of its kind.
- Any information reported needs to be freely available for re-analysis and not in the hands of a company. Google Scholar is a potential solution but it needs to be more open with its data. They already have a journal ranking which provides a valuable and interesting alternative view to the JIF.
One solution would be to show per article citation profiles comparing these for similar papers. How do papers on a certain topic in Journal X compare to not only those in Journal Y but to the whole field? In my opinion, this metric would be most useful when assessing scholarly output.
Thanks for reading to the end (or at least scrolling all the way down). The take home points are:
- the JIF is based on highly skewed data.
- the median rather than the mean is better for summarising such distributions.
- JIF is a very poor indicator of the number of citations a random paper in the journal received!
- reporting a JIF to 3 d.p. is ridiculous, it would be better to round to the nearest 5 or 10.
- an open resource for comparing citation data per journal would be highly valuable.
The post title is taken from “Wrong Number” by The Cure. I’m not sure which album it’s from, I only own a Greatest Hits compilation.
40 thoughts on “Wrong Number: A closer look at Impact Factors”
As you asked for it – another way to run the tests/comparisons would be to do a boostrap comparison of the two fitted distributions. To do this, you could use poweRlaw’s compare_distributions() function (see)[http://cran.r-project.org/web/packages/poweRlaw/] to compare discrete PL distributions (where you probably should fit lognormals, I assume). That has the advantage that you compare actual citation counts, not ranks, which might be slightly obscuring.
Thank you for the comment. Yes, lognormal describes these distributions well (I removed them from the plots for simplicity). I can definitely see the benefit of the comparison you suggest, thanks.
Thanks for this interesting post, which I’ll be bookmarking and passing around.
I am amused to see that Acta Crystallographica Section A presents an even more extreme outlier influence example than Mol Biol Evol (short discussion here: http://people.unil.ch/marcrobinsonrechavi/2015/03/molecular-biology-and-evolution-impact-factor-the-mega-effect-updated/).
Thanks Marc. I hadn’t seen that post. Funnily enough, somebody DM’d me last week to say that MBE was worth checking for outliers influencing their Impact Factor. Now I know why!
Interesting that you should find that rounding to the nearest 5-10 or thereabouts is probably the correct ballpark from which to start comparing journal-level citations. IF points of about 5-10 is also what journals have been shown to be able to negotiate with Thomson Reuters, e.g. CurrBiol 7-12, PLoS Medicine 2-11 or FASEB Journal 0.24-18. The very low r2 of IF with citations (between 0.1-0.2) also means that you have to increase IF by a lot to see a significant increase in citations.
And then you see that whatever slight differences you find between citations in different journals is overcompensated by the opposite trend of increasing unreliability with journal rank: IF is a far better predictor of unreliable research (retracted or not) than of citations:
In other words: the whole citation analysis is moot, because 1) the IF is negotiated, 2) the IF predicts unreliability much better than citations. While one apparently needs sophisticated statistical analyses to tease out any subtle citation-based differences between journals (and that only in very broad IF classes), very simple data show a significant increase of several measures of unreliability with IF.
tl;dr: IF can be useful, but only to identify the worst science: don’t trust unreplicated C/N/S papers.
Thanks for the comment Björn. It makes perfect sense that the science reported in high-IF journals, which select for the most novel findings, are less trustworthy. One cannot possibly have all the right answers immediately after a discovery. Firming things up, refining hypotheses and replication all take time. This seems intuitive to me. I think the problem is that papers are treated as “set in stone” and unassailable after publication. We need to be more realistic about the way science works and stop idolising work in any journal (but especially those at the top of the tree).
I have recommend another citation-based journal metric: the Journal Authority Factor (JAF) = the average h-index of the editors who decide what gets published in the journal (see GENETICS 199:637-8; PMID: 25740911). I suggest that is a better metric on which to judge the quality of journals.
Thanks for the comment Mark. I read your article when it came out and found it thought-provoking and a bit mischievous. Personally, I’ve had great experiences with Professional Editors who have minuscule H-indeces and not-so-great experiences with Academic Editors who have huge Hirsch numbers (and vice versa). I guess a weakness of the JAF would be that it reflects who you could convince to sign up for editorial duties, i.e. it could be gamed. I also wonder if very successful scientists (with large H-indeces) have time to do a good job of editing, in-between travelling the world to give talks, writing grants and papers… I like the idea of new alternative metrics for journals though.
But thanks to it being a poor metric, whether researchers reporting impact factor to 3 decimal places on their CV is a highly useful ‘tell’ for that researcher’s innumeracy https://alexholcombe.wordpress.com/2015/04/23/a-tell-for-researcher-innumeracy/
One approach is simply to report standard errors alongside the impact factors:
Impact factors are actually a good proxy for journal acceptance rates which we don’t have a good easily accessible database for. So, I actually think they are quite useful.
Fantastic post! Thanks for taking time to demonstrate the innumeracy of suing JIF by data.
I have been told that the mean is favored over the median because the median will result in lots of ties and thus fail to produce a continuous rank order for all journals. Needless to say, such ties will be much more accurate reflection of the reality, albeit still rather imperfect. Unfortunately, the wishful thinking of neatly ordered continuously rank remains rather influential.
Good point! This is also the sentiment behind this post by Stefano Bertuzzi at ASCB http://www.ascb.org/a-false-sense-of-precision-what-happens-to-journal-impact-factor-jif-rankings-when-you-drop-a-decimal-place/
Reblogged this on pollysfreedomblog and commented:
By the way, seems like there is an error in the plots for Nature. You must have included citations to editorial material in the calculations to get so many with zero citations. A check for 2014 Nature articles, (type = articles only) gets 828 articles, and the least cited of these have 5, 5, 5, 2 citations. Fully 808 of 828 (97.5%) have 10 or more citations already.
Hi, thanks for the comment. You are right, but it’s a bit more complicated than you are describing to reproduce the distribution, as discussed here https://quantixed.wordpress.com/2016/01/05/the-great-curve-ii-citation-distributions-and-reverse-engineering-the-jif/
Unless things have changed since these posts were written, distinguishing articles and reviews from the other stuff is difficult.
Comments are closed.