There have been calls for journals to publish the distribution of citations to the papers they publish (1 2 3). The idea is to turn the focus away from just one number – the Journal Impact Factor (JIF) – and to look at all the data. Some journals have responded by publishing the data that underlie the JIF (EMBO J, Peer J, Royal Soc, Nature Chem). It would be great if more journals did this. Recently, Stuart Cantrill from Nature Chemistry actually went one step further and compared the distribution of cites at his journal with other chemistry journals. I really liked this post and it made me think that I should just go ahead and harvest the data for cell biology journals and post it.
This post is in two parts. First, I’ll show the data for 22 journals. They’re broadly cell biology, but there’s something for everyone with Cell, Nature and Science all included. Second, I’ll describe how I “reverse engineered” the JIF to get to these numbers. The second part is a bit technical but it describes how difficult it is to reproduce the JIF and highlights some major inconsistencies for some journals. Hopefully it will also be of interest to anyone wanting to do a similar analysis.
Citation distributions for 22 cell biology journals
The JIF for 2014 (published in the summer of 2015) is worked out by counting the total number of 2014 cites to articles in that journal that were published in 2012 and 2013. This number is divided by the number of “citable items” in that journal in 2012 and 2013. There are other ways to look at citation data, different windows to analyse, but this method is used here because it underlies the impact factor. I plotted out histograms to show the citation distributions at these journals from 0-50 citations, inset shows the frequency of papers with 50-1000 cites.
As you can see, the distributions are highly skewed and so reporting the mean is very misleading. Typically ~70% papers pick up less than the mean number of citations. Reporting the median is safer and is shown below. It shows how similar most of the journals are in this field in terms of citations to the average paper in that journal. Another metric, which I like, is the H-index for journals. Google Scholar uses this as a journal metric (using citation data from a 5-year window). For a journal, this is a number, h, which reveals how many papers got >=h citations. A plot of h-indices for these journals is shown below.
Here’s a summary table of all of this information together with the “official JIF” data, which is discussed below.
|Journal||Median||H||Citations||Items||Mean||JIF Cites||JIF Items||JIF|
|Cell Stem Cell||14||37||5192||302||17.2||5233||235||22.268|
|Cell Mol Life Sci||4||19||3364||596||5.6||3427||590||5.808|
|J Cell Biol||6||25||5586||720||7.8||5438||553||9.834|
|J Cell Sci||3||23||5995||1157||5.2||5894||1085||5.432|
|Mol Biol Cell||3||16||3415||796||4.3||3354||751||4.466|
|Nat Cell Biol||13||35||5381||340||15.8||5333||271||19.679|
|Nat Rev Mol Biol Cell||8.5||43||5037||218||23.1||4877||129||37.806|
Reverse engineering the JIF
The analysis shown above was straightforward. However, getting the data to match Thomson-Reuters’ calculations for the JIF was far from easy.
I downloaded the citation data from Web of Science for the 22 journals. I limited the search to “articles” and “reviews”, published in 2012 and 2013. I took the citation data from papers published in 2014 with the aim of plotting out the distributions. As a first step I calculated the mean citation for each journal (a.k.a. impact factor) to see how it compared with the official Journal Impact Factor (JIF). As you can see below, some were correct and others were off by some margin.
|Cell Stem Cell||13.4||22.268|
|Cell Mol Life Sci||5.6||5.808|
|J Cell Biol||7.6||9.834|
|J Cell Sci||5.2||5.432|
|Mol Biol Cell||4.1||4.466|
|Nat Cell Biol||15.1||19.679|
|Nat Rev Mol Cell Biol||15.3||37.806|
For most journals there was a large difference between this number and the official JIF (see below, left). This was not a huge surprise, I’d found previously that the JIF was very hard to reproduce (see also here). To try and understand the difference, I looked at the total citations in my dataset vs those from the official JIF. As you can see from the plot (right), my numbers are pretty much in agreement with those used for the JIF calculation. Which meant that the difference comes from the denominator – the number of citable items.
What the plots show is that, for most journals in my dataset, there are fewer papers considered as citable items by Thomson-Reuters. This is strange. I had filtered the data to leave only journal articles and reviews (which are citable items), so non-citable items should have been removed.
It’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom.
Now, it’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom (see here, here and here). This inconsistency actually makes plotting a distribution impossible. However, I thought that using the same dataset, filtering and getting to the correct total citation number meant that I had the correct list of citable items. So, what could explain this difference?
I looked first at how big the difference in number of citable items is. Journals like Nature and Science are missing >1000 items(!), others are less and some such as Traffic, EMBO J, Development etc. have the correct number. Remember that journals carry different amounts of papers. So as a proportion of total papers the biggest fraction of missing papers was actually from Autophagy and Cell Research which were missing ~50% of papers classified in WoS as “articles” or “reviews”!
My best guess at this stage was that items were incorrectly tagged in Web of Science. Journals like Nature, Science and Current Biology carry a lot of obituaries, letters and other stuff that can fairly be removed from the citable items count. But these should be classified as such in Web of Science and therefore filtered out in my original search. Also, these types of paper don’t explain the big disparity in journals like Autophagy that only carry papers, reviews with a tiny bit of front matter.
I figured a good way forward would be to verify the numbers with another database – PubMed. Details of how I did this are at the foot of this post. This brought me much closer to the JIF “citable items” number for most journals. However, Autophagy, Current Biology and Science are still missing large numbers of papers. As a proportion of the size of the journal, Autophagy, Cell Research and Current Biology are missing the most. While Nature Cell Biology and Nature Reviews Molecular Cell Biology now have more citable items in the JIF calculation than are found in PubMed!
This collection of data was used for the citation distributions shown above, but it highlights some major discrepancies at least for some journals.
How does Thomson Reuters decide what is a citable item?
Some of the reasons for deciding what is a citable item are outlined in this paper. Of the six reasons that are revealed, all seem reasonable, but they suggest that they do not simply look at the classification of papers in the Web of Science database. Without wanting to pick on Autophagy – it’s simply the first one alphabetically – I looked at which was right: the PubMed number of 539 or the JIF number of 247 citable items published in 2012 and 2013. For the JIF number to be correct this journal must only publish ~10 papers per issue, which doesn’t seem to be right at least from a quick glance at the first few issues in 2012.
Why Thomson-Reuters removes some of these papers as non-citable items is a mystery… you can see from the histogram above that for Autophagy only 90 or so papers are uncited in 2014, so clearly the removed items are capable of picking up citations. If anyone has any ideas why the items were removed, please leave a comment.
Trying to understand what data goes into the Journal Impact Factor calculation (for some, but not all journals) is very difficult. This makes JIFs very hard to reproduce. As a general rule in science, we don’t trust things that can’t be reproduced, so why has the JIF persisted. I think most people realise by now that using this single number to draw conclusions about the excellence (or not) of a paper because it was published in a certain journal, is madness. Looking at the citation distributions, it’s clear that the majority of papers could be reshuffled between any of these journals and nobody would notice (see here for further analysis). We would all do better to read the paper and not worry about where it was published.
The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.
In PubMed, a research paper will have the publication type “journal article”, however other items can still have this publication type. These items also have additional types which can therefore be filtered. I retrieved all PubMed records from the journals published in 2012 and 2013 with publication type = “journal article”. This worked for 21 journals, eLife is online only so the ppdat field code had to be changed to pdat.
("Autophagy"[ta] OR "Cancer Cell"[ta] OR "Cell"[ta] OR "Cell Mol Life Sci"[ta] OR "Cell Rep"[ta] OR "Cell Res"[ta] OR "Cell Stem Cell"[ta] OR "Curr Biol"[ta] OR "Dev Cell"[ta] OR "Development"[ta] OR "Elife"[ta] OR "Embo J"[ta] OR "J Cell Biol"[ta] OR "J Cell Sci"[ta] OR "Mol Biol Cell"[ta] OR "Mol Cell"[ta] OR "Nat Cell Biol"[ta] OR "Nat Rev Mol Cell Biol"[ta] OR "Nature"[ta] OR "Oncogene"[ta] OR "Science"[ta] OR "Traffic"[ta]) AND (("2012/01/01"[PPDat] : "2013/12/31"[PPDat])) AND journal article[pt:noexp]
I saved this as an XML file and then pulled the values from the “publication type” key using Nokogiri/ruby (script). I then had a list of all the publication type combinations for each record. As a first step I simply counted the number of journal articles for each journal and then subtracted anything that was tagged as “biography”, “comment”, “portraits” etc. This could be done in IgorPro by making a wave indicating whether an item should be excluded (0 or 1) using the DOI as a lookup. This wave could then be used exclude papers from the distribution.
For calculation of the number of missing papers as a proportion of size of journal, I used the number of items from WoS for the WoS calculation, and the JIF number for the PubMed comparison.
Related to this, this IgorPro procedure will read in csv files from WoS/WoK. As mentioned in the main text, data were downloaded 500 records at a time as csv from WoS, using journal titles as a search term and limiting to “article” or “review” and limiting to 2012 and 2013. Note that limiting the search at the outset by year, limits the citation data you get back. You need to search first to get citations from all years and then refine afterwards. The files can be stitched together with the cat command.
cat *.txt > merge.txt
Edit 8/1/16 @ 07:41 Jon Lane told me via Twitter that Autophagy publishes short commentaries of papers in other journals called “Autophagic puncta” (you need to be a cell biologist to get this gag). He suggests these could be removed by Thomson Reuters for their calculation. This might explain the discrepancy for this journal. However, these items 1) cite other papers (so they contribute to JIF calculations), 2) they get cited (Jon says his own piece has been cited 18 times) so they are not non-citable items, 3) they’re tagged as though they are a paper or a review in WoS and PubMed.
28 thoughts on “The Great Curve II: Citation distributions and reverse engineering the JIF”
Reblogged this on For Better Science and commented:
Steve Royle trying to calculate the official Thomson-Reuters Journal Impact factors, and gets much smaller numbers. He also found the citation index of majority of papers is totally unrelated to where they were published.
Reblogged this on Green Tea and Velociraptors and commented:
Some great detective work here on the data behind journal impact factors
“Why Thomson-Reuters removes some of these papers as non-citable items is a mystery… […] If anyone has any ideas why the items were removed, please leave a comment.”
Publishers can negotiate with TR (or ISI before that) on what gets counted:
http://www.plosmedicine.org/article/info:doi/10.1371%2Fjournal.pmed.0030291 (you cite this)
http://blogarchive.brembs.net/comment-n817.html (on Current Biology, which still stands out in your analysis today, after the IF was re-negotiated after acquisition by Elsevier in 2003)
If Elsevier, PLoS and FASEB do it for their journals, couldn’t Taylor and Francis do it for Autophagy?
Thanks for the comment and for the links.
I just had an idea that would be even smarter than negotiating: using tags in the article interfering with TRs aggregator so it would miss some articles. Should be technically feasible, if done right.
Interesting analysis. I’m intrigued by the fact that, despite the inconsistencies/errors/irregularities in the databases that you’ve uncovered, your calculated IFs are still highly correlated with the JIFs. That suggests that IF as a relative (if not absolute) measure is still useful in assessing the “quality” of a journal (with all of the usual caveats of course).
One thing I noticed in your histograms is that the first bar doesn’t actually line up with the zero on the x axis and I wondered if you’d just plotted papers with at least 1 citation (though your text indicates not).
“The Great Curve” is a great song!
“That suggests that IF as a relative (if not absolute) measure is still useful in assessing the “quality” of a journal (with all of the usual caveats of course).”
What kind of ‘quality’ do you mean?
– Utility as in “As my gazilion colleagues, I’ve also used this method (reference)”?
– Reliability as in “We have published in Nature, where only the world’s best science gets published, so trust us”.
– Visibility as in “As our competitors have published in Nature (reference), we work in an important field”
– Importance as in “We just cured cancer and Alzheimer using cold fusion (believe it or not)”
– Beauty as in “wow, this paper elegantly solved this problem in a logically consistent way and made sure all control experiments were presented in an easy to follow fashion”
Some of these qualities are indeed captured by JIF, others the inverse, again others not at all.
Or do you have a sixth/seventh way of understanding quality?
Whichever definition of “quality” one wishes, which is why I used quotes. Though underlying most of these is a rough correlation between the “quality” of a publication and the number of times that it is cited, which scales up to the IF of the journal. As I said, with all of the usual caveats about why some papers are cited and others not, of course.
Well, but some of these five correlate, others not and again other correlate in the inverse direction. So if I ‘wish’ utility/quality, JIF may assess it. If I ‘wish’ reliability/quality, JIF assesses the opposite: higher JIF predicts worse reliability. If I take beauty/quality it likely doesn’t assess it at all.
So it depends highly on what you ‘wish’ if your above statement: JIF “is still useful in assessing the “quality” of a journal” is correct or false. For instance, if I ‘wish’ beauty, your statement is likely incorrect.
If you ‘wish’ “quality” to mean “number of citations”, why use the fraught word quality at all?
Clearly I was using “quality” as a short hand. But your comment that “higher JIF predicts worse reliability” is highly debatable. As far as I’m aware this has only been assessed for the bio-med sciences. There’s no indication that this is true in other fields (unless I’ve missed some studies?) I’m familiar with your FHN 2013 review of the topic and largely agree that IF is a poor measure of “quality” for a journal, but it remains the best we have at the moment, and one which is broadly understood to be flawed.
Ah, yes, I was referring to biomed/experimental sciences without emphasizing that, I apologize! This is a cell-biology blog post after all! 🙂 Hence, for the journals discussed here, it holds solid, even if you take the studies that appeared after our review into account 🙂
Some of the studies we cover are physical/chemical in nature and so the scope goes a little beyond biomed to experimental in general, but you are right, the data we cover is dominated by biomed.
I’m not aware methodological quality can be assessed quantitatively in other, non-experimental fields?
I asked Tim Gowers to provide references on mathematics, or at least provide an idea of how one could quantify how well the work was done in his field, but he had neither a reference, nor an idea.
Likewise, there seem to be no such studies in paleontology, or so the few people I know there have told me.
Economists have been skeptical of their journal rank for probably longer than we biomeds, but I’m not aware of any quantitative data.
Computer scientists are somewhat critical of their “conference rank”, but this criticism appears to be less dominant or widespread than in the other fields.
The other non-experimental fields I haven’t asked or read about.
Thus, in the current absence of any way to objectively assess reliability outside of experimental work and in the presence of widespread suspicion that similar mechanisms seem to apply across the board, I’d tentatively conclude (until contradictory evidence can be obtained) that journal rank is equally predictive of lower reliability in other fields than those we were able to cover. I’m of course well aware that I may be making a generalization error here, but maybe making such over-generalizing claims will motivate people to do such studies in their fields 🙂
“I’m not aware methodological quality can be assessed quantitatively in other, non-experimental fields?”
Depends what you mean by “non-experimental fields”; my own field of ecology and evolutionary biology can be very experimental, in the strict sense of manipulating systems and comparing effects against appropriate controls. In contrast lots of bio-medical research is (strictly) non-experimental in that it’s purely observational, e.g. gene frequencies in human populations. But in any case it would certainly be possible to assess statistical power, effect sizes, etc. for studies in a field such as ecology.
I should say with “biomed” I’m referring to biology/medicine which would include your field to the extent you mentioned. For instance, a newer study on “animal experimentation” (but in PubMed, alas) found similar results:
Macleod MR, et al. (2015) Risk of Bias in Reports of In Vivo Research: A Focus for Improvement. doi:10.1371/journal.pbio.1002273
Clearly, as long as people use statistics, that methodology can be quantified, absolutely.
Thanks for the comment. They correlate in the sense that Cell, Nature and Science have a lot of highly cited papers and so they’re at the top, but as you go down the scale there’s virtually no difference (see previous post).
Well spotted about the x-axis. That first bin is 0-1. I used bin centring (/C flag for any Igor nerds out there) because I was going to do some curve fitting… which I didn’t end up doing.
Reblogged this on In the Dark and commented:
Here’s a lengthy study of Journal Impact Factors. It’s mainly about cell biology journals but I think this across all scientific disciplines. The JIF is so flawed as to be meaningless but this discussion suggests that the situation is even worse than that, with some advertised JIFs being wrong…
Your estimates of the IF seem quite close to the estimated cites per document (2y) reported by SJR, i.e., their estimates of IF. Below is the link to the data for science magazine: http://www.scimagojr.com/journalsearch.php?q=23571&tip=sid&clean=0
On a related note, I am quite impressed by the difference between the IF estimate and the corresponding pagerank estimates, based on the same data and produced by the same team. The data for science magazine show that clearly, and you can see a more comprehensive comparison for all journals that I made:
How did I miss that post on your blog? Very nice and incredibly useful. Did you make that with plotly?
Yes, I made it with plotly. I do all data processing with a perl script, and the script generates html that uses plotly to display the data interactively. I found this combination to be quite simple and effective.
Reblogged at https://nextmovesoftware.com/blog/2016/01/07/how-the-auc-of-a-roc-curve-is-like-the-journal-impact-factor/
Let me know if there’s any problem using the image as I’ve done.
Interesting detective work here, which I’ve come across through another blog.
All single metric indicators of research quality are of course merely that – indicators- and all can be criticized. Indeed using citations as a proxy for quality can also be heavily criticized (bandwagons/faddish/favours the established outlets etc). Accepting that citations are a valid indicator of quality, it still baffles me why JIF has become the dominant metric – who agreed its definition? Its high annual sensitivity alone should ring alarm bells on its suitability. As with sjroyle I think Journal H-factor has a lot to commend it but I see value in doing it over much longer periods.
At one level the study is highlighting the basic statistical property that the mean is very sensitive to outliers, the median by definition is not. However, I would argue that the mean in this case is better because it will reward journals that somehow end up publishing some highly cited papers i.e. they may be doing something right in the context of a system we all acknowledge has high inherent variance.
However, the really serious issue is the ‘denominator’ problem highlighted. I was surprised to learn that journals have any role in this – should this not be established objectively (I to hesitate to say a committee) that applies transparent methods? True the rank correlation between the actual and estimated doesn’t seem too bad (from eyeballing) but if I had published a paper in say Autophagy, which is ranked 18th on JIF but should be ranked 10th, I would be pretty miffed. Similarly if I had published in Cell Rep I may feel chuffed that it is officially ranked at 10th even though it is true rank should be 18th.
I won’t be publishing in these journal as my discipline is business and management but the issues are similar. There are so many journals, we as authors, reviewers, editors and our Institutions and funders are all looking for good quality metrics.
If JIF is going to persist it should be both transparent and reproducible so hopefully Thomson- Reuters will respond and agree a clear process for JIF calculations.
More generally though we should be look for better journal quality metrics (like H index).
Thanks for the detailed comment. I definitely agree that longer periods to look at cites are much more useful. The problem is that librarians (the intended market for JIFs) need current information on journal performance. Having said this, a two year window and one year of citation data is too short/small for cell biology where papers peak in year 2 or 3.
I take your point about the mean, but the problem is that the mean doesn’t reflect well how most of the papers in the journal are doing in terms of cites.
Comments are closed.