The Great Curve: Citation distributions

This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.


What is interesting is that the distribution changes over time:

  • In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
  • The following year most papers are picking up citations: the distribution moves rightwards.
  • Over the next few years the distribution relaxes back leftwards as the citations die away.
  • The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.


The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

Creep Diets: Fewer papers published at JCB

JCBdietA couple of years ago, a colleague sent me this picture* to say “who put J Cell Biol on a diet?”. I joked that maybe they publish too many autophagy papers and didn’t think much more of it.

Recently, Ron Vale put up this very interesting piece on bioRxiv discussing what it takes to publish a paper in the field of cell biology these days. In the main, he questions whether this is now out of reach of many trainees in our labs. It raises some great points and I recommend reading it.

One (of many) interesting stats in the article is that J Cell Biol now publishes fewer papers than it used to. Which made me think back to the photo and wonder why there has been a decline. Elsewhere, Vale notes that a cell biology paper now contains >2 the amount of data than papers of yesteryear. I’ve also written before about the creeping increase in the number of authors per paper at J Cell Biol and (more so) at Cell. Publication in Science is something of an arms race and his point is really that the amount of data, the time taken, the effort/people involved has got to an untenable level.

The data in the preprint is a bit limited as he only looks at two snapshots in time – because he looks at two cohorts of students at UCSF. So I thought I’d look at the decrease in JCB papers over time – did it really fall off? by how much? when did it start?.


Getting the data is straightforward. In fact, PubMed will give you a csv of frequency of papers for a given search term (it even shows you a snapshot in the main search window). I wanted a bit more control, so I exported the records for JCB and NCB. I filtered out interviews and commentary as best as I could and plotted out the records as two histograms using a bin width of 6 months. It’s pretty clear that J Cell Biol is indeed publishing fewer papers now than it used to. It looks like the trend started around 2002, possibly accelerating in the last 5 years (the photo agrees with this). The six month output at JCB in 2015 is similar to what it was in 1975!

In the comments section of the preprint, there is a bit of discussion of why this may be. Overall, there are more and more papers being published every year. There’s no reason to think that the number of cell biology papers has remained static or fallen. So if J Cell Biol have not taken a decision to limit the number of papers, why is there a decline? One commenter suggests  Nature Cell Biology has “taken” some of these papers. So I plotted those numbers out too. The number of papers at NCB is capped and has been constant since the launch of the journal. It does look like NCB could be responsible, but it’s a complex question. Personally, I think it’s unlikely. When NCB was launched this marked a period of expansion in the number of scientific journals and it’s likely that the increase in number of venues that a paper can go to (rather than the creation of NCB per se) has affected publication at JCB. One simple cause could be financial, i.e. the page number being limited by RUP. If this is true, why not move the journal online? There’s so many datasets and movies in papers these days that it barely makes sense to print JCB any more.

I love reading papers in JCB. They are sufficiently detailed so that you know what’s going on. They’re definitely on Cell Biology, not some tangential area of molecular biology. The Editors are active cell biologists and it has had a long history of publishing some truly landmark discoveries in our field. For these reasons, I’m sad that there are fewer JCB papers these days. If it’s an editorial decision to try to make the journal more exclusive, this is even more regrettable. I wonder if the Editors feel that they just don’t get enough high quality papers. If this is the case, then maybe the expectations for what a paper “should be” need to be brought back in line with reality. Which is one of the points that Ron Vale is making in his article.

* I cropped the picture to remove some identifying things on the bookshelf.

Update @ 07:07 17/7/15: Rebecca Alvinia from JCB had left a comment on Ron Vale’s piece on bioRxiv to say that JCB are not purposely limiting the number of papers. Fillip Port then asked why JCB does not take preprints. Rebecca has now replied saying that following a change of policy, J Cell Biol and the other RUP journals will take preprinted papers. This is great news!

Creep Diets is the title track from the second album by the oddly named Fudge Tunnel, released on Earache Records in 1993

Wrong Number: A closer look at Impact Factors

This is a long post about Journal Impact Factors. Thanks to Stephen Curry for encouraging me to post this.


  • the JIF is based on highly skewed data
  • it is difficult to reproduce the JIFs from Thomson-Reuters
  • JIF is a very poor indicator of the number of citations a random paper in the journal received
  • reporting a JIF to 3 d.p. is ridiculous, it would be better to round to the nearest 5 or 10.

I really liked this recent tweet from Stat Fact

It’s a great illustration of why reporting means for skewed distributions is a bad idea. And this brings us quickly to Thomson-Reuters’ Journal Impact Factor (JIF).

I can actually remember the first time I realised that the JIF was a spurious metric. This was in 2003, after reading a letter to Nature from David Colquhoun who plotted out the distribution of citations to a sample of papers in Nature. Up until that point, I hadn’t appreciated how skewed these data are. We put it up on the lab wall.


Now, the JIF for a given year is calculated as follows:

A JIF for 2013 is worked out by counting the total number of 2013 cites to articles in that journal that were published in 2011 and 2012. This number is divided by the number of “citable items” in that journal in 2011 and 2012.

There are numerous problems with this calculation that I don’t have time to go into here. If we just set these aside for the moment, the JIF is still used widely today and not for the purpose it was originally intended. Eugene Garfield, created the metric to provide librarians with a simple way to prioritise subscriptions to Journals that carried the most-cited scientific papers. The JIF is used (wrongly) in some institutions in the criteria for hiring, promotion and firing. This is because of the common misconception that the JIF is a proxy for the quality of a paper in that journal. Use of metrics in this manner is opposed by the SF-DORA and I would encourage anyone that hasn’t already done so, to pledge their support for this excellent initiative.

Why not report the median rather than the mean?

With the citation distribution in mind, why do Thomson-Reuters calculate the mean rather than the median for the JIF? It makes no sense at all. If you didn’t quite understand why from the @statfact tweet above, then look at this:

ActaJIFThe Acta Crystallographica Section A effect. The plot shows that this journal had a JIF of 2.051 in 2008 which jumped to 49.926 in 2009 due to a single highly-cited paper. Did every other paper in this journal suddenly get amazingly awesome and highly-cited for this period? Of course not. The median is insensitive to outliers like this.

The answer to why Thomson-Reuters don’t do this is probably for ease of computation. The JIF (mean) requires only three numbers for each journal, whereas calculating the median would require citation information for each paper under consideration for each journal. But it’s not that difficult (see below). There’s also a mismatch in the items that bring in citations to the numerator and those that count as “citeable items” in the denominator. This opacity is one of the major criticisms of the Impact Factor and this presents a problem for them to calculate the median.

Let’s crunch some citation numbers

I had a closer look at citation data for a small number of journals in my field. DC’s citation distribution plot was great (in fact, superior to JIF data) but it didn’t capture the distribution that underlies the JIF. I crunched the IF2012 numbers (released in June 2013) sometime in December 2013. This is shown below. My intention was to redo this analysis more fully in June 2014 when the IF2013 was released, but I was busy, had lost interest and the company said that they would be more open with the data (although I’ve not seen any evidence for this). I wrote about partial impact factors instead, which took over my blog. Anyway, the analysis shown here is likely to be similar for any year and the points made below are likely to hold.

I mainly looked at Nature, Nature Cell Biology, Journal of Cell Biology, EMBO Journal and J Cell Science. Using citations in 2012 articles to papers published in 2010 and 2011, i.e. the same criteria as for IF2012.

The first thing that happens when you attempt this analysis is that you realise how unreproducible the Thomson-Reuters JIFs are. This has been commented on in the past (e.g. here), yet I had the same data as the company uses to calculate JIFs and it was difficult to see how they had arrived at their numbers. After some wrangling I managed to get a set of papers for each journal that gave close to the same JIF.


From this we can look at the citation distribution within the dataset for each journal. Below is a gallery of these distributions. You can see that the data are highly skewed. For example, JCB has kurtosis of 13.5 and a skewness of 3. For all of these journals ~2/3 of papers had fewer than the mean number of citations. With this kind of skew, it makes more sense to report the median (as described above). Note that Cell is included here but was not used in the main analysis.

So how do these distributions look when compared? I plotted each journal compared to JCB. They are normalised to account for the differing number of papers in each dataset. As you can see they are largely overlapping.


If the distributions overlap so much, how certain can we be that a paper in a journal with a high JIF will have more citations than a paper in a journal with a lower JIF? In other words, how good is the JIF (mean or median) at predicting how many citations a paper published in a certain journal is likely to have?

To look at this, I ran a Monte Carlo analysis comparing a random paper from one journal with a random one from JCB and looked at the difference in number of citations. Papers in EMBO J are indistinguishable from JCB. Papers in JCS have very slightly fewer citations than JCB. Most NCB papers have a similar number of cites to papers in JCB, but there is a tail of papers with higher cites, a similar but more amplified picture for Nature.


Thomson-Reuters quotes the JIF to 3 d.p. and most journals use this to promote their impact factor (see below). The precision of 3 d.p. is ridiculous when two journals with IFs of 10.822 and 9.822 are indistinguishable when it comes to the number of citations to randomly sampled papers in that journal.

So how big do differences in JIF have to be in order to be able to tell a “Journal X paper” from a “Journal Y paper” (in terms of citations)?

To look at this I ran some comparisons between the journals in order to get some idea of “significant differences”. I made virtual issues of each journal with differing numbers of papers (5,10,20,30) and compared the citations in each via Wilcoxon rank text and then plotted out the frequency of p-values for 100 of these tests. Please leave a comment if you have a better idea to look at this. I liked this method over the head-to-head comparison for two papers as it allows these papers the benefit of the (potential) reflected glory of other papers in the journal. In other words, it is closer to what the JIF is about.

OK, so this shows that sufficient sample size is required to detect differences, no surprise there. But at N=20 and N=30 the result seems pretty clear. A virtual issue of Nature trumps a virtual issue of JCB, and JCB beats JCS. But again, there is no difference between JCB and EMBO J. Finally, only ~30% of the time would a virtual issue of NCB trump JCB for citations! NCB and JCB had a difference in JIF of  almost 10 (20.761 vs 10.822). So not only is quoting the JIF to 3 d.p. ridiculous, it looks like rounding the JIF to the nearest 5 (or 10) might be better!

This analysis supports the idea that there are different tiers of journal (in Cell Biology at least). But the JIF is the bluntest of tools to separate these journals. A more rigorous analysis is needed to demonstrate this more clearly but it is not feasible to do this while having a dataset which agrees with that of Thomson-Reuters (without purchasing the data from the company).

If you are still not convinced about how shortcomings of the JIF, here is a final example. The IF2013 for Nature increased from 38.597 to 42.351. Let’s have a look at the citation distributions that underlie this rise of 3.8! As you can see below they are virtually identical. Remember that there’s a big promotion that the journal uses to pull in new subscribers, seems a bit hollow somehow doesn’t it? Disclaimer: I think this promotion is a bit tacky, but it’s actually a really good deal… the News stuff at the front and the Jobs section at the back alone are worth ~£40.

Show us the data!

More skewed distributions: The distribution of JIFs in the Cell Biology Category for IF2012 is itself skewed. Median JIF is 3.2 and Mean JIF is 4.8.

Recently, Stephen Curry has called for Journals to report the citation distribution data rather than parroting their Impact Factor (to 3 d.p.). I agree with this. The question is though – what to report?

  • The IF window is far too narrow (2 years + 1 year of citations) so a broader window would be more useful.
  • A comparison dataset from another journal is needed in order to calibrate ourselves.
  • Citations are problematic – not least because they are laggy. A journal could change dramatically and any citation metric would not catch up for ~2 years.
  • Related to this some topics are hot and others not. I guess we’re most interested in how a paper in Journal X compares to others of its kind.
  • Any information reported needs to be freely available for re-analysis and not in the hands of a company. Google Scholar is a potential solution but it needs to be more open with its data. They already have a journal ranking which provides a valuable and interesting alternative view to the JIF.

One solution would be to show per article citation profiles comparing these for similar papers. How do papers on a certain topic in Journal X compare to not only those in Journal Y but to the whole field? In my opinion, this metric would be most useful when assessing scholarly output.


Thanks for reading to the end (or at least scrolling all the way down). The take home points are:

  • the JIF is based on highly skewed data.
  • the median rather than the mean is better for summarising such distributions.
  • JIF is a very poor indicator of the number of citations a random paper in the journal received!
  • reporting a JIF to 3 d.p. is ridiculous, it would be better to round to the nearest 5 or 10.
  • an open resource for comparing citation data per journal would be highly valuable.

The post title is taken from “Wrong Number” by The Cure. I’m not sure which album it’s from, I only own a Greatest Hits compilation.

Waiting to Happen: Publication lag times in Cell Biology Journals

My interest in publication lag times continues. Previous posts have looked at how long it takes my lab to publish our work, how often trainees publish and I also looked at very long lag times at Oncogene. I recently read a blog post on automated calculation of publication lag times for Bioinformatics journals. I thought it would be great to do this for Cell Biology journals too. Hopefully people will find it useful and can use this list when thinking about where to send their paper.

What is publication lag time?

If you are reading this, you probably know how science publication works. Feel free to skip. Otherwise, it goes something like this. After writing up your work for publication, you submit it to a journal. Assuming that this journal will eventually publish the paper (there is usually a period of submitting, getting rejected, resubmitting to a different journal etc.), they receive the paper on a certain date. They send it out to review, they collate the reviews and send back a decision, you (almost always) revise your paper further and then send it back. This can happen several times. At some point it gets accepted on a certain date. The journal then prepares the paper for publication in a scheduled issue on a specific date (they can also immediately post papers online without formatting). All of these steps add significant delays. It typically takes 9 months to publish a paper in the biomedical sciences. In 2015 this sounds very silly, when world-wide dissemination of information is as simple as a few clicks on a trackpad. The bigger problem is that we rely on papers as a currency to get jobs or funding and so these delays can be more than just a frustration, they can affect your ability to actually do more science.

The good news is that it is very straightforward to parse the received, accepted and published dates from PubMed. So we can easily calculate the publication lags for cell biology journals. If you don’t work in cell biology, just follow the instructions below to make your own list.

The bad news is that the deposition of the date information in PubMed depends on the journal. The extra bad news is that three of the major cell biology journals do not deposit their data: J Cell Biol, Mol Biol Cell and J Cell Sci. My original plan was to compare these three journals with Traffic, Nat Cell Biol and Dev Cell. Instead, I extended the list to include other journals which take non-cell biology papers (and deposit their data).


A summary of the last ten years

Three sets of box plots here show the publication lags for eight journals that take cell biology papers. The journals are Cell, Cell Stem Cell, Current Biology, Developmental Cell, EMBO Journal, Nature Cell Biology, Nature Methods and Traffic (see note at the end about eLife). They are shown in alphabetical order. The box plots show the median and the IQR, whiskers show the 10th and 90th percentiles. The three plots show the time from Received-to-Published (Rec-Pub), and then a breakdown of this time into Received-to-Accepted (Rec-Acc) and Accepted-to-Published (Rec-Pub). The colours are just to make it easier to tell the journals apart and don’t have any significance.

You can see from these plots that the journals differ widely in the time it takes to publish a paper there. Current Biology is very fast, whereas Cell Stem Cell is relatively slow. The time it takes the journals to move them from acceptance to publication is pretty constant. Apart from Traffic where it takes an average of ~3 months to get something in to print. Remember that the paper is often online for this period so this is not necessarily a bad thing. I was not surprised that Current Biology was the fastest. At this journal, a presubmission inquiry is required and the referees are often lined up in advance. The staff are keen to publish rapidly, hence the name, Current Biology. I was amazed at Nature Cell Biology having such a short time from Received-to-Acceptance. The delay in Review-to-Acceptance comes from multiple rounds of revision and from doing extra experimental work. Anecdotally, it seems that the review at Nature Cell Biol should be just as lengthy as at Dev Cell or EMBO J. I wonder if the received date is accurate… it is possible to massage this date by first rejecting the paper, but allowing a resubmission. Then using the resubmission date as the received date [Edit: see below]. One way to legitimately limit this delay is to only allow a certain time for revisions and only allow one round of corrections. This is what happens at J Cell Biol, unfortunately we don’t have this data to see how effective this is.


How has the lag time changed over the last ten years?

Have the slow journals always been slow? When did they become slow?  Again three plots are shown (side-by-side) depicting the Rec-Pub and then the Rec-Acc and Acc-Pub time. Now the intensity of red or blue shows the data for each year (2014 is the most intense colour). Again you can see that the dataset is not complete with missing date information for Traffic for many years, for example.

Interestingly, the publication lag has been pretty constant for some journals but not others. Cell Stem Cell and Dev Cell (but not the mothership – Cell) have seen increases as have Nature Cell Biology and Nature Methods. On the whole Acc-Pub times are stable, except for Nature Methods which is the only journal in the list to see an increase over the time period. This just leaves us with the task of drawing up a ranked list of the fastest to the slowest journal. Then we can see which of these journals is likely to delay dissemination of our work the most.

The Median times (in days) for 2013 are below. The journals are ranked in order of fastest to slowest for Received-to-Publication. I had to use 2013 because EMBO J is missing data for 2014.

Journal Rec-Pub Rec-Acc Acc-Pub
Curr Biol 159 99.5 56
Nat Methods 192 125 68
Cell 195 169 35
EMBO J 203 142 61
Nature Cell Biol 237 180 59
Traffic 244 161 86
Dev Cell 247 204 43
Cell Stem Cell 284 205 66

You’ll see that only Cell Stem Cell is over the threshold where it would be faster to conceive and give birth to a human being than to publish a paper there (on average). If the additional time wasted in submitting your manuscript to other journals is factored in, it is likely that most papers are at least on a par with the median gestation time.

If you are wondering why eLife is missing… as a new journal it didn’t have ten years worth of data to analyse. It did have a reasonably complete set for 2013 (but Rec-Acc only). The median time was 89 days, beating Current Biology by 10.5 days.


Please check out Neil Saunders’ post on how to do this. I did a PubMed search for (journal1[ta] OR journal2[ta] OR ...) AND journal article[pt] to make sure I didn’t get any reviews or letters etc. I limited the search from 2003 onwards to make sure I had 10 years of data for the journals that deposited it. I downloaded the file as xml and I used Ruby/Nokogiri to parse the file to csv. Installing Nokogiri is reasonably straightforward, but the documentation is pretty impenetrable. The ruby script I used was from Neil’s post (step 3) with a few lines added:


require 'nokogiri'

f =
doc = Nokogiri::XML(f)

doc.xpath("//PubmedArticle").each do |a|
r = ["", "", "", "", "", "", "", "", "", "", ""]
r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text
r[1] = a.xpath("MedlineCitation/PMID").text
r[2] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Year").text
r[3] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Month").text
r[4] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Day").text
r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text
r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text
r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text
r[8] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Year").text
r[9] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Month").text
r[10] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Day").text
puts r.join(",")

and then executed as described. The csv could then be imported into IgorPro and processed. Neil’s post describes a workflow for R, or you could use Excel or whatever at this point. As he notes, quite a few records are missing the date information and some of it is wrong, i.e. published before it was accepted. These need to be cleaned up. The other problem is that the month is sometimes an integer and sometimes a three-letter code. He uses lubridate in R to get around this, a loop-replace in Igor is easy to construct and even Excel can handle this with an IF statement, e.g. IF(LEN(G2)=3,MONTH(1&LEFT(G2,3)),G2) if the month is in G2. Good luck!

Edit 9/3/15 @ 17:17 several people (including Deborah Sweet and Bernd Pulverer from Cell Press/Cell Stem Cell and EMBO, respectively) have confirmed via Twitter that some journals use the date of resubmission as the submitted date. Cell Stem Cell and EMBO journals use the real dates. There is no way to tell whether a journal does this or not (from the deposited data). Stuart Cantrill from Nature Chemistry pointed out that his journal do declare that they sometimes reset the clock. I’m not sure about other journals. My own feeling is that – for full transparency – journals should 1) record the actual dates of submission, acceptance and publication, 2) deposit them in PubMed and add them to the paper. As pointed out by Jim Woodgett, scientists want the actual dates on their paper, partly because they are the real dates, but also to claim priority in certain cases. There is a conflict here, because journals might appear inefficient if they have long publication lag times. I think this should be an incentive for Editors to simplify revisions by giving clear guidance and limiting successive revision cycles. (This Edit was corrected 10/3/15 @ 11:04).

The post title is taken from “Waiting to Happen” by Super Furry Animals from the “Something 4 The Weekend” single.

Half Right

I was talking to a speaker visiting our department recently. While discussing his postdoc work from years ago, he told me about the identification of the sperm factor that causes calcium oscillations in the egg at fertilisation. It was an interesting tale because the group who eventually identified the factor – now widely accepted as PLCzeta – had earlier misidentified the factor, naming it oscillin.

The oscillin paper was published in Nature in 1996 and the subsequent (correct) paper was published in Development in 2002. I wondered what the citation profiles of these papers looks like now.


As you can see there was intense interest in the first paper that quickly petered out, presumably when people found out that oscillin was a contaminant and not the real factor. The second paper on the other hand has attracted a large number of citations and continues to do so 12 years later – a sign of a classic paper. However, the initial spike in citations was not as high as the Nature paper.

The impact factor of Nature is much higher than that of Development. I’ve often wondered if this is due to a sociological phenomenon: people like to cite Cell/Nature/Science papers rather than those at other journals and this bumps up the impact factor. Before you comment, yes I know there are other reasons, but the IFs do not change much over time and I wonder whether journal hierarchy explains the hardiness of IFs over time. Anyway, these papers struck me as a good test of the idea… Here we have essentially the same discovery, reported by the same authors. The only difference here is the journal (and that one paper is six years after the other). Normally it is not possible to test if the journal influences citations because a paper cannot erased and republished somewhere else. The plot suggests that Nature papers inherently attract much more cites than those in Development, presumably because of the exposure of publishing there. From the graph, it’s not difficult to see that even if a paper turns out not to be right, it can still boost the IF of the journal during the window of assessment. Another reason not to trust journal impact factors.

I can’t think of any way to look at this more systematically to see if this phenomenon holds true. I just thought it was interesting, so I’ll leave it here.

The post title is taken from Half Right by Elliott Smith from the posthumous album New Moon. Bootlegs have the title as Not Half Right, which would also be appropriate.

What The World Is Waiting For

The transition for scientific journals from print to online has been slow and painful. And it is not yet complete. This week I got an RSS alert to a “new” paper in Oncogene. When I downloaded it, something was familiar… very familiar… I’d read it almost a year ago! Sure enough, the AOP (ahead of print or advance online publication) date for this paper was September 2013 and here it was in the August 2014 issue being “published”.

I wondered why a journal would do this. It is possible that delaying actual publication would artificially boost the Impact Factor of a journal because there is a delay before citations roll in and citations also peak after two years. So if a journal delays actual publication, then the Impact Factor assessment window captures a “hotter” period when papers are more likely to generate more citations*. Richard Sever (@cshperspectives) jumped in to point out a less nefarious explanation – the journal obviously has a backlog of papers but is not allowed to just print more papers to catch up, due to page budgets.

There followed a long discussion about this… which you’re welcome to read. I was away giving a talk and missed all the fun, but if I may summarise on behalf of everybody: isn’t it silly that we still have pages – actual pages, made of paper – and this is restricting publication.

I wondered how Oncogene got to this position. I retrieved the data for AOP and actual publication for the last five years of papers at Oncogene excluding reviews, from Pubmed. Using oncogene[ta] NOT review[pt] as a search term. The field DP has the date published (the “issue date” that the paper appears in print) and PHST has several interesting dates including [aheadofprint]. These could be parsed and imported into IgorPro as 1D waves. The lag time from AOP to print could then be calculated. I got 2916 papers from the search and was able to get data for 2441 papers.

OncogeneLagTimeYou can see for this journal that the lag time has been stable at around 300 days (~10 months) for issues published since 2013. So a paper AOP in Feb 2012 had to wait over 10 months to make it into print. This followed a linear period of lag time growth from mid-2010.

I have no links to Oncogene and don’t particularly want to single them out. I’m sure similar lags are happening at other print journals. Actually, my only interaction with Oncogene was that they sent this paper of ours out to review in 2011 (it got two not-negative-but-admittedly-not-glowing reviews) and then they rejected it because they didn’t like the cell line we used. I always thought this was a bizarre decision: why couldn’t they just decide that before sending it to review and wasting our time? Now, I wonder whether they were not keen to add to their increasing backlog of papers at their journal? Whatever the reason, it has put me off submitting other papers there.

I know that there are good arguments for continuing print versions of journals, but from a scientist’s perspective the first publication is publication. Any subsequent versions are simply redundant and confusing.

*Edit: Alexis Verger (@Alexis_Verger) pointed me to a paper which describes that, for neuroscience journals, the lag time has increased over time. Moreover, the authors suggest that this is for the purpose of maximising Journal Impact Factor.

The post title comes from the double A-side Fools Gold/What The World Is Waiting For by The Stone Roses.

Six Plus One

Last week, ALM (article-level metric) data for PLoS journals were uploaded to Figshare with the invitation to do something cool with it.

Well, it would be rude not to. Actually, I’m one of the few scientists on the planet that hasn’t published a paper with Public Library of Science (PLoS), so I have no personal agenda here. However, I love what PLoS is doing and what it has achieved to disrupt the scientific publishing system. Anyway, what follows is not in any way comprehensive, but I was interested to look at a few specific things:

  1. Is there a relationship between Twitter mentions and views of papers?
  2. What is the fraction of views that are PDF vs HTML?
  3. Can citations be predicted by more immediate article level metrics?

The tl;dr version is 1. Yes. 2. ~20%. 3. Can’t say but looks unlikely.

1. Twitter mentions versus paper views

All PLoS journals are covered. The field containing paper views is (I think) “Counter” this combines views of HTML and PDF (see #2). A plot of Counter against Publication Date for all PLoS papers (upper plot) shows that the number of papers published has increased dramatically since the introduction of PLoS ONE in 2007. There is a large variance in number of views, which you’d expect and also, the views tail off for the most recent papers, since they have had less time to accumulate views. Below is the same plot where the size and colour of the markers reflects their Twitter score (see key). There’s a sharp line that must correspond to the date when Twitter data was logged as an ALM. There’s a scattering of mentions after this date to older literature, but one 2005 paper stands out – Ioannidis’s paper Why Most Published Research Findings Are False. It has a huge number of views and a large twitter score, especially considering that it was a seven year old paper when they started recording the data. A pattern emerges in the post-logging period. Papers with more views are mentioned more on Twitter. The larger darker markers are higher on the y-axis. Mentioning a paper on Twitter is sure to generate views of the paper, at some (unknown) conversion rate. However, as this is a single snapshot, we don’t know if Twitter mentions drive more downloads of papers, or whether more “interesting”/highly downloaded work is talked about more on Twitter.


2. Fraction of PDF vs HTML views

I asked a few people what they thought the download ratio is for papers. Most thought 60-75% as PDF versus 40-25% HTML. I thought it would be lower, but I was surprised to see that it is, at most, 20% for PDF. The plot below shows the fraction of PDF downloads (counter_pdf/(counter_pdf+counter_html)). For all PLoS journals, and then broken down for PLoS Biol, PLoS ONE.

PDF-FractionThis was a surprise to me. I have colleagues who don’t like depositing post-print or pre-print papers because they say that they prefer their work to be seen typeset in PDF format. However, this shows that, at least for PLoS journals, the reader is choosing to not see a typeset PDF at all, but a HTML version.

Maybe the PLoS PDFs are terribly formatted and 80% people don’t like them. There is an interesting comparison that can be done here, because all papers are deposited at Pubmed Central (PMC) and so the same plot can be generated for the PDF fraction there. The PDF format is different to PLoS and so we can test the idea that people prefer HTML over PDF at PLoS because they don’t like the PLoS format.


The fraction of PDF downloads is higher, but only around 30%. So either the PMC format is just as bad, or this is the way that readers like to consume the scientific literature. A colleague mentioned that HTML views are preferable to PDF if you want to actually want to do something with the data, e.g. for meta-analysis. This could have an effect. HTML views could be skim reading, whereas PDF is for people who want to read in detail… I wonder whether these fractions are similar at other publishers, particularly closed access publishers?

3. Citation prediction?

ALMs are immediate whereas citations are slow. If we assume for a moment that citations are a definitive means to determine the impact of a paper (which they may not be), then can ALMs predict citations? This would make them very useful in the evaluation of scientists and their endeavours. Unfortunately, this dataset is not sufficient to answer this properly, but with multiple timepoints, the question could be investigated. I looked at number of paper downloads and also the Mendeley score to see how these two things may foretell citations. What follows is a strategy to do this is an unbiased way with few confounders.

scopus v citesThe dataset has a Scopus column, but for some reason these data are incomplete. It is possible to download data (but not on this scale AFAIK) for citations from Web of Science and then use the DOI to cross-reference to the other dataset. This plot shows the Scopus data as a function of “Total Citations” from Web of Science, for 500 papers. I went with the Web of Science data as this appears more robust.

The question is whether there is a relationship between downloads of a paper (Counter, either PDF or HTML) and citations. Or between Mendeley score and citations. I figured that downloading, Mendeley and citation, show three progressive levels of “commitment” to a paper and so they may correlate differently with citations. Now, to look at this for all PLoS journals for all time would be silly because we know that citations are field-specific, journal-specific, time-sensitive etc. So I took the following dataset from Web of Science: the top 500 most-cited papers in PLoS ONE for the period of 2007-2010 limited to “cell biology”. By cross-referencing I could check the corresponding values for Counter and for Mendeley.


I was surprised that the correlation was very weak in both cases. I thought that the correlation would be stronger with Mendeley, however signal-to-noise is a problem here with few users of the service compared with counting downloads. Below each plot is a ranked view of the papers, with the Counter or Mendeley data presented as a rolling average. It’s a very weak correlation at best. Remember that this is post-hoc. Papers that have been cited more would be expected to generate more views and higher Mendeley scores, but this is not necessarily so. Predicting future citations based on Counter or Mendeley, will be tough. To really know if this is possible, this approach needs to be used with multiple ALM timepoints to see if there is a predictive value for ALMs, but based on this single timepoint, it doesn’t seem as though prediction will be possible.

Again, looking at this for a closed access journal would be very interesting. The most-downloaded paper in this set, had far more views (143,952) than other papers cited a similar number of times (78). The paper was this one which I guess is of interest to bodybuilders! Presumably, it was heavily downloaded by people who probably are not in a position to cite the paper. Although these downloads didn’t result in extra citations, this paper has undeniable impact outside of academia. Because PLoS is open access, the bodybuilders were able to access the paper, rather than being met by a paywall. Think of the patients who are trying to find out more about their condition and can’t read any of the papers… The final point here is that ALMs have their own merit, irrespective of citations, which are the default metric for judging the impact of our work.

Methods: To crunch the numbers for yourself, head over to Figshare and download the csv. A Web of Science subscription is needed for the citation data. All the plots were generated in IgorPro, but no programming is required for these comparisons and everything I’ve done here can be easily done in Excel or another package.

Edit: Matt Hodgkinson (@mattjhodgkinson) Snr Ed at PLoS ONE told me via Twitter that all ALM data (periodically updated) are freely available here. This means that some of the analyses I wrote about are possible.

The post title comes from Six Plus One a track on Dad Man Cat by Corduroy. Plus is as close to PLoS as I could find in my iTunes library.

You Know My Name (Look Up The Number)

What is your h-index on Twitter?

This thought crossed my mind yesterday when I saw a tweet that was tagged #academicinsults

It occurred to me that a Twitter account is a kind of micro-publishing platform. So what would “publication metrics” look like for Twitter? Twitter makes analytics available, so they can easily be crunched. The main metrics are impressions and engagements per tweet. As I understand it, impressions are the number of times your tweet is served up to people in their feed (boosted by retweets). Engagements are when somebody clicks on the tweet (either a link or to see the thread or whatever). In publication terms, impressions would equate to people downloading your paper and engagements mean that they did something with it, like cite it. This means that a “h-index” for engagements can be calculated with these data.

For those that don’t know, the h-index for a scientist means that he/she has h papers that have been cited h or more times. The Twitter version would be a tweeter that has h tweets that were engaged with h or more times. My data is shown here:

TwitterAnalyticsMy twitter h-index is currently 36. I have 36 tweets that have been engaged with 36 or more times.

So, this is a lot higher than my actual h-index, but obviously there are differences. Papers accrue citations as time goes by, but the information flow on Twitter is so fast that tweets don’t accumulate engagement over time. In that sense, the Twitter h-index is less sensitive to the time a user has been active on Twitter, versus the real h-index which is strongly affected by age of the scientist. Other differences include the fact that I have “published” thousands of tweets and only tens of papers. Also, whether or not more people read my tweets compared to my papers… This is not something I want to think too much about, but it would affect how many engagements it is possible to achieve.

The other thing I looked at was whether replying to somebody actually means more engagement. This would skew the Twitter h-index. I filtered tweets that started with an @ and found that this restricts who sees the tweet, but doesn’t necessarily mean more engagement. Replies make up a very small fraction of the h tweets.

I’ll leave it to somebody else to calculate the Impact Factor of Twitter. I suspect it is very low, given the sheer volume of tweets.

Please note this post is just for fun. Normal service will (probably) resume in the next post.

Edit: As pointed out in the comments this post is short on “Materials and Methods”. If you want to calculate your ownTwitter h-index, go here. When logged in to Twitter, the analytics page should present your data (it may take some time to populate this page after you first view it). A csv can be downloaded from the button on the top-right of the page. I imported this into IgorPro (as always) to generate the plots. The engagements data need to be sorted in descending order and then the h-index can be found by comparing the numbers with their ranked position.

The post title is from the quirky B-side to the Let It Be single by The Beatles.

Strange Things – update

My post on the strange data underlying the new impact factor for eLife was read by many people. Thanks for the interest and for the comments and discussion that followed. I thought I should follow up on some of the issues raised in the post.

To recap:

  1. eLife received a 2013 Impact Factor despite only publishing 27 papers in the last three months of the census window. Other journals, such as Biology Open did not.
  2. There were spurious miscites to papers before eLife published any papers. I wondered whether this resulted in an early impact factor.
  3. The Web of Knowledge database has citations from articles in the past referring to future articles!

1. Why did eLife get an early Impact Factor? It turns out that there is something called a partial Impact Factor.  This is where an early Impact Factor is awarded to some journals in special cases. This is described here in a post at Scholarly Kitchen. Cell Reports also got an early Impact Factor and Nature Methods got one a few years ago (thanks to Daniel Evanko for tweeting about Nature Methods’ partial Impact Factor). The explanation is that if a journal is publishing papers that are attracting large numbers of citations it gets fast-tracked for an Impact Factor.

2. In a comment, Rafael Santos pointed out that the miscites were “from a 2013 eLife paper to an inexistent 2010 eLife paper, and another miscite from a 2013 PLoS Computational Biology paper to an inexistent 2011 eLife paper”. The post at Scholarly Kitchen confirms that citations are not double-checked or cleaned up at all by Thomson-Reuters. It occurred to me that journals looking to game their Impact Factor could alter the year for citations to papers in their own journal in order to inflate their Impact Factor. But no serious journal would do that – or would they?

3. This is still unexplained. If anybody has any ideas (other than time travel) please leave a comment.

Strange Things

I noticed something strange about the 2013 Impact Factor data for eLife.

Before I get onto the problem. I feel I need to point out that I dislike Impact Factors and think that their influence on science is corrosive. I am a DORA signatory and I try to uphold those principles. I admit that, in the past, I used to check the new Impact Factors when they were released, but no longer. This year, when the 2013 Impact Factors came out I didn’t bother to log on to take a look. A chance Twitter conversation with Manuel Théry (@ManuelTHERY) and Christophe Leterrier (@christlet) was my first encounter with the new numbers.

Huh? eLife has an Impact Factor?

For those that don’t know, the 2013 Impact Factor is worked out by counting the total number of 2013 cites to articles in a given journal that were published in 2011 and 2012. This number is divided by the number of “citable items” in that journal in 2011 and 2012.

Now, eLife launched in October 2012. So it seems unfair that it gets an Impact Factor since it only published papers for 12.5% of the window under scrutiny. Is this normal?

I looked up the 2013 Impact Factor for Biology Open, a Company of Biologists journal that launched in January 2012* and… it doesn’t have one! So why does eLife get an Impact Factor but Biology Open doesn’t?**

elife-JIFLooking at the numbers for eLife revealed that there were 230 citations in 2013 to eLife papers in 2011 and 2012. One of which was a mis-citation to an article in 2011. This article does not exist (the next column shows that there were no articles in 2011). My guess is that Thomson Reuters view this as the journal existing for 2011 and 2012, and therefore deserving of an Impact Factor. Presumably there are no mis-cites in the Biology Open record and it will only get an Impact Factor next year. Doesn’t this call into question the veracity of the database? I have found other errors in records previously (see here). I also find it difficult to believe that no-one checked this particular record given the profile of eLife.

elfie-citesPerhaps unsurprisingly, I couldn’t track down the rogue citation. I did look at the cites to eLife articles from all years in Web of Science, the Thomson Reuters database (which again showed that eLife only started publishing in Oct 2012). As described before there are spurious citations in the database. Josh Kaplan’s eLife paper on UNC13/Tomosyn managed to rack up 5 citations in 2004, some 9 years before it was published (in 2013)! This was along with nine other papers that somehow managed to be cited in 2004 before they were published. It’s concerning enough that these data are used for hiring, firing and funding decisions, but if the data are incomplete or incorrect this is even worse.

Summary: I’m sure the Impact Factor of eLife will rise as soon as it has a full window for measurement. This would actually be 2016 when the 2015 Impact Factors are released. The journal has made it clear in past editorials (and here) that it is not interested in an Impact Factor and won’t promote one if it is awarded. So, this issue makes no difference to the journal. I guess the moral of the story is: don’t take the Impact Factor at face value. But then we all knew that already. Didn’t we?

* For clarity, I should declare that we have published papers in eLife and Biology Open this year.

** The only other reason I can think of is that eLife was listed on PubMed right away, while Biology Open had to wait. This caused some controversy at the time. I can’t see why a PubMed listing should affect Impact Factor. Anyhow, I noticed that Biology Open got listed in PubMed by October 2012, so in the end it is comparable to eLife.

Edit: There is an update to this post here.

Edit 2: This post is the most popular on Quantixed. A screenshot of visitors’ search engine queries (Nov 2014)…


The post title is taken from “Strange Things” from Big Black’s Atomizer LP released in 1986.