Tips from the blog IX: running route

University of Warwick is a popular conference destination, with thousands of visitors per year. Next time you visit and stay on campus, why not bring your running shoes and try out these routes?

Route 1

track1

This is just over 10K and it takes you from main campus out towards Cryfield Pavilion. A path goes to the Greenway (a former railway), which is a nice flat gravel track. It goes up to Burton Green and back to campus via Westwood Heath Road. To exit the Greenway at Burton Green you need to take the “offramp” at the bridge otherwise you will end up heading to Berkswell. If you want to run totally off-road*, just turn back at this point (probably ~12K). The path out to the Greenway and the Greenway itself is unlit, so be careful early in the morning or late at night.

GPX of a trace put together on gpsies.

Track 2

track2

This is a variation on Track 1. Instead of heading up the Greenway to Burton Green, take a left and head towards Kenilworth Common. With a bit of navigation you can run on alongside a brook and pop out in Abbey Fields and see the ruins of Kenilworth Abbey. This is out-and-back, 12K. Obviously you can turn back sooner if you prefer. It’s all off-road apart from a few 100m on quiet residential streets as you navigate from the Common to Abbey Fields. GPX from Uni to around the lake at Abbey Fields.

Track 3

track3

 

This is a variation on Track 1 where you exit the Greenway and take a loop around Crackley Wood. The Wood is nice and has wild deer and other interesting wildlife. This route is totally off-road and is shorter at ~8K. GPX from Uni to around the Wood.

 

Other Routes

There is a footpath next to a bike lane down the A429 which is popular for runners heading to do a lap or two of Memorial Park in Coventry. This is OK, but means that you run alongside cars a lot.

If you don’t have time for these routes, the official Warwick page has three very short running routes of around 3 to 5 km (1, 2 and 3). I think that these routes are the ones that are on the signpost near the Sports Centre.

* Here, off-road means on paths but not alongside a road on a pavement. It doesn’t mean across fields.

This post is part of a series of tips.

The Great Curve II: Citation distributions and reverse engineering the JIF

There have been calls for journals to publish the distribution of citations to the papers they publish (1 2 3). The idea is to turn the focus away from just one number – the Journal Impact Factor (JIF) – and to look at all the data. Some journals have responded by publishing the data that underlie the JIF (EMBO J, Peer JRoyal Soc, Nature Chem). It would be great if more journals did this. Recently, Stuart Cantrill from Nature Chemistry actually went one step further and compared the distribution of cites at his journal with other chemistry journals. I really liked this post and it made me think that I should just go ahead and harvest the data for cell biology journals and post it.

This post is in two parts. First, I’ll show the data for 22 journals. They’re broadly cell biology, but there’s something for everyone with Cell, Nature and Science all included. Second, I’ll describe how I “reverse engineered” the JIF to get to these numbers. The second part is a bit technical but it describes how difficult it is to reproduce the JIF and highlights some major inconsistencies for some journals. Hopefully it will also be of interest to anyone wanting to do a similar analysis.

Citation distributions for 22 cell biology journals

The JIF for 2014 (published in the summer of 2015) is worked out by counting the total number of 2014 cites to articles in that journal that were published in 2012 and 2013. This number is divided by the number of “citable items” in that journal in 2012 and 2013. There are other ways to look at citation data, different windows to analyse, but this method is used here because it underlies the impact factor. I plotted out histograms to show the citation distributions at these journals from 0-50 citations, inset shows the frequency of papers with 50-1000 cites.

Dist1

Dist2

As you can see, the distributions are highly skewed and so reporting the mean is very misleading. Typically ~70% papers pick up less than the mean number of citations. Reporting the median is safer and is shown below. It shows how similar most of the journals are in this field in terms of citations to the average paper in that journal. Another metric, which I like, is the H-index for journals. Google Scholar uses this as a journal metric (using citation data from a 5-year window). For a journal, this is a number, h, which reveals how many papers got >=h citations. A plot of h-indices for these journals is shown below.

medianplusH

Here’s a summary table of all of this information together with the “official JIF” data, which is discussed below.

Journal Median H Citations Items Mean JIF Cites JIF Items JIF
Autophagy 3 18 2996 539 5.6 2903 247 11.753
Cancer Cell 14 37 5241 274 19.1 5222 222 23.523
Cell 19 72 28147 1012 27.8 27309 847 32.242
Cell Rep 6 26 6141 743 8.3 5993 717 8.358
Cell Res 3 19 1854 287 6.5 2222 179 12.413
Cell Stem Cell 14 37 5192 302 17.2 5233 235 22.268
Cell Mol Life Sci 4 19 3364 596 5.6 3427 590 5.808
Curr Biol 4 24 6751 1106 6.1 7293 762 9.571
Development 5 25 6069 930 6.5 5861 907 6.462
Dev Cell 7 23 3986 438 9.1 3922 404 9.708
eLife 5 20 2271 306 7.4 2378 255 9.3212
EMBO J 8 27 5828 557 10.5 5822 558 10.434
J Cell Biol 6 25 5586 720 7.8 5438 553 9.834
J Cell Sci 3 23 5995 1157 5.2 5894 1085 5.432
Mol Biol Cell 3 16 3415 796 4.3 3354 751 4.466
Mol Cell 11 37 8669 629 13.8 8481 605 14.018
Nature 12 105 69885 2758 25.3 71677 1729 41.296
Nat Cell Biol 13 35 5381 340 15.8 5333 271 19.679
Nat Rev Mol Biol Cell 8.5 43 5037 218 23.1 4877 129 37.806
Oncogene 5 26 6973 1038 6.7 8654 1023 8.459
Science 14 83 54603 2430 22.5 56231 1673 33.611
Traffic 3 11 1020 252 4.0 1018 234 4.350

 

Reverse engineering the JIF

The analysis shown above was straightforward. However, getting the data to match Thomson-Reuters’ calculations for the JIF was far from easy.

I downloaded the citation data from Web of Science for the 22 journals. I limited the search to “articles” and “reviews”, published in 2012 and 2013. I took the citation data from papers published in 2014 with the aim of plotting out the distributions. As a first step I calculated the mean citation for each journal (a.k.a. impact factor) to see how it compared with the official Journal Impact Factor (JIF). As you can see below, some were correct and others were off by some margin.

Journal Calculated IF JIF
Autophagy 5.4 11.753
Cancer Cell 14.8 23.523
Cell 23.9 32.242
Cell Rep 8.2 8.358
Cell Res 5.7 12.413
Cell Stem Cell 13.4 22.268
Cell Mol Life Sci 5.6 5.808
Curr Biol 5.0 9.571
Development 6.5 6.462
Dev Cell 7.5 9.708
eLife 6.0 9.322
EMBO J 10.5 10.434
J Cell Biol 7.6 9.834
J Cell Sci 5.2 5.432
Mol Biol Cell 4.1 4.466
Mol Cell 11.8 14.018
Nature 25.1 41.296
Nat Cell Biol 15.1 19.679
Nat Rev Mol Cell Biol 15.3 37.806
Oncogene 6.7 8.459
Science 18.6 33.611
Traffic 4.0 4.35

For most journals there was a large difference between this number and the official JIF (see below, left). This was not a huge surprise, I’d found previously that the JIF was very hard to reproduce (see also here). To try and understand the difference, I looked at the total citations in my dataset vs those from the official JIF. As you can see from the plot (right), my numbers are pretty much in agreement with those used for the JIF calculation. Which meant that the difference comes from the denominator – the number of citable items.

JifCalc

What the plots show is that, for most journals in my dataset, there are fewer papers considered as citable items by Thomson-Reuters. This is strange. I had filtered the data to leave only journal articles and reviews (which are citable items), so non-citable items should have been removed.

It’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom.

Now, it’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom (see here, here and here). This inconsistency actually makes plotting a distribution impossible. However, I thought that using the same dataset, filtering and getting to the correct total citation number meant that I had the correct list of citable items. So, what could explain this difference?

missingPapersI looked first at how big the difference in number of citable items is. Journals like Nature and Science are missing >1000 items(!), others are less and some such as Traffic, EMBO J, Development etc. have the correct number. Remember that journals carry different amounts of papers. So as a proportion of total papers the biggest fraction of missing papers was actually from Autophagy and Cell Research which were missing ~50% of papers classified in WoS as “articles” or “reviews”!

My best guess at this stage was that items were incorrectly tagged in Web of Science. Journals like Nature, Science and Current Biology carry a lot of obituaries, letters and other stuff that can fairly be removed from the citable items count. But these should be classified as such in Web of Science and therefore filtered out in my original search. Also, these types of paper don’t explain the big disparity in journals like Autophagy that only carry papers, reviews with a tiny bit of front matter.

PubmedCompI figured a good way forward would be to verify the numbers with another database – PubMed. Details of how I did this are at the foot of this post. This brought me much closer to the JIF “citable items” number for most journals. However, Autophagy, Current Biology and Science are still missing large numbers of papers. As a proportion of the size of the journal, Autophagy, Cell Research and Current Biology are missing the most. While Nature Cell Biology and Nature Reviews Molecular Cell Biology now have more citable items in the JIF calculation than are found in PubMed!

This collection of data was used for the citation distributions shown above, but it highlights some major discrepancies at least for some journals.

How does Thomson Reuters decide what is a citable item?

Some of the reasons for deciding what is a citable item are outlined in this paper. Of the six reasons that are revealed, all seem reasonable, but they suggest that they do not simply look at the classification of papers in the Web of Science database. Without wanting to pick on Autophagy – it’s simply the first one alphabetically – I looked at which was right: the PubMed number of 539 or the JIF number of 247 citable items published in 2012 and 2013. For the JIF number to be correct this journal must only publish ~10 papers per issue, which doesn’t seem to be right at least from a quick glance at the first few issues in 2012.

Why Thomson-Reuters removes some of these papers as non-citable items is a mystery… you can see from the histogram above that for Autophagy only 90 or so papers are uncited in 2014, so clearly the removed items are capable of picking up citations. If anyone has any ideas why the items were removed, please leave a comment.

Summary

Trying to understand what data goes into the Journal Impact Factor calculation (for some, but not all journals) is very difficult. This makes JIFs very hard to reproduce. As a general rule in science, we don’t trust things that can’t be reproduced, so why has the JIF persisted. I think most people realise by now that using this single number to draw conclusions about the excellence (or not) of a paper because it was published in a certain journal, is madness. Looking at the citation distributions, it’s clear that the majority of papers could be reshuffled between any of these journals and nobody would notice (see here for further analysis). We would all do better to read the paper and not worry about where it was published.

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

In PubMed, a research paper will have the publication type “journal article”, however other items can still have this publication type. These items also have additional types which can therefore be filtered. I retrieved all PubMed records from the journals published in 2012 and 2013 with publication type = “journal article”. This worked for 21 journals, eLife is online only so the ppdat field code had to be changed to pdat.


("Autophagy"[ta] OR "Cancer Cell"[ta] OR "Cell"[ta] OR "Cell Mol Life Sci"[ta] OR "Cell Rep"[ta] OR "Cell Res"[ta] OR "Cell Stem Cell"[ta] OR "Curr Biol"[ta] OR "Dev Cell"[ta] OR "Development"[ta] OR "Elife"[ta] OR "Embo J"[ta] OR "J Cell Biol"[ta] OR "J Cell Sci"[ta] OR "Mol Biol Cell"[ta] OR "Mol Cell"[ta] OR "Nat Cell Biol"[ta] OR "Nat Rev Mol Cell Biol"[ta] OR "Nature"[ta] OR "Oncogene"[ta] OR "Science"[ta] OR "Traffic"[ta]) AND (("2012/01/01"[PPDat] : "2013/12/31"[PPDat])) AND journal article[pt:noexp]

I saved this as an XML file and then pulled the values from the “publication type” key using Nokogiri/ruby (script). I then had a list of all the publication type combinations for each record. As a first step I simply counted the number of journal articles for each journal and then subtracted anything that was tagged as “biography”, “comment”, “portraits” etc. This could be done in IgorPro by making a wave indicating whether an item should be excluded (0 or 1) using the DOI as a lookup. This wave could then be used exclude papers from the distribution.

For calculation of the number of missing papers as a proportion of size of journal, I used the number of items from WoS for the WoS calculation, and the JIF number for the PubMed comparison.

Related to this, this IgorPro procedure will read in csv files from WoS/WoK. As mentioned in the main text, data were downloaded 500 records at a time as csv from WoS, using journal titles as a search term and limiting to “article” or “review” and limiting to 2012 and 2013. Note that limiting the search at the outset by year, limits the citation data you get back. You need to search first to get citations from all years and then refine afterwards. The files can be stitched together with the cat command.


cat *.txt > merge.txt

Edit 8/1/16 @ 07:41 Jon Lane told me via Twitter that Autophagy publishes short commentaries of papers in other journals called “Autophagic puncta” (you need to be a cell biologist to get this gag). He suggests these could be removed by Thomson Reuters for their calculation. This might explain the discrepancy for this journal. However, these items 1) cite other papers (so they contribute to JIF calculations), 2) they get cited (Jon says his own piece has been cited 18 times) so they are not non-citable items, 3) they’re tagged as though they are a paper or a review in WoS and PubMed.

What Difference Does It Make?

A few days ago, Retraction Watch published the top ten most-cited retracted papers. I saw this post with a bar chart to visualise these citations. It didn’t quite capture what the effect (if any) a retraction has on citations. I thought I’d quickly plot this out for the number one article on the list.

Retract

The plot is pretty depressing. The retraction has no effect on citations. Note that the retraction notice has racked up 125 citations, which could mean that at least some of the ~1000 citations to the original article that came after the retraction, acknowledge the fact that the article has been pulled.

The post title is taken from “What Difference Does it Make?” by The Smiths from ‘The Smiths’ and ‘Hatful of Hollow’