The Great Curve: Citation distributions

This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.


What is interesting is that the distribution changes over time:

  • In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
  • The following year most papers are picking up citations: the distribution moves rightwards.
  • Over the next few years the distribution relaxes back leftwards as the citations die away.
  • The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.


The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

Tips from the blog VIII: Time Machine on QNAP NAS

This is just a quick tip as it took me a little while to sort this out. In the lab we have two QNAP TS-869 Pro NAS devices. Each was set up with a single RAID6 storage pool and I ran them as a primary and replicant via rsync. We recently bought a bigger server and so the plan was to repurpose one of the NAS boxes to be a Time Machine for all the computers in the lab.

We have around 10 computers that use a Time Machine for small documents that reside on each computer (primary data is always on the server). So far, I’d relied on external Hard Drives to do this. However: 1) they fail, 2) they can get unplugged accidentally and fail to backup and 3) they take up space.

As always the solution is simple and I’ll outline this with the drawbacks and then describe the other things I did to save you wasting time.

  1. Wipe the NAS. I went for RAID5, rather than RAID6. I figured this is safe enough. The NAS emails me if there is a problem with any of the disks and they can be replaced.
  2. Enable Time Machine. In the QNAP interface in Backup Server>Time Machine, click Enable Time Machine support. Set a limit if you like or 0 for no limit. Add a Username and Password.
  3. Pick the NAS as the Time Machine disk. On each Mac, wait for backup to complete, turn Time Machine off. In Time Machine Preferences pick new disk. It will see your QNAP NAS Time Machine share. Click on it, enter user/pass. click OK. Don’t select use both disks (an option in Yosemite onwards).
  4. That’s it. Wait for backup to complete, check it. Unplug external HD and repurpose.

You don’t need each user to have a share or an account on the NAS. You don’t need to mount the volume on the Mac at startup. The steps above are all you need to do.

The major drawback is that all users share the same Time Machine space. In theory, one person could swamp the backup with their files and this will limit how far back all users can go in the Time Machine. The NAS is pretty big, so I think this will be OK. A rule for putting data/big files in a directory on a user’s Mac and then excluding this directory from the Backup seems the obvious workaround.

What not to try

There is a guide on the web (on the QNAP website!) which shows you how to add Time Machine support for a Mac. It requires some command line action and is pretty complicated. Don’t do it. My guess is that this was the original hack and this has now been superseded by the “official” support on the QNAP interface. I’m not sure why they haven’t taken this guide down. There is another page on the site, outlining the steps above. Just use that.

I read on the QNAP forums about the drawback of using the official Time Machine backup (i.e. all users having to share). My brainwave was to create a user for each Mac, enable a home folder, limit the account with a quota and then set the Mac to mount the volume on startup. This could then be the Time Machine and allow me to limit the size of the Time Machine backup on a per user basis. It didn’t work! Time Machine needs a specially formatted volume to work and so it cannot use a home folder in this way. Maybe there is a way to get this work – it would be a better solution – but I couldn’t manage it.

So there you go. If you have one of these devices. This is what worked for us.