I read this recent paper about very highly cited papers and science funding in the UK. The paper itself was not very good, but the dataset which underlies the paper is something to behold, as I’ll explain below.
The idea behind the paper was to examine very highly cited papers in biomedicine with a connection to the UK. Have those authors been successful in getting funding from MRC, Wellcome Trust or NIHR? They find that some of the authors of these very highly cited papers are not funded by these sources. Note that these funders are some, but not all, of the science funding bodies in the UK. The authors also looked at panel members of those three funders, and report that these individuals are funded at high rates and that the overlap between panel membership and very highly cited authorship is very low. I don’t want to critique the paper extensively, but the conclusions drawn are rather blinkered. A few reasons: 1, MRC, NIHR and Wellcome support science in other ways than direct funding of individuals (e.g. PhD programmes, infrastructure etc.). 2, The contribution of other funders e.g. BBSRC was ignored. 3, Panels tend to be selected from the pool of awardees, rather than the other way around. I understand that the motivation of the authors is to stimulate debate around whether science funding is effective, and this is welcome, but the paper strays too far in to clickbait territory for my tastes.
The most interesting thing about the analysis (and arguably its main shortcoming) was the dataset. The authors took the papers in Scopus which have been cited >1000 times. This is ~450 papers as of last week. As I found out when I recreated their dataset, this is a freakish set of papers. Of course weird things can be found when looking at outliers.

The authors describe a one-line search term they used to retrieve papers from Scopus. These papers span 2005 to the present day and were then filtered for UK origin.
LANGUAGE ( english ) AND PUBYEAR > 2005 AND ( LIMIT-TO ( SRCTYPE , "j " ) ) AND ( LIMIT-TO (DOCTYPE , "ar " ) ) AND ( LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "BIOC" ) OR LIMIT-TO (SUBJAREA , "PHAR" ) OR LIMIT-TO ( SUBJAREA , "IMMU" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "DENT" ) )
I’m not sure how accurate the dataset is in terms of finding papers of UK origin, but the point here is to look at the dataset and not to critique the paper.
I downloaded the first 20,000 (a limitation of Scopus). I think it will break the terms to put the dataset on here but if your institution has a subscription, it can be recreated. The top paper has 16,549 citations! The 20,000th paper has accrued 122 citations, and the papers with >1000 citations account for 450 papers as of last week.
Now, some papers are older than others, so I calculated the average citation rate by dividing total cites by the number of years since publication, to get a better picture of the hottest among these freaky papers. The two colour-coded plots show the years since publication. It is possible to see some young papers which are being cited at an even higher rate than the pack. These will move up the ranking faster than their neighbours over the next few months.
Just looking at the “Top 20” is amazing. These papers are being cited at rates of approximately 1000 times per year. The paper ranked 6 is a young paper which is cited at a very high rate and will likely move up the ranking. So what are these freakish papers?
In the table below (apologies for the strange formatting), I’ve pasted the top 20 of the highly cited paper dataset. They are a mix of clinical consortia papers and bioinformatics tools for sequence and structural analysis. The tools make sense. They are widely used in a huge number of papers and get heavily cited as a result. In fact, these citation numbers are probably an underestimate, since citations to software can often get missed out of papers. The clinical papers are also useful to large fields. They have many authors and there is a network effect to their citation which can drive up the cites to these items (this is noted in the paper I described above). Even though the data are expected, I was amazed by the magnitude of citations and the rates that these works are acquiring citations. The topic of papers is pretty similar beyond the top 20.
There’s no conclusion for this post. There are a tiny subset of papers out there with freakishly high citation rates. We should simply marvel at them…
Title | Year | Journal | Total cites | |
1 | Clustal W and Clustal X version 2.0 | 2007 | Bioinformatics | 16549 |
2 | The Sequence Alignment/Map format and SAMtools | 2009 | Bioinformatics | 13586 |
3 | Fast and accurate short read alignment with Burrows-Wheeler transform | 2009 | Bioinformatics | 12653 |
4 | PLINK: A tool set for whole-genome association and population-based linkage analyses | 2007 | American Journal of Human Genetics | 12241 |
5 | Estimates of worldwide burden of cancer in 2008: GLOBOCAN 2008 | 2010 | International Journal of Cancer | 11047 |
6 | Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 2012 | 2015 | International Journal of Cancer | 10352 |
7 | PHENIX: A comprehensive Python-based system for macromolecular structure solution | 2010 | Acta Crystallographica Section D: Biological Crystallography | 10093 |
8 | Phaser crystallographic software | 2007 | Journal of Applied Crystallography | 9617 |
9 | New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1) | 2009 | European Journal of Cancer | 9359 |
10 | Features and development of Coot | 2010 | Acta Crystallographica Section D: Biological Crystallography | 9241 |
11 | Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities | 2009 | Applied and Environmental Microbiology | 8127 |
12 | BEAST: Bayesian evolutionary analysis by sampling trees | 2007 | BMC Evolutionary Biology | 8019 |
13 | Improved survival with ipilimumab in patients with metastatic melanoma | 2010 | New England Journal of Medicine | 7293 |
14 | OLEX2: A complete structure solution, refinement and analysis program | 2009 | Journal of Applied Crystallography | 7173 |
15 | Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 2010 | 2012 | The Lancet | 6296 |
16 | New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.0 | 2010 | Systematic Biology | 6290 |
17 | The MIQE guidelines: Minimum information for publication of quantitative real-time PCR experiments | 2009 | Clinical Chemistry | 6086 |
18 | The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials | 2011 | BMJ (Online) | 6065 |
19 | Velvet: Algorithms for de novo short read assembly using de Bruijn graphs | 2008 | Genome Research | 5550 |
20 | A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: A systematic analysis for the Global Burden of Disease Study 2010 | 2012 | The Lancet | 5499 |
—
The post title comes from “One With The Freaks” by The Notwist.
Fascinating. As you note, it’s a little dangerous to try to infer anything about outliers, but it feels like there’s the germ of something profound lurking in here. The list is filled with representatives from large research fields (e.g. cancer biology) or things of general utility (e.g. bioinformatic tools). If you choose to condense scientific value down to a single metric, then you’re bound to get a skewed distribution.