Five Get Over Excited: Academic papers that cite quantixed posts

Anyone that maintains a website is happy that people out there are interested enough to visit. Web traffic is one thing, but I take greatest pleasure in seeing quantixed posts being cited in academic papers.

I love the fact that some posts on here have been cited in the literature more than some of my actual papers.

It’s difficult to track citations to web resources. This is partly my fault, I think it is possible to register posts so that they have a DOI, but I have not done this and so tracking is a difficult task. Websites are part of what is known as the grey literature: items that are not part of traditional academic publishing.

The most common route for me to discover that a post has been cited is when I actually read the paper. There are four examples that spring to mind: here, here, here and here. With these papers, I read the paper and was surprised to find quantixed cited in the bibliography.

Vanity and curiosity made me wonder if there were other citations I didn’t know about. A cited reference search in Web of Science pulled up two more: here and here.

A bit of Googling revealed yet more citations, e.g. two quantixed posts are cited in this book. And another citation here.

OK so quantixed is not going to win any “highly cited” prizes or develop a huge H-index (if something like that existed for websites). But I’m pleased that 1) there are this many citations given that there’s a bias against citing web resources, and 2) the content here has been useful to others, particularly for academic work.

All of these citations are to posts looking at impact factors, metrics and publication lag times. In terms of readership, these posts get sustained traffic, but currently the most popular posts on quantixed are the “how to” guides, LaTeX to Word and Back seeing the most traffic. Somewhere in between citation and web traffic are cases when quantixed posts get written about elsewhere, e.g. in a feature in Nature by Kendall Powell.

The post title comes from “Five Get Over Excited” by The Housemartins. A band with a great eye for song titles, it can be found on the album “The People Who Grinned Themselves to Death”.

One With The Freaks – very highly cited papers in biology

I read this recent paper about very highly cited papers and science funding in the UK. The paper itself was not very good, but the dataset which underlies the paper is something to behold, as I’ll explain below.

The idea behind the paper was to examine very highly cited papers in biomedicine with a connection to the UK. Have those authors been successful in getting funding from MRC, Wellcome Trust or NIHR? They find that some of the authors of these very highly cited papers are not funded by these sources. Note that these funders are some, but not all, of the science funding bodies in the UK. The authors also looked at panel members of those three funders, and report that these individuals are funded at high rates and that the overlap between panel membership and very highly cited authorship is very low. I don’t want to critique the paper extensively, but the conclusions drawn are rather blinkered. A few reasons: 1, MRC, NIHR and Wellcome support science in other ways than direct funding of individuals (e.g. PhD programmes, infrastructure etc.). 2, The contribution of other funders e.g. BBSRC was ignored. 3, Panels tend to be selected from the pool of awardees, rather than the other way around. I understand that the motivation of the authors is to stimulate debate around whether science funding is effective, and this is welcome, but the paper strays too far in to clickbait territory for my tastes.

The most interesting thing about the analysis (and arguably its main shortcoming) was the dataset. The authors took the papers in Scopus which have been cited >1000 times. This is ~450 papers as of last week. As I found out when I recreated their dataset, this is a freakish set of papers. Of course weird things can be found when looking at outliers.

Dataset of 20,000 papers from Scopus (see details below)

The authors describe a one-line search term they used to retrieve papers from Scopus. These papers span 2005 to the present day and were then filtered for UK origin.

LANGUAGE ( english ) AND PUBYEAR > 2005 AND ( LIMIT-TO ( SRCTYPE , "j " ) ) AND ( LIMIT-TO (DOCTYPE , "ar " ) ) AND ( LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "BIOC" ) OR LIMIT-TO (SUBJAREA , "PHAR" ) OR LIMIT-TO ( SUBJAREA , "IMMU" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "DENT" ) )

I’m not sure how accurate the dataset is in terms of finding papers of UK origin, but the point here is to look at the dataset and not to critique the paper.

I downloaded the first 20,000 (a limitation of Scopus). I think it will break the terms to put the dataset on here but if your institution has a subscription, it can be recreated. The top paper has 16,549 citations! The 20,000th paper has accrued 122 citations, and the papers with >1000 citations account for 450 papers as of last week.

Now, some papers are older than others, so I calculated the average citation rate by dividing total cites by the number of years since publication, to get a better picture of the hottest among these freaky papers. The two colour-coded plots show the years since publication. It is possible to see some young papers which are being cited at an even higher rate than the pack. These will move up the ranking faster than their neighbours over the next few months.

Just looking at the “Top 20” is amazing. These papers are being cited at rates of approximately 1000 times per year. The paper ranked 6 is a young paper which is cited at a very high rate and will likely move up the ranking. So what are these freakish papers?

In the table below (apologies for the strange formatting), I’ve pasted the top 20 of the highly cited paper dataset. They are a mix of clinical consortia papers and bioinformatics tools for sequence and structural analysis. The tools make sense. They are widely used in a huge number of papers and get heavily cited as a result. In fact, these citation numbers are probably an underestimate, since citations to software can often get missed out of papers. The clinical papers are also useful to large fields. They have many authors and there is a network effect to their citation which can drive up the cites to these items (this is noted in the paper I described above). Even though the data are expected, I was amazed by the magnitude of citations and the rates that these works are acquiring citations. The topic of papers is pretty similar beyond the top 20.

There’s no conclusion for this post. There are a tiny subset of papers out there with freakishly high citation rates. We should simply marvel at them…

TitleYearJournalTotal cites
1Clustal W and Clustal X version 2.02007Bioinformatics16549
2The Sequence Alignment/Map format and SAMtools2009Bioinformatics13586
3Fast and accurate short read alignment with Burrows-Wheeler transform2009Bioinformatics12653
4PLINK: A tool set for whole-genome association and population-based linkage analyses2007American Journal of Human Genetics12241
5Estimates of worldwide burden of cancer in 2008: GLOBOCAN 20082010International Journal of Cancer11047
6Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 20122015International Journal of Cancer10352
7PHENIX: A comprehensive Python-based system for macromolecular structure solution2010Acta Crystallographica Section D: Biological Crystallography10093
8Phaser crystallographic software2007Journal of Applied Crystallography9617
9New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)2009European Journal of Cancer9359
10Features and development of Coot2010Acta Crystallographica Section D: Biological Crystallography9241
11Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities2009Applied and Environmental Microbiology8127
12BEAST: Bayesian evolutionary analysis by sampling trees2007BMC Evolutionary Biology8019
13Improved survival with ipilimumab in patients with metastatic melanoma2010New England Journal of Medicine7293
14OLEX2: A complete structure solution, refinement and analysis program2009Journal of Applied Crystallography7173
15Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet6296
16New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.02010Systematic Biology6290
17The MIQE guidelines: Minimum information for publication of quantitative real-time PCR experiments2009Clinical Chemistry6086
18The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials2011BMJ (Online)6065
19Velvet: Algorithms for de novo short read assembly using de Bruijn graphs2008Genome Research5550
20A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet5499

The post title comes from “One With The Freaks” by The Notwist.

For What It’s Worth: Influence of our papers on our papers

This post is about a citation analysis that didn’t quite work out.

I liked this blackboard project by Manuel Théry looking at the influence of each paper authored by David Pellman’s lab on the future directions of the Pellman lab.

It reminds me that papers can have impact in the field while others might be influential to the group itself. I wondered which of the papers on which I’m an author have been most influential to my other papers and whether this correlates with a measure of their impact on the field.

There’s no code in this post. I retrieved the relevant records from Scopus and used the difference in “with” and “without” self-citation to pull together the numbers.

Influence: I used the number of citations to a paper from any of our papers as the number for self-citation. This was divided by the total number of future papers. This means if I have 50 papers, and the 23rd paper that was published has collected 27 self-citations, this has a score of 1 (the 23rd paper nor any of the preceding 22 papers, can cite the 23rd paper, but the 27 that follow, could).  This is our metric for influence.

Impact: As a measure of general impact I took the total number of citations for each paper and divided this by the number of years since publication to get average cites per year for each paper.

Plot of influence against impact

Reviews and methods papers are shown in blue, while research papers are in red. I was surprised that some papers have been cited by as much as half of the papers that followed.

Generally, the articles that were most influential to us were also the papers with the biggest impact. Although the correlation is not very strong. There is an obvious outlier paper that gets 30 cites per year (over a 12 year period, I should say) but this paper has not influenced our work as much as other papers have. This is partly because the paper is a citation magnet and partly because we’ve stopped working on this topic in the last few years.

Obviously, the most recent papers were the least informative. There are no future papers to test if they were influential and there are few citations so far to understand their impact.

It’s difficult to say what the correlation between impact and influence on our own work really means, if anything. Does it mean that we have tended to pursue projects because of their impact (I would hope not)? Perhaps these papers are generally useful to the field and to us.

In summary, I don’t think this analysis was successful. I had wanted to construct some citation networks – similar to the Pellman tree idea above – to look at influence in more detail, but I lost confidence in the method. Many of our self-citations are for methodological reasons and so I’m not sure if we’re measuring influence or utility here. Either way, the dataset is not big enough (yet) to do more meaningful number crunching. Having said this, the approach I’ve described here will work for any scholar and could be done at scale.

There are several song titles in the database called ‘For What It’s Worth’. This one is Chapterhouse on Rownderbout.


Rollercoaster III: yet more on Google Scholar

In a previous post I made a little R script to crunch Google Scholar data for a given scientist. The graphics were done in base R and looked a bit ropey. I thought I’d give the code a spring clean – it’s available here. The script is called ggScholar.R (rather than gScholar.R). Feel free to run it and raise an issue or leave a comment if you have some ideas.

I’m still learning how to get things looking how I want them using ggplot2, but this is an improvement on the base R version.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song and album by slowcore/dream pop outfit Red House Painters.

If I Can’t Change Your Mind

I have written previously about Journal Impact Factors (here and here). The response to these articles has been great and earlier this year I was asked to write something about JIFs and citation distributions for one of my favourite journals. I agreed and set to work.

Things started off so well. A title came straight to mind. In the style of quantixed, I thought The Number of The Beast would be amusing. I asked for opinions on Twitter and got an even better one (from Scott Silverman @sksilverman) Too Many Significant Figures, Not Enough Significance. Next, I found an absolute gem of a quote to kick off the piece. It was from the eminently quotable Sydney Brenner.

Before we develop a pseudoscience of citation analysis, we should remind ourselves that what matters absolutely is the scientific content of a paper and that nothing will substitute for either knowing it or reading it.

looseEndsThat quote was from a Loose Ends piece that Uncle Syd penned for Current Biology in 1995. Wow, 1995… that is quite a few years ago I thought to myself. Never mind. I pressed on.

There’s a lot of literature on JIFs, research assessment and in fact there are whole fields of scholarly activity (bibliometrics) devoted to this kind of analysis. I thought I’d better look back at what has been written previously. The “go to” paper for criticism of JIFs is Per Seglen’s analysis in the BMJ, published in 1997. I re-read this and I can recommend it if you haven’t already seen it. However, I started to feel uneasy. There was not much that I could add that hadn’t already been said, and what’s more it had been said 20 years ago.

Around about this time I was asked to review some fellowship applications for another EU country. The applicants had to list their publications, along with the JIF. I found this annoying. It was as if SF-DORA never happened.

There have been so many articles, blog posts and more written on JIFs. Why has nothing changed? It was then that I realised that it doesn’t matter how many things are written – however coherently argued – people like JIFs and they like to use them for research assessment. I was wasting my time writing something else. Sorry if this sounds pessimistic. I’m sure new trainees can be reached by new articles on this topic, but acceptance of JIF as a research assessment tool runs deep. It is like religious thought. No amount of atheist writing, no matter how forceful, cogent, whatever, will change people’s minds. That way of thinking is too deeply ingrained.

As the song says, “If I can’t change your mind, then no-one will”.

So I declared defeat and told the journal that I felt like I had said all that I could already say on my blog and that I was unable to write something for them. Apologies to all like minded individuals for not continuing to fight the good fight.

But allow me one parting shot. I had a discussion on Twitter with a few people, one of whom said they disliked the “JIF witch hunt”. This caused me to think about why the JIF has hung around for so long and why it continues to have support. It can’t be that so many people are statistically illiterate or that they are unscientific in choosing to ignore the evidence. What I think is going on is a misunderstanding. Criticism of a journal metric as being unsuitable to judge individual papers is perceived as an attack on journals with a high-JIF. Now, for good or bad, science is elitist and we are all striving to do the best science we can. Striving for the best for many scientists means aiming to publish in journals which happen to have a high JIF. So an attack of JIFs as a research assessment tool, feels like an attack on what scientists are trying to do every day.

JIFDistBecause of this intense focus on high-JIF journals… what people don’t appreciate is that the reality is much different. The distribution of JIFs is as skewed as that for the metric itself. What this means is that focussing on a minuscule fraction of papers appearing in high-JIF journals is missing the point. Most papers are in journals with low-JIFs. As I’ve written previously, papers in journals with a JIF of 4 get similar citations to those in a journal with a JIF of 6. So the JIF tells us nothing about citations to the majority of papers and it certainly can’t predict the impact of these papers, which are the majority of our scientific output.

So what about those fellowship applicants? All of them had papers in journals with low JIFs (<8). The applicants’ papers were indistinguishable in that respect. What advice would I give to people applying to such a scheme? Well, I wouldn’t advise not giving the information asked for. To be fair to the funding body they also asked for number of citations for each paper, but for papers that are only a few months old, this number is nearly always zero. My advice would be to try and make sure that your paper is available freely for anyone to read. Many of the applicants’ papers were outside my expertise and so the title and abstract didn’t tell me much about the significance of the paper. So I looked at some of these papers to look at the quality of the data in there… if I had access. Applicants who had published in closed access journals are at a disadvantage here because if I couldn’t download the paper then it was difficult to assess what they had been doing.

I was thinking that this post would be a meta-meta-blogpost. Writing about an article which was written about something I wrote on my blog. I suppose it still is, except the article was never finished. I might post again about JIFs, but for now I doubt I will have anything new to say that hasn’t already been said.

The post title is taken from “If I Can’t Change Your Mind” by Sugar from their LP Copper Blue. Bob Mould was once asked about song-writing and he said that the perfect song was like a maths puzzle (I can’t find a link to support this, so this is from memory). If you are familiar with this song, songwriting and/or mathematics, then you will understand what he means.

Edit @ 08:22 16-05-20 I found an interview with Bob Mould where he says song-writing is like city-planning. Maybe he just compares song-writing to lots of different things in interviews. Nonetheless I like the maths analogy.

Throes of Rejection: No link between rejection rates and impact?

I was interested in the analysis by Frontiers on the lack of a correlation between the rejection rate of a journal and the “impact” (as measured by the JIF). There’s a nice follow here at Science Open. The Times Higher Education Supplement also reported on this with the line that “mass rejection of research papers by selective journals in a bid to achieve a high impact factor is an enormous waste of academics’ time”.

First off, the JIF is a flawed metric in a number of ways but even at face value, what does this analysis really tell us?

IF-vs-Rej-Rate-1-1-768x406

This plot is taken from the post by Jon Tennant at Science Open.

As others have pointed out:

  1. The rejection rate is dominated by desk rejects, which although very annoying, don’t take that much time.
  2. Without knowing the journal name it is difficult to know what to make of the plot.

The data are available from Figshare and – thanks to Thomson-Reuters habit of reporting JIF to 3 d.p. – we can easily pull the journal titles from a list using JIF as a key. The list is here. Note that there may be errors due to this quick-and-dirty method.

The list takes on a different meaning when you can see the Journal titles alongside the numbers for rejection rate and JIF.

rjxn

 

Looking for familiar journals – whichever field you are in – you will be disappointed. There’s an awful lot of noise in there. By this, I mean journals that are outside of your field.

This is the problem with this analysis as I see it. It is difficult to compare Nature Neuroscience with Mineralium Deposita…

My plan with this dataset was to replot rejection rate versus JIF2014 for a few different journal categories, but I don’t think there’s enough data to do this and make a convincing case one way or the other. So, I think the jury is still out on this question.

It would be interesting to do this analysis on a bigger dataset. Journals releasing their numbers on rejection rates would be a step forward to doing this.

One final note:

The Orthopedic Clinics of North America is a tough journal. Accepts only 2 papers in every 100 for an impact factor of 1!

 

The post title is from “Throes of Rejection” by Pantera from their Far Beyond Driven LP. I rejected the title “Satan Has Rejected my Soul” by Morrissey for obvious reasons.

The Great Curve II: Citation distributions and reverse engineering the JIF

There have been calls for journals to publish the distribution of citations to the papers they publish (1 2 3). The idea is to turn the focus away from just one number – the Journal Impact Factor (JIF) – and to look at all the data. Some journals have responded by publishing the data that underlie the JIF (EMBO J, Peer JRoyal Soc, Nature Chem). It would be great if more journals did this. Recently, Stuart Cantrill from Nature Chemistry actually went one step further and compared the distribution of cites at his journal with other chemistry journals. I really liked this post and it made me think that I should just go ahead and harvest the data for cell biology journals and post it.

This post is in two parts. First, I’ll show the data for 22 journals. They’re broadly cell biology, but there’s something for everyone with Cell, Nature and Science all included. Second, I’ll describe how I “reverse engineered” the JIF to get to these numbers. The second part is a bit technical but it describes how difficult it is to reproduce the JIF and highlights some major inconsistencies for some journals. Hopefully it will also be of interest to anyone wanting to do a similar analysis.

Citation distributions for 22 cell biology journals

The JIF for 2014 (published in the summer of 2015) is worked out by counting the total number of 2014 cites to articles in that journal that were published in 2012 and 2013. This number is divided by the number of “citable items” in that journal in 2012 and 2013. There are other ways to look at citation data, different windows to analyse, but this method is used here because it underlies the impact factor. I plotted out histograms to show the citation distributions at these journals from 0-50 citations, inset shows the frequency of papers with 50-1000 cites.

Dist1

Dist2

As you can see, the distributions are highly skewed and so reporting the mean is very misleading. Typically ~70% papers pick up less than the mean number of citations. Reporting the median is safer and is shown below. It shows how similar most of the journals are in this field in terms of citations to the average paper in that journal. Another metric, which I like, is the H-index for journals. Google Scholar uses this as a journal metric (using citation data from a 5-year window). For a journal, this is a number, h, which reveals how many papers got >=h citations. A plot of h-indices for these journals is shown below.

medianplusH

Here’s a summary table of all of this information together with the “official JIF” data, which is discussed below.

Journal Median H Citations Items Mean JIF Cites JIF Items JIF
Autophagy 3 18 2996 539 5.6 2903 247 11.753
Cancer Cell 14 37 5241 274 19.1 5222 222 23.523
Cell 19 72 28147 1012 27.8 27309 847 32.242
Cell Rep 6 26 6141 743 8.3 5993 717 8.358
Cell Res 3 19 1854 287 6.5 2222 179 12.413
Cell Stem Cell 14 37 5192 302 17.2 5233 235 22.268
Cell Mol Life Sci 4 19 3364 596 5.6 3427 590 5.808
Curr Biol 4 24 6751 1106 6.1 7293 762 9.571
Development 5 25 6069 930 6.5 5861 907 6.462
Dev Cell 7 23 3986 438 9.1 3922 404 9.708
eLife 5 20 2271 306 7.4 2378 255 9.3212
EMBO J 8 27 5828 557 10.5 5822 558 10.434
J Cell Biol 6 25 5586 720 7.8 5438 553 9.834
J Cell Sci 3 23 5995 1157 5.2 5894 1085 5.432
Mol Biol Cell 3 16 3415 796 4.3 3354 751 4.466
Mol Cell 11 37 8669 629 13.8 8481 605 14.018
Nature 12 105 69885 2758 25.3 71677 1729 41.296
Nat Cell Biol 13 35 5381 340 15.8 5333 271 19.679
Nat Rev Mol Biol Cell 8.5 43 5037 218 23.1 4877 129 37.806
Oncogene 5 26 6973 1038 6.7 8654 1023 8.459
Science 14 83 54603 2430 22.5 56231 1673 33.611
Traffic 3 11 1020 252 4.0 1018 234 4.350

 

Reverse engineering the JIF

The analysis shown above was straightforward. However, getting the data to match Thomson-Reuters’ calculations for the JIF was far from easy.

I downloaded the citation data from Web of Science for the 22 journals. I limited the search to “articles” and “reviews”, published in 2012 and 2013. I took the citation data from papers published in 2014 with the aim of plotting out the distributions. As a first step I calculated the mean citation for each journal (a.k.a. impact factor) to see how it compared with the official Journal Impact Factor (JIF). As you can see below, some were correct and others were off by some margin.

Journal Calculated IF JIF
Autophagy 5.4 11.753
Cancer Cell 14.8 23.523
Cell 23.9 32.242
Cell Rep 8.2 8.358
Cell Res 5.7 12.413
Cell Stem Cell 13.4 22.268
Cell Mol Life Sci 5.6 5.808
Curr Biol 5.0 9.571
Development 6.5 6.462
Dev Cell 7.5 9.708
eLife 6.0 9.322
EMBO J 10.5 10.434
J Cell Biol 7.6 9.834
J Cell Sci 5.2 5.432
Mol Biol Cell 4.1 4.466
Mol Cell 11.8 14.018
Nature 25.1 41.296
Nat Cell Biol 15.1 19.679
Nat Rev Mol Cell Biol 15.3 37.806
Oncogene 6.7 8.459
Science 18.6 33.611
Traffic 4.0 4.35

For most journals there was a large difference between this number and the official JIF (see below, left). This was not a huge surprise, I’d found previously that the JIF was very hard to reproduce (see also here). To try and understand the difference, I looked at the total citations in my dataset vs those from the official JIF. As you can see from the plot (right), my numbers are pretty much in agreement with those used for the JIF calculation. Which meant that the difference comes from the denominator – the number of citable items.

JifCalc

What the plots show is that, for most journals in my dataset, there are fewer papers considered as citable items by Thomson-Reuters. This is strange. I had filtered the data to leave only journal articles and reviews (which are citable items), so non-citable items should have been removed.

It’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom.

Now, it’s no secret that the papers cited in the sum on the top of the impact factor calculation are not necessarily the same as the papers counted on the bottom (see here, here and here). This inconsistency actually makes plotting a distribution impossible. However, I thought that using the same dataset, filtering and getting to the correct total citation number meant that I had the correct list of citable items. So, what could explain this difference?

missingPapersI looked first at how big the difference in number of citable items is. Journals like Nature and Science are missing >1000 items(!), others are less and some such as Traffic, EMBO J, Development etc. have the correct number. Remember that journals carry different amounts of papers. So as a proportion of total papers the biggest fraction of missing papers was actually from Autophagy and Cell Research which were missing ~50% of papers classified in WoS as “articles” or “reviews”!

My best guess at this stage was that items were incorrectly tagged in Web of Science. Journals like Nature, Science and Current Biology carry a lot of obituaries, letters and other stuff that can fairly be removed from the citable items count. But these should be classified as such in Web of Science and therefore filtered out in my original search. Also, these types of paper don’t explain the big disparity in journals like Autophagy that only carry papers, reviews with a tiny bit of front matter.

PubmedCompI figured a good way forward would be to verify the numbers with another database – PubMed. Details of how I did this are at the foot of this post. This brought me much closer to the JIF “citable items” number for most journals. However, Autophagy, Current Biology and Science are still missing large numbers of papers. As a proportion of the size of the journal, Autophagy, Cell Research and Current Biology are missing the most. While Nature Cell Biology and Nature Reviews Molecular Cell Biology now have more citable items in the JIF calculation than are found in PubMed!

This collection of data was used for the citation distributions shown above, but it highlights some major discrepancies at least for some journals.

How does Thomson Reuters decide what is a citable item?

Some of the reasons for deciding what is a citable item are outlined in this paper. Of the six reasons that are revealed, all seem reasonable, but they suggest that they do not simply look at the classification of papers in the Web of Science database. Without wanting to pick on Autophagy – it’s simply the first one alphabetically – I looked at which was right: the PubMed number of 539 or the JIF number of 247 citable items published in 2012 and 2013. For the JIF number to be correct this journal must only publish ~10 papers per issue, which doesn’t seem to be right at least from a quick glance at the first few issues in 2012.

Why Thomson-Reuters removes some of these papers as non-citable items is a mystery… you can see from the histogram above that for Autophagy only 90 or so papers are uncited in 2014, so clearly the removed items are capable of picking up citations. If anyone has any ideas why the items were removed, please leave a comment.

Summary

Trying to understand what data goes into the Journal Impact Factor calculation (for some, but not all journals) is very difficult. This makes JIFs very hard to reproduce. As a general rule in science, we don’t trust things that can’t be reproduced, so why has the JIF persisted. I think most people realise by now that using this single number to draw conclusions about the excellence (or not) of a paper because it was published in a certain journal, is madness. Looking at the citation distributions, it’s clear that the majority of papers could be reshuffled between any of these journals and nobody would notice (see here for further analysis). We would all do better to read the paper and not worry about where it was published.

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

In PubMed, a research paper will have the publication type “journal article”, however other items can still have this publication type. These items also have additional types which can therefore be filtered. I retrieved all PubMed records from the journals published in 2012 and 2013 with publication type = “journal article”. This worked for 21 journals, eLife is online only so the ppdat field code had to be changed to pdat.


("Autophagy"[ta] OR "Cancer Cell"[ta] OR "Cell"[ta] OR "Cell Mol Life Sci"[ta] OR "Cell Rep"[ta] OR "Cell Res"[ta] OR "Cell Stem Cell"[ta] OR "Curr Biol"[ta] OR "Dev Cell"[ta] OR "Development"[ta] OR "Elife"[ta] OR "Embo J"[ta] OR "J Cell Biol"[ta] OR "J Cell Sci"[ta] OR "Mol Biol Cell"[ta] OR "Mol Cell"[ta] OR "Nat Cell Biol"[ta] OR "Nat Rev Mol Cell Biol"[ta] OR "Nature"[ta] OR "Oncogene"[ta] OR "Science"[ta] OR "Traffic"[ta]) AND (("2012/01/01"[PPDat] : "2013/12/31"[PPDat])) AND journal article[pt:noexp]

I saved this as an XML file and then pulled the values from the “publication type” key using Nokogiri/ruby (script). I then had a list of all the publication type combinations for each record. As a first step I simply counted the number of journal articles for each journal and then subtracted anything that was tagged as “biography”, “comment”, “portraits” etc. This could be done in IgorPro by making a wave indicating whether an item should be excluded (0 or 1) using the DOI as a lookup. This wave could then be used exclude papers from the distribution.

For calculation of the number of missing papers as a proportion of size of journal, I used the number of items from WoS for the WoS calculation, and the JIF number for the PubMed comparison.

Related to this, this IgorPro procedure will read in csv files from WoS/WoK. As mentioned in the main text, data were downloaded 500 records at a time as csv from WoS, using journal titles as a search term and limiting to “article” or “review” and limiting to 2012 and 2013. Note that limiting the search at the outset by year, limits the citation data you get back. You need to search first to get citations from all years and then refine afterwards. The files can be stitched together with the cat command.


cat *.txt > merge.txt

Edit 8/1/16 @ 07:41 Jon Lane told me via Twitter that Autophagy publishes short commentaries of papers in other journals called “Autophagic puncta” (you need to be a cell biologist to get this gag). He suggests these could be removed by Thomson Reuters for their calculation. This might explain the discrepancy for this journal. However, these items 1) cite other papers (so they contribute to JIF calculations), 2) they get cited (Jon says his own piece has been cited 18 times) so they are not non-citable items, 3) they’re tagged as though they are a paper or a review in WoS and PubMed.

What Difference Does It Make?

A few days ago, Retraction Watch published the top ten most-cited retracted papers. I saw this post with a bar chart to visualise these citations. It didn’t quite capture what the effect (if any) a retraction has on citations. I thought I’d quickly plot this out for the number one article on the list.

Retract

The plot is pretty depressing. The retraction has no effect on citations. Note that the retraction notice has racked up 125 citations, which could mean that at least some of the ~1000 citations to the original article that came after the retraction, acknowledge the fact that the article has been pulled.

The post title is taken from “What Difference Does it Make?” by The Smiths from ‘The Smiths’ and ‘Hatful of Hollow’

White label: the growth of bioRxiv

bioRxiv, the preprint server for biology, recently turned 2 years old. This seems a good point to take a look at how bioRxiv has developed over this time and to discuss any concerns sceptical people may have about using the service.

Firstly, thanks to Richard Sever (@cshperspectives) for posting the data below. The first plot shows the number of new preprints deposited and the number that were revised, per month since bioRxiv opened in Nov 2013. There are now about 200 preprints being deposited per month and this number will continue to increase. The cumulative article count (of new preprints) shows that, as of the end of last month, there are >2500 preprints deposited at bioRxiv. overall2

subject2

What is take up like across biology? To look at this, the number of articles in different subject categories can be totted up. Evolutionary Biology, Bioinformatics and Genomics/Genetics are the front-running disciplines. Obviously counting articles should be corrected for the size of these fields, but it’s clear that some large disciplines have not adopted preprinting in the same way. Cell biology, my own field, has some catching up to do. It’s likely that this reflects cultures within different fields. For example, genomics has a rich history of data deposition, sharing and openness. Other fields, less so…

So what are we waiting for?

I’d recommend that people wondering about preprinting go and read Stephen Curry’s post “just do it“. Any people who remain sceptical should keep reading…

Do I really want to deposit my best work on bioRxiv?

I’ve picked six preprints that were deposited in 2015. This selection demonstrates how important work is appearing first at bioRxiv and is being downloaded thousands of times before the papers appear in the pages of scientific journals.

  1. Accelerating scientific publishing in biology. A preprint about preprinting from Ron Vale, subsequently published in PNAS.
  2. Analysis of protein-coding genetic variation in 60,706 humans. A preprint summarising a huge effort from ExAC Exome Aggregation Consortium. 12,366 views, 4,534 downloads.
  3. TP53 copy number expansion correlates with the evolution of increased body size and an enhanced DNA damage response in elephants. This preprint was all over the news, e.g. Science.
  4. Sampling the conformational space of the catalytic subunit of human γ-secretase. CryoEM is the hottest technique in biology right now. Sjors Scheres’ group have been at the forefront of this revolution. This paper is now out in eLife.
  5. The genome of the tardigrade Hypsibius dujardini. The recent controversy over horizontal gene transfer in Tardigrades was rapidfire thanks to preprinting.
  6. CRISPR with independent transgenes is a safe and robust alternative to autonomous gene drives in basic research. This preprint concerning biosafety of CRISPR/Cas technology could be accessed immediately thanks to preprinting.

But many journals consider preprints to be previous publications!

Wrong. It is true that some journals have yet to change their policy, but the majority – including Nature, Cell and Science – are happy to consider manuscripts that have been preprinted. There are many examples of biology preprints that went on to be published in Nature (ancient genomes) and Science (hotspots in birds). If you are worried about whether the journal you want to submit your work to will allow preprinting, check this page first or the SHERPA/RoMEO resource. The journal “information to authors” page should have a statement about this, but you can always ask the Editor.

I’m going to get scooped

Preprints establish priority. It isn’t possible to be scooped if you deposit a preprint that is time-stamped showing that you were the first. The alternative is to send it to a journal where no record will exist that you submitted it if the paper is rejected, or sometimes even if they end up publishing it (see discussion here). Personally, I feel that the fear of scooping in science is overblown. In fields that are so hot that papers are coming out really fast the fear of scooping is high, everyone sees the work if its on bioRxiv or elsewhere – who was first is clear to all. Think of it this way: depositing a preprint at bioRxiv is just the same as giving a talk at a meeting. Preprints mean that there is a verifiable record available to everyone.

Preprints look ugly, I don’t want people to see my paper like that.

The depositor can format their preprint however they like! Check out Christophe Leterrier’s beautifully formatted preprint, or this one from Dennis Eckmeier. Both authors made their templates available so you can follow their example (1 and 2).

Yes but does -insert name of famous scientist- deposit preprints?

Lots of high profile scientists have already used bioRxiv. David Bartel, Ewan Birney, George Church, Ray Deshaies, Jennifer Doudna, Steve Henikoff, Rudy Jaenisch, Sophien Kamoun, Eric Karsenti, Maria Leptin, Rong Li, Andrew Murray, Pam Silver, Bruce Stillman, Leslie Vosshall and many more. Some sceptical people may find this argument compelling.

I know how publishing works now and I don’t want to disrupt the status quo

It’s paradoxical how science is all about pushing the frontiers, yet when it comes to publishing, scientists are incredibly conservative. Physics and Mathematics have been using preprinting as part of the standard route to publication for decades and so adoption by biology is nothing unusual and actually, we will simply be catching up. One vision for the future of scientific publishing is that we will deposit preprints and then journals will search out the best work from the server to highlight in their pages. The journals that will do this are called “overlay journals”. Sounds crazy? It’s already happening in Mathematics. Terry Tao, a Fields medal-winning mathematician recently deposited a solution to the Erdos discrepency problem on arXiv (he actually put them on his blog first). This was then “published” in Discrete Analysis, an overlay journal. Read about this here.

Disclaimer: other preprint services are available. F1000 Research, PeerJ Preprints and of course arXiv itself has quantitative biology section. My lab have deposited work at bioRxiv (1, 2 and 3) and I am an affiliate for the service, which means I check preprints before they go online.

Edit 14/12/15 07:13 put the scientists in alphabetical order. Added a part about scooping.

The post title comes from the term “white label” which is used for promotional vinyl copies of records ahead of their official release.

Where You Come From: blog visitor stats

It’s been a while since I did some navel-gazing about who reads this blog and where they come from. This week, quantixed is close to 25K views and there was a burst of people viewing an old post, which made me look again at the visitor statistics.

Where do the readers of quantixed come from?
map

Well, geographically they come from all around the world. The number of visitors from each country is probably related to: population of scientists and geographical spread of science people on Twitter (see below). USA is in the lead, followed by UK, Germany, Canada, France, Spain, Australia, etc.

Where do they click from? This is pretty interesting. Most people come here from Twitter (45%), around 20% come via a search on Google (mainly looking for eLife’s Impact Factor) and another ~20% come from the blog Scholarly Kitchen(!). Around 3% come from Facebook, which is pretty neat since I don’t have a profile and presumably people are linking to quantixed on there. 1% come from people clicking links that have been emailed to them – I also value these hits a lot. I guess these links are sent to people who don’t do any social media, but somebody thought the recipient should read something on quantixed. I get a few hits from blogs and sites where we’ve linked to each other. The remainder are a long list of single clicks from a wide variety of pages.

What do they read?

The traffic is telling me that quantixed doesn’t have “readers”. I think most people are one-time visitors, or at least occasional visitors. I do know which posts are popular:

  1. Strange Things
  2. Wrong Number
  3. Advice for New PIs
  4. Publication lag times I and II
  5. Violin plots
  6. Principal Component Analysis

Just like my papers, I’ve found it difficult to predict what will be interesting to lots of people. Posts that took a long time to prepare and were the most fun to think about, have received hardly any views. The PCA post is most surprising, because I thought no-one would be interested in that!

I thoroughly enjoy writing quantixed and I really value the feedback that I get from people I talk to offline about it. I’m constantly amazed who has read something on here. The question that they always ask is “how do you find the time?”. And I always answer, “I don’t”. What I mean is I don’t really have the free time to write this blog. Between the lab, home life, sleep and cycling, there is no time for this at all. The analyses you see on here take only three hours or less. If anything looks tougher than this, I drop it. If draft posts aren’t interesting enough to get finished, they get canned. Writing the blog is a nice change from writing papers, grants and admin. So I don’t feel it detracts from work. One aim was to improve my programming through fun analyses; and I’ve definitely learnt a lot about that. The early posts on coding are pretty cringe-worthy. I also wanted to improve my writing which is still a bit dry and scientific…

My favourite type of remark is when people tell me about something that they’ve read on here, not realising that I actually write this blog! Anyway, whoever you are, wherever you come from; I hope you enjoy quantixed. If you read something here and like it, please leave a comment, tweet a link or email it to a friend. The encouragement is appreciated.

The post title is taken from “Where You Come From” by Pantera. This was a difficult one to pick, but this song had the most apt title, at least.