Communication Breakdown

There is an entertaining rumour going around about the journal Nature Communications. When I heard it for the fourth or fifth time, I decided to check out whether there is any truth in it.

The rumour goes something like this: the impact factor of Nature Communications is driven by physical sciences papers.

Sometimes it is put another way: cell biology papers drag down the impact factor of Nature Communications, or that they don’t deserve the high JIF tag of the journal because they are cited at lower rates. Could this be true?

TL;DR it is true but the effect is not as big as the rumour suggests. Jump to conclusion.

Nature Communications is the megajournal big journal that sits below the subject-specific Nature journals. Operating as an open access, pay-to-publish journal it is a way for Springer Nature to capture revenue from papers that were good, but did not make the editorial selection for subject-specific Nature journals. This is a long-winded way of saying that there are wide variety of papers covered by this journal which publishes around 5,000 papers per year. This complicates any citation analysis because we need a way to differentiate papers from different fields. I describe one method to do this below.

Quick look at the data

I had a quick look at the top 20 papers from 2016-2017 with the most citations in 2018. There certainly were a lot of non-biological papers in there. Since highly cited papers disproportionately influence the Journal Impact Factor, then this suggested the rumours might be true.

Citations (2018)Title
23811.4% Efficiency non-fullerene polymer solar cells with trialkylsilyl substituted 2D-conjugated polymer as donor
226Circular RNA profiling reveals an abundant circHIPK3 that regulates cell growth by sponging multiple miRNAs
208Recommendations for myeloid-derived suppressor cell nomenclature and characterization standards
203High-efficiency and air-stable P3HT-based polymer solar cells with a new non-fullerene acceptor
201One-Year stable perovskite solar cells by 2D/3D interface engineering
201Massively parallel digital transcriptional profiling of single cells
177Array of nanosheets render ultrafast and high-capacity Na-ion storage by tunable pseudocapacitance
166Multidimensional materials and device architectures for future hybrid energy storage
163Coupled molybdenum carbide and reduced graphene oxide electrocatalysts for efficient hydrogen evolution
149Ti<inf>3</inf>C<inf>2</inf> MXene co-catalyst on metal sulfide photo-absorbers for enhanced visible-light photocatalytic hydrogen production
149Balancing surface adsorption and diffusion of lithium-polysulfides on nonconductive oxides for lithium-sulfur battery design
146Adaptive resistance to therapeutic PD-1 blockade is associated with upregulation of alternative immune checkpoints
140Conductive porous vanadium nitride/graphene composite as chemical anchor of polysulfides for lithium-sulfur batteries
136Fluorination-enabled optimal morphology leads to over 11% efficiency for inverted small-molecule organic solar cells
134The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes
132Photothermal therapy with immune-adjuvant nanoparticles together with checkpoint blockade for effective cancer immunotherapy
131Enhanced electronic properties in mesoporous TiO<inf>2</inf> via lithium doping for high-efficiency perovskite solar cells
125Electron-phonon coupling in hybrid lead halide perovskites
123A sulfur host based on titanium monoxide@carbon hollow spheres for advanced lithium-sulfur batteries
121Biodegradable black phosphorus-based nanospheres for in vivo photothermal cancer therapy

Let’s dive in to the data

We will use R for this analysis. If you want to work along, the script and data can be downloaded below. With a few edits, the script will also work for similar analysis of other journals.

First of all I retrieved three datasets.

  • Citation data for the journal. We’ll look at 2018 Journal Impact Factor, so we need citations in 2018 to papers in the journal published in 2016 and 2017. This can be retrieved from Scopus as a csv.
  • Pubmed XML file for the Journal to cover the articles that we want to analyse. Search term = “Nat Commun”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])
  • Pubmed XML file to get cell biology MeSH terms. Search term = “J Cell Sci”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])

Using MeSH terms to segregate the dataset

Analysing the citation data is straightforward, but how can we classify the content of the dataset? I realised that I could use Medical Subject Heading (MeSH) from PubMed to classify the data. If I retrieved the same set of papers from PubMed and then check which papers had MeSH terms which matched that of a “biological” dataset, the citation data could be segregated. I used a set of J Cell Sci papers to do this. Note that these MeSH terms are not restricted to cell biology, they cover all kinds of biochemistry and other aspects of biology. The papers that do not match these MeSH terms are ecology, chemistry and physical sciences (many of these don’t have MeSH terms). We start by getting our biological MeSH terms.

require(XML)
require(tidyverse)
require(readr)
## extract a data frame from PubMed XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  year <- lapply(records, xpathSApply, ".//PubDate/Year", xmlValue) 
  year[sapply(year, is.list)] <- NA
  year <- unlist(year)
  articletitle <- lapply(records, xpathSApply, ".//ArticleTitle", xmlValue) 
  articletitle[sapply(articletitle, is.list)] <- NA
  articletitle <- unlist(articletitle)
  journal <- lapply(records, xpathSApply, ".//ISOAbbreviation", xmlValue) 
  journal[sapply(journal, is.list)] <- NA
  journal <- unlist(journal)
  volume <- lapply(records, xpathSApply, ".//JournalIssue/Volume", xmlValue)
  volume[sapply(volume, is.list)] <- NA
  volume <- unlist(volume)
  issue <- lapply(records, xpathSApply, ".//JournalIssue/Issue", xmlValue)
  issue[sapply(issue, is.list)] <- NA
  issue <- unlist(issue)
  pages <- lapply(records, xpathSApply, ".//MedlinePgn", xmlValue)
  pages[sapply(pages, is.list)] <- NA
  pages <- unlist(pages)
  abstract <- lapply(records, xpathSApply, ".//Abstract/AbstractText", xmlValue)
  abstract[sapply(abstract, is.list)] <- NA
  abstract <- sapply(abstract, paste, collapse = "|")
  ptype <- lapply(records, xpathSApply, ".//PublicationType", xmlValue)
  ptype[sapply(ptype, is.list)] <- NA
  ptype <- sapply(ptype, paste, collapse = "|")
  mesh <- lapply(records, xpathSApply, ".//MeshHeading/DescriptorName", xmlValue)
  mesh[sapply(mesh, is.list)] <- NA
  mesh <- sapply(mesh, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors, year, articletitle, journal, volume, issue, pages, abstract, ptype, mesh, stringsAsFactors = FALSE)
  return(theDF)
}
# function to separate multiple entries in one column to many columns using | separator 
# from https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns
split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as_tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]
  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

## First load the JCS data to get the MeSH terms of interest
jcsFilename <- "./jcs.xml"
jcsData <- extract_xml(jcsFilename)
# put MeSH into a df
meshData <- as.data.frame(jcsData$mesh, stringsAsFactors = FALSE)
colnames(meshData) <- "mesh"
# separate each MeSH into its own column of a df
splitMeshData <- meshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
  select(starts_with("mesh_"))
splitMeshData <- splitMeshData %>% 
  gather(na.rm = TRUE) %>%
  filter(value != "NA")
# collate key value df of unique MeSH
uniqueMesh <- unique(splitMeshData)
# this gives us a data frame of cell biology MeSH terms

Now we need to load in the Nature Communications XML data from PubMed and also get the citation data into R.

## Now use a similar procedure to load the NC data for comparison
ncFilename <- "./nc.xml"
ncData <- extract_xml(ncFilename)
ncMeshData <- as.data.frame(ncData$mesh, stringsAsFactors = FALSE)
colnames(ncMeshData) <- "mesh"
splitNCMeshData <- ncMeshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
  select(starts_with("mesh_"))
# make a new column to hold any matches of rows with MeSH terms which are in the uniqueMeSH df 
ncData$isCB <- apply(splitNCMeshData, 1, function(r) any(r %in% uniqueMesh$value))
rm(splitMeshData,splitNCMeshData,uniqueMesh)

## Next we load the citation data file retrieved from Scopus
scopusFilename <- "./Scopus_Citation_Tracker.csv"
# the structure of the file requires a little bit of wrangling, ignore warnings
upperHeader <- read_csv(scopusFilename, 
                                    skip = 5)
citationData <- read_csv(scopusFilename, 
                        skip = 6)
upperList <- colnames(upperHeader)
lowerList <- colnames(citationData)
colnames(citationData) <- c(lowerList[1:7],upperList[8:length(upperList)])
rm(upperHeader,upperList,lowerList)

Next we need to perform a join to match up the PubMed data with the citation data.

## we now have two data frames, one with the citation data and one with the papers
# make both frames have a Title column
colnames(citationData)[which(names(citationData) == "Document Title")] <- "Title"
colnames(ncData)[which(names(ncData) == "articletitle")] <- "Title"
# ncData paper titles have a terminating period, so remove it
ncData$Title <- gsub("\\.$","",ncData$Title, perl = TRUE)
# add citation data to ncData data frame
allDF <- inner_join(citationData, ncData, by = "Title")

Now we’ll make some plots.

# Plot histogram with indication of mean and median
p1 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)
p1

# Group outlier papers for clarity
p2 <- allDF %>% 
  mutate(x_new = ifelse(allDF$'2018' > 80, 80, allDF$'2018')) %>% 
  ggplot(aes(x_new)) +
  geom_histogram(binwidth = 1, col = "black", fill = "gray") +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)
p2

# Plot the data for both sets of papers separately
p3 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")
p3

The citation data look typical: highly skewed, with few very highly cited papers and the majority (two-thirds) receiving less than the mean number of citations. The “cell biology” dataset and the non-cell biology dataset look pretty similar.

Now it is time to answer our main question. Do cell biology papers drag down the impact factor of the journal?

## make two new data frames, one for the cell bio papers and one for non-cell bio
cbDF <- subset(allDF,allDF$isCB == TRUE)
nocbDF <- subset(allDF,allDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
summary(allDF$'2018')
summary(cbDF$'2018')
summary(nocbDF$'2018')
> summary(allDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.48   14.00  238.00 
> summary(cbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    7.00   10.17   13.00  226.00 
> summary(nocbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    9.00   13.61   16.00  238.00 

The “JIF” for the whole journal is 11.48, whereas for the non-cell biology content it is 13.61. The cell biology dataset has a “JIF” of 10.17. So basically, the rumour is true but the effect is quite mild. The rumour is that the cell biology impact factor is much lower.

The reason “JIF” is in quotes is that it is notoriously difficult to calculate this metric. All citations are summed for the numerator, but the denominator comprises “citable items”. To get something closer to the actual JIF, we probably should remove non-citable items. These are Errata, Letters, Editorials and Retraction notices.

## We need to remove some article types from the dataset
itemsToRemove <- c("Published Erratum","Letter","Editorial","Retraction of Publication")
allArticleData <- as.data.frame(allDF$ptype, stringsAsFactors = FALSE)
colnames(allArticleData) <- "ptype"
splitallArticleData <- allArticleData %>% 
  bind_cols(split_into_multiple(.$ptype, "[|]", "ptype")) %>%
  select(starts_with("ptype_"))
# make a new column to hold any matches of rows that are non-citable items
allDF$isNCI <- apply(splitallArticleData, 1, function(r) any(r %in% itemsToRemove))
# new data frame with only citable items
allCitableDF <- subset(allDF,allDF$isNCI == FALSE)

# Plot the data after removing "non-citable items for both sets of papers separately
p4 <- ggplot(data=allCitableDF, aes(allCitableDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allCitableDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")
p4

After removal the citation distributions look a bit more realistic (notice that the earlier versions had many items with zero citations).

Citation distributions with non-citable items removed

Now we can redo the last part.

# subset new dataframes
cbCitableDF <- subset(allCitableDF,allCitableDF$isCB == TRUE)
nocbCitableDF <- subset(allCitableDF,allCitableDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
summary(allCitableDF$'2018')
summary(cbCitableDF$'2018')
summary(nocbCitableDF$'2018')
> summary(allCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.63   14.00  238.00 
> summary(cbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   10.19   13.00  226.00 
> summary(nocbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00    9.00   14.06   17.00  238.00 

Now the figures have changed. The “JIF” for the whole journal is 11.63, whereas for the non-cell biology content it would 14.06. The cell biology dataset has a “JIF” of 10.19. To more closely approximate the JIF, we need to do:

# approximate "impact factor" for the journal
sum(allDF$'2018') / nrow(allCitableDF)
# approximate "impact factor" for the journal's cell biology content
sum(cbDF$'2018') / nrow(cbCitableDF)
# approximate "impact factor" for the journal's non-cell biology content
sum(nocbDF$'2018') / nrow(nocbCitableDF)
> # approximate "impact factor" for the journal
> sum(allDF$'2018') / nrow(allCitableDF)
[1] 11.64056
> # approximate "impact factor" for the journal's cell biology content
> sum(cbDF$'2018') / nrow(cbCitableDF)
[1] 10.19216
> # approximate "impact factor" for the journal's non-cell biology content
> sum(nocbDF$'2018') / nrow(nocbCitableDF)
[1] 14.08123

This made only a minor change, probably because the dataset is so huge (7239 papers for two years with non-citable items removed). If we were to repeat this on another journal with more front content and fewer papers, this distinction might make a bigger change.

Note also that my analysis uses Scopus data whereas Web of Science numbers are used for JIF calculations (thanks to Anna Sharman for prompting me to add this).

Conclusion

So the rumour is true but the effect is not as big as people say. There’s a ~17% reduction in potential impact factor by including these papers rather than excluding them. However, these papers comprise ~63% of the corpus and they bring in an estimated revenue to the publisher of $12,000,000 per annum. No journal would forego this income in order to bump the JIF from 11.6 to 14.1.

It is definitely not true that these papers are under-performing. Their citation rates are similar to those in the best journals in the field. Note that citation rates do not necessarily reflect the usefulness of the paper. For one thing they are largely an indicator of the volume of a research field. Anyhow, next time you hear this rumour for someone, you can set them straight.

And I nearly managed to write an entire post without mentioning that JIF is a terrible metric, especially for judging individual papers in a journal, but then you knew that didn’t you?

The post title comes from “Communication Breakdown” by the might Led Zeppelin from their debut album. I was really tempted to go with “Dragging Me Down” by Inspiral Carpets, but Communication Breakdown was too good to pass up.

Rollercoaster IV: ups and downs of Google Scholar citations

Time for an update to a previous post. For the past few years, I have been using an automated process to track citations to my lab’s work on Google Scholar (details of how to set this up are at the end of this post).

Due to the nature of how Google Scholar tracks citations, it means that citations get added (hooray!) but might be removed (booo!). Using a daily scrape of the data it is possible to watch this happening. The plots below show the total citations to my papers and then a version where we only consider the net daily change.

Four years of tracking citations on Google Scholar

The general pattern is for papers to accrue citations and some do so faster than others. You can also see that the number of citations occasionally drops down. Remember that we are looking at net change here. So a decrease of one citation is masked by the addition of one citation and vice versa. Even so, you can see net daily increases and even decreases.

It’s difficult to see what is happening down at the bottom of the graph so let’s separate them out. The two plots below show the net change in citations, either on the same scale (left) or scaled to the min/max for that paper (right).

Citation tracking for individual papers

The papers are shown here ranked from the ones that accrued the most citations down to the ones that gained no citations while they were tracked. Five “new” papers began to be tracked very recently. This is because I changed the way that the data are scraped (more on this below).

The version on the right reveals a few interesting things. Firstly that there seems to be “bump days” where all of the papers get a jolt in one direction or another. This could be something internal to Google or the addition or several items which all happen to cite a bunch of my papers. The latter explanation is unlikely, given the frequency of changes seen in the whole dataset. Secondly, some papers are highly volatile with daily toggling of citation numbers. I have no idea why this may be. Two plots below demonstrate these two points. The arrow shows a “bump day”. The plot on the right shows two review papers that have volatile citation numbers.

I’m going to keep the automated tracking going. I am a big fan of Google Scholar, as I have written previously, but quoting some of the numbers makes me uneasy, knowing how unstable they are.

Note that you can use R to get aggregate Google Scholar data as I have written about previously.

How did I do it?

The analysis would not be possible without automation. I use a daemon to run a shell script everyday. This script calls a python routine which outputs the data to a file. I wrote something in Igor to load each day’s data, and crunch the numbers, and make the graphs. The details of this part are in the previous post.

I realised that I wasn’t getting all of my papers using the previous shell script. Well, this is a bit of a hack, but I changed the calls that I make to scholar.py so that I request data from several years.

#!/bin/bash
cd /directory/for/data/
python scholar.py -c 500 --author "Sam Smith" --after=1999 --csv > g1999.csv
sleep $[ ( $RANDOM % 15 )  + 295 ]
# and so on
python scholar.py -c 500 --author "Sam Smith" --after=2019 --csv > g2019.csv
OF=all_$(date +%Y%m%d).csv
cat g*.csv > $OF
rm g*.csv

I found that I got different results for each year I made the query. My first change was to just request all years using a loop to generate the calls. This resulted in an IP ban for 24 hours! Through a bit of trial-and-error I found that reducing the queries to ten and waiting a polite amount of time between queries avoided the ban.

The hacky part was to figure out which year requests I needed to make to make sure I got most of my papers. There is probably a better way to do this!

I still don’t get every single paper and I retrieve data for a number of papers on which I am not an author – I have no idea why! I exclude the erroneous papers using the Igor program that reads all the data and plots out everything. The updated version of this code is here.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song by Sleater-Kinney from their “The Woods” album.

Not What You Want: our new paper on a side effect of GFP nanobodies

We have a new preprint out – it is a cautionary tale about using GFP nanobodies in cells. This short post gives a bit of background to the work. Please read the paper if you are interested in using GFP nanobodies in cells, you can find it here.

Paper in a nutshell: Caution is needed when using GFP nanobodies because they can inhibit their target protein in cells.

People who did the work: Cansu Küey did most of the work for the paper. She discovered the inhibition side effect of the dongles. Gabrielle Larocque contributed a figure where she compared dongle-knocksideways with regular knocksideways. The project was initiated by Nick Clarke who made our first set of dongles and tested which fluorescent proteins the nanobody binds in cells. Lab people and their profiles can be found here.

Background: Many other labs have shown that nanobodies can be functionalised so that you can stick new protein domains onto GFP-tagged proteins to do new things. This is useful because it means you can “retrofit” an existing GFP knock-in cell line or organism to do new things like knocksideways without making new lines. In fact there was a recent preprint which described a suite of functionalised nanobodies that can confer all kinds of functions to GFP.

Like many other labs we were working on this method. We thought functionalised GFP nanobodies resembled “dongles” – those adaptors that Apple makes so much money from – that convert one port to another.

Dongles, dongles, dongles… (photo by Rex Hammock, licensed for reuse https://www.flickr.com/photos/rexblog/5575298582)

A while back we made several different dongles. We were most interested in a GFP nanobodies with an additional FKBP domain that would allow us to do knocksideways (or FerriTagging) in GFP knock-in cells. For those that don’t know, knocksideways is not a knockout or a knockdown, but a way of putting a protein somewhere else in the cell to inactivate it. The most common place to send a protein is to the mitochondria.

Knocksideways works by joining FKBP and FRB (on the mitochondria) using rapamycin. Normally FKBP is fused to the protein of interest (top). If we just have a GFP tag, we can’t do knocksideways (middle). If we add a dongle (bottom) we can attach FKBP domains to allow knocksideways to happen.

We found that dongle-knocksideways works really well and we were very excited about this method. Here we are removing GFP-clathrin from the mitotic spindle in seconds using dongle knocksideways.

GFP-clathrin, shown here in blue is sent to the mitochondria (yellow) using rapamycin. This effect is only possible because of the dongle which adds FKBP to GFP via a GFP nanobody.

Since there are no specific inhibitors of endocytosis, we thought dongle knocksideways would be cool to try in cells with dynamin-2 tagged with GFP at both alleles. There is a line from David Drubin’s lab which is widely used. This would mean we could put the dongle plasmids on Addgene and everyone could inhibit endocytosis on-demand!

Initial results were encouraging. We could put dynamin onto mitochondria alright.

Dynamin-2-GFP undergoing dongle-knocksideways. The Mitotrap is shown in red and dynamin is green.

But we hit a problem. It turns out that dongle binding to dynamin inhibits endocytosis. So we have unintended inhibition of the target protein. This is a big problem because the power of knocksideways comes from being able to observe normal function and then rapidly switch it off. So if there is inhibition before knocksideways, the method is useless.

Now, this problem could be specific to dynamin or it might be a general problem with all targets of dongles. Either way, we’ve switched from this method and wrote this manuscript to alert others to the side effects of dongles. We discuss possible ways forward for this method and also point out some applications of the nanobody technology that are unaffected by our observations.

The post title comes from “Not What You Want” by Sleater-Kinney from their wonderful Dig Me Out record.

Five Get Over Excited: Academic papers that cite quantixed posts

Anyone that maintains a website is happy that people out there are interested enough to visit. Web traffic is one thing, but I take greatest pleasure in seeing quantixed posts being cited in academic papers.

I love the fact that some posts on here have been cited in the literature more than some of my actual papers.

It’s difficult to track citations to web resources. This is partly my fault, I think it is possible to register posts so that they have a DOI, but I have not done this and so tracking is a difficult task. Websites are part of what is known as the grey literature: items that are not part of traditional academic publishing.

The most common route for me to discover that a post has been cited is when I actually read the paper. There are four examples that spring to mind: here, here, here and here. With these papers, I read the paper and was surprised to find quantixed cited in the bibliography.

Vanity and curiosity made me wonder if there were other citations I didn’t know about. A cited reference search in Web of Science pulled up two more: here and here.

A bit of Googling revealed yet more citations, e.g. two quantixed posts are cited in this book. And another citation here.

OK so quantixed is not going to win any “highly cited” prizes or develop a huge H-index (if something like that existed for websites). But I’m pleased that 1) there are this many citations given that there’s a bias against citing web resources, and 2) the content here has been useful to others, particularly for academic work.

All of these citations are to posts looking at impact factors, metrics and publication lag times. In terms of readership, these posts get sustained traffic, but currently the most popular posts on quantixed are the “how to” guides, LaTeX to Word and Back seeing the most traffic. Somewhere in between citation and web traffic are cases when quantixed posts get written about elsewhere, e.g. in a feature in Nature by Kendall Powell.

The post title comes from “Five Get Over Excited” by The Housemartins. A band with a great eye for song titles, it can be found on the album “The People Who Grinned Themselves to Death”.

One With The Freaks – very highly cited papers in biology

I read this recent paper about very highly cited papers and science funding in the UK. The paper itself was not very good, but the dataset which underlies the paper is something to behold, as I’ll explain below.

The idea behind the paper was to examine very highly cited papers in biomedicine with a connection to the UK. Have those authors been successful in getting funding from MRC, Wellcome Trust or NIHR? They find that some of the authors of these very highly cited papers are not funded by these sources. Note that these funders are some, but not all, of the science funding bodies in the UK. The authors also looked at panel members of those three funders, and report that these individuals are funded at high rates and that the overlap between panel membership and very highly cited authorship is very low. I don’t want to critique the paper extensively, but the conclusions drawn are rather blinkered. A few reasons: 1, MRC, NIHR and Wellcome support science in other ways than direct funding of individuals (e.g. PhD programmes, infrastructure etc.). 2, The contribution of other funders e.g. BBSRC was ignored. 3, Panels tend to be selected from the pool of awardees, rather than the other way around. I understand that the motivation of the authors is to stimulate debate around whether science funding is effective, and this is welcome, but the paper strays too far in to clickbait territory for my tastes.

The most interesting thing about the analysis (and arguably its main shortcoming) was the dataset. The authors took the papers in Scopus which have been cited >1000 times. This is ~450 papers as of last week. As I found out when I recreated their dataset, this is a freakish set of papers. Of course weird things can be found when looking at outliers.

Dataset of 20,000 papers from Scopus (see details below)

The authors describe a one-line search term they used to retrieve papers from Scopus. These papers span 2005 to the present day and were then filtered for UK origin.

LANGUAGE ( english ) AND PUBYEAR > 2005 AND ( LIMIT-TO ( SRCTYPE , "j " ) ) AND ( LIMIT-TO (DOCTYPE , "ar " ) ) AND ( LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "BIOC" ) OR LIMIT-TO (SUBJAREA , "PHAR" ) OR LIMIT-TO ( SUBJAREA , "IMMU" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "DENT" ) )

I’m not sure how accurate the dataset is in terms of finding papers of UK origin, but the point here is to look at the dataset and not to critique the paper.

I downloaded the first 20,000 (a limitation of Scopus). I think it will break the terms to put the dataset on here but if your institution has a subscription, it can be recreated. The top paper has 16,549 citations! The 20,000th paper has accrued 122 citations, and the papers with >1000 citations account for 450 papers as of last week.

Now, some papers are older than others, so I calculated the average citation rate by dividing total cites by the number of years since publication, to get a better picture of the hottest among these freaky papers. The two colour-coded plots show the years since publication. It is possible to see some young papers which are being cited at an even higher rate than the pack. These will move up the ranking faster than their neighbours over the next few months.

Just looking at the “Top 20” is amazing. These papers are being cited at rates of approximately 1000 times per year. The paper ranked 6 is a young paper which is cited at a very high rate and will likely move up the ranking. So what are these freakish papers?

In the table below (apologies for the strange formatting), I’ve pasted the top 20 of the highly cited paper dataset. They are a mix of clinical consortia papers and bioinformatics tools for sequence and structural analysis. The tools make sense. They are widely used in a huge number of papers and get heavily cited as a result. In fact, these citation numbers are probably an underestimate, since citations to software can often get missed out of papers. The clinical papers are also useful to large fields. They have many authors and there is a network effect to their citation which can drive up the cites to these items (this is noted in the paper I described above). Even though the data are expected, I was amazed by the magnitude of citations and the rates that these works are acquiring citations. The topic of papers is pretty similar beyond the top 20.

There’s no conclusion for this post. There are a tiny subset of papers out there with freakishly high citation rates. We should simply marvel at them…

TitleYearJournalTotal cites
1Clustal W and Clustal X version 2.02007Bioinformatics16549
2The Sequence Alignment/Map format and SAMtools2009Bioinformatics13586
3Fast and accurate short read alignment with Burrows-Wheeler transform2009Bioinformatics12653
4PLINK: A tool set for whole-genome association and population-based linkage analyses2007American Journal of Human Genetics12241
5Estimates of worldwide burden of cancer in 2008: GLOBOCAN 20082010International Journal of Cancer11047
6Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 20122015International Journal of Cancer10352
7PHENIX: A comprehensive Python-based system for macromolecular structure solution2010Acta Crystallographica Section D: Biological Crystallography10093
8Phaser crystallographic software2007Journal of Applied Crystallography9617
9New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)2009European Journal of Cancer9359
10Features and development of Coot2010Acta Crystallographica Section D: Biological Crystallography9241
11Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities2009Applied and Environmental Microbiology8127
12BEAST: Bayesian evolutionary analysis by sampling trees2007BMC Evolutionary Biology8019
13Improved survival with ipilimumab in patients with metastatic melanoma2010New England Journal of Medicine7293
14OLEX2: A complete structure solution, refinement and analysis program2009Journal of Applied Crystallography7173
15Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet6296
16New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.02010Systematic Biology6290
17The MIQE guidelines: Minimum information for publication of quantitative real-time PCR experiments2009Clinical Chemistry6086
18The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials2011BMJ (Online)6065
19Velvet: Algorithms for de novo short read assembly using de Bruijn graphs2008Genome Research5550
20A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet5499

The post title comes from “One With The Freaks” by The Notwist.

All That Noise: The vesicle packing problem

This week Erick Martins Ratamero and I put up a preprint on vesicle packing. This post is a bit of backstory but please take a look at the paper, it’s very short and simple.

The paper started when I wanted to know how many receptors could fit in a clathrin-coated vesicle. Sounds like a simple problem – but it’s actually more complicated.

Of course, this problem is not as simple as calculating the surface area of the vesicle, the cross-sectional area of the receptor and dividing one by the other. The images above show the problem. The receptors would be the dimples on the golf ball… they can’t overlap… how many can you fit on the ball?

It turns out that a PhD student working in Groningen in 1930 posed a similar problem (known as the Tammes Problem) in his thesis. His concern was the even pattern of pores on a pollen grain, but the root of the problem is the Thomson Problem. This is the minimisation of energy that occurs when charged particles are on a spherical surface. The particles must distribute themselves as far away as possible from all other particles.

There are very few analytical solutions to the Tammes Problem (presently 3-14 and 24 are solved). Anyhow, our vesicle packing problem is the other way around. We want to know, for a vesicle of a certain size, and cargo of a certain size, how many can we fit in.

Fortunately stochastic Tammes solvers are available like this one, that we could adapt. It turns out that the numbers of receptors that could be packed is enormous: for a typical clathrin-coated vesicle almost 800 G Protein-Coupled Receptors could fit on the surface. Note, that this doesn’t take into account steric hinderance and assumes that the vesicle carries nothing else. Full details are in the paper.

Why does this matter? Many labs are developing ways to count molecules in cellular structures by light or electron microscopy. We wanted to have a way to check that our results were physically possible. For example, if we measure 1000 GPCRs in a clathrin-coated vesicle, we know something has gone wrong.

What else? This paper ticked a few things on my publishing bucket list: a paper that is solely theoretical, a coffee-break idea paper and one that is on a “fun” subject. Erick has previous form with theoretical/fun papers, previously publishing on modelling peloton dynamics in procycling.

We figured the paper was more substantial than a blog post yet too minimal to send to a journal. So unless a journal wants to publish it (and gets in touch with us), this will be my first preprint where bioRxiv is the final destination.

We got a sense that people might be interested in an answer to the vesicle packing problem because whenever we asked people for an estimate, we got hugely different answers! The paper has been well-received so far. We’ve had quite a few comments on Twitter and we’re glad that we wrote up the work.

The post title comes from the “All That Noise” LP by The Darkside. I picked this not because of the title, but because of the cover.

All That Noise cover shows a packing problem on a sphere

A Certain Ratio: Gender ratio in our papers

I saw today on Twitter that a few labs were examining the gender balance of their papers and posting the ratios of male:female authors. It started with this tweet.

This analysis is simple to perform, but interpreting it can be hard. For example, is the research group gender balanced to start with? How many of the authors are collaborators? Nonetheless, I have the data for all of my papers, so I thought I’d take a quick look too.

To note: the papers are organised chronologically from top to bottom. They include my papers from before I was a PI (first eight papers). Only research papers are listed, no reviews or methods papers and only those from my lab (i.e. collaborative papers where I am not corresponding author are excluded).

Female authors are blue-green, males are orange. I am a dark orange colour. The blocks are organised according to author list. Joint first authors are boxed.

To the right is a graphic to show the gender ratio. The size of the circle indicates the number of authors on the paper. This is because a paper with M:F ratio of 1 is excusable with 2 or 3 authors, but not with 8. Most of our papers have only a few authors so it’s not a great metric.

On the whole the balance is petty good. Men and women are equally likely to be first author and they are well-represented in the author list. On the other hand, the lab has always had a healthy gender balance and so I would’ve been surprised to find otherwise.

Edit: I replaced the graphic above after a few errors were pointed out to me. Specifically, some authors added to three papers during revisions that were not in the list I used. Also, it was suggested to me that removing myself from the analysis would be a good idea! This was a good suggestion and the corresponding graphic is below.

The post title is taken from the band A Certain Ratio a Factory Records staple from the 1980s.

For What It’s Worth: Influence of our papers on our papers

This post is about a citation analysis that didn’t quite work out.

I liked this blackboard project by Manuel Théry looking at the influence of each paper authored by David Pellman’s lab on the future directions of the Pellman lab.

It reminds me that papers can have impact in the field while others might be influential to the group itself. I wondered which of the papers on which I’m an author have been most influential to my other papers and whether this correlates with a measure of their impact on the field.

There’s no code in this post. I retrieved the relevant records from Scopus and used the difference in “with” and “without” self-citation to pull together the numbers.

Influence: I used the number of citations to a paper from any of our papers as the number for self-citation. This was divided by the total number of future papers. This means if I have 50 papers, and the 23rd paper that was published has collected 27 self-citations, this has a score of 1 (the 23rd paper nor any of the preceding 22 papers, can cite the 23rd paper, but the 27 that follow, could).  This is our metric for influence.

Impact: As a measure of general impact I took the total number of citations for each paper and divided this by the number of years since publication to get average cites per year for each paper.

Plot of influence against impact

Reviews and methods papers are shown in blue, while research papers are in red. I was surprised that some papers have been cited by as much as half of the papers that followed.

Generally, the articles that were most influential to us were also the papers with the biggest impact. Although the correlation is not very strong. There is an obvious outlier paper that gets 30 cites per year (over a 12 year period, I should say) but this paper has not influenced our work as much as other papers have. This is partly because the paper is a citation magnet and partly because we’ve stopped working on this topic in the last few years.

Obviously, the most recent papers were the least informative. There are no future papers to test if they were influential and there are few citations so far to understand their impact.

It’s difficult to say what the correlation between impact and influence on our own work really means, if anything. Does it mean that we have tended to pursue projects because of their impact (I would hope not)? Perhaps these papers are generally useful to the field and to us.

In summary, I don’t think this analysis was successful. I had wanted to construct some citation networks – similar to the Pellman tree idea above – to look at influence in more detail, but I lost confidence in the method. Many of our self-citations are for methodological reasons and so I’m not sure if we’re measuring influence or utility here. Either way, the dataset is not big enough (yet) to do more meaningful number crunching. Having said this, the approach I’ve described here will work for any scholar and could be done at scale.

There are several song titles in the database called ‘For What It’s Worth’. This one is Chapterhouse on Rownderbout.


Ferrous: new paper on FerriTagging proteins in cells

We have a new paper out. It’s not exactly news, because the paper has been up on bioRxiv since December 2016 and hasn’t changed too much. All of the work was done by Nick Clarke when he was a PhD student in the lab. This post is to explain our new paper to a general audience.

The paper in a nutshell

We have invented a new way to tag proteins in living cells so that you can see them by light microscopy and by electron microscopy.

Why would you want to do that?

Proteins do almost all of the jobs in cells that scientists want to study. We can learn a lot about how proteins work by simply watching them down the microscope. We want to know their precise location. Light microscopy means that the cells are alive and we can watch the proteins move around. It’s a great method but it has low resolution, so seeing a protein’s precise location is not possible. We can overcome this limitation by using electron microscopy. This gives us higher resolution, but the proteins are stuck in one location. When we correlate images from one microscope to the other, we can watch proteins move and then look at them with high resolution. All we need is a way to see the proteins so that they can be seen in both types of microscope. We do this with tagging.

Tagging proteins so that we can see them by light microscopy is easy. A widely used method is to use a fluorescent protein such as GFP. We can’t see GFP in the electron microscope (EM) so we need another method. Again, there are several tags available but they all have drawbacks. They are not precise enough, or they don’t work on single proteins. So we came up with a new one and fused it with a fluorescent protein.

What is your EM tag?

We call it FerriTag. It is based on Ferritin which is a large protein shell that cells use to store iron. Because iron scatters electrons, this protein shell can be seen by EM as a particle. There was a problem though. If Ferritin is fused to a protein, we end up with a mush. So, we changed Ferritin so that it could be attached to the protein of interest by using a drug. This meant that we could put the FerriTag onto the protein we want to image in a few seconds. In the picture on the right you can see how this works to FerriTag clathrin, a component of vesicles in cells.

We can watch the tagging process happening in cells before looking by EM. The movie on the right shows green spots (clathrin-coated pits in a living cell) turning orange/yellow when we do FerriTagging. The cool thing about FerriTag is that it is genetically encoded. That means that we get the cell to make the tag itself and we don’t have to put it in from outside which would damage the cell.

What can you use FerriTag for?

Well, it can be used to tag many proteins in cells. We wanted to precisely localise a protein called HIP1R which links clathrin-coated pits to the cytoskeleton. We FerriTagged HIP1R and carried out what we call “contextual nanoscale mapping”. This is just a fancy way of saying that we could find the FerriTagged HIP1R and map where it is relative to the clathrin-coated pit. This allowed us to see that HIP1R is found at the pit and surrounding membrane. We could even see small changes in the shape of HIP1R in the different locations.

We’re using FerriTag for lots of projects. Our motivation to make FerriTag was so that we could look at proteins that are important for cell division and this is what we are doing now.

Is the work freely available?

Yes! The paper is available here under CC-BY licence. All of the code we wrote to analyse the data and run computer simulations is available here. All of the plasmids needed to do FerriTagging are available from Addgene (a non-profit company, there is a small fee) so that anyone can use them in the lab to FerriTag their favourite protein.

How long did it take to do this project?

Nick worked for four years on this project. Our first attempt at using ribosomes to tag proteins failed, but Nick then managed to get Ferritin working as a tag. This paper has broken our lab record for longest publication delay from first submission to final publication. The diagram below tells the whole saga.

 

The publication process was frustratingly slow. It took a few months to write the paper and then we submitted to the first journal after Christmas 2016. We got a rapid desk rejection and sent the paper to another journal and it went out for review. We had two positive referees and one negative one, but we felt we could address the comments and checked with the journal who said that they would consider a revised paper as an appeal. We did some work and resubmitted the paper. Almost six months after first submission the paper was rejected, but with the offer of a rapid (ha!) publication at Nature Communications using the peer review file from the other journal.

Hindsight is a wonderful thing but I now regret agreeing to transfer the paper to Nature Communications. It was far from rapid. They drafted in a new reviewer who came with a list of new questions, as well as being slow to respond. Sure, a huge chunk of the delay was caused by us doing revision experiments (the revisions took longer than they should because Nick defended his PhD, was working on other projects and also became a parent). However, the journal was really slow. The Editor assigned to our paper left the journal which didn’t help and the reviewer they drafted in was slow to respond each time (6 and 7 weeks, respectively). Particularly at the end, after the paper was ‘accepted in principle’ it took them three weeks to actually accept the paper (seemingly a week to figure out what a bib file is and another to ask us something about chi-squared tests). Then a further three weeks to send us the proofs, and then another three weeks until publication. You can see from the graphic that we sent back the paper in the third week of February and only incurred a 9-day delay ourselves, yet the paper was not published until July.

Did the paper improve as a result of this process? Yes and no. We actually added some things in the first revision cycle (for Journal #2) that got removed in subsequent peer review cycles! And the message in the final paper is exactly the same as the version on bioRxiv, posted 18 months previously. So in that sense, no it didn’t. It wasn’t all a total waste of time though, the extra reviewer convinced us to add some new analysis which made the paper more convincing in the end. Was this worth an 18-month delay? You can download our paper and the preprint and judge for yourself.

Were we unlucky with this slow experience? Maybe, but I know other authors who’ve had similar (and worse) experiences at this journal. As described in a previous post, the publication lag times are getting longer at Nature Communications. This suggests that our lengthy wait is not unique.

There’s lots to like about this journal:

  • It is open access.
  • It has the Nature branding (which, like it or not, impresses many people).
  • Peer review file is available
  • The papers look great (in print and online).

But there are downsides too.

  • The APC for each paper is £3300 ($5200). Obviously open access must cost something, but there a cheaper OA journals available (albeit without the Nature branding).
  • Ironically, paying a premium for this reputation is complicated since the journal covers a wide range of science and its kudos varies depending on subfield.
  • It’s also slow, and especially so when you consider that papers have often transferred here from somewhere else.
  • It’s essentially a mega journal, so your paper doesn’t get the same exposure as it would in a community-focused journal.
  • There’s the whole ReadCube/SpringerNature thing…

Overall it was a negative publication experience with this paper. Transferring a paper along with the peer review file to another journal has worked out well for us recently and has been rapid, but not this time. Please leave a comment particularly if you’ve had a positive experience and redress the balance.

The post title comes from “Ferrous” by Circle from their album Meronia.

Rollercoaster III: yet more on Google Scholar

In a previous post I made a little R script to crunch Google Scholar data for a given scientist. The graphics were done in base R and looked a bit ropey. I thought I’d give the code a spring clean – it’s available here. The script is called ggScholar.R (rather than gScholar.R). Feel free to run it and raise an issue or leave a comment if you have some ideas.

I’m still learning how to get things looking how I want them using ggplot2, but this is an improvement on the base R version.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song and album by slowcore/dream pop outfit Red House Painters.