Communication Breakdown

There is an entertaining rumour going around about the journal Nature Communications. When I heard it for the fourth or fifth time, I decided to check out whether there is any truth in it.

The rumour goes something like this: the impact factor of Nature Communications is driven by physical sciences papers.

Sometimes it is put another way: cell biology papers drag down the impact factor of Nature Communications, or that they don’t deserve the high JIF tag of the journal because they are cited at lower rates. Could this be true?

TL;DR it is true but the effect is not as big as the rumour suggests. Jump to conclusion.

Nature Communications is the megajournal big journal that sits below the subject-specific Nature journals. Operating as an open access, pay-to-publish journal it is a way for Springer Nature to capture revenue from papers that were good, but did not make the editorial selection for subject-specific Nature journals. This is a long-winded way of saying that there are wide variety of papers covered by this journal which publishes around 5,000 papers per year. This complicates any citation analysis because we need a way to differentiate papers from different fields. I describe one method to do this below.

Quick look at the data

I had a quick look at the top 20 papers from 2016-2017 with the most citations in 2018. There certainly were a lot of non-biological papers in there. Since highly cited papers disproportionately influence the Journal Impact Factor, then this suggested the rumours might be true.

Citations (2018)Title
23811.4% Efficiency non-fullerene polymer solar cells with trialkylsilyl substituted 2D-conjugated polymer as donor
226Circular RNA profiling reveals an abundant circHIPK3 that regulates cell growth by sponging multiple miRNAs
208Recommendations for myeloid-derived suppressor cell nomenclature and characterization standards
203High-efficiency and air-stable P3HT-based polymer solar cells with a new non-fullerene acceptor
201One-Year stable perovskite solar cells by 2D/3D interface engineering
201Massively parallel digital transcriptional profiling of single cells
177Array of nanosheets render ultrafast and high-capacity Na-ion storage by tunable pseudocapacitance
166Multidimensional materials and device architectures for future hybrid energy storage
163Coupled molybdenum carbide and reduced graphene oxide electrocatalysts for efficient hydrogen evolution
149Ti<inf>3</inf>C<inf>2</inf> MXene co-catalyst on metal sulfide photo-absorbers for enhanced visible-light photocatalytic hydrogen production
149Balancing surface adsorption and diffusion of lithium-polysulfides on nonconductive oxides for lithium-sulfur battery design
146Adaptive resistance to therapeutic PD-1 blockade is associated with upregulation of alternative immune checkpoints
140Conductive porous vanadium nitride/graphene composite as chemical anchor of polysulfides for lithium-sulfur batteries
136Fluorination-enabled optimal morphology leads to over 11% efficiency for inverted small-molecule organic solar cells
134The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes
132Photothermal therapy with immune-adjuvant nanoparticles together with checkpoint blockade for effective cancer immunotherapy
131Enhanced electronic properties in mesoporous TiO<inf>2</inf> via lithium doping for high-efficiency perovskite solar cells
125Electron-phonon coupling in hybrid lead halide perovskites
123A sulfur host based on titanium monoxide@carbon hollow spheres for advanced lithium-sulfur batteries
121Biodegradable black phosphorus-based nanospheres for in vivo photothermal cancer therapy

Let’s dive in to the data

We will use R for this analysis. If you want to work along, the script and data can be downloaded below. With a few edits, the script will also work for similar analysis of other journals.

First of all I retrieved three datasets.

  • Citation data for the journal. We’ll look at 2018 Journal Impact Factor, so we need citations in 2018 to papers in the journal published in 2016 and 2017. This can be retrieved from Scopus as a csv.
  • Pubmed XML file for the Journal to cover the articles that we want to analyse. Search term = “Nat Commun”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])
  • Pubmed XML file to get cell biology MeSH terms. Search term = “J Cell Sci”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])

Using MeSH terms to segregate the dataset

Analysing the citation data is straightforward, but how can we classify the content of the dataset? I realised that I could use Medical Subject Heading (MeSH) from PubMed to classify the data. If I retrieved the same set of papers from PubMed and then check which papers had MeSH terms which matched that of a “biological” dataset, the citation data could be segregated. I used a set of J Cell Sci papers to do this. Note that these MeSH terms are not restricted to cell biology, they cover all kinds of biochemistry and other aspects of biology. The papers that do not match these MeSH terms are ecology, chemistry and physical sciences (many of these don’t have MeSH terms). We start by getting our biological MeSH terms.

require(XML)
require(tidyverse)
require(readr)
## extract a data frame from PubMed XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  year <- lapply(records, xpathSApply, ".//PubDate/Year", xmlValue) 
  year[sapply(year, is.list)] <- NA
  year <- unlist(year)
  articletitle <- lapply(records, xpathSApply, ".//ArticleTitle", xmlValue) 
  articletitle[sapply(articletitle, is.list)] <- NA
  articletitle <- unlist(articletitle)
  journal <- lapply(records, xpathSApply, ".//ISOAbbreviation", xmlValue) 
  journal[sapply(journal, is.list)] <- NA
  journal <- unlist(journal)
  volume <- lapply(records, xpathSApply, ".//JournalIssue/Volume", xmlValue)
  volume[sapply(volume, is.list)] <- NA
  volume <- unlist(volume)
  issue <- lapply(records, xpathSApply, ".//JournalIssue/Issue", xmlValue)
  issue[sapply(issue, is.list)] <- NA
  issue <- unlist(issue)
  pages <- lapply(records, xpathSApply, ".//MedlinePgn", xmlValue)
  pages[sapply(pages, is.list)] <- NA
  pages <- unlist(pages)
  abstract <- lapply(records, xpathSApply, ".//Abstract/AbstractText", xmlValue)
  abstract[sapply(abstract, is.list)] <- NA
  abstract <- sapply(abstract, paste, collapse = "|")
  ptype <- lapply(records, xpathSApply, ".//PublicationType", xmlValue)
  ptype[sapply(ptype, is.list)] <- NA
  ptype <- sapply(ptype, paste, collapse = "|")
  mesh <- lapply(records, xpathSApply, ".//MeshHeading/DescriptorName", xmlValue)
  mesh[sapply(mesh, is.list)] <- NA
  mesh <- sapply(mesh, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors, year, articletitle, journal, volume, issue, pages, abstract, ptype, mesh, stringsAsFactors = FALSE)
  return(theDF)
}
# function to separate multiple entries in one column to many columns using | separator 
# from https://stackoverflow.com/questions/4350440/split-data-frame-string-column-into-multiple-columns
split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as_tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]
  names(cols) <- paste(into_prefix, 1:m, sep = "_")
  return(cols)
}

## First load the JCS data to get the MeSH terms of interest
jcsFilename <- "./jcs.xml"
jcsData <- extract_xml(jcsFilename)
# put MeSH into a df
meshData <- as.data.frame(jcsData$mesh, stringsAsFactors = FALSE)
colnames(meshData) <- "mesh"
# separate each MeSH into its own column of a df
splitMeshData <- meshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
  select(starts_with("mesh_"))
splitMeshData <- splitMeshData %>% 
  gather(na.rm = TRUE) %>%
  filter(value != "NA")
# collate key value df of unique MeSH
uniqueMesh <- unique(splitMeshData)
# this gives us a data frame of cell biology MeSH terms

Now we need to load in the Nature Communications XML data from PubMed and also get the citation data into R.

## Now use a similar procedure to load the NC data for comparison
ncFilename <- "./nc.xml"
ncData <- extract_xml(ncFilename)
ncMeshData <- as.data.frame(ncData$mesh, stringsAsFactors = FALSE)
colnames(ncMeshData) <- "mesh"
splitNCMeshData <- ncMeshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
  select(starts_with("mesh_"))
# make a new column to hold any matches of rows with MeSH terms which are in the uniqueMeSH df 
ncData$isCB <- apply(splitNCMeshData, 1, function(r) any(r %in% uniqueMesh$value))
rm(splitMeshData,splitNCMeshData,uniqueMesh)

## Next we load the citation data file retrieved from Scopus
scopusFilename <- "./Scopus_Citation_Tracker.csv"
# the structure of the file requires a little bit of wrangling, ignore warnings
upperHeader <- read_csv(scopusFilename, 
                                    skip = 5)
citationData <- read_csv(scopusFilename, 
                        skip = 6)
upperList <- colnames(upperHeader)
lowerList <- colnames(citationData)
colnames(citationData) <- c(lowerList[1:7],upperList[8:length(upperList)])
rm(upperHeader,upperList,lowerList)

Next we need to perform a join to match up the PubMed data with the citation data.

## we now have two data frames, one with the citation data and one with the papers
# make both frames have a Title column
colnames(citationData)[which(names(citationData) == "Document Title")] <- "Title"
colnames(ncData)[which(names(ncData) == "articletitle")] <- "Title"
# ncData paper titles have a terminating period, so remove it
ncData$Title <- gsub("\\.$","",ncData$Title, perl = TRUE)
# add citation data to ncData data frame
allDF <- inner_join(citationData, ncData, by = "Title")

Now we’ll make some plots.

# Plot histogram with indication of mean and median
p1 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)
p1

# Group outlier papers for clarity
p2 <- allDF %>% 
  mutate(x_new = ifelse(allDF$'2018' > 80, 80, allDF$'2018')) %>% 
  ggplot(aes(x_new)) +
  geom_histogram(binwidth = 1, col = "black", fill = "gray") +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)
p2

# Plot the data for both sets of papers separately
p3 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")
p3

The citation data look typical: highly skewed, with few very highly cited papers and the majority (two-thirds) receiving less than the mean number of citations. The “cell biology” dataset and the non-cell biology dataset look pretty similar.

Now it is time to answer our main question. Do cell biology papers drag down the impact factor of the journal?

## make two new data frames, one for the cell bio papers and one for non-cell bio
cbDF <- subset(allDF,allDF$isCB == TRUE)
nocbDF <- subset(allDF,allDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
summary(allDF$'2018')
summary(cbDF$'2018')
summary(nocbDF$'2018')
> summary(allDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.48   14.00  238.00 
> summary(cbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    7.00   10.17   13.00  226.00 
> summary(nocbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    9.00   13.61   16.00  238.00 

The “JIF” for the whole journal is 11.48, whereas for the non-cell biology content it is 13.61. The cell biology dataset has a “JIF” of 10.17. So basically, the rumour is true but the effect is quite mild. The rumour is that the cell biology impact factor is much lower.

The reason “JIF” is in quotes is that it is notoriously difficult to calculate this metric. All citations are summed for the numerator, but the denominator comprises “citable items”. To get something closer to the actual JIF, we probably should remove non-citable items. These are Errata, Letters, Editorials and Retraction notices.

## We need to remove some article types from the dataset
itemsToRemove <- c("Published Erratum","Letter","Editorial","Retraction of Publication")
allArticleData <- as.data.frame(allDF$ptype, stringsAsFactors = FALSE)
colnames(allArticleData) <- "ptype"
splitallArticleData <- allArticleData %>% 
  bind_cols(split_into_multiple(.$ptype, "[|]", "ptype")) %>%
  select(starts_with("ptype_"))
# make a new column to hold any matches of rows that are non-citable items
allDF$isNCI <- apply(splitallArticleData, 1, function(r) any(r %in% itemsToRemove))
# new data frame with only citable items
allCitableDF <- subset(allDF,allDF$isNCI == FALSE)

# Plot the data after removing "non-citable items for both sets of papers separately
p4 <- ggplot(data=allCitableDF, aes(allCitableDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allCitableDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")
p4

After removal the citation distributions look a bit more realistic (notice that the earlier versions had many items with zero citations).

Citation distributions with non-citable items removed

Now we can redo the last part.

# subset new dataframes
cbCitableDF <- subset(allCitableDF,allCitableDF$isCB == TRUE)
nocbCitableDF <- subset(allCitableDF,allCitableDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
summary(allCitableDF$'2018')
summary(cbCitableDF$'2018')
summary(nocbCitableDF$'2018')
> summary(allCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.63   14.00  238.00 
> summary(cbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   10.19   13.00  226.00 
> summary(nocbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00    9.00   14.06   17.00  238.00 

Now the figures have changed. The “JIF” for the whole journal is 11.63, whereas for the non-cell biology content it would 14.06. The cell biology dataset has a “JIF” of 10.19. To more closely approximate the JIF, we need to do:

# approximate "impact factor" for the journal
sum(allDF$'2018') / nrow(allCitableDF)
# approximate "impact factor" for the journal's cell biology content
sum(cbDF$'2018') / nrow(cbCitableDF)
# approximate "impact factor" for the journal's non-cell biology content
sum(nocbDF$'2018') / nrow(nocbCitableDF)
> # approximate "impact factor" for the journal
> sum(allDF$'2018') / nrow(allCitableDF)
[1] 11.64056
> # approximate "impact factor" for the journal's cell biology content
> sum(cbDF$'2018') / nrow(cbCitableDF)
[1] 10.19216
> # approximate "impact factor" for the journal's non-cell biology content
> sum(nocbDF$'2018') / nrow(nocbCitableDF)
[1] 14.08123

This made only a minor change, probably because the dataset is so huge (7239 papers for two years with non-citable items removed). If we were to repeat this on another journal with more front content and fewer papers, this distinction might make a bigger change.

Note also that my analysis uses Scopus data whereas Web of Science numbers are used for JIF calculations (thanks to Anna Sharman for prompting me to add this).

Conclusion

So the rumour is true but the effect is not as big as people say. There’s a ~17% reduction in potential impact factor by including these papers rather than excluding them. However, these papers comprise ~63% of the corpus and they bring in an estimated revenue to the publisher of $12,000,000 per annum. No journal would forego this income in order to bump the JIF from 11.6 to 14.1.

It is definitely not true that these papers are under-performing. Their citation rates are similar to those in the best journals in the field. Note that citation rates do not necessarily reflect the usefulness of the paper. For one thing they are largely an indicator of the volume of a research field. Anyhow, next time you hear this rumour for someone, you can set them straight.

And I nearly managed to write an entire post without mentioning that JIF is a terrible metric, especially for judging individual papers in a journal, but then you knew that didn’t you?

The post title comes from “Communication Breakdown” by the might Led Zeppelin from their debut album. I was really tempted to go with “Dragging Me Down” by Inspiral Carpets, but Communication Breakdown was too good to pass up.

Turn A Square: generative aRt

A while back I visited Artistes & Robots in Paris. Part of the exhibition was on the origins of computer-based art. Nowadays this is referred to as generative art, where computers generate artwork according to rules specified by the programmer. I wanted to emulate some of the early generative artwork I saw there, using R.

Some examples of early generative art

Georg Nees was a pioneer of computer-based artwork and he has a series of pieces using squares. An example I found on the web was Schotter (1968).

CIS:E.217-2008

Another example using rotation of squares is Boxes by William J. Kolomyjec.

Boxes

I set out to generate similar images where the parameters can be tweaked to adjust the final result. The code is available on GitHub. Here is a typical image. Read on for an explanation of the code.

Generative art made in R

Generating a grid of squares

I started by generating a grid of squares. We can use the segment command in R to do the drawing.

xWave <- seq.int(1:10)
yWave <- seq.int(1:10)

for (i in seq_along(xWave)) {
  xCentre <- xWave[i]
  for (j in seq_along(yWave)) {
    yCentre <- yWave[j]
    lt <- c(xCentre - 0.4,yCentre - 0.4)
    rt <- c(xCentre + 0.4,yCentre - 0.4)
    rb <- c(xCentre + 0.4,yCentre + 0.4)
    lb <- c(xCentre - 0.4,yCentre + 0.4)
    new_shape_start <- rbind(lt,rt,rb,lb)
    new_shape_end <- rbind(rt,rb,lb,lt)
    new_shape <- cbind(new_shape_start,new_shape_end)
    if(i == 1 &amp;&amp; j == 1) {
      multiple_segments <- new_shape
    } else {
      multiple_segments <- rbind(multiple_segments,new_shape)
    }
  }
}
par(pty="s")
plot(0, 0,
     xlim=c(min(xWave)-1,max(xWave)+1),
     ylim=c(min(yWave)-1,max(yWave)+1),
     col = "white", xlab = "", ylab = "", axes=F) 
segments(x0 = multiple_segments[,1],
         y0 = multiple_segments[,2],
         x1 = multiple_segments[,3],
         y1 = multiple_segments[,4])
A simple grid

We’re using base R graphics to make this artwork – nothing fancy. Probably the code can be simplified!

Add some complexity

The next step was to add some complexity and flexibility. I wanted to be able to control three things:

  1. the size of the grid
  2. the grout (distance between squares)
  3. the amount of hysteresis (distorting the squares, rather than rotating them).

Here is the code. See below for some explanation and examples.

# this function will make grid art
# arguments define the number of squares in each dimension (xSize, ySize)
# grout defines the gap between squares (none = 0, max = 1)
# hFactor defines the amount of hysteresis (none = 0, max = 1, moderate = 10)
make_grid_art <- function(xSize, ySize, grout, hFactor) {
  xWave <- seq.int(1:xSize)
  yWave <- seq.int(1:ySize)
  axMin <- min(min(xWave) - 1,min(yWave) - 1)
  axMax <- max(max(xWave) + 1,max(yWave) + 1)
  nSquares <- length(xWave) * length(yWave)
  x <- 0
  halfGrout <- (1 - grout) / 2
  for (i in seq_along(yWave)) {
    yCentre <- yWave[i]
    for (j in seq_along(xWave)) {
      if(hFactor < 1) {
        hyst <- rnorm(8, halfGrout, 0)
      }
      else {
        hyst <- rnorm(8, halfGrout, sin(x / (nSquares - 1) * pi) / hFactor)
      }
      xCentre <- xWave[j]
      lt <- c(xCentre - hyst[1],yCentre - hyst[2])
      rt <- c(xCentre + hyst[3],yCentre - hyst[4])
      rb <- c(xCentre + hyst[5],yCentre + hyst[6])
      lb <- c(xCentre - hyst[7],yCentre + hyst[8])
      new_shape_start <- rbind(lt,rt,rb,lb)
      new_shape_end <- rbind(rt,rb,lb,lt)
      new_shape <- cbind(new_shape_start,new_shape_end)
      if(i == 1 &amp;&amp; j == 1) {
        multiple_segments <- new_shape
      } else {
        multiple_segments <- rbind(multiple_segments,new_shape)
      }
      x <- x + 1
    }
  }
  par(mar = c(0,0,0,0))
  plot(0, 0,
       xlim=c(axMin,axMax),
       ylim=c(axMax,axMin),
       col = "white", xlab = "", ylab = "", axes=F, asp = 1) 
  segments(x0 = multiple_segments[,1],
           y0 = multiple_segments[,2],
           x1 = multiple_segments[,3],
           y1 = multiple_segments[,4])
}

The amount of hysteresis is an important element. It determines the degree of distortion for each square. The distortion is done by introducing noise into the coordinates for each square that we map onto the grid. The algorithm used for the distortion is based on a half-cycle of a sine wave. It starts and finishes on zero distortion of the coordinates. In between it rises and falls symmetrically introducing more and then less distortion as we traverse the grid.

Changing the line of code that assigns a value to hyst will therefore change the artwork. Let’s look at some examples.

# square grid with minimal hysteresis
make_grid_art(10,10,0.2,50)
# square grid (more squares) more hysteresis
make_grid_art(20,20,0.2,10)
# rectangular grid same hysteresis
make_grid_art(25,15,0.2,10)
# same grid with no hysteresis
make_grid_art(25,15,0.2,0)
# square grid moderate hysteresis and no grout
make_grid_art(20,20,0,10)

Hopefully this will give you some ideas for simple schemes to generate artwork in R. I plan to add to the GitHub repo when get the urge. There is already some other generative artwork code in there for Igor Pro. Two examples are shown below (I don’t think I have written about this before on quantixed).

The post title comes from “Turn A Square” by The Shins from their album Chutes Too Narrow.

Garmonbozia: Using R to look at Garmin CSV data

Garmin Connect has a number of plots built in, but to take a deeper dive into all your fitness data, you need to export a CSV and fire up R. This post is a quick guide to some possibilities for running data. 

There’s a few things that I wanted to look at. For example, how does my speed change through the year? How does that compare to previous years? If I see some trends, is that the same for short runs and long runs? I wanted to look at the cumulative distance I’d run each year… There’s a lot of things that would be good to analyse.

Garmin Connect has a simple way to export data as a CSV. There are other ways to get your data, but the web interface is pretty straightforward. To export a CSV of your data, head to the Garmin Connect website, login and select Activities, All Activities. On this page, filter the activities for whatever you want to export. I clicked Running (you can filter some more if you want), and then scrolled down letting the data load onto the page until I went back as far as I wanted. In the top right corner, you click Export CSV and you will download whatever is displayed on the page.

The code to generate these plots, together with some fake data to play with can be found here.

Now in R, load in the CSV file

require(ggplot2)
require(dplyr)
require(hms)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)

We have a data frame, but we need to rejig the Dates and a few other columns before we can start making plots.

# format Date column to POSIXct
df1$Date <- as.POSIXct(strptime(df1$Date, format = "%Y-%m-%d %H:%M:%S"))
# format Avg.Pace to POSIXct
df1$Avg.Pace <- as.POSIXct(strptime(df1$Avg.Pace, format = "%M:%S"))
# make groups of different distances using ifelse
df1$Type <- ifelse(df1$Distance < 5, "< 5 km", ifelse(df1$Distance < 8, "5-8 km", ifelse(df1$Distance < 15, "8-15 km", ">15 km")))
# make factors for these so that they're in the right order when we make the plot
df1$Type_f = factor(df1$Type, levels=c("< 5 km","5-8 km","8-15 km", ">15 km"))

Now we can make the first plot. The code for the first one is below, with all the code for the other plots shown below that.

# plot out average pace over time
p1 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, color = Distance)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Date", y = "Average Pace (min/km)")

The remainder of the code for the other plots is shown below. The code is commented. For some of the plots, a bit of extra work on the data frame is required.

# plot out same data grouped by distance
p2 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, group = Type_f, color = Type_f)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Date", y = "Average Pace (min/km)", colour = NULL) +
  facet_grid(~Type_f)
# now look at stride length. first remove zeros
df1[df1 == 0] <- NA
# now find earliest valid date
date_v <- df1$Date
# change dates to NA where there is no avg stride data
date_v <- as.Date.POSIXct(ifelse(df1$Avg.Stride.Length > 0, df1$Date, NA))
# find min and max for x-axis
earliest_date <- min(date_v, na.rm = TRUE)
latest_date <- max(date_v, na.rm = TRUE)
# make the plot
p3 <- ggplot(data = df1, aes(x = Date,y = Avg.Stride.Length, group = Type_f, color = Type_f)) +
  geom_point() + 
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average stride length (m)", colour = NULL) +
  facet_grid(~Type_f)
df1$Avg.HR <- as.numeric(as.character(df1$Avg.HR))
p4 <- ggplot(data = df1, aes(x = Date,y = Avg.HR, group = Type_f, color = Type_f)) +
  geom_point() +
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average heart rate (bpm)", colour = NULL) +
  facet_grid(~Type_f)
# plot out average pace per distance coloured by year
p5 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, color = Date)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Distance (km)", y = "Average Pace (min/km)")
# make a date factor for year to group the plots
df1$Year <- format(as.Date(df1$Date, format="%d/%m/%Y"),"%Y")
p6 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, group = Year, color = Year)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Distance", y = "Average Pace (min/km)") +
  facet_grid(~Year)
# Cumulative sum over years
df1 <- df1[order(as.Date(df1$Date)),]
df1 <- df1 %>% group_by(Year) %>% mutate(cumsum = cumsum(Distance))
p7 <- ggplot( data = df1, aes(x = Date,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  labs(x = "Date", y = "Cumulative distance (km)")
# Plot these cumulative sums overlaid
# Find New Year's Day for each and then work out how many days have elapsed since
df1$nyd <- paste(df1$Year,"-01-01",sep = "")
df1$Days <- as.Date(df1$Date, format="%Y-%m-%d") - as.Date(as.character(df1$nyd), format="%Y-%m-%d")
# Make the plot
p8 <- ggplot( data = df1, aes(x = Days,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  scale_x_continuous() +
  labs(x = "Days", y = "Cumulative distance (km)")

Finally, we can save all of the plots using ggsave.

# save all plots
ggsave("allPace.png", plot = p1, width = 8, height = 4, dpi = "print")
ggsave("paceByDist.png", plot = p2, width = 8, height = 4, dpi = "print")
ggsave("strideByDist.png", plot = p3, width = 8, height = 4, dpi = "print")
ggsave("HRByDist.png", plot = p4, width = 8, height = 4, dpi = "print")
ggsave("allPaceByDist.png", plot = p5, width = 8, height = 4, dpi = "print")
ggsave("paceByDistByYear.png", plot = p6, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistByYear.png", plot = p7, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistOverlay.png", plot = p8, width = 8, height = 4, dpi = "print")

I think the code might fail if you don’t record all of the fields that I do. For example if heart rate data is missing or stride length is not recorded, I’m not sure what the code will do. The aim here is to give an idea of what sorts of plots can be generated just using the summary data in the CSV provided by Garmin. Feel free to make suggestions in the comments below.

The post title comes from “Garmonbozia” by Superdrag from the Regretfully Yours album. Apparently Garmonbozia is something eaten by the demons in the Black Lodge in the TV series Twin Peaks.

All Around The World: Maps and Flags in R

Our lab is international. People born all over the world have come to work in my group. I’m proud of this fact, especially in the current political climate. I’ve previously used the GoogleMaps API to display a heat map on our lab webpage. It shows where in the world people in the lab come from. This was OK, but I wanted to get an R based solution to make this graphic to make it easier to automate updates.

This post is an explainer and “how to” guide. Code and some example data are here.

The idea is to create graphics to describe the origins of a group of people. For my use-case it is my research group, but it could be any group of people. All you need is a list of countries that the people come from.

In the example for this post, I took the Top 100 Footballers voted for by Guardian readers in 2016. In my lab dataset, I store the countries of origin in ISO2 format. This means Spain is ES, Germany is DE and so on. I converted the Guardian dataset to ISO2 format using a lookup and then I was ready to put it into the R script.

if (!require("rworldmap")) {
  install.packages("rworldmap")
  library(rworldmap)
}
# ggplot2, ggFlags, dplyr are needed for the bar charts
library(ggplot2)
library(dplyr)
if (!require("ggflags")) {
  devtools::install_github("rensa/ggflags")
  library(ggflags)
}

# csv file with each person as a row and containing a column with the header Origin and
# countries in 2-letter ISO format (change joinCode for other formats)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
countries_lab <- as.data.frame(table(df1$Origin))
colnames(countries_lab) <- c("country", "value")
matched <- joinCountryData2Map(countries_lab, joinCode="ISO2", nameJoinColumn="country")

This part of the script sets up the libraries that are needed. We’ll use the rworldmap package first. This post was very useful for guidance. We load in the csv file which contains the countries of origin for the people we want to map out. It doesn’t need anything more than one column with the ISO2 codes. If it does it’s OK. As long as the header for the countries column is called “Origin”, all will be OK.

This column is extracted and a new dataframe is made from it which has each country as a row and the number of instances of that country as a second column. These are labelled “country” and “value” for convenience. Now rworldmap does its thing with the joinCountryData2Map line. Next we make the map!

# make png of the map
png(file = "labWorldMap.png",
    width = 1024, height = 768)
par(mai=c(0,0,0.2,0))
mapCountryData(matched,
               nameColumnToPlot="value",
               mapTitle= "",
               catMethod = "logFixedWidth",
               colourPalette = "heat",
               oceanCol="lightblue",
               missingCountryCol="white",
               addLegend = FALSE,
               lwd = 1)
dev.off()
Where in the world…. heat map showing country of origin for the people in the dataset

This makes a nice map. I’ve hidden the legend which shows what the colours mean. The map can be customised in lots of ways. I liked the way this map looked and my other aim was to make a chart to show the relative numbers of people in each country. Speaking of which…

# make bar chart of lab members
countries_lab %>%
  mutate(code = tolower(country)) %>%
  ggplot(aes(x = reorder(country, value), y = value)) +
  geom_bar(stat = "identity") +
  geom_flag(y = -1, aes(country = code), size = 4) +
  scale_y_continuous(expand = c(0.1, 1)) +
  xlab("Country") +
  ylab("Members") +
  theme_bw() +
  theme(legend.title = element_blank()) +
  coord_flip()
ggsave("plot.png", plot = last_plot())

Using the data frame we made previously, we can pipe it to ggplot2 via the wonders of dplyr. I am using geom_flag here from the ggflags library. Note that this is a fork of ggflags which gives circular flags which look great on the graph. The geom_flag needs a lowercase entry for each ISO2 country code so the first step is to mutate the country column to make a new lowercase column called code.

Bar chart of the same dataset using flag emojis for the tick labels

That’s it! With a csv file and a few lines of R code you can generate some nice looking graphics.

The dataset shows that the country that produced the biggest fraction of the world’s best footballers (as voted for by Guardian readers) was Spain. There are no surprises in this dataset. The most prominent European and South American countries giving a strong showing.

The post title is taken from “All Around The World” by The Dead Milkmen. Many songs in my library with this title, but this paranoid extraterrestrial tune gets the pick this time.

Til I Die: Seeking new music

I’ve been following the tweets from an account called Albums You Must Hear @Albums2Hear. Each tweet is an album recommended by the account owner. I’m a sucker for lists of Albums That I Must Hear Before I Die since I’m always interested in new (or not so new) music recommendations.

I wanted to assemble a list of the albums that I don’t have from this account and I was able to do so using R.

Using rtweet, it was possible to pull a list of all the albums and reorganise them so that I had a csv containing the albums with the artist and year. I could then use this to compare with a list of albums from my iTunes library. A snippet of the retrieved records is shown here (full list is here).

The code for retrieval is here. The output is csv can be used to compare with a list of your own records.


library(rtweet)
library(httpuv)
library(stringr)
all_tweets <- get_timeline("Albums2Hear", n = 1500)
albums <- all_tweets$text
albums <- gsub("#albumsyoumusthear ","",albums)
tempdf <- as.data.frame(str_split_fixed(albumV, " - ", 3))
colnames(tempdf) <- c("Artist","Album","YearURL")
tempdf2 <- as.data.frame(str_split_fixed(tempdf$YearURL, " ", 2))
colnames(tempdf2) <- c("Year","URL")
df <- data.frame(y$Artist,y$Album,z$Year)
colnames(df) <- c("Artist","Album","Year")
write.csv(df,file = "albums2hear.csv")

Thanks to whoever runs the account – they ask for support here.

The post title comes from ‘Til I Die by The Beach Boys from Sunflower/Surf’s Up.

Multiplex: Small multiple artwork from GPX tracks

I’d seen the small multiple artwork of running and cycling routes from Marcus Volz’s R package Strava all over the web. Ads for “posters of your GPS tracks” pop up on Reddit and I’d notice a few #Rstats people put up their posters on Twitter. I’ve had the package bookmarked for a while and this week I finally got round to generating a small multiple poster of some of my cycling routes.

I was pleased with the result and wanted to post it here. But also, running the code was not straightforward as I’ll explain below. If you want to generate your own plot read on.

The idea behind the poster is really nice. You get a kind of generative art-style poster. It looks nice and you can identify individual routes which is fun to do.

The instructions on the GitHub page are absolutely correct and the code should run out-of-the-box. The idea is that you download your Strava data and then make your plots. Unfortunately, it seems that a change in Strava’s data export policy (possibly related to GDPR changes) has broken the package. I found that there are two problems. First, Strava’s “download your data” link gives you a mix of formats (in my case GPX and FIT files), the package only works with GPX. Second, if there is any elevation data missing from a track, the data frame that is needed to make the poster is not built properly.

Going GPX only: In my case, I don’t keep all my data in Strava and instead use a local repository managed with RubiTrack. This software allows me to filter for the tracks I want and export them in GPX format. The only problem is that it generates one huge file with all the tracks enclosed. This gets read by the package as a single track. To fix this, I split the file using awk.

awk '/<trk>/{close(file);n++;}{file="track"n".gpx";print >> file;}' untitled.gpx

I could then discard track1.gpx which just had the xml header and then use the directory of gpx files.

The elevation problem: this affected only some of the tracks, so in the end the R code needed to be modified. The elevation data is not needed to make the posters so the file process_data.R needs editing, line 28 can be commented out and then line 32 should read:

result <- data.frame(lat = lat, lon = lon, time = time, type = type) %<%

This issue is raised on GitHub and has been closed, but the code doesn’t work with elevation blanks. If you run into this problem, this is the way I found to fix it. The other plots in the package which do use elevation will not run, but at least the poster can be made.

I exported the poster as PDF and then made some changes in Illustrator to give the result above.

The post title comes from Multiplex from Oliver’s Standing Stone LP from 1974.

Pledging My Time III

I’ve previously crunched times for local Half and Full Marathons here on quantixed. Last weekend was the Kenilworth Half Marathon (2018) over a new course. I thought I’d have a look at the distributions of times and paces of the runners. The times are available here. If the Time and Category for finishers are saved as a csv, the script below works to generate the following plots.

Aggregated stats for the race are here. The beeswarm plot nicely shows the distribution of runners times and paces per category. There’s a bimodality to some of the age groups which is interesting. You can see from the average times that people get slower as they get older, as expected.

There was a roughly 2:1 split of M:F runners with a similar proportion in all categories. The ratio is similar for DNSers. The winning times were Andrew Savery of Leamington C A & C in MV35 with 01:12:51 and Polly Keen of Nuneaton Harriers in F sen with 01:23:46.

Congrats to everyone who ran and thanks to the organisers and all the supporters out on the course.


require(ggplot2)
require(ggbeeswarm)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
# aggregate M and F to a new category called Gender
df1$Gender <- ifelse(startsWith(df1$Category,"F"),"F","M")
# format Date column to POSIXct
df1$Time <- as.POSIXct(strptime(df1$Time, format = "%H:%M:%S"))
orig_var <- as.POSIXct("00:00:00", format = "%H:%M:%S")
p1 <- ggplot( data = df1, aes(x = Category,y = Time, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p1
# instead of finishing time, let's look at pace (min/km)
df1$Pace <- as.numeric(difftime(df1$Time, orig_var) / 21.1) * 3600
df1$Pace <- as.POSIXct(df1$Pace, origin = orig_var, format = "%H:%M:%S")
p2 <- ggplot( data = df1, aes(x = Category,y = Pace, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p2
# calculate speeds rather than pace
df1$Speed <- 21.1 / as.numeric(difftime(df1$Time, orig_var))
p3 <- ggplot( data = df1, aes(x = Category, y = Speed, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  ylim(0,NA) + ylab("Speed (km/h)")
p3
# now make the same plots but by Gender rather than Category
p4 <- ggplot( data = df1, aes(x = Gender,y = Time, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p4
p5 <- ggplot( data = df1, aes(x = Gender,y = Pace, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p5
p6 <- ggplot( data = df1, aes(x = Gender, y = Speed, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  ylim(0,NA) + ylab("Speed (km/h)")
p6
ggsave("raceTimeByCat.png", plot = p1)
ggsave("racePaceByCat.png", plot = p2)
ggsave("raceSpeedByCat.png", plot = p3)
ggsave("raceTimeByGen.png", plot = p4)
ggsave("racePaceByGen.png", plot = p5)
ggsave("raceSpeedByGen.png", plot = p6)


Edit 2018-09-12T18:52:43Z I wasn’t happy with the plots and added a few more lines to look at gender as well as category. And show speed as well as pace and finishing time.

The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan

Rollercoaster III: yet more on Google Scholar

In a previous post I made a little R script to crunch Google Scholar data for a given scientist. The graphics were done in base R and looked a bit ropey. I thought I’d give the code a spring clean – it’s available here. The script is called ggScholar.R (rather than gScholar.R). Feel free to run it and raise an issue or leave a comment if you have some ideas.

I’m still learning how to get things looking how I want them using ggplot2, but this is an improvement on the base R version.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song and album by slowcore/dream pop outfit Red House Painters.

Ten Years vs The Spread: Calculating publication lag times in R

There have been several posts on this site about publication lag times. You can read them here. Lag times are the delays in the dissemination of scientific data introduced by the process of publishing the paper in a journal. Nowadays, your paper can be online in a few hours using a preprint server. However, this work is not peer reviewed. Journals organise a formal peer review and provide some sort of certification of the work. They typeset the work and all of this adds delays the dissemination of work in a journal.

To look at publication delays, you can use PubMed data, which is incomplete but can give insight into how long these delays can be. Previous posts have involved the use of a ruby script to make a csv file from PubMed XML output and then use this in Igor to calculate the publication lag times. There is another method detailed in this excellent post by Daniel Himmelstein.

I recently posted a figure for Nature Communications lag times on Twitter and was asked to generate others. I figured that I should write an R script and people can make their own!

The PubMedLagR code is available here with instructions for use.

A query for Nature Communications data at PubMed, such as:

nat commun[ta] AND 2000 : 2018[pdat] AND journal article[pt]

Retrieves all paper for this journal. The range from 2010 to 2018 is for illustration, this journal has only been in operation for these years. Filtering for journal articles rather and attempting to get rid of reviews and front matter is wise, but doesn’t always work. Again this journal doesn’t carry this material so this is for illustration. Getting your query right is very important.

Save the results in XML format and then run the R script as directed. This should give a csv of the data and a png of the lag times.

This is data from Nature Communications. Colleagues had two separate papers accepted at this journal and experienced long delays. I was interested to see if papers were generally taking longer to publish here. Of course we do not know why. Delays are partly the fault of the authors, the reviewers and the journal and it is not possible to say why publication lag times are increasing for this journal year-on-year. The journal has grown in terms of number of papers published, has this introduced inefficiencies? Are reviewers being slow to review? Are they being more demanding? Are Editors not marshalling the referee reports and providing clear guidance to authors? Allowing too much time and too many rounds of revision? Are authors being too slow to do further experimental work? The answer will be yes to some of these questions for some of the papers.

This is not to focus on Nature Communications, it’s one of a few journals that many colleagues complain is too slow to publish their work. With this code you can have a look at the journal you are interested in submitting to and consider whether there is a more rapid venue for your work.

Update:

I changed the code slightly and prettified the plots just a little. Below are some plots for Nature Cell Biology, Nature Neuroscience. I also did a search for clathrin or CRISPR papers over the same time period. These keyword searches are fairly flat, whereas the journal-specific increase in publication lag time can be seen.

The lag times at Nature Neuroscience look artificially low and then seem to have jumped up in 2016 to be something similar to Nature Cell Biology or Nature Communications.

Edit

I neglected to point out that the code truncates the y-axis in the bottom right plot to 1000 days or the maximum lag time, whichever is smaller. This is because it gets difficult to see the data points if there is an outlier, which might be due to an error in PubMed data.

A reader commented on Twitter that some poor paper had a lag time almost 1000 days. Well, due to the y-axis truncation we don’t see that 9 papers in Nature Communications since 2010 have lag times (RecAcc) of > 1000 days. The record holder has a lag time of 1561 days! I checked that this was not a PubMed error by looking at the dates on the paper.

Notes

Date information is not available in PubMed for every paper unfortunately. This is especially true of older papers.

The date information is supplied to PubMed from the journal. These dates are not necessarily accurate: 1) you can see occasional errors in the data, 2) journals sometimes “reset the clock” on papers and treat resubmissions as new submissions.

The post title is taken from “10 Years vs The Spread” by Wing-Tipped Sloat from the LP Chewyfoot. Obviously the song has nothing to do with smoothed kernel density estimates of journal publication lag times, but the title was incredibly apt.

Rollercoaster II: more on Google Scholar citations

I’ve previously written about Google Scholar. Its usefulness and its instability. I just read a post by Jon Tennant on how to harvest Google Scholar data in R and I thought I would use his code as the basis to generate some nice plots based on Google Scholar data.

A script for R is below and can be found here. Graphics are base R but do the job.

First of all I took it for a spin on my own data. The outputs are shown here:

These were the most interesting plots that sprang to mind. First is a ranked citation plot which also shows y=x to find the Hirsch number. Second, was to look at total citations per year to all papers over time. Google Scholar shows the last few years of this plot in the profile page. Third, older papers accrue more citations, but how does this look for all papers? Finally, a prediction of what my H-index will do over time (no prizes for guessing that it will go up!). As Jon noted, the calculation comes from this paper.

While that’s interesting, we need to get  the data of a scholar with a huge number of papers and citations. Here is George Church.

At the time of writing he has 763 papers with over 90,000 citations in total and a H-index of 147. Interestingly ~10% of his total citations come from a monster paper in PNAS with Wally Gilbert in the mid 80s on genome sequencing.

Feel free to grab/fork this code and have a play yourself. If you have other ideas for plots or calculations, add a comment here or an issue at GitHub.

if(!require(scholar)){
     install.packages("scholar")
}
library(scholar)
# Add Google Scholar ID of interest here
ID <- ""
# If you didn't add one to the script prompt user to add one
if(ID == ""){
     ID <- readline(prompt="Enter Scholar ID: ")
}
# Get the citation history
citeByYear<-get_citation_history(ID)
# Get profile information
profile <- get_profile(ID)
# Get publications and save as a csv
pubs <- get_publications(ID)
write.csv(pubs, file = "citations.csv")
# Predict h-index
hIndex <- predict_h_index(ID)
# Now make some plots
# Plot of total citations by year
png(file = "citationsByYear.png")
plot(citeByYear$year,citeByYear$cites,
     type="h", xlab="Year", ylab = "Total Cites")
dev.off()
# Plot of ranked paper by citation with h
png(file = "citationsAndH.png")
plot(pubs$cites, type="l",
     xlab="Paper rank", ylab = "Citations per paper")
abline(0,1)
text(nrow(pubs),max(pubs$cites, na.rm = TRUE),
     profile$h_index)
dev.off()
# Plot of cites to paper by year
png(file = "citesByYear.png")
plot(pubs$year, pubs$cites,
     xlab="Year", ylab = "Citations per paper")
dev.off()
# Plot of h-index prediction
thisYear <- as.integer(format(Sys.Date(), "%Y"))
png(file = "hPred.png")
     plot(hIndex$years_ahead+thisYear,hIndex$h_index,
     ylim = c(0, max(hIndex$h_index, na.rm = TRUE)),
     type = "h",
     xlab="Year", ylab = "H-index prediction") 
dev.off()

Note that my previous code used a python script to grab Google Scholar data. While that script worked well, the scholar package for R seems a lot more reliable.

I have a surprising number of tracks in my library with Rollercoaster in the title. This time I will go with the Jesus & Mary Chain track from Honey’s Dead.