Garmonbozia: Using R to look at Garmin CSV data

Garmin Connect has a number of plots built in, but to take a deeper dive into all your fitness data, you need to export a CSV and fire up R. This post is a quick guide to some possibilities for running data. 

There’s a few things that I wanted to look at. For example, how does my speed change through the year? How does that compare to previous years? If I see some trends, is that the same for short runs and long runs? I wanted to look at the cumulative distance I’d run each year… There’s a lot of things that would be good to analyse.

Garmin Connect has a simple way to export data as a CSV. There are other ways to get your data, but the web interface is pretty straightforward. To export a CSV of your data, head to the Garmin Connect website, login and select Activities, All Activities. On this page, filter the activities for whatever you want to export. I clicked Running (you can filter some more if you want), and then scrolled down letting the data load onto the page until I went back as far as I wanted. In the top right corner, you click Export CSV and you will download whatever is displayed on the page.

The code to generate these plots, together with some fake data to play with can be found here.

Now in R, load in the CSV file

require(ggplot2)
require(dplyr)
require(hms)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)

We have a data frame, but we need to rejig the Dates and a few other columns before we can start making plots.

# format Date column to POSIXct
df1$Date <- as.POSIXct(strptime(df1$Date, format = "%Y-%m-%d %H:%M:%S"))
# format Avg.Pace to POSIXct
df1$Avg.Pace <- as.POSIXct(strptime(df1$Avg.Pace, format = "%M:%S"))
# make groups of different distances using ifelse
df1$Type <- ifelse(df1$Distance < 5, "< 5 km", ifelse(df1$Distance < 8, "5-8 km", ifelse(df1$Distance < 15, "8-15 km", ">15 km")))
# make factors for these so that they're in the right order when we make the plot
df1$Type_f = factor(df1$Type, levels=c("< 5 km","5-8 km","8-15 km", ">15 km"))

Now we can make the first plot. The code for the first one is below, with all the code for the other plots shown below that.

# plot out average pace over time
p1 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, color = Distance)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Date", y = "Average Pace (min/km)")

The remainder of the code for the other plots is shown below. The code is commented. For some of the plots, a bit of extra work on the data frame is required.

# plot out same data grouped by distance
p2 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, group = Type_f, color = Type_f)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Date", y = "Average Pace (min/km)", colour = NULL) +
  facet_grid(~Type_f)
# now look at stride length. first remove zeros
df1[df1 == 0] <- NA
# now find earliest valid date
date_v <- df1$Date
# change dates to NA where there is no avg stride data
date_v <- as.Date.POSIXct(ifelse(df1$Avg.Stride.Length > 0, df1$Date, NA))
# find min and max for x-axis
earliest_date <- min(date_v, na.rm = TRUE)
latest_date <- max(date_v, na.rm = TRUE)
# make the plot
p3 <- ggplot(data = df1, aes(x = Date,y = Avg.Stride.Length, group = Type_f, color = Type_f)) +
  geom_point() + 
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average stride length (m)", colour = NULL) +
  facet_grid(~Type_f)
df1$Avg.HR <- as.numeric(as.character(df1$Avg.HR))
p4 <- ggplot(data = df1, aes(x = Date,y = Avg.HR, group = Type_f, color = Type_f)) +
  geom_point() +
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average heart rate (bpm)", colour = NULL) +
  facet_grid(~Type_f)
# plot out average pace per distance coloured by year
p5 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, color = Date)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Distance (km)", y = "Average Pace (min/km)")
# make a date factor for year to group the plots
df1$Year <- format(as.Date(df1$Date, format="%d/%m/%Y"),"%Y")
p6 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, group = Year, color = Year)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Distance", y = "Average Pace (min/km)") +
  facet_grid(~Year)
# Cumulative sum over years
df1 <- df1[order(as.Date(df1$Date)),]
df1 <- df1 %>% group_by(Year) %>% mutate(cumsum = cumsum(Distance))
p7 <- ggplot( data = df1, aes(x = Date,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  labs(x = "Date", y = "Cumulative distance (km)")
# Plot these cumulative sums overlaid
# Find New Year's Day for each and then work out how many days have elapsed since
df1$nyd <- paste(df1$Year,"-01-01",sep = "")
df1$Days <- as.Date(df1$Date, format="%Y-%m-%d") - as.Date(as.character(df1$nyd), format="%Y-%m-%d")
# Make the plot
p8 <- ggplot( data = df1, aes(x = Days,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  scale_x_continuous() +
  labs(x = "Days", y = "Cumulative distance (km)")

Finally, we can save all of the plots using ggsave.

# save all plots
ggsave("allPace.png", plot = p1, width = 8, height = 4, dpi = "print")
ggsave("paceByDist.png", plot = p2, width = 8, height = 4, dpi = "print")
ggsave("strideByDist.png", plot = p3, width = 8, height = 4, dpi = "print")
ggsave("HRByDist.png", plot = p4, width = 8, height = 4, dpi = "print")
ggsave("allPaceByDist.png", plot = p5, width = 8, height = 4, dpi = "print")
ggsave("paceByDistByYear.png", plot = p6, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistByYear.png", plot = p7, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistOverlay.png", plot = p8, width = 8, height = 4, dpi = "print")

I think the code might fail if you don’t record all of the fields that I do. For example if heart rate data is missing or stride length is not recorded, I’m not sure what the code will do. The aim here is to give an idea of what sorts of plots can be generated just using the summary data in the CSV provided by Garmin. Feel free to make suggestions in the comments below.

The post title comes from “Garmonbozia” by Superdrag from the Regretfully Yours album. Apparently Garmonbozia is something eaten by the demons in the Black Lodge in the TV series Twin Peaks.

Not What You Want: our new paper on a side effect of GFP nanobodies

We have a new preprint out – it is a cautionary tale about using GFP nanobodies in cells. This short post gives a bit of background to the work. Please read the paper if you are interested in using GFP nanobodies in cells, you can find it here.

Paper in a nutshell: Caution is needed when using GFP nanobodies because they can inhibit their target protein in cells.

People who did the work: Cansu Küey did most of the work for the paper. She discovered the inhibition side effect of the dongles. Gabrielle Larocque contributed a figure where she compared dongle-knocksideways with regular knocksideways. The project was initiated by Nick Clarke who made our first set of dongles and tested which fluorescent proteins the nanobody binds in cells. Lab people and their profiles can be found here.

Background: Many other labs have shown that nanobodies can be functionalised so that you can stick new protein domains onto GFP-tagged proteins to do new things. This is useful because it means you can “retrofit” an existing GFP knock-in cell line or organism to do new things like knocksideways without making new lines. In fact there was a recent preprint which described a suite of functionalised nanobodies that can confer all kinds of functions to GFP.

Like many other labs we were working on this method. We thought functionalised GFP nanobodies resembled “dongles” – those adaptors that Apple makes so much money from – that convert one port to another.

Dongles, dongles, dongles… (photo by Rex Hammock, licensed for reuse https://www.flickr.com/photos/rexblog/5575298582)

A while back we made several different dongles. We were most interested in a GFP nanobodies with an additional FKBP domain that would allow us to do knocksideways (or FerriTagging) in GFP knock-in cells. For those that don’t know, knocksideways is not a knockout or a knockdown, but a way of putting a protein somewhere else in the cell to inactivate it. The most common place to send a protein is to the mitochondria.

Knocksideways works by joining FKBP and FRB (on the mitochondria) using rapamycin. Normally FKBP is fused to the protein of interest (top). If we just have a GFP tag, we can’t do knocksideways (middle). If we add a dongle (bottom) we can attach FKBP domains to allow knocksideways to happen.

We found that dongle-knocksideways works really well and we were very excited about this method. Here we are removing GFP-clathrin from the mitotic spindle in seconds using dongle knocksideways.

GFP-clathrin, shown here in blue is sent to the mitochondria (yellow) using rapamycin. This effect is only possible because of the dongle which adds FKBP to GFP via a GFP nanobody.

Since there are no specific inhibitors of endocytosis, we thought dongle knocksideways would be cool to try in cells with dynamin-2 tagged with GFP at both alleles. There is a line from David Drubin’s lab which is widely used. This would mean we could put the dongle plasmids on Addgene and everyone could inhibit endocytosis on-demand!

Initial results were encouraging. We could put dynamin onto mitochondria alright.

Dynamin-2-GFP undergoing dongle-knocksideways. The Mitotrap is shown in red and dynamin is green.

But we hit a problem. It turns out that dongle binding to dynamin inhibits endocytosis. So we have unintended inhibition of the target protein. This is a big problem because the power of knocksideways comes from being able to observe normal function and then rapidly switch it off. So if there is inhibition before knocksideways, the method is useless.

Now, this problem could be specific to dynamin or it might be a general problem with all targets of dongles. Either way, we’ve switched from this method and wrote this manuscript to alert others to the side effects of dongles. We discuss possible ways forward for this method and also point out some applications of the nanobody technology that are unaffected by our observations.

The post title comes from “Not What You Want” by Sleater-Kinney from their wonderful Dig Me Out record.

Five Get Over Excited: Academic papers that cite quantixed posts

Anyone that maintains a website is happy that people out there are interested enough to visit. Web traffic is one thing, but I take greatest pleasure in seeing quantixed posts being cited in academic papers.

I love the fact that some posts on here have been cited in the literature more than some of my actual papers.

It’s difficult to track citations to web resources. This is partly my fault, I think it is possible to register posts so that they have a DOI, but I have not done this and so tracking is a difficult task. Websites are part of what is known as the grey literature: items that are not part of traditional academic publishing.

The most common route for me to discover that a post has been cited is when I actually read the paper. There are four examples that spring to mind: here, here, here and here. With these papers, I read the paper and was surprised to find quantixed cited in the bibliography.

Vanity and curiosity made me wonder if there were other citations I didn’t know about. A cited reference search in Web of Science pulled up two more: here and here.

A bit of Googling revealed yet more citations, e.g. two quantixed posts are cited in this book. And another citation here.

OK so quantixed is not going to win any “highly cited” prizes or develop a huge H-index (if something like that existed for websites). But I’m pleased that 1) there are this many citations given that there’s a bias against citing web resources, and 2) the content here has been useful to others, particularly for academic work.

All of these citations are to posts looking at impact factors, metrics and publication lag times. In terms of readership, these posts get sustained traffic, but currently the most popular posts on quantixed are the “how to” guides, LaTeX to Word and Back seeing the most traffic. Somewhere in between citation and web traffic are cases when quantixed posts get written about elsewhere, e.g. in a feature in Nature by Kendall Powell.

The post title comes from “Five Get Over Excited” by The Housemartins. A band with a great eye for song titles, it can be found on the album “The People Who Grinned Themselves to Death”.

All Around The World: Maps and Flags in R

Our lab is international. People born all over the world have come to work in my group. I’m proud of this fact, especially in the current political climate. I’ve previously used the GoogleMaps API to display a heat map on our lab webpage. It shows where in the world people in the lab come from. This was OK, but I wanted to get an R based solution to make this graphic to make it easier to automate updates.

This post is an explainer and “how to” guide. Code and some example data are here.

The idea is to create graphics to describe the origins of a group of people. For my use-case it is my research group, but it could be any group of people. All you need is a list of countries that the people come from.

In the example for this post, I took the Top 100 Footballers voted for by Guardian readers in 2016. In my lab dataset, I store the countries of origin in ISO2 format. This means Spain is ES, Germany is DE and so on. I converted the Guardian dataset to ISO2 format using a lookup and then I was ready to put it into the R script.

if (!require("rworldmap")) {
  install.packages("rworldmap")
  library(rworldmap)
}
# ggplot2, ggFlags, dplyr are needed for the bar charts
library(ggplot2)
library(dplyr)
if (!require("ggflags")) {
  devtools::install_github("rensa/ggflags")
  library(ggflags)
}

# csv file with each person as a row and containing a column with the header Origin and
# countries in 2-letter ISO format (change joinCode for other formats)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
countries_lab <- as.data.frame(table(df1$Origin))
colnames(countries_lab) <- c("country", "value")
matched <- joinCountryData2Map(countries_lab, joinCode="ISO2", nameJoinColumn="country")

This part of the script sets up the libraries that are needed. We’ll use the rworldmap package first. This post was very useful for guidance. We load in the csv file which contains the countries of origin for the people we want to map out. It doesn’t need anything more than one column with the ISO2 codes. If it does it’s OK. As long as the header for the countries column is called “Origin”, all will be OK.

This column is extracted and a new dataframe is made from it which has each country as a row and the number of instances of that country as a second column. These are labelled “country” and “value” for convenience. Now rworldmap does its thing with the joinCountryData2Map line. Next we make the map!

# make png of the map
png(file = "labWorldMap.png",
    width = 1024, height = 768)
par(mai=c(0,0,0.2,0))
mapCountryData(matched,
               nameColumnToPlot="value",
               mapTitle= "",
               catMethod = "logFixedWidth",
               colourPalette = "heat",
               oceanCol="lightblue",
               missingCountryCol="white",
               addLegend = FALSE,
               lwd = 1)
dev.off()
Where in the world…. heat map showing country of origin for the people in the dataset

This makes a nice map. I’ve hidden the legend which shows what the colours mean. The map can be customised in lots of ways. I liked the way this map looked and my other aim was to make a chart to show the relative numbers of people in each country. Speaking of which…

# make bar chart of lab members
countries_lab %>%
  mutate(code = tolower(country)) %>%
  ggplot(aes(x = reorder(country, value), y = value)) +
  geom_bar(stat = "identity") +
  geom_flag(y = -1, aes(country = code), size = 4) +
  scale_y_continuous(expand = c(0.1, 1)) +
  xlab("Country") +
  ylab("Members") +
  theme_bw() +
  theme(legend.title = element_blank()) +
  coord_flip()
ggsave("plot.png", plot = last_plot())

Using the data frame we made previously, we can pipe it to ggplot2 via the wonders of dplyr. I am using geom_flag here from the ggflags library. Note that this is a fork of ggflags which gives circular flags which look great on the graph. The geom_flag needs a lowercase entry for each ISO2 country code so the first step is to mutate the country column to make a new lowercase column called code.

Bar chart of the same dataset using flag emojis for the tick labels

That’s it! With a csv file and a few lines of R code you can generate some nice looking graphics.

The dataset shows that the country that produced the biggest fraction of the world’s best footballers (as voted for by Guardian readers) was Spain. There are no surprises in this dataset. The most prominent European and South American countries giving a strong showing.

The post title is taken from “All Around The World” by The Dead Milkmen. Many songs in my library with this title, but this paranoid extraterrestrial tune gets the pick this time.

One With The Freaks – very highly cited papers in biology

I read this recent paper about very highly cited papers and science funding in the UK. The paper itself was not very good, but the dataset which underlies the paper is something to behold, as I’ll explain below.

The idea behind the paper was to examine very highly cited papers in biomedicine with a connection to the UK. Have those authors been successful in getting funding from MRC, Wellcome Trust or NIHR? They find that some of the authors of these very highly cited papers are not funded by these sources. Note that these funders are some, but not all, of the science funding bodies in the UK. The authors also looked at panel members of those three funders, and report that these individuals are funded at high rates and that the overlap between panel membership and very highly cited authorship is very low. I don’t want to critique the paper extensively, but the conclusions drawn are rather blinkered. A few reasons: 1, MRC, NIHR and Wellcome support science in other ways than direct funding of individuals (e.g. PhD programmes, infrastructure etc.). 2, The contribution of other funders e.g. BBSRC was ignored. 3, Panels tend to be selected from the pool of awardees, rather than the other way around. I understand that the motivation of the authors is to stimulate debate around whether science funding is effective, and this is welcome, but the paper strays too far in to clickbait territory for my tastes.

The most interesting thing about the analysis (and arguably its main shortcoming) was the dataset. The authors took the papers in Scopus which have been cited >1000 times. This is ~450 papers as of last week. As I found out when I recreated their dataset, this is a freakish set of papers. Of course weird things can be found when looking at outliers.

Dataset of 20,000 papers from Scopus (see details below)

The authors describe a one-line search term they used to retrieve papers from Scopus. These papers span 2005 to the present day and were then filtered for UK origin.

LANGUAGE ( english ) AND PUBYEAR > 2005 AND ( LIMIT-TO ( SRCTYPE , "j " ) ) AND ( LIMIT-TO (DOCTYPE , "ar " ) ) AND ( LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "BIOC" ) OR LIMIT-TO (SUBJAREA , "PHAR" ) OR LIMIT-TO ( SUBJAREA , "IMMU" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "DENT" ) )

I’m not sure how accurate the dataset is in terms of finding papers of UK origin, but the point here is to look at the dataset and not to critique the paper.

I downloaded the first 20,000 (a limitation of Scopus). I think it will break the terms to put the dataset on here but if your institution has a subscription, it can be recreated. The top paper has 16,549 citations! The 20,000th paper has accrued 122 citations, and the papers with >1000 citations account for 450 papers as of last week.

Now, some papers are older than others, so I calculated the average citation rate by dividing total cites by the number of years since publication, to get a better picture of the hottest among these freaky papers. The two colour-coded plots show the years since publication. It is possible to see some young papers which are being cited at an even higher rate than the pack. These will move up the ranking faster than their neighbours over the next few months.

Just looking at the “Top 20” is amazing. These papers are being cited at rates of approximately 1000 times per year. The paper ranked 6 is a young paper which is cited at a very high rate and will likely move up the ranking. So what are these freakish papers?

In the table below (apologies for the strange formatting), I’ve pasted the top 20 of the highly cited paper dataset. They are a mix of clinical consortia papers and bioinformatics tools for sequence and structural analysis. The tools make sense. They are widely used in a huge number of papers and get heavily cited as a result. In fact, these citation numbers are probably an underestimate, since citations to software can often get missed out of papers. The clinical papers are also useful to large fields. They have many authors and there is a network effect to their citation which can drive up the cites to these items (this is noted in the paper I described above). Even though the data are expected, I was amazed by the magnitude of citations and the rates that these works are acquiring citations. The topic of papers is pretty similar beyond the top 20.

There’s no conclusion for this post. There are a tiny subset of papers out there with freakishly high citation rates. We should simply marvel at them…

TitleYearJournalTotal cites
1Clustal W and Clustal X version 2.02007Bioinformatics16549
2The Sequence Alignment/Map format and SAMtools2009Bioinformatics13586
3Fast and accurate short read alignment with Burrows-Wheeler transform2009Bioinformatics12653
4PLINK: A tool set for whole-genome association and population-based linkage analyses2007American Journal of Human Genetics12241
5Estimates of worldwide burden of cancer in 2008: GLOBOCAN 20082010International Journal of Cancer11047
6Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 20122015International Journal of Cancer10352
7PHENIX: A comprehensive Python-based system for macromolecular structure solution2010Acta Crystallographica Section D: Biological Crystallography10093
8Phaser crystallographic software2007Journal of Applied Crystallography9617
9New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)2009European Journal of Cancer9359
10Features and development of Coot2010Acta Crystallographica Section D: Biological Crystallography9241
11Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities2009Applied and Environmental Microbiology8127
12BEAST: Bayesian evolutionary analysis by sampling trees2007BMC Evolutionary Biology8019
13Improved survival with ipilimumab in patients with metastatic melanoma2010New England Journal of Medicine7293
14OLEX2: A complete structure solution, refinement and analysis program2009Journal of Applied Crystallography7173
15Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet6296
16New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.02010Systematic Biology6290
17The MIQE guidelines: Minimum information for publication of quantitative real-time PCR experiments2009Clinical Chemistry6086
18The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials2011BMJ (Online)6065
19Velvet: Algorithms for de novo short read assembly using de Bruijn graphs2008Genome Research5550
20A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet5499

The post title comes from “One With The Freaks” by The Notwist.

All That Noise: The vesicle packing problem

This week Erick Martins Ratamero and I put up a preprint on vesicle packing. This post is a bit of backstory but please take a look at the paper, it’s very short and simple.

The paper started when I wanted to know how many receptors could fit in a clathrin-coated vesicle. Sounds like a simple problem – but it’s actually more complicated.

Of course, this problem is not as simple as calculating the surface area of the vesicle, the cross-sectional area of the receptor and dividing one by the other. The images above show the problem. The receptors would be the dimples on the golf ball… they can’t overlap… how many can you fit on the ball?

It turns out that a PhD student working in Groningen in 1930 posed a similar problem (known as the Tammes Problem) in his thesis. His concern was the even pattern of pores on a pollen grain, but the root of the problem is the Thomson Problem. This is the minimisation of energy that occurs when charged particles are on a spherical surface. The particles must distribute themselves as far away as possible from all other particles.

There are very few analytical solutions to the Tammes Problem (presently 3-14 and 24 are solved). Anyhow, our vesicle packing problem is the other way around. We want to know, for a vesicle of a certain size, and cargo of a certain size, how many can we fit in.

Fortunately stochastic Tammes solvers are available like this one, that we could adapt. It turns out that the numbers of receptors that could be packed is enormous: for a typical clathrin-coated vesicle almost 800 G Protein-Coupled Receptors could fit on the surface. Note, that this doesn’t take into account steric hinderance and assumes that the vesicle carries nothing else. Full details are in the paper.

Why does this matter? Many labs are developing ways to count molecules in cellular structures by light or electron microscopy. We wanted to have a way to check that our results were physically possible. For example, if we measure 1000 GPCRs in a clathrin-coated vesicle, we know something has gone wrong.

What else? This paper ticked a few things on my publishing bucket list: a paper that is solely theoretical, a coffee-break idea paper and one that is on a “fun” subject. Erick has previous form with theoretical/fun papers, previously publishing on modelling peloton dynamics in procycling.

We figured the paper was more substantial than a blog post yet too minimal to send to a journal. So unless a journal wants to publish it (and gets in touch with us), this will be my first preprint where bioRxiv is the final destination.

We got a sense that people might be interested in an answer to the vesicle packing problem because whenever we asked people for an estimate, we got hugely different answers! The paper has been well-received so far. We’ve had quite a few comments on Twitter and we’re glad that we wrote up the work.

The post title comes from the “All That Noise” LP by The Darkside. I picked this not because of the title, but because of the cover.

All That Noise cover shows a packing problem on a sphere

A Certain Ratio: Gender ratio in our papers

I saw today on Twitter that a few labs were examining the gender balance of their papers and posting the ratios of male:female authors. It started with this tweet.

This analysis is simple to perform, but interpreting it can be hard. For example, is the research group gender balanced to start with? How many of the authors are collaborators? Nonetheless, I have the data for all of my papers, so I thought I’d take a quick look too.

To note: the papers are organised chronologically from top to bottom. They include my papers from before I was a PI (first eight papers). Only research papers are listed, no reviews or methods papers and only those from my lab (i.e. collaborative papers where I am not corresponding author are excluded).

Female authors are blue-green, males are orange. I am a dark orange colour. The blocks are organised according to author list. Joint first authors are boxed.

To the right is a graphic to show the gender ratio. The size of the circle indicates the number of authors on the paper. This is because a paper with M:F ratio of 1 is excusable with 2 or 3 authors, but not with 8. Most of our papers have only a few authors so it’s not a great metric.

On the whole the balance is petty good. Men and women are equally likely to be first author and they are well-represented in the author list. On the other hand, the lab has always had a healthy gender balance and so I would’ve been surprised to find otherwise.

Edit: I replaced the graphic above after a few errors were pointed out to me. Specifically, some authors added to three papers during revisions that were not in the list I used. Also, it was suggested to me that removing myself from the analysis would be a good idea! This was a good suggestion and the corresponding graphic is below.

The post title is taken from the band A Certain Ratio a Factory Records staple from the 1980s.

Installing open source PyMol on a Mac

This is a quick “how to” post. There is a licensed version of PyMol (MacPyMol) available, but the open source version can be installed on a Mac free of charge. The official page has a guide, which is not terribly detailed, and I found this excellent guide which is unfortunately out-of-date.

Here is an updated guide to installing PyMol using Homebrew on macOS Mojave 10.14.3

Step 1 is to install Homebrew

/usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Next step is to install the PyMol dependencies using Homebrew

brew install Caskroom/cask/xquartz
brew install tcl-tk
brew install python

Now install PyMol.

brew install brewsci/bio/pymol

You can now run PyMol by typing

pymol

The MPIBR-Bioinformatics page has the following guide to make a little executable app to launch PyMol straight from the Desktop.

  • Open Automator, which is in Applications in macOS.
  • Create new document, select “Application”.
  • Select “Actions” and “Library” in the left pane. Select “Utilities” and “Run Shell Script”. Drag this into the main pane.
  • Choose “/bin/bash” as a shell.
  • Paste the following: /usr/local/bin/pymol -M. If this doesn’t work, check the path to pymol using which pymol in the terminal, and use this instead.
  • Save the application (“File > Save”) to the Desktop and name it “PyMol”.

Now you can double-click this app to run PyMol.

Plotlines: the story behind the graph

On a whim a posted a plot on Twitter. It shows a marathon training schedule. This post explains the story behind the graph.

I downloaded a few different 17-week marathon training schedules. Most were in imperial measurement and/or were written for time at a certain pace, e.g. 30 min Easy Run etc. I wanted to convert one of these schedules into a proper plan where I input my pace and get an idea of the distance I need to run (in metric). This means I can pick routes to run each day without having to think too much about it.

Calculating the running paces was simple using Jack Daniels’ VDot calculator and verifying the predicted paces with my running database. I constructed a spreadsheet from the plan, and then did the calculations to get the distances out. Once this was done I wondered what the rationale behind the schedule is, and the best way to see that is to plot it out.

No colour-coding

From this plot the way that the long run on each Sunday is extended or tapered is reasonably clear. However, I was wondering about how intense each of the runs will be. Running 5K at threshold is more intense than a 10K easy run. To look at this I just took the average pace for the session. This doesn’t quite tell us about intensity, because a 10 min easy run + 2 intervals + 10 min easy run is not as intense as doing 4 intervals, yet the average pace would be similar. But it would be close enough. I used a colourscale called VioletOrangeYellow and the result was quite intuitive.

Marathon training schedule
A menu of pain – marathon training schedule

The shorter runs are organised in blocks of intensity while the long runs are about building endurance. From what I understand, the blocks are to do with adapting to the stresses of increasing running load/intensity.

Feedback on the plot was good: runners liked it, non-runners thought it was psychotic.

Small Talk: How big is your lab?

I really dislike being asked “how big is your lab?”. The question usually arises at scientific meetings when you are chatting to someone during a break. Small talk can lead to some banal questions being asked, and that’s fine, but when this question is asked seriously, the person asking really just wants to compare themselves to you in some way. This is one reason why I dislike being asked “how big is your lab?”.

The other reason I don’t like the question is that it can be difficult to answer. I don’t mean that I have so many people in my group that I can’t possibly count them. No, I mean that it can be difficult to give an accurate answer. There’s perhaps a student in the group who is currently writing up, or possibly they’ve handed in their thesis and they are awaiting a viva – do they count towards the tally? They are in your lab but they’re not in your lab. Perhaps you jointly supervise someone, or maybe there is someone who is away working in another lab somewhere. I’m guilty of overthinking this or at least fretting about giving an incorrect answer. Whatever the circumstance, I think that the size of most research groups is not very stable over time, so I dislike the question because it’s difficult answer accurately.

I looked at group size recently because the lab had surpassed the milestone of having 50 all-time members and I wanted to see how the group size had varied over time.

Timeline of people in the lab

The first timeline shows the arrival and departure of lab members over time. The role of each person is colour coded as indicated. Note that some people start in one role and get upgraded. PG to PhD, PhD to Post-doc (PDRA). So what it the group size over time?

Group size over time

It turns out that we peaked this year with a team size of 12. The smallest size (besides the period where I started out, when I was on my own!) was at the end of 2012 when I prepared to move the lab to a different university. What has the make up of the lab been during this time.

Constitution of the lab over time

In this last plot the fraction of the team that are PhD, post-doc etc. is shown over time. This plot is interesting because I can see that it was two years before a PhD student joined the group and also how the lab has become post-doc-heavy in the last 18 months.

So what is the answer to “how big is your lab?”. Well, take your pick. Right now it is 11 with someone just joined this week. Over the last year it has averaged at just over 10. Over the last five years it has been 8 to 9. It’s still not an easy question to answer even if you can see all the data.

Methods: I have been trying to use R for these type of posts, so that sharing the code is more useful, but I drew a blank with this one. I found several tools to plot the first timeline (timevis and vistime). To do the integration and breakdown plots, I struggled… I knew exactly how to make those plots in Igor, so that’s what I did. All that was required was a list of the people, their role, their start-end dates, and a few lines of code. I keep a record of this as previously mentioned.

The post title comes from “Small Talk” a track by American Culture on a Split 7″ with Boyracer on Emotional Response Records.