Rollercoaster IV: ups and downs of Google Scholar citations

Time for an update to a previous post. For the past few years, I have been using an automated process to track citations to my lab’s work on Google Scholar (details of how to set this up are at the end of this post).

Due to the nature of how Google Scholar tracks citations, it means that citations get added (hooray!) but might be removed (booo!). Using a daily scrape of the data it is possible to watch this happening. The plots below show the total citations to my papers and then a version where we only consider the net daily change.

Four years of tracking citations on Google Scholar

The general pattern is for papers to accrue citations and some do so faster than others. You can also see that the number of citations occasionally drops down. Remember that we are looking at net change here. So a decrease of one citation is masked by the addition of one citation and vice versa. Even so, you can see net daily increases and even decreases.

It’s difficult to see what is happening down at the bottom of the graph so let’s separate them out. The two plots below show the net change in citations, either on the same scale (left) or scaled to the min/max for that paper (right).

Citation tracking for individual papers

The papers are shown here ranked from the ones that accrued the most citations down to the ones that gained no citations while they were tracked. Five “new” papers began to be tracked very recently. This is because I changed the way that the data are scraped (more on this below).

The version on the right reveals a few interesting things. Firstly that there seems to be “bump days” where all of the papers get a jolt in one direction or another. This could be something internal to Google or the addition or several items which all happen to cite a bunch of my papers. The latter explanation is unlikely, given the frequency of changes seen in the whole dataset. Secondly, some papers are highly volatile with daily toggling of citation numbers. I have no idea why this may be. Two plots below demonstrate these two points. The arrow shows a “bump day”. The plot on the right shows two review papers that have volatile citation numbers.

I’m going to keep the automated tracking going. I am a big fan of Google Scholar, as I have written previously, but quoting some of the numbers makes me uneasy, knowing how unstable they are.

Note that you can use R to get aggregate Google Scholar data as I have written about previously.

How did I do it?

The analysis would not be possible without automation. I use a daemon to run a shell script everyday. This script calls a python routine which outputs the data to a file. I wrote something in Igor to load each day’s data, and crunch the numbers, and make the graphs. The details of this part are in the previous post.

I realised that I wasn’t getting all of my papers using the previous shell script. Well, this is a bit of a hack, but I changed the calls that I make to scholar.py so that I request data from several years.

#!/bin/bash
cd /directory/for/data/
python scholar.py -c 500 --author "Sam Smith" --after=1999 --csv > g1999.csv
sleep $[ ( $RANDOM % 15 )  + 295 ]
# and so on
python scholar.py -c 500 --author "Sam Smith" --after=2019 --csv > g2019.csv
OF=all_$(date +%Y%m%d).csv
cat g*.csv > $OF
rm g*.csv

I found that I got different results for each year I made the query. My first change was to just request all years using a loop to generate the calls. This resulted in an IP ban for 24 hours! Through a bit of trial-and-error I found that reducing the queries to ten and waiting a polite amount of time between queries avoided the ban.

The hacky part was to figure out which year requests I needed to make to make sure I got most of my papers. There is probably a better way to do this!

I still don’t get every single paper and I retrieve data for a number of papers on which I am not an author – I have no idea why! I exclude the erroneous papers using the Igor program that reads all the data and plots out everything. The updated version of this code is here.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song by Sleater-Kinney from their “The Woods” album.

Turn A Square: generative aRt

A while back I visited Artistes & Robots in Paris. Part of the exhibition was on the origins of computer-based art. Nowadays this is referred to as generative art, where computers generate artwork according to rules specified by the programmer. I wanted to emulate some of the early generative artwork I saw there, using R.

Some examples of early generative art

Georg Nees was a pioneer of computer-based artwork and he has a series of pieces using squares. An example I found on the web was Schotter (1968).

CIS:E.217-2008

Another example using rotation of squares is Boxes by William J. Kolomyjec.

Boxes

I set out to generate similar images where the parameters can be tweaked to adjust the final result. The code is available on GitHub. Here is a typical image. Read on for an explanation of the code.

Generative art made in R

Generating a grid of squares

I started by generating a grid of squares. We can use the segment command in R to do the drawing.

xWave <- seq.int(1:10)
yWave <- seq.int(1:10)

for (i in seq_along(xWave)) {
  xCentre <- xWave[i]
  for (j in seq_along(yWave)) {
    yCentre <- yWave[j]
    lt <- c(xCentre - 0.4,yCentre - 0.4)
    rt <- c(xCentre + 0.4,yCentre - 0.4)
    rb <- c(xCentre + 0.4,yCentre + 0.4)
    lb <- c(xCentre - 0.4,yCentre + 0.4)
    new_shape_start <- rbind(lt,rt,rb,lb)
    new_shape_end <- rbind(rt,rb,lb,lt)
    new_shape <- cbind(new_shape_start,new_shape_end)
    if(i == 1 &amp;&amp; j == 1) {
      multiple_segments <- new_shape
    } else {
      multiple_segments <- rbind(multiple_segments,new_shape)
    }
  }
}
par(pty="s")
plot(0, 0,
     xlim=c(min(xWave)-1,max(xWave)+1),
     ylim=c(min(yWave)-1,max(yWave)+1),
     col = "white", xlab = "", ylab = "", axes=F) 
segments(x0 = multiple_segments[,1],
         y0 = multiple_segments[,2],
         x1 = multiple_segments[,3],
         y1 = multiple_segments[,4])
A simple grid

We’re using base R graphics to make this artwork – nothing fancy. Probably the code can be simplified!

Add some complexity

The next step was to add some complexity and flexibility. I wanted to be able to control three things:

  1. the size of the grid
  2. the grout (distance between squares)
  3. the amount of hysteresis (distorting the squares, rather than rotating them).

Here is the code. See below for some explanation and examples.

# this function will make grid art
# arguments define the number of squares in each dimension (xSize, ySize)
# grout defines the gap between squares (none = 0, max = 1)
# hFactor defines the amount of hysteresis (none = 0, max = 1, moderate = 10)
make_grid_art <- function(xSize, ySize, grout, hFactor) {
  xWave <- seq.int(1:xSize)
  yWave <- seq.int(1:ySize)
  axMin <- min(min(xWave) - 1,min(yWave) - 1)
  axMax <- max(max(xWave) + 1,max(yWave) + 1)
  nSquares <- length(xWave) * length(yWave)
  x <- 0
  halfGrout <- (1 - grout) / 2
  for (i in seq_along(yWave)) {
    yCentre <- yWave[i]
    for (j in seq_along(xWave)) {
      if(hFactor < 1) {
        hyst <- rnorm(8, halfGrout, 0)
      }
      else {
        hyst <- rnorm(8, halfGrout, sin(x / (nSquares - 1) * pi) / hFactor)
      }
      xCentre <- xWave[j]
      lt <- c(xCentre - hyst[1],yCentre - hyst[2])
      rt <- c(xCentre + hyst[3],yCentre - hyst[4])
      rb <- c(xCentre + hyst[5],yCentre + hyst[6])
      lb <- c(xCentre - hyst[7],yCentre + hyst[8])
      new_shape_start <- rbind(lt,rt,rb,lb)
      new_shape_end <- rbind(rt,rb,lb,lt)
      new_shape <- cbind(new_shape_start,new_shape_end)
      if(i == 1 &amp;&amp; j == 1) {
        multiple_segments <- new_shape
      } else {
        multiple_segments <- rbind(multiple_segments,new_shape)
      }
      x <- x + 1
    }
  }
  par(mar = c(0,0,0,0))
  plot(0, 0,
       xlim=c(axMin,axMax),
       ylim=c(axMax,axMin),
       col = "white", xlab = "", ylab = "", axes=F, asp = 1) 
  segments(x0 = multiple_segments[,1],
           y0 = multiple_segments[,2],
           x1 = multiple_segments[,3],
           y1 = multiple_segments[,4])
}

The amount of hysteresis is an important element. It determines the degree of distortion for each square. The distortion is done by introducing noise into the coordinates for each square that we map onto the grid. The algorithm used for the distortion is based on a half-cycle of a sine wave. It starts and finishes on zero distortion of the coordinates. In between it rises and falls symmetrically introducing more and then less distortion as we traverse the grid.

Changing the line of code that assigns a value to hyst will therefore change the artwork. Let’s look at some examples.

# square grid with minimal hysteresis
make_grid_art(10,10,0.2,50)
# square grid (more squares) more hysteresis
make_grid_art(20,20,0.2,10)
# rectangular grid same hysteresis
make_grid_art(25,15,0.2,10)
# same grid with no hysteresis
make_grid_art(25,15,0.2,0)
# square grid moderate hysteresis and no grout
make_grid_art(20,20,0,10)

Hopefully this will give you some ideas for simple schemes to generate artwork in R. I plan to add to the GitHub repo when get the urge. There is already some other generative artwork code in there for Igor Pro. Two examples are shown below (I don’t think I have written about this before on quantixed).

The post title comes from “Turn A Square” by The Shins from their album Chutes Too Narrow.

Timestretched: audio stretching on the command line

I was recently reminded of the wonders of paulstretch by a 8-fold slowed down version of Pyramid Song by Radiohead.

Slowed down version of Pyramid Song

Paulstretch is an audio manipulation widget that can stretch or compress the time of an audio recording. Note that it doesn’t “slow down” or “speed up” a recording, it resamples the audio and recasts it over a different time scale while maintaining the pitch. There’s lots of examples on the web of how paulstretch can stretch a song, but fewer examples of the other way around. I wondered what time compression would sound like for a slow song.

There’s a plugin for Audacity, which allows stretching but does not allow compressing. There is a python version available to run paulstretch from the command line, and another user has added the ability to process lossless audio (in FLAC format). My fork is here (with a very minor change) for permanence. These scripts all allow time compression as well as time stretching.

I compressed Ebony Tears by Cathedral. A doom metal tune from 1991 which is at ~56 bpm. Two-fold compression (-s 0.5 with 0.25 sampling) recasts the song as a 112 bpm heavy metal tune.

Ebony Tears twice as fast

For a more well known example of a slow song, I went for “Last Night I Dreamt That Somebody Loved Me” by The Smiths. Compression works OK with this song. Without sounding like a total philistine, the intro with the animal noises and Johnny Marr plonking away on the piano becomes mercifully shortened. Then the main song (0:58) turns into a more jolly reel at 0.5 compression.

Last night… twice as fast

…and a somehow more urgent moody tune (at 1:28) with 0.75. Disclaimer: I love the original version of this song, at the correct pace. I am not saying this is an improvement in any way.

Last night… 1.5 times as fast

Compressing songs was fun, but somehow not as fascinating as stretching. The trick is to pick songs with minimal percussion and vocals for the stretched version to sound like something other than prolonged noise. Here is a four-fold stretch of Into The Groove by Madonna. The vocals are OK but those gated 80s drums sound awful smudged over a longer time window.

Intooo the grooooove

And here are two-fold and four-fold versions of Joe Satriani’s instrumental The Forgotten (Part One). The original is an agitated guitar workout. It is transformed into an ambient soundscape with stretching.

The Forgotten x2
The Forgotten x4

The command line versions of paulstretch are easy to use and fun to experiment with. Feel free to comment with suggestions for good contenders for stretching or compressing.

The post title comes from “Timestretched” by The Divine Comedy from their Regeneration LP.

All Around The World: Maps and Flags in R

Our lab is international. People born all over the world have come to work in my group. I’m proud of this fact, especially in the current political climate. I’ve previously used the GoogleMaps API to display a heat map on our lab webpage. It shows where in the world people in the lab come from. This was OK, but I wanted to get an R based solution to make this graphic to make it easier to automate updates.

This post is an explainer and “how to” guide. Code and some example data are here.

The idea is to create graphics to describe the origins of a group of people. For my use-case it is my research group, but it could be any group of people. All you need is a list of countries that the people come from.

In the example for this post, I took the Top 100 Footballers voted for by Guardian readers in 2016. In my lab dataset, I store the countries of origin in ISO2 format. This means Spain is ES, Germany is DE and so on. I converted the Guardian dataset to ISO2 format using a lookup and then I was ready to put it into the R script.

if (!require("rworldmap")) {
  install.packages("rworldmap")
  library(rworldmap)
}
# ggplot2, ggFlags, dplyr are needed for the bar charts
library(ggplot2)
library(dplyr)
if (!require("ggflags")) {
  devtools::install_github("rensa/ggflags")
  library(ggflags)
}

# csv file with each person as a row and containing a column with the header Origin and
# countries in 2-letter ISO format (change joinCode for other formats)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
countries_lab <- as.data.frame(table(df1$Origin))
colnames(countries_lab) <- c("country", "value")
matched <- joinCountryData2Map(countries_lab, joinCode="ISO2", nameJoinColumn="country")

This part of the script sets up the libraries that are needed. We’ll use the rworldmap package first. This post was very useful for guidance. We load in the csv file which contains the countries of origin for the people we want to map out. It doesn’t need anything more than one column with the ISO2 codes. If it does it’s OK. As long as the header for the countries column is called “Origin”, all will be OK.

This column is extracted and a new dataframe is made from it which has each country as a row and the number of instances of that country as a second column. These are labelled “country” and “value” for convenience. Now rworldmap does its thing with the joinCountryData2Map line. Next we make the map!

# make png of the map
png(file = "labWorldMap.png",
    width = 1024, height = 768)
par(mai=c(0,0,0.2,0))
mapCountryData(matched,
               nameColumnToPlot="value",
               mapTitle= "",
               catMethod = "logFixedWidth",
               colourPalette = "heat",
               oceanCol="lightblue",
               missingCountryCol="white",
               addLegend = FALSE,
               lwd = 1)
dev.off()
Where in the world…. heat map showing country of origin for the people in the dataset

This makes a nice map. I’ve hidden the legend which shows what the colours mean. The map can be customised in lots of ways. I liked the way this map looked and my other aim was to make a chart to show the relative numbers of people in each country. Speaking of which…

# make bar chart of lab members
countries_lab %>%
  mutate(code = tolower(country)) %>%
  ggplot(aes(x = reorder(country, value), y = value)) +
  geom_bar(stat = "identity") +
  geom_flag(y = -1, aes(country = code), size = 4) +
  scale_y_continuous(expand = c(0.1, 1)) +
  xlab("Country") +
  ylab("Members") +
  theme_bw() +
  theme(legend.title = element_blank()) +
  coord_flip()
ggsave("plot.png", plot = last_plot())

Using the data frame we made previously, we can pipe it to ggplot2 via the wonders of dplyr. I am using geom_flag here from the ggflags library. Note that this is a fork of ggflags which gives circular flags which look great on the graph. The geom_flag needs a lowercase entry for each ISO2 country code so the first step is to mutate the country column to make a new lowercase column called code.

Bar chart of the same dataset using flag emojis for the tick labels

That’s it! With a csv file and a few lines of R code you can generate some nice looking graphics.

The dataset shows that the country that produced the biggest fraction of the world’s best footballers (as voted for by Guardian readers) was Spain. There are no surprises in this dataset. The most prominent European and South American countries giving a strong showing.

The post title is taken from “All Around The World” by The Dead Milkmen. Many songs in my library with this title, but this paranoid extraterrestrial tune gets the pick this time.

All That Noise: The vesicle packing problem

This week Erick Martins Ratamero and I put up a preprint on vesicle packing. This post is a bit of backstory but please take a look at the paper, it’s very short and simple.

The paper started when I wanted to know how many receptors could fit in a clathrin-coated vesicle. Sounds like a simple problem – but it’s actually more complicated.

Of course, this problem is not as simple as calculating the surface area of the vesicle, the cross-sectional area of the receptor and dividing one by the other. The images above show the problem. The receptors would be the dimples on the golf ball… they can’t overlap… how many can you fit on the ball?

It turns out that a PhD student working in Groningen in 1930 posed a similar problem (known as the Tammes Problem) in his thesis. His concern was the even pattern of pores on a pollen grain, but the root of the problem is the Thomson Problem. This is the minimisation of energy that occurs when charged particles are on a spherical surface. The particles must distribute themselves as far away as possible from all other particles.

There are very few analytical solutions to the Tammes Problem (presently 3-14 and 24 are solved). Anyhow, our vesicle packing problem is the other way around. We want to know, for a vesicle of a certain size, and cargo of a certain size, how many can we fit in.

Fortunately stochastic Tammes solvers are available like this one, that we could adapt. It turns out that the numbers of receptors that could be packed is enormous: for a typical clathrin-coated vesicle almost 800 G Protein-Coupled Receptors could fit on the surface. Note, that this doesn’t take into account steric hinderance and assumes that the vesicle carries nothing else. Full details are in the paper.

Why does this matter? Many labs are developing ways to count molecules in cellular structures by light or electron microscopy. We wanted to have a way to check that our results were physically possible. For example, if we measure 1000 GPCRs in a clathrin-coated vesicle, we know something has gone wrong.

What else? This paper ticked a few things on my publishing bucket list: a paper that is solely theoretical, a coffee-break idea paper and one that is on a “fun” subject. Erick has previous form with theoretical/fun papers, previously publishing on modelling peloton dynamics in procycling.

We figured the paper was more substantial than a blog post yet too minimal to send to a journal. So unless a journal wants to publish it (and gets in touch with us), this will be my first preprint where bioRxiv is the final destination.

We got a sense that people might be interested in an answer to the vesicle packing problem because whenever we asked people for an estimate, we got hugely different answers! The paper has been well-received so far. We’ve had quite a few comments on Twitter and we’re glad that we wrote up the work.

The post title comes from the “All That Noise” LP by The Darkside. I picked this not because of the title, but because of the cover.

All That Noise cover shows a packing problem on a sphere

Plotlines: the story behind the graph

On a whim a posted a plot on Twitter. It shows a marathon training schedule. This post explains the story behind the graph.

I downloaded a few different 17-week marathon training schedules. Most were in imperial measurement and/or were written for time at a certain pace, e.g. 30 min Easy Run etc. I wanted to convert one of these schedules into a proper plan where I input my pace and get an idea of the distance I need to run (in metric). This means I can pick routes to run each day without having to think too much about it.

Calculating the running paces was simple using Jack Daniels’ VDot calculator and verifying the predicted paces with my running database. I constructed a spreadsheet from the plan, and then did the calculations to get the distances out. Once this was done I wondered what the rationale behind the schedule is, and the best way to see that is to plot it out.

No colour-coding

From this plot the way that the long run on each Sunday is extended or tapered is reasonably clear. However, I was wondering about how intense each of the runs will be. Running 5K at threshold is more intense than a 10K easy run. To look at this I just took the average pace for the session. This doesn’t quite tell us about intensity, because a 10 min easy run + 2 intervals + 10 min easy run is not as intense as doing 4 intervals, yet the average pace would be similar. But it would be close enough. I used a colourscale called VioletOrangeYellow and the result was quite intuitive.

Marathon training schedule
A menu of pain – marathon training schedule

The shorter runs are organised in blocks of intensity while the long runs are about building endurance. From what I understand, the blocks are to do with adapting to the stresses of increasing running load/intensity.

Feedback on the plot was good: runners liked it, non-runners thought it was psychotic.

Experiment Zero: Using a Raspberry Pi Zero camera

This is the first post at quantixed about Raspberry Pi computing.

Pi Zero is a minimalist Raspberry Pi that can be coupled to a camera. With this little rig, you can make time-lapse footage amongst other things. I’ve set up a couple of these now. One was to make a time-lapse movie of some plants growing through a plastic maze. The results were pretty good and I thought I’d upload the video and a brief how-to guide.

After a delay, you can see four beans sprouting and then one eventually makes it to the top of the maze. This footage was shot over 27 days. The Pi took pictures every 5 min, but I sampled at 10 min in order to make the movie (after discarding the pictures after the sun went down). Everything was automated.

The camera shoots at 3280 × 2464. I downsampled the images to make the video. The camera didn’t focus well on the maze which was a bit too close. Other units are shooting scenery and the autofocus on the unit is great.

How I did it

Pi Zero

Pi Zero with camera module (without IR filter) and a case are available for around £40. I bought mine from the Pi Hut. Power supplies and SD cards are readily available. I put together the PiCam with a fresh Raspbian full image on a 16GB SD card. Another option is to use a smaller card and get the Pi to save the images to a server.

I used PiBaker to format the SD Card, load on Raspbian and add a startup script that would connect the Pi Zero to WiFi and enable VNC. That meant I could plug it in and start using it headless. Well in theory! It turns out that VNC via Mac does not work with the UNIX style password which is the default on the Pi. I needed to connect to a monitor to rectify this by changing to VNC password in the VNC GUI. After this I could log in and use the Pi Zero remotely.

A few more minor steps were needed for full functionality:

  1. I enabled ssh and camera port in Raspberry Pi Configuration, disabled bluetooth and set the correct timezone (this can probably be done in PiBaker but I forgot).
  2. Since I have several Raspberry Pis on the LAN. I needed to give this one its own identity to prevent network conflicts.
  3. I needed to set up SMB sharing on the new Pi.

Instructions for how to do these things are just one google search away.

Now the Pi was ready to start taking images. I built a little stand for it out of Lego and set up the plant maze.

Taking pictures with the Pi

I wrote a shell script to take pictures using raspistill.

I made a directory called camera in home/pi

mkdir camera

Then made a camera.sh file home/pi that looked like this:

#!/bin/bash
DATE=$(date +"%Y-%m-%d_%H%M")
raspistill -o /home/pi/camera/$DATE.jpg

Then I made it executable

chmod +x camera.sh

Using CRON, I execute the shell script on a schedule. I wanted to take pictures every 5 minutes. You can consult cronguru for your needs.

*/5 * * * * /home/pi/camera.sh 2>&amp;1
sudo modprobe bcm2835_wdt
sudo nano /etc/modules

Adding the line “bcm2835_wdt” and saving the file

Next I installed the watchdog daemon

sudo apt-get install watchdog chkconfig
chkconfig watchdog on
sudo /etc/init.d/watchdog start
sudo nano /etc/watchdog.conf

I uncommented two lines from this file:

  • watchdog-device = /dev/watchdog
  • the line that had max-load-1 in it

Save the watchdog.conf file.

There are guides online that describe how to set up the Pi so that it sends you an email or SMS when there’s a crash/reboot. I figured I didn’t need this – as long as it reboots OK.

What now?

Well, you wait for it to take photos! You can log in via VNC and check that the images are being acquired, or go in via ssh and watch the camera directory fill up. The size of the images is 3280 × 2464 and they are around 4.5 MB each, so the disk can quickly fill.

After a while you’ll want to assemble a movie. I wrote a shell script on my Mac in order to to pull down the images, take a copy of the ones I want and then make a movie file and upload it to Dropbox so I could look at it on the go.

#!/usr/bin/env bash
# move to the location of the images
cd /local/disk/folder2/
# pull down all images to a local folder - only new images are copied
rsync -trv /Volumes/HOMEPI/camera/ /local/disk/folder/
# overnight images are dark and less than 1.5 MB
# copy the ones we want to keep
rsync -trv --min-size=1000K /local/disk/folder/ /local/disk/folder2/
# or you could filter on size like this - delete <2MB
find . -name "*.jpg" -size -2000k -delete
# scale the images down to 480 px wide and make movie
ffmpeg -framerate 30 -pattern_type glob -i '*.jpg' -c:v libx264 -pix_fmt yuv420p -vf scale=480:-2 out.mp4
# move to dropbox
mv out.mp4 /My/Dropbox/Folder/out.mp4

This script means that I had to manually delete the pictures from the Pi once they’d been copied but that was OK. My plan is to write a script to do this for the longer running projects so that it is automated.

While it is possible to make the movies on the Pi itself, I did it on the Mac as that computer is beefier and is not busy taking pictures every 5 min! ffmpeg is a great tool for this and the documentation is impressive. For example if you have set up the camera in the wrong orientation you can do transposition in ffmpeg. If you don’t have ffmpeg, it is a simple install on the command line.

ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)" /dev/null 2> /dev/null
brew install ffmpeg

Hopefully this guide is useful to you. The Pi Zero Camera can be used for streaming video as well as taking a series of still images. I’m planning to test this out soon.

The post title “Experiment Zero” comes from the title of the album by Man or Astro-Man?

Pledging My Time IV

 

The Green Leek 10.5 km run is a mixed terrain race now in its third year. Today’s was a wet and muddy edition. The chip times were posted this afternoon and using my previous code, I took a look at the results.

I was a bit disappointed with my time, which was about 24 s slower than last year. Considering that I’m running faster this year than last, I wondered whether the conditions affected my time. To look at this I quickly retrieved times for people who’d run it all three editions and looked to see if this edition was generally slower than previous editions.

Excuse the formatting of the plot. It looked pretty flat but then we’re probably only considering very small differences over 10.5 km. So I looked at the difference in time from the 2016 edition. Again the formatting is bad (23:55 is 5 minutes faster than 2016, 00:05 is 5 minutes slower).

Three people recorded much slower times this year, but the majority are within the difference from 2016 to 2017. Obviously this is just a few people that could be easily picked out using a script, more runners might reveal more of a pattern. Anyway, here’s hoping for better weather next year!

Well done to Andy Crabtree and Rachel Miller who were fastest male and female, respectively. Thanks to the organisers and volunteers.

The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan

Multiplex: Small multiple artwork from GPX tracks

I’d seen the small multiple artwork of running and cycling routes from Marcus Volz’s R package Strava all over the web. Ads for “posters of your GPS tracks” pop up on Reddit and I’d notice a few #Rstats people put up their posters on Twitter. I’ve had the package bookmarked for a while and this week I finally got round to generating a small multiple poster of some of my cycling routes.

I was pleased with the result and wanted to post it here. But also, running the code was not straightforward as I’ll explain below. If you want to generate your own plot read on.

The idea behind the poster is really nice. You get a kind of generative art-style poster. It looks nice and you can identify individual routes which is fun to do.

The instructions on the GitHub page are absolutely correct and the code should run out-of-the-box. The idea is that you download your Strava data and then make your plots. Unfortunately, it seems that a change in Strava’s data export policy (possibly related to GDPR changes) has broken the package. I found that there are two problems. First, Strava’s “download your data” link gives you a mix of formats (in my case GPX and FIT files), the package only works with GPX. Second, if there is any elevation data missing from a track, the data frame that is needed to make the poster is not built properly.

Going GPX only: In my case, I don’t keep all my data in Strava and instead use a local repository managed with RubiTrack. This software allows me to filter for the tracks I want and export them in GPX format. The only problem is that it generates one huge file with all the tracks enclosed. This gets read by the package as a single track. To fix this, I split the file using awk.

awk '/<trk>/{close(file);n++;}{file="track"n".gpx";print >> file;}' untitled.gpx

I could then discard track1.gpx which just had the xml header and then use the directory of gpx files.

The elevation problem: this affected only some of the tracks, so in the end the R code needed to be modified. The elevation data is not needed to make the posters so the file process_data.R needs editing, line 28 can be commented out and then line 32 should read:

result <- data.frame(lat = lat, lon = lon, time = time, type = type) %<%

This issue is raised on GitHub and has been closed, but the code doesn’t work with elevation blanks. If you run into this problem, this is the way I found to fix it. The other plots in the package which do use elevation will not run, but at least the poster can be made.

I exported the poster as PDF and then made some changes in Illustrator to give the result above.

The post title comes from Multiplex from Oliver’s Standing Stone LP from 1974.

Pledging My Time III

I’ve previously crunched times for local Half and Full Marathons here on quantixed. Last weekend was the Kenilworth Half Marathon (2018) over a new course. I thought I’d have a look at the distributions of times and paces of the runners. The times are available here. If the Time and Category for finishers are saved as a csv, the script below works to generate the following plots.

Aggregated stats for the race are here. The beeswarm plot nicely shows the distribution of runners times and paces per category. There’s a bimodality to some of the age groups which is interesting. You can see from the average times that people get slower as they get older, as expected.

There was a roughly 2:1 split of M:F runners with a similar proportion in all categories. The ratio is similar for DNSers. The winning times were Andrew Savery of Leamington C A & C in MV35 with 01:12:51 and Polly Keen of Nuneaton Harriers in F sen with 01:23:46.

Congrats to everyone who ran and thanks to the organisers and all the supporters out on the course.


require(ggplot2)
require(ggbeeswarm)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
# aggregate M and F to a new category called Gender
df1$Gender <- ifelse(startsWith(df1$Category,"F"),"F","M")
# format Date column to POSIXct
df1$Time <- as.POSIXct(strptime(df1$Time, format = "%H:%M:%S"))
orig_var <- as.POSIXct("00:00:00", format = "%H:%M:%S")
p1 <- ggplot( data = df1, aes(x = Category,y = Time, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p1
# instead of finishing time, let's look at pace (min/km)
df1$Pace <- as.numeric(difftime(df1$Time, orig_var) / 21.1) * 3600
df1$Pace <- as.POSIXct(df1$Pace, origin = orig_var, format = "%H:%M:%S")
p2 <- ggplot( data = df1, aes(x = Category,y = Pace, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p2
# calculate speeds rather than pace
df1$Speed <- 21.1 / as.numeric(difftime(df1$Time, orig_var))
p3 <- ggplot( data = df1, aes(x = Category, y = Speed, color = Category)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  ylim(0,NA) + ylab("Speed (km/h)")
p3
# now make the same plots but by Gender rather than Category
p4 <- ggplot( data = df1, aes(x = Gender,y = Time, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p4
p5 <- ggplot( data = df1, aes(x = Gender,y = Pace, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p5
p6 <- ggplot( data = df1, aes(x = Gender, y = Speed, color = Gender)) + 
  geom_quasirandom(alpha = 0.5, stroke = 0) +
  stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
  ylim(0,NA) + ylab("Speed (km/h)")
p6
ggsave("raceTimeByCat.png", plot = p1)
ggsave("racePaceByCat.png", plot = p2)
ggsave("raceSpeedByCat.png", plot = p3)
ggsave("raceTimeByGen.png", plot = p4)
ggsave("racePaceByGen.png", plot = p5)
ggsave("raceSpeedByGen.png", plot = p6)


Edit 2018-09-12T18:52:43Z I wasn’t happy with the plots and added a few more lines to look at gender as well as category. And show speed as well as pace and finishing time.

The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan