## All Around The World: Maps and Flags in R

Our lab is international. People born all over the world have come to work in my group. I’m proud of this fact, especially in the current political climate. I’ve previously used the GoogleMaps API to display a heat map on our lab webpage. It shows where in the world people in the lab come from. This was OK, but I wanted to get an R based solution to make this graphic to make it easier to automate updates.

This post is an explainer and “how to” guide. Code and some example data are here.

The idea is to create graphics to describe the origins of a group of people. For my use-case it is my research group, but it could be any group of people. All you need is a list of countries that the people come from.

In the example for this post, I took the Top 100 Footballers voted for by Guardian readers in 2016. In my lab dataset, I store the countries of origin in ISO2 format. This means Spain is ES, Germany is DE and so on. I converted the Guardian dataset to ISO2 format using a lookup and then I was ready to put it into the R script.

if (!require("rworldmap")) {
install.packages("rworldmap")
library(rworldmap)
}
# ggplot2, ggFlags, dplyr are needed for the bar charts
library(ggplot2)
library(dplyr)
if (!require("ggflags")) {
devtools::install_github("rensa/ggflags")
library(ggflags)
}

# csv file with each person as a row and containing a column with the header Origin and
# countries in 2-letter ISO format (change joinCode for other formats)
file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)
countries_lab <- as.data.frame(table(df1$Origin)) colnames(countries_lab) <- c("country", "value") matched <- joinCountryData2Map(countries_lab, joinCode="ISO2", nameJoinColumn="country")  This part of the script sets up the libraries that are needed. We’ll use the rworldmap package first. This post was very useful for guidance. We load in the csv file which contains the countries of origin for the people we want to map out. It doesn’t need anything more than one column with the ISO2 codes. If it does it’s OK. As long as the header for the countries column is called “Origin”, all will be OK. This column is extracted and a new dataframe is made from it which has each country as a row and the number of instances of that country as a second column. These are labelled “country” and “value” for convenience. Now rworldmap does its thing with the joinCountryData2Map line. Next we make the map! # make png of the map png(file = "labWorldMap.png", width = 1024, height = 768) par(mai=c(0,0,0.2,0)) mapCountryData(matched, nameColumnToPlot="value", mapTitle= "", catMethod = "logFixedWidth", colourPalette = "heat", oceanCol="lightblue", missingCountryCol="white", addLegend = FALSE, lwd = 1) dev.off()  This makes a nice map. I’ve hidden the legend which shows what the colours mean. The map can be customised in lots of ways. I liked the way this map looked and my other aim was to make a chart to show the relative numbers of people in each country. Speaking of which… # make bar chart of lab members countries_lab %>% mutate(code = tolower(country)) %>% ggplot(aes(x = reorder(country, value), y = value)) + geom_bar(stat = "identity") + geom_flag(y = -1, aes(country = code), size = 4) + scale_y_continuous(expand = c(0.1, 1)) + xlab("Country") + ylab("Members") + theme_bw() + theme(legend.title = element_blank()) + coord_flip() ggsave("plot.png", plot = last_plot())  Using the data frame we made previously, we can pipe it to ggplot2 via the wonders of dplyr. I am using geom_flag here from the ggflags library. Note that this is a fork of ggflags which gives circular flags which look great on the graph. The geom_flag needs a lowercase entry for each ISO2 country code so the first step is to mutate the country column to make a new lowercase column called code. That’s it! With a csv file and a few lines of R code you can generate some nice looking graphics. The dataset shows that the country that produced the biggest fraction of the world’s best footballers (as voted for by Guardian readers) was Spain. There are no surprises in this dataset. The most prominent European and South American countries giving a strong showing. The post title is taken from “All Around The World” by The Dead Milkmen. Many songs in my library with this title, but this paranoid extraterrestrial tune gets the pick this time. ## Installing open source PyMol on a Mac This is a quick “how to” post. There is a licensed version of PyMol (MacPyMol) available, but the open source version can be installed on a Mac free of charge. The official page has a guide, which is not terribly detailed, and I found this excellent guide which is unfortunately out-of-date. Here is an updated guide to installing PyMol using Homebrew on macOS Mojave 10.14.3 Step 1 is to install Homebrew /usr/bin/ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"

Next step is to install the PyMol dependencies using Homebrew

brew install Caskroom/cask/xquartzbrew install tcl-tkbrew install python

Now install PyMol.

brew install brewsci/bio/pymol

You can now run PyMol by typing

pymol

The MPIBR-Bioinformatics page has the following guide to make a little executable app to launch PyMol straight from the Desktop.

• Open Automator, which is in Applications in macOS.
• Create new document, select “Application”.
• Select “Actions” and “Library” in the left pane. Select “Utilities” and “Run Shell Script”. Drag this into the main pane.
• Choose “/bin/bash” as a shell.
• Paste the following: /usr/local/bin/pymol -M. If this doesn’t work, check the path to pymol using which pymol in the terminal, and use this instead.
• Save the application (“File > Save”) to the Desktop and name it “PyMol”.

Now you can double-click this app to run PyMol.

## Tips from the blog XI: docx to pdf

A long time ago I posted a little Automator routine to convert Word doc/docx files to PDF. Not long after that, this routine ceased to work due to changes in Microsoft Word (I think). It’s still very useful to convert a whole folder of docx files to PDF in order to avoid Word and just use Preview on the Mac. For committee work or for marking students’ work, I often have a whole folder of docx files and would prefer it if they were in PDF format. I found this very nifty trick on the web and thought I’d post a link here to make up for the fact that my old post no longer works.

The full post is here. What is so nice about this Automator solution is that it uses a bash script to do the conversion. This means you don’t need Microsoft Word for it to work! From what I can see it uses the xml in the docx file (and presumably won’t work on older *.doc files) for the conversion. The post describes how to run it as a Service in macOS. Note, that it destroys the docx files, so it should only be used on a copy. It could be run from the command line rather than right-clink, the engine is this little script.

Thanks to Jacob Salmela for posting it.

This post is part of a series of short tips

## Multiplex: Small multiple artwork from GPX tracks

I’d seen the small multiple artwork of running and cycling routes from Marcus Volz’s R package Strava all over the web. Ads for “posters of your GPS tracks” pop up on Reddit and I’d notice a few #Rstats people put up their posters on Twitter. I’ve had the package bookmarked for a while and this week I finally got round to generating a small multiple poster of some of my cycling routes.

I was pleased with the result and wanted to post it here. But also, running the code was not straightforward as I’ll explain below. If you want to generate your own plot read on.

The idea behind the poster is really nice. You get a kind of generative art-style poster. It looks nice and you can identify individual routes which is fun to do.

The instructions on the GitHub page are absolutely correct and the code should run out-of-the-box. The idea is that you download your Strava data and then make your plots. Unfortunately, it seems that a change in Strava’s data export policy (possibly related to GDPR changes) has broken the package. I found that there are two problems. First, Strava’s “download your data” link gives you a mix of formats (in my case GPX and FIT files), the package only works with GPX. Second, if there is any elevation data missing from a track, the data frame that is needed to make the poster is not built properly.

Going GPX only: In my case, I don’t keep all my data in Strava and instead use a local repository managed with RubiTrack. This software allows me to filter for the tracks I want and export them in GPX format. The only problem is that it generates one huge file with all the tracks enclosed. This gets read by the package as a single track. To fix this, I split the file using awk.

awk '/<trk>/{close(file);n++;}{file="track"n".gpx";print >> file;}' untitled.gpx


I could then discard track1.gpx which just had the xml header and then use the directory of gpx files.

The elevation problem: this affected only some of the tracks, so in the end the R code needed to be modified. The elevation data is not needed to make the posters so the file process_data.R needs editing, line 28 can be commented out and then line 32 should read:

result <- data.frame(lat = lat, lon = lon, time = time, type = type) %<%

This issue is raised on GitHub and has been closed, but the code doesn’t work with elevation blanks. If you run into this problem, this is the way I found to fix it. The other plots in the package which do use elevation will not run, but at least the poster can be made.

I exported the poster as PDF and then made some changes in Illustrator to give the result above.

The post title comes from Multiplex from Oliver’s Standing Stone LP from 1974.

## Adventures in Code VI: debugging and silly mistakes

This deserved a bit of further explanation, due to the stupidity involved.

“Debugging is like being the detective in a crime movie where you are also the murderer.” – Filipe Fortes

My code was giving an unexpected result and I was having a hard time figuring out the problem. The unexpected result was that a resampled set of 2D coordinates were not being rotated randomly. I was fortunate to be able to see this otherwise I would have never found this bug and probably would’ve propagated the error to other code projects.

I narrowed down the cause but ended up having to write some short code to check that it really did cause the error.

I was making a rotation matrix and then using it to rotate a 2D coordinate set by matrix multiplication. The angle was randomised in the loop. What could go wrong? I looked at that this:

theta = pi*enoise(1)
rotMat = {{cos(theta),-sin(theta)},{sin(theta),cos(theta)}}

and thought “two lines – pah – it can be done in one!”. Since the rotation matrix is four numbers [-1,1], I thought “I’ll just pick those numbers at random, I just want a random angle don’t I?”

rotMat = enoise(1)

Why doesn’t an alarm go off when this happens? A flashing sign saying “are you sure about that?”…

My checks showed that a single point at 1,0 after matrix multiplication with this method gives.

When it should give

And it’s so obvious when you’ve seen why. The four numbers in the rotation matrix are, of course, not independent.

I won’t make that mistake again and I’m going to try to think twice when trying to save a line of code like in the future!

Part of a series on computers and coding.

## Frankly, Mr. Shankly

I read about Antonio Sánchez Chinchón’s clever approach to use the Travelling Salesperson algorithm to generate some math-art in R. The follow up was even nicer in my opinion, Pencil Scribbles. The subject was Boris Karloff as the monster in Frankenstein. I was interested in running the code (available here and here), so I thought I’d run it on a famous scientist.

By happy chance one of the most famous scientists of the 20th Century, Rosalind Franklin, shares a nominative prefix with the original subject. There is also a famous portrait of her that I thought would work well.

I first needed needed to clear up the background because it was too dark.

Now to run the TSP code.

The pencil scribbles version is nicer I think.

The R scripts basically ran out-of-the-box. I was using a new computer that didn’t have X11quartz on it nor the packages required, but once that they were installed I just needed to edit the line to use a local file in my working directory. The code just ran. The outputs FrankyTSP and Franky_scribbles didn’t even need to be renamed, given my subject’s name.

Thanks to Antonio for making the code available and so easy to use.

The post title comes from “Frankly, Mr. Shankly” by The Smiths which appears on The Queen is Dead. If the choice of post title needs an explanation, it wasn’t a good choice…

## Paintball’s Coming Home: generating Damien Hirst spot paintings

A few days ago, I read an article about Damien Hirst’s new spot paintings. I’d forgotten how regular the spots were in the original spot paintings from the 1990s (examples are on his page here). It made me think that these paintings could be randomly generated and so I wrote a quick piece of code to do this (HirstGenerator).

I used Hirst’s painting ‘Abalone Acetone Powder’ (1991), which is shown on this page as photographed by Alex Hartley. A wrote some code to sample the colours of this image and then a script to replicate it. The original is shown below  © Damien Hirst and Science Ltd. Click them for full size.

and then this is the replica:

Now that I had a palette of the colours used in the original. It was simple to write a generator to make spot paintings where the spots are randomly assigned.

The generator can make canvasses at whatever size is required.

The code can be repurposed to make spot paintings with different palettes from his other spot paintings or from something else. So there you have it. Generative Hirst Spot Paintings.

For nerds only

My original idea was to generate a palette of unique colours from the original painting. Because of the way I sampled them, each spot is represented once in the palette. This means the same colour as used by the artist is represented as several very similar but nonidentical colours in the palette. My original plan was to find the euclidean distances between all spots in RGB colour space and to establish a distance cutoff to decide what is a unique colour.

That part was easy to write but what value to give for the cutoff was tricky. After some reading, it seems that other colour spaces are better suited for this task, e.g. converting RGB to a CIE colour space. For two reasons, I didn’t pursue this. First, quantixed coding is time-limited. Second. assuming that there is something to the composition of these spot paintings (and they are not a con trick) the frequency of spots must have artistic merit and so they should be left in the palette for sampling in the generated pictures. The representation of the palette in RGB colour space had an interesting pattern (shown in the GIF above).

The post title comes from “Paintball’s Coming Home” by Half Man Half Biscuit from Voyage To The Bottom Of The Road. Spot paintings are kind of paintballs, but mostly because I love the title of this song.

## Esoteric Circle

Many projects in the lab involve quantifying circular objects. Microtubules, vesicles and so on are approximately circular in cross section. This quick post is about how to find the diameter of these objects using a computer.

So how do you measure the diameter of an object that is approximately circular? Well, if it was circular you would measure the distance from one edge to the other, crossing the centre of the object. It doesn’t matter along which axis you do this. However, since these objects are only approximately circular, it matters along which axis you measure. There are a couple of approaches that can be used to solve this problem.

Principal component analysis

The object is a collection of points* and we can find the eigenvectors and eigenvalues of these points using principal component analysis. This was discussed previously here. The 1st eigenvector points along the direction of greatest variance and the 2nd eigenvector is normal to the first. The order of eigenvectors is determined by their eigenvalues. We use these to rotate the coordinate set and offset to the origin.

Now the major axis of the object is aligned to the x-axis at y=0 and the minor axis is aligned with the y-axis at x=0 (compare the plot on the right with the one on the left, where the profiles are in their original orientation – offset to zero). We can then find the absolute values of the axis crossing points and when added together these represent the major axis and minor axis of the object. In Igor, this is done using a oneliner to retrieve a rotated set of coords as the wave M_R.

PCA/ALL/SEVC/SRMT/SCMT xCoord,yCoord

To find the crossing points, I use Igor’s interpolation-based level crossing functions. For example, storing the aggregated diameter in a variable called len.

FindLevel/Q/EDGE=1/P m1c0, 0
len = abs(m1c1(V_LevelX))
FindLevel/Q/EDGE=2/P m1c0, 0
len += abs(m1c1(V_LevelX))

This is just to find one axis (where m1c0 and m1c1 are the 1st and 2nd columns of a 2-column wave m1) and so you can see it is a bit cumbersome.

Anyway, I was quite happy with this solution. It is unbiased and also tells us how approximately circular the object is (because the major and minor axes tell us the aspect ratio or ellipticity of the object). I used it in Figure 2 of this paper to show the sizes of the coated vesicles. However, in another project we wanted to state what the diameter of a vesicle was. Not two numbers, just one. How do we do that? We could take the average of the major and minor axes, but maybe there’s an easier way.

Polar coordinates

The distance from the centre to every point on the edge of the object can be found easily by converting the xy coordinates to polar coordinates. To do this, we first find the centre of the object. This is the centroid $$(\bar{x},\bar{y})$$ represented by

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}$$ and $$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_{i}$$

for n points and subtract this centroid from all points to arrange the object around the origin. Now, since the xy coords are represented in polar system by

$$x_{i} = r_{i}\cos(\phi)$$ and $$y_{i} = r_{i}\sin(\phi)$$

we can find r, the radial distance, using

$$r_{i} = \sqrt{x_{i}^{2} + y_{i}^{2}}$$

With those values we can then find the average radial distance and report that.

There’s something slightly circular (pardon the pun) about this method because all we are doing is minimising the distance to a central point initially and then measuring the average distance to this minimised point in the latter step. It is much faster than the PCA approach and would be insensitive to changes in point density around the object. The two methods would probably diverge for noisy images. Again in Igor this is simple:

Make/O/N=(dimsize(m1,0)-1)/FREE rW

rW[] = sqrt(m1[p][0]^2 + m1[p][1]^2)

len = 2 * mean(rW)
Here again, m1 is the 2-column wave of coords and the diameter of the object is stored in len.

How does this compare with the method above? The answer is kind of obvious, but it is equidistant between the major and minor axes. Major axis is shown in red and minor axis shown in blue compared with the mean radial distance method (plotted on the y-axis). In places there is nearly a 10 nm difference which is considerable for objects which are between 20 and 35 nm in diameter. How close is it to the average of the major and minor axis? Those points are in black and they are very close but not exactly on y=x.

So for simple, approximately circular objects with low noise, the ridiculously simple polar method gives us a single estimate of the diameter of the object and this is much faster than the more complex methods above. For more complicated shapes and noisy images, my feeling is that the PCA approach would be more robust. The two methods actually tell us two subtly different things about the shapes.

Why don’t you just measure them by hand?

In case there is anyone out there wondering why a computer is used for this rather than a human wielding the line tool in ImageJ… there are two good reasons.

1. There are just too many! Each image has tens of profiles and we have hundreds of images from several experiments.
2. How would you measure the profile manually? This approach shows two unbiased methods that don’t rely on a human to draw any line across the object.

* = I am assuming that the point set is already created.

The post title is taken from “Esoteric Circle” by Jan Garbarek from the LP of the same name released in 1969. The title fits well since this post is definitely esoteric. But maybe someone out there is interested!

## The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.


## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets\$text))
## remove all the punctation from tweets
tCorpus <- tm_map(tCorpus, removePunctuation)
## good idea to remove stopwords: high frequency words such as I, me and so on
tCorpus <- tm_map(tCorpus, removeWords, stopwords('english'))
## next step is to stem the words. Means that talking and talked become talk
tCorpus <- tm_map(tCorpus, stemDocument)
## now display your wordcloud
wordcloud(tCorpus, max.words = 100, random.order = FALSE)



For my @clathrin account this gave:

So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time…

The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes.

The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP.

## The Second Arrangement

To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations.

Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it.

The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to

$$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$

Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here.

I had got as far as figuring out that the ratio of permutations to derangements converges to e. However,  you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it.

Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted.

The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another.

If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on.

To calculate the derangements:

$$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$

This is a polynomial expansion of this expression:

$$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$

where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G.

Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames.

What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that.

So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading!

Further notes

The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat?

Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem?

The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions.