## Rollercoaster II: more on Google Scholar citations

I’ve previously written about Google Scholar. Its usefulness and its instability. I just read a post by Jon Tennant on how to harvest Google Scholar data in R and I thought I would use his code as the basis to generate some nice plots based on Google Scholar data.

A script for R is below and can be found here. Graphics are base R but do the job.

First of all I took it for a spin on my own data. The outputs are shown here:

These were the most interesting plots that sprang to mind. First is a ranked citation plot which also shows y=x to find the Hirsch number. Second, was to look at total citations per year to all papers over time. Google Scholar shows the last few years of this plot in the profile page. Third, older papers accrue more citations, but how does this look for all papers? Finally, a prediction of what my H-index will do over time (no prizes for guessing that it will go up!). As Jon noted, the calculation comes from this paper.

While that’s interesting, we need to get  the data of a scholar with a huge number of papers and citations. Here is George Church.

At the time of writing he has 763 papers with over 90,000 citations in total and a H-index of 147. Interestingly ~10% of his total citations come from a monster paper in PNAS with Wally Gilbert in the mid 80s on genome sequencing.

Feel free to grab/fork this code and have a play yourself. If you have other ideas for plots or calculations, add a comment here or an issue at GitHub.

if(!require(scholar)){
install.packages("scholar")
}
library(scholar)
# Add Google Scholar ID of interest here
ID <- ""
# If you didn't add one to the script prompt user to add one
if(ID == ""){
ID <- readline(prompt="Enter Scholar ID: ")
}
# Get the citation history
citeByYear<-get_citation_history(ID)
# Get profile information
profile <- get_profile(ID)
# Get publications and save as a csv
pubs <- get_publications(ID)
write.csv(pubs, file = "citations.csv")
# Predict h-index
hIndex <- predict_h_index(ID)
# Now make some plots
# Plot of total citations by year
png(file = "citationsByYear.png")
plot(citeByYear$year,citeByYear$cites,
type="h", xlab="Year", ylab = "Total Cites")
dev.off()
# Plot of ranked paper by citation with h
png(file = "citationsAndH.png")
plot(pubs$cites, type="l", xlab="Paper rank", ylab = "Citations per paper") abline(0,1) text(nrow(pubs),max(pubs$cites, na.rm = TRUE),
profile$h_index) dev.off() # Plot of cites to paper by year png(file = "citesByYear.png") plot(pubs$year, pubs$cites, xlab="Year", ylab = "Citations per paper") dev.off() # Plot of h-index prediction thisYear <- as.integer(format(Sys.Date(), "%Y")) png(file = "hPred.png") plot(hIndex$years_ahead+thisYear,hIndex$h_index, ylim = c(0, max(hIndex$h_index, na.rm = TRUE)),
type = "h",
xlab="Year", ylab = "H-index prediction")
dev.off()

Note that my previous code used a python script to grab Google Scholar data. While that script worked well, the scholar package for R seems a lot more reliable.

I have a surprising number of tracks in my library with Rollercoaster in the title. This time I will go with the Jesus & Mary Chain track from Honey’s Dead.

## I’m not following you II: Twitter data and R

My activity on twitter revolves around four accounts.

I try to segregate what happens on each account, and there’s inevitably some overlap. But what about overlap in followers?

What lucky people are following all four? How many only see the individual accounts?

It’s quite easy to look at this in R.

So there are 36 lucky people (or bots!) following all four accounts. I was interested in the followers of the quantixed account since it seemed to me that it attracts people from a slightly different sphere. It looks like about one-third of quantixed followers only follow quantixed, about one-third follow clathrin also and more or less the remainder are “all in” following three accounts or all four. CMCB followers are split about the same. The lab account is a bit different, with close to one-half of the followers also following clathrin.

Extra nerd points:

This is a Venn diagram and not an Euler plot. Venn just shows schematically the intersections and does not attempt to encode information in the area of each part. Euler plots for greater than three groups are hard to generate and to make any sense of what is shown. It is a dataviz problem to look at the proportions or lots of groups. A solution here would be to generate a further four Venn diagrams. On each, display the proportion for one category as a fraction or percentage

How to do it:

Last time, I described how to set up rtweet and make a Twitter app for use in R. You can use this to pull down lists of followers and extract their data. Using the intersect function you can work out the numbers of followers at each intersection. For four accounts, there will be 1 group of four, 4 groups of three, 6 groups of two. The VennDiagram package just needs the total numbers for all four groups and then details of the intersections, i.e. you don’t need to work out the groups minus their intersections – it does this for you.

library(rtweet)
library(httpuv)
library(VennDiagram)
## whatever name you assigned to your created app
appname <- "whatever_name"
## api key (example below is not a real key)
key <- "blah614h"
## api secret (example below is not a real key)
secret <- "blah614h"
## create token named "twitter_token"
app = appname,
consumer_key = key,
consumer_secret = secret)
clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
quantixed_followers <- get_followers("quantixed", n = "all")
quantixed_followers_names <- lookup_users(quantixed_followers)
cmcb_followers <- get_followers("Warwick_CMCB", n = "all")
cmcb_followers_names <- lookup_users(cmcb_followers)
roylelab_followers <- get_followers("roylelab", n = "all")
roylelab_followers_names <- lookup_users(roylelab_followers)
# a = clathrin
# b = quantixed
# c = cmcb
# d = roylelab
## now work out intersections
anb <- intersect(clathrin_followers_names$user_id,quantixed_followers_names$user_id)
anc <- intersect(clathrin_followers_names$user_id,cmcb_followers_names$user_id)
and <- intersect(clathrin_followers_names$user_id,roylelab_followers_names$user_id)
bnc <- intersect(quantixed_followers_names$user_id,cmcb_followers_names$user_id)
bnd <- intersect(quantixed_followers_names$user_id,roylelab_followers_names$user_id)
cnd <- intersect(cmcb_followers_names$user_id,roylelab_followers_names$user_id)
anbnc <- intersect(anb,cmcb_followers_names$user_id) anbnd <- intersect(anb,roylelab_followers_names$user_id)
ancnd <- intersect(anc,roylelab_followers_names$user_id) bncnd <- intersect(bnc,roylelab_followers_names$user_id)
anbncnd <- intersect(anbnc,roylelab_followers_namesuser_id) ## four-set Venn diagram venn.plot <- draw.quad.venn( area1 = nrow(clathrin_followers_names), area2 = nrow(quantixed_followers_names), area3 = nrow(cmcb_followers_names), area4 = nrow(roylelab_followers_names), n12 = length(anb), n13 = length(anc), n14 = length(and), n23 = length(bnc), n24 = length(bnd), n34 = length(cnd), n123 = length(anbnc), n124 = length(anbnd), n134 = length(ancnd), n234 = length(bncnd), n1234 = length(anbncnd), category = c("Clathrin", "quantixed", "CMCB", "RoyleLab"), fill = c("dodgerblue1", "red", "goldenrod1", "green"), lty = "dashed", cex = 2, cat.cex = 1.5, cat.col = c("dodgerblue1", "red", "goldenrod1", "green"), fontfamily = "Helvetica", cat.fontfamily = "Helvetica" ); # write to file png(filename = "Quad_Venn_diagram.png"); grid.draw(venn.plot); dev.off() I’ll probably return to rtweet in future and will recycle the title if I do. Like last time, the post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins ## Frankly, Mr. Shankly I read about Antonio Sánchez Chinchón’s clever approach to use the Travelling Salesperson algorithm to generate some math-art in R. The follow up was even nicer in my opinion, Pencil Scribbles. The subject was Boris Karloff as the monster in Frankenstein. I was interested in running the code (available here and here), so I thought I’d run it on a famous scientist. By happy chance one of the most famous scientists of the 20th Century, Rosalind Franklin, shares a nominative prefix with the original subject. There is also a famous portrait of her that I thought would work well. I first needed needed to clear up the background because it was too dark. Now to run the TSP code. The pencil scribbles version is nicer I think. The R scripts basically ran out-of-the-box. I was using a new computer that didn’t have X11quartz on it nor the packages required, but once that they were installed I just needed to edit the line to use a local file in my working directory. The code just ran. The outputs FrankyTSP and Franky_scribbles didn’t even need to be renamed, given my subject’s name. Thanks to Antonio for making the code available and so easy to use. The post title comes from “Frankly, Mr. Shankly” by The Smiths which appears on The Queen is Dead. If the choice of post title needs an explanation, it wasn’t a good choice… ## Paintball’s Coming Home: generating Damien Hirst spot paintings A few days ago, I read an article about Damien Hirst’s new spot paintings. I’d forgotten how regular the spots were in the original spot paintings from the 1990s (examples are on his page here). It made me think that these paintings could be randomly generated and so I wrote a quick piece of code to do this (HirstGenerator). I used Hirst’s painting ‘Abalone Acetone Powder’ (1991), which is shown on this page as photographed by Alex Hartley. A wrote some code to sample the colours of this image and then a script to replicate it. The original is shown below © Damien Hirst and Science Ltd. Click them for full size. and then this is the replica: Now that I had a palette of the colours used in the original. It was simple to write a generator to make spot paintings where the spots are randomly assigned. The generator can make canvasses at whatever size is required. The code can be repurposed to make spot paintings with different palettes from his other spot paintings or from something else. So there you have it. Generative Hirst Spot Paintings. For nerds only My original idea was to generate a palette of unique colours from the original painting. Because of the way I sampled them, each spot is represented once in the palette. This means the same colour as used by the artist is represented as several very similar but nonidentical colours in the palette. My original plan was to find the euclidean distances between all spots in RGB colour space and to establish a distance cutoff to decide what is a unique colour. That part was easy to write but what value to give for the cutoff was tricky. After some reading, it seems that other colour spaces are better suited for this task, e.g. converting RGB to a CIE colour space. For two reasons, I didn’t pursue this. First, quantixed coding is time-limited. Second. assuming that there is something to the composition of these spot paintings (and they are not a con trick) the frequency of spots must have artistic merit and so they should be left in the palette for sampling in the generated pictures. The representation of the palette in RGB colour space had an interesting pattern (shown in the GIF above). The post title comes from “Paintball’s Coming Home” by Half Man Half Biscuit from Voyage To The Bottom Of The Road. Spot paintings are kind of paintballs, but mostly because I love the title of this song. ## Esoteric Circle Many projects in the lab involve quantifying circular objects. Microtubules, vesicles and so on are approximately circular in cross section. This quick post is about how to find the diameter of these objects using a computer. So how do you measure the diameter of an object that is approximately circular? Well, if it was circular you would measure the distance from one edge to the other, crossing the centre of the object. It doesn’t matter along which axis you do this. However, since these objects are only approximately circular, it matters along which axis you measure. There are a couple of approaches that can be used to solve this problem. Principal component analysis The object is a collection of points* and we can find the eigenvectors and eigenvalues of these points using principal component analysis. This was discussed previously here. The 1st eigenvector points along the direction of greatest variance and the 2nd eigenvector is normal to the first. The order of eigenvectors is determined by their eigenvalues. We use these to rotate the coordinate set and offset to the origin. Now the major axis of the object is aligned to the x-axis at y=0 and the minor axis is aligned with the y-axis at x=0 (compare the plot on the right with the one on the left, where the profiles are in their original orientation – offset to zero). We can then find the absolute values of the axis crossing points and when added together these represent the major axis and minor axis of the object. In Igor, this is done using a oneliner to retrieve a rotated set of coords as the wave M_R. PCA/ALL/SEVC/SRMT/SCMT xCoord,yCoord To find the crossing points, I use Igor’s interpolation-based level crossing functions. For example, storing the aggregated diameter in a variable called len. FindLevel/Q/EDGE=1/P m1c0, 0 len = abs(m1c1(V_LevelX)) FindLevel/Q/EDGE=2/P m1c0, 0 len += abs(m1c1(V_LevelX)) This is just to find one axis (where m1c0 and m1c1 are the 1st and 2nd columns of a 2-column wave m1) and so you can see it is a bit cumbersome. Anyway, I was quite happy with this solution. It is unbiased and also tells us how approximately circular the object is (because the major and minor axes tell us the aspect ratio or ellipticity of the object). I used it in Figure 2 of this paper to show the sizes of the coated vesicles. However, in another project we wanted to state what the diameter of a vesicle was. Not two numbers, just one. How do we do that? We could take the average of the major and minor axes, but maybe there’s an easier way. Polar coordinates The distance from the centre to every point on the edge of the object can be found easily by converting the xy coordinates to polar coordinates. To do this, we first find the centre of the object. This is the centroid $$(\bar{x},\bar{y})$$ represented by $$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}$$ and $$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_{i}$$ for n points and subtract this centroid from all points to arrange the object around the origin. Now, since the xy coords are represented in polar system by $$x_{i} = r_{i}\cos(\phi)$$ and $$y_{i} = r_{i}\sin(\phi)$$ we can find r, the radial distance, using $$r_{i} = \sqrt{x_{i}^{2} + y_{i}^{2}}$$ With those values we can then find the average radial distance and report that. There’s something slightly circular (pardon the pun) about this method because all we are doing is minimising the distance to a central point initially and then measuring the average distance to this minimised point in the latter step. It is much faster than the PCA approach and would be insensitive to changes in point density around the object. The two methods would probably diverge for noisy images. Again in Igor this is simple: Make/O/N=(dimsize(m1,0)-1)/FREE rW rW[] = sqrt(m1[p][0]^2 + m1[p][1]^2) len = 2 * mean(rW) Here again, m1 is the 2-column wave of coords and the diameter of the object is stored in len. How does this compare with the method above? The answer is kind of obvious, but it is equidistant between the major and minor axes. Major axis is shown in red and minor axis shown in blue compared with the mean radial distance method (plotted on the y-axis). In places there is nearly a 10 nm difference which is considerable for objects which are between 20 and 35 nm in diameter. How close is it to the average of the major and minor axis? Those points are in black and they are very close but not exactly on y=x. So for simple, approximately circular objects with low noise, the ridiculously simple polar method gives us a single estimate of the diameter of the object and this is much faster than the more complex methods above. For more complicated shapes and noisy images, my feeling is that the PCA approach would be more robust. The two methods actually tell us two subtly different things about the shapes. Why don’t you just measure them by hand? In case there is anyone out there wondering why a computer is used for this rather than a human wielding the line tool in ImageJ… there are two good reasons. 1. There are just too many! Each image has tens of profiles and we have hundreds of images from several experiments. 2. How would you measure the profile manually? This approach shows two unbiased methods that don’t rely on a human to draw any line across the object. * = I am assuming that the point set is already created. The post title is taken from “Esoteric Circle” by Jan Garbarek from the LP of the same name released in 1969. The title fits well since this post is definitely esoteric. But maybe someone out there is interested! ## The Sound of Clouds: wordcloud of tweets using R Another post using R and looking at Twitter data. As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are. Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R? First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’. ## for text mining library(tm) ## for building a corpus library(SnowballC) ## for making wordclouds library(wordcloud) ## read in your tweets tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE) ## make a corpus of the text of the tweets tCorpus <- Corpus(VectorSource(tweetstext))
## remove all the punctation from tweets
tCorpus <- tm_map(tCorpus, removePunctuation)
## good idea to remove stopwords: high frequency words such as I, me and so on
tCorpus <- tm_map(tCorpus, removeWords, stopwords('english'))
## next step is to stem the words. Means that talking and talked become talk
tCorpus <- tm_map(tCorpus, stemDocument)
## now display your wordcloud
wordcloud(tCorpus, max.words = 100, random.order = FALSE)

For my @clathrin account this gave:

So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time…

The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes.

The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP.

## I’m not following you: Twitter data and R

I wondered how many of the people that I follow on Twitter do not follow me back. A quick way to look at this is with R. OK, a really quick way is to give a 3rd party application access rights to your account to do this for you, but a) that isn’t safe, b) you can’t look at anyone else’s data, and c) this is quantixed – doing nerdy stuff like this is what I do. Now, the great thing about R is the availability of well-written packages to do useful stuff. I quickly found two packages twitteR and rtweet that are designed to harvest Twitter data. I went with rtweet and there were some great guides to setting up OAuth and getting going.

The code below set up my environment and pulled down lists of my followers and my “friends”. I’m looking at my main account and not the quantixed twitter account.

library(rtweet)
library(httpuv)
## setup your appname,api key and api secret
appname <- "whatever_name"
key <- "blah614h"
secret <- "blah614h"
## create token named "twitter_token"
app = appname,
consumer_key = key,
consumer_secret = secret)

clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
clathrin_friends <- get_friends("clathrin")
clathrin_friends_names <- lookup_users(clathrin_friends)

The terminology is that people that follow me are called Followers and people that I follow are called Friends. These are the terms used by Twitter’s API. I have almost 3000 followers and around 1200 friends.

This was a bit strange… I had fewer followers with data than actual followers. Same for friends: missing a few hundred in total. I extracted a list of the Twitter IDs that had no data and tried a few other ways to look them up. All failed. I assume that these are users who have deleted their account (and the Twitter ID stays reserved) or maybe they are suspended for some reason. Very strange.

## noticed something weird
## look at the twitter ids of followers and friends with no data
missing_followers <- setdiff(clathrin_followers$user_id,clathrin_followers_names$user_id)
missing_friends <- setdiff(clathrin_friends$user_id,clathrin_friends_names$user_id)

## find how many real followers/friends are in each set
aub <- union(clathrin_followers_names$user_id,clathrin_friends_names$user_id)
anb <- intersect(clathrin_followers_names$user_id,clathrin_friends_names$user_id)

## make an Euler plot to look at overlap
fit <- euler(c(
"Followers" = nrow(clathrin_followers_names) - length(anb),
"Friends" = nrow(clathrin_friends_names) - length(anb),
"Followers&amp;Friends" = length(anb)))
plot(fit)
plot(fit)

In the code above, I arranged in sets the “real Twitter users” who follow me or I follow them. There was an overlap of 882 users, leaving 288 Friends who don’t follow me back – boo hoo!

I next wanted to see who these people are, which is pretty straightforward.

## who are the people I follow who don't follow me back
bonly <- setdiff(clathrin_friends_names$user_id,anb) no_follow_back <- lookup_users(bonly) Looking at no_follow_back was interesting. There are a bunch of announcement accounts and people with huge follower counts that I wasn’t surprised do not follow me back. There are a few people on the list with whom I have interacted yet they don’t follow me, which is a bit odd. I guess they could have unfollowed me at some point in the past, but my guess is they were never following me in the first place. It used to be the case that you could only see tweets from people you followed, but the boundaries have blurred a lot in recent years. An intermediary only has to retweet something you have written for someone else to see it and you can then interact, without actually following each other. In fact, my own Twitter experience is mainly through lists, rather than my actual timeline. And to look at tweets in a list you don’t need to follow anyone on there. All of this led me to thinking: maybe other people (who follow me) are wondering why I don’t follow them back… I should look at what I am missing out on. ## who are the people who follow me but I don't follow back aonly <- setdiff(clathrin_followers_names$user_id,anb)
no_friend_back <- lookup_users(aonly)
## save csvs with all user data for unreciprocated follows
write.csv(no_follow_back, file = "nfb.csv")
write.csv(no_friend_back, file = "nfb2.csv")

With this last bit of code, I was able to save a file for each subset of unreciprocated follows/friends. Again there were some interesting people on this list. I must’ve missed them following me and didn’t follow back.

I used these lists to prune my friends and to follow some interesting new people. The csv files contain the Twitter bio of all the accounts so it’s quick to go through and check who is who and who is worth following. Obviously you can search all of this content for keywords and things you are interested in.

So there you have it. This is my first “all R” post on quantixed – hope you liked it!

The post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins.

## Rollercoaster: ups and downs of Google Scholar citations

In the UK there is an advertising disclaimer that “the value of your investments may go down as well as up.” Since papers are our main commodity in science and citations are something of a return, surely the “value” of a published paper only ever increases over time. Doesn’t it?

I think this is true when citations to a paper are tracked at a conventional database (Web of Science for example). Citations are added and very rarely taken away. With Google Scholar it is a different story. Now, I am a huge Google Scholar fan so this post is not a criticism of the service at all. One of the nice things about GS is that it counts citations from the “grey literature”, i.e. theses, patents etc. But not so grey as to include blogs and news articles (most of the time). So you get a broader view of the influence of a paper beyond the confines of a conventional database. With this broader view comes volatility, as I’ll show below.

I don’t obsessively check my own page every day – honestly I don’t(!) – but I did happen to check my own page twice within a short space of time and I noticed that my H-index went up by 1 and then decreased by 1. I’m pretty sure I didn’t imagine this and so I began to wonder how stable the citation data in Google Scholar actually is and whether I could track cites automatically.

What goes up (must come down)

Manually checking GS every day is beyond me, and what are computers for anyway? I set up a little routine to grab my data each day and look at the stability of citations (details of how to do this are below if you’re interested).

You can click on the plot to see it in its full glory.

Each line is a plot of citations to a paper over many weeks. The grey line is no citations gained or lost, relative to the start. As the paper accrues citations the line becomes more red and if it loses citations below the starting point it turns blue. They are ranked by the integral of change in citation over time.

The data are retrieved daily so if a paper gains citations and loses an equal number in less than 24 hours, this is not detected.

You can see from the plot that the number of citations to a paper can go down as well as up. For one paper, citations dropped significantly from one day to the next, which undid two month’s worth of increases. This paper is my highest cited work and dropped 10 cites from 443 to 433.

I’m guessing that running this routine on someone working in a field with a higher citation rate would show more volatility.

The increases in citations have an obvious cause but what about the decreases? My guess is that they are duplicate citations which are removed when they are added to a “cluster” (Google’s way of dealing with multiple URLs for the same paper). Another cause is probably something that is subsequently judged to not be a paper, e.g. a blog post, and getting removed.

The alert emails from Google Scholar have always puzzled me. I have alerts set up to tell me when my work is cited. I love getting them – who doesn’t want to see who has cited their work? Annoyingly they arrive infrequently and only ever contain one or two new papers. I looked at the frequency of changes in citation number and checked when I received emails from Google Scholar.

Over the same period as the plot above, you can see that citations to my profile happen pretty frequently. Again, if my work was cited at a higher rate, I guess this would be even more frequent. But in this period I only received six or so alert emails. I don’t think GS waits until a citation is stable for a while before emailing, because they tend to come immediately after an update. The alert emails remain a mystery to me. It would be great if they came a bit more often and it would be even better if they told you which paper(s) they cite!

Summary

Google Scholar is a wonderful service that finds an extra 20% or so of the impact of your work compared to other databases. With this extra information comes volatility and the numbers you see on there probably shouldn’t be treated as absolute.

Methods

To do this I used Christian Kreibich’s python script to retrieve information from Google Scholar. I wrote a little shell script to run the scholar.py and set up a daemon to do this everyday at the same time. I couldn’t find a way to search my UserID and so the search for my name brings up some unrelated papers that need to be filtered. There are restrictions on what you can retrieve, so my script retrieved papers within three different time frames to avoid hitting the limit for paper information retrieval.

The daemon is a plist in ~/Library/LaunchAgents/

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple Computer//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>
<key>Label</key>
<string>com.quantixed.gscrape</string>
<key>KeepAlive</key>
<false/>
<false/>
<key>Program</key>
<string>/path/to/the/shell/script/gscrape.sh</string>
<key>StartCalendarInterval</key>
<dict>
<key>Hour</key>
<integer>14</integer>
<key>Minute</key>
<integer>30</integer>
</dict>
</dict>
</plist>

And the shell script is something like

#!/bin/bash
cd /path/to/the/shell/script/
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=1999 --before=2007 --csv > a.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2008 --before=2012 --csv > b.csv
/usr/bin/pythonw '/path/to/your/scholar.py-master/scholar.py' -c 500 --author "Joe Bloggs" --after=2013 --csv > c.csv
OF=all_$(date +%Y%m%d).csv cat a.csv b.csv c.csv >$OF

To crunch the data I wrote something in Igor which reads in the CSVs and plotted out my data. This meant first getting a list of clusterIDs which correspond to my papers in order to filter out other people’s work.

I have a surprising number of tracks in my library with Rollercoaster in the title. I will go with indie wannabe act Northern Uproar for the title of this post.

“What goes up (must come down)” is from Graham & Brown’s Super Fresco wallpaper ad from 1984.

“Please please tell me now” is a lyric from Duran Duran’s “Is There Something I Should Know?”.

## The Second Arrangement

To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations.

Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it.

The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to

$$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$

Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here.

I had got as far as figuring out that the ratio of permutations to derangements converges to e. However,  you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it.

Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted.

The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another.

If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on.

To calculate the derangements:

$$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$

This is a polynomial expansion of this expression:

$$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$

where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G.

Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames.

What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that.

So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading!

Further notes

The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat?

Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem?

The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions.

## Adventures in Code V: making a map of Igor functions

I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions.

This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in.

grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output
for i in cat output; do grep -ie "\$i" *.ipf | grep -w "Function" >> funcfile.txt ; done

Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.

Part of a series with code snippets and tips.