## Round and Round

I thought I’d share a procedure for rotating a 2D set of coordinates about the origin. Why would you want do this? Well, we’ve been looking at cell migration in 2D – tracking nuclear position over time. Cells migrate at random and I previously blogged about ways to visualise these tracks more clearly. Part of this earlier procedure was to set the start of each track at (0,0). This gives a random hairball of tracks moving away from the origin. Wouldn’t it be a good idea to orient all the tracks so that the endpoint lies on the same axis? This would simplify the view and allow one to assess how ‘directional’ the cell tracks are. To rotate a set of coordinates, you need to use a rotation matrix. This allows you to convert the x,y coordinates to their new position x’,y’. This rotation is counter-clockwise.

$$x’ = x \cos \theta – y \sin \theta\,$$

$$y’ = x \sin \theta + y \cos \theta\,$$

However, we need to find theta first. To do this we need to find the angle between two lines, using this formula.

$$\cos \theta = \frac {\mathbf a \cdot \mathbf b}{\left \Vert {\mathbf a} \right \Vert \cdot \left \Vert {\mathbf b} \right \Vert}$$

The maths is kept to a minimum here. If you are interested, look at the code at the bottom.

The two lines (a and b) are formed by the x-axis (origin to some point on the x-axis, i.e. y=0) and by a line running from the origin to the last coordinate in the series. This calculation can be done for each track with theta for each track being used to rotate the that whole track (x,y changed to x’,y’ for each point).

Here is an example of just a few tracks from an experiment. Typically we have hundreds of tracks for each experimental group and the code will blast through them all very quickly (<1 s).

After rotation, the tracks are now aligned so that the last point is on the x-axis at y=0. This allows us to see how ‘directional’ the tracks are. The end points are now aligned, when they migrated there, how convoluted was their path.

The code to do this is up on Igor Exchange code snippets. A picture of the code is below (markup for code in WordPress is not very clear). See the code snippet if you want to use it.

The weakness of this method is that acos (arccos) only gives results from 0 to Pi (0 to 180°). There is a correction in the procedure, but everything needs editing if you want to rotate the co-ordinates to some other plane. Feedback welcome.

Edit Jim Prouty and A.G. have suggested two modifications to the code. The first is to use complex waves rather than 2D real waves. Then use two native Igor functions r2polar or p2rect. The second suggestion is to use Matrix operations! As is often the case with Igor there are several ways of doing things. The method described here is long-winded compared to a MatrixOp and if the waves were huge these solutions would be much, much faster. As it is, our migration movies typically have 60 points and as mentioned rotator() blasts through them very quickly. More complex coordinate sets would need something more sophisticated.

The post title is taken from “Round & Round” by New Order from their Technique LP.

## Sure To Fall

What does the life cycle of a scientific paper look like?

It stands to reason that after a paper is published, people download and read the paper and then if it generates sufficient interest, it will begin to be cited. At some point these citations will peak and the interest will die away as the work gets superseded or the field moves on. So each paper has a useful lifespan. When does the average paper start to accumulate citations, when do they peak and when do they die away?

Citation behaviours are known to be very field-specific. So to narrow things down, I focussed on cell biology and in one area “clathrin-mediated endocytosis” in particular. It’s an area that I’ve published in – of course this stuff is driven by self-interest. I downloaded data for 1000 papers from Web of Science that had accumulated the most citations. Reviews were excluded, as I assume their citation patterns are different from primary literature. The idea was just to take a large sample of papers on a topic. The data are pretty good, but there are some errors (see below).

Number-crunching (feel free to skip this bit): I imported the data into IgorPro making a 1D wave for each record (paper). I deleted the last point corresponding to cites in 2014 (the year is not complete). I aligned all records so that year of publication was 0. Next, the citations were normalised to the maximum number achieved in the peak year. This allows us to look at the lifecycle in a sensible way. Next I took out records to papers less than 6 years old as I reasoned these would have not have completed their lifecycle and could contaminate the analysis (it turned out to make little difference). The lifecycles were plotted and averaged. I also wrote a quick function to pull out the peak year for citations post hoc.

So what did it show?

Citations to a paper go up and go down, as expected (top left). When cumulative citations are plotted most of the articles have an initial burst and then level off. The exception are ~8 articles that continue to rise linearly (top right). On average a paper generates its peak citations three years after publication (box plot). The fall after this peak period is pretty linear and it’s apparently all over somewhere >15 years after publication (bottom left). To look at the decline in more detail I aligned the papers so that year 0 was the year of peak citations. The average now loses almost 40% of those peak citations in the following year and then declines steadily (bottom right).

Edit: The dreaded Impact Factor calculation takes the citations to articles published in the preceding 2 years and divides by the number of citable items in that period. This means that each paper only contributes to the Impact Factor in years 1 and 2. This is before the average paper reaches its peak citation period. Thanks to David Stephens (@david_s_bristol) for pointing this out. The alternative 5 year Impact Factor gets around this limitation.

Perhaps lifecycle is the wrong term: papers in this dataset don’t actually ‘die’, i.e. go to 0 citations. There is always a chance that a paper will pick up the odd citation. Papers published 15 years ago are still clocking 20% of their peak citations. Looking at papers cited at lower rates would be informative here.

Two other weaknesses that affect precision is that 1) a year is a long time and 2) publication is subject to long lag times. The analysis would be improved by categorising the records based on the month-year when the paper was published and the month-year when each citation comes in. Papers published in January in one year probably have a different peak than those published in December of the same year, but this is lost when looking at year alone. Secondly, due to publication lag, it is impossible to know when the peak period of influence for a paper truly is.
Problems in the dataset. Some reviews remained despite being supposedly excluded, i.e. they are not properly tagged in the database. Also, some records have citations from years before the article was published! The numbers of citations are small enough to not worry for this analysis, but it makes you wonder about how accurate the whole dataset is. I’ve written before about how complete citation data may or may not be. These sorts of things are a concern for all of us who are judged by these things for hiring and promotion decisions.

The post title is taken from ‘Sure To Fall’ by The Beatles, recorded during The Decca Sessions.

## Tips from the Blog I

What is the best music to listen to while writing a manuscript or grant proposal? OK, I know that some people prefer silence and certainly most people hate radio chatter while trying to concentrate. However, if you like listening to music, setting an iPod on shuffle is no good since a track by Napalm Death can jump from the speakers and affect your concentration. Here is a strategy for a randomised music stream of the right mood and with no repetition, using iTunes.

For this you need:
A reasonably large and varied iTunes library that is properly tagged*.

1. Setup the first smart playlist to select all songs in your library that you like to listen to while writing. I do this by selecting genres that I find conducive to writing.
Conditions are:
-Match any of the following rules
-Genre contains jazz
-add as many genres as you like, e.g. shoegaze, space rock, dream pop etc.
-Don’t limit and do check live updating
I call this list Writing

2. Setup a second smart playlist that makes a randomised novel list from the first playlist
Conditions are:
-Match all of the following rules
-Playlist is Writing   //or whatever you called the 1st playlist
-Last played is not in the last 14 days    //this means once the track is played it disappears, i.e. refreshes constantly
-Limit to 50 items selected by random
-Check Live updating
I call this list Writing List

That’s it! Now play from Writing List while you write. The same strategy works for other moods, e.g. for making figures I like to listen to different music and so I have another pair for that.

After a while, the tracks that you’ve skipped (for whatever reason) clog up the playlist. Just select all and delete from the smart playlist, this refreshes the list and you can go again with a fresh set.

* If your library has only a few tracks, or has plenty of tracks but they are all of a similar genre, this tip is not for you.

## All This And More

I was looking at the latest issue of Cell and marvelling at how many authors there are on each paper. It’s no secret that the raison d’être of Cell is to publish the “last word” on a topic (although whether it fulfils that objective is debatable). Definitive work needs to be comprehensive. So it follows that this means lots of techniques and ergo lots of authors. This means it is even more impressive when a dual author paper turns up in the table of contents for Cell. Anyway, I got to thinking: has it always been the case that Cell papers have lots of authors and if not, when did that change?

I downloaded the data for all articles published by Cell (and for comparison, J Cell Biol) from Scopus. The records required a bit of cleaning. For example, SnapShot papers needed to be removed and also the odd obituary etc. had been misclassified as an article. These could be quickly removed. I then went back through and filtered out ‘articles’ that were less than three pages as I think it is not possible for a paper to be two pages or fewer in length. The data could be loaded into IgorPro and boxplots generated per year to show how author number varied over time. Reviews that are misclassified as Articles will still be in the dataset, but I figured these would be minimal.

First off: Yes, there are more authors on average for a Cell paper versus a J Cell Biol paper. What is interesting is that both journals had similar numbers of authors when Cell was born (1974) and they crept up together until the early 2000s, when the number of Cell authors kept increasing, or JCell Biol flattened off, whichever way you look at it.

I think the overall trend to more authors is because understanding biology has increasingly required multiple approaches and the bar for evidence seems to be getting higher over time. The initial creep to more authors (1974-2000) might be due to a cultural change where people (technicians/students/women) began to get proper credit for their contributions. However, this doesn’t explain the divergence between J Cell Biol and Cell in recent years. One possibility is Cell takes more non-cell biology papers and that these papers necessarily have more authors. For example, the polar bear genome was published in Cell (29 authors), and this sort of paper would not appear in J Cell Biol. Another possibility is that J Cell Biol has a shorter and stricter revision procedure, which means that multiple rounds of revision, collecting new techniques and new authors is more limited than it is at Cell. Any other ideas?

I also quickly checked whether more authors means more citations, but found no evidence for such a relationship. For papers published in the years 2000-2004, the median citation number for papers with 1-10 authors was pretty constant for J Cell Biol. For Cell, these data mere more noisy. Three-author papers tended to be cited a bit more than those with two authors, but then four author papers were also lower.

The number of authors on papers from our lab ranges from 2-9 and median is 3.5. This would put an average paper from our lab in the bottom quartile for JCB and in the lower 10% for Cell in 2013. Ironically, our 9 author paper (an outlier) was published in J Cell Biol. Maybe we need to get more authors on our papers before we can start troubling Cell with our manuscripts…

The Post title is taken from ‘All This and More’ by The Wedding Present from their LP George Best.

## Blast Off!

This post is about metrics and specifically the H-index. It will probably be the first of several on this topic.

I was re-reading a blog post by Alex Bateman on his affection for the H-index as a tool for evaluating up-and-coming scientists. He describes Jorge Hirsch’s H-index, its limitations and its utility quite nicely, so I won’t reiterate this (although I’ll probably do so in another post). What is under-appreciated is that Hirsch also introduced the m quotient, which is the H-index divided by years since the first publication. It’s the m quotient that I’ll concentrate on here. The TL;DR is: I think that the H-index does have some uses, but evaluating early career scientists is not one of them.

Anyone of an anti-metrics disposition should look away now.

Alex proposes that the scientists can be judged (and hired) by using m as follows:

• <1.0 = average scientist
• 1.0-2.0 = above average
• 2.0-3.0 = excellent
• >3.0 = stellar

He says “So post-docs with an m-value of greater than three are future science superstars and highly likely to have a stratospheric rise. If you can find one, hire them immediately!”.

From what I have seen, the H-index (and therefore m) is too noisy for early stage career scientists to be of any use for evaluation. Let’s leave that aside for the moment. What he is saying is you should definitely hire a post-doc who has published ≥3 papers with ≥3 citations each in their first year, ≥6 with ≥6 citations each in their second year, ≥9 papers with ≥9 in their third year…

Do these people even exist? A candidate with 3 year PhD and a 3 year postdoc (6 would mean ≥18 papers with ≥18 citations each! In my field (molecular cell biology), it is unusual for somebody to publish that many papers, let alone accrue citations at that rate*.

This got me thinking: using Alex’s criteria, how many stellar scientists would we miss out on and would we be more likely to hire the next Jan Hendrik Schön. To check this out I needed to write a quick program to calculate H-index by year (I’ll describe this in a future post). Off the top of my head I thought of a few scientists that I know of, who are successful by many other measures, and plotted their H-index by year. The dotted line shows a constant m of 1,  “average” by Alex’s criteria. I’ve taken a guess at when they became a PI. I have anonymised the scholars, the information is public and anyone can calculate this, but it’s not fair to identify people without asking (hopefully they can’t recognise themselves – if they read this!).

This is a small sample taken from people in my field. You can see that it is rare for scientists to have a big m at an early stage in their careers. With the exception of Scholar C, who was just awesome from the get-go, panels appointing any of these scholars would have had trouble divining the future success of these people on the basis of H-index and m alone. Scholar D and Scholar E really saw their careers take-off by making big discoveries, and these happened at different stages of their careers. Both of these scholars were “below average” when they were appointed as PI. The panel would certainly not have used metrics in their evaluation (the databases were not in wide use back then), probably just letters of recommendation and reading the work. Clearly, they could identify the potential in these scientists… or maybe they just got lucky. Who knows?!

There may be other fields where publication at higher rates can lead to a large m but I would still question the contribution of the scientist to the papers that led to the H-index. Are they first or last author? One problem with the H-index is that the 20th scientist in a list of 40 authors gets the same credit as the first author. Filtering what counts in the list of articles seems sensible, but this would make the values even more noisy for early stage scientists.

*In the comments section, somebody points out that if you publish a paper very early then this affects your m value. This is something I sympathise with. My first paper was in 1999 when I was an undergrad. This dents my m value as it was a full three years until my next paper.

The post title is taken from ‘Blast Off!’ by Rivers Cuomo from ‘Songs from the Black Hole’ the unreleased follow-up to Pinkerton.

## Very Best Years

What was the best year in music?

OK, I have to be upfront and say that I thought the answer to this would be 1991. Why? Just a hunch. Nevermind, Loveless, Spiderland, Laughing Stock… it was a pretty good year. I thought it would be fun to find out if there really was a golden year in music. It turns out that it wasn’t 1991.

There are many ways to look at this question, but I figured that a good place to start was to find what year had the highest density of great LPs. But how do we define a great LP? Music critics are notorious for getting it wrong and so I’m a big fan of rateyourmusic.com (RYM) which democratises the grading process for music by crowdsourcing opinion. It allows people to rate LPs in their collection and these ratings are aggregated via a slightly opaque system and the albums are ranked into charts. I scraped the data for the Top 1000 LPs of All-Time*. Crunching the numbers was straightforward. So what did it show?

Looking at the Top 1000, 1971 and 1972 are two years with the highest representation. Looking at the Top 500 LPs, 1971 is the year with most records. Looking at the Top 100, the late 60s features highly.

To look at this in detail, I plotted the rank versus year. This showed that there was a gap in the early 80s where not many Top 1000 LPs were released. This could be seen in the other plots but, it’s clearer on the bubble plot. Also the cluster of high ranking LPs released in the 1960s is obvious.

The plot is colour-coded to show the rank, while the size of the bubbles indicates the rating. Note that rating doesn’t correlate with rank (RYM also factors in number of ratings and user loyalty, to determine this). To take the ranking into account, I calculated the “integrated score” for all albums released in a given year. The score is 1001-rank, and the summation of all of these scores for albums released in a given year gives the integrated score.

This is shown on a background of scores for each decade. Again, 1970s rule and 1971 is the peak. The shape of this profile will not surprise music fans. The first bump in the late 50s coincides with rock n roll, influential jazz records and the birth of the LP as a serious format. The 60s sees a rapid increase in density of great albums per year, hitting a peak in 1971. The decline that follows is halted by a spike in 1977: punk. There’s a relative dearth of highly rated LPs in the early 80s and things really tail off in the early 2000s. The lack of highly rated LPs in these later years is probably best explained by few ratings, due to young age of these LPs. Also diversification of music styles, tastes and the way that music is consumed is likely to play a role. The highest ranked LP on the list is Radiohead’s OK Computer (1997) which was released in a non-peak year. Note that 1991 does not stand out particularly. In fact, in the 1990s, 1994 stands out as the best year for music.

Finally, RYM has a nice classification system for music so I calculated the integrated score for these genres and sub-genres (cowpunk, anyone?). Rock (my definition) is by far the highest scoring and Singer-Songwriter is the highest scoring genre/sub-genre.

So there you have it. 1971 was the best year in music according to this analysis. Now… where’s my copy of Tago Mago.

* I did this mid-April. I doubt it’s changed much. This was an exercise to learn how to scrape and I also don’t think I broke the terms of service of RYM. If I did, I’ll take this post down.

The title of this post comes from ‘Very Best Years’ by The Grays from their LP ‘Ro Sham Bo’. It was released in 1994…

## I’m Gonna Crawl

Fans of data visualisation will know the work of Edward Tufte well. His book “The Visual Display of Quantitative Information” is a classic which covers the history and the principals of conveying data in a concise way, that is easy to interpret. He is also credited with two different dataviz techniques: sparklines and image quilts. It was these two innovations that came to mind when I was discussing some cell migration results generated in our lab.

Sparklines are small displays of 1D information versus time to highlight the profile (think: stocks and shares).

Image quilts are arrays of images that together quickly provide you with an overview (think: Google Images results).

Analysing cell migration generates ‘tracks’ of many cells as they move around a 2D surface. Tracks are pairs of XY co-ordinates at different time points. We want to understand how these tracks change if we do something to the cells, e.g. knock-down a particular protein. There are many ways to analyse this. Such as: looking at the speed of migration, their directionality, etc. etc. When we were looking at lots of tracks, all jumbled up, I thought of sparklines and of image quilts and thought the easiest way to compare a control and test group would be to generate something similar.

We start out with many tracks within a field:

It’s difficult to see what is happening here, so it needs to be simplified.

I wrote a couple of procedures in IgorPro that calculated the cumulative distance that each cell had migrated at a given time point (say, the end of the movie). These cumulative distances were then ranked and then the corresponding cells were arrayed in the x-dimension according to how far they migrated. This was a little bit tricky to do, but that’s another story.

This plot shows the tracks with the shortest/slowest to the left and the furthest/fastest to the right. This can then be compared to a test set and differences become apparent. However, we need to look at many tracks and expanding these “sparklines” further is not practical – we want to provide an overview.

Accordingly, I wrote another procedure to array them in an XY array with a given spacing between the start points. This should give an “image quilt” feel.

I added gridlines to indicate the start position. The result is that a nice overview is seen and differences between groups can be easily seen at first glance (or not seen if there is no effect!).

This method works well to compare control and test groups that have a similar number of cells. If N is different (say, more than 10%), we need to take a random sample of tracks and array those to get a feel for what’s happening. Obviously the tracks could be arrayed according whatever parameter is required, e.g. highest speed, most directional etc. etc.

One thought is to do a further iteration where the tracks are oriented so that the start and end points are at the same point in X, or oriented so that the tracks have the same starting trajectory. As it is, the mix of trajectories spoils the ease of interpretation.

Obviously, this can be applied to tracks of anything: growing and shrinking microtubules, endosome/lysosome movement etc. etc.

Any suggestions for improvements are welcome, but I think this is a quick and easy way to just eyeball the data to see if there are any differences before calculating any other parameters. I thought I’d put the idea out there – maybe together with the code if there is any interest.

The post title is from I’m Gonna Crawl – Led Zeppelin from their In Through The Out Door LP

## All Together Now

In the lab we use IgorPro from Wavemetrics for analysis. Here is a useful procedure to plot all XY pairs in an experiment. I was plotting out some cell tracking data with a colleague and I knew that I had this useful function buried in an experiment somewhere. I eventually found it and thought I’d post it here. I’ll add it to the code section of the website soon. Looking at it, it doesn’t look like it was written by me. A search of IgorExchange didn’t reveal its author, so maybe it was me. Apologies if it wasn’t.

The point is: if you have a bunch of XY pairs and you just want to plot all of them in one window to look at them. If they are 2D waves or a small number of 1D waves, this is straightforward. If you have hundreds, you need a function!

An example would be fluorescence recordings versus time (where each time wave is unique to the fluorescence trace) or XY co-ordinates of a particle in space.

To use this procedure, you need an experiment with a logical naming system for 1D waves. something like X_ctrl1, X_ctrl2, X_ctrl3 etc. and Y_ctrl1, Y_ctrl2, Y_ctrl3 etc. Paste the following into the Procedure Window (command+m).


Function PlotAllWaves(theYList,theXlist)
String theYList
String theXList
display
Variable i=0
string aWaveName = ""
string bWaveName = ""
do
aWaveName = StringFromList(i, theYList)
bWavename = StringFromList(i, theXList)
WAVE/Z aWave = $aWaveName WAVE/Z bWave =$bWaveName
if (!WaveExists(aWave))
break
endif
appendtograph aWave vs bWave
i += 1
while(1)
End


After compiling you can call the function by typing in the Command Window:


PlotAllWaves(wavelist("x_*", ";", ""),wavelist("y_*", ";", ""))


You’ll need to change this for whatever convention you are using for your wave naming system. You will know how to do this if you have got this far!

This function is very useful for just eyeballing the data after you have imported it. The databrowser shows only one wave at a time, but it is preferable to look at all the waves to find errors, spot outliers or trends etc.

Edit 28/4/15: the logical naming system and the order in which the waves were added to the experiment are crucial for this to work. We’re now using two different versions of this code that either a) check that the waves are compatible or b) concatenate the waves into a 2D wave before plotting. This reduces errors in plotting.

The post title is taken from All Together Now – The Beatles from the Yellow Submarine soundtrack.

## Counting backwards

I thought I would start add a blog to our lab website. The plan is to update maybe once a week with content that is too long for twitter but doesn’t fit in the categories on the lab website. I’m thinking extra analysis, paper commentaries, outreach activities etc. Let’s see how it goes.

First up: how do you count the number of words or characters in a text file?

Microsoft Word has a nice feature for doing this, but poor old TextEdit does not. Fortunately, AppleScript can come to the rescue! I found a script on the web to count the number of words in a TextEdit file and modified it slightly to give the number of characters as well.

Why would you want to do this? When editing fields on a web form (particularly grant application forms) it’s not practical to do this in the browser and these fields often have strict limits on words and characters.

Here is the code:


tell application "TextEdit"
set wc to count words of document 1
set cc to count characters of document 1
if wc is equal to 1 then
set txt to " word, "
else
set txt to " words, "
end if
if cc is equal to 1 then
set txtc to " character."
else
set txtc to " characters."
end if

set result to "This text comprises " & (wc as string) & txt & (cc as string) & txtc
display dialog result with title "WordStats" buttons {"OK"} default button "OK"
end tell


If you are new to this: open AppleScript Editor. New file. Paste in the code above. Click Compile. It should look something like this:

Now Save it to your Scripts folder in home/Library. Call it something sensible e.g. TextEditCounter. Now, in AppleScript Editor. Click Preferences and check the box ‘Show script menu in menu bar’. This shows the AppleScript icon in your menu bar and if you click there, you should see your script there waiting for you to use it.

This blog title is taken from Counting Backwards – Throwing Muses from their LP The Real Ramona.