## Division Day: using PCA in cell biology

In this post I’ll describe a computational method for splitting two sides of a cell biological structure. It’s a simple method that relies on principal component analysis, otherwise known as PCA. Like all things mathematical there are some great resources on the web, if you want to understand this operation in more detail (for example, this great post by Lior Pachter). PCA can applied to many biological problems, you’ve probably seen it used to find patterns in large data sets, e.g. from proteomic studies. It can also be useful for analysing microscopy data. Since our analysis using this method is unlikely to make it into print any time soon, I thought I’d put it up on Quantixed.

During mitosis, a cell forms a mitotic spindle to share copied chromosomes equally to the two new cells. Our lab is working on how this process works and how it goes wrong in cancer. The chromosomes attach to the spindle via kinetochores and during prometaphase they are moved to the middle of the cell. Here, the chromosomes are organised into a disc-like structure called the metaphase plate. The disc is thin in the direction of the spindle axis, but much larger in width and height. To examine the spatial distribution of kinetochores on the plate we wanted a way to approximately separate kinetochores on one side if the plate from the other.

Kinetochores can be easily detected in 3D confocal images of mitotic cells by particle analysis. Kinetochores are easily stained and appear as bright spots that a computer can pick out (we use Imaris for this). The cartesian coordinates of each detected kinetochore were saved as csv and fed into IgorPro. A procedure could then be run which works in three steps. The code is shown at the bottom, it is wrapped in further code that deals with multiple datasets from many cells/experiments etc. The three steps are:

1. PCA
2. Point-to-plane
3. Analysis on each subset

I’ll describe each step and how it works.

1. Principal component analysis

This is used to find the 3rd eigenvector, which can be used to define a plane passing through the centre of the plate. This plane is used for division.

Now, because the metaphase plate is a disc it has three dimensions, the third of which – “thickness” – is the smallest. PCA will find the principal component, i.e. the direction in which there is most variance. Orthogonal to that is the second biggest variance and orthogonal to that direction is the smallest. These directions are called eigenvectors and their magnitude is the eigenvalue. As there are three dimensions to the data we can get all three eigenvectors out and the 3rd eigenvector corresponds to thickness of the metaphase plate. Metaphase plates in cells grown on coverslips are orientated similarly, but the cells themselves are at random orientations. PCA takes no notice of this and can simply reveal the direction of the smallest dimension of a 3D structure. The movie shows this in action for a simulated data set. The black spots are arranged in a disk shape about the origin. They are rotated about x by 45° (the blue spots). We then run PCA and show the eigenvectors as unit vectors (red lines). The 3rd eigenvector is normal to the plane of division, i.e. the 1st and 2nd eigenvectors lie on the plane of division.

Also, the centroid needs to be defined. This is simply the cartesian coordinates for the average of each dimension. It is sometimes referred to as the mean vector. In the example this was the origin, in reality this will depend on the position and the overall height of the cell.

A much longer method to get the eigenvectors is to define the variance-covariance matrix (sometimes called the dispersion matrix) for each dimension, for all kinetochores and then do an eigenvector decomposition on the matrix. PCA is one command, whereas the matrix calculation would be an extra loop followed by an additional command.

2. Point-to-plane

The distance of each kinetochore to the plane that we defined is calculated. If it is a positive value then the kinetochore lies on the same side as the normal vector (defined above). If it is negative then it is on the other side. The maths behind how to do this are in section 10.3.1 of Geometric Tools for Computer Graphics by Schneider & Eberly (starting on p. 374). Google it, there is a PDF version on the web. I’ll save you some time, you just need one equation that defines a plane,

$$ax+by+cz+d=0$$

Where the unit normal vector is [a b c] and a point on the plane is [x y z]. We’ll use the coordinates of the centroid as a point on the plane to find d. Now that we know this, we can use a similar equation to find the distance of any point to the plane,

$$ax_{i}+by_{i}+cz_{i}+d$$

Results for each kinetochore are used to sort each side of the plane into separate waves for further calculation. In the movie below, the red dots and blue dots show the positions of the kinetochores on either side of the division plane. It’s a bit of an optical illusion, but the cube is turning in a right hand fashion.

3. Analysis on each subset

Now that the data have been sorted, separate calculations can be carried out on each. In the example, we were interested in how the kinetochores were organised spatially and so we looked at the distance to nearest neighbour. This is done by finding the Euclidean distance from each kinetochore to every other kinetochore and putting the lowest value for each kinetochore into a new wave. However, this calculation can be anything you want. If there are further waves that specify other properties of the kinetochores, e.g. brightness, then these can be similarly processed here.

Other notes

The code in its present form (not very streamlined) was fast and could be run on every cell from a number of experiments, reading out positional data for 10,000 kinetochores in ~2 s. For QC it is possible to display the two separated coordinated sets to check that the division worked fine (see above). The power of this method is that it doesn’t rely on imaging spindle poles or anything else to work out the orientation of the metaphase plate. It works well for metaphase cells, but cells with any misaligned chromosomes ruin the calculation. It is possible to remove these and still fit the plane, but for our analysis we focused on cells at metaphase with a defined plate.

What else can it be used for?

Other structures in the cell can be segregated in a similar way. For example, the Golgi apparatus has a trans and a cis side, which could be similarly divided (although using the 2nd eigenvector as normal to the plane, rather than the 3rd).

Acknowledgements: I’d like to thank A.G. at WaveMetrics Inc. for encouraging me to try PCA rather than my dispersion matrix approach.

If you want to use it, the code is available here (it seems I can only upload PDF at wordpress.com). I used pygments for annotation.

The post title comes from “Division Day” a great single by Elliott Smith.

## Tips from the blog III – violin plots

Having recently got my head around violin plots, I thought I would explain what they are and why you might want to use them.

There are several options when it comes to plotting summary data. I list them here in order of granularity, before describing violin plots and how to plot them in some detail.

Bar chart

This is the mainstay of most papers in my field. Typically, a bar representing the mean value that’s been measured is shown with an error bar which shows either the standard error of the mean, the standard deviation, or more rarely a confidence interval. The two data series plotted in all cases is the waiting time for Old Faithful eruptions (waiting), a classic dataset from R. I have added some noise to waiting (waiting_mod) for comparison. I think it’s fair to say that most people feel that the bar chart has probably had its day and that we should be presenting our data in more informative ways*.

Pros: compact, easy to tell differences between groups

Cons: hides the underlying distribution, obscures the n number

Box plot

The box plot – like many things in statistics – was introduced by Tukey. It’s sometimes known as a Tukey plot, or a box-and-whiskers plot. The idea was to give an impression of the underlying distribution without showing a histogram (see below). Histograms are great, but when you need to compare many distributions they do not overlay well and take up a lot of space to show them side-by-side. In the simplest form, the “box” is the interquartile range (IQR, 25th and 75th percentiles) with a line to show the median. The whiskers show the 10th and 90th percentiles. There are many variations on this theme: outliers can be shown or not, the whiskers may show the limits of the dataset (or something else), the boxes can be notched or their width may represent the sample size…

Pros: compact, easy to tell differences between groups, shows normality/skewness

Cons: hides multimodal data, sometimes obscures the n number, many variations

Histogram

A histogram is a method of showing the distribution of a dataset and was introduced by Pearson. The number of observations within a bin are counted and plotted. The ‘bars’ sit next to each other, because the variable being measured is continuous. The variable being measured is on the x-axis, rather than the category (as in the other plots).

Often the area of all the bars is normalised to 1 in order to assess the distributions without being confused by differences in sample size. As you can see here, “waiting” is bimodal. This was hidden in the bar chart and in the bot plot.

Related to histograms are other display types such as stemplots or stem-and-leaf plots.

Pros: shows multimodal data, shows normality/skewness clearly

Cons: not compact, difficult to overlay, bin size and position can be misleading

Scatter dot plot

It’s often said that if there are less than 10 data points, then best practice is to simply show the points. Typically the plot is shown together with a bar to show the mean (or median) and maybe with error bars showing s.e.m., s.d., IQR. There are a couple of methods of plotting the points, because they need to be scattered in x value in order to be visualised. Adding random noise is one approach, but this looks a bit messy (top). A symmetrical scatter can be introduced by binning (middle) and a further iteration is to bin the y values rather than showing their true location (bottom). There’s a further iteration which constrains the category width and overlays multiple points, but again the density becomes difficult to see.

These plots still look quite fussy, the binned version is the clearest but then we are losing the exact locations of the points, which seems counterintuitive. Another alternative to scattering the dots is to show a rug plot (see below) where there is no scatter.

Pros: shows all the observations

Cons: can be difficult to assess the distribution

Violin plot

This type of plot was introduced in the software package NCSS in 1997 and described in this paper: Hintze & Nelson (1998) The American Statistician 52(2):181-4 [PDF]. As the title says, violin plots are a synergism between box plot and density trace. A thin box plot is shown together with a symmetrical kernel density estimate (KDE, see explanation below). The point is to be able to quickly assess the distribution. You can see that the bimodality of waiting in the plot, but there’s no complication of lots of points just a smooth curve to see the data.

Pros: shows multimodal data, shows normality/skewness unambiguously

Cons: hides n, not familiar to many readers.

* Why is the bar chart so bad and why should I show my data another way?

The best demonstration of why the bar chart is bad is Anscombe’s Quartet (the figure to the right is taken from the Wikipedia page). These four datasets are completely different, yet they all have the same summary statistics. The point is, you would never know unless you plotted the data. A bar chart would look identical for all four datasets.

Making Violin Plots in IgorPro

I wanted to make Violin Plots in IgorPro, since we use Igor for absolutely everything in the lab. I wrote some code to do this and I might make some improvements to it in the future – if I find the time! This was an interesting exercise, because it meant forcing myself to understand how smoothing is done. What follows below is an aide memoire, but you may find it useful.

What is a kernel density estimate?

A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.

What’s wrong with using a histogram as a KDE?

The following examples are taken from here (which in turn are taken from the book by Bowman and Azzalini described below). A histogram is simplistic. We lose the location of each datapoint because of binning. Histograms are not smooth and the estimate is very sensitive to the size of the bins and also the starting location of the first bin. The histograms to the right show the same data points (in the rug plot).

Using the same bin size, they result in very different distributions depending on where the first bin starts. My first instinct to generate a KDE was to simply smooth a histogram, but this is actually quite inaccurate as it comes from a lossy source. Instead we need to generate a real KDE.

How do I make a KDE?

To do this we place a kernel (a Gaussian is commonly used) at each data point. The rationale behind this is that each observation can be thought of as being representative of a greater number of observations. It sounds a bit bizarre to assume normality to estimate a density non-parametrically, but it works. We can sum all of the kernels to give a smoothed distribution: the KDE. Easy? Well, yes as long as you know how wide to make the kernels. To do this we need to find the bandwidth, h (also called the smoothing parameter).

It turns out that this is not completely straightforward. The answer is summarised in this book: Bowman & Azzalini (1997) Applied Smoothing Techniques for Data Analysis. In the original paper on violin plots, they actually do not have a good solution for selecting h for drawing the violins, and they suggest trying several different values for h. They recommend starting at ~15% of the data range as a good starting point. Obviously if you are writing some code, the process of selecting h needs to be automatic.

Optimising h is necessary because if h is too large, the estimate with be oversmoothed and features will be lost. If is too small, then it will be undersmoothed and bumpy. The examples to the right (again, taken from Bowman & Azzalini, via this page) show examples of undersmoothed, oversmoothed and optimal smoothing.

An optimal solution to find h is

$$h = \left(\frac{4}{3n}\right)^{\frac{1}{5}}\sigma$$

This is termed Silverman’s rule-of-thumb. If smoothing is needed in more than one dimension, the multidimensional version is

$$h = \left\{\frac{4}{\left(p+2\right)n}\right\}^{\frac{1}{\left(p+4\right)}}\sigma$$

You might need multidimensional smoothing to contextualise more than one parameter being measured. The waiting data used above describes the time to wait until the next eruption from Old Faithful. The duration of the eruption is measured, and also the wait to the next eruption can be extracted, giving three parameters. These can give a 3D density estimate as shown here in the example.

The Bowman & Azzalini recommend that, if the distribution is long-tailed, using the median absolute deviation estimator is robust for $$\sigma$$.

$$\tilde\sigma=median\left\{|y_i-\tilde\mu|\right\}/0.6745$$

where $$\tilde\mu$$ is the median of the sample. All of this is something you don’t need to worry about if you use R to plot violins, the implementation in there is rock solid having been written in S plus and then ported to R years ago. You can even pick how the h selection is done from sm.density, or even modify the optimal h directly using hmult.

To get this working in IgorPro, I used some code for 1D KDE that was already on IgorExchange. It needed a bit of modification because it used FastGaussTransform to sum the kernels as a shortcut. It’s a very fast method, but initially gave an estimate that seemed to be undersmoothed. I spent a while altering the formula for h, hence the detail above. To cut a long story short, FastGaussTransform uses Taylor expansion of the Gauss transform and it just needed more terms to do this accurately. This is set with the /TET flag. Note also, that in Igor the width of a Gauss is sigma*2^1/2.

OK, so how do I make a Violin for plotting?

I used the draw tools to do this and placed the violins behind an existing box plot. This is necessary to be able to colour the violins (apparently transparency is coming to Igor in IP7). The other half of the violin needs to be calculated and then joined by the DrawPoly command. If the violins are trimmed, i.e. cut at the limits of the dataset, then this required an extra point to be added. Without trimming, this step is not required. The only other issue is how wide the violins are plotted. In R, the violins are all normalised so that information about n is lost. In the current implementation, box width is 0.1 and the violins are normalised to the area under the curve*(0.1/2). So, again information on n is lost.

Future improvements

Ideas for developments of the Violin Plot method in IgorPro

• incorporate it into the ipf for making boxplots so that it is integrated as an option to ‘calculate percentiles’
• find a better solution for setting the width of the violin
• add other bandwidth options, as in R
• add more options for colouring the violins

What do you think? Did I miss something? Let me know in the comments.

References

Bowman, A.W. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis : The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press.

Hintze, J.L. & Nelson, R.D. (1998) Violin plots: A Box Plot-Density Trace Synergism. The American Statistician, 52:181-4.

## My Favorite Things

I realised recently that I’ve maintained a consistent iTunes library for ~10 years. For most of that time I’ve been listening exclusively to iTunes, rather than to music in other formats. So the library is a useful source of information about my tastes in music. It should be possible to look at who are my favourite artists, what bands need more investigation, or just to generate some interesting statistics based on my favourite music.

Play count is the central statistic here as it tells me how often I’ve listened to a certain track. It’s the equivalent of a +1/upvote/fave/like or maybe even a citation. Play count increases by one if you listen to a track all the way to the end. So if a track starts and you don’t want to hear it and you skip on to the next song, there’s no +1. There’s a caveat here in that the time a track has been in the library, influences the play count to a certain extent – but that’s for another post*. The second indicator for liking a track or artist is the fact that it’s in the library. This may sound obvious, but what I mean is that artists with lots of tracks in the library are more likely to be favourite artists compared to a band with just one or two tracks in there. A caveat here is that some artists do not have long careers for a variety of reasons, which can limit the number of tracks actually available to load into the library. Check the methods at the foot of the post if you want to do the same.

What’s the most popular year? Firstly, I looked at the most popular year in the library. This question was the focus of an earlier post that found that 1971 was the best year in music. The play distribution per year can be plotted together with a summary of how many tracks and how many plays in total from each year are in the library. There’s a bias towards 90s music, which probably reflects my age, but could also be caused by my habit of collecting CD singles which peaked as a format in this decade. The average number of plays is actually pretty constant for all years (median of ~4), the mean is perhaps slightly higher for late-2000s music.

Favourite styles of music: I also looked at Genre. Which styles of music are my favourite? I plotted the total number of tracks versus the total number of plays for each Genre in the library. Size of the marker reflects the median number of plays per track for that genre. Most Genres obey a rule where total plays is a function of total tracks, but there are exceptions. Crossover, Hip-hop/Rap and Power-pop are highlighted as those with an above average number of plays. I’m not lacking in Power-pop with a few thousand tracks, but I should probably get my hands on more Crossover or Hip-Hop/Rap.

Using citation statistics to find my favourite artists: Next, I looked at who my favourite artists are. It could be argued that I should know who my favourite artists are! But tastes can change over a 10 year period and I was interested in an unbiased view of my favourite artists rather than who I think they are. A plot of Total Tracks vs Mean plays per track is reasonably informative. The artists with the highest plays per track are those with only one track in the library, e.g. Harvey Danger with Flagpole Sitta. So this statistic is pretty unreliable. Equally, I’ve got lots of tracks by Manic Street Preachers but evidently I don’t play them that often. I realised that the problem of identifying favourite artists based on these two pieces of information (plays and number of tracks) is pretty similar to assessing scientists using citation metrics (citations and number of papers). Hirsch proposed the h-index to meld these two bits of information into a single metric, the h-index. It’s easily computed and I already had an Igor procedure to calculate it en masse, so I ran it on the library information.

Before doing this, I consolidated multiple versions of the same track into one. I knew that I had several versions of the same track, especially as I have multiple versions of some albums (e.g. Pet Sounds = 3 copies = mono + stereo + a capella), the top offending track was “Baby’s Coming Back” by Jellyfish, 11 copies! Anyway, these were consolidated before running the h-index calculation.

The top artist was Elliott Smith with an h-index of 32. This means he has 32 tracks that have been listened to at least 32 times each. I was amazed that Muse had the second highest h-index (I don’t consider myself a huge fan of their music) until I remembered a period where their albums were on an iPod Nano used during exercise. Amusingly (and narcissistically) my own music – the artist names are redacted – scored quite highly with two out of three bands in the top 100, which are shown here. These artists with high h-indeces are the most consistently played in the library and probably constitute my favourite artists, but is the ranking correct?

The procedure also calculates the g-index for every artist. The g-index is similar to the h-index but takes into account very highly played tracks (very highly cited papers) over the h threshold. For example, The Smiths h=26. This could be 26 tracks that have been listened to exactly 26 times or they could have been listened to 90 times each. The h-index cannot reveal this, but the g-index gets to this by assessing average plays for the ranked tracks. The Smiths g=35. To find the artists that are most-played-of-the-consistently-most-played, I subtracted h from g and plotted the Top 50. This ranked list I think most closely represents my favourite artists, according to my listening habits over the last ten years.

Track length: Finally, I looked at the track length. I have a range of track lengths in the library, from “You Suffer” by Napalm Death (iTunes has this at 4 s, but Wikipedia says it is 1.36 s), through to epic tracks like “Blue Room” by The Orb. Most tracks are in the 3-4 min range. Plays per track indicates that this track length is optimal with most of the highly played tracks being within this window. The super-long tracks are rarely listened to, probably because of their length. Short tracks also have higher than average plays, probably because they are less likely to be skipped, due to their length.

These were the first things that sprang to mind for iTunes analysis. As I said at the top, there’s lots of information in the library to dig through, but I think this is enough for one post. And not a pie-chart in sight!

Methods: the library is in xml format and can be read/parsed this way. More easily, you can just select the whole library and copy-paste it into TextEdit and then load this into a data analysis package. In this case, IgorPro (as always). Make sure that the interesting fields are shown in the full library view (Music>Songs). To do everything in this post you need artist, track, album, genre, length, year and play count. At the time of writing, I had 21326 tracks in the library. For the “H-index” analysis, I consolidated multiple versions of the same track, giving 18684 tracks. This is possible by concatenating artist and the first ten characters of the track title (separated by a unique character) and adding the play counts for these concatenated versions. The artist could then be deconvolved (using the unique character) and used for the H-calculation. It’s not very elegant, but seemed to work well. The H-index and G-index calculations were automated (previously sort-of-described here), as was most of the plot generation. The inspiration for the colour coding is from the 2013 Feltron Report.

* there’s an interesting post here about modelling the ideal playlist. I worked through the ideas in that post but found that it doesn’t scale well to large libraries, especially if they’ve been going for a long time, i.e. mine.

The post title is taken from John Coltrane’s cover version of My Favorite Things from the album of the same name. Excuse the US English spelling.

## Belly Button Window

A bit of navel gazing for this post. Since moving the blog to wordpress.com in the summer, it recently accrued 5000 views. Time to analyse what people are reading…

The most popular post on the blog (by a long way) is “Strange Things“, a post about the eLife impact factor (2824 views). The next most popular is a post about a Twitter H-index, with 498 views. The Strange Things post has accounted for ~50% of views since it went live (bottom plot) and this fraction seems to be creeping up. More new content is needed to change this situation.

I enjoy putting blog posts together and love the discussion that follows from my posts. It’s also been nice when people have told me that they read my blog and enjoy my posts. One thing I didn’t expect was the way that people can take away very different messages from the same post. I don’t know why I found this surprising, since this often happens with our scientific papers! Actually, in the same way as our papers, the most popular posts are not the ones that I would say are the best.

Wet Wet Wet: I have thought about deleting the Strange Things post, since it isn’t really what I want this blog to be about. An analogy here is the Scottish pop-soul outfit Wet Wet Wet who released a dreadful cover of The Troggs’ “Love is All Around” in 1994. In the end, the band deleted the single in the hope of redemption, or so they said. Given that the song had been at number one for 15 weeks, the damage was already done. I think the same applies here, so the post will stay.

Directing Traffic: Most people coming to the blog are clicking on links on Twitter. A smaller number come via other blogs which feature links to my posts. A very small number come to the blog via a Google search. Google has changed the way it formats the clicks and so most of the time it is not possible to know what people were searching for. For those that I can see, the only search term is… yes, you’ve guessed it: “elife impact factor”.

Methods: WordPress stats are available for blog owners via URL formatting. All you need is your API key and (obviously) your blog address.

Instructions are found at http://stats.wordpress.com/csv.php

A basic URL format would be: http://stats.wordpress.com/csv.php?api_key=yourapikey&blog_uri=yourblogaddress replacing yourapikey with your API key (this can be retrieved at https://apikey.wordpress.com) and yourblogaddress with your blog address e.g. quantixed.wordpress.com

Various options are available from the first page to get the stats in which you are  interested. For example, the following can be appended to the second URL to get a breakdown of views by post title for the past year:

&table=postviews&days=365&limit=-1

The format can be csv, json or xml depending on how your preference for what you want to do next with the information.

The title is from “Belly Button Window” by Jimi Hendrix, a posthumous release on the Cry of Love LP.

## What The World Is Waiting For

The transition for scientific journals from print to online has been slow and painful. And it is not yet complete. This week I got an RSS alert to a “new” paper in Oncogene. When I downloaded it, something was familiar… very familiar… I’d read it almost a year ago! Sure enough, the AOP (ahead of print or advance online publication) date for this paper was September 2013 and here it was in the August 2014 issue being “published”.

I wondered why a journal would do this. It is possible that delaying actual publication would artificially boost the Impact Factor of a journal because there is a delay before citations roll in and citations also peak after two years. So if a journal delays actual publication, then the Impact Factor assessment window captures a “hotter” period when papers are more likely to generate more citations*. Richard Sever (@cshperspectives) jumped in to point out a less nefarious explanation – the journal obviously has a backlog of papers but is not allowed to just print more papers to catch up, due to page budgets.

There followed a long discussion about this… which you’re welcome to read. I was away giving a talk and missed all the fun, but if I may summarise on behalf of everybody: isn’t it silly that we still have pages – actual pages, made of paper – and this is restricting publication.

I wondered how Oncogene got to this position. I retrieved the data for AOP and actual publication for the last five years of papers at Oncogene excluding reviews, from Pubmed. Using oncogene[ta] NOT review[pt] as a search term. The field DP has the date published (the “issue date” that the paper appears in print) and PHST has several interesting dates including [aheadofprint]. These could be parsed and imported into IgorPro as 1D waves. The lag time from AOP to print could then be calculated. I got 2916 papers from the search and was able to get data for 2441 papers.

You can see for this journal that the lag time has been stable at around 300 days (~10 months) for issues published since 2013. So a paper AOP in Feb 2012 had to wait over 10 months to make it into print. This followed a linear period of lag time growth from mid-2010.

I have no links to Oncogene and don’t particularly want to single them out. I’m sure similar lags are happening at other print journals. Actually, my only interaction with Oncogene was that they sent this paper of ours out to review in 2011 (it got two not-negative-but-admittedly-not-glowing reviews) and then they rejected it because they didn’t like the cell line we used. I always thought this was a bizarre decision: why couldn’t they just decide that before sending it to review and wasting our time? Now, I wonder whether they were not keen to add to their increasing backlog of papers at their journal? Whatever the reason, it has put me off submitting other papers there.

I know that there are good arguments for continuing print versions of journals, but from a scientist’s perspective the first publication is publication. Any subsequent versions are simply redundant and confusing.

*Edit: Alexis Verger (@Alexis_Verger) pointed me to a paper which describes that, for neuroscience journals, the lag time has increased over time. Moreover, the authors suggest that this is for the purpose of maximising Journal Impact Factor.

The post title comes from the double A-side Fools Gold/What The World Is Waiting For by The Stone Roses.

## Tips from the Blog II

An IgorPro tip this week. The default font for plots is Geneva. Most of our figures are assembled using Helvetica for labelling. The default font can be changed in Igor Graph Preferences, but Preferences need to be switched on in order to be implemented. Anyway, I always seem to end up with a mix of Geneva plots and Helevetica plots. This can be annoying as the fonts are pretty similar yet the spacing is different and this can affect the plot size. Here is a quick procedure Helvetica4All() to rectify this for all graph windows.

## Six Plus One

Last week, ALM (article-level metric) data for PLoS journals were uploaded to Figshare with the invitation to do something cool with it.

Well, it would be rude not to. Actually, I’m one of the few scientists on the planet that hasn’t published a paper with Public Library of Science (PLoS), so I have no personal agenda here. However, I love what PLoS is doing and what it has achieved to disrupt the scientific publishing system. Anyway, what follows is not in any way comprehensive, but I was interested to look at a few specific things:

1. Is there a relationship between Twitter mentions and views of papers?
2. What is the fraction of views that are PDF vs HTML?
3. Can citations be predicted by more immediate article level metrics?

The tl;dr version is 1. Yes. 2. ~20%. 3. Can’t say but looks unlikely.

1. Twitter mentions versus paper views

2. Fraction of PDF vs HTML views

I asked a few people what they thought the download ratio is for papers. Most thought 60-75% as PDF versus 40-25% HTML. I thought it would be lower, but I was surprised to see that it is, at most, 20% for PDF. The plot below shows the fraction of PDF downloads (counter_pdf/(counter_pdf+counter_html)). For all PLoS journals, and then broken down for PLoS Biol, PLoS ONE.

This was a surprise to me. I have colleagues who don’t like depositing post-print or pre-print papers because they say that they prefer their work to be seen typeset in PDF format. However, this shows that, at least for PLoS journals, the reader is choosing to not see a typeset PDF at all, but a HTML version.

Maybe the PLoS PDFs are terribly formatted and 80% people don’t like them. There is an interesting comparison that can be done here, because all papers are deposited at Pubmed Central (PMC) and so the same plot can be generated for the PDF fraction there. The PDF format is different to PLoS and so we can test the idea that people prefer HTML over PDF at PLoS because they don’t like the PLoS format.

The fraction of PDF downloads is higher, but only around 30%. So either the PMC format is just as bad, or this is the way that readers like to consume the scientific literature. A colleague mentioned that HTML views are preferable to PDF if you want to actually want to do something with the data, e.g. for meta-analysis. This could have an effect. HTML views could be skim reading, whereas PDF is for people who want to read in detail… I wonder whether these fractions are similar at other publishers, particularly closed access publishers?

3. Citation prediction?

ALMs are immediate whereas citations are slow. If we assume for a moment that citations are a definitive means to determine the impact of a paper (which they may not be), then can ALMs predict citations? This would make them very useful in the evaluation of scientists and their endeavours. Unfortunately, this dataset is not sufficient to answer this properly, but with multiple timepoints, the question could be investigated. I looked at number of paper downloads and also the Mendeley score to see how these two things may foretell citations. What follows is a strategy to do this is an unbiased way with few confounders.

The dataset has a Scopus column, but for some reason these data are incomplete. It is possible to download data (but not on this scale AFAIK) for citations from Web of Science and then use the DOI to cross-reference to the other dataset. This plot shows the Scopus data as a function of “Total Citations” from Web of Science, for 500 papers. I went with the Web of Science data as this appears more robust.

The question is whether there is a relationship between downloads of a paper (Counter, either PDF or HTML) and citations. Or between Mendeley score and citations. I figured that downloading, Mendeley and citation, show three progressive levels of “commitment” to a paper and so they may correlate differently with citations. Now, to look at this for all PLoS journals for all time would be silly because we know that citations are field-specific, journal-specific, time-sensitive etc. So I took the following dataset from Web of Science: the top 500 most-cited papers in PLoS ONE for the period of 2007-2010 limited to “cell biology”. By cross-referencing I could check the corresponding values for Counter and for Mendeley.

I was surprised that the correlation was very weak in both cases. I thought that the correlation would be stronger with Mendeley, however signal-to-noise is a problem here with few users of the service compared with counting downloads. Below each plot is a ranked view of the papers, with the Counter or Mendeley data presented as a rolling average. It’s a very weak correlation at best. Remember that this is post-hoc. Papers that have been cited more would be expected to generate more views and higher Mendeley scores, but this is not necessarily so. Predicting future citations based on Counter or Mendeley, will be tough. To really know if this is possible, this approach needs to be used with multiple ALM timepoints to see if there is a predictive value for ALMs, but based on this single timepoint, it doesn’t seem as though prediction will be possible.

Methods: To crunch the numbers for yourself, head over to Figshare and download the csv. A Web of Science subscription is needed for the citation data. All the plots were generated in IgorPro, but no programming is required for these comparisons and everything I’ve done here can be easily done in Excel or another package.

Edit: Matt Hodgkinson (@mattjhodgkinson) Snr Ed at PLoS ONE told me via Twitter that all ALM data (periodically updated) are freely available here. This means that some of the analyses I wrote about are possible.

The post title comes from Six Plus One a track on Dad Man Cat by Corduroy. Plus is as close to PLoS as I could find in my iTunes library.

## You Know My Name (Look Up The Number)

This thought crossed my mind yesterday when I saw a tweet that was tagged #academicinsults

It occurred to me that a Twitter account is a kind of micro-publishing platform. So what would “publication metrics” look like for Twitter? Twitter makes analytics available, so they can easily be crunched. The main metrics are impressions and engagements per tweet. As I understand it, impressions are the number of times your tweet is served up to people in their feed (boosted by retweets). Engagements are when somebody clicks on the tweet (either a link or to see the thread or whatever). In publication terms, impressions would equate to people downloading your paper and engagements mean that they did something with it, like cite it. This means that a “h-index” for engagements can be calculated with these data.

For those that don’t know, the h-index for a scientist means that he/she has h papers that have been cited h or more times. The Twitter version would be a tweeter that has h tweets that were engaged with h or more times. My data is shown here:

My twitter h-index is currently 36. I have 36 tweets that have been engaged with 36 or more times.

So, this is a lot higher than my actual h-index, but obviously there are differences. Papers accrue citations as time goes by, but the information flow on Twitter is so fast that tweets don’t accumulate engagement over time. In that sense, the Twitter h-index is less sensitive to the time a user has been active on Twitter, versus the real h-index which is strongly affected by age of the scientist. Other differences include the fact that I have “published” thousands of tweets and only tens of papers. Also, whether or not more people read my tweets compared to my papers… This is not something I want to think too much about, but it would affect how many engagements it is possible to achieve.

The other thing I looked at was whether replying to somebody actually means more engagement. This would skew the Twitter h-index. I filtered tweets that started with an @ and found that this restricts who sees the tweet, but doesn’t necessarily mean more engagement. Replies make up a very small fraction of the h tweets.

I’ll leave it to somebody else to calculate the Impact Factor of Twitter. I suspect it is very low, given the sheer volume of tweets.

Please note this post is just for fun. Normal service will (probably) resume in the next post.

Edit: As pointed out in the comments this post is short on “Materials and Methods”. If you want to calculate your ownTwitter h-index, go here. When logged in to Twitter, the analytics page should present your data (it may take some time to populate this page after you first view it). A csv can be downloaded from the button on the top-right of the page. I imported this into IgorPro (as always) to generate the plots. The engagements data need to be sorted in descending order and then the h-index can be found by comparing the numbers with their ranked position.

The post title is from the quirky B-side to the Let It Be single by The Beatles.

## Round and Round

I thought I’d share a procedure for rotating a 2D set of coordinates about the origin. Why would you want do this? Well, we’ve been looking at cell migration in 2D – tracking nuclear position over time. Cells migrate at random and I previously blogged about ways to visualise these tracks more clearly. Part of this earlier procedure was to set the start of each track at (0,0). This gives a random hairball of tracks moving away from the origin. Wouldn’t it be a good idea to orient all the tracks so that the endpoint lies on the same axis? This would simplify the view and allow one to assess how ‘directional’ the cell tracks are. To rotate a set of coordinates, you need to use a rotation matrix. This allows you to convert the x,y coordinates to their new position x’,y’. This rotation is counter-clockwise.

$$x’ = x \cos \theta – y \sin \theta\,$$

$$y’ = x \sin \theta + y \cos \theta\,$$

However, we need to find theta first. To do this we need to find the angle between two lines, using this formula.

$$\cos \theta = \frac {\mathbf a \cdot \mathbf b}{\left \Vert {\mathbf a} \right \Vert \cdot \left \Vert {\mathbf b} \right \Vert}$$

The maths is kept to a minimum here. If you are interested, look at the code at the bottom.

The two lines (a and b) are formed by the x-axis (origin to some point on the x-axis, i.e. y=0) and by a line running from the origin to the last coordinate in the series. This calculation can be done for each track with theta for each track being used to rotate the that whole track (x,y changed to x’,y’ for each point).

Here is an example of just a few tracks from an experiment. Typically we have hundreds of tracks for each experimental group and the code will blast through them all very quickly (<1 s).

After rotation, the tracks are now aligned so that the last point is on the x-axis at y=0. This allows us to see how ‘directional’ the tracks are. The end points are now aligned, when they migrated there, how convoluted was their path.

The code to do this is up on Igor Exchange code snippets. A picture of the code is below (markup for code in WordPress is not very clear). See the code snippet if you want to use it.

The weakness of this method is that acos (arccos) only gives results from 0 to Pi (0 to 180°). There is a correction in the procedure, but everything needs editing if you want to rotate the co-ordinates to some other plane. Feedback welcome.

Edit Jim Prouty and A.G. have suggested two modifications to the code. The first is to use complex waves rather than 2D real waves. Then use two native Igor functions r2polar or p2rect. The second suggestion is to use Matrix operations! As is often the case with Igor there are several ways of doing things. The method described here is long-winded compared to a MatrixOp and if the waves were huge these solutions would be much, much faster. As it is, our migration movies typically have 60 points and as mentioned rotator() blasts through them very quickly. More complex coordinate sets would need something more sophisticated.

The post title is taken from “Round & Round” by New Order from their Technique LP.

## Sure To Fall

What does the life cycle of a scientific paper look like?

It stands to reason that after a paper is published, people download and read the paper and then if it generates sufficient interest, it will begin to be cited. At some point these citations will peak and the interest will die away as the work gets superseded or the field moves on. So each paper has a useful lifespan. When does the average paper start to accumulate citations, when do they peak and when do they die away?

Citation behaviours are known to be very field-specific. So to narrow things down, I focussed on cell biology and in one area “clathrin-mediated endocytosis” in particular. It’s an area that I’ve published in – of course this stuff is driven by self-interest. I downloaded data for 1000 papers from Web of Science that had accumulated the most citations. Reviews were excluded, as I assume their citation patterns are different from primary literature. The idea was just to take a large sample of papers on a topic. The data are pretty good, but there are some errors (see below).

Number-crunching (feel free to skip this bit): I imported the data into IgorPro making a 1D wave for each record (paper). I deleted the last point corresponding to cites in 2014 (the year is not complete). I aligned all records so that year of publication was 0. Next, the citations were normalised to the maximum number achieved in the peak year. This allows us to look at the lifecycle in a sensible way. Next I took out records to papers less than 6 years old as I reasoned these would have not have completed their lifecycle and could contaminate the analysis (it turned out to make little difference). The lifecycles were plotted and averaged. I also wrote a quick function to pull out the peak year for citations post hoc.

So what did it show?

Citations to a paper go up and go down, as expected (top left). When cumulative citations are plotted most of the articles have an initial burst and then level off. The exception are ~8 articles that continue to rise linearly (top right). On average a paper generates its peak citations three years after publication (box plot). The fall after this peak period is pretty linear and it’s apparently all over somewhere >15 years after publication (bottom left). To look at the decline in more detail I aligned the papers so that year 0 was the year of peak citations. The average now loses almost 40% of those peak citations in the following year and then declines steadily (bottom right).

Edit: The dreaded Impact Factor calculation takes the citations to articles published in the preceding 2 years and divides by the number of citable items in that period. This means that each paper only contributes to the Impact Factor in years 1 and 2. This is before the average paper reaches its peak citation period. Thanks to David Stephens (@david_s_bristol) for pointing this out. The alternative 5 year Impact Factor gets around this limitation.

Perhaps lifecycle is the wrong term: papers in this dataset don’t actually ‘die’, i.e. go to 0 citations. There is always a chance that a paper will pick up the odd citation. Papers published 15 years ago are still clocking 20% of their peak citations. Looking at papers cited at lower rates would be informative here.

Two other weaknesses that affect precision is that 1) a year is a long time and 2) publication is subject to long lag times. The analysis would be improved by categorising the records based on the month-year when the paper was published and the month-year when each citation comes in. Papers published in January in one year probably have a different peak than those published in December of the same year, but this is lost when looking at year alone. Secondly, due to publication lag, it is impossible to know when the peak period of influence for a paper truly is.
Problems in the dataset. Some reviews remained despite being supposedly excluded, i.e. they are not properly tagged in the database. Also, some records have citations from years before the article was published! The numbers of citations are small enough to not worry for this analysis, but it makes you wonder about how accurate the whole dataset is. I’ve written before about how complete citation data may or may not be. These sorts of things are a concern for all of us who are judged by these things for hiring and promotion decisions.

The post title is taken from ‘Sure To Fall’ by The Beatles, recorded during The Decca Sessions.