The Digital Cell: Statistical tests

Statistical hypothesis testing, commonly referred to as “statistics”, is a topic of consternation among cell biologists.

This is a short practical guide I put together for my lab. Hopefully it will be useful to others. Note that statistical hypothesis testing is a huge topic and one post cannot hope to cover everything that you need to know.

What statistical test should I do?

To figure out what statistical test you need to do, look at the table below. But before that, you need to ask yourself a few things.

  • What are you comparing?
  • What is n?
  • What will the test tell you? What is your hypothesis?
  • What will the p value (or other summary statistic) mean?

If you are not sure about any of these things, whichever test you do is unlikely to tell you much.

The most important question is: what type of data do you have? This will help you pick the right test.

  • Measurement – most data you analyse in cell biology will be in this category. Examples are: number of spots per cell, mean GFP intensity per cell, diameter of nucleus, speed of cell migration…
    • Normally-distributed – this means it follows a “bell-shaped curve” otherwise called “Gaussian distribution”.
    • Not normally-distributed – data that doesn’t fit a normal distribution: skewed data, or better described by other types of curve.
  • Binomial – this is data where there are two possible outcomes. A good example here in cell biology would be a mitotic index measurement (the proportion of cells in mitosis). A cell is either in mitosis or it is not.
  • Other – maybe you have ranked or scored data. This is not very common in cell biology. A typical example here would be a scoring chart for a behavioural effect with agreed criteria (0 = normal, 5 = epileptic seizures). For a cell biology experiment, you might have a scoring system for a phenotype, e.g. fragmented Golgi (0 = is not fragmented, 5 = is totally dispersed). These arbitrary systems are a not a good idea. Especially, if the person scoring is unblinded to the experimental procedure. Try to come up with an unbiased measurement procedure.

 

What do you want to do? Measurement

(Normal)

Measurement

(not Normal)

Binomial

 

Describe one group Mean, SD Median, IQR Proportion
Compare one group to a value One-sample t-test Wilcoxon test Chi-square
Compare two unpaired groups Unpaired t-test Wilcoxon-Mann-Whitney two-sample rank test Fisher’s exact test

or Chi-square

Compare two paired groups Paired t-test Wilcoxon signed rank test McNemar’s test
Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test
Compare three or more matched groups Repeated-measures ANOVA Friedman test Cochran’s Q test
Quantify association between two variables Pearson correlation Spearman correlation
Predict value from another measured variable Simple linear regression Nonparametric regression Simple logistic regression
Predict value from several measured or binomial variables Multiple linear (or nonlinear) regression Multiple logistic regression

Modified from Table 37.1 (p. 298) in Intuitive Biostatistics by Harvey Motulsky, 1995 OUP.

What do “paired/unpaired” and “matched/unmatched” mean?

Most of the data you will get in cell biology is unpaired or unmatched. Individual cells are measured and you have say, 20 cells in the control group and 18 different cells in the test group. These are unpaired (or unmatched in the case of more than one test group) because the cells are different in each group. If you had the same cell in two (or more) groups, the data would be paired (or matched). An example of a paired dataset would be where you have 10 cells that you treat with a drug. You take a measurement from each of them before treatment and a measurement after. So you have paired measurements: one for cell A before treatment, one after; one for cell B before and after, and so on.

How to do some of these tests in IgorPRO

The examples below assume that you have values in waves called data0, data1, data2,… substitute the wavenames for your actual wave names.

Is it normally distributed?

The simplest way is to plot them and see. You can plot out your data using Analysis>Histogram… or Analysis>Packages>Percentiles and BoxPlot… Another possibility is to look at skewness or kurtosis of the dataset (you can do this with WaveStats, see below)

However, if you only have a small number of measurements, or you want to be sure, you can do a test. There are several tests you can do (Kolmogorov-Smirnoff, Jarque-Bera, Shapiro-Wilk). The easiest to do and most intuitive (in Igor) is Shapiro-Wilk.


StatsShapiroWilkTest data0

If p < 0.05 then the data are not normally distributed. Statistical tests on normally distributed data are called parametric, while those on non-normally distributed data are non-parametric.

Describe one group

To get the mean and SD (and lots of other statistics from your data):


Wavestats data0

To get the median and IQR:


StatsQuantiles/ALL data0

The mean and sd are also stored as variables (V_avg, V_sdev). StatsQuantiles calculates V_median, V_Q25, V_Q75, V_IQR, etc. Note that you can just get the median by typing Print StatsMedian(data0) or – in Igor7 – Print median(data0). There is often more than one way to do something in Igor.

Compare one group to a value

It is unlikely that you will need to do this. In cell biology, most of the time we do not have hypothetical values for comparison, we have experimental values from appropriate controls. If you need to do this:


StatsTTest/CI/T=1 data0

Compare two unpaired groups

Use this for normally distributed data where you have test versus control, with no other groups. For paired data, use the additional flag /PAIR.


StatsTTest/CI/T=1 data0,data1

For the non-parametric equivalent, if n is large computation takes a long time. Use additional flag /APRX=2. If the data are paired, use the additional flag /WSRT.


StatsWilcoxonRankTest/T=1/TAIL=4 data0,data1

For binomial data, your waves will have 2 points. Where point 0 corresponds to one outcome and point 1, the other. Note that you can compare to expected values here, for example a genetic cross experiment can be compared to expected Mendelian frequencies. To do Fisher’s exact test, you need a 2D wave representing a contingency table. McNemar’s test for paired binomial data is not available in Igor

StatsChiTest/S/T=1 data0,data1

If you have more than two groups, do not do multiple versions of these tests, use the correct method from the table.

Compare three or more unmatched groups

For normally-distributed data, you need to do a 1-way ANOVA followed by a post-hoc test. The ANOVA will tell you if there are any differences among the groups and if it is possible to investigate further with a post-hoc test. You can discern which groups are different using a post-hoc test. There are several tests available, e.g. Dunnet’s is useful where you have one control value and a bunch of test conditions. We tend to use Tukey’s post-hoc comparison (the /NK flag also does Newman-Keuls test).


StatsAnova1Test/T=1/Q/W/BF data0,data1,data2,data3
StatsTukeyTest/T=1/Q/NK data0,data1,data2,data3

The non-parametric equivalent is Kruskal-Wallis followed by a multiple comparison test. Dunn-Holland-Wolfe method is used.


StatsKSTest/T=1/Q data0,data1,data2,data3
StatsNPMCTest/T=1/DHW/Q data0,data1,data2,data3

Compare three or more matched groups

It’s unlikely that this kind of data will be obtained in a typical cell biology experiment.

StatsANOVA2RMTest/T=1 data0,data1,data2,data3

There are also operations for StatsFriedmanTest and StatsCochranTest.

Correlation

Straightforward command for two waves or one 2D wave. Waves (or columns) must be of the same length


StatsCorrelation data0

At this point, you probably want to plot out the data and use Igor’s fitting functions. The best way to get started is with the example experiment, or just display your data and Analysis>Curve Fitting…

Hazard and survival data

In the lab we have, in the past, done survival/hazard analysis. This is a bit more complex and we used SPSS and would do so again as Igor does not provide these functions.

Notes for use

Screen Shot 2016-07-12 at 14.18.18The good news is that all of this is a lot more intuitive in Igor 7! There is a new Menu item called Statistics, where most of these functions have a dialog with more information. In Igor 6.3 you are stuck with the command line. Igor 7 will be out soon (July 2016).

  • Note that there are further options to most of these commands, if you need to see them
    • check the manual or Igor Help
    • or type ShowHelpTopic “StatsMedian” in the Command Window (put whatever command you want help with between the quotes).
  • Extra options are specified by “flags”, these are things like “/Q” that come after the command. For example, /Q means “quiet” i.e. don’t print the output into the history window.
  • You should always either print the results to the history or put them into a table so that we can check them. Note that the table gets over written if you do the same test with different data, so printing in this case is a good idea.
  • The defaults in Igor are setup OK for our needs. For example, Igor does two-tailed comparison, alpha = 0.05, Welch’s correction, etc.
  • Most operations can handle waves of different length (or have flags set to handle this case).
  • If you are used to doing statistical tests in Excel, you might be wondering about tails and equal variances. The flags are set in the examples to do two-tailed analysis and unequal variances are handled by Welch’s correction.
  • There’s a school of thought that says that using non-parametric tests is best to be cautious. These tests are not as powerful and so it is best to use parametric tests (t test, ANOVA) when you can.

Part of a series on the future of cell biology in quantitative terms.

The Digital Cell: Getting started with IgorPRO

This post follows on from “Getting Started“.

In the lab we use IgorPRO for pretty much everything. We have many analysis routines that run in Igor, we have scripts for processing microscope metadata etc, and we use it for generating all figures for our papers. Even so, people in the lab engage with it to varying extents. The main battle is that the use of Excel is pretty ubiquitous.

I am currently working on getting more people in the lab started with using Igor. I’ve found that everyone is keen to learn. The approach so far has been workshops to go through the basics. This post accompanies the first workshop, which is coupled to the first few pages of the Manual. If you’re interested in using Igor read on… otherwise you can skip to the part where I explain why I don’t want people in the lab to use Excel.

IgorPro is very powerful and the learning curve is steep, but the investment is worth it.

WaveMetrics_IGOR_Pro_LogoThese are some of the things that Igor can do: Publication-quality graphics, High-speed data display, Ability to handle large data sets, Curve-fitting, Fourier transforms, smoothing, statistics, and other data analysis, Waveform arithmetic, Matrix math, Image display and processing, Combination graphical and command-line user interface, Automation and data processing via a built-in programming environment, Extensibility through modules written in the C and C++ languages. You can even play games in it!

The basics

The first thing to learn is about the objects in the Igor environment and how they work.There are four basic objects that all Igor users will encounter straight away.

  • Waves
  • Graphs
  • Tables
  • Layouts

All data is stored as waveforms (or waves for short). Waves can be displayed in graphs or tables. Graphs and tables can be placed in a Layout. This is basically how you make a figure.

The next things to check out are the command window (which displays the history), the data browser and the procedure window.

Essential IgorPro

  • Tables are not spreadsheets! Most important thing to understand. Tables are just a way of displaying a wave. They may look like a spreadsheet, but they are not.
  • Igor is case insensitive.
  • Spaces. Igor can handle spaces in names of objects, but IMO are best avoided.
  • Igor is 0-based not 1-based
  • Logical naming and logical thought – beginners struggle with this and it’s difficult to get this right when you are working on a project, but consistent naming of objects makes life easier.
  • Programming versus not programming – you can get a long way without programming but at some point it will be necessary and it will save you a lot of time.

Pretty soon, you will go beyond the four basic objects and encounter other things. These include: Numeric and string variables, Data folders, Notebooks, Control panels, 3D plots – a.k.a. gizmo, Procedures.

Getting started guide
Getting started guide

Why don’t we use Excel?

  • Excel can’t make high quality graphics for publication.
    • We do that in Igor.
    • So any effort in Excel is a waste of time.
  • Excel is error-prone.
    • Too easy for mistakes to be introduced.
    • Not auditable. Tough/impossible to find mistakes.
    • Igor has a history window that allows us to see what has happened.
  • Most people don’t know how to use it properly.
  • Not good for biological data – Transcription factor Oct4 gets converted to a date.
  • Limited to 1048576 rows and 16384 columns.
  • Related: useful link describing some spreadsheet crimes of data entry.

But we do use Excel a lot

  • Excel is useful for quick calculations and for preparing simple charts to show at lab meeting.
  • Same way that Powerpoint is OK to do rough figures for lab meeting.
  • But neither are publication-quality.
  • We do use Excel for Tracking Tables, Databases(!) etc.

The transition is tough, but worth it

Writing formulae in Excel is straightforward, and the first thing you will find is that to achieve the same thing in Igor is more complicated. For example, working out the mean for each row in an array (a1:y20) in Excel would mean typing =AVERAGE(A1:y1) in cell z1 and copying this cell down to z20. Done. In Igor there are several ways to do this, which itself can be unnerving. One way is to use the Waves Average panel. You need to know how this works to get it to do what you want.

But before you turn back, thinking I’ll just do this in Excel and then import it… imagine you now want to subtract a baseline value from the data, scale it and then average. Imagine that your data are sampled at different intervals. How would you do that? Dealing with those simple cases in Excel is difficult-to-impossible. In Igor, it’s straightforward.

Resources for learning more Igor:

  • Igor Help – fantastic resource containing the manual and more. Access via Help or by typing ShowHelpTopic “thing I want to search for”.
  • Igor Manual – This PDF is available online or in Applications/Igor Pro/Manual. This used to be a distributed as a hard copy… it is now ~3000 pages.
  • Guided Tour of IgorPro – this is a great way to start and will form the basis of the workshops.
  • Demos – Igor comes packed with Demos for most things from simple to advanced applications.
  • IgorExchange – Lots of code snippets and a forum to ask for advice or search for past answers.
  • Igor Tips – I’ve honestly never used these, you can turn on tips in Igor which reveal help on mouse over.
  • Igor mailing list – topics discussed here are pretty advanced.
  • Introduction to IgorPRO from Payam Minoofar is good. A faster start to learning to program that reading the manual.
  • Hands-on experience!

Part of a series on the future of cell biology in quantitative terms.

Everything In Its Right Place

Something that has driven me nuts for a while is the bug in FIJI/ImageJ when making montages of image stacks. This post is about a solution to this problem.

What’s a montage?

You have a stack of images and you want to array them in m rows by n columns. This is useful for showing a gallery of each frame in a movie or to separate the channels in a multichannel image.

What’s the bug/feature in ImageJ?

If you select Image>Stacks>Make Montage… you can specify how you want to layout your montage. You can specify a “border” for this. Let’s say we have a stack of 12 images that are 300 x 300 pixels. Let’s arrange them into 3 rows and 4 columns with 0 border.

Screen Shot 2016-07-06 at 14.21.53So far so good. We have an image that is 1200 x 900. But it looks a bit rubbish, we need some grouting (white pixel space between the images). We don’t need a border, but let’s ignore that for the moment. So the only way to do this in ImageJ is to specify a border of 8 pixels.

Screen Shot 2016-07-06 at 14.22.27Looks a lot better. Ok there’s a border around the outside, which is no use, but it looks good. But wait a minute! Check out the size of the image (1204 x 904). This is only 4 pixels bigger in x and y, yet we added all that grouting, what’s going on?

The montage is not pixel perfect.

Screen Shot 2016-07-06 at 14.23.38

So the first image is not 300 x 300 any more. It is 288 x 288. Hmmm, maybe we can live with losing some data… but what’s this?

Screen Shot 2016-07-06 at 14.24.49

The next image in the row is not even square! It’s 292 x 288. How much this annoys you will depend on how much you like things being correct… The way I see it, this is science, if we don’t look after the details, who will? If I start with 300 x 300 images, it’s not too much to ask to end up with 300 x 300 images, is it? I needed to fix this.

Solutions

I searched for a while for a solution. It had clearly bothered other people in the past, but I guess people just found their own workaround.

ImageJ solution for multichannel array

So for a multichannel image, where the grayscale images are arrayed next to the merge, I wrote something in ImageJ to handle this. These macros are available here. There is a macro for doing the separation and arraying. Then there is a macro to combine these into a bigger figure.

Igor solution

For the exact case described above, where large stacks need to be tiled out into and m x n array, I have to admit I struggled to write something for ImageJ and instead wrote something for IgorPRO. Specifying 3 rows, 4 columns and a grout of 8 pixels gives the correct TIFF 1224 x 916, with each frame showing in full and square. The code is available here, it works for 8 bit greyscale and RGB images.

Screen Shot 2016-07-06 at 14.55.31

I might update the code at some point to make sure it can handle all data types and to allow labelling and adding of a scale bar etc.

The post title is taken from “Everything In Its Right Place” by Radiohead from album Kid A.

Adventures in code II

I needed to generate a uniform random distribution of points inside a circle and, later, a sphere. This is part of a bigger project, but the code to do this is kind of interesting. There were no solutions available for IgorPro, but stackexchange had plenty of examples in python and mathematica. There are many ways to do this. The most popular seems to be to generate a uniform random set of points in a square or cube and then discard those that are greater than the radius away from the origin. I didn’t like this idea, because I needed to extend it to spheroids eventually, and as I saw it the computation time saved was minimal.

Here is the version for points in a circle (radius = 1, centred on the origin).

circleCode

This gives a nice set of points, 1000 shown here.

pointsCircle

And here is the version inside a sphere. This code has variable radius for the sphere.

sphereCode

The three waves (xw,yw,zw) can be concatenated and displayed in a Gizmo. The code just plots out the three views.

pointsSphere

My code uses var + enoise(var) to get a random variable from 0,var. This is because enoise goes from -var to +var. There is an interesting discussion about whether this is a truly flat PDF here.

This is part of a bigger project where I’ve had invaluable help from Tom Honnor from Statistics.

This post is part of a series on esoterica in computer programming.

Weak Superhero: how to win and lose at Marvel Top Trumps

Top Trumps is a card game for children. The mind can wander when playing such games with kids… typically, I start thinking: what is the best strategy for this game? But also, as the game drags on: what is the quickest way to lose?

Since Top Trumps is based on numerical values with simple outcomes, it seemed straightforward to analyse the cards and to simulate different scenarios to look at these questions.

PackMany Top Trumps variants exist, but the pack I’ll focus on is Marvel Universe “Who’s Your Hero?” made by Winning Moves (cat. No.: 3399736). Note though that the approach can probably be adapted to handle any other Top Trumps set.

There are 30 cards featuring Marvel characters. Each card has six categories:

  1. Strength
  2. Skill
  3. Size
  4. Wisecracks
  5. Mystique
  6. Top Trumps Rating.

What is the best card and which one is the worst?

VictoriesIn order to determine this I pulled in all the data and compared each value to every other card’s value, and repeated this per category (code is here, the data are here). The scaling is different between category, but that’s OK, because the game only uses within field comparisons. This technique allowed me to add up how many cards have a lower value for a certain field for a given card, i.e. how many cards would that card beat. These victories could then be summed across all six fields to determine the “winningest card”.

The cumulative victories can be used to rank the cards and a category plot illustrates how “winningness” is distributed throughout the deck.

As an aside: looking at the way the scores for each category are distributed is interesting too. Understanding these distributions and the way that each are scaled gives a better feel for whether a score of say 2 in Wisecracks is any good (it is).

Wasp SMvIMThe best card in the deck is Iron Man. What is interesting is that Spider-Man has the designation Top Trump (see card), but he’s actually second in terms of wins over all other cards. Head-to-head, Spider-Man beats Iron Man in Skill and Mystique. They draw on Top Trumps Rating. But Iron Man beats Spider-Man on the three remaining fields. So if Iron Man comes up in your hand, you are most likely to defeat your opponent.

At the other end of the “winningest card” plot, the worst card, is Wasp. Followed by Ant Man and Bucky Barnes. There needs to be a terrible card in every Top Trump deck, and Wasp is it. She has pitiful scores in most fields. And can collectively only win 9 out of (6 * 29) = 174 contests. If this card comes up, you are pretty much screwed.

What about draws? VictoriesFSIt’s true that a draw doesn’t mean losing and the active player gets another turn, so a draw does have some value. To make sure I wasn’t overlooking this with my system of counting victories, I recalculated the values using a Football League points system (3 points for a win, 1 point for a draw and 0 for a loss). The result is the same, with only some minor changes in the ranking.

I went with the first evaluation system in order to simulate the games.

I wrote a first version of the code that would printout what was happening so I could check that the simulation ran OK. Once that was done, it was possible to call the function that runs the game, do this multiple (1 x 10^6) times and record who won (player 1 or player 2) and for how many rounds each game lasted.

A typical printout of a game (first 9 rounds) is shown here. So now I could test out different strategies: What is the best way to win and what is the best way to lose?

Strategy 1: pick your best category and play

S_bestIf you knew which category was the most likely to win, you could pick that one and just win every game? Well, not quite. If both players take this strategy, then the player that goes first has a slight advantage and wins 57.8% of the time. The games can go on and on, the longest is over 500 rounds. I timed a few rounds and it worked out around 15 s per round. So the longest game would take just over 2 hours.

Strategy 2: pick one category and stick with it

S_stickThis one requires very little brainpower and suits the disengaged adult: just keep picking the same category. In this scenario, Player 1 just picks strength every time while Player 2 picks their best category. This is a great way to lose. Just 0.02% of games are won using this strategy.

Strategy 3: pick categories at random

S_randomThe next scenario was to just pick random categories. I set up Player 1 to do this and play against Player 2 picking their best category. This means 0.2% of wins for Player 1. The games are over fairly quickly with the longest of 1 x 10^6 games stretching to 200 rounds.

If both players take this strategy, it results in much longer games (almost 2000 rounds for the longest). The player-goes-first advantage disappears and the wins are split 49.9 to 50.1%.

Strategy 4: pick your worst category

S_worstHow does all of this compare with selecting the worst category? To look at this I made Player 2 take this strategy, while Player 1 picked the best category. The result was definitive, it is simply not possible for Player 2 to win. Player 1 wins 100% of all 1 x 10^6 games. The games are over in less than 60 rounds, with most being wrapped up in less than 35 rounds. Of course this would require almost as much knowledge of the deck as the winning strategy, but if you are determined to lose then it is the best strategy.

 

The hand you’re dealt

Head-to-head, the best strategy is to pick your best category (no surprise there), but whether you win or lose depends on the cards you are dealt. I looked at which player is dealt the worst card Wasp and at the outcome. The split of wins for player 1 (58% of games) are with 54% of those, Player 2 stated with Wasp. Being dealt this card is a disadvantage but it is not the kiss of death. This analysis could be extended to look at the outcome if the n worst cards end up in your hand. I’d predict that this would influence the outcome further than just having Wasp.

So there you have it: every last drop of fun squeezed out of a children’s game by computational analysis. At quantixed, we aim to please.

The post title is taken from “Weak Superhero” by Rocket From The Crypt off their debut LP “Paint As A Fragrance” on Headhunter Records

Parallel Lines: Spatial statistics of microtubules in 3D

Our recent paper on “the mesh” in kinetochore fibres (K-fibres) of the mitotic spindle was our first adventure in 3D electron microscopy. This post is about some of the new data analysis challenges that were thrown up by this study. I promised a more technical post about this paper and here it is, better late than never.

Figure 6In the paper we describe how over-expression of TACC3 causes the microtubules (MTs) in K-fibres to become “more wonky”. This was one of those observations that we could see by eye in the tomograms, but we needed a way to quantify it. And this meant coming up with a new spatial statistic.

After a few false starts*, we generated a method that I’ll describe here in the hope that the extra detail will be useful for other people interested in similar problems in cell biology.

The difficulty in this analysis comes from the fact that the fibres are randomly oriented, because of the way that the experiment is done. We section orthogonally to the spindle axis, but the fibre is rarely pointing exactly orthogonal to the tomogram. So the challenge is to reorient all the fibres to be able to pool numbers from across different fibres to derive any measurements. The IgorPro code to do this was made available with the paper. I have recently updated this code for a more streamlined workflow (available here).

We had two 3D point sets, one representing the position of each microtubule in the fibre at bottom of our tomogram and the other set is the position at the top. After creating individual MT waves from these point sets to work with, these waves could be plotted in 3D to have a look at them.

TempMovieThis is done in IgorPro by using a Gizmo. Shown here is a set of MTs from one K-fibre, rotated to show how the waves look in 3D, note that the scaling in z is exaggerated compared with x and y.

We need to normalise the fibres by getting them to all point in the same direction. We found that trying to pull out the average trajectory for the fibre didn’t work so well if there were lots of wonky MTs. So we came up with the following method:

  • Calculate the total cartesian distance of all MT waves in an xy view, i.e. the sum of all projections of vectors on an xy plane.
  • Rotate the fibre.
  • Recalculate the total distance.
  • Repeat.

So we start off with this set of waves (Original). We rotate through 3D space and plot the total distance at each rotation to find the minimum, i.e. when most MTs are pointing straight at the viewer. This plot (Finding Minimum) is coloured so that hot colours are the smallest distance, it shows this calculation for a range of rotations in phi and theta. Once this minimum is found, the MT waves can be rotated by this value and the set is then normalised (you need to click on the pictures to see them properly).

Now we have all of the fibres that we imaged oriented in the same way, pointing to the zenith. This means we can look at angles relative to the z axis and derive statistics.

The next challenge was to make a measure of “wonkiness”. In other words, test how parallel the MTs are.

Violin plots of theta don’t really get across the wonkiness of the TACC3 overexpressed K-fibres (see figure above). To visualise this more clearly, each MT was turned into a vector starting at the origin and the point where the vector intersected with an xy plane set at an arbitrary distance in z (100 nm) was calculated. The scatter of these intersections demonstrates nicely how parallel the MTs are. If all MTs were perfectly parallel, they would all intersect at 0,0. In the control this is more-or-less true, with a bit of noise. In contrast, the TACC3-overexpressed group have much more scatter. What was nice is that the radial scatter was homogeneous, which showed that there was no bias in the acquisition of tomograms. The final touch was to generate a bivariate histogram which shows the scatter around 0,0 but it is normalised for the total number of points. Note that none of this possible without the first normalisation step.

Parallelism

The only thing that we didn’t have was a good term to describe what we were studying. “Wonkiness” didn’t sound very scientific and “parallelness” was also a bit strange. Parallelism is a word used in the humanities to describe analogies in art, film etc. However, it seemed the best term to describe the study of how parallel the MTs in a fibre are.

With a little help from my friends

The development of this method was borne out of discussions with Tom Honnor and Julia Brettschneider in the Statistics department in Warwick. The idea for the intersecting scatter plot came from Anne Straube in the office next door to me. They are all acknowledged in our paper for their input. A.G. at WaveMetrics helped me speed up my code by using MatrixOP and Euler’s rotation. His other suggestion of using PCA to do this would undoubtedly be faster, but I haven’t implemented this – yet. The bivariate histograms were made using JointHistogram() found here. JointHistogram now ships with Igor 7.

* as we describe in the paper

Several other strategies were explored to analyze deviations in trajectory versus the fiber axis. These were: examining the variance in trajectory angles, pairwise comparison of all MTs in the bundle, comparison to a reference MT that represented the fiber axis, using spherical rotation and rotating by an average value. These produced similar results, however, the one described here was the most robust and represents our best method for this kind of spatial statistical analysis.

The post title is taken from the Blondie LP “Parallel Lines”.

The Great Curve: Citation distributions

This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.

allPlot

What is interesting is that the distribution changes over time:

  • In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
  • The following year most papers are picking up citations: the distribution moves rightwards.
  • Over the next few years the distribution relaxes back leftwards as the citations die away.
  • The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.

citedist

The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.
cdcode

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

At a Crawl: Analysis of Cell Migration in IgorPro

In the lab we have been doing quite a bit of analysis of cell migration in 2D. Typically RPE1 cells migrating on fibronectin-coated glass. There are quite a few tools out there to track cell movements and to analyse their migration. Naturally, none of these did quite what we wanted and none fitted nicely into our analysis workflow. This meant writing something from scratch in IgorPro. You can access the code from my GitHub pages.

We’ve previously published a paper doing these kinds of analysis, but this code is all-new, faster and more efficient

The CellMigration repo contains an ipf that has three functions which will import tracks into Igor and analyse them. We use manual tracking in ImageJ/FIJI to obtain the co-ordinates for analysis, this is the only non-Igor part. I figured it was not worth porting this part too. Instructions are given in the repo and are hopefully self-explanatory. Here’s a screenshot of a typical experiment.

CMScreenShot

Acknowledgements: Colour palette is taken from the SRON stylesheet. The excellent igorutils (by yamad) gave me the idea for a structure and jtigor helped me with referencing StrConstants.

The post title is taken from the track “At a Crawl” by The Melvins from their Ozma LP

Your Favorite Thing: Algorithmically Perfect Playlist

I’ve previously written about analysing my iTunes library and about generating Smart Playlists in iTunes. This post takes things a bit further by generating a “perfect playlist” outside of iTunes… it is exclusively for nerds.

How can you put together a perfect playlist? What are your favourite songs? How can you tell what they are? Well, we could look at how many times you’ve played each song in your iTunes library (assuming this is mainly how you consume your music)… but this can be misleading. Songs that have been in there since the start (in my case, a decade ago) have had longer to accumulate plays than songs that were added last year. This problem was covered nicely in a post by Mr Science Show.

He suggests that your all-time greatest playlist can be modelled using

\(\frac{dp}{dt}=\frac{A}{Bt+N_0} + Ce^{-Dt}\)

Where \(N_0\) is the number of tracks in the library at \(t_0\), time zero. A and B are constants and the collection growing linearly over time. The second component is an additional correction for the fact that songs added more recently are likely to have garnered more plays, and as they age, they relax back into the general soup of the library. I used something similar to make my perfect playlist.

Calculating something like this is well beyond the scope of iTunes and so we need to do something more heavy duty. The steps below show how this can be achieved. Of course, I used IgorPro to do almost all of this. I tried to read in the iTunes Music Library.xml directly in Igor using the udStFiLrXML package, but couldn’t get it to work. So there’s a bit of ruby followed by an all-Igor workflow. You can scroll to the bottom to find out a) whether this was worth it and b) for other stuff I discovered along the way.

All the code to do this is available here. I’ll try to put quantixed code on github from now on.

Once the data is in Igor, the strategy is to calculate the expected number of plays a track should have received if iTunes was simply set to random. We can then compare this number to the actual number of plays. The ratio of these numbers helps us to discriminate our favourite tracks. To work out the expected plays, we calculate the number of tracks in the library over time and the inverse of this gives us the probability that a given track, at that moment in the lifetime of the library, will be played. We know the total number of plays and the lifetime of the library, so if we assume that play rate is constant over time (fair assumption), this means we can calculate the expected number of plays for each track. As noted above, there is a slight snag with this methodology, because tracks added in the last few months will have a very low number of expected plays, yet are likely to have been played quite a lot. To compensate for this I used the modelling method suggested by Mr Science Show, but only for recent songs. Hopefully that all makes sense, so now for a step-by-step guide.

Step 1: Extract data from iTunes xml file to tsv

After trying and failing to write my own script to parse the xml file, I stumbled across this on the web.

#!/usr/bin/ruby

require 'rubygems'
 require 'nokogiri'

list = []
 doc = Nokogiri::XML(File.open(ARGV[0], 'r'))

 doc.xpath('/plist/dict/dict/dict').each do |node|

hash = {}
 last_key = nil

node.children.each do |child|

next if child.blank?

if child.name == 'key'

 last_key = child.text
 else

 hash[last_key] = child.text
 end
 end

list << hash
 end

 p list

This script was saved as parsenoko.rb and could be executed from the command line

find . -name "*.xml" -exec ruby parsenoko.rb {} > playlist.csv \;

after cd to appropriate directory containing the script and a copy of the xml file.

Step 2: A little bit of cleaning

The file starts with [ and ends with ]. Each dictionary item (dict) has been printed enclosed by {}. It’s easiest to remove these before importing to IgorPro. For my library the maximum number of keys is 38. I added a line with (ColumnA<tab>ColumnB<tab>…<tab>ColumnAL), to make sure all keys were imported correctly.

Step 3: Import into IgorPro

Import the tsv. This is preferable to csv because many tracks have commas in the track title, album title or artist name. Everything comes in as text and we will sort everything out in the next step.

LoadWave /N=Column/O/K=2/J/V={"\t"," $",0,0}

Step 4: Get Igor to sort the key values into waves corresponding to each key

This is a major type of cleaning. What we’ll do is read the key and its value. The two are separated by => and so this is used to parse and resort the values. This will convert the numeric values to numeric waves.

This is done by executing

iTunes()

Step 5: Convert timestamps to date values

iTunes stores dates in a UTC timestamp with this format 2014-10-02T20:24:10Z. It does this for Date Added, Date Modified, Last Played etc. To do anything meaningful with these, we need to convert them to date values. IgorPro uses the time in seconds from Midnight on 1st Jan 1904 as a date system. This requires double precision FP64 waves. We can parse the string containing this time stamp and convert it using

DateRead()

Step 6: Discover your favourite tracks!

We do all of this by running

Predictor()

The way this works is described above. Note that you can run whatever algorithm you like at this point to generate a list of tracks.

Step 7: Make a playlist to feed back to iTunes

The format for playlists is the M3U file. This has a simple layout which can easily be printed to a Notebook in Igor and then saved as a file for importing back into iTunes.

To do this we run

WritePlaylist(listlen)

Where the Variable listlen is the length of the playlist. In this example, listlen=50 would give the Top 50 favourite tracks.

So what did I find out?

My top 50 songs determined by this method were quite different to the Smart Playlist in iTunes of the Most Played tracks. The tracks at the top of the Most Played list in iTunes have disappeared in the new list and these are the ones that have been in the library for a long time and I suppose I don’t listen to that much any more. The new algorithmically designed playlist has a bunch of fresher tracks that were added in the last few years and I have listened to quite a lot. Looking through I can see music that I should explore in more detail. In short, it’s a superior playlist and one that will always change and should not go stale.

Other useful stuff

There are quite a few parsing tools on the web that vary in their utility and usefulness. Some that deserve a mention are:

  • The xml file should be readable as a plist by cocoa which is native to OSX
  • Visualisation of what proportion of an iTunes library is by a given artist – bdunagan’s blog
  • itunes-parser on github by phiggins
  • Really nice XSLT to move the xml file to html – moveable-type
  • Comprehensive but difficult to follow method in ruby.

The post title comes from “Your Favorite Thing” by Sugar from their LP “File Under: Easy Listening”

Green is the Colour: mNeonGreen spectra

I was searching for the excitation and emission spectra for mNeonGreen. I was able to find an image, but no values for the spectra. Here is an approximation of the spectra (xlsx format, still haven’t figured out csv for wordpress).

I got these values using IgorThief.ipf a very handy tool that allows the extraction of XY coordinates from a bitmapped plot (below).

mNeonGreen is available from Allele Biotechnologies.

Here is a great site for comparing fluorescent protein properties.

Edit @ 06:54 4/7/14: A web-based data thief was suggested by @dozenoaks

The post title is taken from “Green is the Colour” from the Pink Floyd LP “More”