## Esoteric Circle

Many projects in the lab involve quantifying circular objects. Microtubules, vesicles and so on are approximately circular in cross section. This quick post is about how to find the diameter of these objects using a computer.

So how do you measure the diameter of an object that is approximately circular? Well, if it was circular you would measure the distance from one edge to the other, crossing the centre of the object. It doesn’t matter along which axis you do this. However, since these objects are only approximately circular, it matters along which axis you measure. There are a couple of approaches that can be used to solve this problem.

Principal component analysis

The object is a collection of points* and we can find the eigenvectors and eigenvalues of these points using principal component analysis. This was discussed previously here. The 1st eigenvector points along the direction of greatest variance and the 2nd eigenvector is normal to the first. The order of eigenvectors is determined by their eigenvalues. We use these to rotate the coordinate set and offset to the origin.

Now the major axis of the object is aligned to the x-axis at y=0 and the minor axis is aligned with the y-axis at x=0 (compare the plot on the right with the one on the left, where the profiles are in their original orientation – offset to zero). We can then find the absolute values of the axis crossing points and when added together these represent the major axis and minor axis of the object. In Igor, this is done using a oneliner to retrieve a rotated set of coords as the wave M_R.

PCA/ALL/SEVC/SRMT/SCMT xCoord,yCoord

To find the crossing points, I use Igor’s interpolation-based level crossing functions. For example, storing the aggregated diameter in a variable called len.

FindLevel/Q/EDGE=1/P m1c0, 0
len = abs(m1c1(V_LevelX))
FindLevel/Q/EDGE=2/P m1c0, 0
len += abs(m1c1(V_LevelX))

This is just to find one axis (where m1c0 and m1c1 are the 1st and 2nd columns of a 2-column wave m1) and so you can see it is a bit cumbersome.

Anyway, I was quite happy with this solution. It is unbiased and also tells us how approximately circular the object is (because the major and minor axes tell us the aspect ratio or ellipticity of the object). I used it in Figure 2 of this paper to show the sizes of the coated vesicles. However, in another project we wanted to state what the diameter of a vesicle was. Not two numbers, just one. How do we do that? We could take the average of the major and minor axes, but maybe there’s an easier way.

Polar coordinates

The distance from the centre to every point on the edge of the object can be found easily by converting the xy coordinates to polar coordinates. To do this, we first find the centre of the object. This is the centroid $$(\bar{x},\bar{y})$$ represented by

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}$$ and $$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_{i}$$

for n points and subtract this centroid from all points to arrange the object around the origin. Now, since the xy coords are represented in polar system by

$$x_{i} = r_{i}\cos(\phi)$$ and $$y_{i} = r_{i}\sin(\phi)$$

we can find r, the radial distance, using

$$r_{i} = \sqrt{x_{i}^{2} + y_{i}^{2}}$$

With those values we can then find the average radial distance and report that.

There’s something slightly circular (pardon the pun) about this method because all we are doing is minimising the distance to a central point initially and then measuring the average distance to this minimised point in the latter step. It is much faster than the PCA approach and would be insensitive to changes in point density around the object. The two methods would probably diverge for noisy images. Again in Igor this is simple:

Make/O/N=(dimsize(m1,0)-1)/FREE rW

rW[] = sqrt(m1[p][0]^2 + m1[p][1]^2)

len = 2 * mean(rW)
Here again, m1 is the 2-column wave of coords and the diameter of the object is stored in len.

How does this compare with the method above? The answer is kind of obvious, but it is equidistant between the major and minor axes. Major axis is shown in red and minor axis shown in blue compared with the mean radial distance method (plotted on the y-axis). In places there is nearly a 10 nm difference which is considerable for objects which are between 20 and 35 nm in diameter. How close is it to the average of the major and minor axis? Those points are in black and they are very close but not exactly on y=x.

So for simple, approximately circular objects with low noise, the ridiculously simple polar method gives us a single estimate of the diameter of the object and this is much faster than the more complex methods above. For more complicated shapes and noisy images, my feeling is that the PCA approach would be more robust. The two methods actually tell us two subtly different things about the shapes.

Why don’t you just measure them by hand?

In case there is anyone out there wondering why a computer is used for this rather than a human wielding the line tool in ImageJ… there are two good reasons.

1. There are just too many! Each image has tens of profiles and we have hundreds of images from several experiments.
2. How would you measure the profile manually? This approach shows two unbiased methods that don’t rely on a human to draw any line across the object.

* = I am assuming that the point set is already created.

The post title is taken from “Esoteric Circle” by Jan Garbarek from the LP of the same name released in 1969. The title fits well since this post is definitely esoteric. But maybe someone out there is interested!

## Parallel lines: new paper on modelling mitotic microtubules in 3D

We have a new paper out! You can access it here.

The people

This paper really was a team effort. Faye Nixon and Tom Honnor are joint-first authors. Faye did most of the experimental work in the final months of her PhD and Tom came up with the idea for the mathematical modelling and helped to rewrite our analysis method in R. Other people helped in lots of ways. George did extra segmentation, rendering and movie making. Nick helped during the revisions of the paper. Ali helped to image samples… the list is quite long.

The paper in a nutshell

We used a 3D imaging technique called SBF-SEM to see microtubules in dividing cells, then used computers to describe their organisation.

What’s SBF-SEM?

Serial block face scanning electron microscopy. This method allows us to take an image of a cell and then remove a tiny slice, take another image and so on. We then have a pile of images which covers the entire cell. Next we need to put them back together and make some sense of them.

How do you do that?

We use a computer to track where all the microtubules are in the cell. In dividing cells – in mitosis – the microtubules are in the form of a mitotic spindle. This is a machine that the cell builds to share the chromosomes to the two new cells. It’s very important that this process goes right. If it fails, mistakes can lead to diseases such as cancer. Before we started, it wasn’t known whether SBF-SEM had the power to see microtubules, but we show in this paper that it is possible.

We can see lots of other cool things inside the cell too like chromosomes, kinetochores, mitochondria, membranes. We made many interesting observations in the paper, although the focus was on the microtubules.

So you can see all the microtubules, what’s interesting about that?

The interesting thing is that our resolution is really good, and is at a large scale. This means we can determine the direction of all the microtubules in the spindle and use this for understanding how well the microtubules are organised. Previous work had suggested that proteins whose expression is altered in cancer cause changes in the organisation of spindle microtubules. Our computational methods allowed us to test these ideas for the first time.

Resolution at a large scale, what does that mean?

The spindle is made of thousands of microtubules. With a normal light microscope, we can see the spindle but we can’t tell individual microtubules apart. There are improvements in light microscopy (called super-resolution) but even with those improvements, right in the body of the spindle it is still not possible to resolve individual microtubules. SBF-SEM can do this. It doesn’t have the best resolution available though. A method called Electron Tomography has much higher resolution. However, to image microtubules at this large scale (meaning for one whole spindle), it would take months or years of effort! SBF-SEM takes a few hours. Our resolution is better than light microscopy, worse than electron tomography, but because we can see the whole spindle and image more samples, it has huge benefits.

What mathematical modelling did you do?

Cells are beautiful things but they are far from perfect. The microtubules in a mitotic spindle follow a pattern, but don’t do so exactly. So what we did was to create a “virtual spindle” where each microtubule had been made perfect. It was a bit like “photoshopping” the cell. Instead of straightening the noses of actresses, we corrected the path of every microtubule. How much photoshopping was needed told us how imperfect the microtubule’s direction was. This measure – which was a readout of microtubule “wonkiness” – could be done on thousands of microtubules and tell us whether cancer-associated proteins really cause the microtubules to lose organisation.

The publication process

The paper is published in Journal of Cell Science and it was a great experience. Last November, we put up a preprint on this work and left it up for a few weeks. We got some great feedback and modified the paper a bit before submitting it to a journal. One reviewer gave us a long list of useful comments that we needed to address. However, the other two reviewers didn’t think our paper was a big enough breakthrough for that journal. Our paper was rejected*. This can happen sometimes and it is frustrating as an author because it is difficult for anybody to judge which papers will go on to make an impact and which ones won’t. One of the two reviewers thought that because the resolution of SBF-SEM is lower than electron tomography, our paper was not good enough. The other one thought that because SBF-SEM will not surpass light microscopy as an imaging method (really!**) and because EM cannot be done live (the cells have to be fixed), it was not enough of a breakthrough. As I explained above, the power is that SBF-SEM is between these two methods. Somehow, the referees weren’t convinced. We did some more work, revised the paper, and sent it to J Cell Sci.

J Cell Sci is a great journal which is published by Company of Biologists, a not-for-profit organisation who put a lot of money back into cell biology in the UK. They are preprint friendly, they allow the submission of papers in any format, and most importantly, they have a fast-track*** option. This allowed me to send on the reviews we had and including our response to them. They sent the paper back to the reviewer who had a list of useful comments and they were happy with the changes we made. It was accepted just 18 days after we sent it in and it was online 8 days later. I’m really pleased with the whole publishing experience with J Cell Sci.

* I’m writing about this because we all have papers rejected. There’s no shame in that at all. Moreover, it’s obvious from the dates on the preprint and on the JCS paper that our manuscript was rejected from another journal first.

** Anyone who knows something about microscopy will find this amusing and/or ridiculous.

*** Fast-track is offered by lots of journals nowadays. It allows authors to send in a paper that has been reviewed elsewhere with the peer review file. How the paper has been revised in light of those comments is assessed by at the Editor and one peer reviewer.

Parallel lines is of course the title of the seminal Blondie LP. I have used this title before for a blog post, but it matches the topic so well.

## Adventures in Code V: making a map of Igor functions

I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions.

This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in.

grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output
for i in cat output; do grep -ie "\$i" *.ipf | grep -w "Function" >> funcfile.txt ; done


Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.

Part of a series with code snippets and tips.

## Realm of Chaos

Caution: this post is for nerds only.

I watched this numberphile video last night and was fascinated by the point pattern that was created in it. I thought I would quickly program my own version to recreate it and then look at patterns made by more points.

I didn’t realise until afterwards that there is actually a web version of the program used in the video here. It is a bit limited though so my code was still worthwhile.

A fractal triangular pattern can be created by:

1. Setting three points
2. Picking a randomly placed seed point
3. Rolling a die and going halfway towards the result
4. Repeat last step

If the first three points are randomly placed the pattern is skewed, so I added the ability to generate an equilateral triangle. Here is the result.

and here are the results of a triangle through to a decagon.

All of these are generated with one million points using alpha=0.25. The triangle, pentagon and hexagon make nice patterns but the square and polygons with more than six points make pretty uninteresting patterns.

Watching the creation of the point pattern from a triangular set is quite fun. This is 30000 points with a frame every 10 points.

Here is the code.

Some other notes: this version runs in IgorPro. In my version, the seed is set at the centre of the image rather than a random location. I used the random allocation of points rather than a six-sided dice.

The post title is taken from the title track from Bolt Thrower’s “Realm of Chaos”.

## Elevation: accuracy of a Garmin Edge 800 GPS device

I use a Garmin 800 GPS device to log my cycling activity. including my commutes. Since I have now built up nearly 4 years of cycling the same route, I had a good dataset to look at how accurate the device is.

I wrote some code to import all of the rides tagged with commute in rubiTrack 4 Pro (technical details are below). These tracks needed categorising so that they could be compared. Then I plotted them out as a gizmo in Igor Pro and compared them to a reference data set which I obtained via GPS Visualiser.

The reference dataset is black. Showing the “true” elevation at those particular latitude and longitude coordinates. Plotted on to that are the commute tracks coloured red-white-blue according to longitude. You can see that there are a range of elevations recorded by the device, apart from a few outliers they are mostly accurate but offset. This is strange because I have the elevation of the start and end points saved in the device and I thought it changed the altitude it was measuring to these elevation positions when recording the track, obviously not.

To look at the error in the device I plotted out the difference in the measured altitude at a given location versus the true elevation. For each route (to and from work) a histogram of elevation differences is shown to the right. The average difference is 8 m for the commute in and 4 m for the commute back. This is quite a lot considering that all of this is only ~100 m above sea level. The standard deviation is 43 m for the commute in and 26 m for the way back.

This post at VeloViewer comparing GPS data on Strava from pro-cyclists riding the St15 of 2015 Giro d’Italia sprang to mind. Some GPS devices performed OK, whereas others (including Garmin) did less well. The idea in that post is that rain affects the recording of some units. This could be true and although I live in a rainy country, I doubt it can account for the inaccuracies recorded here. Bear in mind that that stage was over some big changes in altitude and my recordings, very little. On the other hand, there are very few tracks in that post whereas there is lots of data here.

It’s interesting that the data is worse going in to work than coming back. I do set off quite early in the morning and it is colder etc first thing which might mean the unit doesn’t behave as well for the commute to work. Both to and from work tracks vary most in lat/lon recordings at the start of the track which suggests that the unit is slow to get an exact location – something every Garmin user can attest to. Although I always wait until it has a fix before setting off. The final two plots show what the beginning of the return from work looks like for location accuracy (travelling east to west) compared to a midway section of the same commute (right). This might mean the the inaccuracy at the start determines how inaccurate the track is. As I mentioned, the elevation is set for start and end points. Perhaps if the lat/lon is too far from the endpoint it fails to collect the correct elevation.

Conclusion

I’m disappointed with the accuracy of the device. However, I have no idea whether other GPS units (including phones) would outperform the Garmin Edge 800 or even if later Garmin models are better. This is a good but limited dataset. A similar analysis would be possible on a huge dataset (e.g. all strava data) which would reveal the best and worst GPS devices and/or the best conditions for recording the most accurate data.

Technical details

I described how to get GPX tracks from rubiTrack 4 Pro into Igor and how to crunch them in a previous post. I modified the code to get elevation data out from the cycling tracks and generally made the code slightly more robust. This left me with 1,200 tracks. My commutes are varied. I frequently go from A to C via B and from C to A via D which is a loop (this is what is shown here). But I also go A to C via D, C to A via B and then I also often extend the commute to include 30 km of Warwickshire countryside. The tracks could be categorized by testing whether they began at A or C (this rejected some partial routes) and then testing whether they passed through B or D. These could then be plotted and checked visually for any routes which went off course, there were none. The key here is to pick the right B and D points. To calculate the differences in elevation, the simplest thing was to get GPS Visualiser to tell me what the elevation should be for all the points I had. I was surprised that the API could do half a million points without complaining. This was sufficient to do the rest. Note that the comparisons needed to be done as lat/lon versus elevation because due to differences in speed, time or trackpoint number lead to inherent differences in lat/lon (and elevation). Note also due to the small scale I didn’t bother converting lat/lon into flat earth kilometres.

The post title comes from “Elevation” by Television, which can be found on the classic “Marquee Moon” LP.

## Colours Running Out: Analysis of 2016 running

Towards the end of 2015, I started distance running. I thought it’d be fun to look at the frequency of my runs over the course of 2016.

Most of my runs were recorded with a GPS watch. I log my cycling data using Rubitrack, so I just added my running data to this. This software is great but to do any serious number crunching, other software is needed. Yes, I know that if I used strava I can do lots of things with my data… but I don’t. I also know that there are tools for R to do this, but I wrote something in Igor instead. The GitHub repo is here. There’s a technical description below, as well as some random thoughts on running (and cycling).

The animation shows the tracks I recorded as 2016 rolled by. The routes won’t mean much to you, but I can recognise most of them. You can see how I built up the distance to run a marathon and then how the runs became less frequent through late summer to October. I logged 975 km with probably another 50 km or so not logged.

Technical description

To pull the data out of rubiTrack 4 Pro is actually quite difficult since there is no automated export. An applescript did the job of going through all the run activities and exporting them as gpx. There is an API provided by Garmin to take the data straight from the FIT files recorded by the watch, but everything is saved and tagged in rubiTrack, so gpx is a good starting point. GPX is an xml format which can be read into Igor using XMLutils XOP written by andyfaff. Previously, I’ve used nokogiri for reading XML, but this XOP keeps everything within Igor. This worked OK, but I had some trouble with namespaces which I didn’t resolve properly and what is in the code is a slight hack. I wrote some code which imported all the files and then processed the time frame I wanted to look at. It basically looks at a.m. and p.m. for each day in the timeframe. Igor deals with date/time nicely and so this was quite easy. Two lookups per day were needed because I often went for two runs per day (run commuting). I set the lat/lon at the start of each track as 0,0. I used the new alpha tools in IP7 to fade the tracks so that they decay away over time. They disappear with 1/8 reduction in opacity over a four day period. Igor writes out to mov which worked really nicely, but wordpress can’t host movies, so I added a line to write out TIFFs of each frame of the animation and assembled a nice gif using FIJI.

Getting started with running

Getting into running was almost accidental. I am a committed cyclist and had always been of the opinion: since running doesn’t improve aerobic cycling performance (only cycling does that), any activity other than cycling is a waste of time. However, I realised that finding time for cycling was getting more difficult and also my goal is to keep fit and not to actually be a pro-cyclist, so running had to be worth a try. Roughly speaking, running is about three times more time efficient compared to cycling. One hour of running approximates to three hours of cycling. I thought, I would just try it. Over the winter. No more than that. Of course, I soon got the running bug and ran through most of 2016. Taking part in a few running events (marathon, half marathons, 10K). A quick four notes on my experience.

1. The key thing to keeping running is staying healthy and uninjured. That means building up distance and frequency of running very slowly. In fact, the limitation to running is the body’s ability to actually do the distance. In cycling this is different, as long as you fuel adequately and you’re reasonably fit, you could cycle all day if you wanted. This not true of running, and so, building up to doing longer distances is essential and the ramp up shouldn’t be rushed. Injuries will cost you lost weeks on a training schedule.
2. There’s lots of things “people don’t tell you” about running. Blisters and things everyone knows about, but losing a toenail during a 20 km run? Encountering runner’s GI problems? There’s lots of surprises as you start out. Joining a club or reading running forums probably helps (I didn’t bother!). In case you are wondering, the respective answers are getting decent shoes fitted and well, there is no cure.
3. Going from cycling to running meant going from very little upper body mass to gaining extra muscle. This means gaining weight. This is something of a shock to a cyclist and seems counterintuitive, since more activity should really equate to weight loss. I maintained cycling through the year, but was not expecting a gain of ~3 kilos.
4. As with any sport, having something to aim for is essential. Training for training’s sake can become pointless, so line up something to shoot for. Sign up for an event or at least have an achievement (distance, average speed) in your mind that you want to achieve.

So there you have it. I’ll probably continue to mix running with cycling in 2017. I’ll probably extend the repo to do more with cycling data if I have the time.

The post title is taken from “Colours Running Out” by TOY from their eponymous LP.

## Tips from the blog X: multi-line commenting in Igor

This is part-tip, part-adventures in code. I found out recently that it is possible to comment out multiple lines of code in Igor and thought I’d put this tip up here.

Multi-line commenting in programming is useful two reasons:

1. writing comments (instructions, guidance) that last more than one line
2. the ability to temporarily remove a block of code while testing

In each computer language there is the ability to comment out at least one line of code.

In Igor this is “//”, which comments out the whole line, but no more.

This is the same as in ImageJ macro language.

Now, to comment out whole sections in FIJI/ImageJ is easy. Inserting “/*” where you want the comment to start, and then “*/” where it ends, multiple lines later.

I didn’t think this syntax was available in Igor, and it isn’t really. I was manually adding “//” for each line I wanted to remove, which was annoying. It turns out that you can use Edit > Commentize to add “//” to the start of all selected lines. The keyboard shortcut in IP7 is Cmd-/. You can reverse the process with Edit > Decommentize or Cmd-\.

There is actually another way. Igor can conditionally compile code. This is useful if for example you write for Igor 7 and Igor 6. You can get compilation of IP7 commands only if the user is running IP7 for example. This same logic can be used to comment out code as follows.

The condition if 0 is never satisfied, so the code does not compile. The equivalent statement for IP7-specific compilation, is “#if igorversion()>=7”.

So there you have it, two ways to comment out code in Igor. These tips were from IgorExchange.

This post is part of a series of tips.

## Bateman Writes: Eye of the Tiger

I don’t often write about music at quantixed but I recently caught Survivor’s “Eye of The Tiger” on the radio and thought it deserved a quick post.

Surely everyone knows this song: a kind of catchall motivational tune. It is loved by people in gyms with beach-unready bodies and by presidential hopefuls without permission to use it.

Written specifically for Rocky III after Sylvester Stallone was refused permission by Queen to use “Another One Bites The Dust”, it has that 1980s middle-of-the-road hard-rock-but-not-heavy-metal feel to it. The kind of track that must be filed under “guilty pleasure”. Possibly you love this song. Maybe you get ready to meet your opponents whilst listening to it? If this is you, please don’t read on.

I find it difficult listening to this track because of the timing of the intro. Not sure what I mean?

Here is a waveform of one channel for the intro. Two of the opening phrases are shown underlined. A phrase in this case is: dun, dun-dun-dun, dun-dun-dun, dun-dun-durrrr. Can you see the problem with the second of those two phrases?

Still don’t see it? In the second phrase the second of the dun-dun-duns comes in late.

I’ve overlaid the waveform again to compare phrase 1 with phrase 2.

The difference is one-eighth (quaver) and it drives me nuts. I think it’s intentional because, well the whole band play the same thing. I don’t think it’s a tape splice error, because the track sounds live and surely someone must have noticed. Finally, they play these phrases again in the outro and that point the timing is correct. No, it’s intentional. Why?

I started doing that now-famous dead string guitar riff and started slashing those chords to the punches we saw on the screen, and the whole song took shape in the next three days.

So my best guess is that the notes were written to match the on-screen action!

The video on YouTube is only at 220 million views (at the time of writing). Give it a listen, if my description of dun-dun-dun’s was not illustrative enough for you.

Notes:

• The waveform is taken from the Eye of The Tiger album version of the song. I read that the version in the movie is actually the demo version.
• I loaded it into Igor using SoundLoadWave. I made an average of the stereo channels using MatrixOp and then downsampled the wave from 44.1 kHz so it was easier to move around.

A very occasional series on music. The name Bateman Writes, refers to the obsessive writings of the character Patrick Bateman in Bret Easton Ellis’s novel American Psycho. This serial killer had a penchant for middle of the road rock act Huey Lewis & The News.

## The International Language of Screaming

A couple of recent projects have meant that I had to get to grips more seriously with R and with MATLAB. Regular readers will know that I am a die-hard IgorPro user. Trying to tackle a new IDE is a frustrating experience, as anyone who has tried to speak a foreign language will know. The speed with which you can do stuff (or get your point across) is very slow. Not only that, but… if you could just revert to your mother tongue it would be so much easier…

What I needed was something like a Babel Fish. As I’m sure you’ll know, this fish is the creation of Douglas Adams. It allows instant translation of any language. The only downside is that you have to insert the fish into your ear.

The closest thing to the Babel Fish in computing is the cheat sheet. These sheets are typically a huge list of basic commands that you’ll need as you get going. I found a nice page which had cheat sheets which allowed easy interchange between R, MATLAB and python. There was no Igor version. Luckily, a user on IgorExchange had taken the R and MATLAB page and added some Igor commands. This was good, but it was a bit rough and incomplete. I took this version, formatted it for GitHub flavored markdown, and made some edits.

The repo is here. I hope it’s useful for others. I learned a lot putting it together. If you are an experienced user of R, MATLAB or IGOR (or better still can speak one or more of these languages), please fork and make edits or suggest changes via GitHub issues, or by leaving a comment on this page if you are not into GitHub. Thanks!

R-MATLAB-IGOR-CheatSheet

Here is a little snapshot to whet your appetite. Bon appetit!

The post title is taken from “The International Language of Screaming” by Super Furry Animals from their Radiator LP. Released as a single, the flip-side had a version called NoK which featured the backing tracking to the single. Gruff sings the welsh alphabet with no letter K.

## The Digital Cell: Statistical tests

Statistical hypothesis testing, commonly referred to as “statistics”, is a topic of consternation among cell biologists.

This is a short practical guide I put together for my lab. Hopefully it will be useful to others. Note that statistical hypothesis testing is a huge topic and one post cannot hope to cover everything that you need to know.

What statistical test should I do?

To figure out what statistical test you need to do, look at the table below. But before that, you need to ask yourself a few things.

• What are you comparing?
• What is n?
• What will the test tell you? What is your hypothesis?
• What will the p value (or other summary statistic) mean?

If you are not sure about any of these things, whichever test you do is unlikely to tell you much.

The most important question is: what type of data do you have? This will help you pick the right test.

• Measurement – most data you analyse in cell biology will be in this category. Examples are: number of spots per cell, mean GFP intensity per cell, diameter of nucleus, speed of cell migration…
• Normally-distributed – this means it follows a “bell-shaped curve” otherwise called “Gaussian distribution”.
• Not normally-distributed – data that doesn’t fit a normal distribution: skewed data, or better described by other types of curve.
• Binomial – this is data where there are two possible outcomes. A good example here in cell biology would be a mitotic index measurement (the proportion of cells in mitosis). A cell is either in mitosis or it is not.
• Other – maybe you have ranked or scored data. This is not very common in cell biology. A typical example here would be a scoring chart for a behavioural effect with agreed criteria (0 = normal, 5 = epileptic seizures). For a cell biology experiment, you might have a scoring system for a phenotype, e.g. fragmented Golgi (0 = is not fragmented, 5 = is totally dispersed). These arbitrary systems are a not a good idea. Especially, if the person scoring is unblinded to the experimental procedure. Try to come up with an unbiased measurement procedure.

 What do you want to do? Measurement (Normal) Measurement (not Normal) Binomial Describe one group Mean, SD Median, IQR Proportion Compare one group to a value One-sample t-test Wilcoxon test Chi-square Compare two unpaired groups Unpaired t-test Wilcoxon-Mann-Whitney two-sample rank test Fisher’s exact test or Chi-square Compare two paired groups Paired t-test Wilcoxon signed rank test McNemar’s test Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test Compare three or more matched groups Repeated-measures ANOVA Friedman test Cochran’s Q test Quantify association between two variables Pearson correlation Spearman correlation Predict value from another measured variable Simple linear regression Nonparametric regression Simple logistic regression Predict value from several measured or binomial variables Multiple linear (or nonlinear) regression Multiple logistic regression

Modified from Table 37.1 (p. 298) in Intuitive Biostatistics by Harvey Motulsky, 1995 OUP.

What do “paired/unpaired” and “matched/unmatched” mean?

Most of the data you will get in cell biology is unpaired or unmatched. Individual cells are measured and you have say, 20 cells in the control group and 18 different cells in the test group. These are unpaired (or unmatched in the case of more than one test group) because the cells are different in each group. If you had the same cell in two (or more) groups, the data would be paired (or matched). An example of a paired dataset would be where you have 10 cells that you treat with a drug. You take a measurement from each of them before treatment and a measurement after. So you have paired measurements: one for cell A before treatment, one after; one for cell B before and after, and so on.

How to do some of these tests in IgorPRO

The examples below assume that you have values in waves called data0, data1, data2,… substitute the wavenames for your actual wave names.

Is it normally distributed?

The simplest way is to plot them and see. You can plot out your data using Analysis>Histogram… or Analysis>Packages>Percentiles and BoxPlot… Another possibility is to look at skewness or kurtosis of the dataset (you can do this with WaveStats, see below)

However, if you only have a small number of measurements, or you want to be sure, you can do a test. There are several tests you can do (Kolmogorov-Smirnoff, Jarque-Bera, Shapiro-Wilk). The easiest to do and most intuitive (in Igor) is Shapiro-Wilk.


StatsShapiroWilkTest data0



If p < 0.05 then the data are not normally distributed. Statistical tests on normally distributed data are called parametric, while those on non-normally distributed data are non-parametric.

Describe one group

To get the mean and SD (and lots of other statistics from your data):


Wavestats data0



To get the median and IQR:


StatsQuantiles/ALL data0



The mean and sd are also stored as variables (V_avg, V_sdev). StatsQuantiles calculates V_median, V_Q25, V_Q75, V_IQR, etc. Note that you can just get the median by typing Print StatsMedian(data0) or – in Igor7 – Print median(data0). There is often more than one way to do something in Igor.

Compare one group to a value

It is unlikely that you will need to do this. In cell biology, most of the time we do not have hypothetical values for comparison, we have experimental values from appropriate controls. If you need to do this:


StatsTTest/CI/T=1 data0



Compare two unpaired groups

Use this for normally distributed data where you have test versus control, with no other groups. For paired data, use the additional flag /PAIR.


StatsTTest/CI/T=1 data0,data1



For the non-parametric equivalent, if n is large computation takes a long time. Use additional flag /APRX=2. If the data are paired, use the additional flag /WSRT.


StatsWilcoxonRankTest/T=1/TAIL=4 data0,data1



For binomial data, your waves will have 2 points. Where point 0 corresponds to one outcome and point 1, the other. Note that you can compare to expected values here, for example a genetic cross experiment can be compared to expected Mendelian frequencies. To do Fisher’s exact test, you need a 2D wave representing a contingency table. McNemar’s test for paired binomial data is not available in Igor

StatsChiTest/S/T=1 data0,data1


If you have more than two groups, do not do multiple versions of these tests, use the correct method from the table.

Compare three or more unmatched groups

For normally-distributed data, you need to do a 1-way ANOVA followed by a post-hoc test. The ANOVA will tell you if there are any differences among the groups and if it is possible to investigate further with a post-hoc test. You can discern which groups are different using a post-hoc test. There are several tests available, e.g. Dunnet’s is useful where you have one control value and a bunch of test conditions. We tend to use Tukey’s post-hoc comparison (the /NK flag also does Newman-Keuls test).


StatsAnova1Test/T=1/Q/W/BF data0,data1,data2,data3
StatsTukeyTest/T=1/Q/NK data0,data1,data2,data3



The non-parametric equivalent is Kruskal-Wallis followed by a multiple comparison test. Dunn-Holland-Wolfe method is used.


StatsKSTest/T=1/Q data0,data1,data2,data3
StatsNPMCTest/T=1/DHW/Q data0,data1,data2,data3



Compare three or more matched groups

It’s unlikely that this kind of data will be obtained in a typical cell biology experiment.

StatsANOVA2RMTest/T=1 data0,data1,data2,data3


There are also operations for StatsFriedmanTest and StatsCochranTest.

Correlation

Straightforward command for two waves or one 2D wave. Waves (or columns) must be of the same length


StatsCorrelation data0



At this point, you probably want to plot out the data and use Igor’s fitting functions. The best way to get started is with the example experiment, or just display your data and Analysis>Curve Fitting…

Hazard and survival data

In the lab we have, in the past, done survival/hazard analysis. This is a bit more complex and we used SPSS and would do so again as Igor does not provide these functions.

Notes for use

The good news is that all of this is a lot more intuitive in Igor 7! There is a new Menu item called Statistics, where most of these functions have a dialog with more information. In Igor 6.3 you are stuck with the command line. Igor 7 will be out soon (July 2016).

• Note that there are further options to most of these commands, if you need to see them
• check the manual or Igor Help
• or type ShowHelpTopic “StatsMedian” in the Command Window (put whatever command you want help with between the quotes).
• Extra options are specified by “flags”, these are things like “/Q” that come after the command. For example, /Q means “quiet” i.e. don’t print the output into the history window.
• You should always either print the results to the history or put them into a table so that we can check them. Note that the table gets over written if you do the same test with different data, so printing in this case is a good idea.
• The defaults in Igor are setup OK for our needs. For example, Igor does two-tailed comparison, alpha = 0.05, Welch’s correction, etc.
• Most operations can handle waves of different length (or have flags set to handle this case).
• If you are used to doing statistical tests in Excel, you might be wondering about tails and equal variances. The flags are set in the examples to do two-tailed analysis and unequal variances are handled by Welch’s correction.
• There’s a school of thought that says that using non-parametric tests is best to be cautious. These tests are not as powerful and so it is best to use parametric tests (t test, ANOVA) when you can.

Part of a series on the future of cell biology in quantitative terms.