## Pentagrammarspin: why twelve pentagons?

This post has been in my drafts folder for a while. With the World Cup here, it’s time to post it!

It’s a rule that a 3D assembly of hexagons must have at least twelve pentagons in order to be a closed polyhedral shape. This post takes a look at why this is true.

First, some examples from nature. The stinkhorn fungus Clathrus ruber, has a largely hexagonal layout, with pentagons inserted. The core of HIV has to contain twelve pentagons (shown in red, in this image from the Briggs group) amongst many hexagonal units. My personal favourite, the clathrin cage, can assemble into many buckminsterfullerene-like shapes, but all must contain at least twelve pentagons with a variable number of hexagons.

The case of clathrin is particularly interesting because clathrin triskelia can assemble into a flat hexagonal lattice on membranes. If clathrin is going to coat a vesicle, that means 12 pentagons need to be introduced. So there needs to be quite a bit of rearrangement in order to do this.

You can see the same rule in everyday objects. The best example is a football, or soccer ball, if you are reading in the USA.

The classic design of football has precisely twelve pentagons and twenty hexagonal panels. The roadsign for football stadia here in the UK shows a weirdly distorted hexagonal array that has no pentagons. 22,543 people signed a petition to pressurise the authorities to change it, but the Government responded that it was too costly to correct this geometrical error.

So why do all of these assemblies have 12 pentagons?

In the classic text “On Growth and Form” by D’Arcy Wentworth Thompson, polyhedral forms in nature are explored in some detail. In the wonderfully titled On Concretions, Specules etc. section, the author notes polyhedral forms in natural objects.

One example is Dorataspis, shown left. The layout is identical to the D6 hexagonal barrel assembly of a clathrin cage shown above. There is a belt of six hexagons, one at the top, one at the bottom (eight total) and twelve pentagons between the hexagons. In the book, there is an explanation of the maths behind why there must be twelve pentagons in such assemblies, but it’s obfuscated in bizarre footnotes in latin. I’ll attempt to explain it below.

To shed some light on this we need the help of Euler’s formulae. The surface of a polyhedron in 3D is composed of faces, edges and vertices. If we think back to the football the faces are the pentagons and hexgonal panels, the edges are the stitching where two panels meet and the vertices are where three edges come together. We can denote faces, edges and vertices as f, e and v, respectively. These are 2D, 1D and zero-dimensional objects, respectively. Euler’s formula which is true for all polyhedra is:

$$f – e + v = 2$$

If you think about a cube, it has six faces. It has 12 edges and 8 vertices. So, 6 – 12 + 8 = 2. We can also check out a the football above. This has 32 faces (twelve pentagons, twenty hexagons), 90 edges and sixty vertices. 32 – 90 + 60 = 2. Feel free to check it with other polyhedra!

Euler found a second formula which is true for polyhedra where three edges come together at a vertex.

$$\sum (6-n)f_{n} = 12$$

in this formula, $$f_{n}$$ means number of n-gons.

So let’s say we have dodecahedron, which is a polyhedron made of 12 pentagons. So $$n$$ = 5 and $$f_{n}$$ = 12, and you can see that $$(6-5)12 = 12$$.

Let’s take a more complicated object, like the football. Now we have:

$$((6-6)20) + ((6-5)12) = 12$$

You can now see why the twelve pentagons are needed. Because 6-6 = 0, we can add as many hexagons as we like, this will add nothing to the left hand side. As long as the twelve pentagons are there, we will have a polyhedron. Without them we don’t. This is the answer to why there must be twelve pentagons in a closed polyhedral assembly.

So how did Euler get to the second equation? You might have spotted this yourself for the f, e, v values for the football. Did you notice that the ratio of edges to vertices is 3:2? This is because each edge has two vertices at either end (it is a 1D object) and remember we are dealing with polyhedra with three edges at each vertex. so $$v = \frac{2}{3}e$$. Also, each edge is at the boundary of two polygons. So $$e = \frac{1}{2}\sum n f_{n}$$. You can check that with the values for the cube or football above. We know that $$f = \sum f_{n}$$, this just means that the number of faces is the sum of all the faces of all n-gons. This means that:

$$f – e + v = 2$$

Can be turned into

$$f – (1/3)e = \sum n f_{n} – \frac{1}{6}\sum n f_{n} = 2$$

Let’s multiply by 6 to get, oh yes

$$\sum (6-n)f_{n} = 12$$

There are some topics for further exploration here:

• You can add 0, 2 or 10000 hexagons to 12 pentagons to make a polyhedron, but can you add just one?
• What happens when you add a few heptagons into the array?

Image credits (free-to-use/wiki or):

Clathrus ruber – tineye search didn’t find source.

HIV cores – Briggs Group

Exploded football – Quora

The post title comes from “Pentagrammarspin” by Steve Hillage from the 2006 remaster of his LP Fish Rising

## Esoteric Circle

Many projects in the lab involve quantifying circular objects. Microtubules, vesicles and so on are approximately circular in cross section. This quick post is about how to find the diameter of these objects using a computer.

So how do you measure the diameter of an object that is approximately circular? Well, if it was circular you would measure the distance from one edge to the other, crossing the centre of the object. It doesn’t matter along which axis you do this. However, since these objects are only approximately circular, it matters along which axis you measure. There are a couple of approaches that can be used to solve this problem.

Principal component analysis

The object is a collection of points* and we can find the eigenvectors and eigenvalues of these points using principal component analysis. This was discussed previously here. The 1st eigenvector points along the direction of greatest variance and the 2nd eigenvector is normal to the first. The order of eigenvectors is determined by their eigenvalues. We use these to rotate the coordinate set and offset to the origin.

Now the major axis of the object is aligned to the x-axis at y=0 and the minor axis is aligned with the y-axis at x=0 (compare the plot on the right with the one on the left, where the profiles are in their original orientation – offset to zero). We can then find the absolute values of the axis crossing points and when added together these represent the major axis and minor axis of the object. In Igor, this is done using a oneliner to retrieve a rotated set of coords as the wave M_R.

PCA/ALL/SEVC/SRMT/SCMT xCoord,yCoord

To find the crossing points, I use Igor’s interpolation-based level crossing functions. For example, storing the aggregated diameter in a variable called len.

FindLevel/Q/EDGE=1/P m1c0, 0
len = abs(m1c1(V_LevelX))
FindLevel/Q/EDGE=2/P m1c0, 0
len += abs(m1c1(V_LevelX))

This is just to find one axis (where m1c0 and m1c1 are the 1st and 2nd columns of a 2-column wave m1) and so you can see it is a bit cumbersome.

Anyway, I was quite happy with this solution. It is unbiased and also tells us how approximately circular the object is (because the major and minor axes tell us the aspect ratio or ellipticity of the object). I used it in Figure 2 of this paper to show the sizes of the coated vesicles. However, in another project we wanted to state what the diameter of a vesicle was. Not two numbers, just one. How do we do that? We could take the average of the major and minor axes, but maybe there’s an easier way.

Polar coordinates

The distance from the centre to every point on the edge of the object can be found easily by converting the xy coordinates to polar coordinates. To do this, we first find the centre of the object. This is the centroid $$(\bar{x},\bar{y})$$ represented by

$$\bar{x} = \frac{1}{n}\sum_{i=1}^{n} x_{i}$$ and $$\bar{y} = \frac{1}{n}\sum_{i=1}^{n} y_{i}$$

for n points and subtract this centroid from all points to arrange the object around the origin. Now, since the xy coords are represented in polar system by

$$x_{i} = r_{i}\cos(\phi)$$ and $$y_{i} = r_{i}\sin(\phi)$$

we can find r, the radial distance, using

$$r_{i} = \sqrt{x_{i}^{2} + y_{i}^{2}}$$

With those values we can then find the average radial distance and report that.

There’s something slightly circular (pardon the pun) about this method because all we are doing is minimising the distance to a central point initially and then measuring the average distance to this minimised point in the latter step. It is much faster than the PCA approach and would be insensitive to changes in point density around the object. The two methods would probably diverge for noisy images. Again in Igor this is simple:

Make/O/N=(dimsize(m1,0)-1)/FREE rW

rW[] = sqrt(m1[p][0]^2 + m1[p][1]^2)

len = 2 * mean(rW)
Here again, m1 is the 2-column wave of coords and the diameter of the object is stored in len.

How does this compare with the method above? The answer is kind of obvious, but it is equidistant between the major and minor axes. Major axis is shown in red and minor axis shown in blue compared with the mean radial distance method (plotted on the y-axis). In places there is nearly a 10 nm difference which is considerable for objects which are between 20 and 35 nm in diameter. How close is it to the average of the major and minor axis? Those points are in black and they are very close but not exactly on y=x.

So for simple, approximately circular objects with low noise, the ridiculously simple polar method gives us a single estimate of the diameter of the object and this is much faster than the more complex methods above. For more complicated shapes and noisy images, my feeling is that the PCA approach would be more robust. The two methods actually tell us two subtly different things about the shapes.

Why don’t you just measure them by hand?

In case there is anyone out there wondering why a computer is used for this rather than a human wielding the line tool in ImageJ… there are two good reasons.

1. There are just too many! Each image has tens of profiles and we have hundreds of images from several experiments.
2. How would you measure the profile manually? This approach shows two unbiased methods that don’t rely on a human to draw any line across the object.

* = I am assuming that the point set is already created.

The post title is taken from “Esoteric Circle” by Jan Garbarek from the LP of the same name released in 1969. The title fits well since this post is definitely esoteric. But maybe someone out there is interested!

## The Second Arrangement

To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations.

Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it.

The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to

$$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$

Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here.

I had got as far as figuring out that the ratio of permutations to derangements converges to e. However,  you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it.

Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted.

The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another.

If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on.

To calculate the derangements:

$$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$

This is a polynomial expansion of this expression:

$$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$

where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G.

Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames.

What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that.

So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading!

Further notes

The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat?

Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem?

The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions.

## Realm of Chaos

Caution: this post is for nerds only.

I watched this numberphile video last night and was fascinated by the point pattern that was created in it. I thought I would quickly program my own version to recreate it and then look at patterns made by more points.

I didn’t realise until afterwards that there is actually a web version of the program used in the video here. It is a bit limited though so my code was still worthwhile.

A fractal triangular pattern can be created by:

1. Setting three points
2. Picking a randomly placed seed point
3. Rolling a die and going halfway towards the result
4. Repeat last step

If the first three points are randomly placed the pattern is skewed, so I added the ability to generate an equilateral triangle. Here is the result.

and here are the results of a triangle through to a decagon.

All of these are generated with one million points using alpha=0.25. The triangle, pentagon and hexagon make nice patterns but the square and polygons with more than six points make pretty uninteresting patterns.

Watching the creation of the point pattern from a triangular set is quite fun. This is 30000 points with a frame every 10 points.

Here is the code.

Some other notes: this version runs in IgorPro. In my version, the seed is set at the centre of the image rather than a random location. I used the random allocation of points rather than a six-sided dice.

The post title is taken from the title track from Bolt Thrower’s “Realm of Chaos”.

## The Digital Cell: Statistical tests

Statistical hypothesis testing, commonly referred to as “statistics”, is a topic of consternation among cell biologists.

This is a short practical guide I put together for my lab. Hopefully it will be useful to others. Note that statistical hypothesis testing is a huge topic and one post cannot hope to cover everything that you need to know.

What statistical test should I do?

To figure out what statistical test you need to do, look at the table below. But before that, you need to ask yourself a few things.

• What are you comparing?
• What is n?
• What will the test tell you? What is your hypothesis?
• What will the p value (or other summary statistic) mean?

If you are not sure about any of these things, whichever test you do is unlikely to tell you much.

The most important question is: what type of data do you have? This will help you pick the right test.

• Measurement – most data you analyse in cell biology will be in this category. Examples are: number of spots per cell, mean GFP intensity per cell, diameter of nucleus, speed of cell migration…
• Normally-distributed – this means it follows a “bell-shaped curve” otherwise called “Gaussian distribution”.
• Not normally-distributed – data that doesn’t fit a normal distribution: skewed data, or better described by other types of curve.
• Binomial – this is data where there are two possible outcomes. A good example here in cell biology would be a mitotic index measurement (the proportion of cells in mitosis). A cell is either in mitosis or it is not.
• Other – maybe you have ranked or scored data. This is not very common in cell biology. A typical example here would be a scoring chart for a behavioural effect with agreed criteria (0 = normal, 5 = epileptic seizures). For a cell biology experiment, you might have a scoring system for a phenotype, e.g. fragmented Golgi (0 = is not fragmented, 5 = is totally dispersed). These arbitrary systems are a not a good idea. Especially, if the person scoring is unblinded to the experimental procedure. Try to come up with an unbiased measurement procedure.

 What do you want to do? Measurement (Normal) Measurement (not Normal) Binomial Describe one group Mean, SD Median, IQR Proportion Compare one group to a value One-sample t-test Wilcoxon test Chi-square Compare two unpaired groups Unpaired t-test Wilcoxon-Mann-Whitney two-sample rank test Fisher’s exact test or Chi-square Compare two paired groups Paired t-test Wilcoxon signed rank test McNemar’s test Compare three or more unmatched groups One-way ANOVA Kruskal-Wallis test Chi-square test Compare three or more matched groups Repeated-measures ANOVA Friedman test Cochran’s Q test Quantify association between two variables Pearson correlation Spearman correlation Predict value from another measured variable Simple linear regression Nonparametric regression Simple logistic regression Predict value from several measured or binomial variables Multiple linear (or nonlinear) regression Multiple logistic regression

Modified from Table 37.1 (p. 298) in Intuitive Biostatistics by Harvey Motulsky, 1995 OUP.

What do “paired/unpaired” and “matched/unmatched” mean?

Most of the data you will get in cell biology is unpaired or unmatched. Individual cells are measured and you have say, 20 cells in the control group and 18 different cells in the test group. These are unpaired (or unmatched in the case of more than one test group) because the cells are different in each group. If you had the same cell in two (or more) groups, the data would be paired (or matched). An example of a paired dataset would be where you have 10 cells that you treat with a drug. You take a measurement from each of them before treatment and a measurement after. So you have paired measurements: one for cell A before treatment, one after; one for cell B before and after, and so on.

How to do some of these tests in IgorPRO

The examples below assume that you have values in waves called data0, data1, data2,… substitute the wavenames for your actual wave names.

Is it normally distributed?

The simplest way is to plot them and see. You can plot out your data using Analysis>Histogram… or Analysis>Packages>Percentiles and BoxPlot… Another possibility is to look at skewness or kurtosis of the dataset (you can do this with WaveStats, see below)

However, if you only have a small number of measurements, or you want to be sure, you can do a test. There are several tests you can do (Kolmogorov-Smirnoff, Jarque-Bera, Shapiro-Wilk). The easiest to do and most intuitive (in Igor) is Shapiro-Wilk.


StatsShapiroWilkTest data0



If p < 0.05 then the data are not normally distributed. Statistical tests on normally distributed data are called parametric, while those on non-normally distributed data are non-parametric.

Describe one group

To get the mean and SD (and lots of other statistics from your data):


Wavestats data0



To get the median and IQR:


StatsQuantiles/ALL data0



The mean and sd are also stored as variables (V_avg, V_sdev). StatsQuantiles calculates V_median, V_Q25, V_Q75, V_IQR, etc. Note that you can just get the median by typing Print StatsMedian(data0) or – in Igor7 – Print median(data0). There is often more than one way to do something in Igor.

Compare one group to a value

It is unlikely that you will need to do this. In cell biology, most of the time we do not have hypothetical values for comparison, we have experimental values from appropriate controls. If you need to do this:


StatsTTest/CI/T=1 data0



Compare two unpaired groups

Use this for normally distributed data where you have test versus control, with no other groups. For paired data, use the additional flag /PAIR.


StatsTTest/CI/T=1 data0,data1



For the non-parametric equivalent, if n is large computation takes a long time. Use additional flag /APRX=2. If the data are paired, use the additional flag /WSRT.


StatsWilcoxonRankTest/T=1/TAIL=4 data0,data1



For binomial data, your waves will have 2 points. Where point 0 corresponds to one outcome and point 1, the other. Note that you can compare to expected values here, for example a genetic cross experiment can be compared to expected Mendelian frequencies. To do Fisher’s exact test, you need a 2D wave representing a contingency table. McNemar’s test for paired binomial data is not available in Igor

StatsChiTest/S/T=1 data0,data1


If you have more than two groups, do not do multiple versions of these tests, use the correct method from the table.

Compare three or more unmatched groups

For normally-distributed data, you need to do a 1-way ANOVA followed by a post-hoc test. The ANOVA will tell you if there are any differences among the groups and if it is possible to investigate further with a post-hoc test. You can discern which groups are different using a post-hoc test. There are several tests available, e.g. Dunnet’s is useful where you have one control value and a bunch of test conditions. We tend to use Tukey’s post-hoc comparison (the /NK flag also does Newman-Keuls test).


StatsAnova1Test/T=1/Q/W/BF data0,data1,data2,data3
StatsTukeyTest/T=1/Q/NK data0,data1,data2,data3



The non-parametric equivalent is Kruskal-Wallis followed by a multiple comparison test. Dunn-Holland-Wolfe method is used.


StatsKSTest/T=1/Q data0,data1,data2,data3
StatsNPMCTest/T=1/DHW/Q data0,data1,data2,data3



Compare three or more matched groups

It’s unlikely that this kind of data will be obtained in a typical cell biology experiment.

StatsANOVA2RMTest/T=1 data0,data1,data2,data3


There are also operations for StatsFriedmanTest and StatsCochranTest.

Correlation

Straightforward command for two waves or one 2D wave. Waves (or columns) must be of the same length


StatsCorrelation data0



At this point, you probably want to plot out the data and use Igor’s fitting functions. The best way to get started is with the example experiment, or just display your data and Analysis>Curve Fitting…

Hazard and survival data

In the lab we have, in the past, done survival/hazard analysis. This is a bit more complex and we used SPSS and would do so again as Igor does not provide these functions.

Notes for use

The good news is that all of this is a lot more intuitive in Igor 7! There is a new Menu item called Statistics, where most of these functions have a dialog with more information. In Igor 6.3 you are stuck with the command line. Igor 7 will be out soon (July 2016).

• Note that there are further options to most of these commands, if you need to see them
• check the manual or Igor Help
• or type ShowHelpTopic “StatsMedian” in the Command Window (put whatever command you want help with between the quotes).
• Extra options are specified by “flags”, these are things like “/Q” that come after the command. For example, /Q means “quiet” i.e. don’t print the output into the history window.
• You should always either print the results to the history or put them into a table so that we can check them. Note that the table gets over written if you do the same test with different data, so printing in this case is a good idea.
• The defaults in Igor are setup OK for our needs. For example, Igor does two-tailed comparison, alpha = 0.05, Welch’s correction, etc.
• Most operations can handle waves of different length (or have flags set to handle this case).
• If you are used to doing statistical tests in Excel, you might be wondering about tails and equal variances. The flags are set in the examples to do two-tailed analysis and unequal variances are handled by Welch’s correction.
• There’s a school of thought that says that using non-parametric tests is best to be cautious. These tests are not as powerful and so it is best to use parametric tests (t test, ANOVA) when you can.

Part of a series on the future of cell biology in quantitative terms.

## Division Day: using PCA in cell biology

In this post I’ll describe a computational method for splitting two sides of a cell biological structure. It’s a simple method that relies on principal component analysis, otherwise known as PCA. Like all things mathematical there are some great resources on the web, if you want to understand this operation in more detail (for example, this great post by Lior Pachter). PCA can applied to many biological problems, you’ve probably seen it used to find patterns in large data sets, e.g. from proteomic studies. It can also be useful for analysing microscopy data. Since our analysis using this method is unlikely to make it into print any time soon, I thought I’d put it up on Quantixed.

During mitosis, a cell forms a mitotic spindle to share copied chromosomes equally to the two new cells. Our lab is working on how this process works and how it goes wrong in cancer. The chromosomes attach to the spindle via kinetochores and during prometaphase they are moved to the middle of the cell. Here, the chromosomes are organised into a disc-like structure called the metaphase plate. The disc is thin in the direction of the spindle axis, but much larger in width and height. To examine the spatial distribution of kinetochores on the plate we wanted a way to approximately separate kinetochores on one side if the plate from the other.

Kinetochores can be easily detected in 3D confocal images of mitotic cells by particle analysis. Kinetochores are easily stained and appear as bright spots that a computer can pick out (we use Imaris for this). The cartesian coordinates of each detected kinetochore were saved as csv and fed into IgorPro. A procedure could then be run which works in three steps. The code is shown at the bottom, it is wrapped in further code that deals with multiple datasets from many cells/experiments etc. The three steps are:

1. PCA
2. Point-to-plane
3. Analysis on each subset

I’ll describe each step and how it works.

1. Principal component analysis

This is used to find the 3rd eigenvector, which can be used to define a plane passing through the centre of the plate. This plane is used for division.

Now, because the metaphase plate is a disc it has three dimensions, the third of which – “thickness” – is the smallest. PCA will find the principal component, i.e. the direction in which there is most variance. Orthogonal to that is the second biggest variance and orthogonal to that direction is the smallest. These directions are called eigenvectors and their magnitude is the eigenvalue. As there are three dimensions to the data we can get all three eigenvectors out and the 3rd eigenvector corresponds to thickness of the metaphase plate. Metaphase plates in cells grown on coverslips are orientated similarly, but the cells themselves are at random orientations. PCA takes no notice of this and can simply reveal the direction of the smallest dimension of a 3D structure. The movie shows this in action for a simulated data set. The black spots are arranged in a disk shape about the origin. They are rotated about x by 45° (the blue spots). We then run PCA and show the eigenvectors as unit vectors (red lines). The 3rd eigenvector is normal to the plane of division, i.e. the 1st and 2nd eigenvectors lie on the plane of division.

Also, the centroid needs to be defined. This is simply the cartesian coordinates for the average of each dimension. It is sometimes referred to as the mean vector. In the example this was the origin, in reality this will depend on the position and the overall height of the cell.

A much longer method to get the eigenvectors is to define the variance-covariance matrix (sometimes called the dispersion matrix) for each dimension, for all kinetochores and then do an eigenvector decomposition on the matrix. PCA is one command, whereas the matrix calculation would be an extra loop followed by an additional command.

2. Point-to-plane

The distance of each kinetochore to the plane that we defined is calculated. If it is a positive value then the kinetochore lies on the same side as the normal vector (defined above). If it is negative then it is on the other side. The maths behind how to do this are in section 10.3.1 of Geometric Tools for Computer Graphics by Schneider & Eberly (starting on p. 374). Google it, there is a PDF version on the web. I’ll save you some time, you just need one equation that defines a plane,

$$ax+by+cz+d=0$$

Where the unit normal vector is [a b c] and a point on the plane is [x y z]. We’ll use the coordinates of the centroid as a point on the plane to find d. Now that we know this, we can use a similar equation to find the distance of any point to the plane,

$$ax_{i}+by_{i}+cz_{i}+d$$

Results for each kinetochore are used to sort each side of the plane into separate waves for further calculation. In the movie below, the red dots and blue dots show the positions of the kinetochores on either side of the division plane. It’s a bit of an optical illusion, but the cube is turning in a right hand fashion.

3. Analysis on each subset

Now that the data have been sorted, separate calculations can be carried out on each. In the example, we were interested in how the kinetochores were organised spatially and so we looked at the distance to nearest neighbour. This is done by finding the Euclidean distance from each kinetochore to every other kinetochore and putting the lowest value for each kinetochore into a new wave. However, this calculation can be anything you want. If there are further waves that specify other properties of the kinetochores, e.g. brightness, then these can be similarly processed here.

Other notes

The code in its present form (not very streamlined) was fast and could be run on every cell from a number of experiments, reading out positional data for 10,000 kinetochores in ~2 s. For QC it is possible to display the two separated coordinated sets to check that the division worked fine (see above). The power of this method is that it doesn’t rely on imaging spindle poles or anything else to work out the orientation of the metaphase plate. It works well for metaphase cells, but cells with any misaligned chromosomes ruin the calculation. It is possible to remove these and still fit the plane, but for our analysis we focused on cells at metaphase with a defined plate.

What else can it be used for?

Other structures in the cell can be segregated in a similar way. For example, the Golgi apparatus has a trans and a cis side, which could be similarly divided (although using the 2nd eigenvector as normal to the plane, rather than the 3rd).

Acknowledgements: I’d like to thank A.G. at WaveMetrics Inc. for encouraging me to try PCA rather than my dispersion matrix approach.

If you want to use it, the code is available here (it seems I can only upload PDF at wordpress.com). I used pygments for annotation.

The post title comes from “Division Day” a great single by Elliott Smith.