## Waiting to happen II: Publication lag times

Following on from the last post about publication lag times at cell biology journals, I went ahead and crunched the numbers for all journals in PubMed for one year (2013). Before we dive into the numbers, a couple of points about this kind of information.

1. Some journals “reset the clock” on the received date with manuscripts that are resubmitted. This makes comparisons difficult.
2. The length of publication lag is not necessarily a reflection of the way the journal operates. As this comment points out, manuscripts are out of the journals hands (with the reviewers) for a substantial fraction of the time.
3. The dataset is incomplete because the deposition of this information is not mandatory. About 1/3 of papers have the date information deposited (see below).
4. Publication lag times go hand-in-hand with peer review. Moving to preprints and post-publication review would eradicate these delays.

Thanks for all the feedback on my last post, particularly those that highlighted the points above.

To see how all this was done, check out the Methods bit below, where you can download the full summary. I ended up with a list of publication lag times for 428500 papers published in 2013 (see left). To make a bit more sense of this, I split them by journal and then found the publication lag time stats for each. This had to be done per journal since PLoS ONE alone makes up 45560 of the records.

To try and visualise what these publication lag times look like for all journals, I made a histogram of the Median lag times for all journals using a 10 d bin width. It takes on average ~100 d to go from Received to Accepted and a further ~120 d to go from Accepted to Published. The whole process on average takes 239 days.

To get a feel for the variability in these numbers I plotted out the ranked Median times for each journal and overlaid Q25 and Q75 (dots). The IQR for some of the slower journals was >150 d. So the papers that they publish can have very different fates.

Is the publication lag time longer at higher tier journals? To look at this, I used the Rec-Acc time and the 2013 Journal Impact Factor which, although widely derided and flawed, does correlate loosely with journal prestige. I have fewer journals in this dataset, because the lookup of JIFs didn’t find every journal in my starting set, either because the journal doesn’t have one or there were minor differences in the PubMed name and the Thomson-Reuters name. The median of the median Rec-Acc times for each bin is shown. So on average, journals with a JIF <1 will take 1 month longer to accept your paper than journal with an IF ranging from 1-10. After this it rises again, to ~2 months longer at journals with an IF over 10. Why? Perhaps at the lower end, the trouble is finding reviewers; whereas at the higher end, multiple rounds of review might become a problem.

The executive summary is below. These are the times (in days) for delays at all journals in PubMed for 2013.

Interval Median Q25 Q75
Accepted-to-Published 122 84 186

For comparison:

1. Median time from ovulation to birth of a human being is 268 days.
2. Mark Beaumont cycled around the world (29,446 km) in 194 days.
3. Ellen MacArthur circumnavigated the globe single-handed in 72 days.

On the whole it seems that publishing in Cell Biology is quite slow compared to the whole of PubMed. Why this is the case is a tricky question. Is it because cell biologists submit papers too early and they need more revision? Are they more dogged in sending back rejected manuscripts? Is it because as a community we review too harshly and/or ask too much of the authors? Do Editors allow too many rounds of revision or not give clear guidance to expedite the time from Received-to-Accepted? It’s probably a combination of all of these factors and we’re all to blame.

Finally, this amusing tweet to show the transparency of EMBO J publication timelines raises the question: would these authors have been better off just sending the paper somewhere else?

Methods: I searched PubMed using journal article[pt] AND ("2013/01/01"[PDAT] : "2013/12/31"[PDAT]) this gave a huge xml file (~16 GB) which nokogiri balked at. So I divided the query up into subranges of those dates (1.4 GB) and ran the script on all xml files. This gave 1425643 records. I removed records that did not have a received date or those with greater than 12 in the month field (leaving 428513 records). 13 of these records did not have a journal name. This gave 428500 records from 3301 journals. Again, I filtered out negative values (papers accepted before they were received) and a couple of outliers (e.g. 6000 days!). With a bit of code it was quite straightforward to extract simple statistics for each of the journals. You can download the data here to look up the information for a journal of your choice (wordpress only allows xls, not txt/csv). The fields show the journal name and the number of valid articles. Then for Acc-Pub, Rec-Acc and Rec-Pub, the number, Median, lower quartile, upper quartile times in days are given. I set a limit of 5 or more articles for calculation of the stats. Blank entries are where there was no valid data. Note that there are some differences with the table in my last post. This is because for that analysis I used a bigger date range and then filtered the year based on the published field. Here my search started out by specifying PDAT, which is slightly different.

The data are OK, but the publication date needs to be taken with a pinch of salt. For many records it was missing a month or day, so the date used for some records is approximate. In retrospect using the Entrez date or one of the other required fields would have probably be better. I liked the idea of the publication date as this is when the paper finally appears in print which still represents a significant delay at some journals. The Recieved-to-Accepted dates are valid though.

## Waiting to Happen: Publication lag times in Cell Biology Journals

My interest in publication lag times continues. Previous posts have looked at how long it takes my lab to publish our work, how often trainees publish and I also looked at very long lag times at Oncogene. I recently read a blog post on automated calculation of publication lag times for Bioinformatics journals. I thought it would be great to do this for Cell Biology journals too. Hopefully people will find it useful and can use this list when thinking about where to send their paper.

What is publication lag time?

If you are reading this, you probably know how science publication works. Feel free to skip. Otherwise, it goes something like this. After writing up your work for publication, you submit it to a journal. Assuming that this journal will eventually publish the paper (there is usually a period of submitting, getting rejected, resubmitting to a different journal etc.), they receive the paper on a certain date. They send it out to review, they collate the reviews and send back a decision, you (almost always) revise your paper further and then send it back. This can happen several times. At some point it gets accepted on a certain date. The journal then prepares the paper for publication in a scheduled issue on a specific date (they can also immediately post papers online without formatting). All of these steps add significant delays. It typically takes 9 months to publish a paper in the biomedical sciences. In 2015 this sounds very silly, when world-wide dissemination of information is as simple as a few clicks on a trackpad. The bigger problem is that we rely on papers as a currency to get jobs or funding and so these delays can be more than just a frustration, they can affect your ability to actually do more science.

The good news is that it is very straightforward to parse the received, accepted and published dates from PubMed. So we can easily calculate the publication lags for cell biology journals. If you don’t work in cell biology, just follow the instructions below to make your own list.

The bad news is that the deposition of the date information in PubMed depends on the journal. The extra bad news is that three of the major cell biology journals do not deposit their data: J Cell Biol, Mol Biol Cell and J Cell Sci. My original plan was to compare these three journals with Traffic, Nat Cell Biol and Dev Cell. Instead, I extended the list to include other journals which take non-cell biology papers (and deposit their data).

A summary of the last ten years

Three sets of box plots here show the publication lags for eight journals that take cell biology papers. The journals are Cell, Cell Stem Cell, Current Biology, Developmental Cell, EMBO Journal, Nature Cell Biology, Nature Methods and Traffic (see note at the end about eLife). They are shown in alphabetical order. The box plots show the median and the IQR, whiskers show the 10th and 90th percentiles. The three plots show the time from Received-to-Published (Rec-Pub), and then a breakdown of this time into Received-to-Accepted (Rec-Acc) and Accepted-to-Published (Rec-Pub). The colours are just to make it easier to tell the journals apart and don’t have any significance.

You can see from these plots that the journals differ widely in the time it takes to publish a paper there. Current Biology is very fast, whereas Cell Stem Cell is relatively slow. The time it takes the journals to move them from acceptance to publication is pretty constant. Apart from Traffic where it takes an average of ~3 months to get something in to print. Remember that the paper is often online for this period so this is not necessarily a bad thing. I was not surprised that Current Biology was the fastest. At this journal, a presubmission inquiry is required and the referees are often lined up in advance. The staff are keen to publish rapidly, hence the name, Current Biology. I was amazed at Nature Cell Biology having such a short time from Received-to-Acceptance. The delay in Review-to-Acceptance comes from multiple rounds of revision and from doing extra experimental work. Anecdotally, it seems that the review at Nature Cell Biol should be just as lengthy as at Dev Cell or EMBO J. I wonder if the received date is accurate… it is possible to massage this date by first rejecting the paper, but allowing a resubmission. Then using the resubmission date as the received date [Edit: see below]. One way to legitimately limit this delay is to only allow a certain time for revisions and only allow one round of corrections. This is what happens at J Cell Biol, unfortunately we don’t have this data to see how effective this is.

How has the lag time changed over the last ten years?

Have the slow journals always been slow? When did they become slow?  Again three plots are shown (side-by-side) depicting the Rec-Pub and then the Rec-Acc and Acc-Pub time. Now the intensity of red or blue shows the data for each year (2014 is the most intense colour). Again you can see that the dataset is not complete with missing date information for Traffic for many years, for example.

Interestingly, the publication lag has been pretty constant for some journals but not others. Cell Stem Cell and Dev Cell (but not the mothership – Cell) have seen increases as have Nature Cell Biology and Nature Methods. On the whole Acc-Pub times are stable, except for Nature Methods which is the only journal in the list to see an increase over the time period. This just leaves us with the task of drawing up a ranked list of the fastest to the slowest journal. Then we can see which of these journals is likely to delay dissemination of our work the most.

The Median times (in days) for 2013 are below. The journals are ranked in order of fastest to slowest for Received-to-Publication. I had to use 2013 because EMBO J is missing data for 2014.

Journal Rec-Pub Rec-Acc Acc-Pub
Curr Biol 159 99.5 56
Nat Methods 192 125 68
Cell 195 169 35
EMBO J 203 142 61
Nature Cell Biol 237 180 59
Traffic 244 161 86
Dev Cell 247 204 43
Cell Stem Cell 284 205 66

You’ll see that only Cell Stem Cell is over the threshold where it would be faster to conceive and give birth to a human being than to publish a paper there (on average). If the additional time wasted in submitting your manuscript to other journals is factored in, it is likely that most papers are at least on a par with the median gestation time.

If you are wondering why eLife is missing… as a new journal it didn’t have ten years worth of data to analyse. It did have a reasonably complete set for 2013 (but Rec-Acc only). The median time was 89 days, beating Current Biology by 10.5 days.

Methods

Please check out Neil Saunders’ post on how to do this. I did a PubMed search for (journal1[ta] OR journal2[ta] OR ...) AND journal article[pt] to make sure I didn’t get any reviews or letters etc. I limited the search from 2003 onwards to make sure I had 10 years of data for the journals that deposited it. I downloaded the file as xml and I used Ruby/Nokogiri to parse the file to csv. Installing Nokogiri is reasonably straightforward, but the documentation is pretty impenetrable. The ruby script I used was from Neil’s post (step 3) with a few lines added:


#!/usr/bin/ruby

require 'nokogiri'

f = File.open(ARGV.first)
doc = Nokogiri::XML(f)
f.close

doc.xpath("//PubmedArticle").each do |a|
r = ["", "", "", "", "", "", "", "", "", "", ""]
r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text
r[1] = a.xpath("MedlineCitation/PMID").text
r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text
r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text
r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text
r[8] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Year").text
r[9] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Month").text
r[10] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Day").text
puts r.join(",")
end



and then executed as described. The csv could then be imported into IgorPro and processed. Neil’s post describes a workflow for R, or you could use Excel or whatever at this point. As he notes, quite a few records are missing the date information and some of it is wrong, i.e. published before it was accepted. These need to be cleaned up. The other problem is that the month is sometimes an integer and sometimes a three-letter code. He uses lubridate in R to get around this, a loop-replace in Igor is easy to construct and even Excel can handle this with an IF statement, e.g. IF(LEN(G2)=3,MONTH(1&LEFT(G2,3)),G2) if the month is in G2. Good luck!

Edit 9/3/15 @ 17:17 several people (including Deborah Sweet and Bernd Pulverer from Cell Press/Cell Stem Cell and EMBO, respectively) have confirmed via Twitter that some journals use the date of resubmission as the submitted date. Cell Stem Cell and EMBO journals use the real dates. There is no way to tell whether a journal does this or not (from the deposited data). Stuart Cantrill from Nature Chemistry pointed out that his journal do declare that they sometimes reset the clock. I’m not sure about other journals. My own feeling is that – for full transparency – journals should 1) record the actual dates of submission, acceptance and publication, 2) deposit them in PubMed and add them to the paper. As pointed out by Jim Woodgett, scientists want the actual dates on their paper, partly because they are the real dates, but also to claim priority in certain cases. There is a conflict here, because journals might appear inefficient if they have long publication lag times. I think this should be an incentive for Editors to simplify revisions by giving clear guidance and limiting successive revision cycles. (This Edit was corrected 10/3/15 @ 11:04).

The post title is taken from “Waiting to Happen” by Super Furry Animals from the “Something 4 The Weekend” single.

## If and When: publishing and productivity in the lab

I thought I’d share this piece of analysis looking at productivity of people in the lab. Here, productivity means publishing papers. This is unfortunate since some people in my lab have made some great contributions to other peoples’ projects or have generally got something going, but these haven’t necessarily transferred into print. Also, the projects people have been involved in have varied in toughness. I’ve had students on an 8-week rotation who just collected some data which went straight into a paper and I’ve had postdocs toil for two years trying to purify a protein complex… I wasn’t looking to single out who was the most productive person (I knew who that was already), but I was interested to look at other things, e.g. how long is it on average from someone joining the lab to them publishing their first paper?

The information below would be really useful if it was available for all labs. When trainees are looking for jobs, it would be worth knowing the productivity of a given lab. This can be very hard to discern, since it is difficult to see how many people have worked in the lab and for how long. Often all you have to go on is the PubMed record of the PI. Two papers published per year in the field of cell biology is a fantastic output, but not if you have a lab of thirty people. How likely are you – as a future lab member – to publish your own 1st author paper? This would be very handy to know before applying to a lab.

I extracted the date of online publication for all of our papers as well as author position and other details. I had a record of start and end dates for everyone in the lab. Although as I write this, I realise that I’ve left one person off by mistake. All of this could be dumped into IgorPro and I wrote some code to array the data in a plot vs time. People are anonymous – they’ll know who they are, if they’re reading. Also we have one paper which is close to being accepted so I included this although it is not in press yet.

The first plot shows when people joined the lab and how long they stayed. Each person has been colour-coded according to their position. The lines represent their time spent in the lab. Some post-graduates (PG) came as a masters student for a rotation and then came back for a PhD and hence have a broken line. Publications are shown by markers according to when a paper featuring that person was published online. There’s a key to indicate a paper versus review/methods paper and if the person was 1st author or not. We have published two papers that I would call collaborative, i.e. a minor component from our lab. Not shown here are the publications that are authored by me but don’t feature anyone else working in the lab.

This plot starts when I got my first independent position. As you can see it was ~1 year until I was able to recruit my first tech. It was almost another 2 years before we published our first paper. Our second one took almost another 2 years! What is not factored in here is the time spent waiting to get something published – see here. The early part of getting a lab going is tough, however you can see that once we were up-and-running the papers came out more quickly.

In the second plot, I offset the traces to show duration in the lab and relative time to publication from the start date in the lab. I also grouped people according to their position and ranked them by duration in the lab. This plot is clearer for comparing publication rates and lag to getting the first paper etc. This plot shows quite nicely that lots of people from the lab publish “posthumously”. This is thanks to the publication lag but also to things not getting finished or results that needed further experiments to make sense etc. Luckily the people in my lab have been well organised, which has made it possible to publish things after they’ve left.

I was surprised to see that five people published within ~1.5 years of joining the lab. However, in each case the papers came about because of some groundwork by other people.

I think the number of people and the number of papers are both too low to begin to predict how long someone will take to get their first paper out, but these plots give a sense of how tough it is and how much effort and work is required to make it into print.

Methods: To recreate this for your own lab, you just need a list of lab members with start and end dates. The rest can be extracted from PubMed. Dataviz fans may be interested that the colour scheme is taken from Paul Tol’s guide.

The post title comes from “If and When” by The dB’s from Ride The Wild Tomtom

## Science songs

I thought I’d compile a list of songs related to biomedical science. These were all found in my iTunes library. I’ve missed off multiple entries for the same kind of thing, as indicated.

Neuroscience

• Grand Mal -Elliott Smith from XO Sessions
• She’s Lost Control – Joy Division from Unknown Pleasures (Epilepsy)
• Aneuryism – Nirvana from Hormoaning EP
• Serotonin – Mansun from Six
• Serotonin Smile – Ooberman from Shorley Wall EP
• Brain Damage – Pink Floyd from Dark Side of The Moon
• Paranoid Schizophrenic – The Bats from How Pop Can You Get?
• Headacher – Bear Quartet from Penny Century
• Headache – Frank Black from Teenager of the Year
• Manic Depression – Jimi Hendrix Experience and lots of other songs about depression
• Paranoid – Black Sabbath from Paranoid (thanks to Joaquin for the suggestion!)

Medical

• Cancer (interlude) – Mansun from Six
• Hepatic Tissue Fermentation – Carcass or pretty much any song in this genre of Death Metal
• Whiplash – Metallica from Kill ‘Em All
• Another Invented Disease – Manic Street Preachers from Generation Terrorists
• Broken Nose – Family from Bandstand
• Ana’s Song – Silverchair from Neon Ballroom (Anorexia Nervosa)
• 4st 7lb – Manic Street Preachers from The Holy Bible (Anorexia Nervosa)
• November Spawned A Monster – Morrissey from Bona Drag (disability)
• Castles Made of Sand – Jimi Hendrix Experience from Axis: Bold As Love (disability)
• Cardiac Arrest – Madness from 7
• Blue Veins – The Raconteurs from Broken Boy Soldiers
• Vein Melter – Herbie Hancock from Headhunters
• Scoliosis – Pond from Rock Collection (curvature of the spine)
• Taste the Blood – Mazzy Star… lots of songs with blood in the title.

Pharmaceutical

• Biotech is Godzilla – Sepultura from Chaos A.D.
• Luminol – Ryan Adams from Rock N Roll
• Feel Good Hit Of The Summer – Queens of The Stone Age from Rated R (prescription drugs of abuse)
• Stars That Play with Laughing Sam’s Dice – Jimi Hendrix Experience (and hundreds of other songs about recreational drugs)
• Tramazi Parti – Black Grape from It’s Great When You’re Straight…
• Z is for Zofirax – Wingtip Sloat from If Only For The Hatchery
• Goldfish and Paracetamol – Catatonia from International Velvet
• L Dopa – Big Black from Songs About Fucking

Genetics and molecular biology

• Genetic Reconstruction – Death from Spiritual Healing
• Genetic – Sonic Youth from 100%
• Hair and DNA – Hot Snakes from Audit in Progress
• DNA – Circle from Meronia
• Biological – Air from Talkie Walkie
• Gene by Gene – Blur from Think Tank
• My Selfish Gene – Catatonia from International Velvet
• Sheer Heart Attack – Queen (“it was the DNA that made me this way”)
• Mutantes – Os Mutantes
• The Missing Link – Napalm Death from Mentally Murdered E.P.
• Son of Mr. Green Genes – Frank Zappa from Hot Rats

Cell Biology

• Sweet Oddysee Of A Cancer Cell T’ Th’ Center Of Yer Heart – Mercury Rev from Yerself Is Steam
• Dead Embryonic Cells – Sepultura from Arise
• Cells – They Might Be Giants from Here Comes Science (songs for kids about science)
• White Blood Cells LP by The White Stripes
• Anything by The Membranes
• Soma – Smashing Pumpkins from Siamese Dream
• Golgi Apparatus – Phish from Junta
• Cell-scape LP by Melt Banana

Album covers with science images

Godflesh – Selfless. Scanning EM image of some cells growing on a microchip?

Circle – Meronia. Photograph of an ampuole?

## Division Day: using PCA in cell biology

In this post I’ll describe a computational method for splitting two sides of a cell biological structure. It’s a simple method that relies on principal component analysis, otherwise known as PCA. Like all things mathematical there are some great resources on the web, if you want to understand this operation in more detail (for example, this great post by Lior Pachter). PCA can applied to many biological problems, you’ve probably seen it used to find patterns in large data sets, e.g. from proteomic studies. It can also be useful for analysing microscopy data. Since our analysis using this method is unlikely to make it into print any time soon, I thought I’d put it up on Quantixed.

During mitosis, a cell forms a mitotic spindle to share copied chromosomes equally to the two new cells. Our lab is working on how this process works and how it goes wrong in cancer. The chromosomes attach to the spindle via kinetochores and during prometaphase they are moved to the middle of the cell. Here, the chromosomes are organised into a disc-like structure called the metaphase plate. The disc is thin in the direction of the spindle axis, but much larger in width and height. To examine the spatial distribution of kinetochores on the plate we wanted a way to approximately separate kinetochores on one side if the plate from the other.

Kinetochores can be easily detected in 3D confocal images of mitotic cells by particle analysis. Kinetochores are easily stained and appear as bright spots that a computer can pick out (we use Imaris for this). The cartesian coordinates of each detected kinetochore were saved as csv and fed into IgorPro. A procedure could then be run which works in three steps. The code is shown at the bottom, it is wrapped in further code that deals with multiple datasets from many cells/experiments etc. The three steps are:

1. PCA
2. Point-to-plane
3. Analysis on each subset

I’ll describe each step and how it works.

1. Principal component analysis

This is used to find the 3rd eigenvector, which can be used to define a plane passing through the centre of the plate. This plane is used for division.

Now, because the metaphase plate is a disc it has three dimensions, the third of which – “thickness” – is the smallest. PCA will find the principal component, i.e. the direction in which there is most variance. Orthogonal to that is the second biggest variance and orthogonal to that direction is the smallest. These directions are called eigenvectors and their magnitude is the eigenvalue. As there are three dimensions to the data we can get all three eigenvectors out and the 3rd eigenvector corresponds to thickness of the metaphase plate. Metaphase plates in cells grown on coverslips are orientated similarly, but the cells themselves are at random orientations. PCA takes no notice of this and can simply reveal the direction of the smallest dimension of a 3D structure. The movie shows this in action for a simulated data set. The black spots are arranged in a disk shape about the origin. They are rotated about x by 45° (the blue spots). We then run PCA and show the eigenvectors as unit vectors (red lines). The 3rd eigenvector is normal to the plane of division, i.e. the 1st and 2nd eigenvectors lie on the plane of division.

Also, the centroid needs to be defined. This is simply the cartesian coordinates for the average of each dimension. It is sometimes referred to as the mean vector. In the example this was the origin, in reality this will depend on the position and the overall height of the cell.

A much longer method to get the eigenvectors is to define the variance-covariance matrix (sometimes called the dispersion matrix) for each dimension, for all kinetochores and then do an eigenvector decomposition on the matrix. PCA is one command, whereas the matrix calculation would be an extra loop followed by an additional command.

2. Point-to-plane

The distance of each kinetochore to the plane that we defined is calculated. If it is a positive value then the kinetochore lies on the same side as the normal vector (defined above). If it is negative then it is on the other side. The maths behind how to do this are in section 10.3.1 of Geometric Tools for Computer Graphics by Schneider & Eberly (starting on p. 374). Google it, there is a PDF version on the web. I’ll save you some time, you just need one equation that defines a plane,

$$ax+by+cz+d=0$$

Where the unit normal vector is [a b c] and a point on the plane is [x y z]. We’ll use the coordinates of the centroid as a point on the plane to find d. Now that we know this, we can use a similar equation to find the distance of any point to the plane,

$$ax_{i}+by_{i}+cz_{i}+d$$

Results for each kinetochore are used to sort each side of the plane into separate waves for further calculation. In the movie below, the red dots and blue dots show the positions of the kinetochores on either side of the division plane. It’s a bit of an optical illusion, but the cube is turning in a right hand fashion.

3. Analysis on each subset

Now that the data have been sorted, separate calculations can be carried out on each. In the example, we were interested in how the kinetochores were organised spatially and so we looked at the distance to nearest neighbour. This is done by finding the Euclidean distance from each kinetochore to every other kinetochore and putting the lowest value for each kinetochore into a new wave. However, this calculation can be anything you want. If there are further waves that specify other properties of the kinetochores, e.g. brightness, then these can be similarly processed here.

Other notes

The code in its present form (not very streamlined) was fast and could be run on every cell from a number of experiments, reading out positional data for 10,000 kinetochores in ~2 s. For QC it is possible to display the two separated coordinated sets to check that the division worked fine (see above). The power of this method is that it doesn’t rely on imaging spindle poles or anything else to work out the orientation of the metaphase plate. It works well for metaphase cells, but cells with any misaligned chromosomes ruin the calculation. It is possible to remove these and still fit the plane, but for our analysis we focused on cells at metaphase with a defined plate.

What else can it be used for?

Other structures in the cell can be segregated in a similar way. For example, the Golgi apparatus has a trans and a cis side, which could be similarly divided (although using the 2nd eigenvector as normal to the plane, rather than the 3rd).

Acknowledgements: I’d like to thank A.G. at WaveMetrics Inc. for encouraging me to try PCA rather than my dispersion matrix approach.

If you want to use it, the code is available here (it seems I can only upload PDF at wordpress.com). I used pygments for annotation.

The post title comes from “Division Day” a great single by Elliott Smith.

## Sticky End

We have a new paper out! You can access it here.

The work was mainly done by Cristina Gutiérrez Caballero, a post-doc in the lab. We had some help from Selena Burgess and Richard Bayliss at the University of Leicester, with whom we have an ongoing collaboration.

The paper in a nutshell

We found that TACC3 binds the plus-ends of microtubules via an interaction with ch-TOG. So TACC3 is a +TIP.

What is a +TIP?

This is a term used to describe proteins that bind to the plus-ends of microtubules. Microtubules are a major component of the cell’s cytoskeleton. They are polymers of alpha/beta-tubulin that grow and shrink, a feature known as dynamic instability. A microtubule has polarity, the fast growing end is known as the plus-end, and the slower growing end is referred to as the minus-end. There are many proteins that bind to the plus-end and these are termed +TIPs.

OK, so what are TACC3 and ch-TOG?

They are two proteins found on the mitotic spindle. TACC3 is an acronym for transforming acidic coiled-coil protein 3, and ch-TOG stands for colonic hepatic tumour overexpressed gene. As you can tell from the names they were discovered due to their altered expression in certain human cancers. TACC3 is a well-known substrate for Aurora A kinase, which is an enzyme that is often amplified in cancer. The ch-TOG protein is thought to be a microtubule polymerase, i.e. an enzyme that helps microtubules grow. In the paper, we describe how TACC3 and ch-TOG stick together at the microtubule end. TACC3 and ch-TOG are at the very end of the microtubule, they move ahead of other +TIPs like “end-binding proteins”, e.g. EB3.

What is the function of TACC3 as a +TIP?

We think that TACC3 is piggybacking on ch-TOG while it is acting as a polymerase, but any biological function or consequence of this piggybacking was difficult to detect. We couldn’t see any clear effect on microtubule dynamics when we removed or overexpressed TACC3. We did find that loss of TACC3 affects how cells migrate, but this is not likely to be due to a change in microtubule dynamics.

I thought TACC3 and ch-TOG were centrosomal proteins…

In the paper we look again at this and find that there are different pools of TACC3, ch-TOG and clathrin (alone and in combination) and describe how they reside in different places in the cell. Although ch-TOG is clearly at centrosomes, we don’t find TACC3 at centrosomes, although it is on microtubules that cluster near the centrosomes at the spindle pole. TACC3 is often described as a centrosomal protein in lots of other papers, but this is quite misleading.

What else?

We were on the cover – whatever that means in the digital age! We imaged a cell expressing tagged EB3 proteins, EB3 is another +TIP. We coloured consecutive frames different colours and the result looked pretty striking. Biology Open picked it as their cover, which we were really pleased about. Our paper is AOP at the moment and so hopefully they won’t change their mind by the time it appears in the next issue.

Preprinting

This is the second paper that we have deposited as a preprint at bioRxiv (not counting a third paper that we preprinted after it was accepted). I was keen to preprint this particular paper because we became aware that two other groups had similar results following a meeting last summer. Strangely, a week or so after preprinting and submitting to a journal, a paper from a completely different group appeared with a very similar finding! We’d been “scooped”. They had found that the Xenopus homologue of TACC3 was a +TIP in retinal neuronal cultures. The other group had clearly beaten us to it, having submitted their paper some time before our preprint. The reviewers of our paper complained that our data was no longer novel and our paper was rejected. This was annoying because there were lots of novel findings in our paper that weren’t in theirs (and vice versa). The reviewers did make some other constructive suggestions that we incorporated into the manuscript. We updated our preprint and then submitted to Biology Open. One advantage of the preprinting process is that the changes we made can be seen by all. Biology Open were great and took a decision based on our comments from the other journal and the changes we had made in response to them. Their decision to provisionally accept the paper was made in four days. Like our last experience publishing in Biology Open, it was very positive.

References

Gutiérrez-Caballero, C., Burgess, S.G., Bayliss, R. & Royle, S.J. (2015) TACC3-ch-TOG track the growing tips of microtubules independently of clathrin and Aurora-A phosphorylation. Biol. Open doi:10.1242/​bio.201410843.

Nwagbara, B. U., Faris, A. E., Bearce, E. A., Erdogan, B., Ebbert, P. T., Evans, M. F., Rutherford, E. L., Enzenbacher, T. B. and Lowery, L. A. (2014) TACC3 is a microtubule plus end-tracking protein that promotes axon elongation and also regulates microtubule plus end dynamics in multiple embryonic cell types. Mol. Biol. Cell 25, 3350-3362.

The post title is taken from the last track on The Orb’s U.F.Orb album.

## Tips from the blog III – violin plots

Having recently got my head around violin plots, I thought I would explain what they are and why you might want to use them.

There are several options when it comes to plotting summary data. I list them here in order of granularity, before describing violin plots and how to plot them in some detail.

Bar chart

This is the mainstay of most papers in my field. Typically, a bar representing the mean value that’s been measured is shown with an error bar which shows either the standard error of the mean, the standard deviation, or more rarely a confidence interval. The two data series plotted in all cases is the waiting time for Old Faithful eruptions (waiting), a classic dataset from R. I have added some noise to waiting (waiting_mod) for comparison. I think it’s fair to say that most people feel that the bar chart has probably had its day and that we should be presenting our data in more informative ways*.

Pros: compact, easy to tell differences between groups

Cons: hides the underlying distribution, obscures the n number

Box plot

The box plot – like many things in statistics – was introduced by Tukey. It’s sometimes known as a Tukey plot, or a box-and-whiskers plot. The idea was to give an impression of the underlying distribution without showing a histogram (see below). Histograms are great, but when you need to compare many distributions they do not overlay well and take up a lot of space to show them side-by-side. In the simplest form, the “box” is the interquartile range (IQR, 25th and 75th percentiles) with a line to show the median. The whiskers show the 10th and 90th percentiles. There are many variations on this theme: outliers can be shown or not, the whiskers may show the limits of the dataset (or something else), the boxes can be notched or their width may represent the sample size…

Pros: compact, easy to tell differences between groups, shows normality/skewness

Cons: hides multimodal data, sometimes obscures the n number, many variations

Histogram

A histogram is a method of showing the distribution of a dataset and was introduced by Pearson. The number of observations within a bin are counted and plotted. The ‘bars’ sit next to each other, because the variable being measured is continuous. The variable being measured is on the x-axis, rather than the category (as in the other plots).

Often the area of all the bars is normalised to 1 in order to assess the distributions without being confused by differences in sample size. As you can see here, “waiting” is bimodal. This was hidden in the bar chart and in the bot plot.

Related to histograms are other display types such as stemplots or stem-and-leaf plots.

Pros: shows multimodal data, shows normality/skewness clearly

Cons: not compact, difficult to overlay, bin size and position can be misleading

Scatter dot plot

It’s often said that if there are less than 10 data points, then best practice is to simply show the points. Typically the plot is shown together with a bar to show the mean (or median) and maybe with error bars showing s.e.m., s.d., IQR. There are a couple of methods of plotting the points, because they need to be scattered in x value in order to be visualised. Adding random noise is one approach, but this looks a bit messy (top). A symmetrical scatter can be introduced by binning (middle) and a further iteration is to bin the y values rather than showing their true location (bottom). There’s a further iteration which constrains the category width and overlays multiple points, but again the density becomes difficult to see.

These plots still look quite fussy, the binned version is the clearest but then we are losing the exact locations of the points, which seems counterintuitive. Another alternative to scattering the dots is to show a rug plot (see below) where there is no scatter.

Pros: shows all the observations

Cons: can be difficult to assess the distribution

Violin plot

This type of plot was introduced in the software package NCSS in 1997 and described in this paper: Hintze & Nelson (1998) The American Statistician 52(2):181-4 [PDF]. As the title says, violin plots are a synergism between box plot and density trace. A thin box plot is shown together with a symmetrical kernel density estimate (KDE, see explanation below). The point is to be able to quickly assess the distribution. You can see that the bimodality of waiting in the plot, but there’s no complication of lots of points just a smooth curve to see the data.

Pros: shows multimodal data, shows normality/skewness unambiguously

Cons: hides n, not familiar to many readers.

* Why is the bar chart so bad and why should I show my data another way?

The best demonstration of why the bar chart is bad is Anscombe’s Quartet (the figure to the right is taken from the Wikipedia page). These four datasets are completely different, yet they all have the same summary statistics. The point is, you would never know unless you plotted the data. A bar chart would look identical for all four datasets.

Making Violin Plots in IgorPro

I wanted to make Violin Plots in IgorPro, since we use Igor for absolutely everything in the lab. I wrote some code to do this and I might make some improvements to it in the future – if I find the time! This was an interesting exercise, because it meant forcing myself to understand how smoothing is done. What follows below is an aide memoire, but you may find it useful.

What is a kernel density estimate?

A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.

What’s wrong with using a histogram as a KDE?

The following examples are taken from here (which in turn are taken from the book by Bowman and Azzalini described below). A histogram is simplistic. We lose the location of each datapoint because of binning. Histograms are not smooth and the estimate is very sensitive to the size of the bins and also the starting location of the first bin. The histograms to the right show the same data points (in the rug plot).

Using the same bin size, they result in very different distributions depending on where the first bin starts. My first instinct to generate a KDE was to simply smooth a histogram, but this is actually quite inaccurate as it comes from a lossy source. Instead we need to generate a real KDE.

How do I make a KDE?

To do this we place a kernel (a Gaussian is commonly used) at each data point. The rationale behind this is that each observation can be thought of as being representative of a greater number of observations. It sounds a bit bizarre to assume normality to estimate a density non-parametrically, but it works. We can sum all of the kernels to give a smoothed distribution: the KDE. Easy? Well, yes as long as you know how wide to make the kernels. To do this we need to find the bandwidth, h (also called the smoothing parameter).

It turns out that this is not completely straightforward. The answer is summarised in this book: Bowman & Azzalini (1997) Applied Smoothing Techniques for Data Analysis. In the original paper on violin plots, they actually do not have a good solution for selecting h for drawing the violins, and they suggest trying several different values for h. They recommend starting at ~15% of the data range as a good starting point. Obviously if you are writing some code, the process of selecting h needs to be automatic.

Optimising h is necessary because if h is too large, the estimate with be oversmoothed and features will be lost. If is too small, then it will be undersmoothed and bumpy. The examples to the right (again, taken from Bowman & Azzalini, via this page) show examples of undersmoothed, oversmoothed and optimal smoothing.

An optimal solution to find h is

$$h = \left(\frac{4}{3n}\right)^{\frac{1}{5}}\sigma$$

This is termed Silverman’s rule-of-thumb. If smoothing is needed in more than one dimension, the multidimensional version is

$$h = \left\{\frac{4}{\left(p+2\right)n}\right\}^{\frac{1}{\left(p+4\right)}}\sigma$$

You might need multidimensional smoothing to contextualise more than one parameter being measured. The waiting data used above describes the time to wait until the next eruption from Old Faithful. The duration of the eruption is measured, and also the wait to the next eruption can be extracted, giving three parameters. These can give a 3D density estimate as shown here in the example.

The Bowman & Azzalini recommend that, if the distribution is long-tailed, using the median absolute deviation estimator is robust for $$\sigma$$.

$$\tilde\sigma=median\left\{|y_i-\tilde\mu|\right\}/0.6745$$

where $$\tilde\mu$$ is the median of the sample. All of this is something you don’t need to worry about if you use R to plot violins, the implementation in there is rock solid having been written in S plus and then ported to R years ago. You can even pick how the h selection is done from sm.density, or even modify the optimal h directly using hmult.

To get this working in IgorPro, I used some code for 1D KDE that was already on IgorExchange. It needed a bit of modification because it used FastGaussTransform to sum the kernels as a shortcut. It’s a very fast method, but initially gave an estimate that seemed to be undersmoothed. I spent a while altering the formula for h, hence the detail above. To cut a long story short, FastGaussTransform uses Taylor expansion of the Gauss transform and it just needed more terms to do this accurately. This is set with the /TET flag. Note also, that in Igor the width of a Gauss is sigma*2^1/2.

OK, so how do I make a Violin for plotting?

I used the draw tools to do this and placed the violins behind an existing box plot. This is necessary to be able to colour the violins (apparently transparency is coming to Igor in IP7). The other half of the violin needs to be calculated and then joined by the DrawPoly command. If the violins are trimmed, i.e. cut at the limits of the dataset, then this required an extra point to be added. Without trimming, this step is not required. The only other issue is how wide the violins are plotted. In R, the violins are all normalised so that information about n is lost. In the current implementation, box width is 0.1 and the violins are normalised to the area under the curve*(0.1/2). So, again information on n is lost.

Future improvements

Ideas for developments of the Violin Plot method in IgorPro

• incorporate it into the ipf for making boxplots so that it is integrated as an option to ‘calculate percentiles’
• find a better solution for setting the width of the violin
• add other bandwidth options, as in R
• add more options for colouring the violins

What do you think? Did I miss something? Let me know in the comments.

References

Bowman, A.W. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis : The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press.

Hintze, J.L. & Nelson, R.D. (1998) Violin plots: A Box Plot-Density Trace Synergism. The American Statistician, 52:181-4.

## Joining A Fanclub

When I started this blog, my plan was to write about interesting papers or at least blog about the ones from my lab. This post is a bit of both.

I was recently asked to write a “Journal Club” piece for Nature Reviews Molecular Cell Biology, which is now available online. It’s paywalled unfortunately. It’s also very short, due to the format. For these reasons, I thought I’d expand a bit on the papers I highlighted.

I picked two papers from Dick McIntosh’s group, published in J Cell Biol in the early 1990s as my subject. The two papers are McDonald et al. 1992 and Mastronarde et al. 1993.

Almost everything we know about the microanatomy of mitotic spindles comes from classical electron microscopy (EM) studies. How many microtubules are there in a kinetochore fibre? How do they contact the kinetochore? These questions have been addressed by EM. McIntosh’s group in Boulder, Colorado have published so many classic papers in this area, but there are many more coming from Conly Rieder, Alexey Khodjakov, Bruce McEwen and many others. Even with the advances in light microscopy which have improved spatial resolution (resulting in a Nobel Prize last year), EM is the only way to see individual microtubules within a complex subcellular structure like the mitotic spindle. The title of the piece, Super-duper resolution imaging of mitotic microtubules, is a bit of a dig at the fact that EM still exceeds the resolution available from super-resolution light microscopy. It’s not the first time that this gag has been used, but I thought it suited the piece quite well.

There are several reasons to highlight these papers over other electron microscopy studies of mitotic spindles.

It was the first time that 3D models of microtubules in mitotic spindles were built from electron micrographs of serial sections. This allowed spatial statistical methods to be applied to understand microtubule spacing and clustering. The software that was developed by David Mastronarde to do this was later packaged into IMOD. This is a great software suite that is actively maintained, free to download and is essential for doing electron microscopy. Taking on the same analysis today would be a lot faster, but still somewhat limited by cutting sections and imaging to get the resolution required to trace individual microtubules.

The paper actually showed that some of the microtubules in kinetochore fibres travel all the way from the pole to the kinetochore, and that interpolar microtubules invade the bundle occasionally. This was an open question at the time and was really only definitively answered thanks to the ability to digitise and trace individual microtubules using computational methods.

The final thing I like about these papers is that it’s possible to reproduce the analysis. The methods sections are wonderfully detailed and of course the software is available to do similar work. This is in contrast to most papers nowadays, where it is difficult to understand how the work has been done in the first place, let alone to try and reproduce it in your own lab.

David Mastronarde and Dick McIntosh kindly commented on the piece that I wrote and also Faye Nixon in my lab made some helpful suggestions. There’s no acknowledgement section, so I’ll thank them all here.

References

McDonald, K. L., O’Toole, E. T., Mastronarde, D. N. & McIntosh, J. R. (1992) Kinetochore microtubules in PTK cells. J. Cell Biol. 118, 369—383

Mastronarde, D. N., McDonald, K. L., Ding, R. & McIntosh, J. R. (1993) Interpolar spindle microtubules in PTK cells. J. Cell Biol. 123, 1475—1489

Royle, S.J. (2015) Super-duper resolution imaging of mitotic microtubules. Nat. Rev. Mol. Cell. Biol. doi:10.1038/nrm3937 Published online 05 January 2015

The post title is taken from “Joining a Fanclub” by Jellyfish from their classic second and final LP “Spilt Milk”.

## A Day In The Life II

I have been doing paper of the day (#potd) again in 2014. See my previous post about this.

My “rules” for paper of the day are:

1. Read one paper each working day.
2. If I am away, or reviewing a paper for a journal or colleague, then I get a pass.
3. Read it sufficiently to be able to explain it to somebody else, i.e. don’t just scan the abstract and look at the figures. Really read it and understand it. Scan and skim as many other papers as you normally would!
4. Only papers reporting primary research count. No reviews/opinion pieces etc.
5. If it was really good or worth telling people about – tweet about it.
6. Make a simple database in Excel – this helps you keep track, make notes about the paper (to see if you meet #3) and allows you to find the paper easily in the future (this last point turned out to be very useful).

This year has been difficult, especially sticking to #3. My stats for 2014 are:

• 73% success rate. Down from 85% in 2013
• Stats errors in 36% of papers I read!
• 86% of papers were from 2014

Following last year, I wasn’t so surprised by the journals that the papers appeared in:

1. eLife
2. J Cell Biol
3. Mol Biol Cell
4. Dev Cell
5. Nature Methods
6. J Cell Sci
7. J Neurosci
8. Nature Cell Biol
9. Traffic
10. Curr Biol
11. Nature
12. Nature Comm
13. Science

According to my database I only read one paper in Cell this year. I certainly have lots of them in “Saved for later” in Feedly (which is a black hole from which papers rarely emerge to be read). It’s possible that the reason Cell, Nature and Science are low on the list is that I might quickly glance at papers in those journals but not actually read them for #potd. Last year eLife was at number 9 and this year it is at number 1. This journal is definitely publishing a lot of exciting cell biology and also the lens format is very nice for reading.

I think I’ll try to continue this in 2015. The main thing it has made me realise is how few papers I read (I mean really read). I wonder if students and postdocs are actually the main consumers of the literature. If this is correct, do PIs rely on “subsistence reading”, i.e. when they write their own papers and check the immediate literature? Is their deep reading done only during peer reviewing other people’s work? Or do PIs rely on a constant infusion of the latest science at seminars and at meetings?

## Insane In The Brain

Back of the envelope calculations for this post.

An old press release for a paper on endocytosis by Tom Kirchhausen contained this fascinating factoid:

The equivalent of the entire brain, or a football field of membrane, is turned over every hour

If this is true it is absolutely staggering. Let’s check it out.

A synaptic vesicle is ~40 nm in diameter. So the surface area of 1 vesicle is

$$4 \pi r^2$$

which is 5026 nm2, or 5.026 x 10-15 m2.

Now, an American football field is 5350 m2 (including both endzones), this is the equivalent of 1.065 x 1018 synaptic vesicles.

It is estimated that the human cortex has 60 trillion synapses. This means that each synapse would need to internalise 17742 vesicles to retrieve the area of membrane equivalent to one football field.

The factoid says this takes one hour. This membrane load equates to each synapse turning over 296 vesicles in one minute, which is 4.93 vesicles per second.

Tonic activity of neurons differs throughout the brain and actually 5 Hz doesn’t sound too high (feel free to correct me on this). We’ve only considered cortical neurons, so the factoid seems pretty plausible!

For an actual football field, i.e. Association Football. The calculation is slightly more complicated. This is because there is no set size for football pitches. In England, the largest is apparently Manchester City (7598 m2) while the smallest actually belongs to the greatest football team in the world, Crewe Alexandra (5518 m2).

A brain would hoover up Man City’s ground in an hour if each synapse turned over 7 vesicles per second, while Gresty Road would only take 5 vesicles per second.

What is less clear from the factoid is whether a football field really equates to an “entire brain”. Bionumbers has no information on this. I think this part of the factoid may come from a different bit of data which is that clathrin-mediated endocytosis in non-neuronal cells can internalise the equivalent of the entire surface area of the cell in about an hour. I wonder whether this has been translated to neurons for the purposes of the quote. Either way, it is an amazing factoid that the brain can turnover this huge amount of membrane in such a short space of time.

So there you have it: quanta quantified on quantixed.

The post title is from “Insane In The Brain” by Cypress Hill from the album Black Sunday.