Methods papers for MD997

I am now running a new module for masters students, MD997. The aim is to introduce the class to a range of advanced research methods and to get them to think about how to formulate their own research question(s).

The module is built around a paper which is allocated in the first session. I had to come up with a list of methods-type papers, which I am posting below. There are 16 students and I picked 23 papers. I aimed to cover their interests, which are biological but with some chemistry, physics and programming thrown in. The papers are a bit imaging-biased but I tried to get some ‘omics and neuro in there. There were some preprints on the list to make sure I covered the latest stuff.

The students picked their top 3 papers and we managed to assign them without too much trouble. Every paper got at least one vote. Some papers were in high demand. Fitzpatrick et al. on cryoEM of Alzheimer’s samples and the organoid paper from Lancaster et al. had by far the most votes.

The students present this paper to the class and also use it to formulate their own research proposal. Which one would you pick?

  1. Booth, D.G. et al. (2016) 3D-CLEM Reveals that a Major Portion of Mitotic Chromosomes Is Not Chromatin Mol Cell 64, 790-802. http://dx.doi.org/10.1016/j.molcel.2016.10.009
  2. Chai, H. et al. (2017) Neural Circuit-Specialized Astrocytes: Transcriptomic, Proteomic, Morphological, and Functional Evidence Neuron 95, 531-549 e9. http://dx.doi.org/10.1016/j.neuron.2017.06.029
  3. Chang, J.B. et al. (2017) Iterative expansion microscopy Nat Methods 14, 593-599. http://dx.doi.org/10.1038/nmeth.4261
  4. Chen, B.C. et al. (2014) Lattice light-sheet microscopy: imaging molecules to embryos at high spatiotemporal resolution Science 346, 1257998. http://dx.doi.org/10.1126/science.1257998
  5. Chung, K. & Deisseroth, K. (2013) CLARITY for mapping the nervous system Nat Methods 10, 508-13. http://dx.doi.org/10.1038/nmeth.2481
  6. Eichler, K. et al. (2017) The Complete Connectome Of A Learning And Memory Center In An Insect Brain bioRxiv. http://dx.doi.org/10.1101/141762
  7. Fitzpatrick, A.W.P. et al. (2017) Cryo-EM structures of tau filaments from Alzheimer’s disease Nature 547, 185-190. http://dx.doi.org/10.1038/nature23002
  8. Habib, N. et al. (2017) Massively parallel single-nucleus RNA-seq with DroNc-seq Nat Methods 14, 955-958. http://dx.doi.org/10.1038/nmeth.4407
  9. Hardman, G. et al. (2017) Extensive non-canonical phosphorylation in human cells revealed using strong-anion exchange-mediated phosphoproteomics bioRxiv. http://dx.doi.org/10.1101/202820
  10. Herzik, M.A., Jr. et al. (2017) Achieving better-than-3-A resolution by single-particle cryo-EM at 200 keV Nat Methods. http://dx.doi.org/10.1038/nmeth.4461
  11. Jacquemet, G. et al. (2017) FiloQuant reveals increased filopodia density during breast cancer progression J Cell Biol 216, 3387-3403. http://dx.doi.org/10.1083/jcb.201704045
  12. Jungmann, R. et al. (2014) Multiplexed 3D cellular super-resolution imaging with DNA-PAINT and Exchange-PAINT Nat Methods 11, 313-8. http://dx.doi.org/10.1038/nmeth.2835
  13. Kim, D.I. et al. (2016) An improved smaller biotin ligase for BioID proximity labeling Mol Biol Cell 27, 1188-96. http://dx.doi.org/10.1091/mbc.E15-12-0844
  14. Lancaster, M.A. et al. (2013) Cerebral organoids model human brain development and microcephaly Nature 501, 373-9. http://dx.doi.org/10.1038/nature12517
  15. Madisen, L. et al. (2012) A toolbox of Cre-dependent optogenetic transgenic mice for light-induced activation and silencing Nat Neurosci 15, 793-802. http://dx.doi.org/10.1038/nn.3078
  16. Penn, A.C. et al. (2017) Hippocampal LTP and contextual learning require surface diffusion of AMPA receptors Nature 549, 384-388. http://dx.doi.org/10.1038/nature23658
  17. Qin, P. et al. (2017) Live cell imaging of low- and non-repetitive chromosome loci using CRISPR-Cas9 Nat Commun 8, 14725. http://dx.doi.org/10.1038/ncomms14725
  18. Quick, J. et al. (2016) Real-time, portable genome sequencing for Ebola surveillance Nature 530, 228-232. http://dx.doi.org/10.1038/nature16996
  19. Ries, J. et al. (2012) A simple, versatile method for GFP-based super-resolution microscopy via nanobodies Nat Methods 9, 582-4. http://dx.doi.org/10.1038/nmeth.1991
  20. Rogerson, D.T. et al. (2015) Efficient genetic encoding of phosphoserine and its nonhydrolyzable analog Nat Chem Biol 11, 496-503. http://dx.doi.org/10.1038/nchembio.1823
  21. Russell, M.R. et al. (2017) 3D correlative light and electron microscopy of cultured cells using serial blockface scanning electron microscopy J Cell Sci 130, 278-291. http://dx.doi.org/10.1242/jcs.188433
  22. Strickland, D. et al. (2012) TULIPs: tunable, light-controlled interacting protein tags for cell biology Nat Methods 9, 379-84. http://dx.doi.org/10.1038/nmeth.1904
  23. Yang, J. et al. (2015) The I-TASSER Suite: protein structure and function prediction Nat Methods 12, 7-8. http://dx.doi.org/10.1038/nmeth.3213

If you are going to do a similar exercise, Twitter is invaluable for suggestions for papers. None of the students complained that they couldn’t find three papers which matched their interests. I set up a slide carousel in Powerpoint with the front page of each paper together with some key words to tell the class quickly what the paper was about. I gave them some discussion time and then collated their choices on the board. Assigning the papers was quite straightforward, trying to honour the first choices as far as possible. Having an excess of papers prevented too much horse trading for the papers that multiple people had picked.

Hopefully you find this list useful. I was inspired by Raphaël posting his own list here.

The Digital Cell: Workflow

The future of cell biology, even for small labs, is quantitative and computational. What does this mean and what should it look like?

My group is not there yet, but in this post I’ll describe where we are heading. The graphic below shows my current view of the ideal workflow for my lab.

Workflow

The graphic is pretty self-explanatory, but to walk you through:

  • A lab member sets up a microscopy experiment. We have standardised procedures/protocols in a lab manual and systems are in place so that reagents are catalogued to minimise error.
  • Data goes straight from the microscope to the server (and backed-up). Images and metadata are held in a database and object identifiers are used for referencing in electronic lab notebooks (and for auditing).
  • Analysis of the data happens with varying degrees of human intervention. The outputs of all analyses are processed automatically. Code for doing these steps in under version control using git (github).
  • Post-analysis the processed outputs contain markers for QC and error checking. We can also trace back to the original data and check the analysis. Development of code happens here too, speeding up slow procedures via “software engineering”.
  • Figures are generated using scripts which are linked to the original data with an auditable record of any modification to the image.
  • Project management, particularly of paper writing is via trello. Writing papers is done using collaborative tools. Everything is synchronised to enable working from any location.
  • This is just an overview and some details are missing, e.g. backup of analyses is done locally and via the server.

Just to reiterate, that my team are not at this point yet, but we are reasonably close. We have not yet implemented three of these things properly in my group, but in our latest project (via collaboration) the workflow has worked as described above.

The output is a manuscript! In the future I can see that publication of a paper as a condensed report will give way to making the data, scripts and analysis available, together with a written summary. This workflow is designed to allow this to happen easily, but this is the topic for another post.

Part of a series on the future of cell biology in quantitative terms.

Voice Your Opinion: Editors shopping for preprints is the future

Today I saw a tweet from Manuel Théry (an Associate Ed at Mol Biol Cell). Which said that he heard that the Editor-in-Chief of MBoC, David Drubin shops for interesting preprints on bioRxiv to encourage the authors to submit to MBoC. This is not a surprise to me. I’ve read that authors of preprints on bioRxiv have been approached by journal Editors previously (here and here, there are many more). I’m pleased that David is forward-thinking and that MBoC are doing this actively.

I think this is the future.

Why? If we ignore for a moment the “far future” which may involve the destruction of most journals, leaving a preprint server and a handful of subject-specific websites which hunt down and feature content from the server and co-ordinate discussions and overviews of current trends… I actually think this is a good idea for the “immediate future” of science and science publishing. Two reasons spring to mind.

  1. Journals would be crazy to miss out: The manuscripts that I am seeing on bioRxiv are not stuff that’s been dumped there with no chance of “real publication”. This stuff is high profile. I mean that in the following sense: the work in my field that has been posted is generally interesting, it is from labs that do great science, and it is as good as work in any journal (obviously). For some reason I have found myself following what is being deposited here more closely than at any real journal. Journals would be crazy to miss out on this stuff.
  2. Levelling the playing field: For better-or-worse papers are judged on where they are published. The thing that bothers me most about this is that manuscripts are only submitted to 1 or more journals before “finding their home”. This process is highly noisy and it means that if we accept that there is a journal hierarchy, your paper may or may not be deserving of the kudos it receives in its resting place. If all journals actively scour the preprint server(s), the authors can then pick the “highest bidder”. This would make things fairer in the sense that all journals in the hierarchy had a chance to consider the paper and its resting place may actually reflect its true quality.

I don’t often post opinions here, but I thought this would take more than 140 characters to explain. If you agree or disagree, feel free to leave a comment!

Edit @ 11:46 16-05-26 Pedro Beltrao pointed out that this idea is not new, a post of his from 2007.

Edit 16-05-26 Misattributed the track to Extreme Noise Terror (corrected). Also added some links thanks to Alexis Verger.

The post title comes from “Voice Your Opinion” by Unseen Terror. The version I have is from a Peel sessions compilation “Hardcore Holocaust”.

A Day In The Life III

This year #paperOTD (or paper of the day for any readers not on Twitter) did not go well for me. I’ve been busy with lots of things and I’m now reviewing more grants than last year because I am doing more committee work. This means I am finding less time to read one paper per day. Nonetheless I will round up the stats for this year. I only managed to read a paper on 59.2% of available days…

The top ten journals that published the papers that I read:

  • 1 Nat Commun
  • 2 J Cell Biol
  • 3 Nature
  • 4= Cell
  • 4= eLife
  • 4= Traffic
  • 7 Science
  • 8= Dev Cell
  • 8= Mol Biol Cell
  • 8= Nat Cell Biol

Nature Communications has published some really nice cell biology this year and I’m not surprised it’s number one. Also, I read more papers in Cell this year compared to last. The papers I read are mainly recent. Around 83% of the papers were published in 2015. Again, a significant fraction (42%) of the papers have statistical errors. Funnily enough there were no preprints in my top ten. I realised that I tend to read these when approving them as an affiliate (thoroughly enough for #paperOTD if they interest me) but I don’t mark them in the database.

I think my favourite paper was this one on methods to move organelles around cells using light, see also this paper for a related method. I think I’ll try again next year to read one paper per day. I’m a bit worried that if I don’t attempt this, I simply won’t read any papers in detail.

lgs

I also resolved to read one book per month in 2015. I managed this in 2014, but fell short in 2015 just like with #paperOTD. The best book from a limited selection was Matthew Cobb’s Life’s Greatest Secret. A tale of the early days of molecular biology, as it happened. I was a bit sceptical that Matthew could bring anything new to this area of scientific history. Having read Eighth Day of Creation, and then some pale imitations, I thought that this had pretty much been covered completely. This book however takes a fresh perspective and it’s worth reading. Matthew has a nice writing style, animating the dusty old main characters with a insightful detail as he goes along. Check it out.

This blog is going well, with readership growing all the time. I have written about this progress previously (here and here). The most popular posts are those on publishing: preprints, impact factors and publication lag times, rather than my science, but that’s OK. There is more to come on lag times in the New Year, stay tuned.

ladidadiI am a fan of year-end lists as you may be able to tell. My album of the year is Battles – La Di Da Di which came out on Warp in September. An honourable mention goes to Air Formation – Were We Ever Here EP which I bought on iTunes since the 250 copies had long gone by the time I discovered it on AC30.

Since I don’t watch TV or go to the cinema, I don’t have a pick of the year for that. When it comes to pro-cycling, of course I have an opinion. My favourite stage race was Critérium du Dauphiné Libere which was won by Chris Froome in a close contest with Tejay van Garderen. The best one-day race was a tough pick between E3 Harelbeke won by Geraint Thomas and Omloop Het Nieuwsblad won by Ian Stannard. Although E3 was a hard man’s race in tough conditions, I have to go for Stannard outfoxing three(!) Etixx Quick Step riders to take the win in Nieuwsblad. I’m a bit annoyed that those three picks all involve Team Sky and British riders…. I won’t bore everyone with my own cycling (and running) exploits in 2015. Just to say, that I’ve been more active this year in any year since 2009.

I shouldn’t need to tell you where the post title comes from. If you haven’t heard Sgt. Pepper’s Lonely Hearts Club Band by The Beatles, you need to rectify this urgently. The greatest album recorded on 4-track equipment, no question. 🙂

 

The Great Curve: Citation distributions

This post follows on from a previous post on citation distributions and the wrongness of Impact Factor.

Stephen Curry had previously made the call that journals should “show us the data” that underlie the much-maligned Journal Impact Factor (JIF). However, this call made me wonder what “showing us the data” would look like and how journals might do it.

What citation distribution should we look at? The JIF looks at citations in a year to articles published in the preceding 2 years. This captures a period in a paper’s life, but it misses “slow burner” papers and also underestimates the impact of papers that just keep generating citations long after publication. I wrote a quick bit of code that would look at a decade’s worth of papers at one journal to see what happened to them as yearly cohorts over that decade. I picked EMBO J to look at since they have actually published their own citation distribution, and also they appear willing to engage with more transparency around scientific publication. Note that, when they published their distribution, it considered citations to papers via a JIF-style window over 5 years.

I pulled 4082 papers with a publication date of 2004-2014 from Web of Science (the search was limited to Articles) along with data on citations that occurred per year. I generated histograms to look at distribution of citations for each year. Papers published in 2004 are in the top row, papers from 2014 are in the bottom row. The first histogram shows citations in the same year as publication, in the next column, the following year and so-on. Number of papers is on y and on x the number of citations. Sorry for the lack of labelling! My excuse is that my code made a plot with “subwindows”, which I’m not too familiar with.

allPlot

What is interesting is that the distribution changes over time:

  • In the year of publication, most papers are not cited at all, which is expected since there is a lag to publication of papers which can cite the work and also some papers do not come out until later in the year, meaning the likelihood of a citing paper coming out decreases as the year progresses.
  • The following year most papers are picking up citations: the distribution moves rightwards.
  • Over the next few years the distribution relaxes back leftwards as the citations die away.
  • The distributions are always skewed. Few papers get loads of citations, most get very few.

Although I truncated the x-axis at 40 citations, there are a handful of papers that are picking up >40 cites per year up to 10 years after publication – clearly these are very useful papers!

To summarise these distributions I generated the median (and the mean – I know, I know) number of citations for each publication year-citation year combination and made plots.

citedist

The mean is shown on the left and median on the right. The layout is the same as in the multi-histogram plot above.

Follow along a row and you can again see how the cohort of papers attracts citations, peaks and then dies away. You can also see that some years were better than others in terms of citations, 2004 and 2005 were good years, 2007 was not so good. It is very difficult, if not impossible, to judge how 2013 and 2014 papers will fare into the future.

What was the point of all this? Well, I think showing the citation data that underlie the JIF is a good start. However, citation data are more nuanced than the JIF allows for. So being able to choose how we look at the citations is important to understand how a journal performs. Having some kind of widget that allows one to select the year(s) of papers to look at and the year(s) that the citations came from would be perfect, but this is beyond me. Otherwise, journals would probably elect to show us a distribution for a golden year (like 2004 in this case), or pick a window for comparison that looked highly favourable.

Finally, I think journals are unlikely to provide this kind of analysis. They should, if only because it is a chance for a journal to show how it publishes many papers that are really useful to the community. Anyway, maybe they don’t have to… What this quick analysis shows is that it can be (fairly) easily harvested and displayed. We could crowdsource this analysis using standardised code.

Below is the code that I used – it’s a bit rough and would need some work before it could be used generally. It also uses a 2D filtering method that was posted on IgorExchange by John Weeks.
cdcode

The post title is taken from “The Great Curve” by Talking Heads from their classic LP Remain in Light.

Middle of the road: pitching your paper

I saw this great tweet (fairly) recently:

https://twitter.com/jkpfeiff/status/589148184284254208/

I thought this was such a great explanation of when to submit your paper.

It reminded me of a diagram that I sketched out when talking to a student in my lab about a paper we were writing. I was trying to explain why we don’t exaggerate our findings. And conversely why we don’t undersell our results either. I replotted it below:

PaperPitch

Getting out to review is a major hurdle to publishing a paper. Therefore, convincing the Editor that you have found out something amazing is the first task. This is counterbalanced by peer review, which scrutinises the claims made in a paper for their experimental support. So, exaggerated claims might get you over the first hurdle, but it will give you problems during peer review (and afterwards if the paper makes it to print). Conversely, underselling or not interpreting all your data fully is a different problem. It’s unlikely to impress the Editor as it can make your paper seem “too specialised”, although if it made it to the hands of your peers they would probably like it! Obviously at either end of the spectrum no-one likes a dull/boring/incremental paper and everyone can smell a rat if the claims are completely overblown, e.g. genome sequence of Sasquatch.

So this is why we try to interpret our results fully but are careful not to exaggerate our claims. It might not get us out to review every time, but at least we can sleep at night.

I don’t know if this is a fair representation. Certainly depending on the journal the scale of the y-axis needs to change!

The post title is taken from “Middle of the Road” by Teenage Fanclub a B-side from their single “I Don’t Want Control of You”.

Zero Tolerance

We were asked to write a Preview piece for Developmental Cell. Two interesting papers which deal with the insertion of amphipathic helices in membranes to influence membrane curvature during endocytosis were scheduled for publication and the journal wanted some “front matter” to promote them.

Our Preview is paywalled – sorry about that – but I can briefly tell you why these two papers are worth a read.

The first paper – a collaboration between EMBL scientists led by Marko Kaksonen – deals with the yeast proteins Ent1 and Sla2. Ent1 has an ENTH domain and Sla2 has an ANTH domain. ENTH stands for Epsin N-terminal homology whereas ANTH means AP180 N-terminal homology. These two domains are known to bind membrane and in the case of ENTH to tubulate and vesiculate giant unilamellar vesicles (GUVs). Ent1 does this via an amphipathic helix “Helix 0” that inserts into the outer leaflet to bend the membrane. The new paper shows that Ent1 and Sla2 can bind together (regulated by PIP2) and that ANTH regulates ENTH so that it doesn’t make lots of vesicles, instead the two team up to make regular membrane tubules. The tubules are decorated with a regular “coat” of these adaptor proteins. This coat could prepattern the clathrin lattice. Also, because Sla2 links to actin, then actin can presumably pull on this lattice to help drive the formation of a new vesicle. The regular spacing might distribute the forces evenly over large expanses of membrane.

The second paper – from David Owen’s lab at CIMR in Cambridge – shows that CALM (a protein with an ANTH domain) actually has a secret Helix 0! They show that this forms on contact with lipid. CALM influences the size of clathrin-coated pits and vesicles, by influencing curvature. They propose a model where cargo size needs to be matched to vesicle size, simply due to the energetics of pit formation. The idea is that cells do this by regulating the ratio of AP2 to CALM.

You can read our preview and the papers by Skruzny et al and Miller et al in the latest issue of Dev Cell.

The post title and the title of our Preview is taken from “Zero Tolerance” by Death from their Symbolic LP. I didn’t want to be outdone by these Swedish scientists who have been using Bob Dylan song titles and lyrics in their papers for years.

To Open Closed Doors: How open is your reference list?

Our most recent manuscript was almost ready for submission. We were planning to send it to an open access journal. It was then that I had the thought: how many papers in the reference list are freely available?

It somehow didn’t make much sense to point readers towards papers that they might not be able to access. So, I wondered if there was a quick way to determine how papers in my reference list were open access. I asked on twitter and got a number of suggestions:

  1. Search crossref to find out if the journal is in DOAJ (@epentz)
  2. How Open Is It? from Cottage Labs will check a list of DOIs (up to 20) for openness (@emanuil_tolev)
  3. Open access DOI Resolver will perform a similar task (@neurocraig)

I actually used a fourth method (from @biochemistries and @invisiblecomma) which was to use HubMed, although in the end a similar solution can be reached by searching PubMed itself. Whereas the other strategies will work for a range of academic texts, everything in my reference list was from PubMed. So this solution worked well for me. I pulled out the list of Accessions (PMIDs) for my reference list. This was because some papers were old and I did not have their DOIs. The quickest way to do this was to make a new EndNote style that only contained the field Accession and get it to generate a new bibliography from my manuscript. I appended [uid] OR after each one and searched with that term.

What happened?

My paper had 44 references. Of these, 35 were freely available to read. I was actually surprised by how many were available. So, 9 papers were not free to read. As advised, I checked each one to really make sure that the HubMed result was accurate, and it was.

Please note that I’d written the paper without giving this a thought and citing papers as I normally do: the best demonstration of something, the first paper to show something, using primary papers as far as possible.

Seven of the nine I couldn’t compromise on. They’re classic papers from 80s and 90s that are still paywalled but are unique in what they describe. However, two papers were reviews in closed access journals. Now these I could do something about! Especially as I prefer to cite the primary literature anyway. Plus, most reviews are pretty unoriginal in what they cover and an alternative open access version that is fairly recent can easily be found. I’ll probably run this check for future manuscripts and see what it throws up.

Summary

It’s often said that papers are our currency in science. The valuation of this currency comes from citations. Funnily enough, we the authors are in a position to actually do something about this. I don’t think any of us should compromise the science in our manuscripts. However, I think we could all probably pay a bit more attention to the citations that we dish out when writing a paper. Whether this is simply to make sure that what we cite is widely accessible, or just making sure that credit goes to the right people.

The post title is taken from “To Open Closed Doors” by D.R.I. from the Dirty Rotten LP

Waiting to happen II: Publication lag times

Following on from the last post about publication lag times at cell biology journals, I went ahead and crunched the numbers for all journals in PubMed for one year (2013). Before we dive into the numbers, a couple of points about this kind of information.

  1. Some journals “reset the clock” on the received date with manuscripts that are resubmitted. This makes comparisons difficult.
  2. The length of publication lag is not necessarily a reflection of the way the journal operates. As this comment points out, manuscripts are out of the journals hands (with the reviewers) for a substantial fraction of the time.
  3. The dataset is incomplete because the deposition of this information is not mandatory. About 1/3 of papers have the date information deposited (see below).
  4. Publication lag times go hand-in-hand with peer review. Moving to preprints and post-publication review would eradicate these delays.

Thanks for all the feedback on my last post, particularly those that highlighted the points above.

rawdatesTo see how all this was done, check out the Methods bit below, where you can download the full summary. I ended up with a list of publication lag times for 428500 papers published in 2013 (see left). To make a bit more sense of this, I split them by journal and then found the publication lag time stats for each. This had to be done per journal since PLoS ONE alone makes up 45560 of the records.

LagTimesTo try and visualise what these publication lag times look like for all journals, I made a histogram of the Median lag times for all journals using a 10 d bin width. It takes on average ~100 d to go from Received to Accepted and a further ~120 d to go from Accepted to Published. The whole process on average takes 239 days.

To get a feel for the variability in these numbers I plotted out the ranked Median times for each journal and overlaid Q25 and Q75 (dots). The IQR for some of the slower journals was >150 d. So the papers that they publish can have very different fates.

IFIs the publication lag time longer at higher tier journals? To look at this, I used the Rec-Acc time and the 2013 Journal Impact Factor which, although widely derided and flawed, does correlate loosely with journal prestige. I have fewer journals in this dataset, because the lookup of JIFs didn’t find every journal in my starting set, either because the journal doesn’t have one or there were minor differences in the PubMed name and the Thomson-Reuters name. The median of the median Rec-Acc times for each bin is shown. So on average, journals with a JIF <1 will take 1 month longer to accept your paper than journal with an IF ranging from 1-10. After this it rises again, to ~2 months longer at journals with an IF over 10. Why? Perhaps at the lower end, the trouble is finding reviewers; whereas at the higher end, multiple rounds of review might become a problem.

The executive summary is below. These are the times (in days) for delays at all journals in PubMed for 2013.

Interval Median Q25 Q75
Received-to-Accepted 97 69 136
Accepted-to-Published 122 84 186
Received-to-Published 239 178 319

For comparison:

  1. Median time from ovulation to birth of a human being is 268 days.
  2. Mark Beaumont cycled around the world (29,446 km) in 194 days.
  3. Ellen MacArthur circumnavigated the globe single-handed in 72 days.

On the whole it seems that publishing in Cell Biology is quite slow compared to the whole of PubMed. Why this is the case is a tricky question. Is it because cell biologists submit papers too early and they need more revision? Are they more dogged in sending back rejected manuscripts? Is it because as a community we review too harshly and/or ask too much of the authors? Do Editors allow too many rounds of revision or not give clear guidance to expedite the time from Received-to-Accepted? It’s probably a combination of all of these factors and we’re all to blame.

Finally, this amusing tweet to show the transparency of EMBO J publication timelines raises the question: would these authors have been better off just sending the paper somewhere else?

Methods: I searched PubMed using journal article[pt] AND ("2013/01/01"[PDAT] : "2013/12/31"[PDAT]) this gave a huge xml file (~16 GB) which nokogiri balked at. So I divided the query up into subranges of those dates (1.4 GB) and ran the script on all xml files. This gave 1425643 records. I removed records that did not have a received date or those with greater than 12 in the month field (leaving 428513 records). 13 of these records did not have a journal name. This gave 428500 records from 3301 journals. Again, I filtered out negative values (papers accepted before they were received) and a couple of outliers (e.g. 6000 days!). With a bit of code it was quite straightforward to extract simple statistics for each of the journals. You can download the data here to look up the information for a journal of your choice (wordpress only allows xls, not txt/csv). The fields show the journal name and the number of valid articles. Then for Acc-Pub, Rec-Acc and Rec-Pub, the number, Median, lower quartile, upper quartile times in days are given. I set a limit of 5 or more articles for calculation of the stats. Blank entries are where there was no valid data. Note that there are some differences with the table in my last post. This is because for that analysis I used a bigger date range and then filtered the year based on the published field. Here my search started out by specifying PDAT, which is slightly different.

The data are OK, but the publication date needs to be taken with a pinch of salt. For many records it was missing a month or day, so the date used for some records is approximate. In retrospect using the Entrez date or one of the other required fields would have probably be better. I liked the idea of the publication date as this is when the paper finally appears in print which still represents a significant delay at some journals. The Recieved-to-Accepted dates are valid though.

Waiting to Happen: Publication lag times in Cell Biology Journals

My interest in publication lag times continues. Previous posts have looked at how long it takes my lab to publish our work, how often trainees publish and I also looked at very long lag times at Oncogene. I recently read a blog post on automated calculation of publication lag times for Bioinformatics journals. I thought it would be great to do this for Cell Biology journals too. Hopefully people will find it useful and can use this list when thinking about where to send their paper.

What is publication lag time?

If you are reading this, you probably know how science publication works. Feel free to skip. Otherwise, it goes something like this. After writing up your work for publication, you submit it to a journal. Assuming that this journal will eventually publish the paper (there is usually a period of submitting, getting rejected, resubmitting to a different journal etc.), they receive the paper on a certain date. They send it out to review, they collate the reviews and send back a decision, you (almost always) revise your paper further and then send it back. This can happen several times. At some point it gets accepted on a certain date. The journal then prepares the paper for publication in a scheduled issue on a specific date (they can also immediately post papers online without formatting). All of these steps add significant delays. It typically takes 9 months to publish a paper in the biomedical sciences. In 2015 this sounds very silly, when world-wide dissemination of information is as simple as a few clicks on a trackpad. The bigger problem is that we rely on papers as a currency to get jobs or funding and so these delays can be more than just a frustration, they can affect your ability to actually do more science.

The good news is that it is very straightforward to parse the received, accepted and published dates from PubMed. So we can easily calculate the publication lags for cell biology journals. If you don’t work in cell biology, just follow the instructions below to make your own list.

The bad news is that the deposition of the date information in PubMed depends on the journal. The extra bad news is that three of the major cell biology journals do not deposit their data: J Cell Biol, Mol Biol Cell and J Cell Sci. My original plan was to compare these three journals with Traffic, Nat Cell Biol and Dev Cell. Instead, I extended the list to include other journals which take non-cell biology papers (and deposit their data).

LagTimes1

A summary of the last ten years

Three sets of box plots here show the publication lags for eight journals that take cell biology papers. The journals are Cell, Cell Stem Cell, Current Biology, Developmental Cell, EMBO Journal, Nature Cell Biology, Nature Methods and Traffic (see note at the end about eLife). They are shown in alphabetical order. The box plots show the median and the IQR, whiskers show the 10th and 90th percentiles. The three plots show the time from Received-to-Published (Rec-Pub), and then a breakdown of this time into Received-to-Accepted (Rec-Acc) and Accepted-to-Published (Rec-Pub). The colours are just to make it easier to tell the journals apart and don’t have any significance.

You can see from these plots that the journals differ widely in the time it takes to publish a paper there. Current Biology is very fast, whereas Cell Stem Cell is relatively slow. The time it takes the journals to move them from acceptance to publication is pretty constant. Apart from Traffic where it takes an average of ~3 months to get something in to print. Remember that the paper is often online for this period so this is not necessarily a bad thing. I was not surprised that Current Biology was the fastest. At this journal, a presubmission inquiry is required and the referees are often lined up in advance. The staff are keen to publish rapidly, hence the name, Current Biology. I was amazed at Nature Cell Biology having such a short time from Received-to-Acceptance. The delay in Review-to-Acceptance comes from multiple rounds of revision and from doing extra experimental work. Anecdotally, it seems that the review at Nature Cell Biol should be just as lengthy as at Dev Cell or EMBO J. I wonder if the received date is accurate… it is possible to massage this date by first rejecting the paper, but allowing a resubmission. Then using the resubmission date as the received date [Edit: see below]. One way to legitimately limit this delay is to only allow a certain time for revisions and only allow one round of corrections. This is what happens at J Cell Biol, unfortunately we don’t have this data to see how effective this is.

lagtimes2

How has the lag time changed over the last ten years?

Have the slow journals always been slow? When did they become slow?  Again three plots are shown (side-by-side) depicting the Rec-Pub and then the Rec-Acc and Acc-Pub time. Now the intensity of red or blue shows the data for each year (2014 is the most intense colour). Again you can see that the dataset is not complete with missing date information for Traffic for many years, for example.

Interestingly, the publication lag has been pretty constant for some journals but not others. Cell Stem Cell and Dev Cell (but not the mothership – Cell) have seen increases as have Nature Cell Biology and Nature Methods. On the whole Acc-Pub times are stable, except for Nature Methods which is the only journal in the list to see an increase over the time period. This just leaves us with the task of drawing up a ranked list of the fastest to the slowest journal. Then we can see which of these journals is likely to delay dissemination of our work the most.

The Median times (in days) for 2013 are below. The journals are ranked in order of fastest to slowest for Received-to-Publication. I had to use 2013 because EMBO J is missing data for 2014.

Journal Rec-Pub Rec-Acc Acc-Pub
Curr Biol 159 99.5 56
Nat Methods 192 125 68
Cell 195 169 35
EMBO J 203 142 61
Nature Cell Biol 237 180 59
Traffic 244 161 86
Dev Cell 247 204 43
Cell Stem Cell 284 205 66

You’ll see that only Cell Stem Cell is over the threshold where it would be faster to conceive and give birth to a human being than to publish a paper there (on average). If the additional time wasted in submitting your manuscript to other journals is factored in, it is likely that most papers are at least on a par with the median gestation time.

If you are wondering why eLife is missing… as a new journal it didn’t have ten years worth of data to analyse. It did have a reasonably complete set for 2013 (but Rec-Acc only). The median time was 89 days, beating Current Biology by 10.5 days.

Methods

Please check out Neil Saunders’ post on how to do this. I did a PubMed search for (journal1[ta] OR journal2[ta] OR ...) AND journal article[pt] to make sure I didn’t get any reviews or letters etc. I limited the search from 2003 onwards to make sure I had 10 years of data for the journals that deposited it. I downloaded the file as xml and I used Ruby/Nokogiri to parse the file to csv. Installing Nokogiri is reasonably straightforward, but the documentation is pretty impenetrable. The ruby script I used was from Neil’s post (step 3) with a few lines added:


#!/usr/bin/ruby

require 'nokogiri'

f = File.open(ARGV.first)
doc = Nokogiri::XML(f)
f.close

doc.xpath("//PubmedArticle").each do |a|
r = ["", "", "", "", "", "", "", "", "", "", ""]
r[0] = a.xpath("MedlineCitation/Article/Journal/ISOAbbreviation").text
r[1] = a.xpath("MedlineCitation/PMID").text
r[2] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Year").text
r[3] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Month").text
r[4] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='received']/Day").text
r[5] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Year").text
r[6] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Month").text
r[7] = a.xpath("PubmedData/History/PubMedPubDate[@PubStatus='accepted']/Day").text
r[8] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Year").text
r[9] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Month").text
r[10] = a.xpath("MedlineCitation/Article/Journal/JournalIssue/Pubdate/Day").text
puts r.join(",")
end

and then executed as described. The csv could then be imported into IgorPro and processed. Neil’s post describes a workflow for R, or you could use Excel or whatever at this point. As he notes, quite a few records are missing the date information and some of it is wrong, i.e. published before it was accepted. These need to be cleaned up. The other problem is that the month is sometimes an integer and sometimes a three-letter code. He uses lubridate in R to get around this, a loop-replace in Igor is easy to construct and even Excel can handle this with an IF statement, e.g. IF(LEN(G2)=3,MONTH(1&LEFT(G2,3)),G2) if the month is in G2. Good luck!

Edit 9/3/15 @ 17:17 several people (including Deborah Sweet and Bernd Pulverer from Cell Press/Cell Stem Cell and EMBO, respectively) have confirmed via Twitter that some journals use the date of resubmission as the submitted date. Cell Stem Cell and EMBO journals use the real dates. There is no way to tell whether a journal does this or not (from the deposited data). Stuart Cantrill from Nature Chemistry pointed out that his journal do declare that they sometimes reset the clock. I’m not sure about other journals. My own feeling is that – for full transparency – journals should 1) record the actual dates of submission, acceptance and publication, 2) deposit them in PubMed and add them to the paper. As pointed out by Jim Woodgett, scientists want the actual dates on their paper, partly because they are the real dates, but also to claim priority in certain cases. There is a conflict here, because journals might appear inefficient if they have long publication lag times. I think this should be an incentive for Editors to simplify revisions by giving clear guidance and limiting successive revision cycles. (This Edit was corrected 10/3/15 @ 11:04).

The post title is taken from “Waiting to Happen” by Super Furry Animals from the “Something 4 The Weekend” single.