## A Day In The Life II

I have been doing paper of the day (#potd) again in 2014. See my previous post about this.

My “rules” for paper of the day are:

1. Read one paper each working day.
2. If I am away, or reviewing a paper for a journal or colleague, then I get a pass.
3. Read it sufficiently to be able to explain it to somebody else, i.e. don’t just scan the abstract and look at the figures. Really read it and understand it. Scan and skim as many other papers as you normally would!
4. Only papers reporting primary research count. No reviews/opinion pieces etc.
5. If it was really good or worth telling people about – tweet about it.
6. Make a simple database in Excel – this helps you keep track, make notes about the paper (to see if you meet #3) and allows you to find the paper easily in the future (this last point turned out to be very useful).

This year has been difficult, especially sticking to #3. My stats for 2014 are:

• 73% success rate. Down from 85% in 2013
• Stats errors in 36% of papers I read!
• 86% of papers were from 2014

Following last year, I wasn’t so surprised by the journals that the papers appeared in:

1. eLife
2. J Cell Biol
3. Mol Biol Cell
4. Dev Cell
5. Nature Methods
6. J Cell Sci
7. J Neurosci
8. Nature Cell Biol
9. Traffic
10. Curr Biol
11. Nature
12. Nature Comm
13. Science

According to my database I only read one paper in Cell this year. I certainly have lots of them in “Saved for later” in Feedly (which is a black hole from which papers rarely emerge to be read). It’s possible that the reason Cell, Nature and Science are low on the list is that I might quickly glance at papers in those journals but not actually read them for #potd. Last year eLife was at number 9 and this year it is at number 1. This journal is definitely publishing a lot of exciting cell biology and also the lens format is very nice for reading.

I think I’ll try to continue this in 2015. The main thing it has made me realise is how few papers I read (I mean really read). I wonder if students and postdocs are actually the main consumers of the literature. If this is correct, do PIs rely on “subsistence reading”, i.e. when they write their own papers and check the immediate literature? Is their deep reading done only during peer reviewing other people’s work? Or do PIs rely on a constant infusion of the latest science at seminars and at meetings?

## Insane In The Brain

Back of the envelope calculations for this post.

An old press release for a paper on endocytosis by Tom Kirchhausen contained this fascinating factoid:

The equivalent of the entire brain, or a football field of membrane, is turned over every hour

If this is true it is absolutely staggering. Let’s check it out.

A synaptic vesicle is ~40 nm in diameter. So the surface area of 1 vesicle is

$$4 \pi r^2$$

which is 5026 nm2, or 5.026 x 10-15 m2.

Now, an American football field is 5350 m2 (including both endzones), this is the equivalent of 1.065 x 1018 synaptic vesicles.

It is estimated that the human cortex has 60 trillion synapses. This means that each synapse would need to internalise 17742 vesicles to retrieve the area of membrane equivalent to one football field.

The factoid says this takes one hour. This membrane load equates to each synapse turning over 296 vesicles in one minute, which is 4.93 vesicles per second.

Tonic activity of neurons differs throughout the brain and actually 5 Hz doesn’t sound too high (feel free to correct me on this). We’ve only considered cortical neurons, so the factoid seems pretty plausible!

For an actual football field, i.e. Association Football. The calculation is slightly more complicated. This is because there is no set size for football pitches. In England, the largest is apparently Manchester City (7598 m2) while the smallest actually belongs to the greatest football team in the world, Crewe Alexandra (5518 m2).

A brain would hoover up Man City’s ground in an hour if each synapse turned over 7 vesicles per second, while Gresty Road would only take 5 vesicles per second.

What is less clear from the factoid is whether a football field really equates to an “entire brain”. Bionumbers has no information on this. I think this part of the factoid may come from a different bit of data which is that clathrin-mediated endocytosis in non-neuronal cells can internalise the equivalent of the entire surface area of the cell in about an hour. I wonder whether this has been translated to neurons for the purposes of the quote. Either way, it is an amazing factoid that the brain can turnover this huge amount of membrane in such a short space of time.

So there you have it: quanta quantified on quantixed.

The post title is from “Insane In The Brain” by Cypress Hill from the album Black Sunday.

## Half Right

I was talking to a speaker visiting our department recently. While discussing his postdoc work from years ago, he told me about the identification of the sperm factor that causes calcium oscillations in the egg at fertilisation. It was an interesting tale because the group who eventually identified the factor – now widely accepted as PLCzeta – had earlier misidentified the factor, naming it oscillin.

The oscillin paper was published in Nature in 1996 and the subsequent (correct) paper was published in Development in 2002. I wondered what the citation profiles of these papers looks like now.

As you can see there was intense interest in the first paper that quickly petered out, presumably when people found out that oscillin was a contaminant and not the real factor. The second paper on the other hand has attracted a large number of citations and continues to do so 12 years later – a sign of a classic paper. However, the initial spike in citations was not as high as the Nature paper.

The impact factor of Nature is much higher than that of Development. I’ve often wondered if this is due to a sociological phenomenon: people like to cite Cell/Nature/Science papers rather than those at other journals and this bumps up the impact factor. Before you comment, yes I know there are other reasons, but the IFs do not change much over time and I wonder whether journal hierarchy explains the hardiness of IFs over time. Anyway, these papers struck me as a good test of the idea… Here we have essentially the same discovery, reported by the same authors. The only difference here is the journal (and that one paper is six years after the other). Normally it is not possible to test if the journal influences citations because a paper cannot erased and republished somewhere else. The plot suggests that Nature papers inherently attract much more cites than those in Development, presumably because of the exposure of publishing there. From the graph, it’s not difficult to see that even if a paper turns out not to be right, it can still boost the IF of the journal during the window of assessment. Another reason not to trust journal impact factors.

I can’t think of any way to look at this more systematically to see if this phenomenon holds true. I just thought it was interesting, so I’ll leave it here.

The post title is taken from Half Right by Elliott Smith from the posthumous album New Moon. Bootlegs have the title as Not Half Right, which would also be appropriate.

## My Favorite Things

I realised recently that I’ve maintained a consistent iTunes library for ~10 years. For most of that time I’ve been listening exclusively to iTunes, rather than to music in other formats. So the library is a useful source of information about my tastes in music. It should be possible to look at who are my favourite artists, what bands need more investigation, or just to generate some interesting statistics based on my favourite music.

Play count is the central statistic here as it tells me how often I’ve listened to a certain track. It’s the equivalent of a +1/upvote/fave/like or maybe even a citation. Play count increases by one if you listen to a track all the way to the end. So if a track starts and you don’t want to hear it and you skip on to the next song, there’s no +1. There’s a caveat here in that the time a track has been in the library, influences the play count to a certain extent – but that’s for another post*. The second indicator for liking a track or artist is the fact that it’s in the library. This may sound obvious, but what I mean is that artists with lots of tracks in the library are more likely to be favourite artists compared to a band with just one or two tracks in there. A caveat here is that some artists do not have long careers for a variety of reasons, which can limit the number of tracks actually available to load into the library. Check the methods at the foot of the post if you want to do the same.

What’s the most popular year? Firstly, I looked at the most popular year in the library. This question was the focus of an earlier post that found that 1971 was the best year in music. The play distribution per year can be plotted together with a summary of how many tracks and how many plays in total from each year are in the library. There’s a bias towards 90s music, which probably reflects my age, but could also be caused by my habit of collecting CD singles which peaked as a format in this decade. The average number of plays is actually pretty constant for all years (median of ~4), the mean is perhaps slightly higher for late-2000s music.

Favourite styles of music: I also looked at Genre. Which styles of music are my favourite? I plotted the total number of tracks versus the total number of plays for each Genre in the library. Size of the marker reflects the median number of plays per track for that genre. Most Genres obey a rule where total plays is a function of total tracks, but there are exceptions. Crossover, Hip-hop/Rap and Power-pop are highlighted as those with an above average number of plays. I’m not lacking in Power-pop with a few thousand tracks, but I should probably get my hands on more Crossover or Hip-Hop/Rap.

Using citation statistics to find my favourite artists: Next, I looked at who my favourite artists are. It could be argued that I should know who my favourite artists are! But tastes can change over a 10 year period and I was interested in an unbiased view of my favourite artists rather than who I think they are. A plot of Total Tracks vs Mean plays per track is reasonably informative. The artists with the highest plays per track are those with only one track in the library, e.g. Harvey Danger with Flagpole Sitta. So this statistic is pretty unreliable. Equally, I’ve got lots of tracks by Manic Street Preachers but evidently I don’t play them that often. I realised that the problem of identifying favourite artists based on these two pieces of information (plays and number of tracks) is pretty similar to assessing scientists using citation metrics (citations and number of papers). Hirsch proposed the h-index to meld these two bits of information into a single metric, the h-index. It’s easily computed and I already had an Igor procedure to calculate it en masse, so I ran it on the library information.

Before doing this, I consolidated multiple versions of the same track into one. I knew that I had several versions of the same track, especially as I have multiple versions of some albums (e.g. Pet Sounds = 3 copies = mono + stereo + a capella), the top offending track was “Baby’s Coming Back” by Jellyfish, 11 copies! Anyway, these were consolidated before running the h-index calculation.

The top artist was Elliott Smith with an h-index of 32. This means he has 32 tracks that have been listened to at least 32 times each. I was amazed that Muse had the second highest h-index (I don’t consider myself a huge fan of their music) until I remembered a period where their albums were on an iPod Nano used during exercise. Amusingly (and narcissistically) my own music – the artist names are redacted – scored quite highly with two out of three bands in the top 100, which are shown here. These artists with high h-indeces are the most consistently played in the library and probably constitute my favourite artists, but is the ranking correct?

The procedure also calculates the g-index for every artist. The g-index is similar to the h-index but takes into account very highly played tracks (very highly cited papers) over the h threshold. For example, The Smiths h=26. This could be 26 tracks that have been listened to exactly 26 times or they could have been listened to 90 times each. The h-index cannot reveal this, but the g-index gets to this by assessing average plays for the ranked tracks. The Smiths g=35. To find the artists that are most-played-of-the-consistently-most-played, I subtracted h from g and plotted the Top 50. This ranked list I think most closely represents my favourite artists, according to my listening habits over the last ten years.

Track length: Finally, I looked at the track length. I have a range of track lengths in the library, from “You Suffer” by Napalm Death (iTunes has this at 4 s, but Wikipedia says it is 1.36 s), through to epic tracks like “Blue Room” by The Orb. Most tracks are in the 3-4 min range. Plays per track indicates that this track length is optimal with most of the highly played tracks being within this window. The super-long tracks are rarely listened to, probably because of their length. Short tracks also have higher than average plays, probably because they are less likely to be skipped, due to their length.

These were the first things that sprang to mind for iTunes analysis. As I said at the top, there’s lots of information in the library to dig through, but I think this is enough for one post. And not a pie-chart in sight!

Methods: the library is in xml format and can be read/parsed this way. More easily, you can just select the whole library and copy-paste it into TextEdit and then load this into a data analysis package. In this case, IgorPro (as always). Make sure that the interesting fields are shown in the full library view (Music>Songs). To do everything in this post you need artist, track, album, genre, length, year and play count. At the time of writing, I had 21326 tracks in the library. For the “H-index” analysis, I consolidated multiple versions of the same track, giving 18684 tracks. This is possible by concatenating artist and the first ten characters of the track title (separated by a unique character) and adding the play counts for these concatenated versions. The artist could then be deconvolved (using the unique character) and used for the H-calculation. It’s not very elegant, but seemed to work well. The H-index and G-index calculations were automated (previously sort-of-described here), as was most of the plot generation. The inspiration for the colour coding is from the 2013 Feltron Report.

* there’s an interesting post here about modelling the ideal playlist. I worked through the ideas in that post but found that it doesn’t scale well to large libraries, especially if they’ve been going for a long time, i.e. mine.

The post title is taken from John Coltrane’s cover version of My Favorite Things from the album of the same name. Excuse the US English spelling.