## My Favorite Things

I realised recently that I’ve maintained a consistent iTunes library for ~10 years. For most of that time I’ve been listening exclusively to iTunes, rather than to music in other formats. So the library is a useful source of information about my tastes in music. It should be possible to look at who are my favourite artists, what bands need more investigation, or just to generate some interesting statistics based on my favourite music.

Play count is the central statistic here as it tells me how often I’ve listened to a certain track. It’s the equivalent of a +1/upvote/fave/like or maybe even a citation. Play count increases by one if you listen to a track all the way to the end. So if a track starts and you don’t want to hear it and you skip on to the next song, there’s no +1. There’s a caveat here in that the time a track has been in the library, influences the play count to a certain extent – but that’s for another post*. The second indicator for liking a track or artist is the fact that it’s in the library. This may sound obvious, but what I mean is that artists with lots of tracks in the library are more likely to be favourite artists compared to a band with just one or two tracks in there. A caveat here is that some artists do not have long careers for a variety of reasons, which can limit the number of tracks actually available to load into the library. Check the methods at the foot of the post if you want to do the same.

What’s the most popular year? Firstly, I looked at the most popular year in the library. This question was the focus of an earlier post that found that 1971 was the best year in music. The play distribution per year can be plotted together with a summary of how many tracks and how many plays in total from each year are in the library. There’s a bias towards 90s music, which probably reflects my age, but could also be caused by my habit of collecting CD singles which peaked as a format in this decade. The average number of plays is actually pretty constant for all years (median of ~4), the mean is perhaps slightly higher for late-2000s music.

Favourite styles of music: I also looked at Genre. Which styles of music are my favourite? I plotted the total number of tracks versus the total number of plays for each Genre in the library. Size of the marker reflects the median number of plays per track for that genre. Most Genres obey a rule where total plays is a function of total tracks, but there are exceptions. Crossover, Hip-hop/Rap and Power-pop are highlighted as those with an above average number of plays. I’m not lacking in Power-pop with a few thousand tracks, but I should probably get my hands on more Crossover or Hip-Hop/Rap.

Using citation statistics to find my favourite artists: Next, I looked at who my favourite artists are. It could be argued that I should know who my favourite artists are! But tastes can change over a 10 year period and I was interested in an unbiased view of my favourite artists rather than who I think they are. A plot of Total Tracks vs Mean plays per track is reasonably informative. The artists with the highest plays per track are those with only one track in the library, e.g. Harvey Danger with Flagpole Sitta. So this statistic is pretty unreliable. Equally, I’ve got lots of tracks by Manic Street Preachers but evidently I don’t play them that often. I realised that the problem of identifying favourite artists based on these two pieces of information (plays and number of tracks) is pretty similar to assessing scientists using citation metrics (citations and number of papers). Hirsch proposed the h-index to meld these two bits of information into a single metric, the h-index. It’s easily computed and I already had an Igor procedure to calculate it en masse, so I ran it on the library information.

Before doing this, I consolidated multiple versions of the same track into one. I knew that I had several versions of the same track, especially as I have multiple versions of some albums (e.g. Pet Sounds = 3 copies = mono + stereo + a capella), the top offending track was “Baby’s Coming Back” by Jellyfish, 11 copies! Anyway, these were consolidated before running the h-index calculation.

The top artist was Elliott Smith with an h-index of 32. This means he has 32 tracks that have been listened to at least 32 times each. I was amazed that Muse had the second highest h-index (I don’t consider myself a huge fan of their music) until I remembered a period where their albums were on an iPod Nano used during exercise. Amusingly (and narcissistically) my own music – the artist names are redacted – scored quite highly with two out of three bands in the top 100, which are shown here. These artists with high h-indeces are the most consistently played in the library and probably constitute my favourite artists, but is the ranking correct?

The procedure also calculates the g-index for every artist. The g-index is similar to the h-index but takes into account very highly played tracks (very highly cited papers) over the h threshold. For example, The Smiths h=26. This could be 26 tracks that have been listened to exactly 26 times or they could have been listened to 90 times each. The h-index cannot reveal this, but the g-index gets to this by assessing average plays for the ranked tracks. The Smiths g=35. To find the artists that are most-played-of-the-consistently-most-played, I subtracted h from g and plotted the Top 50. This ranked list I think most closely represents my favourite artists, according to my listening habits over the last ten years.

Track length: Finally, I looked at the track length. I have a range of track lengths in the library, from “You Suffer” by Napalm Death (iTunes has this at 4 s, but Wikipedia says it is 1.36 s), through to epic tracks like “Blue Room” by The Orb. Most tracks are in the 3-4 min range. Plays per track indicates that this track length is optimal with most of the highly played tracks being within this window. The super-long tracks are rarely listened to, probably because of their length. Short tracks also have higher than average plays, probably because they are less likely to be skipped, due to their length.

These were the first things that sprang to mind for iTunes analysis. As I said at the top, there’s lots of information in the library to dig through, but I think this is enough for one post. And not a pie-chart in sight!

Methods: the library is in xml format and can be read/parsed this way. More easily, you can just select the whole library and copy-paste it into TextEdit and then load this into a data analysis package. In this case, IgorPro (as always). Make sure that the interesting fields are shown in the full library view (Music>Songs). To do everything in this post you need artist, track, album, genre, length, year and play count. At the time of writing, I had 21326 tracks in the library. For the “H-index” analysis, I consolidated multiple versions of the same track, giving 18684 tracks. This is possible by concatenating artist and the first ten characters of the track title (separated by a unique character) and adding the play counts for these concatenated versions. The artist could then be deconvolved (using the unique character) and used for the H-calculation. It’s not very elegant, but seemed to work well. The H-index and G-index calculations were automated (previously sort-of-described here), as was most of the plot generation. The inspiration for the colour coding is from the 2013 Feltron Report.

* there’s an interesting post here about modelling the ideal playlist. I worked through the ideas in that post but found that it doesn’t scale well to large libraries, especially if they’ve been going for a long time, i.e. mine.

The post title is taken from John Coltrane’s cover version of My Favorite Things from the album of the same name. Excuse the US English spelling.

## Six Plus One

Last week, ALM (article-level metric) data for PLoS journals were uploaded to Figshare with the invitation to do something cool with it.

Well, it would be rude not to. Actually, I’m one of the few scientists on the planet that hasn’t published a paper with Public Library of Science (PLoS), so I have no personal agenda here. However, I love what PLoS is doing and what it has achieved to disrupt the scientific publishing system. Anyway, what follows is not in any way comprehensive, but I was interested to look at a few specific things:

1. Is there a relationship between Twitter mentions and views of papers?
2. What is the fraction of views that are PDF vs HTML?
3. Can citations be predicted by more immediate article level metrics?

The tl;dr version is 1. Yes. 2. ~20%. 3. Can’t say but looks unlikely.

1. Twitter mentions versus paper views

2. Fraction of PDF vs HTML views

I asked a few people what they thought the download ratio is for papers. Most thought 60-75% as PDF versus 40-25% HTML. I thought it would be lower, but I was surprised to see that it is, at most, 20% for PDF. The plot below shows the fraction of PDF downloads (counter_pdf/(counter_pdf+counter_html)). For all PLoS journals, and then broken down for PLoS Biol, PLoS ONE.

This was a surprise to me. I have colleagues who don’t like depositing post-print or pre-print papers because they say that they prefer their work to be seen typeset in PDF format. However, this shows that, at least for PLoS journals, the reader is choosing to not see a typeset PDF at all, but a HTML version.

Maybe the PLoS PDFs are terribly formatted and 80% people don’t like them. There is an interesting comparison that can be done here, because all papers are deposited at Pubmed Central (PMC) and so the same plot can be generated for the PDF fraction there. The PDF format is different to PLoS and so we can test the idea that people prefer HTML over PDF at PLoS because they don’t like the PLoS format.

The fraction of PDF downloads is higher, but only around 30%. So either the PMC format is just as bad, or this is the way that readers like to consume the scientific literature. A colleague mentioned that HTML views are preferable to PDF if you want to actually want to do something with the data, e.g. for meta-analysis. This could have an effect. HTML views could be skim reading, whereas PDF is for people who want to read in detail… I wonder whether these fractions are similar at other publishers, particularly closed access publishers?

3. Citation prediction?

ALMs are immediate whereas citations are slow. If we assume for a moment that citations are a definitive means to determine the impact of a paper (which they may not be), then can ALMs predict citations? This would make them very useful in the evaluation of scientists and their endeavours. Unfortunately, this dataset is not sufficient to answer this properly, but with multiple timepoints, the question could be investigated. I looked at number of paper downloads and also the Mendeley score to see how these two things may foretell citations. What follows is a strategy to do this is an unbiased way with few confounders.

The dataset has a Scopus column, but for some reason these data are incomplete. It is possible to download data (but not on this scale AFAIK) for citations from Web of Science and then use the DOI to cross-reference to the other dataset. This plot shows the Scopus data as a function of “Total Citations” from Web of Science, for 500 papers. I went with the Web of Science data as this appears more robust.

The question is whether there is a relationship between downloads of a paper (Counter, either PDF or HTML) and citations. Or between Mendeley score and citations. I figured that downloading, Mendeley and citation, show three progressive levels of “commitment” to a paper and so they may correlate differently with citations. Now, to look at this for all PLoS journals for all time would be silly because we know that citations are field-specific, journal-specific, time-sensitive etc. So I took the following dataset from Web of Science: the top 500 most-cited papers in PLoS ONE for the period of 2007-2010 limited to “cell biology”. By cross-referencing I could check the corresponding values for Counter and for Mendeley.

I was surprised that the correlation was very weak in both cases. I thought that the correlation would be stronger with Mendeley, however signal-to-noise is a problem here with few users of the service compared with counting downloads. Below each plot is a ranked view of the papers, with the Counter or Mendeley data presented as a rolling average. It’s a very weak correlation at best. Remember that this is post-hoc. Papers that have been cited more would be expected to generate more views and higher Mendeley scores, but this is not necessarily so. Predicting future citations based on Counter or Mendeley, will be tough. To really know if this is possible, this approach needs to be used with multiple ALM timepoints to see if there is a predictive value for ALMs, but based on this single timepoint, it doesn’t seem as though prediction will be possible.

Methods: To crunch the numbers for yourself, head over to Figshare and download the csv. A Web of Science subscription is needed for the citation data. All the plots were generated in IgorPro, but no programming is required for these comparisons and everything I’ve done here can be easily done in Excel or another package.

Edit: Matt Hodgkinson (@mattjhodgkinson) Snr Ed at PLoS ONE told me via Twitter that all ALM data (periodically updated) are freely available here. This means that some of the analyses I wrote about are possible.

The post title comes from Six Plus One a track on Dad Man Cat by Corduroy. Plus is as close to PLoS as I could find in my iTunes library.

## You Know My Name (Look Up The Number)

This thought crossed my mind yesterday when I saw a tweet that was tagged #academicinsults

It occurred to me that a Twitter account is a kind of micro-publishing platform. So what would “publication metrics” look like for Twitter? Twitter makes analytics available, so they can easily be crunched. The main metrics are impressions and engagements per tweet. As I understand it, impressions are the number of times your tweet is served up to people in their feed (boosted by retweets). Engagements are when somebody clicks on the tweet (either a link or to see the thread or whatever). In publication terms, impressions would equate to people downloading your paper and engagements mean that they did something with it, like cite it. This means that a “h-index” for engagements can be calculated with these data.

For those that don’t know, the h-index for a scientist means that he/she has h papers that have been cited h or more times. The Twitter version would be a tweeter that has h tweets that were engaged with h or more times. My data is shown here:

My twitter h-index is currently 36. I have 36 tweets that have been engaged with 36 or more times.

So, this is a lot higher than my actual h-index, but obviously there are differences. Papers accrue citations as time goes by, but the information flow on Twitter is so fast that tweets don’t accumulate engagement over time. In that sense, the Twitter h-index is less sensitive to the time a user has been active on Twitter, versus the real h-index which is strongly affected by age of the scientist. Other differences include the fact that I have “published” thousands of tweets and only tens of papers. Also, whether or not more people read my tweets compared to my papers… This is not something I want to think too much about, but it would affect how many engagements it is possible to achieve.

The other thing I looked at was whether replying to somebody actually means more engagement. This would skew the Twitter h-index. I filtered tweets that started with an @ and found that this restricts who sees the tweet, but doesn’t necessarily mean more engagement. Replies make up a very small fraction of the h tweets.

I’ll leave it to somebody else to calculate the Impact Factor of Twitter. I suspect it is very low, given the sheer volume of tweets.

Please note this post is just for fun. Normal service will (probably) resume in the next post.

Edit: As pointed out in the comments this post is short on “Materials and Methods”. If you want to calculate your ownTwitter h-index, go here. When logged in to Twitter, the analytics page should present your data (it may take some time to populate this page after you first view it). A csv can be downloaded from the button on the top-right of the page. I imported this into IgorPro (as always) to generate the plots. The engagements data need to be sorted in descending order and then the h-index can be found by comparing the numbers with their ranked position.

The post title is from the quirky B-side to the Let It Be single by The Beatles.

## Round and Round

I thought I’d share a procedure for rotating a 2D set of coordinates about the origin. Why would you want do this? Well, we’ve been looking at cell migration in 2D – tracking nuclear position over time. Cells migrate at random and I previously blogged about ways to visualise these tracks more clearly. Part of this earlier procedure was to set the start of each track at (0,0). This gives a random hairball of tracks moving away from the origin. Wouldn’t it be a good idea to orient all the tracks so that the endpoint lies on the same axis? This would simplify the view and allow one to assess how ‘directional’ the cell tracks are. To rotate a set of coordinates, you need to use a rotation matrix. This allows you to convert the x,y coordinates to their new position x’,y’. This rotation is counter-clockwise.

$$x’ = x \cos \theta – y \sin \theta\,$$

$$y’ = x \sin \theta + y \cos \theta\,$$

However, we need to find theta first. To do this we need to find the angle between two lines, using this formula.

$$\cos \theta = \frac {\mathbf a \cdot \mathbf b}{\left \Vert {\mathbf a} \right \Vert \cdot \left \Vert {\mathbf b} \right \Vert}$$

The maths is kept to a minimum here. If you are interested, look at the code at the bottom.

The two lines (a and b) are formed by the x-axis (origin to some point on the x-axis, i.e. y=0) and by a line running from the origin to the last coordinate in the series. This calculation can be done for each track with theta for each track being used to rotate the that whole track (x,y changed to x’,y’ for each point).

Here is an example of just a few tracks from an experiment. Typically we have hundreds of tracks for each experimental group and the code will blast through them all very quickly (<1 s).

After rotation, the tracks are now aligned so that the last point is on the x-axis at y=0. This allows us to see how ‘directional’ the tracks are. The end points are now aligned, when they migrated there, how convoluted was their path.

The code to do this is up on Igor Exchange code snippets. A picture of the code is below (markup for code in WordPress is not very clear). See the code snippet if you want to use it.

The weakness of this method is that acos (arccos) only gives results from 0 to Pi (0 to 180°). There is a correction in the procedure, but everything needs editing if you want to rotate the co-ordinates to some other plane. Feedback welcome.

Edit Jim Prouty and A.G. have suggested two modifications to the code. The first is to use complex waves rather than 2D real waves. Then use two native Igor functions r2polar or p2rect. The second suggestion is to use Matrix operations! As is often the case with Igor there are several ways of doing things. The method described here is long-winded compared to a MatrixOp and if the waves were huge these solutions would be much, much faster. As it is, our migration movies typically have 60 points and as mentioned rotator() blasts through them very quickly. More complex coordinate sets would need something more sophisticated.

The post title is taken from “Round & Round” by New Order from their Technique LP.

## Sure To Fall

What does the life cycle of a scientific paper look like?

It stands to reason that after a paper is published, people download and read the paper and then if it generates sufficient interest, it will begin to be cited. At some point these citations will peak and the interest will die away as the work gets superseded or the field moves on. So each paper has a useful lifespan. When does the average paper start to accumulate citations, when do they peak and when do they die away?

Citation behaviours are known to be very field-specific. So to narrow things down, I focussed on cell biology and in one area “clathrin-mediated endocytosis” in particular. It’s an area that I’ve published in – of course this stuff is driven by self-interest. I downloaded data for 1000 papers from Web of Science that had accumulated the most citations. Reviews were excluded, as I assume their citation patterns are different from primary literature. The idea was just to take a large sample of papers on a topic. The data are pretty good, but there are some errors (see below).

Number-crunching (feel free to skip this bit): I imported the data into IgorPro making a 1D wave for each record (paper). I deleted the last point corresponding to cites in 2014 (the year is not complete). I aligned all records so that year of publication was 0. Next, the citations were normalised to the maximum number achieved in the peak year. This allows us to look at the lifecycle in a sensible way. Next I took out records to papers less than 6 years old as I reasoned these would have not have completed their lifecycle and could contaminate the analysis (it turned out to make little difference). The lifecycles were plotted and averaged. I also wrote a quick function to pull out the peak year for citations post hoc.

So what did it show?

Citations to a paper go up and go down, as expected (top left). When cumulative citations are plotted most of the articles have an initial burst and then level off. The exception are ~8 articles that continue to rise linearly (top right). On average a paper generates its peak citations three years after publication (box plot). The fall after this peak period is pretty linear and it’s apparently all over somewhere >15 years after publication (bottom left). To look at the decline in more detail I aligned the papers so that year 0 was the year of peak citations. The average now loses almost 40% of those peak citations in the following year and then declines steadily (bottom right).

Edit: The dreaded Impact Factor calculation takes the citations to articles published in the preceding 2 years and divides by the number of citable items in that period. This means that each paper only contributes to the Impact Factor in years 1 and 2. This is before the average paper reaches its peak citation period. Thanks to David Stephens (@david_s_bristol) for pointing this out. The alternative 5 year Impact Factor gets around this limitation.

Perhaps lifecycle is the wrong term: papers in this dataset don’t actually ‘die’, i.e. go to 0 citations. There is always a chance that a paper will pick up the odd citation. Papers published 15 years ago are still clocking 20% of their peak citations. Looking at papers cited at lower rates would be informative here.

Two other weaknesses that affect precision is that 1) a year is a long time and 2) publication is subject to long lag times. The analysis would be improved by categorising the records based on the month-year when the paper was published and the month-year when each citation comes in. Papers published in January in one year probably have a different peak than those published in December of the same year, but this is lost when looking at year alone. Secondly, due to publication lag, it is impossible to know when the peak period of influence for a paper truly is.
Problems in the dataset. Some reviews remained despite being supposedly excluded, i.e. they are not properly tagged in the database. Also, some records have citations from years before the article was published! The numbers of citations are small enough to not worry for this analysis, but it makes you wonder about how accurate the whole dataset is. I’ve written before about how complete citation data may or may not be. These sorts of things are a concern for all of us who are judged by these things for hiring and promotion decisions.

The post title is taken from ‘Sure To Fall’ by The Beatles, recorded during The Decca Sessions.

## All This And More

I was looking at the latest issue of Cell and marvelling at how many authors there are on each paper. It’s no secret that the raison d’être of Cell is to publish the “last word” on a topic (although whether it fulfils that objective is debatable). Definitive work needs to be comprehensive. So it follows that this means lots of techniques and ergo lots of authors. This means it is even more impressive when a dual author paper turns up in the table of contents for Cell. Anyway, I got to thinking: has it always been the case that Cell papers have lots of authors and if not, when did that change?

I downloaded the data for all articles published by Cell (and for comparison, J Cell Biol) from Scopus. The records required a bit of cleaning. For example, SnapShot papers needed to be removed and also the odd obituary etc. had been misclassified as an article. These could be quickly removed. I then went back through and filtered out ‘articles’ that were less than three pages as I think it is not possible for a paper to be two pages or fewer in length. The data could be loaded into IgorPro and boxplots generated per year to show how author number varied over time. Reviews that are misclassified as Articles will still be in the dataset, but I figured these would be minimal.

First off: Yes, there are more authors on average for a Cell paper versus a J Cell Biol paper. What is interesting is that both journals had similar numbers of authors when Cell was born (1974) and they crept up together until the early 2000s, when the number of Cell authors kept increasing, or JCell Biol flattened off, whichever way you look at it.

I think the overall trend to more authors is because understanding biology has increasingly required multiple approaches and the bar for evidence seems to be getting higher over time. The initial creep to more authors (1974-2000) might be due to a cultural change where people (technicians/students/women) began to get proper credit for their contributions. However, this doesn’t explain the divergence between J Cell Biol and Cell in recent years. One possibility is Cell takes more non-cell biology papers and that these papers necessarily have more authors. For example, the polar bear genome was published in Cell (29 authors), and this sort of paper would not appear in J Cell Biol. Another possibility is that J Cell Biol has a shorter and stricter revision procedure, which means that multiple rounds of revision, collecting new techniques and new authors is more limited than it is at Cell. Any other ideas?

I also quickly checked whether more authors means more citations, but found no evidence for such a relationship. For papers published in the years 2000-2004, the median citation number for papers with 1-10 authors was pretty constant for J Cell Biol. For Cell, these data mere more noisy. Three-author papers tended to be cited a bit more than those with two authors, but then four author papers were also lower.

The number of authors on papers from our lab ranges from 2-9 and median is 3.5. This would put an average paper from our lab in the bottom quartile for JCB and in the lower 10% for Cell in 2013. Ironically, our 9 author paper (an outlier) was published in J Cell Biol. Maybe we need to get more authors on our papers before we can start troubling Cell with our manuscripts…

The Post title is taken from ‘All This and More’ by The Wedding Present from their LP George Best.

## Blast Off!

This post is about metrics and specifically the H-index. It will probably be the first of several on this topic.

I was re-reading a blog post by Alex Bateman on his affection for the H-index as a tool for evaluating up-and-coming scientists. He describes Jorge Hirsch’s H-index, its limitations and its utility quite nicely, so I won’t reiterate this (although I’ll probably do so in another post). What is under-appreciated is that Hirsch also introduced the m quotient, which is the H-index divided by years since the first publication. It’s the m quotient that I’ll concentrate on here. The TL;DR is: I think that the H-index does have some uses, but evaluating early career scientists is not one of them.

Anyone of an anti-metrics disposition should look away now.

Alex proposes that the scientists can be judged (and hired) by using m as follows:

• <1.0 = average scientist
• 1.0-2.0 = above average
• 2.0-3.0 = excellent
• >3.0 = stellar

He says “So post-docs with an m-value of greater than three are future science superstars and highly likely to have a stratospheric rise. If you can find one, hire them immediately!”.

From what I have seen, the H-index (and therefore m) is too noisy for early stage career scientists to be of any use for evaluation. Let’s leave that aside for the moment. What he is saying is you should definitely hire a post-doc who has published ≥3 papers with ≥3 citations each in their first year, ≥6 with ≥6 citations each in their second year, ≥9 papers with ≥9 in their third year…

Do these people even exist? A candidate with 3 year PhD and a 3 year postdoc (6 would mean ≥18 papers with ≥18 citations each! In my field (molecular cell biology), it is unusual for somebody to publish that many papers, let alone accrue citations at that rate*.

This got me thinking: using Alex’s criteria, how many stellar scientists would we miss out on and would we be more likely to hire the next Jan Hendrik Schön. To check this out I needed to write a quick program to calculate H-index by year (I’ll describe this in a future post). Off the top of my head I thought of a few scientists that I know of, who are successful by many other measures, and plotted their H-index by year. The dotted line shows a constant m of 1,  “average” by Alex’s criteria. I’ve taken a guess at when they became a PI. I have anonymised the scholars, the information is public and anyone can calculate this, but it’s not fair to identify people without asking (hopefully they can’t recognise themselves – if they read this!).

This is a small sample taken from people in my field. You can see that it is rare for scientists to have a big m at an early stage in their careers. With the exception of Scholar C, who was just awesome from the get-go, panels appointing any of these scholars would have had trouble divining the future success of these people on the basis of H-index and m alone. Scholar D and Scholar E really saw their careers take-off by making big discoveries, and these happened at different stages of their careers. Both of these scholars were “below average” when they were appointed as PI. The panel would certainly not have used metrics in their evaluation (the databases were not in wide use back then), probably just letters of recommendation and reading the work. Clearly, they could identify the potential in these scientists… or maybe they just got lucky. Who knows?!

There may be other fields where publication at higher rates can lead to a large m but I would still question the contribution of the scientist to the papers that led to the H-index. Are they first or last author? One problem with the H-index is that the 20th scientist in a list of 40 authors gets the same credit as the first author. Filtering what counts in the list of articles seems sensible, but this would make the values even more noisy for early stage scientists.

*In the comments section, somebody points out that if you publish a paper very early then this affects your m value. This is something I sympathise with. My first paper was in 1999 when I was an undergrad. This dents my m value as it was a full three years until my next paper.

The post title is taken from ‘Blast Off!’ by Rivers Cuomo from ‘Songs from the Black Hole’ the unreleased follow-up to Pinkerton.

## Very Best Years

What was the best year in music?

OK, I have to be upfront and say that I thought the answer to this would be 1991. Why? Just a hunch. Nevermind, Loveless, Spiderland, Laughing Stock… it was a pretty good year. I thought it would be fun to find out if there really was a golden year in music. It turns out that it wasn’t 1991.

There are many ways to look at this question, but I figured that a good place to start was to find what year had the highest density of great LPs. But how do we define a great LP? Music critics are notorious for getting it wrong and so I’m a big fan of rateyourmusic.com (RYM) which democratises the grading process for music by crowdsourcing opinion. It allows people to rate LPs in their collection and these ratings are aggregated via a slightly opaque system and the albums are ranked into charts. I scraped the data for the Top 1000 LPs of All-Time*. Crunching the numbers was straightforward. So what did it show?

Looking at the Top 1000, 1971 and 1972 are two years with the highest representation. Looking at the Top 500 LPs, 1971 is the year with most records. Looking at the Top 100, the late 60s features highly.

To look at this in detail, I plotted the rank versus year. This showed that there was a gap in the early 80s where not many Top 1000 LPs were released. This could be seen in the other plots but, it’s clearer on the bubble plot. Also the cluster of high ranking LPs released in the 1960s is obvious.

The plot is colour-coded to show the rank, while the size of the bubbles indicates the rating. Note that rating doesn’t correlate with rank (RYM also factors in number of ratings and user loyalty, to determine this). To take the ranking into account, I calculated the “integrated score” for all albums released in a given year. The score is 1001-rank, and the summation of all of these scores for albums released in a given year gives the integrated score.

This is shown on a background of scores for each decade. Again, 1970s rule and 1971 is the peak. The shape of this profile will not surprise music fans. The first bump in the late 50s coincides with rock n roll, influential jazz records and the birth of the LP as a serious format. The 60s sees a rapid increase in density of great albums per year, hitting a peak in 1971. The decline that follows is halted by a spike in 1977: punk. There’s a relative dearth of highly rated LPs in the early 80s and things really tail off in the early 2000s. The lack of highly rated LPs in these later years is probably best explained by few ratings, due to young age of these LPs. Also diversification of music styles, tastes and the way that music is consumed is likely to play a role. The highest ranked LP on the list is Radiohead’s OK Computer (1997) which was released in a non-peak year. Note that 1991 does not stand out particularly. In fact, in the 1990s, 1994 stands out as the best year for music.

Finally, RYM has a nice classification system for music so I calculated the integrated score for these genres and sub-genres (cowpunk, anyone?). Rock (my definition) is by far the highest scoring and Singer-Songwriter is the highest scoring genre/sub-genre.

So there you have it. 1971 was the best year in music according to this analysis. Now… where’s my copy of Tago Mago.

* I did this mid-April. I doubt it’s changed much. This was an exercise to learn how to scrape and I also don’t think I broke the terms of service of RYM. If I did, I’ll take this post down.

The title of this post comes from ‘Very Best Years’ by The Grays from their LP ‘Ro Sham Bo’. It was released in 1994…

## Give, Give, Give Me More, More, More

A recent opinion piece published in eLife bemoaned the way that citations are used to judge academics because we are not even certain of the veracity of this information. The main complaint was that Google Scholar – a service that aggregates citations to articles using a computer program – may be less-than-reliable.

There are three main sources of citation statistics: Scopus, Web of Knowledge/Science and Google Scholar; although other sources are out there. These are commonly used and I checked out how comparable these databases are for articles from our lab.

The ratio of citations is approximately 1:1:1.2 for Scopus:WoK:GS. So Google Scholar is a bit like a footballer, it gives 120%.

I first did this comparison in 2012 and again in 2013. The ratio has remained constant, although these are the same articles, and it is a very limited dataset. In the eLife opinion piece, Eve Marder noted an extra ~30% citations for GS (although I calculated it as ~40%, 894/636=1.41). Talking to colleagues, they have also noticed this. It’s clear that there is some inflation with GS, although the degree of inflation may vary by field. So where do these extra citations come from?

1. Future citations: GS is faster than Scopus and WoK. Articles appear there a few days after they are published, whereas it takes several weeks or months for the same articles to appear in Scopus and WoK.
2. Other papers: some journals are not in Scopus and WoK. Again, these might be new journals that aren’t yet included at the others, but GS doesn’t discriminate and includes all papers it finds. One of our own papers (an invited review at a nascent OA journal) is not covered by Scopus and WoK*. GS picks up preprints whereas the others do not.
3. Other stuff: GS picks up patents and PhD theses. While these are not traditional papers, published in traditional journals, they are clearly useful and should be aggregated.
4. Garbage: GS does pick up some stuff that is not a real publication. One example is a product insert for an antibody, which has a reference section. Another is duplicate publications. It is quite good at spotting these and folding them into a single publication, but some slip through.

OK, Number 4 is worrying, but the other citations that GS detects versus Scopus and WoK are surely a good thing. I agree with the sentiment expressed in the eLife paper that we should be careful about what these numbers mean, but I don’t think we should just disregard citation statistics as suggested.

In summary, I am a fan of Google Scholar. My page is here.

* = I looked into this a bit more and the paper is actually in WoK, it has no Title and it has 7 citations (versus 12 in GS). Although it doesn’t come up in a search for Fiona or for me.

However, I know from GS that this paper was also cited in a paper by the Cancer Genome Atlas Network in Nature. WoK listed this paper as having 0 references and 0 citations(!). Does any of this matter? Well, yes. WoK is a Thomson Reuters product and is used as the basis for their dreaded Impact Factor – which (like it or not) is still widely used for decision making. Also many Universities use WoK information in their hiring and promotions processes.

The post title comes from ‘Give, Give, Give Me More, More, More’ by The Wonder Stuff from the LP ‘Eight Legged Groove Machine’. Finding a post title was difficult this time. I passed on: Pigs (Three Different Ones) and Juxtapozed with U. My iTunes library is lacking songs about citations…

## Some Things Last A Long Time

How long does it take to publish a paper?

The answer is – in our experience, at least – about 9 months.

That’s right, it takes about the same amount of time to have a baby as it does to publish a scientific paper. Discussing how we can make the publication process quicker is for another day. Right now, let’s get into the numbers.

The graphic shows the time taken from submission-to-publication for papers on which I am an author. I’m missing data for two papers (one from 1999 and one from 2002) and the Biol Open paper is published online but not yet “in print”, but mostly the information is complete. If you want to calculate this for your own papers; my advice would be to keep a spreadsheet of submission and decision dates as you go along… and archive your emails.

In the last analysis, a few people pointed out ways that the graphic could be improved, and I’ve now implemented these changes.

The graphic shows that the journey to publication is in four eras:

1. Pre-time (before 0 on the x-axis): this is the time from first submission to the first journal. A dark time which involves rejection.
2. Submission at the final journal (starting at time 0). Again, the orange periods are when the manuscript is with the journal and the green, when it is with us. Needless to say this green time is mainly spent doing experimental work (compare green periods for reviews and for papers)
3. Acceptance! This is where the orange bar stops. The manuscript is then readied for publication (blank area).
4. Published online. A purple period that ends with final publication in print.

Note that: i) the delays are more-or-less negated by preprinting provided deposition is before the first submission (grey line, for Biol Open paper), ii) these delay diagrams do not take into account the original drafting/rewriting cycle before the fist submission – nor the time taken to do the work!

So… how long does it take to publish a paper?

In the top right graph: the time from first submission to being published online is 250 days on average (median). This is shown by the blue bar. If we throw in the average time it takes to go from online to print (15 days) this gives 265 days. The average time for human gestation is 266 days. So it takes about the same amount of time to have a baby as it does to publish a paper! By contrast, reviews take only 121 days, equivalent to four lunar cycles (118 days).

My 2005 paper at Nature holds the record for the most protracted publication 399 days from submission to publication. The fastest publication is the most recent, our Biol Open paper was online 49 days after submission (it was also online 1 day before submission as a preprint).

In the bottom right graph: I added together the total time each paper was either with the journal, or with us, and plotted the average. The time from acceptance-to-publication online is shown stacked onto the “time with journal” column. You can see from this graphic that the lion’s share of the delay comes from revisions that we must do in order for a paper to be published. Multiple revisions and submissions also push these numbers up compared to the totals for reviews.

How representative are these numbers?

This is a small dataset at many different journals and so it is difficult to conclude much. With this analysis, I was hoping to identify ‘slow journals’ that we should avoid and also to think about our publication strategy (as much as a crap shoot can have a strategy). The whole process is stochastic and I don’t see any reason to change the way that we navigate the system. Having said this, I can’t see us doing any more methods/book chapters, as they are just so slow.

Just over half of our papers have some “pre-time”, i.e. they got rejected from at least one other journal before finding a home. A colleague of mine likes to say:

“if your paper is accepted at the first journal you send it to, you sent it to the wrong place”

One thing for sure is that publication takes a long time. And I don’t think our experience is uncommon. The pace of scientific publishing has been described as glacial by Leslie Vosshall and I don’t disagree with this. I think the 9 months figure is probably representative for most areas of biology. I know that other scientists in my field, who have more tenacity for rejections and for slugging it out at high impact journals, have much longer times from 1st submission to acceptance. In my opinion, wasting even more time chasing publication is crazy, counter-productive and demotivating for the people in the lab.

The irony in all this is that, even though we are working at the absolute bleeding edge of science with all of this technology at our disposal, our methods for reporting science are badly out of date. And with that I’ll push the “publish” button and this will be online…

The title of this post comes from ‘Some Things Last A Long Time’ by Daniel Johnston from his LP ‘1990’.