Half Right

I was talking to a speaker visiting our department recently. While discussing his postdoc work from years ago, he told me about the identification of the sperm factor that causes calcium oscillations in the egg at fertilisation. It was an interesting tale because the group who eventually identified the factor – now widely accepted as PLCzeta – had earlier misidentified the factor, naming it oscillin.

The oscillin paper was published in Nature in 1996 and the subsequent (correct) paper was published in Development in 2002. I wondered what the citation profiles of these papers looks like now.

plcz

As you can see there was intense interest in the first paper that quickly petered out, presumably when people found out that oscillin was a contaminant and not the real factor. The second paper on the other hand has attracted a large number of citations and continues to do so 12 years later – a sign of a classic paper. However, the initial spike in citations was not as high as the Nature paper.

The impact factor of Nature is much higher than that of Development. I’ve often wondered if this is due to a sociological phenomenon: people like to cite Cell/Nature/Science papers rather than those at other journals and this bumps up the impact factor. Before you comment, yes I know there are other reasons, but the IFs do not change much over time and I wonder whether journal hierarchy explains the hardiness of IFs over time. Anyway, these papers struck me as a good test of the idea… Here we have essentially the same discovery, reported by the same authors. The only difference here is the journal (and that one paper is six years after the other). Normally it is not possible to test if the journal influences citations because a paper cannot erased and republished somewhere else. The plot suggests that Nature papers inherently attract much more cites than those in Development, presumably because of the exposure of publishing there. From the graph, it’s not difficult to see that even if a paper turns out not to be right, it can still boost the IF of the journal during the window of assessment. Another reason not to trust journal impact factors.

I can’t think of any way to look at this more systematically to see if this phenomenon holds true. I just thought it was interesting, so I’ll leave it here.


The post title is taken from Half Right by Elliott Smith from the posthumous album New Moon. Bootlegs have the title as Not Half Right, which would also be appropriate.

My Favorite Things

I realised recently that I’ve maintained a consistent iTunes library for ~10 years. For most of that time I’ve been listening exclusively to iTunes, rather than to music in other formats. So the library is a useful source of information about my tastes in music. It should be possible to look at who are my favourite artists, what bands need more investigation, or just to generate some interesting statistics based on my favourite music.

Play count is the central statistic here as it tells me how often I’ve listened to a certain track. It’s the equivalent of a +1/upvote/fave/like or maybe even a citation. Play count increases by one if you listen to a track all the way to the end. So if a track starts and you don’t want to hear it and you skip on to the next song, there’s no +1. There’s a caveat here in that the time a track has been in the library, influences the play count to a certain extent – but that’s for another post*. The second indicator for liking a track or artist is the fact that it’s in the library. This may sound obvious, but what I mean is that artists with lots of tracks in the library are more likely to be favourite artists compared to a band with just one or two tracks in there. A caveat here is that some artists do not have long careers for a variety of reasons, which can limit the number of tracks actually available to load into the library. Check the methods at the foot of the post if you want to do the same.

What’s the most popular year? Firstly, I looked at the most popular year in the library. This question was the focus of an earlier post that found that 1971 was the best year in music. The play distribution per year can be plotted together with a summary of how many tracks and how many plays in total from each year are in the library. There’s a bias towards 90s music, which probably reflects my age, but could also be caused by my habit of collecting CD singles which peaked as a format in this decade. The average number of plays is actually pretty constant for all years (median of ~4), the mean is perhaps slightly higher for late-2000s music.

Favourite styles of music: I also looked at Genre. Which styles of music are my favourite? I plotted the total number of tracks versus the total number of plays for each Genre in the library. Size of the marker reflects the median number of plays per track for that genre. Most Genres obey a rule where total plays is a function of total tracks, but there are exceptions. Crossover, Hip-hop/Rap and Power-pop are highlighted as those with an above average number of plays. I’m not lacking in Power-pop with a few thousand tracks, but I should probably get my hands on more Crossover or Hip-Hop/Rap.

Nov14

Using citation statistics to find my favourite artists: Next, I looked at who my favourite artists are. It could be argued that I should know who my favourite artists are! But tastes can change over a 10 year period and I was interested in an unbiased view of my favourite artists rather than who I think they are. A plot of Total Tracks vs Mean plays per track is reasonably informative. The artists with the highest plays per track are those with only one track in the library, e.g. Harvey Danger with Flagpole Sitta. So this statistic is pretty unreliable. Equally, I’ve got lots of tracks by Manic Street Preachers but evidently I don’t play them that often. I realised that the problem of identifying favourite artists based on these two pieces of information (plays and number of tracks) is pretty similar to assessing scientists using citation metrics (citations and number of papers). Hirsch proposed the h-index to meld these two bits of information into a single metric, the h-index. It’s easily computed and I already had an Igor procedure to calculate it en masse, so I ran it on the library information.

Before doing this, I consolidated multiple versions of the same track into one. I knew that I had several versions of the same track, especially as I have multiple versions of some albums (e.g. Pet Sounds = 3 copies = mono + stereo + a capella), the top offending track was “Baby’s Coming Back” by Jellyfish, 11 copies! Anyway, these were consolidated before running the h-index calculation.

The top artist was Elliott Smith with an h-index of 32. This means he has 32 tracks that have been listened to at least 32 times each. I was amazed that Muse had the second highest h-index (I don’t consider myself a huge fan of their music) until I remembered a period where their albums were on an iPod Nano used during exercise. Amusingly (and narcissistically) my own music – the artist names are redacted – scored quite highly with two out of three bands in the top 100, which are shown here. These artists with high h-indeces are the most consistently played in the library and probably constitute my favourite artists, but is the ranking correct?

The procedure also calculates the g-index for every artist. The g-index is similar to the h-index but takes into account very highly played tracks (very highly cited papers) over the h threshold. For example, The Smiths h=26. This could be 26 tracks that have been listened to exactly 26 times or they could have been listened to 90 times each. The h-index cannot reveal this, but the g-index gets to this by assessing average plays for the ranked tracks. The Smiths g=35. To find the artists that are most-played-of-the-consistently-most-played, I subtracted h from g and plotted the Top 50. This ranked list I think most closely represents my favourite artists, according to my listening habits over the last ten years.

Nov14-2

Track length: Finally, I looked at the track length. I have a range of track lengths in the library, from “You Suffer” by Napalm Death (iTunes has this at 4 s, but Wikipedia says it is 1.36 s), through to epic tracks like “Blue Room” by The Orb. Most tracks are in the 3-4 min range. Plays per track indicates that this track length is optimal with most of the highly played tracks being within this window. The super-long tracks are rarely listened to, probably because of their length. Short tracks also have higher than average plays, probably because they are less likely to be skipped, due to their length.

These were the first things that sprang to mind for iTunes analysis. As I said at the top, there’s lots of information in the library to dig through, but I think this is enough for one post. And not a pie-chart in sight!

Methods: the library is in xml format and can be read/parsed this way. More easily, you can just select the whole library and copy-paste it into TextEdit and then load this into a data analysis package. In this case, IgorPro (as always). Make sure that the interesting fields are shown in the full library view (Music>Songs). To do everything in this post you need artist, track, album, genre, length, year and play count. At the time of writing, I had 21326 tracks in the library. For the “H-index” analysis, I consolidated multiple versions of the same track, giving 18684 tracks. This is possible by concatenating artist and the first ten characters of the track title (separated by a unique character) and adding the play counts for these concatenated versions. The artist could then be deconvolved (using the unique character) and used for the H-calculation. It’s not very elegant, but seemed to work well. The H-index and G-index calculations were automated (previously sort-of-described here), as was most of the plot generation. The inspiration for the colour coding is from the 2013 Feltron Report.

* there’s an interesting post here about modelling the ideal playlist. I worked through the ideas in that post but found that it doesn’t scale well to large libraries, especially if they’ve been going for a long time, i.e. mine.

The post title is taken from John Coltrane’s cover version of My Favorite Things from the album of the same name. Excuse the US English spelling.

Belly Button Window

A bit of navel gazing for this post. Since moving the blog to wordpress.com in the summer, it recently accrued 5000 views. Time to analyse what people are reading…

blogstatsThe most popular post on the blog (by a long way) is “Strange Things“, a post about the eLife impact factor (2824 views). The next most popular is a post about a Twitter H-index, with 498 views. The Strange Things post has accounted for ~50% of views since it went live (bottom plot) and this fraction seems to be creeping up. More new content is needed to change this situation.

I enjoy putting blog posts together and love the discussion that follows from my posts. It’s also been nice when people have told me that they read my blog and enjoy my posts. One thing I didn’t expect was the way that people can take away very different messages from the same post. I don’t know why I found this surprising, since this often happens with our scientific papers! Actually, in the same way as our papers, the most popular posts are not the ones that I would say are the best.

Wet Wet Wet: I have thought about deleting the Strange Things post, since it isn’t really what I want this blog to be about. An analogy here is the Scottish pop-soul outfit Wet Wet Wet who released a dreadful cover of The Troggs’ “Love is All Around” in 1994. In the end, the band deleted the single in the hope of redemption, or so they said. Given that the song had been at number one for 15 weeks, the damage was already done. I think the same applies here, so the post will stay.

Directing Traffic: Most people coming to the blog are clicking on links on Twitter. A smaller number come via other blogs which feature links to my posts. A very small number come to the blog via a Google search. Google has changed the way it formats the clicks and so most of the time it is not possible to know what people were searching for. For those that I can see, the only search term is… yes, you’ve guessed it: “elife impact factor”.

Methods: WordPress stats are available for blog owners via URL formatting. All you need is your API key and (obviously) your blog address.

Instructions are found at http://stats.wordpress.com/csv.php

A basic URL format would be: http://stats.wordpress.com/csv.php?api_key=yourapikey&blog_uri=yourblogaddress replacing yourapikey with your API key (this can be retrieved at https://apikey.wordpress.com) and yourblogaddress with your blog address e.g. quantixed.wordpress.com

Various options are available from the first page to get the stats in which you are  interested. For example, the following can be appended to the second URL to get a breakdown of views by post title for the past year:

&table=postviews&days=365&limit=-1

The format can be csv, json or xml depending on how your preference for what you want to do next with the information.

The title is from “Belly Button Window” by Jimi Hendrix, a posthumous release on the Cry of Love LP.

What The World Is Waiting For

The transition for scientific journals from print to online has been slow and painful. And it is not yet complete. This week I got an RSS alert to a “new” paper in Oncogene. When I downloaded it, something was familiar… very familiar… I’d read it almost a year ago! Sure enough, the AOP (ahead of print or advance online publication) date for this paper was September 2013 and here it was in the August 2014 issue being “published”.

I wondered why a journal would do this. It is possible that delaying actual publication would artificially boost the Impact Factor of a journal because there is a delay before citations roll in and citations also peak after two years. So if a journal delays actual publication, then the Impact Factor assessment window captures a “hotter” period when papers are more likely to generate more citations*. Richard Sever (@cshperspectives) jumped in to point out a less nefarious explanation – the journal obviously has a backlog of papers but is not allowed to just print more papers to catch up, due to page budgets.

There followed a long discussion about this… which you’re welcome to read. I was away giving a talk and missed all the fun, but if I may summarise on behalf of everybody: isn’t it silly that we still have pages – actual pages, made of paper – and this is restricting publication.

I wondered how Oncogene got to this position. I retrieved the data for AOP and actual publication for the last five years of papers at Oncogene excluding reviews, from Pubmed. Using oncogene[ta] NOT review[pt] as a search term. The field DP has the date published (the “issue date” that the paper appears in print) and PHST has several interesting dates including [aheadofprint]. These could be parsed and imported into IgorPro as 1D waves. The lag time from AOP to print could then be calculated. I got 2916 papers from the search and was able to get data for 2441 papers.

OncogeneLagTimeYou can see for this journal that the lag time has been stable at around 300 days (~10 months) for issues published since 2013. So a paper AOP in Feb 2012 had to wait over 10 months to make it into print. This followed a linear period of lag time growth from mid-2010.

I have no links to Oncogene and don’t particularly want to single them out. I’m sure similar lags are happening at other print journals. Actually, my only interaction with Oncogene was that they sent this paper of ours out to review in 2011 (it got two not-negative-but-admittedly-not-glowing reviews) and then they rejected it because they didn’t like the cell line we used. I always thought this was a bizarre decision: why couldn’t they just decide that before sending it to review and wasting our time? Now, I wonder whether they were not keen to add to their increasing backlog of papers at their journal? Whatever the reason, it has put me off submitting other papers there.

I know that there are good arguments for continuing print versions of journals, but from a scientist’s perspective the first publication is publication. Any subsequent versions are simply redundant and confusing.

*Edit: Alexis Verger (@Alexis_Verger) pointed me to a paper which describes that, for neuroscience journals, the lag time has increased over time. Moreover, the authors suggest that this is for the purpose of maximising Journal Impact Factor.

The post title comes from the double A-side Fools Gold/What The World Is Waiting For by The Stone Roses.

Tips from the Blog II

An IgorPro tip this week. The default font for plots is Geneva. Most of our figures are assembled using Helvetica for labelling. The default font can be changed in Igor Graph Preferences, but Preferences need to be switched on in order to be implemented. Anyway, I always seem to end up with a mix of Geneva plots and Helevetica plots. This can be annoying as the fonts are pretty similar yet the spacing is different and this can affect the plot size. Here is a quick procedure Helvetica4All() to rectify this for all graph windows.

Helvetica4All

Six Plus One

Last week, ALM (article-level metric) data for PLoS journals were uploaded to Figshare with the invitation to do something cool with it.

Well, it would be rude not to. Actually, I’m one of the few scientists on the planet that hasn’t published a paper with Public Library of Science (PLoS), so I have no personal agenda here. However, I love what PLoS is doing and what it has achieved to disrupt the scientific publishing system. Anyway, what follows is not in any way comprehensive, but I was interested to look at a few specific things:

  1. Is there a relationship between Twitter mentions and views of papers?
  2. What is the fraction of views that are PDF vs HTML?
  3. Can citations be predicted by more immediate article level metrics?

The tl;dr version is 1. Yes. 2. ~20%. 3. Can’t say but looks unlikely.

1. Twitter mentions versus paper views

All PLoS journals are covered. The field containing paper views is (I think) “Counter” this combines views of HTML and PDF (see #2). A plot of Counter against Publication Date for all PLoS papers (upper plot) shows that the number of papers published has increased dramatically since the introduction of PLoS ONE in 2007. There is a large variance in number of views, which you’d expect and also, the views tail off for the most recent papers, since they have had less time to accumulate views. Below is the same plot where the size and colour of the markers reflects their Twitter score (see key). There’s a sharp line that must correspond to the date when Twitter data was logged as an ALM. There’s a scattering of mentions after this date to older literature, but one 2005 paper stands out – Ioannidis’s paper Why Most Published Research Findings Are False. It has a huge number of views and a large twitter score, especially considering that it was a seven year old paper when they started recording the data. A pattern emerges in the post-logging period. Papers with more views are mentioned more on Twitter. The larger darker markers are higher on the y-axis. Mentioning a paper on Twitter is sure to generate views of the paper, at some (unknown) conversion rate. However, as this is a single snapshot, we don’t know if Twitter mentions drive more downloads of papers, or whether more “interesting”/highly downloaded work is talked about more on Twitter.

twitter_counter

2. Fraction of PDF vs HTML views

I asked a few people what they thought the download ratio is for papers. Most thought 60-75% as PDF versus 40-25% HTML. I thought it would be lower, but I was surprised to see that it is, at most, 20% for PDF. The plot below shows the fraction of PDF downloads (counter_pdf/(counter_pdf+counter_html)). For all PLoS journals, and then broken down for PLoS Biol, PLoS ONE.

PDF-FractionThis was a surprise to me. I have colleagues who don’t like depositing post-print or pre-print papers because they say that they prefer their work to be seen typeset in PDF format. However, this shows that, at least for PLoS journals, the reader is choosing to not see a typeset PDF at all, but a HTML version.

Maybe the PLoS PDFs are terribly formatted and 80% people don’t like them. There is an interesting comparison that can be done here, because all papers are deposited at Pubmed Central (PMC) and so the same plot can be generated for the PDF fraction there. The PDF format is different to PLoS and so we can test the idea that people prefer HTML over PDF at PLoS because they don’t like the PLoS format.

PMCGraph

The fraction of PDF downloads is higher, but only around 30%. So either the PMC format is just as bad, or this is the way that readers like to consume the scientific literature. A colleague mentioned that HTML views are preferable to PDF if you want to actually want to do something with the data, e.g. for meta-analysis. This could have an effect. HTML views could be skim reading, whereas PDF is for people who want to read in detail… I wonder whether these fractions are similar at other publishers, particularly closed access publishers?

3. Citation prediction?

ALMs are immediate whereas citations are slow. If we assume for a moment that citations are a definitive means to determine the impact of a paper (which they may not be), then can ALMs predict citations? This would make them very useful in the evaluation of scientists and their endeavours. Unfortunately, this dataset is not sufficient to answer this properly, but with multiple timepoints, the question could be investigated. I looked at number of paper downloads and also the Mendeley score to see how these two things may foretell citations. What follows is a strategy to do this is an unbiased way with few confounders.

scopus v citesThe dataset has a Scopus column, but for some reason these data are incomplete. It is possible to download data (but not on this scale AFAIK) for citations from Web of Science and then use the DOI to cross-reference to the other dataset. This plot shows the Scopus data as a function of “Total Citations” from Web of Science, for 500 papers. I went with the Web of Science data as this appears more robust.

The question is whether there is a relationship between downloads of a paper (Counter, either PDF or HTML) and citations. Or between Mendeley score and citations. I figured that downloading, Mendeley and citation, show three progressive levels of “commitment” to a paper and so they may correlate differently with citations. Now, to look at this for all PLoS journals for all time would be silly because we know that citations are field-specific, journal-specific, time-sensitive etc. So I took the following dataset from Web of Science: the top 500 most-cited papers in PLoS ONE for the period of 2007-2010 limited to “cell biology”. By cross-referencing I could check the corresponding values for Counter and for Mendeley.

CounterMendelyvsCites

I was surprised that the correlation was very weak in both cases. I thought that the correlation would be stronger with Mendeley, however signal-to-noise is a problem here with few users of the service compared with counting downloads. Below each plot is a ranked view of the papers, with the Counter or Mendeley data presented as a rolling average. It’s a very weak correlation at best. Remember that this is post-hoc. Papers that have been cited more would be expected to generate more views and higher Mendeley scores, but this is not necessarily so. Predicting future citations based on Counter or Mendeley, will be tough. To really know if this is possible, this approach needs to be used with multiple ALM timepoints to see if there is a predictive value for ALMs, but based on this single timepoint, it doesn’t seem as though prediction will be possible.

Again, looking at this for a closed access journal would be very interesting. The most-downloaded paper in this set, had far more views (143,952) than other papers cited a similar number of times (78). The paper was this one which I guess is of interest to bodybuilders! Presumably, it was heavily downloaded by people who probably are not in a position to cite the paper. Although these downloads didn’t result in extra citations, this paper has undeniable impact outside of academia. Because PLoS is open access, the bodybuilders were able to access the paper, rather than being met by a paywall. Think of the patients who are trying to find out more about their condition and can’t read any of the papers… The final point here is that ALMs have their own merit, irrespective of citations, which are the default metric for judging the impact of our work.

Methods: To crunch the numbers for yourself, head over to Figshare and download the csv. A Web of Science subscription is needed for the citation data. All the plots were generated in IgorPro, but no programming is required for these comparisons and everything I’ve done here can be easily done in Excel or another package.

Edit: Matt Hodgkinson (@mattjhodgkinson) Snr Ed at PLoS ONE told me via Twitter that all ALM data (periodically updated) are freely available here. This means that some of the analyses I wrote about are possible.

The post title comes from Six Plus One a track on Dad Man Cat by Corduroy. Plus is as close to PLoS as I could find in my iTunes library.

You Know My Name (Look Up The Number)

What is your h-index on Twitter?

This thought crossed my mind yesterday when I saw a tweet that was tagged #academicinsults

It occurred to me that a Twitter account is a kind of micro-publishing platform. So what would “publication metrics” look like for Twitter? Twitter makes analytics available, so they can easily be crunched. The main metrics are impressions and engagements per tweet. As I understand it, impressions are the number of times your tweet is served up to people in their feed (boosted by retweets). Engagements are when somebody clicks on the tweet (either a link or to see the thread or whatever). In publication terms, impressions would equate to people downloading your paper and engagements mean that they did something with it, like cite it. This means that a “h-index” for engagements can be calculated with these data.

For those that don’t know, the h-index for a scientist means that he/she has h papers that have been cited h or more times. The Twitter version would be a tweeter that has h tweets that were engaged with h or more times. My data is shown here:

TwitterAnalyticsMy twitter h-index is currently 36. I have 36 tweets that have been engaged with 36 or more times.

So, this is a lot higher than my actual h-index, but obviously there are differences. Papers accrue citations as time goes by, but the information flow on Twitter is so fast that tweets don’t accumulate engagement over time. In that sense, the Twitter h-index is less sensitive to the time a user has been active on Twitter, versus the real h-index which is strongly affected by age of the scientist. Other differences include the fact that I have “published” thousands of tweets and only tens of papers. Also, whether or not more people read my tweets compared to my papers… This is not something I want to think too much about, but it would affect how many engagements it is possible to achieve.

The other thing I looked at was whether replying to somebody actually means more engagement. This would skew the Twitter h-index. I filtered tweets that started with an @ and found that this restricts who sees the tweet, but doesn’t necessarily mean more engagement. Replies make up a very small fraction of the h tweets.

I’ll leave it to somebody else to calculate the Impact Factor of Twitter. I suspect it is very low, given the sheer volume of tweets.

Please note this post is just for fun. Normal service will (probably) resume in the next post.

Edit: As pointed out in the comments this post is short on “Materials and Methods”. If you want to calculate your ownTwitter h-index, go here. When logged in to Twitter, the analytics page should present your data (it may take some time to populate this page after you first view it). A csv can be downloaded from the button on the top-right of the page. I imported this into IgorPro (as always) to generate the plots. The engagements data need to be sorted in descending order and then the h-index can be found by comparing the numbers with their ranked position.

The post title is from the quirky B-side to the Let It Be single by The Beatles.

Pay You Back In Time

A colleague once told me that they only review three papers per year and then refuse any further requests for reviewing. Her reasoning was as follows:

  • I publish one paper a year (on average)
  • This paper incurs three peer reviews
  • Therefore, I owe “the system” three reviews.

It’s difficult to fault this logic. However, I think that as a senior scientist with a wealth of experience, the system would benefit greatly from more of her input. Actually, I don’t think she sticks rigorously to this and I know that she is an Academic Editor at a journal so, in fact she contributes much more to the system than she was letting on.

I thought of this recently when – in the space of one week – I got three peer review requests, which I accepted. I began to wonder about my own debit and credit in the peer review system. I only have reliable data from 2010.

Reviews incurred as an author are in gold (re-reviews are in pale gold), reviews completed as a peer are in purple (re-reviews are in pale purple). They are plotted cumulatively and the difference – or the balance – is shown by the markers. So, I have been in a constant state of owing the system reviews and I’m in no position to be turning down review requests.

In my defence, I was for two years Section Editor at BMC Cell Biology which means that I contributed more to the system that the plot shows. Another thing is reviews incurred/completed as a grant applicant/referee. I haven’t factored those in, but I think this would take the balance down further. I also comment on colleagues papers and grant applications.

Thinking back, I’ve only ever turned down a handful of peer review requests. Reasons being either that the work was too far outside my area of expertise or that I had a conflict of interest. I’ve never cited a balance of zero as a reason for not reviewing and this analysis shows that I’m not in this category.

In case any Editors are reading this… I’m happy to review work in my area, but please remember I currently have three papers to review!

The post title comes from a demo recording by The Posies that can be found on the At Least, At Last compilation on Not Lame Recordings.

Strange Things – update

My post on the strange data underlying the new impact factor for eLife was read by many people. Thanks for the interest and for the comments and discussion that followed. I thought I should follow up on some of the issues raised in the post.

To recap:

  1. eLife received a 2013 Impact Factor despite only publishing 27 papers in the last three months of the census window. Other journals, such as Biology Open did not.
  2. There were spurious miscites to papers before eLife published any papers. I wondered whether this resulted in an early impact factor.
  3. The Web of Knowledge database has citations from articles in the past referring to future articles!

1. Why did eLife get an early Impact Factor? It turns out that there is something called a partial Impact Factor.  This is where an early Impact Factor is awarded to some journals in special cases. This is described here in a post at Scholarly Kitchen. Cell Reports also got an early Impact Factor and Nature Methods got one a few years ago (thanks to Daniel Evanko for tweeting about Nature Methods’ partial Impact Factor). The explanation is that if a journal is publishing papers that are attracting large numbers of citations it gets fast-tracked for an Impact Factor.

2. In a comment, Rafael Santos pointed out that the miscites were “from a 2013 eLife paper to an inexistent 2010 eLife paper, and another miscite from a 2013 PLoS Computational Biology paper to an inexistent 2011 eLife paper”. The post at Scholarly Kitchen confirms that citations are not double-checked or cleaned up at all by Thomson-Reuters. It occurred to me that journals looking to game their Impact Factor could alter the year for citations to papers in their own journal in order to inflate their Impact Factor. But no serious journal would do that – or would they?

3. This is still unexplained. If anybody has any ideas (other than time travel) please leave a comment.

Strange Things

I noticed something strange about the 2013 Impact Factor data for eLife.

Before I get onto the problem. I feel I need to point out that I dislike Impact Factors and think that their influence on science is corrosive. I am a DORA signatory and I try to uphold those principles. I admit that, in the past, I used to check the new Impact Factors when they were released, but no longer. This year, when the 2013 Impact Factors came out I didn’t bother to log on to take a look. A chance Twitter conversation with Manuel Théry (@ManuelTHERY) and Christophe Leterrier (@christlet) was my first encounter with the new numbers.

Huh? eLife has an Impact Factor?

For those that don’t know, the 2013 Impact Factor is worked out by counting the total number of 2013 cites to articles in a given journal that were published in 2011 and 2012. This number is divided by the number of “citable items” in that journal in 2011 and 2012.

Now, eLife launched in October 2012. So it seems unfair that it gets an Impact Factor since it only published papers for 12.5% of the window under scrutiny. Is this normal?

I looked up the 2013 Impact Factor for Biology Open, a Company of Biologists journal that launched in January 2012* and… it doesn’t have one! So why does eLife get an Impact Factor but Biology Open doesn’t?**

elife-JIFLooking at the numbers for eLife revealed that there were 230 citations in 2013 to eLife papers in 2011 and 2012. One of which was a mis-citation to an article in 2011. This article does not exist (the next column shows that there were no articles in 2011). My guess is that Thomson Reuters view this as the journal existing for 2011 and 2012, and therefore deserving of an Impact Factor. Presumably there are no mis-cites in the Biology Open record and it will only get an Impact Factor next year. Doesn’t this call into question the veracity of the database? I have found other errors in records previously (see here). I also find it difficult to believe that no-one checked this particular record given the profile of eLife.

elfie-citesPerhaps unsurprisingly, I couldn’t track down the rogue citation. I did look at the cites to eLife articles from all years in Web of Science, the Thomson Reuters database (which again showed that eLife only started publishing in Oct 2012). As described before there are spurious citations in the database. Josh Kaplan’s eLife paper on UNC13/Tomosyn managed to rack up 5 citations in 2004, some 9 years before it was published (in 2013)! This was along with nine other papers that somehow managed to be cited in 2004 before they were published. It’s concerning enough that these data are used for hiring, firing and funding decisions, but if the data are incomplete or incorrect this is even worse.

Summary: I’m sure the Impact Factor of eLife will rise as soon as it has a full window for measurement. This would actually be 2016 when the 2015 Impact Factors are released. The journal has made it clear in past editorials (and here) that it is not interested in an Impact Factor and won’t promote one if it is awarded. So, this issue makes no difference to the journal. I guess the moral of the story is: don’t take the Impact Factor at face value. But then we all knew that already. Didn’t we?

* For clarity, I should declare that we have published papers in eLife and Biology Open this year.

** The only other reason I can think of is that eLife was listed on PubMed right away, while Biology Open had to wait. This caused some controversy at the time. I can’t see why a PubMed listing should affect Impact Factor. Anyhow, I noticed that Biology Open got listed in PubMed by October 2012, so in the end it is comparable to eLife.

Edit: There is an update to this post here.

Edit 2: This post is the most popular on Quantixed. A screenshot of visitors’ search engine queries (Nov 2014)…

searches

The post title is taken from “Strange Things” from Big Black’s Atomizer LP released in 1986.