Rollercoaster IV: ups and downs of Google Scholar citations

Time for an update to a previous post. For the past few years, I have been using an automated process to track citations to my lab’s work on Google Scholar (details of how to set this up are at the end of this post).

Due to the nature of how Google Scholar tracks citations, it means that citations get added (hooray!) but might be removed (booo!). Using a daily scrape of the data it is possible to watch this happening. The plots below show the total citations to my papers and then a version where we only consider the net daily change.

Four years of tracking citations on Google Scholar

The general pattern is for papers to accrue citations and some do so faster than others. You can also see that the number of citations occasionally drops down. Remember that we are looking at net change here. So a decrease of one citation is masked by the addition of one citation and vice versa. Even so, you can see net daily increases and even decreases.

It’s difficult to see what is happening down at the bottom of the graph so let’s separate them out. The two plots below show the net change in citations, either on the same scale (left) or scaled to the min/max for that paper (right).

Citation tracking for individual papers

The papers are shown here ranked from the ones that accrued the most citations down to the ones that gained no citations while they were tracked. Five “new” papers began to be tracked very recently. This is because I changed the way that the data are scraped (more on this below).

The version on the right reveals a few interesting things. Firstly that there seems to be “bump days” where all of the papers get a jolt in one direction or another. This could be something internal to Google or the addition or several items which all happen to cite a bunch of my papers. The latter explanation is unlikely, given the frequency of changes seen in the whole dataset. Secondly, some papers are highly volatile with daily toggling of citation numbers. I have no idea why this may be. Two plots below demonstrate these two points. The arrow shows a “bump day”. The plot on the right shows two review papers that have volatile citation numbers.

I’m going to keep the automated tracking going. I am a big fan of Google Scholar, as I have written previously, but quoting some of the numbers makes me uneasy, knowing how unstable they are.

Note that you can use R to get aggregate Google Scholar data as I have written about previously.

How did I do it?

The analysis would not be possible without automation. I use a daemon to run a shell script everyday. This script calls a python routine which outputs the data to a file. I wrote something in Igor to load each day’s data, and crunch the numbers, and make the graphs. The details of this part are in the previous post.

I realised that I wasn’t getting all of my papers using the previous shell script. Well, this is a bit of a hack, but I changed the calls that I make to so that I request data from several years.

cd /directory/for/data/
python -c 500 --author "Sam Smith" --after=1999 --csv > g1999.csv
sleep $[ ( $RANDOM % 15 )  + 295 ]
# and so on
python -c 500 --author "Sam Smith" --after=2019 --csv > g2019.csv
OF=all_$(date +%Y%m%d).csv
cat g*.csv > $OF
rm g*.csv

I found that I got different results for each year I made the query. My first change was to just request all years using a loop to generate the calls. This resulted in an IP ban for 24 hours! Through a bit of trial-and-error I found that reducing the queries to ten and waiting a polite amount of time between queries avoided the ban.

The hacky part was to figure out which year requests I needed to make to make sure I got most of my papers. There is probably a better way to do this!

I still don’t get every single paper and I retrieve data for a number of papers on which I am not an author – I have no idea why! I exclude the erroneous papers using the Igor program that reads all the data and plots out everything. The updated version of this code is here.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song by Sleater-Kinney from their “The Woods” album.

One With The Freaks – very highly cited papers in biology

I read this recent paper about very highly cited papers and science funding in the UK. The paper itself was not very good, but the dataset which underlies the paper is something to behold, as I’ll explain below.

The idea behind the paper was to examine very highly cited papers in biomedicine with a connection to the UK. Have those authors been successful in getting funding from MRC, Wellcome Trust or NIHR? They find that some of the authors of these very highly cited papers are not funded by these sources. Note that these funders are some, but not all, of the science funding bodies in the UK. The authors also looked at panel members of those three funders, and report that these individuals are funded at high rates and that the overlap between panel membership and very highly cited authorship is very low. I don’t want to critique the paper extensively, but the conclusions drawn are rather blinkered. A few reasons: 1, MRC, NIHR and Wellcome support science in other ways than direct funding of individuals (e.g. PhD programmes, infrastructure etc.). 2, The contribution of other funders e.g. BBSRC was ignored. 3, Panels tend to be selected from the pool of awardees, rather than the other way around. I understand that the motivation of the authors is to stimulate debate around whether science funding is effective, and this is welcome, but the paper strays too far in to clickbait territory for my tastes.

The most interesting thing about the analysis (and arguably its main shortcoming) was the dataset. The authors took the papers in Scopus which have been cited >1000 times. This is ~450 papers as of last week. As I found out when I recreated their dataset, this is a freakish set of papers. Of course weird things can be found when looking at outliers.

Dataset of 20,000 papers from Scopus (see details below)

The authors describe a one-line search term they used to retrieve papers from Scopus. These papers span 2005 to the present day and were then filtered for UK origin.


I’m not sure how accurate the dataset is in terms of finding papers of UK origin, but the point here is to look at the dataset and not to critique the paper.

I downloaded the first 20,000 (a limitation of Scopus). I think it will break the terms to put the dataset on here but if your institution has a subscription, it can be recreated. The top paper has 16,549 citations! The 20,000th paper has accrued 122 citations, and the papers with >1000 citations account for 450 papers as of last week.

Now, some papers are older than others, so I calculated the average citation rate by dividing total cites by the number of years since publication, to get a better picture of the hottest among these freaky papers. The two colour-coded plots show the years since publication. It is possible to see some young papers which are being cited at an even higher rate than the pack. These will move up the ranking faster than their neighbours over the next few months.

Just looking at the “Top 20” is amazing. These papers are being cited at rates of approximately 1000 times per year. The paper ranked 6 is a young paper which is cited at a very high rate and will likely move up the ranking. So what are these freakish papers?

In the table below (apologies for the strange formatting), I’ve pasted the top 20 of the highly cited paper dataset. They are a mix of clinical consortia papers and bioinformatics tools for sequence and structural analysis. The tools make sense. They are widely used in a huge number of papers and get heavily cited as a result. In fact, these citation numbers are probably an underestimate, since citations to software can often get missed out of papers. The clinical papers are also useful to large fields. They have many authors and there is a network effect to their citation which can drive up the cites to these items (this is noted in the paper I described above). Even though the data are expected, I was amazed by the magnitude of citations and the rates that these works are acquiring citations. The topic of papers is pretty similar beyond the top 20.

There’s no conclusion for this post. There are a tiny subset of papers out there with freakishly high citation rates. We should simply marvel at them…

TitleYearJournalTotal cites
1Clustal W and Clustal X version 2.02007Bioinformatics16549
2The Sequence Alignment/Map format and SAMtools2009Bioinformatics13586
3Fast and accurate short read alignment with Burrows-Wheeler transform2009Bioinformatics12653
4PLINK: A tool set for whole-genome association and population-based linkage analyses2007American Journal of Human Genetics12241
5Estimates of worldwide burden of cancer in 2008: GLOBOCAN 20082010International Journal of Cancer11047
6Cancer incidence and mortality worldwide: Sources, methods and major patterns in GLOBOCAN 20122015International Journal of Cancer10352
7PHENIX: A comprehensive Python-based system for macromolecular structure solution2010Acta Crystallographica Section D: Biological Crystallography10093
8Phaser crystallographic software2007Journal of Applied Crystallography9617
9New response evaluation criteria in solid tumours: Revised RECIST guideline (version 1.1)2009European Journal of Cancer9359
10Features and development of Coot2010Acta Crystallographica Section D: Biological Crystallography9241
11Introducing mothur: Open-source, platform-independent, community-supported software for describing and comparing microbial communities2009Applied and Environmental Microbiology8127
12BEAST: Bayesian evolutionary analysis by sampling trees2007BMC Evolutionary Biology8019
13Improved survival with ipilimumab in patients with metastatic melanoma2010New England Journal of Medicine7293
14OLEX2: A complete structure solution, refinement and analysis program2009Journal of Applied Crystallography7173
15Global and regional mortality from 235 causes of death for 20 age groups in 1990 and 2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet6296
16New algorithms and methods to estimate maximum-likelihood phylogenies: Assessing the performance of PhyML 3.02010Systematic Biology6290
17The MIQE guidelines: Minimum information for publication of quantitative real-time PCR experiments2009Clinical Chemistry6086
18The Cochrane Collaboration’s tool for assessing risk of bias in randomised trials2011BMJ (Online)6065
19Velvet: Algorithms for de novo short read assembly using de Bruijn graphs2008Genome Research5550
20A comparative risk assessment of burden of disease and injury attributable to 67 risk factors and risk factor clusters in 21 regions, 1990-2010: A systematic analysis for the Global Burden of Disease Study 20102012The Lancet5499

The post title comes from “One With The Freaks” by The Notwist.

For What It’s Worth: Influence of our papers on our papers

This post is about a citation analysis that didn’t quite work out.

I liked this blackboard project by Manuel Théry looking at the influence of each paper authored by David Pellman’s lab on the future directions of the Pellman lab.

It reminds me that papers can have impact in the field while others might be influential to the group itself. I wondered which of the papers on which I’m an author have been most influential to my other papers and whether this correlates with a measure of their impact on the field.

There’s no code in this post. I retrieved the relevant records from Scopus and used the difference in “with” and “without” self-citation to pull together the numbers.

Influence: I used the number of citations to a paper from any of our papers as the number for self-citation. This was divided by the total number of future papers. This means if I have 50 papers, and the 23rd paper that was published has collected 27 self-citations, this has a score of 1 (the 23rd paper nor any of the preceding 22 papers, can cite the 23rd paper, but the 27 that follow, could).  This is our metric for influence.

Impact: As a measure of general impact I took the total number of citations for each paper and divided this by the number of years since publication to get average cites per year for each paper.

Plot of influence against impact

Reviews and methods papers are shown in blue, while research papers are in red. I was surprised that some papers have been cited by as much as half of the papers that followed.

Generally, the articles that were most influential to us were also the papers with the biggest impact. Although the correlation is not very strong. There is an obvious outlier paper that gets 30 cites per year (over a 12 year period, I should say) but this paper has not influenced our work as much as other papers have. This is partly because the paper is a citation magnet and partly because we’ve stopped working on this topic in the last few years.

Obviously, the most recent papers were the least informative. There are no future papers to test if they were influential and there are few citations so far to understand their impact.

It’s difficult to say what the correlation between impact and influence on our own work really means, if anything. Does it mean that we have tended to pursue projects because of their impact (I would hope not)? Perhaps these papers are generally useful to the field and to us.

In summary, I don’t think this analysis was successful. I had wanted to construct some citation networks – similar to the Pellman tree idea above – to look at influence in more detail, but I lost confidence in the method. Many of our self-citations are for methodological reasons and so I’m not sure if we’re measuring influence or utility here. Either way, the dataset is not big enough (yet) to do more meaningful number crunching. Having said this, the approach I’ve described here will work for any scholar and could be done at scale.

There are several song titles in the database called ‘For What It’s Worth’. This one is Chapterhouse on Rownderbout.

Rollercoaster III: yet more on Google Scholar

In a previous post I made a little R script to crunch Google Scholar data for a given scientist. The graphics were done in base R and looked a bit ropey. I thought I’d give the code a spring clean – it’s available here. The script is called ggScholar.R (rather than gScholar.R). Feel free to run it and raise an issue or leave a comment if you have some ideas.

I’m still learning how to get things looking how I want them using ggplot2, but this is an improvement on the base R version.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song and album by slowcore/dream pop outfit Red House Painters.

Rollercoaster II: more on Google Scholar citations

I’ve previously written about Google Scholar. Its usefulness and its instability. I just read a post by Jon Tennant on how to harvest Google Scholar data in R and I thought I would use his code as the basis to generate some nice plots based on Google Scholar data.

A script for R is below and can be found here. Graphics are base R but do the job.

First of all I took it for a spin on my own data. The outputs are shown here:

These were the most interesting plots that sprang to mind. First is a ranked citation plot which also shows y=x to find the Hirsch number. Second, was to look at total citations per year to all papers over time. Google Scholar shows the last few years of this plot in the profile page. Third, older papers accrue more citations, but how does this look for all papers? Finally, a prediction of what my H-index will do over time (no prizes for guessing that it will go up!). As Jon noted, the calculation comes from this paper.

While that’s interesting, we need to get  the data of a scholar with a huge number of papers and citations. Here is George Church.

At the time of writing he has 763 papers with over 90,000 citations in total and a H-index of 147. Interestingly ~10% of his total citations come from a monster paper in PNAS with Wally Gilbert in the mid 80s on genome sequencing.

Feel free to grab/fork this code and have a play yourself. If you have other ideas for plots or calculations, add a comment here or an issue at GitHub.

# Add Google Scholar ID of interest here
ID <- ""
# If you didn't add one to the script prompt user to add one
if(ID == ""){
     ID <- readline(prompt="Enter Scholar ID: ")
# Get the citation history
# Get profile information
profile <- get_profile(ID)
# Get publications and save as a csv
pubs <- get_publications(ID)
write.csv(pubs, file = "citations.csv")
# Predict h-index
hIndex <- predict_h_index(ID)
# Now make some plots
# Plot of total citations by year
png(file = "citationsByYear.png")
     type="h", xlab="Year", ylab = "Total Cites")
# Plot of ranked paper by citation with h
png(file = "citationsAndH.png")
plot(pubs$cites, type="l",
     xlab="Paper rank", ylab = "Citations per paper")
text(nrow(pubs),max(pubs$cites, na.rm = TRUE),
# Plot of cites to paper by year
png(file = "citesByYear.png")
plot(pubs$year, pubs$cites,
     xlab="Year", ylab = "Citations per paper")
# Plot of h-index prediction
thisYear <- as.integer(format(Sys.Date(), "%Y"))
png(file = "hPred.png")
     ylim = c(0, max(hIndex$h_index, na.rm = TRUE)),
     type = "h",
     xlab="Year", ylab = "H-index prediction")

Note that my previous code used a python script to grab Google Scholar data. While that script worked well, the scholar package for R seems a lot more reliable.

I have a surprising number of tracks in my library with Rollercoaster in the title. This time I will go with the Jesus & Mary Chain track from Honey’s Dead.