Rollercoaster IV: ups and downs of Google Scholar citations

Time for an update to a previous post. For the past few years, I have been using an automated process to track citations to my lab’s work on Google Scholar (details of how to set this up are at the end of this post).

Due to the nature of how Google Scholar tracks citations, it means that citations get added (hooray!) but might be removed (booo!). Using a daily scrape of the data it is possible to watch this happening. The plots below show the total citations to my papers and then a version where we only consider the net daily change.

Four years of tracking citations on Google Scholar

The general pattern is for papers to accrue citations and some do so faster than others. You can also see that the number of citations occasionally drops down. Remember that we are looking at net change here. So a decrease of one citation is masked by the addition of one citation and vice versa. Even so, you can see net daily increases and even decreases.

It’s difficult to see what is happening down at the bottom of the graph so let’s separate them out. The two plots below show the net change in citations, either on the same scale (left) or scaled to the min/max for that paper (right).

Citation tracking for individual papers

The papers are shown here ranked from the ones that accrued the most citations down to the ones that gained no citations while they were tracked. Five “new” papers began to be tracked very recently. This is because I changed the way that the data are scraped (more on this below).

The version on the right reveals a few interesting things. Firstly that there seems to be “bump days” where all of the papers get a jolt in one direction or another. This could be something internal to Google or the addition or several items which all happen to cite a bunch of my papers. The latter explanation is unlikely, given the frequency of changes seen in the whole dataset. Secondly, some papers are highly volatile with daily toggling of citation numbers. I have no idea why this may be. Two plots below demonstrate these two points. The arrow shows a “bump day”. The plot on the right shows two review papers that have volatile citation numbers.

I’m going to keep the automated tracking going. I am a big fan of Google Scholar, as I have written previously, but quoting some of the numbers makes me uneasy, knowing how unstable they are.

Note that you can use R to get aggregate Google Scholar data as I have written about previously.

How did I do it?

The analysis would not be possible without automation. I use a daemon to run a shell script everyday. This script calls a python routine which outputs the data to a file. I wrote something in Igor to load each day’s data, and crunch the numbers, and make the graphs. The details of this part are in the previous post.

I realised that I wasn’t getting all of my papers using the previous shell script. Well, this is a bit of a hack, but I changed the calls that I make to scholar.py so that I request data from several years.

#!/bin/bash
cd /directory/for/data/
python scholar.py -c 500 --author "Sam Smith" --after=1999 --csv > g1999.csv
sleep $[ ( $RANDOM % 15 )  + 295 ]
# and so on
python scholar.py -c 500 --author "Sam Smith" --after=2019 --csv > g2019.csv
OF=all_$(date +%Y%m%d).csv
cat g*.csv > $OF
rm g*.csv

I found that I got different results for each year I made the query. My first change was to just request all years using a loop to generate the calls. This resulted in an IP ban for 24 hours! Through a bit of trial-and-error I found that reducing the queries to ten and waiting a polite amount of time between queries avoided the ban.

The hacky part was to figure out which year requests I needed to make to make sure I got most of my papers. There is probably a better way to do this!

I still don’t get every single paper and I retrieve data for a number of papers on which I am not an author – I have no idea why! I exclude the erroneous papers using the Igor program that reads all the data and plots out everything. The updated version of this code is here.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song by Sleater-Kinney from their “The Woods” album.

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.