Tips From The Blog XII: Improving your Twitter experience

This is a quick set of tips to improve your Twitter experience. YMMV on these tips. Plus I can see Twitter changing things so that they no longer work, but this advice is correct as of today.

I see a lot of people on Twitter complaining about two things:

  • Twitter changing the tweets shown from “latest first” to some algorithmic order that irritates everybody
  • Promoted tweets, ads, suggestions of who to follow etc etc

These things seem to plague Twitter users if they are using the web interface or the official app, if they are looking at “Home”.

Tip 1 – use a client

If you use Tweetdeck, the promoted stuff disappears and the tweets are always shown in “latest first”. It is available as a desktop app.

Alternative apps for mobile devices were kind of strangled by Twitter some time ago. However, apps like Twitterrific also allow you to look at the raw Twitter. I stopped using Twitterrific after push notifications were disabled by Twitter to encourage everyone back to their app. So…

Tip 2 – use lists

If you use the official Twitter app, you will have the same problems as described above. The solution here is to set up a list and add people you’d like to follow. When you view the list, you get the content “latest first” and it has no ads or sponsored stuff.

Tip 3 – the hack

It is possible to use mute words to prevent unwanted content making its way into your feed. Below is a list that was recently shared on Twitter:

Here is the list for easy copy/pasting


This is a hack since these hidden words are likely to change, but if you are desperate and don’t want to use a client or lists… go for it. If you go to Settings and add each line into Block Words, select from everyone and select forever.

Don’t forget to also tap the stars icon on your home screen and tell the App you want to see latest tweets first. It reverts after a few days. If this is too annoying, use one of the other tips above.

This post is part of a series of tips.

Child’s Play: pi-hole set up for a safer internet

I have been running a pi-hole to block ads on my home network for a while. It’s great! Not only are ads blocked, but it speeds up internet browsing because… the ads do not load. I wondered if it would be possible to use a pi-hole to make a child-safe internet experience to protect the little people in the house.

Sure, there are ways to do this in most routers but they are not ideal. I have an Orbi mesh from Netgear and this has two parental control options: “Live Parenting Control” which is seemingly being deprecated as they push “Circle” by Disney. If the words “by Disney” alone were not enough to trouble anybody, 1) it works by doing an ARP poisoning attack on the router, 2) Disney (or whoever) would be logging all requests from the network, and 3) the free version is limited and you have to pay for full protection. So, can a pi-hole be used to make a (free) child-safe internet experience? Yes! The trick is how to do that while maintaining a full-bodied internet for everyone else (and maintain ad-blocking for everyone).

Existing set-up

I have the Orbi router doing DHCP assignment (static IPs for some stuff and a range for dynamic assignment). DNS points to the ad-blocking pi-hole which is wired to the router. Yes, I know I can have the pi-hole doing DHCP and I have run it this way with a different router but this configuration is how I have it right now. The router doesn’t allow DNS settings to be assigned to each device. I’ll describe how I made the second pi-hole and then how I integrated it into this set up.

Making a blockhole

I bought a RPi Zero W, with pibow case, power supply and 8 GB SD card. My ad-blocker pi-hole runs on a RPi 3B+ and has a bigger card, but there was no need for something that would not handle much traffic.

I installed Raspbian Stretch Lite (I wasn’t sure if pi-hole is supported under Buster). Legacy downloads of Stretch are available from the Raspberry Pi website. The RPi zero has miniHDMI out to connect to a monitor for the setup. I customised it a bit, enabled ssh and VNC so that I could control it headlessly. Next I gave it the name blockhole to distinguish it from the other RPis on the network. I assigned a fixed IP via the router and then ran the pi-hole installation as described on the pi-hole site. I could see the dashboard and log in OK, so all was good.

At this point, I simply had a second ad-blocking pi-hole on my LAN with no device(s) on the network using it. Firstly, let’s turn it into a parental control device. I wanted three things:

1. Force safe search on Google, Bing, Duckduckgo and YouTube

There is a great thread on how to do this on the pi-hole discourse site. The relevant link is here. Jayke Peters made a really simple bash script to modify the appropriate files on a pi-hole to do this (other people in the thread worked out how to force safe search). In case that link disappears:

mv ./ /usr/local/bin/
chmod a+x /usr/local/bin/ --enable

This can all be done via ssh to the blockhole. The last line needs to be run as sudo. You can check that it has generated the appropriate file by:

cd /etc/dnsmasq.d/

You should see a file called 05-restrict.conf in there if everything went OK.

Enforcing safe search is such a great idea. Kids can type in a rude word into a search engine and get all kinds of inappropriate content. This change forces the search to be done via the “safe search” settings. It really works. The same rude word search with enforced safe search brings down harmless results on Google for example.

2. Block inappropriate sites outright

The script adds wildcard blocks to common terms found in adult site URLs. This means that these sites are just blocked outright. This is a good method. The alternative is to add blocklists to the pi-hole. There are some available on GitHub. Even those that have 1 million URLs will not block the sites that will spring up tomorrow or next week. So just blocking based on common terms like xxx should work better.

3. Add some more blocks

YouTube is forced into safe search, but what if you just want to block it outright? Or any other site? You can blacklist any site using the pi-hole admin page. Log in and select blacklist. The wildcard function will deal with URL variants. The script mentioned above adds to the blacklist other search engines, e.g. ecosia that have no safe search capacity. I added a bunch of other sites here that I wasn’t happy about, e.g. FaceBook to round off the blockhole.

Upstream DNS

On the ad-block pi-hole I use Cloudflare as the upstream DNS. It is possible to use a service which has family filtering instead. OpenDNS has an option for doing this (which may be pay-for-service – I’m not sure). Neustar or other services will give filtering of inappropriate content. Note that they will be logging requests, but only from the kids devices, so it’s different to the Disney scenario mentioned above.

Integration into the network

The next step was to get it working with the existing network. As described above, we want to maintain a full-bodied but ad-blocked experience for everyone else.

The simplest method was to alter the DNS settings on the devices that the kids use. The DNS address is the blockhole and so they get child-safe internet. Depending on the device, the setting is quite obscure and can be locked in the case of a kid’s account on a Mac. If they figure out how to change the DNS, it is possible to hand out the blockhole address from the router and manually assign the ad-blocker pi-hole DNS to devices used by adults. It’s not perfect but it will do.

The blockhole in action. This is the dashboard showing queries etc.

Finally, what about time-limiting the internet? Well, the router has options to pause the internet per device and it is possible to run the blockhole on a scheduler to only allow internet at certain times. This is not as sophisticated as the Circle system where there is an option to have x minutes of internet per day and the possibility to reward more minutes for good behaviour etc.

The point of this post was to share how to set up this system and integrate it with an existing pi-hole. None of the work is mine, it was all done with a bit of searching, but I thought it was worth posting my solution in case it helps other parents or carers out there.

If you use pi-hole to block ads or to make a blockhole, consider donating to this useful project.

The post title comes from the track “Child’s Play” from African Head Charge’s Akwaaba LP.

Coming Soon: The Digital Cell

Long-time readers might remember the short-lived series on quantixed called The Digital Cell. There is a reason why I stopped these posts, which I can now reveal… The Digital Cell will soon be a book!

Published by Cold Spring Harbor Laboratory Press, The Digital Cell is a handbook to help cell and developmental biologists get to grips with programming, image analysis, statistics and much more. As I wrote in the very first Digital Cell post here on quantixed:

[computer-based methods] have now permeated mainstream […] cell biology to such an extent that any groups that want to do cell biology in the future have to adapt in order to survive.

The book is aimed at helping researchers adapt. At the start of the book I write:

The aim of this book is to equip cell biologists for this change: to become digital cell biologists. Maybe you are a new student starting your first cell biology project. This book is designed to help you. Perhaps you are working in cell biology already but you didn’t have much previous exposure to computer science, maths and statistics. This book will get you started. Maybe you are a seasoned cell biologist. You read the latest papers and wonder how you could apply those quantitative approaches in your lab. You may even have digital cell biologists in your group already and want to know how they think and how you can best support them.

The Digital Cell

The first proofs are due next week and we are hoping to publish the book in the fall.

Please consider this a teaser. I plan to write some more posts on the process of writing the book and to preview some more content.

The post title comes from “Coming Soon” by Queen from “The Game” LP.

Communication Breakdown

There is an entertaining rumour going around about the journal Nature Communications. When I heard it for the fourth or fifth time, I decided to check out whether there is any truth in it.

The rumour goes something like this: the impact factor of Nature Communications is driven by physical sciences papers.

Sometimes it is put another way: cell biology papers drag down the impact factor of Nature Communications, or that they don’t deserve the high JIF tag of the journal because they are cited at lower rates. Could this be true?

TL;DR it is true but the effect is not as big as the rumour suggests. Jump to conclusion.

Nature Communications is the megajournal big journal that sits below the subject-specific Nature journals. Operating as an open access, pay-to-publish journal it is a way for Springer Nature to capture revenue from papers that were good, but did not make the editorial selection for subject-specific Nature journals. This is a long-winded way of saying that there are wide variety of papers covered by this journal which publishes around 5,000 papers per year. This complicates any citation analysis because we need a way to differentiate papers from different fields. I describe one method to do this below.

Quick look at the data

I had a quick look at the top 20 papers from 2016-2017 with the most citations in 2018. There certainly were a lot of non-biological papers in there. Since highly cited papers disproportionately influence the Journal Impact Factor, then this suggested the rumours might be true.

Citations (2018)Title
23811.4% Efficiency non-fullerene polymer solar cells with trialkylsilyl substituted 2D-conjugated polymer as donor
226Circular RNA profiling reveals an abundant circHIPK3 that regulates cell growth by sponging multiple miRNAs
208Recommendations for myeloid-derived suppressor cell nomenclature and characterization standards
203High-efficiency and air-stable P3HT-based polymer solar cells with a new non-fullerene acceptor
201One-Year stable perovskite solar cells by 2D/3D interface engineering
201Massively parallel digital transcriptional profiling of single cells
177Array of nanosheets render ultrafast and high-capacity Na-ion storage by tunable pseudocapacitance
166Multidimensional materials and device architectures for future hybrid energy storage
163Coupled molybdenum carbide and reduced graphene oxide electrocatalysts for efficient hydrogen evolution
149Ti<inf>3</inf>C<inf>2</inf> MXene co-catalyst on metal sulfide photo-absorbers for enhanced visible-light photocatalytic hydrogen production
149Balancing surface adsorption and diffusion of lithium-polysulfides on nonconductive oxides for lithium-sulfur battery design
146Adaptive resistance to therapeutic PD-1 blockade is associated with upregulation of alternative immune checkpoints
140Conductive porous vanadium nitride/graphene composite as chemical anchor of polysulfides for lithium-sulfur batteries
136Fluorination-enabled optimal morphology leads to over 11% efficiency for inverted small-molecule organic solar cells
134The somatic mutation profiles of 2,433 breast cancers refines their genomic and transcriptomic landscapes
132Photothermal therapy with immune-adjuvant nanoparticles together with checkpoint blockade for effective cancer immunotherapy
131Enhanced electronic properties in mesoporous TiO<inf>2</inf> via lithium doping for high-efficiency perovskite solar cells
125Electron-phonon coupling in hybrid lead halide perovskites
123A sulfur host based on titanium monoxide@carbon hollow spheres for advanced lithium-sulfur batteries
121Biodegradable black phosphorus-based nanospheres for in vivo photothermal cancer therapy

Let’s dive in to the data

We will use R for this analysis. If you want to work along, the script and data can be downloaded below. With a few edits, the script will also work for similar analysis of other journals.

First of all I retrieved three datasets.

  • Citation data for the journal. We’ll look at 2018 Journal Impact Factor, so we need citations in 2018 to papers in the journal published in 2016 and 2017. This can be retrieved from Scopus as a csv.
  • Pubmed XML file for the Journal to cover the articles that we want to analyse. Search term = “Nat Commun”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])
  • Pubmed XML file to get cell biology MeSH terms. Search term = “J Cell Sci”[Journal] AND (“2016/01/01″[PDAT] : “2017/12/31″[PDAT])

Using MeSH terms to segregate the dataset

Analysing the citation data is straightforward, but how can we classify the content of the dataset? I realised that I could use Medical Subject Heading (MeSH) from PubMed to classify the data. If I retrieved the same set of papers from PubMed and then check which papers had MeSH terms which matched that of a “biological” dataset, the citation data could be segregated. I used a set of J Cell Sci papers to do this. Note that these MeSH terms are not restricted to cell biology, they cover all kinds of biochemistry and other aspects of biology. The papers that do not match these MeSH terms are ecology, chemistry and physical sciences (many of these don’t have MeSH terms). We start by getting our biological MeSH terms.

## extract a data frame from PubMed XML file
## This is modified from christopherBelter's pubmedXML R code
extract_xml <- function(theFile) {
  newData <- xmlParse(theFile)
  records <- getNodeSet(newData, "//PubmedArticle")
  pmid <- xpathSApply(newData,"//MedlineCitation/PMID", xmlValue)
  doi <- lapply(records, xpathSApply, ".//ELocationID[@EIdType = \"doi\"]", xmlValue)
  doi[sapply(doi, is.list)] <- NA
  doi <- unlist(doi)
  authLast <- lapply(records, xpathSApply, ".//Author/LastName", xmlValue)
  authLast[sapply(authLast, is.list)] <- NA
  authInit <- lapply(records, xpathSApply, ".//Author/Initials", xmlValue)
  authInit[sapply(authInit, is.list)] <- NA
  authors <- mapply(paste, authLast, authInit, collapse = "|")
  year <- lapply(records, xpathSApply, ".//PubDate/Year", xmlValue) 
  year[sapply(year, is.list)] <- NA
  year <- unlist(year)
  articletitle <- lapply(records, xpathSApply, ".//ArticleTitle", xmlValue) 
  articletitle[sapply(articletitle, is.list)] <- NA
  articletitle <- unlist(articletitle)
  journal <- lapply(records, xpathSApply, ".//ISOAbbreviation", xmlValue) 
  journal[sapply(journal, is.list)] <- NA
  journal <- unlist(journal)
  volume <- lapply(records, xpathSApply, ".//JournalIssue/Volume", xmlValue)
  volume[sapply(volume, is.list)] <- NA
  volume <- unlist(volume)
  issue <- lapply(records, xpathSApply, ".//JournalIssue/Issue", xmlValue)
  issue[sapply(issue, is.list)] <- NA
  issue <- unlist(issue)
  pages <- lapply(records, xpathSApply, ".//MedlinePgn", xmlValue)
  pages[sapply(pages, is.list)] <- NA
  pages <- unlist(pages)
  abstract <- lapply(records, xpathSApply, ".//Abstract/AbstractText", xmlValue)
  abstract[sapply(abstract, is.list)] <- NA
  abstract <- sapply(abstract, paste, collapse = "|")
  ptype <- lapply(records, xpathSApply, ".//PublicationType", xmlValue)
  ptype[sapply(ptype, is.list)] <- NA
  ptype <- sapply(ptype, paste, collapse = "|")
  mesh <- lapply(records, xpathSApply, ".//MeshHeading/DescriptorName", xmlValue)
  mesh[sapply(mesh, is.list)] <- NA
  mesh <- sapply(mesh, paste, collapse = "|")
  theDF <- data.frame(pmid, doi, authors, year, articletitle, journal, volume, issue, pages, abstract, ptype, mesh, stringsAsFactors = FALSE)
# function to separate multiple entries in one column to many columns using | separator 
# from
split_into_multiple <- function(column, pattern = ", ", into_prefix){
  cols <- str_split_fixed(column, pattern, n = Inf)
  # Sub out the ""'s returned by filling the matrix to the right, with NAs which are useful
  cols[which(cols == "")] <- NA
  cols <- as_tibble(cols)
  # name the 'cols' tibble as 'into_prefix_1', 'into_prefix_2', ..., 'into_prefix_m' 
  # where m = # columns of 'cols'
  m <- dim(cols)[2]
  names(cols) <- paste(into_prefix, 1:m, sep = "_")

## First load the JCS data to get the MeSH terms of interest
jcsFilename <- "./jcs.xml"
jcsData <- extract_xml(jcsFilename)
# put MeSH into a df
meshData <-$mesh, stringsAsFactors = FALSE)
colnames(meshData) <- "mesh"
# separate each MeSH into its own column of a df
splitMeshData <- meshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
splitMeshData <- splitMeshData %>% 
  gather(na.rm = TRUE) %>%
  filter(value != "NA")
# collate key value df of unique MeSH
uniqueMesh <- unique(splitMeshData)
# this gives us a data frame of cell biology MeSH terms

Now we need to load in the Nature Communications XML data from PubMed and also get the citation data into R.

## Now use a similar procedure to load the NC data for comparison
ncFilename <- "./nc.xml"
ncData <- extract_xml(ncFilename)
ncMeshData <-$mesh, stringsAsFactors = FALSE)
colnames(ncMeshData) <- "mesh"
splitNCMeshData <- ncMeshData %>% 
  bind_cols(split_into_multiple(.$mesh, "[|]", "mesh")) %>%
# make a new column to hold any matches of rows with MeSH terms which are in the uniqueMeSH df 
ncData$isCB <- apply(splitNCMeshData, 1, function(r) any(r %in% uniqueMesh$value))

## Next we load the citation data file retrieved from Scopus
scopusFilename <- "./Scopus_Citation_Tracker.csv"
# the structure of the file requires a little bit of wrangling, ignore warnings
upperHeader <- read_csv(scopusFilename, 
                                    skip = 5)
citationData <- read_csv(scopusFilename, 
                        skip = 6)
upperList <- colnames(upperHeader)
lowerList <- colnames(citationData)
colnames(citationData) <- c(lowerList[1:7],upperList[8:length(upperList)])

Next we need to perform a join to match up the PubMed data with the citation data.

## we now have two data frames, one with the citation data and one with the papers
# make both frames have a Title column
colnames(citationData)[which(names(citationData) == "Document Title")] <- "Title"
colnames(ncData)[which(names(ncData) == "articletitle")] <- "Title"
# ncData paper titles have a terminating period, so remove it
ncData$Title <- gsub("\\.$","",ncData$Title, perl = TRUE)
# add citation data to ncData data frame
allDF <- inner_join(citationData, ncData, by = "Title")

Now we’ll make some plots.

# Plot histogram with indication of mean and median
p1 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)

# Group outlier papers for clarity
p2 <- allDF %>% 
  mutate(x_new = ifelse(allDF$'2018' > 80, 80, allDF$'2018')) %>% 
  ggplot(aes(x_new)) +
  geom_histogram(binwidth = 1, col = "black", fill = "gray") +
  labs(x = "2018 Citations", y = "Frequency") +
  geom_vline(aes(xintercept = mean(allDF$'2018',na.rm = TRUE)), col='orange', linetype="dashed", size=1) +
  geom_vline(aes(xintercept = median(allDF$'2018',na.rm = TRUE)), col='blue', linetype="dashed", size=1)

# Plot the data for both sets of papers separately
p3 <- ggplot(data=allDF, aes(allDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")

The citation data look typical: highly skewed, with few very highly cited papers and the majority (two-thirds) receiving less than the mean number of citations. The “cell biology” dataset and the non-cell biology dataset look pretty similar.

Now it is time to answer our main question. Do cell biology papers drag down the impact factor of the journal?

## make two new data frames, one for the cell bio papers and one for non-cell bio
cbDF <- subset(allDF,allDF$isCB == TRUE)
nocbDF <- subset(allDF,allDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
> summary(allDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.48   14.00  238.00 
> summary(cbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    7.00   10.17   13.00  226.00 
> summary(nocbDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    9.00   13.61   16.00  238.00 

The “JIF” for the whole journal is 11.48, whereas for the non-cell biology content it is 13.61. The cell biology dataset has a “JIF” of 10.17. So basically, the rumour is true but the effect is quite mild. The rumour is that the cell biology impact factor is much lower.

The reason “JIF” is in quotes is that it is notoriously difficult to calculate this metric. All citations are summed for the numerator, but the denominator comprises “citable items”. To get something closer to the actual JIF, we probably should remove non-citable items. These are Errata, Letters, Editorials and Retraction notices.

## We need to remove some article types from the dataset
itemsToRemove <- c("Published Erratum","Letter","Editorial","Retraction of Publication")
allArticleData <-$ptype, stringsAsFactors = FALSE)
colnames(allArticleData) <- "ptype"
splitallArticleData <- allArticleData %>% 
  bind_cols(split_into_multiple(.$ptype, "[|]", "ptype")) %>%
# make a new column to hold any matches of rows that are non-citable items
allDF$isNCI <- apply(splitallArticleData, 1, function(r) any(r %in% itemsToRemove))
# new data frame with only citable items
allCitableDF <- subset(allDF,allDF$isNCI == FALSE)

# Plot the data after removing "non-citable items for both sets of papers separately
p4 <- ggplot(data=allCitableDF, aes(allCitableDF$'2018')) +
  geom_histogram(binwidth = 1) +
  labs(title="",x = "Citations", y = "Count") +
  facet_grid(ifelse(allCitableDF$isCB, "Cell Biol", "Removed") ~ .) +
  theme(legend.position = "none")

After removal the citation distributions look a bit more realistic (notice that the earlier versions had many items with zero citations).

Citation distributions with non-citable items removed

Now we can redo the last part.

# subset new dataframes
cbCitableDF <- subset(allCitableDF,allCitableDF$isCB == TRUE)
nocbCitableDF <- subset(allCitableDF,allCitableDF$isCB == FALSE)
# print a summary of the 2018 citations to these papers for each df
> summary(allCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   11.63   14.00  238.00 
> summary(cbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    4.00    8.00   10.19   13.00  226.00 
> summary(nocbCitableDF$'2018')
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   0.00    5.00    9.00   14.06   17.00  238.00 

Now the figures have changed. The “JIF” for the whole journal is 11.63, whereas for the non-cell biology content it would 14.06. The cell biology dataset has a “JIF” of 10.19. To more closely approximate the JIF, we need to do:

# approximate "impact factor" for the journal
sum(allDF$'2018') / nrow(allCitableDF)
# approximate "impact factor" for the journal's cell biology content
sum(cbDF$'2018') / nrow(cbCitableDF)
# approximate "impact factor" for the journal's non-cell biology content
sum(nocbDF$'2018') / nrow(nocbCitableDF)
> # approximate "impact factor" for the journal
> sum(allDF$'2018') / nrow(allCitableDF)
[1] 11.64056
> # approximate "impact factor" for the journal's cell biology content
> sum(cbDF$'2018') / nrow(cbCitableDF)
[1] 10.19216
> # approximate "impact factor" for the journal's non-cell biology content
> sum(nocbDF$'2018') / nrow(nocbCitableDF)
[1] 14.08123

This made only a minor change, probably because the dataset is so huge (7239 papers for two years with non-citable items removed). If we were to repeat this on another journal with more front content and fewer papers, this distinction might make a bigger change.

Note also that my analysis uses Scopus data whereas Web of Science numbers are used for JIF calculations (thanks to Anna Sharman for prompting me to add this).


So the rumour is true but the effect is not as big as people say. There’s a ~17% reduction in potential impact factor by including these papers rather than excluding them. However, these papers comprise ~63% of the corpus and they bring in an estimated revenue to the publisher of $12,000,000 per annum. No journal would forego this income in order to bump the JIF from 11.6 to 14.1.

It is definitely not true that these papers are under-performing. Their citation rates are similar to those in the best journals in the field. Note that citation rates do not necessarily reflect the usefulness of the paper. For one thing they are largely an indicator of the volume of a research field. Anyhow, next time you hear this rumour for someone, you can set them straight.

And I nearly managed to write an entire post without mentioning that JIF is a terrible metric, especially for judging individual papers in a journal, but then you knew that didn’t you?

The post title comes from “Communication Breakdown” by the might Led Zeppelin from their debut album. I was really tempted to go with “Dragging Me Down” by Inspiral Carpets, but Communication Breakdown was too good to pass up.

Rollercoaster IV: ups and downs of Google Scholar citations

Time for an update to a previous post. For the past few years, I have been using an automated process to track citations to my lab’s work on Google Scholar (details of how to set this up are at the end of this post).

Due to the nature of how Google Scholar tracks citations, it means that citations get added (hooray!) but might be removed (booo!). Using a daily scrape of the data it is possible to watch this happening. The plots below show the total citations to my papers and then a version where we only consider the net daily change.

Four years of tracking citations on Google Scholar

The general pattern is for papers to accrue citations and some do so faster than others. You can also see that the number of citations occasionally drops down. Remember that we are looking at net change here. So a decrease of one citation is masked by the addition of one citation and vice versa. Even so, you can see net daily increases and even decreases.

It’s difficult to see what is happening down at the bottom of the graph so let’s separate them out. The two plots below show the net change in citations, either on the same scale (left) or scaled to the min/max for that paper (right).

Citation tracking for individual papers

The papers are shown here ranked from the ones that accrued the most citations down to the ones that gained no citations while they were tracked. Five “new” papers began to be tracked very recently. This is because I changed the way that the data are scraped (more on this below).

The version on the right reveals a few interesting things. Firstly that there seems to be “bump days” where all of the papers get a jolt in one direction or another. This could be something internal to Google or the addition or several items which all happen to cite a bunch of my papers. The latter explanation is unlikely, given the frequency of changes seen in the whole dataset. Secondly, some papers are highly volatile with daily toggling of citation numbers. I have no idea why this may be. Two plots below demonstrate these two points. The arrow shows a “bump day”. The plot on the right shows two review papers that have volatile citation numbers.

I’m going to keep the automated tracking going. I am a big fan of Google Scholar, as I have written previously, but quoting some of the numbers makes me uneasy, knowing how unstable they are.

Note that you can use R to get aggregate Google Scholar data as I have written about previously.

How did I do it?

The analysis would not be possible without automation. I use a daemon to run a shell script everyday. This script calls a python routine which outputs the data to a file. I wrote something in Igor to load each day’s data, and crunch the numbers, and make the graphs. The details of this part are in the previous post.

I realised that I wasn’t getting all of my papers using the previous shell script. Well, this is a bit of a hack, but I changed the calls that I make to so that I request data from several years.

cd /directory/for/data/
python -c 500 --author "Sam Smith" --after=1999 --csv > g1999.csv
sleep $[ ( $RANDOM % 15 )  + 295 ]
# and so on
python -c 500 --author "Sam Smith" --after=2019 --csv > g2019.csv
OF=all_$(date +%Y%m%d).csv
cat g*.csv > $OF
rm g*.csv

I found that I got different results for each year I made the query. My first change was to just request all years using a loop to generate the calls. This resulted in an IP ban for 24 hours! Through a bit of trial-and-error I found that reducing the queries to ten and waiting a polite amount of time between queries avoided the ban.

The hacky part was to figure out which year requests I needed to make to make sure I got most of my papers. There is probably a better way to do this!

I still don’t get every single paper and I retrieve data for a number of papers on which I am not an author – I have no idea why! I exclude the erroneous papers using the Igor program that reads all the data and plots out everything. The updated version of this code is here.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song by Sleater-Kinney from their “The Woods” album.

Turn A Square: generative aRt

A while back I visited Artistes & Robots in Paris. Part of the exhibition was on the origins of computer-based art. Nowadays this is referred to as generative art, where computers generate artwork according to rules specified by the programmer. I wanted to emulate some of the early generative artwork I saw there, using R.

Some examples of early generative art

Georg Nees was a pioneer of computer-based artwork and he has a series of pieces using squares. An example I found on the web was Schotter (1968).


Another example using rotation of squares is Boxes by William J. Kolomyjec.


I set out to generate similar images where the parameters can be tweaked to adjust the final result. The code is available on GitHub. Here is a typical image. Read on for an explanation of the code.

Generative art made in R

Generating a grid of squares

I started by generating a grid of squares. We can use the segment command in R to do the drawing.

xWave <-
yWave <-

for (i in seq_along(xWave)) {
  xCentre <- xWave[i]
  for (j in seq_along(yWave)) {
    yCentre <- yWave[j]
    lt <- c(xCentre - 0.4,yCentre - 0.4)
    rt <- c(xCentre + 0.4,yCentre - 0.4)
    rb <- c(xCentre + 0.4,yCentre + 0.4)
    lb <- c(xCentre - 0.4,yCentre + 0.4)
    new_shape_start <- rbind(lt,rt,rb,lb)
    new_shape_end <- rbind(rt,rb,lb,lt)
    new_shape <- cbind(new_shape_start,new_shape_end)
    if(i == 1 &amp;&amp; j == 1) {
      multiple_segments <- new_shape
    } else {
      multiple_segments <- rbind(multiple_segments,new_shape)
plot(0, 0,
     col = "white", xlab = "", ylab = "", axes=F) 
segments(x0 = multiple_segments[,1],
         y0 = multiple_segments[,2],
         x1 = multiple_segments[,3],
         y1 = multiple_segments[,4])
A simple grid

We’re using base R graphics to make this artwork – nothing fancy. Probably the code can be simplified!

Add some complexity

The next step was to add some complexity and flexibility. I wanted to be able to control three things:

  1. the size of the grid
  2. the grout (distance between squares)
  3. the amount of hysteresis (distorting the squares, rather than rotating them).

Here is the code. See below for some explanation and examples.

# this function will make grid art
# arguments define the number of squares in each dimension (xSize, ySize)
# grout defines the gap between squares (none = 0, max = 1)
# hFactor defines the amount of hysteresis (none = 0, max = 1, moderate = 10)
make_grid_art <- function(xSize, ySize, grout, hFactor) {
  xWave <-
  yWave <-
  axMin <- min(min(xWave) - 1,min(yWave) - 1)
  axMax <- max(max(xWave) + 1,max(yWave) + 1)
  nSquares <- length(xWave) * length(yWave)
  x <- 0
  halfGrout <- (1 - grout) / 2
  for (i in seq_along(yWave)) {
    yCentre <- yWave[i]
    for (j in seq_along(xWave)) {
      if(hFactor < 1) {
        hyst <- rnorm(8, halfGrout, 0)
      else {
        hyst <- rnorm(8, halfGrout, sin(x / (nSquares - 1) * pi) / hFactor)
      xCentre <- xWave[j]
      lt <- c(xCentre - hyst[1],yCentre - hyst[2])
      rt <- c(xCentre + hyst[3],yCentre - hyst[4])
      rb <- c(xCentre + hyst[5],yCentre + hyst[6])
      lb <- c(xCentre - hyst[7],yCentre + hyst[8])
      new_shape_start <- rbind(lt,rt,rb,lb)
      new_shape_end <- rbind(rt,rb,lb,lt)
      new_shape <- cbind(new_shape_start,new_shape_end)
      if(i == 1 &amp;&amp; j == 1) {
        multiple_segments <- new_shape
      } else {
        multiple_segments <- rbind(multiple_segments,new_shape)
      x <- x + 1
  par(mar = c(0,0,0,0))
  plot(0, 0,
       col = "white", xlab = "", ylab = "", axes=F, asp = 1) 
  segments(x0 = multiple_segments[,1],
           y0 = multiple_segments[,2],
           x1 = multiple_segments[,3],
           y1 = multiple_segments[,4])

The amount of hysteresis is an important element. It determines the degree of distortion for each square. The distortion is done by introducing noise into the coordinates for each square that we map onto the grid. The algorithm used for the distortion is based on a half-cycle of a sine wave. It starts and finishes on zero distortion of the coordinates. In between it rises and falls symmetrically introducing more and then less distortion as we traverse the grid.

Changing the line of code that assigns a value to hyst will therefore change the artwork. Let’s look at some examples.

# square grid with minimal hysteresis
# square grid (more squares) more hysteresis
# rectangular grid same hysteresis
# same grid with no hysteresis
# square grid moderate hysteresis and no grout

Hopefully this will give you some ideas for simple schemes to generate artwork in R. I plan to add to the GitHub repo when get the urge. There is already some other generative artwork code in there for Igor Pro. Two examples are shown below (I don’t think I have written about this before on quantixed).

The post title comes from “Turn A Square” by The Shins from their album Chutes Too Narrow.

Timestretched: audio stretching on the command line

I was recently reminded of the wonders of paulstretch by a 8-fold slowed down version of Pyramid Song by Radiohead.

Slowed down version of Pyramid Song

Paulstretch is an audio manipulation widget that can stretch or compress the time of an audio recording. Note that it doesn’t “slow down” or “speed up” a recording, it resamples the audio and recasts it over a different time scale while maintaining the pitch. There’s lots of examples on the web of how paulstretch can stretch a song, but fewer examples of the other way around. I wondered what time compression would sound like for a slow song.

There’s a plugin for Audacity, which allows stretching but does not allow compressing. There is a python version available to run paulstretch from the command line, and another user has added the ability to process lossless audio (in FLAC format). My fork is here (with a very minor change) for permanence. These scripts all allow time compression as well as time stretching.

I compressed Ebony Tears by Cathedral. A doom metal tune from 1991 which is at ~56 bpm. Two-fold compression (-s 0.5 with 0.25 sampling) recasts the song as a 112 bpm heavy metal tune.

Ebony Tears twice as fast

For a more well known example of a slow song, I went for “Last Night I Dreamt That Somebody Loved Me” by The Smiths. Compression works OK with this song. Without sounding like a total philistine, the intro with the animal noises and Johnny Marr plonking away on the piano becomes mercifully shortened. Then the main song (0:58) turns into a more jolly reel at 0.5 compression.

Last night… twice as fast

…and a somehow more urgent moody tune (at 1:28) with 0.75. Disclaimer: I love the original version of this song, at the correct pace. I am not saying this is an improvement in any way.

Last night… 1.5 times as fast

Compressing songs was fun, but somehow not as fascinating as stretching. The trick is to pick songs with minimal percussion and vocals for the stretched version to sound like something other than prolonged noise. Here is a four-fold stretch of Into The Groove by Madonna. The vocals are OK but those gated 80s drums sound awful smudged over a longer time window.

Intooo the grooooove

And here are two-fold and four-fold versions of Joe Satriani’s instrumental The Forgotten (Part One). The original is an agitated guitar workout. It is transformed into an ambient soundscape with stretching.

The Forgotten x2
The Forgotten x4

The command line versions of paulstretch are easy to use and fun to experiment with. Feel free to comment with suggestions for good contenders for stretching or compressing.

The post title comes from “Timestretched” by The Divine Comedy from their Regeneration LP.

Garmonbozia: Using R to look at Garmin CSV data

Garmin Connect has a number of plots built in, but to take a deeper dive into all your fitness data, you need to export a CSV and fire up R. This post is a quick guide to some possibilities for running data. 

There’s a few things that I wanted to look at. For example, how does my speed change through the year? How does that compare to previous years? If I see some trends, is that the same for short runs and long runs? I wanted to look at the cumulative distance I’d run each year… There’s a lot of things that would be good to analyse.

Garmin Connect has a simple way to export data as a CSV. There are other ways to get your data, but the web interface is pretty straightforward. To export a CSV of your data, head to the Garmin Connect website, login and select Activities, All Activities. On this page, filter the activities for whatever you want to export. I clicked Running (you can filter some more if you want), and then scrolled down letting the data load onto the page until I went back as far as I wanted. In the top right corner, you click Export CSV and you will download whatever is displayed on the page.

The code to generate these plots, together with some fake data to play with can be found here.

Now in R, load in the CSV file

file_name <- file.choose()
df1 <- read.csv(file_name, header = TRUE, stringsAsFactors = FALSE)

We have a data frame, but we need to rejig the Dates and a few other columns before we can start making plots.

# format Date column to POSIXct
df1$Date <- as.POSIXct(strptime(df1$Date, format = "%Y-%m-%d %H:%M:%S"))
# format Avg.Pace to POSIXct
df1$Avg.Pace <- as.POSIXct(strptime(df1$Avg.Pace, format = "%M:%S"))
# make groups of different distances using ifelse
df1$Type <- ifelse(df1$Distance < 5, "< 5 km", ifelse(df1$Distance < 8, "5-8 km", ifelse(df1$Distance < 15, "8-15 km", ">15 km")))
# make factors for these so that they're in the right order when we make the plot
df1$Type_f = factor(df1$Type, levels=c("< 5 km","5-8 km","8-15 km", ">15 km"))

Now we can make the first plot. The code for the first one is below, with all the code for the other plots shown below that.

# plot out average pace over time
p1 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, color = Distance)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Date", y = "Average Pace (min/km)")

The remainder of the code for the other plots is shown below. The code is commented. For some of the plots, a bit of extra work on the data frame is required.

# plot out same data grouped by distance
p2 <- ggplot( data = df1, aes(x = Date,y = Avg.Pace, group = Type_f, color = Type_f)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Date", y = "Average Pace (min/km)", colour = NULL) +
# now look at stride length. first remove zeros
df1[df1 == 0] <- NA
# now find earliest valid date
date_v <- df1$Date
# change dates to NA where there is no avg stride data
date_v <- as.Date.POSIXct(ifelse(df1$Avg.Stride.Length > 0, df1$Date, NA))
# find min and max for x-axis
earliest_date <- min(date_v, na.rm = TRUE)
latest_date <- max(date_v, na.rm = TRUE)
# make the plot
p3 <- ggplot(data = df1, aes(x = Date,y = Avg.Stride.Length, group = Type_f, color = Type_f)) +
  geom_point() + 
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average stride length (m)", colour = NULL) +
df1$Avg.HR <- as.numeric(as.character(df1$Avg.HR))
p4 <- ggplot(data = df1, aes(x = Date,y = Avg.HR, group = Type_f, color = Type_f)) +
  geom_point() +
  ylim(0, NA) + xlim(as.POSIXct(earliest_date), as.POSIXct(latest_date)) +
  geom_smooth() +
  labs(x = "Date", y = "Average heart rate (bpm)", colour = NULL) +
# plot out average pace per distance coloured by year
p5 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, color = Date)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth(color = "orange") +
  labs(x = "Distance (km)", y = "Average Pace (min/km)")
# make a date factor for year to group the plots
df1$Year <- format(as.Date(df1$Date, format="%d/%m/%Y"),"%Y")
p6 <- ggplot( data = df1, aes(x = Distance,y = Avg.Pace, group = Year, color = Year)) + 
  geom_point() +
  scale_y_datetime(date_labels = "%M:%S") +
  geom_smooth() +
  labs(x = "Distance", y = "Average Pace (min/km)") +
# Cumulative sum over years
df1 <- df1[order(as.Date(df1$Date)),]
df1 <- df1 %>% group_by(Year) %>% mutate(cumsum = cumsum(Distance))
p7 <- ggplot( data = df1, aes(x = Date,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  labs(x = "Date", y = "Cumulative distance (km)")
# Plot these cumulative sums overlaid
# Find New Year's Day for each and then work out how many days have elapsed since
df1$nyd <- paste(df1$Year,"-01-01",sep = "")
df1$Days <- as.Date(df1$Date, format="%Y-%m-%d") - as.Date(as.character(df1$nyd), format="%Y-%m-%d")
# Make the plot
p8 <- ggplot( data = df1, aes(x = Days,y = cumsum, group = Year, color = Year)) + 
  geom_line() +
  scale_x_continuous() +
  labs(x = "Days", y = "Cumulative distance (km)")

Finally, we can save all of the plots using ggsave.

# save all plots
ggsave("allPace.png", plot = p1, width = 8, height = 4, dpi = "print")
ggsave("paceByDist.png", plot = p2, width = 8, height = 4, dpi = "print")
ggsave("strideByDist.png", plot = p3, width = 8, height = 4, dpi = "print")
ggsave("HRByDist.png", plot = p4, width = 8, height = 4, dpi = "print")
ggsave("allPaceByDist.png", plot = p5, width = 8, height = 4, dpi = "print")
ggsave("paceByDistByYear.png", plot = p6, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistByYear.png", plot = p7, width = 8, height = 4, dpi = "print")
ggsave("cumulativeDistOverlay.png", plot = p8, width = 8, height = 4, dpi = "print")

I think the code might fail if you don’t record all of the fields that I do. For example if heart rate data is missing or stride length is not recorded, I’m not sure what the code will do. The aim here is to give an idea of what sorts of plots can be generated just using the summary data in the CSV provided by Garmin. Feel free to make suggestions in the comments below.

The post title comes from “Garmonbozia” by Superdrag from the Regretfully Yours album. Apparently Garmonbozia is something eaten by the demons in the Black Lodge in the TV series Twin Peaks.

Not What You Want: our new paper on a side effect of GFP nanobodies

We have a new preprint out – it is a cautionary tale about using GFP nanobodies in cells. This short post gives a bit of background to the work. Please read the paper if you are interested in using GFP nanobodies in cells, you can find it here.

Paper in a nutshell: Caution is needed when using GFP nanobodies because they can inhibit their target protein in cells.

People who did the work: Cansu Küey did most of the work for the paper. She discovered the inhibition side effect of the dongles. Gabrielle Larocque contributed a figure where she compared dongle-knocksideways with regular knocksideways. The project was initiated by Nick Clarke who made our first set of dongles and tested which fluorescent proteins the nanobody binds in cells. Lab people and their profiles can be found here.

Background: Many other labs have shown that nanobodies can be functionalised so that you can stick new protein domains onto GFP-tagged proteins to do new things. This is useful because it means you can “retrofit” an existing GFP knock-in cell line or organism to do new things like knocksideways without making new lines. In fact there was a recent preprint which described a suite of functionalised nanobodies that can confer all kinds of functions to GFP.

Like many other labs we were working on this method. We thought functionalised GFP nanobodies resembled “dongles” – those adaptors that Apple makes so much money from – that convert one port to another.

Dongles, dongles, dongles… (photo by Rex Hammock, licensed for reuse

A while back we made several different dongles. We were most interested in a GFP nanobodies with an additional FKBP domain that would allow us to do knocksideways (or FerriTagging) in GFP knock-in cells. For those that don’t know, knocksideways is not a knockout or a knockdown, but a way of putting a protein somewhere else in the cell to inactivate it. The most common place to send a protein is to the mitochondria.

Knocksideways works by joining FKBP and FRB (on the mitochondria) using rapamycin. Normally FKBP is fused to the protein of interest (top). If we just have a GFP tag, we can’t do knocksideways (middle). If we add a dongle (bottom) we can attach FKBP domains to allow knocksideways to happen.

We found that dongle-knocksideways works really well and we were very excited about this method. Here we are removing GFP-clathrin from the mitotic spindle in seconds using dongle knocksideways.

GFP-clathrin, shown here in blue is sent to the mitochondria (yellow) using rapamycin. This effect is only possible because of the dongle which adds FKBP to GFP via a GFP nanobody.

Since there are no specific inhibitors of endocytosis, we thought dongle knocksideways would be cool to try in cells with dynamin-2 tagged with GFP at both alleles. There is a line from David Drubin’s lab which is widely used. This would mean we could put the dongle plasmids on Addgene and everyone could inhibit endocytosis on-demand!

Initial results were encouraging. We could put dynamin onto mitochondria alright.

Dynamin-2-GFP undergoing dongle-knocksideways. The Mitotrap is shown in red and dynamin is green.

But we hit a problem. It turns out that dongle binding to dynamin inhibits endocytosis. So we have unintended inhibition of the target protein. This is a big problem because the power of knocksideways comes from being able to observe normal function and then rapidly switch it off. So if there is inhibition before knocksideways, the method is useless.

Now, this problem could be specific to dynamin or it might be a general problem with all targets of dongles. Either way, we’ve switched from this method and wrote this manuscript to alert others to the side effects of dongles. We discuss possible ways forward for this method and also point out some applications of the nanobody technology that are unaffected by our observations.

The post title comes from “Not What You Want” by Sleater-Kinney from their wonderful Dig Me Out record.

Five Get Over Excited: Academic papers that cite quantixed posts

Anyone that maintains a website is happy that people out there are interested enough to visit. Web traffic is one thing, but I take greatest pleasure in seeing quantixed posts being cited in academic papers.

I love the fact that some posts on here have been cited in the literature more than some of my actual papers.

It’s difficult to track citations to web resources. This is partly my fault, I think it is possible to register posts so that they have a DOI, but I have not done this and so tracking is a difficult task. Websites are part of what is known as the grey literature: items that are not part of traditional academic publishing.

The most common route for me to discover that a post has been cited is when I actually read the paper. There are four examples that spring to mind: here, here, here and here. With these papers, I read the paper and was surprised to find quantixed cited in the bibliography.

Vanity and curiosity made me wonder if there were other citations I didn’t know about. A cited reference search in Web of Science pulled up two more: here and here.

A bit of Googling revealed yet more citations, e.g. two quantixed posts are cited in this book. And another citation here.

OK so quantixed is not going to win any “highly cited” prizes or develop a huge H-index (if something like that existed for websites). But I’m pleased that 1) there are this many citations given that there’s a bias against citing web resources, and 2) the content here has been useful to others, particularly for academic work.

All of these citations are to posts looking at impact factors, metrics and publication lag times. In terms of readership, these posts get sustained traffic, but currently the most popular posts on quantixed are the “how to” guides, LaTeX to Word and Back seeing the most traffic. Somewhere in between citation and web traffic are cases when quantixed posts get written about elsewhere, e.g. in a feature in Nature by Kendall Powell.

The post title comes from “Five Get Over Excited” by The Housemartins. A band with a great eye for song titles, it can be found on the album “The People Who Grinned Themselves to Death”.