Pentagrammarspin: why twelve pentagons?

This post has been in my drafts folder for a while. With the World Cup here, it’s time to post it!

It’s a rule that a 3D assembly of hexagons must have at least twelve pentagons in order to be a closed polyhedral shape. This post takes a look at why this is true.

First, some examples from nature. The stinkhorn fungus Clathrus ruber, has a largely hexagonal layout, with pentagons inserted. The core of HIV has to contain twelve pentagons (shown in red, in this image from the Briggs group) amongst many hexagonal units. My personal favourite, the clathrin cage, can assemble into many buckminsterfullerene-like shapes, but all must contain at least twelve pentagons with a variable number of hexagons.

The case of clathrin is particularly interesting because clathrin triskelia can assemble into a flat hexagonal lattice on membranes. If clathrin is going to coat a vesicle, that means 12 pentagons need to be introduced. So there needs to be quite a bit of rearrangement in order to do this.

You can see the same rule in everyday objects. The best example is a football, or soccer ball, if you are reading in the USA.

The classic design of football has precisely twelve pentagons and twenty hexagonal panels. The roadsign for football stadia here in the UK shows a weirdly distorted hexagonal array that has no pentagons. 22,543 people signed a petition to pressurise the authorities to change it, but the Government responded that it was too costly to correct this geometrical error.

So why do all of these assemblies have 12 pentagons?

In the classic text “On Growth and Form” by D’Arcy Wentworth Thompson, polyhedral forms in nature are explored in some detail. In the wonderfully titled On Concretions, Specules etc. section, the author notes polyhedral forms in natural objects.

One example is Dorataspis, shown left. The layout is identical to the D6 hexagonal barrel assembly of a clathrin cage shown above. There is a belt of six hexagons, one at the top, one at the bottom (eight total) and twelve pentagons between the hexagons. In the book, there is an explanation of the maths behind why there must be twelve pentagons in such assemblies, but it’s obfuscated in bizarre footnotes in latin. I’ll attempt to explain it below.

To shed some light on this we need the help of Euler’s formulae. The surface of a polyhedron in 3D is composed of faces, edges and vertices. If we think back to the football the faces are the pentagons and hexgonal panels, the edges are the stitching where two panels meet and the vertices are where three edges come together. We can denote faces, edges and vertices as f, e and v, respectively. These are 2D, 1D and zero-dimensional objects, respectively. Euler’s formula which is true for all polyhedra is:

\(f – e + v = 2\)

If you think about a cube, it has six faces. It has 12 edges and 8 vertices. So, 6 – 12 + 8 = 2. We can also check out a the football above. This has 32 faces (twelve pentagons, twenty hexagons), 90 edges and sixty vertices. 32 – 90 + 60 = 2. Feel free to check it with other polyhedra!

Euler found a second formula which is true for polyhedra where three edges come together at a vertex.

\(\sum (6-n)f_{n} = 12\)

in this formula, \(f_{n}\) means number of n-gons.

So let’s say we have dodecahedron, which is a polyhedron made of 12 pentagons. So \(n\) = 5 and \(f_{n}\) = 12, and you can see that \((6-5)12 = 12\).

Let’s take a more complicated object, like the football. Now we have:

\(((6-6)20) + ((6-5)12) = 12\)

You can now see why the twelve pentagons are needed. Because 6-6 = 0, we can add as many hexagons as we like, this will add nothing to the left hand side. As long as the twelve pentagons are there, we will have a polyhedron. Without them we don’t. This is the answer to why there must be twelve pentagons in a closed polyhedral assembly.

So how did Euler get to the second equation? You might have spotted this yourself for the f, e, v values for the football. Did you notice that the ratio of edges to vertices is 3:2? This is because each edge has two vertices at either end (it is a 1D object) and remember we are dealing with polyhedra with three edges at each vertex. so \(v = \frac{2}{3}e\). Also, each edge is at the boundary of two polygons. So \( e = \frac{1}{2}\sum n f_{n}\). You can check that with the values for the cube or football above. We know that \(f = \sum f_{n}\), this just means that the number of faces is the sum of all the faces of all n-gons. This means that:

\(f – e + v = 2\)

Can be turned into

\(f – (1/3)e = \sum n f_{n} – \frac{1}{6}\sum n f_{n} = 2\)

Let’s multiply by 6 to get, oh yes

\(\sum (6-n)f_{n} = 12\)

There are some topics for further exploration here:

  • You can add 0, 2 or 10000 hexagons to 12 pentagons to make a polyhedron, but can you add just one?
  • What happens when you add a few heptagons into the array?

Image credits (free-to-use/wiki or):

Clathrus ruber – tineye search didn’t find source.

HIV cores – Briggs Group

Exploded football – Quora

The post title comes from “Pentagrammarspin” by Steve Hillage from the 2006 remaster of his LP Fish Rising

Rollercoaster III: yet more on Google Scholar

In a previous post I made a little R script to crunch Google Scholar data for a given scientist. The graphics were done in base R and looked a bit ropey. I thought I’d give the code a spring clean – it’s available here. The script is called ggScholar.R (rather than gScholar.R). Feel free to run it and raise an issue or leave a comment if you have some ideas.

I’m still learning how to get things looking how I want them using ggplot2, but this is an improvement on the base R version.

As described earlier I have many Rollercoaster songs in my library. This time it’s the song and album by slowcore/dream pop outfit Red House Painters.

Ten Years vs The Spread: Calculating publication lag times in R

There have been several posts on this site about publication lag times. You can read them here. Lag times are the delays in the dissemination of scientific data introduced by the process of publishing the paper in a journal. Nowadays, your paper can be online in a few hours using a preprint server. However, this work is not peer reviewed. Journals organise a formal peer review and provide some sort of certification of the work. They typeset the work and all of this adds delays the dissemination of work in a journal.

To look at publication delays, you can use PubMed data, which is incomplete but can give insight into how long these delays can be. Previous posts have involved the use of a ruby script to make a csv file from PubMed XML output and then use this in Igor to calculate the publication lag times. There is another method detailed in this excellent post by Daniel Himmelstein.

I recently posted a figure for Nature Communications lag times on Twitter and was asked to generate others. I figured that I should write an R script and people can make their own!

The PubMedLagR code is available here with instructions for use.

A query for Nature Communications data at PubMed, such as:

nat commun[ta] AND 2000 : 2018[pdat] AND journal article[pt]

Retrieves all paper for this journal. The range from 2010 to 2018 is for illustration, this journal has only been in operation for these years. Filtering for journal articles rather and attempting to get rid of reviews and front matter is wise, but doesn’t always work. Again this journal doesn’t carry this material so this is for illustration. Getting your query right is very important.

Save the results in XML format and then run the R script as directed. This should give a csv of the data and a png of the lag times.

This is data from Nature Communications. Colleagues had two separate papers accepted at this journal and experienced long delays. I was interested to see if papers were generally taking longer to publish here. Of course we do not know why. Delays are partly the fault of the authors, the reviewers and the journal and it is not possible to say why publication lag times are increasing for this journal year-on-year. The journal has grown in terms of number of papers published, has this introduced inefficiencies? Are reviewers being slow to review? Are they being more demanding? Are Editors not marshalling the referee reports and providing clear guidance to authors? Allowing too much time and too many rounds of revision? Are authors being too slow to do further experimental work? The answer will be yes to some of these questions for some of the papers.

This is not to focus on Nature Communications, it’s one of a few journals that many colleagues complain is too slow to publish their work. With this code you can have a look at the journal you are interested in submitting to and consider whether there is a more rapid venue for your work.

Update:

I changed the code slightly and prettified the plots just a little. Below are some plots for Nature Cell Biology, Nature Neuroscience. I also did a search for clathrin or CRISPR papers over the same time period. These keyword searches are fairly flat, whereas the journal-specific increase in publication lag time can be seen.

The lag times at Nature Neuroscience look artificially low and then seem to have jumped up in 2016 to be something similar to Nature Cell Biology or Nature Communications.

Edit

I neglected to point out that the code truncates the y-axis in the bottom right plot to 1000 days or the maximum lag time, whichever is smaller. This is because it gets difficult to see the data points if there is an outlier, which might be due to an error in PubMed data.

A reader commented on Twitter that some poor paper had a lag time almost 1000 days. Well, due to the y-axis truncation we don’t see that 9 papers in Nature Communications since 2010 have lag times (RecAcc) of > 1000 days. The record holder has a lag time of 1561 days! I checked that this was not a PubMed error by looking at the dates on the paper.

Notes

Date information is not available in PubMed for every paper unfortunately. This is especially true of older papers.

The date information is supplied to PubMed from the journal. These dates are not necessarily accurate: 1) you can see occasional errors in the data, 2) journals sometimes “reset the clock” on papers and treat resubmissions as new submissions.

The post title is taken from “10 Years vs The Spread” by Wing-Tipped Sloat from the LP Chewyfoot. Obviously the song has nothing to do with smoothed kernel density estimates of journal publication lag times, but the title was incredibly apt.

Cloud Eleven: A cloud-based code sharing solution for IgorPro

This post is something of a “how to” guide. The problem is how can you share code with a small team and keep it up-to-date?

For ImageJ, the solution is simple. You can make an ImageJ update site and then push any updated code to the user when they startup ImageJ. For IgorPro, there is no equivalent. Typically I send ipf files to someone and they run the code, but I have to resend them whenever there’s an update. This can cause confusion over which is the latest version.

I’ve tried a bunch of things such as versioning the code (this at least tells you if the person is running an out-of-date version). I also put the code up on GitHub and tell people to pull down the latest version, but again this doesn’t work well. The topic of how to share code comes up perennially on the Igor mailing list and on IgorExchange, so it’s clearly something that people struggle with. I’ve found the solutions offered to be a bit daunting.

My solution is detailed here with some code to make it run.

The details

It is possible to use aliases (shortcuts on Windows) in the User’s WaveMetrics folders to point to external files. Items in the Igor Procedures directory get loaded when you start Igor. Items in User Procedures can get loaded optionally. These aliases sit in the Users Wavemetrics folder (the exact location depends on the Igor Version) and this means that the program itself can get updated without overwriting these files.

So, if aliases are created here that point to a cloud-based repo of Igor Code it can be used to:

  1. Optionally load code as the user needs it. Because the code sits in the cloud it can be updated and instantly used by everyone.
  2. Force Igor to load a bit of code to make a little menu item so that the user can pick the code they want to load.

To get this working, I made use of Ryotako’s excellent menu loader which was written for loading the “hidden” WaveMetrics procedures. I created a version like that one which makes a menu of our shared code in alphabetical order. I then made a version that allows the code to be grouped by purpose. People in my group told me they prefer this version. Code just needs to be organised in folders for it to work.

The IgorDistro.ipf just needs to be placed in a folder called IP somewhere in a share that users have access to (unless people are contributing to code development, you can make the share read-only). In the same directory that IP sits in, place a folder called UP with all of your code organised into folders. Aliases need to be created that point Igor Procedures folder to IP and User Procedures to UP. That’s it! I’ve tested it on IP7 and IP8, on Windows and Mac, using a shared Dropbox as the cloud repo.

Obviously, the correct IgorPro licences need to be in place to share the code with multiple users.

The post title comes from the pseudonymous LP by Cloud Eleven. A great debut album. The track “Wish I” is worth the price of the record alone (assuming people still buy records).

Rollercoaster II: more on Google Scholar citations

I’ve previously written about Google Scholar. Its usefulness and its instability. I just read a post by Jon Tennant on how to harvest Google Scholar data in R and I thought I would use his code as the basis to generate some nice plots based on Google Scholar data.

A script for R is below and can be found here. Graphics are base R but do the job.

First of all I took it for a spin on my own data. The outputs are shown here:

These were the most interesting plots that sprang to mind. First is a ranked citation plot which also shows y=x to find the Hirsch number. Second, was to look at total citations per year to all papers over time. Google Scholar shows the last few years of this plot in the profile page. Third, older papers accrue more citations, but how does this look for all papers? Finally, a prediction of what my H-index will do over time (no prizes for guessing that it will go up!). As Jon noted, the calculation comes from this paper.

While that’s interesting, we need to get  the data of a scholar with a huge number of papers and citations. Here is George Church.

At the time of writing he has 763 papers with over 90,000 citations in total and a H-index of 147. Interestingly ~10% of his total citations come from a monster paper in PNAS with Wally Gilbert in the mid 80s on genome sequencing.

Feel free to grab/fork this code and have a play yourself. If you have other ideas for plots or calculations, add a comment here or an issue at GitHub.

if(!require(scholar)){
     install.packages("scholar")
}
library(scholar)
# Add Google Scholar ID of interest here
ID <- ""
# If you didn't add one to the script prompt user to add one
if(ID == ""){
     ID <- readline(prompt="Enter Scholar ID: ")
}
# Get the citation history
citeByYear<-get_citation_history(ID)
# Get profile information
profile <- get_profile(ID)
# Get publications and save as a csv
pubs <- get_publications(ID)
write.csv(pubs, file = "citations.csv")
# Predict h-index
hIndex <- predict_h_index(ID)
# Now make some plots
# Plot of total citations by year
png(file = "citationsByYear.png")
plot(citeByYear$year,citeByYear$cites,
     type="h", xlab="Year", ylab = "Total Cites")
dev.off()
# Plot of ranked paper by citation with h
png(file = "citationsAndH.png")
plot(pubs$cites, type="l",
     xlab="Paper rank", ylab = "Citations per paper")
abline(0,1)
text(nrow(pubs),max(pubs$cites, na.rm = TRUE),
     profile$h_index)
dev.off()
# Plot of cites to paper by year
png(file = "citesByYear.png")
plot(pubs$year, pubs$cites,
     xlab="Year", ylab = "Citations per paper")
dev.off()
# Plot of h-index prediction
thisYear <- as.integer(format(Sys.Date(), "%Y"))
png(file = "hPred.png")
     plot(hIndex$years_ahead+thisYear,hIndex$h_index,
     ylim = c(0, max(hIndex$h_index, na.rm = TRUE)),
     type = "h",
     xlab="Year", ylab = "H-index prediction") 
dev.off()

Note that my previous code used a python script to grab Google Scholar data. While that script worked well, the scholar package for R seems a lot more reliable.

I have a surprising number of tracks in my library with Rollercoaster in the title. This time I will go with the Jesus & Mary Chain track from Honey’s Dead.

Turn That Heartbeat Over Again: comparing wrist and chest-strap HRM

As a geek, the added bonus of exercise is the fun that you can have with the data you’ve generated. A recent conversation on Twitter about the accuracy of wrist-based HRMs got me thinking… how does a wrist-based HRM compare with a traditional chest-strap HRM? Conventional wisdom says that the chest-strap is more accurate, but my own experience of chest-strap HRMs is that they are a bit unreliable. Time to put it to the test.

I have a Garmin Fēnix 5 which records wrist-based HR and I have a Garmin chest-strap which uses ANT+ to transmit. I could pick up the ANT+ signal with a Garmin Edge 800 so that I could record both datasets simultaneously, on the same bike ride. Both the Fēnix and Edge can record GPS and Time (obviously) allowing accurate registration of the data. I also set both devices to receive cadence data via ANT+ from the same cadence/speed sensor so that I could control for (or at least look at) variability in recordings. I rode for a ~1 hr ~32 km to capture enough data. Obviously this is just one trial but the data gives a feel for the accuracy of the two HRMs. Biking, I figured was a fair activity since upper body and wrist movement is minimal, meaning that the contacts for both HRMs are more likely to stay in place than if I was running.

I’ll get to heart rate last. First, you can see that the GPS recording is virtually identical between the two units – so that’s a good sign. Second, elevation is pretty similar too. There’s a bit of divergence at the beginning and end of the ride, since those parts are over the same stretch of road, neither device looks totally accurate. I’d be inclined to trust the Fēnix here since has a newer altimeter in it. Third, cadence looks accurate. Remember this data is coming off the same sensor and so any divergence is in how it’s being logged. Finally, heart rate. You can see that at the beginning of the ride, both wrist-based and chest-strap HRMs struggle. Obviously I have no reference here to know which trace is correct, but the chest-strap recording looks like it has an erroneous low reading for several minutes but is otherwise normal. The wrist-based HRM looks like it is reading 120ish bpm values from the start and then snaps into reading the right thing after a while. The chest-strap makes best contact when some perspiration has built up, which might explain its readings. The comparisons are shown below in grey. The correlation is OK but not great. Compared to cadence, the data diverge a lot more which rules out simple logging differences as a cause.

I found a different story in Smart recording mode on the Fēnix. This is a lower frequency recording mode, which is recommended to preserve battery life for long activities.

So what can we see here? Well, the data from the Fēnix are more patchy but even so, the data are pretty similar except for heart rate. The Fēnix performs badly here. Again, you can see the drop out of the chest strap HRM for a few minutes at the start, but otherwise it seems pretty reliable.  The comparison graph for heart rate shows how poorly the wrist-based HRM measures heart rate, in this mode.

OK, this is just two rides, for one person – not exactly conclusive but it gives some idea about the data that are captured with each HRM.

Conclusions

Wrist-based HRM is pretty good (at the higher sampling rate only) especially considering how the technology works, plus chest-strap HRMs can be uncomfortable to wear, so wrist-based HRM may be all you need. If you are training to heart rate zones, or want the best data, chest-strap HRM is more reliable than wrist-based HRM generally. Neither are very good for very short activities (<15 min).

For nerds only

Comparisons like these are quite easy to do in desktop packages like Rubitrack (which I love) or Ascent or others. They tend to mask things like missing data points, they smooth out the data and getting the plots the way you want is not straightforward. So, I took the original FIT files from each unit and used these for processing. There’s a great package for R called cycleRtools. This worked great except for the smart recording data which was sampled irregularly and it turns out and the package requires monotonic sampling. I generated a gpx file and parsed the data for this activity in R using XML. I found this snippet on the web (modified slightly).

library(XML)
library(plyr)
filename <- "myfile.gpx"
gpx.raw <- xmlTreeParse(filename, useInternalNodes = TRUE)
rootNode <- xmlRoot(gpx.raw)
gpx.rawlist <- xmlToList(rootNode)$trk
gpx.list <- unlist(gpx.rawlist[names(gpx.rawlist) == "trkseg"], recursive = FALSE)
gpx <- do.call(rbind.fill, lapply(gpx.list, function(x) as.data.frame(t(unlist(x)), stringsAsFactors=F)))
names(gpx) <- c("ele", "time", "temp", "hr", "cad", "lat", "lon")

Otherwise:

library(cycleRtools)
edge <- as.data.frame(read_fit(file = file.choose(), format = TRUE, CP = NULL, sRPE = NULL))
write.csv(edge, file = "edge.csv", row.names = F)

The resulting data frames could be saved out as csv and read into Igor to make the plots. I wrote a quick function in Igor to resample all datasets at 1 Hz. The plots and layout were generated by hand.

The post title comes from “Turn That Heartbeat Over Again” by Steely Dan from Can’t Buy A Thrill

I’m not following you II: Twitter data and R

My activity on twitter revolves around four accounts.

I try to segregate what happens on each account, and there’s inevitably some overlap. But what about overlap in followers?

What lucky people are following all four? How many only see the individual accounts?

It’s quite easy to look at this in R.

So there are 36 lucky people (or bots!) following all four accounts. I was interested in the followers of the quantixed account since it seemed to me that it attracts people from a slightly different sphere. It looks like about one-third of quantixed followers only follow quantixed, about one-third follow clathrin also and more or less the remainder are “all in” following three accounts or all four. CMCB followers are split about the same. The lab account is a bit different, with close to one-half of the followers also following clathrin.

Extra nerd points:

This is a Venn diagram and not an Euler plot. Venn just shows schematically the intersections and does not attempt to encode information in the area of each part. Euler plots for greater than three groups are hard to generate and to make any sense of what is shown. It is a dataviz problem to look at the proportions or lots of groups. A solution here would be to generate a further four Venn diagrams. On each, display the proportion for one category as a fraction or percentage

How to do it:

Last time, I described how to set up rtweet and make a Twitter app for use in R. You can use this to pull down lists of followers and extract their data. Using the intersect function you can work out the numbers of followers at each intersection. For four accounts, there will be 1 group of four, 4 groups of three, 6 groups of two. The VennDiagram package just needs the total numbers for all four groups and then details of the intersections, i.e. you don’t need to work out the groups minus their intersections – it does this for you.

library(rtweet)
library(httpuv)
library(VennDiagram)
## whatever name you assigned to your created app
appname <- "whatever_name"
## api key (example below is not a real key)
key <- "blah614h"
## api secret (example below is not a real key)
secret <- "blah614h"
## create token named "twitter_token"
twitter_token <- create_token(
app = appname,
consumer_key = key,
consumer_secret = secret)
clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
quantixed_followers <- get_followers("quantixed", n = "all")
quantixed_followers_names <- lookup_users(quantixed_followers)
cmcb_followers <- get_followers("Warwick_CMCB", n = "all")
cmcb_followers_names <- lookup_users(cmcb_followers)
roylelab_followers <- get_followers("roylelab", n = "all")
roylelab_followers_names <- lookup_users(roylelab_followers)
# a = clathrin
# b = quantixed
# c = cmcb
# d = roylelab
## now work out intersections
anb <- intersect(clathrin_followers_names$user_id,quantixed_followers_names$user_id)
anc <- intersect(clathrin_followers_names$user_id,cmcb_followers_names$user_id)
and <- intersect(clathrin_followers_names$user_id,roylelab_followers_names$user_id)
bnc <- intersect(quantixed_followers_names$user_id,cmcb_followers_names$user_id)
bnd <- intersect(quantixed_followers_names$user_id,roylelab_followers_names$user_id)
cnd <- intersect(cmcb_followers_names$user_id,roylelab_followers_names$user_id)
anbnc <- intersect(anb,cmcb_followers_names$user_id)
anbnd <- intersect(anb,roylelab_followers_names$user_id)
ancnd <- intersect(anc,roylelab_followers_names$user_id)
bncnd <- intersect(bnc,roylelab_followers_names$user_id)
anbncnd <- intersect(anbnc,roylelab_followers_names$user_id)
## four-set Venn diagram
venn.plot <- draw.quad.venn(
area1 = nrow(clathrin_followers_names),
area2 = nrow(quantixed_followers_names),
area3 = nrow(cmcb_followers_names),
area4 = nrow(roylelab_followers_names),
n12 = length(anb),
n13 = length(anc),
n14 = length(and),
n23 = length(bnc),
n24 = length(bnd),
n34 = length(cnd),
n123 = length(anbnc),
n124 = length(anbnd),
n134 = length(ancnd),
n234 = length(bncnd),
n1234 = length(anbncnd),
category = c("Clathrin", "quantixed", "CMCB", "RoyleLab"),
fill = c("dodgerblue1", "red", "goldenrod1", "green"),
lty = "dashed",
cex = 2,
cat.cex = 1.5,
cat.col = c("dodgerblue1", "red", "goldenrod1", "green"),
fontfamily = "Helvetica",
cat.fontfamily = "Helvetica"
);
# write to file
png(filename = "Quad_Venn_diagram.png");
grid.draw(venn.plot);
dev.off()

I’ll probably return to rtweet in future and will recycle the title if I do.

Like last time, the post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins

Adventures in Code VI: debugging and silly mistakes

This deserved a bit of further explanation, due to the stupidity involved.

“Debugging is like being the detective in a crime movie where you are also the murderer.” – Filipe Fortes

My code was giving an unexpected result and I was having a hard time figuring out the problem. The unexpected result was that a resampled set of 2D coordinates were not being rotated randomly. I was fortunate to be able to see this otherwise I would have never found this bug and probably would’ve propagated the error to other code projects.

I narrowed down the cause but ended up having to write some short code to check that it really did cause the error.

I was making a rotation matrix and then using it to rotate a 2D coordinate set by matrix multiplication. The angle was randomised in the loop. What could go wrong? I looked at that this:

theta = pi*enoise(1)
rotMat = {{cos(theta),-sin(theta)},{sin(theta),cos(theta)}}

and thought “two lines – pah – it can be done in one!”. Since the rotation matrix is four numbers [-1,1], I thought “I’ll just pick those numbers at random, I just want a random angle don’t I?”

rotMat = enoise(1)

Why doesn’t an alarm go off when this happens? A flashing sign saying “are you sure about that?”…

My checks showed that a single point at 1,0 after matrix multiplication with this method gives.

When it should give

And it’s so obvious when you’ve seen why. The four numbers in the rotation matrix are, of course, not independent.

I won’t make that mistake again and I’m going to try to think twice when trying to save a line of code like in the future!

Part of a series on computers and coding.

Do It Yourself: Lab Notebook Archiving Project

A while back, the lab moved to an electronic lab notebook (details here and here). One of the drivers for this move was the huge number of hard copy lab note books that had accumulated in the lab over >10 years. Switching to an ELN solved this problem for the future, but didn’t make the old lab note books disappear. So the next step was to archive them and free up some space.

We access the contents of these books fairly regularly so archiving had to mean digitising them as well as putting them into storage. I looked at a few options before settling on a very lo-fi solution.

Option 1: call in the professionals

I got a quote from our University’s preferred data archiving firm. The lab notebooks we use have 188 pages and I had 89 to archive. The quote was over £4000 + VAT for scanning only. This was too expensive and so I next looked at DIY options.

Option 2: scan the books

At the University we have good MPDs that will scan documents and store them on a server as a multipage PDF. There’s two resolutions at which you can scan, which are good-but-not-amazing quality. The scanners have a feeder which would automate the scan of a lab book, but it would mean destroying the books (which are hardbound) to scan them.

I tried scanning one book using this method. Disassembling a notebook with a razorblade was quite quick but the problem was that the scanner struggled with the little print outs that people stick in their lab books. Dealing with jams and misfired scans meant that this was not an option, and I didn’t want to destroy all of the books either.

Option 3: photography rigs

Next, I looked at book scanning projects to see how they were done. In these projects, the books are valuable and so can’t be destroyed, but it must be automated… I found that these projects use a cradle to sit the book in. A platen is pushed against the pages (to flatten the pages) and then two cameras take a picture of the two pages, triggered in sync using an external button or foot pedal. An example of one raspberry pi-powered rig is here. Building one of these appealed but would still require some expense (and time and effort). I asked around if anyone else wanted to help with the build, thinking that others may be wanting to archive their notebooks, but I got no takers.

Option 4: the zero-cost solution!

Inspiration came from a student who left my lab and wanted to photograph her lab books for future reference. She captured them on her camera phone by hand in a matter of minutes. Shooting two pages of a book from a single digital camera suspended above the notebook would be a good compromise. Luckily I had access to a digital camera and a few hundred Lego bricks. Total new spend = £0.

I know it looks terrible, but it was pretty effective!

I put the rig on a table (for ergonomic reasons), next to a window and photographed each book using natural light. It took around 10 min to photograph one lab book. I took the images over a few weeks amongst doing other stuff so that the job didn’t become too onerous. I shot the books at the highest resolution and stored the raw images on the server. I wrote a quick script to stack the images scale them down 25% and export to PDF to make an easy-to-consult PDF file for each lab book. Everyone in the lab can access these PDFs and if needed can pull down the high res versions. The lab books have now been stored in a sealed container. We can access the books if needed. However, having looked at the images, I think if something is not readable from the file, it won’t be readable in the hard copy.

Was it worth it?

I think so. It took a while to get everything digitised but I’m glad it’s done. The benefits are:

  1. Easy access to all lab books for every member of the lab.
  2. Clearing a load of clutter from my office.
  3. The rig can be rebuilt easily, but is not otherwise sitting around gathering dust.
  4. Some of the older lab books were deteriorating and so capturing them before they got worse was a good idea (see picture above for some sellotape degradation).

The post title is taken from the LP “Do It Yourself” by The Seahorses.

Dividing Line: not so simple division in ctenophores

This wonderful movie has repeatedly popped up into my twitter feed.

It was taken by Tessa Montague and is available here (tweet is here).

The movie is striking because of the way that cytokinesis starts at one side and moves to the other. Most model systems for cell division have symmetrical division.

Rob de Bruin commented that “it makes total sense to segregate this way”. Implying that if a cell just gets cut in half it deals with equal sharing of components. This got me thinking…

It does make sense to share n identical objects this way. For example, vesiculation of the Golgi generates many equally sized vesicles. Cutting the cell in half ensures that each cell gets approximately half of the Golgi (although there is another pathway that actively segregates vesicular material, reviewed here). However, for segregation of genetic material – where it is essential that each cell receives one (and exactly one) copy of the genome – a cutting-in-half mechanism simply doesn’t cut it (pardon the pun).

The error rate of such a mechanism would be approximately 50% which is far too high for something so important. Especially at this (first) division as shown in the movie.

I knew nothing about ctenophores (comb jellies) before seeing this movie and with a bit of searching I found this paper. In here they show that there is indeed a karyokinetic (mitotic) mechanism that segregates the genetic material and that this happens independently of the cytokinetic process which is actin-dependent. So not so different after all. The asymmetric division and the fact that these divisions are very rapid and synchronised is very interesting. It’s very different to the sorts of cells that we study in the lab. Thanks to Tessa Montague for the amazing video that got me thinking about this.

Footnote: the 50% error rate can be calculated as follows. Although segregation is in 3D, this is a 1D problem. If we assume that the cell divides down the centre of the long axis and that object 1 and object 2 can be randomly situated along the long axis. There is an equal probability of each object ending each cell. So object 1 can end in either cell 1 or cell 2, as can object 2. The probability that objects 1 and 2 end in the same cell is 50%. This is because there is a 25% chance of each outcome (object 1 in cell 1, object 2 in cell 2; object 1 in cell 2, object 2 in cell 1; object 1 and object 2 in cell 1; object 1 and object 2 in cell 2). It doesn’t matter how many objects we are talking about or the size of the cell. This is a highly simplified calculation but serves the purpose of showing that another solution is needed to segregate objects with identity during cell division.

The post title comes from “Dividing Line” from the Icons of Filth LP Onward Christian Soldiers.