## Methods papers for MD997

I am now running a new module for masters students, MD997. The aim is to introduce the class to a range of advanced research methods and to get them to think about how to formulate their own research question(s).

The module is built around a paper which is allocated in the first session. I had to come up with a list of methods-type papers, which I am posting below. There are 16 students and I picked 23 papers. I aimed to cover their interests, which are biological but with some chemistry, physics and programming thrown in. The papers are a bit imaging-biased but I tried to get some ‘omics and neuro in there. There were some preprints on the list to make sure I covered the latest stuff.

The students picked their top 3 papers and we managed to assign them without too much trouble. Every paper got at least one vote. Some papers were in high demand. Fitzpatrick et al. on cryoEM of Alzheimer’s samples and the organoid paper from Lancaster et al. had by far the most votes.

The students present this paper to the class and also use it to formulate their own research proposal. Which one would you pick?

1. Booth, D.G. et al. (2016) 3D-CLEM Reveals that a Major Portion of Mitotic Chromosomes Is Not Chromatin Mol Cell 64, 790-802. http://dx.doi.org/10.1016/j.molcel.2016.10.009
2. Chai, H. et al. (2017) Neural Circuit-Specialized Astrocytes: Transcriptomic, Proteomic, Morphological, and Functional Evidence Neuron 95, 531-549 e9. http://dx.doi.org/10.1016/j.neuron.2017.06.029
3. Chang, J.B. et al. (2017) Iterative expansion microscopy Nat Methods 14, 593-599. http://dx.doi.org/10.1038/nmeth.4261
4. Chen, B.C. et al. (2014) Lattice light-sheet microscopy: imaging molecules to embryos at high spatiotemporal resolution Science 346, 1257998. http://dx.doi.org/10.1126/science.1257998
5. Chung, K. & Deisseroth, K. (2013) CLARITY for mapping the nervous system Nat Methods 10, 508-13. http://dx.doi.org/10.1038/nmeth.2481
6. Eichler, K. et al. (2017) The Complete Connectome Of A Learning And Memory Center In An Insect Brain bioRxiv. http://dx.doi.org/10.1101/141762
7. Fitzpatrick, A.W.P. et al. (2017) Cryo-EM structures of tau filaments from Alzheimer’s disease Nature 547, 185-190. http://dx.doi.org/10.1038/nature23002
8. Habib, N. et al. (2017) Massively parallel single-nucleus RNA-seq with DroNc-seq Nat Methods 14, 955-958. http://dx.doi.org/10.1038/nmeth.4407
9. Hardman, G. et al. (2017) Extensive non-canonical phosphorylation in human cells revealed using strong-anion exchange-mediated phosphoproteomics bioRxiv. http://dx.doi.org/10.1101/202820
10. Herzik, M.A., Jr. et al. (2017) Achieving better-than-3-A resolution by single-particle cryo-EM at 200 keV Nat Methods. http://dx.doi.org/10.1038/nmeth.4461
11. Jacquemet, G. et al. (2017) FiloQuant reveals increased filopodia density during breast cancer progression J Cell Biol 216, 3387-3403. http://dx.doi.org/10.1083/jcb.201704045
12. Jungmann, R. et al. (2014) Multiplexed 3D cellular super-resolution imaging with DNA-PAINT and Exchange-PAINT Nat Methods 11, 313-8. http://dx.doi.org/10.1038/nmeth.2835
13. Kim, D.I. et al. (2016) An improved smaller biotin ligase for BioID proximity labeling Mol Biol Cell 27, 1188-96. http://dx.doi.org/10.1091/mbc.E15-12-0844
14. Lancaster, M.A. et al. (2013) Cerebral organoids model human brain development and microcephaly Nature 501, 373-9. http://dx.doi.org/10.1038/nature12517
15. Madisen, L. et al. (2012) A toolbox of Cre-dependent optogenetic transgenic mice for light-induced activation and silencing Nat Neurosci 15, 793-802. http://dx.doi.org/10.1038/nn.3078
16. Penn, A.C. et al. (2017) Hippocampal LTP and contextual learning require surface diffusion of AMPA receptors Nature 549, 384-388. http://dx.doi.org/10.1038/nature23658
17. Qin, P. et al. (2017) Live cell imaging of low- and non-repetitive chromosome loci using CRISPR-Cas9 Nat Commun 8, 14725. http://dx.doi.org/10.1038/ncomms14725
18. Quick, J. et al. (2016) Real-time, portable genome sequencing for Ebola surveillance Nature 530, 228-232. http://dx.doi.org/10.1038/nature16996
19. Ries, J. et al. (2012) A simple, versatile method for GFP-based super-resolution microscopy via nanobodies Nat Methods 9, 582-4. http://dx.doi.org/10.1038/nmeth.1991
20. Rogerson, D.T. et al. (2015) Efficient genetic encoding of phosphoserine and its nonhydrolyzable analog Nat Chem Biol 11, 496-503. http://dx.doi.org/10.1038/nchembio.1823
21. Russell, M.R. et al. (2017) 3D correlative light and electron microscopy of cultured cells using serial blockface scanning electron microscopy J Cell Sci 130, 278-291. http://dx.doi.org/10.1242/jcs.188433
22. Strickland, D. et al. (2012) TULIPs: tunable, light-controlled interacting protein tags for cell biology Nat Methods 9, 379-84. http://dx.doi.org/10.1038/nmeth.1904
23. Yang, J. et al. (2015) The I-TASSER Suite: protein structure and function prediction Nat Methods 12, 7-8. http://dx.doi.org/10.1038/nmeth.3213

If you are going to do a similar exercise, Twitter is invaluable for suggestions for papers. None of the students complained that they couldn’t find three papers which matched their interests. I set up a slide carousel in Powerpoint with the front page of each paper together with some key words to tell the class quickly what the paper was about. I gave them some discussion time and then collated their choices on the board. Assigning the papers was quite straightforward, trying to honour the first choices as far as possible. Having an excess of papers prevented too much horse trading for the papers that multiple people had picked.

Hopefully you find this list useful. I was inspired by Raphaël posting his own list here.

## The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.


## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets$text)) ## remove all the punctation from tweets tCorpus <- tm_map(tCorpus, removePunctuation) ## good idea to remove stopwords: high frequency words such as I, me and so on tCorpus <- tm_map(tCorpus, removeWords, stopwords('english')) ## next step is to stem the words. Means that talking and talked become talk tCorpus <- tm_map(tCorpus, stemDocument) ## now display your wordcloud wordcloud(tCorpus, max.words = 100, random.order = FALSE)  For my @clathrin account this gave: So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time… The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes. The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP. ## I’m not following you: Twitter data and R I wondered how many of the people that I follow on Twitter do not follow me back. A quick way to look at this is with R. OK, a really quick way is to give a 3rd party application access rights to your account to do this for you, but a) that isn’t safe, b) you can’t look at anyone else’s data, and c) this is quantixed – doing nerdy stuff like this is what I do. Now, the great thing about R is the availability of well-written packages to do useful stuff. I quickly found two packages twitteR and rtweet that are designed to harvest Twitter data. I went with rtweet and there were some great guides to setting up OAuth and getting going. The code below set up my environment and pulled down lists of my followers and my “friends”. I’m looking at my main account and not the quantixed twitter account.  library(rtweet) library(httpuv) ## setup your appname,api key and api secret appname <- "whatever_name" key <- "blah614h" secret <- "blah614h" ## create token named "twitter_token" twitter_token <- create_token( app = appname, consumer_key = key, consumer_secret = secret) clathrin_followers <- get_followers("clathrin", n = "all") clathrin_followers_names <- lookup_users(clathrin_followers) clathrin_friends <- get_friends("clathrin") clathrin_friends_names <- lookup_users(clathrin_friends)  The terminology is that people that follow me are called Followers and people that I follow are called Friends. These are the terms used by Twitter’s API. I have almost 3000 followers and around 1200 friends. This was a bit strange… I had fewer followers with data than actual followers. Same for friends: missing a few hundred in total. I extracted a list of the Twitter IDs that had no data and tried a few other ways to look them up. All failed. I assume that these are users who have deleted their account (and the Twitter ID stays reserved) or maybe they are suspended for some reason. Very strange.  ## noticed something weird ## look at the twitter ids of followers and friends with no data missing_followers <- setdiff(clathrin_followers$user_id,clathrin_followers_names$user_id) missing_friends <- setdiff(clathrin_friends$user_id,clathrin_friends_names$user_id) ## find how many real followers/friends are in each set aub <- union(clathrin_followers_names$user_id,clathrin_friends_names$user_id) anb <- intersect(clathrin_followers_names$user_id,clathrin_friends_names$user_id) ## make an Euler plot to look at overlap fit <- euler(c( "Followers" = nrow(clathrin_followers_names) - length(anb), "Friends" = nrow(clathrin_friends_names) - length(anb), "Followers&amp;Friends" = length(anb))) plot(fit) plot(fit)  In the code above, I arranged in sets the “real Twitter users” who follow me or I follow them. There was an overlap of 882 users, leaving 288 Friends who don’t follow me back – boo hoo! I next wanted to see who these people are, which is pretty straightforward.  ## who are the people I follow who don't follow me back bonly <- setdiff(clathrin_friends_names$user_id,anb)
no_follow_back <- lookup_users(bonly)



Looking at no_follow_back was interesting. There are a bunch of announcement accounts and people with huge follower counts that I wasn’t surprised do not follow me back. There are a few people on the list with whom I have interacted yet they don’t follow me, which is a bit odd. I guess they could have unfollowed me at some point in the past, but my guess is they were never following me in the first place. It used to be the case that you could only see tweets from people you followed, but the boundaries have blurred a lot in recent years. An intermediary only has to retweet something you have written for someone else to see it and you can then interact, without actually following each other. In fact, my own Twitter experience is mainly through lists, rather than my actual timeline. And to look at tweets in a list you don’t need to follow anyone on there. All of this led me to thinking: maybe other people (who follow me) are wondering why I don’t follow them back… I should look at what I am missing out on.

## who are the people who follow me but I don't follow back
cat a.csv b.csv c.csv > $OF  To crunch the data I wrote something in Igor which reads in the CSVs and plotted out my data. This meant first getting a list of clusterIDs which correspond to my papers in order to filter out other people’s work. I have a surprising number of tracks in my library with Rollercoaster in the title. I will go with indie wannabe act Northern Uproar for the title of this post. “What goes up (must come down)” is from Graham & Brown’s Super Fresco wallpaper ad from 1984. “Please please tell me now” is a lyric from Duran Duran’s “Is There Something I Should Know?”. ## The Second Arrangement To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations. Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it. The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to $$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$ Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here. I had got as far as figuring out that the ratio of permutations to derangements converges to e. However, you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it. Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted. The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another. If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on. To calculate the derangements: $$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$ This is a polynomial expansion of this expression: $$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$ where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G. Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames. What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that. So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading! Further notes The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat? Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem? The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions. ## Parallel lines: new paper on modelling mitotic microtubules in 3D We have a new paper out! You can access it here. The people This paper really was a team effort. Faye Nixon and Tom Honnor are joint-first authors. Faye did most of the experimental work in the final months of her PhD and Tom came up with the idea for the mathematical modelling and helped to rewrite our analysis method in R. Other people helped in lots of ways. George did extra segmentation, rendering and movie making. Nick helped during the revisions of the paper. Ali helped to image samples… the list is quite long. The paper in a nutshell We used a 3D imaging technique called SBF-SEM to see microtubules in dividing cells, then used computers to describe their organisation. What’s SBF-SEM? Serial block face scanning electron microscopy. This method allows us to take an image of a cell and then remove a tiny slice, take another image and so on. We then have a pile of images which covers the entire cell. Next we need to put them back together and make some sense of them. How do you do that? We use a computer to track where all the microtubules are in the cell. In dividing cells – in mitosis – the microtubules are in the form of a mitotic spindle. This is a machine that the cell builds to share the chromosomes to the two new cells. It’s very important that this process goes right. If it fails, mistakes can lead to diseases such as cancer. Before we started, it wasn’t known whether SBF-SEM had the power to see microtubules, but we show in this paper that it is possible. We can see lots of other cool things inside the cell too like chromosomes, kinetochores, mitochondria, membranes. We made many interesting observations in the paper, although the focus was on the microtubules. So you can see all the microtubules, what’s interesting about that? The interesting thing is that our resolution is really good, and is at a large scale. This means we can determine the direction of all the microtubules in the spindle and use this for understanding how well the microtubules are organised. Previous work had suggested that proteins whose expression is altered in cancer cause changes in the organisation of spindle microtubules. Our computational methods allowed us to test these ideas for the first time. Resolution at a large scale, what does that mean? The spindle is made of thousands of microtubules. With a normal light microscope, we can see the spindle but we can’t tell individual microtubules apart. There are improvements in light microscopy (called super-resolution) but even with those improvements, right in the body of the spindle it is still not possible to resolve individual microtubules. SBF-SEM can do this. It doesn’t have the best resolution available though. A method called Electron Tomography has much higher resolution. However, to image microtubules at this large scale (meaning for one whole spindle), it would take months or years of effort! SBF-SEM takes a few hours. Our resolution is better than light microscopy, worse than electron tomography, but because we can see the whole spindle and image more samples, it has huge benefits. What mathematical modelling did you do? Cells are beautiful things but they are far from perfect. The microtubules in a mitotic spindle follow a pattern, but don’t do so exactly. So what we did was to create a “virtual spindle” where each microtubule had been made perfect. It was a bit like “photoshopping” the cell. Instead of straightening the noses of actresses, we corrected the path of every microtubule. How much photoshopping was needed told us how imperfect the microtubule’s direction was. This measure – which was a readout of microtubule “wonkiness” – could be done on thousands of microtubules and tell us whether cancer-associated proteins really cause the microtubules to lose organisation. The publication process The paper is published in Journal of Cell Science and it was a great experience. Last November, we put up a preprint on this work and left it up for a few weeks. We got some great feedback and modified the paper a bit before submitting it to a journal. One reviewer gave us a long list of useful comments that we needed to address. However, the other two reviewers didn’t think our paper was a big enough breakthrough for that journal. Our paper was rejected*. This can happen sometimes and it is frustrating as an author because it is difficult for anybody to judge which papers will go on to make an impact and which ones won’t. One of the two reviewers thought that because the resolution of SBF-SEM is lower than electron tomography, our paper was not good enough. The other one thought that because SBF-SEM will not surpass light microscopy as an imaging method (really!**) and because EM cannot be done live (the cells have to be fixed), it was not enough of a breakthrough. As I explained above, the power is that SBF-SEM is between these two methods. Somehow, the referees weren’t convinced. We did some more work, revised the paper, and sent it to J Cell Sci. J Cell Sci is a great journal which is published by Company of Biologists, a not-for-profit organisation who put a lot of money back into cell biology in the UK. They are preprint friendly, they allow the submission of papers in any format, and most importantly, they have a fast-track*** option. This allowed me to send on the reviews we had and including our response to them. They sent the paper back to the reviewer who had a list of useful comments and they were happy with the changes we made. It was accepted just 18 days after we sent it in and it was online 8 days later. I’m really pleased with the whole publishing experience with J Cell Sci. * I’m writing about this because we all have papers rejected. There’s no shame in that at all. Moreover, it’s obvious from the dates on the preprint and on the JCS paper that our manuscript was rejected from another journal first. ** Anyone who knows something about microscopy will find this amusing and/or ridiculous. *** Fast-track is offered by lots of journals nowadays. It allows authors to send in a paper that has been reviewed elsewhere with the peer review file. How the paper has been revised in light of those comments is assessed by at the Editor and one peer reviewer. Parallel lines is of course the title of the seminal Blondie LP. I have used this title before for a blog post, but it matches the topic so well. ## Adventures in Code V: making a map of Igor functions I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions. This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in. grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output for i in cat output; do grep -ie "$i" *.ipf | grep -w "Function" >> funcfile.txt ; done


Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.

Part of a series with code snippets and tips.

## Realm of Chaos

Caution: this post is for nerds only.

I watched this numberphile video last night and was fascinated by the point pattern that was created in it. I thought I would quickly program my own version to recreate it and then look at patterns made by more points.

I didn’t realise until afterwards that there is actually a web version of the program used in the video here. It is a bit limited though so my code was still worthwhile.

A fractal triangular pattern can be created by:

1. Setting three points
2. Picking a randomly placed seed point
3. Rolling a die and going halfway towards the result
4. Repeat last step

If the first three points are randomly placed the pattern is skewed, so I added the ability to generate an equilateral triangle. Here is the result.

and here are the results of a triangle through to a decagon.

All of these are generated with one million points using alpha=0.25. The triangle, pentagon and hexagon make nice patterns but the square and polygons with more than six points make pretty uninteresting patterns.

Watching the creation of the point pattern from a triangular set is quite fun. This is 30000 points with a frame every 10 points.

Here is the code.

Some other notes: this version runs in IgorPro. In my version, the seed is set at the centre of the image rather than a random location. I used the random allocation of points rather than a six-sided dice.

The post title is taken from the title track from Bolt Thrower’s “Realm of Chaos”.