## The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.


## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
## read in your tweets
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets$text)) ## remove all the punctation from tweets tCorpus <- tm_map(tCorpus, removePunctuation) ## good idea to remove stopwords: high frequency words such as I, me and so on tCorpus <- tm_map(tCorpus, removeWords, stopwords('english')) ## next step is to stem the words. Means that talking and talked become talk tCorpus <- tm_map(tCorpus, stemDocument) ## now display your wordcloud wordcloud(tCorpus, max.words = 100, random.order = FALSE)  For my @clathrin account this gave: So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time… The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes. The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP. ## The Second Arrangement To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations. Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it. The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to $$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$ Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here. I had got as far as figuring out that the ratio of permutations to derangements converges to e. However, you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it. Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted. The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another. If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on. To calculate the derangements: $$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$ This is a polynomial expansion of this expression: $$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$ where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G. Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames. What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that. So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading! Further notes The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat? Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem? The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions. ## Adventures in Code V: making a map of Igor functions I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions. This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in. grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output for i in cat output; do grep -ie "$i" *.ipf | grep -w "Function" >> funcfile.txt ; done


Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.

Part of a series with code snippets and tips.

## Adventures in Code IV: correcting filenames

A large amount of time doing data analysis is the process of cleaning, importing, reorganising and generally not actually analysing data but getting it ready to analyse. I’ve been trying to get over the idea to non-coders in the group that strict naming conventions (for example) are important and very helpful to the poor person who has to deal with the data.

Things have improved a lot and dtatsets that used to take a few hours to clean up are now pretty much straightforward. A recent example is shown here. Almost 200 subconditions are plotted out and there is only one missing graph. I suspect the blood sugar levels were getting low in the person generating the data… the cause was a hyphen in the filename and not an underscore.

These data are read into Igor from CSVs outputted from Imaris. Here comes the problem: the folder and all files within it have the incorrect name.

There are 35 files in each folder and clearly this needs a computer to fix, even if it were just one foldersworth at fault. The quickest way is to use the terminal and there are lots of ways to do it.

Now, as I said the problem is that the foldername and filenames both need correcting. Most terminal commands you can quickly find online actually fail because they try to rename the file and folder at the same time, and since the folder with the new name doesn’t exist… you get an error.

The solution is to rename the folders first and then the files.


find . -type d -maxdepth 2 -name "oldstring*" | while read FNAME; do mv "$FNAME" "${FNAME//oldstring/newstring}"; done
find . -type f -maxdepth 3 -name "oldstring*.csv" | while read FNAME; do mv "$FNAME" "${FNAME//oldstring/newstring}"; done



A simple tip, but effective and useful. HT this gist

Part of a series on computers and coding

## Adventures in Code III: the quantixed ImageJ Update site

We have some macros for ImageJ/FIJI for making figures and blind analysis which could be useful to others.

I made an ImageJ Update Site so that the latest versions can be pushed out to the people in the lab, but this also gives the opportunity to share our code with the world. Feel free to add the quantixed ImageJ update site to your ImageJ or FIJI installation. Details of how to do that are here.

The code is maintained in this GitHub repo, which has a walkthrough for figure-making in the README. So, if you’d like to make figures the quantixed way, adding ROIs and zooms, then feel free to give this code a try. Please raise any issues there or get in touch some other way.

Disclaimer: this code is under development. I offer no guarantees to its usefulness. I am not responsible for data loss or injury that may result from its use!

Update @ 10:35 2016-12-20 I should point out that other projects already exist to make figures (MagicMontage, FigureJScientiFig). These projects are fine but they didn’t do what I wanted, so I made my own.

## Tips from the blog X: multi-line commenting in Igor

This is part-tip, part-adventures in code. I found out recently that it is possible to comment out multiple lines of code in Igor and thought I’d put this tip up here.

Multi-line commenting in programming is useful two reasons:

1. writing comments (instructions, guidance) that last more than one line
2. the ability to temporarily remove a block of code while testing

In each computer language there is the ability to comment out at least one line of code.

In Igor this is “//”, which comments out the whole line, but no more.

This is the same as in ImageJ macro language.

Now, to comment out whole sections in FIJI/ImageJ is easy. Inserting “/*” where you want the comment to start, and then “*/” where it ends, multiple lines later.

I didn’t think this syntax was available in Igor, and it isn’t really. I was manually adding “//” for each line I wanted to remove, which was annoying. It turns out that you can use Edit > Commentize to add “//” to the start of all selected lines. The keyboard shortcut in IP7 is Cmd-/. You can reverse the process with Edit > Decommentize or Cmd-\.

There is actually another way. Igor can conditionally compile code. This is useful if for example you write for Igor 7 and Igor 6. You can get compilation of IP7 commands only if the user is running IP7 for example. This same logic can be used to comment out code as follows.

The condition if 0 is never satisfied, so the code does not compile. The equivalent statement for IP7-specific compilation, is “#if igorversion()>=7”.

So there you have it, two ways to comment out code in Igor. These tips were from IgorExchange.

If you want to read more about commenting in different languages and the origins of comments, read here.

This post is part of a series of tips.

## The International Language of Screaming

A couple of recent projects have meant that I had to get to grips more seriously with R and with MATLAB. Regular readers will know that I am a die-hard IgorPro user. Trying to tackle a new IDE is a frustrating experience, as anyone who has tried to speak a foreign language will know. The speed with which you can do stuff (or get your point across) is very slow. Not only that, but… if you could just revert to your mother tongue it would be so much easier…

What I needed was something like a Babel Fish. As I’m sure you’ll know, this fish is the creation of Douglas Adams. It allows instant translation of any language. The only downside is that you have to insert the fish into your ear.

The closest thing to the Babel Fish in computing is the cheat sheet. These sheets are typically a huge list of basic commands that you’ll need as you get going. I found a nice page which had cheat sheets which allowed easy interchange between R, MATLAB and python. There was no Igor version. Luckily, a user on IgorExchange had taken the R and MATLAB page and added some Igor commands. This was good, but it was a bit rough and incomplete. I took this version, formatted it for GitHub flavored markdown, and made some edits.

The repo is here. I hope it’s useful for others. I learned a lot putting it together. If you are an experienced user of R, MATLAB or IGOR (or better still can speak one or more of these languages), please fork and make edits or suggest changes via GitHub issues, or by leaving a comment on this page if you are not into GitHub. Thanks!

R-MATLAB-IGOR-CheatSheet

Here is a little snapshot to whet your appetite. Bon appetit!

The post title is taken from “The International Language of Screaming” by Super Furry Animals from their Radiator LP. Released as a single, the flip-side had a version called NoK which featured the backing tracking to the single. Gruff sings the welsh alphabet with no letter K.

## Adventures in code II

I needed to generate a uniform random distribution of points inside a circle and, later, a sphere. This is part of a bigger project, but the code to do this is kind of interesting. There were no solutions available for IgorPro, but stackexchange had plenty of examples in python and mathematica. There are many ways to do this. The most popular seems to be to generate a uniform random set of points in a square or cube and then discard those that are greater than the radius away from the origin. I didn’t like this idea, because I needed to extend it to spheroids eventually, and as I saw it the computation time saved was minimal.

Here is the version for points in a circle (radius = 1, centred on the origin).

This gives a nice set of points, 1000 shown here.

And here is the version inside a sphere. This code has variable radius for the sphere.

The three waves (xw,yw,zw) can be concatenated and displayed in a Gizmo. The code just plots out the three views.

My code uses var + enoise(var) to get a random variable from 0,var. This is because enoise goes from -var to +var. There is an interesting discussion about whether this is a truly flat PDF here.

This is part of a bigger project where I’ve had invaluable help from Tom Honnor from Statistics.

This post is part of a series on esoterica in computer programming.

## Adventures in code

An occasional series in esoteric programming issues.

As part of a larger analysis project I needed to implement a short program to determine the closest distance of two line segments in 3D space. This will be used to sort out which segments to compare… like I say, part of a bigger project. The best method to do this is to find the closest distance one segment is to the other when the other one is represented as an infinite line. You can then check if that point is beyond the segment if it is you use the limits of the segment to calculate the distance. There’s a discussion on stackoverflow here. The solutions point to one in C++ and one in MATLAB. The C++ version is easiest to port to Igor due to the similarity of languages, but the explanation of the MATLAB code was more approachable. So I ported that to Igor to figure out how it works.

The MATLAB version is:

>> P = [-0.43256      -1.6656      0.12533]
P =
-0.4326   -1.6656    0.1253
>> Q = [0.28768      -1.1465       1.1909]
Q =
0.2877   -1.1465    1.1909
>> R = [1.1892    -0.037633      0.32729]
R =
1.1892   -0.0376    0.3273
>> S = [0.17464     -0.18671      0.72579]
S =
0.1746   -0.1867    0.7258
>> N = null(P-Q)
N =
-0.3743   -0.7683
0.9078   -0.1893
-0.1893    0.6115
>> r = (R-P)*N
r =
0.8327   -1.4306
>> s = (S-P)*N
s =
1.0016   -0.3792
>> n = (s - r)*[0 -1;1 0];
>> n = n/norm(n);
>> n
n =
0.9873   -0.1587
>> d = dot(n,r)
d =
1.0491
>> d = dot(n,s)
d =
1.0491
>> v = dot(s-r,d*n-r)/dot(s-r,s-r)
v =
1.2024
>> u = (Q-P)'\((S - (S*N)*N') - P)'
u =
0.9590
>> P + u*(Q-P)
ans =
0.2582   -1.1678    1.1473
>> norm(P + u*(Q-P) - S)
ans =
1.0710


and in IgorPro:

Function MakeVectors()
Make/O/D/N=(1,3) matP={{-0.43256},{-1.6656},{0.12533}}
Make/O/D/N=(1,3) matQ={{0.28768},{-1.1465},{1.1909}}
Make/O/D/N=(1,3) matR={{1.1892},{-0.037633},{0.32729}}
Make/O/D/N=(1,3) matS={{0.17464},{-0.18671},{0.72579}}
End

Function MakeVectors()
Make/O/D/N=(1,3) matP={{-0.43256},{-1.6656},{0.12533}}
Make/O/D/N=(1,3) matQ={{0.28768},{-1.1465},{1.1909}}
Make/O/D/N=(1,3) matR={{1.1892},{-0.037633},{0.32729}}
Make/O/D/N=(1,3) matS={{0.17464},{-0.18671},{0.72579}}
End

Function DoCalcs()
WAVE matP,matQ,matR,matS
MatrixOp/O tempMat = matP - matQ
MatrixSVD tempMat
Make/O/D/N=(3,2) matN
Wave M_VT
matN = M_VT[p][q+1]
MatrixOp/O tempMat2 = (matR - matP)
MatrixMultiply tempMat2, matN
Wave M_product
Duplicate/O M_product, mat_r
MatrixOp/O tempMat2 = (matS - matP)
MatrixMultiply tempMat2, matN
Duplicate/O M_product, mat_s
Make/O/D/N=(2,2) MatUnit
matUnit = {{0,1},{-1,0}}
MatrixOp/O tempMat2 = (mat_s - mat_r)
MatrixMultiply tempMat2,MatUnit
Duplicate/O M_Product, Mat_n
Variable nn
nn = norm(mat_n)
MatrixOP/O new_n = mat_n / nn
//new_n is now a vector with unit length
Variable dd
dd = MatrixDot(new_n,mat_r)
//print dd
//dd = MatrixDot(new_n,mat_s)
//print dd
dd = abs(dd)
// now find v
// v = dot(s-r,d*n-r)/dot(s-r,s-r)
variable vv
MatrixOp/O mat_s_r = mat_s - mat_r
MatrixOp/O tempmat2 = dd * mat_n - mat_r
vv = MatrixDot(mat_s_r,tempmat2) / MatrixDot(mat_s_r,mat_s_r)
//print vv
//because vv &amp;gt; 1 then closest post is s (because rs(1) = s) and therefore closest point on RS to infinite line PQ is S
//what about the point on PQ is this also outside the segment?
// u = (Q-P)'\((S - (S*N)*N') - P)'
variable uu
MatrixOp/O matQ_P = matQ - matP
MatrixTranspose matQ_P
//MatrixOP/O tempMat2 = ((matS - (matS * matN) * MatrixTranspose(MatN)) - MatrixTranspose(matP))
Duplicate/O MatN, matNprime
MatrixTranspose matNprime
MatrixMultiply matS, matN
Duplicate/O M_Product, matSN
MatrixMultiply M_Product, matNprime
MatrixOP/O tempMat2 = ((matS - M_product) - matP)
MatrixTranspose tempMat2
MatrixLLS matQ_P tempMat2
Wave M_B
uu = M_B[0]
// find point on PQ that is closest to RS
// P + u*(Q-P)
MatrixOp/O matQ_P = matQ - matP
MatrixOp/O matPoint = MatP + (uu * MatQ_P)
MatrixOP/O distpoint = matPoint - matS
Variable dist
dist = norm(distpoint)
Print dist
End


The sticking points were finding the Igor equivalents of

• null()
• norm()
• dot()
• \ otherwise known as mldivide

Which are:

• MatrixSVD (answer is in the final two columns of wave M_VT)
• norm()
• MatrixDot()
• MatrixLLS

MatrixLLS wouldn’t accept a mix of single-precision and double-precision waves, so this needed to be factored into the code.

As you can see, the Igor code is much longer. Overall, I think MATLAB handles Matrix Math better than Igor. It is certainly easier to write. I suspect that there are a series of Igor operations that can do what I am trying to do here, but this was an exercise in direct porting.

More work is needed to condense this down and also deal with every possible case. Then it needs to be incorporated into the bigger program. SIMPLE! Anyway, hope this helps somebody.

The post title is taken from the band Adventures In Stereo.