## The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.


## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets$text)) ## remove all the punctation from tweets tCorpus <- tm_map(tCorpus, removePunctuation) ## good idea to remove stopwords: high frequency words such as I, me and so on tCorpus <- tm_map(tCorpus, removeWords, stopwords('english')) ## next step is to stem the words. Means that talking and talked become talk tCorpus <- tm_map(tCorpus, stemDocument) ## now display your wordcloud wordcloud(tCorpus, max.words = 100, random.order = FALSE)  For my @clathrin account this gave: So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time… The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes. The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP. ## I’m not following you: Twitter data and R I wondered how many of the people that I follow on Twitter do not follow me back. A quick way to look at this is with R. OK, a really quick way is to give a 3rd party application access rights to your account to do this for you, but a) that isn’t safe, b) you can’t look at anyone else’s data, and c) this is quantixed – doing nerdy stuff like this is what I do. Now, the great thing about R is the availability of well-written packages to do useful stuff. I quickly found two packages twitteR and rtweet that are designed to harvest Twitter data. I went with rtweet and there were some great guides to setting up OAuth and getting going. The code below set up my environment and pulled down lists of my followers and my “friends”. I’m looking at my main account and not the quantixed twitter account.  library(rtweet) library(httpuv) ## setup your appname,api key and api secret appname <- "whatever_name" key <- "blah614h" secret <- "blah614h" ## create token named "twitter_token" twitter_token <- create_token( app = appname, consumer_key = key, consumer_secret = secret) clathrin_followers <- get_followers("clathrin", n = "all") clathrin_followers_names <- lookup_users(clathrin_followers) clathrin_friends <- get_friends("clathrin") clathrin_friends_names <- lookup_users(clathrin_friends)  The terminology is that people that follow me are called Followers and people that I follow are called Friends. These are the terms used by Twitter’s API. I have almost 3000 followers and around 1200 friends. This was a bit strange… I had fewer followers with data than actual followers. Same for friends: missing a few hundred in total. I extracted a list of the Twitter IDs that had no data and tried a few other ways to look them up. All failed. I assume that these are users who have deleted their account (and the Twitter ID stays reserved) or maybe they are suspended for some reason. Very strange.  ## noticed something weird ## look at the twitter ids of followers and friends with no data missing_followers <- setdiff(clathrin_followers$user_id,clathrin_followers_names$user_id) missing_friends <- setdiff(clathrin_friends$user_id,clathrin_friends_names$user_id) ## find how many real followers/friends are in each set aub <- union(clathrin_followers_names$user_id,clathrin_friends_names$user_id) anb <- intersect(clathrin_followers_names$user_id,clathrin_friends_names$user_id) ## make an Euler plot to look at overlap fit <- euler(c( "Followers" = nrow(clathrin_followers_names) - length(anb), "Friends" = nrow(clathrin_friends_names) - length(anb), "Followers&amp;Friends" = length(anb))) plot(fit) plot(fit)  In the code above, I arranged in sets the “real Twitter users” who follow me or I follow them. There was an overlap of 882 users, leaving 288 Friends who don’t follow me back – boo hoo! I next wanted to see who these people are, which is pretty straightforward.  ## who are the people I follow who don't follow me back bonly <- setdiff(clathrin_friends_names$user_id,anb)
no_follow_back <- lookup_users(bonly)



Looking at no_follow_back was interesting. There are a bunch of announcement accounts and people with huge follower counts that I wasn’t surprised do not follow me back. There are a few people on the list with whom I have interacted yet they don’t follow me, which is a bit odd. I guess they could have unfollowed me at some point in the past, but my guess is they were never following me in the first place. It used to be the case that you could only see tweets from people you followed, but the boundaries have blurred a lot in recent years. An intermediary only has to retweet something you have written for someone else to see it and you can then interact, without actually following each other. In fact, my own Twitter experience is mainly through lists, rather than my actual timeline. And to look at tweets in a list you don’t need to follow anyone on there. All of this led me to thinking: maybe other people (who follow me) are wondering why I don’t follow them back… I should look at what I am missing out on.

## who are the people who follow me but I don't follow back
cat a.csv b.csv c.csv > $OF  To crunch the data I wrote something in Igor which reads in the CSVs and plotted out my data. This meant first getting a list of clusterIDs which correspond to my papers in order to filter out other people’s work. I have a surprising number of tracks in my library with Rollercoaster in the title. I will go with indie wannabe act Northern Uproar for the title of this post. “What goes up (must come down)” is from Graham & Brown’s Super Fresco wallpaper ad from 1984. “Please please tell me now” is a lyric from Duran Duran’s “Is There Something I Should Know?”. ## The Second Arrangement To validate our analyses, I’ve been using randomisation to show that the results we see would not arise due to chance. For example, the location of pixels in an image can be randomised and the analysis rerun to see if – for example – there is still colocalisation. A recent task meant randomising live cell movies in the time dimension, where two channels were being correlated with one another. In exploring how to do this automatically, I learned a few new things about permutations. Here is the problem: If we have two channels (fluorophores), we can test for colocalisation or cross-correlation and get a result. Now, how likely is it that this was due to chance? So we want to re-arrange the frames of one channel relative to the other such that frame i of channel 1 is never paired with frame i of channel 2. This is because we want all pairs to be different to the original pairing. It was straightforward to program this, but I became interested in the maths behind it. The maths: Rearranging n objects is known as permutation, but the problem described above is known as Derangement. The number of permutations of n frames is n!, but we need to exclude cases where the ith member stays in the ith position. It turns out that to do this, you need to use the principle of inclusion and exclusion. If you are interested, the solution boils down to $$n!\sum_{k=0}^{n}\frac{(-1)^k}{k!}$$ Which basically means: for n frames, there are n! number of permutations, but you need to subtract and add diminishing numbers of different permutations to get to the result. Full description is given in the wikipedia link. Details of inclusion and exclusion are here. I had got as far as figuring out that the ratio of permutations to derangements converges to e. However, you can tell that I am not a mathematician as I used brute force calculation to get there rather than write out the solution. Anyway, what this means in a computing sense, is that if you do one permutation, you might get a unique combination, with two you’re very likely to get it, and by three you’ll certainly have it. Back to the problem at hand. It occurred to me that not only do we not want frame i of channel 1 paired with frame i of channel 2 but actually it would be preferable to exclude frames i ± 2, let’s say. Because if two vesicles are in the same location at frame i they may also be colocalised at frame i-1 for example. This is more complex to write down because for frames 1 and 2 and frames n and n-1, there are fewer possibilities for exclusion than for all other frames. For all other frames there are n-5 legal positions. This obviously sets a lower limit for the number of frames capable of being permuted. The answer to this problem is solved by rook polynomials. You can think of the original positions of frames as columns on a n x n chess board. The rows are the frames that need rearranging, excluded positions are coloured in. Now the permutations can be thought of as Rooks in a chess game (they can move horizontally or vertically but not diagonally). We need to work out how many arrangements of Rooks are possible such that there is one rook per row and such that no Rook can take another. If we have an 7 frame movie, we have a 7 x 7 board looking like this (left). The “illegal” squares are coloured in. Frame 1 must go in position D,E,F or G, but then frame 2 can only go in E, F or G. If a rook is at E1, then we cannot have a rook at E2. And so on. To calculate the derangements: $$1 + 29 x + 310 x^2 + 1544 x^3 + 3732 x^4 + 4136 x^5 + 1756 x^6 + 172 x^7$$ This is a polynomial expansion of this expression: $$R_{m,n}(x) = n!x^nL_n^{m-n}(-x^{-1})$$ where $$L_n^\alpha(x)$$ is an associated Laguerre polynomial. The solution in this case is 8 possibilities. From 7! = 5040 permutations. Of course our movies have many more frames and so the randomisation is not so limited. In this example, frame 4 can only either go in position A or G. Why is this important? The way that the randomisation is done is: the frames get randomised and then checked to see if any “illegal” positions have been detected. If so, do it again. When no illegal positions are detected, shuffle the movie accordingly. In the first case, the computation time per frame is constant, whereas in the second case it could take much longer (because there will be more rejections). In the case of 7 frames, with the restriction of no frames at i ±2, then the failure rate is 5032/5040 = 99.8%. Depending on how the code is written, this can cause some (potentially lengthy) wait time. Luckily, the failure rate comes down with more frames. What about it practice? The numbers involved in directly calculating the permutations and exclusions quickly becomes too big using non-optimised code on a simple desktop setup (a 12 x 12 board exceeds 20 GB). The numbers and rates don’t mean much, what I wanted to know was whether this slows down my code in a real test. To look at this I ran 100 repetitions of permutations of movies with 10-1000 frames. Whereas with the simple derangement problem permutations needed to be run once or twice, with greater restrictions, this means eight or nine times before a “correct” solution is found. The code can be written in a way that means that this calculation is done on a placeholder wave rather than the real data and then applied to the data afterwards. This reduces computation time. For movies of around 300 frames, the total run time of my code (which does quite a few things besides this) is around 3 minutes, and I can live with that. So, applying this more stringent exclusion will work for long movies and the wait times are not too bad. I learned something about combinatorics along the way. Thanks for reading! Further notes The first derangement issue I mentioned is also referred to as the hat-check problem. Which refers to people (numbered 1,2,3 … n) with corresponding hats (labelled 1,2,3 … n). How many ways can they be given the hats at random such that they do not get their own hat? Adding i+1 as an illegal position is known as problème des ménages. This is a problem of how to seat married couples so that they sit in a man-woman arrangement without being seated next to their partner. Perhaps i ±2 should be known as the vesicle problem? The post title comes from “The Second Arrangement” by Steely Dan. An unreleased track recorded for the Gaucho sessions. ## Adventures in Code V: making a map of Igor functions I’ve generated a lot of code for IgorPro. Keeping track of it all has got easier since I started using GitHub – even so – I have found myself writing something only to discover that I had previously written the same thing. I was thinking that it would be good to make a list of all functions that I’ve written to locate long lost functions. This question was brought up on the Igor mailing list a while back and there are several solutions – especially if you want to look at dependencies. However, this two liner works to generate a file called funcfile.txt which contains a list of functions and the ipf file that they are appear in. grep "^[ \t]*Function" *.ipf | grep -oE '[ \t]+[A-Za-z_0-9]+\(' | tr -d " " | tr -d "(" > output for i in cat output; do grep -ie "$i" *.ipf | grep -w "Function" >> funcfile.txt ; done


Thanks to Thomas Braun on the mailing list for the idea. I have converted it to work on grep (BSD grep) 2.5.1-FreeBSD which runs on macOS. Use the terminal, cd to the directory containing your ipf files and run it. Enjoy!

EDIT: I did a bit more work on this idea and it has now expanded to its own repo. Briefly, funcfile.txt is converted to tsv and then parsed – using Igor – to json. This can be displayed using some d3.js magic.

Part of a series with code snippets and tips.

## Realm of Chaos

Caution: this post is for nerds only.

I watched this numberphile video last night and was fascinated by the point pattern that was created in it. I thought I would quickly program my own version to recreate it and then look at patterns made by more points.

I didn’t realise until afterwards that there is actually a web version of the program used in the video here. It is a bit limited though so my code was still worthwhile.

A fractal triangular pattern can be created by:

1. Setting three points
2. Picking a randomly placed seed point
3. Rolling a die and going halfway towards the result
4. Repeat last step

If the first three points are randomly placed the pattern is skewed, so I added the ability to generate an equilateral triangle. Here is the result.

and here are the results of a triangle through to a decagon.

All of these are generated with one million points using alpha=0.25. The triangle, pentagon and hexagon make nice patterns but the square and polygons with more than six points make pretty uninteresting patterns.

Watching the creation of the point pattern from a triangular set is quite fun. This is 30000 points with a frame every 10 points.

Here is the code.

Some other notes: this version runs in IgorPro. In my version, the seed is set at the centre of the image rather than a random location. I used the random allocation of points rather than a six-sided dice.

The post title is taken from the title track from Bolt Thrower’s “Realm of Chaos”.

## Notes To The Future

Previously I wrote about our move to electronic lab notebooks (ELNs). This post contains the technical details to understand how it works for us. You can even replicate our setup if you want to take the plunge.

Why go electronic?

Many reasons: I wanted to be able to quickly find information in our lab books. I wanted lab members to be able to share information more freely. I wanted to protect against loss of a notebook. I think switching to ELNs is inevitable and not only that I needed to do something about the paper notebooks: my group had amassed 100 in 10 years.

We took the plunge and went electronic. To recap, I decided to use WordPress as a platform for our ELN.

Getting started

We had a Linux box on which I could install WordPress. This involved installing phpMyAdmin and registering a mySQL database and then starting up WordPress. If that sounds complicated, it really isn’t. I simply found a page on the web with step-by-step instructions for my box. You could run this on an old computer or even on a Raspberry Pi, it just has to be on a local network.

Next, I set myself up as admin and then created a user account for each person in the lab. Users can have different privileges. I set all people in the lab to Author. This means they can make, edit and delete posts. Being an Author is better than the other options (Contributor or Editor) which wouldn’t work for users to make entries, e.g. Contributors cannot upload images. Obviously authors being able to delete posts is not acceptable for an ELN, so I removed this capability with a plugin (see below).

I decided that we would all write in the same ELN. This makes searching the contents much easier for me, the PI. The people in the lab were a bit concerned about this because they were each used to having their own lab book. It would be possible to set up a separate ELN for each person but this would be too unwieldy for the PI, so I grouped everyone together. However, it doen’t feel like writing in a communal notebook because each Author of a post is identifiable and so it is possible to look at the ELN of just one user as a “virtual lab book”. To do this easily, you need a plugin (see below).

If we lost the WP installation it would be a disaster, so I setup a backup. This is done locally with a plugin (see below). Additionally, I set up an rsync routine from the box that goes off weekly to our main lab server. Our main lab server uses ZFS and is backed up to a further geographically distinct location. So this is pretty indestructible (if that statement is not tempting fate…). The box has a RAID6 array of disks but in the case of hardware failure plus corruption and complete loss of the array, we would lose one week of entries at most.

Theme

We tried out a few before settling on one that we liked. We might change and tweak this more as we go on.

The one we liked was called gista. It looks really nice, like a github page. It is no longer maintained unfortunately. Many of the other themes we looked at have really big fonts for the posts, which gives a really bloggy look, but is not conducive to a ELN.

Two things needed tweaking for gitsta to be just right: I wanted the author name to be visible directly after the title and I didn’t want comments to show up. This meant editing the content.php file. Finally, the style.css file needs changing to have the word gista-child in the comments, to allow it to get dependencies from gitsta and to show up in your list of themes to select.

The editing is pretty easy, since there are lots of guides online for doing this. If you just want to download our edited version to try it, you can get it from here (I might make some more changes in the future). If you want to use it, just download it, rename the directory as gitsta-child and then place it in WordPress/wp-content/themes/ of your installation – it should be good to go!

Plugins

As you saw above, I installed a few plugins which are essential for full functionality

• My Private Site – this plugin locks off the site so that only people with a login can access the site. Our ELN is secure – note that this is not a challenge to try to hack us – it sits inside our internal network and as such is not “on the internet”. Nonetheless, anyone with access to the network who could find the IP could potentially read our ELN. This plugin locks off access to everyone not in our lab.
• Authors Widget – this plugin allows the addition of a little menu to the sidebar (widget) allowing the selection of posts by one author. This allows us to switch between virtual labbooks for each lab member. Users can bookmark their own Author name so that they only see their labbook if they want.
• Capability Manager Enhanced – you can edit rights of each level of user or create new levels of user. I used this to remove the ability to delete posts.
• BackWPup – this allows the local backup of all WP content. It’s highly customisable and is recommended.

Other plugins which are non-essential-but-useful

• WP Statistics – this is a plugin that allows admin to see how many visits etc the ELN has had that day/week etc. This one works on a local installation like ours. Others will not work because they require the site to be on the internet.
• WP-Markdown – this allows you to write your posts in md. I like writing in md, nobody in my lab uses this function.

Gitsta wants to use gust rather than the native WP dashboard. But gust and md were too complicated for our needs, so I uninstalled gust.

Using the ELN

Lab members/users/authors make “posts” for each lab book entry. This means we have formalised how lab book entries are done. We already had a guide for best practice for labbook entries in our lab manual which translates wonderfully to the ELN. It’s nothing earth-shattering, just that each experiment has a title, aim, methods, results and conclusion (just like we were taught in school!). In a paper notebook this is actually difficult to do because our experiments run for days (sometimes weeks) and many experiments run simultaneously. This means you either have to budget pages in the notebook for each separate experiment, interleave entries (which is not very readable) or write up at the end (which is not best practice). With ELNs you just make one entry for each experiment and update all of them as you go along. Problem solved. Edits are possible and it is possible to see what changes have been made and it is even possible to roll back changes.

Posts are given a title. We have a system in the lab for initials plus numbers for each experiment. This is used for everything associated with that experiment, so the files are easy to find, the films can be located and databases can cross-reference. The ELN also allows us to add categories and tags. So we have wide ranging categories (these are set by admin) and tags which can be more granular. Each post created by an author is identifiable as such, even without the experiment code to the title. So it is possible to filter the view to see posts:

• by one lab member
• on Imaging (or whatever topic)
• by date or in a date range

Of course you can also search the whole ELN, which is the thing I need most of all because it gets difficult to remember who did what and when. Even lab members themselves don’t remember that they did an experiment two or more years previously! So this feature will be very useful in the future.

WordPress allows pictures to be uploaded and links to be added. Inserting images is easy to show examples of how an experiment went. For data that is captured digitally this is a case of uploading the file. For things that are printed out or are a physical thing, i.e. western films or gel doc pictures, we are currently taking a picture and adding these to the post. In theory we can add hard links to data on our server. This is certainly not allowed in many other ELNs for security reasons.

In many ways the ELN is no different to our existing lab books. Our ELN is not on the internet and as such is not accessible from home without VPN to the University. This is analogous to our current set up where the paper lab books have to stay in the lab and are not allowed to be taken home.

Finally, in response to a question on Twitter after the previous ELN post: how do we protect against manipulation? Well previously we followed best practice for paper books. We used hard bound books with numbered pages (ensuring pages couldn’t be removed), Tip-ex was not allowed, edits had to be done in a different colour pen and dated etc. I think the ELN is better in many ways. Posts cannot be deleted, edits are logged and timestamped. User permissions mean I know who has edited what and when. Obviously, as with paper books, if somebody is intent on deception, they can still falsify their own lab records in some way. In my opinion, the way to combat this is regular review of the primary data and also maintaining an environment where people don’t feel like they should deceive.

The post title is taken from “Notes To The Future” by Patti Smith , the version I have is recorded Live in St. Mark’s Church, NYC in 2002 from Land (1975-2002). I thought this was appropriate since a lab note book is essentially notes to your future self. ELNs are also the future of taking notes in the lab.

## The Soft Bulletin: Electronic Lab Notebooks

We finally took the plunge and adopted electronic lab notebook (ELNs) for the lab. This short post describes our choice of software. I will write another post about how it’s going, how I set it up and other technical details.

tl;dr we are using WordPress as our ELN.

First, so you can understand my wishlist of requirements for the perfect ELN.

1. Easy-to-use. Allow adding pictures and notes easily.
2. Versioning (ability to check edits and audit changes)
3. Backup and data security
4. Ability to export and go elsewhere if required
5. Free or low cost
6. Integration with existing lab systems if possible
7. Open software, future development
8. Clarity over who owns the software, who owns the data, and where the information is stored
9. Can be deployed for the entire lab

There are many ELN software solutions available, but actually very few fulfil all of those requirements. So narrowing down the options was quite straightforward in the end. Here is the path I went down.

Evernote

I have used Evernote as my ELN for over a year. I don’t do labwork these days, but I make notes when doing computer programming, data analysis and writing papers. I also use it for personal stuff. I like it a lot, but Evernote is not an ELN solution for a whole lab. First, there is an issue over people using it for work and for personal stuff. How do we archive their lab documents without accessing other data? How do we pay for it? What happens when they leave? These sorts of issues prevent the use of many of the available ELN software packages, for a whole lab. I think many ELN software packages would work well for individuals, but I wanted something to deploy for the whole lab. For example, so that I can easily search and find stuff long after the lab member has left and not have to go into different packages to do this.

OneNote

The next most obvious solution is OneNote from Microsoft. Our University provides free access to this package and so using it would get around any pricing problems. Each lab member could use it with their University identity, separating any problems with work/life. It has some nice features (shared by Evernote) such as photographing documents/whiteboards etc and saving them straight to notes. I know several individuals (not whole labs) using this as their ELN. I’m not a big fan of running Microsoft software on Macs and we are completely Apple native in the lab. Even so, OneNote was a promising solution.

I also looked into several other software packages:

I liked the sound of RSpace, but it wasn’t clear to me who they were, why they wanted to offer a free ELN service and where they would store our data and what they might want to do with it. Last year, the scare that Evernote were going to snoop on users’ data made me realise that when it came to our ELNs – we had to host the data. I didn’t want to trust a company to do this. I also didn’t want to rely on a company to:

• continue to do what we sign up for, e.g. provide a free software
• keep updating the software, e.g.  so that macOS updates don’t kill it
• not sell up to an evil company
• do something else that I didn’t agree with.

As I saw it, this left one option: self-hosting and not only that, there were only two possibilities.

Use a wiki

This is – in many ways – my preferred solution. Wikis have been going for years and they are widely used. I set one up and made a lab notebook entry. It was great. I could edit it and edits were timestamped. It looked OK (but not amazing). There were possibilities to add tables, links etc. However, I thought that doing the code to make an entry would be a challenge for some people in the lab. I know that wikis are everywhere and that editing them is simple, but I kept thinking of the project student that comes to the lab for a short project. They need to read papers to figure out their project, they have to learn to clone/run gels/image cells/whatever AND then they also have to learn to write in a wiki? Just to keep a log of what they are doing? For just a short stay? I could see this meaning that the ELN gets neglected and things didn’t get documented.

I know other labs are using a wiki as an ELN and they do it successfully. It is possible, but I don’t think it would work for us. I also needed to entice people in the lab to convert them from using paper lab notebooks. This meant something that looked nice.

Use WordPress

This option I did not take seriously at first. A colleague told me two years ago that WordPress would be the best platform for an ELN, and I smiled politely. I write this blog on a wordpress dot com platform, but somehow didn’t consider it as an ELN option. After looking for alternatives that we could self-host, it slowly dawned on me that WordPress (a self-hosted installation) actually meets all of the requirements for an ELN.

1. It’s easy-to-use. My father, who is in his 70s, edits a website using WordPress as a platform. So any person working in the lab should be able to do it.
2. Versioning. You can see edits and roll back changes if required. Not as granular as wiki but still good.
3. Backup and data security. I will cover our exact specification in a future post. Our ELN is internal and can’t be accessed from outside the University. We have backup and it is pretty secure. Obviously, self-hosting means that if we have a technical problem, we have to fix it. Although I could move it to new hardware very quickly.
4. Ability to export and go elsewhere if required. It is simple to pack up an xml and move to another platform. The ubiquity of WordPress means that this will always be the case.
5. Free or low cost. WordPress is free and you can have as many users as you like! The hardware has a cost, but we have that hardware anyway.
6. Integration with existing lab systems if possible. We use naming conventions for people’s lab book entries and experiments. Moving to WordPress makes this more formal. Direct links to the primary data on our lab server are possible (not necessarily true of other ELN software).
7. Open software, future development. Again WordPress is ubiquitous and so there are options for themes and plugins to help make it a good ELN. We can also do some development if needed. There is a large community, meaning tweaking the installation is easy to do.
8. Clarity over who owns the software, who owns the data, and where the information is stored. It’s installed on our machines and so we don’t have to worry about this.
9. It can be deployed for the whole lab. Details in the follow-up post.

It also looks good and has a more up-to-date feel to it than a wiki. A screenshot of an innocuous lab notebook entry is shown to the right. I’ve blurred out some details of our more exciting experiments.

It’s early days. I started by getting the newer people in the lab to convert. Anyone who had only a few months left in the lab was excused from using the new system. I’m happy with the way it looks and how it works. We’ll see how it works out.

The main benefits for me are readability and being able to look at what people are doing. I’m looking forward to being able to search back through the entries, as this can be a serious timesuck with paper lab notebooks.

Edit 2017-04-26T07:28:43Z After posting this yesterday a few other suggestions came through that you might want to consider.

Labfolder, I had actually looked at this and it seems good but at 10 euros per user per month, I thought it was too expensive. I get that good software solutions have a cost and am not against paying for good software. I’d prefer a one-off cost (well, of course I’d prefer free!).

Mary Elting alerted me to Shawn Douglas’s lektor-based ELN. Again this ticks all of the boxes I mentioned above.

Manuel Théry suggested ELab. Again, I hadn’t seen this and it looks like it meets the criteria.

The Soft Bulletin is an occasional series of posts about software choices in research. The name comes from The Flaming Lips LP of the same name.

## The Soft Bulletin: PDF organisation

I recently asked on Twitter for any recommendations for software to organise my PDFs. I got several replies, but nothing really fitted the bill. This is a brief summary.

My situation

I have quite a lot of books, textbooks, cheat sheets, manuals, protocols etc. in PDF format and I need a way to organise them. I don’t need to reference this content, I just need to search it and access it quickly – ideally across several devices.

Note: I don’t collect PDFs of research articles. I have a hundred or so articles that were difficult to get hold of, and I keep those, but I’m pretty complacent about my access to scholarly literature.

I currently use Papers2 for storing my PDFs. It’s OK, but there are some bugs in it. Papers3 came out a few years ago, but I didn’t do the upgrade because there are issues with sync across multiple computers. Now it doesn’t look like Papers will be supported in the future. For example, I heard on Twitter that there is no ETA for an issue with Papers3 on Sierra. Future proofing – I’ve come to realise – is important to me as I am pretty loyal to software, I don’t like to change to something else, but I do like new features and innovation.

I don’t need a solution for referencing. I am resigned to using EndNote for that.

Ideally I just want something like iTunes to organise my PDFs, but I don’t want to use iTunes! Perhaps my requirements are too particular and what I want just isn’t available.

The suggestions

Thanks to everyone who made suggestions. Together with other solutions they were (in no particular order):

Zotero

www.zotero.org

I downloaded this and gave it a brief try. PDF import worked well and the UI looked OK. I stumbled on the sync capabilities. I currently sync my computers with Unison and this is complicated (but not impossible) to do for Zotero. They want you to use cloud syncing – which I would probably be OK with. I need to test out which cloud service is best to use. There is a webDAV option which my University supports and I think this would work for me. I think this software is the most likely candidate for me to switch to.

Mendeley

www.mendeley.com

This software got the most recommendations. I have to admit that the Elsevier connection is a huge turn-off for me. Although the irony of using it to organise my almost exclusively Elsevier-free content would be quite nice. I know that most of this type of software has been bought out by the publishing giants (Papers by Springer, EndNote by Thomson Reuters/Clarivate), but I don’t like this and I don’t have to support it if I don’t want to. I didn’t look into sync capabilities here.

Bookends

www.sonnysoftware.com

People rave about this software package for Mac. I like the fact that it has a separate lineage to the other packages. It is very expensive and it is primarily a referencing package. Right now, I’m just looking for something to organise my PDFs and this seems to be overkill.

Evernote

www.evernote.com

I use Evernote as a lab notebook and it is possible to use it to store PDFs. You can make a NoteBook for them, add a Note for each one and attach the PDF. The major plus here is that I already use it (and pay for it). The big negative is that I would prefer a separate standalone package to organise my PDFs. I know, difficult to please aren’t I?

Finder and Spotlight

This is the D.I.Y. option.

I have to say that this is the most appealing in many ways. If you just name PDFs systematically and store them in a folder hierarchy that you organise and tag – it would work. Sync would work with my current solution. Searching with Spotlight would work just as well as any other program. I would not need another program! At some point in the past I organised my PDFs like this. I moved to storing them in Papers so that it would save them in a hierarchical structure for me. This is what I mean by an iTunes-like organiser. An app to name, tag and file-away the PDFs would be ideal. I don’t want to go back to this if I can help it.

Like Mendeley, this is an option that I did not seriously entertain. I think this is too far away from what I want. As I see it, this software is designed as a web extension and paper recommendation service, which is not what I’m looking for.

Papers3

papersapp.com

As mentioned above, the lack of updates to this software and problems with sync mean that I am looking for something else. I really liked Papers2 and would be happy to continue using this if various things like import and editing were improved. I guess the option here is to stay with Papers2 and put up with the little things that annoy me. At some point though there will be a macOS update which breaks it and then I will be stuck.

Endnote
endnote.com

I use Endnote for referencing. I hate Endnote with a passion. But I can use it. I know how to write styles etc. and edit term lists because I’ve used it since something like v3. At some point in the past I began to store papers in Endnote. I stopped doing this and moved to Papers2. I have to admit it’s OK at doing this, although the way it organises the PDFs on disk is a bit strange IMO. I don’t like storing books and other content in my library though so this is not a good solution.

iBooks

Here is a curveball. I use iBooks and Kindle app for reading books in mobi/epub/pdf format. Actually, iBooks works quite well for PDFs and has the ability to sync with other devices. I have a feeling this could work, although some of the PDFs I have are quite bulky and I’d need to figure out a way for them to stay in the cloud and not reside on mobile devices. It’s definitely designed for reading books and not for pulling up the PDF in Preview and quickly finding a specific thing. For this reason I don’t think it would work.

Note that there are other apps for this task. Also, if you search for “PDF” in the App Store, there plenty of other programs aimed at people outside academia. Maybe one of those would be OK.

So what did I do?

I doubt anyone has the precise requirements that I have and so you’re probably not interested in what I decided. However, the simplest thing to do was to import the next batch of PDFs into Papers2 and wait to see if something better comes along. I will try Zotero a bit more when I get some time and see if this is the solution for me.

The post title is taken from The Flaming Lips’ 1999 album “The Soft Bulletin”.

## Adventures in Code IV: correcting filenames

A large amount of time doing data analysis is the process of cleaning, importing, reorganising and generally not actually analysing data but getting it ready to analyse. I’ve been trying to get over the idea to non-coders in the group that strict naming conventions (for example) are important and very helpful to the poor person who has to deal with the data.

Things have improved a lot and dtatsets that used to take a few hours to clean up are now pretty much straightforward. A recent example is shown here. Almost 200 subconditions are plotted out and there is only one missing graph. I suspect the blood sugar levels were getting low in the person generating the data… the cause was a hyphen in the filename and not an underscore.

These data are read into Igor from CSVs outputted from Imaris. Here comes the problem: the folder and all files within it have the incorrect name.

There are 35 files in each folder and clearly this needs a computer to fix, even if it were just one foldersworth at fault. The quickest way is to use the terminal and there are lots of ways to do it.

Now, as I said the problem is that the foldername and filenames both need correcting. Most terminal commands you can quickly find online actually fail because they try to rename the file and folder at the same time, and since the folder with the new name doesn’t exist… you get an error.

The solution is to rename the folders first and then the files.


find . -type d -maxdepth 2 -name "oldstring*" | while read FNAME; do mv "$FNAME" "${FNAME//oldstring/newstring}"; done
find . -type f -maxdepth 3 -name "oldstring*.csv" | while read FNAME; do mv "$FNAME" "${FNAME//oldstring/newstring}"; done



A simple tip, but effective and useful. HT this gist

Part of a series on computers and coding