My Blank Pages III: The Art of Data Science

largeI recently finished reading The Art of Data Science by Roger Peng & Elizabeth Matsui. Roger, together with Jeff Leek, writes the Simply Statistics blog and he works at JHU with Elizabeth.

The aim of the book is to give a guide to data analysis. It is not meant as a comprehensive data analysis “how to”, nor is it a manual for statistics or programming. Instead it is a high-level guide: how to think about data analysis and how to go about doing it. This makes it an interesting read for anyone working with data.

I think anyone who reads the Simply Statistics blog or who has read the piece Roger and Jeff wrote for Science, will be familiar with a lot of the content in here. At the beginning of the book, I didn’t feel like I learned too much. However, I can see that the “converted” are maybe not the target audience here. Towards the end of the book, the authors walk through a few examples of how to analyse some data focussing on the question in mind, how to refine it and then how to start the analysis. This is the most useful aspect of the book in my opinion, to see the approach to data analysis working in practice. The authors sum up the book early on by comparing it to books about songwriting. I admit to rolling my eyes at this comparison (data analysis as an artform…), but actually it is a good analogy. I think many people who work with data know how to do it, in the way that people who write songs know how to do it, although they probably have not had a formal course in the techniques that are being used. Equally reading a guidebook on songwriting will not make you a great songwriter. A book can only get you so far, intuition and invention are required and the same applies to data science.

The book was published via Lean Pub who have an interesting model where you pay a recommended price (or more!) but if you don’t have the money, you can pay less. Also, you can see what fraction goes to the author(s). The books can be updated continually as typos or code updates are fixed. Roger and the Simply Stats people have put out a few books via this publisher. These books on R, programming, statistics and data science all look good and it seems more books are coming soon.

On a personal note: In 2014, I decided to try and read one book per month. I managed it, but in 2015, I am struggling. It is now November and this book is the 7th I’ve read this year. It was published in September but it took me until now to finish it. Too much going on…

My Blank Pages is a track by Velvet Crush. This is an occasional series of book reviews.

Trellisaze: Using Trello for lab organisation

Previously, I wrote a post with tips for new PIs on lab organisation. Since that time, I’ve started using Trello to organise operations in my lab.

Trello is basically a way to track the progress of projects. Collaborative working is built-in. A friend had begun using Trello as she got involved in building an app. It seems that Trello is popular among teams working to develop software. Sure enough, I asked for opinions on Trello via twitter and got a nice email from somebody on the Open Microscopy Environment team on the pros and cons of using Trello. You can see one of their boards in action here, it is after all, open! This convinced me to give it a go. I set up a few boards, invited the lab members and got stuck in.

trello

I set up subject-specific and technique-specific boards (as well as my own to-do list and a board for tasks at home). All lab members are members of Royle Lab group and we have two groups boards – General and Molecular Biology. The General Stuff contains information about lab meetings, one-to-one meetings, orders etc. even photos of lab socials. Molecular Biology, everyone is a member because everybody does some cloning in my group. Then the Membrane Traffic people have a board that the others can’t see etc. I’ll probably move to making them all available to everybody in the group soon. The default is for boards to be closed, i.e. not possible for outsiders to see. You need to add people to your board for them to see it and to work with you.

Part of an example board is shown here:

memtraffboard

I’ve redacted parts that we’re not ready to tell the world about just yet. There are many guides online to show you how to get going with Trello. Basically, you have Boards. Within each board you have Lists, the columns that you see above. On a list, you put Cards. On the back of the card you can comment, add checklists, files, links, due dates etc etc. People can be assigned to cards and to provide updates with how it’s going. All of these things can be easily edited as priorities change. For example, I am writing a paper with one person and so we have a list for the paper, with cards for each figure and a card for writing.

I’m happy with how this is working. For example, when writing a paper, myself and the first author used to do an awful lot of rapid communication via email (I’ve previously called this Tiki-taka). It’s best if this was kept out of our Inboxes and organised somewhere. Also, how can we keep track of what still needs doing? Did that experiment get redone with the extra control? Which folder were the tracking experiments in? All of this can be recorded and managed using Trello. You can see the little speech bubble on each card indicating that we are talking to each other.

My tips/notes are:

  • In a team, there will always be some people who take to it and use it avidly, while others don’t engage.
  • To encourage take up, I communicate through Trello to the lab rather than using email.
  • Also, at our weekly one-to-one meetings, we edit cards together.
  • We are just using the free version. I’ve accumulated credits to go gold, but haven’t done so.
  • There are good iOS and Android apps for Trello. Notifications get pushed here if you subscribe to a board, list or card. It will ping you emails too, but you can switch this off.
  • File sharing is still done via our server (or Dropbox for small files), but notifications go on the board.
  • Make cards very specific, cards covering big lab projects will fester and clutter up the list.
  • The help files are incredibly nerdy… they even have a dog called Taco who pops up now and again.

Summary: I recommend Trello (note that other management softwares are available – kanbanflow, slack etc), particularly if you have a large group. Even for new PIs or those with small groups who might be on top of everything, I think there is still something that you’ll get out of it.

The post title is taken from Trellisaze by Slowdive from their Pygmalion LP.

Tips from the blog VIII: Time Machine on QNAP NAS

This is just a quick tip as it took me a little while to sort this out. In the lab we have two QNAP TS-869 Pro NAS devices. Each was set up with a single RAID6 storage pool and I ran them as a primary and replicant via rsync. We recently bought a bigger server and so the plan was to repurpose one of the NAS boxes to be a Time Machine for all the computers in the lab.

We have around 10 computers that use a Time Machine for small documents that reside on each computer (primary data is always on the server). So far, I’d relied on external Hard Drives to do this. However: 1) they fail, 2) they can get unplugged accidentally and fail to backup and 3) they take up space.

As always the solution is simple and I’ll outline this with the drawbacks and then describe the other things I did to save you wasting time.

  1. Wipe the NAS. I went for RAID5, rather than RAID6. I figured this is safe enough. The NAS emails me if there is a problem with any of the disks and they can be replaced.
  2. Enable Time Machine. In the QNAP interface in Backup Server>Time Machine, click Enable Time Machine support. Set a limit if you like or 0 for no limit. Add a Username and Password.
  3. Pick the NAS as the Time Machine disk. On each Mac, wait for backup to complete, turn Time Machine off. In Time Machine Preferences pick new disk. It will see your QNAP NAS Time Machine share. Click on it, enter user/pass. click OK. Don’t select use both disks (an option in Yosemite onwards).
  4. That’s it. Wait for backup to complete, check it. Unplug external HD and repurpose.

You don’t need each user to have a share or an account on the NAS. You don’t need to mount the volume on the Mac at startup. The steps above are all you need to do.

The major drawback is that all users share the same Time Machine space. In theory, one person could swamp the backup with their files and this will limit how far back all users can go in the Time Machine. The NAS is pretty big, so I think this will be OK. A rule for putting data/big files in a directory on a user’s Mac and then excluding this directory from the Backup seems the obvious workaround.

What not to try

There is a guide on the web (on the QNAP website!) which shows you how to add Time Machine support for a Mac. It requires some command line action and is pretty complicated. Don’t do it. My guess is that this was the original hack and this has now been superseded by the “official” support on the QNAP interface. I’m not sure why they haven’t taken this guide down. There is another page on the site, outlining the steps above. Just use that.

I read on the QNAP forums about the drawback of using the official Time Machine backup (i.e. all users having to share). My brainwave was to create a user for each Mac, enable a home folder, limit the account with a quota and then set the Mac to mount the volume on startup. This could then be the Time Machine and allow me to limit the size of the Time Machine backup on a per user basis. It didn’t work! Time Machine needs a specially formatted volume to work and so it cannot use a home folder in this way. Maybe there is a way to get this work – it would be a better solution – but I couldn’t manage it.

So there you go. If you have one of these devices. This is what worked for us.

Tips from the Blog V: Advice for New PIs

I recently gave a talk at a retreat for new PIs working at QMUL. My talk was focussed on tips for getting started, i.e. the nitty gritty of running an efficient lab. It was a mix of things I’ve been told, worked out for myself or that I’d learned the hard way.

PIs are expected to be able to do lots of things that can be full-time jobs in themselves. In my talk, I focussed on ways to make yourself more efficient to give yourself as much time to tackle all these different roles that you need to take on. You don’t need to work 80 hours a week to succeed, but you do need to get organised.

1. Timelines

Get a plan together. A long-term (5 -year) plan and a shorter (1-2 year) plan. What do you want to achieve in the lab? What papers do you want to publish? How many people do you need in the lab? What grants do you need? When are your next three grant applications due? When is the first one due? Work back from there. It’s January, the first one is due in September, better get that paper submitted! You need a draft application available for circulation to colleagues in good time to do something about the comments. Plan well. Don’t leave anything to the last minute. But don’t apportion too much time as the task will expand to fill it.

Always try to work towards the big goals. It’s too easy to spend all of your time on “urgent” things and busywork (fire-fighting). Prioritise Important over Urgent.

2. Time audit

Doing a time audit is a good way to identify where you are wasting time and how to reorganise your day to be more effective. Do you find it difficult to write first thing in the morning? If so, why not deal with your email or paperwork first thing since it requires less brain activity. Can you work during your commute? Save busywork for then. Can you switch between lab work and desk work well? Where are you fitting in teaching and admin? Try and find out answers to these questions with a time audit. It’s a horrible corporate thing to do, but I found it worked for me.

3. Lab manual

This was a popular idea. Paul Nurse’s lab had one – so should yours! The Royle lab manual has the following sections:

  • Lab organisation
  • Molecular Biology
  • Cell Biology
  • Biochemistry
  • Imaging

The lab organisation section has subsections on 1) how to keep a lab book; 2) lab organisation (databases, plasmid/antibody organisation); 3) computers/data storage; 4) lab calculations; 5) making figures. The other sections are a collection of our tried-and-tested protocols. New protocols are submitted to a share on the server and honed until ready for preservation in the Lab Manual. The idea is that you give the manual to all new starters and get them to stick to it and to consult the manual first when troubleshooting. People in the lab like it, because they are not left guessing exactly what you expect of them.

As part of this. You need to sort out lab databases and a lab server for all of the data. One suggestion was to give one person in the lab the job of looking after (a.k.a. policing) the databases and enforcing compliance. We don’t do this and instead do spot checks every few weeks/months to ensure that things haven’t drifted too far.

Occasionally, and at random, I’ll ask all lab members to bring their lab books to our lab meeting. I ask everyone to swap books with someone else. I then pick a random date and ask person X to describe (using the lab book) what person Y did on that day. It’s a bit of fun, but makes people realise how important keeping a good lab book is.

4. Tame your email

There are lots of tips on how to do this – find something that works for you. For example, I set up several filters/rules that move messages that are low importance away from my inbox. I flag messages and deal with them later if they will take more than 5 sec to deal with. I’ve tried checking at specified times of the day – doesn’t work well for me – but it might for you. Out-of-hours email is a problem. Just remember that no email is so urgent that it cannot wait until the morning – otherwise they would phone you.

5. Automation

Again there are lots of tips out there, e.g. in this post from Sylvain Deville. I have set up macros for routine things like exporting figures in a variety of formats/resolutions and assembling them with a manuscript file to one PDF for circulating around the lab for comment. We have workflows for building figures and writing papers. Anything that you do frequently is worth automating.

6. Deposit your plasmids with Addgene

They’ll distribute them for you! This saves a lot of time. You still get to check who is requesting them if you are curious.

7. Organising frequently-used files

Spend some time making some really good schematic figures. The can be used reused and rejigged time and again for a variety of purposes – talks, manuscripts etc. It’s worth doing this well and with a diagram that is definitely yours and not plundered from the web. Also, never retype text instructions – save them somewhere and just cut-and-paste. Examples include: answers for common questions from students, details of how to do something in the lab, details of how to get to the lab, brief biography, abstracts for talks…

Have a long-format CV always ready, keep updating it (I’ve not found a good way to automate this, yet). I get asked for my CV all the time for lots of different things. Have the long (master) CV set up so that you can delete sections as appropriate, depending on the purpose. Use the publication list from this for pasting into various boxes that you are required to fill out. An Endnote smart list of all of your papers is also handy for rapidly formatting a list of your papers according to different styles. Try to keep your web profiles up-to-date. If you publish a new paper add to your CV and all of your profiles so they don’t look out of date. ORCiD, Researchfish, whatever, try and keep them all current.

Get a slidedeck together of all your slides on a topic. Pull from here to put your talks together. Get a slidepack together to show to visitors to the lab at a moment’s notice. Also, when you publish a new paper, make slides of the final figure versions and add them to the master slidedeck.

8. Alerts

Set up literature alerts. My advice would be don’t have these coming to your inbox. Use RSS. This way you can efficiently mark interesting papers to look at later and keep your email free of clutter. Grab feeds for your favourite journals and for custom pubmed searches. Not just for subject keywords but also for colleagues and scientists who you think do interesting work. Set up Google Scholar to send you papers that have cited your work. Together with paper recommendations from Twitter (or maybe some other services – PubChase etc.) you’ll be as sure as you can be that you’re not missing anything. Also grab feeds from funding agencies, so that you don’t miss calls for grant applications. If all of these things are in place, you don’t need to browse the web which can be a huge time drain.

9. Synchronise

I have several computers synced via Unison (thanks to Daniel and Christophe who suggested this to me years ago). You can do this via Dropbox, but the space is limited. Unison syncs all my documents so that I am always able to work on the latest versions of things wherever I am. This is useful, if for some reason you cannot make it in to work unexpectedly.

10. Paper of the day

This has worked at some level to make sure that I am reading papers regularly. Posts about this here and here.

11. Make use of admin staff

If you have access to administrative staff get them to do as much of your paperwork as is feasible so you can concentrate on other things. And be nice to them! They can help a lot if you are really stuck with something, like an imminent deadline; or they can… be less helpful.

12. Be a good colleague

There’s a temptation to perform badly in tasks so that you don’t get asked again in order to reduce your workload. Don’t do this. It is true that if you are efficient, you will get asked to do more things. This is good (because not all tasks are bad). If you have too much to do, you just need to manage it. Say “No” if your workload is too high. But don’t just do a bad job. This pushes the problem onto your colleagues. If nothing else, you need their help. Also, help your colleagues if they need it. Always make yourself available to comment on their grants and papers. Interacting with colleagues is one of the most fun parts of being a PI.

13. Don’t write a book chapter

It’s a waste of time. Nobody will read it. Nobody will cite it. It will take time away from publishing real papers. Also, think carefully about writing review articles. If you have something unique to say, then go for it. Don’t do it just because you’ve been asked…

In need of some more advice?

atthehelmThis post was focussed on technicalities of running a lab to make things more efficient. There’s obviously lots more to it than this: people management, networking etc.

A great recommendation that I got after I had been a PI for a few years… this excellent book by Kathy Barker. At The Helm: Leading your laboratory. I read this and wished I’d found it earlier. The sections on early stage negotiations and planning for the moment you become a PI are great (although it is very US-centric).

I’ve also been told that the EMBO Course for New Investigators is great, although I have not attended it.

Update 12:15 13/7/15: A reader sent this link to me via email. It’s a document from HHMI on scientific management for Postdocs and New PIs. Well worth a read!!

Update 07:41 4/2/15: We now use Trello for organising activities in the lab. You can read about how we do that here. I added the lab book audit anecdote and fixed some typos.

Thanks to attendees of the QMUL ECR Retreat for your feedback on my talk. I also incorporated a few points from Kebs Hodivala-Dilke’s presentation, which was focussed more on the big picture points. If you have any other time-saving tips for PIs, leave a comment.

My Blank Pages II: Statistics Done Wrong

I have just finished reading this excellent book, Statistics done wrong: a woefully complete guide by Alex Reinhart. I’d recommend it to anyone interested in quantitative biology and particularly to PhD students starting out in biomedical science.

20150524_214742Statistics is a topic that many people find difficult to grasp. I think there are a couple of reasons for this that I’ll go into below. The aim of this book is to comprehensively cover the common mistakes and errors that are continually crop up in data analysis. The author writes in an easy-to-understand style and – this is the important bit – he dispenses with nearly all the equations. The result is an accessible guide on “what not to do” in significance testing.

I think there are two main reasons why people find statistics tough: uncertainty and mathematical anxiety.

First, uncertainty. What I mean is the uncertainty over what statistical approach to take, rather than the uncertainty that can be studied using statistics! It is very easy to find fault in which statistical approaches have been used in a study by a biologist. Why did they show the confidence interval and not the standard deviation? Why haven’t they corrected for multiple testing…? Statistics has a “gotcha” reputation. The reason for the uncertainty is that it is difficult to come up with a hard-and-fast set of guidelines of approaches to take, because this depends a lot on the type of data that has been collected, what is being tested etc. And there are often several ways to do the same thing. This uncertainty doesn’t go away even with a firm grounding in statistics. The methods are nearly always up for debate as far as I can see. And I think it is this uncertainty that prevents people from really engaging with statistics. In the absence of clear direction, it seems like having in mind a set of “what not to do”, is a useful approach to stats.

Second, mathematical anxiety, i.e. fear of maths. Biology has a reputation for being populated by people who ended up here through an affinity with science but a discomfort with physics and maths. This is unfair as there are many areas of biology where this is not true and statistical/quantitative approaches are right at the forefront. Nonetheless, there is a reason why there are umpteen “Statistics for Biologists” books in the bookshop. Now, the way that statistics is taught is to crunch through the equations that describe statistical concepts. Again, this means that people who really need to know about statistics for their research are held back if they don’t have a mathematical background or just find maths a bit daunting. The situation is well described by a recent post at Will Kurt’s excellent Count Bayesie blog on the teaching of statistics. His point is: insisting that students know these equations gets in the way of them understanding statistics. Nowadays, calculating something like the standard deviation is trivial using a computer and we are unlikely to need to know the derivation of an equation in order to do our work. We should just skip the equations and explain why.

The nice thing about this book is that the author has collected together all the faux pas that you’re likely to encounter and how to avoid them. This goes some way to addressing uncertainty in what methods to use. Secondly, the author has dispensed with the equations, so the mathematically anxious can pick it up without fear. These features make this book different to other stats books that I’ve read.

You can find copies at many online retailers. It’s published by No Starch. I picked up a copy after reading about it on Nathan Yau’s Flowing Data blog.

The post title comes from “My Blank Pages” by Velvet Crush from their Teenage Symphonies to God LP.

Middle of the road: pitching your paper

I saw this great tweet (fairly) recently:

https://twitter.com/jkpfeiff/status/589148184284254208/

I thought this was such a great explanation of when to submit your paper.

It reminded me of a diagram that I sketched out when talking to a student in my lab about a paper we were writing. I was trying to explain why we don’t exaggerate our findings. And conversely why we don’t undersell our results either. I replotted it below:

PaperPitch

Getting out to review is a major hurdle to publishing a paper. Therefore, convincing the Editor that you have found out something amazing is the first task. This is counterbalanced by peer review, which scrutinises the claims made in a paper for their experimental support. So, exaggerated claims might get you over the first hurdle, but it will give you problems during peer review (and afterwards if the paper makes it to print). Conversely, underselling or not interpreting all your data fully is a different problem. It’s unlikely to impress the Editor as it can make your paper seem “too specialised”, although if it made it to the hands of your peers they would probably like it! Obviously at either end of the spectrum no-one likes a dull/boring/incremental paper and everyone can smell a rat if the claims are completely overblown, e.g. genome sequence of Sasquatch.

So this is why we try to interpret our results fully but are careful not to exaggerate our claims. It might not get us out to review every time, but at least we can sleep at night.

I don’t know if this is a fair representation. Certainly depending on the journal the scale of the y-axis needs to change!

The post title is taken from “Middle of the Road” by Teenage Fanclub a B-side from their single “I Don’t Want Control of You”.

If and When: publishing and productivity in the lab

I thought I’d share this piece of analysis looking at productivity of people in the lab. Here, productivity means publishing papers. This is unfortunate since some people in my lab have made some great contributions to other peoples’ projects or have generally got something going, but these haven’t necessarily transferred into print. Also, the projects people have been involved in have varied in toughness. I’ve had students on an 8-week rotation who just collected some data which went straight into a paper and I’ve had postdocs toil for two years trying to purify a protein complex… I wasn’t looking to single out who was the most productive person (I knew who that was already), but I was interested to look at other things, e.g. how long is it on average from someone joining the lab to them publishing their first paper?

The information below would be really useful if it was available for all labs. When trainees are looking for jobs, it would be worth knowing the productivity of a given lab. This can be very hard to discern, since it is difficult to see how many people have worked in the lab and for how long. Often all you have to go on is the PubMed record of the PI. Two papers published per year in the field of cell biology is a fantastic output, but not if you have a lab of thirty people. How likely are you – as a future lab member – to publish your own 1st author paper? This would be very handy to know before applying to a lab.

I extracted the date of online publication for all of our papers as well as author position and other details. I had a record of start and end dates for everyone in the lab. Although as I write this, I realise that I’ve left one person off by mistake. All of this could be dumped into IgorPro and I wrote some code to array the data in a plot vs time. People are anonymous – they’ll know who they are, if they’re reading. Also we have one paper which is close to being accepted so I included this although it is not in press yet.

RoylePapers1

The first plot shows when people joined the lab and how long they stayed. Each person has been colour-coded according to their position. The lines represent their time spent in the lab. Some post-graduates (PG) came as a masters student for a rotation and then came back for a PhD and hence have a broken line. Publications are shown by markers according to when a paper featuring that person was published online. There’s a key to indicate a paper versus review/methods paper and if the person was 1st author or not. We have published two papers that I would call collaborative, i.e. a minor component from our lab. Not shown here are the publications that are authored by me but don’t feature anyone else working in the lab.

This plot starts when I got my first independent position. As you can see it was ~1 year until I was able to recruit my first tech. It was almost another 2 years before we published our first paper. Our second one took almost another 2 years! What is not factored in here is the time spent waiting to get something published – see here. The early part of getting a lab going is tough, however you can see that once we were up-and-running the papers came out more quickly.

RoylePapers2

In the second plot, I offset the traces to show duration in the lab and relative time to publication from the start date in the lab. I also grouped people according to their position and ranked them by duration in the lab. This plot is clearer for comparing publication rates and lag to getting the first paper etc. This plot shows quite nicely that lots of people from the lab publish “posthumously”. This is thanks to the publication lag but also to things not getting finished or results that needed further experiments to make sense etc. Luckily the people in my lab have been well organised, which has made it possible to publish things after they’ve left.

I was surprised to see that five people published within ~1.5 years of joining the lab. However, in each case the papers came about because of some groundwork by other people.

I think the number of people and the number of papers are both too low to begin to predict how long someone will take to get their first paper out, but these plots give a sense of how tough it is and how much effort and work is required to make it into print.

Methods: To recreate this for your own lab, you just need a list of lab members with start and end dates. The rest can be extracted from PubMed. Dataviz fans may be interested that the colour scheme is taken from Paul Tol’s guide.

The post title comes from “If and When” by The dB’s from Ride The Wild Tomtom