## Tips from the blog XI: Overleaf

I was recently an external examiner for a PhD viva in Cambridge. As we were wrapping up, I asked “if you were to do it all again, what would you do differently?”. It’s one of my stock questions and normally the candidate says “oh I’d do it so much quicker!” or something similar. However, this time I got a surprise. “I would write my thesis in LaTeX!”, was the reply.

As a recent convert to LaTeX I could see where she was coming from. The last couple of manuscripts I have written were done in Overleaf and have been a breeze. This post is my summary of the site.

I have written ~40 manuscripts and countless other documents using Microsoft Word for Mac, with EndNote as a reference manager (although I have had some failed attempts to break free of that). I’d tried and failed to start using TeX last year, motivated by seeing nicely formatted preprints appearing online. A few months ago I had a new manuscript to write with a significant mathematical modelling component and I realised that now was the chance to make the switch. Not least because my collaborator said “if we are going to write this paper in Word, I wouldn’t know where to start”.

I signed up for an Overleaf account. For those that don’t know, Overleaf is an online TeX writing tool on one half of the screen and a rendered version of your manuscript on the other. The learning curve is quite shallow if you are used to any kind of programming or markup. There are many examples on the site and finding out how to do stuff is quick thanks to LaTeX wikibooks and stackexchange.

Beyond the TeX, the experience of writing a manuscript in Overleaf is very similar to editing a blog post in WordPress.

Collaboration

The best thing about Overleaf is the ability to collaborate easily. You can send a link to a collaborator and then work on it together. Using Word in this way can be done with DropBox, but versioning and track changes often cause more problems than it’s worth and most people still email Word versions to each other, which is a nightmare. Overleaf changes this by having a simple interface that can be accessed by multiple people. I have never used Google docs for writing papers, but this does offer the same functionality.

All projects are private by default, but you can put your document up on the site if you want to. You might want to do this if you have developed an example document in a certain style.

Versioning

Depending on the type of account you have, you can roll back changes. It is possible to ‘save’ versions, so if you get to a first draft and want to send it round for comment, you can save a version and then use this to go back to, if required. This is a handy insurance in case somebody comes in to edit the document and breaks something.

You can download a PDF at any point, or for that matter take all the files away as a zip. No more finalfinalpaper3final.docx…

If you’re keeping score, that’s Overleaf 2, Word nil.

Figures

Placing figures in the text is easy and all major formats are supported. What is particularly nice is that I can generate figures in an Igor layout and output directly to PDF and put that into Overleaf. In Word, the placement of figures can be fiddly. Everyone knows the sensation of moving a picture slightly and it disappears inexplicably onto another page. LaTeX will put the figure in where you want it or the next best place. It just works.

Equations

This is what LaTeX excels at. Microsoft Word has an equation editor which has varied over the years from terrible to just-about-usable. The current version actually uses elements of TeX (I think). The support for mathematical text in LaTeX is amazing, not surprising since this is the way that most papers in maths are written. Any biologist will find their needs met here.

Templates and formatting

There are lots of templates available on Overleaf and many more on the web. For example, there are nice PNAS and PLoS formats as well as others for theses and for CVs and other documents. The typesetting is beautiful. Setting out sections/subsections and table of contents is easy. To be fair to Word, if you know how to use it properly, this is easy too, but the problem is that most people don’t, and also styles can get messed up too easily.

Referencing

This works by adding a bibtex file to your project. You can do this with any reference manager. Because I have a huge EndNote database, I used this initially. Another manuscript I’ve been working on, my student started out with a Mendeley library and we’ve used that. It’s very flexible. Slightly more fiddly than with Word and EndNote. However, I’ve had so many problems (and crashes) with that combination over the years that any alternative is a relief.

Compiling

You can set the view on the right to compile automatically or you can force updates manually. Either way the document must compile. If you have made a mistake, it will complain and try to guess what you have done wrong and tell you. Errors that prevent the document from being compiled are red. Less serious errors are yellow and allow compilation to go ahead. This can be slow going at first, but I found that I was soon up to speed with editing.

Preamble

This is the name of the stuff at the header of a TeX document. You can add in all kinds of packages to cover proper usage of units (siunitx) or chemical notation (mhchem). They all have great documentation. All the basics, e.g. referencing, are included in Overleaf by default.

Offline

The entire concept of Overleaf is to work online. Otherwise you could just use TeXshop or some other program. But how about times when you don’t have internet access? I was concerned about this at the start, but I found that in practice, these days, times when you don’t have a connection are very few and far between. However, I was recently travelling and wanted to work on an Overleaf manuscript on the aeroplane. Of course, with Word, this is straightforward.

With Overleaf it is possible. You can do two things. The first is to download your files ahead of your period of internet outage. You can edit your main.tex document in an editor of your choice. The second option is more sophisticated. You can clone your project with git and then work on that local clone. The instructions of how to do that are here (the instructions, from 2015, say it’s in beta, but it’s fully working). You can work on your document locally and then push changes back to Overleaf when you have access once more.

Downsides

OK. Nothing is perfect and I noticed that typos and grammatical errors are more difficult for me to detect in Overleaf. I think this is because I am conditioned with years of Word use. The dictionary is smaller than in Word and it doesn’t try to correct your grammar like word does (although this is probably a good thing!). Maybe I should try the rich text view and see if that helps. I guess the other downside is that the other authors need to know TeX rather than Word. As described above if you are writing with a mathematician, this is not a problem. For biologists though this could be a challenge.

Back to the PhD exam

I actually think that writing a thesis is probably a once-in-a-lifetime chance to understand how Microsoft Word (and EndNote) really works. The candidate explained that she didn’t trust Word enough to do everything right, so her thesis was made of several different documents that were fudged to look like one long thesis. I don’t think this is that unusual. She explained that she had used Word because her supervisor could only use Word and she had wanted to take advantage of the Review tools. Her heart had sunk when her supervisor simply printed out drafts and commented using a red pen, meaning that she could have done it all in LaTeX and it would have been fine.

Conclusion

I have been totally won over by Overleaf. It beats Microsoft Word in so many ways… I’ll stick to Word for grant applications and other non-manuscript documents, but I’m going to keep using it for manuscripts, with the exception of papers written with people who will only use Word.

## Tips from the blog X: multi-line commenting in Igor

This is part-tip, part-adventures in code. I found out recently that it is possible to comment out multiple lines of code in Igor and thought I’d put this tip up here.

Multi-line commenting in programming is useful two reasons:

1. writing comments (instructions, guidance) that last more than one line
2. the ability to temporarily remove a block of code while testing

In each computer language there is the ability to comment out at least one line of code.

In Igor this is “//”, which comments out the whole line, but no more.

This is the same as in ImageJ macro language.

Now, to comment out whole sections in FIJI/ImageJ is easy. Inserting “/*” where you want the comment to start, and then “*/” where it ends, multiple lines later.

I didn’t think this syntax was available in Igor, and it isn’t really. I was manually adding “//” for each line I wanted to remove, which was annoying. It turns out that you can use Edit > Commentize to add “//” to the start of all selected lines. The keyboard shortcut in IP7 is Cmd-/. You can reverse the process with Edit > Decommentize or Cmd-\.

There is actually another way. Igor can conditionally compile code. This is useful if for example you write for Igor 7 and Igor 6. You can get compilation of IP7 commands only if the user is running IP7 for example. This same logic can be used to comment out code as follows.

The condition if 0 is never satisfied, so the code does not compile. The equivalent statement for IP7-specific compilation, is “#if igorversion()>=7”.

So there you have it, two ways to comment out code in Igor. These tips were from IgorExchange.

This post is part of a series of tips.

## Tips from the blog IX: running route

University of Warwick is a popular conference destination, with thousands of visitors per year. Next time you visit and stay on campus, why not bring your running shoes and try out these routes?

Route 1

This is just over 10K and it takes you from main campus out towards Cryfield Pavilion. A path goes to the Greenway (a former railway), which is a nice flat gravel track. It goes up to Burton Green and back to campus via Westwood Heath Road. To exit the Greenway at Burton Green you need to take the “offramp” at the bridge otherwise you will end up heading to Berkswell. If you want to run totally off-road*, just turn back at this point (probably ~12K). The path out to the Greenway and the Greenway itself is unlit, so be careful early in the morning or late at night.

GPX of a trace put together on gpsies.

Track 2

This is a variation on Track 1. Instead of heading up the Greenway to Burton Green, take a left and head towards Kenilworth Common. With a bit of navigation you can run on alongside a brook and pop out in Abbey Fields and see the ruins of Kenilworth Abbey. This is out-and-back, 12K. Obviously you can turn back sooner if you prefer. It’s all off-road apart from a few 100m on quiet residential streets as you navigate from the Common to Abbey Fields. GPX from Uni to around the lake at Abbey Fields.

Track 3

This is a variation on Track 1 where you exit the Greenway and take a loop around Crackley Wood. The Wood is nice and has wild deer and other interesting wildlife. This route is totally off-road and is shorter at ~8K. GPX from Uni to around the Wood.

Other Routes

There is a footpath next to a bike lane down the A429 which is popular for runners heading to do a lap or two of Memorial Park in Coventry. This is OK, but means that you run alongside cars a lot.

If you don’t have time for these routes, the official Warwick page has three very short running routes of around 3 to 5 km (1, 2 and 3). I think that these routes are the ones that are on the signpost near the Sports Centre.

* Here, off-road means on paths but not alongside a road on a pavement. It doesn’t mean across fields.

This post is part of a series of tips.

## Tips from the blog VIII: Time Machine on QNAP NAS

This is just a quick tip as it took me a little while to sort this out. In the lab we have two QNAP TS-869 Pro NAS devices. Each was set up with a single RAID6 storage pool and I ran them as a primary and replicant via rsync. We recently bought a bigger server and so the plan was to repurpose one of the NAS boxes to be a Time Machine for all the computers in the lab.

We have around 10 computers that use a Time Machine for small documents that reside on each computer (primary data is always on the server). So far, I’d relied on external Hard Drives to do this. However: 1) they fail, 2) they can get unplugged accidentally and fail to backup and 3) they take up space.

As always the solution is simple and I’ll outline this with the drawbacks and then describe the other things I did to save you wasting time.

1. Wipe the NAS. I went for RAID5, rather than RAID6. I figured this is safe enough. The NAS emails me if there is a problem with any of the disks and they can be replaced.
2. Enable Time Machine. In the QNAP interface in Backup Server>Time Machine, click Enable Time Machine support. Set a limit if you like or 0 for no limit. Add a Username and Password.
3. Pick the NAS as the Time Machine disk. On each Mac, wait for backup to complete, turn Time Machine off. In Time Machine Preferences pick new disk. It will see your QNAP NAS Time Machine share. Click on it, enter user/pass. click OK. Don’t select use both disks (an option in Yosemite onwards).
4. That’s it. Wait for backup to complete, check it. Unplug external HD and repurpose.

You don’t need each user to have a share or an account on the NAS. You don’t need to mount the volume on the Mac at startup. The steps above are all you need to do.

The major drawback is that all users share the same Time Machine space. In theory, one person could swamp the backup with their files and this will limit how far back all users can go in the Time Machine. The NAS is pretty big, so I think this will be OK. A rule for putting data/big files in a directory on a user’s Mac and then excluding this directory from the Backup seems the obvious workaround.

What not to try

There is a guide on the web (on the QNAP website!) which shows you how to add Time Machine support for a Mac. It requires some command line action and is pretty complicated. Don’t do it. My guess is that this was the original hack and this has now been superseded by the “official” support on the QNAP interface. I’m not sure why they haven’t taken this guide down. There is another page on the site, outlining the steps above. Just use that.

I read on the QNAP forums about the drawback of using the official Time Machine backup (i.e. all users having to share). My brainwave was to create a user for each Mac, enable a home folder, limit the account with a quota and then set the Mac to mount the volume on startup. This could then be the Time Machine and allow me to limit the size of the Time Machine backup on a per user basis. It didn’t work! Time Machine needs a specially formatted volume to work and so it cannot use a home folder in this way. Maybe there is a way to get this work – it would be a better solution – but I couldn’t manage it.

So there you go. If you have one of these devices. This is what worked for us.

## Tips from the blog VII: Recolour Z-stack and Save Projection

I’m putting this up here in case it is useful for somebody.

We capture Z-stacks on a Perkin Elmer Spinning Disk microscope system. I wanted to turn each stack into a single image so that we could quickly compare them. This simple macro does the job.

1. We import the images straight from the *.mvd2 library using the wonderful BioFormats import tool. We open all files as composite hyperstacks.
2. This Macro is then run (works on all of the open images).
3. It finds the mid-Z point of the stack and sets the brightness/contrast for each channel to Auto.
4. It then recolours the channels to the right colours (we capture DAPI then GFP then mCherry, but the import colours them red, green and blue, respectively).
5. Then it makes a z-projection and saves the file in a directory that you specify at the start.
6. It closes the projection and the z-stack.

You can then open up all of the projections, tile them and take a look at what was going on in the experiment.


setBatchMode(true);
imgArray = newArray(nImages);
dir1 = getDirectory("Choose Destination Directory ");
for (i=0; i<nImages; i++) {
selectImage(i+1);
imgArray[i] = getImageID();
}
for (i=0; i< imgArray.length; i++) {
selectImage(imgArray[i]);
id = getImageID();
win = getTitle();
getDimensions(w, h, c, nFrames, dummy);
Stack.setSlice(round(nFrames/2));
run("Enhance Contrast", "saturated=0.35");
run("Next Slice [>]");
run("Enhance Contrast", "saturated=0.35");
run("Next Slice [>]");
run("Enhance Contrast", "saturated=0.35");
Stack.setChannel(1)
run("Blue");
Stack.setChannel(2)
run("Green");
Stack.setChannel(3)
run("Red");
run("Z Project...", "projection=[Max Intensity]");
saveAs("TIFF", dir1+win);
close();
selectImage(id);
close();
}



Sorry for lack of commenting.

Tips from the blog is a series and gets its name from a track from Black Sunday by Cypress Hill.

## Tips from the blog VI: doc to pdf

A while back I made this little Automator script to convert Microsoft Word doc and docx files to PDF. It’s useful for when you are sent a bunch of Word files for committee work. Opening PDFs in Preview is nice and hassle-free. Struggling with Word is not.

It’s not my own work, I just put it together after googling around a bit. I’ll put it here for anyone to use.

To get it working:

1. Open Automator. Choose Template Service and you need to check: Service receives selected files and folders in Finder.app
2. Set the actions: Get Folder content and Run AppleScript (this is where the script goes)
3. Now Save the workflow. Suggested name Doc to PDF.

Now to run it:

1. Select the doc/docx file(s) in the Finder window.
2. Right-click and look for the service in the contextual menu. This should be down the bottom near “Reveal in Finder”.
3. Run it.

If you want to put it onto another computer. Go to ~/Library/Services and you will find the saved workflow there. Just copy to the same location on your other computer.

Known bug: Word has to be open for the script to run. It also doesn’t shut down Word when it’s finished.

Here is the code.

property theList : {"doc", "docx"}

on run {input, parameters}
set output to {}
tell application "Microsoft Word" to set theOldDefaultPath to get default file path file path type documents path
repeat with x in input
try
set theDoc to contents of x
tell application "Finder"
set theFilePath to container of theDoc as text

set ext to name extension of theDoc
if ext is in theList then
set theName to name of theDoc
copy length of theName to l
copy length of ext to exl

set n to l - exl - 1
copy characters 1 through n of theName as string to theFilename

set theFilename to theFilename &amp; ".pdf"

tell application "Microsoft Word"
set default file path file path type documents path path theFilePath
open theDoc
set theActiveDoc to the active document
save as theActiveDoc file format format PDF file name theFilename
copy (POSIX path of (theFilePath &amp; theFilename as string)) to end of output
close theActiveDoc
end tell
end if
end tell
end try
end repeat
tell application "Microsoft Word" to set default file path file path type documents path path theOldDefaultPath

return output
end run


Tips from the blog is a series and gets its name from a track from Black Sunday by Cypress Hill.

## Tips from the Blog V: Advice for New PIs

I recently gave a talk at a retreat for new PIs working at QMUL. My talk was focussed on tips for getting started, i.e. the nitty gritty of running an efficient lab. It was a mix of things I’ve been told, worked out for myself or that I’d learned the hard way.

PIs are expected to be able to do lots of things that can be full-time jobs in themselves. In my talk, I focussed on ways to make yourself more efficient to give yourself as much time to tackle all these different roles that you need to take on. You don’t need to work 80 hours a week to succeed, but you do need to get organised.

1. Timelines

Get a plan together. A long-term (5 -year) plan and a shorter (1-2 year) plan. What do you want to achieve in the lab? What papers do you want to publish? How many people do you need in the lab? What grants do you need? When are your next three grant applications due? When is the first one due? Work back from there. It’s January, the first one is due in September, better get that paper submitted! You need a draft application available for circulation to colleagues in good time to do something about the comments. Plan well. Don’t leave anything to the last minute. But don’t apportion too much time as the task will expand to fill it.

Always try to work towards the big goals. It’s too easy to spend all of your time on “urgent” things and busywork (fire-fighting). Prioritise Important over Urgent.

2. Time audit

Doing a time audit is a good way to identify where you are wasting time and how to reorganise your day to be more effective. Do you find it difficult to write first thing in the morning? If so, why not deal with your email or paperwork first thing since it requires less brain activity. Can you work during your commute? Save busywork for then. Can you switch between lab work and desk work well? Where are you fitting in teaching and admin? Try and find out answers to these questions with a time audit. It’s a horrible corporate thing to do, but I found it worked for me.

3. Lab manual

This was a popular idea. Paul Nurse’s lab had one – so should yours! The Royle lab manual has the following sections:

• Lab organisation
• Molecular Biology
• Cell Biology
• Biochemistry
• Imaging

The lab organisation section has subsections on 1) how to keep a lab book; 2) lab organisation (databases, plasmid/antibody organisation); 3) computers/data storage; 4) lab calculations; 5) making figures. The other sections are a collection of our tried-and-tested protocols. New protocols are submitted to a share on the server and honed until ready for preservation in the Lab Manual. The idea is that you give the manual to all new starters and get them to stick to it and to consult the manual first when troubleshooting. People in the lab like it, because they are not left guessing exactly what you expect of them.

As part of this. You need to sort out lab databases and a lab server for all of the data. One suggestion was to give one person in the lab the job of looking after (a.k.a. policing) the databases and enforcing compliance. We don’t do this and instead do spot checks every few weeks/months to ensure that things haven’t drifted too far.

Occasionally, and at random, I’ll ask all lab members to bring their lab books to our lab meeting. I ask everyone to swap books with someone else. I then pick a random date and ask person X to describe (using the lab book) what person Y did on that day. It’s a bit of fun, but makes people realise how important keeping a good lab book is.

There are lots of tips on how to do this – find something that works for you. For example, I set up several filters/rules that move messages that are low importance away from my inbox. I flag messages and deal with them later if they will take more than 5 sec to deal with. I’ve tried checking at specified times of the day – doesn’t work well for me – but it might for you. Out-of-hours email is a problem. Just remember that no email is so urgent that it cannot wait until the morning – otherwise they would phone you.

5. Automation

Again there are lots of tips out there, e.g. in this post from Sylvain Deville. I have set up macros for routine things like exporting figures in a variety of formats/resolutions and assembling them with a manuscript file to one PDF for circulating around the lab for comment. We have workflows for building figures and writing papers. Anything that you do frequently is worth automating.

They’ll distribute them for you! This saves a lot of time. You still get to check who is requesting them if you are curious.

7. Organising frequently-used files

Spend some time making some really good schematic figures. The can be used reused and rejigged time and again for a variety of purposes – talks, manuscripts etc. It’s worth doing this well and with a diagram that is definitely yours and not plundered from the web. Also, never retype text instructions – save them somewhere and just cut-and-paste. Examples include: answers for common questions from students, details of how to do something in the lab, details of how to get to the lab, brief biography, abstracts for talks…

Have a long-format CV always ready, keep updating it (I’ve not found a good way to automate this, yet). I get asked for my CV all the time for lots of different things. Have the long (master) CV set up so that you can delete sections as appropriate, depending on the purpose. Use the publication list from this for pasting into various boxes that you are required to fill out. An Endnote smart list of all of your papers is also handy for rapidly formatting a list of your papers according to different styles. Try to keep your web profiles up-to-date. If you publish a new paper add to your CV and all of your profiles so they don’t look out of date. ORCiD, Researchfish, whatever, try and keep them all current.

Get a slidedeck together of all your slides on a topic. Pull from here to put your talks together. Get a slidepack together to show to visitors to the lab at a moment’s notice. Also, when you publish a new paper, make slides of the final figure versions and add them to the master slidedeck.

Set up literature alerts. My advice would be don’t have these coming to your inbox. Use RSS. This way you can efficiently mark interesting papers to look at later and keep your email free of clutter. Grab feeds for your favourite journals and for custom pubmed searches. Not just for subject keywords but also for colleagues and scientists who you think do interesting work. Set up Google Scholar to send you papers that have cited your work. Together with paper recommendations from Twitter (or maybe some other services – PubChase etc.) you’ll be as sure as you can be that you’re not missing anything. Also grab feeds from funding agencies, so that you don’t miss calls for grant applications. If all of these things are in place, you don’t need to browse the web which can be a huge time drain.

9. Synchronise

I have several computers synced via Unison (thanks to Daniel and Christophe who suggested this to me years ago). You can do this via Dropbox, but the space is limited. Unison syncs all my documents so that I am always able to work on the latest versions of things wherever I am. This is useful, if for some reason you cannot make it in to work unexpectedly.

10. Paper of the day

11. Make use of admin staff

If you have access to administrative staff get them to do as much of your paperwork as is feasible so you can concentrate on other things. And be nice to them! They can help a lot if you are really stuck with something, like an imminent deadline; or they can… be less helpful.

12. Be a good colleague

13. Don’t write a book chapter

It’s a waste of time. Nobody will read it. Nobody will cite it. It will take time away from publishing real papers. Also, think carefully about writing review articles. If you have something unique to say, then go for it. Don’t do it just because you’ve been asked…

In need of some more advice?

This post was focussed on technicalities of running a lab to make things more efficient. There’s obviously lots more to it than this: people management, networking etc.

A great recommendation that I got after I had been a PI for a few years… this excellent book by Kathy Barker. At The Helm: Leading your laboratory. I read this and wished I’d found it earlier. The sections on early stage negotiations and planning for the moment you become a PI are great (although it is very US-centric).

I’ve also been told that the EMBO Course for New Investigators is great, although I have not attended it.

Update 12:15 13/7/15: A reader sent this link to me via email. It’s a document from HHMI on scientific management for Postdocs and New PIs. Well worth a read!!

Update 07:41 4/2/15: We now use Trello for organising activities in the lab. You can read about how we do that here. I added the lab book audit anecdote and fixed some typos.

Thanks to attendees of the QMUL ECR Retreat for your feedback on my talk. I also incorporated a few points from Kebs Hodivala-Dilke’s presentation, which was focussed more on the big picture points. If you have any other time-saving tips for PIs, leave a comment.

## Tips from the blog IV – averaging

I put a recent code snippet put up on the IgorExchange. It’s a simple procedure for averaging a set of 1D waves and putting the results in a new wave. The difference between this code and Average Waves.ipf (which ships with Igor) is that this function takes the average of all points in the wave and places this single value in a new wave. You can specify whether the mean or median is used for the average.

I still don’t have a way to markup Igor code in wordpress.

## Tips from the blog III – violin plots

Having recently got my head around violin plots, I thought I would explain what they are and why you might want to use them.

There are several options when it comes to plotting summary data. I list them here in order of granularity, before describing violin plots and how to plot them in some detail.

Bar chart

This is the mainstay of most papers in my field. Typically, a bar representing the mean value that’s been measured is shown with an error bar which shows either the standard error of the mean, the standard deviation, or more rarely a confidence interval. The two data series plotted in all cases is the waiting time for Old Faithful eruptions (waiting), a classic dataset from R. I have added some noise to waiting (waiting_mod) for comparison. I think it’s fair to say that most people feel that the bar chart has probably had its day and that we should be presenting our data in more informative ways*.

Pros: compact, easy to tell differences between groups

Cons: hides the underlying distribution, obscures the n number

Box plot

The box plot – like many things in statistics – was introduced by Tukey. It’s sometimes known as a Tukey plot, or a box-and-whiskers plot. The idea was to give an impression of the underlying distribution without showing a histogram (see below). Histograms are great, but when you need to compare many distributions they do not overlay well and take up a lot of space to show them side-by-side. In the simplest form, the “box” is the interquartile range (IQR, 25th and 75th percentiles) with a line to show the median. The whiskers show the 10th and 90th percentiles. There are many variations on this theme: outliers can be shown or not, the whiskers may show the limits of the dataset (or something else), the boxes can be notched or their width may represent the sample size…

Pros: compact, easy to tell differences between groups, shows normality/skewness

Cons: hides multimodal data, sometimes obscures the n number, many variations

Histogram

A histogram is a method of showing the distribution of a dataset and was introduced by Pearson. The number of observations within a bin are counted and plotted. The ‘bars’ sit next to each other, because the variable being measured is continuous. The variable being measured is on the x-axis, rather than the category (as in the other plots).

Often the area of all the bars is normalised to 1 in order to assess the distributions without being confused by differences in sample size. As you can see here, “waiting” is bimodal. This was hidden in the bar chart and in the bot plot.

Related to histograms are other display types such as stemplots or stem-and-leaf plots.

Pros: shows multimodal data, shows normality/skewness clearly

Cons: not compact, difficult to overlay, bin size and position can be misleading

Scatter dot plot

It’s often said that if there are less than 10 data points, then best practice is to simply show the points. Typically the plot is shown together with a bar to show the mean (or median) and maybe with error bars showing s.e.m., s.d., IQR. There are a couple of methods of plotting the points, because they need to be scattered in x value in order to be visualised. Adding random noise is one approach, but this looks a bit messy (top). A symmetrical scatter can be introduced by binning (middle) and a further iteration is to bin the y values rather than showing their true location (bottom). There’s a further iteration which constrains the category width and overlays multiple points, but again the density becomes difficult to see.

These plots still look quite fussy, the binned version is the clearest but then we are losing the exact locations of the points, which seems counterintuitive. Another alternative to scattering the dots is to show a rug plot (see below) where there is no scatter.

Pros: shows all the observations

Cons: can be difficult to assess the distribution

Violin plot

This type of plot was introduced in the software package NCSS in 1997 and described in this paper: Hintze & Nelson (1998) The American Statistician 52(2):181-4 [PDF]. As the title says, violin plots are a synergism between box plot and density trace. A thin box plot is shown together with a symmetrical kernel density estimate (KDE, see explanation below). The point is to be able to quickly assess the distribution. You can see that the bimodality of waiting in the plot, but there’s no complication of lots of points just a smooth curve to see the data.

Pros: shows multimodal data, shows normality/skewness unambiguously

Cons: hides n, not familiar to many readers.

* Why is the bar chart so bad and why should I show my data another way?

The best demonstration of why the bar chart is bad is Anscombe’s Quartet (the figure to the right is taken from the Wikipedia page). These four datasets are completely different, yet they all have the same summary statistics. The point is, you would never know unless you plotted the data. A bar chart would look identical for all four datasets.

Making Violin Plots in IgorPro

I wanted to make Violin Plots in IgorPro, since we use Igor for absolutely everything in the lab. I wrote some code to do this and I might make some improvements to it in the future – if I find the time! This was an interesting exercise, because it meant forcing myself to understand how smoothing is done. What follows below is an aide memoire, but you may find it useful.

What is a kernel density estimate?

A KDE is a non-parametric method to estimate a probability density function of a variable. A histogram can be thought of as a simplistic non-parametric density estimate. Here, a rectangle is used to represent each observation and it gets bigger the more observations are made.

What’s wrong with using a histogram as a KDE?

The following examples are taken from here (which in turn are taken from the book by Bowman and Azzalini described below). A histogram is simplistic. We lose the location of each datapoint because of binning. Histograms are not smooth and the estimate is very sensitive to the size of the bins and also the starting location of the first bin. The histograms to the right show the same data points (in the rug plot).

Using the same bin size, they result in very different distributions depending on where the first bin starts. My first instinct to generate a KDE was to simply smooth a histogram, but this is actually quite inaccurate as it comes from a lossy source. Instead we need to generate a real KDE.

How do I make a KDE?

To do this we place a kernel (a Gaussian is commonly used) at each data point. The rationale behind this is that each observation can be thought of as being representative of a greater number of observations. It sounds a bit bizarre to assume normality to estimate a density non-parametrically, but it works. We can sum all of the kernels to give a smoothed distribution: the KDE. Easy? Well, yes as long as you know how wide to make the kernels. To do this we need to find the bandwidth, h (also called the smoothing parameter).

It turns out that this is not completely straightforward. The answer is summarised in this book: Bowman & Azzalini (1997) Applied Smoothing Techniques for Data Analysis. In the original paper on violin plots, they actually do not have a good solution for selecting h for drawing the violins, and they suggest trying several different values for h. They recommend starting at ~15% of the data range as a good starting point. Obviously if you are writing some code, the process of selecting h needs to be automatic.

Optimising h is necessary because if h is too large, the estimate with be oversmoothed and features will be lost. If is too small, then it will be undersmoothed and bumpy. The examples to the right (again, taken from Bowman & Azzalini, via this page) show examples of undersmoothed, oversmoothed and optimal smoothing.

An optimal solution to find h is

$$h = \left(\frac{4}{3n}\right)^{\frac{1}{5}}\sigma$$

This is termed Silverman’s rule-of-thumb. If smoothing is needed in more than one dimension, the multidimensional version is

$$h = \left\{\frac{4}{\left(p+2\right)n}\right\}^{\frac{1}{\left(p+4\right)}}\sigma$$

You might need multidimensional smoothing to contextualise more than one parameter being measured. The waiting data used above describes the time to wait until the next eruption from Old Faithful. The duration of the eruption is measured, and also the wait to the next eruption can be extracted, giving three parameters. These can give a 3D density estimate as shown here in the example.

The Bowman & Azzalini recommend that, if the distribution is long-tailed, using the median absolute deviation estimator is robust for $$\sigma$$.

$$\tilde\sigma=median\left\{|y_i-\tilde\mu|\right\}/0.6745$$

where $$\tilde\mu$$ is the median of the sample. All of this is something you don’t need to worry about if you use R to plot violins, the implementation in there is rock solid having been written in S plus and then ported to R years ago. You can even pick how the h selection is done from sm.density, or even modify the optimal h directly using hmult.

To get this working in IgorPro, I used some code for 1D KDE that was already on IgorExchange. It needed a bit of modification because it used FastGaussTransform to sum the kernels as a shortcut. It’s a very fast method, but initially gave an estimate that seemed to be undersmoothed. I spent a while altering the formula for h, hence the detail above. To cut a long story short, FastGaussTransform uses Taylor expansion of the Gauss transform and it just needed more terms to do this accurately. This is set with the /TET flag. Note also, that in Igor the width of a Gauss is sigma*2^1/2.

OK, so how do I make a Violin for plotting?

I used the draw tools to do this and placed the violins behind an existing box plot. This is necessary to be able to colour the violins (apparently transparency is coming to Igor in IP7). The other half of the violin needs to be calculated and then joined by the DrawPoly command. If the violins are trimmed, i.e. cut at the limits of the dataset, then this required an extra point to be added. Without trimming, this step is not required. The only other issue is how wide the violins are plotted. In R, the violins are all normalised so that information about n is lost. In the current implementation, box width is 0.1 and the violins are normalised to the area under the curve*(0.1/2). So, again information on n is lost.

Future improvements

Ideas for developments of the Violin Plot method in IgorPro

• incorporate it into the ipf for making boxplots so that it is integrated as an option to ‘calculate percentiles’
• find a better solution for setting the width of the violin
• add other bandwidth options, as in R
• add more options for colouring the violins

What do you think? Did I miss something? Let me know in the comments.

References

Bowman, A.W. & Azzalini, A. (1997) Applied Smoothing Techniques for Data Analysis : The Kernel Approach with S-Plus Illustrations: The Kernel Approach with S-Plus Illustrations. Oxford University Press.

Hintze, J.L. & Nelson, R.D. (1998) Violin plots: A Box Plot-Density Trace Synergism. The American Statistician, 52:181-4.

## Tips from the Blog II

An IgorPro tip this week. The default font for plots is Geneva. Most of our figures are assembled using Helvetica for labelling. The default font can be changed in Igor Graph Preferences, but Preferences need to be switched on in order to be implemented. Anyway, I always seem to end up with a mix of Geneva plots and Helevetica plots. This can be annoying as the fonts are pretty similar yet the spacing is different and this can affect the plot size. Here is a quick procedure Helvetica4All() to rectify this for all graph windows.