# Step By Step: recreating a volcano plot in R

We have an analysis routine for proteomics data written for IgorPro. One output is a volcano plot. These plots show the fold change in one sample compared to another and plot that against a p-value to estimate how reproducible any changes observed are. This post is not about that software, but on the topic of how we can recreate this plot in R. What steps need to be considered?

### Quick note about volcano plots in R

There are plenty of ways to make volcano plots in R. One of the best is EnhancedVolcano which is available in Bioconductor. There is also a shiny app VolcaNoseR by Joachim Goedhart. If you want to make a volcano plot with your own data, start with one of those. The idea of this post is to walkthrough the steps to get a desired result.

## The original plot

An example output from VolcanoPlot is shown below. Each point represents a protein detected by mass spectrometry. The co-ordinates come from a Log2 representation of the fold-change on the x-axis, and on the y, the p-value from a t-test (typically there are three replicates for the test and for the control).

Some things to note about the plot before we try to recreate it in R.

• Points are coloured grey if the fold change is less than 2 and greater than 0.5 (i.e. 1 and -1 on a Log2 scale), not of interest.
• If the p-value is less than 0.05, the points are either blue or pink. Blue, the fold change is lower than 0.5 (enriched in the control); pink, the fold change is greater than 2 (enriched in the test condition). Changes greater than 2 but below the significance threshold are coloured red.
• Vertical dotted line shows the boundary between enrichment in test or control
• Horizontal dotted line shows the significance threshold.
• y axis is decreasing and log10-scaled (1 to 1e-4)
• x axis is on a log2 scale, symmetrical (-7 to 7) with respect to dotted line and labels show the exponent not the real value.

Aesthetically, we have 0.5 alpha on the points to help with visualisation. No stroke on the points. Helvetica throughout. There’s plenty of space on the plot, and it is tall. This is to aid labelling of proteins, none of which are shown here.

It also looks “Igor-like”: there’s a standoff on the axes, engineering style ticks on y to help show the log scale, open axes (no mirroring), no grid lines.

## Our mission is to recreate this plot in R

I’ll go through the major steps that need to be considered and worked through to get something close to the plot above (using ggplot).

We start with a tab-separated text file from Igor. This file has several columns including the names of the proteins, the p-value, the ratio of test vs control, and a number corresponding to the colouring. Let’s go!

```library(tidyverse)

# so_ratioWave and so_allTWave contain the ratios and p-values
ggplot(df, aes(x = so_ratioWave, y = so_allTWave)) +
geom_point()
```

This doesn’t look good! The p-values and ratios are their true values and we would like to represent them differently. In Igor, this is a simple option to change the axis type, set limits and you’re done. With ggplot, it is more tricky. We can try to play with the scales using `scale_y_log10` and so on. The simplest approach though is to make new columns in our data frame with the converted versions of these data. We need to do a `-log10()` transformation on the p-values and a `log2()` transformation on the ratios.

```df\$minus_log10_pvalue <- -log10(df\$so_allTWave)
df\$log2_fc <- log2(df\$so_ratioWave)

ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_point()
```

Let’s make it look a tad more presentable. Add some labels and change the theme. Give the points an alpha of 0.5 like in the original plot. We can use `shape = 16` to give stroke-less points.

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_point(shape = 16, alpha = 0.5) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

We’ll stick with this basic theme, although the Igor version had open axes and no gridlines. But we now have our basic plot sorted out.

### Next: How do we do the colours?

The dataframe has a column called so_colorWave which has 0, 1, 2, 3 to designate the grey, red, blue and pink colouring of the points.

We use `aes(colour = as.factor(so_colorWave))` to specify that we want to use this column to colour the points. I call `as.factor()` because when the column was imported it is an integer type.

Then we can use `scale_colour_manual()` to specify the colours. I use hex representation for each (matching the original). The `values` argument is supplied with a list of these values. Note that this is a named list. Without specifying which factor is coloured by which hex colour, we would get the same result in this case. However, if the dataframe did not have any 2s (blue points) for example, without a named list, the colours change order. So this is a more robust way to colour the points.

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

Let’s add the dashed lines. We’ll put them behind the points rather than in front as the original has it. Remember that ggplot adds things to the plot in the order you add them!

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

Getting closer… now let’s pay the axes some attention

### Next: scaling and labelling of axes

We would like symmetrical scaling about 0 on the x-axis. To do this we need to find the largest or smallest value in the dataset and use that to set the limits.

```max(abs(df\$log2_fc))
```

will give us this value, we use `abs()` so that negative values become positive and we then find the maximum of all values in one direction. We could set this as a value and use that, or we can get ggplot to calculate it on the fly, like this:

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
scale_x_continuous(limits = c(-max(abs(df\$log2_fc)),max(abs(df\$log2_fc)))) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

Now for the y-axis. We can fix the labelling like this:

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
scale_x_continuous(limits = c(-max(abs(df\$log2_fc)),max(abs(df\$log2_fc)))) +
scale_y_continuous(limits = c(0,NA), labels = function(i) 10^-i) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

Adding the log ticks is difficult. We can use `annotation_logticks()` and specify `sides = "l", outside = TRUE` to give us log ticks in the same place as the original. As default the plot clipping means that they are not visible, so we can add `coord_cartesian(clip = "off")` to give us this:

Now we have three problems. The log ticks are the wrong way around (and don’t match the axis), they overlap with the tick labels, and there are ticks beyond 1 which doesn’t make sense. There is no reverse option for `annotation_logticks()`, although an issue was filed on GitHub some time ago.

### Decision time: how do we solve the y axis problem?

Can we fix this problem? We can go back to the original y data, plot that (rather than the transform) and transform the axis.

```library(scales)
ggplot(df, aes(x = log2_fc, y = so_allTWave)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
scale_x_continuous(limits = c(-max(abs(df\$log2_fc)),max(abs(df\$log2_fc)))) +
scale_y_continuous(trans = c("log10", "reverse")) +
annotation_logticks(sides = "l", outside = TRUE) +
coord_cartesian(clip = "off") +
labs(x = "Fold change (log2)", y = "P-value") +
theme_bw(9) +
theme(legend.position = "none")
```

Use of `scale_y_continuous(trans = c("log10", "reverse"))` works to transform the data correctly. However, `annotation_logticks()` is not aware of this transformation so this doesn’t solve the problem. Note that the horizontal line is now misplaced and that would need fixing too.

So, decision time: I will drop the log ticks since it will be too much effort to fix this.

Since I failed with the ticks, let’s fix the gridlines a bit since it doesn’t make sense to have the minor gridlines at a non-integer location. And let’s mimic the Igor style a little closer.

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
scale_x_continuous(limits = c(-max(abs(df\$log2_fc)),max(abs(df\$log2_fc)))) +
scale_y_continuous(limits = c(0,NA), labels = function(i) 10^-i) +
labs(x = "Fold change (log2)", y = "P-value") +
theme_classic(9) +
theme(legend.position = "none")
```

I will change the scales so they match the original better. Oh, and the x-axis title needs fixing.

```ggplot(df, aes(x = log2_fc, y = minus_log10_pvalue)) +
geom_vline(xintercept = 0, linetype = "dashed", colour = "grey") +
geom_hline(yintercept = -log10(0.05), linetype = "dashed", colour = "grey") +
geom_point(aes(colour = as.factor(so_colorWave)), shape = 16, alpha = 0.5) +
scale_colour_manual(values = c("0" = "#808080", "1" = "#ff8080", "2" = "#8080ff", "3" = "#ff80ff")) +
scale_x_continuous(limits = c(-7.1,7.1), breaks = seq(-6,6,2)) +
scale_y_continuous(limits = c(0,4), labels = function(i) 10^-i) +
xlab(bquote(Test / Control (Log))) +
ylab("P-value") +
theme_classic(9) +
theme(legend.position = "none")
```

### A final side-by-side

Here is my attempt at a recreation, with the original. It’s reasonably close. Drag the slider to compare.

## Conclusion

The aim here was to show the steps needed to get a desired plot result. We were not going for an exact recreation of the original plot, just something close. I wanted to highlight how sometimes things don’t work out and you have to make some choices along the way – the important thing is a) to get a plot, and b) that you are happy with it.

The post title comes from “Step By Step” by The Alan Parsons Project from “Eye in the Sky”. Sorry to disappoint anyone who thought I would pick New Kids on The Block or Whitney Houston.

## 3 thoughts on “Step By Step: recreating a volcano plot in R”

1. quantixed says:

Thanks for the comment. I tried adding annotation_logticks and it currently doesn’t work as intended for Volcano plots. I should have linked to the reference though! I have filed an issue to see if it can be resolved.