## Pledging My Time III

I’ve previously crunched times for local Half and Full Marathons here on quantixed. Last weekend was the Kenilworth Half Marathon (2018) over a new course. I thought I’d have a look at the distributions of times and paces of the runners. The times are available here. If the Time and Category for finishers are saved as a csv, the script below works to generate the following plots.

Aggregated stats for the race are here. The beeswarm plot nicely shows the distribution of runners times and paces per category. There’s a bimodality to some of the age groups which is interesting. You can see from the average times that people get slower as they get older, as expected.

There was a roughly 2:1 split of M:F runners with a similar proportion in all categories. The ratio is similar for DNSers. The winning times were Andrew Savery of Leamington C A & C in MV35 with 01:12:51 and Polly Keen of Nuneaton Harriers in F sen with 01:23:46.

Congrats to everyone who ran and thanks to the organisers and all the supporters out on the course.

```
require(ggplot2)
require(ggbeeswarm)
file_name <- file.choose()
# aggregate M and F to a new category called Gender
df1\$Gender <- ifelse(startsWith(df1\$Category,"F"),"F","M")
# format Date column to POSIXct
df1\$Time <- as.POSIXct(strptime(df1\$Time, format = "%H:%M:%S"))
orig_var <- as.POSIXct("00:00:00", format = "%H:%M:%S")
p1 <- ggplot( data = df1, aes(x = Category,y = Time, color = Category)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p1
# instead of finishing time, let's look at pace (min/km)
df1\$Pace <- as.numeric(difftime(df1\$Time, orig_var) / 21.1) * 3600
df1\$Pace <- as.POSIXct(df1\$Pace, origin = orig_var, format = "%H:%M:%S")
p2 <- ggplot( data = df1, aes(x = Category,y = Pace, color = Category)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p2
# calculate speeds rather than pace
df1\$Speed <- 21.1 / as.numeric(difftime(df1\$Time, orig_var))
p3 <- ggplot( data = df1, aes(x = Category, y = Speed, color = Category)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
ylim(0,NA) + ylab("Speed (km/h)")
p3
# now make the same plots but by Gender rather than Category
p4 <- ggplot( data = df1, aes(x = Gender,y = Time, color = Gender)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
scale_y_datetime(date_labels = "%H:%M:%S", limits = c(orig_var,NA))
p4
p5 <- ggplot( data = df1, aes(x = Gender,y = Pace, color = Gender)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
scale_y_datetime(date_labels = "%M:%S", limits = c(orig_var,NA))
p5
p6 <- ggplot( data = df1, aes(x = Gender, y = Speed, color = Gender)) +
geom_quasirandom(alpha = 0.5, stroke = 0) +
stat_summary(fun.y = mean, geom = "point", size=2, aes(group = 1)) +
ylim(0,NA) + ylab("Speed (km/h)")
p6
ggsave("raceTimeByCat.png", plot = p1)
ggsave("racePaceByCat.png", plot = p2)
ggsave("raceSpeedByCat.png", plot = p3)
ggsave("raceTimeByGen.png", plot = p4)
ggsave("racePaceByGen.png", plot = p5)
ggsave("raceSpeedByGen.png", plot = p6)

```

Edit 2018-09-12T18:52:43Z I wasn’t happy with the plots and added a few more lines to look at gender as well as category. And show speed as well as pace and finishing time.

The post title is taken from “Pledging My Time” a track from Blonde on Blonde by Bob Dylan

## Rip It Up: Grabbing movies from Twitter for use in ImageJ

Some great scientific data gets posted on Twitter. Sometimes I want to take a closer look and this post describes a strategy to do so.

Edit: I received a request to take down the 3D volume images derived from the example dataset I used in this post. I’ve edited the post below so that is now a general guide.

Grab the video

It can be a bit difficult to the grab video from Twitter. The best way I’ve found is using youtube-dl. This works for downloading video and audio from YouTube to view offline, but it also works for other embedded video content on other websites.

`youtube-dl -o '%(title)s.%(ext)s' https://twitter.com/username/status/tweetID`

Convert to avi

Now, mp4 is a compressed file format which cannot be read directly by FIJI/ImageJ. Conversion to avi means that the file can be loaded. I like to use another command line tool, ffmpeg for video conversions.

`ffmpeg -i originalFile.mp4 -pix_fmt nv12 -f avi -vcodec rawvideo convertedFile.avi`

Now we have an avi file called convertedFile.avi that we can use.

The avi can be loaded into FIJI. At this point you can analyse the video. However, in the case of the video I was interested in, the data had been pseudocolored and was now in RGB format. I wanted to look at the original data. Converting to grayscale does not give the correct representation but conversion back to grayscale is possible if you know the LUT was applied. Even if you don’t, it’s possible to take a guess at the LUT and do the conversion.

Converting RGB to original values

I found a nice gist that does the conversion for a single image. I just modified this code to work for a stack. It requires the LUT to be displayed vertically in a window called LUT. Caution: this code runs very slowly because every pixel in every slice needs to be recalculated and ImageJ is slow… I took a guess that mpl-inferno was used (I don’t think is exactly right but it worked well enough). You can display the built-in LUTs in FIJI using Color > Display LUTs… and from there you can make the LUT window which the macro uses for the calculation. The macro to convert stacks to grayscale using the LUT is here.

I had a nice grayscale version of the data (inverted because I wanted to look at the volume). This let me see how the layers in the original video add together to make the full structure. I used ClearVolume which can be installed via Update Site in FIJI. I just made a quick video to show it in action (see below). You’ll have to take my word for it (video removed).

So extracting scientific data from Twitter or another online source is pretty straightforward. The extra complication was getting rid of the pseudocoloring, but once this was done, something very close to the original data was available.  Nonetheless this workflow is a fun way to take a closer look at some of the cool movies that people post on Twitter. I hope you find it useful.

The post title comes from “Rip It Up” by Orange Juice. A popular title in my library with versions from several different artists. I was thinking what is described is similar to ripping video content.

## Turn That Heartbeat Over Again: comparing wrist and chest-strap HRM

As a geek, the added bonus of exercise is the fun that you can have with the data you’ve generated. A recent conversation on Twitter about the accuracy of wrist-based HRMs got me thinking… how does a wrist-based HRM compare with a traditional chest-strap HRM? Conventional wisdom says that the chest-strap is more accurate, but my own experience of chest-strap HRMs is that they are a bit unreliable. Time to put it to the test.

I have a Garmin Fēnix 5 which records wrist-based HR and I have a Garmin chest-strap which uses ANT+ to transmit. I could pick up the ANT+ signal with a Garmin Edge 800 so that I could record both datasets simultaneously, on the same bike ride. Both the Fēnix and Edge can record GPS and Time (obviously) allowing accurate registration of the data. I also set both devices to receive cadence data via ANT+ from the same cadence/speed sensor so that I could control for (or at least look at) variability in recordings. I rode for a ~1 hr ~32 km to capture enough data. Obviously this is just one trial but the data gives a feel for the accuracy of the two HRMs. Biking, I figured was a fair activity since upper body and wrist movement is minimal, meaning that the contacts for both HRMs are more likely to stay in place than if I was running.

I’ll get to heart rate last. First, you can see that the GPS recording is virtually identical between the two units – so that’s a good sign. Second, elevation is pretty similar too. There’s a bit of divergence at the beginning and end of the ride, since those parts are over the same stretch of road, neither device looks totally accurate. I’d be inclined to trust the Fēnix here since has a newer altimeter in it. Third, cadence looks accurate. Remember this data is coming off the same sensor and so any divergence is in how it’s being logged. Finally, heart rate. You can see that at the beginning of the ride, both wrist-based and chest-strap HRMs struggle. Obviously I have no reference here to know which trace is correct, but the chest-strap recording looks like it has an erroneous low reading for several minutes but is otherwise normal. The wrist-based HRM looks like it is reading 120ish bpm values from the start and then snaps into reading the right thing after a while. The chest-strap makes best contact when some perspiration has built up, which might explain its readings. The comparisons are shown below in grey. The correlation is OK but not great. Compared to cadence, the data diverge a lot more which rules out simple logging differences as a cause.

I found a different story in Smart recording mode on the Fēnix. This is a lower frequency recording mode, which is recommended to preserve battery life for long activities.

So what can we see here? Well, the data from the Fēnix are more patchy but even so, the data are pretty similar except for heart rate. The Fēnix performs badly here. Again, you can see the drop out of the chest strap HRM for a few minutes at the start, but otherwise it seems pretty reliable.  The comparison graph for heart rate shows how poorly the wrist-based HRM measures heart rate, in this mode.

OK, this is just two rides, for one person – not exactly conclusive but it gives some idea about the data that are captured with each HRM.

Conclusions

Wrist-based HRM is pretty good (at the higher sampling rate only) especially considering how the technology works, plus chest-strap HRMs can be uncomfortable to wear, so wrist-based HRM may be all you need. If you are training to heart rate zones, or want the best data, chest-strap HRM is more reliable than wrist-based HRM generally. Neither are very good for very short activities (<15 min).

For nerds only

Comparisons like these are quite easy to do in desktop packages like Rubitrack (which I love) or Ascent or others. They tend to mask things like missing data points, they smooth out the data and getting the plots the way you want is not straightforward. So, I took the original FIT files from each unit and used these for processing. There’s a great package for R called cycleRtools. This worked great except for the smart recording data which was sampled irregularly and it turns out and the package requires monotonic sampling. I generated a gpx file and parsed the data for this activity in R using XML. I found this snippet on the web (modified slightly).

```library(XML)
library(plyr)
filename <- "myfile.gpx"
gpx.raw <- xmlTreeParse(filename, useInternalNodes = TRUE)
rootNode <- xmlRoot(gpx.raw)
gpx.rawlist <- xmlToList(rootNode)\$trk
gpx.list <- unlist(gpx.rawlist[names(gpx.rawlist) == "trkseg"], recursive = FALSE)
gpx <- do.call(rbind.fill, lapply(gpx.list, function(x) as.data.frame(t(unlist(x)), stringsAsFactors=F)))
names(gpx) <- c("ele", "time", "temp", "hr", "cad", "lat", "lon")
```

Otherwise:

```library(cycleRtools)
edge <- as.data.frame(read_fit(file = file.choose(), format = TRUE, CP = NULL, sRPE = NULL))
write.csv(edge, file = "edge.csv", row.names = F)
```

The resulting data frames could be saved out as csv and read into Igor to make the plots. I wrote a quick function in Igor to resample all datasets at 1 Hz. The plots and layout were generated by hand.

The post title comes from “Turn That Heartbeat Over Again” by Steely Dan from Can’t Buy A Thrill

## I’m not following you II: Twitter data and R

My activity on twitter revolves around four accounts.

I try to segregate what happens on each account, and there’s inevitably some overlap. But what about overlap in followers?

What lucky people are following all four? How many only see the individual accounts?

It’s quite easy to look at this in R.

So there are 36 lucky people (or bots!) following all four accounts. I was interested in the followers of the quantixed account since it seemed to me that it attracts people from a slightly different sphere. It looks like about one-third of quantixed followers only follow quantixed, about one-third follow clathrin also and more or less the remainder are “all in” following three accounts or all four. CMCB followers are split about the same. The lab account is a bit different, with close to one-half of the followers also following clathrin.

Extra nerd points:

This is a Venn diagram and not an Euler plot. Venn just shows schematically the intersections and does not attempt to encode information in the area of each part. Euler plots for greater than three groups are hard to generate and to make any sense of what is shown. It is a dataviz problem to look at the proportions or lots of groups. A solution here would be to generate a further four Venn diagrams. On each, display the proportion for one category as a fraction or percentage

How to do it:

Last time, I described how to set up rtweet and make a Twitter app for use in R. You can use this to pull down lists of followers and extract their data. Using the intersect function you can work out the numbers of followers at each intersection. For four accounts, there will be 1 group of four, 4 groups of three, 6 groups of two. The VennDiagram package just needs the total numbers for all four groups and then details of the intersections, i.e. you don’t need to work out the groups minus their intersections – it does this for you.

```library(rtweet)
library(httpuv)
library(VennDiagram)
## whatever name you assigned to your created app
appname <- "whatever_name"
## api key (example below is not a real key)
key <- "blah614h"
## api secret (example below is not a real key)
secret <- "blah614h"
app = appname,
consumer_key = key,
consumer_secret = secret)
clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
quantixed_followers <- get_followers("quantixed", n = "all")
quantixed_followers_names <- lookup_users(quantixed_followers)
cmcb_followers <- get_followers("Warwick_CMCB", n = "all")
cmcb_followers_names <- lookup_users(cmcb_followers)
roylelab_followers <- get_followers("roylelab", n = "all")
roylelab_followers_names <- lookup_users(roylelab_followers)
# a = clathrin
# b = quantixed
# c = cmcb
# d = roylelab
## now work out intersections
anb <- intersect(clathrin_followers_names\$user_id,quantixed_followers_names\$user_id)
anc <- intersect(clathrin_followers_names\$user_id,cmcb_followers_names\$user_id)
and <- intersect(clathrin_followers_names\$user_id,roylelab_followers_names\$user_id)
bnc <- intersect(quantixed_followers_names\$user_id,cmcb_followers_names\$user_id)
bnd <- intersect(quantixed_followers_names\$user_id,roylelab_followers_names\$user_id)
cnd <- intersect(cmcb_followers_names\$user_id,roylelab_followers_names\$user_id)
anbnc <- intersect(anb,cmcb_followers_names\$user_id)
anbnd <- intersect(anb,roylelab_followers_names\$user_id)
ancnd <- intersect(anc,roylelab_followers_names\$user_id)
bncnd <- intersect(bnc,roylelab_followers_names\$user_id)
anbncnd <- intersect(anbnc,roylelab_followers_names\$user_id)
## four-set Venn diagram
area1 = nrow(clathrin_followers_names),
area2 = nrow(quantixed_followers_names),
area3 = nrow(cmcb_followers_names),
area4 = nrow(roylelab_followers_names),
n12 = length(anb),
n13 = length(anc),
n14 = length(and),
n23 = length(bnc),
n24 = length(bnd),
n34 = length(cnd),
n123 = length(anbnc),
n124 = length(anbnd),
n134 = length(ancnd),
n234 = length(bncnd),
n1234 = length(anbncnd),
category = c("Clathrin", "quantixed", "CMCB", "RoyleLab"),
fill = c("dodgerblue1", "red", "goldenrod1", "green"),
lty = "dashed",
cex = 2,
cat.cex = 1.5,
cat.col = c("dodgerblue1", "red", "goldenrod1", "green"),
fontfamily = "Helvetica",
cat.fontfamily = "Helvetica"
);
# write to file
grid.draw(venn.plot);
dev.off()
```

I’ll probably return to rtweet in future and will recycle the title if I do.

Like last time, the post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins

## Frankly, Mr. Shankly

I read about Antonio Sánchez Chinchón’s clever approach to use the Travelling Salesperson algorithm to generate some math-art in R. The follow up was even nicer in my opinion, Pencil Scribbles. The subject was Boris Karloff as the monster in Frankenstein. I was interested in running the code (available here and here), so I thought I’d run it on a famous scientist.

By happy chance one of the most famous scientists of the 20th Century, Rosalind Franklin, shares a nominative prefix with the original subject. There is also a famous portrait of her that I thought would work well.

I first needed needed to clear up the background because it was too dark.

Now to run the TSP code.

The pencil scribbles version is nicer I think.

The R scripts basically ran out-of-the-box. I was using a new computer that didn’t have X11quartz on it nor the packages required, but once that they were installed I just needed to edit the line to use a local file in my working directory. The code just ran. The outputs FrankyTSP and Franky_scribbles didn’t even need to be renamed, given my subject’s name.

Thanks to Antonio for making the code available and so easy to use.

The post title comes from “Frankly, Mr. Shankly” by The Smiths which appears on The Queen is Dead. If the choice of post title needs an explanation, it wasn’t a good choice…

## Paintball’s Coming Home: generating Damien Hirst spot paintings

A few days ago, I read an article about Damien Hirst’s new spot paintings. I’d forgotten how regular the spots were in the original spot paintings from the 1990s (examples are on his page here). It made me think that these paintings could be randomly generated and so I wrote a quick piece of code to do this (HirstGenerator).

I used Hirst’s painting ‘Abalone Acetone Powder’ (1991), which is shown on this page as photographed by Alex Hartley. A wrote some code to sample the colours of this image and then a script to replicate it. The original is shown below  © Damien Hirst and Science Ltd. Click them for full size.

and then this is the replica:

Now that I had a palette of the colours used in the original. It was simple to write a generator to make spot paintings where the spots are randomly assigned.

The generator can make canvasses at whatever size is required.

The code can be repurposed to make spot paintings with different palettes from his other spot paintings or from something else. So there you have it. Generative Hirst Spot Paintings.

For nerds only

My original idea was to generate a palette of unique colours from the original painting. Because of the way I sampled them, each spot is represented once in the palette. This means the same colour as used by the artist is represented as several very similar but nonidentical colours in the palette. My original plan was to find the euclidean distances between all spots in RGB colour space and to establish a distance cutoff to decide what is a unique colour.

That part was easy to write but what value to give for the cutoff was tricky. After some reading, it seems that other colour spaces are better suited for this task, e.g. converting RGB to a CIE colour space. For two reasons, I didn’t pursue this. First, quantixed coding is time-limited. Second. assuming that there is something to the composition of these spot paintings (and they are not a con trick) the frequency of spots must have artistic merit and so they should be left in the palette for sampling in the generated pictures. The representation of the palette in RGB colour space had an interesting pattern (shown in the GIF above).

The post title comes from “Paintball’s Coming Home” by Half Man Half Biscuit from Voyage To The Bottom Of The Road. Spot paintings are kind of paintballs, but mostly because I love the title of this song.

## Measured Steps: Garmin step adjustment algorithm

I recently got a new GPS running watch, a Garmin Fēnix 5. As well as tracking runs, cycling and swimming, it does “activity tracking” – number of steps taken in a day, sleep, and so on. The step goals are set to move automatically and I wondered how it worked. With a quick number crunch, the algorithm revealed itself. Read on if you are interested how it works.

The watch started out with a step target of 7500 steps in one day. I missed this by 2801 and the target got reduced by 560 to 6940 for the next day. That day I managed 12480, i.e. 5540 over the target. So the target went up by 560 to 7500. With me so far? Good. So next I went over the target and it went up again (but this time by 590 steps). I missed that target by a lot and the target was reduced by 530 steps. This told me that I’d need to collect a bit more data to figure out how the goal is set. Here are the first few days to help you see the problem.

 Actual steps Goal Deficit/Surplus Adjustment for Tomorrow 4699 7500 -2801 -560 12480 6940 5540 560 10417 7500 2917 590 2726 8090 -5364 -530 6451 7560 -1109 -220 8843 7340 1503 150 8984 7490 1494 300 9216 7790 1426 290

The data is available for download as a csv via the Garmin Connect website. After waiting to accumulate some more data, I plotted out the adjustment vs step deficit/surplus. The pattern was pretty clear.

There are two slopes here that pass through the origin. It doesn’t matter what the target was, the adjustment applied is scaled according to how close to the target I was, i.e. the step deficit or surplus. There was either a small (0.1) or large (0.2) scaling used to adjust the step target for the next day, but how did the watch decide which scale to use?

The answer was to look back at the previous day’s activity as well as the current day.

So if today you exceeded the target and you also exceeded the target yesterday then you get a small scale increase. Likewise if you fell short today and yesterday, you get a small scale decrease. However, if you’ve exceeded today but fell short yesterday, your target goes up by the big scaling. Falling short after exceeding yesterday is rewarded with a big scale decrease. The actual size of the decrease depends on the deficit or surplus on that day. The above plot is coloured according to the four possibilities described here.

I guess there is a logic to this. The goal could quickly get unreachable if it increased by 20% on a run of two days exceeding the target, and conversely, too easy if the decreases went down rapidly with consecutive inactivity. It’s only when there’s been a swing in activity that the goal should get moved by the large scaling. Otherwise, 10% in the direction of attainment is fine.

I have no idea if this is the algorithm used across all of Garmin’s watches or if other watch manufacturer’s use different target-setting algorithms.

The post title comes from “Measured Steps” by Edsel from their Techniques of Speed Hypnosis album.

## Inspiration Information: some book recommendations for kids

As with children’s toys and clothes, books aimed at children tend to be targeted in a gender-stereotyped way. This is a bit depressing. While books about princesses can be inspirational to young girls – if the protagonist decides to give it all up and have a career as a medic instead (the plot to Zog by Julia Donaldson) – mostly they are not. How about injecting some real inspiration into reading matter for kids?

Here are a few recommendations. This is not a survey of the entire market, just a few books that I’ve come across that have been road-tested and received a mini-thumbs up from little people I know.

Little People Big Dreams: Marie Curie by Isabel Sanchez Vegara & Frau Isa

This is a wonderfully illustrated book that tells the story of Marie Curie. From a young girl growing up in Poland, overcoming gender restrictions to go and study in France and subsequently winning two Nobel Prizes and being a war hero! The front part of the book is written in simple language that kids can read while the last few pages are (I guess) for an adult to read aloud to the child, or for older children to read for themselves.

This book is part of a series which features inspirational women: Ada Lovelace, Rosa Parks, Emmeline Pankhurst, Amelia Earhart. What is nice is that the series also has books on women from creative fields Coco Chanel, Audrey Hepburn, Frida Kahlo, Ella Fitzgerald. Often non-fiction books for kids are centred on science/tech/human rights which is great but, let’s face it, not all kids will engage with these topics. The bigger message here is to show young people that little people with big dreams can change the world.

Ada Twist, Scientist by Andrea Beaty & David Roberts

A story about a young scientist who keeps on asking questions. The moral of the story is that there is nothing wrong with asking “why?”. The artwork is gorgeous and there are plenty of things to spot and look at on each page. The mystery of the book is not exactly solved either so there’s fun to be had discussing this as well as reading the book straight. Ada Marie Twist is named after Ada Lovelace and Marie Curie, two female giants of science.

This book is highly recommended. It’s fun and crammed full with positivity.

Rosie Revere, Engineer by Andrea Beaty & David Roberts

By the same author and illustrator, ‘Rosie Revere…’ tells the story of a young inventor. She overcomes ridicule when she is taken under the wing of her great aunt who is an inspirational engineer. Her great aunt Rose is I think supposed to be Rosie the Riveter, be-headscarfed feminist icon from WWII. A wonderful touch.

Rosie is a classmate of Ada Twist (see above) and there is another book featuring a young (male) architect which we have not yet road-tested. Rather than recruitment propaganda for Engineering degrees, the broader message of ‘Rosie Revere…’ is that persevering with your ideas and interests is a good thing, i.e. never give up.

Good Night Stories for Rebel Girls by Elena Favilli & Francesca Cavallo
A wonderful book that gives brief biographies of inspiring women. Each two page spread has some text and an illustration of the rebel girl to inspire young readers. The book has a This book belongs to… page at the beginning, but in a move of pure genius, the book has two final pages for the owner of the book to write their own story. Just like the women featured in the book, the owner to the book can have their own one page story and draw their own self-portrait.
This book is highly recommended.
EDIT: this book was added to the list on 2018-02-26

Who was Charles Darwin? by Deborah Hopkinson & Nancy Harrison

This is a non-fiction book covering Darwin’s life from school days through the Beagle adventures and on to old age. It’s a book for children although compared to the books above, this is quite a dry biography with a few black-and-white illustrations. This says more about how well the books above are illustrated rather than anything particularly bad about “Who Was Charles Darwin?”. Making historical or biographical texts appealing to kids is a tough gig.

The text is somewhat inspirational – Darwin’s great achievements were made despite personal problems – but there is a disconnect between the life of a historical figure like Darwin and the children of today.

For older people

Quantum Mechanics by Jim Al-Khalili

Aimed at older children and adults, this book explains the basics behind the big concept of “Quantum Mechanics”. These Ladybird Expert books have a retro appeal, being similar to the original Ladybird books published over forty years ago. Jim Al-Khalili is a great science communicator and any young people (or adults) who have engaged with his TV work will enjoy this short format book.

Evolution by Steve Jones

This is another book in the Ladybird Expert series (there is one further book, on “Climate Change”). The brief here is the same: a short format explainer of a big concept, this time “Evolution”. The target audience is the same. It is too dry for young children but perfect for teens and for adults. Steve Jones is an engaging writer and this book doesn’t disappoint, although the format is limited to one-page large text vignettes on evolution with an illustration on the facing page.

It’s a gateway to further reading on the topic and there’s a nice list of resources at the end.

Computing for Kids

After posting this, I realised that we have lots of other children’s science and tech books that I could have included. The best of the rest is this “lift-the-flap” book on Computers and Coding published by Usborne. It’s a great book that introduces computing concepts in a fun gender-free way. It can inspire kids to get into programming perhaps making a step up from Scratch Jr or some other platform that they use at school.

I haven’t included any links to buy these books. Of course, they’re only a google search away. If you like the sound of any, why not drop in to your local independent bookshop and support them by buying a copy there.

The post title comes from the title track of the “Inspiration Information” LP by Shuggie Otis. The version I have is the re-release with  ‘Strawberry Letter 23’ on it from ‘Freedom Flight’ – probably his best known track – as well as a host of other great tunes. Highly underrated, check it out. There’s another recommendation for you.

## The Sound of Clouds: wordcloud of tweets using R

Another post using R and looking at Twitter data.

As I was typing out a tweet, I had the feeling that my vocabulary is a bit limited. Papers I tweet about are either “great”, “awesome” or “interesting”. I wondered what my most frequently tweeted words are.

Like the last post you can (probably) do what I’ll describe online somewhere, but why would you want to do that when you can DIY in R?

First, I requested my tweets from Twitter. I wasn’t sure of the limits of rtweet for retrieving tweets and the request only takes a few minutes. This gives you a download of everything including a csv of all your tweets. The text of those tweets is in a column called ‘text’.

```
## for text mining
library(tm)
## for building a corpus
library(SnowballC)
## for making wordclouds
library(wordcloud)
tweets <- read.csv('tweets.csv', stringsAsFactors = FALSE)
## make a corpus of the text of the tweets
tCorpus <- Corpus(VectorSource(tweets\$text))
## remove all the punctation from tweets
tCorpus <- tm_map(tCorpus, removePunctuation)
## good idea to remove stopwords: high frequency words such as I, me and so on
tCorpus <- tm_map(tCorpus, removeWords, stopwords('english'))
## next step is to stem the words. Means that talking and talked become talk
tCorpus <- tm_map(tCorpus, stemDocument)
wordcloud(tCorpus, max.words = 100, random.order = FALSE)

```

For my @clathrin account this gave:

So my most tweeted word is paper, followed by cell and lab. I’m quite happy about that. I noticed that great is also high frequency, which I had a feeling would also be the case. It looks like @christlet, @davidsbristol, @jwoodgett and @cshperspectives are among my frequent twitterings, this is probably a function of the length of time we’ve been using twitter. The cloud was generated using 10.9K tweets over seven years, it might be interesting to look at any changes over this time…

The cloud is a bit rough and ready. Further filtering would be a good idea, but this quick exercise just took a few minutes.

The post title comes from “The Sound of Clouds” by The Posies from their Solid States LP.

## I’m not following you: Twitter data and R

I wondered how many of the people that I follow on Twitter do not follow me back. A quick way to look at this is with R. OK, a really quick way is to give a 3rd party application access rights to your account to do this for you, but a) that isn’t safe, b) you can’t look at anyone else’s data, and c) this is quantixed – doing nerdy stuff like this is what I do. Now, the great thing about R is the availability of well-written packages to do useful stuff. I quickly found two packages twitteR and rtweet that are designed to harvest Twitter data. I went with rtweet and there were some great guides to setting up OAuth and getting going.

The code below set up my environment and pulled down lists of my followers and my “friends”. I’m looking at my main account and not the quantixed twitter account.

```
library(rtweet)
library(httpuv)
## setup your appname,api key and api secret
appname <- "whatever_name"
key <- "blah614h"
secret <- "blah614h"
app = appname,
consumer_key = key,
consumer_secret = secret)

clathrin_followers <- get_followers("clathrin", n = "all")
clathrin_followers_names <- lookup_users(clathrin_followers)
clathrin_friends <- get_friends("clathrin")
clathrin_friends_names <- lookup_users(clathrin_friends)

```

The terminology is that people that follow me are called Followers and people that I follow are called Friends. These are the terms used by Twitter’s API. I have almost 3000 followers and around 1200 friends.

This was a bit strange… I had fewer followers with data than actual followers. Same for friends: missing a few hundred in total. I extracted a list of the Twitter IDs that had no data and tried a few other ways to look them up. All failed. I assume that these are users who have deleted their account (and the Twitter ID stays reserved) or maybe they are suspended for some reason. Very strange.

```
## noticed something weird
## look at the twitter ids of followers and friends with no data
missing_followers <- setdiff(clathrin_followers\$user_id,clathrin_followers_names\$user_id)
missing_friends <- setdiff(clathrin_friends\$user_id,clathrin_friends_names\$user_id)

## find how many real followers/friends are in each set
aub <- union(clathrin_followers_names\$user_id,clathrin_friends_names\$user_id)
anb <- intersect(clathrin_followers_names\$user_id,clathrin_friends_names\$user_id)

## make an Euler plot to look at overlap
fit <- euler(c(
"Followers" = nrow(clathrin_followers_names) - length(anb),
"Friends" = nrow(clathrin_friends_names) - length(anb),
"Followers&amp;Friends" = length(anb)))
plot(fit)
plot(fit)

```

In the code above, I arranged in sets the “real Twitter users” who follow me or I follow them. There was an overlap of 882 users, leaving 288 Friends who don’t follow me back – boo hoo!

I next wanted to see who these people are, which is pretty straightforward.

```
## who are the people I follow who don't follow me back
bonly <- setdiff(clathrin_friends_names\$user_id,anb)
no_follow_back <- lookup_users(bonly)

```

Looking at no_follow_back was interesting. There are a bunch of announcement accounts and people with huge follower counts that I wasn’t surprised do not follow me back. There are a few people on the list with whom I have interacted yet they don’t follow me, which is a bit odd. I guess they could have unfollowed me at some point in the past, but my guess is they were never following me in the first place. It used to be the case that you could only see tweets from people you followed, but the boundaries have blurred a lot in recent years. An intermediary only has to retweet something you have written for someone else to see it and you can then interact, without actually following each other. In fact, my own Twitter experience is mainly through lists, rather than my actual timeline. And to look at tweets in a list you don’t need to follow anyone on there. All of this led me to thinking: maybe other people (who follow me) are wondering why I don’t follow them back… I should look at what I am missing out on.

```## who are the people who follow me but I don't follow back
aonly <- setdiff(clathrin_followers_names\$user_id,anb)
no_friend_back <- lookup_users(aonly)
## save csvs with all user data for unreciprocated follows
write.csv(no_follow_back, file = "nfb.csv")
write.csv(no_friend_back, file = "nfb2.csv")

```

With this last bit of code, I was able to save a file for each subset of unreciprocated follows/friends. Again there were some interesting people on this list. I must’ve missed them following me and didn’t follow back.

I used these lists to prune my friends and to follow some interesting new people. The csv files contain the Twitter bio of all the accounts so it’s quick to go through and check who is who and who is worth following. Obviously you can search all of this content for keywords and things you are interested in.

So there you have it. This is my first “all R” post on quantixed – hope you liked it!

The post title is from “I’m Not Following You” the final track from the 1997 LP of the same name from Edwyn Collins.