One of several features that Strava put behind a paywall was the ability to compare performance on similar courses. I miss this comparison tool and wondered how hard it would be to code my own.
This post is a walkthrough of how I approached the problem. The code is available here. It uses the trackeR library in R to convert the GPX tracks to a huge dataframe. This is then processed by IgorPro.
Step 1: data wrangling
I maintain a database of tracks using RubiTrack. I can export the tracks in GPX format and I can select the tracks to export based on date, country, and so on. Export results in a large GPX file with all selected tracks. A short script in R processes these and provides calculations of things like point-to-point distance. The resulting dataframe is saved as csv which is then read into Igor. Igor breaks the dataframe into separate tracks and records total distance, pace, speed etc. for each track. With that wrangling done, it’s on to the interesting stuff.
Step 2: finding similar tracks
This is the heart of the problem. I often wondered how Strava did this so efficiently… Here is my solution.
I was working with ~700 tracks and a way to efficiently compare them is needed. The major problem is track length: the shortest tracks were 4 km and the longest, 42 km. How can we compare them? Speed and sampling means the number of points differs between tracks of equal length. To get around this issue, I generated uniform tracks which consisted of an equal number of points. 64 points was enough to describe a course well. This is easily done by projecting all lat/long coords to flat earth XY coords, then resampling x and y coordinates separately versus cumulative distance at each point.
Step 3: comparing tracks
The next problem is the comparison itself. We compare each uniform track to all other uniform tracks by finding the average distance between them, point-for-point. That is, the distance from the 1st point of one track to the 1st point of the other, and so on, sum, and divide by 64. For all pairs of tracks. This gives us a dis-similarity matrix.
When comparing two tracks that are in different cities, the values are large since every point of the uniform tracks are very far apart. If there are two tracks that start and end at the same place, then these have a smaller value and those that follow exactly the same course have a value approaching zero.
From here we can generate a dendrogram to look at the distances between tracks and find the closest pairings.
Clusters can be determined using a threshold on the y-axis and setting a lower limit for the size of the clusters. There are no good automated methods for determining the threshold that I could find. Examining neighbouring tracks revealed that 0.02 was a good cutoff. This gave the coloured clusters shown below the dendrogram.
My method meant that clockwise vs anti-clockwise tracks on the same course were clustered separately and that outward tracks were clustered distinctly from the reverse route.
From here it was a matter of displaying the course and the summary statistics over time, to get a Strava style plot.
The sensitivity was sufficient so that tracks that differed only by slight diversions were still clustered.
The course summaries have been useful for looking at my pace over time. It would be possible to map other variables on here such as time of day, or time of year to enable clear pace comparison.
Unexpectedly, this analysis revealed several local courses that I have not run for 18 months or more. So I now have inspiration for the next few runs.
I’m happy with how this turned out. I’m less happy with my pace over the last year as I’ve recovered from an injury, but that’s a whole other story.
The title of this post comes from “Cluster One” from Pink Floyd’s Division Bell LP.