Better Race Predictions with AI
It won’t stop you from bonking, but it’ll help your friends waste the right amount of time waiting at aid stations
Back in 2020, during a lockdown and while recovering from surgery, I published two articles for analyzing activity data in Python — scripts to visualize training efforts and extract information about pace, distance, or elevation based on GPS data. I remember spending hours googling how to parse GPS coordinates, debugging the code one line at a time until I could finally see a plot that was essentially a less visually pleasing version of the usual run maps you can see in the running app of your choice. Duh! Golden indeed were the days when it was entirely possible to spend the better part of a day figuring out why your arrays are all in the wrong shape and why matplotlib won’t space things out the way you’d like to. The ultimate irony is that all of this was supposed to be mere prep work for a personalized race time predictor I wanted to build. Alas, torn between a) feeling uninspired about the lowest-lift version of such an algorithm (fit a curve to your grade-adjusted pace and apply that to your upcoming race) and b) the daunting task of building something that I could truly take pride in, using all the data I’ve been reporting in Strava over the years, I’ve procrastinated just enough to forget about it entirely.
Until now! If you’re learning how to code in the year of the Lord 2025, the preceding paragraph probably strikes you as a curious relic from a bygone era. It may even sound bizarre that these were actual problems that programmers had to deal with just five years ago. Nowadays, all it takes is a simple (and often poorly specified) plain-English prompt like “make nice elevation plot for this gpx file” and within minutes, you’d have working code that is more cleanly written and produces nicer outputs than what hours of high cognitive load work got you before. The code won’t be perfect, mind you, but it runs and does more or less what you asked for — it gives you a basis from which to iterate.
It’s not hard to see that these developments would revolutionize the way we write code and develop software. And on a much smaller scale, they convinced me that it was time I gave that race time predictor another shot, this time with LLMs doing most of the heavy lifting. Other than providing me with answers to questions around my own race prep — When should I expect to get to that checkpoint where my family is waiting for me? Do I need to refill all of my bottles at this aid station, or can I go lighter and refill at the next? How do I decide whether a given performance was exceptional by my own standards? — I was also very curious whether LLMs could in fact create something that was both useful and user-friendly. Time to find out.
Since this was my first time doing an entire LLM-assisted coding project, I’ll begin with a few observations of what this actually looks like in practice, interlaced with some musings about what this may mean for the future of coding.
Building with LLMs: Can it live up to the hype?
Begin with a simple acknowledgement of fact: These little chatbots are nothing short of stunning. Back in 2018 (the time I got into data science and machine learning), smart people spent months building neural networks for single tasks — tumor classification, sentiment analysis, whatever — and celebrated when accuracy improved from 83% to 84%. We implemented (and trained) models from scratch, worked hard to understand the linear algebra behind backpropagation, and companies would hire anyone with STEM credentials who knew how to implement a basic k-means clustering algorithm. You’d even pay people on Amazon Mechanical Turk to label data by hand so you could train your models.
Today, you type “build me a chess app” and get a working prototype almost instantly, so you can spend more time ensuring that the pawns look like your least favorite classmates from high school. Contra Arthur C. Clarke, we’re lightning fast when it comes to normalizing the extraordinary, with any sufficiently advanced technology remaining indistinguishable from magic for only the briefest moment in time.
This wasn’t the first time I used LLMs to write code — in fact, it’s been quite helpful to quickly generate data visualizations that would have taken me forever on my own. But despite being faster than me on my own, it was still a rather frustrating enterprise, as oftentimes code wouldn’t compile. That’s no longer true: Even for someone who isn’t following performance benchmark reporting of all the LLMs out there very much, it was obvious that they’ve gotten much better over a timespan as short as half a year.
So I decided to take things to a completely new level — every step from brainstorming the idea, to (iteratively) building a prediction model and wrapping it all up in a nice, shiny app was done with the help of an LLM. Even though the result that you’ll see here took many more days to be completely, it was rather fascinating to see that the LLM essentially provided me with a working app four messages into our conversation (the first three messages were just clarifications related to my prompt). It understood what I was asking for and built something that worked and looked like a pretty good approximation of what I asked for.
For someone who learned programming the hard way, this was both revelatory and shocking. It finally made me understand (to a degree, anyway) those math teachers who insisted we learn long division by hand, despite their spectacularly wrong prediction that we wouldn’t always have a calculator in our pocket. They were anxious about something real: If you don’t understand the underlying process, how can you tell when the output is nonsense?
The programmers lamenting the rise of vibe coding today are singing the same tune, and sure as hell they’ve got a point. But, like it or not, there’s no turning back, no stopping the tide. The vast majority of code will soon be written by AI, and hopefully by then we’ll have devised systems that ensure we aren’t producing massive vulnerabilities. We’ll need to develop new instincts for catching the LLM’s failure modes, just as we developed instincts for spotting our own off-by-one errors and null pointer exceptions. Programming jobs won’t disappear, but they’ll require different skills: less syntax memorization, more systems thinking and skeptical verification. And let’s be honest — I’ve worked with a lot of programmers, and while some of them were genuinely impressive, many (including the author of these lines) often write pretty awful code. Did we really know what’s happening in the libraries we happily imported, or — to ask the question is to answer it — what running a script would do at the circuit level? As science progresses, we’ll have to develop every-higher powers of abstraction, and AI-generated code might be just another floor in that edifice.
So what did I observe, specifically, as I vibe-coded my way towards a race time predictor?
- AI-assisted coding makes you learn much faster. If you’re anything like me, nothing beats specific examples, and a piece of code that works is the ultimate case study. I had never before built an actual app with an interface, and probably would’ve spent a lot of time trying to figure out what’s the best starting point for this, which framework/library to use (I’m sure there’s lots of heated debate about this on the internet), and then painstakingly piece it together, google items one at a time (“How do I add a sidebar to this?”). Instead, an LLM just served me the entire thing on a silver platter, and next time I want to do something similar, it would be quite quick to find all the relevant Streamlit syntax. I’ve also never really worked directly with API calls, and again, having a concrete examples for how this can be done makes it all seem less daunting than before.
- You don’t need to know how to code, but…: It’s technically true that you wouldn’t have to know anything about programming to produce an artifact of your choice using LLMs. And there may be many use cases where this would do the job for you, even if you had absolute no idea what was happening under the hood. By contrast, I felt that it was certainly a big advantage to have at least some familiarity with a programming language, because being able to read the generated code will help you spot the sometimes comical errors that AIs produce. It’s very tempting — I certainly felt this way — to simply cross your fingers and hope that your AI-generated code does the right thing. What I found instead is that AI is very good at creating something that looks very reasonable and logical, which makes it all the harder to spot the instances were it goes completely off the rails.
- The 80/20 rule is alive and well. It’s super easy to build a prototype. Naturally, I figured I’d need maybe a day or two to polish it up for release, write a short article here to explain what I’ve done and bam; done! Two weeks later, I was still debugging why it gave me super reasonable estimates for one type of race and then suggest world-record shattering arrival times for another. Hardly any of these extra days were spent building new features; instead, I tried to get the prediction logic right — fixing edge cases where rest times compounded exponentially, where altitude adjustments went negative, where the Monte Carlo simulation always predicted I’d win Western States. The gulf between “looks right” (AIs are devilishly good at this) and “is right” can be vast!
- The choice of LLM matters: I started this project working with ChatGPT 5, “Thinking” Mode, but later also tried Gemini Pro 2.5 for a bit before mostly switching to Claude Opus 4.1. I’m lucky enough to have premium subscriptions for all of them through my work and can therefore test their relative strengths, but this likely isn’t the case for everyone, and it’s probably also not necessary (I do, however, recommend to at least have a free version of a different model in additional to your “main” LLM — sometimes you just want a quick answer to a small problem, but if you ask this in your main chat the AI would regenerate thousands of lines of code.) Overall, Claude was the most enjoyable tool of the three of them, while Gemini was an utter disaster (surprisingly, as it’s supposed to be one of the leading models out there). It’s hard to say whether this is due to the inherent capabilities of the competitor models, my prompts, the mode I chose or the specific tasks I asked them to do (e.g. I used Gemini only for refactoring), nevertheless a few things stood out:
ChatGPT 5 was pretty good at prototyping, making suggestions and (very helpfully) created an initial project in a zip folder, which made it very easy to set up. However, it would then regenerate the entire zip folder each time I asked for changes, until I told it to just tell me what changes to make in a specific file. That worked so-so — it was often vague about where to make changes, would suggest odd things like using imports inside the function instead of on top of the file, and had a bit of a tendency to just add more stuff instead of focusing on the big picture. The code it wrote was also super dense and really hard to parse — maybe that’s me not being explicit with my prompts, or simply an attempt to avoid premature optimization — , but it did suck to read.
Gemini 2.5 was a wildcard. I used it to suggest improvements for my initial code, and it came back with some pretty solid ideas for refactoring it. However, it then also just dropped lots and lots of code, and whenever I pointed out that it was missing entire sections, it would confirm my observation and continue overlooking them. I gave up pretty quickly, as this was rather frustrating.
Claude Opus finally sorted out the refactoring for me, and because that worked so well, I also used it to improve the interface and the prediction algorithm. I really like the fact that it provides all files in a “tab” format next to the chat plus some basic version control, so you can see exactly what changed. But while it was most enjoyable to work with, it did screw up big time on a number of occasions, such as by adding the last aid station’s predicted rest time to the next one (and so on until the last).
The app
Now that you’ve indulged me, let’s waste no more time and get right to the heart of the matter. Here’s the GitHub repo:
The app essentially does three things:
- It pulls data from your Strava runs and repackages it into nice little “grade bins” — essentially making a prediction for how fast you typically run at a given grade (and altitude). This can be represented by a pace curve that looks as follows:
2. It visualizes the different legs of the course you’re about to run, so that you’ll know exactly what to expect before you hit the next aid station:
3. It combines your own data with the profile of the upcoming race to predict when you’re expected to hit each aid station, including confidence intervals (or more precisely, the 10th and 90th percentile):
Here’s a little video to demonstrate what the app looks like in practice:
For details on how the app creates these personalized, race-specific estimates, read on.
Inside the belly of the beast: How the algorithm predicts arrival times
Although the devil is certainly in the details, the basic idea behind predicting ETAs is not that hard to understand. I also think it’s actually fairly clever (I’m allowed to say this since I didn’t come up with it myself) — make judicious use of available data, but don’t overengineer it and be realistic about the limitations of any model trying to forecast the performance of a particular human being on a particular day. I’ll explain key points below; for more details, I’ll invite you to take a look at that Github repo.
What data it uses
- Your Strava race history (excluding training runs): the time-series streams for distance, speed, grade, and elevation
- The GPX of the course you’re about to run
- Your aid-station list (cumulative distances)
- A race conditions slider to adjust outputs based on race-day factors such as form, fatigue, weather or ruggedness of the terrain
What it learns from your history
- Your personal speed vs. slope curve (at sea level): For every grade bin (steep down, moderate up, etc.), it calculates how fast you typically run these during races. Then, speeds from high altitude are adjusted to their sea-level equivalent. Recent races get a bit more weight; and it also tracks variability per grade that will later be used for uncertainty bands.
- Your stamina with distance (the so-called Riegel exponent). The general idea behind this exponent is that you become exponentially slower the further into a race you are. The above paces per grade bin are averages; for a typical trail race you’ll be faster in the beginning, slower towards the end, sometimes significantly so.
How it predicts your next race
- Parse and process the course. The app reads the gpx file you provide and applies some smoothing. It then calculates the grades and bins them so the slope curve from above can be used. It also splits the course into legs at each aid station, since the goal is to predict arrival times for each of these.
- Calculate base time. Beginning with our sea-level normalized grade bins, we apply an altitude penalty from the course under consideration — high alpine course should be slower than your backyard race, even holding distance and elevation gain constant. We can the sum up all of these bins to get a first baseline estimate based solely on the terrain.
- Apply various adjustments for realistic estimates: For races that are similar to what you usually run (I do a lot of 50k’s, for example), not much more is needed for a good prediction. However, especially if you’re looking at races that are substantially longer, we’ll need to tweak the baseline estimates a bit. We do this in three different ways:
(a) By using the Riegel exponent that’s appropriate for this specific race (think of this as you personal endurance fingerprint) to factor in how your pace degrades with distance.
(b) By applying a “distance scaling” factor to further slow you down on very long races. The idea here is that the pace curve we calculated earlier likely places more weight on shorter races, but you 10% uphill pace for a 50k isn’t the same for a 100 miler, especially towards the end of it. There’s some conceptual overlap with the Riegel exponent here, but the main point is that even if you ran your race with perfect consistency (neither slowing down nor speeding up), you still couldn’t run your usual 50k pace on a race that’s more than 3x longer.
(c) By adding aid station and rest/sleep breaks. While you may buzz through the checkpoints during a quick race, in ultra territory you are definitely going to take time to refuel, sit down or even nap a bit. Since the pace curve was derived from your “while moving” speed (my Strava autopauses when I don’t move), we need to add rest back in to get realistic arrival times. Right now this is done in a bit of an ad-hoc fashion through a couple of parameters that were tuned to be in line with how long I actually stopped during previous races.
- Monte Carlo simulation for confidence intervals: This uses race conditions, day-to-day variability and correlated randomness (if you’re slower on the uphill today, you may also be slower on flat sections) to simulate 10th and 90th percentiles, with a built-in skewness that ensures that the 90th percentile is further from the best guess than the 10th. The reason for this is pretty straightforward: There are only so many ways to run faster than you typically do, and their effect tends to be pretty limited. By contrast, it’s very easy to blow up in a race for all kinds of reasons.
Of course, getting here involved some amusing fails as well as less uplifting roadblocks. Early versions confidently predicted that I could easily beat the UTMB course record on a good day, while simultaneously estimating that finishing a flat 10k road race in 55 minutes would be a spectacular success for me. Confidence intervals for the latter were also much wider in relative terms. (When pressed, the “reason” I got for this is “Counterintuitively, ultra runners are actually more consistent than short-distance road runners, since factors pulling in different directions tend to cancel out”, which makes a lot of sense until you start thinking about it.) It also developed an algorithm where rest time would compound, so that every 10 mins spent at the first aid station would give you an additional hour of rest at the last one — but then pretend that the rest time at the last aid station was equally to total rest time for the entire race.
Model limitations
We need to be honest — this model is a sophisticated interpolation (or extrapolation?) engine that may give an illusion of precision it simply can’t offer. It is likely going to be better than a simple educated guess, but clearly human performance remains very variable. As with all predictive models, it’s fundamentally limited by the quality of the data you feed it with. In my case, I could provide a list of some 50 races that I recorded in Strava, but I’m not sure that’s representative for most runners. And even with such a relatively well-documented history of running, the model struggles to do a good job for cases it hasn’t seen much — in my case, anything past the 100-mile mark or flat road races.
What’s more, many of the controls are fairly arbitrary — for example, the “race conditions” knob adds a bit of a penalty if the terrain’s really rough, or it’s supposed to be very hot on D-Day, but honestly? A technical trail can massively drag down your average speed, as can excessive heat. Yet if you allowed that knob to influence outcomes more, you’re effectively throwing out all the prediction logic you built before and replace it with a slider that users can move until it gives them their desired result. There may not be a sat
What else? I’m trying to resist the temptation to provide a laundry list of uncertainties here, so let’s just pick out some that seem to matter more:
- Temporal fitness changes: The model assumes you’re an average of your past self, which may not be true. Recency weighting addresses some of that but won’t capture things like an injury that sets you back.
- Race effort heterogeneity: Not all “races” in Strava are true races — sometimes it’s running with a slower partner, sometimes you treat them as fun or race prep runs (where the main goal isn’t to arrive as fast as possible but to simulate conditions for a longer race — you may even have done a long workout the day before to tire your legs appropriately). The model treats all races as equivalent maximal efforts.
- Sequential effects: The model treats grade bins independently, but the order matters enormously. 2000m straight up followed by 2000m down is very different from alternating 100m segments, even though the grade distributions are identical.
- Night running penalty: Any ultra runner will confirm that you’re always slower during the night (a combination of circadian rhythm plus low visibility). A race that’s two nights one day will be slower than two days one night.
Roadmap
I have a number of ideas for additional features that could be included to make the prediction better, more robust and more user-friendly. Whether and to what extent I’ll work on those (and others that I’m not thinking of right now) will to a certain degree depend on how much interest there is for this app. If it’s just my little side project, it may be good enough as it is right now, whereas if I’m getting a signal that other runners would like to use this, too, I’d be more willing to spend more time on this. So why don’t you let me know in the comments? New ideas, suggestions for improvements or (God forbid!) pointing out bugs I’ve overlooked are most welcome.
With that out of the way, here are some things I’ve been thinking about:
- Make it a desktop app/browser extension: Probably the biggest lift and also the most logical next step. Right now you’ll have to be at least a bit tech-savvy in order to get this to run — wouldn’t it be nice if you could simply connect your Strava account like e.g. Elevate without having to clone a repo and run bash commands? In that case, it would also be neat to expose critical parameters in the interface so users can fiddle around with them, instead of having to edit a config file.
- Use external data to improve predictions: Concretely, what I have in mind here are things like ITMB/ITRA indices (i.e. how you compare to other runners) or historical results from the race you’re about to run (this would provide valuable information on whether it’s a “typical” or a “hard” 100k).
- Real-time updates: Allow users to enter actual arrival times at different checkpoints and update estimates accordingly.
- Include training runs and allow users to select other sports: For this initial version, I’ve hard-coded a few things — by default, the app will only pick up activities that are labeled as both “run” and “race” in Strava. Of course, it would be entirely possible to relax these constraints and let users chose for themselves what they’d like to include.
- Base the rest/sleep function on actual race data. Derive the functional form of the rest vs elapsed time function from the data, rather than using a simplistic model. If you use auto pause in Strava, it’ll provide both your “moving” and “elapsed” time for races, so it should be possible to model how much you typically rest per hour on the trails.
- Small stuff: Let user enter the start time of the race so that the model predicts which time of day a runner is supposed to arrive at an aid station. For aid stations, provide “in” and “out” estimates. Allow them to deselect races when building pace curves (e.g. a charity run with friends). What else comes to mind?
