Why I stopped trusting my gut and started trusting Whoop
I used to schedule training like a calendar Tetris game. Slot workouts between calls, after dinner, or whenever the guilt kicked in.
Some days I felt amazing. Some days I felt like concrete. Same plan. Same program. Totally different output.
That annoyed me more than it should.
I wear a Whoop. I write code. I mess with AI. At some point it felt stupid not to wire those together and ask a simple question.
Can I predict my best training windows from my wearable data?
Not vibes. Not mood. Actual data. And not just “train hard when recovery is green”. That advice is lazy and honestly not very helpful if you train consistently.
I wanted a rough forecast of when my body is likely to be most primed. Not perfect. Just better than guessing.
The actual question I care about
I coach baseball. I lift. I run. I do way too many experiments on myself. One pattern kept popping up.
- Some days, low recovery still produced strong performance.
- Other days, great recovery still felt weirdly flat.
- Time of day mattered more than I wanted to admit.
Most training plans ignore that. They treat your body like a calendar slot, not a dynamic system.
So I framed a better question:
Given my sleep, recovery, strain and timing patterns from Whoop, what is the probability that this upcoming 2–3 hour window will produce a “peak” training session?
I do not care if the probability is 0.76 or 0.81. I care if it is higher than tomorrow. Or higher than this evening. That is enough to make better choices.
What I pulled out of Whoop
First problem. Whoop gives you nice charts. It does not really want you scraping everything. But you can get your data out.
I used the Whoop export. CSV. Boring, but it works.
The useful stuff for this experiment:
- Daily recovery (0-100%)
- Resting heart rate
- Heart rate variability (HRV)
- Sleep duration and sleep efficiency
- Strain (and when it was accumulated)
- Bedtime / wake time and consistency
Then I added my own layer on top.
- Manual performance scores for sessions, 1–5.
- Session type: strength, sprint, skill work, game day.
- Session start time in local time.
- Notes like “slept like trash”, “travel day”, “cold”, “kids sick”, “double espresso”.
I did not overthink the scoring. 1 is garbage. 3 is fine. 5 is “I could play a doubleheader and still go lift”. Imperfect, but honest.
Defining “peak” without lying to myself
AI will happily optimise the wrong thing if you feed it fuzzy labels. I had to make “peak” concrete.
I decided on this rule:
- Peak session = performance score 4 or 5, and at least one objective sign of progress.
Objective signs looked like this:
- Strength: new rep PR at a given weight, or higher total volume at same RPE.
- Conditioning: faster repeat sprint times, same work at lower heart rate, or more intervals at same pace.
- Skill: better command in bullpen, higher strike percentage, or cleaner movement on video review.
So if a session felt good but numbers were flat, it did not count as a peak. I wanted real performance, not hype.
I added a binary column in the dataset: is_peak = 1 or 0.
From raw data to features my model can actually use
Just feeding daily recovery into AI is lazy. The fun lives in combinations and lagging effects.
I used Python for this. Pandas, nothing fancy.
Some of the features that turned out useful:
- Rolling HRV baseline. 7-day average vs last night. I care more about the delta than the absolute number.
- Sleep debt. How far last 3 days are below my personal optimal duration.
- Bedtime irregularity. Std dev of bedtimes over the last week.
- Strain balance. Yesterday’s strain vs 7-day average.
- Time since last hard session. Hours since a session scored 4 or 5.
- Circadian bucket. Session time grouped into chunks: early morning, late morning, afternoon, evening.
- Day-of-week. Because my life schedule is not symmetric.
Most of this is simple math. For example, sleep debt is just:
sleep_debt = 3 * target_sleep - sum(last_3_nights_sleep)
I also added a really subjective feature: mental load score. 1–5, based on meetings, deadlines, context switching. This is where the “biohacking meets real life” part kicks in. Nobody has perfect lab conditions.
Where AI actually comes in
I did not want to build a massive ML stack for this. I like scrappy experiments that I can change in a weekend.
So I used two approaches in parallel:
- A basic gradient boosted model (XGBoost) trained locally.
- An LLM-based forecaster using an AI API where I feed engineered features as JSON and ask for a probability and explanation.
The traditional model gives me a number. The LLM gives me a narrative of why it thinks a window is good or bad. I like both.
The key prompt pattern for the LLM:
You are my training planner.
You get my last 10 days of Whoop metrics and context as JSON.
Your job: for each proposed 2-hour window in the next 36 hours,
output a probability (0-1) that this will be a high-performance session
(is_peak = 1), based on my historical patterns.
Focus on:
- HRV vs my rolling baseline
- Sleep debt and consistency
- Time of day effects from history
- Strain and time since last hard session
Output terse JSON only, with reasoning for each window.
I do not ask the LLM to train. I treat it like a pattern-matching forecaster on top of engineered features and my past labels. It is opinionated, but that is the point.
What the early results looked like
I trained the gradient boosted model on about 6 months of data. A bit over 250 labeled sessions. Not huge, but enough for some shape.
Accuracy metrics are nice for dashboards, but I care about something simpler.
When the model said a window had >70% “peak” probability, how often was the session actually a 4 or 5?
After cross validation and a month of “live” use, this was roughly what I saw:
- High probability windows (>0.7): about 65–70% actually turned into peak sessions.
- Medium probability (0.4–0.7): 35–45% peaks.
- Low probability (<0.4): peaks dropped to 15–20%.
Is it perfect? No. Is it better than my calendar Tetris? Absolutely.
The LLM layer added something extra. It gave me language for what my body was already doing.
Some common patterns it spotted:
- “You historically underperform in early sessions after two late bedtimes, even with high recovery scores.”
- “You hit more strength PRs in late-morning windows after moderate strain days, not low-strain rest days.”
- “Evening sessions after heavy cognitive load have lower peak probability, regardless of recovery.”
I could feel these patterns before. I just could not quantify them.
Scheduling around the forecast, not the calendar
I wired the whole thing into a tiny internal tool. Nothing pretty. A FastAPI backend and a minimal frontend on my local network.
Flow looks like this:
- Every morning, a script pulls my latest Whoop export and merges it.
- The model generates probabilities for a few candidate windows that actually fit my schedule.
- The LLM annotates each window with a one-sentence reason.
- I see a simple table: time window, probability, reason, recommended session type.
Example from a real day:
10:00–12:00 | 0.74 | Sleep debt low, HRV slightly above baseline, good history for strength here.
16:00–18:00 | 0.42 | Long meeting block before this, history of underperformance post-calls.
20:00–22:00 | 0.31 | Late window after two late bedtimes in a row, poor strength outcomes historically.
I picked the 10:00 window, moved a meeting, and loaded my heaviest work there. That session hit a 5. New rep PR. Felt exactly like the forecast described.
Does it always line up? No. But when it does not, I look for what I missed. Travel. Caffeine. Kids. Life variables I still need to encode.
What actually mattered in the model
The model’s feature importance chart was not surprising, but it was clarifying.
Top contributors:
- Time of day bucket. My late-morning sessions were quietly carrying my training.
- Sleep debt (3-day). When this crossed a threshold, everything sagged, even if recovery was “green”.
- Bedtime consistency. Not just duration. Erratic bedtimes killed peak probability.
- Time since last hard session. I respond well to about 36–48 hours after a very hard effort, not 24.
- Strain balance vs baseline. Both too low and too high strain yesterday were bad for peaks. The middle was better.
The model almost treated daily recovery as a supporting actor, not a star. I think most people overrate that single number.
The messy parts and why I am okay with them
This entire setup is personal. It is overfitted to my lifestyle, my schedule, my weird mix of baseball and lifting.
There are obvious problems:
- Self-reported performance scores are noisy.
- Whoop data itself has errors.
- External stress is hard to quantify cleanly.
- The LLM occasionally hallucinates causality that is not there.
I am fine with that. I am not submitting this to a journal. I just want more good sessions and fewer wasted ones.
The biggest win so far is not perfect prediction. It is permission to reschedule or downshift when the forecast is low and my body agrees.
Instead of forcing a max-effort day into a bad window, I flip the script:
- If peak probability is high: I protect that window. Phone off. Hard session.
- If it is middling: I run technique work or submax strength.
- If it is low: I do mobility, easy zone 2, or nothing, and I do not feel guilty.
How you could steal this, without copying my stack
If you are not the type who wants to wire up Python, you can still use the basic idea.
Here is the lightweight version I would do if I started over:
- Export 3–6 months of Whoop data.
- Add a column in a spreadsheet where you score each session 1–5.
- Mark your top 20% sessions as “peak”.
- Calculate simple stuff: average HRV for the week, last night vs average, 3-day sleep total, day-of-week, time-of-day bucket.
- Use an AI tool to look for patterns and build plain language rules.
You will probably find something like:
- “Your best sprint sessions almost never happen before 9am.”
- “Heavy lifting goes better 24–36 hours after moderate strain, not after full rest.”
- “Late-night sessions are only good when sleep debt is basically zero.”
Once you see those rules, you cannot unsee them. You start scheduling around them. That is the real value.
Where I want to take this next
This system still feels like a prototype. It already changed how I train, but it can go further.
Next steps on my list:
- Include GPS and velocity data from sprint sessions and games, not just subjective scores.
- Track nutrition windows and caffeine intake relative to sessions.
- Split models for strength, speed and skill instead of a single generic “peak”.
- Automate session type suggestions based on forecast and upcoming schedule.
The end goal is simple. When I open my calendar, I do not just see meetings and tasks. I see where my body is likely to be most dangerous. Then I stack the important work right there.
Whoop is just the sensor. AI is just the pattern spotter. The interesting part lives in between, where you are honest about how you actually perform and you are willing to move your life a bit to match it.
I think that is where wearable data finally starts to earn its keep.
Subscribe to my newsletter to get the latest updates and news
Member discussion