May 31, 2026 Biohacking

Biohacking With AI: How I Built Autonomous Wearable Health Insights

I wired my wearables into an AI loop that pushes real-time, personalized biohacking recommendations instead of generic scores. This is what actually worked and what broke.

Photo by Luke Chesser / Unsplash

Why I Stopped Trusting "Scores"

I like toys. Especially toys with sensors.

Oura, Whoop, Apple Watch, a cheap HR strap from Decathlon, even a continuous glucose monitor for a while. All of them tried to tell me how I was doing with one magic number.

Readiness: 82. Sleep score: 91. Strain: 13.2.

Nice. But my body did not care. Some days I had a high readiness score and still felt like a brick. Other days I felt sharp and the app nagged me to “take it easy”.

So I stopped asking: “What is my score?”

I started asking: “What should I do right now?”

That question is where AI actually got useful for biohacking. Not to generate pretty charts, but to make a decision: push, maintain, or back off.

The Goal: Autonomous, Boringly Simple Health Suggestions

I wanted a system that:

Pulls live data from my wearables automatically.
Understands my baselines, not population averages.
Spits out 1–2 concrete suggestions, not a 10-paragraph essay.
Runs in the background without me having to remember to open an app.

Think tiny health coach in the background. Not a life guru. More like a slightly annoying teammate that pings you with: “Take 10 minutes outside, your HRV tanked after that last call.”

So I wired my wearables to an LLM and tried to make it as autonomous as I could without blowing up my API bill or my sanity.

My Wearable Stack

Right now my core stack looks like this:

Oura Ring for sleep, HRV, resting HR.
Apple Watch for heart rate during training, HRV snapshots, and activity.
Cheap BLE chest strap for kettlebell and conditioning sessions when I care about accuracy.

I tried adding CGM data, but that turned into a distraction. Good for a focused 2-week experiment, not great for permanent background noise. So in this phase I cut it.

Important detail: I stopped trusting each platform’s “readiness” score and focused on raw-ish metrics:

HRV (rMSSD), nightly average and trend.
Resting heart rate and how fast it drops after workouts.
Sleep duration and fragmentation.
Training load from heart rate and RPE notes.

These are boring numbers. Which is perfect. Boring means predictable. Predictable means an LLM can reason about them without hallucinating some spiritual conclusion out of a 79 sleep score.

The Architecture: From Wrist To Recommendation

This is what I actually built. Not theoretical. Running on my stuff right now.

The system has five pieces:

Data collection workers
Normalization and feature builder
Long-term baseline and history store
LLM “coach” prompt layer
Notification layer (where it pokes me)

1. Data collection workers

I run small, boring scripts on a cheap VPS.

A cron job hits the Oura API every 30 minutes in the morning, then every 2 hours during the day for readiness updates.
Another job pulls Apple Health data via a personal shortcut export every few hours. Not pretty, but it works.
Workout sessions push data in directly when I finish them, through a tiny web form where I log RPE and what I actually did.

Everything lands in a Postgres database. Timestamps, raw values, the source device, plus some calculated stuff like rolling averages.

2. Normalization and feature builder

Each hour a job runs that:

Back-fills missing data, or marks gaps clearly instead of pretending it knows.
Builds “features” for the AI layer: HRV vs 7-day average, sleep vs 14-day average, last hard session delta, and so on.
Flags notable events. For example: “HRV dropped > 20% below 7-day mean”, or “Sleep < 80% of 14-day avg and two days in a row of heavy training”.

This part matters more than the AI model choice in my opinion. If your inputs are vague and inconsistent, your clever prompt will just generate poetic nonsense.

3. Baseline and history

I keep a rolling 90-day window for most metrics.

HRV especially is very personal. My “good” number might be your “I am dying” number. So the system stores per-user (just me right now) baselines and updates them every week.

That way the AI agent never sees: “HRV = 60”. It sees: “HRV = 60, which is 8% above his 30-day average, and he slept 15% more than usual”. Now it has context.

4. LLM coach layer

This is a small API that wraps the model. I currently use a GPT-4-level model, but the key is the system prompt and how I limit its job.

The system prompt looks roughly like this (shortened here):

You are a pragmatic health coach for a single user.
You must respond with 1-2 concrete actions only.
No more than 120 words. No general wellness advice.
Use only the data given. If data is missing, say so.

User data: <metrics>
User context: <recent training log + calendar notes>

Then I feed in the hourly feature bundle, plus a short snippet from my notes if I tagged something like “travel”, “sick”, or “poor sleep because of kids baseball tournament”.

The output format is strict JSON:

{
  "priority": "push" | "maintain" | "back_off",
  "reason": "short explanation",
  "actions": [
    "action 1",
    "action 2"
  ]
}

No Markdown. No “as an AI language model”. Just a tiny plan for the next few hours.

5. Notification layer

The last piece is the annoying one.

A small worker checks the new JSON and compares it to the previous recommendation. If the priority or actions really changed, it sends me a push notification through a custom iOS shortcut endpoint.

Example message:

Priority: back_off
Reason: HRV dropped 22% below 7-day avg after 2 heavy days.
Actions: 1) Keep training under 30 min, low intensity. 2) In bed by 22:00, no late screens.

If nothing meaningful changed, it stays quiet. Silence is part of the design. I do not want a Tamagotchi, I want a cold friend who only texts when it matters.

What "Real-time" Actually Means Here

People say “real-time” when they mean “not yesterday”. For biohacking I do not need sub-second updates. I need the system to react fast enough that I can still change something.

The cadence that felt right in practice:

Morning block: 2–3 runs in the first hour after waking, as data from Oura and Apple sync.
Daytime: Every 2 hours, unless there is a spike or a flagged event.
Training window: Right after a workout, when I log RPE and type of session.

I tried tighter loops. 15-minute cycles felt like noise. I got more pings during meetings and coding blocks, and I started ignoring them. Which kills the whole point.

So “real-time” in this project became: fast enough to adjust the next block of the day, slow enough that it does not spam me.

Concrete Examples From My Own Data

1. Catching overreach before it hit

Two heavy kettlebell days. Both felt good. My subjective rating: “Let’s keep going.”

The system saw:

HRV 18% down from 7-day average.
Resting heart rate up 4 bpm vs 14-day average.
Sleep time normal, but more wake events.

It flagged back_off and suggested: “Skill work only today, no conditioning. Prioritize 30-minute walk in sunlight.”

I listened. Next day HRV bounced back, and I avoided that familiar 3–4 day fatigue slump. Subjectively it felt like cheating. An invisible adult in the room.

2. Travel days and fake guilt

On travel days my normal apps still shame me. No rings closed. Activity score bad. Readiness confused.

With the AI layer, I tagged my calendar with “travel” and fed that into the context. The system adapted quickly:

Priority: maintain
Actions: “Walk 15 minutes between flights. Hydrate aggressively. Bed as early as possible after arrival.”

No comment on my lack of deadlifts. No nonsense about streaks. Just realistic constraints based on my situation, not some perfect training plan.

3. Late-night coding and hidden cost

I did a classic mistake. Pushed a feature late, bright screen, heavy React profiler sessions until midnight.

Next morning the system saw:

Sleep quantity acceptable, but deep sleep down by ~30% vs 14-day average.
Heart rate stayed elevated longer into the night.
Calendar showed “late deploy” tag from my notes.

Response:

Priority: back_off
Actions: “Avoid heavy training today. Block 20:30–22:00 for no-screen wind-down. If you work, do offline writing.”

The useful part is not the advice. I already know screens at night are bad. The useful part is timing. It caught it the morning after and forced the tradeoff into my face while I was planning the day.

Where AI Actually Helped (And Where It Did Not)

Helpful things first.

Pattern stitching. I am decent at looking at HRV or sleep in isolation. The LLM is better at noticing “three medium red flags” that together justify backing off.
Context awareness. Pulling in calendar tags like “travel”, “kids tournament”, “launch week” made the recommendations far less annoying. Generic apps ignore this.
Natural language summaries. I could generate the same decisions with hand-coded rules, but the explanation quality from the model made me trust the output more.

Not helpful:

Generic wellness advice. If you do not clamp the prompt hard, it will start telling you to drink water and meditate. I cut that ruthlessly.
Creating complex long-term plans. The model is bad at periodization. I still set my weekly and monthly training blocks manually.
Data cleaning. AI is not magic here. I tried having it infer missing values. It invented too much. Plain scripts with explicit rules won.

Constraints I Put On The System (For My Own Sanity)

Autonomous does not mean “free to be weird”. I added hard constraints.

No medical claims. The prompt explicitly blocks diagnosis talk. If metrics are extreme, it says: “These values are unusual; talk to a real doctor.”
Limited authority. The system never gets to cancel my workouts. It can label the day as back_off, but I still choose. This keeps the relationship healthy.
Strict word cap. Max 120 words. Otherwise it drifts into storytime. I am not reading essays between meetings.
Data transparency. Every recommendation includes a one-line summary of what it actually saw: “HRV -18% vs 7d, RHR +3 bpm, 2x heavy days.” No black box energy.

If You Want To Build Your Own Version

I am not packaging this as a product, at least not yet. But if you want to wire something similar for yourself, this order worked well for me:

Collect first, model later. Spend a few weeks just pulling data into a database. Look at it manually. See what actually matters for you.
Define 3–5 triggers. For example: big HRV drop, lack of sleep two days in a row, unusually high heart rate during light activity. Only then bring in AI.
Start with a fixed prompt. Do not build a full “agent” that can call tools and write its own rules. Keep the LLM as a reasoning layer on top of clear metrics.
Respect your annoyance budget. Tune notification frequency until you barely notice the system exists. If it feels loud, you will ignore it.

I think people overestimate what models can do and underestimate how powerful they are as boring, opinionated summarizers. A little reasoning plus your own data goes a long way.

Where I Want To Take This Next

I see three obvious next steps.

Stronger integration with training plans. Right now I still plan my cycles manually. I want the system to suggest minor adjustments: “swap heavy day to Thursday, based on recovery.”
Better short-term HRV snapshots. Continuous or at least scheduled HRV recordings during the day, not just night data, so it can react to work stress spikes too.
Team mode. I coach baseball. At some point I want a stripped-down version that helps me modulate practice intensity for players based on basic wearable data.

But I am careful. More data and more autonomy are not always better. The sweet spot seems to be:

Small, opinionated loop.
Clear question: “What should I do next?”
Minimal friction to follow the suggestion.

That is how I am using AI for biohacking right now. Not as a guru. As a slightly nerdy mirror that watches my signals and tells me, in plain language, when I am about to be stupid with training or sleep.

That is good enough for me.

Subscribe to my newsletter

Subscribe to my newsletter to get the latest updates and news

Biohacking With AI: How I Built Autonomous Wearable Health Insights

by Richard Lemon

Why I Stopped Trusting "Scores"

The Goal: Autonomous, Boringly Simple Health Suggestions

My Wearable Stack