Dec 18, 2025

User Intent Prediction · Part 21

The Dumb One Won.

I had a problem that looked like modeling.

It wasn’t.

It was plumbing.

My predictions were numerically beautiful.

And operationally useless.

Because the model was overconfident by an order of magnitude.

It was shouting “90%!” in a world that converts at ~2%.

Three candidates

I tested three calibration fixes.

Two were elegant.

One was dumb.

Candidate 1: Temperature Scaling

One parameter.

Scale the logits.

Pretend the error is smooth.

It helped a little.

Not enough.

Candidate 2: Platt Scaling

Logistic regression on top of the score.

Assume the error follows a sigmoid.

It helped a little.

Not enough.

Candidate 3: Isotonic Regression

No assumptions.

No curve.

No elegance.

Just a monotonic mapping:

Raw score → measured conversion rate.

A lookup table built from outcomes.

It’s the “dumb” solution.

And it won.

Why the dumb one won

Because the error wasn’t smooth.

It wasn’t a clean sigmoid mismatch.

It was jagged.

Production behavior is jagged.

Parametric methods try to force mess into a shape.

Isotonic regression doesn’t.

It fits the mess.

It takes the model’s “confidence” and replaces it with reality.

The production proof

After deploying the calibration layer, I pulled a new production window.

Five days.

Real traffic.

No synthetic tests.

269,320 predictions from 15,160 users.

The post-calibration behavior was exactly what I wanted:

Mean prediction: 0.0334 (3.3%)
Range: 0.02–0.07 (2–7%)

That’s not “exciting.”

That’s the point.

The model stopped hallucinating 90% intent.

It started speaking in base rates.

It became usable.

What “usable” means

Before calibration:

Model says 0.95 → Actual ~7%
Can’t trust thresholds
Can’t bin users safely
Interventions waste money

After calibration:

Model says 0.05 → Actual ~5%
Thresholds mean something
Context bins are stable
Interventions target correctly

The model didn’t get “smarter.”

It got honest.

The hidden benefit

Calibration didn’t make the model smarter.

It made the system safer.

With honest probabilities, I can:

Create stable context bins
Trigger interventions without bankrupting myself
Compare cohorts without mixing illusions

Most importantly:

I can finally tell the difference between:

“This user is slightly above baseline.”
“This user is meaningfully above baseline.”

Before calibration, both looked like “95%.”

Why this matters for bandits

I built a contextual bandit for adaptive interventions.

Thompson Sampling.

Real-time arm selection.

It requires calibrated probabilities.

Not rankings.

Not relative scores.

Actual probabilities.

Because the bandit needs to know:

“Is this user in the 0-20% bucket or the 60-80% bucket?”

If all predictions are 70-95%, the buckets collapse.

If predictions span 2-7%, the buckets work.

Calibration saved the entire intervention system.

The lesson

The smartest thing in this whole pipeline wasn’t a neural network.

It was a table.

A boring monotonic mapping learned from real outcomes.

The dumb one won because it matched reality.

And reality is the only metric that matters.

Next: White Knuckles on the Wheel.