User Intent Prediction #21: The Dumb One Won.
I had a problem that looked like modeling.
It wasn’t.
It was plumbing.
My predictions were numerically beautiful.
And operationally useless.
Because the model was overconfident by an order of magnitude.
It was shouting “90%!” in a world that converts at ~2%.
Three candidates
I tested three calibration fixes.
Two were elegant.
One was dumb.
Candidate 1: Temperature Scaling
One parameter.
Scale the logits.
Pretend the error is smooth.
It helped a little.
Not enough.
Candidate 2: Platt Scaling
Logistic regression on top of the score.
Assume the error follows a sigmoid.
It helped a little.
Not enough.
Candidate 3: Isotonic Regression
No assumptions.
No curve.
No elegance.
Just a monotonic mapping:
Raw score → measured conversion rate.
A lookup table built from outcomes.
It’s the “dumb” solution.
And it won.
Why the dumb one won
Because the error wasn’t smooth.
It wasn’t a clean sigmoid mismatch.
It was jagged.
Production behavior is jagged.
Parametric methods try to force mess into a shape.
Isotonic regression doesn’t.
It fits the mess.
It takes the model’s “confidence” and replaces it with reality.
The production proof
After deploying the calibration layer, I pulled a new production window.
Five days.
Real traffic.
No synthetic tests.
269,320 predictions from 15,160 users.
The post-calibration behavior was exactly what I wanted:
- Mean prediction: 0.0334 (3.3%)
- Range: 0.02–0.07 (2–7%)
That’s not “exciting.”
That’s the point.
The model stopped hallucinating 90% intent.
It started speaking in base rates.
It became usable.
What “usable” means
Before calibration:
- Model says 0.95 → Actual ~7%
- Can’t trust thresholds
- Can’t bin users safely
- Interventions waste money
After calibration:
- Model says 0.05 → Actual ~5%
- Thresholds mean something
- Context bins are stable
- Interventions target correctly
The model didn’t get “smarter.”
It got honest.
The hidden benefit
Calibration didn’t make the model smarter.
It made the system safer.
With honest probabilities, I can:
- Create stable context bins
- Trigger interventions without bankrupting myself
- Compare cohorts without mixing illusions
Most importantly:
I can finally tell the difference between:
- “This user is slightly above baseline.”
- “This user is meaningfully above baseline.”
Before calibration, both looked like “95%.”
Why this matters for bandits
I built a contextual bandit for adaptive interventions.
Thompson Sampling.
Real-time arm selection.
It requires calibrated probabilities.
Not rankings.
Not relative scores.
Actual probabilities.
Because the bandit needs to know:
“Is this user in the 0-20% bucket or the 60-80% bucket?”
If all predictions are 70-95%, the buckets collapse.
If predictions span 2-7%, the buckets work.
Calibration saved the entire intervention system.
The lesson
The smartest thing in this whole pipeline wasn’t a neural network.
It was a table.
A boring monotonic mapping learned from real outcomes.
The dumb one won because it matched reality.
And reality is the only metric that matters.
This closes the loop on the contradiction from Dec 8: when two models disagree, it’s not a debate. It’s a measurement problem. Same User. Opposite Predictions.