User Intent Prediction #19: Thirty Days. Walking Dead.
The model didn’t die.
It kept running.
It kept returning numbers.
It kept looking confident.
That’s what made it dangerous.
It was walking dead.
The lie
I opened a production report and saw this:
High confidence users (0.9–1.0):
- Predicted: 96.6% conversion
- Actual: 7.1% conversion
That’s not “a bit off.”
That’s a hallucination.
The model wasn’t wrong by 5%.
It was wrong by 89.5 percentage points.
And it wasn’t a one-off.
Medium-high confidence (0.7–0.8):
- Predicted: 74.5%
- Actual: 7.1%
Medium (0.5–0.6):
- Predicted: 57.5%
- Actual: 6.2%
Reality was screaming the same message:
Your probabilities are fiction.
Why this happens
If you’ve never deployed prediction models, you’ll miss this.
AUC is a seductive metric.
If AUC is 0.77, you think:
“Great. The model can separate buyers from non-buyers.”
It can.
But AUC doesn’t care about honesty.
AUC asks:
“Can you rank users correctly?”
Calibration asks:
“Can I trust the number?”
In production, I needed the number.
Because I wasn’t just forecasting.
I was building decision systems:
- When to intervene
- When to discount
- When to stop wasting money
A model that ranks well but lies about probability is not “mostly fine.”
It’s actively harmful.
The real punchline
The funnel converts at ~2%.
That means 98 out of 100 users don’t buy.
A model that predicts 96% confidence in that world is not confident.
It’s delusional.
This is the trap:
You train in one distribution.
You deploy into another.
And your “probabilities” are just scaled scores pretending to be truth.
The walking dead moment
The most infuriating part:
Nothing visibly broke.
No errors.
No downtime.
No alarms.
The model was returning perfect-looking JSON.
And every number was wrong.
That’s “walking dead” software.
Alive enough to run.
Dead enough to ruin decisions.
Why ranking hides this
If you sort 1,000 users by prediction:
- Top 100: 9 buyers
- Bottom 100: 0 buyers
The model is doing something useful.
That’s what AUC measures.
But if you decide:
“Give discounts to everyone above 0.8”
And 80% of users score above 0.8…
You just bankrupted yourself.
Because the threshold is meaningless.
The model doesn’t know what “80%” means.
It just knows “this user > that user.”
The intervention cost
I built an adaptive funnel engine on top of these predictions.
Thompson Sampling.
Contextual bandits.
Real-time interventions.
All of it required calibrated probabilities.
Without calibration, the system was:
- Wasting interventions on users it thought were 95% likely to convert (actually 6%)
- Ignoring users it thought were 20% likely to convert (some of whom would have bought)
The bandit couldn’t learn.
Because the signal was lies.
The fix (and the humiliation)
The fix wasn’t a new model.
It wasn’t more training.
It was a post-processing step.
A calibration layer.
I tried multiple calibration methods.
Two helped a little.
One actually made the model honest.
The winning method wasn’t elegant.
It wasn’t parametric.
It wasn’t “ML.”
It was a monotonic lookup table built from reality.
Raw score → measured conversion rate.
A model that says “0.9” gets mapped to “~0.07.”
Not because I felt like it.
Because that’s what happened.
The lesson
A model can be accurate and dead at the same time.
Accurate at ranking.
Dead at telling the truth.
If you’re building an adaptive funnel, probability isn’t a nice-to-have.
It’s the foundation.
Without calibration, your intervention policy is built on lies.
And you don’t notice until you spend money.
So now I treat probability like milk.
It expires.
You don’t “set it and forget it.”
You test it against outcomes.
Regularly.
Because the worst failures don’t crash.
They keep running.
This is what made the next discovery so painful: the same behavior signal generalized across very different audiences. Same Pattern. Different People.