Dec 16, 2025

User Intent Prediction · Part 19

Thirty Days. Walking Dead.

The model didn’t die.

It kept running.

It kept returning numbers.

It kept looking confident.

That’s what made it dangerous.

It was walking dead.

The lie

I opened a production report and saw this:

High confidence users (0.9–1.0):

Predicted: 96.6% conversion
Actual: 7.1% conversion

That’s not “a bit off.”

That’s a hallucination.

The model wasn’t wrong by 5%.

It was wrong by 89.5 percentage points.

And it wasn’t a one-off.

Medium-high confidence (0.7–0.8):

Predicted: 74.5%
Actual: 7.1%

Medium (0.5–0.6):

Predicted: 57.5%
Actual: 6.2%

Reality was screaming the same message:

Your probabilities are fiction.

Why this happens

If you’ve never deployed prediction models, you’ll miss this.

AUC is a seductive metric.

If AUC is 0.77, you think:

“Great. The model can separate buyers from non-buyers.”

It can.

AUC doesn’t care about honesty.

AUC asks:

“Can you rank users correctly?”

Calibration asks:

“Can I trust the number?”

In production, I needed the number.

Because I wasn’t just forecasting.

I was building decision systems:

When to intervene
When to discount
When to stop wasting money

A model that ranks well. Then lies about probability. That is not “mostly fine.”

It’s actively harmful.

The real punchline

The funnel converts at ~2%.

That means 98 out of 100 users don’t buy.

A model that predicts 96% confidence in that world is not confident.

It’s delusional.

This is the trap:

You train in one distribution.

You deploy into another.

And your “probabilities” are just scaled scores pretending to be truth.

The walking dead moment

The most infuriating part:

Nothing visibly broke.

No errors.

No downtime.

No alarms.

The model was returning perfect-looking JSON.

And every number was wrong.

That’s “walking dead” software.

Alive enough to run.

Dead enough to ruin decisions.

Why ranking hides this

If you sort 1,000 users by prediction:

Top 100: 9 buyers
Bottom 100: 0 buyers

The model is doing something useful.

That’s what AUC measures.

If you decide:

“Give discounts to everyone above 0.8”

And 80% of users score above 0.8…

You just bankrupted yourself.

Because the threshold is meaningless.

The model doesn’t know what “80%” means.

It just knows “this user > that user.”

The intervention cost

I built an adaptive funnel engine on top of these predictions.

Thompson Sampling.

Contextual bandits.

Real-time interventions.

All of it required calibrated probabilities.

Without calibration, the system was:

Wasting interventions on users it thought were 95% likely to convert (actually 6%)
Ignoring users it thought were 20% likely to convert (some of whom would have bought)

The bandit couldn’t learn.

Because the signal was lies.

The fix (and the humiliation)

The fix wasn’t a new model.

It wasn’t more training.

It was a post-processing step.

A calibration layer.

I tried multiple calibration methods.

Two helped a little.

One actually made the model honest.

The winning method wasn’t elegant.

It wasn’t parametric.

It wasn’t “ML.”

It was a monotonic lookup table built from reality.

Raw score → measured conversion rate.

A model that says “0.9” gets mapped to “~0.07.”

Not because I felt like it.

Because that’s what happened.

The lesson

A model can be accurate and dead at the same time.

Accurate at ranking.

Dead at telling the truth.

If you’re building an adaptive funnel, probability isn’t a nice-to-have.

It’s the foundation.

Without calibration, your intervention policy is built on lies.

And you don’t notice until you spend money.

So now I treat probability like milk.

It expires.

You don’t “set it and forget it.”

You test it against outcomes.

Regularly.

Because the worst failures don’t crash.

They keep running.

Next: Same Pattern. Different People..