Dec 8, 2025

User Intent Prediction · Part 17

Same User. Opposite Predictions.

I had two models running in production.

Both trained on the same 44,000 users.
Both validated on the same test set.
Both deployed simultaneously.

I pulled 493,963 predictions from production logs.

Model A and Model B had zero correlation.

Pearson r = -0.01.

They disagreed on 58% of users.

The Discovery

I wasn’t looking for this.

I was validating calibration stability after deployment.

I extracted a week of production data:

29,677 users
493,963 predictions
Both models predicting simultaneously for every user

I plotted Model A vs Model B predictions.

I expected: A diagonal line (models agree).

I got: A cloud. No pattern. Random scatter.

Correlation: -0.01

That’s not “slightly different.”

That’s “completely unrelated.”

The Examples

Here’s what the data looked like:

User #4,832:

Model A: 92% purchase intent
Model B: 8% purchase intent
Gap: 84 points

User #7,291:

Model A: 7% purchase intent
Model B: 89% purchase intent
Gap: 82 points

User #12,445:

Model A: 95% purchase intent
Model B: 11% purchase intent
Gap: 84 points

Same user. Same moment. Same input data.

Opposite conclusions.

I identified 533 cases where the gap exceeded 50 points.

That’s not disagreement. That’s contradiction.

The Pattern

I grouped predictions by confidence level to see where disagreement was highest:

	Confidence Range	Disagreement Rate
Low (0-30%)	82%	Models can’t agree on dropouts
Medium (30-70%)	35%	Some agreement on uncertain users
High (70-100%)	84%	Models can’t agree on buyers

They disagreed most at the extremes.

When Model A was confident (high or low), Model B was confident in the opposite direction.

Translation:

Model A: “This user will definitely buy!”
Model B: “This user will definitely drop out.”

Both can’t be right.

The Averages Tell the Story

Across all 493,963 predictions:

Model A: Mean prediction = 0.72
Model B: Mean prediction = 0.54

Gap: 35%

Model A was systematically more optimistic.

Model B was systematically more conservative.

For every 100 users:

Model A predicted ~72 would buy
Model B predicted ~54 would buy

Actual conversion rate: ~2%

Both were wildly overconfident (despite calibration fixes).

Model A was more wrong, more confidently.

Why This Happened

The models weren’t broken.

They were learning different patterns.

Model A: Simple behavioral features

Trained on: motivation_score (dwell time + idle gaps + position)
Architecture: Dense neural network, 1 input feature
Pattern: “High motivation = purchase”

Model B: Sequence model + behavioral features

Trained on: Full screen sequences (LSTM) + motivation_score
Architecture: LSTM encoder (128 dims) + dense head
Pattern: “Temporal dynamics + motivation = purchase”

Model A learned: Static snapshots of user state.
Model B learned: Temporal trajectories through the funnel.

They weren’t measuring the same thing.

Example:

User at Screen 45:

Motivation score: 0.85 (high)
Recent sequence: Screen 40 → 41 → 42 → 43 → 44 → 45 (fast progression)

Model A sees: High motivation (0.85) → Predict 90% intent
Model B sees: Fast progression = losing interest → Predict 12% intent

Model A saw a motivated user.
Model B saw a user rushing to quit.

Who’s right?

I didn’t know. I didn’t have ground truth yet.

The fact they disagreed meant they were capturing fundamentally different signals.

The Implications

I built this system to make decisions:

Decision 1: Adaptive Discounts

If purchase_probability > 0.8, show full price.
If purchase_probability < 0.3, show discount.

Problem: Model A says 0.92, Model B says 0.08.

Do I show full price or discount?

Decision 2: Intervention Timing

Show targeted message when purchase_probability crosses 0.6.

Problem: Model A crosses 0.6 at Screen 10. Model B never crosses 0.6.

When do I intervene?

I built an adaptive funnel engine on contradictory predictions.

No wonder doing nothing won.

The Ensemble Attempt

I tried combining them:

Simple Average: (Model A + Model B) / 2

User #4,832: (0.92 + 0.08) / 2 = 0.50
User #7,291: (0.07 + 0.89) / 2 = 0.48

Result: Everything regresses to 0.5.

Weighted Average: 0.6 * Model A + 0.4 * Model B

User #4,832: 0.6 _ 0.92 + 0.4 _ 0.08 = 0.58
User #7,291: 0.6 _ 0.07 + 0.4 _ 0.89 = 0.40

Result: Slightly better. Still averaging out disagreement.

Max/Min: Take the higher or lower prediction

Problem: Amplifies whichever model is more confident (often wrong)

The fundamental issue:

When models have zero correlation, averaging doesn’t create wisdom—it creates noise.

I couldn’t evaluate which ensemble worked best because I didn’t have purchase outcomes yet.

That validation came later (spoiler: neither model was right).

The Hypothesis

This wasn’t a bug. It was a discovery.

Model A (motivation-only): Captures static intent signals.
Model B (sequences+motivation): Captures temporal dynamics.

The fact they disagreed proved they were learning complementary information.

If they agreed perfectly, I wouldn’t need two models.

The hypothesis:

Maybe Model A is better for users with stable motivation.
Maybe Model B is better for users with volatile behavior.
Maybe an ensemble could outperform both.

The problem:

I couldn’t test this without ground truth.

I needed actual purchase outcomes to know who was right.

The Validation Plan

I couldn’t wait for A/B test results (weeks of data).

I needed validation now.

So I did something risky:

I queried our database for actual purchases from the past week.

Matched them against the 493,963 predictions.

Found 491 real purchase outcomes.

Ran the comparison.

The results shocked me: The Honest Liar.

The Lesson

Zero correlation doesn’t mean both models are broken.

It means they’re learning different patterns.

Model A: “Motivated users buy.”
Model B: “User trajectories predict purchase.”

Both could be capturing real signals.
Both could be missing critical information.
Both could be overconfident liars.

The only way to know: ground truth.

Without actual purchase outcomes, disagreement is just noise.

With purchase outcomes, disagreement becomes insight.

The checklist I use now:

✅ Deploy multiple models to production
❌ Validate correlation between model predictions
❌ Identify systematic biases (Model A optimistic, Model B conservative)
❌ Test against real purchase outcomes BEFORE trusting predictions
❌ Build ensemble strategies only after validation

I did step 1.

I skipped steps 2-5.

When two models contradict each other, don’t guess. Measure.

Because a 35% disagreement isn’t a bug—it’s an unanswered question.

Next: The More They Clicked, The Less Was Clear..