User Intent Prediction #17: Same User. Opposite Predictions.
I had two models running in production.
Both trained on the same 44,000 users.
Both validated on the same test set.
Both deployed simultaneously.
I pulled 493,963 predictions from production logs.
Model A and Model B had zero correlation.
Pearson r = -0.01.
They disagreed on 58% of users.
The Discovery
I wasn’t looking for this.
I was validating calibration stability after deployment.
I extracted a week of production data:
- 29,677 users
- 493,963 predictions
- Both models predicting simultaneously for every user
I plotted Model A vs Model B predictions.
I expected: A diagonal line (models agree).
I got: A cloud. No pattern. Random scatter.
Correlation: -0.01
That’s not “slightly different.”
That’s “completely unrelated.”
The Examples
Here’s what the data looked like:
User #4,832:
- Model A: 92% purchase intent
- Model B: 8% purchase intent
- Gap: 84 points
User #7,291:
- Model A: 7% purchase intent
- Model B: 89% purchase intent
- Gap: 82 points
User #12,445:
- Model A: 95% purchase intent
- Model B: 11% purchase intent
- Gap: 84 points
Same user. Same moment. Same input data.
Opposite conclusions.
I identified 533 cases where the gap exceeded 50 points.
That’s not disagreement. That’s contradiction.
The Pattern
I grouped predictions by confidence level to see where disagreement was highest:
| Confidence Range | Disagreement Rate | Interpretation | |
|---|---|---|---|
| Low (0-30%) | 82% | Models can’t agree on dropouts | |
| Medium (30-70%) | 35% | Some agreement on uncertain users | |
| High (70-100%) | 84% | Models can’t agree on buyers |
They disagreed most at the extremes.
When Model A was confident (high or low), Model B was confident in the opposite direction.
Translation:
Model A: “This user will definitely buy!”
Model B: “This user will definitely drop out.”
Both can’t be right.
The Averages Tell the Story
Across all 493,963 predictions:
Model A: Mean prediction = 0.72
Model B: Mean prediction = 0.54
Gap: 35%
Model A was systematically more optimistic.
Model B was systematically more conservative.
For every 100 users:
- Model A predicted ~72 would buy
- Model B predicted ~54 would buy
Actual conversion rate: ~2%
Both were wildly overconfident (despite calibration fixes).
But Model A was more wrong, more confidently.
Why This Happened
The models weren’t broken.
They were learning different patterns.
Model A: Simple behavioral features
- Trained on:
motivation_score(dwell time + idle gaps + position) - Architecture: Dense neural network, 1 input feature
- Pattern: “High motivation = purchase”
Model B: Sequence model + behavioral features
- Trained on: Full screen sequences (LSTM) +
motivation_score - Architecture: LSTM encoder (128 dims) + dense head
- Pattern: “Temporal dynamics + motivation = purchase”
Model A learned: Static snapshots of user state.
Model B learned: Temporal trajectories through the funnel.
They weren’t measuring the same thing.
Example:
User at Screen 45:
- Motivation score: 0.85 (high)
- Recent sequence: Screen 40 → 41 → 42 → 43 → 44 → 45 (fast progression)
Model A sees: High motivation (0.85) → Predict 90% intent
Model B sees: Fast progression = losing interest → Predict 12% intent
Model A saw a motivated user.
Model B saw a user rushing to quit.
Who’s right?
I didn’t know. I didn’t have ground truth yet.
But the fact they disagreed meant they were capturing fundamentally different signals.
The Implications
I built this system to make decisions:
Decision 1: Adaptive Discounts
If purchase_probability > 0.8, show full price.
If purchase_probability < 0.3, show discount.
Problem: Model A says 0.92, Model B says 0.08.
Do I show full price or discount?
Decision 2: Intervention Timing
Show targeted message when purchase_probability crosses 0.6.
Problem: Model A crosses 0.6 at Screen 10. Model B never crosses 0.6.
When do I intervene?
I built an adaptive funnel engine on contradictory predictions.
No wonder doing nothing won.
The Ensemble Attempt
I tried combining them:
Simple Average: (Model A + Model B) / 2
- User #4,832: (0.92 + 0.08) / 2 = 0.50
- User #7,291: (0.07 + 0.89) / 2 = 0.48
Result: Everything regresses to 0.5.
Weighted Average: 0.6 * Model A + 0.4 * Model B
- User #4,832: 0.6 _ 0.92 + 0.4 _ 0.08 = 0.58
- User #7,291: 0.6 _ 0.07 + 0.4 _ 0.89 = 0.40
Result: Slightly better, but still averaging out disagreement.
Max/Min: Take the higher or lower prediction
- Problem: Amplifies whichever model is more confident (often wrong)
The fundamental issue:
When models have zero correlation, averaging doesn’t create wisdom—it creates noise.
I couldn’t evaluate which ensemble worked best because I didn’t have purchase outcomes yet.
That validation came later (spoiler: neither model was right).
The Hypothesis
This wasn’t a bug. It was a discovery.
Model A (motivation-only): Captures static intent signals.
Model B (sequences+motivation): Captures temporal dynamics.
The fact they disagreed proved they were learning complementary information.
If they agreed perfectly, I wouldn’t need two models.
The hypothesis:
Maybe Model A is better for users with stable motivation.
Maybe Model B is better for users with volatile behavior.
Maybe an ensemble could outperform both.
The problem:
I couldn’t test this without ground truth.
I needed actual purchase outcomes to know who was right.
The Validation Plan
I couldn’t wait for A/B test results (weeks of data).
I needed validation now.
So I did something risky:
I queried our database for actual purchases from the past week.
Matched them against the 493,963 predictions.
Found 491 real purchase outcomes.
Ran the comparison.
The results shocked me: The Honest Liar.
The Lesson
Zero correlation doesn’t mean both models are broken.
It means they’re learning different patterns.
Model A: “Motivated users buy.”
Model B: “User trajectories predict purchase.”
Both could be capturing real signals.
Both could be missing critical information.
Both could be overconfident liars.
The only way to know: ground truth.
Without actual purchase outcomes, disagreement is just noise.
With purchase outcomes, disagreement becomes insight.
The checklist I should have followed:
- ✅ Deploy multiple models to production
- ❌ Validate correlation between model predictions
- ❌ Identify systematic biases (Model A optimistic, Model B conservative)
- ❌ Test against real purchase outcomes BEFORE trusting predictions
- ❌ Build ensemble strategies only after validation
I did step 1.
I skipped steps 2-5.
When two models contradict each other, don’t guess. Measure.
Because a 35% disagreement isn’t a bug—it’s an unanswered question.
This confusion led me to validate both models against real purchase data—and discover both were confidently wrong.