Dec 4, 2025

User Intent Prediction · Part 16

One Month Passed. Models Became Strangers.

I trained my model in October. It worked great.

In November, I tested it on new users.

It failed spectacularly.

Not “slightly worse.” Not “needs tuning.”

Completely broken.

The model that predicted purchase intent with 77% accuracy in October was now guessing randomly in November.

The Shock

I ran the October model on November data expecting minor drift.

Maybe 5% accuracy drop. Maybe 10% if I was unlucky.

I had read about model drift. I knew it was a thing. I thought it happened over months, maybe quarters. Not weeks.

The Reality:

Sequence length distribution: Collapsed by 96%
KL Divergence: 22.39 (severe drift)
PSI Score: 27.13 (critical)
KS Statistic: 0.738 (models are strangers)

These aren’t “needs calibration” numbers.

These are “your model is dead” numbers.

For context: a KL divergence above 0.30 is considered “severe drift” and triggers emergency retraining. Mine was 22.39.

A PSI (Population Stability Index) above 0.25 means “significant population shift.” Mine was 27.13.

The Kolmogorov-Smirnov statistic measures how different two distributions are. A value of 0.738 means the distributions are almost completely non-overlapping.

Translation: My October model and November data were living in different universes.

The Discovery Process

I didn’t notice this immediately.

I was busy fixing calibration issues and dealing with position 40 instability.

Then I decided to do a routine health check: “Let’s see how the October model performs on fresh November data.”

I extracted 30 days of production data from BigQuery:

39,693 users
788,274 events
789 purchases (1.99% conversion)

I ran the October model on this data.

The predictions were nonsense.

Users with 2-3 events were getting 90%+ purchase probability.

Users with 50+ events were getting 10% purchase probability.

The model was backwards.

I thought: “Maybe I have a bug in my evaluation script.”

I checked. No bug.

I thought: “Maybe the data extraction is wrong.”

I checked. Data was fine.

Then I plotted the sequence length distributions.

That’s when I saw it.

The Numbers Don’t Lie

October Training Data (PCOS cohort):

Median sequence length: 51 events
P95 sequence length: 61 events
Users: 44,134 (full funnel journeys)
Timespan: Complete user lifecycle (onboarding → outcome)

November Production Data (PCOS cohort):

Median sequence length: 2 events (-96.1%)
P95 sequence length: 58 events (-4.9%)
Users: 15,627 (30-day monitoring window)
Timespan: Recent activity only (30-day snapshot)

Wait. The median dropped from 51 to 2?

Yes.

My model was trained on users who completed 50+ screens.

My production data was capturing users who clicked 2 times and left.

The model had never seen this before.

It’s like training a language model on novels and then asking it to predict the next word in random Twitter fragments.

The model doesn’t just perform poorly—it’s fundamentally confused.

The Root Cause

October data represented full user journeys:

User starts onboarding
Answers 50+ quiz questions
Sees their personalized plan
Hits the paywall
Either buys or quits

Complete stories. Beginning to end.

November data represented 30-day production monitoring:

User logs in today
Clicks 2 screens
Leaves
Might come back tomorrow, might not

Incomplete snapshots. Random fragments.

The model was trained on novels and tested on random sentences.

Why did this happen?

Because I changed my data extraction query.

In October, I pulled “all users who completed the funnel” (either purchased or dropped out after seeing the paywall).

In November, I pulled “all users active in the last 30 days” (to monitor production performance).

Different query = Different distribution = Dead model.

I didn’t think this would matter. I thought “user behavior is user behavior.”

I was wrong.

The RICE Cohort Was Worse

I thought maybe PCOS was an outlier.

I checked RICE (different diet cohort).

RICE October:

Median: 52 events
P95: 106 events
Users: 15,100

RICE November:

Median: 3 events (-94.2%)
P95: 59 events (-44.3%)
Users: 24,066

Even worse drift.

The P95 dropped by 44%. Half the long sequences disappeared.

The pattern was consistent across cohorts:

PCOS KL Divergence: 22.39
RICE KL Divergence: 22.65

Both cohorts showed the same catastrophic drift.

This wasn’t a data quality issue. This was a fundamental mismatch between training and production.

The Architectural Mismatch

My LSTM sequence encoder was designed for long sequences:

Positional encodings calibrated for 50+ events
Masking layers optimized for P95=61-106 events
Time-series motivation features computed over full journeys

November reality:

95% of sequences had <10 events
Masking tensors were 95% padding (wasted computation)
Motivation features couldn’t compute on 2-event sequences

The model wasn’t just inaccurate.

It was architecturally incompatible with production data.

Example: My motivation score feature tracks “dwell time patterns across the funnel.”

On a 50-event sequence, this is rich signal:

User slows down at screen 15 (reading carefully)
User speeds up at screen 30 (losing interest)
User pauses at screen 45 (considering purchase)

On a 2-event sequence, this is noise:

User clicked twice
???

The feature doesn’t exist.

And that feature was contributing 87% of the model’s predictive power (according to SHAP analysis).

No wonder the model failed.

The Retraining Decision

I had three options:

Option 1: Accept Production Reality (Recommended)

Retrain models on production-representative data (partial sequences)
Pro: Models match deployment reality
Con: Lose information from early funnel stages
Effort: Moderate (retrain E10/E12 on Nov data)

Option 2: Hybrid Approach

Maintain two model variants:
- Full-funnel models (Oct) for batch prediction on complete journeys
- Real-time models (Nov) for production monitoring on partial sequences
Pro: Preserve full-funnel insights, optimize for real-time
Con: Operational complexity (dual model maintenance)
Effort: High (maintain two training pipelines)

Option 3: Data Pipeline Redesign

Modify production extraction to capture full user histories (not 30-day window)
Pro: Consistent train/prod data distributions
Con: Higher data volume, privacy/retention concerns
Effort: High (BigQuery query redesign, storage costs)

I chose Option 1.

Why? Because production data is the truth.

If my production system only sees 2-3 events per user (because that’s how users actually behave in a 30-day window), then my model needs to work on 2-3 events.

Training on full funnels was a luxury. Production is reality.

The Lesson

Models expire.

Not because they “forget.”

Not because users change behavior.

Because the data distribution shifts underneath them.

October = Full funnels (training environment) November = Production snapshots (real world)

The gap between training and production killed the model.

I thought I was being smart by training on “real user data.”

October’s “real data” was a curated dataset of complete journeys.

November’s “real data” was messy, incomplete, production chaos.

Monthly retraining isn’t optional. It’s survival.

And it’s not just about retraining the model.

It’s about validating that your training data distribution matches your production data distribution.

If they don’t match, your model is dead on arrival.

The checklist I use now:

✅ Train model on October data
❌ Validate that October distribution matches production distribution
❌ Monitor distribution drift in production
❌ Set up automated alerts for KL/PSI/KS thresholds
❌ Retrain monthly (or when drift exceeds thresholds)

I did step 1.

I skipped steps 2-5.

I paid the price.

Now I have drift detection running in production. If KL divergence exceeds 0.30, I get an alert.

If PSI exceeds 0.25, I schedule retraining within 2 weeks.

If KS statistic exceeds 0.50, I trigger emergency fallback to the baseline model.

Models don’t age gracefully. They die suddenly.

The only way to survive is to watch them closely and retrain before they become strangers.

Next: Same User. Opposite Predictions..