User Intent Prediction #12: Smart Discounts Regret.
I built a machine that burns money to learn how to print money. It mostly just burned money.
I wanted to be smart. I didn’t want to give a 20% discount to everyone. That burns margin. I didn’t want to give it to no one. That burns conversion.
I wanted to give it only to the people who needed it to buy.
The Solution: Contextual Bandits. A “Bandit” is an algorithm that learns by doing.
- It tries an action (Give Discount).
- It sees the result (Sale / No Sale).
- It updates its strategy.
It sounds perfect. It sounds like “Auto-Pilot for Revenue.”
The Concept of “Regret”
In Bandit theory, there is a metric called Regret. Regret = (Best Possible Reward) - (Actual Reward).
If the algorithm guesses wrong, it “regrets” it. It learns from the pain.
The Problem: In a simulation, “Regret” is just a number. In a startup, “Regret” is lost money.
Every time the Bandit explores (tries a bad option to learn), I lose a sale. I was paying real dollars to educate my algorithm.
Example:
- User A: High intent (0.9). Bandit gives full price. User buys anyway. (Good).
- User B: Medium intent (0.6). Bandit gives full price. User quits. (Lost $50 sale).
- User C: Low intent (0.2). Bandit gives discount. User still quits. (Lost $10 margin).
The Bandit is “exploring” to learn which users need discounts. But every exploration costs money.
The Setup
I used Thompson Sampling. It’s a probabilistic way to balance exploration (learning) and exploitation (earning).
How it works:
- For each user, the Bandit has a “belief” about the probability they will buy with/without a discount.
- It samples from this belief (a Beta distribution).
- It picks the action with the highest sampled value.
- It updates the belief based on the result.
The Theory:
- High Uncertainty: Explore more (try different actions to learn).
- High Certainty: Exploit the winner (use the best action).
I deployed it. I watched the logs. I saw the Bandit giving full price to hesitant users. (Lost sale). I saw the Bandit giving discounts to eager users. (Lost margin).
“It’s learning!” I told myself. “It needs time!”
The Math of Pain
After 1 week:
- Regret: $1,200 (lost sales + wasted discounts).
- Learning: Minimal (only 18 sales to learn from).
The Bandit needed 1,000+ sales to converge. At 1.5% conversion, that’s 66,000 users. At my traffic, that’s 3 months.
I couldn’t afford 3 months of “learning tax.”
The Lesson
Learning isn’t free.
I treated the Bandit like a magic money printer. I forgot that it has to spend money to learn how to make money.
And in a funnel with 1.5% conversion, data is scarce. Learning takes a long time. And “Regret” piles up fast.
This leads to why doing nothing beat the smart algorithm—sometimes the cost of learning is higher than the value of the lesson.