Smart Discounts Regret.

I built a machine that burns money to learn how to print money. It mostly just burned money.

I wanted to be smart. I didn’t want to give a 20% discount to everyone. That burns margin. I didn’t want to give it to no one. That burns conversion.

I wanted to give it only to the people who needed it to buy.

The Solution: Contextual Bandits. A “Bandit” is an algorithm that learns by doing.

It tries an action (Give Discount).
It sees the result (Sale / No Sale).
It updates its strategy.

It sounds perfect. It sounds like “Auto-Pilot for Revenue.”

The Concept of “Regret”

In Bandit theory, there is a metric called Regret. Regret = (Best Possible Reward) - (Actual Reward).

If the algorithm guesses wrong, it “regrets” it. It learns from the pain.

The Problem: In a simulation, “Regret” is just a number. In a startup, “Regret” is lost money.

Every time the Bandit explores (tries a bad option to learn), I lose a sale. I was paying real dollars to educate my algorithm.

Example:

User A: High intent (0.9). Bandit gives full price. User buys anyway. (Good).
User B: Medium intent (0.6). Bandit gives full price. User quits. (Lost $50 sale).
User C: Low intent (0.2). Bandit gives discount. User still quits. (Lost $10 margin).

The Bandit is “exploring” to learn which users need discounts. Every exploration costs money.

The Setup

I used Thompson Sampling. It’s a probabilistic way to balance exploration (learning) and exploitation (earning).

How it works:

For each user, the Bandit has a “belief” about the probability they will buy with/without a discount.
It samples from this belief (a Beta distribution).
It picks the action with the highest sampled value.
It updates the belief based on the result.

The Theory:

High Uncertainty: Explore more (try different actions to learn).
High Certainty: Exploit the winner (use the best action).

I deployed it. I watched the logs. I saw the Bandit giving full price to hesitant users. (Lost sale). I saw the Bandit giving discounts to eager users. (Lost margin).

“It’s learning!” I told myself. “It needs time!”

The Math of Pain

After 1 week:

Regret: $1,200 (lost sales + wasted discounts).
Learning: Minimal (only 18 sales to learn from).

The Bandit needed 1,000+ sales to converge. At 1.5% conversion, that’s 66,000 users. At my traffic, that’s 3 months.

I couldn’t afford 3 months of “learning tax.”

The Lesson

Learning isn’t free.

I treated the Bandit like a magic money printer. I forgot that it has to spend money to learn how to make money.

And in a funnel with 1.5% conversion, data is scarce. Learning takes a long time. And “Regret” piles up fast.

Next: Doing Nothing Won..