Sergey Kopanev: you sleep — agents ship

Go Back
User Intent Prediction · Part 5

3,700x Faster. Still Not Fast Enough.


I optimized a model so fast the universe politely asked me to fuck off.

I spent 6 hours optimizing a model that was already 3,700 times faster than a blink of an eye.

I had a model. It ran in 27 microseconds. That’s 0.027 milliseconds. That’s fast enough to handle the traffic of Amazon, Google, and Facebook combined (on a single laptop).

Real-time funnel adaptation needs <50ms. Not <1ms. Definitely not <0.03ms.

I’m an engineer. So I thought: “I can make it faster.”

The result: I crashed the operating system.

The Benchmark: 27 Microseconds

I converted my model to ONNX.

  • Mean latency: 25μs
  • Throughput: 28,000 predictions per second

To put that in perspective:

  • A blink of an eye: 100,000μs
  • A standard API request: 50,000μs
  • My model: 27μs

I had solved the latency problem. The ticket was closed. I didn’t stop.

The Obsession: Quantization

“27μs is good,” I thought. “If I quantize it to INT8, I could hit 10μs!”

I wanted to optimize for the sake of optimization. I wanted the high score. My funnel had a 1.5% conversion, and I was here fighting microseconds.

The experiment: I ran the standard quantization tools. The result: Segmentation fault (core dumped)

The Crash

I tried again. Different tool. Different library. Exit code 139 (SIGSEGV)

My optimization script was crashing the kernel.

The reason: My model was too small. It had 48 parameters. Total.

The overhead of the quantization logic—setting up the lookup tables—was larger than the model itself.

I was packing a sandwich into a shipping container and calling it “efficiency.” The computer was literally rejecting my stupidity.

The Lesson

Optimization has a stopping point.

This wasn’t high-frequency trading. It was a diet quiz. I was optimizing for vanity metrics, not business value.

  • Goal: < 50,000μs (50ms)
  • Reality: 27μs
  • Margin: 1,850x faster than required

When to stop optimizing:

  1. When you hit your SLA.
  2. When the optimization costs more than the compute savings.
  3. When your tools start segfaulting because your problem is too small.

The Outcome

I deployed the un-quantized, “slow” 27μs model. It runs in production. It handles every user event.

Nobody noticed. Because nobody buys faster just because your matrix multiplies faster.


Next: Three Sequence Models. All Failed..