Sergey Kopanev: you sleep — agents ship

Go Back
NBP Follow-ups · Part 2

The Night Realtime Stopped Talking


I opened the app to demo the live transcript.

The waveform moved. The recording timer ticked. The transcript pane was empty.

I talked into the mic for thirty seconds. Empty. I talked louder. Empty. I clicked stop. The final transcript existed — generated from the audio file after recording ended. The realtime pane had never produced a single word.

It had been broken for weeks. I just had not opened it during a long enough recording to notice.

What NBP Is

NBP records on macOS and runs Whisper locally to produce transcripts. The first article explains why that matters.

There are two transcript paths. Post-recording — the audio file goes through Whisper end-to-end after stop. Realtime — a sliding window through Whisper while the recording is still happening, words appearing as you speak.

Post-recording is the workhorse. Realtime is the demo. Realtime is also the feature people actually want when they are trying to follow a meeting on second screens.

Both paths share Whisper. They do not share the orchestration.

How It Got There

Six weeks before the empty-pane day, I ran a batch of agents overnight. Documented in agents coded all night. Documented again, less proudly, in brief and gate — the post about the fake review gate that let trash through.

The night dropped about fifty squash merges into the repo. Audio processing rewrites. Pipeline plumbing. New transcript JSON migration. Connector refactors.

Each merge had a green checkmark. Each merge had a passing review. Each merge had a feature description that read fine.

The realtime transcription file was not in the diff for any of those merges.

How It Broke Anyway

The realtime path reads from a shared audio buffer that the recording pipeline fills. It runs Whisper on a sliding window with a 1-second step. It carries prompt tokens from the previous pass to give the next pass context.

Three things changed under it during the agent night, none in realtime_transcription.rs:

  1. The audio buffer’s chunk timing shifted. Recording fed it differently after the pipeline rewrite. Realtime’s “available samples” calls returned different shapes.
  2. The Whisper context loader picked up new params downstream. The realtime side built FullParams with set_language(None) — auto-detect. Auto-detect on a 1-second window with the new feeding cadence latched onto silence and refused to produce English.
  3. Prompt-token carryover started poisoning subsequent passes. The token vector survived across silences when it should have been cleared.

Each individual change made sense in its own PR. The interaction was nobody’s review.

The First Stabilization

I caught a chunk of it on Feb 26. The commit message is the most honest summary: stabilize realtime transcription after batch feature squashes.

That pass was cleanup, not fix. I deleted the dead Cloud arm of ActiveTranscriber — a feature flag had landed for cloud realtime and then been silently turned off, leaving an orphaned enum branch. I tightened the emit paths so transcript writes and UI events stayed in lockstep. I cleaned up the JSON write that was producing partial files mid-recording.

I told myself realtime was fixed.

It was not. It was less broken in obvious ways and still broken in the way that matters: the pane stayed empty.

The Real Fix

Six weeks later, on a Sunday evening, I sat down and ripped Whisper’s params apart.

let mut params = FullParams::new(SamplingStrategy::Greedy { best_of: 1 });
params.set_language(Some("en"));
params.set_translate(false);
params.set_single_segment(true);
params.set_no_context(true);
params.set_n_threads(4);

Five lines. Each one corresponded to a wrong assumption from the original implementation.

set_language(Some("en")) — auto-detect was the silent killer. On short windows, Whisper’s language detector kept guessing wrong and bailing. Pin the language. The app already knows it.

set_single_segment(true) — realtime is one rolling segment, not many. Telling Whisper that explicitly stops it from trying to chunk a 1-second window into pieces it cannot resolve.

set_no_context(true) — kill the prompt-token carryover entirely. The feature was supposed to give the next pass context from the last pass. In practice, it was carrying garbage from one silence into the next utterance and biasing toward repetition. No context > wrong context.

set_n_threads(4) — it was running on the default thread count, which on this hardware was 1. Inference took longer than the step interval. Realtime fell behind realtime.

Then the step interval bump: 1s → 2s. Inference at 4 threads on this model on this machine takes between 800ms and 1.6s. A 1-second step was a coin flip on whether the next pass started before the previous one finished. 2-second step gives slack and the pane updates feel smooth, not stuttery.

let step_interval = std::time::Duration::from_secs(2);

After all of that, the pane filled with words while I was still talking.

The Toast That Should Have Been There

The other change in that commit was not in Whisper. It was a single line that made the bug discoverable:

let _ = app_handle.emit(EVENT_TRANSCRIPTION_ERROR, e);

For six weeks, realtime errors had been printed to eprintln! and nothing else. The user — me, in this case — got an empty pane and no signal that anything was wrong.

A toast on error would have told me about this on day one. The agents that wrote the surrounding code did not add that toast because none of their features needed it. The original realtime code did not have it because, on the day it shipped, it worked.

A feature that has no error surface is a feature you stop seeing when it stops working.

The Trade-Off

Cost of dropping prompt_tokens: realtime transcripts have slightly worse coherence across pauses. A name said in minute one is not carried into minute three; Whisper has to rediscover it from acoustics.

Cost of pinning the language: this app no longer transcribes my Russian voice notes in realtime. Final transcript still does, since it uses different params. Realtime is English-only now. That is acceptable for the use case I actually have.

Cost of bumping step to 2s: the latency on word appearance is slightly higher. The pane catches up in chunks of 2 seconds instead of 1. That trade buys reliability — words that appear are correct words.

The net: realtime works. Trade-offs are visible. The original implementation had the opposite shape — it worked by accident and stopped working without telling anyone.

The Lesson Is Not About Whisper

Agents merging fifty features overnight is fine. Reviews passing is fine. The thing that broke was the absence of a check the reviews were not designed to perform: does the feature you are not touching still work after this PR?

Realtime had no smoke test. It had no error surface. It had no monitor that would have noticed the empty pane. It was protected by exactly one thing — me opening the app and looking at the pane during a long-enough recording.

That is not protection. That is luck.

The fix to the bug was Whisper params. The fix to the class of bug is unfinished.

Takeaway

The dangerous bug in a multi-agent codebase is the bug nobody’s PR caused.

Add an error surface for every feature, even the ones that worked on day one. Check unrelated features after merge sprees. A green review queue is not a working app.


Next: Same .bin, Different File.