The Review Passed. Everything Was Broken.
Building AI Autopilot for code, research, and workflows.
I set agents loose on NBP at 11pm. NBP is a Tauri 2 + Rust macOS app. By morning they had completed 50 tasks. A review stage ran — a second model checked the output. Review passed. I opened the app. It wouldn’t launch.
Not “crashed on a specific action.” Wouldn’t launch.
This happened on a real product. Not a toy repo.
The goal was faster shipping with agent help.
What I got was proof that review without execution checks is theater.
What Happened
50 tasks. Overnight. No supervision. The agents wrote Rust, touched the Tauri config, reorganized startup logic. A review model read the output and said it looked complete.
In the morning I spent 2 hours with Claude Code debugging the mess. Tracing through startup failures, reverting changes, finding the point where the initialization sequence broke. Two hours to fix what agents built overnight.
The review had passed. The app was broken. Those two facts coexisted without contradiction.
Why the Review Was Useless
The review model asked the wrong question.
It asked: “Does this look complete?”
That is not a gate. That is a vibe check. One model reading another model’s output and deciding it “seems right” is an echo chamber with an API.
The agents produced code that was internally consistent. Correct syntax. Right shape. Read like something a competent person would write. The review model saw all of that and said yes.
Neither model ran the app. Neither model checked startup. Neither model had a definition of “correct” that went beyond reading the output and pattern-matching against what correct-looking code looks like.
The model is not lying when it approves. It genuinely cannot tell the difference between code that looks right and code that launches. It has never run anything. It has only ever read.
What Acceptance Actually Means
Acceptance is not a review. Acceptance is a gate.
A gate is binary. It opens or it stays shut. “Looks complete” is not a gate. “The app launches” is a gate.
Real acceptance criteria for NBP startup:
- The binary compiles.
- The app launches without error.
- The main window appears within 3 seconds.
- No panics in the log.
Not: “The startup logic seems sound.” Not: “I don’t see obvious issues.” Not: “This looks complete.”
The criteria have to be checkable by something that is not reading the code. A compiler. A test runner. An actual execution. Something that produces pass or fail based on observable behavior, not interpretation.
“Does it launch?” is a gate. “Does this look correct?” is not.
The Spec Has to Come First
The other failure: acceptance criteria were written after I saw the broken output. Or not written at all — implied.
That does not work.
The spec has to exist before the agents start. The agents ran 50 tasks toward an implied definition of done. The review model approved against that same implied definition. Nobody wrote down “the app must launch” before the run started.
When both the agents and the reviewer are guessing at what done means, you get consensus without correctness. They agreed. The app disagreed.
The Fix Is Not Better Agents
Every instinct after a failure like this points at the agents. Better prompt. Smarter model. More context in the system message.
Wrong direction.
The agents did what agents do. They generated plausible output. They cannot spontaneously develop the ability to run a Tauri app and verify it launches. That is not how they operate. Improving the model doesn’t change that.
The fix is the gate.
Define what done means before the agents start. Make the criteria executable. Run the app. Check the log. If it launches, it passes. If it doesn’t, the pipeline stops or the agents go again. No human judgment required. No model judgment required.
The 2 hours I spent debugging that morning were not an agent failure. They were a gate failure. There was no gate. There was a model politely agreeing with another model, and an app that wouldn’t open.
That is not a pipeline. That is theater.