Two Things. Brief and Gate.
Building AI Autopilot for code, research, and workflows.
Agents ran overnight. 50 tasks completed. Review stage passed. I opened the app in the morning — it wouldn’t launch. Startup was broken entirely. I spent 2 hours with Claude Code untangling it.
The agents did exactly what they were asked. That was the problem.
What Happened
NBP is a Tauri 2 + Rust app I’m building on macOS. I handed a batch of tasks to the pipeline before bed. Review ran — a second model checked the output, flagged nothing, passed everything. By morning, 50 tasks were marked done.
The app didn’t open.
The review model looked at the changes and decided they looked right. They looked right. They weren’t right. There’s a difference. Startup was broken because assumptions had compounded across 50 tasks. Each agent filled a gap the brief left open. Each assumption enabled the next. By task 50, I wasn’t debugging code — I was debugging a chain of decisions no human made.
Two hours later, it was fixed. The agents were fine. The brief was vague. The gate was fake.
The Brief
The brief is not a prompt. It’s a contract.
It answers three questions before any agent touches anything:
- What is the exact deliverable?
- What does success look like, concretely?
- What is out of scope?
If you can’t answer all three, the brief isn’t done.
Vague brief means agents fill gaps. Agents always fill gaps — that’s what they do. One gap, one assumption. One assumption enables the next. Across 50 tasks, assumptions stack. The output looks confident. The model doesn’t signal uncertainty. It just proceeds.
By the time you look at the result, the confidence reads as correctness. It isn’t.
A brief that says “improve startup performance” is not a brief. A brief that says “reduce cold start time below 800ms, no changes to initialization order, no new dependencies” is a contract. Agents can be gated against a contract. They cannot be gated against a vibe.
The Gate
A gate is binary. Pass or fail. Nothing in between.
“Does this look correct?” is not a gate. It’s a question. Questions don’t stop bad output from moving forward. “Does the app launch?” is a gate.
The gate checks against the brief exactly. Not “is this reasonable?” — “does this match what was specified?” Those produce different answers. The second one is testable. The first one is guesswork.
A model reviewing another model’s output is two models agreeing. That’s not validation. The review passed because the code looked plausible. Plausible is not the same as functional.
Tests run. Models read. Only one of those is a gate.
Running the gate at the end of 50 tasks tells you the pipeline failed. It doesn’t tell you where. A gate after each task catches failures immediately. The failure stays local. Fixing one bad task is not the same as debugging 2 hours of compounded assumptions.
The Trap
Here’s how it happens.
You hand off 50 tasks. You watch them complete. The review model passes everything. It all looks busy and productive. You feel like something got done.
Then you write the gate to match what you have.
That’s the trap. The moment you see the output, your criteria shift. You start evaluating what you got instead of what you asked for. “This is mostly right” becomes “this passes.” The gate collapses because you wrote it after you already decided the output was acceptable.
Brief and gate written before agents start. Not after. That’s the rule.
When you know you’ll have to write a testable gate, vague briefs become impossible. You cannot write a gate for “make it better.” The constraint forces precision upstream.
In Practice
Before any agent run, two documents:
The brief. Exact deliverable. Explicit scope boundaries. Definition of done in concrete terms.
The gate checklist. Binary checks only. Either the output satisfies each one or it doesn’t. No partials. No judgment calls. “App launches in under 2 seconds on cold start” is on the list. “Code looks clean” is not.
The gate runs programmatically where possible. Where it can’t, the checklist is specific enough that a human check takes under a minute.
Bad output fails fast. Good output moves forward. The pipeline doesn’t accumulate ambiguity across 50 tasks.
That’s the whole system. Two documents. Written before the agents start.
Next: [coming soon]