Jan 25, 2026

Building AI Autopilot · Part 6

The Orchestrator Trap

Building AI Autopilot for code, research, and workflows.

I’m building HLTM — a personal automation layer that runs AI agents on my projects so I don’t have to babysit every task. The goal: drop a brief, walk away, come back to working code.

To do that, I needed an orchestrator. Something to read the brief, break it into tasks, hand them to the right agents, and verify the result.

I built it out of Claude.

It worked on the first try.

Claude called Claude. The planner broke the task into steps. The coder executed them. The reviewer checked the output.

Beautiful. Elegant. Completely wrong.

What I Built

The idea was simple.

One agent plans. One agent codes. One agent reviews.

Each agent gets a skill — a markdown file with instructions. The orchestrator reads the output, decides what to call next, passes context forward.

autopilot → [reads brief] → calls planner
planner → [writes tasks] → calls coder
coder → [writes code] → calls reviewer
reviewer → [checks code] → calls autopilot

I called it hltm-autopilot. It ran entirely inside Claude Code.

Claude orchestrating Claude.

The First Sign

Two days in, the reviewer started hallucinating problems.

Not small problems. Confident, detailed, completely fabricated problems.

“The function hashUser() has a SQL injection vulnerability.”

There was no hashUser(). There was no SQL.

The reviewer invented it.

I fixed the prompt. Ran it again. Different hallucination, same confidence.

I didn’t understand yet. I thought it was a prompt problem.

It wasn’t.

The Cascade

Here’s what actually happens when Claude calls Claude.

The orchestrator runs. It builds context. It calls the planner. The planner adds its own context. Now there’s twice as much to track.

The planner calls the coder. The coder reads everything — the original brief, the orchestrator’s reasoning, the planner’s task breakdown. All of it. Context is now enormous.

By the time the reviewer runs, it’s reading a wall of text.

Half of it is stale. A quarter is the orchestrator’s internal reasoning, not intended as instructions. The reviewer doesn’t know what’s current and what’s history.

So it guesses.

And it guesses confidently.

The Spawn Problem

It gets worse.

Every agent spawn is a fresh API call. Fresh calls can fail. Rate limits. Timeouts. Malformed output.

When the planner fails, it doesn’t fail gracefully.

It returns something that looks like a plan. Partial. Incoherent. Formatted correctly enough that the orchestrator continues.

The coder gets a broken plan.

Executes it anyway.

The reviewer gets broken code from a broken plan.

Approves it. Because by this point, the context is so polluted it can’t tell broken from working.

One bad spawn, three agents down the line, code is merged.

I didn’t notice until I ran it on a real project.

The Rebuild Loop

I thought I could fix it with better prompts.

v3.0.0: Stricter instructions. More explicit rules. NO ASSUMPTIONS added everywhere.

Still cascaded.

v4.0.0: Atomic skills. Each agent isolated. No cross-references between skills.

Better. Then broke in a different place.

v5.0.0: Autopilot as router. Agents don’t talk to each other, everything goes through the center.

Two days. Six commits. Still wrong.

v6.0.0: Major refactor. Slim orchestrator. Structured references. Opus for thinking tasks.

This one almost worked.

Four days. Six versions. Same architecture every time.

I was patching the symptoms.

The Real Problem

The orchestrator was Claude.

Every layer assumed Claude’s output format. Claude’s tool format. Claude’s context window. Claude’s way of signaling completion.

I couldn’t swap the coder for a cheaper model without rewriting the orchestrator.

I couldn’t run the reviewer in Docker without losing tool access.

I couldn’t test a single stage without running the whole pipeline.

The architecture was Claude-shaped.

And Claude’s failure modes — confident hallucinations, context poisoning, silent errors — were baked into every layer.

The vendor lock wasn’t the problem.

It was the symptom.

The real problem: I used an LLM to do a job that doesn’t need intelligence. It needs reliability.

The orchestrator’s job is to call the right agent at the right time. That’s not a reasoning task. That’s a state machine.

A state machine you build once and trust.

Not a model that re-reasons it every time with whatever context happened to accumulate.

What Comes Next

I deleted all six versions.

hltm-autopilot. hltm-implementation. hltm-actualize. Gone.

Replaced them with 300 lines of bash.

Next: Why every prompt became a lie, and the one rule that fixed it..