Works on My Machine Is Not a Passing Test.

Building AI Autopilot for code, research, and workflows.

The develop stage runs on the host Mac. The review stage runs in Docker.

HLTM is a personal automation layer running AI agents autonomously. A bash loop — hltm-loop.sh — cycles agents through stages with fresh context each round. Develop writes the code. Review decides if it actually works.

One rule for review: if it doesn’t build in the container, the task fails. No exceptions.

I added this boundary after too many “fixed” tasks broke on the next machine.

The Stage Boundary

The developer agent has everything. Your full tool stack, your cached deps, your globally installed binaries, your muscle-memory paths hardcoded into configs.

It builds. Of course it builds. It has everything.

That’s not a review. That’s a demo.

Docker enforces a different question: does it work with only what’s in the repo and what you explicitly declared?

Each review round spins a fresh container. Branch checkout. Full build, lint, tests. If any step fails, the agent emits <loop:failed> and the task goes back to develop. No partial passes. No “it mostly worked.”

What It Actually Caught

Three bugs that “worked on the host” made it to Docker and died there.

A missing dependency in package.json. The package was installed globally on the Mac. The developer agent ran it fine. The container had never seen it.

A hardcoded absolute path. /Users/skopanev/Projects/... buried in a config file. Obvious in retrospect. Invisible until something outside that path tried to run it.

A test that passed because of a leftover build artifact. The test wasn’t testing the new code — it was hitting a cached result from a previous run. Clean container, no cache, test failed.

None of these would have shipped to a real environment and worked.

The Cargo Shim Problem

The project uses Rust.

Rust on an ARM Mac does not cross-compile to Linux/Docker natively. The toolchain is host-specific. Running cargo build inside a Linux container on an M1 Mac fails in ways that are not about your code.

Solution: cargo inside Docker is a shim.

# Inside Docker: /usr/local/bin/cargo (the shim)
echo "$*" > /tmp/hltm-bridge/request
while [ ! -f /tmp/hltm-bridge/exit_code ]; do sleep 0.2; done
cat /tmp/hltm-bridge/response
exit $(cat /tmp/hltm-bridge/exit_code)

When the review agent runs cargo build, the shim writes the command to /tmp/hltm-bridge/request. A watcher on the host Mac reads it, runs real cargo, writes the output and exit code back. The agent in Docker gets the result as if cargo ran locally.

The bridge is mounted as a tmpfs volume. No files persist between runs. No state leaks from one review round to the next.

The agent doesn’t know any of this. It runs cargo build. It gets a result. Pass or fail.

The Trade-Off

Docker review adds time to each round. Container startup. Full clean build instead of incremental. Worth it.

The alternative is trusting that what worked on the dev machine works everywhere. That trust has a cost. You pay it later, in a different environment, when the fix is harder.

Catching a missing package.json entry in review costs thirty seconds.

Catching it in production costs your afternoon.

Next: Agents that never talk to each other. By design..