Last Friday, I ran an experiment I'd been building toward for months. I typed one command, walked away from my computer, and came back to find 7,808 lines of code refactored across 36 files. The whole thing took 21 minutes.
This wasn't a demo. It was real code for a client project. And it worked.
Well, mostly. There was one bug. Took about 30 seconds to fix.
How I Got Here
If there's a 101 class for AI-assisted coding, the first lesson is: actively manage your context window. Don't rely on auto-compacts. Don't let the conversation grow until the model forgets what you're building. Plan upfront. Break work into context-window-sized chunks. Burn the tokens on planning so execution stays sharp.
I learned this the hard way. The advanced nuance: AI gets worse even within a single context window. The first 50% of a conversation is sharp. After that, quality degrades, even for the best of the bunch (Opus 4.5). For real work, this is a disaster.
So I became ruthless about planning. And planning means more than just planning the build. I spend my cycles, and I spend my tokens, proportionately on architecture and planning before any coding begins. I pit AI against AI during the architectural stages, using different models to critique and compare approaches. I run security reviews upfront. I run plans through our Pattern Police, a reference architecture agent that evaluates for consistency and adherence to our standards. Only after all of that do I break work into phases. Each phase gets fresh context. Self-contained prompts with everything included, no "see previous conversation" references. I think through sequencing, identify what can run in parallel. Progress is tracked in files that survive restarts.
Plan, plan, plan. Then build.
Here's what I'd already automated: the architectural design agents that pit models against each other. The build planning agents that break work into phases. The Pattern Police that enforce our standards. All of that runs without me in the middle. Those were game-changing improvements.
But when it came time to actually execute the build, I was still the bottleneck. For a five-phase build, I'd spend an hour opening and closing sessions, manually launching each phase, babysitting the process. The planning was agentic. The build was not.
Then I found Geoffrey Huntley's Ralph.
The Missing Piece
Ralph's insight was embarrassingly simple: write a shell script that does what I was doing manually.
Loop through the phases. Invoke Claude for each one. Check if it succeeded. If it failed, retry with more reasoning. Track progress in a YAML file. That's it.
I'd read Anthropic's post on effective harnesses for long-running agents and absorbed some of those ideas already. But Ralph showed me the execution layer I was missing. Not a framework. Not infrastructure. Just a bash script that replaces me as the orchestrator.
Here's a peek behind the curtain: we've built our entire company on Claude Code. We have a reference architecture that documents our standards, automates our processes, and ensures everything is done according to our best practices. When it came time to incorporate Ralph into how we work, I updated that reference architecture and the associated agents we've already built: the architectural design agent, the build planning agent, the build execution agent. I added a run-build.sh script that implements Ralph's approach. The script handles sequential execution, failure detection, context exhaustion recovery, all the edge cases I was handling manually.
This experiment was the first real test. Could the script actually replace me?
The Experiment
I had a working prototype: about 4,700 lines of Python spread across a flat src/ directory. Well-written code, but we'd decided to refactor it into a modular, component-based architecture for reuse across additional use cases down the road.
BEFORE: src/*.py (monolithic codebase)
AFTER: core_lib/{models,processing,analysis,output}/*.py
agents/review/*.py
This refactoring would touch the entire application. Not super complicated work, but time-consuming. Even babysitting AI through the build was going to eat up a couple hours. I'd been postponing it for days.
This time, I wanted to walk away.
The Plan
I already had a detailed refactoring plan: 570 lines of markdown covering five sessions of work. My Ralph-enabled build agent converted this into a machine-readable YAML file with phases, dependencies, and validation commands. Self-contained prompts for each phase. A progress tracker that would survive interruptions.
So, I typed ./run-build.sh and walked away.
First run failed. The terminal said "BUILD COMPLETE!" but the output directory was empty. Every phase had asked for file permissions and then exited. Claude Code needs approval to write files. In unattended mode, there's no human to approve. The fix: --dangerously-skip-permissions. One flag I'd forgotten to add.
Second run: 21 minutes later, actually complete. Each phase ran its validation checks: package installs, import smoke tests, architectural lints, Docker builds, health endpoint checks. 5,005 lines of Python, organized into a clean module structure.
One bug remained. Docker Compose crashed on startup: a Pydantic forward reference issue that only surfaces when all models load together at runtime. The validation was import-level, not runtime-level. 30 seconds to fix.
My Rolling Epiphany
This keeps happening. Almost weekly. I have a holy smokes breakthrough and when I reflect on it, the breakthrough is rarely because the technology just got better. It's because I wondered: could AI do this too?
Two years ago, the models were the limitation. They hallucinated. They forgot instructions mid-task. That's not the limiting factor anymore. Models and their harnesses are so good that the limitation is almost always our imagination.
I thought I was out in front. Phased builds. Parallel execution. Pattern libraries. And yet. I hadn't considered automating myself out of the build loop. It took 30 seconds of a YouTube video. Someone saying, "what if a bash script did the orchestration instead of you?" Once I heard it, building it was trivial. There's no IP here. The script is maybe 200 lines of bash. Anyone could write it. The capability was there. I just hadn't wondered.
What else am I not doing because I haven't wondered the right thing yet?
