A Bad Loop Ships Bad Code Faster: Evals Are the Real Discipline
Here is the sentence that should be taped above every agentic-coding team’s desk: a bad loop ships bad code faster. We first wrote it in passing in Designing Agent Loops That Run While You Sleep, and it has only gotten truer. The entire industry is optimizing for autonomy — longer runs, more parallel agents, less human babysitting. Far fewer people are optimizing for the thing that actually decides whether autonomy is an asset or a liability: verification.
The logic is unforgiving. Once an agent can generate code cheaply, in parallel, all night, generation stops being the bottleneck. As we put it in The Code Agent Orchestra, “the bottleneck is no longer generation, it’s verification.” And a fast generator wired to a weak verifier isn’t a productivity machine — it’s a machine for producing plausible-looking mistakes at a rate no human can review. The faster the loop, the more important the gate. This post is about the gate.
The verification gate, and why it has to come first
Every agentic loop has a moment where it decides “this work is done.” That decision is the verification gate, and it can be built from many materials: tests, linters, type checks, security scans, a golden dataset, or an independent reviewer model. The quality of that gate is, quite literally, the quality of everything the loop ships.
The single most common — and most expensive — mistake is letting the loop run before the gate is trustworthy. It feels like progress: the agent is busy, diffs are landing, the backlog is shrinking. But if the gate can’t reliably tell correct from incorrect, all that velocity is pointed in a random direction. You are not shipping faster; you are accumulating wrong work faster, and discovering it later, when it’s more expensive to unwind. Build the gate before you open the throttle, not after.
This connects directly to two ideas we’ve argued elsewhere. It’s the same maker/checker boundary from how agents actually reason — separate the model that writes from the thing that checks — scaled up to govern an entire pipeline. And it’s the hidden risk in the self-improving harness: an agent that optimizes itself against a bad eval doesn’t get better, it gets more confidently wrong, faster. The eval is the thing everything else is steering toward. If it points the wrong way, more horsepower only hurts.
Evals are the new test suite
For decades, the artifact that encoded “is this correct?” was the test suite. In the agentic era, that artifact is the eval — and 2026 has turned eval design into a discipline of its own. The shape that’s emerging on serious teams looks like this:
- A golden dataset drawn from real failures. Not synthetic toy cases — a curated set built from the actual ways your system has gone wrong in production. Your incident history is your most valuable eval material.
- A mix of deterministic and judgment-based scorers. Cheap, exact checks (does it compile, do the tests pass, is the output well-formed) catch the mechanical failures; an LLM-based judge catches the semantic ones a regex never will.
- A judge calibrated against human reviewers. An LLM-as-judge is only trustworthy to the degree it agrees with people you trust. Calibration isn’t optional — an uncalibrated judge is just another confident guesser.
- A CI gate that blocks regressions. The eval runs on every change and stops the ones that make things worse. This is what turns evals from a dashboard you glance at into a gate that actually holds.
The teams that build this — a real eval program with an incident taxonomy, golden datasets, a deterministic-plus-judge scorer mix, and CI regression gates — gain a structural quality advantage. Not a faster demo; a system that stays correct as it scales.
From dashboard to runtime: the agent-as-judge
Two 2026 developments push this further, and both are worth knowing. The first is eval-driven development, where pre-production evals don’t just live in CI — they convert into runtime guardrails. Eval scores start controlling what the agent is allowed to do: which tools it can call, when it must escalate to a human, whether an action ships at all. The eval stops being a report card and becomes an active control in the decision loop — the guardrail layer fed by live measurement.
The second is the agent-as-judge. Instead of scoring only the final output, an autonomous judge-agent observes the intermediate steps — it reads the action log, uses tools, and reasons about the trajectory while the worker agent runs. It can pinpoint which requirement was missed and which step went wrong, not just that the end result failed. As agents take on longer, multi-step tasks, judging the journey rather than only the destination is how you catch the error at step 3 instead of paying for it at step 30. The pattern pairs cheap distilled evaluators running continuously with a heavier agent-judge invoked selectively for deep verification — fast and thorough, without paying for thorough on every call.
Why this is the highest-leverage work in the region
It would be easy to read “evals” as unglamorous plumbing. It is exactly the opposite — it is the most defensible, most ownable engineering available right now, and it is shaped for Southeast Asia’s strengths.
Here’s why. An eval encodes what correct means for your problem — and that is irreducibly local. A frontier lab can hand you a brilliant model and a generic test harness. It cannot tell you whether a Khmer-language invoice was parsed correctly, whether a Cambodian compliance rule was actually satisfied, or what “done” means for an agricultural co-op’s records. That judgment lives in a golden dataset of your real failures and a judge calibrated against your domain experts. It is built from local knowledge, and it cannot be imported. This is the same thesis as spec-driven development seen from the other side: the spec says what to build, the eval proves it was built right, and both are writing-and-judgment work, not GPU spend.
And the leverage compounds. A team that owns a trustworthy eval can safely turn up autonomy — run more agents, longer, with less supervision — because the gate catches what slips. A team without one cannot, no matter how good its models are. The eval is what lets a small, capital-light team in Phnom Penh or Da Nang scale output without scaling risk. It is the permission slip for everything else in the agentic stack.
What to take from this
Build the gate before you open the throttle. Start a golden dataset from your real incidents today, even a small one. Mix cheap deterministic checks with a judge, and calibrate the judge against people you trust. Put the eval in CI as a gate that blocks regressions, then graduate it into a runtime guardrail. And treat the eval as a first-class, owned artifact — because it encodes the one thing the frontier lab can never ship you: what correct means here.
The race everyone can see is the race to make agents act. The race that actually decides the winners is quieter: making them act correctly, provably, at scale. A bad loop ships bad code faster. A good eval is how you make sure the loop is a good one.