When Agents Get Worse the Longer They Run

Most failure stories about coding agents are about a single bad answer: the agent misread the task, hallucinated an API, shipped a bug. But there’s a quieter, more expensive failure mode that only shows up when you let an agent keep working. It doesn’t fail on turn one. It succeeds — and then it succeeds again, and again, and with each passing iteration the code it’s building gets a little worse, until what started as a clean solution is a bloated, tangled thing that technically passes its checks and is miserable to maintain. The agent gets worse the longer it runs, and for most of that run, nobody notices.

This isn’t a vibe. In 2026 a set of benchmarks was built specifically to measure long-horizon agent behavior, and the numbers are sobering. They tell a consistent story: agents are far weaker at sustained, iterative, real-scale work than the single-shot benchmarks suggest — and a big part of the gap is self-inflicted degradation.

The benchmark built to measure decay

The sharpest of these is SlopCodeBench (arXiv 2603.24755), and its whole design is aimed at the problem above. Instead of one-shot tasks, it gives an agent an evolving specification and makes it repeatedly extend its own earlier solution — 36 problems across 196 checkpoints — while leaving the internal structure of the code entirely up to the agent. That last part matters: most iterative benchmarks constrain the design space so tightly that you can’t see how an agent’s early decisions poison its later ones. SlopCodeBench deliberately doesn’t.

The headline result is bleak. Across 15 coding agents, open and closed, no agent fully solved a single problem end-to-end, and the best one passed just 14.8% of checkpoints. But the more interesting numbers are about quality over time. The benchmark tracks two kinds of decay: structural erosion (complexity piling up in the wrong places) and verbosity (redundant, bloated code). Structural erosion rose across checkpoints in 77% of trajectories; verbosity rose in 75.5%. Compared against 473 real open-source Python repositories, agent code was 2.3x more verbose and 2.0x more eroded — and the human repositories, measured across their own git histories, degraded less often and by smaller margins.

In other words: the agents pass checkpoints while producing code that erodes and bloats with each turn. The task gets done. The codebase gets worse. That’s the slop, and it accumulates.

It isn’t only “slop” — the long horizon itself is hard

You might read SlopCodeBench as a code-quality story. Two companion benchmarks show it’s also a raw-capability story: agents simply fall apart at real engineering scale.

RoadmapBench (arXiv 2605.15846) grounds its tasks in real open-source version upgrades — 115 long-horizon tasks across 17 repositories and 5 languages, each requiring a median of 3,700 lines of changes across 51 files. This is what a real version iteration looks like, not a single-file bug fix. Tested on thirteen frontier models, the strongest — Claude-Opus-4.7 — resolved only 39.1% of tasks, while the weakest managed 5.2%. The authors are blunt that this is “in stark contrast to existing bug-fix benchmarks,” where scores are far higher: shrink the horizon and agents look great; stretch it to real scale and most of the capability evaporates.

AgencyBench (arXiv 2601.11044) pushes the horizon further still — 32 real-world scenarios, 138 tasks, each requiring an average of 90 tool calls, 1 million tokens, and hours of execution to resolve. Even the best class of models, closed-source frontier systems, hit only 48.4% (versus 32.1% for open-source). When a single task takes ninety tool calls and a million tokens, every weakness in sustained reasoning gets ninety chances to compound.

Put the three together and the pattern is unmistakable. The benchmarks where agents shine are short. The moment the horizon gets long — many iterations, many files, many tool calls — performance collapses, and a meaningful slice of that collapse is the agent degrading its own work along the way.

Why quality decays as the iterations pile up

It’s worth being precise about why this happens, because it’s a different mechanism from the one we covered in context engineering. Context rot is about the model’s recall decaying as the window fills with noise — a perception problem. Long-horizon degradation is downstream of that but distinct: it’s about the artifact decaying as each turn builds on the last.

Here’s the compounding loop. The agent makes an early structural choice — a slightly-too-broad function, a shortcut, a missing abstraction. The next checkpoint asks it to extend that code. Rather than refactor (which is risky and which nothing in the loop rewards), it bolts the new feature onto the existing shape. Now the foundation is a little more eroded, and the next extension is built on that. Each turn passes its checkpoint, so there’s no signal that anything is wrong — but the structure is quietly concentrating complexity and accumulating redundancy. By checkpoint twenty, the agent is reasoning over a codebase its own earlier turns made harder to reason over. Slop begets slop.

This is why it connects so directly to a point we’ve made before: a bad loop ships bad code faster. A long autonomous run with no quality gate doesn’t just risk one bad commit — it risks a slow-motion erosion that’s invisible until you read the diff. The benchmark even tested the obvious fix: explicit quality guidance in the prompt. It helped the starting point — reducing initial verbosity and erosion by up to a third — but it did not change the rate of degradation. Telling the agent “write clean code” makes turn one cleaner. It does nothing about the slope of the decline.

The harness, not the prompt, is the fix

If a quality instruction can’t stop the decay, what can? The honest answer from the long-running-agent literature is structural, not verbal — and it’s the same discipline we keep coming back to: the loop has to be engineered, not just prompted.

Three moves do the heavy lifting. Scope each unit of work small. The degradation curve is a function of how many turns ride on the same eroding foundation; if you decompose a feature into independent, well-bounded subtasks, each one starts from a clean structural baseline instead of inheriting twenty turns of slop. Reset context per task. A fresh window for each scoped unit denies the rot a place to accumulate — the durable state lives in artifacts, not in an ever-growing transcript. And checkpoint with real recovery points. Commit working state to git after each verified unit so a degraded branch can be abandoned rather than extended. This is exactly the two-part, one-feature-at-a-time architecture we described in the harness is the product: an initializer that sets up durable state, then a coding agent that picks one feature, verifies it end-to-end, commits, and hands clean artifacts to the next session.

The unifying idea: degradation is a property of the loop’s shape, not the model’s IQ. A frontier model in a sloppy long-horizon loop will erode its own work. A modest model in a loop that scopes tightly, verifies hard, and resets often will hold quality far longer. The benchmarks measure agents in long, unbroken runs precisely because that’s the hardest case — and it’s the case a good harness is designed to avoid ever entering.

Why this favors Southeast Asia

Here’s the part that matters for this region, and it’s the same argument sharpened. The fix for long-horizon degradation is not a bigger model or more GPUs — it’s judgment about decomposition, verification, and when to reset. That is systems engineering, and it runs on a laptop.

A team that internalizes this has a real edge. The instinct everyone else is following — point a frontier agent at a big task and let it run for hours — is precisely the regime where these benchmarks show the worst decay and the highest token bills. A team in Phnom Penh or Da Nang that instead scopes work into small verified units gets better output for fewer tokens, which, as we argued in the economics of agentic AI, is the lever that actually compounds. Ninety tool calls and a million tokens per task is an enormous cost surface; the discipline that shrinks the horizon shrinks the bill at the same time it raises quality.

And it’s durable. Knowing how to keep a long run from rotting — where to cut a task, what gate to insert, when to throw away a branch and start fresh — is hard-won engineering judgment that no model release makes obsolete. The frontier lab will keep shipping agents that can run longer. What it can’t ship is the discipline to run them well. That discipline is the most underrated reliability skill in the agentic stack, and it’s available to anyone willing to engineer the loop instead of trusting the run.