Context Engineering: The Context Window Is Your Agent's Real Bottleneck
There’s a quiet assumption underneath a lot of agent disappointment: that if the agent is failing, you need a smarter model. For long-running agents, that assumption is usually wrong. The thing that breaks first isn’t the model’s intelligence — it’s the context window. And the fix isn’t a bigger window or a better model. It’s engineering what the model sees at each step. That discipline now has a name — context engineering — and it is quietly becoming the load-bearing skill of the agentic era.
The counterintuitive part is where it gets interesting. The instinct is to give the agent more context — the full history, every tool result, the whole conversation — on the theory that more information can only help. The evidence says the opposite. More context, past a point, makes agents worse.
The number that should change your mind
A 2026 study, Less Context, Better Agents, put this to a clean test. The researchers ran a GPT-5 agent through a 50-task expense-itemization benchmark — exactly the kind of long-horizon, tool-heavy work where context piles up fast — under different context policies.
The full-history agent, holding every tool call and response, scored 71.0% complete itemization. It burned roughly 1.48 million tokens and took 14.56 hours. The agent that kept only its last five tool calls plus a running summary of what it had evicted scored 91.6% — and did it on 553,000 tokens in 5.79 hours. That’s a 63.9% reduction in tokens and 60.2% less wall-clock time, while gaining more than twenty points of accuracy. The result held when they swapped GPT-5 for Claude Sonnet 4.5.
Read that again, because it inverts the intuition. The agent that saw less was more accurate, dramatically cheaper, and far faster. The full history wasn’t helping the model — it was drowning it.
Why more context makes agents worse: “context rot”
The mechanism behind this has an industry name now: context rot. As the token count climbs, a model’s effective recall degrades — well before it hits the hard context limit. A 200K-token window does not mean the model reliably uses 200K tokens of information. Long before that ceiling, the signal it needs gets buried under accumulated tool dumps, stale intermediate steps, and dead ends it already abandoned. The window is full of stuff, and the stuff is mostly noise.
This is why a long-running agent degrades over a session even when it never technically “runs out” of context. Each verbose tool response it appends makes the next decision a little harder to ground in what actually matters. Capability doesn’t fall off a cliff; it erodes, turn by turn. The longer the agent runs, the more the rot compounds — and long-running is exactly the direction the whole field is moving.
Context as a managed resource
The productive reframe is to stop treating the context window as a passive log the agent writes to, and start treating it as a scarce, actively managed resource — the same way an operating system treats physical memory. That analogy isn’t loose: 2026 research on context management borrows directly from classical OS design — working-set theory, virtual memory, demand paging — paging the relevant information in on demand rather than keeping everything resident. The model’s attention is the RAM. Your job is the memory manager.
In practice, context engineering converges on a handful of moves — the field summarizes them as write, select, compress, isolate — and each one is a lever you control without touching the model:
- Compress (compaction + summarization). Distill older turns into compact summaries that preserve the decisions and discard the transcript. This is the move that drove the benchmark result above: evict the raw tool calls, keep a running summary. You lose the verbatim noise and keep the signal.
- Write (externalize state). Let the agent offload plans, findings, and progress to durable storage — a scratchpad, a progress file, the git history — instead of carrying everything in-window. The window holds what’s needed now; the rest lives outside it and is paged back in when relevant.
- Select (retrieve on demand). Pull only the specific files, docs, or prior results the current step needs, rather than front-loading everything the agent might use.
- Isolate (sub-agents). Give independent sub-tasks their own focused context windows instead of one bloated shared one. This is a context-engineering argument for the multi-agent pattern we covered in the code agent orchestra: a sub-agent that only knows about one file reasons better than one juggling the whole codebase — partly because its window isn’t rotting.
Crossing many windows: the long-running harness
The hardest version of this problem is the agent that has to work across many context windows — a task too big to fit in one session at all. Anthropic’s 2026 engineering guidance on long-running harnesses is blunt about the core difficulty: “each new session begins with no memory of what came before.” And it’s equally blunt that compaction alone doesn’t save you — even a frontier model like Opus 4.5 fails to build production-quality applications from high-level prompts when all it has is a compressed window.
Their answer is architectural, and it’s pure context engineering. A two-part harness: an initializer agent that sets up the environment once — a progress file, a structured feature list with hundreds of items marked passing or failing, an init.sh to run the project — and then a coding agent that, every session, reads the progress file and git log, picks one feature, implements it, verifies it with real end-to-end testing, commits, and leaves clean artifacts for the next session. The context window resets every session; the state persists outside it, in files the next agent pages back in. Incremental progress, one feature at a time, with the window deliberately kept lean.
This is the same insight as the benchmark, scaled up: the durable memory doesn’t live in the context window. It lives in the artifacts the harness curates around it. The window is working memory; the files are disk. We’ve argued before that the harness is the product — context management is the part of the harness that decides whether a long-running agent stays coherent or quietly rots, and it’s what lets loops run unattended while you sleep without drifting off the rails.
Why this is the highest-leverage, lowest-capital skill in the stack
Here’s the part that matters most for this region, and it’s the sharpest version of the argument we keep making. Context engineering requires no GPU, no capital, and no frontier lab — it is almost pure systems judgment. The benchmark result above wasn’t won by training a better model. It was won by someone deciding, carefully, what the model should and shouldn’t see at each step. That decision is engineering, and it’s the kind that runs on a laptop.
That profile fits Southeast Asia’s developers exactly. A small team in Phnom Penh or Da Nang can’t out-train OpenAI, and doesn’t need to. It can absolutely out-engineer a competitor’s context strategy — and the payoff is direct: 64% fewer tokens is 64% off the inference bill, and higher accuracy on top. Context engineering converts careful thinking straight into lower cost and better results, which is the most capital-efficient lever in the entire agentic stack. It is the same GPU-free, durable capability we keep arguing the region should build, and it compounds: every domain — Khmer document workflows, local compliance, an agricultural co-op’s records — needs its own answer to what the agent should keep in working memory, and that answer can only be engineered from inside the problem.
The frontier lab will rent you a million-token window. What it can’t do is decide what belongs in it. That decision — what the model sees, when, and what gets paged out — is the work. And right now, it’s the most underrated engineering job in AI.