Models vs Agents: The Shifting Boundary

Here is a pattern anyone building with AI agents has lived through. You spend weeks engineering scaffolding around a model — a careful chain of prompts to force step-by-step planning, a routing layer to pick the right tool, glue code to decompose a big task into small ones. It works. You ship it. Then the next model drops, does all of that natively, and your clever scaffolding is suddenly dead weight you have to rip out.

This keeps happening, and it raises a genuinely hard architectural question — maybe the architectural question of the agentic era: as base models absorb more of what agent frameworks were built to provide, how much scaffolding still matters? Is the agent layer a temporary crutch the model is steadily swallowing — or is it the part that actually compounds over time?

It’s not an idle debate. The answer determines where you should be spending your engineering effort right now. Let’s take both sides seriously.

Two layers, one stack

First, definitions, because this conversation is drowning in overlapping terms — SDK, scaffolding, framework, harness, agent. Cobus Greyling and others have spent real effort disentangling them, and the distinction that matters is simple. There are two layers:

The model. Raw capability: reasoning, knowledge, planning, the ability to choose and call tools. What you get from the weights.
The scaffolding around it. Everything that turns a raw model into a goal-directed agent — the modular prompts, memory, tool wiring, and orchestration that ZBrain and others describe as “scaffolding.” This is the harness and the loop we’ve written about: the non-model runtime that makes the model do things reliably.

The whole “models vs agents” debate is really a question about the boundary between these two layers — and crucially, which direction it’s moving. Capabilities don’t sit still on one side. They migrate. And the direction of migration is what’s in dispute.

The case for “the model is swallowing the agent”

The strong version of one side, argued forcefully by DevIQ among others, is that the model is swallowing the agent. Each generation of frontier model absorbs abilities that previously required external scaffolding:

Planning and decomposition. Early agents needed elaborate prompt chains to force a model to break a task into steps. Newer models plan natively, in a single call, often better than the hand-built chain did.
Tool selection. Routing logic to decide which tool to use is increasingly redundant — the model just picks correctly, given good tool descriptions.
Multi-step reasoning and execution. The think-act-observe loop that frameworks were built to impose is becoming something models do internally.

The implication is uncomfortable for anyone who has invested heavily in a framework: a lot of agent scaffolding is temporary. It exists to compensate for a model limitation, and the moment the limitation disappears, so does the reason for the scaffolding. On this view, betting your architecture on elaborate hand-crafted process is betting against the one thing that has reliably improved: the model itself.

This is a real and recurring phenomenon. If you’ve ripped out a planning chain because the new model didn’t need it, you’ve felt the boundary move. Ignore this side and you’ll keep building scaffolding with a half-life of one model release.

The case for the harness

But there’s an equally serious counter-argument, and it’s the one we find more convincing for where the leverage actually is.

A 2026 line of work reframes the question as system scaling versus model scaling. The thesis: model scaling (bigger, smarter weights) and system scaling (memory, routing, orchestration, verification, governance) are distinct axes. The model getting better does not, on its own, give you durable memory across long runs, coordination of multiple agents, an audit trail for a regulated decision, or a verification step that catches the model’s confident mistakes. Those are system-level properties. They come from the harness, not the weights.

And here’s the counterintuitive part: as models get more capable and you trust them with longer-horizon, higher-stakes work, the system-level demands don’t shrink — they grow. A more autonomous agent needs more verification, not less. An agent doing hours of work needs more sophisticated memory and context management, not less. An agent making consequential decisions needs more governance and observability, not less. The argument is that the harness has to absorb capability fastest precisely as the model improves.

So both things are true at once, which is what makes this genuinely hard: the model is swallowing the low-level scaffolding (planning chains, routing glue) while the system-level harness (memory, orchestration, verification, governance) becomes more important, not less. The boundary isn’t moving uniformly in one direction. It’s moving up the stack — eating the mechanical layer, exposing the architectural one.

A concrete example makes the split obvious. Two years ago, getting a model to reliably break “refactor this module” into ordered sub-steps took a hand-built planning prompt — pure scaffolding, and exactly the kind the model has now swallowed. But getting three agents to refactor three modules in parallel, share what each learned, reconcile conflicting edits, and produce an audit trail a reviewer can trust afterward? No model improvement gives you that for free. It’s orchestration and verification — system design that you build, that compounds, and that a regulated client will increasingly require. The planning prompt was disposable. The orchestration is an asset.

The Bitter Lesson lens

The deepest framing of this tension comes from Ethan Mollick, riffing on Rich Sutton’s famous “Bitter Lesson.” The Bitter Lesson, learned repeatedly across AI history, is that general methods that leverage computation and learning eventually beat hand-crafted, human-engineered approaches. Applied here: will general model learning ultimately beat the carefully engineered agent scaffold?

Mollick’s honest answer is that we’re about to find out — and he poses it as the open question of the moment: does process matter? If the Bitter Lesson holds all the way down, then elaborate agent process is a losing bet against scale, and the model swallows nearly everything. If it doesn’t — if there’s an irreducible layer of system design that no amount of model scale produces for free — then the harness is where durable value lives.

Our read: the Bitter Lesson is real and will keep deleting low-level scaffolding, but it operates inside a system that still has to be designed. A better-learned model makes a well-engineered loop better; it does not write the loop, define “done,” guarantee the data is right, or decide what the agent is allowed to do. The Bitter Lesson eats the cleverness. It doesn’t eat the engineering.

A practical rule of thumb

If you’re making architecture decisions today, here’s the rule that falls out of all this: build scaffolding that’s cheap to delete, and invest durably in the parts that survive model upgrades.

Concretely:

Treat low-level scaffolding as disposable. Planning chains, routing logic, prompt gymnastics to coax a behavior — build them lightly, expect to throw them away, and don’t let them calcify into your architecture. Each one is a bet against the next model, and you will often lose that bet.
Invest heavily in the layers that compound. Evaluations, data quality, verification, memory architecture, observability, and governance don’t get obsoleted by a model release — they get more valuable as you trust agents with more. McKinsey’s work on scaling agentic AI in the enterprise lands in the same place: the durable foundations are organizational and system-level, not a particular model or framework.
Use the three-layer lens. As we argued in our loop-engineering post: adopt the harness, engineer the loop, scale with orchestration. The harness commoditizes, the model improves underneath you — your durable edge is the middle and upper layers, designed to assume the model will keep getting better.

The teams that get burned are the ones who build a moat out of model-compensating cleverness. The teams that compound are the ones whose investment gets more valuable every time the model improves.

Conclusion

Models vs agents isn’t a question with a single winner — it’s a moving boundary, and the skill is knowing which side of it you’re building on. The model is genuinely swallowing the low-level agent scaffolding, and it will keep doing so; betting against that is betting against the most reliable trend in the field. But the system-level harness — memory, orchestration, verification, governance — isn’t getting swallowed. It’s getting more important, because capability you can’t trust, audit, or stop isn’t capability you can ship.

Build for that. Make your scaffolding cheap to delete and your verification, data, and orchestration durable. Bet on the layers that survive the next model — because there will always be a next model.

This is exactly the kind of architecture call we help teams get right at Inference Loops. If you’re deciding where to invest in your agent stack — and what to leave disposable — let’s talk.