Turing Post
Posts
State of AI Coding: Context, Trust, and Subagents

State of AI Coding: Context, Trust, and Subagents

Emerging playbook beyond IDE

Ksenia Se & Will Schenk
November 23, 2025

If you are using IDE by January 1 – you are a bad engineer!

Steve Yegge, Sourcegraph, ex-Google, ex-Amazon, self-acclaimed AI babysitter

Two days at the AI Engineering Code Summit in NYC (Nov 20-21) felt less like a conference and more like a working prototype of 2026. And if I have to point out one thing it would be this:

2026: The Year the IDE Died

That was literally the title of Steve Yegge and Gene Kim’s talk. And while we’ve heard predictions about the “death of the IDE” before, what should we do about it, actually?

What we’re witnessing is a structural change in how engineers think about coding when the coder is no longer a person typing into a buffer. In a sense, the industry that builds software is shifting to a new set of physics: from hands-on human to physics of a distributed system of autonomous workers operating inside verification chains. So what becomes the role of the human? And what kind of system can we trust to build the software we rely on?

This article is an attempt to draw a map that emerged from those two days. If you believe a bigger model will save you, you are thinking about this moment the wrong way. But how to think about it? Let’s discuss the systems that will build the software of the future.

In today’s episode:

What actually changed: from "The Diver" to "The Ant Swarm"
Context engineering: the discipline everyone now practices
Verification as the real moat
The rise of parallel agents and the orchestrator class
Artifacts replaced chat
The struggles are real
The emerging playbook and the new agentic coding stack
Final thought
Resources

What Actually Changed: From "The Diver" to "The Ant Swarm"

Steve Yegge from Sourcegraph has so many bangers and sharp metaphors that I have to use him again. His framing of “an ant swarm” versus “a diver in the ocean” is a great explanation of why scaling the window doesn’t solve the underlying problem:

“Your context window is like an oxygen tank. You’re sending a diver down into your codebase to swim around and fix things for you. One diver. And everyone says: we’ll just give him a bigger tank. One million tokens. But he’s still going to run out of oxygen!”

No matter how big the tank is, a single agent deep in the weeds eventually suffers from what might be called "nitrogen narcosis" – it loses the thread, hallucinates interfaces, and forgets the original goal. Very familiar, right?

Later Dex Horthy from HumanLayer gave it another catchy name – "Dumb Zone". He presented findings from 100,000 developer sessions identifying what he calls the – getting past the middle 40-60% of a large context window where model recall degrades and reasoning falters. "The more you use the context window, the worse outcomes you get," Horthy warned.

So "God Model" approach is dead. You cannot prompt your way out of the Dumb Zone.

What can you do?

Context Engineering: The Discipline Everyone Now Practices

The winning architecture is the "Ant Swarm" or “Agent Swarm”, that operate like factories. As an example:

The Planner Ant: Reads the issue and writes a spec.
The Research Ant: Greps the repo, reads 5 relevant files, and extracts interfaces.
The Coder Ant: Implements the function in a clean, isolated context.
The Tester Ant: Runs the build and reports the outcome.

Why this matters: This approach solves "Context Pollution." By giving each "ant" a blank slate for its specific task and only returning key information, we avoid filling up the orchestration agent with what can ultimately be distracting information, and we might be able to avoid Horthy's "Dumb Zone" entirely.

This reframes the engineering discipline. Your job is no longer coaxing clever behavior out of a single model (it’s impossible!). Your job is Architecting the Ant Farm – building the clean, deterministic, well-instrumented environment where a swarm of specialized agents can operate without crashing into each other.

So yes, if last year was the year of prompt engineering, this year is the year of context engineering and context management.

The only way to get better performance out of an LLM is to put better tokens in, and then you get better tokens out. Every turn of the loop, the only thing that influences what comes out next is what is in the conversation so far.

Dex Horthy, HumanLayer

But "better tokens" doesn't mean "more tokens." It means Intentional Compaction.

The most effective teams are now adopting the RPI Loop (Research-Plan-Implement) to systematically avoid the Dumb Zone. It works by forcing a hard stop between thinking and doing.

Research: An agent scans the codebase. It does not write code. It produces a compact markdown summary of only the relevant state.
Plan: A reasoning model (or a human) reviews the research and writes a step-by-step plan. This plan compresses "intent" into a clean artifact.
Implement: A separate agent executes the plan with a fresh, empty context window.

It means building what the OpenAI Fine-Tuning team call "The Harness" – a rigid architectural scaffold that forces the model to behave. Call it "Harness Engineering" if you wish, and it means – deliberately making the process slower and harder to ensure the context stays pristine.

Others echoed similar patterns. Beyang Liu from Amp Code described the same failure mode through the lens of tool overload: "Context confusion. The more tools you add in the context window, the more things the agent has to choose from. If the tools are not relevant to the task at hand, it ends up getting confused."

At Anthropic, Katelyn Lesse showcased how they are productizing this discipline via Memory and Context Editing. They are giving developers tools to explicitly "prune" the context window – deleting old tool outputs and irrelevant files – to keep the model’s IQ high. The era of "just dump the whole repo into the prompt" is officially over.

DO NOT OUTSOURCE THE THINKING. It can only amplify the thinking that you've done.

Dex Horthy, HumanLayer

Use the RPI loop to force the agent to prove it understands the context before it writes a single line of code. If you skip the Research phase, you are diving straight into the Dumb Zone. That’s not what you want.

The Rise of Parallel Agents and the Orchestrator Class

If humans are unitaskers – capable of focusing on only one thread at a time – agents are fundamentally parallel.

But the current generation of AI tools (think ChatGPT or GitHub Copilot) honors this human limitation. They are reactive. You sit in front of a chat box, type a command, wait for the spinner, read the code, and prompt again. The AI is faster than you, but it is still locked to your linear timeline. You are the bottleneck.

The next generation, exemplified by Kath Korevec from Google’s Jules and new architectures from Replit, is breaking this linearity. They are moving from reactive assistants to proactive, asynchronous swarms. The word “Parallelism” was used a lot.

Imagine this scenario: You leave work at 6 PM. While you sleep, an agent named Jules observes that your dependencies are outdated. It spins up a workspace, reads the changelogs, updates package.json, runs the build, fixes three breaking changes, runs the test suite, and leaves a verified Pull Request waiting for you with a summary of its decisions.

This shift from "chatting" to "dispatching" unlocks true parallelism. Michele Catasta from Replit explained how they are moving toward a world where the primary interface is an orchestrator loop. It dispatches tasks to sub-agents that run concurrently. He described the initially generated interfaces as having a lot of “painted doors” — on first glance it looked good but more than 30% of the functionality wasn’t there, and the proactive agents can make sure these tasks aren’t forgotten. While one agent is refactoring the database schema, another is updating the frontend types to match, and a third is writing the documentation.

Image Credit: Replit

And you become this Orchestrator. But your job is not to write the loop; it is to define the constraints. You review the plans. You verify the the tests. You merge the work. The skill set shifts from syntax recall to systems thinking. Can you define the boundaries of a task clearly enough that a swarm can execute it without crashing into each other?

This is the new leverage. A senior engineer used to be 10x more productive than a junior because they could type the right solution faster. In the Orchestrator era, a senior engineer is 100x more productive because they can effectively manage a swarm of twenty agents working in parallel, while the junior eng is still stuck chatting with one.

Verification as the Real Moat

Eno Reyes from Factory AI said it best: “Many tasks are much easier to verify than to solve.” A few speakers echoed: predictable AI behavior emerges not from better generation, but from better verification. Verification becomes your trust leverage.

Take Cline’s Nik Pash’s “tea kettle example”:
Task: boil water.
A good verifier asks one question: Is the kettle whistling?
That’s outcome-driven verification. It doesn’t care how you boiled it, or what stove you used – it only cares whether the water is boiling.
A bad verifier, on the other hand, checks for irrelevant process signals: Is the burner on high? Has five minutes passed? Is the lid on? Did you use filtered water? That’s process-driven verification – it measures activity, not results.

Just get a goddamn whistle already.

And if I need to decipher: stop micromanaging the process. Build verifiers that know what success looks like and let agents find their own path to it.

Some interesting numbers came from Qodo’s CEO Itamar Friedman:

82% of developers now use AI assistants.
76% still don’t trust their output.

When asked why, most said the same thing: “We don’t trust the context the model has.” The model intelligence is fine – but it’s missing verification context. Without it, confidence collapses.

Friedman’s prescription: “Don’t accept this PR unless it meets your test coverage threshold.” He called these automated quality gates, and they’re becoming the new unit of trust.

And the Replit’s lesson that Michael Catasta presented was demonstrative too:

Out of 22 million creators using their agent, 30% of shipped features were broken.
His diagnosis: users hate testing.

Catasta’s team built autonomous testing agents – not for the users, but for the system itself. Each agent runs end-to-end verification passes after every generation, capturing screenshots, diffing outputs, and surfacing regressions before deployment.

That one decision – automating verification loops – cut failure rates by half and stabilized the rollout of Replit’s agentic workflows.

Capital One’s Max Kanat-Alexander framed the enterprise challenge in one sentence:
“We spend more time reading code than writing it, and even more so now.”

As AI accelerates output, review velocity becomes the bottleneck. The old workflow – write, test, review – breaks when agents write faster than humans can verify. For large teams, the real infrastructure investment isn’t compute – it’s verification tooling. Deterministic tests, trusted build pipelines, and automated validators become the only way to sustain speed without collapse.

“Agents did not attend your meeting that has no transcript,” Kanat-Alexander reminded the crowd. Documentation, specs, and validation logs are not bureaucratic overhead –they’re the substrate AI depends on.

The next decade of agentic infrastructure won’t be built around smarter prompts or bigger models. It will be built around systems that can prove their own correctness.

In the hierarchy of AI leverage:

Models generate.
Agents orchestrate.
Verification decides what’s real.

The Trust Stack in Practice: How Predictability Enables Adoption

What's good for humans is good for AI.

Max Kanat-Alexander, Capital One's Executive Distinguished Engineer for Developer Experience and author of "Code Simplicity"

Max Kanat-Alexander's framework synthesizes it. His prescription for systems-level thinking connects context engineering to trust outcomes:

Standardize environments → reduces context variability → enables predictable verification
Improve deterministic validation → creates outcome-driven verification → builds predictability
Refactor for testability → makes verification easier → strengthens trust
Write down external context → eliminates tribal knowledge → improves context for all actors
Better tooling, better testing, better messages → "all that is old is new again, good software practices matter"

The Trust Stack is the formalization of what successful organizations discovered independently. What's good for humans is good for AI. Trust grows from predictable behavior. Predictability requires solid infrastructure. And infrastructure benefits everyone equally.

Artifacts Replaced Chat

So if the terminal and the text editor are dying. What is the new interface for intelligence?

It might be Artifact.

Kevin Hou from Google DeepMind (presenting Antigravity) and the teams at Anthropic are converging on the same UI paradigm: The Agent Manager.

In the old world, you stared at a linear chat stream. But code isn't linear – it's structural.
In the new world, the chat is secondary. The primary focus is the Artifact – a live React component, a mermaid architecture diagram, a project plan, or a visual diff. Artifact is constantly evolving.

This solves "Reviewer's Fatigue." You cannot review a stream of 50 chat messages to understand a refactor. But you can review a visual artifact that tracks state changes. We are moving from "chatting with a bot" to "collaborating on a document."

The Struggles are Real

Amidst the optimism of swarms and automated testing, a sobering reality check came from the data. Yegor Denisov-Blanch from Stanford presented findings on the "Productivity Paradox." In Greenfield projects (new apps, clean slate), AI drives massive gains – often boosting productivity by 30% to 40%. But in Brownfield projects (legacy code, complex dependencies, years of tech debt), productivity gains drop to near zero. In some cases, they go negative.

Image Credit: Stanford University

Why? Because legacy code is messy, and AI hates a mess.

Agents thrive in "clean room" environments where types are strict, documentation is current, and tests are deterministic. But real-world corporate codebases are rarely clean rooms. They are filled with "Vibe Coding" slop – code that was written quickly to solve a problem but wasn't engineered to last.

The struggles cluster around three silent killers:

Missing Context: The critical logic sleeps in a Jira ticket from 2019, or worse, it’s tribal knowledge. "Oh, don't touch that file, Dave is the only one who knows how it works." An AI agent cannot know context that exists only in Dave’s head.
Flaky Tests: An agent learns from feedback. If your test suite fails randomly 10% of the time, the agent receives noisy signals. It tries to "fix" code that wasn't broken, introducing new bugs in the process.
The "Everything Touches Everything" Trap: In complex legacy systems, changing a CSS class on the login page might break the PDF export in the billing module. Humans navigate this with intuition and fear. Agents navigate it with confidence and chaos.

This leads to what Lei Zhang from Bloomberg described as a terrifying multiplier effect: "Two engineers can create the tech debt of fifty engineers [using AI]."

If you unleash a fast, tireless agent on a messy codebase without strict guardrails, it will not clean it up. It will generate spaghetti code at 100x the speed of a human. It will match the style of the existing mess, amplifying the entropy until the system becomes unmaintainable.

This creates the "Rich Get Richer" effect. Teams that have invested in "environment cleanliness" – strict linters, high test coverage, clear documentation – get a massive boost from AI. Teams that are drowning in tech debt find that AI offers them nothing but a faster way to drown. Refactoring is no longer just hygiene; it is the prerequisite for survival.

So, what do we do? How do we survive the transition from the old physics to the new? →

The Emerging Playbook

The rest of this fascinating story is available to our Premium users only. Highly recommended to understand the bigger picture →

Editor of this article: Will Schenk / TheFocusAI

Thank you for reading and supporting Turing Post 🤍 We appreciate you

Reply

or to participate.