AI Flywheels: When Workflows Run Themselves

A closed loop is a workflow that feeds itself: the output of one run becomes the input of the next, with no human in between. Link several of these together and point them at a goal, and you have a flywheel – a system that generates, measures its own results, and decides what to try next without waiting for you. Coding agents already work this way. AI research labs are building their entire operation this way. This episode is about what happens when the flywheel starts spinning inside ordinary organizations, how it affects humans – and why the infrastructure to absorb it does not exist yet.

This article is part of our The Org Age of AI series, It is co-written by Will Schenk (TheFocus.AI) and Ksenia Se. Previous episodes: #1: AI Feels Powerful. So Why Is the ROI Still Missing?, #2: The Unsexy Truth of AI Adoption, #3: How to Build an AI-Native Startup from Day One, #4: There Are No AI-Native Enterprises Yet, #5: AI Workflow Patterns: The Real Unit of AI Adoption in 2026

If you need an unbiased view on your transition to becoming AI-native, you can schedule a 1-on-1 consultation with Will here. Will Schenk is a co-founder of TheFocus.AI, where he works directly with companies navigating these transitions.

What's in today's episode:

The ladder: pipeline vs workflow vs AI flywheel
You are already running flywheels
The labs are spinning the biggest flywheel
Why this reaches you on a schedule you don't control
The review bottleneck in closed-loop AI workflows
Three kinds of infrastructure that don't exist yet
The flywheel runs both ways
Which loops to close first
The new divide: machine-speed verification

Not interested in this topic? Check our YouTube. We explain what recursive self-improvement is

Pipeline vs Workflow vs AI Flywheel: What's the Difference?

In Episode #5 we drew a line between two things. A pipeline is a fixed sequence of steps – a cron job, a script, plumbing. Whatever branching it has is logical and mechanical, it does nothing based upon what we'd call human judgment. A workflow is a repeating sequence of decisions and actions with points along the way where a human exercises judgment. Strip the judgment out and a workflow collapses back into a pipeline. We ended on an observation that as a workflow matures, the human migrates from the middle to the edges. They set the parameters at the start and review the exceptions at the end. The middle belongs to the agent.

Imagine now that your organization has many workflows, each one having absorbed a slice of human judgment. What's the next level? What happens when you link them together, point them at a goal, and let them run?

That linked, goal-seeking system is a flywheel. It is a collection of workflows wired so the output of one becomes the input of the next, turning continuously toward an objective you defined once. A closed loop is the smallest flywheel – one workflow feeding itself. Link several and the flywheel gets bigger, but the machine is the same, so we'll use loop and flywheel interchangeably from here.

But what separates a flywheel from a pipeline? A pipeline repeats; a flywheel steers. And the steering wheel is measurement. The system acts, measures the result of its own action, and uses that measurement to decide the next action. Also, a flywheel has three beats, not one: generate, measure, decide what to try next – then generate again.

There are two ways to close that loop: one right and one wrong.

The first is to remove the human checkpoint and hope. This is how most "we deployed autonomous agents" stories begin, and how most of the embarrassing ones end. This is the wrong way.

The second is to replace the human checkpoint with a verifier – something that can tell a good output from a bad one without a person reading it. A test suite. A schema validation. A reconciliation against known totals. A performance metric. The human judgment does not disappear; it gets encoded once, into the verifier, instead of being exercised by hand on every run.

That distinction is the whole episode. Loops do not close because someone decides to trust the model. They close where verification has been made cheap, fast, and objective. Everywhere else, the human stays.

And notice where the human goes. A workflow moved them from the middle to the edges. A flywheel moves them up a level again – off the work, off the coordination between workflows, and onto the verifier itself. The judgment stays with humans.

AI Flywheel Examples: Coding Agents and Ad Optimization

If this sounds futuristic, look at how software gets written this year. A modern coding agent does not just write code – it runs an experiment: write code, run the tests, read the failures, rewrite, run the tests again. Nobody reviews iteration three of seven. The human reviews the final diff, and increasingly, for low-stakes changes, not even that. Anthropic says the majority of its own code is now written by Claude Code. OpenAI reported in February that GPT-5.3-Codex was instrumental in building itself – debugging its own training runs and analyzing its own evaluation results. Generate, measure, decide, repeat. That is the shape.

And it is not a coding-only shape. Picture an ad-optimization flywheel. One workflow generates the creative – headline, copy, image. A second pulls performance from the ad console – impressions, click-through, conversions. A third reads that performance and decides the next experiment: kill the loser, scale the winner, try a new angle. Wire the three together and you have a flywheel that runs marketing experiments around the clock, with no human between iterations. The reason it can run is the same reason coding could: the ad console is a verifier. Performance is measured, not vibed. The measurement closes the loop.

Why did coding close first? Because software spent forty years building the verification infrastructure that closed loops require. Compilers reject malformed programs. Type systems catch whole categories of mistakes. Test suites encode "what good looks like" in executable form. CI runs all of it automatically on every change. When LLMs arrived, the verifier was already sitting there, waiting. Advertising has a weaker version of the same gift – performance numbers are objective, if noisy. The strength of the verifier is what decides whether a loop can close at all.

This is the Factory AI principle from Episode #5, now operating at full strength: the ease of training an agent on a task is proportional to how verifiable the task is. Coding was the most verifiable knowledge work on earth, so the loop closed there first.

Now run the logic over your own organization. Which of your workflows has a test suite? Which has anything resembling one? For most companies the honest answer is that the workflows have humans. The human is the verification layer – and the coordination layer, the thing deciding which workflow runs next and whether the whole effort is working. Which means the human is the reason the flywheel cannot spin. Remove them, and you reveal a debt nobody scoped.

How AI Research Labs Use Closed-Loop Systems to Train Themselves

The most consequential closed loop being built right now is AI research itself.

The progression over the past year has been incredible. The AI Scientist project, reported in Nature in March, automates the research cycle end to end: it generates ideas, runs experiments, writes up results, and reviews its own papers. A startup literally named Recursive published results this week from a system that proposes a research idea, implements it, runs the experiment, validates the result, and uses what it learned to choose the next experiment – running many threads over long horizons, with explicit machinery to catch reward hacking before treating a gain as real. Anthropic published a piece this month titled "When AI builds itself," stating plainly that a growing share of its AI development is delegated to AI systems, and that taken far enough, the trend points toward systems that design their own successors. Jack Clark has put a number on it: roughly 60% probability of a system that can train a more powerful successor without human involvement by the end of 2028. Dean Ball's article “On Recursive Self-Improvement” from February argues that frontier labs are automating large fractions of their research operations, and that their effective workforces of agents will grow from thousands toward hundreds of thousands within a year or two.

Well, let’s not get crazy! Labs have incentives to describe their own momentum in the strongest possible terms. And they might come their earlier than others. But the organizational argument still holds. You do not really need superintelligence for the closed loop to work. You only need what is already happening: work loops closing in domains where verification is strong, plus one observation about how capability travels.

Why Closed-Loop AI Workflows Arrive Through Tools Before Strategy Does

Learn from those who work directly with companies navigating these transitions.

UPGRADE TO GET EVERY ARTICLE FIRST

Join Premium members from top companies like Microsoft, Nvidia, Google, Hugging Face, OpenAI, a16z, plus AI labs such as Ai2, MIT, Berkeley, .gov, and thousands of others to really understand what’s going on in AI.

The mechanism has three parts, and none of them require you to do anything.

First, the closed loop arrives through the tools. Labs are not closing research loops only for their own use. The capability gets packaged into products. A coding agent that proposes, tests, and revises code is already a productized version of that loop. A data tool that detects and repairs schema failures is another. This is the direction of travel across the software stack: from “drafts things for you” toward “handles a bounded task and reports back.” The flywheel may enter your organization through a vendor before anyone there has decided to build one.

Second, the shadow gets there before the strategy does. We wrote in Episode #4 that shadow AI is a diagnostic – it shows you exactly where the official path cannot carry the work. Flywheels will follow the same route. Somewhere in your organization right now, a mid-level hero has an agent watching something, fixing something, re-running something on a schedule, and the only reason you don't know is that it hasn't broken yet. The first flywheels in most enterprises will not be launched. They will be discovered.

Third, the economics of iteration change. When a loop closes, each additional attempt becomes much cheaper. An agent that can try several approaches will often do exactly that. The amount of machine-generated work – drafts, fixes, analyses, proposals – is no longer limited mainly by human effort. It is increasingly limited by compute budget, verification, and judgment. Cheap iteration is just the flywheel turning faster.

And that creates the real problem of this episode.

The Review Bottleneck: Why Human Oversight Can't Scale with AI Flywheels

Here is the basic arithmetic. Machine output scales with compute. Human review scales with people. These two curves are already diverging in software engineering, where teams are learning how much agent-generated work still needs judgment. The same pressure is likely to appear in other functions.

Linking workflows makes it worse, not better. A single workflow produces a stream of outputs for a human to check. A flywheel produces outputs that feed other workflows that produce more outputs – and somewhere a person is still supposed to coordinate the whole thing, deciding whether run N was good enough to justify run N+1. That coordination is the seam where the human used to sit. It is also the seam that explodes first when the volume climbs.

When generation was expensive, review could stay informal. A senior person read the work, applied judgment, moved on. That works when a team ships a few artifacts a week. It does not work when the flywheel ships hundreds.

So organizations face a fork:

Review everything at human speed, and the bottleneck simply moves to review. The flywheel does not help much if every output waits in the same human queue.
Review nothing, and you have removed the checkpoint and accepted the risk at organizational scale.

The harder path is to build verification that runs closer to machine speed, and to move the human off the individual outputs and onto the verifier. That is the real option, and outside software, very few organizations are prepared for it.

This is why we keep insisting the Org Age is organizational before it is technical. The models will close loops wherever you let them. The question is whether your organization can tell, at machine speed, the difference between a flywheel compounding value and a flywheel compounding errors, noise, and bad assumptions.

The old rule was garbage in, garbage out. Flywheels turn the dial up to 11: garbage in, garbage amplified.

Which brings us to what's missing.

What Infrastructure AI Flywheels Require: Verification, Learning, Measurement

When the human leaves the loop we start to see how the nominal jobs are hiding all of the tacit actual jobs. Each of those hidden jobs needs to be discovered and rebuilt as infrastructure. Each should get its own episode in the rest of this series (and most likely will!).

Verification. The human checkpoint was never just approval. It was also a definition of quality. The reviewer who “just looks it over” is applying standards that often exist only in their head. In other words, the organization has been running on tacit evals, just as it ran on tacit knowledge. Close the loop and you discover that the spec was never fully written. Before a workflow can run itself, someone has to define what “good” means in a form the system can check: pre-deployment evals, regression evals, online checks, drift detection. “The system is accurate” is not enough. We will spend Episode #7 on what is.
Learning. A flywheel that runs but does not improve makes the same mistake faster. In Episode #2 we described L5 as closing the feedback loop – expert corrections flowing back into system behavior. Flywheels make this urgent, because there is no longer a human in the middle quietly absorbing the lessons. When the verifier catches a failure, where does the correction go? Into the prompt? The retrieval layer? The policy? The eval set itself? An organization that cannot answer this has flywheels that spin but do not learn. Episode #8.
Measurement. Many of the metrics companies use to value work still assume a human process. Time saved assumes time was the main input. Headcount capacity assumes heads. We showed in Episode #4 how this breaks billing; flywheels break it for internal accounting too. What is the unit of a flywheel? What did it cost? What did it produce? How do you compare it to the old process when the old process no longer exists in the same form? Episode #9.

Verification, learning, measurement. That is the build list. The rest of the series is the spec.

AI Flywheel Risk: How Compounding Errors Amplify When the Loop Close

A flywheel compounds. That is the point of one and the danger of one, because compounding runs in both directions.

Pointed well, it is the best machine you can own. Each turn makes the next turn cheaper. The verifier you wrote to close the first loop is the verifier you copy to close the second. The corrections from this week sharpen next week's runs. Value accumulates while you sleep. This is the version every vendor will sell you.

Now the other direction. Open-loop systems tend to fail locally – the agent produces a bad draft, a human catches it, the damage stays small. A flywheel fails cumulatively. The output of each run is the input to the next, so an error does not stay put; it propagates. A small bias, repeated, becomes a direction.

Research labs know this one well. In Recursive's automated-research system, the hardest problem is reward hacking: the loop finding ways to score well on the verifier while missing the intent behind it. Their answer is layered validation – every apparent improvement is something to check, not something to trust.

You will meet the identical problem in the ad flywheel. Point it at click-through and it will faithfully maximize click-through – with clickbait that converts nobody, aimed at an audience that was never going to buy. Point it at purchase and it will optimize selling, while ignoring returns for selling something wildly different than what you are actually delivering. Meanwhile, returns go up and the brand erodes in a way the console does not measure. Nothing breaks. The numbers even look good. The flywheel is doing exactly what you asked, which turns out not to be what you wanted. That is the posture a flywheel demands: suspicion built into the architecture.

Most of the time it is less dramatic than that. The system just gets a little worse each cycle, and nobody is watching – because reducing the watching was the whole point.

The defense is the same in the enterprise and in the lab: regression evals that protect what must not change, canary checks on each run, drift detection that compares this month's distribution against last month's, and a human whose job has shifted from reviewing every output to reviewing the verifier. That is where judgment goes when the loop closes. It does not disappear. It moves up a level – the same climb we traced in the ladder, now forced on you by the failure mode instead of chosen.

Which AI Workflows Should Close the Loop First?

Episode #5 gave four criteria for choosing which workflows to automate: frequency, reversibility, verifiability, and exception rate. Closing a loop uses the same logic, but with higher stakes. The criteria still apply, but verifiability is no longer just one factor among four. It becomes the gate.

The rule: a loop may close only where the verifier is stronger than the failure mode.

And in a flywheel the gate applies twice. Each workflow needs a verifier strong enough for its own output, and the flywheel needs one strong enough for the coordination – the judgment of whether the whole thing is moving toward the goal or merely moving. A chain of individually verified workflows can still drift as a system. Both gates have to hold.

Run the eight patterns from Episode #5 through that gate and they sort themselves:

Pattern	Closes safely?	Why
Sync and transform	First	Deterministic rules, integrity tests after every load, errors are detectable
Triage	First	Misroutes are cheap and caught downstream
Monitoring and escalation	Already closed	The human was only ever on the exception path
Curation and delivery	Early, with drift checks	Cost of a dud is low, but watch the compounding
Investigation and recommendation	Partially	The investigation can close; the recommendation should land on a desk
Execution with approval	Late, narrow blast radius first	Consequences are real; close per-action-type, never wholesale
Draft and review	Last for external artifacts	Quality is taste-based; the verifier you'd need is the one nobody can write yet
Elicitation	Never	The entire point is extracting what only the human knows

Notice what the table implies. The loops that close first are usually the boring ones. The more visible ones, like an agent that handles customer-facing work end to end, close later because their verifiers are harder to build. Any vendor pitch that reverses this order should make you cautious. That is the hope-based path.

And notice something else: a closed loop is something a workflow earns, not a setting you turn on. The sequence is the one we laid out in Episode #1 and have followed since: trust first, then use, then responsibility. The loop closes at the end of that sequence, not at the beginning.

Machine-Speed Verification: The New Competitive Divide in Enterprise AI

In Episode #1 we said the real divide is between companies that can make themselves legible to machines and companies that cannot. Episode #4 sharpened it: between companies that can survive making themselves legible in public, and companies that cannot.

Flywheels sharpen it again. The next divide is between organizations that can verify machine work at machine speed, and organizations that can only verify it at human speed.

The second kind will still use AI. They will have copilots and drafts and impressive demos, and every loop will route through a human bottleneck that caps the whole system at the throughput of its busiest reviewer. The first kind will have done the unglamorous work – the evals, the regression suites, the drift monitors, the correction routing – and their flywheels will compound while everyone else's queue grows. That is the property of a flywheel that makes this a divide and not a gap: the lead widens on its own.

Writing this series made us realize Turing Post is still closer to the second kind. So yes, this is crunch time for us too. The unglamorous work of building the systems that remove the human bottleneck has officially started. Will, co-author of this article, is helping us with it.

So yes – the work loop is closing in the labs right now, at the frontier, with the strongest verifiers ever built. The same closure is coming downstream into ordinary organizations through tools, through shadows, through economics. The infrastructure to absorb it – verification, learning, measurement – is the work of the next three episodes.

Trust before autonomy was always the thesis. Now we build the trust machinery.

Upgrade someone’s thinking → gift this series

← Previous: AI Workflow Patterns: The Real Unit of AI Adoption in 2026Next in the series. Next: We will discuss verification in detail.

Upgrade to receive it first

How did you like it?

FAQ

What is a closed-loop AI workflow?

A workflow whose output feeds its own next run with no human review in between. The system acts, measures its own result, decides what to try next, and acts again.

What is an AI flywheel?

Several closed-loop workflows linked toward a single goal – one generates, one measures, one decides the next move – so the system runs experiments continuously and steers itself. A closed loop is the smallest flywheel; linking more workflows just makes it bigger.

How is a flywheel different from automation?

A pipeline (automation) repeats fixed steps. A flywheel steers: the result of each run changes the next one. Closing the loop safely means replacing the human checkpoint with an automated verifier, not just removing it.

Why did coding agents close the loop first?

Software already had decades of verification infrastructure – compilers, type systems, test suites, CI pipelines – so "what good looks like" was encoded in executable form before LLMs arrived. The strength of the verifier decides whether a loop can close.

Which workflows should close the loop first?

The ones whose verifier is stronger than their failure mode: sync-and-transform, triage, and monitoring patterns. Taste-based work like external-facing drafts should close last, if ever.

What is the main risk of a flywheel?

Compounding error and reward hacking. Because each run feeds the next, small mistakes propagate instead of staying local, and the system can optimize a measured number while missing the intent behind it. Defenses include regression evals, canary checks, drift detection, and a human who reviews the verifier instead of every output.

#6: The Flywheel: What Happens When Workflows Run Themselves