Quick answer: What is harness engineering?
Harness engineering is the design of the runtime layer around the model: tool interfaces, context assembly and compaction, sandboxed command execution, policy enforcement, observability, and failure recovery. In coding agents, this is what turns model intelligence into reliable, secure, repeatable work. The model proposes actions; the harness constrains, executes, and verifies them.
Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe
Key concepts in this interview:
Agent harness (agent loop): the runtime layer that prompts the model, executes tool calls, and returns results.
Tool calling: how the model uses shell, web, and computer-use tools to take actions.
Security vs safety: harness security restricts what can run; model safety decides what should run.
Sandboxing: OS-level isolation and policy enforcement to protect user machines.
Model vs harness: model quality dominates outcomes, but harness reliability and sandboxing remain critical.
Agent-first workflow:
AGENTS.md, better repo structure, and explicit conventions improve agent performance.Mission control UX: managing multiple parallel agent threads as a new developer interface..
Quick note: Iβm in San Jose for NVIDIA GTC, and itβs absolutely huge this year. Iβve heard more than 30,000 people are expected, and tickets have already sold out. The good news is that you can still register online and watch the sessions remotely for free. There are more than 1,000 of them, so yes, youβll have to choose.
The keynote is a must-watch, and there are plenty of sessions you really canβt miss:
Claws: How to Build Safe and Secure Long Running Agents on March 16 at 4:00 PM PT
and Open Models: Where We Are and Where Weβre Headed, with a phenomenal lineup including Jensen Huang, Harrison Chase, Michael Truell, and Arthur Mensch, on March 18 at 12:30 pm PT
Advancing to AI's Next Frontier: Insights From Jeff Dean and Bill Dally on March 18 at 4:00 PM PT
and many many more
And if youβre around, send me a note and letβs meet in person.
Now to our super interesting interview.
βGood developers are always looking to optimize their inner loop, but this is a new inner loop that everyone is still figuring out.β
I like this line from Michael Bolin, lead for open-source Codex at OpenAI. It captures a shift that is easy to miss if you only look at benchmarks and model releases. Before OpenAI, Bolin spent years building developer tools and infrastructure across Google and Meta, including work associated publicly with projects like Buck, Nuclide, and DotSlash, so he comes to this moment with a long view on how developer workflows actually change.
The popular narrative around AI coding tools is this: the model writes the code. But when you talk to the people building and using these systems every day, the story becomes more nuanced. The bottleneck is no longer just model capability. Increasingly, it is the environment around the model β the tools it can call, the constraints it operates under, the structure of the repository it navigates, and the feedback loops that let it improve. That emerging layer is often called harness engineering.
In our conversation, Michael walks through how this shift is playing out inside Codex: what the harness actually does, how coding agents are changing developer workflows, why repositories suddenly need to be more legible, and where the balance between model capability and harness design might ultimately land.
We also move to a more personal question that many developers are asking themselves: what does it mean to be a programmer when you are no longer typing most of the code? Bolin has spent two decades writing software, yet he describes this transition not as a loss but as a shift toward shaping systems and artifacts at a higher level β with agents accelerating experimentation, prototypes becoming cheaper, and engineering taste becoming more important than raw typing speed.
I really enjoyed learning from Michael.
Subscribe to our YouTube channel, or listen the interview on Spotify / Apple
We prepared a transcript for reference, but the full experience is in the video. And as always: like and comment. It helps us grow on YouTube and bring you more insights.
Ksenia:
Hello, everyone. Iβm happy to have Michael Bolin today. Heβs the lead for open-source Codex. Michael, thank you for joining me.
Michael:
Thank you, itβs great to be here.
Ksenia:
People often think the story of AI coding is just: the model writes code. But a lot of teams building agents say the real shift is designing the environment around the model. What side are you on?
Michael:
The model is going to dominate the experience, for sure. But weβve found thereβs still a lot of room for innovation in the harness. Itβs not a pure research problem. For our team in particular, itβs been about the relationship between the engineering side and the research side β co-developing the agent together, making sure the harness lets the agent shine and do the best things it can do. Then giving the agent the right tooling, making sure that tooling gets used in training so that things are in-distribution when we ship it as a product.
What Is the Harness?
Ksenia:
Letβs define the harness and why itβs become so important.
Michael:
Sure. The harness is what we sometimes call the agent loop β the bit that calls out to the model, samples it, and gives it context: hereβs what Iβm trying to do, here are the tools available to you, tell me what to do next. Then it gets a response from the model β often a tool call β that says: hereβs the tool I want to call with these arguments, let me know what came back.
Sometimes these tools are pretty straightforward, like just run this executable and tell me what stdout was and what the exit code was. Weβve done a lot more experiments with more sophisticated tools for controlling the machine, for controlling the userβs laptop β more like an interactive terminal session rather than simple command shelling. Or it could say, do this web search, or various other things.
For Codex specifically, because it is primarily a coding agent and we care tremendously about security and sandboxing, a lot of what the harness does is take shell commands or computer-use commands from the model and ensure they run in a sandbox or under whatever policy the user has given the agent. There turns out to be a lot of complexity in that area. Itβs critical that we not only expose all the intelligence of the model, but do it safely on the userβs machine.
Ksenia:
How do you handle safety when youβre open-sourcing Codex?
Michael:
You can actually see all of it because itβs in our repo. We do different things for each operating system. On macOS, thereβs a technology called Seatbelt. On Linux, we use a collection of libraries β something called Bubblewrap, seccomp, and Landlock. On Windows, weβve actually built our own sandbox. Some of these things, like Seatbelt, are part of macOS, so theyβre not in the open-source repo β just how we call it. But our Windows sandbox code is in the open-source repo. We orchestrate all these calls to go through the sandbox in the appropriate way for each different tool call.
Ksenia:
So when people fork Codex, all the safety rules are baked in?
Michael:
Right β though I should clarify a detail. Safety and security get used interchangeably in AI, but they are subtly different. What Iβm describing is more on the security side: yes, you can run this tool, but you can only read these folders or write to these folders, that sort of thing.
What most people in the industry would call safety is actually happening more on the backend β making sure the tool calls the model suggests in the first place are appropriate to run. From the harnessβs perspective, itβs following orders in a certain sense: faithfully executing the tool calls. But the decisions about what tool calls are safe or appropriate to run are made by the model.
So if you forked Codex and youβre still talking to our models and relying on our modelβs safety, then yes, you get that. If youβre running someone elseβs model, itβs a little more up in the air.
How Codex Has Grown
Ksenia:
Since you launched Codex, how has it performed? What are you seeing?
Michael:
The response has been very positive. Usage is up roughly five times from the start of the year. We launched in April of last year as part of the o3 and o4 mini launch β we were using reasoning models, but tool calling and instruction following wasnβt quite where we wanted it to be. Then in August, when GPT-5 came out, we did a refresh of the CLI, and thatβs really when it started moving. We had growth before, but it really started jumping up. Then we launched the VS Code extension later that summer and into the fall, and people really gravitated toward that β I believe VS Code overtook CLI usage. And then we launched the app at the start of this year, and that has really taken off. I think itβs genuinely the first of its kind in a lot of ways.
Ksenia:
Whatβs so new about it?
Michael:
Developers have historically spent most of their time in their IDE, so it makes sense to meet users where they are. Some users are in the terminal β thatβs why we have the CLI. A lot of users are in an IDE β thatβs why weβre in VS Code, and now integrated into JetBrains and Xcode as well. Those are obvious, natural places to go.
With the Codex app, weβve actually established a new surface. I like to think of it as a mission control interface β now Iβm managing many conversations in parallel. But it still has key pieces youβd expect from a traditional IDE: you can browse the diff the agent has made, you can pop open the terminal with Command-J without switching to a different window if you want to do something ad hoc. Itβs really breaking the expectation that you have to have all your code in front of you at all times. For a lot of people, thereβs more value in being able to organize and work across multiple agents simultaneously. Thatβs what we bring front and center.
How Coding Agents Change Developer Workflows
Ksenia:
How do coding agents like Codex change the way developers actually work day-to-day?
Michael:
I think the biggest change is throughput. If you really put attention into it, you can get a lot of work done in parallel. That does incur some amount of context switching that not everyone loves, but if you can master it, you can push a lot of things forward at once.
Personally, I have about five clones of the Codex repo that I hop between. Sometimes itβs a small thing I noticed while doing something else β just a quick fix. Other times itβs a full-day conversation where Iβm working through a very large change with Codex throughout the day, between meetings. A lot of people who have five minutes between meetings will send another message just to push a task forward in another direction.
The second thing is that people are spending more time figuring out how to optimize this workflow. Itβs all very new, relatively speaking. Should I turn this thing I keep doing into a reusable skill? Should I share that skill with my teammates? Good developers are always looking to optimize their inner loop, but this is a new inner loop that everyone is still figuring out.
And the third thing that gets a lot of attention is code review. The volume of code review has gone up significantly, but Codex is also doing a lot of that code review itself, which saves a lot of time. Figuring out how to make the most of that is still a moving target.
Early Stories from Building Codex
Ksenia:
When you were working on Codex initially, were there any unexpected things you encountered?
Michael:
So much has changed. Codex is still not even quite a year old, which is pretty remarkable given the amount of change in that time.
When we launched in April 2025, that was part of the o3 and o4 mini launch. We were using reasoning models, but tool calling and instruction following wasnβt quite where we wanted it to be. Seeing that improve over time has been a big one.
One thing that was really exciting early on was getting Codex to write more of Codex itself β watching that bootstrapping happen. Things like agents.md becoming more standard, getting that scaffolding in place so that youβre building the tool thatβs optimizing your own workflow. That gives you a kind of exponential liftoff that was just exciting and fun. And seeing colleagues really start to get it and shift more of their work to Codex β that was great to watch.
Repositories in an Agentic World
Ksenia:
How do repositories and documentation need to look when an agent is navigating them instead of a human developer? You mentioned agents.md β whatβs the biggest difference?
Michael:
Itβs funny β one interesting thing about this whole agentic coding journey is that there are practices considered best practice in software for a long time that we just never did. Documentation is one. Test-driven development is another. People didnβt ignore them entirely, but the trade-off always felt questionable. Now, in an agent-first world, these things are clearly worth it. People are almost rediscovering them and genuinely caring about them.
When you think about agents.md, for example, everything we write in ours I would say is also suitable for a human joining the team β all the things they need to know, all the best practices. Itβs actually kind of freeing to write these things down for both the agent and your teammates.
That said, on Codex we consider ourselves AGI-pilled β meaning the agent should really be deciding what to do rather than us feeding it more and more instructions. Rather than writing a document that runs parallel to the source code and risks being duplicative or inconsistent, we let the agent spend time reading the code and forming its own opinion. We try to put things in agents.md that it wouldnβt have gotten quickly from the code β things like: hereβs how you should run the tests, or these tests are more important than those. But we try not to overdo it, and let the agent decide the best way forward.
Ksenia:
Do you think in the near future agents.md will be written by an agent?
Michael:
A lot of peopleβs already are. I know many folks who include in their personal instructions something like: when youβre done, please update anything of interest in the agents.md file β anything that wasnβt obvious, or that you and Codex discovered along the way. We donβt happen to do it as a general practice on our team β you can check the repo history on that one β but itβs common. There are papers coming out on how much to tell the agent, and Iβm sure it depends on the agent too. We take a modest approach: not tens of pages of instructions, more like a handful of things.
Ksenia:
Context engineering seems like an increasingly important part of this process. Is there such a thing as too much context for an agent?
Michael:
From my empirical experience rather than a research perspective: for a moderately sized task, I usually write about a paragraph and ask Codex to familiarize itself with that part of the code. Sometimes I give it explicit file pointers if I think itβll help, but often I donβt β it does a good job searching the codebase on its own.
One subtle thing that matters more than people realize: making sure files and folders are well named. Thatβs good practice anyway, but itβs probably even more important when an agent is searching through code.
Most of the context is going to be agents.md, the prompt I wrote, and maybe some file references. I also give Codex access to GitHub, so it can look at things like: a similar issue happened in this pull request, and it can see not just the code but the conversation that happened around that PR. But again, itβs more about letting Codex know what options it has β here are the tools in your toolbox β without being prescriptive about how it should solve the problem. Itβs a good model, so it does a good job there.
Architecture, Slop, and What Surprises You
Ksenia:
It sounds like working this way pushes you toward stricter architecture.
Michael:
Certainly. Codex is going to follow patterns it sees in the codebase. If you have good architecture in the first place, it will follow it, maintain it, and enforce the invariants you set up β and youβre going to be in a good position over time. Thatβs also true of human developers, of course. Itβs just that the rate of change is now so much higher that if you have these standards, youβre going to feel the benefits of them much more acutely.
Ksenia:
Do you still see a lot of slop coming out from the model and from coding agents? How do you fight it?
Michael:
I donβt think Iβve seen things Iβd call slop from Codex, honestly. What I see more is that these models like to write code. So sometimes the right answer is deleting code, and you might need to be a little more explicit about that. But thatβs not really slop β itβs more like: you added 500 lines to this file, maybe you should have made a new file. Those are easier fixes.
Whatβs actually far more common is that Codex knows an idiom or a language feature I just havenβt encountered yet, and it uses it. I learn something new. Thatβs more often how Codex surprises me β rather than slop.
Model vs. Harness: What Rules?
Ksenia:
What youβre describing is that when Codex started, the model wasnβt quite there yet. Now the model is much stronger, and the app itself is bringing a wider audience. But Iβm trying to understand β big model or big harness, what matters more? Is there a point where the harness stops being a wrapper and becomes an environment that matters more, or does the model always rule?
Michael:
I see where youβre going β is there a world where the harness kind of fades away and doesnβt have much of a role? Itβs possible. In a lot of ways, we try to make the harness as small and as tight as possible.
One thing youβll notice in Codex compared to some others is that we try to give the agent very few tools. For example, Codex doesnβt have an explicit tool for reading files. Instead, we give it control of a computer terminal β if it uses cat or sed or whatever command-line tool is present to read a file, great. That ties back to what I said about being AGI-pilled: we give it a large space to play in and let it find the best path forward.
The one place we compromise on that is security. The sandboxing is a very important backstop to just letting Codex run wild. And sometimes people play tricks trying to manage the context window by biasing the agent to do certain things β as the harness author, itβs tempting to say, well, I know better. But we try to shy away from that. If Codex is about to run a tool that would spit out a gigabyte of data, our view would be: letβs bias Codex to write that to a file and then grep it, but leave it free to choose how to solve the problem.
Ksenia:
Do you think itβs possible to encode all of this β the safety rules, the sandboxing β or should there always be a human in the loop?
Michael:
For the kinds of coding tasks we focus on, I think sandboxing is really the main replacement for human-in-the-loop, at least for most of what we work on. You have a problem, you give it to Codex, it operates in a sandboxed environment constrained in certain ways, and letting it explore that space is going to get you the best solution β certainly at scale. I have five clones of Codex going. If I had to interject every few minutes on all five of them, that fundamentally limits the throughput you can get out of them.
More of those corrections should be happening at training time and then playing out at inference time, rather than requiring a human in the loop.
Ksenia:
So more will live in the model, not the harness?
Michael:
I would say so, yes. Though there are other parts that remain important β reliability of the harness is a big one. If the harness falls over, your session is over, and thereβs nothing the model can do. So performance and reliability are really important principles when implementing the harness.
As we inevitably move toward multi-agent and sub-agent setups β more agents talking across machines β the harness is no longer just a single process on a single machine. It becomes a network of agents. I expect thereβll be a lot more interesting work to do there. Iβve spent most of my career writing tools for developers; now Iβm writing more tools for agents. The agent can write its own tools as well, but like I said, weβd rather have a small number of very powerful tools that let it explore the space well β and weβll continue experimenting to find the right set of primitives.
The Primitives of Agentic Coding
Ksenia:
Do you know what that set of primitives looks like? Have you thought it through?
Michael:
I think we see a lot of the pieces already. Thereβs what I called the shell tool, or terminal tool β the modelβs interface into using a computer terminal more like a human would, not just straight command execution. Things like dealing with streaming output and using that efficiently.
Memory is another big area. Historically, every time you started a conversation, it was from nothing β thatβs why you have agents.md and all this context-stuffing to get information into the model quickly. If you look in the repo, there are a lot of experiments around memory.
Thereβs also a lot happening with different types of context connectors. Originally we were focused on computer tasks on your local machine, but now itβs also about getting work done in a broader sense β sending email on your behalf, creating documents, taking action in a web browser.
And then thereβs the standard LLM infrastructure: generally speaking, more context window is good; how you compact things when you hit the limits; all of that is still actively being pursued and contributes to the overall agent experience.
This interview has been edited and condensed for clarity.
Further reading
Open-source Codex (repo, architecture, and sandboxing): https://github.com/openai/codex
AI coding agents and the new software stack: https://www.turingpost.com/p/aisoftwarestack
Spec-driven development for safer AI coding workflows: https://www.turingpost.com/p/spec-driven-development
Open models, safety, and real-world agent behavior (OpenClaw case): https://www.turingpost.com/p/openclaw
Alignment methods and RLHF variants in production stacks: https://www.turingpost.com/p/rlhfvariants

