Quick answer: What is harness engineering?

Harness engineering is the design of the runtime layer around the model: tool interfaces, context assembly and compaction, sandboxed command execution, policy enforcement, observability, and failure recovery. In coding agents, this is what turns model intelligence into reliable, secure, repeatable work. The model proposes actions; the harness constrains, executes, and verifies them.

Subscribe for weekly operator-grade AI systems analysis:
https://www.turingpost.com/subscribe

Key concepts in this interview:

  • Agent harness (agent loop): the runtime layer that prompts the model, executes tool calls, and returns results.

  • Tool calling: how the model uses shell, web, and computer-use tools to take actions.

  • Security vs safety: harness security restricts what can run; model safety decides what should run.

  • Sandboxing: OS-level isolation and policy enforcement to protect user machines.

  • Model vs harness: model quality dominates outcomes, but harness reliability and sandboxing remain critical.

  • Agent-first workflow: AGENTS.md, better repo structure, and explicit conventions improve agent performance.

  • Mission control UX: managing multiple parallel agent threads as a new developer interface..

Quick note: I’m in San Jose for NVIDIA GTC, and it’s absolutely huge this year. I’ve heard more than 30,000 people are expected, and tickets have already sold out. The good news is that you can still register online and watch the sessions remotely for free. There are more than 1,000 of them, so yes, you’ll have to choose.

The keynote is a must-watch, and there are plenty of sessions you really can’t miss:

And if you’re around, send me a note and let’s meet in person.

Now to our super interesting interview.

β€œGood developers are always looking to optimize their inner loop, but this is a new inner loop that everyone is still figuring out.”

I like this line from Michael Bolin, lead for open-source Codex at OpenAI. It captures a shift that is easy to miss if you only look at benchmarks and model releases. Before OpenAI, Bolin spent years building developer tools and infrastructure across Google and Meta, including work associated publicly with projects like Buck, Nuclide, and DotSlash, so he comes to this moment with a long view on how developer workflows actually change.

The popular narrative around AI coding tools is this: the model writes the code. But when you talk to the people building and using these systems every day, the story becomes more nuanced. The bottleneck is no longer just model capability. Increasingly, it is the environment around the model – the tools it can call, the constraints it operates under, the structure of the repository it navigates, and the feedback loops that let it improve. That emerging layer is often called harness engineering.

In our conversation, Michael walks through how this shift is playing out inside Codex: what the harness actually does, how coding agents are changing developer workflows, why repositories suddenly need to be more legible, and where the balance between model capability and harness design might ultimately land.

We also move to a more personal question that many developers are asking themselves: what does it mean to be a programmer when you are no longer typing most of the code? Bolin has spent two decades writing software, yet he describes this transition not as a loss but as a shift toward shaping systems and artifacts at a higher level – with agents accelerating experimentation, prototypes becoming cheaper, and engineering taste becoming more important than raw typing speed.

I really enjoyed learning from Michael.

Subscribe to our YouTube channel, or listen the interview on Spotify / Apple

We prepared a transcript for reference, but the full experience is in the video. And as always: like and comment. It helps us grow on YouTube and bring you more insights.

Ksenia:
Hello, everyone. I’m happy to have Michael Bolin today. He’s the lead for open-source Codex. Michael, thank you for joining me.

Michael:
Thank you, it’s great to be here.

Ksenia:
People often think the story of AI coding is just: the model writes code. But a lot of teams building agents say the real shift is designing the environment around the model. What side are you on?

Michael:
The model is going to dominate the experience, for sure. But we’ve found there’s still a lot of room for innovation in the harness. It’s not a pure research problem. For our team in particular, it’s been about the relationship between the engineering side and the research side – co-developing the agent together, making sure the harness lets the agent shine and do the best things it can do. Then giving the agent the right tooling, making sure that tooling gets used in training so that things are in-distribution when we ship it as a product.

What Is the Harness?

Ksenia:
Let’s define the harness and why it’s become so important.

Michael:
Sure. The harness is what we sometimes call the agent loop – the bit that calls out to the model, samples it, and gives it context: here’s what I’m trying to do, here are the tools available to you, tell me what to do next. Then it gets a response from the model – often a tool call – that says: here’s the tool I want to call with these arguments, let me know what came back.

Sometimes these tools are pretty straightforward, like just run this executable and tell me what stdout was and what the exit code was. We’ve done a lot more experiments with more sophisticated tools for controlling the machine, for controlling the user’s laptop – more like an interactive terminal session rather than simple command shelling. Or it could say, do this web search, or various other things.

For Codex specifically, because it is primarily a coding agent and we care tremendously about security and sandboxing, a lot of what the harness does is take shell commands or computer-use commands from the model and ensure they run in a sandbox or under whatever policy the user has given the agent. There turns out to be a lot of complexity in that area. It’s critical that we not only expose all the intelligence of the model, but do it safely on the user’s machine.

Ksenia:
How do you handle safety when you’re open-sourcing Codex?

Michael:
You can actually see all of it because it’s in our repo. We do different things for each operating system. On macOS, there’s a technology called Seatbelt. On Linux, we use a collection of libraries – something called Bubblewrap, seccomp, and Landlock. On Windows, we’ve actually built our own sandbox. Some of these things, like Seatbelt, are part of macOS, so they’re not in the open-source repo – just how we call it. But our Windows sandbox code is in the open-source repo. We orchestrate all these calls to go through the sandbox in the appropriate way for each different tool call.

Ksenia:
So when people fork Codex, all the safety rules are baked in?

Michael:
Right – though I should clarify a detail. Safety and security get used interchangeably in AI, but they are subtly different. What I’m describing is more on the security side: yes, you can run this tool, but you can only read these folders or write to these folders, that sort of thing.

What most people in the industry would call safety is actually happening more on the backend – making sure the tool calls the model suggests in the first place are appropriate to run. From the harness’s perspective, it’s following orders in a certain sense: faithfully executing the tool calls. But the decisions about what tool calls are safe or appropriate to run are made by the model.

So if you forked Codex and you’re still talking to our models and relying on our model’s safety, then yes, you get that. If you’re running someone else’s model, it’s a little more up in the air.

How Codex Has Grown

Ksenia:
Since you launched Codex, how has it performed? What are you seeing?

Michael:
The response has been very positive. Usage is up roughly five times from the start of the year. We launched in April of last year as part of the o3 and o4 mini launch – we were using reasoning models, but tool calling and instruction following wasn’t quite where we wanted it to be. Then in August, when GPT-5 came out, we did a refresh of the CLI, and that’s really when it started moving. We had growth before, but it really started jumping up. Then we launched the VS Code extension later that summer and into the fall, and people really gravitated toward that – I believe VS Code overtook CLI usage. And then we launched the app at the start of this year, and that has really taken off. I think it’s genuinely the first of its kind in a lot of ways.

Ksenia:
What’s so new about it?

Michael:
Developers have historically spent most of their time in their IDE, so it makes sense to meet users where they are. Some users are in the terminal – that’s why we have the CLI. A lot of users are in an IDE – that’s why we’re in VS Code, and now integrated into JetBrains and Xcode as well. Those are obvious, natural places to go.

With the Codex app, we’ve actually established a new surface. I like to think of it as a mission control interface – now I’m managing many conversations in parallel. But it still has key pieces you’d expect from a traditional IDE: you can browse the diff the agent has made, you can pop open the terminal with Command-J without switching to a different window if you want to do something ad hoc. It’s really breaking the expectation that you have to have all your code in front of you at all times. For a lot of people, there’s more value in being able to organize and work across multiple agents simultaneously. That’s what we bring front and center.

How Coding Agents Change Developer Workflows

Ksenia:
How do coding agents like Codex change the way developers actually work day-to-day?

Michael:
I think the biggest change is throughput. If you really put attention into it, you can get a lot of work done in parallel. That does incur some amount of context switching that not everyone loves, but if you can master it, you can push a lot of things forward at once.

Personally, I have about five clones of the Codex repo that I hop between. Sometimes it’s a small thing I noticed while doing something else – just a quick fix. Other times it’s a full-day conversation where I’m working through a very large change with Codex throughout the day, between meetings. A lot of people who have five minutes between meetings will send another message just to push a task forward in another direction.

The second thing is that people are spending more time figuring out how to optimize this workflow. It’s all very new, relatively speaking. Should I turn this thing I keep doing into a reusable skill? Should I share that skill with my teammates? Good developers are always looking to optimize their inner loop, but this is a new inner loop that everyone is still figuring out.

And the third thing that gets a lot of attention is code review. The volume of code review has gone up significantly, but Codex is also doing a lot of that code review itself, which saves a lot of time. Figuring out how to make the most of that is still a moving target.

Early Stories from Building Codex

Ksenia:
When you were working on Codex initially, were there any unexpected things you encountered?

Michael:
So much has changed. Codex is still not even quite a year old, which is pretty remarkable given the amount of change in that time.

When we launched in April 2025, that was part of the o3 and o4 mini launch. We were using reasoning models, but tool calling and instruction following wasn’t quite where we wanted it to be. Seeing that improve over time has been a big one.

One thing that was really exciting early on was getting Codex to write more of Codex itself – watching that bootstrapping happen. Things like agents.md becoming more standard, getting that scaffolding in place so that you’re building the tool that’s optimizing your own workflow. That gives you a kind of exponential liftoff that was just exciting and fun. And seeing colleagues really start to get it and shift more of their work to Codex – that was great to watch.

Repositories in an Agentic World

Ksenia:
How do repositories and documentation need to look when an agent is navigating them instead of a human developer? You mentioned agents.md – what’s the biggest difference?

Michael:
It’s funny – one interesting thing about this whole agentic coding journey is that there are practices considered best practice in software for a long time that we just never did. Documentation is one. Test-driven development is another. People didn’t ignore them entirely, but the trade-off always felt questionable. Now, in an agent-first world, these things are clearly worth it. People are almost rediscovering them and genuinely caring about them.

When you think about agents.md, for example, everything we write in ours I would say is also suitable for a human joining the team – all the things they need to know, all the best practices. It’s actually kind of freeing to write these things down for both the agent and your teammates.

That said, on Codex we consider ourselves AGI-pilled – meaning the agent should really be deciding what to do rather than us feeding it more and more instructions. Rather than writing a document that runs parallel to the source code and risks being duplicative or inconsistent, we let the agent spend time reading the code and forming its own opinion. We try to put things in agents.md that it wouldn’t have gotten quickly from the code – things like: here’s how you should run the tests, or these tests are more important than those. But we try not to overdo it, and let the agent decide the best way forward.

Ksenia:
Do you think in the near future agents.md will be written by an agent?

Michael:
A lot of people’s already are. I know many folks who include in their personal instructions something like: when you’re done, please update anything of interest in the agents.md file – anything that wasn’t obvious, or that you and Codex discovered along the way. We don’t happen to do it as a general practice on our team – you can check the repo history on that one – but it’s common. There are papers coming out on how much to tell the agent, and I’m sure it depends on the agent too. We take a modest approach: not tens of pages of instructions, more like a handful of things.

Ksenia:
Context engineering seems like an increasingly important part of this process. Is there such a thing as too much context for an agent?

Michael:
From my empirical experience rather than a research perspective: for a moderately sized task, I usually write about a paragraph and ask Codex to familiarize itself with that part of the code. Sometimes I give it explicit file pointers if I think it’ll help, but often I don’t – it does a good job searching the codebase on its own.

One subtle thing that matters more than people realize: making sure files and folders are well named. That’s good practice anyway, but it’s probably even more important when an agent is searching through code.

Most of the context is going to be agents.md, the prompt I wrote, and maybe some file references. I also give Codex access to GitHub, so it can look at things like: a similar issue happened in this pull request, and it can see not just the code but the conversation that happened around that PR. But again, it’s more about letting Codex know what options it has – here are the tools in your toolbox – without being prescriptive about how it should solve the problem. It’s a good model, so it does a good job there.

Architecture, Slop, and What Surprises You

Ksenia:
It sounds like working this way pushes you toward stricter architecture.

Michael:
Certainly. Codex is going to follow patterns it sees in the codebase. If you have good architecture in the first place, it will follow it, maintain it, and enforce the invariants you set up – and you’re going to be in a good position over time. That’s also true of human developers, of course. It’s just that the rate of change is now so much higher that if you have these standards, you’re going to feel the benefits of them much more acutely.

Ksenia:
Do you still see a lot of slop coming out from the model and from coding agents? How do you fight it?

Michael:
I don’t think I’ve seen things I’d call slop from Codex, honestly. What I see more is that these models like to write code. So sometimes the right answer is deleting code, and you might need to be a little more explicit about that. But that’s not really slop – it’s more like: you added 500 lines to this file, maybe you should have made a new file. Those are easier fixes.

What’s actually far more common is that Codex knows an idiom or a language feature I just haven’t encountered yet, and it uses it. I learn something new. That’s more often how Codex surprises me – rather than slop.

Model vs. Harness: What Rules?

Ksenia:
What you’re describing is that when Codex started, the model wasn’t quite there yet. Now the model is much stronger, and the app itself is bringing a wider audience. But I’m trying to understand – big model or big harness, what matters more? Is there a point where the harness stops being a wrapper and becomes an environment that matters more, or does the model always rule?

Michael:
I see where you’re going – is there a world where the harness kind of fades away and doesn’t have much of a role? It’s possible. In a lot of ways, we try to make the harness as small and as tight as possible.

One thing you’ll notice in Codex compared to some others is that we try to give the agent very few tools. For example, Codex doesn’t have an explicit tool for reading files. Instead, we give it control of a computer terminal – if it uses cat or sed or whatever command-line tool is present to read a file, great. That ties back to what I said about being AGI-pilled: we give it a large space to play in and let it find the best path forward.

The one place we compromise on that is security. The sandboxing is a very important backstop to just letting Codex run wild. And sometimes people play tricks trying to manage the context window by biasing the agent to do certain things – as the harness author, it’s tempting to say, well, I know better. But we try to shy away from that. If Codex is about to run a tool that would spit out a gigabyte of data, our view would be: let’s bias Codex to write that to a file and then grep it, but leave it free to choose how to solve the problem.

Ksenia:
Do you think it’s possible to encode all of this – the safety rules, the sandboxing – or should there always be a human in the loop?

Michael:
For the kinds of coding tasks we focus on, I think sandboxing is really the main replacement for human-in-the-loop, at least for most of what we work on. You have a problem, you give it to Codex, it operates in a sandboxed environment constrained in certain ways, and letting it explore that space is going to get you the best solution – certainly at scale. I have five clones of Codex going. If I had to interject every few minutes on all five of them, that fundamentally limits the throughput you can get out of them.

More of those corrections should be happening at training time and then playing out at inference time, rather than requiring a human in the loop.

Ksenia:
So more will live in the model, not the harness?

Michael:
I would say so, yes. Though there are other parts that remain important – reliability of the harness is a big one. If the harness falls over, your session is over, and there’s nothing the model can do. So performance and reliability are really important principles when implementing the harness.

As we inevitably move toward multi-agent and sub-agent setups – more agents talking across machines – the harness is no longer just a single process on a single machine. It becomes a network of agents. I expect there’ll be a lot more interesting work to do there. I’ve spent most of my career writing tools for developers; now I’m writing more tools for agents. The agent can write its own tools as well, but like I said, we’d rather have a small number of very powerful tools that let it explore the space well – and we’ll continue experimenting to find the right set of primitives.

The Primitives of Agentic Coding

Ksenia:
Do you know what that set of primitives looks like? Have you thought it through?

Michael:
I think we see a lot of the pieces already. There’s what I called the shell tool, or terminal tool – the model’s interface into using a computer terminal more like a human would, not just straight command execution. Things like dealing with streaming output and using that efficiently.

Memory is another big area. Historically, every time you started a conversation, it was from nothing – that’s why you have agents.md and all this context-stuffing to get information into the model quickly. If you look in the repo, there are a lot of experiments around memory.

There’s also a lot happening with different types of context connectors. Originally we were focused on computer tasks on your local machine, but now it’s also about getting work done in a broader sense – sending email on your behalf, creating documents, taking action in a web browser.

And then there’s the standard LLM infrastructure: generally speaking, more context window is good; how you compact things when you hit the limits; all of that is still actively being pursued and contributes to the overall agent experience.

This interview has been edited and condensed for clarity.

Further reading

Reply

Avatar

or to participate

Keep Reading