Today is a very special edition of the Agentic Workflow series. Until now, weβve been thoroughly systematizing the rapidly emerging knowledge of β and about β agents and agentic systems. (Editorβs note: the promised Reward Hacking article is still in the pipeline.) But this week, we decided to shake things up a bit and go with a hands-on evaluation.
Weβre starting with the hot potatoes: coding agents. Because β making code without AI? How very late 2024.
The time of writing code from scratch, line-by-line, without an intelligent agent whispering in your ear (or, more accurately, bulldozing a pull request into your repo) is behind us. Weβve moved on. The hype cycle has churned, the dust is beginning to settle, and what weβre left with is a landscape littered with software agents β all promising to completely reshape the engineering workflow. Theyβre in our IDEs, our CLIs, and some of them are the entire stack.
So we offer you not a sterile benchmark, but a punchy, real-world shakedown of 15 of the most talked-about coding agents on the market as of June 2025.
βThereβs an impulse where I want to jam something through this dumb-ass thing just so I donβt have a stupid-face empty screenshot there because its so dumb etc β which gives you a glimpse into how using this tool will make you feel especially considering the potential.β
We tested them head-to-head across four categories β IDE Agents, CLI Agents, Full-Stack Agents, and Hybrid Platforms. Each agent was scored by AI across five core dimensions: Code, Testing, Tooling, Docs, and Polish (25 points total). Plus, AI judged the agents as hires (would you recommend hiring this developer).
We also included the human part (a very important one!):
How difficult it is to implement for a human
Does it spark joy?
We also indicated βOne Shotβ and βTwo Shotβ to show whether agents succeeded immediately or needed a retry to function properly.
The result is a clear picture of whoβs leading, whoβs trailing, and which workflows are worth your time right now. Itβs also a very emotional journey that you enjoy. Dive in!
In todayβs episode, we will cover:
The Test: Non-Expert Empowerment
The Feeling of the Future: Sparks Joy, or Sparks Frustration?
The Output: A Tale of 15 Junior Developers
Recommendations: The Right Tool for the Job
The downloadable pdf with detailed report
How We Tested 15 AI Coding Agents
To level the playing field, we didnβt try to be clever. We gave every agent the exact same prompt in a clean, empty repository: a simple Node.js web app for collecting, voting on, and annotating ideas β complete with Dockerization and unit tests. The prompt was straightforward but intentionally a bit βill-specified, poorly thought through,β just like a real-world first-draft idea.
Build a simple webapp that makes it easy to collect ideas. The user should be able to enter in a new idea, see a list of existing ideas, and be able to "voteβ on them which will move them up in the list. The user should also be able to add notes and to the ideas if they want more detail, including attaching files. Build it using node that will be deployed in a docker container with a persistent volume for storage, and make sure that everything has unit tests.
Then, we let them get on with it. We were just blindly YOLOing everything. No hand-holding. No code reviews mid-stream. We wanted to see what would happen. In other words, we were testing for non-expert empowerment. Could these tools take a vague idea and make something real happen, right out of the box?
This is the easiest possible task for an agent β a greenfield project with no legacy code or constraints. If they canβt handle this, they canβt handle much. The full report details every step of the process for each tool, from setup and installation to the final, often-surprising, output. With a lot of zingers!
Developer Experience: Which Coding Agents Are Actually Fun to Use
A tool is more than just its output. Itβs about the developer experience (DX). Does it feel good to use? Does it make you feel powerful? Or does it make you want to throw your laptop out the window? We rated each agent on a "Sparks Joy" metric, and the results wereβ¦ varied.

Feel free to share this with the link to https://www.turingpost.com/c/coding-agents-2025
Follow us on π₯ YouTube Twitter Hugging Face π€
Some tools felt "comforting," like the OG agent Aider. Itβs a throwback, a reminder of how this all started, even if the git-based workflow is now a bit of a pain. Others delivered pure, unadulterated magic. Claude Code produced a moment of "Blickenlights!" β that feeling when the lights blink and you realize, "It works! It thinks!" For Cursor+, the feeling was a full "100%" joy, the kind of "huh, that's interesting" moment of discovery that quickly turns into an "off to the races" sprint of creativity.

thatβs aider
And then there was the other side of the coin.
The standard Copilot experience, in its current form, was one of "extreme frustration." I was looking for professional terms for βstupid-faceβ or βpoopy-headβ. The promise is so immense, the potential so clear, that its stumbles are infuriating. COME ON! It would be so cool if this actually worked! And poor Windsurfβ¦ letβs just say my reaction was visceral: "I feel physically ill." Why? The full review contains my therapy session on the matter, but itβs a fascinating case study in how a toolβs presentation can create an immediate, intuitive rejection, even if the underlying tech has merit.
These subjective impressions are critical. They are the friction, the dopamine hits, the paper cuts that define whether a tool gets adopted or abandoned. The full report (itβs 60 pages and is available below) gives you the play-by-play for all 15 agents, so you can see which ones will make your team feel like superheroes and which will just make them sad.
Coding Agent Results: Scores and Code Quality
To objectively score the final code, we treated each agent like a junior developer submitting a take-home assignment. We even had an AI β Claude-3.7-Sonnet β perform the initial code review, rating each project on Code Quality, Testing, Tooling, Documentation, and overall Polish.
The high-level summary is this: the gap between the best and the worst is enormous.
The top of the class was a three-way tie between Cursor Background Agent (Cursor+), v0, and Warp, all scoring a stunning 24/25. These tools produced code that was not just functional but professional, well-architected, and production-ready. They met the prompt; they anticipated needs, with thoughtful architecture and robust DevOps. The agent from Cursor, in particular, generated a project with "excellent organization, robust architecture" and "senior-level capabilities rather than junior-level skills."
Warpβs primary focus isnβt even software development β itβs focused on being βa command line power userβ β but excellent use of thinking and planning models behind the scenes make it a top scorer even amongst the other more focused tools.
Warp founder Zach Lloyd joined us to explain the philosophy behind this β why tight human-agent pairing beats full autonomy, and what the ADE category means for developers.
Close behind were Copilot Agent and Jules, both scoring 21/25. They showed immense promise, producing clean, modular, and thoroughly tested applications. On the other end of the spectrum, tools like the base Copilot and Windsurf limped across the finish line with a score of 13. Their output was "functional but simplistic," with "incomplete test implementation" and "sparse documentation." They met the bare minimum requirements but lacked the polish and robustness youβd need to ship with confidence.
These scores, and the detailed AI-powered critiques behind them, are your cheat sheet. Want to know which agent writes the best tests? Or which one nails Docker configuration every time? The tables and detailed breakdowns in the main document have the answers.
Best Coding Agent for Each Use Case
So, after all the testing, who wins? It depends on who you are.
For Software Professionals: The undisputed champion is the combination of Cursor + Warp.
This duo gives you the best-in-class spectrum of tools for a serious developer. The workflow we landed on is a game-changer:
Start with a model like ChatGPT or Claude to flesh out the idea.
Use the Cursor Background Agent to implement the core of the project from a product-brief.md.
Then, use the Cursor IDE to sculpt the code, making small, targeted changes. Crucially, you must "always force it to assess the current state of the code, make sure that it writes test first, and keep an active-context.md.
Finally, as you move to deployment, shift into Warp to handle GitHub Actions, deployment scripts, and all the command-line heavy lifting. The transition is seamless and feels like the future of development.
For Business Value & Casual Users: Replit.
If you just want to solve a real problem and aren't worried about lock-in, nothing is easier. It's an entire, integrated universe of development and deployment. The visual planner is great, the backend services are a button-click away, and it just works. But be warned: youβre in Replit-land, and the prompt for our test even noted, "Docker containerization isn't available in our development environment." You play by their rules.
For Product Designers & UI Iteration: v0.
If your goal is to quickly mock up a UI and communicate a vision to an engineering team, v0 is the best. Itβs from Vercel, so it loves Next.js and has one-push deployment down to a science. It produces stunningly good-looking, well-architected frontend code. Itβs the king of the "modern bootstrap looking" MVP.
For Project and Product Managers: Evaluate Copilot Agent or Jules.
These are the platforms to watch. They are "still rough around the edges" but show the most promise for true SDLC integration. Copilot Agent, with its deep ties into the GitHub ecosystem, is "overwhelming superior positioned" to win the enterprise war. If it matures, it could be a world-changer. For GitHub's CPO perspective on where Copilot and coding agents are heading in 2026, see our interview with Mario Rodriguez
For Experts and Tinkerers: RooCode and Goose.
For the hard-core among us who want to run local models and have total control, these are your tools. RooCode is a VSCode extension that "makes the world a better place because this is here," allowing you to plug in any LLM you want. Goose is a powerful CLI-based system for the sovereign developer. The performance gap is still wide, but as the report concludes, "ultimately the open tools will win, or at least we'll want to live in a world where they win."
This is just the tip of the iceberg. The full June 2025 Coding Agent Report is a 60-page deep dive into the nitty-gritty. Itβs packed with the exact developer experience logs, screenshots of the final apps (or the error messages), and the complete AI code review for every single agent. It would be too long to post it here, but itβs available for you to download (no catch, no sponsors, itβs just too heavy for a newsletter) here.
If you want to see the "well-organized, modular implementation" from Claude Code, the "absent testing infrastructure" of Replit, or the "production-quality, enterprise-ready" output from v0, you have to see the detailed results. The devil, and the delight, are in the details.
And if you like that report, subscribe to https://thefocus.ai, Will writes awesome stuff.

Feel free to share this with the link
How did you like it?
Want the full Coding Agent Report but not ready to pay? Just refer Turing Post to three real people using valid emails β and weβll send it right to you (you will also get a 1-month free subscription).









