- Turing Post
- Posts
- Guest Post: We benchmarked 7 AI code review tools on large open-source projects.*
Guest Post: We benchmarked 7 AI code review tools on large open-source projects.*
Here are the results

In this guest post, Akshay Utture, Applied AI Engineer at Augment Code, tackles a simple question with a messy answer: can AI reviewers catch real bugs without flooding PRs? His team tested seven tools on a public benchmark and saw the same pattern. Missed issues came from missing context. Noise came from guessing in the dark.
One system handled context retrieval far better than the rest, pulling in the dependencies and call chains needed for real reasoning. Across fifty PRs from major open-source projects, that difference shaped the outcome. Check the results below →
The real test: catching bugs without creating noise
AI code review tools are marketed as fast, accurate, and context-aware. But developers know the real test is whether these systems catch meaningful issues without overwhelming PRs with noise. To understand how these tools perform in practice, we evaluated seven leading AI review systems using the only public dataset for AI code review (more info below – feel free to check our work!). The results make clear that while many tools struggle with noisy or incomplete analysis, one tool – Augment Code Review – stands significantly above the rest.
How code review quality is measured
A review comment is considered correct if it matches a golden comment: a ground-truth issue that a competent human reviewer would be expected to catch. Golden comments reflect real correctness or architectural problems, not stylistic nits.
Each tool’s comments are labeled as:
True positives: match a golden comment
False positives: incorrect or irrelevant comments
False negatives: golden comments the tool missed
From these classifications we compute:
Precision: How trustworthy the tool is
Recall: How comprehensive it is
F-score: Overall quality
High precision keeps developers engaged; high recall is what makes a tool genuinely useful. Only systems with strong context retrieval can achieve both.
Benchmark results
Here are the results, sorted by F-score – the single best measure of overall review quality:
Tool | Precision | Recall | F-score |
Augment Code Review | 65% | 55% | 59% |
Cursor Bugbot | 60% | 41% | 49% |
Greptile | 45% | 45% | 45% |
Codex Code Review | 68% | 29% | 41% |
CodeRabbit | 36% | 43% | 39% |
Claude Code | 23% | 51% | 31% |
GitHub Copilot | 20% | 34% | 25% |
Augment Code Review delivers the highest F-score by a meaningful margin, and importantly, it is one of the very few tools that achieves both high precision and high recall. Achieving this balance is extremely difficult: tools that push recall higher often become noisy, while tools tuned for precision usually miss a significant number of real issues. For example, Claude Code now reaches roughly 51% recall – close to Augment’s recall – but its precision is much lower, leading to a high volume of incorrect or low-value comments. This signal-to-noise tradeoff is the core challenge in AI code review. Developers will not adopt a tool that overwhelms PRs with noise. By maintaining strong precision while also achieving the highest recall in the evaluation, Augment provides materially higher signal and a far more usable review experience in practice.
Why recall is the hardest frontier, and why Augment leads
Precision can be dialed up with filtering and conservative heuristics, but recall requires something fundamentally harder: correct, complete, and intelligent context retrieval.
Most tools fail to retrieve:
dependent modules needed to evaluate correctness
type definitions influencing nullability or invariants
caller/callee chains across files
related test files and fixtures
historical context from previous changes
These gaps lead to missed bugs – not because the model can’t reason about them, but because the model never sees the relevant code.
Augment Code Review succeeds because it consistently surfaces the right context.
Its retrieval engine pulls in the exact set of files and relationships necessary for the model to reason about cross-file logic, API contracts, concurrency behavior, and subtle invariants. This translates directly into higher recall without sacrificing precision.
Why some tools perform better (and why Augment performs best)
Across all seven tools, three factors determined performance – and Augment excelled in each.
1. A superior Context Engine
This is the differentiator. Augment consistently retrieved the correct dependency chains, call sites, type definitions, tests, and related modules – the raw material needed for deep reasoning. No other system demonstrated comparable accuracy or completeness in context assembly.
2. Best combination of model, prompts, and tools
Starting with a strong underlying agent is a key requirement for good code review. A well designed agent loop, context engineering, specialized agent tools, and evaluations go a long way in building agents that know how to navigate your codebase, the web, etc. and collect the necessary information for a comprehensive review.
3. Purpose-built code review tuning
Augment applies domain-specific logic to suppress lint-level clutter and focus on correctness issues. This keeps the signal high while avoiding the spammy behavior common in other tools.
And, Augment Code Review is tuned over time. We are able to compute whether each comment posted by Augment is addressed by a human developer. This data helps us specialize and tune our agent tools, prompts, and context to continually improve our code review service.
Together, these factors produce the highest precision, the highest recall, and the strongest overall F-score.
The benchmark dataset
The benchmark spans 50 pull requests across five large open-source codebases, including Sentry, Grafana, Cal.com, Discourse, and Keycloak. These repositories represent real-world engineering complexity: multi-module architectures, cross-file invariants, deep dependency trees, and nontrivial test suites. Evaluating AI reviewers on this kind of code is the only way to determine whether they behave like senior engineers – or shallow linters.
How we improved the dataset
The original public dataset was invaluable, but incomplete. Many PRs contained multiple meaningful issues that were missing from the golden set, making recall and precision impossible to measure accurately. We expanded and corrected the golden comments by reviewing each PR manually, verifying issues, and validating them against tool outputs. We also adjusted severity so that trivial suggestions do not inflate scores or penalize tools unfairly.
All corrected data and scripts are open-source: https://github.com/ai-code-review-evaluations
Here’s the TL;DR
AI code review is moving fast, but the gap between tools is wider than marketing pages suggest. Most systems struggle to retrieve the context necessary to catch meaningful issues, leading to low recall and reviews that feel shallow or noisy. The defining challenge in AI code review isn’t generation – it’s context: assembling the right files, dependencies, and invariants so the model can reason like an experienced engineer.
Augment Code Review is the only tool in this evaluation that consistently meets that standard. Our Context Engine enables recall far above the rest of the field, and its precision keeps the signal high. Augment produces reviews that feel substantive, architectural, and genuinely useful – closer to a senior teammate than an automated assistant. As codebases grow and teams demand deeper automation, the tools that master context will define the next era of software development. By that measure, Augment Code is already well ahead.
Catch bugs without the noise
Augment Code Review is available in GA starting today! Don’t be afraid to use it with messy, gnarly, large codebases. Learn why GPT-5.2 is the model of choice for Augment Code Review.
New to Augment Code? Check out our AI software development platform to support your work with:
A next-level pair-programming Agent
High-context Chat so you don’t get blocked hunting for answers
Next Edits so you can track ripple effects through your whole codebase
Personalized in-line Completions so you can code faster
Slack integration for quick, consistent answers about your team’s work
Auggie, the advanced CLI integration to help you ship fast and not break things
Augment Code doesn’t replace developers. It helps you get more out of every keystroke.
*This post was written by Akshay Utture, Applied AI Engineer at Augment Code, and originally published here. We thank Augment Code for their insights and support of Turing Post.
Reply