• Turing Post
  • Posts
  • Guest Post: We benchmarked 7 AI code review tools on large open-source projects.*

Guest Post: We benchmarked 7 AI code review tools on large open-source projects.*

Here are the results

In this guest post, Akshay Utture, Applied AI Engineer at Augment Code, tackles a simple question with a messy answer: can AI reviewers catch real bugs without flooding PRs? His team tested seven tools on a public benchmark and saw the same pattern. Missed issues came from missing context. Noise came from guessing in the dark.

One system handled context retrieval far better than the rest, pulling in the dependencies and call chains needed for real reasoning. Across fifty PRs from major open-source projects, that difference shaped the outcome. Check the results below →

The real test: catching bugs without creating noise

AI code review tools are marketed as fast, accurate, and context-aware. But developers know the real test is whether these systems catch meaningful issues without overwhelming PRs with noise. To understand how these tools perform in practice, we evaluated seven leading AI review systems using the only public dataset for AI code review (more info below – feel free to check our work!). The results make clear that while many tools struggle with noisy or incomplete analysis, one tool – Augment Code Review – stands significantly above the rest.

How code review quality is measured

A review comment is considered correct if it matches a golden comment: a ground-truth issue that a competent human reviewer would be expected to catch. Golden comments reflect real correctness or architectural problems, not stylistic nits.

Each tool’s comments are labeled as:

  • True positives: match a golden comment

  • False positives: incorrect or irrelevant comments

  • False negatives: golden comments the tool missed

From these classifications we compute:

  • Precision: How trustworthy the tool is

  • Recall: How comprehensive it is

  • F-score: Overall quality

High precision keeps developers engaged; high recall is what makes a tool genuinely useful. Only systems with strong context retrieval can achieve both.

Benchmark results

Here are the results, sorted by F-score – the single best measure of overall review quality:

Tool

Precision

Recall

F-score

Augment Code Review

65%

55%

59%

Cursor Bugbot

60%

41%

49%

Greptile

45%

45%

45%

Codex Code Review

68%

29%

41%

CodeRabbit

36%

43%

39%

Claude Code

23%

51%

31%

GitHub Copilot

20%

34%

25%

Augment Code Review delivers the highest F-score by a meaningful margin, and importantly, it is one of the very few tools that achieves both high precision and high recall. Achieving this balance is extremely difficult: tools that push recall higher often become noisy, while tools tuned for precision usually miss a significant number of real issues. For example, Claude Code now reaches roughly 51% recall – close to Augment’s recall – but its precision is much lower, leading to a high volume of incorrect or low-value comments. This signal-to-noise tradeoff is the core challenge in AI code review. Developers will not adopt a tool that overwhelms PRs with noise. By maintaining strong precision while also achieving the highest recall in the evaluation, Augment provides materially higher signal and a far more usable review experience in practice.

Why recall is the hardest frontier, and why Augment leads

Precision can be dialed up with filtering and conservative heuristics, but recall requires something fundamentally harder: correct, complete, and intelligent context retrieval.

Most tools fail to retrieve:

  • dependent modules needed to evaluate correctness

  • type definitions influencing nullability or invariants

  • caller/callee chains across files

  • related test files and fixtures

  • historical context from previous changes

These gaps lead to missed bugs – not because the model can’t reason about them, but because the model never sees the relevant code.

Augment Code Review succeeds because it consistently surfaces the right context.

Its retrieval engine pulls in the exact set of files and relationships necessary for the model to reason about cross-file logic, API contracts, concurrency behavior, and subtle invariants. This translates directly into higher recall without sacrificing precision.

Why some tools perform better (and why Augment performs best)

Across all seven tools, three factors determined performance – and Augment excelled in each.

1. A superior Context Engine

This is the differentiator. Augment consistently retrieved the correct dependency chains, call sites, type definitions, tests, and related modules – the raw material needed for deep reasoning. No other system demonstrated comparable accuracy or completeness in context assembly.

2. Best combination of model, prompts, and tools

Starting with a strong underlying agent is a key requirement for good code review. A well designed agent loop, context engineering, specialized agent tools, and evaluations go a long way in building agents that know how to navigate your codebase, the web, etc. and collect the necessary information for a comprehensive review. 

3. Purpose-built code review tuning

Augment applies domain-specific logic to suppress lint-level clutter and focus on correctness issues. This keeps the signal high while avoiding the spammy behavior common in other tools.

And, Augment Code Review is tuned over time. We are able to compute whether each comment posted by Augment is addressed by a human developer. This data helps us specialize and tune our agent tools, prompts, and context to continually improve our code review service.

Together, these factors produce the highest precision, the highest recall, and the strongest overall F-score.

The benchmark dataset

The benchmark spans 50 pull requests across five large open-source codebases, including Sentry, Grafana, Cal.com, Discourse, and Keycloak. These repositories represent real-world engineering complexity: multi-module architectures, cross-file invariants, deep dependency trees, and nontrivial test suites. Evaluating AI reviewers on this kind of code is the only way to determine whether they behave like senior engineers – or shallow linters.

How we improved the dataset

The original public dataset was invaluable, but incomplete. Many PRs contained multiple meaningful issues that were missing from the golden set, making recall and precision impossible to measure accurately. We expanded and corrected the golden comments by reviewing each PR manually, verifying issues, and validating them against tool outputs. We also adjusted severity so that trivial suggestions do not inflate scores or penalize tools unfairly.

All corrected data and scripts are open-source: https://github.com/ai-code-review-evaluations

Here’s the TL;DR

AI code review is moving fast, but the gap between tools is wider than marketing pages suggest. Most systems struggle to retrieve the context necessary to catch meaningful issues, leading to low recall and reviews that feel shallow or noisy. The defining challenge in AI code review isn’t generation – it’s context: assembling the right files, dependencies, and invariants so the model can reason like an experienced engineer.

Augment Code Review is the only tool in this evaluation that consistently meets that standard. Our Context Engine enables recall far above the rest of the field, and its precision keeps the signal high. Augment produces reviews that feel substantive, architectural, and genuinely useful – closer to a senior teammate than an automated assistant. As codebases grow and teams demand deeper automation, the tools that master context will define the next era of software development. By that measure, Augment Code is already well ahead.

Catch bugs without the noise

Augment Code Review is available in GA starting today! Don’t be afraid to use it with messy, gnarly, large codebases. Learn why GPT-5.2 is the model of choice for Augment Code Review.

New to Augment Code? Check out our AI software development platform to support your work with:

  • A next-level pair-programming Agent

  • High-context Chat so you don’t get blocked hunting for answers

  • Next Edits so you can track ripple effects through your whole codebase 

  • Personalized in-line Completions so you can code faster

  • Slack integration for quick, consistent answers about your team’s work 

  • Auggie, the advanced CLI integration to help you ship fast and not break things

Augment Code doesn’t replace developers. It helps you get more out of every keystroke.

*This post was written by Akshay Utture, Applied AI Engineer at Augment Code, and originally published here. We thank Augment Code for their insights and support of Turing Post.

Reply

or to participate.