• Turing Post
  • Posts
  • Guest Post: The Coding Personalities of Leading LLMs*

Guest Post: The Coding Personalities of Leading LLMs*

In this guest post, Prasenjit Sarkar from Sonar highlights findings from the new report The Coding Personalities of Leading LLMs. He explains the Engineering Productivity Paradox: AI boosts code output but not real velocity, since all that code still needs review and verification. A very interesting read.

Technology leaders are in a race to harness the power of AI to boost engineering productivity. We see the potential everywhere – from AI coding assistants generating over 30% of new code at Google to the promise of agentic workflows transforming the entire software development lifecycle (SDLC).  But many leaders are also discovering what we call the Engineering Productivity Paradox: despite a massive increase in the volume of AI-generated code, overall engineering velocity is not increasing at the same rate. 

The reason is simple: all that AI-generated code has to be reviewed and verified by humans. And just like human developers, AI models have their own unique styles, strengths, and weaknesses. Measuring their performance on benchmarking challenges alone isn't enough to understand the quality, security, and maintainability of the code they produce. 

Just recently, Sonar published its latest report, “The Coding Personalities of Leading LLMs,” which goes beyond today’s LLM benchmarks to reveal a more nuanced view of LLM performance. Part of our ongoing The State of Code series, the report analyzes the code generated by six leading LLMs to uncover their distinct “coding personalities.” This new framework for evaluating these powerful tools will help you understand the hidden risks and opportunities of AI-assisted development.

Key findings from the LLM report

Our analysis, which used the SonarQube Enterprise static analysis engine to assess over 4,400 Java programming assignments completed by six leading LLMs, revealed several critical insights for any organization adopting AI. 

1. Each LLM has a unique coding personality

Just as every developer has a distinct style, so does every LLM. Our report identifies measurable "coding personalities" based on traits like:

  • Verbosity: The sheer volume of code generated to complete a task. Claude Sonnet 4, for example, produced more than three times the lines of code as OpenCoder-8B to solve the exact same problems. 

  • Complexity: The structural and logical intricacy of the code. A high-complexity score often correlates with a larger surface area for bugs. 

  • Communication: The tendency to document code with comments. Claude 3.7 Sonnet was a prolific commenter, with 16.4% comment density, while GPT-4o was more taciturn, at just 4.4%. 

Our report introduces "coding archetypes" to bring these personalities to life:

  • The baseline performer (GPT-5 minimal) excels at security but creates verbose code. This complexity leads to the highest rate of code quality and maintainability issues, offsetting its security strengths.

  • The senior architect (Claude Sonnet 4) writes sophisticated, complex code, but this very ambition creates opportunities for high-severity bugs like resource leaks and concurrency issues.

  • The rapid prototyper (OpenCoder-8B) is the fastest and most concise, perfect for a proof-of-concept but at the cost of creating a technical debt minefield.

  • The unfulfilled promise (Llama 3.2 90B) ​​promises top-tier skill but delivers mediocre results while hiding a dangerous security blind spot, producing the highest share of critical vulnerabilities.

  • The efficient generalist (GPT-4o) is a solid jack-of-all-trades but has a habit of fumbling logical details, leading to persistent quality problems over time.

  • The balanced predecessor (Claude Sonnet 3.7) is a capable and highly communicative developer, producing exceptionally well-documented code that is easier for humans to understand, yet still introduces a high number of severe vulnerabilities.

Understanding these traits is like understanding the work style of a new hire – it’s critical for knowing how to manage their output and integrate them into your team.

2. LLMs share impressive strengths

Our research confirms that the models have powerful capabilities that can speed up the initial stages of development.

  • Syntactically correct, fast code generation: All models consistently produced valid, executable code and robust boilerplate for frameworks and common functions, reliably speeding up the initial stages of development. 

  • Solid algorithmic and data structure fundamentals: Each model demonstrated a strong grasp of standard algorithms and data structures, creating viable solutions for well-defined problems – an essential foundation for adding more advanced capabilities. 

  • Effective cross-language translation: The LLMs were adept at translating code concepts and snippets between programming languages, making them powerful tools for teams working across diverse technology stacks.

3. LLMs have common blind spots

While the models we studied are incredibly capable of solving complex problems, our analysis found they share a consistent set of fundamental flaws.

  • Prevalent security gaps: All evaluated LLMs produced a disturbingly high percentage of high-severity vulnerabilities.  For instance, for Llama 3.2 90B, over 70% of its vulnerabilities were rated ‘BLOCKER’, while for GPT-4o, the figure was 62.5%. 

  • Struggle with engineering discipline: The models consistently introduced severe bugs like resource leaks and API contract violations, issues that require a holistic understanding of an application.

  • Inherent bias towards messy code: Perhaps most fundamentally, every model showed a deep tendency to write code that is hard to maintain.  For all LLMs evaluated, "code smells"—indicators of poor structure and technical debt – made up over 90% of all issues found.

4. Newer models can be risky

One of the most surprising findings from our analysis is that a model “upgrade” can conceal an increase in real-world risk.  When we compared Claude 3.7 Sonnet with its successor, Claude Sonnet 4, we saw this paradox in action. The newer model showed a 6.3% improvement on benchmark pass rates, but the bugs it introduced were over 93% more likely to be of ‘BLOCKER’ severity. 

In its effort to solve more complex problems, the newer model generates more sophisticated – and more fragile – solutions. This shows why relying on performance benchmarks alone can be misleading; it's essential to analyze the quality and risk profile of the code, not just its functional correctness.

Similarly, GPT-5's improved functional correctness and major reduction in 'BLOCKER' vulnerabilities come at a cost, introducing a new, more complex risk profile. This is because the model's attempts at sophisticated solutions result in a far higher rate of code smells and advanced "Concurrency / Threading" bugs than its peers.

Vibe, then verify: How Sonar helps you lead in the AI era

These findings don’t diminish the transformative potential of AI; they clarify the path forward. As developers increasingly "vibe" with AI to accelerate creation, success comes from a “trust but verify" approach. True engineering productivity requires a partner that brings confidence to both sides of the equation – fueling the vibe while fortifying the verification. This is where Sonar becomes an essential partner.

Sonar helps you solve the Engineering Productivity Paradox, enabling your teams to safely adopt AI without sacrificing speed or quality. Our platform is the industry standard for integrated code quality and code security, providing a consistent verification layer for all code, whether it’s written by a human or an AI.

No matter which coding personality you "hire" – from an ambitious “senior architect” like Claude Sonnet 4 to a speedy “rapid prototyper” like OpenCoder-8B – Sonar ensures the final output meets your organization's standards. We help you:

  • Fuel AI-enabled development: Integrate seamlessly with the AI coding tools your team uses to solve issues early with real-time feedback in the IDE and leverage automated, AI-powered fixes for all code.

  • Build trust into every line of code: Sonar’s analysis engines detect the very security vulnerabilities, bugs, and maintainability issues that our report found are common in AI-generated code.  

  • Protect your next-gen SDLC: With Sonar, you can establish quality gates and automated controls to ensure all code—especially AI-generated code—is thoroughly vetted before it ever reaches production.  

  • Supercharge developers: By catching issues early and providing automated fixes, we reduce the manual toil of reviewing AI code, freeing your developers to focus on innovation.

As AI rewrites the rules of software development, leaders can't afford to be surprised by the hidden risks. Relying on performance benchmarks alone is like hiring a developer based only on a resume – it doesn't tell you anything about their real-world habits or the quality of their work.

Download The Coding Personalities of Leading LLMs to see the complete analysis, explore all the coding archetypes, and get the data you need to make informed decisions about integrating AI into your SDLC.

*This post was written by Prasenjit Sarkar, Solutions Marketing Manager at Sonar, and originally published here. We thank Sonar for their insights and support of Turing Post.

Reply

or to participate.