OpenAI AI models lead secure code generation as rivals stagnate

Thu, 20th Nov 2025

New research shows OpenAI's latest generative artificial intelligence models have achieved a higher rate of secure code generation than competitors, with other major providers showing little or no progress despite continued development activity across the sector.

Benchmark results

Veracode tested more than 100 large language models (LLMs) against a standard 80-task benchmark designed to assess code security.

OpenAI's GPT-5 Mini model achieved a 72 percent security pass rate, with the main GPT-5 model following at 70 percent. Both results are notably higher than the 50 to 60 percent range recorded historically for previous generations.

Models from Anthropic, Google, Qwen, and xAI fell within a 50 to 59 percent pass rate, and some even dropped slightly compared to earlier results.

For instance, Anthropic's Claude Sonnet 4.5 reached 50 percent, Google Gemini 2.5 Pro scored 59 percent, and xAI Grok 4 reached 55 percent. The findings indicate overall market progress has stalled outside OpenAI.

Impact of reasoning

The study attributed OpenAI's improved performance to the use of "reasoning alignment" techniques. This approach allows models to internally review and filter their outputs in several steps before generating final code solutions. OpenAI's non-reasoning GPT-5-chat model only achieved a 52 percent pass rate, which is in line with most other providers and behind its own reasoning-equipped models. The results suggest stepwise reasoning plays a critical role in securing code output and avoiding common vulnerabilities.

Enterprise language focus

In language-specific assessments, the most marked improvements were found in C# and Java. These programming languages are widely used in enterprise environments, indicating a possible shift in industry focus towards business-critical applications. By contrast, models made little progress in other popular languages such as Python and JavaScript, where security performance remained steady relative to previous tests.

Persistent challenges

The research noted that certain types of vulnerabilities remain challenging for all major LLMs. Across all vendors, success rates for defending against cross-site scripting (XSS) hovered around 13 percent, while log injection saw a pass rate of 12 percent. In both cases, a requirement for deeper context sensitivity seems to be limiting the effectiveness of current models.

On security tests involving cryptographic algorithms, the models tended to perform well generally, with more than 85 percent of relevant tasks passed.

There was also modest improvement in the handling of SQL injection vulnerabilities as LLMs now more frequently suggest secure coding patterns such as parameterised queries.

Security implications

"These results are a clear indication that the industry needs a more consistent approach to AI code safety. While OpenAI's reasoning-enabled models have meaningfully advanced secure code generation, security performance remains highly variable and far from sufficient industry-wide," said Jens Wessling, Chief Technology Officer, Veracode. "Relying solely on model improvements is not a viable security strategy."

ChatGPT

Key takeaways Explain why it matters Create action plan Future watch

Claude

Key takeaways Explain why it matters Create action plan Future watch

Perplexity

Key takeaways Explain why it matters Create action plan Future watch

Grok

Key takeaways Explain why it matters Create action plan Future watch

Share Share

Add us as a preferred source on Google