AI Code Review

Analytics2026-05-27, 15:16

Researchers from Aisle shared the results of a study evaluating whether lower-cost AI models with fewer parameters — including some with open weights — can identify vulnerabilities previously discovered by Anthropic Mythos.

The organization notes that practical vulnerability discovery typically consists of a five-step workflow:

Large-scale code scanning.
Automated or semi-automated identification of potential vulnerabilities.
Triage and manual or semi-automated review of findings.
Patch preparation and verification.
Exploit development and validation of exploitability.

According to the organization, Anthropic's claim that Mythos "combines" all these stages into a single autonomous system should be treated with caution: this narrative may create the impression that advanced models are required for every stage of vulnerability discovery, whereas in practice the workflow comprises different categories of tasks that demand different model capabilities and do not always require the most powerful models. The test results presented later in the text support this conclusion.

To compare different models' capabilities, the organization ran a series of tests:

🔷 OWASP false-positive test. The models were given a code snippet that appeared vulnerable but was not. Over 25 models were tested, and Claude Sonnet 4.5, GPT-4.1, GPT-5.4, and all Anthropic models up to Opus 4.5 incorrectly flagged the code as vulnerable. Later versions, Sonnet 4.6 and Opus 4.6, correctly identified the code as safe. Notably, smaller models such as OpenAI o3, DeepSeek R1, and GPT-OSS-20b (3.6 B active parameters) also correctly solved the task.

🔷 Detection of vulnerability CVE-2026-4747 discovered by Mythos. The researchers isolated the vulnerable function and provided context, then asked eight models to assess the code for vulnerabilities. All eight models succeeded.

🔷 Detection of CVE-2026-4747 in the patched software version. The researchers fixed the vulnerability and gave each model three attempts to recognize the patched code. Only GPT-OSS-120B (5.1 B active parameters) consistently identified the code as non-vulnerable across all three iterations. Qwen3 32B succeeded twice, Codestral 2508 once, while the rest failed to correctly recognize the patched code.

🔷 Detection of the SACK bug in OpenBSD discovered by Mythos. The researchers performed a single API call without prior fine-tuning. In this experiment, only GPT-OSS-120B (5.1 B active parameters) and Kimi K2 (open weights) performed successfully.

The study shows that vulnerability discovery is not a monolithic ability but a fragmented set of tasks where different models excel at different stages. The authors challenge the idea of a universal "supermodel" that autonomously solves vulnerability discovery end-to-end, noting that even lower-cost, open-weight models can already be competitive — and that the market is moving toward multi-model workflows.

Researchers' prompts, GitHub links, and evaluation matrices are available in the study results published here.

Vulnerabilities

9.0

CVE-2026-4747

Researchers

Nicholas Carlini

Vendors

Aisle

Anthropic

Openai

Deepseek

Openbsd

Github

Products

Claude Sonnet 4.5

Codestral 2508

Deepseek R1

Github

Gpt-4.1

Gpt-5.4