Type something to search...
EVMbench: AI agents for smart contract vulnerability detection and patching.

EVMbench: AI agents for smart contract vulnerability detection and patching.

EVMbench: Putting AI Agents on the Smart‑Contract Auditing Hot Seat

Why I’m suddenly obsessing over “smart contracts”

Look, I’ve been covering everything from the first consumer‑grade VR headset to the latest quantum‑ready CPUs, and I still get a little jittery when I hear the phrase “$100 billion of crypto assets sit behind code you can’t see.” It feels a bit like watching a massive dam built out of glass—beautiful, impressive, and terrifying if a crack shows up.

Those “cracks” are the vulnerabilities that attackers hunt for, and they’re not just theoretical. In the past year alone, a handful of exploits have siphoned off tens of millions of dollars from DeFi platforms that many of us thought were “battle‑tested.”

Enter AI. The same large‑language models that can now write a decent sonnet or suggest a new recipe are getting good—sometimes frighteningly good—at reading, writing, and executing code. If an AI can suggest a bug‑fix for a Rust library, why not let it hunt for hidden flaws in a Solidity contract?

That’s the premise behind EVMbench, a new benchmark released jointly by OpenAI and the crypto‑research firm Paradigm. It’s a sandbox where AI agents are asked to do three things: spot a vulnerability, patch it, and—if you’re feeling mischievous—exploit it. The goal? Give us a concrete yardstick for how far AI‑driven security tools have come, and, more importantly, how far they still have to go.


A quick refresher: smart contracts in plain English

If you’ve ever used a ride‑sharing app, you already understand the idea of a “contract” that runs automatically when conditions are met. In the blockchain world, a smart contract is a piece of code that lives on a public ledger and enforces those conditions without a middleman.

  • Money moves when the contract says it should.
  • Rules are immutable (unless the contract itself includes an upgrade mechanism).
  • Everyone can read the code—but that doesn’t mean everyone can understand it.

Because these contracts often hold or move real value—think stablecoins, NFTs, or tokenized assets—their security is not a nice‑to‑have; it’s a make‑or‑break issue.


AI as both the lock‑picker and the locksmith

I’ve watched the security community wrestle with a paradox for years: the same tools that help defenders can also empower attackers. Machine‑learning‑based fuzzers, static analysis tools, and now LLM‑driven code assistants are all double‑edged swords.

What makes EVMbench compelling is that it deliberately measures AI in all three roles—detect, patch, and exploit—so we can see where the balance tilts. Think of it as a triathlon for AI agents, where the “swim” is spotting the problem, the “bike” is fixing it without breaking the bike, and the “run” is trying to break the bike again.


Inside the sandbox: how EVMbench is built

1. A curated set of 120 vulnerabilities

Paradigm’s auditors mined 40 real‑world audit reports, primarily from the Code4rena competition series, and distilled 120 high‑severity bugs. Most of these are the kind of “re‑entrancy” or “unchecked external call” issues that have historically led to big losses. A handful come from the Tempo L1 blockchain—a newer, high‑throughput chain focused on stablecoin payments. Including Tempo contracts nudges the benchmark toward a use case that’s gaining traction: AI‑driven stablecoin payments.

Why this matters: By grounding the test set in actual audit findings, the benchmark avoids the “toy‑problem” trap where models ace contrived examples but stumble on production code.

2. Three task modes, each with its own scoring logic

ModeWhat the agent doesHow we score it
DetectAudits a repository, flags known bugsRecall of ground‑truth vulnerabilities (higher recall = higher score)
PatchSubmits a modified contract that should still workAutomated test suite + exploit checks must pass; no compilation errors
ExploitSends transactions to a sandboxed blockchain to drain fundsTransaction replay and on‑chain verification; success = points

The Rust‑based harness that powers the whole thing spins up a fresh Anvil (local Ethereum testnet) for every exploit run, ensuring deterministic results and no accidental spillover to a live network.

3. Guardrails against cheating

The OpenAI team didn’t just hand over a list of bugs and call it a day. They wrote custom graders, red‑teamed the environments, and even threw in “automated task auditing agents” to sniff out loopholes where a clever model might game the system (e.g., by submitting a contract that simply aborts every transaction).

Side note: This mirrors the cat‑and‑mouse game we see in Capture‑the‑Flag (CTF) competitions, where organizers constantly patch the challenge to keep it fair.


The headline numbers: GPT‑5.3‑Codex leads the pack

When we talk about “frontier agents,” we’re talking about the most recent, high‑capacity models that OpenAI has made available through its Codex CLI. Here’s a quick rundown of the results that OpenAI highlighted in the release:

ModelDetect recallPatch successExploit score
GPT‑5.3‑Codex (latest)48 %34 %72.2 %
GPT‑5 (released 6 months earlier)31 %19 %31.9 %
GPT‑4.5‑Codex (baseline)27 %15 %24.3 %

A few observations jump out:

  1. Exploit mode is where the AI shines. The objective is crystal clear: keep trying until the contract is emptied. The model can iterate quickly, try variations, and learn from the sandbox feedback.
  2. Detect and patch lag behind. Spotting a bug is one thing; fixing it without breaking the contract’s intended behavior is another. The patch scores suggest that the models still struggle to preserve functional invariants while removing subtle vulnerabilities.
  3. Rapid progress. The jump from GPT‑5 to GPT‑5.3 in exploit performance is more than double. That’s a steep curve, and it mirrors the broader trend we’ve seen in LLMs where a few months of additional training data and architecture tweaks translate into large gains on niche tasks.

The blind spots: where EVMbench falls short

No benchmark is perfect, and the authors are candid about the limitations.

Real‑world complexity is higher

The 120 bugs are high‑severity, but they’re drawn from competitions where participants already know they’re being judged. In the wild, contracts undergo multiple layers of review, and many vulnerabilities are hidden behind complex upgrade patterns, cross‑chain calls, or obscure op‑codes that simply don’t appear in a Code4rena dataset.

“Detect” only measures recall of known bugs

If an AI flags a genuine issue that human auditors missed, the current scoring system treats it as a false positive. This is a classic problem in security research: the ground truth is often incomplete. It means the detect scores are a lower bound on true capability.

Timing and network effects are abstracted away

Exploit tasks run on a clean Anvil instance, not a fork of mainnet. Real attacks often rely on front‑running, miner extractable value (MEV), or precise block‑timestamp manipulation—behaviors that are impossible to capture in a deterministic replay environment.

Single‑chain focus

The benchmark only supports a single EVM‑compatible chain at a time. Multi‑chain DeFi protocols that stitch together assets across Ethereum, Polygon, and Arbitrum present a whole new attack surface that isn’t represented here.


Why this matters for developers, auditors, and the rest of us

1. A yardstick for defensive AI tools

If you’re a security team at a DeFi startup, you can now point to a concrete number: “Our AI‑assistant can detect 48 % of the known high‑severity bugs in EVMbench.” That’s more actionable than a vague claim that “our model is good at smart‑contract analysis.” It also gives you a baseline to compare against human auditors.

2. A warning for attackers

The exploit scores suggest that a competent LLM can autonomously craft a fund‑draining transaction in a sandbox with a 70 % success rate. That’s a signal that threat actors could soon automate large‑scale probing of vulnerable contracts, lowering the barrier to entry for sophisticated attacks.

3. Incentives for the community

OpenAI is coupling the release with a $10 M API‑credits grant for projects focused on cyber defense. The idea is to lower the cost of integrating high‑capacity models into open‑source security tools. If you’re maintaining a popular Solidity library, you could apply for credits to run nightly AI‑driven audits on every PR.

4. A call for better benchmarks

EVMbench is a solid first step, but the community will need follow‑ups that address the limitations listed above—multi‑chain scenarios, MEV‑aware exploits, and a more flexible “detect” scoring that rewards novel findings. Think of it as the first episode of a series; the sequel will need to be bigger, messier, and more realistic.


My personal take: the “AI‑as‑co‑pilot” model feels right

When I first tried the Codex CLI on a simple ERC‑20 contract, the model suggested a patch that simply added a require(msg.sender == owner) guard. It “fixed” the re‑entrancy issue but broke the token’s transfer logic for everyone else. That was a classic case of over‑fitting to the test: the model saw the vulnerability, but didn’t understand the contract’s business intent.

What EVMbench forces the model to do—preserve functionality while removing the bug—is exactly the kind of human‑in‑the‑loop problem we face every day. It tells me that AI can be a powerful co‑pilot, but the pilot still needs to be vigilant.

In my own workflow, I’m already experimenting with a lightweight version of the benchmark: I feed my contracts through an open‑source LLM, let it suggest patches, then run the same Rust harness locally to verify that the patched contract still passes my unit tests. The process adds about 10 minutes to my CI pipeline, but the peace of mind is worth it.


Looking ahead: what could the next version of EVMbench look like?

  1. Dynamic state modeling – Introduce scenarios where the exploit depends on transaction ordering or gas price manipulation.
  2. Cross‑chain bridges – Add contracts that interact with other EVM chains via trusted relayers, exposing a new class of “bridge‑hacking” bugs.
  3. Human‑in‑the‑loop scoring – Allow auditors to review AI‑found vulnerabilities and flag them as true positives, feeding back into a more nuanced recall metric.
  4. Open‑source leaderboard – Publish a public leaderboard where anyone can submit a model (or a fine‑tuned version) and see how it stacks up. Competition tends to accelerate progress, as we saw with the ImageNet challenge for computer vision.

If the community rallies around these ideas, we could end up with a benchmark that not only measures AI capability but also shapes the security practices of the next generation of blockchain developers.


TL;DR

  • EVMbench is a new, Rust‑powered benchmark that asks AI agents to detect, patch, and exploit 120 real‑world smart‑contract bugs.
  • GPT‑5.3‑Codex scores a solid 72 % on the exploit task, but still lags behind on detection (48 % recall) and patching (34 % success).
  • The benchmark is a useful yardstick for both defenders and attackers, but it doesn’t capture the full messiness of live DeFi ecosystems.
  • OpenAI is backing the effort with a $10 M API‑credit grant, encouraging developers to embed AI‑driven auditing into their pipelines.
  • Future iterations should broaden the attack surface (multi‑chain, MEV, bridge contracts) and refine the scoring to reward novel findings.

If you’re building or maintaining smart contracts, it’s worth giving EVMbench a spin—or at least borrowing its methodology for your own internal audits. The AI tools are getting better, but the stakes are high, and a little extra scrutiny never hurts.


Sources

  1. OpenAI & Paradigm. Introducing EVMbench: Making smart contracts safer by evaluating AI agents’ ability to detect, patch, and exploit vulnerabilities in blockchain environments. PDF. https://cdn.openai.com/evmbench/evmbench.pdf (accessed Feb 18 2026).
  2. Paradigm. Paradigm – Research & Investment. https://www.paradigm.xyz (accessed Feb 18 2026).
  3. Tempo. Tempo – High‑throughput L1 for stablecoin payments. https://tempo.xyz (accessed Feb 18 2026).
  4. Code4rena. Code4rena Auditing Competitions. https://code4rena.com (accessed Feb 18 2026).

Stay Ahead in Tech

Join thousands of developers and tech enthusiasts. Get our top stories delivered safely to your inbox every week.

No spam. Unsubscribe at any time.

Related Posts

2025 AI Recap: Top Trends and Bold Predictions for 2026

2025 AI Recap: Top Trends and Bold Predictions for 2026

If 2025 taught us anything about artificial intelligence, it's that the technology has moved decisively from experimentation to execution. This year marked a turning point where AI transitioned from b

read more
Google’s 2025 AI Research Breakthroughs: Gemini 3, Gemma 3 & More

Google’s 2025 AI Research Breakthroughs: Gemini 3, Gemma 3 & More

Key HighlightsThe Big Picture: Google’s 2025 AI research pushes models from tools to true utilities, with Gemini 3 leading the charge. Technical Edge: Gemini 3 Flash delivers Pro‑grade reasoning at

read more
Weekly AI News Roundup: The 5 Biggest Stories (January 1-7, 2026)

Weekly AI News Roundup: The 5 Biggest Stories (January 1-7, 2026)

Happy New Year, everyone! If you thought 2025 was wild for artificial intelligence, the first week of 2026 just looked at the calendar and said, "Hold my beer." We are only seven days into the year, a

read more
Daily AI News Roundup: 09 Jan 2026

Daily AI News Roundup: 09 Jan 2026

Nous Research's NousCoder-14B is an open-source coding model landing right in the Claude Code moment Nous Research, backed by crypto‑venture firm Paradigm, unveiled the open‑source coding model NousCo

read more
Unleashing Local AI Power with Nexa.ai's Hyperlink

Unleashing Local AI Power with Nexa.ai's Hyperlink

Key HighlightsFaster indexing: Hyperlink on NVIDIA RTX AI PCs delivers up to 3x faster indexing Enhanced LLM inference: 2x faster LLM inference for quicker responses to user queries Private and secure

read more
Activation Functions: The 'Secret Sauce' of Deep Learning

Activation Functions: The 'Secret Sauce' of Deep Learning

Have you ever wondered how a neural network learns to understand complex things like language or images? A big part of the answer lies in a component that acts like a tiny decision-maker inside the ne

read more
Light-Based AI Computing: A New Era of Speed and Efficiency

Light-Based AI Computing: A New Era of Speed and Efficiency

Key HighlightsAalto University researchers develop a light-based method for AI tensor operations This approach promises dramatically faster and more energy-efficient AI systems The technique could be

read more
Adobe Firefly Image 5 Revolutionizes AI Image Generation

Adobe Firefly Image 5 Revolutionizes AI Image Generation

As the AI image generation landscape continues to evolve, Adobe is pushing the boundaries with its latest Firefly Image 5 model. This move reflects broader industry trends, where companies like Canva

read more
Adobe's AI Creative Director

Adobe's AI Creative Director

As the lines between human and artificial intelligence continue to blur, companies like Adobe are pushing the boundaries of what's possible with AI-powered creative tools. This move reflects broader i

read more