OpenAI unveils EVMbench to benchmark AI for smart-contract security

blockchaincybersecurityfinancial-technology

Wed, Feb 18, 2026

Overview

OpenAI announced a public benchmark named EVMbench, designed to evaluate how artificial intelligence handles code running on Ethereum-style virtual machines. The suite is intended to simulate realistic conditions by using previously observed bug patterns and exploit scenarios, and it was developed in partnership with Paradigm. The launch signals a move from informal experiments to a structured testing regimen for models applied to blockchain code.

Short sentences. Clear aim. Measure, compare, repeat.

What the benchmark measures

EVMbench evaluates three discrete abilities: pinpointing security flaws, generating controlled exploits for validation, and producing corrected code that preserves contract behavior. Each ability is scored independently so progress on one axis does not mask regressions on another. The dataset pulls from audit discoveries and security competitions, prioritizing cases with real economic consequences.

Tests run against live-like bytecode and source variants to assess whether an AI’s output would be useful in practical audits or offensive research. That approach forces models to demonstrate both analytical depth and precision when changing sensitive on‑chain logic.

Why this matters

Smart contracts currently guard a very large pool of user assets; the industry backdrop makes systematic evaluation timely. By codifying success criteria, EVMbench creates a shared reference for toolmakers, auditors, and regulators to judge AI-driven tooling. Collaboration with a crypto research investor like Paradigm suggests the benchmark balances academic rigor with field relevance.

Adoption could accelerate the integration of AI into security workflows, speed up audits, and change how teams triage vulnerabilities. It may also stimulate an arms race where defensive models improve, and attackers tune models to evade or exploit them — raising the bar for continuous evaluation.

PREMIUM ANALYSIS

Read Our Expert Analysis

Create an account or login for free to unlock our expert analysis and key takeaways for this development.

By continuing, you agree to receive marketing communications and our weekly newsletter. You can opt-out at any time.

Free Access

No Payment Needed

Join Thousands of Readers

Recommended for you

Cybersecurity

Endor Labs unveils AURI to embed security into AI coding workflows

Endor Labs released AURI, a local-first security layer that integrates with popular AI coding assistants and IDEs to prioritize reachable, exploitable findings and reduce developer triage. The launch sits alongside complementary approaches — prompt-time guards and model-based reasoning — highlighting a broader industry shift toward preventing insecure code at generation time while raising dual‑use and scalability questions.

Cybersecurity

Cecuro’s specialized AI flags 92% of exploited DeFi contracts

A domain-focused security agent from Cecuro identified vulnerabilities tied to most exploited DeFi contracts in an open benchmark, covering far more loss value than a GPT-5.1-based baseline. The public dataset and evaluation show tailored review processes and heuristics materially lift detection compared with general coding agents.

Cybersecurity

OpenAI Acquires Promptfoo to Harden AI-Agent Security

OpenAI bought Promptfoo to embed prompt- and agent-testing into its Frontier and agent orchestration tooling, accelerating in-house validation while heightening concerns about shrinking vendor-neutral red-team capacity and multi-vendor procurement dynamics in enterprise and defense.

Life Sciences & Health

Future Doctor unveils clinical safety‑effectiveness benchmark; MedGPT leads comparative evaluation

China’s Future Doctor published a Clinical Safety‑Effectiveness Dual‑Track Benchmark (CSEDB) to measure medical AI performance under clinical constraints and used it to compare leading large language models. Their proprietary MedGPT topped the assessment in overall, safety and effectiveness measures, a result that could reshape how hospitals evaluate AI for clinical deployment.

AI & Technology

Buterin outlines practical plan for Ethereum–AI integration to harden markets and governance

Vitalik Buterin proposes concrete engineering paths for integrating AI with Ethereum to preserve privacy, verify model outputs cryptographically and enable autonomous economic agents. Complementary developer work — including an emerging ERC-8004-style registry for agent discovery and reputation — could operationalize these ideas but raises new attack surfaces and governance questions.

AI & Technology

OpenAI unveils Prism, an AI workspace tailored for scientific research

OpenAI launched Prism, a browser-based research workspace that embeds its newest model into project-level drafting, literature review and figure creation while keeping researchers in control. The company also published interaction statistics showing a sharp rise in advanced-topic use of its models and points to broader industry moves toward agentic, context-rich assistants — trends that make provenance, verification and institutional standards critical to Prism’s adoption.

Startups & Venture

Ethereum’s ERC-8004 Set to Activate, Paving Way for Trustless AI Agent Economies

Developers indicate the ERC-8004 standard for registering and validating autonomous AI agents will reach Ethereum mainnet Thursday morning, introducing on-chain mechanisms for discovery and portable reputation. The launch aims to let AI services find, vet, and transact with one another across organizational boundaries, unlocking interoperable agent markets but raising new security and governance questions.

Cybersecurity

Offensive Security at a Crossroads: AI, Continuous Red Teaming, and the Shift from Finding to Fixing

Red teaming and penetration testing are evolving into continuous, automated programs that blend human expertise with AI and SOC-style partitioning: machines handle high-volume checks and humans focus on high-risk decisions. This promises faster, broader coverage and tighter remediation loops but requires explicit governance, pilot-based rollouts, and clear human-in-the-loop boundaries to avoid dependency, adversary reuse of tooling, and regulatory friction.