Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high.

Five stars, top 5, number five, number 5

Benchmarks are often reduced to leaderboard standings in media coverage, but their role in AI development is far more critical. They are the backbone of model evaluation—guiding improvements, enabling reproducibility, and ensuring real-world applicability. Whether you’re a developer, data scientist, or business leader, understanding benchmarks is essential for navigating the AI landscape effectively.

At their core, benchmarks are standardized evaluations designed to measure AI capabilities. Early examples like GLUE (General Language Understanding Evaluation) and SuperGLUE focused on natural language understanding tasks—such as sentence similarity, question answering, and textual entailment—using multiple-choice or span-based formats. Today’s benchmarks are far more sophisticated, reflecting the complex demands AI systems face in production. Modern evaluations assess not only accuracy but also factors like code quality, robustness, interpretability, efficiency, and domain-specific compliance.

Contemporary benchmarks test advanced capabilities: maintaining long-context coherence, performing multimodal reasoning across text and images, and solving graduate-level problems in fields like physics, chemistry, and mathematics. For instance, GPQA (Graduate-Level Google-Proof Q&A Benchmark) challenges models with questions in biology, physics, and chemistry that even human experts find difficult, while MATH (Mathematics Aptitude Test of Heuristics) requires multi-step symbolic reasoning. These benchmarks increasingly use nuanced scoring rubrics to evaluate not just correctness, but reasoning process, consistency, and in some cases, explanations or chain-of-thought alignment.

As AI models improve, they can “saturate” benchmarks—reaching near-perfect scores that limit a test’s ability to differentiate between strong and exceptional models. This phenomenon has created a benchmark arms race, prompting researchers to continuously develop more challenging, interpretable, and fair assessments that reflect real-world use cases without favoring specific modeling approaches.

Keeping up with evolving models

This evolution is particularly stark in the domain of AI coding agents. The leap from basic code completion to autonomous software engineering has driven major changes in benchmark design. For example, HumanEval—launched by OpenAI in 2021—evaluated Python function synthesis from prompts. Fast forward to 2025, and newer benchmarks like SWE-bench evaluate whether an AI agent can resolve actual GitHub issues drawn from widely used open-source repositories, involving multi-file reasoning, dependency management, and integration testing—tasks typically requiring hours or days of human effort.

Beyond traditional programming tasks, emerging benchmarks now test devops automation (e.g., CI/CD management), security-aware code reviews (e.g., identifying CVEs), and even product interpretation (e.g., translating feature specs into implementation plans). Consider a benchmark where an AI must migrate a full application from Python 2 to Python 3—a task involving syntax changes, dependency updates, test coverage, and deployment orchestration.

The trajectory is clear. As AI coding agents evolve from copilots to autonomous contributors, benchmarks will become more critical and credential-like. Drawing parallels to the legal field is apt: Law students may graduate, but passing the bar exam determines their right to practice. Similarly, we may see AI systems undergo domain-specific “bar exams” to earn deployment trust.

This is especially urgent in high-stakes sectors. A coding agent working on financial infrastructure may need to demonstrate competency in encryption, error handling, and compliance with banking regulations. An agent writing embedded code for medical devices would need to pass tests aligned with FDA standards and ISO safety certifications.

Quality control systems for AI

As AI agents gain autonomy in software development, the benchmarks used to evaluate them will become gatekeepers—deciding which systems are trusted to build and maintain critical infrastructure. And this trend won’t stop at coding. Expect credentialing benchmarks for AI in medicine, law, finance, education, and beyond. These aren’t just academic exercises. Benchmarks are positioned to become the quality control systems for an AI-governed world.

However, we’re not there yet. Creating truly effective benchmarks is expensive, time-consuming, and surprisingly difficult. Consider what it takes to build something like SWE-bench: curating thousands of real GitHub issues, setting up testing environments, validating that problems are solvable, and designing fair scoring systems. This process requires domain experts, engineers, and months of refinement, all for a benchmark that may become obsolete as models rapidly improve.

Current benchmarks also have blind spots. Models can game tests without developing genuine capabilities, and performance often doesn’t translate to real-world results. The measurement problem is fundamental. How do you test whether an AI can truly “understand” code versus just pattern-match its way to correct answers?

Investment in better benchmarks isn’t just academic—it’s infrastructure for an AI-driven future. The path from today’s flawed tests to tomorrow’s credentialing systems will require solving hard problems around cost, validity, and real-world relevance. Understanding both the promise and current limitations of benchmarks is essential for navigating how AI will ultimately be regulated, deployed, and trusted.

Abigail Wall is product manager at Runloop.

—

Generative AI Insights provides a venue for technology leaders to explore and discuss the challenges and opportunities of generative artificial intelligence. The selection is wide-ranging, from technology deep dives to case studies to expert opinion, but also subjective, based on our judgment of which topics and treatments will best serve InfoWorld’s technically sophisticated audience. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Contact doug_dineley@foundryco.com.

Topics

About

Policies

Our Network

More

Why benchmarks are key to AI progress

Researchers are racing to develop more challenging, interpretable, and fair assessments of AI models that reflect real-world use cases. The stakes are high.

Keeping up with evolving models

Quality control systems for AI

Show me more

Rust Innovation Lab launched, sponsors first project

Is Meta’s $10 billion cloud deal a good idea for you?

What makes JavaScript great

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust