by Paul Barker

New AI benchmarking tools evaluate real world performance

news

Jun 25, 20255 mins

Now open source, xbench uses an ever changing evaluation mechanism to look at an AI model's ability to execute real-world tasks and make it harder for model makers to train on the tests.

A new AI benchmark for enterprise applications is now available following the launch of xbench, a testing initiative developed in-house by Chinese venture capital firm HongShan Capital Group (HSG).

The challenge with many of the current benchmarks is that they are widely published, making it possible for model creators to train their models to perform well on them and, as a result, reduce their usefulness as a true measure of performance. HSG says it has created a suite of ever-changing benchmarking tests, making it harder for AI companies to train on the test, and meaning they have to rely on more general test-taking capabilities.

HSG said its original intention in creating xbench was to turn its internal evaluation tool into “a public AI benchmark test, and to attract more AI talents and projects in an open and transparent way. We believe that the spirit of open source can make xbench evolve better and create greater value for the AI community.”

On June 17, the company announced it had officially open-sourced two xbench benchmarks: xbench-Science QA and xbench-DeepSearch, promising ”in the future, we will continuously and dynamically update the benchmarks based on the development of large models and AI Agents ….”

Real-world relevance

AI models, said Mohit Agrawal, research director of AI and IoT at CounterPoint Research, “have outgrown traditional benchmarks, especially in subjective domains like reasoning. Xbench is a timely attempt to bridge that gap with real-world relevance and adaptability. It’s not perfect, but it could lay the groundwork for how we track practical AI impact going forward.”

In addition, he said, the models themselves “have progressed significantly over the last two-to-three years, and this means that the evaluation criteria need to evolve with their changing capabilities. Xbench aims to fill key gaps left by traditional evaluation methods, which is a welcome first step toward a more relevant and modern benchmark. It attempts to bring real-world relevance while remaining dynamic and adaptable.”

However, said Agrawal, while it’s relatively easy to evaluate models on math or coding tasks, “assessing models in subjective areas such as reasoning is much more challenging. Reasoning models can be applied across a wide variety of contexts, and models may specialize in particular domains. In such cases, the necessary subjectivity is difficult to capture with any benchmark. Moreover, this approach requires frequent updates and expert input, which may be difficult to maintain and scale.”

Biases, he added, “may also creep into the evaluation, depending on the domain and geographic background of the experts. Overall, xbench is a strong first step, and over time, it may become the foundation for evaluating the practical impact and market readiness of AI agents.”

Hyoun Park, CEO and chief analyst at Amalgam Insights, has some concerns. “The effort to keep AI benchmarks up-to-date and to improve them over time is a welcome one, because dynamic benchmarks are necessary in a market where models are changing on a monthly or even weekly basis,” he said. “But my caveat is that AI benchmarks need to both be updated over time and actually change over time.”

Benchmarking new use cases

He pointed out, “we are seeing with efforts such as Databricks’ Agent Bricks that [it] is important to build independent benchmarks for new and emerging use cases. And Salesforce Research recently released a paper showing how LLMs fare poorly in conducting some practical tasks, even when they are capable of conducting the technical capabilities associated with the task.”

The value of an LLM, said Park, is “often not in the ability to solve any specific problem, but to identify when a novel or difficult approach might be necessary. And that is going to be a challenge for even this approach to benchmarking models, as the current focus is on finding more complex questions that can be directly solved through LLMs rather than figuring out whether these complex tasks are necessary, based on more open-ended and generalized questioning.”

Further to that, he suggested, “[it is] probably more important for 99% of users to simply be aware that they need to conceptually be aware of Vapnik-Chervonenkis complexity [a measure of the complexity of a model] to understand the robustness of a challenge that an AI model is trying to solve. And from a value perspective, it is more useful to simply provide context on whether the VC dimension of a challenge might be considered low or high, because there are practical ramifications on whether you use the small or large AI model to solve the problem, which can be orders of magnitude differences in cost.”

Model benchmarking, Park said, “has been quite challenging, as the exercise is both extremely high stakes in the multi billion dollar AI wars, and also poorly defined. There is a panoply of incentives for AI companies to cheat and overfit their models to specific tests and benchmarks.”

Next read this:

by Paul Barker

Paul Barker is a freelance journalist whose work has appeared in a number of technology magazines and online, including IT World Canada, Channel Daily News, and Financial Post. He covers topics ranging from cybersecurity issues and the evolving world of edge computing to information management and artificial intelligence advances.

Paul was the founding editor of Dot Commerce Magazine, and held editorial leadership positions at Computing Canada and ComputerData Magazine. He earned a B.A. in Journalism from Ryerson University.

Show me more

Topics

About

Policies

Our Network

More

New AI benchmarking tools evaluate real world performance

Now open source, xbench uses an ever changing evaluation mechanism to look at an AI model's ability to execute real-world tasks and make it harder for model makers to train on the tests.

Real-world relevance

Benchmarking new use cases

More from this author

MariaDB buys back the company it sold two years ago

Zoho adds AI capabilities to its low code dev platform

Baidu hits the turbo button to get back into AI race

Onehouse opens up the lakehouse with Open Engines

Baidu’s ERNIE launches ‘signal a global AI race’

Alibaba says its new AI model rivals DeepSeeks’s R-1, OpenAI’s o1

Snowflake announces preview of Cortex Agent APIs to power enterprise data intelligence

Secure AI? Dream on, says AI red team

Show me more

Rust Innovation Lab launched, sponsors first project

Is Meta’s $10 billion cloud deal a good idea for you?

What makes JavaScript great

Getting encryption wrong (and getting it right, too)

How to build a native desktop app vs. a web UI app

PyApp: Build click-to-run Python apps with Rust