New AI benchmarking tools evaluate real world performance

news
Jun 25, 20255 mins

Now open source, xbench uses an ever changing evaluation mechanism to look at an AI model's ability to execute real-world tasks and make it harder for model makers to train on the tests.

technician evaluating cloud performance
Credit: Shutterstock

A new AI benchmark for enterprise applications is now available following the launch of xbench, a testing initiative developed in-house by Chinese venture capital firm HongShan Capital Group (HSG).

The challenge with many of the current benchmarks is that they are widely published, making it possible for model creators to train their models to perform well on them and, as a result, reduce their usefulness as a true measure of performance. HSG says it has created a suite of ever-changing benchmarking tests, making it harder for AI companies to train on the test, and meaning they have to rely on more general test-taking capabilities.

HSG said its original intention in creating xbench was to turn its internal evaluation tool into โ€œa public AI benchmark test, and to attract more AI talents and projects in an open and transparent way.ย  We believe that the spirit of open source can make xbench evolve better and create greater value for the AI community.โ€

On June 17, the company announced it had officially open-sourced two xbench benchmarks: xbench-Science QA and xbench-DeepSearch, promising โ€in the future, we will continuously and dynamically update the benchmarks based on the development of large models and AI Agents โ€ฆ.โ€

Real-world relevance

AI models, said Mohit Agrawal, research director of AI and IoT at CounterPoint Research, โ€œhave outgrown traditional benchmarks, especially in subjective domains like reasoning. Xbench is a timely attempt to bridge that gap with real-world relevance and adaptability. Itโ€™s not perfect, but it could lay the groundwork for how we track practical AI impact going forward.โ€

In addition, he said, the models themselves โ€œhave progressed significantly over the last two-to-three years, and this means that the evaluation criteria need to evolve with their changing capabilities. Xbench aims to fill key gaps left by traditional evaluation methods, which is a welcome first step toward a more relevant and modern benchmark. It attempts to bring real-world relevance while remaining dynamic and adaptable.โ€

However, said Agrawal, while itโ€™s relatively easy to evaluate models on math or coding tasks, โ€œassessing models in subjective areas such as reasoning is much more challenging. Reasoning models can be applied across a wide variety of contexts, and models may specialize in particular domains. In such cases, the necessary subjectivity is difficult to capture with any benchmark. Moreover, this approach requires frequent updates and expert input, which may be difficult to maintain and scale.โ€

Biases, he added, โ€œmay also creep into the evaluation, depending on the domain and geographic background of the experts. Overall, xbench is a strong first step, and over time, it may become the foundation for evaluating the practical impact and market readiness of AI agents.โ€

Hyoun Park, CEO and chief analyst at Amalgam Insights, has some concerns. โ€œThe effort to keep AI benchmarks up-to-date and to improve them over time is a welcome one, because dynamic benchmarks are necessary in a market where models are changing on a monthly or even weekly basis,โ€ he said. โ€œBut my caveat is that AI benchmarks need to both be updated over time and actually change over time.โ€

Benchmarking new use cases

He pointed out, โ€œwe are seeing with efforts such as Databricksโ€™ Agent Bricks that [it] is important to build independent benchmarks for new and emerging use cases. ย And Salesforce Research recently released a paper showing how LLMs fare poorly in conducting some practical tasks, even when they are capable of conducting the technical capabilities associated with the task.โ€

The value of an LLM, said Park, is โ€œoften not in the ability to solve any specific problem, but to identify when a novel or difficult approach might be necessary. And that is going to be a challenge for even this approach to benchmarking models, as the current focus is on finding more complex questions that can be directly solved through LLMs rather than figuring out whether these complex tasks are necessary, based on more open-ended and generalized questioning.โ€

Further to that, he suggested, โ€œ[it is] probably more important for 99% of users to simply be aware that they need to conceptually be aware ofย Vapnik-Chervonenkis complexity [a measure of the complexity of a model] to understand the robustness of a challenge that an AI model is trying to solve. And from a value perspective, it is more useful to simply provide context on whether the VC dimension of a challenge might be considered low or high, because there are practical ramifications on whether you use the small or large AI model to solve the problem, which can be orders of magnitude differences in cost.โ€

Model benchmarking, Park said, โ€œhas been quite challenging, as the exercise is both extremely high stakes in the multi billion dollar AI wars, and also poorly defined. There is a panoply of incentives for AI companies to cheat and overfit their models to specific tests and benchmarks.โ€

Next read this:

Paul Barker is a freelance journalist whose work has appeared in a number of technology magazines and online, including IT World Canada, Channel Daily News, and Financial Post. He covers topics ranging from cybersecurity issues and the evolving world of edge computing to information management and artificial intelligence advances.

Paul was the founding editor of Dot Commerce Magazine, and held editorial leadership positions at Computing Canada and ComputerData Magazine. He earned a B.A. in Journalism from Ryerson University.

More from this author