New Benchmark Shows AI Agents Struggling with Real-World Tasks

A customer experience AI startup, Sierra, has developed a new benchmark that helps in evaluating the performance of AI chatbot agents. The benchmark is named TAU-bench and is evaluated by having conversations with LLM-stimulated users while doing complex tasks. The results show that AI agents which are made with simple LLMs are not able to perform simple tasks. This means that companies need more advanced AI agents for work.

Sierra’s head of research, Karthik Narasimhan, says that Sierra’s benchmark is helping real-world users evaluate the performance and reliability of the AI agents which is important if you want the AI agents to work in a real world setting. He also added that many other benchmarks like SWE-bench, Agentbench and WebArena have also been created for the same purpose but they are not successful in working to their full extent. They are only able to evaluate a single round of agent-human interaction, without answering about more dynamics. This makes them less reliable and adaptable.

As other benchmarks were experiencing some issues, Sierra addressed those issues with TAU-bench. It represented three requirements for the benchmark– the agents should interact smoothly in the real world settings, agents should follow rules and policies given for the task and agents should be reliable so companies can work without having to worry about their results. Many tasks were given to TAU-bench like working on real databases to different APIs and other complex tasks that required it to have conversations with the agents. Every assignment given the agent was about retaining information, performing complex tasks and communicating through real conversation.

The four main features of Sierra’s benchmark after experimenting with it were TAU-benchmark can do realistic dialog, can perform open ended and diverse tasks, do faithful objective evaluation and provide a modular framework. TAU was tested using 12 popular LLMs including GPT-4, Claude-3, Gemini and Llama. All the agents performed poorly, including ChatGPT-4 which got less than 50% average success rate in all domains.

New Benchmark Shows AI Agents Struggling with Real-World Tasks

Arooj Ahmed

You might like