Anthropic's Claude AI Faces Challenges with Simple Tasks, Study Finds

The rollout of Anthropic’s latest Computer Use feature last month created a lot of excitement among users. If you happen to be one of them, you might want to read on.

Show Lab’s latest study is shedding light on what users can expect in terms of its functions and limitations. Researchers at the University of Singapore sat down to share what users can expect in terms of the latest generation of GUI or graphical user interface.

The first frontier model that interacts as GUI agents with any device is the same interface that humans use. This model only accesses desktop screenshots and also interacts by triggering various actions through the mouse and keyboard. The offering also gives users the chance to automate tasks via simple instructions and without having to get API access through applications.

Study experts tested Claude through a long list of tasks like routine web searches, video games, the productivity of the office, and more. These tasks included going through different websites and interacting with them. Various items were searched and purchased and this includes subscribing to the latest news services.

Workflow tasks entail interactions done through multi-apps like getting data from webpages and putting them inside spreadsheets. The productivity involving Office tested the agent’s capability to carry out similar operations like document formatting and rolling out emails to create the best presentations.

Video game tasks look at the agent’s ability to carry out several tasks involving different steps like those needing comprehension of the game’s logic and action planning. Every task tests the ability of the model to access 3D. The model establishes a plan to do a task, carries it out by translating every step, and even a critical element figures out if the model may evaluate the progress and what success comes with the task.

The model should evaluate if any errors were made and then correct them along the way. If that’s not possible, it needs to roll out logical explanations. As per researchers, a new framework was rolled out depending on such components, and these tests were evaluated by humans.

As a whole, Claude performed great in terms of complex tasks. It could plan and reason various steps required to conduct tasks and coordinate from one app to the next. It also revisited pages to see if the task was done well and aligned with the exact goal.

But it did tend to make some big mistakes that any human would have prevented. For instance, in a certain task, the model couldn’t finish subscriptions as it didn’t scroll down the page to search for a corresponding button. In other instances, simple tasks were not done well like choosing and replacing text or altering bullets to numbers.

The model couldn’t even comprehend the mistakes made and incorrectly comprehended why it didn’t reach a certain goal. Therefore, researchers feel that such misjudgments can be a huge issue or a shortfall of the model. The entire GUI framework needs adjustments. So the conclusion from experts is that it cannot replicate the manner by which humans utilize computers.

Image: DIW-Aigen

Read next: Is OpenAI Taking Market Share From Google? Here’s What Experts Have to Say
Previous Post Next Post