New Study Shows AI Models Are Not Able to Perform Even the Low-Level Software Engineering Tasks Yet

OpenAI’s CEO, Sam Altman, says that many companies are incorporating AI into their systems but the companies should think before replacing AI with human engineers because it still cannot do a lot of tasks well. Some researchers developed a benchmark called SWE-Lencer to test how well large language models perform when it comes to performing real freelance software tasks. The results of these tests showed that LLMs are capable of fixing bugs but they are not able to understand how these bugs are caused and make mistakes because of this reason.

The researchers tested Claude 3.5 Sonnet, OpenAI’s GPT o1 and 4o with 1488 freelance software engineer tasks from Upwork. All of those tasks were equal to $1 million in payouts. The tasks were divided into two categories: management tasks where the models were asked to act as a manager and choose the best solution and individual tasks where the models were asked to fix bugs and implement features. The results showed the real world freelance software problems were hard to solve even for advanced AI models and that's why they are not capable of fully replacing humans.

The tasks selected by researchers and other 100 software professionals were put into Docker containers without any internet access so the models cannot get the codes from GitHub. After that, the tasks were added to the Expensify platform and the researchers generated prompts based on descriptions of tasks. Playwright tests were used to simulate real-world user flow and the tests were triple verified by professional engineers to ensure that solutions from models worked.

The results showed that none of the models could earn the real value of the tasks given to them. The best performing model was Claude 3.5 Sonnet which earned $280,050 and solved 26.2% of the tasks. All the models performed best in manager tasks which showed that AI models can handle reasoning and technical understanding of lower-level coding problems but they still cannot replace low-level engineers.


Image: DIW-Aigen

Read next: TikTok Leads, Instagram Follows, X Struggles in Post Interactions
Previous Post Next Post