In a paper by researchers from the Juelich Supercomputing Center (JSC), the School of Electrical and Electronic Engineering at the University of Bristol and the LAION AI laboratory, the researchers of the study found that many LLMs perform reasoning but they cannot perform it consistently. The study says that most of the time, LLMs can even perform basic tasks such as simple logical questions.
The authors of the study say that technological and scientific experts should reassess all large language models to analyze their capabilities. They also say that the weaknesses and failures of LLMs should also be analyzed to reveal all the weak basic reasoning capabilities of these LLMs.
The researchers termed this problem as AIW and used different problems to assess how different models behave when faced with different systematic problems. The researchers gave LLMs questions like, “Alice has X brothers and Y sisters. So how many sisters do Alice’s brothers have?”. This was a simple but varied answer and the solution should be to add Y+1 which even school kids could do.
Even though it was a simple question, many LLMs couldn't solve it. They answered with illogical reasoning and gave incorrect answers, disguising them as correct. It is not that big a problem that these AI models give incorrect answers, the bigger problem is that they give such arguments that it becomes hard to not trust them. They were so confident in their arguments that it was hard to identify if their answers were correct or not.
Many LLMs showed a correctness rate below 50%, with larger models like ChatGPT-4o showing 60% correctness. Even though larger AI models are better than smaller ones, they are still not that good in reasoning. The AIW problems were proof that AI models are not capable of basic reasoning. Even though many of them showed high scores in other tests for their capabilities, most of them couldn't solve AIW problems.
Image: DIWAigen
Read next:
• X Rolls Out New ‘More About This Account’ Feature But The Results Aren’t Impressive
• ChatGPT vs Gemini vs Claude: Which Generative AI Model Is Best For Creating Trending Content?
The authors of the study say that technological and scientific experts should reassess all large language models to analyze their capabilities. They also say that the weaknesses and failures of LLMs should also be analyzed to reveal all the weak basic reasoning capabilities of these LLMs.
The researchers termed this problem as AIW and used different problems to assess how different models behave when faced with different systematic problems. The researchers gave LLMs questions like, “Alice has X brothers and Y sisters. So how many sisters do Alice’s brothers have?”. This was a simple but varied answer and the solution should be to add Y+1 which even school kids could do.
Even though it was a simple question, many LLMs couldn't solve it. They answered with illogical reasoning and gave incorrect answers, disguising them as correct. It is not that big a problem that these AI models give incorrect answers, the bigger problem is that they give such arguments that it becomes hard to not trust them. They were so confident in their arguments that it was hard to identify if their answers were correct or not.
Many LLMs showed a correctness rate below 50%, with larger models like ChatGPT-4o showing 60% correctness. Even though larger AI models are better than smaller ones, they are still not that good in reasoning. The AIW problems were proof that AI models are not capable of basic reasoning. Even though many of them showed high scores in other tests for their capabilities, most of them couldn't solve AIW problems.
Image: DIWAigen
Read next:
• X Rolls Out New ‘More About This Account’ Feature But The Results Aren’t Impressive
• ChatGPT vs Gemini vs Claude: Which Generative AI Model Is Best For Creating Trending Content?