A new research has found out that Google’s DeepMind can perform better in fact checking than humans. The research was published in a paper and was titled “Long-form factuality in large language models”. In the paper, a fact-checking method called Search-Augmented Factuality Evaluator (SAFE) was talked about and it was found that SAFE breaks down each fact in a document individually and then fact checks it using search on Google.
One of the authors of the study says that SAFE uses Large Language Models (LLMs) to properly break down each fact and first check each fact itself by using its multi-step reasoning process and then also matches the facts with the results in Google Search. To find out if SAFE can be used instead of humans in terms of fact-checking, the researchers tested 16,000 facts with both DeepMind and Humans. It was found that SAFE matched humans 72% of the time. There was also 100 disagreements on facts between humans and SAFE and SAFE proved to be correct 76% of the time.
The paper claims that this LLM can achieve superhuman types of powers but many researchers argue about what ‘superhuman’ really means. Garcy Marcus who is an AI researcher says that he was confused after reading the term ‘superhuman’. He said that researchers are over-hyping the model and this wording doesn’t fit the model. He also added that the researchers meant that SAFE is better than under-paid human fact-checkers. To be really called a superhuman, SAFE needs to compete against many professional human data fact checkers because it is important for having the correct results.
Researchers are claiming that SAFE is 20 times cheaper than human fact checkers. SAFE was also used on many other models like ChatGPT, Gemini, Claude etc to check if these models make factual errors. The results showed that the larger models have less factual errors. But some of the best models also generated false claims which shows that we shouldn’t over-rely on these models for factual information. SAFE was good at identifying those false claims.
SAFE Code has been open sourced on GitHub. The other researchers can use that code to fact-check their work and find any mistakes they made with some facts. SAFE has a long way to go and there is a lot of more work it still needs to compete with humans.
Image: DIW-AIgen
Read next: 34% of Americans Say They've Never Heard of ChatGPT
One of the authors of the study says that SAFE uses Large Language Models (LLMs) to properly break down each fact and first check each fact itself by using its multi-step reasoning process and then also matches the facts with the results in Google Search. To find out if SAFE can be used instead of humans in terms of fact-checking, the researchers tested 16,000 facts with both DeepMind and Humans. It was found that SAFE matched humans 72% of the time. There was also 100 disagreements on facts between humans and SAFE and SAFE proved to be correct 76% of the time.
The paper claims that this LLM can achieve superhuman types of powers but many researchers argue about what ‘superhuman’ really means. Garcy Marcus who is an AI researcher says that he was confused after reading the term ‘superhuman’. He said that researchers are over-hyping the model and this wording doesn’t fit the model. He also added that the researchers meant that SAFE is better than under-paid human fact-checkers. To be really called a superhuman, SAFE needs to compete against many professional human data fact checkers because it is important for having the correct results.
Researchers are claiming that SAFE is 20 times cheaper than human fact checkers. SAFE was also used on many other models like ChatGPT, Gemini, Claude etc to check if these models make factual errors. The results showed that the larger models have less factual errors. But some of the best models also generated false claims which shows that we shouldn’t over-rely on these models for factual information. SAFE was good at identifying those false claims.
SAFE Code has been open sourced on GitHub. The other researchers can use that code to fact-check their work and find any mistakes they made with some facts. SAFE has a long way to go and there is a lot of more work it still needs to compete with humans.
Image: DIW-AIgen
Read next: 34% of Americans Say They've Never Heard of ChatGPT