According to a new study published in NPJ Digital Medicine, some Spanish researchers tried to investigate if the large language models are reliable when it comes to giving health advice. The researchers tested seven LLMs, including OpenAI's ChatGPT, ChatGPT-4 and Meta's Llama 3, with 150 medical questions, and the researchers found that all the models tested had varied results. Most of the AI-based search engines give incomplete or incorrect results when users ask them some health-related questions. Even though AI-powered chatbots are increasingly in demand there haven't been proper studies which could show that LLMs give reliable medical-related results. This study found that the results of LLMs accuracy depend on the phrasing, retrieval bias, and reasoning, but they can still produce misinformation.
For the study, the researchers assessed four search engines: Google, Yahoo!, DuckDuckGo and Bing, and seven LLMs including ChatGPT, GPT-4, Flan-T5, Llama3 and MedLlama3. The results showed that ChatGPT, GPT-4, Llama3 and MedLlama3 had the upper hand in most evaluations, while Flan-T5 lagged behind the pack. For search engines, the researchers analyzed the top 20 ranked results. A passage extraction model was used to identify relevant snippets and a reading comprehension model was used to determine if the snippets had a definitive yes/no answer. Two types of users' behaviors were also seen: Lazy users stopped searching as soon as they found the first clear answer, while the diligent users cross-referenced three sources before deciding on an answer. The lazy users were the ones who got the most accurate answers, which shows that top-ranked answers are accurate most of the time.
For large language models, the researchers used different prompting strategies like asking a question without any context, using friendly wording, and using expert wording. The study also provided LLMs some sample Q&As which helped some models but didn't have any effect on others. Retrieval-augmented generation method was also used where LLMs were provided search engine results before they generated their own responses. The performance of the AI models was measured through accuracy, common errors in their responses, and improvements through retrieval augmentation.
The results of the study showed that search engines answered 50-70% queries accurately while LLMs had an 80% accuracy rate. The responses from LLMs varied on the basis of how questions were framed, and the expert prompt (using expert tone) was the most effective but sometimes resulted in less definitive answers. Bing had the most reliable answers, but it wasn't any better than Yahoo!, Google, and DuckDuckGo. Many search results from search engines were irrelevant or off-topic while the precision improved 80-90% by filtering for relevant answers. Smaller LLMs showed improvements in their performance after search engine snippets were added. But poor quality retrieval worsened the accuracy of LLMs, especially for Covid-19 related queries.
The error analysis of LLMs showed that there were three major failures of LLMs when it comes to health-related queries: Incorrect medical consensus understanding, misinterpreting questions, and ambiguous answers. The study showed that the performance of LLMs varied based on the dataset they were being questioned from, with a dataset from 2020 generating more accurate responses than a dataset from 2021.
Read next: AI Search Traffic Jumps 123% as ChatGPT and Perplexity Reshape SMB Strategies
For the study, the researchers assessed four search engines: Google, Yahoo!, DuckDuckGo and Bing, and seven LLMs including ChatGPT, GPT-4, Flan-T5, Llama3 and MedLlama3. The results showed that ChatGPT, GPT-4, Llama3 and MedLlama3 had the upper hand in most evaluations, while Flan-T5 lagged behind the pack. For search engines, the researchers analyzed the top 20 ranked results. A passage extraction model was used to identify relevant snippets and a reading comprehension model was used to determine if the snippets had a definitive yes/no answer. Two types of users' behaviors were also seen: Lazy users stopped searching as soon as they found the first clear answer, while the diligent users cross-referenced three sources before deciding on an answer. The lazy users were the ones who got the most accurate answers, which shows that top-ranked answers are accurate most of the time.
For large language models, the researchers used different prompting strategies like asking a question without any context, using friendly wording, and using expert wording. The study also provided LLMs some sample Q&As which helped some models but didn't have any effect on others. Retrieval-augmented generation method was also used where LLMs were provided search engine results before they generated their own responses. The performance of the AI models was measured through accuracy, common errors in their responses, and improvements through retrieval augmentation.
The results of the study showed that search engines answered 50-70% queries accurately while LLMs had an 80% accuracy rate. The responses from LLMs varied on the basis of how questions were framed, and the expert prompt (using expert tone) was the most effective but sometimes resulted in less definitive answers. Bing had the most reliable answers, but it wasn't any better than Yahoo!, Google, and DuckDuckGo. Many search results from search engines were irrelevant or off-topic while the precision improved 80-90% by filtering for relevant answers. Smaller LLMs showed improvements in their performance after search engine snippets were added. But poor quality retrieval worsened the accuracy of LLMs, especially for Covid-19 related queries.
The error analysis of LLMs showed that there were three major failures of LLMs when it comes to health-related queries: Incorrect medical consensus understanding, misinterpreting questions, and ambiguous answers. The study showed that the performance of LLMs varied based on the dataset they were being questioned from, with a dataset from 2020 generating more accurate responses than a dataset from 2021.
Read next: AI Search Traffic Jumps 123% as ChatGPT and Perplexity Reshape SMB Strategies