Which AI Models Are Leading the Way in Reducing Hallucinations and Improving Accuracy?

AI models are helping us in a lot of areas but they tend to hallucinate too and give us inaccurate information. IBM defines hallucinations in AI chatbots or computer vision tools as some outputs that come out as inaccurate due to detection of some patterns that do not exist. Vectara analyzed 1,000 short documents with each LLMs to detect hallucinations in them and came up with top 15 large language models with the lowest rates of hallucination. According to the data, Zhipu AI’s GLM-4-9B-Chat has the least hallucination rate at 1.3%. Google Gemini-2.0-Flash-Esp has the second lowest hallucination rate at 1.3% as well.

The top third LLM with least hallucination levels is OpenAI’s o1-mini with 1.4% hallucination rate. With a hallucination rate of 1.5%, GPT-4o is the fourth model with least hallucination. GPT-4o-mini and GPT-4-Turbo have hallucination rates of 1.7%. It was observed that more specialized and smaller models have the lowest hallucination rates. OpenAI’s GPT-4 has a hallucination rate of 1.8%, while GPT-3.5-Turbo has a hallucination rate of 1.9%.

It is important for AI systems to show low levels of hallucination for them to work properly, especially in high-stake applications in healthcare, finance and law. Smaller models are slowly reducing hallucinations in their AI models, with Mistral 8×7B models reducing hallucinations in their AI generated texts.

Vectara’s analysis underscores reducing hallucination rates as critical for reliable AI systems in high-stakes fields.

ModelHallucination RateFactual Consistency RateAnswer RateAverage Summary Length (Words)
Zhipu AI GLM-4-9B-Chat1.3 %98.7 %100.0 %58.1
Google Gemini-2.0-Flash-Exp1.3 %98.7 %99.9 %60
OpenAI-o1-mini1.4 %98.6 %100.0 %78.3
GPT-4o1.5 %98.5 %100.0 %77.8
GPT-4o-mini1.7 %98.3 %100.0 %76.3
GPT-4-Turbo1.7 %98.3 %100.0 %86.2
GPT-41.8 %98.2 %100.0 %81.1
GPT-3.5-Turbo1.9 %98.1 %99.6 %84.1
DeepSeek-V2.52.4 %97.6 %100.0 %83.2
Microsoft Orca-2-13b2.5 %97.5 %100.0 %66.2
Microsoft Phi-3.5-MoE-instruct2.5 %97.5 %96.3 %69.7
Intel Neural-Chat-7B-v3-32.6 %97.4 %100.0 %60.7
Qwen2.5-7B-Instruct2.8 %97.2 %100.0 %71
AI21 Jamba-1.5-Mini2.9 %97.1 %95.6 %74.5
Snowflake-Arctic-Instruct3.0 %97.0 %100.0 %68.7
Qwen2.5-32B-Instruct3.0 %97.0 %100.0 %67.9
Microsoft Phi-3-mini-128k-instruct3.1 %96.9 %100.0 %60.1
OpenAI-o1-preview3.3 %96.7 %100.0 %119.3
Google Gemini-1.5-Flash-0023.4 %96.6 %99.9 %59.4
01-AI Yi-1.5-34B-Chat3.7 %96.3 %100.0 %83.7
Llama-3.1-405B-Instruct3.9 %96.1 %99.6 %85.7
Microsoft Phi-3-mini-4k-instruct4.0 %96.0 %100.0 %86.8
Llama-3.3-70B-Instruct4.0 %96.0 %100.0 %85.3
Microsoft Phi-3.5-mini-instruct4.1 %95.9 %100.0 %75
Mistral-Large24.1 %95.9 %100.0 %77.4
Llama-3-70B-Chat-hf4.1 %95.9 %99.2 %68.5
Qwen2-VL-7B-Instruct4.2 %95.8 %100.0 %73.9
Qwen2.5-14B-Instruct4.2 %95.8 %100.0 %74.8
Qwen2.5-72B-Instruct4.3 %95.7 %100.0 %80
Llama-3.2-90B-Vision-Instruct4.3 %95.7 %100.0 %79.8
XAI Grok4.6 %95.4 %100.0 %91
Anthropic Claude-3-5-sonnet4.6 %95.4 %100.0 %95.9
Qwen2-72B-Instruct4.7 %95.3 %100.0 %100.1
Mixtral-8x22B-Instruct-v0.14.7 %95.3 %99.9 %92
Anthropic Claude-3-5-haiku4.9 %95.1 %100.0 %92.9
01-AI Yi-1.5-9B-Chat4.9 %95.1 %100.0 %85.7
Cohere Command-R4.9 %95.1 %100.0 %68.7
Llama-3.1-70B-Instruct5.0 %95.0 %100.0 %79.6
Llama-3.1-8B-Instruct5.4 %94.6 %100.0 %71
Cohere Command-R-Plus5.4 %94.6 %100.0 %68.4
Llama-3.2-11B-Vision-Instruct5.5 %94.5 %100.0 %67.3
Llama-2-70B-Chat-hf5.9 %94.1 %99.9 %84.9
IBM Granite-3.0-8B-Instruct6.5 %93.5 %100.0 %74.2
Google Gemini-1.5-Pro-0026.6 %93.7 %99.9 %62
Google Gemini-1.5-Flash6.6 %93.4 %99.9 %63.3
Microsoft phi-26.7 %93.3 %91.5 %80.8
Google Gemma-2-2B-it7.0 %93.0 %100.0 %62.2
Qwen2.5-3B-Instruct7.0 %93.0 %100.0 %70.4
Llama-3-8B-Chat-hf7.4 %92.6 %99.8 %79.7
Google Gemini-Pro7.7 %92.3 %98.4 %89.5
01-AI Yi-1.5-6B-Chat7.9 %92.1 %100.0 %98.9
Llama-3.2-3B-Instruct7.9 %92.1 %100.0 %72.2
databricks dbrx-instruct8.3 %91.7 %100.0 %85.9
Qwen2-VL-2B-Instruct8.3 %91.7 %100.0 %81.8
Cohere Aya Expanse 32B8.5 %91.5 %99.9 %81.9
IBM Granite-3.0-2B-Instruct8.8 %91.2 %100.0 %81.6
Mistral-7B-Instruct-v0.39.5 %90.5 %100.0 %98.4
Google Gemini-1.5-Pro9.1 %90.9 %99.8 %61.6
Anthropic Claude-3-opus10.1 %89.9 %95.5 %92.1
Google Gemma-2-9B-it10.1 %89.9 %100.0 %70.2
Llama-2-13B-Chat-hf10.5 %89.5 %99.8 %82.1
AllenAI-OLMo-2-13B-Instruct10.8 %89.2 %100.0 %82
AllenAI-OLMo-2-7B-Instruct11.1 %88.9 %100.0 %112.6
Mistral-Nemo-Instruct11.2 %88.8 %100.0 %69.9
Llama-2-7B-Chat-hf11.3 %88.7 %99.6 %119.9
Microsoft WizardLM-2-8x22B11.7 %88.3 %99.9 %140.8
Cohere Aya Expanse 8B12.2 %87.8 %99.9 %83.9
Amazon Titan-Express13.5 %86.5 %99.5 %98.4
Google PaLM-214.1 %85.9 %99.8 %86.6
Google Gemma-7B-it14.8 %85.2 %100.0 %113
Qwen2.5-1.5B-Instruct15.8 %84.2 %100.0 %70.7
Qwen-QwQ-32B-Preview16.1 %83.9 %100.0 %201.5
Anthropic Claude-3-sonnet16.3 %83.7 %100.0 %108.5
Google Gemma-1.1-7B-it17.0 %83.0 %100.0 %64.3
Anthropic Claude-217.4 %82.6 %99.3 %87.5
Google Flan-T5-large18.3 %81.7 %99.3 %20.9
Mixtral-8x7B-Instruct-v0.120.1 %79.9 %99.9 %90.7
Llama-3.2-1B-Instruct20.7 %79.3 %100.0 %71.5
Apple OpenELM-3B-Instruct24.8 %75.2 %99.3 %47.2
Qwen2.5-0.5B-Instruct25.2 %74.8 %100.0 %72.6
Google Gemma-1.1-2B-it27.8 %72.2 %100.0 %66.8
TII falcon-7B-instruct29.9 %70.1 %90.0 %75.5

Read next:

• WhatsApp Beta Tests Personalized AI Chatbots – A Sneak Peek at What’s Coming!

• Researchers Explore How Personality and Integrity Shape Trust in AI Technology

China’s AI Chatbot Market Sees ByteDance’s Doubao Leading Through Innovation and Accessibility
Previous Post Next Post