From Hype to Humble: Meta’s Llama 4 Lands at 32nd in AI Rankings

Tech giant Meta was seen landing in some really hot water for utilizing experimental and unreleased variants of the Llama 4 Maverick to get higher scores across crowdsourced benchmarks. This matter really encouraged the maintainers of LM Arena to admit their error and alter policies as well as the scores generated.

The fact that it scored below other leading archrivals in the industry says so much. The fact of the matter is that it’s not competitive at all, showing rankings far below the likes of GPT-4o from OpenAI and Claude 3.5 Sonnet from Anthropic. It also failed miserably when you look at Google’s Gemini 1.5 Pro. Remember, the competition it went against was months old.

The release variants for Llama 4 were added to the LMArena after they realized the cheating episode. If you didn’t get the chance to see it, it’s probably because it stands at 32nd place in its ranking. Now the question is why the performance is so poor?

Image: DIW

The tech giant tried to defend claims by mentioning that this product was created for conversations. This kind of optimization really did play out well towards LM Arena, which entails human raters comparing outputs of AI systems and selecting any that they want.

LM Arena hasn’t been the most reliable indicator for the performance of AI models for a while now. Still, it’s tailoring out benchmarks to the model, which is not only misleading but a thorough challenge for many developers today. They find it hard to predict how great this model is going to perform in various contexts.

A new statement had the company’s spokesperson share how Meta experiments with different kinds of customized variants. The latest one is more optimized for chats and therefore performs well in that regard.

Now, the latest release is the open source variant, and that’s how we can tell developers are customizing it for their own uses. They are very excited about what they can design and are even more optimistic about feedback coming through.

Read next: Top AI Models Fail Simple Debugging Test — Human Coders Still Reign Supreme
Previous Post Next Post