Gemma-2: Impressive or Just Well-Dressed?

Google recently released their open-source Gemma-2 models (27b and 9b variants), which have been gaining attention in the AI community. In a LinkedIn post, Peter Gostev, Head of AI at Moonpig, highlighted that the 27b variant is now ranking slightly higher than Meta’s 70b model, despite being 2.5 times smaller.

However, digging into the technical report reveals some interesting details about Gemma-2’s training process. The model has been trained on 1 million prompts from the LMSYS Chatbot Arena, as stated in the technical report.

This training approach raises questions about the model’s performance in benchmarks. While some might see it as “outpacing older offerings,” I’d argue that “dressing up” might be a more accurate description. The model seems to have been specifically trained to perform well on the very benchmarks it’s being evaluated on.

It’s worth noting that this isn’t necessarily a new practice. There’s a strong possibility that other models, such as GPT-4o, might have employed similar strategies. The AI community often reuses the same prompts across different evaluations, which can lead to models being overfitted to these specific benchmarks.

This situation highlights two important points:

Creating a “people-pleaser” model that performs well on popular benchmarks isn’t inherently wrong. In fact, it can be seen as a smart strategy to showcase a model’s capabilities.
However, this approach can potentially dilute the effectiveness of these benchmarks. As more models are trained on the same set of prompts used in evaluations, the benchmarks may become less indicative of true generalization capabilities.

As the AI field continues to evolve, it’s crucial to maintain a critical eye on benchmark results and understand the context in which these models are trained and evaluated. While impressive performance numbers are certainly noteworthy, they should be considered alongside a broader understanding of a model’s training process and real-world applicability.