Italian LLM Benchmark: INVALSI for AI

The University of Milano-Bicocca has published a significant work for Generative AI in Italy. As Alessandro Vitale notes in his LinkedIn post, there was previously no benchmark to understand how well LLMs performed in Italian. The new benchmark adapts INVALSI tests, which are typically given to Italian students in elementary, middle, and high schools.

Key points from Vitale’s post:

Claude 3.5 Sonnet by Anthropic is currently the best-performing model.
OpenAI models have some caveats, including ethical filters blocking 5.4% of responses to texts by Gianni Rodari and Ennio Flaiano.
GPT-3.5, widely used in production, performs poorly compared to newer models.
Google’s Gemma 2 9B (instruct version) performs surprisingly well despite its small size.
Fine-tuned open models by Michele Montebovi and MII-LLM show significant improvements for Italian.
Models created in Italy, like Ludovico Comito’s adaptation of Sapienza University of Rome’s Minerva model, still have room for improvement.

The leaderboard shows:

Claude-3.5-sonnet: 92.2
Claude-3-opus: 88.5
Mistral-Large-Instruct-2407: 87.5
Meta-Llama-3.1-405B-Instruct: 86.1
GPT-4-turbo: 86.0
Claude-3-sonnet: 83.1
Meta-Llama-3.1-70B-Instruct: 82.7
Gemini-pro-1.5: 81.2

Regarding a similar benchmark for German, Malte Ostendorff, Senior Research Engineer at Deutsche Telekom, pointed to the Occiglot leaderboard, which covers several European languages including German. While it currently consists of only translated tasks, they are working on localized versions as well.