Ethan Mollick, Associate Professor at The Wharton School, recently noted some significant gaps in current LLM benchmarking:
- No benchmark for LLM hallucination rates
- Few benchmarks with human comparisons
- Lack of common benchmarks for use cases like innovation, writing, persuasion, human interaction, education, and creativity
Mollick points out that LLMs are often built towards benchmarks, highlighting the importance of these gaps.
However, there is some progress in this area. The Hughes Hallucination Evaluation Model (HHEM) leaderboard, developed by Vectara, does evaluate how often an LLM introduces hallucinations when summarizing a document.
While this addresses one of Mollick’s points, it’s clear that more comprehensive benchmarking is needed to fully assess LLM capabilities across various domains and in comparison to human performance.