Ethan Mollick, Associate Professor at The Wharton School, recently shared two key developments:
-
The Charxiv benchmark, a challenging real-life chart reading test, where humans achieve 80% accuracy. Interestingly, Claude 3.5, currently the best-performing Large Language Model (LLM) in this test, manages 60% accuracy.
-
The Chatbot Arena, which compares AI vision answers based on human preferences. In this arena, GPT-4o emerges as the winner.
As Mollick notes, “We are seeing the first practical benchmarks for AI vision.”
However, I noticed a crucial omission in the Chatbot Arena: the absence of Reka AI models, which I’ve found particularly impressive for their concise outputs.
Another benchmark, the WildVision leaderboard, does include Reka Flash. It’s available on Hugging Face.
When I inquired on X, Wei-Lin Chiang, one of the researchers behind the Chatbot Arena, confirmed that they “will add soon” the Reka models to their benchmark.