Nils Durner's Blog Ahas, Breadcrumbs, Coding Epiphanies

Benchmarking AI Vision

Ethan Mollick, Associate Professor at The Wharton School, recently shared two key developments:

  1. The Charxiv benchmark, a challenging real-life chart reading test, where humans achieve 80% accuracy. Interestingly, Claude 3.5, currently the best-performing Large Language Model (LLM) in this test, manages 60% accuracy.

  2. The Chatbot Arena, which compares AI vision answers based on human preferences. In this arena, GPT-4o emerges as the winner.

As Mollick notes, “We are seeing the first practical benchmarks for AI vision.”

However, I noticed a crucial omission in the Chatbot Arena: the absence of Reka AI models, which I’ve found particularly impressive for their concise outputs.

Another benchmark, the WildVision leaderboard, does include Reka Flash. It’s available on Hugging Face.

When I inquired on X, Wei-Lin Chiang, one of the researchers behind the Chatbot Arena, confirmed that they “will add soon” the Reka models to their benchmark.