Motivated by starkly different results from different Llama 3.1 405B providers on one hand, and claims - particulary derived from the Chatbot Arena that quantized versions are no different on the other hand, I have been wishing for a telltale sign that 1) conclusively proves otherwise and 2) tells providers apart. Good news: Simon Willison has started the Pelican-on-a-bicycle benchmark: a gallery that compares LLM outputs from this simple prompt:
Generate an SVG of a pelican riding a bicycle
Towards answering my questions about Llama 3.1 405B, I have added the outputs via AWS Bedrock and Hyperbolic: my fork on Github, thread on X.
Further, this was a quick way to check if increasing the temperature for the new Anthropic Claude 3.5 Sonnet 20241022 release indeed fixes the purported regressions. Summary: no (gallery on X).