Llama 3.1 405B: Quantization and Hosting Challenges

Simon Willison, creator of Datasette and co-creator of Django, recently asked on Twitter for a “vibe check” on Llama 3.1 405B. He was particularly interested in whether it’s becoming a credible self-hosted alternative to the best OpenAI or Anthropic models, and if any companies previously hesitant about sending data to API providers are now using it.

In response to a comment about fp16 being superior to other quantizations, with Hyperbolic Labs being mentioned as the only provider offering it, I noted:

“AWS and Nvidia NGC are not explicit about bf16, but results appear superior to what’s dubbed ‘Llama 3.1 405B Turbo’ (quantized and perhaps otherwise meddled with).”

The fact that other hosters are potentially offering quantized versions without explicitly stating so adds an interesting dimension to the discussion. It suggests that the landscape of LLM hosting and deployment is still evolving, with different providers experimenting with various techniques to optimize performance.

This observation aligns with a comment by Twitter user @_xjdr, who noted that their personal bf16 version of 405B Instruct has “pretty different (better) responses than almost all inference providers, especially for long prompts.” They suggested that “L3.1 is going to be very finicky with quantization, sampling and implementation differences.”

In response to a question about which inference provider offers managed service for 405B at bf16, I mentioned that “AWS has 405B available in us-west-2 now. No mention of quantisation and it looks much better there at first glance!”

These observations highlight the ongoing challenges and variations in hosting and quantizing Llama 3.1 405B, with potentially significant differences in performance depending on the specific implementation.