Llama 4 released

Llama 4 has been released - partially. It’s a suite of three LLMs, with the biggest model (“Behemoth”) still in training. Notes:

~apparently no restrictions on use within the EU~ [see Update below]
~trained in fp8 precision~ [see Update below]
already live on OpenRouter
- via Together: smallest model “Scout” costlier than Gemini 1.5 Pro on my first tests
- also available from Groq and Fireworks, much cheaper than Together
Hugging Face model collection
Maxime Labonne points out license problems

[Update] The Acceptable Use Policy does place restrictions on EU citizens:

With respect to any multimodal models included in Llama 4, the rights granted under Section 1(a) of the Llama 4 Community License Agreement are not being granted to you if you are an individual domiciled in, or a company with a principal place of business in, the European Union. This restriction does not apply to end users of a product or service that incorporates any such multimodal models. (via)

[Update 2025-04-06] While it was trained in fp8, weights are released in bf16:

The Llama 4 Scout model is released as BF16 weights, but can fit within a single H100 GPU with on-the-fly int4 quantization; the Llama 4 Maverick model is released as both BF16 and FP8 quantized weights. The FP8 quantized weights fit on a single H100 DGX host while still maintaining quality. (Model card)

Providers Groq, Fireworks and Together are not advertised as bf16 on OpenRouter, but DeepInfra is.

[Update 2025-04-06 #2] Testing both Scout and Maverick (fp8) on the process visualization practice examples from my recent article, they mostly fail - and are never cheaper than Gemini 1.5 Pro.

[Update 2025-04-13] The NoLiMa benchmark (“Long-Context Evaluation Beyond Literal Matching”) suggests that Llama 4 is behind Llama 3.1, GPT-4o, etc. The open-weights release of Llama 4 also scores poorly on LMSYS Chatbot Arena (#32), where good performance was initially claimed but later turned out to be a different, experimental model.