Google Gemini Launch and Impressions

Google Gemini got launched (“our most capable AI model”).

Upfront summary: „Overall, you should probably keep using GPT-4/Bing for any complex tasks - it is still the best AI available“ (Ethan Mollick)

Literature:

Available through Bard immediately (although limited, and not in the EU) and through AI Studio and Vertex AI starting December 13.

Notes:

comes in four sizes: Ultra (yet unreleased), Pro (used in Bard), and Nano-1 & -2
tuned for Factuality: Attribution and the goal of not hallucinating. “Rather, it should acknowledge that it cannot provide a response by hedging”.
Benchmarks suggest that Ultra will play in the same league as GPT-4/GPT-4V. Pro roughly seems to be GPT-3.5/Claude-2 grade, thus outperforming PaLM 2.
- (but they basically concur with Microsoft Research: “We believe there is a need for more robust and nuanced standardized evaluation benchmarks with no leaked data.”)

Bard:

“For now, Bard with our specifically tuned version of Gemini Pro works for text-based prompts, with support for other content types coming soon.”
will “come to more languages and places, like Europe, in the near future”
“Gemini Ultra will come to Bard early next year in a new experience called Bard Advanced”

On the Technical Report:

use-case examples in the Appendix, as well as in “5.2.5. Modality Combination”. Nice work! 👏🏻
it says that the 32K context window is used “efficiently”, with “98% accuracy when queried across the full context length”. However, I find the accompanying chart (image #1) a little lacking in meaning because it’s the models’ confidence (commonly also expressed as “Perplexity”, a metric derived from the NLL metric used here) and not the accuracy. The chart thus does not prove or illustrate the claim that precedes it 🤔 and just seems to say “it works” (chart #2 is an example of when things don’t work out that nicely; from the RoPE paper). Also, no statement is made about content shorter than 32K - which make a huge difference with both GPT-4 and Claude-2.
Video content is actually ingested from several (key) frames, potentially with varying resolution. Similarly, the mention of “USM features” for audio seems remarkable.
they note that “We have also observed that data quality is more important than quantity”, but confirm that “web documents, books, and code” were used.

Chart #1
Chart #2