Ethan Mollick, Associate Professor at The Wharton School, recently shared two key developments: The Charxiv benchmark, a challenging real-life chart reading test, where humans achieve 80% accuracy. Interestingly, Claude 3.5, currently the best-performing Large Language Model (LLM) in this test, manages 60% accuracy. The Chatbo... Read more 28 Jun 2024 - less than 1 minute read
Google recently released their open-source Gemma-2 models (27b and 9b variants), which have been gaining attention in the AI community. In a LinkedIn post, Peter Gostev, Head of AI at Moonpig, highlighted that the 27b variant is now ranking slightly higher than Meta’s 70b model, despite being 2.5 times smaller. However, digging into the technic... Read more 28 Jun 2024 - 1 minute read
A colleague pitched the idea of creating an “AI Wiki” with best pratices. I am not fond of the idea because there’s way too much nuance to consider still, which would likely render any Wiki article flat out incorrect. A major reason for this is that perhaps most users are using AI Systems and not really the the models themselves - and there is ... Read more 26 Jun 2024 - less than 1 minute read
description: “Details the Clash evaluation framework for LLM comparison, describing methodology, scoring metrics, and case study results across multiple models.” layout: post title: “ClashEval: When LLM Safeguards Clash with RAG” date: 2024-05-20 last_updated: 2024-05-20 tags: [llm, rag, misinformation, ai-safety, aleph alpha] — A recent paper ... Read more (Updated) - 1 minute read
A recent LinkedIn post about “DeutschlandGPT” caught my eye, promising an AI solution “Made in Germany”. Upon closer inspection, some concerning details emerged. The company behind DeutschlandGPT, as listed in their Impressum, is DeutschlandGPT GmbH based in Germering. This suggests they’re likely just an ordinary T-Systems customer rather than... Read more 24 Jun 2024 - 1 minute read