The OpenAI challenge for the community to surface previously unreported vulnerabilities and harmful behaviors in their new open-weights model gpt-oss-20b has concluded. I was awarded Honorable Mention, and the jury of industry experts lauds my submission as “particularly interesting work in evaluation-aware sandbagging” - an emerging focus area ... Read more 25 Sep 2025 - 2 minute read
Reka AI has released a hand-curated dataset, benchmark and leaderboard to grade web search and answer generation of LLM systems. Their blog post describes “Research Eval” as Diverse (374 questions with grading guidelines, across a wide range of topics), Discriminative (current frontier models achieve between 26.7% and 59.1% accuracy), and High-q... Read more 31 Aug 2025 - 1 minute read
News OpenAI Codex CLI, the standalone GPT-5 ⇆ computer interface that’s being positioned as a coding assistant, got a major overhaul. It was rewritten from the ground up, and many useful features were added: IDE integration there now is a Visual Studio Code extension How to activate: click on the OpenAI bloom logo on the uppe... Read more 28 Aug 2025 - 2 minute read
Introduction OpenAI has released the much anticipated GPT-5 model. Many technical background details are in the System card. It comes in five flavors: nano, mini, chat, what I call gpt-5 “proper” and GPT-5 Pro (also known as “gpt-5-thinking-pro). Some of these were beta-tested through the OpenRouter platform under the disguise of “Horizon Alpha”... Read more 15 Aug 2025 (Updated) - 8 minute read
Intro OpenAI has released the open-weights model that was announced end of March. It is a reasoning model that comes in two sizes: 20B and 120B, in a Mixture-of-Experts configuration. Model Card. Commenters call it “similar to o3” in performance, but that’s seems untrue - it’s about o3-mini grade, but lags behind o4-mini. Providers for quick... Read more 12 Aug 2025 (Updated) - 3 minute read