Nils Durner's Blog Ahas, Breadcrumbs, Coding Epiphanies

Kaggle OpenAI Red-Teaming Challenge

The OpenAI challenge for the community to surface previously unreported vulnerabilities and harmful behaviors in their new open-weights model gpt-oss-20b has concluded. I was awarded Honorable Mention, and the jury of industry experts lauds my submission as “particularly interesting work in evaluation-aware sandbagging” - an emerging focus area ... Read more

Reka Web Search Benchmark extended

Reka AI has released a hand-curated dataset, benchmark and leaderboard to grade web search and answer generation of LLM systems. Their blog post describes “Research Eval” as Diverse (374 questions with grading guidelines, across a wide range of topics), Discriminative (current frontier models achieve between 26.7% and 59.1% accuracy), and High-q... Read more

OpenAI Codex CLI agent: Major Update

News OpenAI Codex CLI, the standalone GPT-5 ⇆ computer interface that’s being positioned as a coding assistant, got a major overhaul. It was rewritten from the ground up, and many useful features were added: IDE integration there now is a Visual Studio Code extension How to activate: click on the OpenAI bloom logo on the uppe... Read more

[UPDATED] OpenAI GPT-5 release notes

Introduction OpenAI has released the much anticipated GPT-5 model. Many technical background details are in the System card. It comes in five flavors: nano, mini, chat, what I call gpt-5 “proper” and GPT-5 Pro (also known as “gpt-5-thinking-pro). Some of these were beta-tested through the OpenRouter platform under the disguise of “Horizon Alpha”... Read more

[UPDATED] OpenAI GPT-OSS open weights model released

Intro OpenAI has released the open-weights model that was announced end of March. It is a reasoning model that comes in two sizes: 20B and 120B, in a Mixture-of-Experts configuration. Model Card. Commenters call it “similar to o3” in performance, but that’s seems untrue - it’s about o3-mini grade, but lags behind o4-mini. Providers for quick... Read more