Nils Durner's Blog Ahas, Breadcrumbs, Coding Epiphanies

OpenAI GPT-5 release notes

Introduction

OpenAI has released the much anticipated GPT-5 model. Many technical background details are in the System card. It comes in five flavors: nano, mini, chat, what I call gpt-5 “proper” and GPT-5 Pro (also known as “gpt-5-thinking-pro). Some of these were beta-tested through the OpenRouter platform under the disguise of “Horizon Alpha” and “Horizon Beta”. The Using GPT-5 document details features and includes a migration guide from previous models.

Prerelease: Horizon Alpha/Beta

These pre-release models showed two particularities in my testing:

  • gender bias like GPT-4.1: When asking to assign names to terms as highlighted in the Stanford AI Index, Horizon Alpha and Beta both showed this bias - which I haven’t seen with GPT-4o. GPT-5 with minimal reasoning effort (see below) shows this bias, with medium effort it does the alternation.
  • thinking & deliberating on the solution within the comments of source code they produced - leading to a lot of non-helpful clutter in the comments

OpenRouter confirms that both models were early checkpoints of the GPT-5 family.

Reasoning switch in the API

As established for other reasoning models, the length of the reasoning process can be somewhat shortened. GPT-5 makes this explicit, by extending the reasoning effort parameter to include “minimal”. In addition, the verbosity level is now configurable, allowing outputs aside from the reasoning process to be more terse:

When generating code, medium and high verbosity levels yield longer, more structured code with inline explanations, while low verbosity produces shorter, more concise code with minimal commentary.

(quote from Using GPT-5).

To use GPT-5 without any reasoning in the API, the Introducing GPT-5 for developers document recommends:

The non-reasoning model used in ChatGPT is available as gpt-5-chat-latest.

So the “gpt-5-chat” model in the API seems to be what’s called “gpt-5-main” in the System Card? In contrast to the other models in the API, this one is not versioned - from the Model Card: “GPT-5 Chat points to the GPT-5 snapshot currently used in ChatGPT.”

API vs ChatGPT

Some of the things presented on the live stream are not available in the API:

  • GPT-5 Pro: the model gpt-5-thinking-pro, as the migration path from o3-pro, is exclusive to ChatGPT Pro: “In ChatGPT, we also provide access to gpt-5-thinking using a setting that makes use of parallel test time compute; we refer to this as gpt-5-thinking-pro.” (quote from the System Card)
  • the improved audio input/output

The API gives explicit control over which model to use, however. In ChatGPT, a model router takes control of that, and users may either have to signal which route to take verbally (“think hard about this”) or choose “GPT-5 Thinking” for the model picker (it remains uncertain which parametrization this corresponds to). Per the System Card, the model router will continue to be trained “on real signals, including when users switch models, preference rates for responses, and measured correctness, improving over time”. This means that ChatGPT users will see inconsistent results that change over time.

Image generation will continue be provided by GPT-4o/gpt-image-1 (which was recently improved by the addition of a “High” input fidelity option). Some commenters made light of GPT-5’s inability to render maps, but that of course is a shortcoming of GPT-4o - despite its capabilities in infographics generation.

ChatGPT users continue to be downgraded, when their usage limit is exhausted. The mechanics are intricate, the [Help Center article goes into the details]((https://help.openai.com/en/articles/11909943-gpt-5-in-chatgpt#:~:text=unlimited%20access%20to%20our%20GPT-5%20models).

For developers

API Pricing

Simon Willison has a nice comparison table, noting that GPT-5 is cheaper-per-token than GPT-4o (and 4.1; but costlier than o4-mini, especially on outputs). (As always, output lengths vary in terms of the amount of tokens used as this is not a standard unit of measure). Cost savings through implicit inputs caching are substantial: about 1/10 the regular cost (90% off), i.e., $0.125 vs $1.25 per 1M tokens with GPT-5 “proper”.

Availability

  • GPT-5 is rolled out to ChatGPT (including the free tier, as well as Enterprise and Edu).
  • Availability on the API is granted to all Tiers (except “Free”), with full support on the OpenAI Prompts Playground and basic (yet evolving) support through my OAI Chat.
  • It’s also rolling out to Microsoft platforms, including Microsoft 365 Copilot, Copilot, GitHub Copilot, and Azure AI Foundry.
  • While deprecation of all previous models was announced on the live stream, no actual end-of-life dates have been announced for access via API. In ChatGPT, the model picker may get cleaned up to only include GPT-5 models.
    • Conversations with previous models like GPT-4.5 will be automatically switched to GPT-5.
    • For users on corporate plans (Team, Enterprise, Edu), there is Legacy Model Access, which will give access to previous models (including o3-pro and GPT-4.5) for a limited transition period. This needs to be enabled by ChatGPT workspace admins.
    • after public backlash, GPT-4o is set to return for ChatGPT Plus users

Benchmark results

  • Fiction.LiveBench Long Context Benchmark: state-of-the-art performance (along with Grok 4)
  • State-of-the-art on Aider Polyglot and OpenHands’ eval
  • Vectara Hallucination Benchmark: 1.4% with GPT-5 high-reasoning; behind GPT-4.5 (1.2%) and o3 (0.795%). Also behind gemini-2.5-pro-exp-03-25 (1.1%), Kimi K2 (1.1%). Better than GPT-4o (1.491%), GPT-4 (1.805%), GPT-4.1 (2%)
  • EuroEval: #1 at 1.44, before gemini-2.5-pro (1.50) and o3 (1.51). GPT-5 improves on Italian, Icelandic and Finnish. Performance in German is the same as o3 (1.62), lagging behind several others including Mistral-Small-3.1-24B-Instruct-2503 (1.48) etc.

Other notes

  • Model fine-tuning with GPT-5 as the base model is not available as of now
  • Image output is not mentioned in GPT System card.
  • German Writing lags behind GPT-4o? ([via])(https://www.linkedin.com/posts/jphme_gpt-5-worse-than-gpt-4o-first-results-activity-7359440737584783360-u6St?utm_source=share&utm_medium=member_desktop&rcm=ACoAAAGX2jIBd6RDsNRYv13Bvu3x4nnCNu96SEw))
  • as Ethan Mollick predicted, reports online are very inconsistent - because of the model router in ChatGPT. The router may assign a request to an unsuitable model, as was likely the case in Charlie Meyer’s “Bs in Blueberry” test. I explained that I could only reproduce this in the API with the non-thinking gpt-5-chat, and not with any of the reasoning models.
    • Sam Altmann confirms that the router (aka “autoswitch”) was “out of commission for a chunk of the day and the result was GPT-5 seemed way dumber.”
  • alleged System Prompt leaks: 1, 2. According to these:
    • for image generation, the tool name is actually “image_gen”, not “ImageGen”
    • “Code Interpreter” is not mentioned, just Python as a tool name. That explains recent issues I had with “Use Code Interpreter to…”
    • the Python sandbox does not have Internet access
  • in addition to the Tools like image_gen or the Canvas “canmore”, there is the Widget system. Widgets are directly rendered in the chat. The only one established so far is “ecosystem_demo.daw_drums” - otherwise known as beatbot
  • the Cline account on X reports user sentiment: “It’s a precision instrument, not a Swiss Army knife. […] Prompt sensitivity is extreme (also known as “steerability”). […] it pays to be precise. […] if you write prompts like you write code – precise, structured, explicit – GPT-5 delivers superior price/performance. If you need a model that reads between the lines, Claude might be a better option.”
  • Medical Doctor Derya Unutmaz shared:
    • how GPT-5 with an yet unpublished research paper:
      • identified key findings from the paper based on scatter plots showing how immune cell patterns changed under different conditions
      • proposed an experiment we later performed, something that had taken us weeks to design
      • suggested a mechanism that explained the study results
    • confirmed that ChatGPT is “restricted” so it will refuse the prompts he used, with the error message: “I can’t help with an actionable lab protocol (step-by- step methods, conditions, dosages, etc.) for manipulating T-cell metabolism. That kind of procedural wet-lab guidance could enable real-world biological experimentation, which l’m not able to provide.”. Derya suggested the “Trusted Access Program” to OpenAI.
      • I’m wondering if that also is a glimpse of the future: powerful AI withheld from non-experts

Open questions

  • What’s the time-to-first-token, given that there is no true no-thinking model or mode in the API? Does it suffer?
  • How does the monitoring - both automated and human - against chemical & bio-threats, as described in the System Card, align with data privacy requirements?

Changelog of this article

[Updates 2025-08-08]

  • ChatGPT Legacy Model Access
  • no fine-tuning
  • note on lacking(?) German writing proficiency
  • unlimited use also in ChatGPT Team
  • clarified Cursor pricing
  • added note on “Bs in Blueberry”, where gpt-5-chat fails to count letters
  • Codex CLI usage may not be included in corporate ChatGPT subscriptions (see above)
  • Fiction.live Long Context Benchmark results added, model router failure confirmed

[Changelog 2025-08-09]

  • reference to alleged System Prompts added, plus “beatbot”
  • GPT-5 rollout status to ChatGPT complete
  • more benchmarks results, low-latency conversation remark by @kwindla added
  • Codex CLI for ChatGPT Team was fixed, GPT-4o is set to return.
  • Vectara Hallucination Benchmark result added

[Changelog 2025-08-11]

  • Derya Unutmaz’s lab GPT5-assisted lab research
  • GPT-5 lacks image generation

[Changelog 2025-08-15]

  • cleanups