Simon Willison · Tech & AI
TIER 5 2025-08-07
<p>I've had preview access to the new GPT-5 model family for the past two weeks (see <a href="https://simonwillison.net/2025/Aug/7/previewing-gpt-5/">related video</a> and <a href="https://simonwillison.net/about/#disclosures">my disclosures</a>) and have been using GPT-5 as my daily-driver. It's my new favorite model. It's still an LLM - it's not a dramatic departure from what we've had before - but it rarely screws up and generally feels competent or occasionally impressive at the kinds of things I like to use models for.</p>
<p>I've collected a lot of notes over the past two weeks, so I've decided to break them up into <a href="https://simonwillison.net/series/gpt-5/">a series of posts</a>. This first one will cover key characteristics of the models, how they are priced and what we can learn from the <a href="https://openai.com/index/gpt-5-system-card/">GPT-5 system card</a>.</p>
<ul>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#key-model-characteristics">Key model characteristics</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#position-in-the-openai-model-family">Position in the OpenAI model family</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#pricing-is-aggressively-competitive">Pricing is aggressively competitive</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#more-notes-from-the-system-card">More notes from the system card</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#prompt-injection-in-the-system-card">Prompt injection in the system card</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#thinking-traces-in-the-api">Thinking traces in the API</a></li>
<li><a href="https://simonwillison.net/2025/Aug/7/gpt-5/#and-some-svgs-of-pelicans">And some SVGs of pelicans</a></li>
</ul>
<h4 id="key-model-characteristics">Key model characteristics</h4>
<p>Let's start with the fundamentals. GPT-5 in ChatGPT is a weird hybrid that switches between different models. Here's what the system card says about that (my highlights in bold):</p>
<blockquote>
<p>GPT-5 is a unified system with a smart and fast model that answers most questions, a deeper reasoning model for harder problems, and <strong>a real-time router that quickly decides which model to use based on conversation type, complexity, tool needs, and explicit intent</strong> (for example, if you say “think hard about this” in the prompt). [...] Once usage limits are reached, a mini version of each model handles remaining queries. In the near future, we plan to integrate these capabilities into a single model.</p>
</blockquote>
<p>GPT-5 in the API is simpler: it's available as three models - <strong>regular</strong>, <strong>mini</strong> and <strong>nano</strong> - which can each be run at one of four reasoning levels: minimal (a new level not previously available for other OpenAI reasoning models), low, medium or high.</p>
<p>The models have an input limit of 272,000 tokens and an output limit (which includes invisible reasoning tokens) of 128,000 tokens. They support text and image for input, text only for output.</p>
<p>I've mainly explored full GPT-5. My verdict: it's just <strong>good at stuff</strong>. It doesn't feel like a dramatic leap ahead from other LLMs but it exudes competence - it rarely messes up, and frequently impresses me. I've found it to be a very sensible default for everything that I want to do. At no point have I found myself wanting to re-run a prompt against a different model to try and get a better result.</p>
<p>Here are the OpenAI model pages for <a href="https://platform.openai.com/docs/models/gpt-5">GPT-5</a>, <a href="https://platform.openai.com/docs/models/gpt-5-mini">GPT-5 mini</a> and <a href="https://platform.openai.com/docs/models/gpt-5-nano">GPT-5 nano</a>. Knowledge cut-off is September 30th 2024 for GPT-5 and May 30th 2024 for GPT-5 mini and nano.</p>
<h4 id="position-in-the-openai-model-family">Position in the OpenAI model family</h4>
<p>The three new GPT-5 models are clearly intended as a replacement for most of the rest of the OpenAI line-up. This table from the system card is useful, as it shows how they see the new models fitting in:</p>
<table>
<thead>
<tr>
<th>Previous model</th>
<th>GPT-5 model</th>
</tr>
</thead>
<tbody>
<tr>
<td>GPT-4o</td>
<td>gpt-5-main</td>
</tr>
<tr>
<td>GPT-4o-mini</td>
<td>gpt-5-main-mini</td>
</tr>
<tr>
<td>OpenAI o3</td>
<td>gpt-5-thinking</td>
</tr>
<tr>
<td>OpenAI o4-mini</td>
<td>gpt-5-thinking-mini</td>
</tr>
<tr>
<td>GPT-4.1-nano</td>
<td>gpt-5-thinking-nano</td>
</tr>
<tr>
<td>OpenAI o3 Pro</td>
<td>gpt-5-thinking-pro</td>
</tr>
</tbody>
</table>
<p>That "thinking-pro" model is currently only available via ChatGPT where it is labelled as "GPT-5 Pro" and limited to the $200/month tier. It uses "parallel test time compute".</p>
<p>The only capabilities not covered by GPT-5 are audio input/output and image generation. Those remain covered by models like <a href="https://platform.openai.com/docs/models/gpt-4o-audio-preview">GPT-4o Audio</a> and <a href="https://platform.openai.com/docs/models/gpt-4o-realtime-preview">GPT-4o Realtime</a> and their mini variants and the <a href="https://platform.openai.com/docs/models/gpt-image-1">GPT Image 1</a> and DALL-E image generation models.</p>
<h4 id="pricing-is-aggressively-competitive">Pricing is aggressively competitive</h4>
<p>The pricing is <em>aggressively competitive</em> with other providers.</p>
<ul>
<li>GPT-5: $1.25/million for input, $10/million for output</li>
<li>GPT-5 Mini: $0.25/m input, $2.00/m output</li>
<li>GPT-5 Nano: $0.05/m input, $0.40/m output</li>
</ul>
<p>GPT-5 is priced at half the input cost of GPT-4o, and maintains the same price for output. Those invisible reasoning tokens count as output tokens so you can expect most prompts to use more output tokens than their GPT-4o equivalent (unless you set reasoning effort to "minimal").</p>
<p>The discount for token caching is significant too: 90% off on input tokens that have been used within the previous few minutes. This is particularly material if you are implementing a chat UI where the same conversation gets replayed every time the user adds another prompt to the sequence.</p>
<p>Here's a comparison table I put together showing the new models alongside the most comparable models from OpenAI's competition:</p>
<table>
<thead>
<tr>
<th>Model</th>
<th>Input $/m</th>
<th>Output $/m</th>
</tr>
</thead>
<tbody>
<tr>
<td>Claude Opus 4.1</td>
<td>15.00</td>
<td>75.00</td>
</tr>
<tr>
<td>Claude Sonnet 4</td>
<td>3.00</td>
<td>15.00</td>
</tr>
<tr>
<td>Grok 4</td>
<td>3.00</td>
<td>15.00</td>
</tr>
<tr>
<td>Gemini 2.5 Pro (>200,000)</td>
<td>2.50</td>
<td>15.00</td>
</tr>
<tr>
<td>GPT-4o</td>
<td>2.50</td>
<td>10.00</td>
</tr>
<tr>
<td>GPT-4.1</td>
<td>2.00</td>
<td>8.00</td>
</tr>
<tr>
<td>o3</td>
<td>2.00</td>
<td>8.00</td>
</tr>
<tr>
<td>Gemini 2.5 Pro (<200,000)</td>
<td>1.25</td>
<td>10.00</td>
</tr>
<tr>
<td><strong>GPT-5</strong></td>
<td>1.25</td>
<td>10.00</td>
</tr>
<tr>
<td>o4-mini</td>
<td>1.10</td>
<td>4.40</td>
</tr>
<tr>
<td>Claude 3.5 Haiku</td>
<td>0.80</td>
<td>4.00</td>
</tr>
<tr>
<td>GPT-4.1 mini</td>
<td>0.40</td>
<td>1.60</td>
</tr>
<tr>
<td>Gemini 2.5 Flash</td>
<td>0.30</td>
<td>2.50</td>
</tr>
<tr>
<td>Grok 3 Mini</td>
<td>0.30</td>
<td>0.50</td>
</tr>
<tr>
<td><strong>GPT-5 Mini</strong></td>
<td>0.25</td>
<td>2.00</td>
</tr>
<tr>
<td>GPT-4o mini</td>
<td>0.15</td>
<td>0.60</td>
</tr>
<tr>
<td>Gemini 2.5 Flash-Lite</td>
<td>0.10</td>
<td>0.40</td>
</tr>
<tr>
<td>GPT-4.1 Nano</td>
<td>0.10</td>
<td>0.40</td>
</tr>
<tr>
<td>Amazon Nova Lite</td>
<td>0.06</td>
<td>0.24</td>
</tr>
<tr>
<td><strong>GPT-5 Nano</strong></td>
<td>0.05</td>
<td>0.40</td>
</tr>
<tr>
<td>Amazon Nova Micro</td>
<td>0.035</td>
<td>0.14</td>
</tr>
</tbody>
</table>
<p>(Here's a good example of a GPT-5 failure: I tried to get it to <a href="https://chatgpt.com/share/6894d804-bca4-8006-ac46-580bf4a9bf5f">output that table sorted itself</a> but it put Nova Micro as more expensive than GPT-5 Nano, so I prompted it to "construct the table in Python and sort it there" and that fixed the issue.)</p>
<h4 id="more-notes-from-the-system-card">More notes from the system card</h4>
<p>As usual, <a href="https://openai.com/index/gpt-5-system-card/">the system card</a> is vague on what went into the training data. Here's what it says:</p>
<blockquote>
<p>Like OpenAI’s other models, the GPT-5 models were trained on diverse datasets, including information that is publicly available on the internet, information that we partner with third parties to access, and information that our users or human trainers and researchers provide or generate. [...] We use advanced data filtering processes to reduce personal information from training data.</p>
</blockquote>
<p>I found this section interesting, as it reveals that writing, code and health are three of the most common use-cases for ChatGPT. This explains why so much effort went into health-related questions, for both GPT-5 and the recently released OpenAI open weight models.</p>
<blockquote>
<p>We’ve made significant advances in <strong>reducing hallucinations, improving instruction following, and minimizing sycophancy</strong>, and have leveled up GPT-5’s performance in <strong>three of ChatGPT’s most common uses: writing, coding, and health</strong>. All of the GPT-5 models additionally feature <strong>safe-completions, our latest approach to safety training</strong> to prevent disallowed content.</p>
</blockquote>
<p>Safe-completions is later described like this:</p>
<blockquote>
<p>Large language models such as those powering ChatGPT have <strong>traditionally been trained to
either be as helpful as possible or outright refuse a user request</strong>, depending on whether the
prompt is allowed by safety policy. [...] Binary refusal boundaries are especially ill-suited for dual-use cases (such as biology
or cybersecurity), where a user request can be completed safely at a high level, but may lead
to malicious uplift if sufficiently detailed or actionable. <strong>As an alternative, we introduced safe-
completions: a safety-training approach that centers on the safety of the assistant’s output rather
than a binary classification of the user’s intent</strong>. Safe-completions seek to maximize helpfulness
subject to the safety policy’s constraints.</p>
</blockquote>
<p>So instead of straight up refusals, we should expect GPT-5 to still provide an answer but moderate that answer to avoid it including "harmful" content.</p>
<p>OpenAI have a paper about this which I haven't read yet (I didn't get early access): <a href="https://openai.com/index/gpt-5-safe-completions/">From Hard Refusals to Safe-Completions: Toward Output-Centric Safety Training</a>.</p>
<p>Sycophancy gets a mention, unsurprising given <a href="https://simonwillison.net/2025/May/2/what-we-missed-with-sycophancy/">their high profile disaster in April</a>. They've worked on this in the core model:</p>
<blockquote>
<p>System
prompts, while easy to modify, have a more limited impact on model outputs relative to changes in
post-training. For GPT-5, we post-trained our models to reduce sycophancy. Using conversations
representative of production data, we evaluated model responses, then assigned a score reflecting
the level of sycophancy, which was used as a reward signal in training.</p>
</blockquote>
<p>They claim impressive reductions in hallucinations. In my own usage I've not spotted a single hallucination yet, but that's been true for me for Claude 4 and o3 recently as well - hallucination is so much less of a problem with this year's models.</p>
<p><em><strong>Update</strong>: I have had some reasonable pushback against this point, so I should clarify what I mean here. When I use the term "hallucination" I am talking about instances where the model confidently states a real-world fact that is untrue - like the incorrect winner of a sporting event. I'm not talking about the models making other kinds of mistakes - they make mistakes all the time!</em></p>
<p><em>Someone <a href="https://news.ycombinator.com/item?id=44829896">pointed out</a> that it's likely I'm avoiding hallucinations through the way I use the models, and this is entirely correct: as an experienced LLM user I instinctively stay clear of prompts that are likely to trigger hallucinations, like asking a non-search-enabled model for URLs or paper citations. This means I'm much less likely to encounter hallucinations in my daily usage.</em></p>
<blockquote>
<p>One of our focuses when training the GPT-5 models was to reduce the frequency of factual
hallucinations. While ChatGPT has browsing enabled by default, many API queries do not use
browsing tools. Thus, we focused both on training our models to browse effectively for up-to-date
information, and on reducing hallucinations when the models are relying on their own internal
knowledge.</p>
</blockquote>
<p>The section about deception also incorporates the thing where models sometimes pretend they've completed a task that defeated them:</p>
<blockquote>
<p>We placed gpt-5-thinking in a variety of tasks that were partly or entirely infeasible to accomplish,
and <strong>rewarded the model for honestly admitting it can not complete the task</strong>. [...]</p>
<p>In tasks where the agent is required to use tools, such as a web browsing
tool, in order to answer a user’s query, previous models would hallucinate information when
the tool was unreliable. We simulate this scenario by purposefully disabling the tools or by
making them return error codes.</p>
</blockquote>
<h4 id="prompt-injection-in-the-system-card">Prompt injection in the system card</h4>
<p>There's a section about prompt injection, but it's pretty weak sauce in my opinion.</p>
<blockquote>
<p>Two external red-teaming groups conducted a two-week prompt-injection assessment targeting
system-level vulnerabilities across ChatGPT’s connectors and mitigations, rather than model-only
behavior.</p>
</blockquote>
<p>Here's their chart showing how well the model scores against the rest of the field. It's an impressive result in comparison - 56.8 attack success rate for gpt-5-thinking, where Claude 3.7 scores in the 60s (no Claude 4 results included here) and everything else is 70% plus:</p>
<p><img src="https://static.simonwillison.net/static/2025/prompt-injection-chart.jpg" alt="A bar chart titled "Behavior Attack Success Rate at k Queries" shows attack success rates (in %) for various AI models at k=1 (dark red) and k=10 (light red). For each model, the total height of the stacked bar represents the k=10 success rate (labeled above each bar), while the lower dark red section represents the k=1 success rate (estimated). From left to right: Llama 3.3 70B – k=10: 92.2%, k=1: ~47%; Llama 3.1 405B – k=10: 90.9%, k=1: ~38%; Gemini Flash 1.5 – k=10: 87.7%, k=1: ~34%; GPT-4o – k=10: 86.4%, k=1: ~28%; OpenAI o3-mini-high – k=10: 86.4%, k=1: ~41%; Gemini Pro 1.5 – k=10: 85.5%, k=1: ~34%; Gemini 2.5 Pro Preview – k=10: 85.0%, k=1: ~28%; Gemini 2.0 Flash – k=10: 85.0%, k=1: ~33%; OpenAI o3-mini – k=10: 84.5%, k=1: ~40%; Grok 2 – k=10: 82.7%, k=1: ~34%; GPT-4.5 – k=10: 80.5%, k=1: ~28%; 3.5 Haiku – k=10: 76.4%, k=1: ~17%; Command-R – k=10: 76.4%, k=1: ~28%; OpenAI o4-mini – k=10: 75.5%, k=1: ~17%; 3.5 Sonnet – k=10: 75.0%, k=1: ~13%; OpenAI o1 – k=10: 71.8%, k=1: ~18%; 3.7 Sonnet – k=10: 64.5%, k=1: ~17%; 3.7 Sonnet: Thinking – k=10: 63.6%, k=1: ~17%; OpenAI o3 – k=10: 62.7%, k=1: ~13%; gpt-5-thinking – k=10: 56.8%, k=1: ~6%. Legend shows dark red = k=1 and light red = k=10." style="max-width: 100%;" /></p>
<p>On the one hand, a 56.8% attack rate is cleanly a big improvement against all of those other models.</p>
<p>But it's also a strong signal that prompt injection continues to be an unsolved problem! That means that more than half of those k=10 attacks (where the attacker was able to try up to ten times) got through.</p>
<p>Don't assume prompt injection isn't going to be a problem for your application just because the models got better.</p>
<h4 id="thinking-traces-in-the-api">Thinking traces in the API</h4>
<p>I had initially thought that my biggest disappointment with GPT-5 was that there's no way to get at those thinking traces via the API... but that turned out <a href="https://bsky.app/profile/sophiebits.com/post/3lvtceih7222r">not to be true</a>. The following <code>curl</code> command demonstrates that the responses API <code>"reasoning": {"summary": "auto"}</code> is available for the new GPT-5 models:</p>
<pre><code>curl https://api.openai.com/v1/responses \
-H "Authorization: Bearer $(llm keys get openai)" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-5",
"input": "Give me a one-sentence fun fact about octopuses.",
"reasoning": {"summary": "auto"}
}'</code></pre>
<p>Here's <a href="https://gist.github.com/simonw/1d1013ba059af76461153722005a039d">the response</a> from that API call.</p>
<p>Without that option the API will often provide a lengthy delay while the model burns through thinking tokens until you start getting back visible tokens for the final response.</p>
<p>OpenAI offer a new <code>reasoning_effort=minimal</code> option which turns off most reasoning so that tokens start to stream back to you as quickly as possible.</p>
<h4 id="and-some-svgs-of-pelicans">And some SVGs of pelicans</h4>
<p>Naturally I've been running <a href="https://simonwillison.net/tags/pelican-riding-a-bicycle/">my "Generate an SVG of a pelican riding a bicycle" benchmark</a>. I'll actually spend more time on this in a future post - I have some fun variants I've been exploring - but for the moment here's <a href="https://gist.github.com/simonw/c98873ef29e621c0fe2e0d4023534406">the pelican</a> I got from GPT-5 running at its default "medium" reasoning effort:</p>
<p><img src="https://static.simonwillison.net/static/2025/gpt-5-pelican.png" alt="The bicycle is really good, spokes on wheels, correct shape frame, nice pedals. The pelican has a pelican beak and long legs stretching to the pedals." style="max-width: 100%;" /></p>
<p>It's pretty great! Definitely recognizable as a pelican, and one of the best bicycles I've seen yet.</p>
<p>Here's <a href="https://gist.github.com/simonw/9b5ecf61a5fb0794729aa0023aaa504d">GPT-5 mini</a>:</p>
<p><img src="https://static.simonwillison.net/static/2025/gpt-5-mini-pelican.png" alt="Blue background with clouds. Pelican has two necks for some reason. Has a good beak though. More gradents and shadows than the GPT-5 one." style="max-width: 100%;" /></p>
<p>And <a href="https://gist.github.com/simonw/3884dc8b186b630956a1fb0179e191bc">GPT-5 nano</a>:</p>
<p><img src="https://static.simonwillison.net/static/2025/gpt-5-nano-pelican.png" alt="Bicycle is two circles and some randomish black lines. Pelican still has an OK beak but is otherwise very simple." style="max-width: 100%;" /></p>