LLM 0.32a0 is a major backwards-compatible refactor

<p>I just released <a href="https://llm.datasette.io/en/latest/changelog.html#a0-2026-04-28">LLM 0.32a0</a>, an alpha release of my <a href="https://llm.datasette.io/">LLM</a> Python library and CLI tool for accessing LLMs, with some consequential changes that I've been working towards for quite a while.</p>

<p>Previous versions of LLM modeled the world in terms of prompts and responses. Send the model a text prompt, get back a text response.</p>

<pre><span class="pl-k">import</span> <span class="pl-s1">llm</span>



<span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_model</span>(<span class="pl-s">"gpt-5.5"</span>)

<span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(<span class="pl-s">"Capital of France?"</span>)

<span class="pl-en">print</span>(<span class="pl-s1">response</span>.<span class="pl-c1">text</span>())</pre>

<p>This made sense when I started working on the library back in April 2023. A lot has changed since then!</p>

<p>LLM provides an abstraction over thousands of different models via its <a href="https://llm.datasette.io/en/stable/plugins/index.html">plugin system</a>. The original abstraction - of text input that returns text output - was no longer able to represent everything I needed it to.</p>

<p>Over time LLM itself has grown <a href="https://simonwillison.net/2024/Oct/29/llm-multi-modal/">attachments</a> to handle image, audio, and video input, then <a href="https://simonwillison.net/2025/Feb/28/llm-schemas/">schemas</a> for outputting structured JSON, then <a href="https://simonwillison.net/2025/May/27/llm-tools/">tools</a> for executing tool calls. Meanwhile LLMs kept evolving, adding reasoning support and the ability to return images and all kinds of other interesting capabilities.</p>

<p>LLM needs to evolve to better handle the diversity of input and output types that can be processed by today's frontier models.</p>

<p>The 0.32a0 alpha has two key changes: model inputs can be represented as a sequence of messages, and model responses can be composed of a stream of differently typed parts.</p>

<h4 id="prompts-as-a-sequence-of-messages">Prompts as a sequence of messages</h4>

<p>LLMs accept input as text, but ever since ChatGPT demonstrated the value of a two-way conversational interface, the most common way to prompt them has been to treat that input as a sequence of conversational turns.</p>

<p>The first turn might look like this:</p>

<pre><code>user: Capital of France?

assistant: 

</code></pre>

<p>(The model then gets to fill out the reply from the assistant.)</p>

<p>But each subsequent turn needs to replay the entire conversation up to that point, as a sort of screenplay:</p>

<pre><code>user: Capital of France?

assistant: Paris

user: Germany?

assistant:

</code></pre>

<p>Most of the JSON APIs from the major vendors follow this pattern. Here's what the above looks like using the OpenAI chat completions API, which has been widely imitated by other providers:</p>

<div class="highlight highlight-source-shell"><pre>curl https://api.openai.com/v1/chat/completions \

  -H <span class="pl-s"><span class="pl-pds">"</span>Authorization: Bearer <span class="pl-smi">$OPENAI_API_KEY</span><span class="pl-pds">"</span></span> \

  -H <span class="pl-s"><span class="pl-pds">"</span>Content-Type: application/json<span class="pl-pds">"</span></span> \

  -d <span class="pl-s"><span class="pl-pds">'</span>{</span>

<span class="pl-s">    "model": "gpt-5.5",</span>

<span class="pl-s">    "messages": [</span>

<span class="pl-s">      {</span>

<span class="pl-s">        "role": "user",</span>

<span class="pl-s">        "content": "Capital of France?"</span>

<span class="pl-s">      },</span>

<span class="pl-s">      {</span>

<span class="pl-s">        "role": "assistant",</span>

<span class="pl-s">        "content": "Paris"</span>

<span class="pl-s">      },</span>

<span class="pl-s">      {</span>

<span class="pl-s">        "role": "user",</span>

<span class="pl-s">        "content": "Germany?"</span>

<span class="pl-s">      }</span>

<span class="pl-s">    ]</span>

<span class="pl-s">  }<span class="pl-pds">'</span></span></pre></div>

<p>Prior to 0.32, LLM modeled these as conversations:</p>

<pre><span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_model</span>(<span class="pl-s">"gpt-5.5"</span>)



<span class="pl-s1">conversation</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">conversation</span>()

<span class="pl-s1">r1</span> <span class="pl-c1">=</span> <span class="pl-s1">conversation</span>.<span class="pl-c1">prompt</span>(<span class="pl-s">"Capital of France?"</span>)

<span class="pl-en">print</span>(<span class="pl-s1">r1</span>.<span class="pl-c1">text</span>())

<span class="pl-c"># Outputs "Paris"</span>



<span class="pl-s1">r2</span> <span class="pl-c1">=</span> <span class="pl-s1">conversation</span>.<span class="pl-c1">prompt</span>(<span class="pl-s">"Germany?"</span>)

<span class="pl-en">print</span>(<span class="pl-s1">r2</span>.<span class="pl-c1">text</span>())

<span class="pl-c"># Outputs "Berlin"</span></pre>

<p>This worked if you were building a conversation with the model from scratch, but it didn't provide a way to feed in a previous conversation from the start. This made tasks like building an emulation of the OpenAI chat completions API much harder than they should have been.</p>

<p>The <code>llm</code> CLI tool worked around this through a custom mechanism for persisting and inflating conversations using SQLite, but that never became a stable part of the LLM API - and there are many places you might want to use the Python library without committing to SQLite as the storage layer.</p>

<p>The new alpha now supports this:</p>

<pre><span class="pl-k">import</span> <span class="pl-s1">llm</span>

<span class="pl-k">from</span> <span class="pl-s1">llm</span> <span class="pl-k">import</span> <span class="pl-s1">user</span>, <span class="pl-s1">assistant</span>



<span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_model</span>(<span class="pl-s">"gpt-5.5"</span>)



<span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(<span class="pl-s1">messages</span><span class="pl-c1">=</span>[

    <span class="pl-en">user</span>(<span class="pl-s">"Capital of France?"</span>),

    <span class="pl-en">assistant</span>(<span class="pl-s">"Paris"</span>),

    <span class="pl-en">user</span>(<span class="pl-s">"Germany?"</span>),

])

<span class="pl-en">print</span>(<span class="pl-s1">response</span>.<span class="pl-c1">text</span>())</pre>

<p>The <code>llm.user()</code> and <code>llm.assistant()</code> functions are new builder functions designed to be used within that <code>messages=[]</code> array.</p>

<p>The previous <code>prompt=</code> option still works, but LLM upgrades it to a single-item messages array behind the scenes.</p>

<p>You can also now <em>reply</em> to a response, as an alternative to building a conversation:</p>

<pre><span class="pl-s1">response2</span> <span class="pl-c1">=</span> <span class="pl-s1">response</span>.<span class="pl-c1">reply</span>(<span class="pl-s">"How about Hungary?"</span>)

<span class="pl-en">print</span>(<span class="pl-s1">response2</span>) <span class="pl-c"># Default __str__() calls .text()</span></pre>

<h4 id="streaming-parts">Streaming parts</h4>

<p>The other major new interface in the alpha concerns streaming results back from a prompt.</p>

<p>Previously, LLM supported streaming like this:</p>

<pre><span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(<span class="pl-s">"Generate an SVG of a pelican riding a bicycle"</span>)

<span class="pl-k">for</span> <span class="pl-s1">chunk</span> <span class="pl-c1">in</span> <span class="pl-s1">response</span>:

    <span class="pl-en">print</span>(<span class="pl-s1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>)</pre>

<p>Or this async variant:</p>

<pre><span class="pl-k">import</span> <span class="pl-s1">asyncio</span>

<span class="pl-k">import</span> <span class="pl-s1">llm</span>



<span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_async_model</span>(<span class="pl-s">"gpt-5.5"</span>)

<span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(<span class="pl-s">"Generate an SVG of a pelican riding a bicycle"</span>)



<span class="pl-k">async</span> <span class="pl-k">def</span> <span class="pl-en">run</span>():

    <span class="pl-k">async</span> <span class="pl-k">for</span> <span class="pl-s1">chunk</span> <span class="pl-c1">in</span> <span class="pl-s1">response</span>:

        <span class="pl-en">print</span>(<span class="pl-s1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)



<span class="pl-s1">asyncio</span>.<span class="pl-c1">run</span>(<span class="pl-en">run</span>())</pre>

<p>Many of today's models return mixed types of content. A prompt run against Claude might return reasoning output, then text, then a JSON request for a tool call, then more text content.</p>

<p>Some models can even execute tools on the server-side, for example OpenAI's <a href="https://developers.openai.com/api/docs/guides/tools-code-interpreter?lang=curl">code interpreter tool</a> or Anthropic's <a href="https://platform.claude.com/docs/en/agents-and-tools/tool-use/web-search-tool">web search</a>. This means the results from the model can combine text, tool calls, tool outputs and other formats.</p>

<p>Multi-modal output models are starting to emerge too, which can return images or even <a href="https://developers.openai.com/api/docs/guides/audio#add-audio-to-your-existing-application">snippets of audio</a> intermixed into that streaming response.</p>

<p>The new LLM alpha models these as a stream of typed message parts. Here's what that looks like as a Python API consumer:</p>

<pre><span class="pl-k">import</span> <span class="pl-s1">asyncio</span>

<span class="pl-k">import</span> <span class="pl-s1">llm</span>



<span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_model</span>(<span class="pl-s">"gpt-5.5"</span>)

<span class="pl-s1">prompt</span> <span class="pl-c1">=</span> <span class="pl-s">"invent 3 cool dogs, first talk about your motivations"</span>



<span class="pl-k">def</span> <span class="pl-en">describe_dog</span>(<span class="pl-s1">name</span>: <span class="pl-smi">str</span>, <span class="pl-s1">bio</span>: <span class="pl-smi">str</span>) <span class="pl-c1">-&gt;</span> <span class="pl-smi">str</span>:

    <span class="pl-s">"""Record the name and biography of a hypothetical dog."""</span>

    <span class="pl-k">return</span> <span class="pl-s">f"<span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">name</span><span class="pl-kos">}</span></span>: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">bio</span><span class="pl-kos">}</span></span>"</span>



<span class="pl-k">def</span> <span class="pl-en">sync_example</span>():

    <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(

        <span class="pl-s1">prompt</span>,

        <span class="pl-s1">tools</span><span class="pl-c1">=</span>[<span class="pl-s1">describe_dog</span>],

    )

    <span class="pl-k">for</span> <span class="pl-s1">event</span> <span class="pl-c1">in</span> <span class="pl-s1">response</span>.<span class="pl-c1">stream_events</span>():

        <span class="pl-k">if</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"text"</span>:

            <span class="pl-en">print</span>(<span class="pl-s1">event</span>.<span class="pl-c1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)

        <span class="pl-k">elif</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"tool_call_name"</span>:

            <span class="pl-en">print</span>(<span class="pl-s">f"<span class="pl-cce">\n</span>Tool call: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">event</span>.<span class="pl-c1">chunk</span><span class="pl-kos">}</span></span>("</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)

        <span class="pl-k">elif</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"tool_call_args"</span>:

            <span class="pl-en">print</span>(<span class="pl-s1">event</span>.<span class="pl-c1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)



<span class="pl-k">async</span> <span class="pl-k">def</span> <span class="pl-en">async_example</span>():

    <span class="pl-s1">model</span> <span class="pl-c1">=</span> <span class="pl-s1">llm</span>.<span class="pl-c1">get_async_model</span>(<span class="pl-s">"gpt-5.5"</span>)

    <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">model</span>.<span class="pl-c1">prompt</span>(

        <span class="pl-s1">prompt</span>,

        <span class="pl-s1">tools</span><span class="pl-c1">=</span>[<span class="pl-s1">describe_dog</span>],

    )

    <span class="pl-k">async</span> <span class="pl-k">for</span> <span class="pl-s1">event</span> <span class="pl-c1">in</span> <span class="pl-s1">response</span>.<span class="pl-c1">astream_events</span>():

        <span class="pl-k">if</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"text"</span>:

            <span class="pl-en">print</span>(<span class="pl-s1">event</span>.<span class="pl-c1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)

        <span class="pl-k">elif</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"tool_call_name"</span>:

            <span class="pl-en">print</span>(<span class="pl-s">f"<span class="pl-cce">\n</span>Tool call: <span class="pl-s1"><span class="pl-kos">{</span><span class="pl-s1">event</span>.<span class="pl-c1">chunk</span><span class="pl-kos">}</span></span>("</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)

        <span class="pl-k">elif</span> <span class="pl-s1">event</span>.<span class="pl-c1">type</span> <span class="pl-c1">==</span> <span class="pl-s">"tool_call_args"</span>:

            <span class="pl-en">print</span>(<span class="pl-s1">event</span>.<span class="pl-c1">chunk</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>, <span class="pl-s1">flush</span><span class="pl-c1">=</span><span class="pl-c1">True</span>)



<span class="pl-en">sync_example</span>()

<span class="pl-s1">asyncio</span>.<span class="pl-c1">run</span>(<span class="pl-en">async_example</span>())</pre>

<p>Sample output (from just the first sync example):</p>

<blockquote>

<p><code>My motivation: create three memorable dogs with distinct “cool” styles—one cinematic, one adventurous, and one charmingly chaotic—so each feels like they could star in their own story.</code><br />

<code>Tool call: describe_dog({"name": "Nova Jetpaw", "bio": "A sleek silver-gray whippet who wears tiny aviator goggles and loves sprinting along moonlit beaches. Nova is fearless, elegant, and rumored to outrun drones just for fun."}</code><br />

<code>Tool call: describe_dog({"name": "Mochi Thunderbark", "bio": "A fluffy corgi with a dramatic black-and-gold bandana and the confidence of a rock star. Mochi is short, loud, loyal, and leads a neighborhood 'security patrol' made entirely of squirrels."}</code><br />

<code>Tool call: describe_dog({"name": "Atlas Snowfang", "bio": "A massive white husky with ice-blue eyes and a backpack full of trail snacks. Atlas is calm, heroic, and always knows the way home—even during blizzards, fog, or confusing camping trips."}</code></p>

</blockquote>

<p>At the end of the response you can call <code>response.execute_tool_calls()</code> to actually run the functions that were requested, or send a <code>response.reply()</code> to have those tools called and their return values sent back to the model:</p>

<pre><span class="pl-en">print</span>(<span class="pl-s1">response</span>.<span class="pl-c1">reply</span>(<span class="pl-s">"Tell me about the dogs"</span>))</pre>

<p>This new mechanism for streaming different token types means the CLI tool can now display "thinking" text in a different color from the text in the final response. The thinking text goes to stderr so it won't affect results that are piped into other tools.</p>

<p>This example uses Claude Sonnet 4.6 (with an updated streaming event version of the <a href="https://github.com/simonw/llm-anthropic">llm-anthropic</a> plugin) as Anthropic's models return their reasoning text as part of the response:</p>

<div class="highlight highlight-source-shell"><pre>llm -m claude-sonnet-4.6 <span class="pl-s"><span class="pl-pds">'</span>Think about 3 cool dogs then describe them<span class="pl-pds">'</span></span> \

  -o thinking_display 1</pre></div>

<p><img src="https://static.simonwillison.net/static/2026/claude-thinking-llm.gif" alt="Animated demo. Starts with ~/dev/scratch/llm-anthropic % uv run llm -m claude-sonnet-4.6 'Think about 3 cool dogs then describe them' -o thinking_display 1 - the text then streams in grey: The user wants me to think about 3 cool dogs and then describe them. Let me come up with 3 interesting, cool dogs and describe them. Then switches to regular color text for the output that describes the dogs." style="max-width: 100%;" /></p>

<p>You can suppress the output of reasoning tokens using the new <code>-R/--no-reasoning</code> flag. Surprisingly that ended up being the only CLI-facing change in this release.</p>

<h4 id="a-mechanism-for-serializing-and-deserializing-responses">A mechanism for serializing and deserializing responses</h4>

<p>As mentioned earlier, LLM has quite inflexible code at the moment for persisting conversations to SQLite. I've added a new mechanism in 0.32a0 that should provide Python API users a way to roll their own alternative:</p>

<pre><span class="pl-s1">serializable</span> <span class="pl-c1">=</span> <span class="pl-s1">response</span>.<span class="pl-c1">to_dict</span>()

<span class="pl-c"># serializable is a JSON-style dictionary</span>

<span class="pl-c"># store it anywhere you like, then inflate it:</span>

<span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-v">Response</span>.<span class="pl-c1">from_dict</span>(<span class="pl-s1">serializable</span>)</pre>

<p>The dictionary this returns is actually a <code>TypedDict</code> defined in the new <a href="https://github.com/simonw/llm/blob/main/llm/serialization.py">llm/serialization.py</a> module.</p>

<h4 id="what-s-next-">What's next?</h4>

<p>I'm releasing this as an alpha so I can upgrade various plugins and exercise the new design in real world environments for a few days. I expect the stable 0.32 release will be very similar to this alpha, unless alpha testing reveals some design flaw in the way I've put this all together.</p>

<p>There's one remaining large task: I'd like to redesign the SQLite logging system to better capture the more finely grained details that are returned by this new abstraction.</p>

<p>Ideally I'd like to model this as a graph, to best support situations like an OpenAI-style chat completions API where the same conversations are constantly extended and then repeated with every prompt. I want to be able to store those without duplicating them in the database.</p>

<p>I'm undecided as to whether that should be a feature in 0.32 or I should hold it for 0.33.</p>