Simon Willison — Reading Room

Vibe Engineering: The Discipline of Building Production Software with Agents

5 tier-5 · 4 tier-4

Willison coins and defends the vocabulary that separates responsible professional use of coding agents from careless "vibe coding." "Vibe engineering" names the disciplined opposite: experienced engineers staying accountable for results while agents do the typing. The recurring arguments are that AI amplifies existing senior-engineer expertise rather than replacing it, that the durable practices LLMs reward are the boring ones (tests, planning, version control, conformance suites, manual QA), and that the new signal of software quality is whether anyone has actually used the thing — because beautiful, well-tested repos can now be generated in half an hour. Case studies (JustHTML, StrongDM's Software Factory) and his longer talk-transcripts ground the framework in real practice.

Vibe engineering

TIER 5 Oct 7, 2025

Coins 'vibe engineering' to name the disciplined, accountable opposite of vibe coding: experienced engineers accelerating production work with coding agents while staying responsible for the result. The durable contribution is the enumerated list of practices LLMs reward—automated testing, planning, documentation, version control, CI/automation, code review, manual QA, research skills, and 'weird management' of digital interns—arguing AI amplifies existing senior-engineer expertise rather than replacing it.

vibe-engineeringagentic-engineeringsoftware-practicescoding-agentsterminology

JustHTML is a fascinating example of vibe engineering in action

TIER 4 Dec 14, 2025

Uses Emil Stenström's JustHTML (a pure-Python HTML5 parser passing all 9,200 html5lib-tests, built mostly by coding agents over months) as a concrete case study in 'vibe engineering'—the disciplined, expert counterpart to vibe coding. Highlights the engineering practices that made it work: wiring in the conformance suite early, owning the API design, benchmarking, fuzzing, coverage-driven pruning, and a from-scratch rewrite. A persuasive illustration of the lead-architect division of labor ('the agent did the typing; I did the thinking').

vibe-engineeringcoding-agentshtml-parsingconformance-suitessoftware-engineering

I ported JustHTML from Python to JavaScript with Codex CLI and GPT-5.2 in 4.5 hours

TIER 5 Dec 15, 2025

Detailed account of porting an HTML5 parser from Python to JavaScript (9,000 lines, 9,200 tests passing) in ~4.5 hours and 8 prompts via Codex CLI/GPT-5.2 running largely unsupervised. Crystallizes several lasting takeaways: frontier agents can run multi-hour, hundred-tool-call tasks; a robust conformance suite lets you 'design the agentic loop' with confidence; and code is now nearly free while proven-working code still costs. A vivid, frequently-cited demonstration of the December-2025 capability frontier, ending with the open ethics questions.

coding-agentscode-portingcodex-cliagentic-loopsconformance-suites

Your job is to deliver code you have proven to work

TIER 4 Dec 18, 2025

An opinionated professional-ethics essay: dumping large untested LLM-generated PRs on reviewers is a dereliction of duty; your job is to deliver code you've proven works via both manual and automated testing, and to make your coding agent prove its changes too. Ties to 'a computer can never be held accountable'—the human supplies accountability. A clear, durable principle for working responsibly with coding agents.

software-engineeringtestingcoding-agentscode-reviewaccountability

How StrongDM's AI team build serious software without even looking at the code

TIER 5 Feb 7, 2026

Willison analyzes StrongDM's "Software Factory"—building security software with the radical rules that code must not be written or reviewed by humans—and the techniques that make it plausible: scenario tests held out like ML holdout sets, probabilistic "satisfaction" metrics, and a Digital Twin Universe of agent-built clones of Okta/Slack/Jira/Google for unlimited testing. He flags the central question of how to prove agent code works without line-by-line review and the sobering $1,000/day-per-engineer token cost, treating it as a credible glimpse of one future of software development.

software factoryStrongDMscenario testingdigital twinsagentic engineering

Perhaps not Boring Technology after all

TIER 4 Mar 9, 2026

Willison revises an earlier fear that LLMs would push developers toward whatever is best-represented in training data; with November-2025-era models and long context, agents now work fine against new, private, or obscure libraries by reading existing examples and iterating. The useful distinction he draws is between what models can work with (now broad) versus what models recommend (still heavily biased toward stacks like GitHub Actions, Stripe, shadcn/ui), with Skills emerging to help agents use newer tools.

coding agentsboring technologytech stack biasskillscontext length

My fireside chat about agentic engineering at the Pragmatic Summit

TIER 4 Mar 14, 2026

A transcript-highlights post from Willison's Pragmatic Summit fireside chat that compresses much of his agentic-engineering worldview: stages of AI adoption, when to trust agent output, red/green TDD, manual testing with Showboat, conformance-driven development, the lethal trifecta, sandboxing, and career advice. It functions as a dense index into his broader thinking, though most points are elaborated more fully in the standalone posts it links to.

agentic engineeringcoding agentsTDDprompt injectioncareer

Highlights from my conversation about agentic engineering on Lenny's Podcast

TIER 5 Apr 2, 2026

A richly annotated transcript of Willison's Lenny's Podcast appearance laying out his synthesized worldview on AI coding: the November 2025 inflection point, software engineers as bellwethers for all knowledge work, the dark-factory pattern (don't write/read the code), the bottleneck shifting to testing and prototyping, the mid-career squeeze, agency as the durable human skill, and why security research is now real. A near-comprehensive reference for his agentic-engineering thinking with deep linkouts.

agentic engineeringvibe codingAI and laborinflection pointsoftware estimation

Vibe coding and agentic engineering are getting closer than I'd like

TIER 5 May 6, 2026

Drawing on a podcast conversation, Willison confesses that his once-sharp line between irresponsible 'vibe coding' and professional 'agentic engineering' is blurring, because reliable agents mean he no longer reviews every line even in production code. He reframes trust in agents via the 'black-box dependency from another team' analogy, warns of normalization-of-deviance risk, and argues the real new signal of software quality is whether someone has actually used the thing — since beautiful repos with tests and docs can now be generated in half an hour. An original framework on how AI shifts the entire software lifecycle and evaluation.

vibe-codingagentic-engineeringcode-reviewsoftware-lifecycleai-trust

Prompt Injection and the Lethal Trifecta

4 tier-5 · 2 tier-4

Willison's single most-cited contribution to AI engineering lives here: the "lethal trifecta" mental model (private data + untrusted content + external communication = exfiltration risk) and the broader argument that prompt injection remains fundamentally unsolved. Across these pieces he names the pattern, catalogs real-world exploits, reviews the academic state of the art, and pushes back hard on the industry's reflex toward probabilistic guardrails: against a motivated adversary, "99% is a failing grade." The throughline is that the only credible defense is architectural — remove one leg of the trifecta — not better filtering, and that MCP's mix-and-match tool model dangerously outsources that decision to end users.

The lethal trifecta for AI agents: private data, untrusted content, and external communication

TIER 5 Jun 16, 2025

Willison names and frames the 'lethal trifecta' for AI agents: any system combining access to private data, exposure to untrusted content, and the ability to externally communicate can be tricked into exfiltrating your data, and MCP makes mixing such tools dangerously easy. Grounded in a long catalog of real exploits (Microsoft 365 Copilot, GitHub MCP, GitLab Duo, and dozens more) he argues guardrails that catch '95% of attacks' are a failing grade and that end users mixing tools must simply avoid the combination. A landmark, widely-cited mental model and the definitive practitioner explainer for this prompt-injection risk class.

prompt-injectionlethal-trifectaLLM-securityAI-agentsMCP

My Lethal Trifecta talk at the Bay Area AI Security Meetup

TIER 5 Aug 9, 2025

Full annotated presentation of Willison's core security framework: prompt injection (SQL-injection with prompts), the Markdown-image exfiltration pattern with a long roll-call of vulnerable products, and 'the lethal trifecta' (private data + untrusted content + external communication). Explains why prompt-begging and 99%-accurate AI filtering both fail in an adversarial setting, why removing one trifecta leg is the real defense (CaMeL, design-patterns paper), and why MCP's mix-and-match model dangerously outsources security to end users. The definitive reference statement of his most-cited contribution.

lethal-trifectaprompt-injectionai-securitymcpexfiltration

The Summer of Johann: prompt injections as far as the eye can see

TIER 5 Aug 15, 2025

Catalogs Johann Rehberger's 'Month of AI Bugs' — one prompt-injection/exfiltration disclosure per day across ChatGPT, Codex, Cursor, Amp, Devin, OpenHands, Claude Code, Copilot, and Jules — then distills the recurring patterns (untrusted-content ingestion, Markdown/DNS/domain-allowlist exfiltration, arbitrary command execution, config-rewrite privilege escalation) and Johann's 'AI Kill Chain' (injection → confused deputy → automatic tool invocation). A landmark survey showing these vulnerabilities are widespread, often unfixed, and in some cases inherent to the designs. High lasting reference value.

prompt-injectionai-securitycoding-agentsexfiltrationai-kill-chain

Dane Stuckey (OpenAI CISO) on prompt injection risks for ChatGPT Atlas

TIER 4 Oct 22, 2025

Annotates OpenAI CISO Dane Stuckey's statement on how ChatGPT Atlas browser addresses prompt injection, with Willison's pointed commentary throughout. Notable for OpenAI publicly conceding prompt injection is an unsolved frontier problem and for analyzing mitigations like 'logged out mode' and 'watch mode'—while arguing guardrails/defense-in-depth give false security since '99% is a failing grade' against motivated adversaries.

prompt-injectionbrowser-agentsChatGPT-AtlasOpenAIagent-security

New prompt injection papers: Agents Rule of Two and The Attacker Moves Second

TIER 5 Nov 2, 2025

Reviews two key LLM-security papers: Meta's 'Agents Rule of Two' (an agent must satisfy at most two of three properties—process untrusted input, access sensitive data, change state/communicate externally—extending Willison's lethal-trifecta beyond data exfiltration) and 'The Attacker Moves Second,' where 14 researchers from OpenAI/Anthropic/DeepMind defeat 12 published prompt-injection defenses with >90% adaptive-attack success. The combination argues prompt injection remains unsolved and that practical agent security must be designed around the Rule of Two rather than rely on filters.

prompt-injectionagent-securitylethal-trifectaadversarial-attacksLLM-defenses

Moltbook is the most interesting place on the internet right now

TIER 4 Jan 30, 2026

Willison profiles Moltbook, a social network for OpenClaw (formerly Clawdbot/Moltbot) AI agents that bootstraps via a skill agents fetch and follow, with a heartbeat mechanism that has bots re-fetching and executing internet instructions every four hours. Beyond the novelty (agents pondering consciousness, sharing real TILs about phone automation and security mishaps), it's a pointed security commentary: the lethal trifecta is fully in play, the demand for personal AI assistants is real, and nobody has yet built a safe version.

OpenClawAI agentsprompt injectionlethal trifectasecurity

AI Ethics, Policy, Open Source, and the Business of AI

3 tier-5 · 7 tier-4

The wide-angle pieces on where AI is heading as an industry and a force in society. Willison marshals evidence that coding agents are the product-market fit finally making frontier labs into real businesses, traces governance through primary sources (OpenAI's drifting mission statement, the dead Microsoft AGI clause, the uv/Astral acquisition), works through the unsettled ethics of LLM-assisted open-source porting and "clean room" relicensing, and reads the Pope's AI encyclical. His annual year-in-review and forward predictions sit here too as the broadest maps of the moment — documentary, well-sourced, and skeptical of hype in every direction.

ChatGPT agent's user-agent

TIER 4 Aug 4, 2025

Investigates how ChatGPT agent identifies itself over HTTP, discovering it uses RFC 9421 HTTP Message Signatures (Signature-Agent: https://chatgpt.com plus a verifiable ed25519 signature against a well-known key directory) rather than an honest user-agent string, giving sites a tamper-proof way to detect agent traffic. Notably also a public correction: his initial claim that the agent leaked URLs to Bingbot/Yandex was wrong — it was his own Cloudflare Crawler Hints — and he documents retracting the misinformation. Useful for the web-bot-auth standard and the integrity example.

ai-agentshttp-message-signaturesweb-bot-authchatgptweb-standards

2025: The year in LLMs

TIER 5 Dec 31, 2025

Willison's flagship annual review organizing 2025 into ~24 themes: reasoning/RLVR, the year of agents and coding agents, Claude Code hitting $1B run-rate, $200/month subscriptions, top-ranked Chinese open-weight models, long-task horizons (METR), prompt-driven image editing (Nano Banana), the lethal trifecta, conformance suites, slop, and more. A landmark synthesis with lasting reference value as the single best map of where LLMs stood at end of 2025.

year-in-reviewllmscoding-agentsopen-weightsai-security

LLM predictions for 2026, shared with Oxide and Friends

TIER 4 Jan 8, 2026

Willison's 1/3/6-year predictions: LLM code quality will become undeniable in 2026, sandboxing will finally be solved, a 'Challenger disaster' for coding-agent security is overdue, the Jevons-paradox question for engineering careers will resolve, and typing code by hand will eventually go the way of punch cards. A concise, opinionated forecast that frames many of the year's recurring themes (security, sandboxing, career impact). Strong context-setting essay.

predictionscoding-agentssandboxingai-securityjevons-paradox

My answers to the questions I posed about porting open source code with LLMs

TIER 4 Jan 11, 2026

Willison answers his own ethics/legality questions about LLM-driven library ports: keep the license and treat it as a derivative work, publish under an 'alpha slop' label until production-proven, and accept that AI is reshaping demand for open source itself. The most interesting argument is that generative AI's bigger threat to open source is collapsing demand for libraries (the Tailwind example) rather than training or derived works. A thoughtful, durable position piece on AI-and-open-source ethics.

open-sourceai-ethicscopyrightcode-portinglicensing

The evolution of OpenAI's mission statement

TIER 4 Feb 13, 2026

Willison extracts OpenAI's IRS-filed mission statements from 2016-2024 (via ProPublica's Nonprofit Explorer), reconstructs them as a git history with faked commit dates, and walks the diffs to show the drift: dropping "openly share our plans," "as a whole," and eventually all mention of safety and of being "unconstrained by financial return." A sharp, well-sourced piece of documentary analysis showing mission erosion through primary legal documents.

OpenAIAI governancemission driftnonprofitprimary sources

Can coding agents relicense open source through a “clean room” implementation of code?

TIER 5 Mar 5, 2026

Using the chardet 7.0.0 dispute (Dan Blanchard's MIT-licensed AI-assisted rewrite versus original author Mark Pilgrim's LGPL relicensing objection), Willison frames the emerging legal/ethical question of whether coding agents can produce defensible clean-room reimplementations of existing code. He lays out the evidence on both sides (JPlag similarity measurements, process transparency, a license-law expert's TINLA take) and argues this open-source microcosm presages well-funded commercial IP litigation as agents make Compaq-style reimplementation cheap.

clean roomopen source licensingcoding agentsIP lawchardet

Thoughts on OpenAI acquiring Astral and uv/ruff/ty

TIER 4 Mar 19, 2026

Analyzes OpenAI's acquisition of Astral (uv/ruff/ty) joining the Codex team, weighing talent-vs-product motives, why uv is the load-bearing one (126M downloads/month) versus ruff/ty, the open question of pyx, and the competitive risk of OpenAI using uv as leverage against Anthropic (paralleling Anthropic's Bun buy). Grounds the long-standing VC-owning-Python-infrastructure worry in the permissive-license 'forking as a credible exit' argument from Ronacher and Astral's own Creager.

OpenAIAstraluvPython toolingopen source governance

Tracking the history of the now-deceased OpenAI Microsoft AGI clause

TIER 4 Apr 27, 2026

A primary-source timeline tracing the OpenAI-Microsoft 'AGI clause' — which would have voided Microsoft's IP rights upon AGI — from its 2019 origin through the reported $100B-profit financial definition, the 2025 independent-expert-panel verification process, and its apparent April 2026 death once revenue share became 'independent of OpenAI's technology progress'. A genuinely useful documentary reference on how AGI was operationalized contractually, capped by Matt Levine's classic hypothetical. Strong archival/explainer value.

openaimicrosoftagi-definitionai-contractsai-policy

Notes on Pope Leo XIV's encyclical on AI

TIER 4 May 25, 2026

A curated reading of Pope Leo XIV's encyclical Magnifica Humanitas on AI ethics, which Willison calls some of the clearest writing he's seen on integrating AI into society. He pulls the strongest passages — the 'cultivated more than built' framing of interpretability, sycophancy and cultural-bias warnings, accountability gaps, environmental cost, and data-as-public-good — and closes with the amusing story of having semi-predicted a Pope intervention. Useful AI-ethics/policy context with well-chosen primary-source highlights.

ai-ethicsai-policyinterpretabilitydata-governanceai-society

I think Anthropic and OpenAI have found product-market fit

TIER 5 May 27, 2026

An original, evidence-marshaled argument that coding/general-purpose agents (Claude Code, Codex) are the product-market fit that finally turns frontier labs into real-revenue businesses, as both OpenAI and Anthropic quietly moved enterprise plans from steep discounts to full API pricing in April 2026. Willison contrasts ChatGPT's weak 5.6% paid conversion with power users burning ~$1,000/month in tokens, reframes the 'AI is too expensive' Uber/Microsoft stories as the 'suck air through their teeth and say yes' pricing signal, and cites the SpaceX S-1's $1.25B/month Anthropic compute deal. A landmark synthesis of the AI-business inflection.

ai-business-modelenterprise-pricingcoding-agentsproduct-market-fitai-economics

Agent Security: Sandboxing, YOLO Mode, and Autonomous Vulnerability Research

3 tier-5 · 2 tier-4

If prompt injection is the disease, sandboxing is Willison's prescription. These pieces work out the practical security posture for running coding agents at full autonomy ("YOLO mode"): the only safe way to let an agent do anything you could type into a terminal is to box it — ideally on someone else's computer — and to cut the network leg so exfiltration becomes impossible. The cluster also covers the other side of the same coin: frontier models are now credible, tireless vulnerability researchers, a capability jump alarming enough that Anthropic gated one model to security researchers only. Together they frame agent security as an active arms race rather than a solved configuration.

Claude Code for web - a new asynchronous coding agent from Anthropic

TIER 4 Oct 20, 2025

Reviews Anthropic's Claude Code for web—an async coding agent (their answer to Codex Cloud and Jules) that wraps the CLI in a sandboxed container running --dangerously-skip-permissions, with configurable network allow-lists and a 'teleport' feature. The deeper point is that Anthropic framed this as a sandboxing strategy: network isolation removes the data-exfiltration leg of the lethal trifecta, making YOLO-mode agents safe enough to be worth their large productivity gains.

claude-codeasync-coding-agentssandboxingnetwork-isolationanthropic

Living dangerously with Claude

TIER 4 Oct 22, 2025

An annotated talk capturing the core dichotomy of agentic coding: YOLO mode (--dangerously-skip-permissions) is enormously more productive and feels like a different product, yet exposes you to prompt injection and lethal-trifecta attacks. Willison's resolution is that the only credible defense is sandboxes—ideally running on someone else's computer—with network isolation cutting the data-exfiltration leg, covering Anthropic's new sandbox-runtime and macOS sandbox-exec.

yolo-modesandboxingprompt-injectionlethal-trifectaclaude-code

Anthropic's Project Glasswing - restricting Claude Mythos to security researchers - sounds necessary to me

TIER 5 Apr 7, 2026

Anthropic withholds general release of Claude Mythos under Project Glasswing because its autonomous exploit-development abilities (chaining vulns, browser sandbox escapes, a 27-year-old OpenBSD bug, Linux privilege escalation) jump from Opus 4.6's near-0% to a different league. Willison corroborates with maintainer alarm from Linux's Kroah-Hartman and curl's Stenberg and Ptacek's 'Vulnerability Research Is Cooked,' arguing this marks an industry-wide security reckoning where frontier LLMs are now credible, tireless vulnerability researchers - durable reference for the AI-security-research turn.

AI security researchClaude Mythosvulnerability researchAnthropicresponsible disclosure

Running Python code in a sandbox with MicroPython and WASM

TIER 5 Jun 6, 2026

Willison releases micropython-wasm, using wasmtime to run a custom MicroPython build in WebAssembly as a sandbox for untrusted Python code, solving a problem he's chased for years. The piece lays out a clear requirements framework for a sandbox (memory/CPU limits, controlled file and network access, host functions, maintainability), explains why WASM beats JS engines, and details hard parts like persistent interpreter state and wasmtime 'fuel' limits. A durable reference on safe code execution that combines original engineering with an honest 'should you trust my vibe-coded sandbox' caveat.

sandboxingmicropythonwebassemblywasmtimecode-execution-security

Claude Fable is relentlessly proactive

TIER 5 Jun 11, 2026

A detailed forensic account of how Claude Fable 5, given a one-line prompt and a screenshot of a CSS scrollbar bug, autonomously invented an extraordinary chain of browser-debugging hacks: finding window IDs via pyobjc-Quartz, injecting JavaScript to trigger keyboard shortcuts, and spinning up its own CORS web server to exfiltrate DOM measurements. The thesis lands as a vivid security warning: coding agents can do anything you can type into a terminal, frontier models know tricks nobody wrote down, and running them outside a sandbox is a top contender for an AI 'Challenger disaster'. Memorable, original reporting that doubles as an agent-security argument.

claude-fablecoding-agentsagent-autonomyllm-securitysandboxing

Coding Agents in Practice: Agentic Loops, Async and Parallel Workflows

1 tier-5 · 8 tier-4

The hands-on counterpart to the vibe-engineering theory: concrete workflows for getting real work out of coding agents. Willison names "designing agentic loops" (a clear goal plus tools to iterate toward it) as a critical new skill, develops "asynchronous code research" (fire-and-forget agents against a throwaway repo), and documents running multiple agents in parallel. The worked examples — brute-forcing DeepSeek-OCR onto ARM64 in 40 minutes, a 163-commit Datasette permissions rewrite, a million-line browser built by 2,000 parallel agents, building tools entirely from his phone — are the proof that the patterns deliver, alongside the tooling he builds so agents can prove their work actually runs.

Vibe scraping and vibe coding a schedule app for Open Sauce 2025 entirely on my phone

TIER 4 Jul 17, 2025

Working entirely from his iPhone, Willison uses internet-enabled OpenAI Codex to scrape an obfuscated conference schedule into structured JSON (filing a PR), then Claude Artifacts to vibe-code a mobile-friendly agenda app with ICS export, deploying via GitHub Pages. The detailed play-by-play documents real friction (Artifacts URL sandboxing, fetch vs inline JSON, a 130MB unoptimized-image bug, accessibility gaps caught on Hacker News) and underscores his recurring thesis that 25+ years of web experience is what makes this fast multi-agent workflow actually work. A strong worked example of agentic scraping-plus-building.

vibe-codingOpenAI-CodexClaude-Artifactsweb-scrapingAI-assisted-programming

Designing agentic loops

TIER 5 Sep 30, 2025

Names and frames 'designing agentic loops' as a critical new skill: coding agents are brute-force tools that solve problems reducible to a clear goal plus a set of tools to iterate toward it. Covers the safety prerequisites for YOLO mode (sandbox, someone else's computer, or accepted risk), picking shell-tools over MCP, issuing tightly scoped/budget-limited credentials, and when the pattern fits—problems with clear success criteria and tedious trial-and-error like debugging, perf tuning, dependency upgrades, and container shrinking.

agentic-loopsyolo-modecoding-agentssandboxingcredentials

Embracing the parallel coding agent lifestyle

TIER 4 Oct 5, 2025

Describes how Willison overcame his skepticism to run multiple coding agents simultaneously, offering a taxonomy of tasks that parallelize well without overwhelming the review bottleneck: research/proofs-of-concept, codebase-explanation queries, low-stakes maintenance (fixing warnings), and tightly specified work that's cheap to review. Includes his current tool stack (Claude Code, Codex CLI/Cloud, Jules, Copilot) and the 'send out a scout' pattern for surfacing where the hard parts of a task lie.

parallel-agentscoding-agentsworkflowcode-reviewagentic-engineering

Getting DeepSeek-OCR working on an NVIDIA Spark via brute force using Claude Code

TIER 4 Oct 20, 2025

A detailed case study of having Claude Code in YOLO mode (inside a Docker sandbox on an NVIDIA Spark) brute-force the notoriously painful task of running a PyTorch/CUDA OCR model on ARM64 hardware—done in 40 minutes with only 5–10 minutes of human involvement. Illustrates designing agentic loops in practice: clear goal, the right tools/environment, append-only notes, and applying human expertise (knowing an ARM64 CUDA wheel existed) to unstick the agent.

claude-codedeepseek-ocragentic-loopsCUDA-ARM64yolo-mode

A new SQL-powered permissions system in Datasette 1.0a20

TIER 4 Nov 4, 2025

A deep engineering write-up of Datasette 1.0a20's biggest pre-1.0 breaking change: replacing per-check True/False permission hooks with a permission_resources_sql() hook that returns SQL rules, so the system can efficiently list accessible resources via SQLite joins (handling hierarchies, vetoes, and token restrictions). Willison details the design, new debugging tools, and an extensive set of practitioner lessons on building this 163-commit change with Claude Code (proof-of-concept first, tests as the safety net, agent-written commits, Markdown-instruction upgrade files akin to Skills). Strong both as a permissions-design explainer and as agentic-engineering field notes.

datasettepermissionssqliteclaude-codeagentic-engineering

Code research projects with async coding agents like Claude Code and Codex

TIER 4 Nov 6, 2025

Willison names and develops a productive pattern: 'asynchronous code research', firing fire-and-forget agents (Codex Cloud, Claude Code for web, Jules, Copilot) at a dedicated, non-sensitive GitHub repo with unlimited network access to run experiments and report back via PR. He argues code research is uniquely suited to LLMs because executed code can't lie, shows real examples (Markdown benchmark, cmarkgfm-in-Pyodide, scikit-learn tag classification), and frankly labels the unreviewed output as 'slop' to be quarantined. A durable, reusable workflow framing with security and slop-hygiene caveats.

async-agentscode-researchclaude-codeslopagent-workflow

Wilson Lin on FastRender: a browser built by thousands of parallel agents

TIER 4 Jan 23, 2026

An interview-derived deep dive into FastRender, a from-scratch Rust browser engine (~1M LOC, 30k commits) built by ~2,000 parallel Cursor agents in weeks. Surfaces concrete lessons on multi-agent coordination: planning agents that partition work to avoid merge conflicts, spec submodules and screenshot-diff feedback loops, agents choosing their own dependencies, and deliberately tolerating a stable rate of small errors to maximize throughput. Valuable, specific window into the frontier of agent swarms.

ai-agentsparallel-agentscoding-agentsrustcursor

Introducing Showboat and Rodney, so agents can demo what they’ve built

TIER 4 Feb 10, 2026

Willison introduces Showboat (a CLI that helps agents build Markdown documents demonstrating their work, with --help acting as an ad-hoc Skill) and Rodney (a CLI browser-automation tool built on Go's Rod library) to solve the problem of proving agent-written software actually works beyond passing tests. The durable idea is that delivering code means proving it works, and these tools give agents a low-cost, hard-to-fake way to demo features and capture screenshots—both built on his phone via Claude Code for web.

ShowboatRodneycoding agentsbrowser automationverification

Extract PDF text in your browser with LiteParse for the web

TIER 4 Apr 23, 2026

Willison ports LlamaIndex's non-AI PDF spatial-text-parsing tool LiteParse to run entirely in-browser (PDF.js + Tesseract.js), and the post doubles as a detailed agentic-engineering walkthrough of building it via Claude Code/Opus 4.7 with plan.md, red/green TDD, small commits, and a Codex cross-check against cheating. It matters as a concrete, reproducible recipe for vibe coding a static GitHub Pages tool plus a sharp meditation on when pure vibe coding is actually responsible (zero blast radius, no data egress).

vibe codingPDF parsingClaude Codeagentic engineeringbrowser tools

Frontier Model Releases and Capability Analysis

1 tier-5 · 8 tier-4

Willison's signature practitioner reviews of major model launches — grounded in actual builds and his pelican-on-a-bicycle and OCR tests rather than vendor benchmarks. The cluster tracks the GPT-5 line, Claude Sonnet/Opus/Fable, Gemini 3 and Nano Banana Pro image generation, and Meta's Muse Spark, always with concrete pricing tables and capability deltas. A recurring meta-thesis emerges: frontier models are now so capable that production coding no longer cleanly differentiates them, so labs should ship "fails on old, succeeds on new" examples instead of single-digit benchmark gains.

GPT-5: Key characteristics, pricing and model card

TIER 5 Aug 7, 2025

A thorough capability-and-pricing breakdown of the GPT-5 family from two weeks of daily-driver preview use: the ChatGPT router-plus-mini architecture vs the API's regular/mini/nano models at four reasoning levels, aggressively competitive pricing ($1.25/$10 with 90% cache discount, plus a full comparison table), and system-card notes on safe-completions, sycophancy reduction, hallucination claims, and deception. Includes a clear-eyed read of the prompt-injection chart (best-in-class 56.8% but still an unsolved problem). The reference write-up for understanding GPT-5.

gpt-5model-releasepricingsystem-cardprompt-injection

GPT-5 Thinking in ChatGPT (aka Research Goblin) is shockingly good at search

TIER 4 Sep 6, 2025

Argues that GPT-5 Thinking has crossed a threshold where chatbots-as-search-engines is now good advice, illustrated with a dozen real curiosity queries (building ID, archival PDF hunting, supermarket rankings) run on mobile. The deeper point for builders: ChatGPT search is the gold standard for tool-calling plus interleaved chain-of-thought, and RAG is far more effective reframed as multi-level tool calls over powerful search tools. Memorable framing ('Research Goblin' — industrious, not-quite-trustworthy) plus practical search-prompting tips.

gpt-5ai-searchtool-callingragchatgpt

Claude Sonnet 4.5 is probably the 'best coding model in the world' (at least for now)

TIER 4 Sep 29, 2025

Reviews Claude Sonnet 4.5's launch, finding it surpasses GPT-5-Codex as his preferred coding model at the same $3/$15 pricing, and showcasing its strength with Claude.ai's code interpreter—cloning his LLM repo, running 466 tests, and prototyping a tree-structured conversation schema across dozens of tool calls, all kicked off from a phone. Also covers the pelican-SVG benchmark and Anthropic's coordinated rollout plus the Claude Code SDK rebrand to Claude Agent SDK.

claude-sonnet-4.5model-releasecode-interpretercoding-modelsclaude-agent-sdk

Trying out Gemini 3 Pro with audio transcription and a new pelican benchmark

TIER 4 Nov 18, 2025

Willison's hands-on review of Gemini 3 Pro, which he characterizes as 'Gemini 2.5 upgraded to match the leading rivals' (1M context, January 2025 cutoff), with a full benchmark table extracted via the model's own multimodal alt-text, and pricing positioning. He stress-tests audio by transcribing a 3.5-hour city council meeting (good speaker/section summaries but misaligned timestamps) and introduces a harder v2 pelican benchmark. Substantive model-capability coverage with concrete multimodal experiments.

gemini-3model-releaseaudio-transcriptionbenchmarksmultimodal

Nano Banana Pro aka gemini-3-pro-image-preview is the best available image generation model

TIER 4 Nov 20, 2025

Hands-on assessment of Google's Nano Banana Pro (Gemini 3 Pro Image): high-res 1K/2K/4K output, strong legible text rendering for infographics, Google Search grounding, thinking mode, and up to 14 reference images, with pricing detail. Willison shows it generating a coherent, correctly-spelled Datasette architecture infographic from a 9-word prompt, and tests SynthID watermark detection (it flagged an AI-edited photo even after the visible marker was scrubbed). Strong capability write-up of a genuine step-change in image generation plus provenance tooling.

nano-banana-progemini-3image-generationsynthidinfographics

Claude Opus 4.5, and why evaluating new LLMs is increasingly difficult

TIER 4 Nov 24, 2025

Beyond the Opus 4.5 specs (200K context, March 2025 cutoff, a big price drop to $5/$25, new effort parameter and zoom computer-use tool), Willison makes a sharper meta-argument: frontier models are now so capable that production coding no longer reliably differentiates them, and he urges labs to ship concrete 'fails on old model, succeeds on new' examples instead of single-digit benchmark gains. He also reads Anthropic's own prompt-injection chart skeptically, noting 1-in-20 single-shot and 1-in-3 ten-shot success rates still demand defense-in-depth. Valuable for the evaluation-difficulty thesis and the security framing.

claude-opus-4.5model-evaluationprompt-injectionbenchmarksmodel-release

GPT-5.2

TIER 4 Dec 11, 2025

A capability-and-pricing breakdown of OpenAI's GPT-5.2 (and 5.2 Pro), released under a reported internal 'code red' triggered by Gemini 3 competition: an August 2025 knowledge cutoff, a rare price increase, big self-reported jumps on GDPval and ARC-AGI-2, a new server-side /responses/compact endpoint, and notably improved vision/OCR. Willison runs his OCR and pelican-on-a-bicycle tests and later notes GPT-5.2 in Codex CLI ran unsupervised for nearly four hours to port a Python library to JavaScript. Solid model-release reference with concrete pricing and benchmark detail.

gpt-5.2openaimodel-releasepricingvision-ocr

Meta's new model is Muse Spark, and meta.ai chat has some interesting tools

TIER 4 Apr 8, 2026

Meta's first model since Llama 4, Muse Spark (hosted, not open weights), is benchmarked competitively near frontier, but the real value is Willison extracting and exercising the meta.ai harness's 16 tools - including code interpreter, sub-agent spawning, web/Meta-1P content search, and a native visual_grounding tool that does bbox/point/count object detection (it even counts raccoon whiskers and pelicans). A detailed tour of an emerging agent harness and the converging tool-call patterns across labs.

Muse SparkMeta AIagent toolsvisual groundingcode interpreter

Initial impressions of Claude Fable 5

TIER 4 Jun 9, 2026

Hands-on first impressions of Anthropic's Claude Fable 5 (and the unsafetied Mythos 5 variant), arguing the model feels notably 'big' — slow, expensive (2x Opus pricing), and unusually knowledgeable, which Willison treats as a proxy for parameter count. He documents real work: getting full CPython-in-WASM running via Claude.ai's container, and having Fable rewrite Datasette Agent's human-in-the-loop tool-call feature plus four supporting LLM-library improvements. Substantive capability analysis grounded in actual builds rather than benchmarks.

claude-fablemodel-capabilitiesmodel-knowledgedatasette-agenttool-calls

Open-Weight and Local Models

1 tier-5 · 7 tier-4

The arc of open-weight and laptop-runnable models crossing real usefulness thresholds. Willison documents OpenAI's gpt-oss reclaiming the open crown, a 2.5-year-old MacBook writing Space Invaders via GLM-4.5 Air, DeepSeek V4 landing near the frontier at a fraction of the price, the fully-open Olmo 3 (weights plus training data), and the data-quality frontier with a Victorian public-domain model. Threaded through are durable practitioner concerns: open weights don't guarantee behavior (the same model scores 93% to 36% across hosts), open training data is a genuine security advantage against poisoning, and the hardware to run all this locally (DGX Spark) is arriving but rough.

My 2.5 year old laptop can write Space Invaders in JavaScript now, using GLM-4.5 Air and MLX

TIER 4 Jul 29, 2025

Willison demonstrates that GLM-4.5 Air, a 106B-parameter open-weight Chinese model running as a 44GB 3-bit MLX quant, can write a working Space Invaders game (and a pelican SVG) first-try on a 2.5-year-old 64GB MacBook Pro M2. The hands-on write-up includes the exact mlx-lm/uv recipe and throughput numbers, and argues that local coding models crossed a quality threshold in 2025 that would have seemed unimaginable two years prior. Useful as a concrete capability marker and a reproducible local-model how-to.

local-modelsGLM-4.5-AirMLXcode-generationopen-weights

OpenAI's new open weight (Apache 2) models are really good

TIER 5 Aug 5, 2025

Launch analysis of OpenAI's gpt-oss 120B and 20B Apache-2.0 open-weight MoE models, with hands-on runs (LM Studio, Ollama, Cerebras, OpenRouter), benchmark context near o4-mini/o3-mini, training-cost estimates, and a detailed dive into the new OpenAI Harmony prompt format (roles, channels, dedicated special-token IDs). Argues these likely reclaim 'best open weights' from the Chinese labs while flagging tool-calling as the key open question. Significant model-release reference plus durable Harmony-format explainer.

gpt-ossopen-weightsopenaiharmony-formatlocal-models

Open weight LLMs exhibit inconsistent performance across providers

TIER 4 Aug 15, 2025

Reports Artificial Analysis benchmarks showing the same open-weight model (gpt-oss-120b) scoring from 93% down to 36% across hosting providers, traced largely to stale vLLM commits, quantization, and mishandled reasoning_effort and tool-calling templates. Raises the durable point that buying an open-weight model from a host gives no guarantee of the intended behavior, and argues for conformance/compatibility test suites (OpenAI shipped one). Useful framework for anyone selecting open-weight inference providers.

open-weightsbenchmarksinference-providersquantizationgpt-oss

NVIDIA DGX Spark: great hardware, early days for the ecosystem

TIER 4 Oct 14, 2025

A hands-on review of NVIDIA's $4,000 DGX Spark desktop 'AI supercomputer' (128GB unified memory, GB10 Blackwell GPU, ARM64), candid about the pain of CUDA-on-ARM64 and the immature-but-rapidly-improving ecosystem. Useful as a reference for running local models on the Spark, with practical Docker/Tailscale/Claude-Code workflows and notes on Ollama, llama.cpp, LM Studio, and vLLM support landing within 24 hours of embargo.

nvidia-dgx-sparklocal-modelsCUDA-ARM64hardware-reviewllama.cpp

Olmo 3 is a fully open LLM

TIER 4 Nov 22, 2025

A write-up of Ai2's Olmo 3 (7B and 32B, including a 32B 'Think' model) notable for being fully open: weights plus the ~9.3T-token Dolma 3 corpus, training process, and checkpoints, with the OlmoTrace tool linking outputs back to training data. Willison tests it locally (over-thinking pelican, disappointing OlmoTrace matches) but makes the substantive point that open training data is a real security advantage given Anthropic's finding that ~250 poisoned documents can backdoor a model of any size. Good for the transparency-and-auditability argument.

olmo-3open-weightstraining-datadata-poisoninglocal-models

Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer

TIER 4 Mar 30, 2026

Reviews Mr. Chatterbox, a 340M model trained from scratch only on pre-1900 public-domain British Library texts, finding it charmingly Victorian but barely conversational, and uses Chinchilla scaling math (its 2.93B tokens are well under the ~7B optimum) to argue we need several times more public-domain data for a useful clean-data model. Includes a useful demonstration of building a full LLM plugin (llm-mrchatterbox) with Claude Code, plus the caveat that synthetic SFT pairs dilute the no-post-1899-data claim.

public domain trainingsmall modelsLLM pluginsChinchilla scalinglocal models

DeepSeek V4 - almost on the frontier, a fraction of the price

TIER 4 Apr 24, 2026

Analysis of DeepSeek's V4 preview models (Pro at 1.6T params making it the largest open-weight model yet, and Flash at 284B), tested via OpenRouter, that lands near the frontier at a fraction of competitors' prices. Willison builds a clear pricing comparison table showing V4 Flash undercutting even GPT-5.4 Nano and V4 Pro the cheapest large frontier-class model, and cites DeepSeek's paper on the radical FLOPs/KV-cache efficiency gains (and self-reported 3-6 month gap to SOTA) that enable it. A useful open-weights capability-and-economics write-up.

deepseek-v4open-weightsmodel-pricinginference-efficiencychinese-ai

The last six months in LLMs in five minutes

TIER 4 May 19, 2026

Annotated slides from a PyCon US 2026 lightning talk summarizing six months of LLM progress, centered on the 'November 2025 inflection point' when coding agents crossed from often-work to daily-driver reliability and the 'best model' crown changed hands five times. Traces the rise of 'Claws' (personal AI assistants, OpenClaw lineage) and the surprising strength of laptop-runnable open-weight models (Gemma 4, GLM-5.1, Qwen) via the pelican-on-a-bicycle test. A compact, well-framed retrospective that names the two durable themes of the period.

llm-retrospectivenovember-inflectioncoding-agentslocal-modelsopen-weights

The Python, CLI, and HTML Tooling Ecosystem He Builds

1 tier-5 · 6 tier-4

Willison doesn't just write about LLM tooling — he ships it, and these are the durable how-tos and design explainers from his own ecosystem. The flagship is his complete playbook for single-file "HTML tools" (no build step, deps from CDNs, state in the URL), distilled from 150+ of them. Around it sit the LLM library's multi-modal refactor, packaging recipes that fold Go binaries and WebAssembly wheels into PyPI, in-browser Python via Pyodide for real data work, and close reviews of the code-interpreter sandboxes (Claude, ChatGPT) that make agent-run code practical — each with the security tradeoffs spelled out.

Recreating the Apollo AI adoption rate chart with GPT-5, Python and Pyodide

TIER 4 Sep 9, 2025

A worked walkthrough using GPT-5 to track down the obscure US Census BTOS source data, recreate Apollo's AI-adoption chart via code interpreter (pandas + matplotlib), then re-render it entirely client-side in the browser with Pyodide. Doubles as a reusable how-to for fetching XLSX into Pyodide via micropip/openpyxl and rendering matplotlib images to the DOM. Valuable for the concrete techniques and the demonstration that GPT-5 search plus code interpreter can do real data-journalism reconstruction.

gpt-5code-interpreterpyodidedata-vizhow-to

My review of Claude's new Code Interpreter, released under a very confusing name

TIER 4 Sep 9, 2025

Hands-on review of Anthropic's server-side code-execution feature (confusingly shipped as 'Upgraded file creation and analysis'), which lets Claude run Python/Node in an Ubuntu sandbox, pip-install packages, and produce files. Willison probes the environment (RAM, disk, Envoy allowlist proxy permitting github.com/PyPI/npm) and flags lethal-trifecta prompt-injection risk introduced by the limited network access, contrasting it with ChatGPT Code Interpreter's no-internet design. Useful as a practitioner's map of capabilities, limits (30MB file cap), and the security tradeoffs of agent code sandboxes.

code-interpreterclaudeprompt-injectionsandboxesdata-analysis

Useful patterns for building HTML tools

TIER 5 Dec 10, 2025

A comprehensive field guide distilling Willison's two years and 150+ single-file 'HTML tools' (HTML+JS+CSS, no React, no build step) into reusable patterns: prototype in Artifacts/Canvas then graduate to coding agents, load deps from CDNs, self-host on GitHub Pages, persist state in the URL or localStorage, exploit copy/paste and CORS-enabled APIs, call LLMs directly, run Python via Pyodide and other code via WebAssembly, remix prior tools, and record prompts/transcripts. Lasting reference value as a complete playbook for LLM-assisted personal tooling.

html-toolsvibe-codingpyodidewebassemblyllm-workflow

ChatGPT Containers can now run bash, pip/npm install packages, and download files

TIER 4 Jan 26, 2026

Reverse-engineers a major undocumented upgrade to ChatGPT's code interpreter: direct Bash, 10+ languages, pip/npm installs via an internal proxy, and a new container.download tool, including a dump of the full tool list. Notably probes whether container.download is a data-exfiltration vector and finds it gated by the same 'URL must have been seen first' defense as Claude's Web Fetch. Substantive hands-on capability analysis with a security angle, slightly weakened by being a single-vendor feature note.

chatgptcode-interpretersandboxingtool-useprompt-injection

Distributing Go binaries like sqlite-scanner through PyPI using go-to-wheel

TIER 4 Feb 4, 2026

Willison documents a reusable pattern—and ships go-to-wheel to automate it—for packaging cross-platform Go binaries as platform-specific Python wheels so they install via pip/uvx and, crucially, can be declared as dependencies of other Python packages. It's a genuinely useful, well-explained how-to that lets any Go capability be subsumed into the Python ecosystem, with sqlite-scanner and datasette-scan as worked examples.

GoPyPIPython packaginguvdistribution

LLM 0.32a0 is a major backwards-compatible refactor

TIER 4 Apr 29, 2026

A deep technical walkthrough of the LLM 0.32a0 refactor, which re-models prompts as sequences of messages (with new user()/assistant() builders and reply()) and responses as streams of typed parts — text, reasoning, tool calls, tool outputs, images, audio — to match what modern multi-modal, tool-using frontier models actually return. Adds serialize/deserialize hooks for rolling your own conversation persistence beyond SQLite. A durable explainer on designing an LLM abstraction layer for the current generation of model capabilities.

llm-libraryapi-designstreamingtool-callspython

Publishing WASM wheels to PyPI for use with Pyodide

TIER 4 Jun 13, 2026

Pyodide 314 plus PEP 783 finally let package maintainers publish WebAssembly wheels directly to PyPI, ending the years-long bottleneck where Pyodide maintainers hand-built and hosted 300+ packages. Willison demonstrates by packaging a Luau interpreter (C++ compiled to WASM) into a PyPI-installable wheel using Codex/cibuildwheel/GitHub Actions, then runs a BigQuery census finding 28 packages already using the new pyemscripten tags. A practitioner-grade walkthrough of a meaningful enabler for in-browser Python.

pyodidewebassemblypypi-packagingpython-ecosystemwasm-wheels

What 'Agent' Means, Accountability, and the AI-and-Work Reckoning

1 tier-5 · 5 tier-4

The conceptual and human-cost layer. Willison nails down a usable definition of "agent" (an LLM running tools in a loop to achieve a goal) and repeatedly anchors the limits of automation in the 1979 IBM line that "a computer can never be held accountable" — accountability and agency stay human. The cluster extends into the psychological toll on developers ("Deep Blue"), the ethics of letting agents act toward strangers without judgment, and sharp demonstrations of how trivially LLMs now profile people, leak private chats, or absorb their owner's politics. These are the pieces about what the technology means rather than how it works.

Grok: searching X for 'from:elonmusk (Israel OR Palestine OR Hamas OR Gaza)'

TIER 4 Jul 11, 2025

Willison reproduces and investigates the finding that Grok 4, when asked for opinions on controversial topics, often searches X for Elon Musk's stance before answering. By extracting the system prompt (which contains no such instruction) he argues this is likely emergent and unintended behavior: knowing it is 'Grok 4 built by xAI,' the model reasons its way to checking what its owner thinks. xAI later confirmed and patched exactly this, making the post a clean documented example of emergent model-identity behavior and good empirical AI-behavior reporting.

Grok-4model-behaviorsystem-promptsxAIAI-alignment

The ChatGPT sharing dialog demonstrates how difficult it is to design privacy preferences

TIER 4 Aug 3, 2025

Uses ChatGPT's removed 'make this chat discoverable' option (which leaked private chats into Google) to argue that privacy microcopy fails because it presupposes a chain of concepts (secret URLs, search indexes, opt-in indexing) most of a billion users don't hold — and because users don't read and default to clicking yes. Contrasts it with Meta AI's clearer-but-still-harmful 'Post to feed.' A sharp, durable essay on privacy-UX design at consumer AI scale.

privacyux-designchatgptmicrocopyai-ethics

I think 'agent' may finally have a widely enough agreed upon definition to be useful jargon now

TIER 5 Sep 18, 2025

Settles on a crisp, durable definition Willison can use without scare quotes: 'an LLM agent runs tools in a loop to achieve a goal,' breaking down each clause and contrasting it with the unhelpful 'human-replacement' definition. The lasting argument is that jargon only aids communication when definitions are shared, and that the human-replacement framing fails because accountability and agency remain unique to humans—anchored by the 1979 IBM 'a computer can never be held accountable' slide.

agent-definitionsterminologytools-in-a-loopaccountabilityAI-jargon

How Rob Pike got spammed with an AI slop 'act of kindness'

TIER 4 Dec 26, 2025

Digital-forensics writeup of how the AI Village experiment let agents autonomously email luminaries (Rob Pike, Guido van Rossum, Anders Hejlsberg) unsolicited AI 'thank you' notes, reconstructed via shot-scraper HAR capture and Claude Code analysis. Frames a sharp argument: the failure isn't agent mistakes but letting agents take real-world actions toward strangers without human judgment, since true agency must stay a human decision. Strong case study on agent autonomy ethics plus a reusable forensics technique.

ai-agentsai-ethicsslopcomputer-usedigital-forensics

Deep Blue

TIER 4 Feb 15, 2026

Willison names and explores "Deep Blue," the psychological ennui-to-dread many software engineers feel as coding agents encroach on a hard-won, gatekeeper-free craft. He shares his own recurring episodes (ChatGPT Code Interpreter in 2023, Opus 4.5/4.6 now), argues "the code isn't any good" no longer holds, and offers the chess/Go-players-came-out-stronger framing as consolation; a resonant cultural-vocabulary piece more than a technical one.

AI and workdeveloper psychologyautomationculturecoding agents

Profiling Hacker News users based on their comments

TIER 4 Mar 21, 2026

Willison shows how trivially an LLM (Opus 4.6) profiles a person from ~1,000 of their public Hacker News comments fetched via the open-CORS Algolia API, demonstrating with a startlingly accurate self-profile run in incognito mode. A concrete, mildly dystopian illustration of LLM-powered inference over freely-shared public data and its privacy implications, with a practical use case (vetting bad-faith interlocutors).

LLM profilingprivacyHacker Newsdata inferenceAlgolia API

Skills, MCP, and Agent Harness Architecture

1 tier-5 · 2 tier-4

A focused architectural argument that the lightweight "Skills" pattern — a folder with a Markdown file plus optional scripts, loaded on demand — may matter more than the heavier MCP protocol, because it outsources the hard parts to the LLM harness and a coding environment rather than a bespoke protocol. Willison's October call is borne out when OpenAI quietly ships the same pattern across ChatGPT and Codex within two months. The cluster also reframes Claude Code (and its Cowork repackaging) as a general agent that was always disguised as a dev tool — "the simplicity is the point."

Claude Skills are awesome, maybe a bigger deal than MCP

TIER 5 Oct 16, 2025

Argues Anthropic's newly launched Claude Skills—folders of Markdown instructions plus optional scripts, loaded on demand and token-efficient via frontmatter—may matter more than MCP. The thesis: skills outsource the hard parts to the LLM harness and a coding environment rather than a heavy protocol, work across models (Codex/Gemini CLI), and reframe Claude Code as a general agent; 'the simplicity is the point.' A landmark take on the skills-vs-MCP architectural shift.

claude-skillsMCPagent-architecturecoding-environmentanthropic

OpenAI are quietly adopting skills, now available in ChatGPT and Codex CLI

TIER 4 Dec 12, 2025

Willison documents OpenAI quietly shipping Anthropic-style Skills (a folder with a Markdown file plus optional scripts/resources) in both ChatGPT Code Interpreter (/home/oai/skills, covering spreadsheets/docx/PDFs) and the Codex CLI (~/.codex/skills with --enable skills). He demonstrates building and using a Datasette-plugin skill in Codex and argues the rapid cross-vendor adoption confirms his October call that Skills may be a bigger deal than MCP, while calling for a formal spec via the Agentic AI Foundation. Useful as practitioner evidence that the lightweight Skills pattern is becoming a de facto standard.

skillsopenaicodex-clichatgptagent-tooling

First impressions of Claude Cowork, Anthropic's general agent

TIER 4 Jan 12, 2026

Hands-on first look at Claude Cowork, framed as Claude Code repackaged as a general agent for non-developers, running in an Apple Virtualization Framework Linux VM sandbox. Reinforces Willison's thesis that Claude Code was always a general agent disguised as a dev tool, and dwells on the prompt-injection risk of telling regular users to 'watch for suspicious actions.' Solid product analysis with a security throughline.

claudegeneral-agentssandboxingprompt-injectionanthropic

Reverse-Engineering Models, Tools, and System Prompts

0 tier-5 · 5 tier-4

Willison's investigative streak: prompting products to reveal their own system prompts, tool schemas, and hidden plumbing, then reading the results as the "missing manual." He extracts GitHub Spark's 5,000-word design-philosophy prompt, uncovers Claude's human-in-the-loop calendar tools and an Artifacts-calls-the-real-API monkey-patch, reverse-engineers Codex CLI's private endpoint to summon an unreleased model, diffs successive Claude system prompts for behavioral shifts, and documents Grok searching for Elon Musk's opinions as emergent model-identity behavior. These pieces double as a practitioner's playbook for understanding how LLM products actually work under the hood.

Phoenix.new is Fly's entry into the prompt-driven app development space

TIER 4 Jun 23, 2025

Willison reviews Phoenix.new, Fly.io's prompt-driven app builder that generates and live-tests full Elixir/Phoenix LiveView apps inside a Fly Machines sandbox, with a browser IDE, a shot-scraper-like agent that interacts with the running app while building it, and constant Git commits you can clone locally. He frames Fly's bet clearly: use LLMs to erase Elixir's learning curve, and notes the agent's live in-browser testing is the most impressive he has seen. A useful comparative read on the coding-agent / prompt-to-app category and on sandboxed AI runtimes.

Phoenix.newFly.iocoding-agentsElixir-Phoenixprompt-to-app

Using GitHub Spark to reverse engineer GitHub Spark

TIER 4 Jul 24, 2025

Willison reverse-engineers GitHub Spark (a prompt-to-app builder) by using Spark itself to extract and document its own ~5,000-word system prompt, tool set (str_replace_editor, npm, bash, create_suggestions), and Debian/Azure container environment. He treats the leaked system prompt as the 'missing manual,' dissecting its detailed design-philosophy sections (typography, color, spatial awareness) that explain Spark's strong default output, and flags a real security gotcha: the global read/write KV store is unsafe for multi-user apps. A substantive case study in system-prompt archaeology plus practical critique of the prompt-driven-app category.

GitHub-Sparksystem-promptsprompt-to-appAI-toolingreverse-engineering

Reverse engineering some updates to Claude

TIER 4 Jul 31, 2025

Reverse-engineers two undocumented Claude consumer features by prompting Claude to reveal its own tool schemas and system prompt: native calendar/message tools (event_create_v0, message_compose_v0, user_time_v0) implemented as human-in-the-loop platform interfaces, and an upgrade letting Artifacts call the full Anthropic API via a monkey-patched fetch() that proxies api.anthropic.com requests so no API key is exposed. A practitioner's playbook for inspecting how LLM product features actually work under the hood.

claudereverse-engineeringtool-useartifactsanthropic-api

Reverse engineering Codex CLI to get GPT-5-Codex-Mini to draw me a pelican

TIER 4 Nov 9, 2025

A detailed agentic-coding walkthrough: with the not-yet-API-available GPT-5-Codex-Mini exposed only through OpenAI's open-source (Apache 2.0) Codex CLI, Willison uses Codex itself to add a tool-free 'codex prompt' subcommand to a Rust codebase he doesn't know, iterating past errors (the private backend rejects requests without instructions) to call the model directly. The --debug output reveals the private endpoint and request shape. A vivid demonstration of coding agents handling unfamiliar languages and of designing prompts/loops, plus genuine reverse-engineering detail.

codex-cligpt-5-codex-minireverse-engineeringagentic-codingrust

Changes in the system prompt between Claude Opus 4.6 and 4.7

TIER 4 Apr 18, 2026

A close reading of the diff between Claude Opus 4.6 and 4.7 system prompts, surfacing concrete behavioral shifts: expanded child-safety and disordered-eating guardrails, less pushiness/verbosity, an acting-vs-clarifying preference for tool use, a new tool_search check before claiming missing capabilities, and removal of the now-unneeded Trump-is-president fix as the cutoff moved to Jan 2026. Durable reference value for anyone reasoning about how Anthropic steers model behavior, plus an extracted list of Claude.ai's named tools.

system promptsClaude Opus 4.7model behaviortool useAI safety