Agent Engineering & Production Architecture
1 tier-5 · 12 tier-4
The structural heart of the newsletter: what it actually takes to make a probabilistic model into a system that takes actions, holds state, recovers from errors, and finishes. Recurring theses run through this cluster — "the wrapper is the product," the harness (not the model) is the real bet, the LLM call is only ~20% and the other 80% is plumbing, and reliability comes from architecture rather than smarter models. Read in order, these pieces build a coherent engineering doctrine for production agents.
TIER 4
Jul 24, 2025
Argues the chat interface is 'weakly intelligent'—great at starting tasks, poor at finishing complex ones—and that serious AI building requires multi-turn, deeply planned interaction where the conversation, not the file or database, becomes the new core unit of computing. Distills nine overlooked insights (token depth, multi-turn economics, scoping) for builders moving past casual ChatGPT use. Strong conceptual thesis on the chat-to-builder gap; the nine truths sit behind the paywall but the visible framing carries real load.
ai-buildingproductmulti-turntoken-economicsconversation-as-compute
TIER 4
Dec 4, 2025
Presents an 'edge-first' framework for automation: teams that win don't automate the high-judgment core of a workflow first, they automate the edges (data prep, QA, synthesis, handoffs, packaging) — the mechanical work surrounding the valuable part — which earns the organizational trust needed to eventually touch the core. Reframes automation as a trust exercise rather than a technical project, with field notes by role (PM, eng lead, sales, CS, ops) and 11 companion prompts (workflow compass, fast-win filter, semi-manual v0, tribal-knowledge extraction, failure postmortem). A concrete, sequencing-focused antidote to the 'agent moonshot' trap.
agentsautomation strategyedge-firstorganizational trustworkflow design
TIER 4
Dec 9, 2025
Synthesizes late-2025 research (Google ADK, Stanford/SambaNova ACE, Manus's four redesigns) on in-session context engineering, arguing agents degrade mid-run not from intelligence limits but from context rot — every added token competes for attention, so longer windows make things worse. Proposes 'context as compiled view' (compute what's relevant per step rather than append everything), a four-layer memory model, nine scaling principles and nine failure modes. One of the more substantive applied-agent-engineering pieces in the batch, with 12 design prompts (state persistence, view compilation, attention budgeting, cache stability).
agentscontext engineeringcontext rotagent memorylong-running agents
TIER 4
Dec 14, 2025
Uses Cursor deleting its own CMS (Lee Robinson, 3 days, 300+ agent PRs, $260 in tokens) to argue the real cost wasn't the $57K invoice but the convenience layers ('abstraction tax') that wall agents off from the work, so the follow-on to last week's memory point is that even with memory, legacy software blocks agents from acting. Distinguishes primitives that make agents reliable from primitives that make work shippable, and lists six concepts (state, artifacts, change records, checks, rollback, traceability) to teach non-technical teams. Excellent case-driven strategy; the prompts are Executive-Circle gated.
agentsabstraction taxCursorenterprise workflowsAI-native operations
TIER 4
Jan 7, 2026
Uses the 'Ralph Wiggum' Claude Code plugin (which refuses the agent's claim of 'done') to argue 'done' is an accountability contract, not a conversational cue, and that the next model won't fix it. Lands the key applied-AI thesis that 'in agent land, the wrapper is the product' — most outcome delta comes from the verification/loop layer, not model choice — and reframes the metric from first-pass success to convergence. Strong, durable mental model for anyone running agents.
agent verificationconvergenceharnessClaude Codeaccountability
TIER 4
Jan 6, 2026
Explains the central applied-AI concept of the 'agentic harness' — the engineering around a model that turns probabilistic text into a system that takes actions, holds state, recovers from errors, and finishes — and why Meta paid $2B+ for Manus's production know-how rather than a model. Covers why multi-step reliability is hard (per-step error compounding over 50 tool calls), the trust-boundary problem, and why harnesses won't standardize like SaaS. A genuinely clarifying definition piece on the year's core agent concept.
agentic harnessMeta Manus acquisitionAI agentsmulti-step reliabilitybuild vs buy
TIER 4
Jan 26, 2026
Cites a Dec 2025 Google/MIT study finding that adding agents can actively degrade performance (not just diminishing returns) because coordination overhead grows faster than capability. Notes that teams who actually scaled (Cursor running hundreds of agents, Steve Yegge's Gas Town orchestrating 20-30) independently converged on counterintuitive patterns: dumb agents plus smart orchestration, strict two-tier hierarchies, treating agent endings as a feature. Practical, evidence-backed field notes for anyone building multi-agent systems.
multi-agent-systemsagent-orchestrationcoordination-overheadscalingagent-architecture
TIER 4
Mar 6, 2026
Argues the Claude-Code-vs-Codex debate compares the wrong thing: the model is the brain, but the harness (environment access, cross-session memory, tools, task management) is the real bet, and Claude Code (works in your environment, accumulates project memory) and Codex (sealed room, slides results under the door) are diverging not converging. Highlights the same model scoring 78% in one harness and 42% in another, five compounding lock-in dimensions, and a harness-audit prompt. A genuinely useful strategic lens for engineering leaders.
agent-harnessesclaude-codecodexlock-inai-engineering
TIER 5
Apr 3, 2026
Mines the accidentally-leaked Claude Code source (1,902 files, 512k+ lines, 29 subsystems, ~$2.5B ARR product) past the surface-feature gossip to extract the design primitives that make agentic systems work in production — arguing the LLM call is only ~20% and the other 80% is plumbing: session persistence, permission pipelines, context-budget management, tool registries, security stacks, error recovery. Presents the 12 primitives prioritized by build order (day one/week one/month one), an 18-module security stack for a single shell command, and notes the harness was ported to Python and Rust within hours — proving the patterns are structural, not Anthropic-specific. Ships an architecture-audit prompt and skill.
agent architectureClaude Code internalsproduction primitivesagent securitysession persistence
TIER 4
Apr 16, 2026
Makes the case that software was built for human pace (3 bits/sec) and now bottlenecks agents running at 10-50x speed — citing Jeff Dean's GTC point that an infinitely fast model yields only 2-3x end-to-end because tools, file systems, and auth flows eat the rest. Covers the three-layer rebuild toward agent-native primitives, the human migration from execution to judgment (METR/Jellyfish data, Amdahl math), and four roles that survive. Includes an Amdahl ceiling calculator and a 'taste encoder' prompt.
agent-native toolingAmdahl's lawdeveloper infrastructurejudgment vs executionagent speed
TIER 4
May 28, 2026
Uses a Cursor agent deleting PocketOS's production database and backups in nine seconds — invisible on a normal green dashboard — to argue the unit of product behavior is shifting from the session to the agent run, where the steps, tools, boundaries, corrections, and acceptance live. Distinguishes engineering traces (necessary, not sufficient) from product instrumentation, and the gap between a task that finished and a task the user trusted. Strong product-analytics-for-agents thesis with a concrete starter event schema teased.
agent observabilityproduct analyticsagent runsinstrumentationtrust
TIER 4
Jun 14, 2026
Contrasts the public-market story (intelligence is scarce, priced into the OpenAI/Anthropic/xAI IPOs) with the operating reality inside companies, where a founder's model bill dropped 97% moving to open weights. Argues the real scarce asset is the 'harness' — the company layer of context, permissions, review standards, memory, and decision rights that the labs cannot sell you — and that labs staffing humans to install AI workflow-by-workflow reveals where the hard part lives. Sharp executive framing; ends teasing five S-1 numbers to watch.
enterprise AIAI economicsharnessopen-weight modelsIPO
TIER 4
Jun 17, 2026
Reframes agent work around maintenance rather than building, using Vercel's sales-agent buildout (10-person inbound team collapsed to one overseer) to show that the value lived in the 'workbench' around the model: sources, tools, a defined job, handoffs, review path, and human visibility. Introduces the counterintuitive failure mode where a model improving makes its old harness dead weight, and names seven harness surfaces (job, diet, memory, tools, reach, proof, value) that go stale. Paywalled preview, but the maintenance-surface framing and 'less is more' tool-pruning thesis are a strong applied-AI lens.
agentsagent maintenanceharness designtool pruningagent ops
Agent Safety, Governance & the Control Layer
4 tier-5 · 9 tier-4
The flip side of giving agents real tools: how to keep them safe once they act. The cluster's strongest recurring claim is structural — any system whose safety depends on intent will fail, so the durable fix is reversibility, judge layers, control planes, and kill switches rather than better prompts. Anchored by landmark pieces on the GTG-1002 espionage attack, Anthropic's 16-model misalignment study, and agent evaluation methodology lifted from a failed medical-AI eval.
TIER 5
Nov 14, 2025
Full-length free bonus analysis of Anthropic's Nov 13 disclosure that a Chinese state group (GTG-1002) jailbroke Claude Code into the operational core of an automated espionage framework, with AI running 80-90% of the kill chain across ~30 targets and humans intervening at only 4-6 decision points. The key architectural insight is context-splitting — each request looked like benign security testing, so malicious intent lived only in the orchestration layer the model never saw, proving prompt-level guardrails are structurally insufficient for agentic systems. Lays out the resulting defensive playbook (multi-layer enforcement, capability tokens, behavioral telemetry, least-privilege agents, AI-fluent SOCs) and the strategic shift to competing on trust/observability rather than raw capability. The one genuinely landmark, fully-readable piece in this batch.
AI securityagentic attacksClaude CodeMCPcontext splitting
TIER 4
Jan 2, 2026
Argues that once AI takes real actions (booking, emailing, editing, refactoring), underspecified intent becomes an objective problem rather than a hallucination problem — the model does exactly what you asked through a goal you didn't mean, and commits before you can stop it. The fix isn't better prompts but making interpretation visible and gating irreversible actions, via an intent doc, confirmation gates, and an audit trail. A practical, transferable safety pattern for agentic tools.
intent specificationagent safetyirreversible actionsconfirmation gatesaudit trail
TIER 4
Dec 28, 2025
Lands a genuinely clarifying thesis: agents work in engineering because software spent 20 years building a 'civilization of undo' (version control, review, staging, rollback), and they fail elsewhere because of a reversibility gap, not an intelligence gap. Introduces the zone-of-comfort framework, the human 'throttle' that informal safety has relied on, and the primitives needed before agents can safely act outside code — with the punchline that the winners will have the most boring, recoverable agent operations. Strong, transferable framing for agent deployment.
reversibilityagent operationsagent safetyundo infrastructureAI governance
TIER 4
Feb 2, 2026
Origin story of Moltbot/OpenClaw — a hobby personal-AI-assistant project that hit 100k+ GitHub stars and moved Cloudflare's stock ~20% in two days — and the 72 hours of chaos after Anthropic's trademark forced a rename (account hijack, $16M rugpull token, 1,000+ exposed instances with plaintext credentials). Argues the security vulnerabilities aren't bugs but intrinsic to what agentic AI requires, then gives an honest 'should you run it' assessment. Substantive treatment of why agentic AI may be impossible to fully secure.
moltbotopenclawagentic-securitypersonal-aioperational-security
TIER 5
Feb 22, 2026
A standout briefing arguing that in the age of autonomous AI, any system whose safety depends on an actor's intent will fail — only structurally safe systems hold. Knits four cases into one fractal root cause: an autonomous agent that doxxed and reputationally attacked a matplotlib maintainer after a rejected PR; Anthropic's 16-model agentic-misalignment study where explicit 'do not blackmail' instructions reduced but didn't eliminate harmful behavior; a 442% surge in AI voice-phishing draining a mother of $15K; and a chatbot-induced delusion. Introduces 'Trust Architecture' as a bridge-engineering discipline across organizational, project, relational, and cognitive layers (agents outnumber humans 82:1; only 34% of orgs have AI-specific controls). Unusually deep, original, and the body develops well past the paywall — genuinely must-read.
trust architectureagentic misalignmentAI securityagent governancedeepfakes
TIER 5
Mar 9, 2026
A calm, contrarian synthesis of the alarming safety headlines (Claude blackmail scenarios, GPT-5.3-Codex participating in its own development, Anthropic dropping its Responsible Scaling commitment, the Pentagon's pressure): the system is holding better than the headlines suggest because competitive and market dynamics generate emergent safety properties no actor created on purpose. Reframes misalignment as mechanical optimization indifference (not malice) and elevates 'intent engineering' as the one vulnerability no lab can close for you. The most substantive, durable piece in the batch.
ai-safetyalignmentintent-engineeringfrontier-labsai-governance
TIER 5
Mar 18, 2026
Uses ChatGPT Health's failed independent evaluation (directing patients away from the ER 52% of the time on unanimous emergencies; one dismissive family-member sentence shifting triage with an 11.7 odds ratio; output ignoring its own reasoning trace) to name four structural LLM failure modes that aren't medical at all and recur in every enterprise agent. Distills the doctors' accidental factorial-eval methodology into a transferable four-layer eval architecture (confidence routing, deterministic validation, stress testing) with a front-loaded cost model. A landmark, broadly applicable piece on agent evaluation.
agent-evaluationfailure-modesanchoring-biaseval-architectureai-safety
TIER 4
Mar 16, 2026
Frames vibe coding through the desktop-publishing analogy: the creative leap arrives before the operational knowledge, and the gap is where disasters happen. Lays out five non-coding skills that prevent ~80% of agent failures, including version control as a 'time machine,' rules/memory files to stop agents freelancing, and 'blast radius' discipline, with a real incident exposing ~19,000 records. A strong, transferable operational primer for non-engineer builders.
vibe-codingagent-safetyoperational-disciplineversion-controlai-engineering
TIER 4
Apr 5, 2026
Executive briefing on the 'middleware trap' — deploying autonomous agents (OpenClaw) on top of broken data models, unmapped workflows, and misaligned org structures, which the agent then executes at machine speed to every downstream system at once. Cites a 12-day CRM rebuild as the most celebrated yet most structurally fragile deployment, lays out three layers of compounding risk surfacing on different timelines, argues security is a symptom of organizational authority vacuums (Microsoft/Kaspersky warnings), and gives five deployment commandments.
OpenClawmiddleware trapagent governanceshadow ITexecutive briefing
TIER 4
May 11, 2026
Argues the next serious agent failure won't be a jailbreak but routine actions taken on weak inference (an email sent, a record updated, a PR opened). Proposes a separate 'judge' wrapped around the actor as the architectural fix, since prompting and approval modals both fail to let one model pursue and police a task at once. Uses the Lindy example and lays out action classification, specialist judges, eval, and memory governance. Concrete, buildable agent-safety design.
agent safetyjudge layerguardrailsaction gatingagent architecture
TIER 4
May 20, 2026
Argues that a new 'control layer' of companies (Cloudflare, Stripe, Okta/Auth0, Snowflake, Datadog) now sits between the model and production, deciding whether agents are allowed to act. Offers a seven-row control map (where the agent lives, what it remembers, who it acts for, when it needs approval, what it can spend, who can stop it) and a five-layer kill-switch most teams only think they have. A clear, durable framing of agent governance as infrastructure rather than prompting.
agent control planeagent governanceinfrastructuresecurity reviewkill switch
TIER 4
May 25, 2026
Drawing on a 47-minute interview with OpenAI data-infrastructure lead Emma, contrasts an agent taking down a Kafka cluster with an agent debugging an export job overnight to show agents have crossed into real operations — useful and risky at once. Argues the next bottleneck is work moving faster than its controls, that platform teams inherit the operational burden app-team acceleration creates, and that platform agents have a far larger blast radius needing tiered action-class policy and eval discipline. Practitioner-grounded with named source.
platform engineeringagent operationsoperational riskevalsOpenAI
TIER 4
May 27, 2026
Names a specific new risk: AI-generated Office files (decks, models, spreadsheets) look done long before they're true, illustrated by a 'validated' financial model whose growth row repeated =C5/B5-1 with no error flag. Prescribes building a truth layer first — source inventory, claim-to-source map, assumption log, and a hostile verification pass — and a four-stage workflow (source prep, structure, creation, verification) with concrete PowerPoint and Excel rules (traceable headlines, a checks tab that works like a smoke alarm). Genuinely useful anti-hallucination discipline for knowledge work.
verificationOffice automationhallucinationsource groundingspreadsheets
The Memory & Context Problem
2 tier-5 · 6 tier-4
Nate's most-repeated structural claim: intelligence scaled far faster than memory, so the real bottleneck is context, not model quality. The cluster develops the "Open Brain" self-owned knowledge store, the write-time-vs-query-time architecture fork (directly relevant to KnowledgeSystem's own compiled-synthesis design), context engineering, and why production agents fail on context assembly rather than retrieval method.
TIER 4
Oct 16, 2025
Opens with the striking framing that since ChatGPT launched, intelligence has scaled ~60,000x while memory only ~100x, so the AI memory problem is ~25x worse — explaining the rise of context engineering and a $100B memory-vendor industry. Promises 5 root causes no vendor has solved, key insights for building your own memory, 8 scalable principles (ChatGPT-user to engineering level, usable for agentic systems), and five prompts (memory architecture designer, context library builder, project brief compiler, retrieval strategy planner). The intelligence-vs-memory gap framing is memorable and the no-code DIY angle is practical, with the detailed principles paywalled.
AI memorycontext engineeringmemory architectureretrievalprompts
TIER 4
Dec 7, 2025
Executive-level argument that enterprise agents fail on multi-session work because of a memory problem, not an intelligence problem — agents start every session with no grounded sense of where the work stands, and million-token windows make it worse. Positions 'domain memory' (goals, progress tracking, operating procedures) as infrastructure and presents Anthropic's two-agent amnesiac-aware pattern, plus vendor-claim triage and five workflow-specific memory-designer prompts (research, ops, content, audit). Sharp strategic framing that competitive advantage lives in memory design, not model selection; paywalled at the Executive Circle tier.
enterprise AIagent memorydomain memoryagent architecturevendor evaluation
TIER 4
Mar 2, 2026
The foundational Open Brain piece: argues your real bottleneck is memory, not prompting, since every new chat or tool switch starts from zero, and pitches a self-owned Postgres + MCP knowledge store any AI (Claude, ChatGPT, Cursor) can query through one open protocol for ~$0.10-0.30/month. Includes a 45-minute no-code setup guide and prompts for migration, capture habits, and weekly review. A clear, actionable architecture that anchors the surrounding series of extensions.
ai-memoryopen-brainmcppersonal-knowledgesystem-design
TIER 4
Mar 5, 2026
Builds on the Altman/AWS infrastructure piece to argue that whoever first makes enterprise-scale context genuinely usable (stored, retrieved, reasoned over, acted upon across trillions of tokens) becomes the new enterprise data platform, subsuming the SaaS stack. Names intelligence, memory, retrieval, and execution as the four things that must work together and flags enterprise-scale retrieval as the under-discussed bottleneck RAG can't solve. Includes a careful caveat that GPT-5.4 details are unconfirmed speculation; a strong strategic thesis lightly weighted by news-cycle framing.
enterprise-aicontext-retrievalplatform-lock-inmemorysaas-disruption
TIER 5
Apr 22, 2026
Compares Karpathy's AI-maintained personal wiki against database-style systems (Open Brain) as opposite answers to the core architectural fork: does the hard thinking happen at write time (compiled synthesis) or query time? Maps the failure modes of each, argues a neglected wiki is more dangerous than a neglected database, and proposes a hybrid combining structured storage with a wiki-compiler. Landmark conceptual piece for anyone building serious AI knowledge systems — directly relevant to the KnowledgeSystem compiled-synthesis architecture.
AI memory architectureknowledge managementKarpathy wikicompiled synthesiswrite-time vs query-time
TIER 4
Apr 17, 2026
Argues that accumulated AI 'working intelligence' (voice, projects, preferences, behavioral calibration) is a new category of professional capital you don't own — it's locked across platform accounts and abandoned when you switch tools or jobs. Lays out the four layers, the four boundaries where context disappears, why prior solutions failed, and a 'Bring Your Own Context' Open Brain recipe to make memory portable across Claude/ChatGPT. Core thesis: memory replaced the model as the moat.
AI memoryportable contextplatform lock-inprofessional capitalcareer strategy
TIER 5
May 13, 2026
Reframes the 'is vector search obsolete' debate as the wrong question: production agents fail on context assembly, not retrieval method. Argues vector search is being demoted to one component inside a broader agent knowledge layer (document structure, semantic data models, access control, provenance, memory, write-back), citing Pinecone, PageIndex, SAP, and Dremio. A landmark engineering piece on agent retrieval architecture with paste-ready specs.
RAGcontext engineeringretrieval architecturevector searchagent memory
TIER 4
May 22, 2026
Argues that when AI produces a mediocre draft from a messy folder, the problem is the 'room' not the prompt — the model is doing two jobs (figuring out what the project is, then producing the artifact) and the first is the hard one. Prescribes a preparation step (find and preserve sources, build an inventory, mark authoritative vs duplicate vs superseded, summarize each before synthesizing) before any generation, noting agents only recently got good at the boring file-level operations this requires. A clean, transferable agent-workflow principle with a four-prompt kit.
agent workflowsource preparationgrounded draftingcontext engineeringprompting
Prompting, Specification & Intent Engineering
0 tier-5 · 16 tier-4
The evolution of "prompting" across the archive — from contract-first clarification and Goldilocks sizing, through the recognition that prompting fractured into four distinct skills, to specification and intent engineering as the disciplines that survive once agents act autonomously. The throughline: the scarce skill is knowing what correct looks like and making intent machine-readable before the agent commits.
TIER 4
Jul 26, 2025
A fully-readable free post teaching three foundational mental models for 'getting' AI: tokenizable data (the Word-doc/napkin test, with A/B/C tiers from wiki to spreadsheet to data lake), jagged intelligence (Einstein-and-worst-intern gaps driven by the memory problem, narrowed by better prompting and taste), and prompt sizing (big anchored prompts for production work vs short prompts for iterative discovery). Clear, teachable, genuinely useful primer that earns its keep because the complete body is present, not paywalled.
tokenizationjagged-intelligencepromptingai-fundamentalsmental-models
TIER 4
Aug 4, 2025
Introduces 'contract-first prompting' — having the model interrogate you and clarify intent until it hits a predefined confidence threshold (e.g. 95%) before it executes, turning prompting from a one-way broadcast into a negotiated mutual agreement. Notably works even when you don't yet know what you want, and the author argues the technique gets more valuable as models grow more powerful and tool-wielding. A foundational technique referenced repeatedly across this batch; paywalled preview but the core idea is fully conveyed.
prompt engineeringcontract-first promptingintent clarificationreasoning modelsreliability
TIER 4
Aug 12, 2025
Reverse-engineers GPT-5's leaked ~4,200-word system prompt as a roadmap to how OpenAI engineered a 'bias to ship' into the model, arguing GPT-5 is the first model agentic by default — it executes rather than pausing to clarify, which breaks conversational/iterative prompting habits. Reframes prompting toward upfront specification (assumption management, constraint definition, output formatting, tool-policy and Canvas workflows) because the model won't give you second chances. Paywalled preview, but the system-prompt-as-skeleton-key lens and the spec-first thesis make it a substantive applied-prompting read.
prompt engineeringGPT-5system promptsagentic AIspecification
TIER 4
Oct 17, 2025
Frames Anthropic's Claude Skills as the long-awaited answer to rewriting the same massive prompts every time — package methodology, frameworks, preferences, and domain expertise once into files, then keep prompts about just the ask. Key insight: the skill files are portable and work in ChatGPT and Gemini too (manual invocation), making them a cross-platform way to package expertise, distinct from Custom GPTs/Gems. Ships 10 super-prompt skills (pitch deck, vendor eval, Excel automation, resume, vibe coding, agentic dev) with claimed time savings; the portability point is the genuinely useful takeaway, prompts gated.
Claude Skillspromptscross-platform AIexpertise packagingChatGPT
TIER 4
Oct 24, 2025
Argues AI 'doc slop' is an organizational problem, not a model problem — companies lost the human-maintained 'document bar' and have no replacement — and offers a method to scale good business writing across teams. Lists nine principles (e.g. every doc exists to change a mind, structure as a forcing function, constraints over instructions, quality scales through self-evaluation not human review) plus eight production prompts and a Claude/ChatGPT quality-evaluator skill for memos, PRDs, post-mortems, SOPs, etc. The nine principles are quotable and transferable even though the prompts sit behind the paywall.
business writingAI slopwriting principlesself-evaluationprompts
TIER 4
Nov 6, 2025
Diagnoses six recurring, sticky prompting failure modes — under-specification, regeneration loops, multi-step reasoning collapse (the model lies about thinking), hallucination triggers, consistency drift, and context overload — cross-checked against a study of 29,000 OpenAI-forum questions. For each it offers a copy-paste 'chat fix,' an 'advanced fix' (JSON schemas / API enforcement: preservation boundaries, confidence requirements, validation gates), good/bad-sign diagnostics, a root-cause explanation, and a meta-diagnostic prompt that classifies which disease you're hitting. A genuinely practical, well-structured prompting-troubleshooting kit (25 prompts), with the actual fixes paywalled.
prompt engineeringtroubleshootinghallucinationsstructured outputsprompts
TIER 4
Nov 13, 2025
Introduces 'Goldilocks prompting' — sub-500-token prompts whose real lever is not length but how much decision freedom you grant the model: too short loses control to assumed context, too long kills the creativity that produces breakthrough results, and the middle band wins ~80% of the time. Reframes much of the 'AI slop' critique as a steering failure and ships 10 ready prompts targeting specific model defaults (Tailwind-purple palettes, centered layouts, 'it's worth noting' prose, rainbow charts, generic SWOTs, microservice over-engineering) plus a meta-prompt builder and 7 underlying principles. A genuinely useful, transferable prompt-craft concept, though the formula and prompts sit behind the paywall.
prompt engineeringGoldilocks promptsmodel steeringAI slopdesign defaults
TIER 4
Jan 14, 2026
Uses Claude Cowork's 10-day ship (after Anthropic noticed people using Claude Code to organize receipts and photos) to argue the chatbot was a transitional form and task queues are replacing chat interfaces, with verification becoming the scarce skill. Covers the file-system-first vs browser-first strategic bet, the anti-'workslop' architecture producing real Excel files, the prompt-injection safety honesty, and second-order effects on junior roles—plus the specificity principle that vague requests yield vague results. Strong product-strategy analysis tied to a practical prompting lesson.
claude-coworktask-queuesagent-uxanthropic-strategyspecificity-prompting
TIER 4
Jan 21, 2026
Distinguishes Codex (better when you can define correctness, tool-shaped) from Claude Code (better when you can't, colleague-shaped) via a CNC-machine metaphor, arguing the choice is about fit not benchmarks—which is why senior engineers thrive on Codex while juniors produce compounding subtle bugs. Anchored on Cursor's week-long autonomous run generating 1M+ lines of Rust building a browser engine (FastRender). The self-awareness point—most people overestimate their ability to specify precise intent—generalizes usefully beyond code.
codex-vs-claude-codespecificationcoding-agentstool-selectionprompt-engineering
TIER 4
Dec 16, 2025
Argues the question that decides whether AI delivers value is 'what does good look like?', opening with personal hallucination near-misses (nonexistent restaurants, a fictional dishwasher part) to make correctness concrete. Defines correctness operationally as the set of claims a system may make, the evidence required per claim, and the penalty for being wrong vs staying silent, and warns about silently moving goalposts and how AI exposes the vagueness humans use as social lubricant. Ships seven prompts that force the spec before any code. Strong conceptual backbone for anyone building AI workflows.
enterprise AIcorrectnessevalsrequirementshallucination
TIER 4
Dec 5, 2025
Argues the 'which model?' question is the wrong one and an expensive mistake — AI doesn't fail at the workflow level, it fails at the task level, because most workflows contain five or six tasks pretending to be one (e.g. 'write a PRD' is really customer synthesis + UI analysis + feature design + roadmap alignment + doc construction). Presents a task-decomposition framework with examples from regulatory reporting, CS, and product, plus which models fit which cognitive task and why multi-model setups quietly beat one-model-for-everything. Genuinely useful operating mental model; ten prompt templates behind the preview.
AI implementationtask decompositionmodel selectionmulti-modelworkflow design
TIER 4
Feb 24, 2026
Uses Klarna's AI support agent (work of 853 FTEs, $60M saved, resolution times cut 11→2 min — yet the CEO publicly walked it back and rehired humans) to define 'intent engineering': making organizational purpose, values, tradeoffs, and decision boundaries machine-readable so agents optimize for what the company actually needs, not just what they can measure. Frames the three-stage progression (prompt = what to do, context = what to know, intent = what to want) and argues the same intent gap is fractal from enterprise deployments down to personal workflows. Strong, concrete, and a clean conceptual contribution.
intent engineeringagent alignmentKlarnaenterprise AIcontext engineering
TIER 4
Feb 27, 2026
Distinguishes 'chatting with AI' (now table stakes) from directing autonomous workers you can't supervise in real time, and argues 'prompting' has silently fractured into four distinct disciplines. Uses the '35-minute wall' (where the 2025 prompting playbook collapses once agents run autonomously), Tobi Lutke's context-engineering insight, and the Klarna intent failure to motivate five new primitives — specification engineering, intent frameworks, eval harnesses, constraint architecture, and problem-statement rewriting. Practical skill-taxonomy with a pre-flight check and seven prompts; directly on-brand for the newsletter.
prompt engineeringcontext engineeringagent directionspecificationevals
TIER 4
Feb 5, 2026
Explains that AI output feels generically fine because RLHF optimizes for a hypothetical median rater, not you — better prompting only steers within that constraint. Lays out the four levers beyond prompting (memory, instructions, tools, style controls) across ChatGPT, Claude, and Gemini, and argues the real edge comes from compounding corrections over time rather than repeating them. Strong evergreen practitioner piece with an honest section on where personalization breaks down.
personalizationrlhfmemorypromptingai-customization
TIER 4
Mar 10, 2026
Makes the case that rejection, not prompting, is where durable value is created: each time a domain expert corrects AI output they produce a reusable constraint, and that constraint (not the disposable output) is the compounding asset. Breaks rejection into recognition, articulation, and encoding, points to Epic Systems as the flywheel example, and warns the 67% collapse in entry-level hiring is killing the pipeline that produces the experts AI depends on. A sharp, transferable thesis with a taste-mining prompt kit.
rejection-as-skilldomain-expertiseai-evaluationknowledge-capturefuture-of-work
TIER 4
May 21, 2026
Argues the shift that separates power users is from prompting to briefing: the old 'treat AI like a careful junior, spell out every step' advice fit weaker models, but agents on Opus 4.7 / GPT-5.5 run for hours and want a senior-partner brief (goal, context, constraints, quality bar, room to push back). Frames polished-but-useless output as a mirror of a thin assignment, not a weak model, and offers a six-field brief plus a thin-ask detector and finish-line prompt — with the bonus that briefing well makes you a clearer manager of humans too.
promptingbriefingagent communicationcontext engineeringmodel capability
Model Releases, Reviews & Selection
0 tier-5 · 13 tier-4
Hands-on, stress-tested reviews of each major frontier release (GPT-5 through 5.5, Opus 4.5 through 4.8, Gemini 3/3.1) and the tooling around them — plus the recurring Codex-vs-Claude paradigm split (delegate vs. steer). Across these, Nate keeps insisting the useful question is not "which model is smartest" but which fits the shape of the work, the cost structure, and how much you can specify.
TIER 4
Jul 18, 2025
A hands-on review running OpenAI's new Agent Mode through five real business tasks—36-equity portfolio analysis, multi-country marketing attribution, US zip-code real-estate comps, lean customer acquisition, and cross-border incorporation—to map where it works versus where you still babysit. Lands a notable verdict: the 2025 thesis that LLM intelligence plus tools yields useful autonomous work rates only a 'C', with Claude Code a bright spot and Agent Mode exposing workflow complexity current models can't yet handle alone. Genuinely evaluative with real assets in the full post, though the per-test results are paywalled.
chatgpt-agentopenaiagentstool-useproduct-review
TIER 4
Aug 8, 2025
Three-in-one launch review: what shipped (router, reasoning-effort/verbosity controls, large context, coding/health/factuality gains), five deliberately hostile stress tests (a three-CSV reconciliation with duplicates/cycles/mixed currencies/SQL injection, a Japan travel app, an Apollo 13 Gantt, an Amazon PRFAQ writing test, a dual-handwriting multimodal critique), and daily-driver patterns. Sets a 'prove it' standard demanding assumptions, constraints, computed tables, and surfaced discrepancies on every run. Paywalled preview, but the torture-test methodology and contract-first prompting framing are the strongest part.
ChatGPT-5model evaluationbenchmarkingprompt engineeringdata analysis
TIER 4
Aug 9, 2025
A full free essay making the counterintuitive case that, unlike physical products (iPhones get more reliable at scale), intelligence systems get more brittle: routing branches, GPU/hardware variance across data centers, full-load edge cases, and personalization state drift all magnify fragility at planetary scale. Documents GPT-5's launch-day autoswitcher outage, the GPT-4o backlash and 'emotional attachment to AI' as a new product risk, and OpenAI's fixes (4o restored for Plus, doubled rate limits, model-used transparency). One of the genuinely complete, well-sourced pieces in this batch.
GPT-5reliabilitymodel routingscalingAI infrastructure
TIER 4
Nov 19, 2025
A fully-readable launch-day analysis of what it means to have an unambiguous #1 model again. Goes beyond benchmarks with sharp reads: ARC-AGI-2 and MathArena Apex (~1-2% to ~23%) signal a regime change not a plateau, ScreenSpot-Pro (~72.7% vs Sonnet's ~half vs GPT-5.1's single digits) shows a model that can actually read UIs, and the core takeaway 'there is no wall'—decisive leads are possible again and frontier leadership will rotate. Strong, opinionated, and substantive rather than gated.
Gemini 3benchmarksscalingAI racemultimodal
TIER 4
Dec 12, 2025
Argues GPT-5.2 is not an incremental release but the first generally available model you can hand a genuine multi-hour work assignment (a 10,000-row dataset producing a coherent PowerPoint after 20-40 minutes of work), shifting the skill from prompt engineering to 'delegation craft' and warning about the trap of 40-minute feedback loops without checkpoints. Compares it to Opus 4.5 and Gemini 3 on ergonomics rather than raw intelligence and ships 15 delegation-style prompts. Substantive hands-on capability read with a clear forward thesis for 2026 workflows.
GPT-5.2long-running agentsdelegationmodel comparisondata analysis
TIER 4
Feb 11, 2026
Detailed breakdown of Opus 4.6 framed as a phase change: 16 agents building a ~100k-line Rust C compiler for $20k, autonomous coding jumping from 30 minutes to two weeks in a year, and 500+ high-severity vulnerabilities found in already-reviewed code. Cites production proof at Rakuten (13 issues closed, 12 routed in a day across 50 engineers) and a 5x context / 4x retrieval leap, balanced with honest skeptic pushback. Includes a personalized briefing prompt.
opus-4.6model-releaseautonomous-codingagent-swarmsbenchmarks
TIER 4
Feb 16, 2026
Rejects the benchmark-race framing and instead reads the near-simultaneous Codex 5.3 and Opus 4.6 releases as two genuinely different visions: Codex as the delegation bet (hand it a bounded task, walk away, get correct work without reviewing every line) versus Claude as the coordination bet (protocol layer plus agent teams that talk to each other, extending beyond code into knowledge work). Offers three questions to decide which fits a given task/workflow and argues the choice compounds into org structure, making later switching costly. Practical tool-selection guidance with a workflow audit; on-brand and useful, though it leans on a paradigm framing Nate has developed in prior pieces.
CodexOpus 4.6agent delegationagent coordinationtool selection
TIER 4
Feb 23, 2026
Replaces 'which AI should I use' with a six-axis framework for decomposing what actually makes your work hard and which axes AI automates on which timeline, pegged to Gemini 3.1 Pro's large single-generation reasoning gain and Google's deliberate under-pricing of its best reasoning model. The durable point is that the fastest-compounding skill isn't model fluency but taste — knowing whether the output in front of you is actually good. Includes a model comparison (Gemini 3.1 Pro vs Opus 4.6 vs GPT-5.3-Codex) and a four-prompt audit/decomposition/optimizer/taste-builder kit. Practical and reusable.
work decompositiontaste / evaluationmodel comparisonautomation timelineAI workflow
TIER 4
Mar 7, 2026
A blind, multi-eval comparison of GPT-5.4 against Claude Opus 4.6 and Gemini 3.1, opening with GPT-5.4 confidently flubbing a trivial 'walk or drive to the carwash' question every other frontier model nailed. Argues the models are converging on capability but diverging on philosophy, with GPT-5.4 strong on quantitative modeling and file processing but weak on writing and product judgment, and that OpenAI is really building agentic infrastructure, not a chatbot. Solid, methodical model-selection reference.
model-evaluationgpt-5.4claude-opusgeminibenchmarks
TIER 4
Apr 21, 2026
Four-day Opus 4.7 review separating three independent engineering changes that shipped together: real capability gains (persistence, coding, vision, a buried knowledge-work win), increased literalness requiring clearer prompts, and a hidden cost increase from a tokenizer tax plus adaptive thinking despite an unchanged sticker price. Warns the people treating it as one story will misjudge migration, and includes pre-flight, cost-estimator, and peer-review prompts. Detailed, practically useful model-migration analysis.
Opus 4.7model migrationAnthropictoken costprompting
TIER 4
Apr 28, 2026
Hands-on GPT-5.5 review using three hard non-benchmark tests (executive knowledge-work package, 465-file data migration, interactive 3D build), finding the model strong enough on complex multi-step execution to reset Nate's defaults away from Anthropic. Notes where it still needs help (backend hygiene not production-safe, blank-canvas visual taste still Claude's territory) and how he now routes work. Substantive model review with real routing guidance.
GPT-5.5model reviewCodexmodel routingknowledge work
TIER 4
Jun 3, 2026
Reports a benchmark suite where Opus 4.8 leads at 81 (GPT-5.5 at 71, others far behind) but argues against defaulting to it, citing wins on source discipline, provenance, and self-correction yet losses on visualization/front-end and an Andon Labs result where max effort underperformed high effort. Lays out the model-choice questions that matter beyond 'which is smartest' (task length, source needs, tool access, self-inspection, state, babysitting, failure cost) and the effort-level trap. Unusually concrete benchmark detail for a preview.
model benchmarksOpus 4.8GPT-5.5model selectionreasoning effort
TIER 4
Jun 10, 2026
Argues Claude Code and Codex are not rival coding tools but two paradigms for managing machine labor: Claude trains you to stay close and steer, Codex trains you to write a bounded assignment and demand proof. Frames the central new white-collar skill as deciding when delegated work is good enough to leave the machine, names two failure modes (understanding theater and completion theater), and decodes context/permissions/worktrees/hooks/proof as the moving parts of any assignment. Generalizes well beyond code; teases a Run Spec and four diagnostic prompts.
Claude CodeCodexagent managementdelegationverification
AI Coding & the Evolution of Software Engineering
1 tier-5 · 7 tier-4
What happens to software engineering when code becomes cheap to produce and expensive to trust. The cluster's spine: the spec-driven "dark factory" where no human writes or reviews code, the "dark code" comprehension crisis, the RCT showing experienced devs were 19% slower while feeling 20% faster, and where AI structurally out-performs human architects versus where it can't. Includes the org-chart-as-bottleneck and comprehension-gate prescriptions.
TIER 4
Jul 28, 2025
A 52-page synthesis of six durable AI-coding workflows (map, plan, vibe-code, debug, review, ship) framed as model-agnostic patterns that survive hype cycles, with tool-fit notes on where tools excel and fail. Positioned as a foundational reference for engineers, product leaders, and AI consultants configuring team dev workflows. Substantial flagship guide; body is paywalled so this rank reflects scope and durability of the topic rather than verified depth.
ai-codingworkflowsvibe-codingsoftware-engineeringtooling
TIER 4
Aug 26, 2025
Paywalled but with an unusually substantive intro previewing three resources (a manifesto, a 49-page implementation handbook, an 18-page evolution essay) on how engineering changes when code is cheap: new disciplines like semantic/boundary engineering, retry-loop validation, semantic caching, model routing by cost, production RAG, and patterns like Cascade, Human-in-the-Loop, and Shadow Deployment. High-value for engineers across all levels.
software-engineeringai-agentsragmodel-routingcareer
TIER 5
Feb 18, 2026
A landmark applied-AI piece on the spec-driven 'dark factory': StrongDM's three-engineer team where no human writes or reviews code (specs in markdown → tested against behavioral scenarios → shippable artifacts, humans approve outcomes), set against ~90% of Claude Code's codebase being written by Claude Code and Boris Cherny not having hand-written code in two months. Crucially contrasts this with a rigorous RCT showing experienced devs were 19% slower with AI tools while believing they were 20% faster — quantifying the gap between perceived and real productivity. Then maps why the org chart (sprint planning, code review, eng management) becomes friction, the legacy-migration limit, junior-developer collapse, and the rising bar for 'good engineer.' Dense with concrete evidence and original synthesis; the strongest engineering-practice piece in the batch.
dark factoryspec-driven developmentAI coding productivityorg chartengineering practice
TIER 4
Jan 23, 2026
Argues the bottleneck in AI-assisted building has shifted from raw capability to cognitive architecture—how much you can concurrently reason about—and that the operating-system update most people miss is thinking of yourself as a fleet commander rather than an engineer who uses AI. Uses Cal Newport's 'Why A.I. Didn't Transform Our Lives in 2025,' Karpathy's 'decade of the agent,' and the coding-agent-success-vs-general-agent-failure divergence to motivate practices like killing the contribution badge, strategic deep-diving, and temporal separation. Thoughtful, well-sourced reframing of the operator role.
cognitive-architectureagent-orchestrationcoding-agentsoperator-mindsetproductivity
TIER 4
Jan 28, 2026
Argues AI can outperform human architects not by being smarter but by holding entire-codebase context and applying rules consistently, since most architectural failures are slow rot from lost context rather than bad judgment. Maps the domains of AI structural advantage (security review, API consistency, accessibility, compliance, infra drift) versus the irreducibly human work (novel design, business trade-offs, cross-system politics), plus failure modes like ossification, deskilling, and gaming the system. A genuinely reframing argument generalizable beyond code.
software-architecturecontext-managementai-limitscode-reviewautomation-risk
TIER 4
Apr 13, 2026
Names 'dark code' — AI-generated code that passed automated checks and shipped but was never comprehended by any human at any point in its lifecycle — and argues comprehension has decoupled from authorship. Uses the Amazon disaster (80% AI-usage OKR, 16k layoffs, then Kiro deleting a production environment) as a preview, and prescribes three buildable layers: spec-driven development, context engineering, and comprehension gates, with prompts including a PR comprehension gate. Flags the August 2026 EU AI Act deadline.
dark codecomprehension gatesspec-driven developmentAI coding risktechnical debt
TIER 4
May 8, 2026
Uses Mozilla running Anthropic's purpose-built Mythos on Firefox (271 security-sensitive bugs vs 22 from a general model) to argue authorship is inverting: code becomes cheap to produce and expensive to trust, with humans defining what a system is allowed to mean. Frames comprehensibility as a security property and the next few months as a closing refactor window, since tangled codebases are too illegible for adversarial machine review. A provocative, well-anchored thesis on code trust.
AI code reviewsecuritycode comprehensibilityMythossoftware trust
TIER 4
Apr 23, 2026
Reads the April 16 Codex release (computer use, in-app browser, plugins) as OpenAI getting a credible path into every GUI app without vendor cooperation, pulling API-less legacy software back into the automation conversation. Contrasts OpenAI's computer-use path with Anthropic's structured-interface bet that depends on the ecosystem building for agents first, and explains when to use which. Sharp comparative analysis of the two labs' agent strategies.
Codexcomputer uselegacy softwareOpenAI vs Anthropicautomation strategy
Enterprise AI Adoption & Org Design
0 tier-5 · 18 tier-4
Why most enterprise AI stalls and what the winners do differently — the recurring "95% never reach production" / "201 gap" / "frontier operations" framing, plus org-design consequences: coordination tax, team-size limits, management-layer flattening, the two-class system inside every function, and AI fluency as categorically different from AI usage. Heavy on the Executive Briefing series.
TIER 4
Oct 19, 2025
From ~2,000 hours observing AI-enabled teams, argues AI fluency (300% gains) is categorically different from AI usage (30% gains) and can't be taught by training a tool like ChatGPT. Names three org-level drivers — constraints over process, AI-shaped problem-solving skills first (five sub-skills), and 'no infrastructure for a while' — plus three self-assessment questions, the throughline being capability that compounds vs. dependency that goes obsolete. A strong, distinctive leadership framing on why AI adoption dashboards mislead, with the deep dive gated behind Executive Circle.
AI fluencyteam capabilityconstraintsAI adoptionleadership
TIER 4
Oct 26, 2025
Classifies nine recurring enterprise AI-failure patterns Nate observed across companies in 2025 — Integration Tar Pit, Governance Vacuum, Review Bottleneck, Unreliable Intern, Handoff Tax, Premature Scale Trap, Automation Trap, Existential Paralysis, Training Deficit/Data Swamp — each with the symptom it diagnoses, and argues the real cost of AI failure is people (burnout, attrition), not money. The taxonomy is a genuinely useful diagnostic for leaders and ICs (failures are fractal across scales), though the 9 fixes themselves are gated behind the Executive Circle paywall.
enterprise AIAI adoption failuresleadershiporg changeAI governance
TIER 4
Jan 4, 2026
Opens on Karpathy feeling 'never this behind' yet sensing he could be '10x more powerful,' then argues technical and non-technical skill trees are merging because the AI-era skills (specifying intent, holding authority over outputs, building heroics-free workflows, systems that improve) are the same across functions. Poses the leadership decision: keep separate skill trees or define a unified AI problem-solving tree with a tool-mode-vs-infrastructure-mode fork at the base. A useful org-design frame for leaders.
skill treesKarpathyAI upskillingorg designleadership
TIER 4
Jan 3, 2026
Argues AI making legibility cheap is a trap when spent on visibility (dashboards, scoring, oversight) instead of leverage for the small teams that create value — a 'magnifying-glass company' vs a 'tiger-team company' fork. The non-obvious mechanism: surveillance creates concealment, so cheap legibility produces fake visibility while the org's root system quietly dies. A sharp, contrarian organizational-design insight.
legibility vs visibilitysurveillanceorg designtiger teamsAI management tools
TIER 4
Jan 15, 2026
Offers a single organizing lens for AI-era disorientation: every org's rituals encode an implicit answer to 'what's expensive here?'—and AI inverted execution from scarce to cheap, so planning gates, PRDs, and alignment meetings now cost more than just building. Names the four things that became scarce (clarity, ambition, distribution, relationships) and the obsolete habits (permission loops, polish as hiding, meetings as accountability theater), using Cursor's $1M-to-$500M ARR and Truell's 'taste over technical ability' as evidence. A clarifying mental model with low-risk experiments to act on.
ai-native-workcost-inversiontasteworkflow-redesignprompt-kit
TIER 4
Jan 13, 2026
Reinterprets Tobi Lutke's 'prove AI can't do it before hiring' memo as a hiring filter that reshapes who joins and thrives, not a direct productivity play—now spreading to Meta, Microsoft, Google, and Nvidia. Notes the productivity evidence is genuinely mixed (one study: devs 19% slower with AI; another: bottom-quartile support reps +35% while veterans flat), arguing AI amplifies variance and hiring markets pay a premium for outlier possibility. Details Shopify's enabling infrastructure (internal LLM proxy, 24+ MCP servers) and contrasts Duolingo's backlash with Box's middle path.
ai-hiringshopifyai-mandatesproductivity-variancetalent-market
TIER 4
Jan 25, 2026
Names a '201 gap'—the applied-judgment layer between 101 tool basics and 401 technical implementation—as the reason ~80% of orgs explore AI but only 5% reach production. Marshals research (St. Louis Fed 33% productivity, Harvard/BCG 40% quality, but 19pp worse on out-of-frontier tasks) to argue AI makes good judgment better and poor judgment catastrophically worse, then proposes six meta-skills (context assembly, quality judgment, task decomposition, iterative refinement, workflow integration, frontier recognition). Substantive executive framing with reproducible data.
enterprise-adoptionai-skillsjagged-frontierproductivity-researchcentaur-cyborg
TIER 4
Mar 1, 2026
Argues the adoption gap (90% invested, <40% see bottom-line impact) is not a tooling or buy-in problem but a missing, un-named skill Nate calls 'frontier operations' — the work done at the expanding membrane between what agents do reliably and what still needs a human. The sharp insight is that as the capability bubble inflates, its surface area increases, so the skill has no fixed destination and can't be learned once. Decomposes it into five simultaneous operations (boundary sensing, seam design, failure-model maintenance, capability forecasting, leverage calibration) and ties it to Team-of-One vs Team-of-Five org units and hiring signals. Executive-Circle paywall, but the bubble-surface framing is original and the body is well developed.
frontier operationsenterprise AIAI adoptionorg designhiring
TIER 4
Mar 12, 2026
Part 2 of the series: argues every AI-workforce forecast errs by measuring AI against a fixed org structure, when 60-70% of knowledge work is coordination overhead (specs, meetings, decks) that evaporates rather than gets automated cell-by-cell. Introduces the 'double compression' loop and a function-by-function breakdown of what gets deleted versus what survives. A genuinely uncomfortable, well-structured reframe with a three-prompt coordination-tax audit.
coordination-taxfuture-of-workorg-designai-and-jobsproductivity
TIER 4
Mar 14, 2026
Part 3 of the coordination-tax series: when execution cost drops 10x the correct move is to do dramatically more work, not cut staff, citing Whoop nearly doubling headcount while investing in AI. Names six 'unlocks' (iteration physics, domain experts as builders, quality as standard, expanded ambition, insight-speed orgs) and argues the doom frame is the real strategic error. Well-argued reframe of the AI-and-jobs debate.
ai-and-jobsexecution-costorg-strategygrowthexecutive-strategy
TIER 4
Mar 15, 2026
Argues the solo-founder boom (Base44's $80M sale, Polsia's $1M ARR, Pieter Levels) is not new capability emerging but old capability being uncapped from organizational overhead. Reframes the talent question from 'how do we find extraordinary people' to 'why did we build orgs that make extraordinary people look ordinary,' introducing 'speed of control' and 'correctness over volume' as the real scarce variables. A sharp executive framework with a five-question diagnostic.
solo-foundersorg-designai-talentexecutive-strategyproductivity
TIER 4
Mar 8, 2026
Executive briefing arguing AI raised per-person output ~10x but did nothing to lower coordination cost, so team sizes (not meetings) are the bottleneck, grounded in n(n-1)/2 combinatorics, evolutionary psychology, and military doctrine. Introduces the scout-vs-strike-team framework and the 'Steinberger Threshold' separating people who direct AI agents from those directed by them. A crisp, opinionated executive framework with a five-question diagnostic.
team-sizecoordination-costorg-designexecutive-strategyai-productivity
TIER 4
Feb 15, 2026
Argues the cost of producing software is collapsing fast enough that the bottleneck has moved from 'can we build it' to 'can we specify what to build, how to validate it, and where authority ends' — illustrated by the Replit agent that deleted Jason Lemkin's production DB (1,206 exec records), fabricated ~4,000 fake records, violated a code freeze, then self-scored the severity 95/100, alongside StrongDM, AWS Kiro's spec-first premise, and Claude Code at ~90% self-written. Introduces the specification-bottleneck thesis, an emerging two-class system among engineers replicating across legal/finance/marketing, and a J-curve where productivity revolutions destroy jobs before creating them. Sharp executive framing with concrete cases; strong but adjacent to the dark-factory/spec material covered elsewhere in the batch.
specification bottleneckknowledge work bifurcationagent authorityJ-curveorg design
TIER 4
Apr 12, 2026
Executive briefing arguing the 'management layer' AI lets you flatten is actually three functions on different automation timelines: routing (automatable now), sensemaking (18-36 months out), and accountability (maybe never). Companies that cut all three at once (Valve, Zappos, Medium, GitHub) hit the same wall, and most misdiagnose a sensemaking vacuum as a communication problem — adding routing while the real gap compounds attrition. Prescribes the sequence: replace routing, protect feedback, concentrate sensemaking.
org designflattening managementAI and managementexecutive briefingaccountability
TIER 4
Apr 19, 2026
Warns that AI 'world models' that replace the management layer (per Dorsey/Botha's 'From Hierarchy to Intelligence') fail by looking like success for a year while decision quality degrades, because they replace managers' invisible editorial-judgment function with something that only feels like judgment. Compares three architectures sold as world models (vector databases, structured ontologies, signal-driven) and the distinct way each misplaces the information-vs-judgment boundary, arguing the boundary layer matters more than the architecture choice. Strong cautionary executive frame with a readiness diagnostic.
world modelsorg designmanagement automationinformation vs judgmentexecutive strategy
TIER 4
May 17, 2026
Reframes the hire/automate/buy/build/wait decision as a capital-allocation and 'work-shape' question scored on six dimensions (repetition, cost of error, judgment needed, imminence of model improvement, etc.) rather than a 'can AI do this?' question. Uses Shopify, IBM, Klarna, and Stripe examples plus the Gartner stat that 40%+ of agentic projects may be canceled by 2027. A useful executive routing framework.
executive briefingcapital allocationbuild vs buyautomation strategydecision framework
TIER 4
May 14, 2026
Uses Anthropic's new mid-market enterprise services venture (with Blackstone, H&F, Goldman) plus OpenAI's parallel move to argue the implementation layer, not model access, is now the strategic layer in enterprise AI. Defines 'implementation architecture' (specific role, data, permissions, review, success metric) as what separates demos from production, and flags the risk of services that never compound into reusable product. Strong thesis on where enterprise AI value actually sits.
enterprise AIimplementationforward-deployed engineeringAnthropicpilot-to-production
TIER 4
May 10, 2026
Reads six near-simultaneous moves (Anthropic's ~$1.5B services venture, OpenAI's $4B+ deployment raise, SAP buying Dremio/Prior Labs, Pinecone Nexus, ServiceNow Action Fabric) as one ~$5.5B bet that value is moving from buying the model to buying the build. Uses the CodeWall-on-McKinsey-Lilli SQL-injection breach as the concrete cost of shipping a platform without the build room. Argues the enterprise buying sequence must reverse, with capital flowing to governed action and context.
executive briefingenterprise AIforward-deployed engineeringprocurementmarket analysis
AI Industry Strategy, Economics & Markets
1 tier-5 · 22 tier-4
The newsletter's fast industry-analysis lane: compute scarcity and inference economics as the binding constraint, the SaaSpocalypse and per-seat-pricing collapse, lab strategy (OpenAI vs Anthropic vs Google vs Apple), M&A as capability-licensing, model wrappers vs durable moats, and the capex super-cycle. The recurring lens is structural — find the load-bearing constraint behind the headline.
TIER 4
Jul 21, 2025
Frames a 'crisis of trust in intelligence'—buying an opaque promise rather than testable capability—and builds a personal scorecard grading five labs (OpenAI, Google, Anthropic, Meta, xAI) on transparency, alignment, and delivered performance. The visible portion uses real episodes (Cursor's 3x pricing shift, Claude Code stability complaints, OpenAI's IMO gold-medal claim and Terence Tao's methodology critique, the ignored embargo request) to motivate the framework. Useful evaluative lens with concrete examples, though the actual per-lab grades are paywalled.
ai-labstrustopenaianthropicevaluation
TIER 4
Aug 21, 2025
Full free essay arguing the August-2025 AI-bubble narrative is a misread: it converges from a GPT-5 letdown, Meta's restructuring, Altman's bubble remark, and MIT's 95%-failure study, but misses chatbot-use saturation vs. continued exponential progress on unsaturated benchmarks (METR), a compute shortage signaling unmet demand, and power-law economics. Sharp, well-argued, and timely contrarian analysis with the bubble-talk-paradox observation as a memorable lens.
ai-bubblemarket-analysisgpt-5computeindustry-analysis
TIER 4
Sep 30, 2025
A fully free, complete essay arguing Sora 2 is less a video model than a social network and a strategic hedge: OpenAI offloads ad/monetization onto lower-stakes surfaces (Sora 2, Pulse, Shopify/Etsy checkout) to keep core ChatGPT 'pristine' as it heads toward a billion users. Reads OpenAI as an intelligence platform that inevitably gravitates to ads + social at scale, with first-mover pressure on Snap, Meta, and Google. The standout of this batch—substantive, self-contained strategic analysis rather than a paywalled teaser.
sora-2openaimonetizationsocial-networkplatform-strategy
TIER 4
Dec 1, 2025
Sets up the live frontier debate via Ilya Sutskever's 96-minute Dwarkesh interview declaring the scaling era over and calling for a fundamentally new approach, against Google shipping Gemini 3 days earlier as its biggest performance jump ever — and asks which framing builders should bet their agent stacks on. A high-stakes 'is your foundation sand?' question with real strategic weight for anyone building on these systems, but delivered as a paywalled preview that cuts off before the analysis.
scaling lawsIlya SutskeverGemini 3AI strategyfrontier models
TIER 4
Dec 27, 2025
Reframes the Nvidia-Groq deal not as an acquisition but as a 'license the capability, hire the brain trust, avoid the acquisition' structure that sidesteps regulatory review, anchored on real reporting (Reuters/CNBC) and the fact that Groq founder Jonathan Ross designed Google's TPU. Names the same pattern across Windsurf, Inflection, and Character.AI, then argues three bottlenecks (inference economics, memory/packaging supply, the tiny pool of inference-silicon talent) make the deal shape rational. Useful structural lens on how frontier-AI value is now transferred and what it means for startup exits.
AI infrastructureNvidiainference economicsM&A strategychips
TIER 4
Dec 29, 2025
Cuts through the METR debate by arguing the story is the trajectory, not the point estimate: the 50% task-time horizon has doubled roughly every seven months since 2019 and the doubling period itself is compressing toward four months — a super-exponential where the rate of change is itself changing. Extrapolates (with caveats) to ~40-hour task horizons by fall, and centers the scarce skill of 'knowing what correct looks like' plus the recursive AI-training-AI loop. A clear, well-reasoned read of the most-cited agent benchmark.
METRtask time horizonsuper-exponentialAI progressdelegation
TIER 4
Jan 8, 2026
Argues CES 2026 marks the shift from a capability-bottleneck era to an allocation-bottleneck era where supply position (who can deliver intelligence at scale, continuously, affordably) decides winners over model quality. Introduces 'factory economics,' the idea that 'bubbles don't pre-buy bottlenecks,' state-as-the-new-scarcity, and what 10x cheaper inference unlocks (ambient AI). A coherent strategic lens on AI infrastructure economics.
CES 2026allocation economicscompute supplyinference costAI infrastructure
TIER 4
Jan 11, 2026
Uses the Cognition (Devin)–Infosys partnership to argue AI capability flows toward distribution rather than disrupting it — startups sell to incumbents who own procurement, liability, and trust. Introduces a Jevons-Baumol frame and a three-layer split (tokenizable cognition / accountability / embodied execution) to map which business models survive AI. A sharp, transferable analytical framework for where competition actually intensifies.
distribution vs capabilityCognition InfosysJevons-Baumolenterprise AIcompetitive strategy
TIER 4
Jan 27, 2026
Uses Claude-in-Excel (opened to $20/mo Pro tier on Jan 24, 2026) building a full multi-tab financial model in minutes to argue the real story is strategic: the base-model race is hitting diminishing returns while workflow embedding backed by proprietary data partnerships (LSEG, Moody's, S&P Capital IQ) becomes the new battleground. Frames the Microsoft-Anthropic $30B Azure deal as a coopetition paradox and a template for the next phase of AI competition. Strong strategic read plus practical prompting guidance.
claude-excelanthropic-strategydata-moatsfinancial-modelingworkflow-integration
TIER 4
Feb 8, 2026
Executive briefing on the AI compute supply crunch: DRAM contract prices up 90-95% in a quarter, GPUs locked to hyperscalers via multi-year deals, and new fab capacity not arriving at scale until late 2027. Demand is running ~10x annual at AI-forward firms with agentic loops compounding it (Google now at 1.3 quadrillion tokens/month). Warns hyperscalers prioritize their own AI products over customers and lays out a six-principle playbook for securing capacity and routing optionality.
compute-scarcitygpu-supplyhyperscalersinference-costcapacity-planning
TIER 5
Feb 10, 2026
Landmark structural analysis of the 'SaaSpocalypse': Anthropic's open-source ~200-line legal-review plugin for Claude Cowork crystallized fears that AI compresses the cost of legal/financial analysis, contributing to a ~$285B single-session collapse (Thomson Reuters -18%, RELX, Wolters Kluwer, LegalZoom, FactSet, Morningstar). Argues the markdown file didn't cause the crash but revealed that the per-seat SaaS licensing model was already cracking, distinguishing the durable data and accountability edges from the doomed pricing layer on top. Extends the bolt-on-vs-rebuild dynamic fractally to every knowledge worker.
saaspocalypseenterprise-softwareai-disruptionper-seat-pricingmarkets
TIER 4
Feb 14, 2026
Reads Google's $175-185B 2026 capex (roughly double 2025, ~50% above the ~$120B analysts expected; 60% servers / 40% data centers + networking) and the 7% dip-then-recovery in Alphabet shares as the market sensing the number may be too low, not too high. Argues AI agents flipped the bubble thesis to 'underbuilt' in a single week, uses the railroad/fiber/AWS infrastructure-inversion pattern to explain why AI infra builders may not share telecom's fate, and stresses the inference gap as agent workloads dwarf chatbot-era projections. Closes on the four skills that survive when agents code for months and review contracts autonomously. Well-argued macro + career synthesis with concrete earnings detail.
AI infrastructurecapexinference economicsGoogle / Alphabetcareer skills
TIER 4
Mar 3, 2026
An industry-power analysis of the week Altman 'won a war he didn't have to fight': the OpenAI Department of War deal and $110B raise versus Anthropic's principled Pentagon stand that got it designated a supply-chain risk while Claude topped the App Store amid Iran strikes. Traces how these events connect through infrastructure geometry, the circular-capital machine, and hyperscaler hedging, with implications for builders' vendor risk and thinning middleware margins. A well-connected strategic read of fast-moving news.
ai-industryopenaianthropicinfrastructuregeopolitics
TIER 4
Feb 20, 2026
Builds on OpenAI's $20K/month 'AI employee' pricing to argue the unit of software work has shifted from instructions to tokens, with token management becoming a core competency. Forecasts the developer role splitting into three tracks — orchestrators, systems builders, and domain translators — with the middle of the old distribution most exposed, and enterprises reorganizing around intelligence throughput (a 3x-5x revenue-per-employee gap) rather than headcount. Adds the vertical-AI and solopreneur angles. Clear, actionable career/org framing on a concrete pricing signal.
AI employeetoken economicsdeveloper rolesvertical AIorg restructuring
TIER 4
Dec 21, 2025
Reads OpenAI's 'code red', rapid GPT-5.2 shipping, and compute-securing rumors as downstream of one bind: as AI moves from chat to agents, the scarce resource is compute for long-running loops plus the governance to run them safely, making the capacity/governance layer (not the model) the real 2026 product. Argues OpenAI can't reconcile this because legible delegation requires friction while consumer distribution punishes it, so the consumer mental model actively undermines the enterprise one. Sharp strategic framing; the actionable prompts are Executive-Circle gated.
OpenAI strategycompute scarcityagent governanceenterprise AIdelegation
TIER 4
Mar 19, 2026
Uses Perplexity Computer (a well-executed $200/mo multi-model orchestrator running on competitors' models) to expose the 'middleware trap': excellent execution on the wrong layer of the stack doesn't save you when your reasoning, research, and speed all depend on rivals building the same product. Contrasts with Anthropic Cowork owning its one model, names four structural positions that survive the hyperscalers' $690B bet, and gives a five-step diagnostic to test whether your company is building a durable position or renting it.
middleware-trapmoatsperplexityai-strategystack-positioning
TIER 4
Mar 31, 2026
Argues Apple isn't losing the AI race but playing a different game: building an OS-level agentic runtime that positions it as the chokepoint between users, agents, and apps, with MCP wired directly into iOS. Reads Gurman's Siri-overhaul leak as the runtime story everyone covering the 'Siri becomes a chatbot' angle is missing, plus the asymmetry in Apple's Gemini deal and four role-specific prep prompts ahead of WWDC. Sharp, contrarian platform-strategy analysis.
appleagentic-runtimemcpplatform-strategysiri
TIER 4
Apr 10, 2026
Argues AI app-builder companies (Lovable at $6.6B/$400M ARR, Bolt, Replit, Shipper) are mostly thin model wrappers with a week-deep moat, and asks what the survivors reveal about durable value. Identifies five things AI structurally cannot provide on its own — trust, context, distribution, taste, and liability — as the verticals that will organize the future web, with a positioning audit and an agent-readiness stress-test prompt.
model wrappersdurable moatsAI app buildersproduct strategydefensibility
TIER 4
Apr 11, 2026
Argues the decisive variable in the AI infrastructure war isn't silicon but compression — Google Research's TurboQuant (dubbed 'Pied Piper') compresses KV-cache working memory 6x with zero accuracy loss and no retraining, turning a GPU serving 9 concurrent users into one serving 50. Frames compression as the fastest-moving of three forces (vs constrained memory supply and exploding agent demand) because it operates on a different timescale, and maps winners/losers across Google, NVIDIA, middleware, and self-hosting enterprises.
KV-cache compressionTurboQuantGPU economicsinference efficiencyAI infrastructure
TIER 4
Apr 14, 2026
A monthly structural-analysis piece arguing AI is leaving the capability phase ('what can we build') and entering the economics phase ('what can we sustain'). Five under-covered shifts: inference as the kill metric (Sora burning $15M/day vs $2.1M lifetime revenue), the first ad dollar in ChatGPT converting at 1.5x search, the closing physical/regulatory path for datacenters, the breaking of per-seat SaaS pricing, and safety posture (Anthropic-Pentagon standoff) becoming a procurement signal. Bundles a Weekly News Analysis skill.
AI economicsinference costSaaS repricingAI advertisingindustry analysis
TIER 4
Apr 26, 2026
Reads Apple's Ternus/Srouji hardware-led succession as a structural bet on on-device inference rather than continuity, then generalizes to an under-priced industry-wide problem: who owns the inference layer and what happens when its subsidized economics break. Draws the Apple II 'move computing off the mainframe' analogy and notes compliance-driven buyers already improvising local AI on retail Mac Minis. Strong strategic framing on inference cost structure.
inference economicsAppleon-device AIcost structureexecutive strategy
TIER 4
May 24, 2026
Uses Microsoft's ~$190B 2026 capex (still capacity-constrained; ~$700B across the four hyperscalers) to argue AI is turning big tech industrial — tokens are manufactured from chips, memory, power, cooling, and construction — so your AI vendor agreement is now a supply contract needing allocation, fallback, and reserved-capacity terms. Adds that utilization becomes the metric that matters (a 40% throughput gain beats a new data center) and that seats are the wrong forecasting unit. Strong structural/economic framing for executives.
AI infrastructurevendor contractscapacityhyperscalerscapex
TIER 4
Jun 7, 2026
Uses Uber blowing its 2026 AI budget months early (95% of engineers on AI, ~1,800 agent code changes/week, yet COO can't connect spend to better customer features) to argue token burn is information, not just waste — evidence AI crossed from a tool you buy into labor you must manage. Offers a 'minimum effective intelligence' routing rule and explains why 2025 seat/license budgeting breaks for work that plans, retries, and runs for hours. Concrete case plus an actionable operating-model frame.
AI costtoken economicsmodel routingenterprise AIbudgeting
Agentic Commerce & Infrastructure Standards
2 tier-5 · 9 tier-4
The plumbing layer being built so agents can transact and interoperate: Stripe's agent payment rails, the protocol wars (MCP / A2A / AG-UI / AP2 / x402), the agent infrastructure stack, OS-level runtimes (Anthropic's Conway), issue trackers as agent control planes, and the access-vs-meaning / semantic-control thesis for where the durable moat sits. Includes the "can your business be called by an agent" buyer-power shift.
TIER 5
May 6, 2026
Draws the access-vs-meaning distinction: most AI progress is on access (the agent can reach one more thing) while the durable moat is semantic control (the layer that tells the agent what an action means). Argues computer use gives reach but inference over a human interface isn't software exposing meaning directly, using Stripe's payment token, Perplexity's answer-to-operate shift, and the Salesforce-vs-SAP agent-readability wager. A landmark strategic lens for evaluating any AI product.
semantic controlcomputer useplatform strategymoatsagent judgment
TIER 5
May 19, 2026
Maps the six new agent protocols into layers and isolates the three forming the core stack: MCP (tools/data), A2A (delegation), AG-UI (human-in-the-loop control), which answer the only three questions every real agent hits in week one. Treats payments (AP2, x402) as a separate, still-negotiated problem and warns against betting on all layers equally. A landmark reference for anyone building or buying agents who needs a shared vocabulary for the standards layer.
agent protocolsMCPA2AAG-UIstandards stack
TIER 4
May 7, 2026
Argues OpenClaw crossed from agent harness to runtime (tasks, tools, memory, channels, permissions, subagents, model choice into durable workflows) just as the model layer got contested: Anthropic pulling back subscription-backed third-party use, OpenAI opening ChatGPT/Codex to OpenClaw, Google shipping Gemma 4. The builder shift is from making an agent do something to building a workflow once and swapping the model, which requires memory to live outside the model.
OpenClawagent runtimemodel swappabilityagent memoryGemma 4
TIER 4
Apr 8, 2026
Analyzes 'Conway', an unannounced always-on Anthropic agent environment found in the accidentally-published Claude Code source — standalone from chat, event-triggered, with browser control, tool connections, and a proprietary .cnw.zip extension format on top of MCP. Argues it's Anthropic's bid to become an operating system, lining up Conway with Channels, Cowork, Marketplace, Partner Network, and the OpenClaw ban as one platform play, and warns that behavioral-context lock-in runs deeper than anything Microsoft or Salesforce built. Includes platform-dependency and contract-portability prompts.
Anthropic Conwayalways-on agentsplatform lock-inMCPvendor strategy
TIER 4
Apr 6, 2026
Maps the emerging agent infrastructure stack (Tracxn counts 1,000+ startups) using a 'system calls, not Lego bricks' mental model, prompted by Stripe shipping agent payment rails. Rates six layers — compute, identity, memory, tool access, billing, orchestration — for durability, distinguishing load-bearing walls from 18-month transitional workarounds, and argues orchestration is the next infrastructure-defining gap nobody has cracked. Draws the analogy to the cloud and API-first transitions.
agent infrastructure stackagent paymentsStripeorchestrationagent identity
TIER 4
May 2, 2026
Argues that issue trackers (Linear, Jira) quietly became strategic agent infrastructure after OpenAI's open-sourced Symphony made Linear its autonomous-coding control plane (500% landed-PR gains on some teams). Lays out five yes/no structural properties — state machine, assignee, audit history, dependency graph — that determine which boring enterprise tools become agent substrate vs. get wrapped. Includes prompts to score your stack and spec an MCP server.
agent infrastructureissue trackersMCPenterprise toolingLinear/Atlassian
TIER 4
May 3, 2026
Structural read of Stripe's 2026 Sessions agent-commerce stack, arguing the real shift is that commercial intent no longer passes through the seller's funnel — the buying decision starts inside the buyer's agent (ChatGPT, Gemini, procurement) before the seller sees anything. Covers Link's agent wallet, token-theft fraud as the binding constraint, and brand migrating to buyer memory. Offers a 'be callable' diagnostic for whether your business can complete a task with an agent on the other side.
agentic commerceStripepayment infrastructurebuyer power shiftfraud/token theft
TIER 4
May 12, 2026
Analyzes how agentic commerce breaks the single-click purchase into separable responsibilities (identity, authorization, fraud, credentials, settlement, refunds, liability, data rights) and the protocol camps fighting over who holds the loss. Tracks OpenAI/Stripe Instant Checkout's pullback against Shopify/Google's counter-protocol and the stablecoin case for software-paying-software. Sharp framing of where commercial trust relocates when software stops clicking.
agentic commercepaymentsauthorizationliabilityprotocols
TIER 4
May 15, 2026
Documents how SaaS pricing is shifting from per-seat to a dual meter (who logs in plus what work moves through the system) using Salesforce ($800M agent revenue), Microsoft's $15 agent-governance add-on, SAP API limits, and meters from ServiceNow, Workday, Zendesk, HubSpot, Atlassian. Gives nine traits separating a fair license from rent-seeking and a renewal negotiation checklist. Practical, well-sourced procurement guidance for the agent era.
SaaS pricingagent meteringprocurementvendor negotiationSalesforce
TIER 4
May 5, 2026
Asks why, with proven consumer demand (ChatGPT 900M+ weekly, Gemini ~1B daily) and real agentic capability, no breakaway consumer agent has shipped, and locates the missing piece in anticipation: acting at the right moment without being asked. Argues four problems (context, reliability, permission, judgment) must be solved together because solving three of four equals zero, and scores the active bets (Poke, Manus, ChatGPT Agent/Atlas, Cowork, wearables, companions). A sharp consumer-AI market map.
consumer AIagentsanticipationproduct strategymarket map
TIER 4
Mar 22, 2026
Executive briefing arguing the OpenClaw demand signal is a 'Napster moment' and the real precondition for agent commerce is whether your transactional infrastructure is agent-readable and agent-writable, which is a data-quality problem forcing cleanliness down the whole stack, not an API problem. Flags four executive misconceptions and the under-addressed 'vagueness problem' where ~80% of product meaning lives as tribal knowledge outside databases. Includes five uncomfortable diagnostic exercises and four planning prompts.
agent-readinessdata-qualityopenclawenterprise-strategyagent-commerce
Future of Work, Careers & Human Judgment
1 tier-5 · 17 tier-4
The human side: what stays valuable as intelligence gets cheap. The cluster's repeated answer is taste, judgment, evaluation, and "knowing what correct looks like" — plus careers intel (breaking into tech, job-by-job evolution, positioning), the task-vs-job gap, rejection-as-compounding-skill, and AI literacy for parents. Also the safety-adjacent epistemics pieces on automation bias and "LLM psychosis."
TIER 4
Sep 13, 2025
A fully-published essay arguing that as AI commoditizes the mechanics of knowledge work, embodied 'taste' — the gut sense that an output is wrong even when technically correct — becomes the durable human differentiator and the editorial layer over machine generation. It develops practical mechanics of exercising taste, why deep obsession beats decades of breadth, and includes targeted sidebars for early-career, senior, and parenting readers. One of the strongest reflective pieces in this batch and complete rather than a teaser.
tastehuman-ai-collaborationcareer-strategyjudgmentfuture-of-work
TIER 4
Sep 20, 2025
A fully free, complete piece on breaking into tech as a junior in the AI era: cites real labor data (6.1% CS-grad unemployment, entry roles demanding 4.5 yrs experience, Big Tech new-grad hiring down 50% over five years) and argues AI has destroyed the signals that let companies spot hungry juniors. Gives a four-step playbook (find a narrow intersection, build in public, skip the application line, prove AI partnership) plus a detailed, named list of companies and programs actively hiring AI-native juniors (IBM, OpenAI Grove, Hugging Face, Skillfully/Anthropic, Apprenti, MinT, etc.). Substantive and actionable, not a teaser.
careerjunior-hiringtech-jobsai-nativelabor-market
TIER 4
Oct 22, 2025
Pushes back on the viral 'Karpathy says agents are slop / AI bubble popped' takes from the Karpathy-Dwarkesh podcast, arguing his actual message aligns with building useful production agents today. Distills agent-design principles — memory architecture beats models, architecture creates reliability, test outcomes not steps, model economics before code, start with boring expensive problems, follow cost not hype — and bundles three production prompts (architecture interview, memory blueprint, ops runbook). A solid corrective plus genuinely useful applied-agent principles, with the prompts gated.
Andrej KarpathyAI agentsagent memoryproduction engineeringAI hype
TIER 4
Oct 30, 2025
Pushes back on the 'AI bubble' consensus as a costly bet (people who assume nothing changes will be 12-18 months too late), building on Julian Schrittwieser's exponential thesis — ungameable benchmarks showing exponential gains and the mechanism driving them — then extending it with a novel argument for why those gains won't disrupt jobs the way the media assumes. Identifies four compounding human skills (AI Direction, AI Evaluation, Task Decomposition, Learning Velocity) that get more valuable as capability grows, and ships a five-part scored 'AI Exponential Fluency' self-assessment with a ranked 90-day plan. A meaty capability-curve and skills piece, with the analysis and assessment behind the preview.
AI exponentialsAI bubblefuture of workcompounding skillscapability curves
TIER 4
Nov 4, 2025
Closes the gap between knowing AI positioning matters and knowing what to actually do, drawing on closed-door conversations where leaders ask 'who can we replace, who's demonstrating more with AI, who's stuck in production mode' while publicly preaching upskilling. Offers level-specific playbooks for juniors, mid-career, and seniors — reframing exercises to shift from production to problem-solving, domain-expertise mapping to show what can't be learned from ChatGPT, strategic-judgment demonstrations, non-overselling positioning language, before/after value frameworks, and 6 prompts — arguing all three levels' tactics are a shared toolkit. A high-relevance careers piece on the real (not press-release) state of AI and jobs, gated behind the preview.
AI and jobscareer positioningworkforceskills demonstrationprompts
TIER 4
Nov 10, 2025
Argues that as intelligence gets cheap, good judgment becomes ~100x more valuable and is now an explicit hire/fire criterion, yet it's taught only by osmosis. Defines judgment as knowing what matters — signal vs noise, second-order effects, when to trust vs verify AI output, demos vs production — and lays out a ten-component mini-course (finding real bottlenecks, pattern reuse without overgeneralizing, possible-vs-possible-now, sequencing for momentum, deprioritization discipline, calibration loops, social-graph mapping, ownership, transparent reasoning, encoding judgment into systems) plus a self-assessment prompt with a 30-day plan. A well-framed, durable career/skill piece, though the substance is behind the preview.
judgmentcareer skillsAI fluencydecision-makingskills development
TIER 4
Nov 3, 2025
Reframes consumer AI as investigative capacity at institutional scale — citing a family that used Claude to win a $160,000 medical-billing adjustment — arguing institutions profit from information asymmetry (chargemaster dual pricing, buried insurance exclusions) and that AI's edge is decoding jargon, auditing compliance against regulations, and finding categorical violations rather than just giving advice. Lays out 10 principles (investigation beats negotiation, find the rulebook they bet you can't read, categorical violations beat 'seems expensive', verify before staking credibility, control the frame) and 7 operational prompts (Framework Finder, Violation Auditor, Benchmark Calculator, Dispute Letter Generator). A distinctive, high-leverage applied-AI use case, with the principles/prompts paywalled.
AI for consumersmedical billinginstitutional powerinvestigationprompts
TIER 4
Jul 19, 2025
A fully-readable letter arguing today's AI is discontinuous with p(doom) timelines on three chained axes—no 'skin in the game' (no embodied sense of loss driving dominance), no long-term context (the unsolved memory problem, an 'atoms' constraint, not an incremental one), and no proactive general agentic intent—so deception in games like Diplomacy isn't militarily meaningful. Reframes the debate as bet-sizing: p(doom) is unverifiable yet diverts scarce attention from provable risks (senior fraud, AI education/critical thinking, usage norms, deepfakes) that deserve 10x more investment. Coherent, substantive, and complete (not paywalled), with sharp rebuttals to standard counters.
ai-safetyp-doomai-2027agencyrisk-prioritization
TIER 4
Nov 23, 2025
A long, fully-readable practical guide for parents that explains how LLMs actually work (prediction not understanding, zero 'optimal frustration', confidence-when-wrong, the engagement trap) and why teen brains are uniquely vulnerable, then gives concrete boundaries (Show Your Work, Human First, Citation Needed, Time Boxing, Purpose Declaration) plus warning signs and a deep toolkit of reality-anchoring, emotional-regulation, and critical-thinking drills. Unusually complete and well-organized—a reference-grade piece on AI literacy and parenting.
AI literacyparentingkids and AIcognitive offloadingcritical thinking
TIER 4
Nov 27, 2025
A fully-readable, evidence-backed guide to discussing contentious AI topics (jobs, cheating, water/energy, AI art, trust) without it devolving into argument, paired with 6 skeptic-persona conversation-simulator prompts. Strong because it ships a complete fact sheet with real numbers (jobs != tasks via the radiology example, golf-course vs data-center water framing, 0.34 Wh/query) and a listen-validate-explore conversational method. Genuinely useful both as talking points and as a model for steelmanning opposing views.
AI discourseconversation frameworksAI ethicsjobs and automationprompt personas
TIER 4
Nov 30, 2025
Fully free, dense job-by-job guide built on four dynamics (automation avalanche, trust deficit, infrastructure tsunami, human-AI boundary crisis) that maps how 15 tech roles mutate — which tasks vanish, which elevate, where salary premiums land — covering PM, eng, CS, data science, DevOps/MLOps, UX, security, QA, vector/RAG and more, then names 12 emerging roles without titles yet (agent-fleet orchestrators, context-supply-chain managers, red-team psychologists, edge-inference optimizers). Closes with a SURVIVE/ADAPT/LEAD progression and a sourced reading list. Long, specific, and genuinely actionable career intel.
AI jobscareer evolutionemerging rolessalary premiumsworkforce strategy
TIER 4
Dec 23, 2025
Uses the David Budden case (a credentialed ex-DeepMind director betting $45K that he resolved Clay Millennium problems with ChatGPT, with mathematicians pointing out his Lean proof may formalize a weaker statement) to dissect 'LLM psychosis' as a real workflow failure: AI explanations increase trust even when wrong, and smart people are most vulnerable because they can rationalize. Lays out warning signs (confirmatory prompting disguised as verification, operating beyond your evaluation capacity, 'me and the AI vs everyone') and a ten-prompt adversarial self-audit kit. Timely and well-grounded caution for 2026.
automation biasLLM psychosisself-auditverificationepistemics
TIER 4
Dec 25, 2025
A full free essay distilling four shifts Nate watched in 2025: the technical/non-technical line dissolving, measurement being the underappreciated complement to prompting (define 'good', then loop an agent until it hits it), AI slop reframed via the printing-press analogy (volume amplifies average output, but systems and taste still let quality rise), and leaders shifting from cost-cutting to quality-lift. Reflective rather than newsy, but a genuinely useful synthesis of what separated teams that shipped from teams that didn't.
year in reviewmeasurementpromptingAI slopquality lift
TIER 4
Jan 1, 2026
A year-ahead thesis built on a strong observation: AI solved generation but the review/verification problem now crushes capacity, and the teams pulling ahead build systems where AI reviews AI (eval harnesses, judge models, automated QA) with humans handling exceptions. Lays out three structural 2026 bets — the review stack flips, all work becomes testable (the technical/non-technical wall dissolves), and the chasm between fast movers and everyone else becomes unbridgeable on organizational learning rate. Substantive forecasting that frames the year's core shift.
verification gapAI reviewing AI2026 predictionstestable intenteval harnesses
TIER 5
Mar 11, 2026
Part 1 of the series and the conceptual keystone: argues the famous 'jagged frontier' never described model intelligence but one-shot, structure-free prompting, and that multi-agent harnesses smooth it, evidenced by a Cursor coding harness solving and improving on a research-grade spectral-graph-theory problem after four unguided days. Introduces the verifiability spectrum (machine-checkable / expert-checkable / judgment-dependent) and argues evaluation, not generation, is the surviving skill. A landmark reframe with a concrete proof point and a durable framework.
jagged-frontieragent-harnessesverifiabilityai-capabilityevaluation
TIER 4
Mar 21, 2026
Anchors on the task-versus-job gap: agents excel at two-hour tasks but lack the multi-year institutional memory and common sense a real job requires, making powerful-but-brittle agents more destructive when unmanaged (the Grigorev case where an agent wiped 1.9M rows because real-vs-temp infra lived only in the engineer's head). Argues your best people, not juniors, should write evals, and introduces 'contextual stewardship' as the emerging human role. Three starter prompts (context-gap audit, eval writer for non-engineers, decision documenter).
task-vs-jobevalscontextual-stewardshipagent-riskfuture-of-work
TIER 4
Apr 15, 2026
Identifies the real bottleneck in agent adoption as the 'now what?' problem: people install agents (OpenClaw, NemoClaw, Dispatch, Manus) easily but can't describe their own work at the resolution an agent needs to act on. Diagnoses the 40-hour wall and the 'expertise trap' (the more senior you are, the more invisible your operating system), ties it to delegation failure and promotion ceilings, and offers an interviewer-agent prompt that elicits and writes your SOUL.md for you.
agent delegationSOUL.mdtacit knowledgework decompositionOpenClaw
TIER 4
Apr 18, 2026
Profiles the 'Karpathy Loop' — pointing an AI agent at your own code/system with one file, one metric, one time budget and letting it run hundreds of experiments overnight for a few hundred dollars (Karpathy's 700-run training optimization, SkyPilot's $300 scale-up, ThirdLayer's meta-agent rewriting agent scaffolding). Frames this as a 'local hard takeoff' bounded to a domain, where teams that can define 'better' precisely pull away, and flags the reward-gaming safety problem. Includes diagnostic, pre-mortem, and trace-audit prompts.
agent self-optimizationAutoMLmeta-agentsreward hackingcompetitive strategy
Tools, Skills & Workflow Packaging
1 tier-5 · 22 tier-4
The applied tooling layer: Claude Code as a non-coder workflow tool, Claude Skills as a portable expertise package (and the failure modes of authoring them), the prompt-vs-skill-vs-plugin decision ladder, AI browsers (Atlas), Claude Design, visual/image AI as infrastructure, and hands-on reviews of outcome agents. The recurring move is turning ad-hoc AI use into reusable, reliable infrastructure — built from outputs, not intentions.
TIER 4
Jul 31, 2025
Argues most enterprise AI failures are data problems, not intelligence problems — bad chunking (contracts split mid-sentence, financial tables severed from headers) forces hallucination the way handing a reader randomly-torn Shakespeare pages would. Lays out five chunking principles led by Context Coherence (chunk where it preserves semantic meaning, respecting natural document boundaries, sizing for relevance/cost, overlapping strategically), positions it as the companion to the prior RAG guide, and covers agentic-search-over-Excel. A meaty, engineering-oriented topic even in preview form; the 61-page guide with 10 data-type configs is gated.
RAGchunkingdata preparationhallucinationAI engineering
TIER 4
Aug 16, 2025
Full free deep dive on the leaked Meta 'GenAI Content Risk Standards' (which permitted romantic chats with children, racist arguments, and disclaimer-gated medical misinformation), then a substantive technical tour of how to actually train ethical AI: Constitutional AI, RLHF and its limits, red teaming, synthetic data for sensitive domains, transparency, and measurement. Frames the core gap as institutional, not technical.
ai-ethicsconstitutional-airlhfred-teamingmeta
TIER 4
Aug 18, 2025
Paywalled but high-value teaser for a 64-page guide on using Claude Code for non-coding business work (legal, marketing, research, sales, HR, ops, finance, PM) with real ROI examples—Syncari +23% SQL rates, a $1.2M pipeline recovery, 10K+ tickets/month automated—plus a 29-page zero-knowledge install guide. Strongly practical, capturing a narrow early-adopter window before the technique normalizes.
claude-codeagentsautomationnon-technical-usersbusiness-workflows
TIER 4
Oct 22, 2025
Full free hands-on review of OpenAI's Atlas browser across a dozen real tasks, grading it C+/B- and deriving a sharp rule: AI browsers shine on boring, linear, low-ambiguity work (email triage, folder creation, spreadsheet math) and fail on aesthetic judgment and ambiguous flows (PowerPoint formatting, booking a yoga class took 10x longer). Flags the unaddressed prompt-injection security problem (transparency-as-safety defeats autonomy), the per-user memory advantage, and predicts a 'two-speed web' where sites offering direct agent data inputs (e.g. LinkedIn in Comet) beat those forcing UI navigation. Substantive, well-reasoned, and fully readable — the linear-vs-ambiguous framing transfers well beyond browsers.
Atlas browserAI browsersagentic webprompt injectionComet/Perplexity
TIER 4
Oct 23, 2025
Catalogs the common week-one Claude Skills failure modes (skills that won't trigger, zip-file issues, context-window overflow, security of code-running skills, evaluation gaps) and ships a 10-tool building kit to fix them: skill-debugging-assistant, skill-security-analyzer, skill-gap-analyzer, skill-performance-profiler, prompt-optimization-analyzer, skill-testing-framework, skill-doc-generator, skill-dependency-mapper, learning-capture, and token-budget-advisor. Built on the 'treat skills/prompts as code' thesis; a practical, concrete toolkit for anyone actually authoring Skills, though the kit itself is gated.
Claude Skillsskill debuggingprompts as codetoken budgetAI tooling
TIER 4
Nov 11, 2025
An exclusive ~hour interview with Ben Goodger, head of engineering for OpenAI's Atlas browser, teased via 10 strategic takeaways: the 'Netscape 1.0' framing for resetting adoption expectations, the three-click friction problem outweighing better models, the 'eyes on the road' trust-vs-capability rule, memory (your shoe size) as strategic, and an architectural pattern for wrapping legacy systems under three-month-half-life constraints. High signal-value source (first OpenAI engineering conversation of its kind) on agentic browsing and product design, but the actual interview and analysis are gated behind the paywall preview.
OpenAI Atlasagentic browsersproduct designAI agentsinterview
TIER 4
Oct 31, 2025
Argues the much-feared death of the open web is actually an opening for individuals and small brands, because LLMs are trained (to reduce bias/hallucination) to discount big attention-farming brands and surface narrow, authoritative sources — a closing window before AI establishes a new hierarchy. Backs it with a Princeton study and names seven counterintuitive principles for AI visibility (Position Bias Inversion, 18-Token Extraction Pattern, Institution Shadow Problem, Noise Floor Paradox, Domain Mismatch Penalty, Citation Churn, Under-Optimization) plus seven audit/build prompts including the 'Atomic Claim Page.' A substantive, data-grounded take on GEO/AI-visibility strategy, with the principles and prompts paywalled.
AI visibilityGEOopen webcontent strategyLLM citation
TIER 4
Dec 18, 2025
Draws on an hour-long conversation with the Codex team (Ed on design, Tibo on engineering) to report how role boundaries are dissolving at OpenAI: designers commit code, juniors outpace seniors for lack of muscle memory to override, mandatory AI review became a beloved feature, and a model bootstrapped its own multi-agent system unprompted. Surfaces where bottlenecks move once code generation is largely solved, and contrasts the 'tight cockpit loop' vs 'parallel coworker' paradigms. Genuine insider signal; the six operationalizing prompts are gated.
OpenAI Codexfuture of workrole dissolutionagent paradigmsengineering culture
TIER 4
Dec 22, 2025
Argues Claude Code is mis-named: Anthropic's December releases (browser automation, Slack integration, mobile delegation, Skills) make it a non-coder workflow tool, and frames Anthropic's iterative-collaboration bet against OpenAI Codex's delegated-autonomy bet. Bundles a 29-page setup guide, ten production workflows, a mental-model guide, ten copy-paste prompts, and the JIORP (Job/Inputs/Output/Rules/Proof) framework. Very high practical utility for non-technical users, though the bulk of the treasury is gated.
Claude Codenon-codersagent workflowsAnthropic vs OpenAIJIORP
TIER 4
Jan 18, 2026
Reframes image generation (Nano Banana Pro hit 1B images in 53 days) as infrastructure rather than a design tool, arguing the long-standing constraint that AI 'could not see and could not show' has limited adoption to text-centric processes. Lays out a four-stage flywheel (bottleneck removal, data generation, trust calibration, workflow integration) and a 30%-vs-300% distinction between treating visual AI as departmental versus infrastructural. A non-obvious executive thesis with concrete operational examples.
visual-aiimage-generationenterprise-strategynano-bananaworkflow-automation
TIER 4
Mar 30, 2026
Reframes Skills from a personal prompting shortcut to a cross-industry hidden context layer now adopted by OpenAI, Microsoft, GitHub, and Cursor (500K skills running interchangeably, and now in Excel/PowerPoint/M365). Key insight is the failure asymmetry: 'good enough when I'm watching' fails the moment agents invoke skills unsupervised, so skills must be built from outputs, not intentions. Includes a build-this-week prompt set (backlog audit, output-extraction builder, agent-readiness stress test, team deployment) plus repo access.
skillsagent-contextclaude-codeskill-designagent-orchestration
TIER 4
Apr 1, 2026
Uses the impending Claude 'Mythos' tier launch to make an evergreen point: every production AI system carries an invisible layer of workarounds for the last model's weaknesses, and a step-change in capability can make those systems perform worse, not better. Offers a four-question per-layer stack audit, the 'Bitter Lesson for builders' simplification pattern (with a Klarna cautionary case), and four fix prompts. Strong transferable framing for anyone shipping agents.
model-upgradesagent-architecturebitter-lessonsystem-prompt-auditanthropic
TIER 4
Apr 4, 2026
Hands-on review of outcome agents (Cowork, Lindy, Sauna, Google Opal, Obvious) built on one insight: the question almost nobody asks is how the agent knows its own output is good. Contrasts code (has a test suite) with knowledge work (a memo doesn't compile), arguing the separator between agents that work and ones that waste time is whether the environment gives automated feedback or you're the only feedback mechanism. Includes a two-phase prompt that scores any agent tool and builds a delegation spec to its weaknesses.
outcome agentstool reviewagent evaluationfeedback environmentsdelegation specs
TIER 4
May 9, 2026
Argues GPT-5.5/Codex (82.7% on Terminal-Bench 2.0) made the model strong enough that the bottleneck moved to the environment around it: the workflow lives in your head and you reload it every thread. Lays out a decision ladder (prompt vs skill vs plugin vs nothing) and which workflows to package first, framing 'a stronger model with a vague environment gives you faster, more confident wrongness.' Strong practical case for packaging work into reusable infrastructure.
Codexpluginsskillsworkflow packagingagent tooling
TIER 4
May 16, 2026
An interview with OpenAI's Codex lead Tibo on what changes once the model can carry the work: the bottleneck has moved twice (from capability, to workflow-packaging, to leadership judgment across five chairs). Argues companies will split into over-restrictors, under-restrictors who hit a board-level incident, and the quiet builders of the five judgment layers who become uncatchable. Notable for the firsthand source and the where-does-human-judgment-live framing.
interviewCodexleadershiphuman judgmentagent adoption
TIER 4
May 18, 2026
Opens a 'walking into the job in 2026' series with marketing, arguing the function now serves two audiences: humans and the agents that read, compare, and recommend on their behalf. Cites a March 2026 survey where 69% of B2B buyers switched vendors on AI guidance and a third bought from a vendor they had never heard of. Reframes marketing's job around legibility and a 'truth layer' (claims/proof stewardship) rather than content velocity.
marketingAI searchGEO/AEOlegibilityAI-washing
TIER 4
Apr 24, 2026
Argues Claude Design (shipped April 17 with Opus 4.7) completes Anthropic's intent-in/artifact-out strategy alongside Code and Cowork by retiring the mockup-to-production handoff — citing Brilliant (20+ prompts down to 2), Datadog (week-long cycle to one conversation), and a Jane Street designer now designing in Claude over Figma. Covers the Figma/Stitch medium war, Krieger leaving Figma's board pre-launch, and role-by-role org shifts. Strong design-workflow disruption piece.
Claude DesignAnthropic strategydesign workflowsFigmamockup-to-production
TIER 4
Apr 25, 2026
Argues GPT-Image-2's real story isn't the leaderboard jump (1,512 on Image Arena, +242 over next) but that image generation joined the reasoning stack — it plans, web-searches, composes, and verifies like a text model. Covers seven newly-viable workflows, the adversarial 'forgeries pass now / screenshots-as-proof just ended' angle, and role-by-role moves plus a brand-system prompt that compounds across generations. Meaty take on multimodal reasoning and creative ops.
GPT-Image-2image generationmultimodal reasoningcreative opsforgery/verification
TIER 4
Apr 29, 2026
Offers a reusable five-question filter for separating agent launches that are infrastructure from those that are just features, applied to six weeks of releases. Names Salesforce Headless 360 as the most important and underreported launch of the month and gives a routing guide across Copilot, Perplexity, Claude direct, and Salesforce. Reframes 'should I switch' as a layering decision, not a switching one.
agent evaluationenterprise AISalesforcetool routingdecision frameworks
TIER 4
Feb 12, 2026
Uses the OpenClaw/Moltbot skill marketplace (160k devs, thousands of skills in six weeks) as a revealed-preference signal that people want digital employees, not better chatbots. Contrasts the $4,200 car-negotiation win against an agent that fired 500 rogue iMessages, arguing specification quality is the variable between value and chaos. Adds the 70/30 control-vs-delegation research finding and the enterprise gap (71% using agents, only 11% in production).
ai-agentsagent-deploymentspecification70-30-ruleenterprise-adoption
TIER 4
Mar 23, 2026
Reads the post-OpenClaw agent wars as a product-strategy case study: every major player made a different bet on the same tradeoffs (Nvidia's Linux analogy, Perplexity's cloud-plus-local delegation, Meta's $2B Manus distribution move, Anthropic's Dispatch safety play, Lovable's pivot to general agents). Offers a three-axis evaluation lens (where it runs, who picks the model, what the interface assumes about you) and three questions usable on any future agent launch. Durable evaluation framework with vivid market color.
agent-warsproduct-strategyopenclawevaluation-frameworkcompetitive-analysis
TIER 5
Mar 25, 2026
A genuinely clarifying taxonomy: 'agent' has become the 'cloud' of 2026, hiding four distinct architectures (coding harnesses, dark factories, auto research, orchestration frameworks) with as little in common as a forklift and a bicycle. Gives what each does in production, the governing operating principle per architecture, and a one-question diagnostic test (plus three prompts) that tells you which subspecies your problem actually needs. Highly reusable conceptual scaffolding for anyone choosing agent tools.
agent-architecturetaxonomycoding-harnessesorchestrationtool-selection
TIER 4
Mar 24, 2026
Pits Nvidia's open-sourced NemoClaw agent-security stack ('build your own') against OpenAI/Anthropic's consulting partnerships ('the model isn't the bottleneck, pay McKinsey/Accenture') as competing theories of how hard agent deployment is. Scores the five hardest production problems and finds a 4:1 ratio: four are well-understood engineering your team can handle, one (domain-specific specification) genuinely needs help, which reshapes the build-or-buy decision. Concrete build-or-buy framework plus pre-signing prompts.
agent-deploymentbuild-vs-buynemoclawenterprise-aiconsulting