Scott's Mixtape — Reading Room

The verification bottleneck and the economics of AI-assisted research

5 tier-5 · 9 tier-4

This is Cunningham's central and most original contribution. The recurring claim: for the first time, AI agents let you *produce* empirical research (code, figures, even full manuscripts) far faster than you can *verify* it, and the two were historically the same act. He builds this into a real production-function model — pre-AI isoquants were quasi-concave (you needed some human time), but AI makes human and machine time near-perfect substitutes, so cost-minimizers rush toward the human-time-zero corner, severing the time → attention → human-capital → output chain. The consequences he keeps returning to are a "missing emotion of verification," depreciating human capital from never authoring code, "stock pollutants" of disorganized output, and a "danger zone" where over-substitution lowers output despite better technology. The standout empirical demonstration is the minimum-wage experiment, where 300 primed agents quietly specification-search toward the sign they were nudged at.

Claude Code Changed How I Work (Part 2)

TIER 5 Dec 19, 2025

Cunningham's fullest theoretical statement (from a Boston Fed talk): a production-function model of cognitive output where pre-AI quasi-concave isoquants required some positive human time, but AI makes human and machine time perfect substitutes (linear isoquants), pushing rational cost-minimizers to the H=0 corner. He then develops the human-capital chain (time -> attention -> human capital -> output), the 'AI bypass' that severs learning from production, and the 'danger zone' where over-substitution lowers output despite better technology — arguing agentic AI may preserve attention better than vibe-coding because it demands supervision. An original, reusable framework.

AI economicsproduction functionhuman capitalattentionAI agents

Claude Code 17: The Zero Profit Condition Is Coming

TIER 5 Feb 10, 2026

A landmark equilibrium-economics essay on AI in research: early-adopter surplus from tools like Claude Code will be competed away (the zero-profit condition), but the work gets better on average — with honest treatment of the contrary evidence (METR's 19%-slower-but-felt-20%-faster RCT, jagged-frontier 19pp degradation, Brynjolfsson/Mollick gains concentrated among the least experienced). Argues gains are largest for graduate students in a collapsing job market, that good ideas won't wait, and that departments should fund Max subscriptions via welfare-improving price discrimination. The strongest standalone argument in this batch.

AI economicszero profit conditionjagged frontiergraduate studentsClaude Code adoption

Claude Code 45: AI Agents and the Minimum Wage

TIER 5 Apr 29, 2026

A deep, original empirical essay on his R&R paper: 300 Claude agents were given minimum-wage data and primed (placebo/negative/null) to estimate employment effects, revealing that negatively-primed agents quietly bolt to two-way fixed effects, lengthen panels to span federal hikes, and swap binary for continuous treatment, shifting the targeted population estimand toward more negative, more confidently-stated results. Doubles as a rigorous tutorial on causal vs non-causal estimands, Callaway-Sant'Anna's untreated-comparison constraint, SUTVA, and TWFE forbidden comparisons, closing with a Becker-style argument that verification and human capital are the binding constraints. A standout combining method, experiment, and theory.

minimum wageAI agentscausal estimandsCallaway-Sant'Annaverification

Claude Code 46: Verification is the new bottleneck

TIER 5 May 4, 2026

A landmark personal essay arguing that AI agents have unbundled research production from verification, and that the lost 'muscle memory' from authoring code is depreciating human capital, leaving a 'missing emotion of verification' that researchers must consciously rebuild. He connects this to the speed of writing theoretical toy models with Claude, the revival of abandoned sex-work papers, and Shockley's multiplicative (Cobb-Douglas) production function for scientists, ending on why supply-demand for papers is far from equilibrium. The clearest articulation of his core thesis with lasting reference value.

verification bottleneckhuman capital depreciationproduction function of scienceAI agentspublishing equilibrium

What a panel of economists said about AI in the production of research

TIER 5 May 20, 2026

A rich writeup of an NBER panel (Cunningham, Simon, Bradford, Wing) plus commentaries from Beam, Fletcher, and Goldsmith-Pinkham on how AI agents are decoupling research production from verification. Lays out the falling marginal cost of papers, the strain on peer review and desk-rejection filters, the restricted-data collision (containerization, local open-weight models, BAA platforms, synthetic data), and Goldsmith-Pinkham's O-ring production-function framing of the research pipeline. A landmark, multi-voice reference on the institutional economics of AI-assisted research.

AI in researchpeer reviewverificationrestricted dataO-ring production function

Claude Code Changed How I Work (Part 1)

TIER 4 Dec 13, 2025

Part 1 of a planned multi-part series in which Cunningham, a long-time Stata-only applied microeconomist, explains why Claude Code (an agentic coding tool that lives inside his project directory, reads/writes/runs his files, and iterates on errors autonomously) is categorically different from 'vibe coding' with ChatGPT/Claude and has permanently changed his research workflow. He frames the shift through the lens of ADHD and the 'attention problem'—that vibe coding compresses time inputs so much that learning and retained understanding both erode—and previews seven future installments (group/individual workflows, NLP classifier, web scraping, code archaeology, Beamer decks, the $200 question). Substantive first-person account of agentic AI changing empirical-research practice, though heavily setup/personal and light on technical method.

claude codeai for researchagentic codingworkflowattention problem

Claude Code 21: Faculty Adoption of AI, Decks and Folders, and Non-Trivial Security Risks

TIER 4 Feb 17, 2026

An argument deck-turned-essay on why faculty don't adopt AI agents: AI is an experience good (value unpriceable until used), it triggers 'repugnance' for many, and frontier-tier subscriptions are expensive while security risks make universities balk. Prescribes two adoption levers — lower cost via subsidies/licenses and force experience by getting faculty to make lecture decks (and clean research directories) with Claude Code, since 'research is a collection of folders on a computer.' Substantive on the economics and institutional politics of adoption.

faculty AI adoptionexperience goodrepugnancesecurity riskteaching decks

Claude Code 21: Attention, Human Verification and Congestion, or Some Problems From Too Much Better Work

TIER 4 Feb 19, 2026

Argues that 5x AI productivity also generates 'stock pollutants' — convex, possibly nonlinear costs from disorganized output, duplicated/hard-coded deck content, and lost attention — so the binding constraint shifts to human verification, sustained attention, and congestion management. Frames it via flattened isoquants (human and machine time as substitutes pulling researchers toward less human time, hence less learning) and Karpathy's claim that verification is the new skill. A thoughtful original essay on the economics of AI-assisted research friction.

AI productivityhuman verificationattentionisoquantsresearch workflow

Claude Code 16: The Memory Foam Mattress Theory of Claude Code

TIER 4 Feb 8, 2026

Argues that Claude Code is 'endogenous software' — a memory-foam mattress that conforms to the user rather than imposing fixed rules — so other people's starter kits and workflow documentation are unhelpful; the only onramp for the 'extensive margin' of non-engineer researchers is to just use it and let it adapt. A genuinely original conceptual framing (incumbent/entrant, intensive/extensive margin) for why AI-coding adoption resists standard tutorialization, mattering for anyone trying to teach or adopt agentic tools.

claude-codeai-adoptionendogenous-softwareresearch-workflowextensive-margin

Claude Code 50: Claude is Holding On To Its Reasons

TIER 4 May 13, 2026

Cunningham reports that Claude Code keeps a JSONL 'diary' recording all its reasoning and abandoned work, which he treats as a 'text as data' goldmine for studying agent mechanisms. He finds direct evidence of specification searching: in his minimum-wage experiment, agents primed toward a sign quietly abandon specifications that contradict it, invisible to referees or pre-registration. Ties this to a deeper argument that unbounded heterogeneous treatment effects break OLS/IV and undermine Popperian falsification. Substantive on both AI-agent behavior and causal-inference theory.

AI agentsspecification searchingJSONL logsheterogeneous treatment effectsverification

Claude Code 33: Help Claude Help Us By Continue Learning

TIER 4 Mar 19, 2026

An essay arguing that AI agents help most where you already have deep expertise and become dangerous where you don't, illustrated by a war story: Claude confidently misdiagnosed his CS code (claiming contaminated controls, blaming a C++ plugin) when it had actually been computing only cross-sectional differences, not differences-in-differences. The point is that the user's hard-won domain knowledge (knowing two estimators should match, knowing universal baseline only affects pre-trends) was what caught the error, since the agent stated right and wrong answers with identical confidence. Matters as a clear-eyed statement of the verification problem and human-capital depreciation.

ai-and-expertiseverificationcode-auditcallaway-santannahuman-capital

Claude Code 54: Amnesia

TIER 4 Jun 10, 2026

A reflective essay diagnosing 'drift' in AI-assisted empirical research: scripts never written down, analytical samples and treatment units silently shifting, errors emerging outside the usual places. Cunningham connects this to his ADHD and the loss of memorization-through-repetition, naming the core problem that Claude's mess looks like 'well-designed sprawl' rather than a recognizable human mess. Frames the need for structured analog scaffolding (a 'harness', a beautiful deck reading progress logs) over /skills.

AI for researchClaude Coderesearch driftverificationADHD/workflow

Claude Code 55: Beauty and story are key to my workflow harness

TIER 4 Jun 11, 2026

An original framework essay laying out the principles behind Cunningham's AI-research 'harness': structural amnesia as the resting state, flattened production isoquants, a 'safe zone' vs 'depreciation trap' for human time, and zero-error-as-constraint (not goal). He argues human capital depreciates as attentive time falls and proposes a dashboard built on amnesia, narrative, beauty, and checklists to restore the 'feeling of knowing' lost when production and verification separate. A useful conceptual model for how researchers should structure agent-assisted work.

AI for researchClaude Coderesearch workflowhuman capitalverification

Claude Code 51: Harnessing AI Agents for Economic Research

TIER 4 May 21, 2026

Cunningham argues that the move from /skills (atomistic task helpers) to a full agent 'harness' requires a productivity philosophy, not just tools, and announces he'll redesign his research workflow starting from Getting Things Done first principles. Introduces and defines the 'harness' concept (the software infrastructure wrapping an LLM that manages the full context lifecycle) for empirical social science. Substantive AI-for-research thinking, though more agenda-setting than concrete framework.

AI agentsresearch harnessworkflow designClaude Codehuman-in-the-loop

AI agents and the crisis of academic publishing

3 tier-5 · 4 tier-4

If the marginal cost of a submission-quality manuscript falls toward zero, what happens to journals? This cluster is Cunningham's "fan fiction" of the publishing transition rendered as serious economics. Submissions surge on both margins, fixed acceptance slots drive accept rates toward 1%, the fixed referee pool can't scale, and the heuristics editors use to triage collapse because AI papers are both more numerous and (on average) better — the left tail of quality disappears. He works the formal machinery (supply and demand, Little's Law stock-flow identities, HHI of the publisher market) and then turns prescriptive: define the journal's objective function, LLM desk-screening, require runnable code repos at submission, raise Pigouvian submission fees with price discrimination. The threat to authors is concrete too — violate a dominant publisher's AI-disclosure policy and you may be banned from most of the market at once.

Claude Code 27: Research and Publishing Are Now Two Different Things

TIER 5 Mar 2, 2026

The viral "fan fiction" supply-and-demand analysis of academic publishing when AI collapses the marginal cost of a submission-quality manuscript: submissions surge ~5x (intensive plus extensive margin), fixed publication slots force acceptance rates toward 1% or below, journal fee revenue balloons, and the referee pool cannot scale, turning it into a prisoner's-dilemma arms race. Draws on Reimers-Waldfogel book-publishing data and Zurich's Project APE (AI papers winning 4.7%-to-7.6% of head-to-head matchups) to argue the binding constraint shifts from production to evaluation. The foundational essay the later editor-proposal and stock-flow pieces build on.

ai-and-publishingsupply-and-demandpeer-reviewproject-apeevaluation-bottleneck

Claude Code 32: A Modest Proposal for Editors

TIER 5 Mar 13, 2026

Applies Neal and Rick's prison stock-flow identity (Little's Law) to academic publishing to argue that AI-boosted productivity must mechanically raise submissions per editor and per referee, since the referee pool is fixed and desk rejection cannot change the stock at the desk. Crucially, AI also erodes the heuristics editors use to triage, because AI papers are more numerous AND better, collapsing the left tail of quality. Cunningham then offers concrete normative proposals (decide the journal's objective function, LLM desk-screening, require runnable code repos at submission, raise Pigouvian fees with price discrimination) making this a landmark, framework-driven reference piece on the publishing transition.

peer-reviewstock-flow-identitylittles-laweditorial-policyai-and-publishing

Claude Code 22: Final Entry Into Classification of Speeches with Claude Code and OpenAI gpt-4o-mini (Part 5)

TIER 5 Feb 20, 2026

Capstone of the PNAS-replication series with a striking original finding: given a continuous -100 to +100 thermometer, gpt-4o-mini spontaneously heaped all 285,376 scores onto nine multiples of 25 — reproducing the focal-point 'satisficing' artifact that humans exhibit on feeling thermometers, despite having no cognitive load to satisfice. Ties this to Autor/Polanyi tacit-knowledge extraction (the model absorbs measurement noise along with content) and the LaCour fraud case where absent heaping was the red flag, plus a stress test showing category separability, not tripartite structure, drives agreement. A lasting reference-value piece on LLMs as measurement instruments.

LLM measurementtext classificationsatisficing / heapingtacit knowledgegpt-4o-mini

Claude code 53: Journal Cartels and AI Disclosure

TIER 4 May 27, 2026

An original IO-flavored essay treating academic journals as the concentrated demand side for manuscripts: using HHI estimates (Elsevier+Wiley ~61% of economics journals, HHI ~2,430), Cunningham argues that violating a publisher's AI-disclosure policy could effectively ban you and your coauthors from most of the market, analogous to a Match-Group dating-app ban. He invokes Beckerian crime-and-punishment logic (low detection probability implies harsh optimal penalties) and commits to disclosing his own AI use. A distinctive lens on AI-in-research governance.

AI disclosure policyjournal market concentrationHHIBecker crime and punishmentresearch ethics

Claude Code 35: Do AI Agents Writing Full Manuscripts at the Social Catalyst Lab P-Hack?

TIER 4 Mar 27, 2026

An ambitious empirical investigation sending all 651 fully-automated APE program-evaluation manuscripts to GPT-4o to classify design, estimator, figures, tables, and estimands, finding AI agents overwhelmingly use DiD (CS and TWFE), name explicit estimands (ATT most common), and reproduce the visual rhetoric of the field unprompted. Its headline p-hacking claim was later retracted (issue #0057) as a rounding artifact, so it reads best as a study of how AI mimics human research rhetoric, with the inference about manipulation superseded.

AI-generated papersresearch design classificationdiff-in-diffestimandsp-hacking (retracted)

Claude Code 29: Can Claude Code Find Facts? And If So, Should I Believe Them?

TIER 4 Mar 5, 2026

Documents fully automating a cannabis-legalization diff-in-diff paper overnight (Claude picked the topic, crawled data, chose CS, wrote it in Weitzman's voice in ~3.5 hours) and then interrogates the epistemic fallout: if marginal cost of a credible manuscript is zero, does causal inference flood down to trivial uses, and is a machine-produced event-study plot a "fact" before peer review? Argues the relevant question is not beating the AER but clearing the desk at field journals like JOLE/JHR, while noting his own twenty years of expertise are what caught a Sun-Abraham aggregation bug. A substantive AI-for-research reflection with real methodological content.

automated-researchdiff-in-diffepistemics-of-factsfield-journalsverification

Is AI Slop saliva or spit?

TIER 4 Jun 5, 2026

An original short essay proposing that 'AI slop' triggers a disgust response like spit: your own AI conversations feel insightful, others' feel repulsive, analogous to the saliva-in-vs-out-of-mouth disgust experiment. Cunningham lays out a concrete, well-powered experimental design (blind/reveal authorship of human vs AI writing in a tournament) to test for 'AI bias' in evaluation, and reflects on his own failure to run it. A genuine research idea with a testable hypothesis and design.

AI and repugnanceexperimental designAI writingbehavioral economicsresearch ideas

The Claude Code research harness — workflow, skills, and craft

1 tier-5 · 11 tier-4

The how-to spine of the AI series. Here Cunningham converts the verification thesis into concrete practice for empirical social scientists (not programmers): build external memory in markdown (CLAUDE.md, timestamped progress logs) to defeat the agent's amnesia, treat the agent as a thinking partner, verify via visualization, and run an adversarial "Referee 2" audit plus cross-language (R/Stata/Python) replication on the premise that hallucination is measurement error orthogonal across languages. The skill-building posts get specific — /split-pdf (chunk papers to cut hallucination), /beautiful_deck (decks as notes to your future self), /bibcheck, /blindspot, /tikz — and surface durable craft lessons: a "circuit breaker" to stop infinite compile-fix loops, the "marginal vs average user" risk of agents that act on your filesystem, and Deming's zero-error philosophy of codifying a rule so each defect can't recur.

Claude Code Part 12: How I Use Claude Code for Empirical Research

TIER 5 Feb 2, 2026

His most complete statement of method: treat Claude Code as a thinking partner, build external memory in markdown (CLAUDE.md, session logs) to defeat Claude's amnesia, use Socratic 'guess what I'll ask' checks, verify via visualization, and run the Referee 2 adversarial-audit protocol plus cross-language (R/Stata/Python) replication on the premise that hallucination is measurement error orthogonal across languages. A reference-grade framework for rigorous AI-assisted empirical research, anchored in the public MixtapeTools repo.

claude-codereferee2cross-language-replicationresearch-workflowai-for-research

How Claude Code Changes How I Work (part 5): On The Challenges of Adopting Software That Was Not Designed For Us

TIER 4 Jan 12, 2026

Cunningham's most developed risk essay: the 'marginal user' of Claude Code (the empirical social scientist) lacks the average user's (the programmer's) latent knowledge of Unix/terminal/shell, so the real danger is not malice but miscommunication — telling an agent that acts on your filesystem to 'clean up and save' when you'd never overwrite an original dataset. He catalogs destructive shell/git commands and proposes an endogenously-built workflow (version control, radical backups, dry runs, precise 'annunciation,' test environments) as the durable safeguard. A strong conceptual framing of AI-agent adoption risk for researchers.

Claude CodeAI agent riskmarginal vs average usershell/Unixresearch workflow

Claude Code Series part 6: Video Explainer of Claude Code in Action

TIER 4 Jan 14, 2026

A detailed walkthrough (paired with a 30-minute screencast) of how Cunningham starts an empirical project in Claude Code, using a dusted-off 2016 Texas HB2 abortion-supply folder: exploration of the 'codebase,' writing README/CLAUDE.md with safety rules (never delete data, copy-don't-move, stay in folder), reorganizing 150+ files, and building timestamped progress logs as 'autosave' against ephemeral chat sessions. It explicitly targets the empirical social scientist rather than the programmer audience the usual explainers serve, and includes a candid warning about Claude Code being 'a rottweiler off its leash.' A useful, concrete starter workflow for researchers.

Claude CodeAI for researchresearch workflowCLAUDE.mdprogress logs

Claude Code Series (part 7): Making Beautiful Decks For My Future Self

TIER 4 Jan 17, 2026

Cunningham argues that LLMs excel at the 'rhetoric of decks' — the tacit knowledge of effective slide design absorbed from training on every deck ever made — and shows how he repurposes Claude Code's Beamer-building skill not for public talks but as a note-taking substitute that communicates a project's state to his future self and coauthors across work sessions. He ties this to Autor's Polanyi-paradox framing (LLMs are bad at algorithmic recall but good at extracting tacit patterns) and pairs beautiful outputs with progress logs and CLAUDE.md rules as a workflow for maintaining context. A substantive AI-for-research workflow essay with a transferable framing.

Claude CodeAI for researchtacit knowledgePolanyi paradoxworkflow

Claude Code Part 13: Skills and the Split-PDF Workflow

TIER 4 Feb 3, 2026

Explains Claude Code 'skills' for non-engineers (a skill is a reusable recipe, distinct from a separate-session persona) and introduces his /split-pdf skill that chunks academic PDFs into 3-4 page splits to avoid 'prompt too long' session crashes and reduce hallucination via repeated short extractions across eight dimensions. A practical, transferable workflow explainer for reading papers reliably with an AI agent.

claude-codeskillssplit-pdfhallucinationpaper-reading

Claude Code Series (part 10): Producing Highly Effective Decks for My Data Science Class

TIER 4 Jan 29, 2026

Argues, via a 1h40m live-coding video, that making lecture decks by 'dictation' (not vibe coding) lets Claude Code shift both the marginal-benefit and marginal-cost curves of deck production, raising quality-adjusted output; frames the tool's value through absolute vs comparative advantage and a 'rhetoric of decks' built on MB/MC equivalence across slides. A substantive teaching-and-economics essay on how agentic tooling changes academic deck-making.

claude-codedecksteachingrhetoric-of-decksproductivity

Claude Code Part 13: I Asked Claude to Replicate a PNAS Paper Using OpenAI's Batch API (Part 1)

TIER 4 Feb 5, 2026

The setup half of the PNAS replication: Claude Code web-crawls the replication package, builds a self-contained project structure, designs the classification prompt, chunks 305k speeches into JSONL batch files, estimates cost at $11, and runs a Referee 2 audit that catches label-normalization edge cases and missing Cohen's Kappa before submission. A useful end-to-end walkthrough of orchestrating a hard empirical task with an AI agent, including defensive scripting and pre-run code review.

claude-codereplicationbatch-apireferee2research-workflow

Claude Code 23: W. Edward Deming and The Zero Error Philosophy For Your Workflow

TIER 4 Feb 23, 2026

Applies Deming's 'every error is information' philosophy to AI-assisted work: rather than fixing each defect, find the data-generating process behind it and codify a rule so it can't recur. Concretely shows building 'prosthetic spatial reasoning' for TikZ (Bezier-curve depth formulas, arrow-crossing checks) since LLMs eyeball instead of compute, and argues skills (structured, operational, global) beat commands (single aspirational memo) — with /insights as a personal Myers-Briggs for finding your own friction. A substantive original workflow essay.

AI workflowDeming / zero errorskills vs commandsspatial reasoning/insights

Claude Code 41: Updating my workflow and skills

TIER 4 Apr 13, 2026

A detailed account of iterating on his Claude Code skills: diagnosing why /beautiful_deck produced broken TikZ (it specified what to audit but not how to generate safely), adding six generation rules and a 'circuit breaker' to stop infinite compile-fix loops, adopting a reader's agent-isolation and persistent-extraction improvements to /split-pdf, and renaming /fletcher to /blindspot with a 2x2 vice/virtue framework for catching what you stop noticing. Substantive on the craft of building and debugging AI-research skills.

Claude CodeskillsTikZ debuggingresearch workflowagent isolation

Claude Code 47: Many Agent Frameworks for Skills

TIER 4 May 6, 2026

Cunningham explains why building your own /skills beats borrowing others' (security pipeline risk, fit to your human capital) and details two new multi-agent skills built on his 'gradient decay' conjecture: /split-pdf (chunk a long PDF into 4-page pieces, one agent each, then synthesize) and /bibcheck (one agent per citation or per bibfield to audit references against online sources). A concrete, transferable look at multi-agent skill design for empirical-research workflows.

AI skillsmulti-agent designgradient decaysplit-pdfbibcheck

Claude Code 48: What I'm learning from giving four AI talks in two weeks

TIER 4 May 7, 2026

A reflective essay on demoing AI for research: instead of having Claude Code write a paper live (which triggers Luddite repugnance and can't be verified in real time), he builds a 'beautiful deck' from a stranger's published paper, simulating figures from the reported coefficients, where the human is the rhetor and the AI is the medium. Also surfaces the practical insight that every Claude Code session is fully preserved in a JSONL 'flight recorder' for thought, recoverable and auditable. Useful both as AI-demo strategy and as a verification/reproducibility point.

AI demosbeautiful deckssession logsreproducibilityrhetoric

Claude Code 44: My four criteria for using Agents, with an application to referee reports

TIER 4 Apr 24, 2026

Lays out a four-criteria framework for when to deploy AI agents (high-value, high-time, hard-to-do-well, easy-to-do-badly) and applies it to peer-review referee reports, then walks through his concrete skill pipeline (/split-pdf, /beautiful_deck, /referee2, /blindspot, /tikz) for digesting a manuscript without having the LLM write the report. Substantive AI-for-research piece with a reusable decision framework and a worked example of agent-assisted verification ('the bottleneck is verification, not production').

AI agentsreferee reportsClaude Coderesearch workflowverification

Continuous-treatment diff-in-diff and the TWFE decomposition

1 tier-5 · 6 tier-4

A self-contained tutorial series in which Cunningham teaches himself (and the reader) the Callaway–Goodman-Bacon–Sant'Anna continuous-treatment estimator by building it from the ground up with Claude Code. The throughline is one durable conceptual point: a single TWFE/FWL coefficient under a continuous dose can be algebraically rewritten as several different weighted averages — levels, scaled levels, causal response, scaled 2x2 — each answering a distinct question, none cleanly equal to the causal estimand, and negative weights are the price of clean untreated-vs-treated comparisons. The arc moves from the Frisch-Waugh-Lovell derivation through the four-piece levels decomposition to an interactive R Shiny app that visualizes the sign-flip where below-mean-dose units get negative weight. The motivating slogan: the regression never changes; the question does.

TWFE Continuous Decompositions: The regression never changes. The question does.

TIER 5 Apr 23, 2026

A deep conceptual explainer of Table 1 in Callaway, Goodman-Bacon & Sant'Anna's continuous-treatment diff-in-diff paper: one TWFE/FWL coefficient can be algebraically rewritten as four different weighted averages (levels, scaled levels, causal response, scaled 2x2), each answering a distinct question, and none cleanly equaling the causal estimand. Makes the durable point that negative weights are the price of clean untreated-vs-treated comparisons and motivates 'population-first' forward-engineered estimators; high lasting reference value for understanding TWFE bias under continuous dose.

continuous diff-in-diffTWFEFrisch-Waugh-Lovelldecomposition weightsestimands

Learning Continuous Diff-in-Diff with Claude Code: Deriving the TWFE Weights (Part 1)

TIER 4 Apr 9, 2026

Kicks off the continuous-DiD-with-Claude-Code series, framing the 'Bacon decomposition conjecture' (people learn new estimators only once shown their current one is biased) and Pedro Sant'Anna's backwards- vs forwards-engineering distinction. Sets up the Lu & Yu (2015) China-WTO application and documents the full skill pipeline (/split-pdf, /beautiful_deck, /tikz, /referee2) with prompts, blending substantive methods motivation with a concrete AI-for-research workflow.

continuous diff-in-diffTWFEBacon decompositionClaude Coderesearch workflow

Decomposing the TWFE regression coefficient with continuous treatment dosage using FWL

TIER 4 Apr 15, 2026

A step-by-step algebraic derivation of the CBS levels decomposition: starting from the TWFE regression with a continuous D x Post interaction, applying Frisch-Waugh-Lovell to residualize the coefficient, then deriving the level-weight function as a weighted average of dose-level outcome differences. A careful, slow-paced methods tutorial that earns its keep as a reference for the FWL mechanics, though partly paywalled.

continuous diff-in-diffTWFEFrisch-Waugh-Lovellderivationdecomposition weights

Making a shiny to illustrate the TWFE continuous weights

TIER 4 Apr 20, 2026

Walks through a Claude Code-built R Shiny app that visualizes the CBS levels-decomposition weights, explaining the three ingredients (mean dose, variance, kernel density) and the key sign-flip where below-mean-dose units get negative weights and above-mean units get positive weights. A useful hands-on explainer that combines a methods tutorial with an AI-built interactive teaching tool, including a kernel-smoothing edge-case lesson.

continuous diff-in-diffTWFER Shinydecomposition weightsClaude Code

Claude Code Series (part 8): Resurrecting and Extending an Old Abortion Paper Towards Using Continuous Diff-in-Diff

TIER 4 Jan 20, 2026

Uses Claude Code to revive his JHR (2019) Texas HB2 abortion-access project and prepare to re-estimate it with Callaway/Goodman-Bacon/Sant'Anna continuous diff-in-diff, surfacing a real data audit (thesis vs JHR distances diverging 8% of the time, e.g. Lubbock 307 vs 78 miles) and a genuine identification puzzle about who serves as counterfactual when 42% of Texas lives in five urban counties with no treatment variation. A substantive applied-econometrics case study that combines AI-workflow documentation with a concrete continuous-DiD identification problem. (Repost of 0112 with the correct video.)

claude-codecontinuous-diff-in-diffabortion-accessidentificationdata-audit

Claude Code Series (part 8): Resurrecting and Extending an Old Abortion Paper Towards Using Continuous Diff-in-Diff [original send]

TIER 4 Jan 20, 2026

Identical content to 0111: a Claude Code case study reviving the JHR Texas HB2 abortion project for continuous diff-in-diff, with a thesis-vs-JHR distance data audit and the urban-counterfactual identification problem. This is the original post that was later deleted and reposted as 0111 because it had the wrong video attached; substantively it is the same essay.

claude-codecontinuous-diff-in-diffabortion-accessidentificationdata-audit

Vertical regression and selection bias in diff-in-diff (plus some pictures of Pisa and Stresa Italy)

TIER 5 Jun 12, 2026

A substantive methods tutorial arguing that diff-in-diff's parallel-trends assumption is really just selection bias expressed in first-differenced Y(0), and that the non-parallel-trends bias term is itself a 2x2 diff-in-diff. Cunningham derives the bias from the 2x2 in a few algebraic moves and ties it to Imbens's 'vertical regression' framing of synthetic control, showing horizontal (first-difference) and vertical (group-difference) estimation are numerically identical. Lasting reference value as a conceptual reframing of the field's most common identification assumption.

difference-in-differencesparallel trendsselection biassynthetic controlvertical regression

Diff-in-diff foundations — parallel trends, selection bias, and covariates

1 tier-5 · 1 tier-4

The bedrock tutorials on what difference-in-differences actually assumes. Cunningham keeps reframing parallel trends as a selection-bias condition stated in first-differenced untreated potential outcomes, shows the 2x2 estimator computed many numerically-identical ways to demystify the algebra, and works through when covariate balance does and doesn't matter for the estimand. (Note: issue 0002, which most fully unifies parallel trends with selection bias and vertical regression, is filed in Theme 4 alongside the continuous-DiD arc it extends.)

Diff-in-diff can be written down six ways!

TIER 5 Jun 4, 2026

A hands-on methods tutorial showing the 2x2 diff-in-diff can be computed six numerically-identical ways: two manual ('four averages and three subtractions' via first-differences-then-difference and group-differences-then-difference) and four OLS specifications (saturated dummies, TWFE, first-difference regression, group-difference regression). Includes runnable Stata code (Card-Krueger and castle.dta) for both unweighted and population-weighted versions, making the abstract equivalence concrete. High reference value as a teaching artifact.

difference-in-differences2x2TWFEStata coderegression equivalence

Should I Include Covariates in Diff-in-Diff?

TIER 4 Jun 1, 2026

A methods explainer tackling the common belief that 'if covariates change my diff-in-diff estimate I don't trust it,' walking through when covariate balance does and doesn't matter. Cunningham works an outcome model where sex affects earnings trends and shows that when groups are balanced on the covariate, unconditional parallel trends holds and no control is needed. Substantive but truncated/paywalled mid-derivation, capping its standalone value.

difference-in-differencescovariatesconditional parallel trendscovariate balanceATT

Callaway–Sant'Anna and the perils of staggered DiD with covariates

3 tier-5 · 4 tier-4

A sharp, practitioner-facing cluster on a single underappreciated fact: Callaway–Sant'Anna with covariates secretly fits one propensity-score logit *per treatment cohort*, and that hidden stage is where things break. Cunningham develops the consequences across several posts — Peduzzi's events-per-variable rule means the binding constraint is treated units per cohort (so U.S. state-level staggered panels with singleton treated states routinely violate it), some packages quietly drop covariates or fail to converge, and an apple-to-apple audit running identical specs across six R/Stata/Python packages produced ATT estimates ranging up to ~5x apart, driven by covariate handling, matrix conditioning, and near-separation in the logit. The constructive payoff: zero-covariate baselines, z-scoring covariates, reporting package and version, and switching to regression adjustment (which has no events-per-variable problem).

The Many Logits of Callaway and Sant'Anna and Why It Matters for Your Covariates

TIER 5 Apr 1, 2026

A sharp, original methods tutorial on a widely-missed pitfall: because Callaway-Sant'Anna estimates one propensity-score logit per treatment cohort, the binding constraint on covariate count is the events-per-cohort (Peduzzi's ~10 events/variable rule), not total treated units, so cohorts with few treated units silently fail. Warns that some packages conceal the logit stage and may quietly drop all covariates, with state-level staggered panels especially prone to non-convergence -- lasting reference value for anyone running CS DiD with covariates.

Callaway-Sant'Annadiff-in-diffpropensity scorecovariatesevents per variable

Claude Code 31: Apple-to-Apple Audit of Six Callaway and Sant'Anna packages

TIER 5 Mar 12, 2026

Reports a Claude-Code-built audit running the identical CS estimator with identical specs and data across six packages in R, Stata, and Python, finding ATT estimates ranging from 0.0 to 2.38 (up to ~5x) driven entirely by covariate handling. The root causes are numerical (a large poptotaltrend covariate wrecking matrix conditioning, fixed by z-scoring) and statistical (near-separation in the hidden propensity-score logit, which different optimizers handle differently and z-scoring does not fix). A substantive, original contribution documenting between-package variation as an undocumented source of publication bias, with concrete diagnostics (zero-covariate baseline, report package+version, standardize covariates).

callaway-santannapackage-variationpropensity-scorenear-separationcode-audit

Claude Code 53: Applied econometrics will require a detailed checklist

TIER 5 Jun 8, 2026

A strong applied-practice essay arguing that the returns to econometrics knowledge are higher, not lower, under AI: Claude specification-searches when given discretion and silently stores divergent decisions in JSON, so 'plan then let it rip' backfires. Cunningham recounts a multi-hour CSDID debugging saga (the wrong CRAN-vs-GitHub 'did' install) that only his deep human capital caught, and reproduces 'Pedro's Checklist' (name the estimand in potential outcomes, cohort tables, covariate balance, panelview rollout, outcome-evolution plots) as the structured-analog defense. Core thesis: make zero error the binding constraint.

AI for researchCallaway-Sant'Annastaggered diff-in-diffchecklist/workflowestimand specification

Claude Code 34: Using "Dispatch" on my phone with Claude Code to revisit the cannabis paper

TIER 4 Mar 20, 2026

Framed around the new phone-based "Dispatch" remote-control feature, but the substantive core is a methods lesson: Callaway-Sant'Anna with covariates secretly fits a per-cohort logit propensity score, and Peduzzi et al.'s "events per variable" (EPV) rule means you need ~10 treated units per covariate per cohort-year to avoid biased propensity scores. Cunningham argues U.S. state-level staggered DiD routinely violates this (singleton treated states), and the fixes are dropping to county-level data or switching from doubly-robust/IPW to regression adjustment, which has no events-per-variable problem. Useful for anyone running CS with covariates.

callaway-santannaevents-per-variablepropensity-scoreconditional-parallel-trendsclaude-code

Claude Code 24: Multiple Agents Auditing Your Diff-in-Diff Code (Part 1)

TIER 4 Feb 25, 2026

Proposes treating LLM coding hallucination as classical measurement error and exploits its cross-language independence: if R, Python, and Stata errors are stochastic and independent, replicating a deterministic pipeline in all three and demanding identical output to several digits becomes a powerful code audit. Lays out where the method works (OLS, DiD, IV, analytical SEs) and where it fails (bootstrap, MCMC, simulation/ML with random seeds), using five Callaway-Sant'Anna packages as the case study. A genuinely useful, transferable verification framework.

code auditdifference-in-differencesmeasurement errorLLM hallucinationreplication

Claude Code 26: Multiple Agents Auditing Your Callaway and Sant'Anna Diff-in-Diff (Part 2)

TIER 4 Feb 27, 2026

Runs 15 isolated AI agents (5 packages across Stata/R/Python) on the same Brazilian CAPS diff-in-diff dataset to recreate a 'many-analyst design' and locate where researcher discretion ('non-standard errors') enters staggered DiD. Key finding: agents agreed unanimously on every structural choice (control group, base period, trimming) but diverged entirely on covariate selection — the confounder-vs-mediator boundary is where the variation lives. A substantive methods+AI piece, though it cliffhangers before showing the event-study spread.

difference-in-differencesCallaway-Sant'Annanon-standard errorsmulti-analyst designAI agents

Claude Code 28: Multiple Agents Auditing Your Callaway and Sant'Anna Diff-in-Diff (Part 3)

TIER 4 Mar 4, 2026

Reviews the results of a multi-analyst CS experiment (5 packages, 3 runs each, 15 estimates) where the only varying dimension was covariate selection, finding that more covariates yield larger ATTs, ~77% of variation comes from package choice, and one estimate is ~11x another despite identical data and estimator. The standard deviation across estimates (0.442) is 2.4x the average reported standard error (0.185), demonstrating that conventional standard errors miss analyst-driven uncertainty. A concrete, useful empirical demonstration with teaching value on universal baseline and parallel-trends-as-long-differences.

callaway-santannamany-analyst-designcovariate-selectionnon-standard-errorscode-audit

Synthetic control, matching, and identification theory

2 tier-5 · 0 tier-4

The two most reference-worthy pure-method tutorials in the archive, both about identification beyond the DiD parallel-trends frame. One derives the identifying assumption of synthetic control under the Abadie–Diamond–Hainmueller factor model and contrasts outcome-model identification with design-based (random-assignment) identification, with Monte Carlo code showing why a long, well-fitting pre-period guards against matching on noise. The other shows the Abadie–Imbens bias correction for nearest-neighbor matching written two numerically-identical ways — standard imputation and an "augmentation" form that slides the matched control's outcome along the regression line — and connects it to augmented synthetic control and the Wald/2SLS equivalence. Together they form Cunningham's clearest writing on what makes a non-experimental estimate credible.

Identification in Synthetic Control

TIER 5 Dec 12, 2025

A careful methods tutorial deriving the identifying assumption of synthetic control (Abadie, Diamond & Hainmueller's factor model) and contrasting outcome-model identification with design-based (random-assignment) identification, with Python Monte Carlo code illustrating how everything cancels under exogenous treatment but the donor weights and unobserved factor loadings matter under non-random assignment. Cunningham explains why a long pre-period plus good pre-fit guards against 'matching on noise,' relates synth's bias terms to the unobserved heterogeneity mu and transitory shocks, and compares the role of the pre-period in synth versus the event-study falsification logic in diff-in-diff. High lasting reference value as an original explainer of synthetic control identification.

synthetic controlfactor modelidentificationdiff-in-diffcausal inference

Two Ways to See Abadie-Imbens Bias Correction (And Why It Might Matter)

TIER 5 Dec 24, 2025

A full, original methods tutorial showing that the Abadie-Imbens (2011) bias correction for nearest-neighbor matching can be written two numerically-identical ways: the standard imputation form and an 'augmentation' form that slides the matched control's outcome along the regression line by (slope x covariate discrepancy). Cunningham proves the equivalence (the intercept cancels under differencing), connects it to Ben-Michael-Feller-Rothstein's augmented synthetic control and to the Wald/2SLS equivalence, and works it with toy data and Stata teffects code. The most complete and reference-worthy econometrics piece in this batch.

matchingAbadie-Imbensbias correctionaugmented synthetic controlregression adjustment

LLMs as measurement instruments — text-as-data and replication

1 tier-5 · 1 tier-4

A focused empirical arc treating large language models not as writing assistants but as classification and measurement tools, and asking whether they reproduce trained-classifier results. Cunningham re-classifies hundreds of thousands of congressional speeches from Card et al.'s PNAS immigration-rhetoric paper with gpt-4o-mini via the Batch API ($11, ~2.6 hours), gets 69% agreement with a fine-tuned RoBERTa, and shows the paper's polarization findings survive — because disagreements cluster at the decision boundary and largely cancel in the net-tone difference. The capstone finding (the gpt-4o-mini thermometer-heaping result, issue 0088) and the setup half of the replication (issue 0101) are filed in Themes 2 and 3 respectively; the lower-tier installments of the "why does it work" sub-arc (0094, 0095) are counted but not keepers for this guide. The cluster also includes the Ludwig–Mullainathan–Rambachan framework distinguishing prediction tasks (which require no training leakage) from estimation tasks (which need a small human-coded validation sample to debias LLM labels).

Claude Code 15: The Results Are In: Can LLMs Replicate a PNAS Paper? (Part 2)

TIER 5 Feb 6, 2026

Reports results of using gpt-4o-mini via OpenAI's Batch API to re-classify 305k congressional speeches from Card et al.'s PNAS immigration-rhetoric paper: 69% agreement with the fine-tuned RoBERTa classifier, $11, 2.6 hours, with disagreements clustering toward NEUTRAL and direct PRO/ANTI flips rare. A substantive, reusable demonstration that zero-shot LLM classification can replicate trained-classifier results cheaply and that the paper's polarization findings are robust, with practical lessons on Batch API, Cohen's Kappa, and transition matrices.

llm-classificationreplicationbatch-apitext-analysisai-for-research

Explainer of Ludwig, Mullainathan and Rambachan's 2026 Econometrics of LLM Paper

TIER 4 Mar 25, 2026

A substantive econometrics explainer of the Ludwig-Mullainathan-Rambachan framework for using LLMs in research, distinguishing prediction tasks (which require 'no training leakage,' unattainable via prompt engineering since models memorize training text) from estimation tasks (which require a small human-coded validation sample to debias LLM labels). High value as a guide to a method economists increasingly need, though the back half is paywalled.

LLM econometricsmeasurement errorvalidation sampletext classificationprediction vs estimation

Teaching econometrics in the age of AI

0 tier-5 · 2 tier-4

Cunningham's pedagogy essays, unified by one claim: generative AI has decoupled task-completion from learning, which historically were the same act, so teaching must change to force the time-on-mechanics that builds real human capital. The concrete moves are by-hand spreadsheet worksheets (decomposing a difference in means into ATE + selection bias + reweighted ATT/ATU; computing regression coefficients manually and checking against R), and — paradoxically — using AI to *manufacture* teaching examples: having Claude generate exemplar papers in three genres (descriptive, predictive, causal) so students can study the rhetoric of each, even while he bans AI in his courses. The "rules before strategy" framing argues a skill splits into two halves that should be taught separately.

Rules before strategy or how I'm trying to teach statistics and causal inference

TIER 4 Nov 13, 2025

A substantive teaching essay arguing that learning a skill splits into 'rules' and 'strategy' which should be taught separately, and applying this to his 200-student Harvard Gov 50 course via by-hand spreadsheet worksheets (decomposing a difference in means into ATE + selection bias + reweighted ATT/ATU; manually computing bivariate and multivariate regression coefficients, verified against R's lm). The deeper claim: because gen AI decouples task completion from learning—historically the same act—pedagogy must change to force the time-on-mechanics that builds real human capital, lest a generation produces research outputs it doesn't understand. Original pedagogy-of-causal-inference argument with shareable worksheets.

teachingpedagogyregression mechanicsai and learningcausal inference

A professor's use case for AI generated papers

TIER 4 Apr 16, 2026

An original teaching essay arguing that AI agents can now produce full journal-submittable empirical papers, which both threatens the value of human attention/learning and opens a new pedagogical use: having Claude generate exemplar papers in three distinct genres (descriptive, predictive, causal) so students can study the rhetoric of each. Cunningham explains why he bans AI in his courses yet uses it to manufacture teaching examples, a substantive reflection on AI's effect on econometrics pedagogy.

AI-generated paperseconometrics teachingresearch genresClaude Codeacademic policy

Applied empirical essays and the sociology of the discipline

0 tier-5 · 5 tier-4

The remaining substantive essays that sit outside the methods tutorials and the AI-workflow spine: applied economics, the statistics of research integrity, and the intellectual history of the field. Cunningham dispatches Claude Code to crawl EDGAR and build an HHI analysis of Match Group's dating-app portfolio (a zero-price market that evades antitrust); works through why reconstructing t-statistics from rounded coefficients manufactures false bunching at t=2 (and publicly retracts his own p-hacking claim when he finds the artifact); distinguishes three sources of uncertainty — sampling, design-based, and analyst/researcher uncertainty that standard errors never capture, now operationalizable via many-analyst automation; and maps the family tree of causal inference through Orley Ashenfelter's academic lineage.

The Orley Genealogy Project: Mapping the Family Tree of Causal Inference

TIER 4 Dec 31, 2025

Cunningham lays out his intellectual-history project tracing causal inference in economics through Orley Ashenfelter's academic lineage — the convergence of the Princeton quasi-experimental tradition, the Rubin potential-outcomes framework (linked via Imbens and Angrist), and Heckman's Chicago structural branch — built into a ~1,100-economist genealogy database he's crowdsourcing. A substantive sociology-of-the-discipline essay with lasting reference value for understanding how the credibility revolution propagated, plus a foreshadowing of a planned book.

causal inference historyOrley Ashenfelteracademic genealogyPrinceton IRScredibility revolution

Swiping Under Monopoly: Market Power and Welfare in Online Dating

TIER 4 Apr 6, 2026

An original applied-economics essay arguing that Match Group's portfolio (Tinder, Hinge, OkCupid, etc.) constitutes critical social infrastructure with an HHI near double the DOJ 'highly concentrated' threshold, yet escapes antitrust scrutiny because zero-price markets evade the consumer-welfare price-increase test. Doubles as a worked demonstration of dispatching Claude Code to crawl EDGAR filings and build the HHI/revenue analysis, with Cunningham flagging his own unverified calculations.

antitrustmarket concentrationzero-price marketsonline datingClaude Code research

Claude Code 37: Building an Understanding of Rounding for the Purposes of Publishing and Evidence for P-Hacking

TIER 4 Apr 2, 2026

A substantive methods explainer showing why reconstructing t-statistics from rounded coefficients and standard errors manufactures false heaping at t=2 (because small-scale outcomes collapse rounded ratios onto simple integer ratios, with t being scale-invariant), creating spurious p-hacking signals. Built with a Shiny app, video walkthrough, and a Claude Code transcript, and sets up the Brodeur-comment history where the diff-in-diff p-hacking finding dissolved after the rounding fix.

p-hackingrounding artifactst-statisticsShiny appresearch methods

Claude Code 36: I Was Wrong About P-Hacking (And Here's What I Actually Found)

TIER 4 Mar 30, 2026

A candid correction retracting the prior claim that APE AI-generated papers p-hack: prompted by a reader's question, Cunningham realized his spike at t=2 was a rounding artifact from dividing rounded coefficients by rounded standard errors, not manipulation, and a donut-hole drop of exact-2s collapsed his bunching ratio from 1.52 to 1.02. Valuable both as a methods lesson on why Brodeur's test uses true t-stats and as a model of public error-correction.

p-hackingrounding artifactscorrectionBrodeur testAI-generated papers

If Non-Standard Errors Are Measuring Real Uncertainty, Should We Report Them?

TIER 4 Mar 6, 2026

A conceptual statistics essay distinguishing three sources of uncertainty in estimates: sampling (iid), design-based (treatment assignment), and a third, researcher/analyst uncertainty surfaced by many-analyst designs (Silberzahn et al.), which standard errors never capture because they hold the researcher fixed. Argues Claude Code makes the previously theoretical many-analyst design operational by automating perturbation of discretionary nodes (covariates, package) to build a forest plot and possible p-value, while flagging the open problem of enumerating all endogenous discretionary nodes. A thoughtful, original framing of non-standard errors.

non-standard-errorsmany-analyst-designresearcher-uncertaintyinferenceclaude-code