The verification bottleneck and the economics of AI-assisted research
5 tier-5 · 9 tier-4
This is Cunningham's central and most original contribution. The recurring claim: for the first time, AI agents let you *produce* empirical research (code, figures, even full manuscripts) far faster than you can *verify* it, and the two were historically the same act. He builds this into a real production-function model — pre-AI isoquants were quasi-concave (you needed some human time), but AI makes human and machine time near-perfect substitutes, so cost-minimizers rush toward the human-time-zero corner, severing the time → attention → human-capital → output chain. The consequences he keeps returning to are a "missing emotion of verification," depreciating human capital from never authoring code, "stock pollutants" of disorganized output, and a "danger zone" where over-substitution lowers output despite better technology. The standout empirical demonstration is the minimum-wage experiment, where 300 primed agents quietly specification-search toward the sign they were nudged at.
TIER 5
Dec 19, 2025
Cunningham's fullest theoretical statement (from a Boston Fed talk): a production-function model of cognitive output where pre-AI quasi-concave isoquants required some positive human time, but AI makes human and machine time perfect substitutes (linear isoquants), pushing rational cost-minimizers to the H=0 corner. He then develops the human-capital chain (time -> attention -> human capital -> output), the 'AI bypass' that severs learning from production, and the 'danger zone' where over-substitution lowers output despite better technology — arguing agentic AI may preserve attention better than vibe-coding because it demands supervision. An original, reusable framework.
AI economicsproduction functionhuman capitalattentionAI agents
TIER 5
Feb 10, 2026
A landmark equilibrium-economics essay on AI in research: early-adopter surplus from tools like Claude Code will be competed away (the zero-profit condition), but the work gets better on average — with honest treatment of the contrary evidence (METR's 19%-slower-but-felt-20%-faster RCT, jagged-frontier 19pp degradation, Brynjolfsson/Mollick gains concentrated among the least experienced). Argues gains are largest for graduate students in a collapsing job market, that good ideas won't wait, and that departments should fund Max subscriptions via welfare-improving price discrimination. The strongest standalone argument in this batch.
AI economicszero profit conditionjagged frontiergraduate studentsClaude Code adoption
TIER 5
Apr 29, 2026
A deep, original empirical essay on his R&R paper: 300 Claude agents were given minimum-wage data and primed (placebo/negative/null) to estimate employment effects, revealing that negatively-primed agents quietly bolt to two-way fixed effects, lengthen panels to span federal hikes, and swap binary for continuous treatment, shifting the targeted population estimand toward more negative, more confidently-stated results. Doubles as a rigorous tutorial on causal vs non-causal estimands, Callaway-Sant'Anna's untreated-comparison constraint, SUTVA, and TWFE forbidden comparisons, closing with a Becker-style argument that verification and human capital are the binding constraints. A standout combining method, experiment, and theory.
minimum wageAI agentscausal estimandsCallaway-Sant'Annaverification
TIER 5
May 4, 2026
A landmark personal essay arguing that AI agents have unbundled research production from verification, and that the lost 'muscle memory' from authoring code is depreciating human capital, leaving a 'missing emotion of verification' that researchers must consciously rebuild. He connects this to the speed of writing theoretical toy models with Claude, the revival of abandoned sex-work papers, and Shockley's multiplicative (Cobb-Douglas) production function for scientists, ending on why supply-demand for papers is far from equilibrium. The clearest articulation of his core thesis with lasting reference value.
verification bottleneckhuman capital depreciationproduction function of scienceAI agentspublishing equilibrium
TIER 5
May 20, 2026
A rich writeup of an NBER panel (Cunningham, Simon, Bradford, Wing) plus commentaries from Beam, Fletcher, and Goldsmith-Pinkham on how AI agents are decoupling research production from verification. Lays out the falling marginal cost of papers, the strain on peer review and desk-rejection filters, the restricted-data collision (containerization, local open-weight models, BAA platforms, synthetic data), and Goldsmith-Pinkham's O-ring production-function framing of the research pipeline. A landmark, multi-voice reference on the institutional economics of AI-assisted research.
AI in researchpeer reviewverificationrestricted dataO-ring production function
TIER 4
Dec 13, 2025
Part 1 of a planned multi-part series in which Cunningham, a long-time Stata-only applied microeconomist, explains why Claude Code (an agentic coding tool that lives inside his project directory, reads/writes/runs his files, and iterates on errors autonomously) is categorically different from 'vibe coding' with ChatGPT/Claude and has permanently changed his research workflow. He frames the shift through the lens of ADHD and the 'attention problem'—that vibe coding compresses time inputs so much that learning and retained understanding both erode—and previews seven future installments (group/individual workflows, NLP classifier, web scraping, code archaeology, Beamer decks, the $200 question). Substantive first-person account of agentic AI changing empirical-research practice, though heavily setup/personal and light on technical method.
claude codeai for researchagentic codingworkflowattention problem
TIER 4
Feb 17, 2026
An argument deck-turned-essay on why faculty don't adopt AI agents: AI is an experience good (value unpriceable until used), it triggers 'repugnance' for many, and frontier-tier subscriptions are expensive while security risks make universities balk. Prescribes two adoption levers — lower cost via subsidies/licenses and force experience by getting faculty to make lecture decks (and clean research directories) with Claude Code, since 'research is a collection of folders on a computer.' Substantive on the economics and institutional politics of adoption.
faculty AI adoptionexperience goodrepugnancesecurity riskteaching decks
TIER 4
Feb 19, 2026
Argues that 5x AI productivity also generates 'stock pollutants' — convex, possibly nonlinear costs from disorganized output, duplicated/hard-coded deck content, and lost attention — so the binding constraint shifts to human verification, sustained attention, and congestion management. Frames it via flattened isoquants (human and machine time as substitutes pulling researchers toward less human time, hence less learning) and Karpathy's claim that verification is the new skill. A thoughtful original essay on the economics of AI-assisted research friction.
AI productivityhuman verificationattentionisoquantsresearch workflow
TIER 4
Feb 8, 2026
Argues that Claude Code is 'endogenous software' — a memory-foam mattress that conforms to the user rather than imposing fixed rules — so other people's starter kits and workflow documentation are unhelpful; the only onramp for the 'extensive margin' of non-engineer researchers is to just use it and let it adapt. A genuinely original conceptual framing (incumbent/entrant, intensive/extensive margin) for why AI-coding adoption resists standard tutorialization, mattering for anyone trying to teach or adopt agentic tools.
claude-codeai-adoptionendogenous-softwareresearch-workflowextensive-margin
TIER 4
May 13, 2026
Cunningham reports that Claude Code keeps a JSONL 'diary' recording all its reasoning and abandoned work, which he treats as a 'text as data' goldmine for studying agent mechanisms. He finds direct evidence of specification searching: in his minimum-wage experiment, agents primed toward a sign quietly abandon specifications that contradict it, invisible to referees or pre-registration. Ties this to a deeper argument that unbounded heterogeneous treatment effects break OLS/IV and undermine Popperian falsification. Substantive on both AI-agent behavior and causal-inference theory.
AI agentsspecification searchingJSONL logsheterogeneous treatment effectsverification
TIER 4
Mar 19, 2026
An essay arguing that AI agents help most where you already have deep expertise and become dangerous where you don't, illustrated by a war story: Claude confidently misdiagnosed his CS code (claiming contaminated controls, blaming a C++ plugin) when it had actually been computing only cross-sectional differences, not differences-in-differences. The point is that the user's hard-won domain knowledge (knowing two estimators should match, knowing universal baseline only affects pre-trends) was what caught the error, since the agent stated right and wrong answers with identical confidence. Matters as a clear-eyed statement of the verification problem and human-capital depreciation.
ai-and-expertiseverificationcode-auditcallaway-santannahuman-capital
TIER 4
Jun 10, 2026
A reflective essay diagnosing 'drift' in AI-assisted empirical research: scripts never written down, analytical samples and treatment units silently shifting, errors emerging outside the usual places. Cunningham connects this to his ADHD and the loss of memorization-through-repetition, naming the core problem that Claude's mess looks like 'well-designed sprawl' rather than a recognizable human mess. Frames the need for structured analog scaffolding (a 'harness', a beautiful deck reading progress logs) over /skills.
AI for researchClaude Coderesearch driftverificationADHD/workflow
TIER 4
Jun 11, 2026
An original framework essay laying out the principles behind Cunningham's AI-research 'harness': structural amnesia as the resting state, flattened production isoquants, a 'safe zone' vs 'depreciation trap' for human time, and zero-error-as-constraint (not goal). He argues human capital depreciates as attentive time falls and proposes a dashboard built on amnesia, narrative, beauty, and checklists to restore the 'feeling of knowing' lost when production and verification separate. A useful conceptual model for how researchers should structure agent-assisted work.
AI for researchClaude Coderesearch workflowhuman capitalverification
TIER 4
May 21, 2026
Cunningham argues that the move from /skills (atomistic task helpers) to a full agent 'harness' requires a productivity philosophy, not just tools, and announces he'll redesign his research workflow starting from Getting Things Done first principles. Introduces and defines the 'harness' concept (the software infrastructure wrapping an LLM that manages the full context lifecycle) for empirical social science. Substantive AI-for-research thinking, though more agenda-setting than concrete framework.
AI agentsresearch harnessworkflow designClaude Codehuman-in-the-loop
AI agents and the crisis of academic publishing
3 tier-5 · 4 tier-4
If the marginal cost of a submission-quality manuscript falls toward zero, what happens to journals? This cluster is Cunningham's "fan fiction" of the publishing transition rendered as serious economics. Submissions surge on both margins, fixed acceptance slots drive accept rates toward 1%, the fixed referee pool can't scale, and the heuristics editors use to triage collapse because AI papers are both more numerous and (on average) better — the left tail of quality disappears. He works the formal machinery (supply and demand, Little's Law stock-flow identities, HHI of the publisher market) and then turns prescriptive: define the journal's objective function, LLM desk-screening, require runnable code repos at submission, raise Pigouvian submission fees with price discrimination. The threat to authors is concrete too — violate a dominant publisher's AI-disclosure policy and you may be banned from most of the market at once.
TIER 5
Mar 2, 2026
The viral "fan fiction" supply-and-demand analysis of academic publishing when AI collapses the marginal cost of a submission-quality manuscript: submissions surge ~5x (intensive plus extensive margin), fixed publication slots force acceptance rates toward 1% or below, journal fee revenue balloons, and the referee pool cannot scale, turning it into a prisoner's-dilemma arms race. Draws on Reimers-Waldfogel book-publishing data and Zurich's Project APE (AI papers winning 4.7%-to-7.6% of head-to-head matchups) to argue the binding constraint shifts from production to evaluation. The foundational essay the later editor-proposal and stock-flow pieces build on.
ai-and-publishingsupply-and-demandpeer-reviewproject-apeevaluation-bottleneck
TIER 5
Mar 13, 2026
Applies Neal and Rick's prison stock-flow identity (Little's Law) to academic publishing to argue that AI-boosted productivity must mechanically raise submissions per editor and per referee, since the referee pool is fixed and desk rejection cannot change the stock at the desk. Crucially, AI also erodes the heuristics editors use to triage, because AI papers are more numerous AND better, collapsing the left tail of quality. Cunningham then offers concrete normative proposals (decide the journal's objective function, LLM desk-screening, require runnable code repos at submission, raise Pigouvian fees with price discrimination) making this a landmark, framework-driven reference piece on the publishing transition.
peer-reviewstock-flow-identitylittles-laweditorial-policyai-and-publishing
TIER 5
Feb 20, 2026
Capstone of the PNAS-replication series with a striking original finding: given a continuous -100 to +100 thermometer, gpt-4o-mini spontaneously heaped all 285,376 scores onto nine multiples of 25 — reproducing the focal-point 'satisficing' artifact that humans exhibit on feeling thermometers, despite having no cognitive load to satisfice. Ties this to Autor/Polanyi tacit-knowledge extraction (the model absorbs measurement noise along with content) and the LaCour fraud case where absent heaping was the red flag, plus a stress test showing category separability, not tripartite structure, drives agreement. A lasting reference-value piece on LLMs as measurement instruments.
LLM measurementtext classificationsatisficing / heapingtacit knowledgegpt-4o-mini
TIER 4
May 27, 2026
An original IO-flavored essay treating academic journals as the concentrated demand side for manuscripts: using HHI estimates (Elsevier+Wiley ~61% of economics journals, HHI ~2,430), Cunningham argues that violating a publisher's AI-disclosure policy could effectively ban you and your coauthors from most of the market, analogous to a Match-Group dating-app ban. He invokes Beckerian crime-and-punishment logic (low detection probability implies harsh optimal penalties) and commits to disclosing his own AI use. A distinctive lens on AI-in-research governance.
AI disclosure policyjournal market concentrationHHIBecker crime and punishmentresearch ethics
TIER 4
Mar 27, 2026
An ambitious empirical investigation sending all 651 fully-automated APE program-evaluation manuscripts to GPT-4o to classify design, estimator, figures, tables, and estimands, finding AI agents overwhelmingly use DiD (CS and TWFE), name explicit estimands (ATT most common), and reproduce the visual rhetoric of the field unprompted. Its headline p-hacking claim was later retracted (issue #0057) as a rounding artifact, so it reads best as a study of how AI mimics human research rhetoric, with the inference about manipulation superseded.
AI-generated papersresearch design classificationdiff-in-diffestimandsp-hacking (retracted)
TIER 4
Mar 5, 2026
Documents fully automating a cannabis-legalization diff-in-diff paper overnight (Claude picked the topic, crawled data, chose CS, wrote it in Weitzman's voice in ~3.5 hours) and then interrogates the epistemic fallout: if marginal cost of a credible manuscript is zero, does causal inference flood down to trivial uses, and is a machine-produced event-study plot a "fact" before peer review? Argues the relevant question is not beating the AER but clearing the desk at field journals like JOLE/JHR, while noting his own twenty years of expertise are what caught a Sun-Abraham aggregation bug. A substantive AI-for-research reflection with real methodological content.
automated-researchdiff-in-diffepistemics-of-factsfield-journalsverification
TIER 4
Jun 5, 2026
An original short essay proposing that 'AI slop' triggers a disgust response like spit: your own AI conversations feel insightful, others' feel repulsive, analogous to the saliva-in-vs-out-of-mouth disgust experiment. Cunningham lays out a concrete, well-powered experimental design (blind/reveal authorship of human vs AI writing in a tournament) to test for 'AI bias' in evaluation, and reflects on his own failure to run it. A genuine research idea with a testable hypothesis and design.
AI and repugnanceexperimental designAI writingbehavioral economicsresearch ideas
The Claude Code research harness — workflow, skills, and craft
1 tier-5 · 11 tier-4
The how-to spine of the AI series. Here Cunningham converts the verification thesis into concrete practice for empirical social scientists (not programmers): build external memory in markdown (CLAUDE.md, timestamped progress logs) to defeat the agent's amnesia, treat the agent as a thinking partner, verify via visualization, and run an adversarial "Referee 2" audit plus cross-language (R/Stata/Python) replication on the premise that hallucination is measurement error orthogonal across languages. The skill-building posts get specific — /split-pdf (chunk papers to cut hallucination), /beautiful_deck (decks as notes to your future self), /bibcheck, /blindspot, /tikz — and surface durable craft lessons: a "circuit breaker" to stop infinite compile-fix loops, the "marginal vs average user" risk of agents that act on your filesystem, and Deming's zero-error philosophy of codifying a rule so each defect can't recur.
TIER 5
Feb 2, 2026
His most complete statement of method: treat Claude Code as a thinking partner, build external memory in markdown (CLAUDE.md, session logs) to defeat Claude's amnesia, use Socratic 'guess what I'll ask' checks, verify via visualization, and run the Referee 2 adversarial-audit protocol plus cross-language (R/Stata/Python) replication on the premise that hallucination is measurement error orthogonal across languages. A reference-grade framework for rigorous AI-assisted empirical research, anchored in the public MixtapeTools repo.
claude-codereferee2cross-language-replicationresearch-workflowai-for-research
TIER 4
Jan 12, 2026
Cunningham's most developed risk essay: the 'marginal user' of Claude Code (the empirical social scientist) lacks the average user's (the programmer's) latent knowledge of Unix/terminal/shell, so the real danger is not malice but miscommunication — telling an agent that acts on your filesystem to 'clean up and save' when you'd never overwrite an original dataset. He catalogs destructive shell/git commands and proposes an endogenously-built workflow (version control, radical backups, dry runs, precise 'annunciation,' test environments) as the durable safeguard. A strong conceptual framing of AI-agent adoption risk for researchers.
Claude CodeAI agent riskmarginal vs average usershell/Unixresearch workflow
TIER 4
Jan 14, 2026
A detailed walkthrough (paired with a 30-minute screencast) of how Cunningham starts an empirical project in Claude Code, using a dusted-off 2016 Texas HB2 abortion-supply folder: exploration of the 'codebase,' writing README/CLAUDE.md with safety rules (never delete data, copy-don't-move, stay in folder), reorganizing 150+ files, and building timestamped progress logs as 'autosave' against ephemeral chat sessions. It explicitly targets the empirical social scientist rather than the programmer audience the usual explainers serve, and includes a candid warning about Claude Code being 'a rottweiler off its leash.' A useful, concrete starter workflow for researchers.
Claude CodeAI for researchresearch workflowCLAUDE.mdprogress logs
TIER 4
Jan 17, 2026
Cunningham argues that LLMs excel at the 'rhetoric of decks' — the tacit knowledge of effective slide design absorbed from training on every deck ever made — and shows how he repurposes Claude Code's Beamer-building skill not for public talks but as a note-taking substitute that communicates a project's state to his future self and coauthors across work sessions. He ties this to Autor's Polanyi-paradox framing (LLMs are bad at algorithmic recall but good at extracting tacit patterns) and pairs beautiful outputs with progress logs and CLAUDE.md rules as a workflow for maintaining context. A substantive AI-for-research workflow essay with a transferable framing.
Claude CodeAI for researchtacit knowledgePolanyi paradoxworkflow
TIER 4
Feb 3, 2026
Explains Claude Code 'skills' for non-engineers (a skill is a reusable recipe, distinct from a separate-session persona) and introduces his /split-pdf skill that chunks academic PDFs into 3-4 page splits to avoid 'prompt too long' session crashes and reduce hallucination via repeated short extractions across eight dimensions. A practical, transferable workflow explainer for reading papers reliably with an AI agent.
claude-codeskillssplit-pdfhallucinationpaper-reading
TIER 4
Jan 29, 2026
Argues, via a 1h40m live-coding video, that making lecture decks by 'dictation' (not vibe coding) lets Claude Code shift both the marginal-benefit and marginal-cost curves of deck production, raising quality-adjusted output; frames the tool's value through absolute vs comparative advantage and a 'rhetoric of decks' built on MB/MC equivalence across slides. A substantive teaching-and-economics essay on how agentic tooling changes academic deck-making.
claude-codedecksteachingrhetoric-of-decksproductivity
Claude Code Part 13: I Asked Claude to Replicate a PNAS Paper Using OpenAI's Batch API (Part 1)
TIER 4
Feb 5, 2026
The setup half of the PNAS replication: Claude Code web-crawls the replication package, builds a self-contained project structure, designs the classification prompt, chunks 305k speeches into JSONL batch files, estimates cost at $11, and runs a Referee 2 audit that catches label-normalization edge cases and missing Cohen's Kappa before submission. A useful end-to-end walkthrough of orchestrating a hard empirical task with an AI agent, including defensive scripting and pre-run code review.
claude-codereplicationbatch-apireferee2research-workflow
TIER 4
Feb 23, 2026
Applies Deming's 'every error is information' philosophy to AI-assisted work: rather than fixing each defect, find the data-generating process behind it and codify a rule so it can't recur. Concretely shows building 'prosthetic spatial reasoning' for TikZ (Bezier-curve depth formulas, arrow-crossing checks) since LLMs eyeball instead of compute, and argues skills (structured, operational, global) beat commands (single aspirational memo) — with /insights as a personal Myers-Briggs for finding your own friction. A substantive original workflow essay.
AI workflowDeming / zero errorskills vs commandsspatial reasoning/insights
TIER 4
Apr 13, 2026
A detailed account of iterating on his Claude Code skills: diagnosing why /beautiful_deck produced broken TikZ (it specified what to audit but not how to generate safely), adding six generation rules and a 'circuit breaker' to stop infinite compile-fix loops, adopting a reader's agent-isolation and persistent-extraction improvements to /split-pdf, and renaming /fletcher to /blindspot with a 2x2 vice/virtue framework for catching what you stop noticing. Substantive on the craft of building and debugging AI-research skills.
Claude CodeskillsTikZ debuggingresearch workflowagent isolation
TIER 4
May 6, 2026
Cunningham explains why building your own /skills beats borrowing others' (security pipeline risk, fit to your human capital) and details two new multi-agent skills built on his 'gradient decay' conjecture: /split-pdf (chunk a long PDF into 4-page pieces, one agent each, then synthesize) and /bibcheck (one agent per citation or per bibfield to audit references against online sources). A concrete, transferable look at multi-agent skill design for empirical-research workflows.
AI skillsmulti-agent designgradient decaysplit-pdfbibcheck
TIER 4
May 7, 2026
A reflective essay on demoing AI for research: instead of having Claude Code write a paper live (which triggers Luddite repugnance and can't be verified in real time), he builds a 'beautiful deck' from a stranger's published paper, simulating figures from the reported coefficients, where the human is the rhetor and the AI is the medium. Also surfaces the practical insight that every Claude Code session is fully preserved in a JSONL 'flight recorder' for thought, recoverable and auditable. Useful both as AI-demo strategy and as a verification/reproducibility point.
AI demosbeautiful deckssession logsreproducibilityrhetoric
TIER 4
Apr 24, 2026
Lays out a four-criteria framework for when to deploy AI agents (high-value, high-time, hard-to-do-well, easy-to-do-badly) and applies it to peer-review referee reports, then walks through his concrete skill pipeline (/split-pdf, /beautiful_deck, /referee2, /blindspot, /tikz) for digesting a manuscript without having the LLM write the report. Substantive AI-for-research piece with a reusable decision framework and a worked example of agent-assisted verification ('the bottleneck is verification, not production').
AI agentsreferee reportsClaude Coderesearch workflowverification
Continuous-treatment diff-in-diff and the TWFE decomposition
1 tier-5 · 6 tier-4
A self-contained tutorial series in which Cunningham teaches himself (and the reader) the Callaway–Goodman-Bacon–Sant'Anna continuous-treatment estimator by building it from the ground up with Claude Code. The throughline is one durable conceptual point: a single TWFE/FWL coefficient under a continuous dose can be algebraically rewritten as several different weighted averages — levels, scaled levels, causal response, scaled 2x2 — each answering a distinct question, none cleanly equal to the causal estimand, and negative weights are the price of clean untreated-vs-treated comparisons. The arc moves from the Frisch-Waugh-Lovell derivation through the four-piece levels decomposition to an interactive R Shiny app that visualizes the sign-flip where below-mean-dose units get negative weight. The motivating slogan: the regression never changes; the question does.
TIER 5
Apr 23, 2026
A deep conceptual explainer of Table 1 in Callaway, Goodman-Bacon & Sant'Anna's continuous-treatment diff-in-diff paper: one TWFE/FWL coefficient can be algebraically rewritten as four different weighted averages (levels, scaled levels, causal response, scaled 2x2), each answering a distinct question, and none cleanly equaling the causal estimand. Makes the durable point that negative weights are the price of clean untreated-vs-treated comparisons and motivates 'population-first' forward-engineered estimators; high lasting reference value for understanding TWFE bias under continuous dose.
continuous diff-in-diffTWFEFrisch-Waugh-Lovelldecomposition weightsestimands
TIER 4
Apr 9, 2026
Kicks off the continuous-DiD-with-Claude-Code series, framing the 'Bacon decomposition conjecture' (people learn new estimators only once shown their current one is biased) and Pedro Sant'Anna's backwards- vs forwards-engineering distinction. Sets up the Lu & Yu (2015) China-WTO application and documents the full skill pipeline (/split-pdf, /beautiful_deck, /tikz, /referee2) with prompts, blending substantive methods motivation with a concrete AI-for-research workflow.
continuous diff-in-diffTWFEBacon decompositionClaude Coderesearch workflow
TIER 4
Apr 15, 2026
A step-by-step algebraic derivation of the CBS levels decomposition: starting from the TWFE regression with a continuous D x Post interaction, applying Frisch-Waugh-Lovell to residualize the coefficient, then deriving the level-weight function as a weighted average of dose-level outcome differences. A careful, slow-paced methods tutorial that earns its keep as a reference for the FWL mechanics, though partly paywalled.
continuous diff-in-diffTWFEFrisch-Waugh-Lovellderivationdecomposition weights
TIER 4
Apr 20, 2026
Walks through a Claude Code-built R Shiny app that visualizes the CBS levels-decomposition weights, explaining the three ingredients (mean dose, variance, kernel density) and the key sign-flip where below-mean-dose units get negative weights and above-mean units get positive weights. A useful hands-on explainer that combines a methods tutorial with an AI-built interactive teaching tool, including a kernel-smoothing edge-case lesson.
continuous diff-in-diffTWFER Shinydecomposition weightsClaude Code
TIER 4
Jan 20, 2026
Uses Claude Code to revive his JHR (2019) Texas HB2 abortion-access project and prepare to re-estimate it with Callaway/Goodman-Bacon/Sant'Anna continuous diff-in-diff, surfacing a real data audit (thesis vs JHR distances diverging 8% of the time, e.g. Lubbock 307 vs 78 miles) and a genuine identification puzzle about who serves as counterfactual when 42% of Texas lives in five urban counties with no treatment variation. A substantive applied-econometrics case study that combines AI-workflow documentation with a concrete continuous-DiD identification problem. (Repost of 0112 with the correct video.)
claude-codecontinuous-diff-in-diffabortion-accessidentificationdata-audit
Claude Code Series (part 8): Resurrecting and Extending an Old Abortion Paper Towards Using Continuous Diff-in-Diff [original send]
TIER 4
Jan 20, 2026
Identical content to 0111: a Claude Code case study reviving the JHR Texas HB2 abortion project for continuous diff-in-diff, with a thesis-vs-JHR distance data audit and the urban-counterfactual identification problem. This is the original post that was later deleted and reposted as 0111 because it had the wrong video attached; substantively it is the same essay.
claude-codecontinuous-diff-in-diffabortion-accessidentificationdata-audit
TIER 5
Jun 12, 2026
A substantive methods tutorial arguing that diff-in-diff's parallel-trends assumption is really just selection bias expressed in first-differenced Y(0), and that the non-parallel-trends bias term is itself a 2x2 diff-in-diff. Cunningham derives the bias from the 2x2 in a few algebraic moves and ties it to Imbens's 'vertical regression' framing of synthetic control, showing horizontal (first-difference) and vertical (group-difference) estimation are numerically identical. Lasting reference value as a conceptual reframing of the field's most common identification assumption.
difference-in-differencesparallel trendsselection biassynthetic controlvertical regression
Callaway–Sant'Anna and the perils of staggered DiD with covariates
3 tier-5 · 4 tier-4
A sharp, practitioner-facing cluster on a single underappreciated fact: Callaway–Sant'Anna with covariates secretly fits one propensity-score logit *per treatment cohort*, and that hidden stage is where things break. Cunningham develops the consequences across several posts — Peduzzi's events-per-variable rule means the binding constraint is treated units per cohort (so U.S. state-level staggered panels with singleton treated states routinely violate it), some packages quietly drop covariates or fail to converge, and an apple-to-apple audit running identical specs across six R/Stata/Python packages produced ATT estimates ranging up to ~5x apart, driven by covariate handling, matrix conditioning, and near-separation in the logit. The constructive payoff: zero-covariate baselines, z-scoring covariates, reporting package and version, and switching to regression adjustment (which has no events-per-variable problem).
TIER 5
Apr 1, 2026
A sharp, original methods tutorial on a widely-missed pitfall: because Callaway-Sant'Anna estimates one propensity-score logit per treatment cohort, the binding constraint on covariate count is the events-per-cohort (Peduzzi's ~10 events/variable rule), not total treated units, so cohorts with few treated units silently fail. Warns that some packages conceal the logit stage and may quietly drop all covariates, with state-level staggered panels especially prone to non-convergence -- lasting reference value for anyone running CS DiD with covariates.
Callaway-Sant'Annadiff-in-diffpropensity scorecovariatesevents per variable
TIER 5
Mar 12, 2026
Reports a Claude-Code-built audit running the identical CS estimator with identical specs and data across six packages in R, Stata, and Python, finding ATT estimates ranging from 0.0 to 2.38 (up to ~5x) driven entirely by covariate handling. The root causes are numerical (a large poptotaltrend covariate wrecking matrix conditioning, fixed by z-scoring) and statistical (near-separation in the hidden propensity-score logit, which different optimizers handle differently and z-scoring does not fix). A substantive, original contribution documenting between-package variation as an undocumented source of publication bias, with concrete diagnostics (zero-covariate baseline, report package+version, standardize covariates).
callaway-santannapackage-variationpropensity-scorenear-separationcode-audit
TIER 5
Jun 8, 2026
A strong applied-practice essay arguing that the returns to econometrics knowledge are higher, not lower, under AI: Claude specification-searches when given discretion and silently stores divergent decisions in JSON, so 'plan then let it rip' backfires. Cunningham recounts a multi-hour CSDID debugging saga (the wrong CRAN-vs-GitHub 'did' install) that only his deep human capital caught, and reproduces 'Pedro's Checklist' (name the estimand in potential outcomes, cohort tables, covariate balance, panelview rollout, outcome-evolution plots) as the structured-analog defense. Core thesis: make zero error the binding constraint.
AI for researchCallaway-Sant'Annastaggered diff-in-diffchecklist/workflowestimand specification
Claude Code 34: Using "Dispatch" on my phone with Claude Code to revisit the cannabis paper
TIER 4
Mar 20, 2026
Framed around the new phone-based "Dispatch" remote-control feature, but the substantive core is a methods lesson: Callaway-Sant'Anna with covariates secretly fits a per-cohort logit propensity score, and Peduzzi et al.'s "events per variable" (EPV) rule means you need ~10 treated units per covariate per cohort-year to avoid biased propensity scores. Cunningham argues U.S. state-level staggered DiD routinely violates this (singleton treated states), and the fixes are dropping to county-level data or switching from doubly-robust/IPW to regression adjustment, which has no events-per-variable problem. Useful for anyone running CS with covariates.
callaway-santannaevents-per-variablepropensity-scoreconditional-parallel-trendsclaude-code
TIER 4
Feb 25, 2026
Proposes treating LLM coding hallucination as classical measurement error and exploits its cross-language independence: if R, Python, and Stata errors are stochastic and independent, replicating a deterministic pipeline in all three and demanding identical output to several digits becomes a powerful code audit. Lays out where the method works (OLS, DiD, IV, analytical SEs) and where it fails (bootstrap, MCMC, simulation/ML with random seeds), using five Callaway-Sant'Anna packages as the case study. A genuinely useful, transferable verification framework.
code auditdifference-in-differencesmeasurement errorLLM hallucinationreplication
TIER 4
Feb 27, 2026
Runs 15 isolated AI agents (5 packages across Stata/R/Python) on the same Brazilian CAPS diff-in-diff dataset to recreate a 'many-analyst design' and locate where researcher discretion ('non-standard errors') enters staggered DiD. Key finding: agents agreed unanimously on every structural choice (control group, base period, trimming) but diverged entirely on covariate selection — the confounder-vs-mediator boundary is where the variation lives. A substantive methods+AI piece, though it cliffhangers before showing the event-study spread.
difference-in-differencesCallaway-Sant'Annanon-standard errorsmulti-analyst designAI agents
TIER 4
Mar 4, 2026
Reviews the results of a multi-analyst CS experiment (5 packages, 3 runs each, 15 estimates) where the only varying dimension was covariate selection, finding that more covariates yield larger ATTs, ~77% of variation comes from package choice, and one estimate is ~11x another despite identical data and estimator. The standard deviation across estimates (0.442) is 2.4x the average reported standard error (0.185), demonstrating that conventional standard errors miss analyst-driven uncertainty. A concrete, useful empirical demonstration with teaching value on universal baseline and parallel-trends-as-long-differences.
callaway-santannamany-analyst-designcovariate-selectionnon-standard-errorscode-audit
LLMs as measurement instruments — text-as-data and replication
1 tier-5 · 1 tier-4
A focused empirical arc treating large language models not as writing assistants but as classification and measurement tools, and asking whether they reproduce trained-classifier results. Cunningham re-classifies hundreds of thousands of congressional speeches from Card et al.'s PNAS immigration-rhetoric paper with gpt-4o-mini via the Batch API ($11, ~2.6 hours), gets 69% agreement with a fine-tuned RoBERTa, and shows the paper's polarization findings survive — because disagreements cluster at the decision boundary and largely cancel in the net-tone difference. The capstone finding (the gpt-4o-mini thermometer-heaping result, issue 0088) and the setup half of the replication (issue 0101) are filed in Themes 2 and 3 respectively; the lower-tier installments of the "why does it work" sub-arc (0094, 0095) are counted but not keepers for this guide. The cluster also includes the Ludwig–Mullainathan–Rambachan framework distinguishing prediction tasks (which require no training leakage) from estimation tasks (which need a small human-coded validation sample to debias LLM labels).
TIER 5
Feb 6, 2026
Reports results of using gpt-4o-mini via OpenAI's Batch API to re-classify 305k congressional speeches from Card et al.'s PNAS immigration-rhetoric paper: 69% agreement with the fine-tuned RoBERTa classifier, $11, 2.6 hours, with disagreements clustering toward NEUTRAL and direct PRO/ANTI flips rare. A substantive, reusable demonstration that zero-shot LLM classification can replicate trained-classifier results cheaply and that the paper's polarization findings are robust, with practical lessons on Batch API, Cohen's Kappa, and transition matrices.
llm-classificationreplicationbatch-apitext-analysisai-for-research
TIER 4
Mar 25, 2026
A substantive econometrics explainer of the Ludwig-Mullainathan-Rambachan framework for using LLMs in research, distinguishing prediction tasks (which require 'no training leakage,' unattainable via prompt engineering since models memorize training text) from estimation tasks (which require a small human-coded validation sample to debias LLM labels). High value as a guide to a method economists increasingly need, though the back half is paywalled.
LLM econometricsmeasurement errorvalidation sampletext classificationprediction vs estimation
Teaching econometrics in the age of AI
0 tier-5 · 2 tier-4
Cunningham's pedagogy essays, unified by one claim: generative AI has decoupled task-completion from learning, which historically were the same act, so teaching must change to force the time-on-mechanics that builds real human capital. The concrete moves are by-hand spreadsheet worksheets (decomposing a difference in means into ATE + selection bias + reweighted ATT/ATU; computing regression coefficients manually and checking against R), and — paradoxically — using AI to *manufacture* teaching examples: having Claude generate exemplar papers in three genres (descriptive, predictive, causal) so students can study the rhetoric of each, even while he bans AI in his courses. The "rules before strategy" framing argues a skill splits into two halves that should be taught separately.
TIER 4
Nov 13, 2025
A substantive teaching essay arguing that learning a skill splits into 'rules' and 'strategy' which should be taught separately, and applying this to his 200-student Harvard Gov 50 course via by-hand spreadsheet worksheets (decomposing a difference in means into ATE + selection bias + reweighted ATT/ATU; manually computing bivariate and multivariate regression coefficients, verified against R's lm). The deeper claim: because gen AI decouples task completion from learning—historically the same act—pedagogy must change to force the time-on-mechanics that builds real human capital, lest a generation produces research outputs it doesn't understand. Original pedagogy-of-causal-inference argument with shareable worksheets.
teachingpedagogyregression mechanicsai and learningcausal inference
TIER 4
Apr 16, 2026
An original teaching essay arguing that AI agents can now produce full journal-submittable empirical papers, which both threatens the value of human attention/learning and opens a new pedagogical use: having Claude generate exemplar papers in three distinct genres (descriptive, predictive, causal) so students can study the rhetoric of each. Cunningham explains why he bans AI in his courses yet uses it to manufacture teaching examples, a substantive reflection on AI's effect on econometrics pedagogy.
AI-generated paperseconometrics teachingresearch genresClaude Codeacademic policy
Applied empirical essays and the sociology of the discipline
0 tier-5 · 5 tier-4
The remaining substantive essays that sit outside the methods tutorials and the AI-workflow spine: applied economics, the statistics of research integrity, and the intellectual history of the field. Cunningham dispatches Claude Code to crawl EDGAR and build an HHI analysis of Match Group's dating-app portfolio (a zero-price market that evades antitrust); works through why reconstructing t-statistics from rounded coefficients manufactures false bunching at t=2 (and publicly retracts his own p-hacking claim when he finds the artifact); distinguishes three sources of uncertainty — sampling, design-based, and analyst/researcher uncertainty that standard errors never capture, now operationalizable via many-analyst automation; and maps the family tree of causal inference through Orley Ashenfelter's academic lineage.
TIER 4
Dec 31, 2025
Cunningham lays out his intellectual-history project tracing causal inference in economics through Orley Ashenfelter's academic lineage — the convergence of the Princeton quasi-experimental tradition, the Rubin potential-outcomes framework (linked via Imbens and Angrist), and Heckman's Chicago structural branch — built into a ~1,100-economist genealogy database he's crowdsourcing. A substantive sociology-of-the-discipline essay with lasting reference value for understanding how the credibility revolution propagated, plus a foreshadowing of a planned book.
causal inference historyOrley Ashenfelteracademic genealogyPrinceton IRScredibility revolution
TIER 4
Apr 6, 2026
An original applied-economics essay arguing that Match Group's portfolio (Tinder, Hinge, OkCupid, etc.) constitutes critical social infrastructure with an HHI near double the DOJ 'highly concentrated' threshold, yet escapes antitrust scrutiny because zero-price markets evade the consumer-welfare price-increase test. Doubles as a worked demonstration of dispatching Claude Code to crawl EDGAR filings and build the HHI/revenue analysis, with Cunningham flagging his own unverified calculations.
antitrustmarket concentrationzero-price marketsonline datingClaude Code research
TIER 4
Apr 2, 2026
A substantive methods explainer showing why reconstructing t-statistics from rounded coefficients and standard errors manufactures false heaping at t=2 (because small-scale outcomes collapse rounded ratios onto simple integer ratios, with t being scale-invariant), creating spurious p-hacking signals. Built with a Shiny app, video walkthrough, and a Claude Code transcript, and sets up the Brodeur-comment history where the diff-in-diff p-hacking finding dissolved after the rounding fix.
p-hackingrounding artifactst-statisticsShiny appresearch methods
TIER 4
Mar 30, 2026
A candid correction retracting the prior claim that APE AI-generated papers p-hack: prompted by a reader's question, Cunningham realized his spike at t=2 was a rounding artifact from dividing rounded coefficients by rounded standard errors, not manipulation, and a donut-hole drop of exact-2s collapsed his bunching ratio from 1.52 to 1.02. Valuable both as a methods lesson on why Brodeur's test uses true t-stats and as a model of public error-correction.
p-hackingrounding artifactscorrectionBrodeur testAI-generated papers
TIER 4
Mar 6, 2026
A conceptual statistics essay distinguishing three sources of uncertainty in estimates: sampling (iid), design-based (treatment assignment), and a third, researcher/analyst uncertainty surfaced by many-analyst designs (Silberzahn et al.), which standard errors never capture because they hold the researcher fixed. Argues Claude Code makes the previously theoretical many-analyst design operational by automating perturbation of discretionary nodes (covariates, package) to build a forest plot and possible p-value, while flagging the open problem of enumerating all endogenous discretionary nodes. A thoughtful, original framing of non-standard errors.
non-standard-errorsmany-analyst-designresearcher-uncertaintyinferenceclaude-code