Personal Learnings← Reading Room

Tech & AI

AI Snake Oil

Arvind Narayanan & Sayash Kapoor

23 issues · 20 keepers · 12 tier-5 · 8 tier-4

AI Agents — Reliability & Benchmark Reform

2 tier-5 · 1 tier-4

Across this cluster Narayanan and Kapoor build the case that the way the field measures AI agents is broken, and that the broken measurement is *why* agents impress on leaderboards while disappointing in deployment. Their core moves: accuracy is the wrong sole metric (cost must be co-optimized via Pareto curves), benchmarks reward shortcuts and irreproducible setups, and "average task success rate" hides the consistency/robustness/calibration failures that actually block economic impact. Reliability — not raw capability — is the research bottleneck, and they propose concrete reforms and metrics to make it measurable.

AI leaderboards are no longer useful. It's time to switch to Pareto curves.

TIER 5 Apr 30, 2024

Argues that agent accuracy leaderboards are misleading because they ignore cost: state-of-the-art agent architectures (LDB, LATS, Reflexion) are no more accurate than simple baselines (retry, warming, escalation) while costing up to 50x more, so evaluations must use accuracy-cost Pareto curves and report real dollar costs rather than proxies like parameter count. It introduces the model-evaluation vs downstream-evaluation distinction and documents pervasive reproducibility failures in published agent results. The original framework that seeded the 'AI agents that matter' research line.

AI agentsPareto curvescost-controlled evaluationleaderboardsreproducibility

New paper: AI agents that matter

TIER 5 Jul 3, 2024

Announces the influential 'AI Agents That Matter' paper, which argues that agent benchmarks reward systems that score well without being useful in practice. It offers a spectrum definition of 'agentic' (environment/goals, supervision, system design) and five reforms to agent evaluation: cost-controlled evaluation, joint accuracy-cost (Pareto) optimization, separating model from downstream benchmarking, preventing benchmark shortcuts via proper hold-outs, and improving reproducibility. Foundational reference for rigorous agent evaluation, with reliability framed as the central research bottleneck.

AI agentsagent benchmarksevaluation methodologyreliabilityreproducibility

New Paper: Towards a science of AI agent reliability

TIER 4 Feb 24, 2026

Presents the paper 'Towards a Science of AI Agent Reliability', arguing average task success rate is a poor measure and that reliability (borrowing from aviation and nuclear safety) decomposes into consistency, robustness, calibration/predictability, and safety — twelve metrics in all. Testing 14 models over 18 months on GAIA and TauBench, they find capability rose sharply while reliability gained only modestly across all major providers, explaining why agent economic impact lags benchmark scores; proposes a reliability index.

AI agentsreliabilityevaluationbenchmarkscalibration

Capability Claims, Scaling & the Limits of AGI

2 tier-5 · 2 tier-4

This is the authors' direct assault on trend-extrapolation AGI forecasting. They dismantle the assumption that scaling laws plus a rising curve equal imminent superintelligence: scaling predicts perplexity, not the "emergent abilities" people care about; data is exhausting; AGI is not even a coherent milestone to scale *toward*; and the heuristics used to reason about capability (Moravec's paradox, scaling extrapolations, model-flip-flopping forecasts) are far weaker than they appear. The throughline: capability prediction is unreliable, and you don't need it anyway because diffusion is slow.

AI scaling myths

TIER 5 Jun 27, 2024

Dismantles the belief that continued scaling will produce AGI: scaling laws only predict perplexity, not the law-less 'emergent abilities' users care about; high-quality training data is nearly exhausted and synthetic data won't add volume; and market pressure is pushing models smaller, not bigger. It reframes progress as a 'ladder of generality' of punctuated equilibria where the field has consistently failed to predict the next step. A widely cited counter to trend-extrapolation AGI timelines (e.g. Aschenbrenner).

scaling lawsemergenceAGItraining datasynthetic data

AGI is not a milestone

TIER 5 May 1, 2025

Argues AGI is not a milestone: it is unobservable (no significant capability threshold), has no immediate impact, and any transformation can only be judged retrospectively because impact depends on the deployment environment, not system properties. Distinguishes capability from power to reject the idea of a critical point at which humanity loses control, using nuclear weapons as an explicit anti-analogy for AGI.

AGIcapability vs powerdiffusionsuperintelligenceAI milestones

Is AI progress slowing down?

TIER 4 Dec 18, 2024

Examines whether AI progress is slowing after reports that OpenAI, Anthropic, and Google all hit walls with next-gen models, making four points: declaring model scaling dead is premature, industry leaders' flip-flopping shows their forecasts are unreliable and interest-driven, inference (test-time-compute) scaling is real but unpredictable and uneven, and capability gains link weakly to social/economic impact (which is gated by products and adoption, not capability).

AI scalinginference scalingAI progressforecastingcapability vs impact

Fact checking Moravec's paradox

TIER 4 Jan 29, 2026

Fact-checks Moravec's paradox (tasks hard for humans are easy for AI and vice versa), showing it was never empirically tested, is really a statement about what the AI field finds worth working on, has no predictive power, and rests on dubious evolutionary just-so reasoning. Argues this style of thinking has fueled both alarmism about superintelligent reasoning and false comfort in robotics, and that adapting to AI does not require predicting capability breakthroughs since diffusion is slow.

Moravec's paradoxcapability predictionAI hyperoboticsdiffusion

AI in Science & the Reproducibility Crisis

2 tier-5 · 0 tier-4

The authors' home turf as researchers: the use of machine learning *inside* science, and the methodological rot it can spread. One essay documents data leakage as a cross-disciplinary reproducibility crisis spanning hundreds of papers, fed by publish-or-perish incentives and hype that suppresses skepticism; the other inverts the usual optimism to argue AI may *slow* science by flooding a non-market system with papers while the real bottleneck (turning output into progress) goes untouched. Together they argue AI-for-science funding may be aimed at the wrong target.

Scientists should use AI as a tool, not an oracle

TIER 5 Jun 3, 2024

An update on the reproducibility crisis in ML-based science: data leakage now spans 30 disciplines and ~650+ papers, compounded by deep cultural problems (publish-or-perish, no incentive to debunk) and AI hype that suppresses scientific skepticism in a self-reinforcing feedback loop. It warns that hockey-stick AI adoption curves across fields are a danger signal, that AI-for-science funding may be making things worse, and offers mitigations (the REFORMS checklist, reproducibility incentives). A definitive treatment of leakage and ML reproducibility.

reproducibility crisisdata leakageML-based scienceREFORMS checklistAI hype

Could AI slow science?

TIER 5 Jul 16, 2025

Argues that AI, far from accelerating science, may slow it, because science is a complex non-market system and AI is being used to accelerate paper production when the real bottleneck is the production-progress paradox (exponential output, flat progress). Warns AI worsens reproducibility problems, may entrench flawed theories, and erodes the human understanding that drives genuine breakthroughs, calling on funders and AI companies to target actual bottlenecks rather than just output.

AI in sciencemetasciencereproducibilityproduction-progress paradoxscientific progress

From Capability to Impact — Products, Jobs & Commercial Return

2 tier-5 · 0 tier-4

The economic counterweight to the hype: even where capability is real, the chain from "model can do X" to "industry is transformed / workers replaced / money is made" is long and full of friction. One essay anatomizes why generative AI has produced little commercial return (the "big five" barriers and a missing product-market-fit discipline); the other shows why software engineers — the most AI-exposed, most AI-adopting workers — are not being replaced, using a "decide-execute-deliver" model and unmasking "AI washing" of ordinary layoffs.

AI companies are pivoting from creating gods to building products. Good.

TIER 5 Aug 19, 2024

Diagnoses why generative AI has shown little commercial return despite trillion-dollar spend: companies confused proofs-of-concept with reliable products, with OpenAI/Anthropic neglecting products and Google/Microsoft cramming AI everywhere, both ignoring product-market fit. It then lays out the 'big five' barriers to useful consumer AI (cost, reliability, privacy, safety/security, user interface) and argues these are sociotechnical, so real impact unfolds over a decade not a year. A durable framework for why capability gains don't translate to products.

product-market fitLLM reliabilityAI productscost vs accuracyAI bubble

Why AI hasn’t replaced software engineers, and won’t

TIER 5 Jun 11, 2026

Argues that AI will not cause mass software-engineering layoffs even where capability and adoption are furthest along, debunking high-profile 'AI-driven layoff' stories as 'AI washing' for ordinary financial restructuring. Introduces the 'decide-execute-deliver sandwich' model: AI compresses the middle (execution) layer but the deciding and delivering layers resist automation because they require judgment, accountability, and deep understanding, so demand for engineers stays healthy.

AI and jobssoftware engineeringAI washingdecide-execute-deliveragentic engineering

AI as Normal Technology — The Framework & Its Domains

1 tier-5 · 2 tier-4

The spine of the whole publication. The 15,000-word manifesto recasts AI as a controllable tool that diffuses through society on decades-long timescales — not a separate superintelligent species — and the long causal chain from capability to impact becomes the source of human leverage. A companion FAQ clears up the predictable misreadings ("normal" ≠ mundane). A third piece stress-tests the framework against a specific domain (law), showing exactly why capability gains stall against regulatory, adversarial, and human-in-the-loop bottlenecks. Read the manifesto first; the others depend on it.

AI as Normal Technology

TIER 5 Apr 15, 2025

The landmark, 15,000-word manifesto laying out 'AI as normal technology' — simultaneously a description of current AI, a prediction, and a prescription that AI is a controllable tool, not a separate superintelligent species. Distinguishes invention/innovation/adoption (operating at different, decades-long timescales), proposes a human-AI division of labor where control stays with people, reframes accident/arms-race/misuse/misalignment risks, and advocates resilience and reducing uncertainty over drastic precautionary policy. The framework-defining essay the rest of the newsletter builds on.

AI as normal technologyAI policydiffusionexistential riskresilience

AI Won’t Automatically Make Legal Services Cheaper

TIER 4 Feb 12, 2026

Applies the 'AI as Normal Technology' framework to law, arguing advanced AI will not automatically make legal services cheaper because three bottlenecks block the path from capability to outcome: unauthorized-practice-of-law and entity regulations, the adversarial structure of litigation (relative-quality arms races that keep total cost high), and irreducible human involvement (judges and clients must still understand and adjudicate). A rigorous, domain-specific case study of why capability gains don't translate into cost savings.

AI in lawlegal servicesAI as normal technologybottlenecksregulation

A guide to understanding AI as normal technology

TIER 4 Sep 9, 2025

A companion/FAQ to 'AI as Normal Technology' that clears up common misreadings — chiefly that 'normal' does not mean mundane or predictable (powerful technologies produce unpredictable emergent social effects) — and restates the thesis that the long causal chain from capability to impact gives many points of human leverage. Contrasts the framework with AI 2027 and explains why there is little genuine 'middle ground' between the normal-technology and superintelligence worldviews.

AI as normal technologyAI 2027resiliencediffusionAI discourse

Open-World & Reproducibility Evaluations

1 tier-5 · 1 tier-4

A methodological strand distinct from leaderboard reform: how to measure frontier AI in messy, real-world conditions rather than saturated benchmarks. One essay defines and formalizes "open-world evaluations" (a sample of one, human-in-the-loop, log-analyzed) and launches the CRUX cross-sector collaboration to run them; the other introduces CORE-Bench to test whether agents can automate computational reproducibility — and finds that a cheaply-adapted generalist agent does well, hinting that "generality" may be a red herring for economic impact.

Can AI automate computational reproducibility?

TIER 4 Sep 18, 2024

Introduces CORE-Bench, a benchmark measuring how well AI agents can automate computational reproducibility (re-running a paper's code and data to reproduce its findings), motivated by the failure of Sakana's 'AI Scientist' and the broader weakness of agent evaluation. Best agent scored only 22% on the hardest level; the key conceptual takeaway is that cheaply adapting a generalist agent into a strong task-specific one suggests 'generality' may be a red herring for economic impact.

CORE-BenchreproducibilityAI agentsbenchmarksAI in science

Open-world evaluations for measuring frontier AI capabilities

TIER 5 Apr 16, 2026

Defines and formalizes 'open-world evaluations' — long, messy, real-world AI tests (a sample of one, human-in-the-loop, log-analyzed) that complement saturated benchmarks, surveying ten prior examples to extract best practices and pitfalls. Introduces CRUX, a 17-member cross-sector collaboration to run such evals regularly; its first experiment had an agent build and publish an iOS app to the App Store (two errors, one needing intervention), flagged as an early warning for app-store spam.

open-world evaluationsbenchmarksCRUXAI agentsevaluation methodology

Existential Risk, Forecasting & the Case Against Precautionary Regulation

1 tier-5 · 1 tier-4

The authors' direct intervention in the "doomer" debate, and their sharpest methodological critique. The landmark essay shows that quantified extinction-risk probabilities carry no evidentiary weight — every route to a credible number fails — so they cannot justify policy. The companion replies to the argument that emergent AI risk warrants "extraordinary" precautionary intervention, contending that nonproliferation is brittle for AI (no enriched-uranium chokepoint) and that investing in societal resilience is cheaper and more durable than escalating control over what may be built or published.

AI existential risk probabilities are too unreliable to inform policy

TIER 5 Jul 26, 2024

A rigorous argument that quantified AI extinction-risk probabilities carry no evidentiary weight for policymaking because all three routes to a credible estimate fail: induction lacks a reference class, deduction lacks a theory (unlike asteroid impacts), and subjective forecasts are feelings dressed as numbers whose skill cannot be measured for unique tail-risk events. It shows scoring rules are insensitive to overestimating tail risks, exposes selection-bias inflation in forecaster communities, warns against Pascal's-wager utility maximization, and argues policy should instead forecast concrete AI milestones and economic impacts. A definitive, framework-defining contribution to the x-risk debate.

existential riskforecastingreference classtail riskAI policy

Do AI Risks Require Extraordinary Government Intervention?

TIER 4 May 21, 2026

Responds to Derek Thompson's argument that AI's emergent risks justify 'extraordinary' precautionary government intervention (restricting what companies can release), defining such intervention along three axes: precautionary, burdening non-culpable actors, and bypassing normal governance. Argues nonproliferation is brittle for AI because there is no physical chokepoint like enriched uranium, and that investing in societal resilience (cyber red-teaming, biosecurity screening, bug bounties) is a less costly, more durable defense than escalating control over what can be built or published.

AI policyexistential risknonproliferationresilienceAI regulation

Predictive AI & Its Social Harms

1 tier-5 · 1 tier-4

The other half of the AI Snake Oil thesis: where the rest of the newsletter is cautiously open on generative AI, here the authors are pointedly skeptical of *predictive* AI — systems that make consequential decisions about people. One essay empirically demolishes the AI-election-misinformation panic (78 cases studied; the binding constraint is demand, not AI-enabled supply); the other dissects a national liver-allocation algorithm to ground the "Against Predictive Optimization" argument that predictive logic over people carries recurring, inherent flaws.

Does the UK’s liver transplant matching algorithm systematically exclude younger patients?

TIER 4 Nov 11, 2024

A case study of the UK's national liver-allocation algorithm (Transplant Benefit Score), which appeared to systematically disadvantage younger patients, used to ground the broader 'Against Predictive Optimization' argument that predictive logic over people has recurring inherent flaws. Examines whether the harm stems from the score's design, miscalculation, or the lack of physician override and appeals process, asking when predictive algorithms are legitimate at all versus when simpler formulas suffice.

predictive AIalgorithmic decision-makinghealthcareagainst predictive optimizationfairness

We Looked at 78 Election Deepfakes. Political Misinformation is not an AI Problem.

TIER 5 Dec 13, 2024

Analyzes all 78 instances of AI use in 2024 global elections from the WIRED AI Elections Project, finding that half were not deceptive, deceptive content was cheap to make without AI anyway, and the binding constraint is demand for misinformation, not supply. Concludes political misinformation is not fundamentally an AI problem and that fixes require structural/institutional change rather than curbing AI-generated content — empirically grounding their long-standing prediction against a misinformation apocalypse.

election deepfakesmisinformationempirical studysupply vs demandAI harms