Reasoning Models, RLVR, and the o1/o3 Era
8 tier-5 · 11 tier-4
The intellectual spine of the modern archive. Lambert charts the reasoning paradigm from before it had a name -- his Q* hypothesis and o1 reverse-engineering -- through the public arrival of RLVR (Reinforcement Learning with Verifiable Rewards) and the realization that o1/o3 are large-scale outcome-reward RL rather than test-time tree search. Across these pieces he argues that chain-of-thought is a natural fit for autoregressive models, that reasoning will generalize well beyond math and code, that over-optimization returns in new and weirder forms as RL scales, and -- in the landmark "How to scale RL" -- that RL now has its own scaling laws distinct from pretraining. Together they form the most coherent running account anywhere of how reasoning models are actually trained.
TIER 5
Nov 24, 2023
The widely-cited literature-grounded hypothesis on OpenAI's leaked Q*, arguing it links Q-learning and A* search via tree-of-thoughts reasoning over language steps scored by process reward models, bringing AlphaGo-style self-play and look-ahead planning to LLMs. A landmark interpretive post that shaped how the field understood the path toward reasoning models well before o1.
Q-starreasoningprocess-reward-modelstree-of-thoughtssearch
TIER 4
Aug 8, 2024
A deep, full-transcript interview with Ross Taylor (Galactica lead, Llama post-training reasoning lead) on the nature of LLM reasoning, chain-of-thought vs. adaptive computation, process/outcome reward models and MCPE, why RLHF beat the SFT-only camp at Meta, and why frontier-lab advantage is brute-force execution over secret methods. It matters for the candid, expert insider perspective on reasoning research and post-training culture.
interviewLLM reasoningGalacticaRLHFreward models
TIER 4
Sep 5, 2024
A prescient explainer framing inference-time compute as a distinct scaling law, reading OpenAI's Strawberry/Q*/Orion rumors as self-talk reasoning plus a verifier, and surveying the early test-time-compute literature (best-of-N, Large Language Monkeys, proposer-verifier). It matters as an early, well-sourced articulation of the reasoning-model paradigm that would define the next era of LLMs.
inference scalingreasoningtest-time computeOpenAI StrawberryRLHF
TIER 5
Sep 16, 2024
The definitive early reverse-engineering of OpenAI's o1, framing it as an RL-trained reasoning system that does online/test-time search rather than plain autoregression and thereby confirms inference scaling laws. Lays out the Q*-to-Strawberry-to-o1 lineage, the generator/verifier search structure, and what an open-source o1 would require—an influential reference piece for the reasoning-model paradigm.
o1reasoninginference scaling lawsreinforcement learningtest-time search
TIER 4
Dec 4, 2024
Lambert reverses his earlier reading of OpenAI's o1, arguing the model needs no test-time tree search or process rewards: all the 'search' lives inside large-scale outcome-reward RL, with verifiable answers and LLM-judged continuations doing the supervision, and the test-time-compute scaling curve possibly an artifact of bucketing sampled generations by token count. It is an influential early mechanistic account of reasoning-model training that aligns o1 with the Bitter Lesson and Ai2's own RLVR 'wait, let me check' behaviors. Matters as a foundational framing of how the reasoning-model era actually works.
o1 / reasoning modelsRLVRtest-time computeprocess rewardspost-training
TIER 4
Dec 5, 2024
A deep technical interview with researcher Finbarr Timbers (ex-DeepMind, Midjourney) tracing the full decade of deep RL, from DQN and AlphaZero through the slowdown to RLHF and o1's revival of the field. The full transcript covers RL fundamentals, the bitter lesson, reward modeling, exploration, Tulu 3, and AI research management, a substantive history-and-state-of-RL discussion.
reinforcement learningdeep RL historyAlphaZeroRLHFpodcast interview
TIER 4
Dec 11, 2024
A substantive analysis of OpenAI's Reinforcement Finetuning (RFT) API, framed through LeCun's cake metaphor, arguing it brings RL to the masses and signals that RL training stability is now solved enough to expose to the public. It explains RFT's grader configs, the contrast with SFT/LoRA finetuning, and the potential data flywheel it gives OpenAI for future reasoning models, noting Ai2's open RLVR work is closely analogous.
reinforcement finetuningRFT APIRLVROpenAIRL stability
TIER 5
Dec 20, 2024
The definitive analysis of OpenAI's o3 announcement, the major model event of late 2024, detailing its step-change results on ARC-AGI (87%), Frontier Math (2 to 25%), and SWE-Bench, and the price/compute axes behind them. Lambert argues o3 is most likely the same o1 RL methodology scaled up (no tree search) on a larger base, signaling that RL-driven reasoning is the next hill the industry climbs.
OpenAI o3ARC-AGIreasoning modelsinference scalingmodel release
TIER 4
Jan 2, 2025
A NeurIPS talk (full transcript) addressing whether language models actually reason, arguing for a scoped, behavior-level definition of reasoning rather than an AGI debate, and disambiguating post-training, reasoning, and inference-time compute. It connects o1, RLVR, and the reinforcement-finetuning grader API into a grounded conceptual framing for reasoning research in 2025.
reasoninginference-time computeo1RLVRpodcast talk
TIER 5
Jan 21, 2025
The definitive early breakdown of DeepSeek R1's four-stage RL-heavy training recipe (cold-start SFT, large-scale RL, rejection sampling, final RLHF), arguing it is the first open seminal paper that locks in reasoning-model research and ends the o1 replication uncertainty. It walks through R1-Zero, RLVR reward design, GRPO, distillation, and the open questions, making it a lasting reference for how reasoning models are actually trained.
DeepSeek R1reasoning modelsRLVRGRPOo1 replication
TIER 5
Jan 28, 2025
Makes the influential argument that chain-of-thought reasoning is a natural fit for autoregressive LLMs (handling recurrence in state-space rather than parameters), so RL-trained reasoning will generalize well beyond code and math — really teaching models to allocate more compute to harder problems. Marshals early evidence (deliberative-alignment safety generalization, R1 topping creative-writing/Humanity's-Last-Exam/calibration leaderboards) and frames the generator-verifier flywheel. A foundational conceptual piece that anchors much of Lambert's later reasoning coverage.
reasoning modelschain of thoughtgeneralizationRLverifiers
TIER 4
Feb 24, 2025
Analyzes Claude 3.7 Sonnet's release (and Claude Code preview) as a tidy SWE/tool-use SOTA improvement, using it to explain Anthropic's single-model approach to reasoning, developer-controllable thinking-token budgets, and visible reasoning traces. The more durable contribution is the explainer on where inference-time scaling goes next — parallel test-time compute, pass@N vs answer-extraction, and verifiers as the real performance limiter. A strong release-plus-explainer that doubles as a reference on inference-time scaling mechanics.
Claude 3.7inference-time scalingreasoningverifiersparallel sampling
TIER 5
Apr 19, 2025
Introduces a three-era framework for RL over-optimization (control: brittle environments; RLHF: bad reward functions; RLVR: effective-but-weird models) to explain o3's new failure mode—tool-use RL produces strong agentic capability alongside fabricated actions and hallucinated tool calls. Argues o3's hacking comes from scaling RL with softer/LLM-judge verifiers and that legibility, not outcomes, is what degrades. Matters as an original, lasting conceptual frame for reasoning-with-tools models and the shift to 'reliable interaction with the external world' as the new frontier.
o3over-optimizationRLVRtool usereward hacking
TIER 5
Jun 4, 2025
Introduces a four-part framework — skills, calibration, strategy, abstraction — for understanding current and next-generation reasoning models, ordered as the path from single-pass problem solving to agentic planning. Explains how parallel inference-time compute differs from training-time RL, why models overthink, and why bootstrapping planning (a Q*-style data-curation effort) is the next race. A clean, original, reusable conceptual framework that recurs across his later writing and talks.
reasoning modelsagentsplanninginference-time scalingtaxonomy
TIER 4
Jun 9, 2025
Maps three RL futures — continuing to scale RLVR (likely, yielding more frequent model releases), pushing RL to sparser long-horizon domains (Lambert is skeptical, citing the credit-assignment and off-policy bottlenecks and how Deep Research actually trains on sub-tasks not end results), and true continual learning (a low-probability algorithmic breakthrough). Notably argues continual learning where the model learns from you is borderline dystopian and that personalization is the safer framing. A clear-eyed, well-structured RL roadmap.
reinforcement learningRLVRcontinual learninglong-horizon RLcredit assignment
TIER 4
Jun 12, 2025
A philosophical essay rebutting the Apple 'Illusion of Thinking' paper, arguing that individual failures (e.g. Tower of Hanoi token limits) cannot prove the absence of reasoning, which should be defined by whether structures are used to solve real tasks. Uses the ornithopter/flight analogy — human reasoning is the inspiration, not the endpoint — and engages Ilya's understanding-vs-awareness framing to claim we've passed the Wright Brothers moment for artificial reasoners. A thoughtful conceptual contribution to the reasoning debate.
reasoningIllusion of Thinkingphilosophy of AIchain-of-thoughtAGI
TIER 4
Jun 18, 2025
A podcast/talk (full transcript) on why benchmark-topping models can flop in real use — the 'art of the model' in RLHF, the Goldilocks zone between evals, vibes, and price, and how over-optimization (sycophancy, unit-test gaming, GPT-4.5's 'not a frontier model') emerges from multi-objective hillclimbing. Ends with his skills/calibration/strategy/abstraction taxonomy and the shift of compute toward RL post-training. A substantive recap of his 2025 thinking with a strong reading list.
reasoning modelsRLHFover-optimizationsycophancymodel evaluation
TIER 4
Jun 23, 2025
A three-part outlook essay: (1) o3's relentless search behavior is an under-appreciated technical breakthrough no other lab has matched; (2) agent progress will be high-variance but rapid because post-training fixes small reliability failures fast; (3) parameter scaling for consumer models has fizzled, replaced by efficiency marches and inference-time scaling. A genuinely useful synthesis of where frontier capability and product trajectories are heading.
o3search agentsagentsmodel scalinginference-time compute
TIER 5
Oct 20, 2025
A definitive technical explainer of the first major RL-scaling-laws paper (ScaleRL, Khatri & Madaan et al.), which fits sigmoid curves (peak A, slope B, compute C) to extrapolate RL learning curves and ablate design choices. Lambert clarifies how RL scaling differs from pretraining laws (extracting maximum performance from a base model, not configuring one big run) and surveys the now-essential ingredients: truncated importance sampling, GSPO, CISPO, and PipelineRL's in-flight updates plus continuous batching (4x+ throughput). He marks the open questions (data regime, base-model choice) and argues the public tooling must be rebuilt to close the gap to frontier RL stacks. A landmark reference on the state of RL scaling. ---
RL scaling lawsreinforcement learningScaleRLimportance samplingRL infrastructurepost-training
Open-Model Releases: OLMo, Tulu, Llama, and the Truly-Open Standard
5 tier-5 · 13 tier-4
Definitive, often same-day teardowns of the releases that defined what "open" means. The Ai2 cluster (OLMo, OLMoE, OLMo 2, Olmo 3, Olmo Hybrid, Tulu 3, Molmo) is reported from the inside by the team that built it, including the first fully-open frontier post-training recipe and the introduction of RLVR to the broad ecosystem. The Meta cluster (Llama 2, Llama 3, Llama 3.1 405B) tracks Llama from "first open model that matches ChatGPT" to its strategic peak as the open frontier. Alongside sit the other open standard-bearers -- Mixtral, DBRX, Gemma 3, Phi/Arctic, Nvidia Nemotron, Arcee -- and interviews with the practitioners who trained them. The throughline is the slow construction of a "truly open" reference point (data + code + weights + logs) the rest of the field is measured against.
TIER 5
Jul 18, 2023
Lambert's same-day, early-access technical breakdown of the Llama 2 release — the first open model he's convinced matches ChatGPT (outside coding) — covering the base model (2T tokens, 4k context, grouped-query attention), the two-reward-model RLHF to dodge the safety-helpfulness tradeoff, two-stage rejection-sampling-then-PPO pipeline, Ghost Attention, ~$25M preference-data cost, license restrictions, and the not-truly-open-source verdict. As the definitive contemporaneous analysis of a landmark open-weights release, with insider technical depth, this is lasting reference material for the open-model era.
Llama 2open weightsRLHF pipelineMetamodel release analysis
TIER 4
Dec 11, 2023
Analysis of Mistral's Mixtral 8x7B sparse-MoE release (52B total / 12B active, beating Llama-2-70B) as the best open model, paired with a primer on why labs use mixture-of-experts for performance and distributed-inference scaling. Reads the release velocity, Mistral Medium hints, and $400M raise as signs of Mistral's momentum, while cautioning that head-to-head chat vibes, not benchmarks, decide real strength.
mixtralmixture-of-expertsopen-modelsmistralmodel-release
TIER 5
Feb 1, 2024
Announces AI2's first OLMo models (7B and 1B) as the first state-of-the-art LLM in years that is fully open — weights, training code, eval code, and the Dolma pretraining dataset all released — and argues this enables scientific research (data attribution, contamination, optimizer-state fine-tuning) impossible on closed models. A landmark release in the open-model movement and a foundational reference for the 'truly open' standard the field would measure against.
OLMoopen modelsAI2Dolmapretraining
TIER 4
Feb 22, 2024
Analyzes Google's Gemma open-weight release (7B/2B, nonstandard architecture, Gemini tokenizer, pretraining annealing) and confirms RLHF details — Google using REINFORCE, a large reward model, and an InstructGPT-style KL penalty — alongside a primer on REINFORCE/PPO/TRPO history and the RAIL license terms. Matters as a substantive technical read on a major open release and on how RLHF is regressing toward simpler policy-gradient methods. Public preview of a paywalled post, but the visible technical content is rich.
GemmaRLHFREINFORCEopen modelslicenses
TIER 4
Mar 28, 2024
Walks through Nathan's full model-familiarization process on Databricks' DBRX, which takes the absolute-best-open-model crown (though Mixtral stays more efficient on active params), and reads Databricks' open-LLM strategy via the Mosaic acquisition. Frames the broader efficiency trend that GPT-4-level performance is steadily reproduced and cheapened across orgs, trending toward effectively free.
dbrxdatabricksmixture-of-expertsmodel-release-analysiscost-efficiency
TIER 5
Apr 18, 2024
Definitive day-one analysis of Meta's Llama 3 release covering pretraining and the 15T-token over-Chinchilla data strategy, the SFT/rejection-sampling/PPO/DPO post-training stack and ~$10M human preference data estimate, human evals, the licensing/ecosystem-takeover terms, and what 70B open weights mean for API providers. A reference-grade model-release teardown from a post-training insider.
llama-3open-modelspretrainingrlhf-post-trainingmodel-release-analysis
TIER 4
Apr 30, 2024
Reads two outlier open releases as previews of where the field is heading: Snowflake's Arctic (a 480B/17B-active many-expert dense-MoE hybrid that trades consumer accessibility for enterprise inference efficiency) and Microsoft's Phi 3 (synthetic-textbook training that inflates MMLU). Argues outlier models break naive compute-vs-MMLU scaling and that the next big open opportunity is small, high-performance-per-parameter models.
mixture-of-expertssynthetic-datasmall-modelsphi-3scaling-laws
TIER 4
Jul 23, 2024
A definitive release analysis of Llama 3.1 405B as the first open model that fairly compares to closed frontier models (Claude 3, GPT-4o), arguing Meta is trying to absorb the entire open ecosystem under the Llama brand while sitting at the wrong layer of the stack (the real lock-in is Nvidia/CUDA and HuggingFace, not the model). It dissects the license's open-washing (synthetic-data permission plus naming/branding capture), Zuckerberg's commoditize-your-complements strategy, and the looming regulatory debate. Matters as a clear-eyed framing of why open-weight models, not 'open-source AI,' now have guaranteed multi-year relevance.
Llama 3.1 405BMetaopen modelslicensingopen-source AI
TIER 4
Aug 1, 2024
A full-transcript interview with educator and 'Build a Large Language Model from Scratch' author Sebastian Raschka covering keeping up with AI research, the post-Llama-3.1 open LLM ecosystem, architectures (MoE, early vs. late multimodal fusion), distillation, and implementation pitfalls. It matters as a substantive, accessible survey of open-model practice from a respected explainer of the field.
interviewopen LLMsmodel architecturesdistillationAI education
TIER 4
Sep 4, 2024
Pairs the OLMoE Mixture-of-Experts model release with an insider account of how frontier model orgs actually operate: compute allocation splits (~60% pretraining, 25% post-training, 10% data), de-risking via FLOP-efficiency-vs-risk tradeoffs, path-dependent capability unlocking, and progress as compounding small wins rather than secret tricks. It matters as a rare, concrete window into the organizational mechanics of building foundation models.
OLMoEmixture of expertsfrontier labscompute allocationpost-training
TIER 4
Sep 25, 2024
Surveys the still-undefined multimodal LLM space through two parallel releases: Ai2's fully-open Molmo (Apache 2.0, built on Qwen/OLMo) and Meta's more-restricted Llama 3.2 Vision, both late-fusion models. Explains why late-fusion dominates, why web-element understanding is the key unsolved capability gating web agents, and argues the small 1B/3B Llama models matter most.
multimodalMolmoLlama 3.2 Visionopen modelslate fusion
TIER 5
Nov 21, 2024
Launch of Tülu 3, the first fully-open frontier post-training recipe, surpassing Meta's Llama 3.1 8B/70B instruct versions via scaled on-policy preference data and the newly introduced Reinforcement Learning with Verifiable Rewards (RLVR), all released with datasets, code, and paper. Framed within a history of open post-training (Alpaca → DPO → the current closed-vs-open gap). A landmark open-recipe release that introduced RLVR to the broad ecosystem and set a reference point for what open post-training can achieve.
Tulu 3RLVRopen post-trainingDPO / preference dataopen models
TIER 4
Nov 26, 2024
A two-part post: a management essay on what makes LM-training teams effective (detail-oriented owners, context-holding managers, brutal prioritization, compounding small gains) and the OLMo 2 7B/13B release, fully-open models trained on 4-5T tokens that beat Llama 3.1 8B and Qwen 2.5 via the Tulu 3 recipe plus RLVR. Useful both as a model-release analysis and a rare candid look at training-org dynamics, including the lesson that RL finetuning needs multiple seeds. Matters as a milestone in fully-open model quality.
OLMo 2open modelsteam managementRLVRpost-training
TIER 4
Jan 22, 2025
A deep technical interview with the OLMo 2 pretraining and data leads (Dirk Groeneveld, Kyle Lo, Luca Soldaini) covering the project's early history, a failed 70B run, the long quest for training stability, and the many small decisions behind data work, muP, scaling laws, and MoE. The full transcript is a rare practitioner's-eye account of what it actually takes to build a frontier-competitive open model, valuable to anyone training LMs.
OLMo 2pretrainingtraining stabilityopen modelspodcast interview
TIER 4
Mar 13, 2025
Announces OLMo 2 32B (a fully open-source, data-and-code GPT-4-class model from Ai2) alongside analysis of Google's Gemma 3, using both to argue the open-closed gap has narrowed to roughly 18 months and that the era has shifted from 'what is open' to 'why open.' Distinguishes true knowledge distillation (matching the teacher's distribution) from colloquial output-distillation, and reads ChatBotArena/SimpleQA/GPQA results to show small open models now clear meaningful capability thresholds. A strong state-of-open-models essay paired with a notable release.
OLMo 2Gemma 3open-source AIdistillationopen-closed gap
TIER 5
Nov 20, 2025
Lambert's launch analysis of Ai2's Olmo 3 (7B/32B), the first 32B+ fully open reasoning model with full data, code, checkpoints, and logs, including the best open 32B base model and a detailed post-training recipe (SFT → DPO → scaled RLVR) plus the 'Model Flow' (Instruct/Think/RL Zero) framing. It discloses real recipe specifics — the delta-learning DPO trick (Qwen3-32B chosen vs 0.6B rejected), 'active refilling' in async RL, and RL Zero datasets enabling study of RLVR contamination on Qwen. A definitive, authoritative model-release analysis from the team that built it, with lasting research value.
Olmo 3open modelspost-trainingRLVRDPOreasoning models
TIER 4
Jan 27, 2026
A founder interview (full transcript) with Arcee AI's Mark McQuade and Lucas Atkins on pivoting from post-training open models for customer domains to pretraining larger US-built open models (Trinity-Large, a 400B ultra-sparse MoE), described as the most genuinely revenue-oriented approach to monetizing open models Lambert has found. The deep discussion covers their business model, pretraining at scale on a startup budget, and the case for American open models. Substantive insider view of an emerging US open-model company's strategy and technical choices.
Arcee AIopen modelspretrainingAmerican AIopen-model business models
TIER 4
Feb 4, 2026
A podcast (full transcript) with NVIDIA's Bryan Catanzaro on the strategic logic of NVIDIA releasing open models (Nemotron) — fundamentally to grow the ecosystem of people building on open weights and to understand what they need from hardware next, the one clear commercial reason to be open. The deep technical and strategic discussion covers NVIDIA's compression/pruning research direction, open data and tech reports, and how open models serve NVIDIA's GPU business. Substantive insider view of a key player's open-model rationale. ---
NVIDIAopen modelsNemotronBryan CatanzaroAI hardware
RLHF, Reward Models, and Post-Training Methods
3 tier-5 · 27 tier-4
The author's home turf: how preference fine-tuning actually shapes a model, written by a researcher who built open RLHF recipes. These pieces dissect the mechanics -- why RLHF is "style transfer plus gentle bug-squashing," the DPO-vs-PPO debate, the chattiness/length paradox, sycophancy and over-optimization, reward models as first-class auditing instruments, and the elicitation view that post-training mostly amplifies latent base-model behavior. They culminate in the canonical "recipe for frontier model post-training" and the synthetic-data frameworks that replaced human labels, plus the newer subfield of character/personality training. This is the reference layer for anyone going deep on post-training.
TIER 4
Mar 27, 2023
Traces RL's progression from cost functions to rewards to preferences, showing how each step loosens the problem spec and injects uncertainty, and argues preference models (relative, deterministic-by-assumption, but distorted by mere-exposure/conditioning/hedonic-adaptation effects) have no guarantee of matching real preferences and are under-studied. A conceptually rich, original framing from Lambert's RL/robotics background that anchors much of his later reward-modeling work.
preference modelingreward modelsreinforcement learningRLHFobjective mismatch
TIER 4
Apr 26, 2023
Explains RLAIF (from Anthropic's Constitutional AI paper) and proposes the broader, more useful framing RLCF—reinforcement learning from computational feedback—any computed signal (unit tests, NER privacy checks, sentiment) optimized via RL, freeing it from human-data bottlenecks and from language alone. A genuinely original reframing and an early call on a method that later became central; its 'RLCF' coinage prefigures RLVR.
RLAIFRLCFConstitutional AIcomputational feedbackreward design
TIER 4
May 3, 2023
Argues the term 'hallucination' is overloaded and harmful (anthropomorphizing deterministic token sampling), then disaggregates it across domains—chat, code, control/decision-making, healthcare, enterprise—proposing axes (consumer vs business, downside risk, accumulated degradation, what counts as out-of-distribution) for when hallucination is value-generative versus dangerous. A thoughtful conceptual framework piece that reframes a heavily-hyped term.
hallucinationstochastic parrotsout-of-distributionRLHFanthropomorphization
TIER 5
Jun 21, 2023
A landmark early explainer dissecting why RLHF works and why it is so hard to set up, walking the full pipeline (capable base model → preference data → RL) and the data/optimization conditions—pairwise preference signal, distribution matching, scaling to 50B+—that make it succeed. Frames RLHF as 'style transfer plus gentle bug-squashing' (a topic filter, not a capability or fact adder) and previews DPO; one of the most-cited reference posts in the archive and foundational to Lambert's later RLHF book.
RLHFpreference datareward modelsConstitutional AIDPO
TIER 4
Jul 21, 2023
A technical follow-up to the Llama 2 release: Lambert diagnoses the model's over-cautious 'evasiveness through harmlessness' (refusing up to 27% of borderline asks) as RLHF-hammer over-optimization against imperfect reward models, then provides practical GPU/inference and fine-tuning sizing guidance plus deep notes on Ghost Attention (GAtt) for multi-turn consistency and Meta's rejection-sampling/PPO RLHF details. He flags the helpful-vs-harmless tradeoff as a fundamental open-source problem. A genuinely useful post-training explainer with corrections to his prior cost estimates.
Llama 2RLHF over-optimizationGhost AttentionGPU sizingsafety-helpfulness tradeoff
TIER 4
Aug 2, 2023
Synthesizing John Schulman's ICML talk, the DPO paper, and the 'Open Problems in RLHF' paper, Lambert explains RLHF's core flaw — optimizing a proxy reward model that's only correlated with chatbot quality, leading to Goodhart-style over-optimization (true objective improves then degrades) measured via KL distance. He covers PPO vs. best-of-N efficiency, why bigger reward models help, his reservations about DPO, and implicit-feedback/steering-loss directions. A high-signal technical explainer of RLHF objective mechanics from a post-training specialist.
RLHFproxy objectivesover-optimizationDPOreward models
TIER 4
Oct 18, 2023
Synthesizes safety papers showing that parameter-level safety training is brittle — as few as 10-100 fine-tuning examples (even benign or 'identity-shifting' data) strip Llama-2-chat or GPT-3.5 filters — meaning safety in accessible weights protects liability, not against misuse. Argues true robustness would require embedding safety in pretraining, with implications for what open releases can responsibly claim.
RLHFAI-safetyjailbreakingfine-tuningopen-models
TIER 4
Oct 25, 2023
First in a literature-review series, mapping how RLHF research trends align with industry practice and arguing the RL part rests on shaky foundations — notably the absence of real exploration (each completion is one action, no multi-step credit assignment) and its low compute/data footprint. Sets up two futures: RLHF folding into pretraining, or a new RL paradigm enabling continual learning from implicit user feedback.
RLHFreinforcement-learningexplorationliterature-reviewcontinual-learning
TIER 4
Nov 22, 2023
Reports AI2's Tulu 2 as the first 70B DPO run to converge with strong results (roughly original-ChatGPT level), crediting a surprisingly low 5e-7 learning rate and the UltraFeedback dataset over optimizer tweaks. A grounded practitioner's progress update on open RLHF that doubles as a case study in DPO scaling, overfitting concerns, and the limits of MT-Bench/AlpacaEval vibe evals.
RLHFDPOtuluzephyrevaluation
TIER 5
Nov 29, 2023
Comprehensive survey establishing synthetic data as the central resource of the post-ChatGPT era — used by frontier labs (Anthropic CAI, OpenAI Superalignment) to remove humans from alignment and by open builders to cheaply chase SOTA. Covers the scaling-data debate, robustness gains, distillation dynamics, and open examples, serving as a durable reference on synthetic-data methods and their divergent open vs. closed goals.
synthetic-dataconstitutional-AIalignmentdistillationpretraining
TIER 4
Dec 6, 2023
Frames the central RLHF question of the moment — do we need true RL (value functions, policy gradients) or does Direct Preference Optimization suffice — arguing DPO's breakthroughs (Tulu, Zephyr) owed more to data, hyperparameters, and implementation simplicity than to the loss function being magic. Concludes the real bottleneck for open RLHF is data, tooling, and evaluation, not optimizer choice, citing Starling's APA result as counter-evidence that strong PPO/online models exist.
RLHFDPOPPOpreference-optimizationpost-training
TIER 4
Jan 12, 2024
A curated, categorized index of RLHF learning resources — talks, podcasts, code repos (alignment-handbook, TRL), key models/datasets (Zephyr, UltraFeedback), evaluations, and foundational papers (InstructGPT, DPO, Llama 2) — assembled by a leading practitioner. Useful as a durable entry-point reference for anyone going deep on RLHF, with the author's light annotations on why each item matters.
RLHFlearning resourcesDPOreward modelsreference
TIER 4
Feb 14, 2024
Argues that even as DPO and LLM-as-judge displace explicit RL, scalar reward models remain a uniquely valuable, underused tool — a clean classifier interface for auditing model representations, biases, and preference data without prompting or per-token compute limits, illustrated with a per-token reward-tracing experiment. Matters as an original RLHF research thesis (a precursor to RewardBench-style work) reframing reward models from disposable training artifacts to first-class analysis instruments.
reward modelsRLHFDPOalignmentinterpretability
TIER 4
Mar 4, 2024
A deeply technical full-transcript interview with Louis Castricato (Synth Labs, EleutherAI, ex-CarperAI) covering RLHF, DPO, preference data, reward models, multimodal/long-context RLHF, OpenAI's influence on the field, and the Gemini bias episode. Matters as a rare candid deep dive from a leading open-RLHF practitioner across nearly every active fine-tuning question.
RLHFDPOreward modelspreference datainterview
TIER 4
Apr 17, 2024
Argues RLHF researchers keep reinventing solutions that established social-science fields already solved, and that reward models are the underused transparency/auditing lever. Introduces social choice theory (social welfare functions, cardinal reward modeling, personalization) and pluralistic alignment as concrete ways to redesign preference data and reward models, with an OLMo 1.7-7B update appended.
rlhfreward-modelssocial-choice-theorypluralistic-alignmentolmo
TIER 4
May 1, 2024
A grounded explainer on what preference fine-tuning (DPO/PPO/KTO etc.) actually does to models: the 'chattiness paradox' where DPO boosts gameable evals (AlpacaEval) without moving ChatBotArena, the parameter-level mechanism by which GPT-4-styled chosen responses get upweighted into longer outputs, why KL constraints don't fully prevent it, and a defense of style transfer as real value. Closes with a credible research agenda (reasoning/code RLHF, online methods, KTO) that aged well.
RLHFDPO vs PPOpreference fine-tuningchattiness/length biasKTO
TIER 4
May 10, 2024
A close reading of OpenAI's first Model Spec as a major RLHF-transparency artifact: it documents intended behaviors (objectives, defaults, rules), the 'chain of command' platform/developer/user prompt hierarchy, and quietly reveals OpenAI's verbosity penalty (Schulman's 'laziness from fear of running out of tokens') and the answer-quality ordering. Lambert argues it exposes how hard balancing many RLHF goals is and frames the unsolved 'Aggregator's AI risk' of one-answer monoliths, pointing toward personalization as the open market.
Model SpecRLHFtransparencychain of commandpersonalization
TIER 4
Jun 21, 2024
An eight-point framework on how frontier labs actually use synthetic data, organized into transferable insights: direct distillation is still king, Gemini Flash is confirmed distilled from Pro (and Claude Haiku likely too), filtering plus data accumulation prevents model collapse, license clauses against training-on-outputs often originate from data vendors, multi-source preference datasets (UltraFeedback/Nectar) have a lower ceiling than on-policy generation, structured/verifiable synthetic data drives IFEval-style gains, weak-to-strong generalization is increasingly real, and synthetic prompt generation is underrated. Matters as a high-signal, durable mental model of synthetic-data practice that reads industry tea leaves better than the academic literature.
synthetic datadistillationmodel collapseweak-to-strongpost-training
TIER 4
Jun 26, 2024
A substantive technical roundup reporting the team's effort to make PPO work 'as well as industry says,' finding that open RLHF tooling is fickle, that bigger reward models and extra reasoning prompts didn't reliably help, and that variation between datasets exceeds variation between DPO/PPO algorithm variants — so dataset work is where open labs should differentiate. It pairs this with a RewardBench retrospective (90% peak accuracy, why it got adopted, reward-model-as-judge beating LLM-as-judge), an Epoch AI estimate that post-training gets ~50% of frontier-lab compute, and an LMSYS/Kaggle preference-prediction competition. Matters as a candid practitioner's state-of-play for open post-training.
RLHFPPO vs DPORewardBenchreward modelspost-training
TIER 5
Aug 7, 2024
Synthesizes the Llama 3.1, Nemotron 340B, and Apple foundation-model reports into a definitive 'new standard pipeline' for frontier post-training: scaled iterative RLHF over instruction tuning, synthetic data surpassing humans, multi-round training, and data filtering as the king variable. It is a landmark reference that crystallized how the field's top labs converged on post-training, widely cited as the canonical articulation of the modern RLHF recipe.
post-trainingRLHFsynthetic dataLlama 3.1data filtering
TIER 4
Jan 8, 2025
A useful state-of-the-field overview accompanying a tutorial talk, proposing the clean three-bucket taxonomy of modern finetuning (instruction tuning, preference tuning, and the new reinforcement finetuning) and arguing post-training has become more impactful, more expensive, less human-data-reliant, and the gateway to reasoning models. A solid explainer for where open post-training stood entering 2025.
post-trainingRLHFreinforcement finetuningopen recipestutorial
TIER 4
Feb 26, 2025
Opens the conversation, largely absent outside frontier labs, on 'character training' — the post-training subset that shapes a model's manner rather than its content — prompted by a sweeping, undocumented GPT-4o personality update and Anthropic's Claude character/constitution work. Argues model specs and behavior-change evaluations should become a community norm (a race to the top), and that character training marks RLHF's shift from an alignment philosophy to an empirical performance tool. A genuinely original framing of an under-documented technique.
character trainingRLHFmodel specpersonalitypost-training
TIER 4
Mar 10, 2025
Proposes the 'elicitation interpretation' of post-training — that post-training mostly extracts and amplifies latent behaviors already present in the base model rather than teaching new skills — using an F1-season analogy and contrasting it with the Superficial Alignment Hypothesis (which it argues gets the intuition right for the wrong reasons). Frames why RL on a few thousand prompts beats SFT on millions of math samples, and why stronger base models are better RL starting points. A clean, reusable conceptual frame for thinking about where post-training gains come from.
post-trainingelicitationRLVRsuperficial alignmentbase models
TIER 4
Mar 31, 2025
A technical paper walkthrough of the reasoning-RL literature—Kimi k1.5, Open-Reasoner-Zero, DAPO, and Dr. GRPO—deflating GRPO hype by showing it is closely related to PPO/RLOO and not a uniquely special algorithm, while explaining the genuinely useful tweaks and base-model RL findings. Valuable as a grounded explainer that corrects the 'GRPO ushered in a new RL era' narrative and curates the key reproductions. A solid technical reference for practitioners following reasoning training.
GRPOreasoning RLDAPObase model RLpaper review
TIER 4
Apr 5, 2025
A backlog of RL notes covering OpenAI's RFT/RLVR showing up across Operator, Deep Research, and Copilot (RLEF); a careful reading of DeepSeek's claims that RL-after-distillation is crucial and that small models gain less from large-scale RL; a well-argued case that DeepSeek did not distill o1 CoTs; and why latent/compressed reasoning is an interesting frontier. Closes with Sutton's full Verification essay. Matters as a substantive technical synthesis of how leading labs actually use RL and of distillation dynamics.
reinforcement learningRLVRdistillationDeepSeeklatent reasoning
TIER 4
May 4, 2025
[Paywalled preview] Uses the GPT-4o sycophancy episode and rollback to argue RLHF/preference tuning remains a central, unsolved problem even in the reasoning era, dissecting OpenAI's post-mortem: a new thumbs-up/down reward signal weakened the primary reward and let sycophancy get baked in via over-optimization. Frames it as a structural limit of optimizing one default model for all users and a case for Model Specs over system prompts. A strong explainer on RLHF over-optimization and release-evaluation blind spots; visible preview already carries the core argument.
sycophancyRLHFGPT-4oover-optimizationModel Spec
TIER 4
May 27, 2025
Covers a UW paper Lambert contributed to showing that RLVR on Qwen 2.5 Math models improves MATH-500 by 15-25 points even with random, incorrect, or format-only 'spurious' rewards — but not on Llama or OLMo. Explains the mechanism: RLVR is largely surfacing a pre-existing code-reasoning strategy Qwen learned in pretraining (likely from synthetic math SFT data with perturbed-number variants), and random rewards work via GRPO's per-prompt advantage structure. An important cautionary finding about doing RL science on a contaminated base model.
RLVRQwen 2.5spurious rewardsGRPOdata contamination
TIER 4
Nov 10, 2025
A research-walkthrough (paywalled preview) of a character-training paper from Lambert's group, arguing character/personality training will be a permanent, philosophically rich subfield of RLHF as models grow more persuasive. It details two post-training stages (DPO distillation then SFT) driven by per-personality constitutions ('I am…' rather than Anthropic's 'choose the response that…'), finds personality is easy to imprint but hard to align to intent, and notes Qwen's internal personality is unusually rigid versus Llama/Gemma. A useful technical explainer with concrete methodology.
character trainingRLHFpersonalityconstitutionspost-training
TIER 4
Nov 16, 2025
An original argument that LLMs write poorly not by accident but as a structural consequence of how they are post-trained: style is one weak signal among many, aggregate preferences suppress quirks, per-instance data labeling and length/sycophancy biases penalize richness, and forced neutrality kills voice. Lambert contends base models can write (and Sydney/Bing showed voice) but there are no market incentives to do the full post-training refresh needed to unlock it. A thoughtful, insider explainer connecting craft to training mechanics.
AI writingpost-trainingvoiceRLHF biasesmodel personality
TIER 4
Jun 16, 2026
A deep podcast interview with Finbarr Timbers (full transcript plus an accompanying slide deck) tracing the history of post-training recipes from InstructGPT to today and dissecting 2026's frontier open recipes (DeepSeek V4, GLM 5, Kimi K2.6, MiMo). It maps what an Olmo-style open recipe would need to reach the frontier, making it a substantive technical reference on the post-training stack. ---
post-trainingRLHFpodcastopen modelsrecipes
The Open vs. Closed Thesis: Ecosystem, Economics, and the ATOM / American DeepSeek Project
3 tier-5 · 22 tier-4
Lambert's signature argument, built up over three years: open models matter as a counterweight to concentrated power and as the engine of AI research, but "train a good open model and release it with no product" is not a viable business. He works through open-model economics (specialization or bust), the four motivational camps of openness, why open and closed ride different exponentials, and ultimately the American DeepSeek Project and the ATOM (American Truly Open Models) manifesto -- the case that the US is ceding open-model leadership to China and should fund at least one frontier-scale truly-open lab. The later pieces propose an open-model consortium as the only stable funding structure once training costs reach the billions.
TIER 4
May 17, 2023
A point-by-point rebuttal to Google's leaked 'We Have No Moat' memo, arguing the durable moats are diverse high-quality fine-tuning/RLHF prompt data and consumer habit/brand, not the model weights themselves—open prompt datasets barely crack 100k usable samples while OpenAI/Google see ~100k queries a day. A well-argued contrarian take with lasting relevance to the open-vs-closed debate.
moatsopen vs closedprompt dataRLHFcompetitive dynamics
TIER 4
Jun 7, 2023
Identifies the growing gap between helpfulness (where open models race ahead on leaderboards) and harmlessness (where the community lags), explaining red-teaming economics, the 'uncensored'/filtered confusion, alignment taxes, and Anthropic's helpful-then-harmless CAI pipeline as the path open source has barely begun. A useful, well-structured explainer of where the open ecosystem stood on safety and why incentives stall it.
open modelsharmlessnessred-teamingConstitutional AIalignment tax
TIER 4
Jun 14, 2023
Argues that not every future LLM needs to be a ChatGPT clone: open source will win by producing smaller, better models for narrow use-cases, and lays out a stakeholder taxonomy (vertical big tech, horizontal big tech, open source, academia) plus the base-model-as-reset-point dynamic that lets open ecosystems leap. A substantive structural read on the open-model ecosystem whose framing recurs throughout Lambert's later work.
open modelsecosystembase modelsacademiaChatGPT
TIER 4
Jul 26, 2023
Lambert argues ML is straining the OSS community's definitions and needs a new taxonomy, using Llama 2 — which is downloadable but not open-source (700M-user clause, no-train-other-LLMs clause, no dataset) — as the focal case. He traces open-source history (GNU, OSI, the SSPL/MongoDB split), separates the replication vs. safety values of openness, and reflects honestly on 'open-source as vibes' rhetoric and the OSI's call to define open-source AI. A substantive, well-sourced framing of the open-source-AI definition debate that anticipated later OSI work.
open source definitionLlama 2 licenseOSImodel release taxonomydata transparency
TIER 4
Oct 4, 2023
Prompted by Mistral's 7B torrent release, Lambert argues that 'train a good open LLM and release it with no product' is not a viable business strategy: smaller open labs can't out-spend OpenAI/Google/Meta, current open models are PR/recruiting tools rather than monetizable artifacts, and without data-sharing or specialization they're on an acquisition-or-bust path. He frames two futures — collaborative openness vs. status-quo consolidation — and urges specialization plus radical data transparency as the only durable moat. A prescient strategic argument about open-model economics that held up well over subsequent years.
open model economicsMistralmoatsdata transparencyscaling costs
TIER 4
Nov 1, 2023
A practical 3-prerequisites / 3-actions / 3-benefits playbook for companies releasing open-weight LLMs, centered on the thesis that open labs must own a niche and build products rather than compete head-on with OpenAI's compute platforms. A useful strategic framework for the business of open models, grounded in the author's HuggingFace/Zephyr experience.
open-modelsbusiness-strategyopen-sourcestartupsniche
TIER 4
Jan 5, 2024
Argues that even when open models (post-Gemini/Mixtral) match GPT-4's benchmark scores in 2024, they will still trail the real product because open models are just weights while ChatGPT is an extensive ML system (safety filters, serving, prompting) — and because open RLHF/DPO tooling remains a starting point, not a solution. The visible preview makes a clear models-vs-products distinction and calls for compounding investment in data and evaluation rather than vibes-based eval. (Paywalled; public preview only.)
open vs closed modelsGPT-4RLHFevaluationDPO
TIER 4
Mar 6, 2024
Proposes a clearer taxonomy for open models — 'Openly Trained Models' (OLMo/Pythia: data+code+weights) vs 'Permissible Usage Models' (Llama/Mistral/Gemma: weights only) vs Closed — and walks the OSI definition history, licenses/copyright, bio-risk debunking, and Mistral/EU politics. Matters as a foundational framing for AI-openness policy and the recurring confusion between 'open-source' and 'open-weight' that misleads policymakers.
open-source definitionmodel taxonomyAI policylicensescopyright
TIER 4
Apr 3, 2024
Maps the open-LLM movement into four motivational camps (accelerationists/capitalists, transparency-driven scientists, inclusion/anti-concentration advocates, and freedom advocates) and argues disagreement on definitions is healthy and expected, not a failure. Introduces a disclosure/accessibility/availability framework for reading any 'open' release through the PR speak.
open-source-aiopenness-taxonomyai-policyecosystemtransparency
TIER 4
Apr 15, 2024
With Command R+, Mixtral 8x22B, DBRX and Grok arriving, argues there is no longer a single 'best open LLM' — the space has fragmented by use case. Builds a compute-vs-MMLU model showing most open-model gains over 18 months came simply from throwing more compute at the problem, and that accessibility now diverges sharply by parameter footprint.
open-modelscompute-efficiencymmluscalingmixture-of-experts
TIER 4
Aug 28, 2024
Dissects the OSI's draft v0.0.9 open-source-AI definition, explaining why the data clause is a deliberate compromise (provenance and copyright/GDPR jeopardy make full data release infeasible) and why a stable definition matters for regulatory carve-outs amid a shrinking data commons. It matters as a clear, authoritative treatment of how open-source AI is being formally defined and why it diverges from open-source software.
open source AIOSI definitiondata commonscopyrightAI policy
TIER 4
Oct 30, 2024
A personal manifesto, on his 30th birthday and first Ai2 anniversary, for why truly open (not Meta-dependent) language models matter: reducing concentration of power, security, research access, regulatory insight, and a more diverse AI economy. Describes his 'white rice research' building OLMo as open research infrastructure and a call for more people to fight for open AI before regulatory or market capture forecloses it. A clear statement of the open-source thesis that animates much of his other writing.
open sourceOLMoAI policyconcentration of powermanifesto
TIER 4
Feb 5, 2025
Argues the DeepSeek moment resets the open-vs-closed narrative and that open-source AI is driven by ideology more than economics (citing parallel 'national standard' framings from Zuckerberg and DeepSeek's Liang Wenfeng). Makes the policy case that restricting open model weights is a losing strategy — export controls should target compute, not weights — and that the US should fund open research and Western alternatives so the global open default is American. A timely, well-reasoned open-AI policy argument with concrete proposals.
open-source AIAI policyDeepSeekexport controlsUS-China
TIER 5
Jul 4, 2025
Lambert's manifesto and multi-year mission statement: build a fully open-source (data, code, logs, weights) model at frontier scale within two years to keep the AI research default on Western technology and prevent a future split between closed American models and ubiquitous, hard-to-trust open Chinese ones. Lays out the structural advantages China holds, the $100M-500M cost estimate, the agentic-era window of opportunity, and why open models are a quintessentially American counterweight to concentrated corporate power. The defining argument that anchors much of his subsequent writing.
open source AIAmerican DeepSeekUS-ChinaAI governanceAi2
TIER 5
Aug 4, 2025
The full manifesto for the ATOM (American Truly Open Models) Project, arguing the US is losing open-model leadership to China and recommending at least one US lab with 10,000+ GPUs dedicated to training frontier open models, backed by detailed download/finetune-adoption data and the case for open models as the engine of AI research. This is the influential, foundational policy argument that anchors much of his subsequent writing.
ATOM Projectopen models policyUS vs ChinaAI researchmanifesto
TIER 4
Nov 23, 2025
This roundup (paywalled preview) leads with a genuinely useful inventory of every serious US open-model lab (Ai2, Arcee, Google, IBM, Nvidia, OpenAI, Stanford Marin, etc.) with each org's representative model and license posture, plus an articulation of the four-step Chinese model-release playbook (social presence, day-zero ecosystem support with free API, Claude-Code-compatible coding subs, in-house tooling). It also flags the recurring problem of third-party providers mis-serving open models (Kimi K2 on Vending-Bench). The lab map and playbook give it lasting reference value.
open modelsUS labsChina release playbooklicensesmodel serving
TIER 4
Dec 14, 2025
A year-end synthesis (paywalled preview) of the open-model year: DeepSeek R1, Qwen 3, and Kimi K2 named the top three for outsized ecosystem impact, with runner-ups (MiniMax M2, GLM-4.5, GPT-OSS, Gemma 3, Olmo 3) and niche winners (Parakeet 3, Moondream 3, Granite 4, SmolLM3). It culminates in a full tier list of US/China/world model makers and 2026 predictions, making it a useful reference map of the open ecosystem. Strong curation though catalog-shaped.
open modelsyear in reviewtier listDeepSeekQwen
TIER 4
Jan 7, 2026
A data-driven walkthrough of eight ATOM Project charts showing China's accelerating dominance of open-model adoption metrics: Qwen alone out-downloads roughly the rest of the ecosystem, Llama remains the most-downloaded Western model despite being abandoned, new entrants (Z.ai, MiniMax, Kimi, Nvidia) barely register, and Chinese models remain the smartest open models. It matters as a quantitative, sourced baseline for the open-model adoption debate, with caveats on the noisiness of HuggingFace downloads. A genuinely useful empirical explainer.
open modelsChinaQwenadoption metricsATOM Project
TIER 4
Feb 17, 2026
Lambert argues the recurring 'open models are closer than ever' narrative is overblown: the ~6-month gap is holding steady, and aggregate indices like Artificial Analysis compress too much and likely understate the true frontier, which has never been harder to capture in public benchmarks. He then walks seven trends including the brutally concentrated open-model market, the missing specialized small-model segment, sovereign AI as the entry point for nations, and China's idea-sharing ecosystem being the most likely place a 'who wins' breakthrough emerges. Substantive multi-trend synthesis even as a paywalled preview.
open modelsbenchmarkssovereign AIChinese AI ecosystemmodel adoption
TIER 5
Mar 16, 2026
Lambert's most structural open-models essay, arguing the open-closed gap is more likely to grow than shrink and laying out three future model classes: true closed frontier models, open frontier models, and small/cheap/specialized open models as 'distributed intelligence' that complement closed agents. Drawing on Google's 'Meaning of Open' and Gurley's Android-as-moat framing, he contends the under-served opportunity is boring, specific small models that orchestrating agents are desperate to call as tools, and that open AI must become an ecosystem rather than a pack chasing the frontier. Lasting reference value for framing open-model strategy.
open modelsAI ecosystemssmall modelsbusiness strategyAI systems
TIER 4
Apr 11, 2026
Argues that as frontier training costs reach billions and capitalism pushes labs to keep their best models closed, a multi-company consortium funding shared near-frontier open models becomes the only stable path, with Nvidia's Nemotron Coalition as an early single-company bootstrap. Predicts Chinese open-weight startups (Moonshot, MiniMax, Z.ai) hit funding stress first, and that demand for guaranteed open intelligence will eventually force the consortium model.
open model consortiumAI economicsNemotronfundingfrontier costs
TIER 4
Apr 15, 2026
A consolidated 13-point thesis on the open-model ecosystem distilled from a spring of writing: closed models stay ahead on robustness, the race is mostly economic staying power (with Chinese labs facing funding stress first), open models win repetitive automation share, bans are impractical, and US adoption recovers from 2027. A useful, dense reference list of the author's positions.
open modelspredictionsAI economicsChina fundingregulation
TIER 4
Apr 20, 2026
Unpacks why reducing the open-closed gap to a single number (e.g. the Artificial Analysis Index) is misleading, explaining how the benchmarked 'frontier' shifts every 12-18 months across task paradigms and how RL environments (not just distillation) are the real lever letting Chinese labs keep pace. Argues the gap is increasingly about hard-to-measure robustness and private-data domains, leaving benchmark confidence at a low.
benchmarksopen vs closed gapRL environmentsevaluationRLVR
TIER 4
May 12, 2026
Argues that since ~80% of a frontier model's compute goes to R&D rather than the final training run, China's all-open ecosystem gains a real cost-structure advantage by sharing insights across labs and avoiding double-spend, partially mirroring open-source software's compounding. Notes open AI lacks OSS's user-back feedback loop and that forking open tools into internal versions undermines the compounding, reinforcing the case for an open-model consortium.
open ecosystemR&D computeChinaOSS economicsconsortium
TIER 4
Jun 1, 2026
Argues the open vs closed balance is fundamentally economic: closed labs ride an integrated exponential by monetizing the top of knowledge work (coding agents people will pay a large premium for), evolving into Apple+Microsoft-like $2-10T oligopolies, while open models ride a slower, broader exponential capturing diffuse, commodity-priced enterprise inference. A clear framework for why both ecosystems grow on different curves. ---
open vs closedAI economicscoding agentsmarket structureinference
China's AI Labs and the Distillation Wars
3 tier-5 · 8 tier-4
Primary-source reporting and analysis on why China leads the open-model race. The centerpieces are a tiered ranking of nineteen Chinese open-model labs and a firsthand field report from visiting nearly every leading one, arguing Chinese research culture is built to be an ideal fast-follower. The release analyses (Kimi K2, Kimi K2 Thinking, DeepSeek V3, Ant/InclusionAI) track Chinese labs reaching and matching the frontier, while the distillation pieces cut through the "distillation attacks" panic -- quantifying its real (modest) impact, naming API abuse as the actual issue, and arguing distillation matters less in the RL era because on-policy generation can't be borrowed.
TIER 5
Jan 9, 2025
A definitive analysis of DeepSeek V3's training efficiency (MLA, MoE, multi-token prediction, FP8, custom comms) and a careful debunking of the viral '$5.5M model' narrative, showing the cited figure excludes research, ablations, salaries, and capex that put real annual cost closer to $500M-$1B+. It reframes how to think about frontier training cost and what 'open-source AI' actually requires, a widely-cited reference during the DeepSeek panic.
DeepSeek V3training costMoEcompute efficiencyopen models
TIER 4
May 6, 2025
Argues from insider hearsay that Western enterprises largely won't deploy Qwen/DeepSeek open weights despite leading performance, driven by information-hazard and code-backdoor fears rather than technical security, opening a real opportunity for permissively licensed Western open models (OLMo, Phi, Mistral). Pairs this with SpeechMap.ai data showing Chinese models are often less censored than expected on Western-relevant topics and that newer OpenAI models refuse more. Matters as a counterintuitive, evidence-backed reframing of the open-model competitive landscape and the adoption-vs-capability gap.
Chinese open modelsenterprise adoptioncensorshipopen-source licensingOLMo
TIER 4
Jul 14, 2025
Analysis of Moonshot AI's Kimi K2, a 1T-param (32B active) permissively licensed open model competitive with frontier coding models, argued to be a second 'DeepSeek moment' showing HighFlyer is not unique, China is at/near the frontier, and the West keeps falling further behind on open models. Ties the release to algorithmic efficiency gains (similar token budget to DeepSeek V3), the Claude-compatible API driving fast adoption, and OpenAI's reactive open-model delay. A sharp release-analysis-plus-geopolitics piece.
Kimi K2open modelsChina AIDeepSeek momentfrontier models
TIER 4
Jul 29, 2025
A deep, full-transcript interview with researcher Ross Taylor (ex-Meta, Galactica) covering why Chinese labs win on open models, why most lab failures are organizational rather than talent-bound, the limits of academic RL/reasoning research without compute, the rise of rubric-based rewards and the evals crisis, and AlphaEvolve as evidence the future isn't only RL. Substantive frontier-practitioner discussion with many concrete, reusable claims about training organizations and RL.
interviewRL/reasoning researchtraining organizationsevals crisisAlphaEvolve
TIER 5
Aug 17, 2025
A definitive, tiered survey ranking 19 Chinese open-model labs by quantity and quality of open contributions — from frontier players (DeepSeek, Qwen) through close competitors (Moonshot/Kimi, Zhipu) to rising and honorable-mention orgs — with per-lab profiles of strategy, architecture, and licensing. This is a high-value, lasting reference map of the Chinese open ecosystem that he repeatedly cites and that drew direct engagement from the labs themselves.
China AI labsopen modelsDeepSeek/Qwenlab surveyreference
TIER 4
Sep 9, 2025
Examines whether China will double down on or change course from its open-source AI strategy, citing the 'AI+' plan, Premier Li's statements, Beijing's city-level model targets, and anecdotes about high engagement and morale in Chinese labs. It argues open models are shifting from 'soft power' to just 'power,' and that by 2026 Chinese open models may widen their performance/adoption lead with real geopolitical and regulatory consequences. A well-sourced policy-and-ecosystem analysis.
China AI policyopen source strategyAI+ plangeopoliticsopen models
TIER 4
Nov 6, 2025
A structured five-point analysis of Moonshot's Kimi K2 Thinking (1T total / 32B active MoE, 256K context), framing it as the closest open models have come to the closed frontier since DeepSeek R1. Key points: China's faster release cadence (gap ~4-6 months), the shift from benchmaxing to real user behaviors, native INT4 quantization-aware training reported at serving precision, 200-300 sequential tool calls with interleaved thinking, and intensifying pricing/mindshare pressure on US labs. A sharp, technically grounded model-release analysis.
Kimi K2 Thinkingopen modelsChinaQAT INT4tool useMoE
TIER 4
Nov 12, 2025
A deep interview (with full transcript) with Richard Bian and Ling technical leads Chen Liang and Ziqi Liu of Ant Group's InclusionAI / Ant Ling, covering how a fintech giant became a frontier-lab contender in eight months, FP8 pre-training, modeling-strategy choices (size, multimodality), gaps in the open ecosystem, and why China is winning the open race. The transcript plus a substantive written overview of InclusionAI's model lineup give it real technical and strategic substance. A strong primary-source look inside a Chinese frontier lab.
Chinese AI labsInclusionAIAnt LingFP8 trainingopen modelsinterview
TIER 4
Feb 24, 2026
Responding to Anthropic naming DeepSeek, Moonshot, and MiniMax for industrial-scale distillation of Claude (16M exchanges via ~24K fraudulent accounts), Lambert quantifies the likely impact (150-400B tokens, meaningful but not crucial) and argues distillation is just a shortcut to more compute that everyone, including Ai2, effectively uses. His key technical point is that distillation matters less in the RL era, since on-policy generation dominates compute and can't be borrowed, so restricting distillation across distributed access is near-impossible and far less impactful than GPU export controls. Timely, well-calibrated analysis of a high-profile dispute.
distillationsynthetic dataChinese AI labsAnthropicUS-China AI
TIER 4
May 4, 2026
Argues 'distillation attacks' is a dangerous misnomer: distillation is an industry-standard technique (Nemotron, Olmo, and even xAI use it) and the real issue is a few Chinese labs jailbreaking/hacking APIs, which should be named as abuse. Warns that conflating the two is fueling a snowballing regulatory push (congressional bill, executive order) that could effectively ban Chinese open-weight models and cripple Western academics and small labs.
distillationpolicyChinaopen-weight regulationterminology
TIER 5
May 7, 2026
A firsthand report from visiting nearly every leading Chinese AI lab (Z.ai, Moonshot, Tsinghua, Meituan, Xiaomi, 01.ai, Alibaba, Ant), arguing China's culture makes it an ideal fast-follower: less ego, student-heavy teams, build-not-buy data, Claude-pilled developers, ownership mentality, and ambiguous-but-real government aid. A rare, high-signal primary-source account that reshapes how to read the open-model ecosystem. ---
ChinaAI labsresearch cultureopen modelsfield report
AI Progress, Scaling Limits, and the Takeoff Debate
2 tier-5 · 6 tier-4
Lambert's clear-eyed, repeatedly-revised position on how fast AI is actually moving. He separates the technical scaling claim (test loss still falls) from the product claim (user-visible chat gains are saturating), argues "emergent" abilities are mostly reliability gains, and pushes back on intelligence-explosion narratives -- from the AI-2027 software-singularity thesis to recursive self-improvement -- with his own "lossy self-improvement" framework: models become core to AI R&D, but friction keeps progress closer to linear than exponential. AGI, he argues, is an ungrounded symbol that bends to each speaker's goals.
TIER 4
Apr 24, 2024
Argues AGI is an ungrounded symbol whose definition shifts to fit each speaker's goals, walking through academic, OpenAI-corporate, and 'modern test' definitions and showing how the Microsoft-OpenAI contract literally makes AGI a legal/financial question. Adds RL's outsized cultural hold on the discourse plus the real ceilings (power, data, agency) that bound any intelligence-explosion narrative.
agiai-discourseopenaireinforcement-learningcompute-constraints
TIER 4
Oct 9, 2024
Grounds the scaling debate mechanically: power-law loss decreases don't imply AGI, and 'emergent' abilities are mostly gains in reliability, illustrated by a semiconductor-yield analogy where extra nines of per-token reliability compound over long token/agent sequences. Separates what scaling de-risks (next-gen loss) from the AGI storytelling it has no bearing on.
scaling lawsemergencereliabilityAGIagents
TIER 4
Nov 14, 2024
Responding to 'scaling is dead' reports, Lambert argues both narratives are true: scaling laws (test loss vs compute) still work technically, but user-visible gains from a bigger GPT-5/Claude 4/Gemini 2-class chat model are slowing because chat goalposts are nearly saturated. The real frontier shifts to specialized models, agents, and new form factors like o1, leaving a large capability-product overhang. A clear, frequently-cited framing of the scaling debate that separates the technical claim from the product claim.
scaling lawsAI economicsGPT-5capability overhangagents
TIER 4
Feb 12, 2025
Uses OpenAI's Deep Research to ask what AI does to science, distinguishing 'information' (which current models accelerate enormously) from 'insight' (genuinely novel discovery, which they argue these models cannot yet produce — 'to an LLM, a novel discovery is indistinguishable from an error'). Contrasts grand AI-for-science projects (AlphaFold-style) with mass-market tools that compress the practice of normal science toward 'instantaneous PhDs,' and frames the coming strain on scientific institutions through Kuhn's paradigms. A thoughtful, distinctive essay on AI and the structure of scientific progress.
Deep ResearchAI for scienceinsight vs informationKuhnscientific institutions
TIER 5
Apr 30, 2025
A rebuttal to the AI 2027 software-singularity thesis arguing benchmarks go vertical because labs hill-climb on training-goal evals (not held-out tests), AI is broadening not narrowing into AlphaGo-style superintelligence, ML research is bottlenecked on messy data work rather than compute-efficient implementation, and multi-domain RL will resemble slow robotics progress. Proposes the compute-share-shifting-to-inference signal as an empirical test. Matters as a clear, reusable framework for thinking about AI progress rates and the limits of recursive self-improvement.
intelligence explosionAI 2027benchmark saturationRL scalingAI research bottlenecks
TIER 4
Aug 15, 2025
Rebuts Dwarkesh Patel's claim that the lack of continual learning is a fundamental bottleneck, arguing that continual learning is a systems/context problem, not a learning-algorithm problem — solvable via memory, retrieval, and massive explicit context fed to reasoning models (he projects 2026-2027). The core move is rejecting the demand that LLMs learn 'like humans,' paralleling his reasoning argument. A pointed, well-argued contribution to a prominent timelines debate.
continual learningAGI debatecontext/memoryreasoning modelsDwarkesh
TIER 4
Oct 7, 2025
Lambert refines his AI-timelines argument from The Curve conference: automating the AI research engineer is plausible in 3-7 years but full automation of AI research and a singularity are not, because compute scaling and rising system complexity will make progress feel linear rather than exponential. He pairs this with reflections on AI 'jaggedness' (via Helen Toner) and renewed urgency on open models versus China. A substantive, well-reasoned position piece on the pace of progress.
AI timelinesautomation of researchintelligence explosionopen modelscompute scaling
TIER 5
Mar 22, 2026
Lambert introduces an original framework, 'lossy self-improvement' (LSI), as a counter to recursive self-improvement (RSI): models do become core to the AI development loop, but friction (research being too narrow, diminishing returns of parallel agents per Amdahl's law, resource/political bottlenecks) breaks every assumption needed for a closed, self-amplifying exponential. He argues progress will feel like a huge step yet remain closer to linear, with no fundamental change convincing him takeoff is imminent. A durable conceptual contribution to the takeoff debate that names a distinct alternative model. ---
recursive self-improvementAI takeoffAGIautomated researchscaling
Frontier (Closed) Model-Release Analyses
1 tier-5 · 14 tier-4
What a major closed release actually means, judged on more than benchmark deltas. From GPT-4 onward through GPT-5, the Claude line (3.5, 4, Fable, Mythos), Gemini 2.5 Pro, the Grok releases, Qwen 3, and gpt-oss, Lambert reads each launch for its strategic signal: GPT-5 as proof AI is on a normal technological path (performance + price + product), Claude 4 as Anthropic's narrowing bet on code, Gemini 2.5 as Google's full-stack second chance, and the "post-benchmark era" in which release-day scores barely convey signal. The latest entries cover the AGI-era governance flashpoints around Claude Fable/Mythos and undisclosed safety nerfs.
TIER 4
Mar 20, 2023
A release-analysis of GPT-4 that deliberately ignores the hype and dissects the under-discussed details: the absence of architecture/data disclosure, the OCR-grade vision encoder, the central role of clean data as the real engine, and the argument that RLHF is far more 'significant' to making the model usable than OpenAI's report admits. Lambert also reads the societal/safety framing critically (anthropomorphizing, EA-aligned red-teamers, terms-of-service-as-policy) and predicts OpenAI's research lead will erode as product demands degrade its research teams. A substantive frontier-release breakdown that sets the template for his later model-review posts.
GPT-4model release analysisRLHFdata moatsAI safety discourse
TIER 4
May 15, 2024
Analysis of the GPT-4o launch (and Google's mirroring response) as both a product/culture inflection and a genuine architectural shift: 'omnimodal' models that natively ingest and emit audio/image/text tokens in one pass, eliminating the speech-to-text/LLM/text-to-speech handoff, evidenced by the new 200k-token tokenizer. Pairs the technical read with a pointed critique of OpenAI's safety posture turning product-first amid the Ilya/superalignment departures.
GPT-4oomnimodal modelstokenizerOpenAI vs GoogleAI safety
TIER 4
Oct 23, 2024
Hands-on first look at Anthropic's Claude 3.5 Sonnet (New) and the Claude Computer Use beta, judging the agent an intuitive 'aha' moment despite refusals, rate limits, and latency, then a four-factor map of the frontier: Anthropic best for chat/coding, Google's Gemini Flash best small/cheap (via documented distillation), OpenAI best at reasoning (o1) but unclear how to use, and all labs sitting on larger unreleased models used mainly to train smaller ones. A useful synthesis of the late-2024 frontier with the insight that model strengths mirror org cultures.
Claudecomputer use / agentsfrontier modelsknowledge distillationGemini Flash
TIER 4
Feb 18, 2025
Treats xAI's Grok 3 as evidence that frontier capability is no longer concentrated in OpenAI/Anthropic/Google and that competitive plus regulatory pressure (post-DeepSeek, post-Vance-AI-summit deregulation) will shrink the gap from training to release. Reads the thin 4-eval benchmark slate skeptically while crediting xAI's velocity, and steps back to question whether increasingly hard-but-unrepresentative evals actually track usefulness — calling for more transparency on internal evals. A good industry-trajectory essay framed around a release.
Grok 3xAIAI competitionderegulationevaluation
TIER 4
Feb 28, 2025
Reads OpenAI's strange GPT-4.5 release — the biggest model the public has tested, yet pitched as 'not a frontier model' — as the clearest signal yet that pretraining-scaling alone no longer yields obvious capability jumps. Estimates ~10x GPT-4 compute (~5-7T params), notes the gains showed up only in hallucinations/SimpleQA/emotional-intelligence rather than coding, and argues its real value is as a base for distillation into future reasoning models. A useful, honest model-release analysis (including a partial mea culpa on scaling expectations).
GPT-4.5scalingpretrainingmodel releasedistillation
TIER 4
Mar 26, 2025
Calls Gemini 2.5 Pro the biggest eval jump in a while (40+ Elo clear on LMArena, second-largest top-model jump in LMSYS history) and uses it to argue Google has finally righted its strategic error and has the only full stack (models, TPUs, cloud). Pairs the release with a clarifying framing that reasoning is now a spectrum (DeepSeek V3 0324, hybrid reasoners) and argues Google should stop chasing ChatGPT and win as the AI platform via products and Cloud. Matters as a strong release-plus-strategy analysis of Google's resurgence and the commoditization-vs-distribution debate.
Gemini 2.5 ProGooglereasoning spectrumLMArenaAI platform strategy
TIER 4
Apr 7, 2025
A pointed release analysis of Llama 4 (Scout/Maverick/Behemoth MoEs) arguing Meta botched it with bizarre Saturday timing, a misleading LMArena entry using an unreleased experimental chat model, an off-putting juvenile default model, an onerous license, and architectures aimed at the wrong (hyperscaler) audience rather than the GPU-poor open community Llama built. Concludes Llama is no longer the open standard and Meta's GenAI org faces cultural and strategic crisis. Matters as the definitive contemporaneous account of Llama's fall from open-default status.
Llama 4Metaopen modelsMoELMArena gaming
TIER 4
Apr 28, 2025
A release analysis of Alibaba's Qwen 3 suite (6 dense models 0.6B-32B plus two MoEs, mostly Apache 2.0, thinking on/off toggles), arguing it validates the DeepSeek R1 recipe and distillation and crowns Qwen as the best-of-both-worlds open standard: DeepSeek-tier peak performance with a full Llama-style size ladder. Flags open questions on robustness, the SFT-distilled smaller models, lack of native multimodality, and undocumented training-token budgets. Matters as the definitive contemporaneous read on the model that displaced Llama as the open default.
Qwen 3open modelsDeepSeek recipedistillationMoE
TIER 4
May 27, 2025
A close release analysis arguing Claude 4 (Opus/Sonnet) is a reliability and reward-hacking-reduction step rather than a benchmark leap, with Anthropic narrowly curating coding/agentic evals and even regressing on some. Lambert questions the 'best coding model wins the AGI race' thesis and positions the major labs (OpenAI as consumer leader, Google as enterprise leader, Anthropic as the software-engineering specialist with a lower ceiling). Matters as a definitive read on Anthropic's strategic narrowing and on how benchmark presentation (parallel test-time compute, shaded bars) misleads.
Claude 4Anthropiccoding agentsbenchmark analysisAGI strategy
TIER 4
Jul 12, 2025
A detailed (paywalled preview) breakdown of xAI's Grok 4: leading benchmarks (HLE, GPQA, ARC-AGI) and 10x RL compute, but middling vibe tests, an o3-like search-heavy style, weak product differentiation, and severe brand/culture risk. The thesis is that catching up on benchmarks is now easy while finding genuine usefulness as performance commoditizes is the singular challenge, and that o3's search reliability still hasn't been matched. Substantive release analysis even in preview form.
Grok 4xAIRL scalingfrontier modelssearch agents
TIER 4
Aug 5, 2025
Analyzes OpenAI's first open-weight release since GPT-2 (gpt-oss 20B/120B, Apache 2.0) as a strategic move that validates open models, reveals more of OpenAI's stack (raw CoT, harmony format, instruction hierarchy), and undercuts its own API pricing. It catalogs open questions (MXFP4 quantization, anti-finetuning safety, tool-use messiness, no base models) and argues this shifts the 'second derivative' for US open models without ending the China gap. A substantive, well-structured release breakdown.
gpt-ossOpenAIopen weightsMoE architectureATOM/open ecosystem
TIER 5
Aug 7, 2025
A definitive release analysis arguing GPT-5 — best-in-class across evals while cheap enough to serve ~1B users via a real-time router — proves AI is on a traditional technological path (performance + price + product) rather than an exponential takeoff, with abilities developing more slowly than products. It draws out implications for the AGI fundraising narrative, the product overhang, and Jevons-paradox adoption. A reference-quality read on what a major model release actually means.
GPT-5model release analysisAI progressrouter/systemAGI narrative
TIER 4
Apr 3, 2026
Lambert lays out a five-factor framework (performance/size, country of origin, license, day-one tooling, fine-tunability) for judging which open-weight model is worth investing in, arguing that benchmarks at release are a deeply incomplete story and that ease of adaptation is what actually decides adoption. He applies the framework to Google's Gemma 4, praising its move to an Apache 2.0 license and arguing its success will hinge on usability rather than a 5-10% benchmark swing. Useful because it reframes open-model evaluation around the post-release adoption realities most release coverage ignores.
open modelsGemma 4model licensingfine-tunabilityopen-source tooling
TIER 4
Apr 9, 2026
Pushes back on the wave of anti-open-weight panic after Claude Mythos's cyber capabilities, arguing critics conflate a static open-closed gap with domain-specific risk and ignore the deployment realities (huge model size, harnesses, ~$10K/day inference) that limit proliferation. Acknowledges cybersecurity could be a genuine red line and proposes three concrete study questions rather than a blanket ban that would just cede open-model leadership to another country.
open-weight riskClaude MythoscybersecurityAI policymodel deployment
TIER 4
Jun 9, 2026
A detailed analysis of Claude Fable 5's release: it is the smartest public model yet, but ships with safety classifiers that downgrade some queries to Opus 4.8 and, more damningly, a silent, undisclosed safeguard that degrades frontier-LLM-development requests via prompt modification/steering vectors/PEFT. Lambert frames the undisclosed nerf as categorically misaligned and a market-entrenchment tactic dressed as safety, arguing it strengthens the case for open models. ---
AnthropicAI safetymodel releasedistillationclassifiers
Architectures, Multimodality, and Robotics Foundations
1 tier-5 · 18 tier-4
The technical substrate beneath the models, plus the RL/robotics roots of the author's thinking. The architecture pieces take non-attention models seriously (the landmark state-space/Mamba explainer, the Tri Dao / Michael Poli interview, hybrid RNN+attention, mixture-of-experts, model merging) and read multimodal/video releases (Sora, Gemini 1.5, Molmo, Apple Intelligence, robotic foundation models). The early RL essays -- written before the LLM era -- lay the conceptual groundwork Lambert later carried into RLHF: how all ML becomes RL, why reward is not enough, the slipperiness of reward specification, and RL as metaphor vs. tool vs. framework.
TIER 4
Feb 19, 2021
An explainer on why RL is hard to define and deploy: the slipperiness of reward specification and agent/environment duality, the case for treating every reward as a slightly wrong interpretation, and three structural tradeoffs (model-based vs end-to-end, exploration vs offline, plus practical blockers like compute cost and walled-garden simulators). A genuinely useful conceptual map of RL's design tensions circa 2021.
RL fundamentalsmodel-based RLoffline RLreward specificationend-to-end learning
TIER 4
Jun 14, 2021
Develops the thesis that any iteratively-retrained, user-facing ML system (recommendation, churn, ad delivery) is effectively a reinforcement-learning loop, exhibiting RL's core properties — feedback, policy fragility, exploitation, replay-buffer/distribution-shift dynamics — even when engineers don't call it RL. Distinguishes 'coursework RL' (update before next rollout) from the 'applied reduction to RL' (deploy, collect, retrain weeks later) and illustrates with Facebook A/B-stopping and NextDoor model-tracking bloopers. A clear, original conceptual framing.
reinforcement learningfeedback loopsrecommendation systemsdistribution shiftMLOps
TIER 4
Jun 21, 2021
A critique of the RL 'reward hypothesis' and the Silver/Singh/Precup/Sutton 'Reward is Enough' paper, arguing that while a scalar reward may exist for any single agent, finding it is impractical (rewards are non-stationary, infinitely tunable, and break down under multi-objective and societal-scale optimization). Ties the argument to dopamine/neuroscience and to attention-metric optimization in social media, warning against companies covertly re-tuning users' reward functions. A substantive conceptual essay foreshadowing his later RLHF reward-modeling work.
reward hypothesisRL theorymulti-objective optimizationAI ethicsneuroscience
TIER 4
Feb 8, 2022
Announces and summarizes the academic paper 'Choices, Risks, and Reward Reports', which formalizes three types of RL feedback (control, behavioral, exogenous) and four design risks (scoping the horizon, defining rewards, pruning information, training multiple agents), then proposes Reward Reports as living documentation for automated decision systems and maps governance/legal entry points. A genuine policy-framework contribution that grounds Lambert's recurring documentation thesis, presented here as a guided summary with reader-by-background pointers.
RL governanceReward ReportsAI policyreward hackingdocumentation
TIER 4
Jan 16, 2023
Uses the 2018-2022 arc of quadrupedal locomotion papers (ANYmal, A Walk in the Park, egocentric-vision end-to-end work) to argue RL's real success metric is taking over a vertical through repeated independent reproduction, not flashy AlphaGo-style one-offs. Sim-to-real pretraining for legged robots is RL's first proven engineering vertical, and the field should tout that — not DeepMind game wins — as the model of success. A well-curated literature walk that doubles as a thesis about redefining what counts as RL working.
reinforcement learningquadruped locomotionsim-to-realroboticsbitter lesson
TIER 4
Feb 1, 2023
Argues that scaling laws have not meaningfully arrived for RL/robotics because the field hasn't laid out what scaling should even look like in decision-making, and that single-environment scaling (DreamerV3, the just-published single-agent scaling-laws paper) misses the point. Proposes three prerequisites — avoiding the closed-world 'games effect', integrating exploration to escape Gato-style data curation, and demonstrating generalization before scaling — concluding that environments, generalization, and exploration, not scale, are RL's real bottlenecks. A substantive, opinionated explainer on why the bitter lesson hasn't transferred to RL.
scaling lawsreinforcement learningroboticsgeneralizationexploration
TIER 4
Feb 15, 2023
Co-authored framework distinguishing three ways RL is invoked — as a metaphor for general intelligence ('reward is enough', DL+RL=AGI), as an engineering tool (RLHF, quadruped locomotion), and as a framework for understanding any deployed feedback system (recommenders, predictive policing). Argues that conflating these three framings is the root of miscommunication about RL progress, and that documentation tools like Reward Reports must capture the time-evolving, optimization-intent dimension that Model Cards miss. A genuinely useful conceptual lens on RL discourse.
reinforcement learningRL taxonomyReward ReportsdocumentationAGI
TIER 4
Sep 29, 2023
Lambert (a robotics/RL PhD) traces Google Brain's scaling progression (QT-Opt → RT-1 → RT-2 vision-language-action models) and weighs it against Moravec's paradox — the observation that sensorimotor control is harder to engineer than reasoning. His thesis: robotics research is progressing well but environment-transfer is far harder than language transfer, so expect slow-burn domain-specific gains rather than a GenAI-style takeoff, and he's openly skeptical of humanoid-robot hype. A substantive explainer connecting RL/robotics history to scaling-law expectations, though only the public preview of this paywalled post is present.
roboticsMoravec's paradoxscaling lawsRT-2 / VLAreinforcement learning
TIER 5
Dec 20, 2023
Landmark deep-technical explainer of non-attention architectures, framing Mamba and StripedHyena reaching Llama-2/Mistral-7B territory as the moment to take SSMs seriously. Walks through the attention-vs-recurrence tradeoff, the SSM continuous/discrete formulation, Mamba's selection mechanism and hardware-aware scan, plus open challenges (GPU utilization, fine-tuning, in-context learning) — a durable reference on alternative LLM architectures.
state-space-modelsmambastripedhyenaarchitecturesRNN
TIER 4
Dec 21, 2023
Full-transcript interview with Tri Dao (Mamba, FlashAttention) and Michael Poli (StripedHyena) on why attention scales quadratically, how SSMs/linear-RNNs/linear-attention converge on one mathematical core, and hardware-aware design. Both predict attention persists as a primitive but hybrid architectures and architectural innovation grow, with data quality (not architecture) setting the scaling-law slope.
state-space-modelsmambaattentionarchitecturesinterview
TIER 4
Jan 24, 2024
Argues that local on-device LLMs will win on latency (and power-efficiency scaling laws) rather than the often-cited personalization angle, since OS-level inference avoids cloud round-trips and sandboxing bottlenecks that make real-time audio viable. Extends this into a strategic read of Big Tech — why Apple, Google's Pixel TPU, and Meta's open-source play are tied together — making it a substantive analysis of where the local-inference market is headed.
local LLMslatencyon-device inferenceAppleMeta strategy
TIER 4
Jan 29, 2024
A technical explainer (co-written with AI2's Jacob Morrison) on why model merging — averaging the weights of separately fine-tuned models — actually works, tracing it back through stochastic weight averaging, linear mode connectivity, and flat-minima generalization, with a quote from SWA author Andrew Gordon Wilson and a visual literature review. Useful because it grounds a meme-y GPU-poor technique (popularized by the anime/'waifu' merging community) in decades of optimization theory and distinguishes it from ensembling and mixture-of-experts.
model mergingweight averagingSWAmixture of expertsliterature review
TIER 4
Feb 16, 2024
Release-day technical summary of three simultaneous events: OpenAI's Sora (diffusion transformer over video patches, emergent physical-world simulation, likely YouTube + procedurally generated data), Gemini 1.5 Pro (MoE, up to 10M-token context implying a non-Transformer routing scheme, near-1.0-Ultra quality), and a stealth Mistral-next model in the arena. Matters as the definitive concise first-pass analysis of a landmark day for video generation and long context.
SoraGemini 1.5long contextdiffusion transformerMistral
TIER 4
Feb 19, 2024
A ranked ten-item follow-up digging into Sora and Gemini 1.5: deepfake/Gaussian-splatting coherence tests, whole-codebase-in-context use, Gemini's contamination and citation oddities, YouTube token-count Fermi estimates, Sora as a world-simulator for model-based RL/robotics, Midjourney style overlap, inference costs, and resulting pressure on Llama/Mistral. Matters as a dense set of practical and technical observations that go well beyond the release-day overview.
SoraGemini 1.5long contextworld modelsinference costs
TIER 4
Jun 5, 2024
Argues robotics is undergoing the same 'everything is a token / transformerification' shift as LLMs, with new foundation-model labs (Physical Intelligence) aiming to be the 'OpenAI for robotics' via cross-robot, language-promptable policies. Lays out the four gating factors — multi-robot policies, plain-text prompting, teleoperation markets, and crucially the manufacturing cost of robots and data scarcity — explaining why prior robot-learning startups were too early and what has to break for this generation to succeed.
roboticsfoundation modelsPhysical Intelligenceembodied AIscaling
TIER 4
Jun 12, 2024
Reads Apple Intelligence as a bet that AI follows prior tech revolutions (incremental, embedded, privacy-preserving) rather than a winner-take-all race, then digs into the disclosed technicals: a ~3B on-device model plus a GPT-4-class server model, adapter/quantization orchestration, and notably two novel post-training algorithms (rejection sampling with a teacher committee, and RLHF via mirror descent policy optimization with a leave-one-out advantage estimator). It matters because Apple putting a non-PPO RLHF algorithm into a shipping product signals current PPO recipes aren't optimal and validates on-device models.
Apple IntelligenceRLHFMDPO vs PPOon-device modelspost-training
TIER 4
Nov 7, 2024
A full, deep technical interview with Tim Dettmers (QLoRA/bitsandbytes author, then Ai2, incoming CMU professor) covering the state of open-source models, agents and SWE-Bench, doing high-impact research while GPU-poor, model merging and optimization landscapes, knowledge distillation, state-space models vs transformers, and the fundamental limits of quantization. Substantive technical discussion from a leading open-source researcher with full transcript. Worth reading for its grounded takes on efficiency and academic AI.
interviewquantizationopen sourceagentsmodel merging
TIER 4
Mar 12, 2025
A deep technical interview (with full transcript) with NYU professor and RL researcher Eugene Vinitsky on scaling self-play in simulated multi-agent RL — including the Gigaflow self-driving result where a single shared-weight policy with randomized rewards produces human-like driving without ever seeing humans drive. The conversation distills how self-play, inference-time compute, and RL scaling laws relate to the current RL-for-LLM takeoff, plus LLMs as in-context reward designers. Substantive enough on the research to be worth reading beyond the episode notes.
reinforcement learningself-playself-drivingmulti-agent RLpodcast
TIER 4
Mar 5, 2026
Announcing Ai2's Olmo Hybrid (a 7B attention+GDN model), Lambert explains why hybrid RNN/attention architectures are being adopted everywhere at once and shares the accompanying paper's theory that hybrids are strictly more expressive than transformers or GDN alone and translate that to better token efficiency (~2x pretraining gain). He is candid about the mixed post-training results and 'horrific' open-source tooling that currently erases inference-throughput gains, and speculates a frontier closed model being an RNN is roughly a coin flip. Strong technical explainer grounded in a real release and paper. ---
hybrid architecturesGated DeltaNetOlmoscaling lawspost-training
Evaluation and Benchmarks
0 tier-5 · 11 tier-4
A sustained critique of how the field measures models -- and why it keeps getting fooled. Lambert shows how ChatBotArena rewards style and low refusal over capability, how Big Tech "evals" are marketing that closed labs can't run fairly on rivals, and how the whole enterprise is sinking into "evaluation quicksand" of contamination, private suites, and unreproducible scores. He defends the open leaderboard as a discovery tool while announcing RewardBench, the first reward-model benchmark, and argues that deploy-and-chat "vibes" evaluation -- and ultimately causal/A-B testing -- is where real model research happens.
TIER 4
May 31, 2023
A thorough early treatment of LLM evaluation: the limits of the two contemporary leaderboards (HuggingFace Open LLM vs. LMSys ChatBot Arena), prompt/tokenization sensitivity, paper-vs-reproduction gaps and eval leakage, and the biases of GPT-4-as-judge and human position bias. Argues for curated held-out per-task prompt sets and ultimately causal/AB-tested evaluation—a substantive, durable explainer on a topic central to the archive.
evaluationleaderboardsChatbot ArenaLLM-as-judgebenchmarks
TIER 4
Sep 13, 2023
Responding to SemiAnalysis's claim that HuggingFace's Open LLM Leaderboard 'actively hurts' open source, Lambert (a contributor) explains how the leaderboard actually functions as a discovery and oversight tool, who uses it across five user segments, its community moderation against benchmark-gaming, and the often-misunderstood probability-based scoring of multiple-choice evals. He rebuts the GPU-rich/poor framing's dismissal of fine-tuning-focused companies and previews the fall open-model wave. A genuinely useful explainer of LLM evaluation mechanics and open-ecosystem dynamics.
open LLM leaderboardevaluationHuggingFacebenchmark gamingSemiAnalysis rebuttal
TIER 4
Oct 26, 2023
Detailed critique (co-written with EleutherAI) of Stanford CRFM's Foundation Model Transparency Index, arguing it measures product documentation rather than true transparency, can't verify closed-model claims, uses a misleading scorecard, and is systematically biased against open models. An influential argument reframing transparency as openness in good faith; public-preview only but the visible critique is substantive.
transparencyFMTIopen-vs-closedAI-policyevaluation
TIER 4
Nov 15, 2023
Argues that automated benchmarks miss what matters and that the deploy-and-chat workflow (vibes-based evals, MT-Bench/AlpacaEval as proxies) is now central to good model research, giving an engineering edge to teams with fast checkpoint-to-endpoint tooling. A useful framing of why deployment engineering and real interaction, not eval numbers, drive progress — and why this advantages cheap small models.
evaluationvibes-evalsdeploymentresearch-workflowtooling
TIER 4
Dec 13, 2023
Argues that Big Tech benchmark comparisons (Microsoft's Medprompt MMLU plot, Google's Gemini-vs-GPT4 chart) are marketing not science, since closed labs cannot fairly evaluate competitors' models they can't access, and contamination plus undisclosed training data make scores untrustworthy. Recommends judging models by hands-on chat, pushing in-context learning over prompting hype, and building on Eleuther's eval harness for the open ecosystem.
evaluationbenchmarksMMLUopen-vs-closedin-context-learning
TIER 4
Feb 28, 2024
A ten-point practical heuristic guide for separating signal from noise in ML content — prioritizing demos/model access, depth-vs-breadth focus, reproducibility, no-free-lunch sanity checks (scaling beats small, simple beats complex), distrust of leaderboard-only claims, and knowing publisher incentives. A genuinely useful, transferable explainer for anyone trying to keep up with AI releases.
information dietevaluation literacyreproducibilitymediadeep learning heuristics
TIER 4
Mar 20, 2024
Argues LLM evaluation is getting harder along three axes — trust (few orgs have no incentive to cook the books), rising price (human and API-credit costs now price out academics and open labs), and vibes-based judgment — with a case for government-funded hidden eval sets. Bundles the launch of RewardBench, the first reward-model evaluation benchmark, covering 30+ RMs across chat/safety/code/math.
evaluationrewardbenchreward-modelsrlhfbenchmarks
TIER 4
May 8, 2024
A comprehensive insider analysis of ChatBotArena — what it is, its Elo statistics and noise floor, who actually needs it (attention-economy chat companies, not enterprise), how style/length 'doping' games it, the 'preference data gap' versus what labs buy from Scale, and where LLM evaluation is heading. Even as a paywalled-preview, the visible argument is a rich, reference-quality treatment of the most influential public LLM eval; docked from 5 only because the incentives/gpt2chatbot section is truncated.
ChatBotArenaLLM evaluationElo / preference datastyle biasLMSYS
TIER 4
Jul 31, 2024
Argues that GPT-4o-mini's surprise top-tier ChatBotArena ranking exposes how the leaderboard rewards style (lists, line breaks, upbeat tone) and low refusal rates rather than peak capability, and surveys partial fixes (hard-prompt categories, Scale's leaderboard) and their conflicts of interest. It matters as an influential, well-evidenced critique of crowd-vote LLM evaluation and the style-vs-substance problem in RLHF.
evaluationChatBotArenaRLHFmodel styleGPT-4o-mini
TIER 4
Sep 30, 2024
A full-transcript masterclass with Scale AI's Riley Goodside on why prompting matters and how it interacts with post-training and evaluation: ChatGPT/ChatML as the biggest event in prompt-engineering history, prompting as the experimental edge that gets absorbed into models, o1's odd steerability, and equal-inference-cost eval comparisons. Rich, durable reference on prompting craft and eval methodology.
promptingevaluationo1post-traininginterview
TIER 4
Oct 16, 2024
Argues that LLM evaluation has become unreliable 'quicksand': closed labs tune private eval suites that can't be reproduced, the open community has failed to standardize tooling, and synthetic data introduces new contamination vectors. Calls for the open ecosystem to converge on a common evaluation standard and previews an economy of expensive expert-built evals (Humanity's Last Exam). ---
evaluationbenchmarkscontaminationopen modelssynthetic data
AI Policy, Governance, and the Definitions of Open
0 tier-5 · 12 tier-4
Where Lambert turns researcher knowledge into policy argument. The throughline: regulate deployed systems and known harms, not model weights or compute thresholds -- so SB 1047 is on the wrong side of history, export controls should target compute not weights, and Model Specs are the right transparency-first abstraction for light-touch regulation. He co-authored Ai2's OSTP comment and reads the White House AI Action Plan, defends the NAIRR as public AI infrastructure, untangles the OSI open-source-AI definition, and warns that the AGI-era of governance (gating releases on vibes) will eventually hit open models too. Includes the policy-focused interviews with Dean Ball, Arvind Narayanan, and Andrew Trask.
TIER 4
Sep 6, 2023
Lambert dismantles the popular 'Manhattan Project for AI' analogy — AI has no clear target unlike the bomb, leaves no radioactive signature for monitoring, and emerges amid distrust of academic institutions and Arxiv-driven mass participation — while drawing real lessons about scientists losing political sway. The second half is a sharp analysis of how social-media algorithms, embellishment incentives, and incomplete corporate participation are destabilizing AI research norms and credentialing. A substantive sociology-of-AI-research essay with several transferable observations.
Oppenheimer analogyAI governanceresearch normsscientific distributioninstitutions
TIER 4
Jun 27, 2024
A deep, full-transcript interview with policy scholar Dean Ball that walks through SB 1047's mechanics (Frontier Model Division, the $100M threshold, lowerable fine-tune FLOP thresholds, mandatory 2028 audits) and the state-legislature politics that determine its fate, then ranges over how a minor AI 'disaster' would likely be blamed on AI without forensic proof, whether Meta releases the 405B, export controls and China's compute ceiling, synthetic-data license enforceability, and a skeptical read of scaling-law/intelligence framings. Matters because the transcript carries substantive, well-informed analysis from a frontier AI-policy voice rather than a thin episode promo.
AI policySB 1047interviewChina export controlsscaling laws
TIER 4
Jul 17, 2024
An influential policy argument that California's SB 1047 is on the wrong side of history by regulating models (and compute/FLOP thresholds) rather than deployed systems, with the standout principle that developers should be liable only for known harms native to the model, not downstream derivatives. It also reads the broader cultural shift — antitrust pressure on big tech, China/national-security stakes, and the politicization of open source — as unlikely allies of open weights, and closes with a concrete list of what Lambert would actually regulate (CSAM/deepfakes, human-content watermarking, PII deletion, researcher access). Matters as a coherent open-models-favorable governance worldview at a pivotal regulatory moment.
SB 1047AI regulationopen weightsantitrustcompute thresholds
TIER 4
Sep 9, 2024
Argues that AI regulation should move away from compute thresholds toward auditing post-training, with mandated Model Specs (OpenAI-style behavior documents) as the regulatory abstraction that distinguishes intentional from unintentional model behaviors. It matters as an original, transparency-first policy framework that bridges safety and acceleration camps without forcing labs to disclose trade secrets.
AI policymodel specspost-trainingregulationtransparency
TIER 4
Oct 2, 2024
Argues that AI Safety culture is structurally losing to capitalism as labs need ~$100B for next-gen models, forcing them to 'appear normal' to investors (OpenAI shedding nonprofit governance) and dissolving the structural incentives that once distinguished them. Pairs this with SB 1047's veto as a litmus test, concluding that staying inside established labs is more impactful than repeatedly forking new orgs.
AI safetyAI policySB 1047OpenAIindustry economics
TIER 4
Oct 10, 2024
A full-transcript interview with OpenMined's Andrew Trask on secure enclaves for pre-release model testing (Anthropic/UK AISI), structured transparency, data-store language models as a better path to 'open training data,' and running models on air-gapped networks. Substantive on privacy-preserving infrastructure and the future governance of model access.
secure enclavesprivacystructured transparencyOpenMinedAI governance
TIER 4
Oct 17, 2024
A full-transcript interview with the AI Snake Oil co-author on disentangling AI hype from reality, the capability-reliability gap in agents, why generality is a red herring for economic AGI, and diffusion as a speed limit on innovation. Substantive on AI policy, agent evaluation (CORE-Bench), scaling, and the predictive-vs-generative-AI distinction.
AI hypeagentsAGIAI policyevaluation
TIER 4
Nov 13, 2024
A policy argument to save the National AI Research Resource (NAIRR), which loses funding January 2025 absent congressional action, framing it as critical infrastructure to keep academic and non-profit AI relevant against big-tech buildout. Distinguishes an 'AI' vs a narrower 'language model' research resource, makes the case for prioritized compute allocation over 'democratizing AI', and rounds up post-election policy risks (state legislation, Elon vs Trump, anti-open-source FUD, agents). A substantive, well-sourced policy piece with lasting relevance to the public-AI-infrastructure debate.
AI policyNAIRRpublic computeopen sourceregulation
TIER 4
Apr 28, 2025
Uses OpenAI's undocumented quiet GPT-4o update to argue AI is becoming a 'normal technology' where product trumps research transparency, then lays out a 'priority stack' framework for understanding why different actors want different kinds of openness (capability, base-model, reward-model, training-spec, structured access). Distinguishes concentration-of-power concerns from intelligence-explosion concerns as the drivers of transparency demands. Matters as a structured taxonomy of transparency and a clear read on OpenAI's shift away from documented releases.
transparencyopennessOpenAIpriority stackAI governance
TIER 4
Jul 23, 2025
An annotated read of the White House AI Action Plan's open-model and AI-research provisions, by a co-author of Ai2's official OSTP comment, arguing the plan rightly endorses more investment in open models (compute markets, NAIRR, NTIA adoption drives) for the right reasons. Lambert flags the gap it leaves — building strong fully open models, not just dispersing compute — plus missing immigration policy and the slippery 'Chinese values' evaluation mandate. Useful as an expert policy reading tied directly to his American DeepSeek thesis.
AI policyopen modelsWhite House Action Planresearch computeUS-China
TIER 4
Mar 6, 2026
A podcast (full transcript) with policy analyst Dean Ball arguing that the U.S. Department of War's designation of Anthropic as a supply-chain risk, while bad, points toward open models being the stable 5-10 year equilibrium for power centers, because no global entity will let a single U.S. company control its relationship to the most important technology. The discussion covers funding open models against a widening frontier gap, sovereign AI and foreign distrust of closed models, nationalization risk, and Ball's idea of financializing compute. Substantive policy reasoning, though it is an exploratory conversation rather than a definitive argument.
AI policyopen modelsgovernment controlsovereign AIAnthropic
TIER 4
Jun 14, 2026
Argues that the U.S. government forcing Anthropic to suspend foreign access to Claude Fable/Mythos is the 'starting gun' of a new AGI-era of AI governance, where releases get gated on vibes by a technically thin executive branch. Lays out lasting positions (export bans are bad policy, Anthropic's fear-mongering accelerated this, the open community shouldn't celebrate) and warns the same heavy-handed treatment will eventually hit open models. ---
AI governanceAnthropicexport controlspolicyopen models
AI Economics, Moats, and Industry Structure
0 tier-5 · 9 tier-4
How value actually accrues in AI. Lambert argues the model itself is not a moat -- weights leak, get distilled, get commoditized -- so durable advantage lives in data, distribution, inference economics, and product. He applies Aggregation Theory to ask whether inference-time compute breaks zero-marginal-cost economics, sizes the real exponential growth of inference usage, dissects the data-foundry (Scale AI) and alignment-as-a-service businesses, reframes the "data wall" as an open-ecosystem problem rather than a frontier one, and -- in "Burning out" -- argues the binding constraint on frontier AI is shifting from financial to human capital.
TIER 4
Dec 28, 2022
Argues that in ML the model itself is not the moat — models leak, get fine-tuned, or get distilled via API outputs — so data, infrastructure, and feedback loops are the durable advantages. Introduces emergent behavior as a new kind of moat: because abilities appear nonlinearly past data thresholds, concentrated user data (and the feedback loop it feeds) can confer lasting advantage in a way the 'data moats are fake' a16z thesis didn't anticipate. A useful, still-relevant business-strategy framework for AI companies.
ML moatsdata advantageemergent abilitiesmodel distillationAI business strategy
TIER 4
Feb 7, 2024
Analyzes Scale AI's ~$750M annualized revenue from selling RLHF/human-preference data and whether any moat protects a labor-heavy data-services business, then sketches 'alignment-as-a-service' (AaaS) as a startup category built on recurring model-monitoring and continual-training rather than one-off data labeling. Matters because it frames the economics of the RLHF supply chain and the existential question of what happens to these businesses if synthetic data or stronger base models reduce the need for human preference data.
RLHFScale AIdata labelingAI economicssynthetic data
TIER 4
Mar 13, 2024
Argues that as GPT4-class capability is replicated across Gemini, Claude 3, Mistral Large, and Inflection, the model itself is no longer a moat — durable advantage shifts to product, sticky user habits, distribution, cheap inference as a loss-leader, and eventually advertising economics. Matters because it reframes the 'no moat' debate around system/product moats versus model moats and flags that open ecosystems still face an unsolved data-coordination ('sinkhole') problem in preference alignment.
model commoditizationmoatsopen vs closedinference economicsRLHF data
TIER 4
May 29, 2024
Reframes the 'data wall' narrative: frontier labs aren't out of data, the open ecosystem is — closed labs can self-generate ~1 trillion synthetic tokens/day from inference traffic for a few million dollars, sign exclusive licensing deals (Reddit, news, Stack Overflow) as legal moats, and use search/best-of-N to manufacture better tokens. The thesis matters because it predicts the data bottleneck pinches open players and small labs (Mistral, Cohere) rather than OpenAI/Google, with human data as the last expensive frontier.
data wallsynthetic datadata licensingopen vs closedscaling
TIER 4
Sep 11, 2024
Dissects Scale AI's 'data foundry' business as caught between rising RLHF demand and synthetic data eating the human-instruction market, with a rare insider account of what procuring human data from vendors actually involves. Concludes that data foundries are aggregators (Uber/Airbnb-tier valuations, not Apple/Meta) increasingly exposed to synthetic data and Nvidia capturing the margins.
data foundryScale AIRLHFsynthetic dataAI industry economics
TIER 4
Mar 5, 2025
Applies Ben Thompson's Aggregation Theory to ask whether inference-time compute breaks the zero-marginal-cost economics that powered internet giants, concluding that consumer chat will stay aggregator-shaped (ad-supported, near-zero cost) while inference-heavy, high-value tasks push AI companies toward platform economics and a 'barbell' market. Argues parallel sampling plus strong verifiers (not just longer single generations) is the real scaling axis, invoking Jevons' paradox and Noam Shazeer's 'making models more expensive is worth it.' A solid economics-of-AI explainer connecting reasoning models to business structure.
inference-time scalingaggregation theoryAI economicsverifiersbusiness models
TIER 4
Mar 19, 2025
Drawing on off-the-record talks with frontier-lab leadership plus a detailed Tulu 3 case study, Lambert lays out how to structure model-training teams: keep core modeling teams small, preserve bottom-up information flow, scale only where co-design isn't needed, and avoid the politics that make large orgs 'unable to put it together.' The Tulu 3 walkthrough (project length, researcher/engineer ratio, ~1000 checkpoints, 8B/70B/405B iteration split, compute as the binding constraint) is a rare concrete look at how a mid-sized post-training effort actually runs. Valuable operational knowledge that is otherwise a closely guarded secret.
org designtraining teamspost-trainingTulu 3management
TIER 4
May 21, 2025
Uses Google I/O's token-throughput slide (480T+ tokens/month, up from ~10T a year prior) plus Azure and OpenAI figures to argue AI inference usage is growing exponentially and is largely profitable, with reasoning models and code agents about to push per-task token use 10-100x higher. The framing that Google now processes more tokens monthly than Common Crawl holds, and that the internet is being rebuilt as AI-first, is a useful quantitative grounding. Matters for sizing the real economics and trajectory of AI adoption beyond saturated benchmarks.
AI usage growthinference economicstoken throughputGoogle I/Oindustry scale
TIER 4
Oct 25, 2025
A reflective essay on the brutal work culture of frontier AI (996/997/002, 100-hour weeks) and why the closing window to stay at the cutting edge is real, not just perceived. Lambert draws an elite-athletics analogy (team culture beats individual talent) and advances a notable thesis: the binding constraint on AI is shifting from financial capital to human capital, since stabilizing and replicating known-good recipes takes focused grinding that money can't shortcut — a structural argument for why from-scratch labs (SSI, Reflection) face long odds. Substantive industry analysis beyond the personal frame. ---
AI work cultureburnouthuman capitalfrontier labsteam culture
Coding Agents and the Agentic-Work Transition
0 tier-5 · 9 tier-4
The most recent capability inflection and what it does to how we work. Lambert calls coding the epicenter of AI progress -- the last broadly tractable frontier and the template for everything else -- and tracks the CLI-agent jump (Claude Code, Codex) where the same model yields wildly different agent quality depending on scaffolding. His "Get good at agents" thesis -- agents push humans up the org chart, being good at using AI beats working hard -- pairs with concrete workflow advice and a taxonomy of the overloaded term "agent," from tool-use to fully agentic. ChatGPT's turn into the vertical "Agentic App" rounds out the cluster.
TIER 4
May 25, 2023
Makes the case that code is the highest-value, most sustainable frontier for LLMs and RL: code-in-pretraining likely drives chain-of-thought reasoning, and code's computable correctness (syntax/runs/tests) is an ideal RLCF reward signal, while flagging risks like accumulated technical debt and lagging code-eval tools. A prescient, substantive argument given how central coding later became to AI progress.
code generationCopilotchain-of-thoughtRLCFreasoning
TIER 4
Jul 12, 2023
Argues that the real blocker for LLM agents is not capability but robust enterprise integration—security, trust/reliability, and dramatic failure modes when a model controls reputation and finances—surveying the 2023 landscape (ChatGPT Plugins, LangChain, Adept, Lindy) and diagnosing the 'LangChain debacle' as a symptom of the integration wall. A substantive, well-reasoned early analysis whose 'agents are the self-driving cars of digital tech' framing aged well.
LLM agentsenterprise integrationLangChainsecurityRAG
TIER 4
Dec 18, 2024
A conceptual framework arguing the term 'AI agent' is overloaded and proposing a spectrum from tool-use LMs to orchestration LMs to fully agentic LMs, with a six-step 'agent cartography' of increasing complexity. It repurposes an RL-systems regulation framework (scoping the horizon, defining utility, pruning information, multiple agents) to give agents crisper definitions ahead of the 2025 agent push.
AI agentstaxonomytool useorchestrationframework
TIER 4
Sep 18, 2025
Argues coding is the last broadly tractable domain of continued frontier progress and the template for how AI capabilities will be built and absorbed elsewhere, with CLI agents (Claude Code, Codex) as the biggest recent capability jump. It notes that product/scaffolding now matters as much as the model (same Claude model, wildly different agent quality) and that coding is shifting toward asynchronous, autonomous PR-generating agents. A substantive, evidence-rich essay on the coding-agent inflection.
coding agentsClaude Code/CodexGPT-5-Codexautonomous PRsAI progress
TIER 4
Sep 22, 2025
Proposes a clean three-primitive framework for modern reasoning models — thinking (reasoning traces), searching (non-parametric knowledge), and acting (code/tool execution) — and argues these will outlast static model weights as the durable technology layer. It reframes hallucinations, tokenomics, and the open-vs-closed tool-integration gap through this lens. A genuinely useful conceptual framework that recurs across his other posts.
reasoning modelstool usesearchagentsinference infrastructure
TIER 4
Sep 30, 2025
Analyzes OpenAI's 'Buy It in ChatGPT' launch and the Agentic Commerce Protocol as the start of ChatGPT becoming the one vertical 'Agentic App,' arguing that where models act (store networks, APIs) now matters as much as the weights. It frames model specialization (OpenAI's consumer/search vertical vs Anthropic's coding bet) as splitting the industry and reducing the number of model releases. A sharp, timely strategic read on AI monetization and the agentic app paradigm.
agentic appsOpenAImonetization/commercemodel specializationChatGPT
TIER 4
Jan 21, 2026
Lambert argues that applying old work habits to coding agents is fundamentally wrong: the shift to Claude Code with Opus 4.5 changes the question from how to solve a problem to what to work on, pushing humans toward more open-ended, ambitious, asynchronous direction-setting while agents do the hard work in parallel. His thesis — 'agents push the humans up the org chart' and 'being good at using AI is a better moat than working hard' — is paired with a concrete workflow (GPT 5 Pro for planning, Claude Code for implementation) and a curated reading list. Influential framing of the agent-native work transition.
coding agentsClaude Codeagentic workflowsfuture of workproductivity
TIER 4
Feb 9, 2026
Comparing Claude Opus 4.6 and GPT-5.3-Codex, Lambert finds Codex 5.3 has become far more Claude-like and is the better top-end coding model, while Opus retains a usability edge for broad, loosely-specified tasks. The larger argument is that we have entered a 'post-benchmark era' where release-day scores barely convey signal (he cites Gemini 3's two-month fall from coronation), validating Anthropic's early bet that real-world agentic gains matter more than evaluation deltas. Worth reading for the post-benchmark thesis and the agent-comparison method.
Claude Opus 4.6Codexcoding agentsbenchmarksmodel evaluation
TIER 4
Mar 18, 2026
A hands-on review arguing GPT 5.4 is a meaningful step that puts OpenAI back in the agent wars, removing the 'death by a thousand cuts' (failed git ops, context anxiety) that drove Lambert off prior Codex versions. He contrasts model philosophies: Claude reads intent and has warmth/character while GPT 5.4 is meticulous and precisely instruction-following, suited to the 'master agent coordinator,' and notes OpenAI's edge on rate limits, reasoning efficiency, and context management. Substantive because it articulates the multi-axis (correctness/usability/speed/cost) view of agentic models rather than a single benchmark. ---
GPT 5.4Codexcoding agentsmodel comparisonagentic AI