Personal Learnings← Reading Room

Tech & AI

Lil'Log

Lilian Weng

44 issues · 44 keepers · 17 tier-5 · 27 tier-4

Generative Models — Diffusion, VAEs, GANs & Flows

4 tier-5 · 1 tier-4

This is the most concentrated cluster of landmark surveys in the archive — Weng's from-scratch derivations of the major generative-model families. Read together they form a unified tour of likelihood-based and adversarial generation: GANs (implicit, adversarial), VAEs (variational latent-variable), normalizing flows (exact tractable density), and diffusion (the family that came to dominate image and video generation). She consistently builds the math up from first principles — the GAN minimax game and its Jensen-Shannon connection, the VAE ELBO and reparameterization trick, the change-of-variable theorem for flows, the DDPM variational bound and its link to score-based modeling — making this the canonical reference set for understanding how modern generators actually work.

From GAN to WGAN

TIER 5 Aug 20, 2017

A rigorous and widely-cited derivation of the GAN framework — the minimax game, its connection to Jensen-Shannon divergence, and why GAN training is unstable and prone to mode collapse and vanishing gradients — then motivating Wasserstein GAN as a fix via the smoother Earth-Mover distance and Lipschitz/weight-clipping constraint. Later posted to arXiv, it stands as a reference explainer of GAN math and the WGAN improvement.

GANWGANWasserstein distancegenerative modelstraining stability

From Autoencoder to Beta-VAE

TIER 5 Aug 12, 2018

A foundational survey tracing the autoencoder family from vanilla, denoising, sparse, and contractive variants to the Variational Autoencoder and its descendants. It derives the VAE ELBO and reparameterization trick in full, then covers beta-VAE (disentanglement), VQ-VAE/VQ-VAE-2 (discrete latents), and TD-VAE — serving as a canonical reference for latent-variable generative modeling.

VAEautoencodersbeta-VAEVQ-VAEELBO

Flow-based Deep Generative Models

TIER 5 Oct 13, 2018

A definitive deep-dive on normalizing flows — the third class of generative model that, unlike GANs and VAEs, explicitly learns the exact data density via a chain of invertible transformations and the change-of-variable theorem. It builds rigorously from Jacobian determinants through RealNVP, NICE, Glow, and autoregressive flows (MADE, PixelRNN, WaveNet, MAF, IAF), making it a widely-used reference for understanding tractable-likelihood generative modeling.

normalizing flowsgenerative modelsRealNVPGlowautoregressive flow

What are Diffusion Models?

TIER 5 Jul 11, 2021

A landmark, heavily-cited survey deriving diffusion-based generative models from scratch — forward/reverse Markov chains, the DDPM variational bound and simplified noise-prediction objective, the connection to score-based (NCSN) modeling and Langevin dynamics, then classifier and classifier-free guidance, DDIM/distillation/consistency speedups, and the GLIDE/unCLIP/Imagen/latent-diffusion model family. It is one of the canonical reference explainers for how modern image generators work, with lasting tutorial value.

diffusion modelsgenerative modelsDDPMscore-basedclassifier-free guidance

Diffusion Models for Video Generation

TIER 4 Apr 12, 2024

A technical deep-dive extending diffusion modeling from images to video, covering the harder temporal-consistency and data-scarcity challenges. It works through parameterization and sampling math (v-prediction, DDIM), training video models from scratch, adapting pretrained image models with temporal layers, and training-free approaches. A solid, math-heavy reference for video generative modeling.

diffusion-modelsvideo-generationgenerative-modelstemporal-consistency

Transformers, Attention & Language Models

3 tier-5 · 1 tier-4

The architectural backbone of modern NLP, told across Weng's most-shared explainers. Start with the attention mechanism (the seq2seq bottleneck through Bahdanau attention to the full Transformer), move to her continuously-updated map of Transformer architecture variants, then to the survey of pretrained language models that defined the pretrain-then-transfer paradigm — with word embeddings as the historical on-ramp. This cluster traces the line from "how does attention work" to "why did large pretrained LMs take over the field," and remains the go-to reference set for the Transformer family.

Learning Word Embedding

TIER 4 Oct 15, 2017

A clear technical survey of word-embedding methods, contrasting count-based (co-occurrence matrix factorization) with context-based predictive models, and detailing word2vec (skip-gram, CBOW), negative sampling, GloVe, and their loss-function designs. A useful explainer of how dense word vectors capture semantic relationships.

word embeddingword2vecGloVeskip-gramNLP

Attention? Attention!

TIER 5 Jun 24, 2018

A landmark explainer on the attention mechanism, starting from the seq2seq bottleneck and Bahdanau additive attention, cataloging the full family of alignment-score functions (dot-product, scaled dot-product, self-attention, soft/hard, global/local), then deriving the Transformer (Q/K/V, multi-head self-attention, encoder-decoder), plus Neural Turing Machines, Pointer Networks, SNAIL, and SAGAN. It remains one of the most widely-shared introductions to attention and the Transformer.

attentiontransformerself-attentionseq2seqneural turing machine

Generalized Language Models

TIER 5 Jan 31, 2019

A widely-cited survey charting the rise of large pretrained language models — from contextual embeddings (CoVe, ELMo, ULMFiT) through the transformer-based GPT/BERT family and its descendants (GPT-2, GPT-3, RoBERTa, ALBERT, XLNet, BART, ELECTRA, T5). A foundational reference explainer of the pretraining-then-transfer paradigm that defined modern NLP, kept updated as the field's flagship models appeared.

language modelsBERTGPTpretrainingcontextual embeddings

The Transformer Family Version 2.0

TIER 5 Jan 27, 2023

A major refactor and expansion (roughly double the original) of her widely-used Transformer-architecture survey, covering attention variants, positional encodings, longer-context and external-memory schemes, recurrence (Universal/Transformer-XL), adaptive computation, and efficient/sparse attention. A go-to reference map of Transformer architecture improvements.

transformersattentionarchitecturepositional-encodingefficient-attention

LLM Agents, Reasoning & Prompt Engineering

3 tier-5 · 0 tier-4

Weng's most influential modern work — three landmark, field-defining surveys on how to get more out of large language models without retraining them. The prompt-engineering survey maps in-context steering (few-shot, CoT, self-consistency, tree-of-thoughts); the LLM-agent post supplied the canonical "controller + Planning + Memory + Tool Use" architecture that became the vocabulary of the entire agent field; and the reasoning deep-dive (co-edited with John Schulman) systematizes test-time compute and why letting models "think longer" works. Together they are arguably the most-cited cluster she has written, and a core reading list for anyone building with LLMs today.

Prompt Engineering

TIER 5 Mar 15, 2023

A foundational, heavily-referenced survey of in-context prompting methods for steering autoregressive LLMs without weight updates: zero/few-shot learning and its biases/calibration, instruction prompting, chain-of-thought, self-consistency, tree-of-thoughts, retrieval-augmented and tool-augmented prompting. Long the canonical practitioner-and-researcher reference for the prompt-engineering landscape.

prompt-engineeringin-context-learningchain-of-thoughtfew-shotllm

LLM Powered Autonomous Agents

TIER 5 Jun 23, 2023

The seminal, extraordinarily widely-cited framework that defined the canonical LLM-agent architecture: LLM as controller plus Planning (task decomposition via CoT/ToT, self-reflection via ReAct/Reflexion), Memory (short- vs long-term with vector stores), and Tool Use (API calling, MRKL, Toolformer). It became the de facto reference diagram and vocabulary for the entire agent field and remains a must-read.

llm-agentsplanningmemorytool-useautonomous-agents

Why We Think

TIER 5 May 1, 2025

A landmark deep-dive survey (co-edited with John Schulman) on test-time compute and chain-of-thought reasoning: why letting models 'think longer' helps, framed via psychology (System 1/2), computation-as-resource, and latent-variable modeling. It systematically reviews parallel sampling vs sequential revision, RL for reasoning (DeepSeek-R1, o-series, the 'aha moment'), CoT faithfulness and reward-hacking risks of optimizing CoT, plus thinking in continuous space (recurrent architectures, thinking/pause tokens, EM and STaR). One of her most-cited reasoning references.

chain-of-thoughttest-time-computereasoningreinforcement-learningcot-faithfulness

Reinforcement Learning — Foundations & Algorithms

2 tier-5 · 5 tier-4

The largest cluster by article count, reflecting Weng's deep RL roots. It spans the full stack: the foundational formalism (MDPs, Bellman equations, value functions, Q-learning), the canonical algorithm lineage from REINFORCE through PPO/SAC/TD3, and the harder open problems — exploration under sparse reward, the exploration-exploitation tradeoff in its purest bandit form, curriculum design, meta-RL for fast adaptation, and sim2real transfer for robotics. The two foundational tier-5 surveys ("A Long Peek into RL" and "Policy Gradient Algorithms") are the entry points the rest build on.

A (Long) Peek into Reinforcement Learning

TIER 5 Feb 19, 2018

A thorough foundational overview of reinforcement learning, defining the core formalism (agent, state, action, reward, policy, value functions, MDP, Bellman equations) and then surveying classic solution methods — dynamic programming, Monte Carlo, TD learning, SARSA, Q-learning, policy gradients, and function approximation. It serves as the widely-referenced entry point that her other RL deep-dives build on.

reinforcement learningMDPQ-learningBellman equationsTD learning

The Multi-Armed Bandit Problem and Its Solutions

TIER 4 Jan 23, 2018

A focused technical explainer of the multi-armed bandit problem as the canonical illustration of the exploration-vs-exploitation dilemma, deriving and comparing solution strategies (epsilon-greedy, UCB, Thompson sampling) with regret analysis and a Bernoulli-bandit implementation. A solid, self-contained deep-dive on a fundamental decision-making problem.

multi-armed banditexploration-exploitationThompson samplingUCBregret

Policy Gradient Algorithms

TIER 5 Apr 8, 2018

A comprehensive, repeatedly-updated reference survey of policy-gradient methods in RL, deriving the policy gradient theorem and then systematically covering the entire algorithm lineage: REINFORCE, actor-critic, A3C/A2C, DPG/DDPG/D4PG, MADDPG, TRPO, PPO, ACER, ACKTR, SAC, TD3, SVPG, IMPALA, and PPG. It is one of the most-cited single-page overviews of modern policy-gradient RL.

policy gradientreinforcement learningPPOactor-criticTRPO

Domain Randomization for Sim2Real Transfer

TIER 4 May 5, 2019

A survey of domain randomization for closing the simulation-to-reality gap in robotics, contrasting it with system identification and domain adaptation, then covering uniform, guided, and automatic/active randomization plus data-augmentation strategies (AutoAugment family). A useful, well-scoped reference for robot-learning practitioners on training policies robust enough to transfer to physical hardware.

sim2realdomain randomizationroboticstransfer learningreinforcement learning

Curriculum for Reinforcement Learning

TIER 4 Jan 29, 2020

A survey of curriculum learning for RL — ordering training experience from easy to hard — covering task-specific curricula, teacher-student/automatic curricula, goal-generation (HER, automatic goal GANs), procedural content generation, and curriculum-through-distillation. A solid technical deep-dive on a training-efficiency technique, organized into clean categories with the underlying principle of selecting examples that are neither too easy nor too hard.

reinforcement learningcurriculum learningautomatic curriculumgoal generationtraining efficiency

Exploration Strategies in Deep Reinforcement Learning

TIER 4 Jun 7, 2020

A wide survey of exploration in deep RL, framing the hard-exploration and noisy-TV problems and then cataloging approaches: classic methods (epsilon-greedy, UCB, Thompson sampling), intrinsic-reward bonuses (count-based, prediction-error/curiosity, RND, information-gain), memory/episodic methods, Q-value uncertainty, and goal-based exploration. A useful, well-organized reference for anyone working on sparse-reward RL.

reinforcement learningexplorationintrinsic rewardcuriosityRND

Meta Reinforcement Learning

TIER 4 Jun 23, 2019

A survey of meta-RL — training agents over a distribution of tasks so they adapt quickly to new ones — tracing the idea from Hochreiter's 2001 recurrent meta-learner through RL^2 and Wang et al. (2016), and dissecting the three key components (meta-learning algorithm, meta-learned model with memory, and the task distribution). A solid conceptual deep-dive linking recurrent dynamics to fast adaptation.

meta-learningreinforcement learningmeta-RLfast adaptationRL^2

AI Safety, Alignment & Reward Hacking

2 tier-5 · 3 tier-4

Drawn from Weng's years leading safety work at OpenAI, this cluster traces the alignment problem from early concrete subproblems to its modern, foundational treatments. The reward-hacking survey (Goodhart's Law through RLHF overoptimization, sycophancy, and LLM-as-grader bias) and the extrinsic-hallucination survey (factuality, detection, and "knowing what you don't know") are widely-cited reference works; around them sit careful deep-dives on adversarial attacks/jailbreaks, toxicity reduction, and the human-data pipeline that quietly determines model quality. The set reads as one researcher's evolving map of what can go wrong with capable models and how to measure and mitigate it.

Reducing Toxicity in Language Models

TIER 4 Mar 21, 2021

A structured deep-dive on LLM safety that organizes the toxicity problem into three parts: collecting labeled toxic/safe data (annotation schemes, adversarial human-and-model loops), detecting toxic content, and detoxifying model generation. Matters as an early, careful framing of an AI-safety subproblem from a researcher who later led safety work, surveying taxonomies, benchmarks, and mitigation methods.

AI safetytoxicitylanguage modelscontent moderationdetoxification

Adversarial Attacks on LLMs

TIER 4 Oct 25, 2023

A survey of adversarial attacks and jailbreaks against aligned LLMs, written from OpenAI safety experience. It lays out threat models (white/black box, classification vs generation) and walks through five attack categories — token manipulation, gradient-based, jailbreak prompting, human red-teaming, and model red-teaming — plus mitigation directions. A solid security/safety reference for LLM robustness.

adversarial-attacksjailbreakllm-securityred-teamingai-safety

Thinking about High-Quality Human Data

TIER 4 Feb 5, 2024

An analysis of how to obtain high-quality human-annotated data, the fuel for supervised and RLHF training, organized around the rater-quality pipeline (task design, rater selection/training, aggregation). It surveys wisdom-of-the-crowd, rater-agreement metrics (Cohen's kappa, probabilistic graph models), disagreement-as-signal, and data-cleaning techniques. A useful, less-discussed reference on the 'data work' behind model quality.

data-qualityhuman-annotationrlhfcrowdsourcingrater-agreement

Extrinsic Hallucinations in LLMs

TIER 5 Jul 7, 2024

A landmark survey narrowing 'hallucination' to extrinsic cases where output is ungrounded by world knowledge, then mapping causes (pretraining data, fine-tuning on new knowledge), detection (FactualityPrompt, FActScore, SelfCheckGPT, retrieval-augmented and sampling-based methods), and anti-hallucination techniques (RAG, factuality-tuning, citation, sampling). Widely cited as the reference treatment of LLM factuality and 'knowing what you don't know'.

hallucinationfactualityllm-evaluationretrieval-augmentationai-safety

Reward Hacking in Reinforcement Learning

TIER 5 Nov 28, 2024

A comprehensive reference survey of reward hacking: when an RL agent exploits flaws in the reward function to win reward without doing the intended task. It builds from reward-shaping theory and Goodhart's Law through extensive catalogs of hacking in RL, LLM, and real-world settings, then dissects RLHF-specific failure modes (RM overoptimization scaling laws, U-Sophistry, sycophancy, LLM-as-grader positional/self bias, in-context reward hacking) and surveys mitigations. A foundational alignment-and-safety reference.

reward-hackingrlhfai-safetygoodharts-lawalignment

Meta-Learning & Learning with Not Enough Data

1 tier-5 · 3 tier-4

How models learn efficiently when labels are scarce. The anchor is Weng's landmark meta-learning survey, which organizes "learning to learn" into metric-, model-, and optimization-based approaches (MAML, Prototypical Networks, Reptile) under the unifying "train as you test" principle. Around it sits her three-part "Learning with not Enough Data" series, which frames the four responses to label scarcity and then surveys semi-supervised learning, active learning (budget-aware labeling), and synthetic data generation in turn. Read together, they form a complete toolkit for the low-data regime.

Meta-Learning: Learning to Learn Fast

TIER 5 Nov 30, 2018

A landmark, heavily-cited survey of meta-learning ('learning to learn') that organizes the field into three coherent approaches — metric-based (Siamese nets, Matching Networks, Prototypical/Relation Networks), model-based (Memory-Augmented NNs, MetaNet), and optimization-based (LSTM meta-learner, MAML, FOMAML, Reptile). It matters as a canonical reference map of few-shot learning, deriving the math behind each model class and the unifying 'train as you test' principle.

meta-learningfew-shot learningMAMLmetric learningmemory-augmented NN

Learning with not Enough Data Part 1: Semi-Supervised Learning

TIER 4 Dec 5, 2021

Part 1 of the low-data series, framing the four approaches to label scarcity (pretrain+finetune, semi-supervised, active learning, auto-generation) and then surveying semi-supervised learning in depth: consistency regularization, pseudo-labeling, and combined methods (Pi-model, Mean Teacher, MixMatch, FixMatch, etc.) via the labeled+unlabeled loss framework. A useful reference for SSL methods.

semi-supervised-learningconsistency-regularizationpseudo-labelinglow-resourcefixmatch

Learning with not Enough Data Part 2: Active Learning

TIER 4 Feb 20, 2022

Part 2 of the low-data series, surveying active learning: how to choose which unlabeled samples are most worth labeling under a budget. It covers acquisition strategies (uncertainty sampling, query-by-committee, diversity/density-based, measuring training effects) and deep-learning-era methods. A solid explainer/reference on budget-aware labeling.

active-learningdata-labelinguncertainty-samplinglow-resourceacquisition-functions

Learning with not Enough Data Part 3: Data Generation

TIER 4 Apr 15, 2022

Part 3 of the low-data-regime series, covering synthetic data generation: data augmentation for images and text (AutoAugment, RandAugment, back-translation, mixup, etc.) and generating wholly new labeled data via pretrained language models and few-shot prompting. A practical reference for combating label scarcity through generated data.

data-generationdata-augmentationsynthetic-datalow-resourcefew-shot

Self-Supervised, Contrastive & Multimodal Learning

1 tier-5 · 2 tier-4

Learning useful representations without labels, and extending them across modalities. The contrastive-learning survey is the widely-cited map of the self-supervised landscape (InfoNCE, SimCLR, BYOL, MoCo, CLIP, SimCSE) and the loss functions and ingredients — augmentation, large batches, hard-negative mining — that make it work. It pairs with the earlier broad survey of self-supervised pretext tasks across images, video, and control, and with the survey of vision-language models that extend pretrained LMs to consume visual signals. The throughline: build strong representations from unlabeled data, then bridge modalities.

Self-Supervised Representation Learning

TIER 4 Nov 10, 2019

A broad survey of self-supervised pretext tasks across images, video, and control — distortion/rotation/colorization/jigsaw/context prediction, temporal/contrastive video tasks, contrastive predictive coding, and bisimulation for RL. A well-organized reference on learning representations from unlabeled data, complementary to the later dedicated contrastive-learning post that absorbed its momentum-contrast section.

self-supervisedrepresentation learningpretext tasksCPCcomputer vision

Contrastive Representation Learning

TIER 5 May 31, 2021

A comprehensive reference survey of contrastive learning objectives (contrastive/triplet/lifted-structured/N-pair losses, NCE, InfoNCE, soft-nearest-neighbors) and the key ingredients — heavy augmentation, large batches, hard-negative mining/debiasing — followed by the major vision and language methods (SimCLR, Barlow Twins, BYOL, MoCo, CLIP, SimCSE). Widely cited as the go-to map of the self-supervised contrastive landscape, useful both as theory and as an implementation guide.

contrastive learningself-supervisedInfoNCESimCLRrepresentation learning

Generalized Visual Language Models

TIER 4 Jun 9, 2022

A survey of how to extend pretrained language models to consume visual signals, grouping vision-language models into four families: joint image-text token training, learned image prefixes for frozen LMs, cross-attention fusion, and training-free combinations. It walks through representative models (VisualBERT, Frozen, Flamingo, etc.). A useful reference on the design space for multimodal LLMs.

vision-language-modelsmultimodalimage-captioningcross-attentionfrozen-lm

Scaling Models — Distributed Training & Inference

1 tier-5 · 1 tier-4

The systems side of large models: how to train them across many GPUs and then serve them cheaply. The distributed-training survey (later co-published on the OpenAI blog with Greg Brockman) is the canonical map of large-scale training infrastructure — data/model/pipeline/tensor parallelism, mixture-of-experts, ZeRO sharding, activation checkpointing, mixed precision, offloading. Its counterpart on inference optimization covers the deployment half: KV cache, quantization, pruning, distillation, and smart batching. Together they bookend the scaling lifecycle from training to production.

How to Train Really Large Models on Many GPUs?

TIER 5 Sep 24, 2021

A widely-referenced systems survey of the parallelism and memory-saving techniques needed to train very large neural networks across many GPUs: data parallelism (BSP/ASP, gradient accumulation), model/pipeline/tensor parallelism, mixture-of-experts, ZeRO/sharding, activation checkpointing, mixed precision, and CPU offloading. The canonical map of large-scale training infrastructure (later co-published on the OpenAI blog with Greg Brockman).

distributed-trainingmodel-parallelismdata-parallelismmixture-of-expertszero

Large Transformer Model Inference Optimization

TIER 4 Jan 10, 2023

A technical survey of techniques to make large Transformer inference cheaper and faster despite its memory footprint (KV cache, quadratic attention) and low parallelizability. It covers parallelism, memory offloading, smart batching, network compression (pruning, quantization, distillation), and architecture-specific attention/sparsity improvements. A solid systems-oriented reference for inference efficiency.

inference-optimizationtransformersquantizationpruningkv-cache

Computer Vision — Detection, NAS & Interpretability

0 tier-5 · 4 tier-4

Weng's computer-vision and AutoML deep-dives, mostly from her earlier years. The object-detection pieces cover the two dominant architecture families — the two-stage R-CNN lineage (region proposals, RoI pooling, Mask R-CNN) and the one-stage real-time detectors (YOLO, SSD, RetinaNet with focal loss) — and the speed/accuracy tradeoff between them. Alongside sit the neural-architecture-search survey (search space, search algorithm, evaluation strategy, from RL-driven toward differentiable DARTS) and a practically-grounded review of model interpretability motivated by regulated domains. Solid applied references, narrower in reach than the marquee LLM and generative surveys.

How to Explain the Prediction of a Machine Learning Model?

TIER 4 Aug 1, 2017

A survey of model interpretability covering both intrinsically interpretable models with model-specific interpretation and post-hoc methods for explaining black boxes (e.g., LIME-style approaches), motivated by real regulatory needs in finance, medicine, and justice. A solid, practically-grounded review of explainable AI from her time at Affirm.

interpretabilityexplainable AImodel explanationblack-box modelsLIME

Object Detection for Dummies Part 3: R-CNN Family

TIER 4 Dec 31, 2017

Part 3 of the object-detection series, a detailed walkthrough of the R-CNN lineage — R-CNN, Fast R-CNN, Faster R-CNN, and Mask R-CNN — covering region proposals, RoI pooling, bounding-box regression, and the speed/accuracy improvements across versions. A useful, well-illustrated explainer of the region-based detection family.

object detectionR-CNNFaster R-CNNMask R-CNNcomputer vision

Object Detection Part 4: Fast Detection Models

TIER 4 Dec 27, 2018

Part 4 of the object-detection series, covering one-stage (proposal-free) detectors — the YOLO family, SSD, and RetinaNet with focal loss — and how they trade a little accuracy for real-time speed versus the two-stage R-CNN family. A clear, well-illustrated technical explainer of the dense-detection architectures, useful as a computer-vision reference though narrower than the marquee LLM/generative surveys.

object detectionYOLOSSDRetinaNetcomputer vision

Neural Architecture Search

TIER 4 Aug 6, 2020

A clear survey of neural architecture search decomposed into its three components — search space, search algorithm (RL controllers, evolution, gradient-based DARTS), and evaluation strategy (weight sharing, performance prediction, one-shot models) — covering the path from expensive RL-driven search toward cheaper differentiable methods. A solid technical explainer of an AutoML subfield, though less central than the LLM/diffusion surveys.

neural architecture searchAutoMLDARTSweight sharingsearch space

Deep Learning Theory & Generalization

0 tier-5 · 3 tier-4

The "why does it work" cluster — Weng's more theoretical essays on the puzzles of deep learning. Why do massively over-parameterized networks generalize instead of overfitting (Occam/MDL, double descent, intrinsic dimension, the lottery ticket hypothesis)? What does information theory say about training dynamics (the Information Bottleneck fit-then-compress story)? And what does the infinite-width limit reveal (Neural Tangent Kernel convergence)? These are curated tours of debated, foundational theory rather than implementation guides — for the reader who wants to understand the principles under the empirical success.

Anatomize Deep Learning with Information Theory

TIER 4 Sep 28, 2017

A summary of Naftali Tishby's Information Bottleneck theory of deep learning, explaining the claim that DNN training has two distinct phases — fitting (representing the input, minimizing generalization error) then compression (forgetting irrelevant detail) — and proposing a new IB-based learning bound. A substantive conceptual deep-dive into a notable (and debated) theory of why deep nets generalize.

information theoryinformation bottleneckdeep learning theorygeneralizationmutual information

Are Deep Neural Networks Dramatically Overfitted?

TIER 4 Mar 14, 2019

A thoughtful essay on the puzzle of why heavily over-parameterized deep networks generalize, working through classic theory (Occam's razor, MDL, expressive power and universal approximation), generalization-and-memorization experiments, the bias-variance/double-descent intuition, intrinsic-dimension and heterogeneous-layer-robustness findings, and the lottery ticket hypothesis. Valuable as a curated tour of deep-learning generalization theory for the curious reader.

generalizationoverfittingdeep learning theorylottery ticketMDL

Some Math behind Neural Tangent Kernel

TIER 4 Sep 8, 2022

A focused, math-intensive deep-dive into Neural Tangent Kernel theory, explaining why infinitely-wide networks converge to a global minimum under gradient descent and behave like a fixed kernel during training. Unlike her usual broad surveys, this one works carefully through the core derivations and convergence proof. A clear theoretical reference for the NTK framework.

neural-tangent-kerneldeep-learning-theoryoptimizationinfinite-widthconvergence

Text Generation, Retrieval & NLP

0 tier-5 · 2 tier-4

How to steer and ground language generation — the applied-NLP work that predates and frames much of the later prompting and RAG literature. The controllable-generation survey organizes steerability along three axes (decoding strategies, prompt design, fine-tuning/steerable layers) and reads as an early map of techniques that the alignment and prompt-engineering posts would later absorb. The open-domain QA survey covers retrieval-augmented architectures end to end (closed-book parametric knowledge vs open-book retriever-reader/generator pipelines: DPR, REALM, RAG, Fusion-in-Decoder), a foundational reference for grounded systems.

How to Build an Open-Domain Question Answering System?

TIER 4 Oct 29, 2020

A detailed walkthrough of open-domain QA architectures spanning closed-book (parametric knowledge in the LM) and open-book retriever-reader and retriever-generator pipelines (TF-IDF/BM25, DPR, ORQA, REALM, RAG, Fusion-in-Decoder). Valuable as a foundational reference for retrieval-augmented systems, including the data-leakage caveat about train/test question overlap that affects how such systems are evaluated.

question answeringretrievalDPRRAGreading comprehension

Controllable Neural Text Generation

TIER 4 Jan 2, 2021

A thorough survey of how to steer an unconditioned language model along three axes — decoding strategies (greedy/beam/top-k/nucleus/penalized and guided decoding), prompt design (including early prompt/P-tuning), and fine-tuning or steerable-layer methods (CTRL, PPLM, GeDi, RL/RLHF-style, unlikelihood training). A useful, well-organized reference for steerability that predates and frames much of the later prompt-engineering and alignment literature.

text generationdecodingcontrollable generationpromptingfine-tuning

Field Overviews & Optimization

0 tier-5 · 2 tier-4

Two pieces that sit slightly apart from the topical clusters: a broad on-ramp to the field and an alternative to gradient-based optimization. The deep-learning overview (adapted from a meetup talk) tours the classical model families — CNNs, RNN/LSTM, seq2seq, autoencoders, RL, GANs — with intuition and canonical pointers, serving as an accessible entry point. The evolution-strategies explainer makes the case for gradient-free, black-box optimization (CMA-ES, natural evolution strategies, OpenAI's parallelizable ES) as a viable alternative to SGD, especially in deep RL. Useful bookends: one for breadth, one for an underused optimization family.

An Overview of Deep Learning for Curious People

TIER 4 Jun 21, 2017

A broad introductory survey (adapted from a meetup talk) explaining why deep learning took off now (more data plus more compute) and then touring the classical model families: CNNs, RNN/LSTM, sequence-to-sequence, autoencoders, reinforcement learning via AlphaGo, and GANs, with intuition and canonical citations for each. It functions as an accessible on-ramp to the field with good pointers (Goodfellow's book, Hinton's course, Karpathy/colah blogs) rather than a deep technical deep-dive. Solid, well-organized explainer with broad coverage but introductory depth.

deep learning overviewCNNRNN/LSTMGANreinforcement learning

Evolution Strategies

TIER 4 Sep 5, 2019

A focused explainer on evolution strategies as a black-box, gradient-free alternative to SGD — from simple Gaussian ES through CMA-ES and natural evolution strategies to OpenAI's parallelizable ES for deep RL. A useful technical deep-dive on an underused optimization family, clear on the math and the practical trade-offs versus backprop-based methods.

evolution strategiesblack-box optimizationCMA-ESreinforcement learninggradient-free