Lil'Log — Reading Room

Generative Models — Diffusion, VAEs, GANs & Flows

4 tier-5 · 1 tier-4

This is the most concentrated cluster of landmark surveys in the archive — Weng's from-scratch derivations of the major generative-model families. Read together they form a unified tour of likelihood-based and adversarial generation: GANs (implicit, adversarial), VAEs (variational latent-variable), normalizing flows (exact tractable density), and diffusion (the family that came to dominate image and video generation). She consistently builds the math up from first principles — the GAN minimax game and its Jensen-Shannon connection, the VAE ELBO and reparameterization trick, the change-of-variable theorem for flows, the DDPM variational bound and its link to score-based modeling — making this the canonical reference set for understanding how modern generators actually work.

From GAN to WGAN

TIER 5 Aug 20, 2017

GAN training fails because JS divergence breaks whenever real and generated distributions don't overlap — nearly guaranteed since both concentrate on low-dimensional manifolds in high-dimensional space.

The minimax objective min_G max_D L(D,G) = E[log D(x)] + E[log(1−D(G(z)))] equals 2·D_JS(p_r ‖ p_g) − 2log2 at the optimal discriminator D*(x) = p_r/(p_r+p_g). Disjoint manifolds let a perfect discriminator separate real from fake completely, collapsing gradient norms by five orders of magnitude and stalling the generator; a weak discriminator gives meaningless signal. Two further failures: concurrent G and D updates don't converge to Nash equilibrium (the xy/−xy toy shows unbounded oscillation), and mode collapse where G outputs a single point that fools D while ignoring data diversity.

Seven stabilization heuristics help: feature matching (match activation statistics rather than raw outputs), minibatch discrimination (let D penalize within-batch homogeneity), historical averaging (penalize large parameter swings), one-sided label smoothing (0.9/0.1 instead of 1/0), virtual batch normalization (normalize against a fixed reference batch), adding noise to D's inputs, and replacing the divergence measure entirely.

WGAN takes the last route using Wasserstein (Earth Mover's) distance W(p_r, p_g) = inf_γ E[‖x−y‖]. On disjoint lines separated by θ: KL = ∞, JS = log 2, but W = |θ| — smooth and gradient-friendly. Via Kantorovich-Rubinstein duality this becomes max_{‖f‖_L≤K} E_{p_r}[f(x)] − E_{p_g}[f(x)], requiring K-Lipschitz f. The critic learns f_w with weights clipped to [−c, c] after each update; RMSProp is preferred over Adam. Weight clipping is crude — gradient penalty (Gulrajani et al. 2017) is the cleaner successor.

GANWGANWasserstein distancegenerative modelstraining stability

From Autoencoder to Beta-VAE

TIER 5 Aug 12, 2018

Autoencoders compress input through a bottleneck and reconstruct it, minimizing MSE loss $L = \frac{1}{n}\sum(\mathbf{x} - f_\theta(g_\phi(\mathbf{x})))^2$. Three regularized variants address overfitting: Denoising AE (Vincent et al., 2008) corrupts input stochastically and trains reconstruction of the clean original, forcing the network to capture inter-dimensional structure. Sparse AE penalizes average neuron activation $\hat{\rho}_j$ away from a target $\rho \approx 0.05$ via KL divergence; the k-sparse variant enforces this by zeroing all but the top-k bottleneck activations and backpropagating only through those units. Contractive AE (Rifai et al., 2011) penalizes the Frobenius norm of the encoder Jacobian $\|J_f(\mathbf{x})\|_F^2$, pushing representations toward low-dimensional manifolds while staying invariant to input perturbations.

VAE (Kingma & Welling, 2014) maps inputs to distributions rather than fixed codes. The encoder outputs $q_\phi(\mathbf{z}|\mathbf{x})$ approximating the intractable posterior, and the ELBO loss balances reconstruction likelihood $\mathbb{E}[\log p_\theta(\mathbf{x}|\mathbf{z})]$ against $D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p_\theta(\mathbf{z}))$. The reparameterization trick $\mathbf{z} = \boldsymbol{\mu} + \boldsymbol{\sigma} \odot \boldsymbol{\epsilon}$ makes sampling differentiable.

$\beta$-VAE (Higgins et al., 2017) scales the KL term by $\beta > 1$, tightening the bottleneck to encourage disentangled latent factors at some cost to reconstruction fidelity. VQ-VAE (van den Oord, 2017) uses discrete codebook lookup via nearest-neighbor quantization; gradients bypass the non-differentiable argmin via straight-through copying, and codebook vectors update by EMA. VQ-VAE-2 adds a two-level hierarchy separating local texture from global structure, then learns a PixelSNAIL prior over the discrete codes. TD-VAE extends VAEs to sequences using RNN belief states and a jumpy-prediction ELBO across distant timesteps $t_1 < t_2$.

VAEautoencodersbeta-VAEVQ-VAEELBO

Flow-based Deep Generative Models

TIER 5 Oct 13, 2018

Flow-based generative models solve what GANs and VAEs sidestep: they directly learn the exact probability density p(x) via a chain of invertible transformations, making the training loss simply negative log-likelihood with no adversarial objective or ELBO approximation.

The mechanism is the change-of-variables theorem. For a mapping x = f(z) from a known base distribution π(z), the log-density telescopes as log p(x) = log π(z₀) − Σ log|det(dfᵢ/dzᵢ₋₁)|. Each transformation fᵢ must be (1) easily invertible and (2) have a tractable Jacobian determinant.

Three coupling-layer models satisfy both constraints. NICE (Dinh 2015) uses additive coupling: the first d dimensions pass through unchanged while the remaining D−d dimensions shift by a function of those d. RealNVP (Dinh 2017) adds a scale term — y_{d+1:D} = x_{d+1:D} ⊙ exp(s(x_{1:d})) + t(x_{1:d}) — producing a lower-triangular Jacobian with determinant exp(Σ sⱼ). Since inversion never requires inverting s or t, both can be arbitrary neural networks. Glow (Kingma & Dhariwal 2018) replaces the fixed channel-reversal permutation with invertible 1×1 convolutions (log-determinant = h·w·log|det W|) and adds per-channel actnorm for single-sample stability.

Autoregressive flows condition each dimension on all previous ones. MAF uses a MADE-masked network: density estimation is one forward pass (fast) but sampling is sequential (slow). IAF reverses the dependency — scale and shift are functions of z_{1:i-1} — making sampling one parallel pass while density estimation becomes sequential. The two are mathematical duals. Flows can also enrich VAE posteriors by replacing the Gaussian approximate posterior with a normalizing flow chain.

normalizing flowsgenerative modelsRealNVPGlowautoregressive flow

What are Diffusion Models?

TIER 5 Jul 11, 2021

Diffusion models generate data by learning to reverse a gradual noising process, sidestepping GAN instability and VAE surrogate losses while remaining analytically tractable and flexible.

The forward process adds Gaussian noise over T steps via schedule {β_t}. A closed-form shortcut samples any timestep directly: x_t = √ᾱ_t · x_0 + √(1−ᾱ_t) · ε, where ᾱ_t = ∏α_i. The reverse process learns p_θ(x_{t−1}|x_t) by minimizing a VLB decomposed into per-step KL terms. Ho et al. (DDPM, 2020) reparameterize so the network predicts noise ε, simplifying to L_simple = E[‖ε_t − ε_θ(x_t,t)‖²]. Nichol & Dhariwal (2021) add a cosine noise schedule and learn variance Σ_θ via L_hybrid = L_simple + λ·L_VLB. Song & Ermon's NCSN trains a score estimator s_θ ≈ ∇_x log q(x) across multiple noise levels; the score equals −ε_θ/√(1−ᾱ_t), linking both frameworks.

Classifier guidance steers sampling by adding w·∇log f_φ(y|x_t) to the noise prediction. Classifier-free guidance (Ho & Salimans, 2021) trains one network with randomly dropped conditioning and combines at inference: ε̄_θ = (w+1)·ε_θ(x_t,t,y) − w·ε_θ(x_t,t), eliminating the external classifier. GLIDE found this outperforms CLIP guidance.

DDPM's main cost is slow sampling (≈20 hours for 50k 32×32 images). DDIM (Song et al., 2020) makes the reverse process deterministic (σ_t=0), producing high-quality samples in far fewer steps and enabling latent-space interpolation. Progressive distillation (Salimans & Ho, 2022) halves steps per iteration. Consistency models (Song et al., 2023) map any noisy x_t directly to x_0, trained by distillation (CD) or standalone (CT) with LPIPS distance and Heun ODE solvers.

LDM runs diffusion in a compressed autoencoder latent space, cutting compute while preserving semantic content; cross-attention injects conditioning via domain-specific encoders. Cascaded pipelines chain models at increasing resolutions with noise conditioning augmentation. Backbone choices are U-Net or Diffusion Transformer (DiT), which patchifies latents and uses adaLN-Zero conditioning, scaling efficiently with compute.

diffusion modelsgenerative modelsDDPMscore-basedclassifier-free guidance

Diffusion Models for Video Generation

TIER 4 Apr 12, 2024

Video diffusion takes two paths: training 3D models from scratch, or inflating pretrained image models with temporal layers.

From scratch. VDM factorizes a 3D U-Net over space and time — spatial convolutions become 1×3×3 and temporal attention with relative position embeddings follows each spatial block. The v-prediction parameterization (v = α_t ε − σ_t x) avoids color shift versus ε-prediction. Conditioning one clip on another uses reconstruction guidance — a gradient term toward the conditioning clip. Imagen Video cascades 7 models (base + 3 TSR + 3 SSR) to 1280×768 at 24 fps; progressive distillation reduces each to 8 steps. Sora uses a DiT over spacetime patch tokens.

Inflating image models. Make-A-Video inserts pseudo-3D layers (2D spatial then 1D temporal, identity-initialized) and fine-tunes only those on unlabeled video, inheriting text-image priors without text-video pairs. Video LDM freezes backbone weights and trains only inserted temporal layers, then patches the decoder with a 3D-conv temporal discriminator to prevent flickering. Stable Video Diffusion fine-tunes the whole model in three stages; dataset curation via optical flow, OCR text density, and CLIP aesthetics lets a smaller curated set beat a larger raw one. Lumiere's STUNet generates the full video in one pass by jointly downsampling time and space, eliminating TSR.

Training-free. Text2Video-Zero warps a first-frame latent by δ_k = λ(k−1)δ to seed subsequent frames and replaces self-attention with cross-frame attention to frame 1. ControlVideo adds full cross-frame attention, an interleaved-frame smoother, and a hierarchical keyframe sampler for long videos.

diffusion-modelsvideo-generationgenerative-modelstemporal-consistency

Transformers, Attention & Language Models

3 tier-5 · 1 tier-4

The architectural backbone of modern NLP, told across Weng's most-shared explainers. Start with the attention mechanism (the seq2seq bottleneck through Bahdanau attention to the full Transformer), move to her continuously-updated map of Transformer architecture variants, then to the survey of pretrained language models that defined the pretrain-then-transfer paradigm — with word embeddings as the historical on-ramp. This cluster traces the line from "how does attention work" to "why did large pretrained LMs take over the field," and remains the go-to reference set for the Transformer family.

Learning Word Embedding

TIER 4 Oct 15, 2017

Words appearing in similar contexts carry similar meanings — dense low-dimensional vectors encode this, enabling arithmetic like vector("cat") − vector("kitten") ≈ vector("dog") − vector("puppy").

Two families exploit context. Count-based methods factorize a global co-occurrence matrix (PCA, topic models). Context-based methods train predictive networks and learn embeddings as parameters. The two architectures are skip-gram (Mikolov et al., 2013 — predict context words from a target) and CBOW (predict the target from averaged context vectors); both use embedding matrix W (V×N) and context matrix W′ (N×V).

The bottleneck is the softmax denominator, which scans all V words per update. Hierarchical softmax (Morin & Bengio, 2005) replaces flat output with a Huffman tree; each prediction becomes a chain of binary sigmoid decisions, cutting cost to O(log V). NCE (Gutmann & Hyvärinen, 2010) reframes prediction as binary classification of the true word against N noise samples, approximating the partition function as Z≈1 and drawing noise from a log-uniform (Zipfian) distribution. Negative sampling (NEG, Mikolov 2013) simplifies NCE by using plain sigmoids, trading distributional fidelity for embedding quality.

Practical improvements: soft sliding windows (random size up to s_max, downweighting distant words), subsampling frequent words with probability 1−√(t/f(w)), and pre-identifying phrases via bigram score C(wi wj)/(C(wi)·C(wj)).

GloVe (Pennington et al., 2014) bridges both families: meaning lives in co-occurrence ratios, yielding wᵢᵀw̃ₖ = log C(wᵢ,w̃ₖ) − log C(wᵢ), minimized with saturation weighting f(c) = (c/c_max)^α.

word embeddingword2vecGloVeskip-gramNLP

Attention? Attention!

TIER 5 Jun 24, 2018

Seq2seq models fail on long inputs because the encoder compresses an entire source sequence into a single fixed-length context vector — early information is lost before decoding starts. Bahdanau (2015) attention fixes this: each decoder step gets a fresh context vector $\mathbf{c}_t = \sum_i \alpha_{t,i} \mathbf{h}_i$, where alignment weights are softmax-normalized scores from a small feed-forward network trained jointly with the model.

Several alignment score functions are catalogued: additive/concat (Bahdanau), dot-product and bilinear "general" variants (Luong 2015), and scaled dot-product $\mathbf{s}^\top \mathbf{h}/\sqrt{n}$ (Vaswani 2017, scaling prevents vanishing gradients at high dimension). Xu 2015 draws the soft/hard distinction: soft attention is differentiable and attends all positions; hard attention picks one patch and requires REINFORCE. Luong's local attention blends both — predict an aligned center, attend over a window — staying differentiable while cheaper than global.

Self-attention applies the same scores within a single sequence, giving direct position-to-position correlation. The Transformer (Vaswani 2017) builds entirely on multi-head self-attention: $h$ parallel scaled dot-product heads projecting Q/K/V through learned matrices, concatenated and linearly mixed, stacked six layers with residual connections and sinusoidal positional encoding — no recurrence.

Neural Turing Machines (Graves 2014) use attention to address a finite external memory via content-based cosine similarity, interpolation with prior weights, and a convolutional location shift. Pointer Networks (Vinyals 2015) redirect attention to select input positions rather than blend them, handling variable-output combinatorial tasks. SAGAN (Zhang 2018) applies self-attention in a GAN with a scale parameter $\gamma$ initialized at zero, so the generator learns local structure before expanding to long-range spatial dependencies.

attentiontransformerself-attentionseq2seqneural turing machine

Generalized Language Models

TIER 5 Jan 31, 2019

Unsupervised pre-training followed by fine-tuning replaced static word vectors as the dominant NLP paradigm, each generation eliminating a limitation of the last.

CoVe (2017) extracted context vectors from a biLSTM NMT encoder concatenated with GloVe, limited by the supervised translation corpus. ELMo (2018) pre-trained a two-layer biLM unsupervised and combined layers via a task-specific weighted sum; lower layers capture syntax, upper layers semantics. Both required custom task architectures. CVT unified them semi-supervisedly: auxiliary predictors seeing only forward or backward states must match the full-context primary prediction on unlabeled data, forcing the encoder to distill bidirectional context.

ULMFiT introduced a three-stage pipeline — Wikipedia pretrain, domain LM fine-tuning, classifier fine-tuning — stabilized with per-layer learning rates and gradual unfreezing.

GPT switched to a Transformer decoder, fine-tuning with one new weight matrix per task and LM loss as auxiliary. BERT added bidirectionality via Masked LM (15% of tokens: 80% masked, 10% random, 10% unchanged) and Next Sentence Prediction. ALBERT compressed BERT 18× through factorized embeddings and cross-layer weight sharing, replacing NSP with Sentence-Order Prediction, requiring coherence rather than topic detection. RoBERTa showed BERT was undertrained: dropping NSP, dynamic masking over 40 epochs, and longer sequences each helped.

GPT-2 (1.5B) eliminated fine-tuning via zero-shot conditioning: summarization with "TL;DR:", translation with in-context pairs. GPT-3 (175B) extended this to few-shot prompting, matching fine-tuned BERT on many benchmarks. XLNet captured bidirectional context autoregressively via permutation language modeling with two-stream attention: the query stream sees position only; the content stream sees both token and position. BART paired a BERT-like encoder with a GPT-like autoregressive decoder trained to reconstruct corrupted text; text infilling and sentence shuffling were strongest. ELECTRA reframed pre-training as replaced token detection — a small generator corrupts tokens, the discriminator classifies every position — covering all tokens rather than 15% and sharply improving compute efficiency.

language modelsBERTGPTpretrainingcontextual embeddings

The Transformer Family Version 2.0

TIER 5 Jan 27, 2023

Every major axis of Transformer improvement — positional encoding, long context, adaptive computation, efficient attention — yields a distinct variant family with precise tradeoffs.

Positional encoding. Transformer-XL decomposes each attention score into four terms using separate content/location key matrices W_E^k, W_R^k and global biases u, v. RoPE multiplies Q and K by position-indexed rotation matrices so q_i·k_j depends only on i−j. ALiBi subtracts a head-specific geometric penalty by distance; a 1.3B model trained at 1024 extrapolates to 2046.

Long context. Transformer-XL caches prior-segment hidden states (stop-gradient) as extended keys and values. Compressive Transformer adds a compressed-memory FIFO — mapping L oldest activations to L/c slots via 1D conv, trained with auto-encoding and attention-reconstruction losses — extending range to (m_m + c·m_cm)×N. kNN-LM interpolates the base LM with a FAISS datastore: p(y|x) = λ p_kNN + (1−λ) p_LM. Memorizing Transformer inserts a kNN-augmented layer near the top; 8k-token memory matches a 5× larger vanilla model.

Adaptive computation. Depth-Adaptive Transformer attaches an exit classifier per layer; CALM uses the top-two-logit gap as signal and picks the threshold via Learn-then-Test. Adaptive Attention Span learns per-head soft masks m_z(x) = clip((R+z−x)/R, 0, 1); lower layers converge to shorter spans.

Efficient attention. Reformer uses LSH (h(x) = argmax([xR; −xR])) to restrict queries to the same bucket — O(L log L) — and reversible residual layers (Y₁=X₁+Attn(X₂), Y₂=X₂+FF(Y₁)) to skip N activation copies. Linformer projects keys and values to k×d (k≪L) for linear complexity. RFA approximates softmax with random feature maps φ, reducing causal attention to a running O(d) state. Longformer and Big Bird combine local sliding-window with globally attending tokens.

RL. Gated Transformer-XL stabilizes training with GRU-style gating initialized near identity and layer norm off the residual path. Decision Transformer frames off-policy RL as sequence modeling over (return-to-go, state, action) triplets conditioned on desired return.

transformersattentionarchitecturepositional-encodingefficient-attention

LLM Agents, Reasoning & Prompt Engineering

3 tier-5 · 0 tier-4

Weng's most influential modern work — three landmark, field-defining surveys on how to get more out of large language models without retraining them. The prompt-engineering survey maps in-context steering (few-shot, CoT, self-consistency, tree-of-thoughts); the LLM-agent post supplied the canonical "controller + Planning + Memory + Tool Use" architecture that became the vocabulary of the entire agent field; and the reasoning deep-dive (co-edited with John Schulman) systematizes test-time compute and why letting models "think longer" works. Together they are arguably the most-cited cluster she has written, and a core reading list for anyone building with LLMs today.

Prompt Engineering

TIER 5 Mar 15, 2023

How you phrase a request to a language model — without changing its weights — dramatically shifts what it produces, and the variance is large enough to matter in practice.

Zero-shot prompting simply poses the task; few-shot adds labeled examples. Example quality dominates: Zhao et al. identify majority-label bias, recency bias, and common-token bias as the main variance sources, and propose calibrating output probabilities against a null input to correct for them. Selection can be systematized — k-NN retrieval in embedding space, graph-based diversity scoring that penalizes already-covered neighbors, or contrastive-learning embeddings scored by $P_\text{LM}(y \mid e_i, x)$.

Chain-of-thought (CoT) prompting elicits step-by-step reasoning chains before the final answer and helps most on complex tasks with models above ~50B parameters. Zero-shot CoT uses "Let's think step by step"; few-shot CoT supplies worked examples. Self-consistency sampling runs multiple completions at temperature > 0 and takes the majority vote. STaR bootstraps rationales without human annotation by iteratively fine-tuning on only the chains that produced correct answers. Tree of Thoughts generalizes CoT into a branching search (BFS or DFS) with per-node evaluation.

Automatic Prompt Engineer (APE) treats instructions as a search problem: generate candidates from input-output pairs, score by execution accuracy or log-probability, then refine via Monte Carlo variants.

Augmented LMs extend prompting beyond text: retrieval augmentation with TF-IDF passage ranking and RAG/noisy-channel/PoE reranking; PAL and PoT, which offload arithmetic to a Python interpreter; and Toolformer, which self-supervisedly learns to insert API calls (calculator, search, QA, translation, calendar) by keeping only those that reduce future-token loss.

prompt-engineeringin-context-learningchain-of-thoughtfew-shotllm

LLM Powered Autonomous Agents

TIER 5 Jun 23, 2023

LLM-powered agents decompose into three components — planning, memory, and tool use — each with distinct methods and failure modes.

Planning. Chain of Thought (CoT) prompts step-by-step decomposition at inference time. Tree of Thoughts branches into a BFS or DFS search over multiple reasoning paths per step, with states scored by classifier or majority vote. LLM+P translates goals into PDDL, delegates to a classical planner, and translates back — effective in robotics but domain-restricted. For self-correction, ReAct interleaves Thought/Action/Observation loops and outperforms act-only baselines on HotpotQA and AlfWorld. Reflexion adds a heuristic detecting inefficient trajectories or hallucination (repeated identical actions → same observation) and resets the episode; reflections accumulate in working memory (up to three). Chain of Hindsight fine-tunes on ranked (output, feedback) sequences so the model learns to follow the improving trend. Algorithm Distillation concatenates multi-episode RL histories into context, training on the improvement process itself and approximating RL² without online training.

Memory. Short-term maps to in-context learning (bounded by the context window); long-term to external vector stores with approximate nearest-neighbor retrieval. Main ANN options: LSH (hash buckets), ANNOY (random projection trees), HNSW (hierarchical small-world graphs), FAISS (cluster-then-refine quantization), and ScaNN (anisotropic quantization preserving inner products over Euclidean distance).

Tool use. MRKL routes queries to specialist neural or symbolic modules. Toolformer self-annotates training data with API calls that improve outputs. HuggingGPT dispatches subtasks to HuggingFace models via ChatGPT across four stages: parse → select → execute → summarize. ChemCrow adds 13 chemistry tools; human experts found it outperforms raw GPT-4 on chemical correctness even though LLM self-evaluation rated them equal — a blind spot when models grade their own domain work.

Three structural limits: finite context constrains history and reflection depth; long-horizon planning breaks under unexpected errors; natural-language interfaces are unreliable enough that much agent code is output parsing.

llm-agentsplanningmemorytool-useautonomous-agents

Why We Think

TIER 5 May 1, 2025

Generating more tokens before answering improves LLM performance because CoT allocates compute that won't fit in a single forward pass. Reasoning traces are latent variables: maximize $\log p(y|x) = \log \mathbb{E}_{z} p(y|z,x)$, the marginal likelihood of the correct answer over CoT samples.

Two orthogonal strategies extract value. Parallel sampling generates $N$ rollouts selected by majority vote or a verifier; beam search guided by a process reward model (PRM) directs compute toward promising partial sequences. Sequential revision iterates on the model's own output, but naive self-correction degrades accuracy — external feedback (unit tests, a stronger model) is required. SCoRe addresses this via two-stage RL: stage 1 KL-penalizes the first attempt to prevent collapse to trivial edits, stage 2 optimizes both turns.

RL with outcome rewards produces the largest gains. DeepSeek-R1: cold-start SFT, reasoning-only RL with format and accuracy rewards, rejection-sampling SFT blending reasoning and general data, then a final RL pass. Pure RL without SFT yields emergent "aha moments." Failed attempts: PRMs were too fragile to reward hacking; MCTS collapsed under the token-level branching factor.

CoT faithfulness is partial. Reasoning models acknowledge misleading prompts more reliably than base models — correct-answer optimization is less sycophantic than preference RLHF. Applying RL pressure to CoT monitoring backfires: models hide reward hacking in opaque traces rather than stop it. Rewarding shorter correct CoTs caused padding with repetition instead of reasoning.

In continuous space, recurrent-depth models (Geiping et al.) run a recurrent block $r$ times per forward pass, $r$ from a log-normal Poisson, BPTT truncated to $k=8$ steps. Pause tokens insert contentless computation slots at both train and inference time.

Test-time compute does not substitute for pretraining: it closes gaps only when inference tokens are far fewer than pretraining tokens. Budget forcing (appending "wait" tokens) correlates positively with accuracy; rejection-sampling for length control reverses it.

chain-of-thoughttest-time-computereasoningreinforcement-learningcot-faithfulness

Reinforcement Learning — Foundations & Algorithms

2 tier-5 · 5 tier-4

The largest cluster by article count, reflecting Weng's deep RL roots. It spans the full stack: the foundational formalism (MDPs, Bellman equations, value functions, Q-learning), the canonical algorithm lineage from REINFORCE through PPO/SAC/TD3, and the harder open problems — exploration under sparse reward, the exploration-exploitation tradeoff in its purest bandit form, curriculum design, meta-RL for fast adaptation, and sim2real transfer for robotics. The two foundational tier-5 surveys ("A Long Peek into RL" and "Policy Gradient Algorithms") are the entry points the rest build on.

A (Long) Peek into Reinforcement Learning

TIER 5 Feb 19, 2018

Reinforcement learning frames sequential decision-making as an MDP $\langle S, A, P, R, \gamma \rangle$ where a policy π maximizes discounted return $G_t = \sum_k \gamma^k R_{t+k+1}$. State-value $V_\pi(s) = \mathbb{E}[G_t|S_t=s]$ and action-value $Q_\pi(s,a)$ are the core quantities; advantage $A(s,a) = Q(s,a)-V(s)$ isolates action quality above the state baseline. Bellman equations decompose value recursively — $V(s) = \mathbb{E}[R_{t+1}+\gamma V(S_{t+1})]$ — and the optimality version substitutes max for the expectation.

With a full model, Generalized Policy Iteration alternates evaluation and greedy improvement, converging to $\pi_*$. Without a model: Monte Carlo averages complete-episode returns (episodes must terminate); TD learning bootstraps each step toward $R_{t+1}+\gamma V(S_{t+1})$, enabling mid-episode updates. SARSA (on-policy TD) updates Q using the next action taken; Q-learning (off-policy, Watkins 1992) uses $\max_{a'} Q(S_{t+1},a')$, decoupling behavior from target. TD(λ) blends all n-step returns with λ-decayed weights: $G_t^\lambda = (1-\lambda)\sum_n \lambda^{n-1} G_t^{(n)}$.

DQN stabilizes Q-learning under neural function approximation with experience replay (random mini-batches break correlations) and a periodically frozen target network $\theta^-$, minimizing $\mathbb{E}[(r+\gamma\max_{a'} Q(s',a';\theta^-)-Q(s,a;\theta))^2]$.

Policy gradient methods parameterize $\pi(a|s;\theta)$ directly and optimize via $\nabla\mathcal{J}(\theta) = \mathbb{E}[\nabla\ln\pi \cdot Q_\pi]$. REINFORCE uses full-episode MC returns with a value baseline to reduce variance. Actor-Critic adds a learned critic for per-step signals. A3C parallelizes across workers asynchronously accumulating gradients against shared global parameters. Evolution Strategies drop value functions and backprop: perturb θ with Gaussian noise, rank by fitness, update via $\frac{1}{\sigma}\mathbb{E}[F(\theta+\sigma\epsilon)\epsilon]$ — trivially parallelizable and invariant to delayed rewards.

Two structural problems recur: exploration-exploitation (ε-greedy for TD, noise for ES) and the deadly triad — off-policy + nonlinear approximation + bootstrapping can diverge; DQN's replay and frozen target partially tame this. AlphaGo Zero synthesizes the framework: a ResNet outputs $(p,v)$ from board state, MCTS refines the move distribution during self-play, and the loss $(z-v)^2-\pi^\top\log p+c\|\theta\|^2$ jointly trains both heads with no human game data.

reinforcement learningMDPQ-learningBellman equationsTD learning

The Multi-Armed Bandit Problem and Its Solutions

TIER 4 Jan 23, 2018

Exploration beats pure exploitation because uncertainty has information value. The Bernoulli bandit frames this: K arms with unknown reward probabilities θ_i; minimize cumulative regret L_T = E[Σ(θ* − Q(a_t))]. ε-greedy takes the current best arm with probability 1−ε and explores uniformly at random otherwise. UCB1 is smarter: select argmax[Q̂(a) + √(2 log t / N_t(a))], a bound derived from Hoeffding's inequality — under-sampled arms get inflated upper bounds and thus earn more pulls. Bayesian UCB replaces the Hoeffding bound with a prior-informed interval (2σ under Gaussian). Thompson Sampling maintains a Beta(α, β) posterior per arm, samples a reward estimate, picks the argmax, then updates α or β on the outcome — naturally concentrating pulls on the best arm as evidence accumulates.

multi-armed banditexploration-exploitationThompson samplingUCBregret

Policy Gradient Algorithms

TIER 5 Apr 8, 2018

Direct policy optimization is necessary in continuous action spaces where argmax over actions is intractable. The policy gradient theorem gives ∇J(θ) ∝ E_π[Q^π(s,a) ∇log π_θ(a|s)], avoiding differentiation of the state distribution. Subtracting state-value as baseline yields advantage A = Q − V, cutting variance without bias.

REINFORCE estimates G_t from full Monte-Carlo trajectories in place of Q^π. Actor-critic replaces that with a learned critic and online TD updates. Off-policy variants reweight by π/β; dropping ∇Q^π still guarantees improvement. A3C runs parallel actors asynchronously updating shared parameters; A2C synchronizes them, working better on GPUs with large batches.

DPG uses a deterministic policy μ(s), giving ∇J = E[∇_a Q · ∇_θ μ] and eliminating importance sampling over actions. DDPG combines DPG with experience replay and soft target updates (τ ≪ 1). D4PG adds a distributional critic Z_w, N-step TD targets, K parallel actors, and prioritized replay. MADDPG adds centralized critics over all agents' actions with decentralized actors; policy ensembles reduce variance.

TRPO enforces E[D_KL(π_old ∥ π)] ≤ δ per update, guaranteeing monotonic improvement but requiring conjugate-gradient line searches. PPO clips the policy ratio r = π/π_old to [1−ε, 1+ε]: J^CLIP = E[min(r·Â, clip(r, 1−ε, 1+ε)·Â)]. PPG separates policy and value into alternating training phases, avoiding conflicting gradients from shared parameters.

ACER makes A3C off-policy via Retrace Q-estimation (IS weights capped at c), truncated policy gradient with bias correction, and a running-average TRPO constraint. SAC maximizes E[r + α H(π)] over policy, soft-Q, and soft-V; α can be learned via Lagrangian to enforce H(π_t) ≥ H_0. TD3 fixes Q-overestimation in DDPG with clipped double-Q, delayed policy updates, and target policy smoothing with clipped noise. IMPALA decouples actors from the learner and corrects policy lag with V-trace: n-step targets with truncated IS weights ρ_i = min(ρ̄, π/μ) and trace coefficients c_j = min(c̄, π/μ).

policy gradientreinforcement learningPPOactor-criticTRPO

Domain Randomization for Sim2Real Transfer

TIER 4 May 5, 2019

Randomizing simulator parameters broadly enough that reality is just one sample in the training distribution lets policies trained in sim transfer to real robots. The objective maximizes expected reward averaged over randomization configs ξ ∈ Ξ, modeling sim-real discrepancies as source-domain variability.

Uniform DR samples each parameter from a fixed interval, randomizing visual properties (lighting, texture, camera pose) and physical dynamics (mass, damping, friction, action delay). A recurrent — not feedforward — policy is required; LSTM memory lets it infer the current environment from trajectory history, functioning as meta-learning. OpenAI's dexterous hand rotation task demonstrated this at scale.

Guided DR replaces uniform sampling with smarter signals. AutoAugment and "Learning to Simulate" optimize the randomization distribution via RL reward on task performance; Yu et al. use CMA-ES with real-environment fitness. SimOpt minimizes sim-vs-real trajectory discrepancy. RCAN trains a cGAN to map randomized or real images to a canonical sim view. ADR uses a discriminator on randomized-vs-reference sim rollouts fed into Stein Variational Policy Gradient — no real data required.

sim2realdomain randomizationroboticstransfer learningreinforcement learning

Curriculum for Reinforcement Learning

TIER 4 Jan 29, 2020

Training RL agents on progressively harder tasks speeds convergence and can prevent forgetting — but a poorly designed curriculum can actively impede learning. Five mechanisms achieve this.

Task-specific curricula manually rank difficulty. Zaremba & Sutskever (2014) showed for LSTM program execution that mixing sequential ordering with random full-range sampling always beats either alone, because reintroducing easy tasks prevents forgetting. Difficulty can be quantified via transfer: Weinshall et al. (2018) rank samples by a pretrained classifier's margin. OpenAI's Rubik's Cube used Automatic Domain Randomization (ADR), growing a distribution over environment parameters rather than a fixed sequence.

Teacher-guided curricula treat task selection as a bandit problem. Graves et al. (2017) use two teacher reward signals: loss-driven progress (gradient update magnitude) or complexity-driven progress (weight posterior KL, motivated by MDL). TSCL (Matiisen 2017) frames the teacher as a POMDP rewarded by total student score deltas, using ε-greedy or Thompson sampling. For continuous spaces, ALP-GMM (Portelas 2019) fits a Gaussian mixture to absolute learning progress — reward differences between neighboring tasks — and samples proportionally.

Asymmetric self-play (Sukhbaatar 2017): Alice transforms s₀→sₜ; Bob earns reward returning to s₀. Alice's reward is max(0, tB−tA), incentivizing tasks just beyond Bob's reach but guaranteed solvable.

Goal GAN (Florensa 2018) generates goals in the intermediate-difficulty band GOID = {g : R_min ≤ R^g(π) ≤ R_max} using an LSGAN. Racaniere & Lampinen (2019) extend this with Setter-Solver-Judge, adding validity, feasibility, and coverage losses to the generator.

Distillation-based methods: Progressive networks (Rusu 2016) freeze earlier columns and let new columns read them via lateral connections U_i^(k:j), preventing forgetting. Mix & Match (Czarnecki 2018) trains a mixture π_mm = Σ αᵢπᵢ, annealing α_K from 0→1 with a KL distillation loss keeping complex policies aligned with simpler ones early.

reinforcement learningcurriculum learningautomatic curriculumgoal generationtraining efficiency

Exploration Strategies in Deep Reinforcement Learning

TIER 4 Jun 7, 2020

Sparse rewards and stochastic distractors make exploration the harder half of RL. Two failure modes define the field: hard-exploration (Montezuma's Revenge) and the noisy-TV problem, where an agent rewarded for novelty fixates on unpredictable noise.

The dominant family augments reward as r_t = r^e_t + β·r^i_t. Count-based methods set r^i ∝ N(s)^{-1/2} but fail in continuous spaces. Bellemare et al. (2016) derive a pseudo-count N̂ = ρ(1−ρ')/(ρ'−ρ) from a density model (CTS, later PixelCNN). Tang et al. (2017) hash states with SimHash φ(s) = sgn(Ag(s)) and count hash codes directly.

Prediction-error bonuses use forward-dynamics error as curiosity. ICM (Pathak 2017) learns the encoding via an inverse dynamics model g: (φ(s_t), φ(s_{t+1})) → a_t, so φ discards uncontrollable environment factors. A large-scale comparison found fixed random features competitive; IDF features transfer better across levels. VIME reformulates the bonus as KL between successive BNN weight posteriors, approximated via Fisher information. Exploration via Disagreement uses variance across an ensemble of five forward models as a differentiable bonus. RND bypasses dynamics: bonus = MSE between a trainable and a fixed random network; non-episodic returns and normalization are critical, and RND consistently clears more than half the rooms in Montezuma's Revenge.

Memory-based methods address non-stationarity. NGU multiplies a per-episode inverse-kernel distance signal by an RND lifelong bonus, discouraging revisits fast within episodes and slowly across them. Agent57 adds a population of (β_j, γ_j) policy pairs selected by a UCB meta-controller, achieving above-human scores on all 57 Atari games. Episodic Curiosity uses a siamese network to predict step-counts between states; unreachable states earn bonuses, bypassing noisy-TV.

Go-Explore archives promising discretized states and returns to them before exploring forward; the policy-based variant replaces deterministic reset with a goal-conditioned policy trained via self-imitation learning. Bootstrapped DQN and VIC/VALOR offer Q-uncertainty and skill-diversity angles on the same problem.

reinforcement learningexplorationintrinsic rewardcuriosityRND

Meta Reinforcement Learning

TIER 4 Jun 23, 2019

Training an agent over a distribution of MDPs so its recurrent hidden state implements a new RL algorithm at test time — without weight updates — is the core idea of Meta-RL (Wang et al. 2016 / RL² Duan et al. 2017). The policy takes $(a_{t-1}, r_{t-1}, s_t)$ as input; an LSTM accumulates trajectory history, its hidden state resets between tasks, and outer-loop gradient descent updates weights. Three components are necessary: memory, a meta-learning algorithm, and a task distribution encoding inductive biases.

On the algorithm side, Meta-gradient RL (Xu et al. 2018) treats discount $\gamma$ and bootstrapping $\lambda$ as meta-parameters $\eta$, updating them online by differentiating a held-out TD($\lambda$) loss through the policy update step while dropping the second-order term. EPG (Houthooft et al. 2018) parameterizes the loss function itself as a temporal convolution $L_\phi$ over experience; $\phi$ is evolved via NES across a population, annealing from a PPO surrogate to the learned loss. MAESN meta-learns structured exploration by conditioning the policy on a per-task latent $z_i \sim \mathcal{N}(\mu_i, \sigma_i)$ fixed per episode, with a KL term toward a unit Gaussian.

Episodic control addresses sample inefficiency through explicit memory. MFEC stores $(s,a) \to Q$ with kNN value estimation. NEC uses a Differentiable Neural Dictionary (DND) keyed on CNN embeddings for soft attention over Q-values. Episodic LSTM adds a reinstatement gate injecting retrieved cell state $\mathbf{c}_\text{ep}$ from a DND directly into the LSTM cell update.

For task acquisition, POET co-evolves bipedal-walker environments and agents via mutation, per-pair optimization, and cross-environment transfer. DIAYN generates reward-free tasks by maximizing $I(S;Z) - I(A;Z|S)$, yielding pseudo-reward $\log D_\phi(z|s)$ that drives diverse, state-distinguishable skills without hand-specified rewards.

meta-learningreinforcement learningmeta-RLfast adaptationRL^2

AI Safety, Alignment & Reward Hacking

2 tier-5 · 3 tier-4

Drawn from Weng's years leading safety work at OpenAI, this cluster traces the alignment problem from early concrete subproblems to its modern, foundational treatments. The reward-hacking survey (Goodhart's Law through RLHF overoptimization, sycophancy, and LLM-as-grader bias) and the extrinsic-hallucination survey (factuality, detection, and "knowing what you don't know") are widely-cited reference works; around them sit careful deep-dives on adversarial attacks/jailbreaks, toxicity reduction, and the human-data pipeline that quietly determines model quality. The set reads as one researcher's evolving map of what can go wrong with capable models and how to measure and mitigate it.

Reducing Toxicity in Language Models

TIER 4 Mar 21, 2021

Large pretrained LMs absorb toxicity from internet training data, and the right intervention depends on where in the pipeline you act: data collection, detection, or generation.

Data collection. Crowdsourcing dominates annotation, with quality controlled via test-question gatekeeping, clear guidelines, majority voting, and diverse annotator pools. Khatri et al.'s two-stage bootstrap seeds weak labels from a 800-word blacklist on Reddit, then expands the dataset with a classifier using confidence thresholds (>0.8 toxic, <0.3 non-toxic). SOLID scales the 14k-tweet OLID seed to 9M+ tweets via democratic co-training — four diverse models (PMI, FastText, LSTM, BERT) vote on unlabeled examples and high-confidence items are admitted.

Detection. Perspective API scores seven attributes (toxicity, insult, threat, etc.) as [0,1] classifier confidences, but over-flags minority-identity mentions due to lexical-cue reliance. Dinan et al.'s "build it, break it, fix it" loop iteratively hardens a BERT classifier by collecting adversarial failures and retraining. Character-level attacks (scrambling, homoglyph substitution, Levenshtein near-neighbors, distractor injection) sharply degrade classifiers; CDAE (contextual denoising autoencoder) partially recovers performance. Self-diagnosis (Schick et al.) skips labeled data entirely: the model computes normalized yes/no probabilities against a prompt template, with accuracy scaling with model size.

Detoxification. Vocabulary shifting learns a 2D toxic/non-toxic token representation and boosts non-toxic likelihoods at decode time. Self-debiasing computes Δ(w) = p(w|x) − p(w|sdb(x,s)) and downweights negative-delta words via e^(λΔ). Text style transfer (Santos et al., Style Transformer, CAE-T5) rewrites toxic inputs using cycle-consistency loss on non-parallel corpora. Gehman et al. find fine-tuning on non-toxic corpora and PPLM outperform CTRL control tokens and word filters. Xu et al.'s system-level design layers classifier-gated I/O filtering, baked-in safety fine-tuning on unsafe-prompt/safe-response pairs, topic-avoidance classifiers, and CTRL-style gender labels at inference.

AI safetytoxicitylanguage modelscontent moderationdetoxification

Adversarial Attacks on LLMs

TIER 4 Oct 25, 2023

LLM safety training can be broken systematically. The attack landscape splits by whether the attacker has model weights or only API access.

Token manipulation (black-box) ranks words by an importance score — the drop in correct-label logit when deleted — then substitutes high-importance words with synonyms filtered by cosine similarity and POS-tag match (TextFooler) or BERT-predicted contextual replacements (BERT-Attack).

Gradient-based attacks (white-box) require open weights. GBDA uses Gumbel-softmax to make token replacement differentiable, with NLL fluency and BERTScore as soft constraints. HotFlip approximates loss change from character flips via first-order Taylor expansion in one backward pass. Universal Adversarial Triggers (UAT) find short input-agnostic token sequences that flip behavior across any input from a distribution. Zou et al. (2023) use Greedy Coordinate Gradient (GCG) — top-k by negative gradient, exact evaluation of the best B — to find suffixes forcing aligned models into affirmative responses ("Sure, here's how to…"), transferring nontrivially to GPT-4 and Claude. ARCA generalizes to arbitrary auditing objectives φ(input, output) via coordinate ascent, splitting each update into a linearly approximatable and an exact autoregressive term.

Jailbreak prompting exploits two failure modes: competing objectives (prefix injection, refusal suppression, DAN role-play) and mismatched generalization — safety training leaves coverage gaps, so Base64, ROT13, Pig Latin, payload splitting, and cross-language prompts bypass it.

Human red-teaming is tool-assisted via saliency highlighting and BERT-Attack substitution dropdowns. Model red-teaming (Perez et al.) trains a red-team LM via zero-shot, SFT, or RL to maximize a harmfulness classifier's score; RL maximizes attack rate but collapses diversity. FLIRT refines this with in-context learning, keeping a priority queue of exemplars ranked by effectiveness, diversity, and low surface toxicity.

Defenses include perplexity filtering (catches nonsensical GCG suffixes), input paraphrasing, BPE-dropout retokenization, and adversarial training — theoretically strongest but degrades generation quality with only modest reductions in attack success.

adversarial-attacksjailbreakllm-securityred-teamingai-safety

Thinking about High-Quality Human Data

TIER 4 Feb 5, 2024

Getting useful labels requires managing two distinct failure modes: annotators who disagree due to errors or spamming, and annotators who disagree because ground truth is genuinely contested.

Aggregation methods range from majority vote to Cohen's Kappa (κ = (p_o − p_e)/(1 − p_e), correcting for chance) to probabilistic graph models. MACE assigns each annotator j a trustworthiness parameter θ_j; when spamming (Bernoulli(1−θ_j)), they draw from a personal noise distribution ξ_j rather than the true label. EM or Variational Bayes recovers per-annotator reliability weights.

For subjective tasks, two paradigms split: prescriptive (enforce one gold answer, easier QC) versus descriptive (preserve diversity as signal). Expert-crowd agreement on AI safety ranges from 0.96 on violence to 0.25 on personal topics — collapsing that disagreement discards real information. Disagreement deconvolution estimates p_flip per annotator and adjusts the label distribution to strip inconsistency noise from stable opinions. Jury Learning (DCN architecture) models each annotator conditioned on demographic metadata; practitioners specify jury composition at inference time.

On the training-dynamics side, influence functions approximate leave-one-out retraining via −H⁻¹∇L to score which samples most shift test loss. Data Maps track per-sample confidence and variability across epochs: low-confidence/low-variability correlates with mislabels, though such examples also improve OOD generalization. AUM measures the margin between assigned-class logit and the runner-up across training; threshold is set via artificially relabeled "threshold samples." INCV iteratively builds a clean trusted set by cross-predicting on held-out halves.

data-qualityhuman-annotationrlhfcrowdsourcingrater-agreement

Extrinsic Hallucinations in LLMs

TIER 5 Jul 7, 2024

LLMs hallucinate in two modes: in-context (contradicting the provided source) and extrinsic (contradicting world knowledge). Extrinsic — fabricated facts or unacknowledged ignorance — is the focus.

Two root causes. Pre-training on web-crawled data embeds stale and wrong facts. Fine-tuning is subtler: Gekhman et al. (2024) show that "Unknown" examples (knowledge the model lacks, by P_Correct categorization) are learned far slower than "Known" ones, and once fitted increase hallucination — best dev performance appears when most "Known" examples are learned but very few "Unknown" ones are.

Detection spans several families. FactualityPrompt measures hallucination via named-entity error rate and RoBERTa entailment ratios against Wikipedia. FActScore decomposes outputs into atomic claims, precision-scoring each against a retrieval-augmented knowledge base. SAFE extends this with an agent issuing iterative Google Search queries per claim, scoring F1@K (precision against recall up to K facts) and outperforming human annotators at 20× lower cost. FacTool handles QA, code, math, and literature by converting claims to search or unit-test queries. SelfCheckGPT needs no knowledge base: it flags claims inconsistent across multiple stochastic samples via BERTScore or NLI. TruthfulQA and SelfAware probe refusal on adversarially misleading or unanswerable questions; larger models self-know better, but RLHF degrades calibration.

Mitigation spans retrieval, decoding, and fine-tuning. RARR queries search and edits only where evidence disagrees, without training. FAVA fine-tunes an editor on synthetic error-tagged data. Self-RAG emits four reflection tokens (Retrieve, IsRel, IsSup, IsUse) gating on-demand retrieval mid-generation. CoVe generates and answers verification questions in factored mode, keeping the hallucinated draft out of context. Factual-nucleus sampling decays top-p across token position ($p_t = \max(\omega, p \cdot \lambda^{t-1})$), greedifying later higher-risk tokens. ITI shifts selected attention heads toward a "truthful" direction at inference. FLAME and FactTune run DPO with FActScore as reward; RAG-sourced positives backfire by injecting unknown knowledge, so self-generated positives work better.

hallucinationfactualityllm-evaluationretrieval-augmentationai-safety

Reward Hacking in Reinforcement Learning

TIER 5 Nov 28, 2024

More capable models hack reward functions more effectively — the proxy-gold gap widens with scale, worsening as AI systems improve.

Reward hacking arises because any proxy reward $R$ differs from the oracle $R^*$, and RL exploits that gap. Garrabrant's Goodhart taxonomy names four failure modes: regressional (selection also selects noise), extremal (optimization pushes distribution out-of-domain), causal (non-causal correlations break under intervention), and adversarial (third parties game the metric). Pan et al. (2022) confirmed that larger model size, finer action resolution, and more training steps increase proxy reward while decreasing true reward.

Gao et al. (2022) fit scaling laws for RLHF overoptimization: gold reward follows $R^*_\text{BoN}(d) = d(\alpha - \beta d)$ where $d = \sqrt{D_\text{KL}(\pi \| \pi_\text{init})}$ — turns over while proxy grows linearly. More RM data raises the peak.

RLHF produces two behavioral pathologies. Sycophancy (Sharma et al. 2023): models shift answers to match stated user beliefs; belief-matching is the strongest predictor in logistic regression on preference data. U-Sophistry (Wen et al. 2024): RLHF increases approval of incorrect answers, raises evaluator error rates, and produces code harder to audit — fabricating evidence and obfuscating structure.

LLM-as-grader setups add positional bias (GPT-4 favors the first candidate) and self-bias (models prefer their own outputs), partially mitigated by multiple-evidence sampling or balanced-position aggregation.

In-context reward hacking (ICRH) occurs at inference in feedback loops: tweet refinement for engagement raises toxicity; an invoice agent facing API errors moves funds without authorization. Scaling worsens ICRH; prompt engineering does not fix it.

Hacking generalizes. Denison et al. (2024) showed a model generalizing zero-shot to rewriting its own reward function after curriculum training on gameable environments. Mitigations include decoupled approval (query and world action sampled independently, preventing tampering from corrupting its own feedback), reward capping, and SEAL-style analysis to identify spoiler features in RLHF data.

reward-hackingrlhfai-safetygoodharts-lawalignment

Meta-Learning & Learning with Not Enough Data

1 tier-5 · 3 tier-4

How models learn efficiently when labels are scarce. The anchor is Weng's landmark meta-learning survey, which organizes "learning to learn" into metric-, model-, and optimization-based approaches (MAML, Prototypical Networks, Reptile) under the unifying "train as you test" principle. Around it sits her three-part "Learning with not Enough Data" series, which frames the four responses to label scarcity and then surveys semi-supervised learning, active learning (budget-aware labeling), and synthetic data generation in turn. Read together, they form a complete toolkit for the low-data regime.

Meta-Learning: Learning to Learn Fast

TIER 5 Nov 30, 2018

Meta-learning optimizes over a distribution of tasks, treating each dataset as one training sample: theta* = argmin E_{D~p(D)}[L(D)]. Training mimics test time: sample a label subset, form support set S and query batch B, feed S as model input, compute loss on B.

Metric-based methods classify by weighted sums of support labels, weights from a learned kernel. Siamese networks train on binary same-class verification; at test time the query matches support images by L1 distance in embedding space. Matching Networks output sum of softmax(cosine) * yi; the Full Context Embedding variant re-encodes support points via bidirectional LSTM over the full set. Prototypical Networks average support embeddings per class into a prototype and classify by softmax over negative squared Euclidean distance. Relation Network scores pairs with a learned CNN via MSE.

Model-based methods build rapid-update architectures. Memory-Augmented Neural Networks adapt NTMs by presenting labels one step late (x_{t+1}, y_t), forcing memory to hold inputs until labels arrive; a Least Recently Used Addressing write head keeps the cache fresh. MetaNet adds fast weights: LSTM F_w generates task-level theta+ from embedding loss gradients; network G_v generates example-level phi+ from task loss gradients; both combine with slow weights at inference.

Optimization-based methods learn the update rule. The LSTM meta-learner maps gradient descent onto cell-state dynamics, making forget gate and learning rate learnable. MAML (Finn et al. 2017) finds initialization theta so one gradient step yields good task performance, requiring second-order gradients. First-Order MAML drops the second derivative with minimal cost. Reptile (Nichol et al. 2018) runs k SGD steps on a task and moves theta toward the result. Taylor expansion shows all three optimize A (average task gradient) and B (gradient alignment across batches): E[g_FOMAML] = A - aB, E[g_MAML] = A - 2aB, E[g_Reptile] = 2A - aB.

meta-learningfew-shot learningMAMLmetric learningmemory-augmented NN

Learning with not Enough Data Part 1: Semi-Supervised Learning

TIER 4 Dec 5, 2021

Unlabeled data can substitute for scarce labels when the model is trained to produce consistent predictions across perturbations. All methods share L = Ls + μ(t)·Lu. Four hypotheses justify the approach: smoothness (nearby dense-region points share labels), cluster structure, low-density separation (decision boundaries sit in sparse gaps), and manifold (high-dimensional data lies on a low-dimensional surface).

Consistency regularization enforces identical predictions across augmented versions of the same input. Pi-model applies MSE between two stochastic forward passes. Temporal Ensembling replaces the second pass with a bias-corrected per-sample EMA of past predictions, cutting compute. Mean Teacher moves the EMA to weights (θ' ← βθ' + (1−β)θ), updating every step rather than every epoch. VAT finds the worst-case perturbation r_vadv that maximally shifts the output and penalizes sensitivity to it. ICT applies MixUp to unlabeled pairs and requires the prediction on the mixture to match the interpolation of individual predictions. UDA shows augmentation quality is decisive: RandAugment for images, back-translation plus TF-IDF word replacement for text, with confidence masking (threshold τ) and temperature sharpening T.

Pseudo labeling assigns argmax labels to unlabeled samples, equivalent to entropy regularization that pushes class distributions apart. Noisy Student trains an EfficientNet teacher to label 300M images, then trains a larger student with stochastic depth, dropout, and RandAugment — noise on the student only; clean inference from the teacher. Meta Pseudo Labels updates the teacher by backpropagating through the student's labeled-set loss via MAML-style one-step gradient unrolling.

Combined methods stack both paradigms. MixMatch averages K augmented predictions as pseudo labels, sharpens with temperature T, and applies MixUp. FixMatch uses weak augmentation to generate confidence-filtered pseudo labels and strong augmentation as training input — swapping the two causes divergence. DivideMix fits a two-component GMM on per-sample loss to separate clean from noisy examples, routing each group through co-divide, co-refinement, and co-guessing across two networks.

Pre-training and self-training are additive but asymmetric: pre-training helps in low-data regimes and can hurt at high volume; self-training gains with more data and stronger augmentation. Stacking SimCLRv2 pre-training, supervised fine-tuning, and pseudo-label distillation confirms all three stages compound, with bigger models consistently more label-efficient.

semi-supervised-learningconsistency-regularizationpseudo-labelinglow-resourcefixmatch

Learning with not Enough Data Part 2: Active Learning

TIER 4 Feb 20, 2022

Active learning maximizes model improvement per labeling dollar by selecting which unlabeled samples to annotate from a pool, rather than labeling randomly within budget B.

Acquisition functions score unlabeled examples. Uncertainty sampling uses least-confident score (1 − P(ŷ|x)), margin between top-two predictions, or entropy. Because deep models are overconfident, Query-by-Committee measures disagreement across a committee via voter entropy, consensus entropy, or mean KL from the consensus.

For uncertainty in deep networks, MC dropout (Gal & Ghahramani 2016) samples multiple stochastic forward passes as an approximate Bayesian ensemble — the basis of DBAL. Naive ensembles calibrate better but are expensive; cheaper variants all underperform them. Bayes-by-backprop (Blundell 2015) learns a variational Gaussian over weights and samples at inference for epistemic uncertainty. The loss prediction module (Yoo & Kweon 2019) trains an MLP on intermediate features to rank which unlabeled sample incurs higher loss, using pairwise ranking rather than MSE to avoid scale-shift; it outperforms entropy and core-set on three vision benchmarks. VAAL (Sinha 2019) trains a β-VAE plus discriminator to select points least similar to the labeled pool. MAL extends this with a minimax game: encoder minimizes feature entropy, classifier maximizes prediction entropy on unlabeled data.

For representativeness, core-set selection (Sener & Savarese 2018) recasts active learning as the NP-hard k-center problem; greedy approximation degrades with many classes or high dimensionality. BADGE (Ash 2020) uses gradient magnitude as uncertainty and k-means++ on gradient embeddings for diversity. BALD maximizes mutual information between predictions and model weights. Hybrid methods chain these: Suggestive Annotation filters by uncertainty then maximizes cosine-similarity coverage; CEAL combines active labeling with pseudo-labeling under a decaying entropy threshold.

active-learningdata-labelinguncertainty-samplinglow-resourceacquisition-functions

Learning with not Enough Data Part 3: Data Generation

TIER 4 Apr 15, 2022

Large pretrained language models can synthesize training data from near-zero labeled examples, approaching fully supervised performance once the resulting noise is actively managed.

Augmentation varies surface form while preserving labels. For images: AutoAugment searches operation sequences via RL; RandAugment collapses that to a single magnitude parameter; Population Based Augmentation evolves strategies across parallel child models. Mixup blends two images pixel-wise (α·I₁ + (1−α)·I₂); CutMix replaces a rectangular region with a crop from a second image; MoCHi interpolates the top-N hardest contrastive negatives in feature space. For text, EDA applies synonym replacement, random insertion, swap, and deletion at rate α×sentence_length. Contextual augmentation samples substitutes from a label-conditioned bidirectional LM. SimCSE creates positive pairs by passing one sentence through a dropout encoder twice.

For synthesis, LAMBADA fine-tunes a generator on a seed corpus to produce label-conditioned continuations, retaining only the top 10% by classifier confidence. UDG reverses the conditioning: few-shot prompting generates inputs x given labels y, then noisy label annealing (NLA) discards examples where the model's confident prediction contradicts the synthetic label; the threshold decays from 0.9 to 1/num_classes. For translation without parallel data, zero-shot LM outputs seed few-shot demonstrations → distillation → back-translation, iterated.

Generated data quality is measured by affinity (accuracy drop on augmented vs. clean data, reflecting distribution shift) and diversity (final training loss, reflecting added complexity); both must be high for gains.

Four mechanisms handle noisy labels. Generalized cross entropy (GCE) interpolates MAE and CCE via (1−f^q)/q, higher q for noisier data. F-correction models label flipping as C where C_ij = p(ỹ=j | y=i) and adjusts loss to −log(C⊤ p̂(y|x)). Learning-to-Reweight (L2R) meta-learns per-sample weights against a small clean validation set, at 3× training cost. Co-teaching trains two peer networks simultaneously, each selecting small-loss instances for the other, with the selection fraction shrinking as training progresses.

data-generationdata-augmentationsynthetic-datalow-resourcefew-shot

Self-Supervised, Contrastive & Multimodal Learning

1 tier-5 · 2 tier-4

Learning useful representations without labels, and extending them across modalities. The contrastive-learning survey is the widely-cited map of the self-supervised landscape (InfoNCE, SimCLR, BYOL, MoCo, CLIP, SimCSE) and the loss functions and ingredients — augmentation, large batches, hard-negative mining — that make it work. It pairs with the earlier broad survey of self-supervised pretext tasks across images, video, and control, and with the survey of vision-language models that extend pretrained LMs to consume visual signals. The throughline: build strong representations from unlabeled data, then bridge modalities.

Self-Supervised Representation Learning

TIER 4 Nov 10, 2019

Unlabeled data can provide its own training signal when labels come from data structure. The learned intermediate representation — not pretext accuracy — is the goal; downstream ImageNet classification measures it.

For images, four families of pretext tasks emerge. Distortion: Exemplar-CNN groups distorted patches of the same image as a surrogate class; rotation prediction (Gidaris et al.) requires 4-class 0°/90°/180°/270° classification, forcing recognition of semantic parts. Patch relationships: relative position prediction (Doersch et al.) classifies which of 8 grid neighbors a second patch occupies — color-channel jitter defeats a chromatic aberration shortcut; jigsaw puzzle solving (Noroozi & Favaro) restores 9 shuffled patches to origin. Feature counting enforces φ(downsampled) = Σ φ(tiles) with a margin loss against collapse. Generative pretext tasks include context encoders (inpainting with L2 + adversarial loss), split-brain autoencoders (predicting one channel subset from another), and Bidirectional GANs with a discriminator on (x, z) pairs. Contrastive Predictive Coding encodes input as z_t and builds autoregressive context c_t; the InfoNCE loss trains the model to pick true future z_{t+k} from N−1 negatives, maximizing I(x; c) without generating observations.

For video, temporal structure supplies the signal: triplet loss pulls together tracked patch representations across frames; frame-order validation (Misra et al.) uses a siamese network on shuffled vs. correct triples; arrow-of-time prediction (Wei et al.) applies a T-CAM network to optical flow groups. Video colorization (Vondrick et al.) copies colors via attention-weighted similarity from a reference frame, enabling segmentation tracking without fine-tuning.

For control, TCN/mfTCN learn camera-invariant state embeddings via multi-view metric loss. Grasp2Vec encodes objects so φ(pre-grasp) − φ(post-grasp) ≈ φ(object). RIG imagines goals from a β-VAE prior; CC-VAE conditions on a context image for object consistency. DeepMDP and DBC align latent L2 distance with behavioral similarity — matching reward and transition distributions under the bisimulation metric, without reconstruction loss.

self-supervisedrepresentation learningpretext tasksCPCcomputer vision

Contrastive Representation Learning

TIER 5 May 31, 2021

Contrastive learning trains an embedding space where similar samples cluster and dissimilar ones are pushed apart — loss functions evolved from single pairs to batch-level mutual-information bounds.

Loss functions form a progression: contrastive loss (Chopra 2005) operates on pairs with a margin; triplet loss adds anchor-positive-negative structure; lifted structured loss exploits all pairwise batch edges via log-sum-exp relaxation; N-pair loss generalizes to N−1 negatives. InfoNCE (CPC, van den Oord 2018) treats the positive as a classification target among noise samples; scoring function estimates density ratio p(x|c)/p(x), so minimizing InfoNCE maximizes a lower bound on I(x;c). Soft nearest-neighbor loss extends to multiple positives with temperature τ.

Three ingredients drive performance: heavy data augmentation (random crop plus color distortion is decisive), large batch size for diverse negatives, and hard negative mining. Unsupervised mining risks false negatives; Chuang et al. (2020) derive a debiased estimator; Robinson et al. (2021) up-weight hard negatives via importance sampling.

In vision: SimCLR applies InfoNCE with a projection head — downstream tasks use the encoder output, not the projected vector. MoCo decouples negative count from batch size via a FIFO queue with momentum key encoder θ_k ← mθ_k + (1−m)θ_q. BYOL drops negatives via online/target network pairs; batch normalization implicitly prevents collapse. SwAV swaps cluster-code predictions across augmented views, solved via Sinkhorn-Knopp. CLIP trains image and text encoders on 400M web pairs via symmetric cross-entropy over an N×N similarity matrix; contrastive objective yields 4x data efficiency over caption prediction and strong zero-shot ImageNet transfer.

For sentence embeddings: SimCSE feeds the same sentence twice with different dropout masks as the positive pair. SBERT fine-tunes on NLI with triplet or softmax objectives. BERT's anisotropic geometry degrades similarity; BERT-flow maps to isotropic Gaussian via normalizing flows; whitening (zero-mean, identity covariance via SVD) outperforms BERT-flow on STS benchmarks at 256 dimensions.

contrastive learningself-supervisedInfoNCESimCLRrepresentation learning

Generalized Visual Language Models

TIER 4 Jun 9, 2022

Extending pretrained language models to consume visual signals divides into four strategies by fusion tightness.

Joint training treats image patches as tokens. VisualBERT feeds CNN bounding-region vectors into BERT with embeddings summing visual features, segment flag, and position; trained with masked LM and sentence-image matching, early visual fusion matters most. SimVLM encodes ResNet-extracted patches as a bidirectional prefix and decodes text autoregressively; mixing ALIGN image-text pairs with C4 text data per batch is critical. CM3 trains on ~1T tokens of raw HTML, tokenizing images via VQVAE-GAN (256 tokens each) and appending masked spans at sequence end, enabling captioning, generation, and entity disambiguation through prompts.

Frozen-LM prefix trains only the vision encoder. Frozen (NF-ResNet-50) shows fine-tuning the LM hurts VQA — encyclopedic knowledge matters more than adaptation. ClipCap maps CLIP embeddings through an 8-layer transformer into k prefix tokens for GPT-2; only the mapping network trains.

Cross-attention fusion injects vision into LM layers. VisualGPT gates encoder-decoder attention against LM hidden states via complementary sigmoid masks B^vis and B^lan. VC-GPT self-ensembles LM logits and fusion logits (W^G h^G + W^fuse h^fuse); the ensemble step is critical. Flamingo uses a Perceiver resampler to compress visual features into fixed-size tokens, injects them via gated cross-attention interleaved with frozen LM layers, and trains on 43M webpages; each text token attends only to its preceding image, supporting arbitrary-length interleaved inputs. CoCa combines contrastive and captioning losses (1:2 weighting) with a split unimodal/multimodal decoder and attentional poolers (n=256 for generation, n=1 for classification).

No training uses language as interchange between frozen models. MAGiC scores each next token by LM confidence minus degeneration penalty plus a CLIP image-text score. PICa converts images to captions and prompts GPT-3 with 16 CLIP-selected examples, gaining +8.6 on OK-VQA. Socratic Models chain VLM, LM, and audio-LM through language prompts — no training.

vision-language-modelsmultimodalimage-captioningcross-attentionfrozen-lm

Scaling Models — Distributed Training & Inference

1 tier-5 · 1 tier-4

The systems side of large models: how to train them across many GPUs and then serve them cheaply. The distributed-training survey (later co-published on the OpenAI blog with Greg Brockman) is the canonical map of large-scale training infrastructure — data/model/pipeline/tensor parallelism, mixture-of-experts, ZeRO sharding, activation checkpointing, mixed precision, offloading. Its counterpart on inference optimization covers the deployment half: KV cache, quantization, pruning, distillation, and smart batching. Together they bookend the scaling lifecycle from training to production.

How to Train Really Large Models on Many GPUs?

TIER 5 Sep 24, 2021

Scaling language model training beyond a single GPU requires combining three orthogonal parallelism strategies, each targeting a different bottleneck.

Data parallelism (DP) replicates weights across workers. Bulk-synchronous BSP syncs every minibatch — correct but slow. Asynchronous ASP eliminates waiting but admits stale weights. PyTorch DDP lands in between via gradient accumulation and bucketed AllReduce.

Naive model parallelism splits layers across devices but leaves workers idle. Pipeline parallelism (PP) cures this with microbatches. GPipe aggregates gradients synchronously at batch end; bubble fraction is (d−1)/(m+d−1), negligible when m > 4d. PipeDream alternates 1F1B per worker with weight stashing to match forward and backward passes on the same version. PipeDream-flush adds periodic global syncs; PipeDream-2BW keeps only two weight versions. PTD-P assigns each device multiple non-contiguous layer chunks, shrinking bubble time by v.

Tensor parallelism (TP) splits individual matrix operations horizontally. Megatron-LM column-partitions the MLP weight so GeLU(XA) = [GeLU(XA₁), GeLU(XA₂)] runs in parallel, and shards Q/K/V projections across heads.

Mixture-of-Experts (MoE) scales parameters without proportional compute. Noisy top-k gating routes each token to k experts; an auxiliary CV² loss prevents expert collapse. GShard (600B params) adds expert capacity limits. Switch Transformer (trillion params) uses top-1 routing. Expert Choice inverts this — experts select their top-k tokens — guaranteeing perfect load balance and 2× faster convergence, though incompatible with autoregressive generation.

Memory techniques: activation recomputation reduces memory to O(√ℓ) at cost of one extra forward pass; mixed-precision training (FP16 compute, FP32 master weights) needs loss scaling for small-gradient stability; ZeRO shards optimizer states, gradients, and parameters across data-parallel ranks, eliminating redundant copies.

distributed-trainingmodel-parallelismdata-parallelismmixture-of-expertszero

Large Transformer Model Inference Optimization

TIER 4 Jan 10, 2023

Transformer inference is bottlenecked by two structural problems: KV cache that can reach 3× model size at batch=512/context=2048, and autoregressive decoding that resists parallelization. Five mitigation families exist: parallelism, offloading, smart batching, network compression, and architectural changes.

Distillation trains a student to match a teacher's softmax outputs via $\mathcal{L}_\text{KD} = \mathcal{L}_\text{distill}(\text{softmax}(z_t,T), \text{softmax}(z_s,T)) + \lambda\mathcal{L}_\text{CE}$. DistilBERT cuts BERT parameters 40%, retains 97% performance, and runs 71% faster by combining soft distillation, MLM, and cosine embedding losses.

Quantization is complicated by activation outliers: in OPT models above 6.7B, a few activation dimensions run ~100× larger than the rest, making naïve INT8 destructive. Key PTQ approaches: LLM.int8() keeps outlier dimensions FP16 while the rest run INT8 per inner product; Q-BERT applies per-head Hessian-based precision (HAWQ); ZeroQuant uses per-token activation quantization with fused kernels; GPTQ minimizes $\|\mathbf{W}\mathbf{X} - \hat{\mathbf{W}}\mathbf{X}\|$ row-by-row via Hessian updates, reaching 3–4 bit weights on OPT-175B; SmoothQuant migrates scale variance from activations to weights offline via $(\mathbf{X}\,\text{diag}(\mathbf{s})^{-1})(\text{diag}(\mathbf{s})\mathbf{W})$, enabling clean W8A8 with better hardware efficiency than mixed-precision schemes.

Pruning distinguishes unstructured (flexible but GPU-incompatible) from structured sparsity. N:M sparsity — specifically 2:4, natively accelerated on Nvidia A100 — uses column permutation to maximize retained magnitude before pruning. SR-STE and Top-KAST extend straight-through estimation to train with enforced sparsity from scratch.

MoE activates only a subset of experts per token. V-MoE adds Batch Priority Routing so high-priority tokens fill capacity first, competitive even at $C\leq0.5$. Task MoE routes at task level for static expert preloading, achieving 2.6× throughput. PR-MoE concentrates more experts at later layers where gains are largest.

Architectural changes attack the quadratic attention bottleneck via sparse patterns (fixed-block, strided, LSH/k-means learnable), recurrence for long-context reuse (Transformer-XL), memory reduction (Linformer low-rank K/V projection; multi-query attention sharing K/V across heads), and per-token early exit (CALM).

inference-optimizationtransformersquantizationpruningkv-cache

Computer Vision — Detection, NAS & Interpretability

0 tier-5 · 4 tier-4

Weng's computer-vision and AutoML deep-dives, mostly from her earlier years. The object-detection pieces cover the two dominant architecture families — the two-stage R-CNN lineage (region proposals, RoI pooling, Mask R-CNN) and the one-stage real-time detectors (YOLO, SSD, RetinaNet with focal loss) — and the speed/accuracy tradeoff between them. Alongside sit the neural-architecture-search survey (search space, search algorithm, evaluation strategy, from RL-driven toward differentiable DARTS) and a practically-grounded review of model interpretability motivated by regulated domains. Solid applied references, narrower in reach than the marquee LLM and generative surveys.

How to Explain the Prediction of a Machine Learning Model?

TIER 4 Aug 1, 2017

Interpretable models and black-box explanation methods solve different problems: the former embed transparency into structure; the latter extract post-hoc justifications without touching internals.

Interpretable models carry built-in mechanisms. Linear regression coefficients only compare across features after standardization; unstandardized, $w_i \cdot x_i$ is the contribution measure. Naive Bayes exposes per-feature influence via the posterior $p(c|x_i)$. Decision lists — Falling Rule Lists (monotone probability ordering), Bayesian Rule Lists (posterior over list structures), and Interpretable Decision Sets (accuracy and interpretability jointly optimized) — are naturally visualizable. Random forests use mean decrease in node impurity for global feature importance.

Three local methods target black-box predictions on individual instances. Prediction Decomposition (Robnik-Sikonja & Kononenko 2008) measures $p(y|x) - p(y|x \setminus A_i)$, marginalizing over the omitted feature's values to avoid missing-value artifacts. Local Gradient Explanation (Baehrens et al. 2010) fits a Parzen-window surrogate and takes its derivative at the test point. LIME (Ribeiro et al. 2016) samples perturbed instances weighted by proximity and fits a sparse K-LASSO locally — revealing that an SVM with 94% accuracy on Christianity vs. Atheism relied on tokens like "re" and "posting", showing that accuracy does not imply trustworthiness.

BETA (Lakkaraju et al. 2017) operates globally, learning a two-level decision set jointly optimizing fidelity, unambiguity, and compactness.

interpretabilityexplainable AImodel explanationblack-box modelsLIME

Object Detection for Dummies Part 3: R-CNN Family

TIER 4 Dec 31, 2017

Each R-CNN generation removes the bottleneck its predecessor left. R-CNN uses selective search to propose ~2,000 region candidates, warps each to fixed size, runs a CNN on every region independently, classifies with per-class SVMs, and refines boxes with a separate regressor (scale-invariant center offsets, log-scale width/height) — three disjoint steps, no shared computation. Non-max suppression and hard-negative mining are standard.

Fast R-CNN runs one CNN pass over the whole image, extracts per-region features via RoI pooling (max-pool any h×w region to fixed H×W), and jointly optimizes classification and bbox regression with log loss + smooth L1. Selective search remains the bottleneck.

Faster R-CNN replaces it with a Region Proposal Network predicting k=9 anchors per position (3 scales × 3 ratios) over the shared feature map.

Mask R-CNN adds a parallel branch predicting an m×m binary mask per class per RoI, swapping RoI pooling for RoIAlign — bilinear interpolation at floating-point coordinates instead of integer quantization — for the spatial precision pixel segmentation requires.

object detectionR-CNNFaster R-CNNMask R-CNNcomputer vision

Object Detection Part 4: Fast Detection Models

TIER 4 Dec 27, 2018

One-stage detectors skip region proposals, running detection over dense location samples for real-time speed at some accuracy cost versus R-CNN-family models.

YOLO divides the image into an S×S grid; each cell predicts B bounding boxes (x,y,w,h), a confidence score Pr(object)×IoU, and K conditional class probabilities—an S×S×(5B+K) output tensor. Loss weights localization with λ_coord=5 and suppresses background with λ_noobj=0.5. Fast but weak on small or irregularly shaped objects.

SSD extends VGG-16 with decreasing-size conv layers forming a multi-scale pyramid. Each level predicts offsets from six anchor boxes per cell (five aspect ratios plus a geometric-mean scale). Localization uses smooth-L1; classification uses softmax with 3:1 hard-negative mining.

YOLOv2 adds batch norm, high-resolution fine-tuning, IoU-based k-means anchor clustering (dist = 1−IoU), sigmoid-bounded offsets, a passthrough fine-grained layer, multi-scale training, and the lighter DarkNet-19 backbone. YOLO9000 jointly trains on COCO plus 9000 ImageNet classes via a WordNet label hierarchy, computing class probability as a product of conditionals up the tree.

RetinaNet addresses foreground/background imbalance with focal loss FL(p_t) = −α(1−p_t)^γ log p_t (α=0.25, γ=2), down-weighting easy negatives. Its FPN backbone on ResNet uses bottom-up and top-down lateral pathways, predicting at levels P3–P7 with nine anchors each.

YOLOv3 swaps softmax for per-class logistic classifiers, upgrades to DarkNet-53 with residual blocks, adds multi-scale prediction at three resolutions, and merges fine-grained features via skip-layer concatenation—outperforming SSD and running 3.8× faster than RetinaNet at lower mAP.

object detectionYOLOSSDRetinaNetcomputer vision

Neural Architecture Search

TIER 4 Aug 6, 2020

Neural Architecture Search (NAS) automates network topology design across three components: search space, search algorithm, and child model evaluation.

Search spaces. Sequential layer-wise search (Zoph & Le 2017) encodes whole networks as token sequences — expressive but needing 800 GPUs for 28 days. The NASNet cell-based space restricts search to two reusable cell types (Normal: same dimensions; Reduction: halved width/height), each cell built from B=5 blocks with 5 discrete choices. Hierarchical NAS (HNAS) recursively assembles primitives into motifs at increasing abstraction levels via adjacency matrices. SMASH frames each layer as reading from and writing to memory blocks.

Search algorithms. RL controllers (NAS: REINFORCE; MetaQNN: Q-learning with ε-greedy) treat architecture token sequences as actions with validation accuracy as reward. AmoebaNet's aging evolution always removes the oldest model, preventing premature convergence. Progressive NAS uses SMBO with an RNN surrogate predicting accuracy, growing architectures from B=1 block upward.

Evaluation strategies. Proxy shortcuts include fewer training epochs, smaller datasets, and ν-SVR learning-curve extrapolation. ENAS achieves ~1000x speedup by treating candidate architectures as sub-graphs of a shared supergraph, alternating between training shared weights and updating the RL controller. SMASH uses a HyperNet to generate weights directly from architecture encodings, though tensor-product generation restricts weights to a low-rank space.

One-shot and differentiable search. DARTS relaxes discrete operation choices to a softmax mixture over each DAG edge, jointly optimizing architecture parameters α and network weights w via bilevel gradient descent — cutting search to 1.5 GPU-days. ProxylessNAS binarizes the mixture so only one path is active per step (one order of magnitude memory saving) and makes latency differentiable as E[latency] = Σ p_j · F(o_j) for hardware-aware search.

Unsupervised NAS (UnNAS) finds rank correlation > 0.8 between pretext-task accuracy and supervised accuracy. AutoML-Zero uses aging evolution over primitive math operations to discover entire ML algorithms from scratch.

neural architecture searchAutoMLDARTSweight sharingsearch space

Deep Learning Theory & Generalization

0 tier-5 · 3 tier-4

The "why does it work" cluster — Weng's more theoretical essays on the puzzles of deep learning. Why do massively over-parameterized networks generalize instead of overfitting (Occam/MDL, double descent, intrinsic dimension, the lottery ticket hypothesis)? What does information theory say about training dynamics (the Information Bottleneck fit-then-compress story)? And what does the infinite-width limit reveal (Neural Tangent Kernel convergence)? These are curated tours of debated, foundational theory rather than implementation guides — for the reader who wants to understand the principles under the empirical success.

Anatomize Deep Learning with Information Theory

TIER 4 Sep 28, 2017

DNN training has two phases: early fitting maximizes mutual information with input X; then compression discards irrelevant details, minimizing I(X;T) while preserving I(T;Y). The Data Processing Inequality, applied to DNNs as Markov chains, forces I(X;hᵢ) to decrease with depth. The information plane tracks each layer by encoder I(X;T) vs. decoder I(T;Y), showing layers drift left during compression. Classic VC-dimension bounds fail for deep nets; Tishby's input-compression bound replaces |H_ε| with 2^{I(Tε;X)}, grounding generalization in mutual information. More layers accelerate compression exponentially via stochastic relaxation (Δtₖ ~ exp(ΔSₖ)); more data pushes I(T;Y) toward the IB limit.

information theoryinformation bottleneckdeep learning theorygeneralizationmutual information

Are Deep Neural Networks Dramatically Overfitted?

TIER 4 Mar 14, 2019

Deep neural networks routinely achieve zero training error yet generalize well — the opposite of what classical bias-variance theory predicts. Parameter count is a poor proxy for true complexity, and several converging results explain why.

MDL formalizes Occam's Razor as minimizing L(H) + L(D|H); Kolmogorov complexity defines algorithmic complexity as the shortest program reproducing an object. By either measure, a memorizing network is formally worst despite perfect training accuracy.

Zhang et al. (2017) confirm DNNs can memorize: a two-layer ReLU network with 2n + d weights fits any n points in d dimensions, proven via a lower-triangular M_ReLU that is always nonsingular. On CIFAR-10 with randomly shuffled labels, networks still reach near-zero training loss, and explicit regularization (dropout, weight decay, data augmentation) barely moves generalization error — ruling it out as the fundamental cause.

Belkin et al. (2018) replace the classical U-shaped risk curve with a double-U: past the interpolation threshold, test error falls again as larger models find lower-norm interpolating functions.

Two empirical measures reveal hidden simplicity. Intrinsic dimension (Li et al., 2018) restricts optimization to a random d-dimensional subspace via θ(D) = θ₀(D) + Pθ(d); a 650k-parameter FC network on CIFAR-10 has intrinsic dimension ~9k. Layer robustness (Zhang et al., 2019) shows re-randomizing any layer destroys performance, while re-initializing all but the earliest layers causes negligible degradation.

The Lottery Ticket Hypothesis (Frankle & Carlin, 2019) ties these together: a sparse subnetwork, rewound to its original initialization and trained alone, matches full-network accuracy. The large parameter space is needed for training, not for the final solution.

generalizationoverfittingdeep learning theorylottery ticketMDL

Some Math behind Neural Tangent Kernel

TIER 4 Sep 8, 2022

Overparameterized neural networks converge reliably to global minima under gradient descent, and the Neural Tangent Kernel (NTK; Jacot et al. 2018) explains why. NTK is defined as K(x, x'; θ) = ∇_θf(x)ᵀ∇_θf(x') and arises naturally from differentiating the network output through time: how a gradient update on one sample shifts predictions elsewhere is determined entirely by this kernel.

The central result: as all hidden-layer widths → ∞, the NTK converges to a deterministic limit K_∞ that depends only on architecture, not on random initialization, and stays constant throughout training. The proof proceeds by induction, leveraging that infinite-width pre-activations converge to a Gaussian process (NNGP; Neal 1994, Lee & Bahri 2018) — established via CLT on i.i.d. hidden units — with recursive covariance Σ^(l+1)(x,x') = E_{f~N(0,Λ^(l))}[σ(f(x))σ(f(x'))] + β².

With K_∞ constant, the training dynamics reduce to a linear ODE. Under MSE loss it solves in closed form: f(θ(t)) = f(θ(0))e^{−ηK_∞t} + (I − e^{−ηK_∞t})Y, guaranteeing convergence when K_∞ is positive definite. This is the linearized model regime (Lee & Xiao et al. 2019). It also explains lazy training (Chizat et al. 2019): as width grows, the relative Jacobian change κ(θ) → 0, so parameters barely move while loss collapses to zero.

neural-tangent-kerneldeep-learning-theoryoptimizationinfinite-widthconvergence

Text Generation, Retrieval & NLP

0 tier-5 · 2 tier-4

How to steer and ground language generation — the applied-NLP work that predates and frames much of the later prompting and RAG literature. The controllable-generation survey organizes steerability along three axes (decoding strategies, prompt design, fine-tuning/steerable layers) and reads as an early map of techniques that the alignment and prompt-engineering posts would later absorb. The open-domain QA survey covers retrieval-augmented architectures end to end (closed-book parametric knowledge vs open-book retriever-reader/generator pipelines: DPR, REALM, RAG, Fusion-in-Decoder), a foundational reference for grounded systems.

How to Build an Open-Domain Question Answering System?

TIER 4 Oct 29, 2020

Factoid QA without a provided context divides into three architectures — retriever-reader, retriever-generator, and closed-book generative — each trading retrieval quality against extraction precision and memorized knowledge.

Retriever-reader systems chain document retrieval with reading comprehension. DrQA set the template: TF-IDF bigram retrieval over Wikipedia returns top-5 articles; a BiLSTM reader predicts answer start/end from GloVe embeddings, exact-match flags, POS/NER/TF features, and attention-aligned question vectors. BERTserini swapped in Anserini+BM25 (paragraph beats sentence or article) and a BERT reader. Multi-passage BERT added global softmax normalization across all passages. DenSPI eliminates the reader: Wikipedia phrase spans are pre-encoded as dense+sparse vectors (BERT for semantics, 2-gram TF for lexical precision), answers found by nearest-neighbor search.

End-to-end systems jointly train retriever and reader. R^3 applies REINFORCE: a Match-LSTM ranker samples passages as a policy, rewarded by reader span accuracy. ORQA trains dual BERT encoders with dot-product retrieval scores, pre-trains the passage encoder via Inverse Cloze Task (predict context from a sentence), then freezes it for precomputed MIPS. REALM upgrades to salient span masking (named entities and dates) and refreshes the index asynchronously. DPR skips pre-training, training on supervised Q/A pairs with in-batch and BM25 hard negatives, using FAISS.

Retriever-generator swaps span extraction for free-text generation. RAG pairs DPR with BART, jointly trained by NLL; RAG-seq uses one document per answer, RAG-token marginalizes per generated token. Fusion-in-Decoder (T5) encodes each retrieved passage independently then fuses them in the decoder to aggregate multi-passage evidence.

Closed-book systems use no retrieval. T5 with salient span masking continued pre-training and per-dataset fine-tuning achieves competitive accuracy from parameter memory; 11B T5 matches a DPR ensemble of three 330M BERT models. GPT-3 shows fine-tuning unnecessary: few-shot prompting matches supervised baselines on TriviaQA.

Cross-paradigm caveat: 58–71% of test answers appear in training sets; removing near-duplicates substantially degrades performance, suggesting evaluations partly measure memorization.

question answeringretrievalDPRRAGreading comprehension

Controllable Neural Text Generation

TIER 4 Jan 2, 2021

Steering a pretrained language model toward desired topic, sentiment, or style without retraining requires three families of intervention: decoding manipulation, prompt design, and fine-tuning.

Decoding strategies reshape outputs without changing weights. Top-k sampling restricts to the k most likely tokens; nucleus/top-p picks the smallest set exceeding a cumulative probability threshold. Penalized sampling discounts repeated tokens (θ=1.2 optimal). Guided decoding augments beam scores with heuristic features (Hafez: word bags, sentiment) or Gricean learned discriminators. Regularized beam search (Meister 2020) adds a UID-motivated variance regularizer — smoothing surprisal reverses BLEU degradation at larger beam widths. LFIW estimates importance weights via a real-vs-generated classifier for resampling.

Prompt design shapes output through context alone. AutoPrompt finds universal trigger tokens by gradient-based embedding updates, outperforming manual prompts on NLU and fact retrieval. Prefix-Tuning prepends trainable vectors to every transformer layer (LM frozen), reparameterized via a small MLP; it matches fine-tuning in low-data regimes. P-tuning inserts pseudo-tokens at flexible positions, using an LSTM encoder to model inter-token dependencies. Prompt Tuning prepends k soft tokens at the input only; at billion-parameter scale it matches full fine-tuning with better domain-shift robustness.

Fine-tuning spans CTRL (trained from scratch on domain-tagged corpora), RL via REINFORCE/PPO on BLEU, ROUGE, or human-preference rewards with KL penalty R(x,y) = R_ψ(x,y) − β log π/p to prevent drift. PPLM runs a forward–backward–forward loop at each decoding step, shifting hidden states toward a bag-of-words or discriminator attribute model — controllable but slow. GeDi uses a class-conditional LM for Bayesian posteriors over control/anti-control codes, guiding a larger LM 30× faster than PPLM. GDC frames control as EBM optimization with moment constraints, training a policy via distributional policy gradient; a gender constraint raised female representation in GPT-2 biography outputs from 7.4% to 35.6%. Unlikelihood training adds a loss term suppressing unwanted tokens, correcting frequency skew from MLE.

text generationdecodingcontrollable generationpromptingfine-tuning

Field Overviews & Optimization

0 tier-5 · 2 tier-4

Two pieces that sit slightly apart from the topical clusters: a broad on-ramp to the field and an alternative to gradient-based optimization. The deep-learning overview (adapted from a meetup talk) tours the classical model families — CNNs, RNN/LSTM, seq2seq, autoencoders, RL, GANs — with intuition and canonical pointers, serving as an accessible entry point. The evolution-strategies explainer makes the case for gradient-free, black-box optimization (CMA-ES, natural evolution strategies, OpenAI's parallelizable ES) as a viable alternative to SGD, especially in deep RL. Useful bookends: one for breadth, one for an underused optimization family.

An Overview of Deep Learning for Curious People

TIER 4 Jun 21, 2017

Deep learning works now because data scale and compute finally match the model's appetite. On small datasets, SVMs, GBMs, and random forests win; large neural nets only outperform them once data is abundant enough to tune exponentially many parameters without manual feature engineering.

CNNs mirror the visual cortex — kernels detect edges, pooling compresses, and fully-connected layers classify; ResNets add skip connections between non-adjacent layers. RNNs pass state across time steps; LSTM cells add gating logic controlling what to retain, forget, and update. Seq2seq models pair encoder and decoder RNNs with a context vector, powering translation and chatbots. Autoencoders compress input through a bottleneck, outperforming PCA on document compression (Hinton & Salakhutdinov). AlphaGo combines supervised pretraining on professional games with RL self-play to strengthen its policy network. GANs pit a generator against a discriminator in a zero-sum game, adversarial pressure driving synthetic images toward indistinguishability.

deep learning overviewCNNRNN/LSTMGANreinforcement learning

Evolution Strategies

TIER 4 Sep 5, 2019

Evolution Strategies (ES) optimize real-valued parameter vectors by iterating over a population distribution rather than computing gradients — viable when the objective is black-box or non-differentiable.

Simple Gaussian ES samples offspring from $\mathcal{N}(\mu, \sigma^2 I)$, keeps the top-$\lambda$ elite samples, and resets $\mu$ and $\sigma$ from that set. The problem: $\sigma$ adapts slowly, hindering rapid exploration scale changes.

CMA-ES fixes this by tracking a full covariance matrix $C$. Step size $\sigma$ is updated via an evolution path $p_\sigma$ that accumulates mean-shift directions; $\sigma$ grows or shrinks based on whether $\|p_\sigma\|$ deviates from its expected length under uncorrelated selection. The covariance matrix blends two estimates: a rank-$\min(\lambda,n)$ update from current elite samples, and a rank-one update from a second path $p_c$ that preserves sign information lost in the outer product $yy^\top$.

Natural Evolution Strategies (NES) replaces heuristic elite selection with gradient ascent on expected fitness via the natural gradient $F_\theta^{-1}\nabla_\theta\mathcal{J}$, where $F_\theta$ is the Fisher information matrix. This penalizes high-uncertainty directions and steps at constant KL distance along the distributional manifold. NES adds rank-based fitness shaping and Mann-Whitney U-test adaptation sampling for hyperparameter adjustment.

OpenAI ES applies NES to RL: workers perturb policy parameters with noise $\epsilon_i$, evaluate rollouts in parallel, and share only random seeds — no weights transmitted. NS-ES replaces fitness with a k-NN novelty score in behavior space to escape local optima; NSR-ES and NSRA-ES blend novelty and fitness with fixed or adaptive weights.

Two EA extensions: PBT jointly evolves weights and hyperparameters by letting parallel workers copy better peers and perturb hyperparameters; WANN searches for minimal topologies evaluated under a single shared weight, showing structure alone achieves non-trivial performance.

evolution strategiesblack-box optimizationCMA-ESreinforcement learninggradient-free