Tech & AI

Eugene Yan

Eugene Yan

100 issues · 100 keepers · 31 tier-5 · 69 tier-4

Recommendation & Search System Design

6 tier-5 · 4 tier-4

The spine of Eugene's applied-ML reputation: how discovery systems are actually built. The cornerstone is the offline/online × retrieval/ranking 2×2 that NVIDIA and Xavier Amatriain went on to cite; around it sit a from-laptop baseline recommender, the graph+NLP follow-up that beats it, a real-time-ML teardown with worked equations, the query-matching survey for search, and two recent landmarks bridging recsys with language modeling (LLM-augmented search/rec and a from-scratch LLM-RecSys hybrid with semantic IDs). The cluster matters because it gives a coherent architecture for any retrieve-then-rank product and tracks where the field moved as LLMs arrived.

Building a Strong Baseline Recommender in PyTorch, on a Laptop

TIER 5 Jan 6, 2020

Part 1 of the recsys series, building an item-item collaborative-filtering baseline via matrix factorization on a 16GB laptop using the Amazon product-relationships dataset. Details the full pipeline—parsing huge JSON, deriving weighted product-pair scores, an efficient negative-sampling hack (100x faster), and pair-by-pair PyTorch matrix factorization with continuous/binary labels and L2 regularization—reaching AUC-ROC ~0.8. A thorough, reproducible applied-ML walkthrough that anchors the series.

recsysmatrix-factorizationpytorchcollaborative-filteringnegative-sampling

Beating the Baseline Recommender with Graph & NLP in Pytorch

TIER 5 Jan 13, 2020

Part 2 of a recsys series showing how graph + NLP techniques (random walks over a product graph + word2vec node embeddings) beat a matrix-factorization baseline, lifting AUC-ROC from ~0.8 to ~0.91-0.96. Covers transition-matrix/sparse-matrix tricks for memory efficiency, node2vec pitfalls, gensim vs. from-scratch PyTorch word2vec (subsampling, negative sampling, skip-gram), and extending embeddings with side info. A rigorous, code-backed deep-dive with lasting reference value for recsys and embedding practitioners.

recsysword2vecgraph-embeddingspytorchnlp

eugeneyan

TIER 4 Apr 26, 2020

A survey, drawn from 10+ papers, on why recommender systems need more than accuracy: serendipity (and its cousins diversity, novelty, surprise) keeps users engaged and supports assortment/seller health, even though accuracy metrics dominate because they're easier to measure and tool-supported. A rigorous explainer on beyond-accuracy evaluation in recsys.

recsysserendipityevaluationdiversitysurvey

eugeneyan

TIER 4 Sep 27, 2020

Notes and paper highlights from RecSys 2020: rising emphasis on bias/ethics and inverse propensity scoring for debiasing, a shift toward sequence models and bandits/RL, Netflix's user research on recommendation complexity (placement/person/context and 1:1 vs 1:many expectations), and work showing how offline-evaluation sampling and data-splitting choices change relative model rankings. A substantive conference digest valuable for recsys practitioners.

recsysrecommendation-systemsbiasoffline-evaluationsequence-models

eugeneyan

TIER 5 Jan 10, 2021

A deep teardown of real-time ML for recommendations: when real-time is (and isn't) worth the ops cost vs. batch, how China and US companies implement it (collaborative filtering and Alibaba's Swing algorithm with worked equations and code, candidate-generation-plus-ranking architecture, Alibaba 1688's design), and how to design and build a simple MVP. Includes primers and is a landmark, reference-grade recsys-engineering piece.

recommendation systemsreal-time MLcollaborative filteringSwing algorithmproduction ML

eugeneyan

TIER 4 Apr 25, 2021

A technical deep-dive on query processing and matching in search, contrasting lexical, graph-based, and embedding-based approaches. It explains normalization, spellcheck, query expansion/relaxation/translation, and retrieval, illustrated with DoorDash's spell-correction ordering trick and Yahoo's bipartite click-graph translation model for cold-start queries. A solid applied-search reference.

searchquery-matchingquery-expansioninformation-retrievalembeddings

eugeneyan

TIER 5 Jun 27, 2021

The reference essay establishing a 2x2 mental model for discovery (recommendation and search) system design: offline vs. online environments crossed with candidate retrieval vs. ranking. It explains how offline batch processes produce artifacts (embeddings, ANN indices, feature stores) consumed by online serving, and grounds the pattern in Alibaba, Facebook, JD, and DoorDash examples; the 2x2 has since been cited by NVIDIA and Xavier Amatriain. A landmark ML-system-design framework.

system-designrecsyssearchretrieval-and-rankingml-architecture

eugeneyan

TIER 4 Dec 24, 2023

An industry teardown of push notifications framed as a recsys variant where user intent is unknown, covering what to push (complements over substitutes per Alibaba, personalized insights for power users per JOOL, recovering/sleeping bandits per Duolingo, hyper-local relevance per DPG Media), what not to push (LinkedIn engagement filtering, Pinterest unsubscribe prediction), and how many to push to minimize unsubscribes. A substantive, well-sourced applied-ML design reference.

push-notificationsrecsysbanditsindustry-teardownengagement

Improving Recommendation Systems & Search in the Age of LLMs

TIER 5 Mar 16, 2025

A sweeping (43-minute) survey of how industrial search and recommendation systems evolved over the prior year by drawing on LLMs, organized by model architectures (semantic IDs, multimodal content towers, LLM-augmented ID models), data generation (LLM-curated metadata and synthetic queries), training paradigms (distillation, pretraining-then-finetune, optimizer/dropout tricks), and unified search+rec frameworks (LinkedIn 360Brew, Netflix UniCoRn). Cites dozens of industry papers (YouTube, Bing, Spotify, Best Buy, etc.). A landmark reference bridging RecSys and language modeling.

recsyssearchllmsurveysystem-design

Training an LLM-RecSys Hybrid for Steerable Recs with Semantic IDs

TIER 5 Sep 14, 2025

A from-scratch technical walkthrough (with code) of training a 'bilingual' Qwen3-8B that speaks both English and item IDs: it extends the LLM vocabulary with semantic-ID tokens generated by an RQ-VAE, applies continued pretraining plus behavioral finetuning, and yields a model that recommends from the catalog and can be steered and explained via natural language. Covers RQ-VAE loss derivation, codebook/commitment-weight ablations, and a SASRec item-ID vs semantic-ID baseline comparison. A landmark applied experiment marrying RecSys and language modeling.

recsyssemantic-idsrq-vaellm-finetuninggenerative-recommendation

LLM Evals & LLM-as-Judge

5 tier-5 · 3 tier-4

Eugene's most concentrated tier-5 cluster argues a single thesis from many angles: evals are the moat for LLM products, off-the-shelf evals usually don't work, and the discipline is the scientific method in disguise — look at your data first, then write criteria, then align a judge per dimension and measure it with classification metrics (precision/recall/Cohen's Kappa) against a human-level bar. The pieces move from broad surveys (task-specific evals, LLM-as-judge, summarization, long-context Q&A) to a tight reusable three-step recipe, plus a build log (AlignEval) and a corrective insisting evals are practices, not tools you buy.

eugeneyan

TIER 5 Sep 3, 2023

A 23-minute survey on evaluating abstractive summaries and detecting hallucination, organized around the four dimensions (fluency, coherence, relevance, consistency) and four families of metrics: reference-based (ROUGE, METEOR, BERTScore, MoverScore), context-based reference-free (ROUGE-C, adapted embeddings, G-Eval), preference-based, and sampling-based, with NLI and QA as the core hallucination-detection methods. It also documents why references are a bottleneck (often lower quality than LLM output), making it a strong reference for summarization evals.

summarization-evalshallucinationnlirouge-bertscoresurvey

eugeneyan

TIER 4 Nov 5, 2023

An experimental write-up showing that pre-finetuning a BART-MNLI model on out-of-domain Wikipedia summaries (USB) before finetuning on news summaries (FIB) yields a 23% PR-AUC improvement and far better probability separation than finetuning on FIB alone, despite USB alone barely helping. The takeaway is that transfer learning extends beyond pretraining into finetuning, so permissive open-source data can bootstrap task-specific hallucination detection. A concrete, code-backed applied-ML result.

hallucination-detectionfinetuningnlitransfer-learningqlora

eugeneyan

TIER 5 Mar 31, 2024

A 33-minute survey of task-specific LLM evals that actually work, arguing off-the-shelf evals usually don't and offering concrete, discriminative metrics per task: classification/extraction (recall/precision, ROC-AUC, PR-AUC, separation of distributions/JSD), summarization (NLI-based consistency, reward-model relevance, length), translation (chrF, BLEURT, COMET), copyright regurgitation, and toxicity. It also covers when human evaluation is still needed and how to calibrate the eval bar to risk, making it a foundational reference for applied eval design.

llm-evalsclassification-metricssummarizationnlisurvey

eugeneyan

TIER 5 Aug 18, 2024

A 49-minute survey distilling two dozen papers into a mental model for LLM-evaluators (LLM-as-judge), covering the key upfront decisions: baseline to compare against, direct scoring vs. pairwise vs. reference-based, and classification vs. correlation metrics. It then walks through use cases, prompting techniques, alignment, finetuning evaluator models, and critiques, arguing for binary outputs and classification metrics over correlation. A landmark reference for anyone adopting LLM-evaluators in production.

llm-as-judgellm-evalssurveymetricsprompting

eugeneyan

TIER 4 Oct 27, 2024

Introduces AlignEval, a free app for building LLM-evaluators in four steps (upload data, label binary pass/fail, write criteria and run the evaluator, optimize against labels), with the core thesis that you must calibrate human criteria to AI output by looking at the data before writing eval criteria. Doubles as a build log covering framework choice (Next.js + FastAPI), structured outputs, hosting on Railway, and a semi-automated dev/test optimization loop. Notable for later being integrated into LangSmith's Align Evals.

llm-evalsllm-as-judgetoolingapp-builddata-labeling

An LLM-as-Judge Won't Save The Product—Fixing Your Process Will

TIER 4 Apr 20, 2025

Argues that evals are practices, not artifacts or tools: building product evals is the scientific method in disguise (observe data, annotate a balanced set, hypothesize failure causes, run experiments, measure, iterate), reinforced by eval-driven development (define success criteria before building, evaluate every change) and ongoing human-in-the-loop monitoring of AI output. Matters as a corrective to the belief that buying another eval tool or LLM judge fixes a broken process.

evalseval-driven-developmentscientific-methodmonitoringprocess

Evaluating Long-Context Question & Answer Systems

TIER 5 Jun 22, 2025

A rigorous deep-dive on evaluating long-context Q&A along two orthogonal axes (faithfulness vs. helpfulness, plus citation accuracy), covering how to build eval datasets via summary-based question generation, claim-decomposition for faithfulness, pairwise comparison for helpfulness, and calibrating LLM-evaluators on precision/recall/Cohen's Kappa. Includes a survey of benchmarks (NarrativeQA, QASPER, L-Eval, NovelQA, Loong) and practical advice for tailoring evals to a use case. Strong reference for RAG and document-QA evaluation.

evalslong-contextquestion-answeringfaithfulnessbenchmarks

Product Evals in Three Simple Steps

TIER 5 Nov 23, 2025

A tight, reusable recipe for building product evals: (i) label a small balanced dataset favoring binary pass/fail (and win/lose for subjective) labels with 50-100 fail cases, (ii) align one LLM-evaluator per dimension on a held-out split, controlling for position bias and measuring precision/recall/Cohen's Kappa against human-level (not perfect) benchmarks, and (iii) integrate an eval harness into the experiment pipeline for tight feedback loops, with sample-size guidance via confidence intervals. Lasting reference value for anyone shipping LLM products.

evalsllm-as-judgeproductiondata-labelingexperimentation

Productionizing ML — Testing, MLOps & Infrastructure

4 tier-5 · 7 tier-4

The "what happens after you train it" cluster — Eugene's most practitioner-oriented body of work. The tier-5 anchors operationalize ML/pipeline testing (software-vs-ML tests, pre-train and behavioral tests, the additive-vs-retroactive distinction for why pipeline tests break) and the feature-store hierarchy of needs. Around them: the post-deployment challenges and the practices that mitigate them, the design-doc checklist for ML systems, ML-specific unit testing, reproducible experimentation tooling (Jupyter/Papermill/MLflow), data-discovery platforms, and conference notes on ML-systems infra.

Simpler Experimentation with Jupyter, Papermill, and MLflow

TIER 4 Mar 15, 2020

A hands-on MLOps workflow for running and tracking many ML experiments without notebook duplication: Jupyter for development, Papermill to parametrize and execute one notebook across many configs (each saved to its own notebook), and MLflow to consolidate metrics, artifacts, and model binaries in a single UI. Demonstrated end-to-end on a stock-index prediction pipeline with runnable code. A useful, reproducible-experimentation explainer.

mlopsexperimentationjupytermlflowreproducibility

eugeneyan

TIER 4 May 18, 2020

Catalogs six little-known challenges that emerge after ML is deployed, from the bottom up: silent schema/data changes, unwanted model interactions, messy infra and codebases, real-world bias and adversaries, org-structure friction, and the ongoing burden of customer service. A grounded, often-cited articulation of the realities of production ML maintenance.

machine-learningproductionmlopsdata-driftengineering

eugeneyan

TIER 4 May 25, 2020

A practical companion to the post-deployment challenges piece, offering ~20 concrete practices for maintaining ML in production: validating incoming data and distributions, monitoring models on retraining, simplifying engineering, minimizing feedback loops and bias, structuring teams, and crowdsourcing customer complaints, with prioritized must-haves vs. good-to-haves. A useful MLOps reference.

mlopsmachine-learningproductiondata-validationmonitoring

eugeneyan

TIER 4 Jun 28, 2020

Conference notes on application-agnostic Spark+AI Summit 2020 talks: deep-learning efficiency via pruning, quantization, and knowledge distillation; how (not) to scale DL training (GPUs, early stopping, Petastorm, Horovod); PyTorch production tooling; probabilistic data structures (Bloom filters, HyperLogLog, count-min sketch as monoids); and the economics of broadcast vs. sort-merge joins. A substantive distillation of practical ML-systems knowledge.

deep-learningmodel-compressiondistillationprobabilistic-data-structuresspark

eugeneyan

TIER 4 Jul 5, 2020

Detailed conference notes on application-specific talks from Spark+AI Summit 2020, covering Airbnb's Zipline (point-in-time feature engineering with Abelian-group and binary-tree aggregations), Gojek's Feast feature store, Netflix's data-quality via KS-tests and swim lanes, LinkedIn's abuse detection with isolation forests and graph clustering, and Zynga's production reinforcement learning. A dense, practitioner-grade survey of real-world ML data infrastructure.

sparkfeature-storedata-qualityanomaly-detectionreinforcement-learning

eugeneyan

TIER 5 Sep 6, 2020

A hands-on deep-dive (with a companion GitHub repo) on how to test ML code and systems, distinguishing software tests (written logic) from ML tests (learned logic) and splitting the latter into pre-train tests (shape, output range, leakage, overfit-on-train) and post-train behavioral tests (invariance, directional expectations, minimum functionality) inspired by the CheckList paper. It is a widely referenced applied-ML engineering resource that operationalizes ML testing with concrete pytest examples.

ml-testingmlopsengineeringpytestbehavioral-testing

eugeneyan

TIER 4 Oct 25, 2020

A technical survey of 10+ data discovery platforms (Lyft Amundsen, Facebook Nemo, LinkedIn DataHub, Spotify, etc.), organizing them around the questions users ask and the features that solve them: free-text/popularity-ranked search, schemas and column statistics, data lineage, and surfacing owners/frequent users. A genuinely useful reference for anyone building or evaluating data catalogs, with attention to open-source options.

data-discoverydata-catalogmetadatadata-engineeringsystem-design

eugeneyan

TIER 5 Feb 21, 2021

Organizes feature-store capabilities into a Maslow-style hierarchy of needs — access (reduce duplication/reuse), serving (real-time, low-latency), integrity (train-serve skew, point-in-time correctness), convenience (easy APIs), and autopilot (backfilling, monitoring) — and grounds each level in how GoJek (Feast), Uber (Palette), Monzo, Airbnb, and DoorDash built theirs. An original framework and a definitive teardown of the feature-store landscape, with lasting reference value.

feature storesMLOpsML infrastructuretrain-serve skewrecsys

eugeneyan

TIER 5 Mar 7, 2021

A detailed checklist for writing design docs for ML systems, organized as Why/What (motivation, success criteria, requirements, scope, assumptions) and How (methodology: problem framing, data, techniques, validation/A-B, human-in-the-loop; and implementation: high-level design, infra/scalability, latency, security, privacy, monitoring, cost, integration). Adds a two-stage pre-review/review process and a lean template repo — a lasting, reference-grade resource for ML practitioners.

ML system designdesign docsML engineeringA/B testingMLOps

eugeneyan

TIER 5 Sep 4, 2022

A thorough analysis of why data/ML pipeline tests break even when new code is correct, walking through a concrete CTR pipeline with row/column/table-level unit tests, schema tests, and integration tests, then showing how adding new data (client-side impressions) affects each. Introduces the key additive-vs-retroactive distinction: row-level and schema tests are robust to change while column/table/integration tests are brittle, with guidance on test granularity, validity, and property-based testing. A reference-grade essay on testing ML pipelines.

pipeline-testingml-engineeringunit-testsdata-qualitymlops

eugeneyan

TIER 4 Feb 25, 2024

Argues that conventional unit-testing practices break down for ML because code learns logic from data rather than containing it, so mocking the model defeats the purpose. Offers practical guidelines: use small inline data samples, test against random/empty weights for shape and device checks, write critical (slow-marked) tests against the actual model to verify training dynamics and output semantics, and don't test external libraries. A useful, opinionated MLOps/engineering explainer.

unit-testingmlopsml-engineeringpythontesting

Applied-ML Design Patterns & Pragmatism

4 tier-5 · 3 tier-4

The "patterns" half of Eugene's applied-ML writing: reusable, named solutions to recurring production problems, plus the pragmatist gospel that frames them. The two design-pattern catalogs (nine ML-system patterns; GoF patterns mapped onto ML code) give shared vocabulary; the content-moderation/fraud teardown shows the patterns composing into a playbook; and the pragmatism pieces — start without ML, applying-ML-as-metagame, and complexity-vs-simplicity — supply the judgment for when to reach for which.

eugeneyan

TIER 4 May 2, 2021

Frames applying ML at work as the 'metagame' beyond knowing ML itself, opening with vivid metagame analogies (rock-paper-scissors, the Cash WinFall lottery syndicates). The core advice: start from the problem not the tech (peel the onion with repeated 'why'), favor system and training-data design over model architecture, and keep designs simple (e.g., Monzo's minimal feature store). An original framing that seeded ApplyingML.com.

applied-mlml-systemsproblem-framingtraining-datafeature-store

eugeneyan

TIER 5 Sep 19, 2021

The widely-cited argument (top of Hacker News) that the first rule of ML is to start without ML: launch with heuristics and rules first, echoing Google's Rule #1 and corroborated by practitioners from Tumblr, GitHub, and Spotify. It then walks through understanding data (correlations, scatter/box plots) and concrete heuristic baselines for recommendations, classification, and spam, with real cases where regex beat deep learning. A landmark, memorable framing on applied-ML pragmatism.

machine-learningheuristicsbaselinesapplied-mlpragmatism

eugeneyan

TIER 5 Jun 12, 2022

Maps classic Gang-of-Four design patterns onto machine learning code and systems with concrete library examples: factory (PyTorch Dataset), adapter (Pandas/Spark/Arrow readers), decorator (lru_cache, pytest fixtures, timers), strategy (XGBoost custom objectives, HF pipelines), iterator (DataLoader), and pipeline (sklearn/Spark MLlib), plus system-level proxy and mediator patterns. A shared-vocabulary reference essay that helps engineers recognize and apply intent in ML libraries and systems.

design-patternsml-systemssoftware-engineeringpythonsystem-design

eugeneyan

TIER 4 Aug 14, 2022

An argument that complexity sells better (signaling effort, mastery, innovation, features) while simplicity is the real advantage—easier to adopt, build, scale, and operate, with lower maintenance cost. Marshals ML evidence that simple methods (tree models, dot products, simple averaging) often beat sophisticated ones, and warns that rewarding complexity in promotions and paper reviews incentivizes needless complication. A memorable, well-reasoned framing piece on the complexity bias, including a thoughtful counterpoint addendum.

simplicitycomplexity-biassystem-designengineering-cultureml

eugeneyan

TIER 5 Feb 26, 2023

A 16-minute industry teardown identifying five recurring patterns in content moderation and fraud/anomaly detection systems: human-in-the-loop ground truth, data augmentation, the cascade pattern, combining supervised and unsupervised learning, and explainability. It synthesizes concrete practices from Stack Exchange, LinkedIn, Uber, DoorDash, Airbnb, Meta, and Cloudflare into a coherent playbook (e.g., multiple binary classifiers vs single multiclass, isolation forests, layered spam defense). High synthesis and reference value for applied ML practitioners.

content-moderationfraud-detectionanomaly-detectionml-patternsindustry-teardown

eugeneyan

TIER 5 Apr 23, 2023

A 20-minute catalog of nine reusable design patterns for ML systems (process raw data once, human-in-the-loop, data augmentation, hard negative mining, reframing, cascade, data flywheel, business-rules layer, evaluate-before-deploy), each with pros/cons and rich industry examples (DoorDash, Airbnb, Meta, Cloudflare, Twitter, Amazon, Tesla). It gives practitioners a shared vocabulary and time-tested solutions for real ML production problems. High reference value and a frequent go-to for ML system design.

ml-design-patternsml-systemsdata-augmentationcascadeproduction-ml

eugeneyan

TIER 4 Apr 30, 2023

An argument and prototype against chat as the default LLM interface: most user intent is better expressed via clicks plus implicit context (location, persona, history) than typed text. The demo blends recommendation systems (ANN over item embeddings), pre-cached 'vibe' filters, and a minimal-chat LLM 'librarian' that already knows the user's context. A thoughtful, durable take on LLM UX beyond the textbox.

llm-uxrecommendationsembeddingsproduct-designprototype

Building with LLMs — Patterns, Prompting & Agents

3 tier-5 · 7 tier-4

The applied-LLM cluster, anchored by Eugene's most-cited single essay — the seven LLM patterns (evals, RAG, fine-tuning, caching, guardrails, defensive UX, user feedback) — and its decision-aid follow-up that maps the patterns onto failure modes. Around it: a prompting-fundamentals primer, two hands-on RAG builds (Obsidian-Copilot, and the Discord assistant that doubles as a retrieval-failure diagnosis), an attention/Transformer intuition explainer, an agents/MCP build, an AI-reading-club product build, condensed conference lessons, and the meta-level framework for working productively with coding agents. The cluster matters as the practical curriculum for shipping LLM systems.

eugeneyan

TIER 5 Apr 9, 2023

A 14-minute experiments post (marked a site favorite) building LLM-augmented Discord tools (/summarize, /eli5, /sql-agent, /search, /board of advisors, /ask-ey) and then diagnosing why retrieval fails. The standout is a careful treatment of four retrieval failure modes—suboptimal ANN tuning, off-the-shelf embeddings transferring poorly, inadequate chunking, and embedding-only retrieval—with concrete fixes (hybrid search, triplet fine-tuning, paragraph chunking, reranking). Strong, original diagnostic value for anyone building RAG.

ragretrieval-failuresagentschunkingembeddings

eugeneyan

TIER 4 May 21, 2023

An intuition-first explainer on attention and the Transformer aimed at readers who've seen the paper but want grounding, using a library (query/key/value) analogy and plain-language answers. It explains why attention matters (removing the fixed-vector bottleneck, parallelism, long-range dependencies) and the rationale for multiple heads, multiple layers, and skip connections. A clear, lasting conceptual reference rather than a math derivation.

transformersattentiondeep-learningexplainerllm

eugeneyan

TIER 4 Jun 11, 2023

A build write-up for Obsidian-Copilot, a prototype writing/reflecting assistant that drafts paragraphs via retrieval-augmented generation over local notes. It walks through the concrete engineering: bullet-based chunking, hybrid retrieval combining BM25 (OpenSearch) with semantic search (e5-small-v2), a FastAPI service, and an Obsidian TypeScript plugin. Valuable as a hands-on, reproducible RAG implementation with real design choices and lessons (hybrid beats embedding-only).

ragretrievalhybrid-searchembeddingsengineering

eugeneyan

TIER 5 Jul 30, 2023

Eugene Yan's landmark 66-minute synthesis of practical patterns for building LLM systems and products, covering seven patterns (evals, RAG, fine-tuning, caching, guardrails, defensive UX, collecting user feedback) along axes of performance-vs-cost and data-vs-user. It distills academic research, industry practice, and metrics (BLEU/ROUGE/BERTScore, G-Eval, LLM-as-judge, DPR/RETRO/HyDE, ANN indices) into one canonical reference. Widely cited and durable; one of the foundational practitioner texts for productionizing LLMs.

llm-patternsevalsragfine-tuningproduction-llm

eugeneyan

TIER 4 Aug 13, 2023

A follow-up to the LLM-patterns post that maps the seven patterns (evals, RAG, fine-tuning, caching, guardrails, defensive UX, user feedback) onto concrete LLM problems, organized by external-vs-internal models and data-vs-non-data patterns. It's a useful decision-aid: for each failure mode (no task metrics, external/internal model underperforming, latency, unreliable output, UX paper cuts, no customer-impact visibility) it names the patterns that mitigate it. Practical reference for practitioners choosing where to invest when building with LLMs.

llm-patternsproduction-llmevalsragsystem-design

eugeneyan

TIER 5 May 26, 2024

A prompting-fundamentals explainer built on the mental model of prompts as conditioning a probabilistic model toward desired output. Covers practical techniques with Claude examples: assigning roles/responsibilities, structured input/output via XML, prefilling assistant responses, and chain-of-thought as more than 'think step by step'. A high-signal, frequently-cited primer for building reliably with LLMs.

promptingllmchain-of-thoughtstructured-outputclaude

eugeneyan

TIER 4 Nov 3, 2024

Thirty-nine condensed lessons from 2024 ML conferences, organized into building effective ML systems, production/scaling, execution/collaboration, building for users, and conference etiquette. The value is in the aphoristic, hard-won rules-of-thumb (reward-function engineering is half the battle, evals are a moat, 'the model isn't your product, the system around it is', start simple, design for the data flywheel) that compress a lot of practitioner wisdom into a quick reference.

ml-systemslessonsproductionevalsleadership

Building AI Reading Club: Features & Behind the Scenes

TIER 4 Jan 12, 2025

A build-in-public walkthrough of AI Reading Club and its companion 'Dewey,' an AI reading assistant that uses selected text and page as explicit context (plus the rest of the book as implicit context) to answer queries, make quizzes, recap, look up terms, and revisit past discussions. Documents the end-to-end vibe-coding flow: Claude for MoSCoW requirements, SVG wireframes, and DB schema; v0.dev for the UI skeleton; then Cursor for backend (Next.js, Supabase, multi-provider APIs, Railway). A useful applied prototyping case study.

prototypingai-productux-designvibe-codingreading

Building News Agents for Daily News Recaps with MCP, Q, and tmux

TIER 4 May 4, 2025

A hands-on build of a daily news-recap agent on Amazon Q CLI: a main agent reads a feeds list, splits it into chunks, spawns three sub-agents in separate tmux panes via MCP tools (per-feed RSS reader/parser/formatter with the @mcp.tool() decorator), and combines their summaries. A practical, code-level primer on MCP tool construction and simple multi-agent orchestration with tmux for visibility.

mcpagentsamazon-qtmuxorchestration

How to Work and Compound with AI

TIER 4 May 3, 2026

An original five-principle framework for working productively with coding agents like Claude Code: context as infrastructure (INDEX.md, per-project CLAUDE.md, a disk-backed memory layer), taste as configuration (scoped CLAUDE.md, lazy-loaded guides, bootstrapping skills from transcripts), verification for autonomy (shift-left hooks, watcher sessions for drift), scaling via delegation (parallel sessions, git worktrees), and closing the loop (mining transcripts for config updates). Matters as a concrete, repeatable mechanism for AI-assisted engineering that doubles as a guide for designing agent harnesses and team norms.

ai-agentsclaude-codeproductivityworkflowcontext-engineering

Deep-Learning & Generative-Model Foundations

3 tier-5 · 3 tier-4

The "understand the models" cluster — survey-grade explainers of the techniques underneath everything else. Two tier-5 references chart NLP's evolution from RNNs to T5 and the four building blocks of text-to-image diffusion. Around them sit the data side of training: the under-discussed art of bootstrapping labels when none exist, how to write annotation guidelines, and a 42-minute survey of synthetic data for finetuning (distillation vs. self-improvement) with its ToS caveats.

eugeneyan

TIER 5 Aug 16, 2020

A 23-minute chronological survey of NLP for supervised learning, tracing the field from RNNs (1985), LSTMs, and GRUs through word embeddings (Word2vec, GloVe, FastText), contextual embeddings (ELMo), attention/Transformers, pre-training (ULMFiT, GPT), BERT and its variants, to T5's text-to-text framing. It includes mechanistic explanations (CBOW vs skip-gram, negative sampling, gating) and comparisons, serving as a durable reference primer on NLP's evolution.

nlpword-embeddingstransformersbertsurvey

eugeneyan

TIER 4 Aug 1, 2021

A deep-dive on the under-discussed art of collecting training labels when none exist, covering semi-supervised (pseudo-labels), active learning (uncertainty sampling, query-by-committee, information density), and weakly supervised approaches. Grounds the methods in industry examples from DoorDash menu tagging, Facebook, Google, and Apple, including augmentation, golden datasets, and annotation workflows. A rigorous applied-ML reference on label bootstrapping.

data-labelingactive-learningsemi-supervisedweak-supervisionapplied-ml

eugeneyan

TIER 5 Nov 27, 2022

A rigorous 19-min deep dive into the fundamentals of text-to-image generation, organized into four building blocks: diffusion (DDPM forward/reverse process with training and sampling algorithms), text conditioning (CLIP, DALL-E, DALL-E 2/unCLIP, Imagen's T5 cross-attention), classifier and classifier-free guidance, and latent-space diffusion (Stable Diffusion's VAE compression). Includes the author's own DDPM experiments and a full paper reference list, making it a lasting reference for understanding the diffusion model landscape.

text-to-imagediffusionstable-diffusionCLIPdeep-learning

eugeneyan

TIER 4 Mar 12, 2023

A practical how-to/survey on writing data labeling and annotation guidelines, structured around five Why/What/How questions a good guideline should answer, with worked examples from Google's and Bing's published search-quality guidelines. It covers motivating annotators, defining terms, decision trees and calibration examples, logistics, and measuring inter-rater reliability (Cohen's kappa) plus tips like binary/objective labels and an 'Unsure' option. Useful, concrete reference for anyone building an annotation process.

data-labelingannotationground-truthinter-rater-reliabilityml-process

eugeneyan

TIER 4 Jan 7, 2024

A curated language-modeling reading list of ~40+ foundational papers (Attention Is All You Need, GPT/BERT/T5, scaling laws, Chinchilla, LoRA/QLoRA, RAG, FlashAttention, DPO, MoE, CLIP/ViT, and more) each with a witty one-sentence 'X is all you need' summary, designed to seed a weekly paper club. High reference value as a structured curriculum, though it is a list rather than a deep-dive essay.

reading-listlanguage-modelspaperspaper-clubfundamentals

eugeneyan

TIER 5 Feb 11, 2024

A 42-minute survey on generating and using synthetic data for finetuning, framed around two axes: distillation from a stronger model vs. self-improvement on a model's own output, applied across pretraining, instruction-tuning, and preference-tuning. It walks through canonical papers (Self-Instruct, Unnatural Instructions, Alpaca, WizardLM, CAI, and more), with attention to ToS/legal risks of distilling external models. A comprehensive reference for the synthetic-data landscape.

synthetic-datafinetuningdistillationinstruction-tuningsurvey

Career Growth & Senior-IC Frameworks

2 tier-5 · 7 tier-4

The personal-career half of the archive: how to grow as a technical IC and how to choose roles well. The tier-5 anchors are the principal-IC field guide (the shift from doing to multiplying) and the ML/AI-engineer hiring framework — one essay on how to be evaluated, one on how to evaluate. Around them: career-planning by values and superpowers, the expert-beginner / beginner's-mind essays, onboarding plans for senior roles, MOOC diminishing returns, the portfolio "why," and the red-flag checklist for vetting a team before you join.

eugeneyan

TIER 4 Jun 25, 2017

A getting-started guide for aspiring data scientists structured as three tools (SQL, Python/R, Spark), three skills (probability/statistics, machine learning, communication), and three avenues for practice (personal projects, NGO volunteering, speaking/writing), with curated MOOC links for each. It matters as a clear, opinionated onboarding roadmap, including the realistic note that ML is only ~20% of the job versus 80% on data and stakeholder work.

data science careerlearning pathSQLmachine learningskills

eugeneyan

TIER 4 Aug 23, 2020

A career essay on the 'expert beginner' trap: people who achieve narrow success, get labeled experts, and stop learning, eventually creating teams and orgs (the 'Dead Sea effect') that resist new technology and stagnate. The antidote is Shoshin (beginner's mind), staying humble and curious, distinguishing book knowledge from tacit practice-gained knowledge, and continuing to 'pedal' (learn) to maintain momentum.

careerlearningbeginners-mindexpertisegrowth

eugeneyan

TIER 4 Jan 24, 2021

Argues that past the first one or two courses, additional MOOCs hit diminishing returns and often become procrastination or self-handicapping, because watching videos is passive consumption on the 'happy path.' The alternative: learn by doing real projects off the happy path with just-in-time, YAGNI-style learning. A persuasive learning-philosophy piece structured as a rebuttal to common objections.

learningMOOCsjust-in-time learningpersonal projectscareer

eugeneyan

TIER 4 Apr 4, 2021

A career-planning framework built on two reflective exercises: identifying your values (what you gain fulfillment from, ranked via pairwise comparison) and your superpowers (what you do better than 95-99% of people, sometimes disguised as weaknesses), then choosing roles that align with both. The 'be so good they can't ignore you' examples (SpaceX's Tom Mueller, Project Loon's Daniel Bowen) and the actionable prompts make it a genuinely useful self-assessment tool.

career planningvaluesstrengthsself-assessmentskill mastery

eugeneyan

TIER 4 Oct 17, 2021

Eugene's own lessons from writing online: write even when unqualified (expertise is a spectrum; people learn best from those a few steps ahead), write first for yourself then for one person, exploit writing's O(1) scaling, think in years/decades and favor durable fundamentals, use portable formats, and prioritize quantity over quality early. A well-argued, durable framework on writing to learn and build a career.

writingcareerlearningcompoundingpersonal-brand

eugeneyan

TIER 4 Feb 13, 2022

A checklist of red flags to vet before joining a data team: poor/inaccessible data, no value-delivery roadmap, title-vs-role mismatch, bad managers, non-transferable tooling, unusual org reporting lines (e.g., DS rolling up to CFO/CMO), and ill-fitting iteration speed. Each flag comes with concrete reverse-interview questions, making it a practical framework for evaluating offers in tech/DS roles.

careerdata-sciencehiringjob-searchteam-evaluation

eugeneyan

TIER 4 May 22, 2022

A guide to onboarding effectively in mid-to-senior tech roles: adopt the right mindset (own your onboarding, beginner's mind, resist the urge to change things, invest in culture and relationships), then execute a 100-day plan with concrete weekly milestones for learning, relationships, and early-win deliverables. Offers a useful 2x2 (learning vs action) and timeline templates, making it a solid, reusable career framework.

onboardingcareer100-day-planleadershipnew-role

eugeneyan

TIER 5 Jul 7, 2024

A thorough framework (co-authored with Jason Liu) for interviewing and hiring ML/AI engineers, covering what to assess (software basics, data literacy, comfort with opaque models, evals, science breadth/depth/application) and the non-technical AICE dimensions (ambiguity, influence, complexity, execution), plus how to run phone screens, STAR-based loops, and debriefs. Closes with an opinionated coach-vs-hire view (hunger, judgment, empathy are mostly hired) that makes it a durable career-and-leadership reference.

hiringinterviewingml-careersdata-literacyleadership

Advice for New Principal Tech ICs (i.e., Notes to Myself)

TIER 5 Oct 19, 2025

A distilled 31-point field guide to operating as a principal engineer/scientist, drawn from Amazon role models and mentor quotes: the core work shifts from doing to multiplying (vision, design feedback, sponsorship, connecting dots, scaling through others), being right is less than half the battle, guard your time, define an owner/sponsor/consultant charter, and remove yourself from the critical path. A high-value, durable career framework for senior-IC growth.

careerleadershipprincipal-engineeric-growthmentorship

Bandits, Exploration & Recsys Evaluation

2 tier-5 · 4 tier-4

The harder-edged sequel to Theme 2: how to evaluate and explore in recommendation when logged data lies. The throughline is that recommendations are an interventional problem usually treated as observational, so naive offline metrics mislead — the fix is counterfactual/off-policy evaluation (IPS, SNIPS), and exploration via bandits (epsilon-greedy, UCB, Thompson Sampling) that also handle cold-start. Surrounding pieces tackle position bias and its self-reinforcing feedback loop, personalization design patterns, and reinforcement learning for recs and search.

eugeneyan

TIER 4 Jun 13, 2021

A design-pattern survey of approaches to personalization, bucketed into bandit, sequential, graph-based, and other groups. It details contextual bandits and their advantage over batch ML (lower regret, better cold-start handling) with worked industry examples from Netflix (image personalization with replay evaluation), DoorDash (multi-level geolocation bandits), and Spotify (recsplanations). A substantive applied design-patterns essay.

personalizationcontextual-banditsrecsysdesign-patternscold-start

eugeneyan

TIER 4 Sep 5, 2021

A technical survey of reinforcement learning for recommendations and search, motivated by the limits of batch recommenders (short-term reward focus, popularity bias, poor cold-start). It covers contextual bandits (Yahoo news, Netflix artwork personalization, including off-policy/replay evaluation) plus value-based and policy-based methods, with concrete industry feature engineering and evaluation details. A substantive applied-RL deep-dive.

reinforcement-learningcontextual-banditsrecsyssearchoff-policy-evaluation

eugeneyan

TIER 5 Apr 10, 2022

Argues that recsys offline evaluation is mis-framed: recommendations are an interventional problem treated as observational, so fitting logged data doesn't measure whether new recs increase clicks. Introduces counterfactual evaluation via Inverse Propensity Scoring (with worked importance-weight examples) and its pitfalls (insufficient support, high variance), plus the CIPS and SNIPS variants and empirical findings that SNIPS performs best without parameter tuning. A conceptually sharp, reference-grade essay on simulating A/B tests offline.

counterfactual-evaluationIPSSNIPSrecsysoff-policy-evaluation

eugeneyan

TIER 4 Apr 17, 2022

Explains position bias—higher-ranked items get clicked regardless of true relevance, creating a self-reinforcing feedback loop—and surveys methods to measure it (RandTopN, exploiting inherent randomness across rankers/widgets/contexts, expectation maximization, FairPairs/RandPair swaps, Boltzmann exploration on scores) trading off cleanliness vs customer impact. Then covers mitigation via randomness, inverse-propensity debiasing, and positional features set to 1 at serving. A focused, practical applied-ML explainer.

position-biaslearning-to-rankrecsysdebiasingevaluation

eugeneyan

TIER 5 May 8, 2022

A comprehensive survey of bandits for recommender systems: explains epsilon-greedy, UCB, and Thompson Sampling, then catalogs industry implementations (Spotify, Yahoo, Alibaba, Doordash, Amazon, Twitter, Netflix, Deezer) and distills cross-cutting lessons—UCB/TS beat epsilon-greedy, TS is more robust to delayed feedback, pessimistic initialization helps, plus approaches to exploration, dimensionality, warm-starting, and off-policy (replay) evaluation. A landmark applied-ML reference on modeling uncertainty and exploration in recsys.

banditsrecsysthompson-samplingUCBexploration

eugeneyan

TIER 4 Oct 2, 2022

A detailed RecSys 2022 conference recap summarizing 17 papers plus three favorites: recency-based sequence sampling, extending Open Bandit Pipeline to industry challenges (off-policy vs on-policy, concept drift, delayed rewards), and Google's 'On the Factory Floor' ad-CTR engineering paper. Rich with applied lessons (reward design, system vs model innovation, offline-online metric correlation) drawn from Apple, Stitch Fix, Spotify, Pinterest, Netflix and more—valuable applied-ML survey, though it is fundamentally a curated paper roundup.

recsysconference-recapbanditssequential-recommendationml-engineering

Data-Science Practice, Process & Project Mechanics

1 tier-5 · 13 tier-4

The largest tier-4 cluster: the craft and process of doing data-science work, from "what the job actually is" to running projects start-to-finish. The tier-5 piece is the durable four-roles taxonomy (data scientist / applied scientist / research scientist / ML engineer). Around it: the realities of the role (ML is <20% of the work), where Agile/Scrum fits DS, the full three-part DS-project-practices series (before/during/after), Python project setup and patterns, productionizing classifiers, the career narrative, and influencing without authority.

eugeneyan

TIER 4 Oct 11, 2016

Part 1 of the product-classification series on data acquisition and category formatting, using Julian McAuley's 9.4M-product Amazon metadata, converting JSON to CSV (including a memory-safe row-by-row approach for large data), flattening category lists into path strings, and progressively filtering down to ~4.59M usable products. It matters as a practical, code-level walkthrough of the messy data-wrangling work that precedes any classifier.

data acquisitiondata wranglingpandasAmazon datasetcategory cleaning

eugeneyan

TIER 4 Dec 11, 2016

Part 2 of the product-classification series detailing text data preparation: defining a data-'purity' metric to detect mislabeled products, ASCII encoding/normalization, custom regex tokenization that preserves informative punctuation, and removing stopwords, numerics, short words, and duplicates. It matters as a thorough, code-level NLP preprocessing walkthrough grounded in cross-validation justifications for each cleaning step.

NLP preprocessingtext cleaningtokenizationdata qualityPython

eugeneyan

TIER 4 Feb 13, 2017

Part 3 of a product-classification series showing how to productionize an ML model by wrapping it in a TitleCategorize class, exposing it via a Flask app with routes and HTML, and using a Python decorator for reusable timing/logging. It matters as a concrete, end-to-end walkthrough of serving an ML model behind a simple API, with idiomatic code (chained methods, DRY via decorators).

ML in productionFlask APIPythonmodel servingdecorators

Data Science and Agile (What Works, and What Doesn't)

TIER 4 Jan 26, 2019

Part 1 examining where Agile/Scrum fits data science: planning and prioritization, clearly defined tasks with deliverables/timelines, and end-of-sprint retrospectives and demos work well, while estimation difficulty, shifting scope, and expectations of working software each sprint do not. Grounds the analysis in concrete examples (NPS root-cause analysis, the data-product build flow). A substantive engineering-leadership essay (tagged 🔥) on process fit.

data-scienceagilescrumteam-processleadership

Data Science and Agile (Frameworks for Effectiveness)

TIER 4 Feb 2, 2019

Part 2 on data science and Agile, proposing concrete adaptations: time-boxed iterations with go/no-go gates (Feasibility Assessment → POC → Deploy to Production → Operational Maintenance), writing up projects before starting, and reserving innovation time. Quantifies the prototype-to-production gap (a cited 2 vs. 117 man-months) to justify production engineering. A practical management framework for running DS teams.

data-scienceagileteam-processmlopsproductivity

What does a Data Scientist really do?

TIER 4 Apr 30, 2019

Debunks the perception that data science is mostly machine learning and requires a PhD, arguing ML is <20% of the work. Lays out the full data-product lifecycle (framing, data acquisition/prep, building validation frameworks and pipelines, experimentation, productionizing) and, citing interviews with data leaders, concludes the defining skill is delivering measurable value. A useful, framework-style corrective on the realities of the role.

data-sciencecareerml-systemsdata-productsexpectations

My Journey from Psych Grad to Leading Data Science at Lazada

TIER 4 Feb 27, 2020

A career narrative of moving from a Psychology degree and investment-analyst role into leading a 12-person data science team at Lazada, framed around three keys: continuous self-learning, getting shit done (measurable value), and emphatic communication. Includes vivid anecdotes (Kaggle top-3%, a failed AB test that lost millions, the communication pivot) that ground the lessons. A substantive, motivating career framework, especially for non-traditional entrants.

careerdata-scienceleadershipcommunicationself-learning

eugeneyan

TIER 4 Jun 7, 2020

Argues for adapting Scrum to data science work despite the apparent mismatch with open-ended research: time-boxed iterations speed learning and limit loss, prioritization keeps teams focused, demos boost accountability, and retrospectives close the improvement loop. A thoughtful team-practices essay grounded in lived experience.

scrumagiledata-scienceteam-practicesleadership

eugeneyan

TIER 4 Jun 21, 2020

A 20-minute hands-on guide to setting up a Python project for automation and collaboration: version manager, virtualenv/Docker, consistent project structure, unit tests, coverage, linting, type-checking, a make-check wrapper, and CI on git push. A thorough, reusable engineering how-to for production-quality Python.

pythonengineeringtestingci-cddeveloper-experience

eugeneyan

TIER 4 Jul 12, 2020

Part two of the data-science-project-practices series, on execution: start with a quick literature review of what others did and what worked, experiment and iterate fast (jupyter/papermill/mlflow), use stand-ups plus an informal end-of-day debrief (EODD) for alignment and feedback, and run regular stakeholder check-ins to stay aligned with business goals. Concrete, experience-tested team practices for delivering DS projects.

data-scienceexecutionteam-practicesstandupsproject-management

eugeneyan

TIER 4 Jul 19, 2020

Part three of a data-science-project-practices series on what to do after a project: make work reproducible (combat dead-end code, run-sequence dependencies, and poor Jupyter-git diffs by converting notebooks to .py at milestones), and document/share work to save future time, enable knowledge transfer, and spark collaboration. Illustrated with how a product-classifier write-up snowballed into multiple new internal projects.

data-sciencereproducibilityjupyterdocumentationworkflow

eugeneyan

TIER 5 Nov 8, 2020

A widely-referenced taxonomy distinguishing the four data roles, defining each by goal, tools, and deliverables: data scientists (analysis to guide decisions), applied scientists (ML systems for business outcomes), research scientists (new methodology), and ML engineers (infra and platforms). Also traces how the title inflated (Facebook, Lyft rebrands) and what that means for teams. A lasting reference for understanding the field's role landscape.

data-science-rolesapplied-scientistml-engineerresearch-scientistcareer

eugeneyan

TIER 4 Mar 6, 2022

A practical quick-start checklist for data science projects: understand intent and context, define requirements/constraints/metrics (framing constraints as enabling), dig into the data early with a quick baseline to test feasibility, consult domain experts/papers/open-source, and standardize/automate the experiment pipeline (notebook hygiene, Hyperopt/Optuna, MLflow, Papermill). Useful, experience-grounded guidance for going from nebulous problem to usable prototype.

data-scienceproject-managementml-workflowmetricsexperimentation

eugeneyan

TIER 4 Jul 31, 2022

A practical tour of uncommon Python patterns learned by reading widely-used libraries (requests, flask, scikit-learn, pytorch, etc.): calling super() in base classes for cooperative multiple inheritance, when to use mixins, relative imports, what to put in __init__.py, instance/class/static methods, a hidden conftest.py path trick, and design principles from sklearn, fastai, and PyTorch. A useful, code-rich engineering explainer for writing more maintainable Python.

pythonsoftware-engineeringlibrary-designcode-patternspytest

Engineering & DS Leadership

1 tier-5 · 8 tier-4

How to build and run data/ML teams, written from the lead's chair. The tier-5 centerpiece is the influential argument for end-to-end data scientists over fragmented specialist hand-offs, grounded in communication-cost math and social-loafing research. Around it: the team-growth playbook (hiring, training, innovation, discipline, camaraderie), team-of-teams mechanisms (debriefs, reviews, input/output metrics), project prioritization via cost-benefit, the intent-vs-requirements delegation model, the team-culture document, prototypes-win-buy-in, and a project-success mechanisms set.

eugeneyan

TIER 4 May 12, 2018

Eugene Yan describes how, as Lazada's Data Science Lead, he drafted a team culture document to combat dilution as the team scaled from 5 to 40+ people, building it on five values (ownership, collaboration, communication, innovation, impact) drawn from studying Netflix, Valve, Google, and Amazon. It matters as a concrete, reusable framework for codifying engineering/DS team culture and using it as a scalable hiring and onboarding artifact.

team cultureleadershipdata sciencehiringmanagement

eugeneyan

TIER 4 Jun 15, 2020

Part one of a three-part series on data science project practices, sharing three pre-project mechanisms: a one-pager mapping intent/outcome/deliverable/constraints (with Amazon's Working Backwards), time-boxing effort, and breaking work down to spot rabbit holes and dead ends. A practical framework for de-risking DS projects before diving in.

data-scienceproject-managementmechanismworking-backwardsproductivity

eugeneyan

TIER 5 Aug 9, 2020

Eugene's influential argument that data scientists should be more end-to-end (own problem framing through deployment) rather than fragmenting the process across specialized roles, which causes coordination overhead, diffusion of responsibility, and lost context. He grounds it in communication-cost math (N(N-1)/2 links), social-loafing/bystander research, autonomy-mastery-purpose motivation, and the real-world experiences of Stitch Fix and Netflix, while acknowledging when specialists are still needed.

data-scienceteam-structureend-to-endownershiporganization

eugeneyan

TIER 4 Oct 11, 2020

Argues that interactive prototypes win buy-in where research-backed proposals and roadmaps fail, because they are concrete, serve as proof of technology, and make feedback easy. Illustrated with how a self-built image-classifier/search demo kick-started a computer-vision investment, plus the lesson that non-technical users won't adopt anything without a GUI. Practical guidance on prototyping tools (FastAPI, Streamlit, Docker) and when prototypes don't suffice.

prototypingstakeholder-buy-inml-productfastapicommunication

eugeneyan

TIER 4 Jan 31, 2021

Yan's playbook for growing and running a data-science team across five levers: hiring (the highest-leverage act; look for curiosity, grit, humility, end-to-end skill), training (via teammates, demos, paper-lunches, reviews), innovation (sharing failure, supporting 20% projects against the innovator's dilemma), discipline (one-pagers, time-boxing, standards), and camaraderie (trust enabling Crocker's-Law candor). A substantive, experience-grounded leadership essay.

DS leadershiphiringteam buildinginnovationmanagement

eugeneyan

TIER 4 Mar 21, 2021

A practical framework for prioritizing data-science/ML projects: quantify via cost-benefit analysis (extent x severity), recognize that capabilities act as multipliers and learning exercises unlock previously-unsolvable problems, and trade off incremental vs. disruptive (S-curve) work while avoiding resume-driven development and pet projects. Synthesizes into a benefits-vs-innovation 2x2, giving DS leads a reusable mental model for the 'which problems' question.

project prioritizationcost-benefit analysisML strategyDS leadershipinnovation

eugeneyan

TIER 4 Mar 20, 2022

Distinguishes high-level intent (the why/what) from low-level requirements (the how) and offers a framework for choosing between them based on three factors: the executor's context/experience (often seniority), the situation's uncertainty (knowledge gap and effects gap), and the maturity of the profession (mature software engineering vs younger data science). A clear, transferable mental model for delegation, spec-writing, and growing people.

intent-vs-requirementsdelegationleadershipspec-writingmanagement

eugeneyan

TIER 4 Jan 22, 2023

A practitioner framework for increasing the success rate of ML projects via four mechanisms: pilot/co-pilot pairing for review and blindspot-catching, structured literature review, methodology review (a code-review analogue for experiments that surfaces data leaks and invalid splits), and timeboxing distinct from estimates. Each comes with concrete guidance (10% co-pilot effort, three-pass paper reading, time/feature/leakage checks, three ways to set timeboxes). Useful, transferable advice for running applied ML work.

ml-projectsmechanismsmethodology-reviewtimeboxingapplied-ml

eugeneyan

TIER 4 Feb 5, 2023

A leadership/operations essay describing four mechanisms for effective technical teams and teams-of-teams: end-of-week debriefs, monthly learning sessions, quarterly reviews, and weekly business reviews built on input/output metrics. It draws on Amazon and Netflix practices (controllable input metrics, 'highly aligned, loosely coupled', Working Backwards) and gives concrete formats and ML-team metric examples. A practical, reusable framework for engineering/ML team management.

engineering-leadershipteam-mechanismsmetricsteam-managementcareer

Writing, Learning & Communication

0 tier-5 · 10 tier-4

The personal-development backbone that connects the whole archive: how Eugene learns continuously, writes to think, and communicates to scale. Pieces cover writing-as-learning and the readingÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢notesÃƒÂ¢Ã¢â‚¬Â Ã¢â‚¬â„¢writing cycle, the Zettelkasten method (>600 HN points), lessons from teaching himself non-fiction writing, ML unit-testing philosophy as a learning artifact, why reading papers makes you more effective, staying current in a fast field, the portfolio "why," writing as a force multiplier as careers grow, the Why/What/How doc framework, and influencing without authority.

eugeneyan

TIER 4 Sep 25, 2017

Reflections on Yan's first 100 days transitioning from individual contributor to Data Science Lead at Lazada, covering the mindset shift away from technical depth, aligning team mission with company goals, instituting monthly 1-on-1s, and learning to delegate. It matters as a practical IC-to-manager transition guide, including concrete 1-on-1 questions for engagement, culture, and productivity.

leadershipmanagement transition1-on-1sdelegationcareer

Writing is Learning: How I Learned an Easier Way to Write

TIER 4 Mar 28, 2020

Reframes writing as a learning-and-thinking process rather than just output, structured as a cycle of reading (consume) → note-taking (collect) → writing (create). Offers concrete habits—read three books at a time, read with intent to write, write from notes rather than a blank page—to make regular writing sustainable. A useful, transferable framework on the reading-to-writing pipeline.

writinglearningreadingnote-takingproductivity

Stop Taking Regular Notes; Use a Zettelkasten Instead

TIER 4 Apr 5, 2020

Argues that regular note-taking fails because notes stay disconnected, and presents the Zettelkasten (slip-box) method—idea cards linked to other idea cards and grouped by topic—as a way to build connections at capture time. Walks through a concrete digital implementation in Roam (literature notes, permanent notes, topic tags) that makes review and synthesis dramatically easier. A widely-cited, practical knowledge-management framework (hit >600 points on Hacker News).

note-takingzettelkastenknowledge-managementwritingproductivity

eugeneyan

TIER 4 Aug 2, 2020

Lessons on writing non-fiction that schooling never taught, drawn from a self-assembled curriculum of the best books, essays, and videos. Key ideas: writing is 80% preparation (reading, note-taking, Zettelkasten) and 20% writing; writing is hard for everyone; your niche emerges from consistent shipping rather than upfront choice; and above all writing must be useful to readers (citing Larry McEnerney and Paul Graham).

writingnon-fictionnote-takingcraftusefulness

eugeneyan

TIER 4 Aug 30, 2020

Argues that reading papers makes data scientists more effective by widening perspective, keeping them current, and saving rework, opening with a story of how a paper-informed teammate (LinkedIn's kNN+SVM label-cleaning trick) unblocked a product classifier to 95% accuracy. It then gives a practical method for choosing papers and a three-pass reading-and-note-taking workflow plus curated resources (Papers with Code, applied-ml, ml-surveys).

reading-paperslearningcareerresearch-workflownote-taking

eugeneyan

TIER 4 Oct 4, 2020

Explores how the writing-versus-coding balance shifts as technical careers grow, with input from four tech leaders (Amazon, OLX, Jina, Netflix): senior contributors create leverage by writing context, sharing the 'why' over the 'what/how', and looking around corners, while writing itself sharpens thinking (Bezos's six-pager, Andy Grove's reports-as-self-discipline). A useful IC/leadership framework on communication as a force multiplier.

writingcareer-growthleadershipsix-pagertechnical-communication

eugeneyan

TIER 4 Oct 18, 2020

Reframes the data science portfolio away from 'how to build one' toward the 'why' (intrinsic motivation: mastery, purpose, fun beats job-chasing alone) and the 'what' (code projects vs. content/writing, and the skills each demonstrates). Illustrated with creators like Vincent Warmerdam, Jay Alammar, and Amit Chaudhary sharing their motivations. A thoughtful, sustainable framing of personal projects and learning in public.

portfoliopersonal-projectsmotivationwritingcareer

eugeneyan

TIER 4 Feb 28, 2021

Introduces the Why-What-How(-Who) framework for structuring work documents, applied to three doc types Yan writes at Amazon: one-pagers (stakeholder alignment), design docs (technical feedback), and after-action reviews (reflection, including Correction-of-Errors). Worked examples (an i2i recommender one-pager, design doc, and an 11/11 incident AAR) make it an actionable writing explainer; part I of the design-doc pair with #0086.

technical writingWhy-What-Howone-pagersafter-action reviewdocumentation

eugeneyan

TIER 4 Jul 4, 2021

A practical playbook for influencing without authority as a data scientist: show data (quantitative and qualitative), use the Socratic method so people convince themselves, discuss in small groups to avoid defensiveness, write ideas down (writing scales O(1)), and learn whether stakeholders are Why/What/How people. A useful, transferable career and collaboration framework.

careerinfluencecommunicationstakeholder-managementleadership

eugeneyan

TIER 4 Jan 19, 2022

A personal system for staying current in a fast-moving ML field: try something new each project, do stretching personal projects at a sustainable pace, attend meetups/conferences, read papers consistently (using the three-pass method), find mentors a few steps ahead, and adopt a beginner's mind. The throughline is that writing/sharing consolidates each learning input, making it a useful career-growth explainer.

learningcareermachine-learningmentorshipreading-papers