Author

Henry V. Primeaux

Year

2025

Review

This book is aimed at a technical audience, but it offers useful insights clearly gleaned from real world experience creating continuous evaluation systems in high stake environments.

You Might Also Like:

Understanding Variation

Wheeler highlights the value of Process Behaviour Charts, which present data as a time series with control limits. This approach accounts for natural process variation and provides a more accurate understanding than single-point comparisons. The book, clear and accessible, discusses recording and analysing data and differentiating between signal and noise.These insights can greatly improve the way we understand data and manage processes, particularly for those in product management roles.

andrewclark.co.uk

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

Get a comprehensive review and summary of the best-selling book "Designing Machine Learning Systems" by Huyen Chip, specifically tailored for product managers. Learn how to apply the key insights and takeaways from this must-read for tech leaders and professionals to your role as a product manager.

andrewclark.co.uk

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

AI Engineering

I focus my reading on evergreen books, as the technical landscape changes so quickly. Chip Huyen's "Designing Machine Learning Systems" stands out as one of my favourites. Her books are challenging to get through (they're long and technical) but they leave you with a much deeper understanding of the field. Though most tools and papers referenced in the book will inevitably be surpassed, the timeless principles and insights made the investment worthwhile. Our world is about to be turned upside down by AI, and I want to be ready. This book certainly helped me hone my understanding.

andrewclark.co.uk

Key Takeaways

The 20% that gave me 80% of the value.

A robust evaluation system isn't a one-time benchmark or a report you generate before launch. It's a living framework embedded in how you build, ship, and maintain AI. Its job is to prove your system works reliably for its intended purpose under real conditions, and catch failures before users do. Robust evals blend representative user inputs, stress and adversarial cases, multiple kinds of metrics, and ongoing monitoring, because good results in preliminary testing often collapse in production when traffic shifts, edge cases appear, or workflows change.

Start with the real goal, not the metric

Everything begins with defining success in outcome terms (resolution rate, fewer escalations, compliance, user satisfaction) and translating that into observable criteria. If eval scores rise but business outcomes don't, you're measuring the wrong thing. You also need to define what failure looks like, especially high-severity scenarios, and make those "must-not-fail" cases first-class citizens in your test suite.

A key discipline here is defining the estimand: the exact property of model behaviour you want to know in real usage, under defined conditions. That means specifying who and what is in scope (population), how you treat edge conditions (timeouts, refusals, ambiguous inputs, system errors), and what counts as success (partial credit vs strict correctness; whether safe refusal is success when uncertain). Most evaluation failures come from silent assumptions: filtering hard cases, ignoring abstentions, or assuming offline distributions match production.

Design evaluation scope as a layered system

Different scopes answer different questions. Unit tests catch component regressions fast. System tests catch integration issues and interface drift. End-to-end tests tell you what the user will experience. Online tests (A/B, canary) measure true impact but require strong guardrails. The most effective programmes keep these layers together so an end-to-end failure can be traced back quickly to the component and input pattern that caused it.

Treat evaluation artefacts like production assets: version datasets, prompts, metric definitions, code/config, and model versions so any result can be reproduced and audited. Plan for nondeterminism: report variance and avoid overreacting to noise.

Build evaluation data that reflects reality and goes beyond it

Your evaluation set should mirror production traffic first, then intentionally extend beyond it. Real inputs (logs, tickets, live queries) anchor you in reality, including the messy cases that cause real failures. Don't sanitise away complexity in the name of cleanliness. Stratify inputs by the contexts that matter (domain, user segment, length, difficulty, interaction mode) so you can see slice-level failures rather than hiding them inside averages. Reserve an untouched slice to detect evaluation overfitting and slow drift in what you test.

Synthetic data fills gaps but must be validated. It's most useful when it's generated from controlled transformations of real inputs (typos, paraphrases, missing context), template variations at scale, or targeted generation of rare scenarios. Validate synthetic sets for plausibility, novelty, and similarity to real distributions, then spot-check where risk is high.

Adversarial inputs are their own regime. They probe robustness and safety boundaries through perturbations, injection attempts, and creative "red-team" strategies. Treat these tests with stricter criteria (e.g., zero policy violations, safe refusal, no leakage), and keep a library of successful attacks as permanent regression tests because many vulnerabilities transfer across versions.

Prompts and labels are part of evaluation design

Prompt wording can change model rankings. Relying on a single "canonical prompt" is brittle, so evaluate across a prompt family: structured variants that reflect realistic rewrites and formatting changes. Measure prompt sensitivity using stability statistics (variance, worst-case, low percentiles), not just a single average score.

Human labels are your ground truth. Weak labelling corrupts everything. Use clear guidelines, pilot on difficult cases, analyse disagreements, and iterate until the task is consistently interpretable. Calibrate annotators, use control items, measure agreement, and build adjudication paths. Representativeness and bias require active engineering: stratify sampling, oversample rare/high-risk strata, and report subgroup performance with confidence intervals so disparities don't stay invisible.

Metrics are instruments: use several, and know their failure modes

No single metric captures quality. Use exact match/accuracy when there's one precise answer. Use precision/recall/F1 when outputs are sets of items and the tradeoff matches product costs. Treat BLEU/ROUGE as weak, task-specific overlap signals: useful for diagnostics, unreliable when multiple phrasings are valid.

Semantic metrics (embedding similarity, BERTScore) better reflect "meaning match" for free-form outputs, but can blur factual mistakes. Calibrate thresholds with manual review, pick embedding models that match your domain, and treat semantic scores as complementary, especially when small errors are costly.

LLM-as-judge provides scalable rubric-based evaluation for nuanced criteria (instruction following, coherence, safety), often correlating with humans when prompts are disciplined. Reliability requires explicit rubrics, fixed output formats, rationale requests for auditability, prompt sensitivity testing, and guardrails against "judge-generator" circularity. Expect judge errors on edge cases; sample disagreements and audit rationales.

Multi-evaluator strategies are how you reduce blind spots: define evaluation axes (factuality, safety, relevance, fluency), then assign at least two independent signals per axis (e.g., semantic metric + rubric judge; safety classifier + safety-focused judge). Disagreement becomes a triage signal for underspecified rubrics, evaluator miscalibration, or genuinely ambiguous cases.

When aggregating metrics, normalise first and weight intentionally based on priorities (often safety and factuality dominate). Report distributions, not just means: include median and worst-case or low-percentile behaviour so catastrophic failures don't disappear in averages. Always report uncertainty and variance across samples, prompts, and evaluators.

Build the eval pipeline like a release-quality system

A real evaluation pipeline ingests inputs, runs models and evaluators, handles failures, logs everything, and scales. Design evaluators as modular components with consistent interfaces so you can swap metrics without rewrites. Keep evaluation behaviour configuration-driven (which evaluators, thresholds, weights, order) so changes are reviewable and reversible. Log enough metadata to explain outcomes: inputs/outputs, scores, prompts/rubrics, versions, run IDs, and failure states.

Regression testing becomes the central decision mechanism: compare every challenger to a reproducible champion baseline on locked datasets and rules, with explicit allowable deltas and slice-level gates.

Make evaluation part of deployment, not a side ritual

Evals matter only when embedded in the release process. Use tiered checks (fast smoke, full gate, scheduled runs) and define "non-negotiables" that block releases. Validate under real traffic with shadow testing (observe alongside production) and canary rollouts (small exposure with monitoring and fast rollback). Version everything that changes behaviour and use feature flags for controlled exposure and instant reversibility.

Traceability is part of the deliverable: every evaluation should answer what ran, on what data, with what rules, and when. When evaluation fails, the system should automatically halt or roll back, alert the right people, and preserve evidence for diagnosis.

Monitor in production to prevent silent decay

Once deployed, models face data pipeline breakage, training-serving skew, input drift, concept drift, infrastructure issues, and gradual output degradation. Monitor a compact set of signals across the chain: input validity and distribution, output distributions (refusals, fallbacks, length, confidence proxies), system health (latency/errors/cost), and user impact (complaints, escalation, satisfaction proxies). Distinguish drift types because the right response differs: pipeline fixes for skew, targeted re-evals for drift, retraining when the world changes. Use anomaly detection to catch rare spikes and outliers. Decide in advance what triggers action and what action follows.

Safety, robustness, and advanced systems require system-level thinking

Robustness and safety evaluation stress-test boundaries: adversarial prompts, edge cases, out-of-distribution scenarios, and sensitive contexts. Track distinct safety risks (hallucination, toxicity, bias, sensitivity) with clear thresholds and escalation paths. Use human-in-the-loop as a planned layer for grey areas, with explicit triggers and auditable decisions. Audit systems as repeatable cycles: run combined suites, categorise failures, remediate, rerun, and convert lessons into permanent tests.

As systems become multimodal, retrieval-augmented, or agentic, evaluation shifts from single-turn output scoring to measuring groundedness, evidence use, trajectory correctness, long-term consistency, and end-to-end task success under cost/latency constraints.

Scale across teams with governance, shared assets, and clear reporting

At organisational scale, evaluation needs standards, enforceable policies, auditability, shared libraries, and a benchmark catalogue. Reporting must serve both diagnosis (engineers) and decisions (leaders), with drill-down to concrete failing cases. Regulatory and ethical requirements must be embedded: privacy-safe data handling, bias and safety checks, traceable logs, documented limitations, and human oversight where risk is high. Onboarding becomes a repeatable playbook supported by mentorship and contributions back into shared assets.

The core lesson across all chapters: robust evaluation isn't a metric. It's an operating system for trust, safety, and continuous improvement.

Deep Summary

Longer form notes, typically condensed, reworded and de-duplicated.

Introduction

What Robust AI Evals Mean

A robust evaluation system is a living framework embedded in model development and deployment. It answers: Is your AI working reliably for its intended purpose, under real-world conditions, and can you prove it?

Effective evals use real user data, adversarial inputs, and evolving metrics, not just static test sets. They handle changing data distributions, unknown failure modes, and complex LLM workflows. They adapt when use cases, priorities, or regulations change.

Most evaluation frameworks are static, shallow, or disconnected from production. They rely on outdated datasets, overfit to narrow benchmarks, fail to monitor after deployment, or ignore edge cases. This produces impressive numbers that don't hold up in real usage.

Chapter 1: Laying the Foundations: Eval Requirements & Planning

Before you even write a line of evaluation code, ask yourself what "success" looks like both from a business and technical standpoint.

Every evaluation should be grounded in your Al system's intended use.

Define success in business terms (e.g., fewer escalations, higher resolution rate, compliance) and translate that into technical criteria that are observable and repeatable.

If your evals look good, but you’re not moving the outcome you care about you’re evaluating the wrong thing.

Make success criteria concrete: objective and repeatable. Use explicit thresholds.

Think about failure too, unacceptable high severity failure scenarios are important to track.

Prioritise what matters most. Classify metrics as primary (drive deployment decisions (go/no-go decisions) or secondary (help interpretation and monitoring).

Surface evals early in the development process, to flush out stakeholder differences in evaluation priorities. Run a small pilot eval on representative inputs, share results with limitations, and use that evidence to negotiate metrics, thresholds, and escalation paths before the system hardens.

Be explicit about what you are estimating, not just what you can score. The estimand is the precise target: the property of model behaviour you actually want to know in real usage, under defined conditions.

How to define your estimand:

Specify the population: which users/queries/tasks are in scope (and over what time window).
Define inclusion/exclusion rules: how to handle ambiguous inputs, empty queries, out-of-domain requests, system errors, retries, and fallbacks.
Define outcomes: what “success” means (partial credit vs strict correctness; whether safe refusal is success when uncertain; how timeouts are scored).
Ensure alignment: the estimand should map directly to the product/business outcome you’re optimising.

Common evaluation failures come from silent assumptions: filtering out hard cases, counting only “easy wins,” ignoring abstentions/timeouts, or assuming offline distributions match real traffic. Treat the estimand as a living contract; revisit it when the product surface changes, the user base shifts, regulations evolve, or new failure modes appear.

Choose evaluation scope based on what you need to learn and how fast you need feedback. Different scopes answer different questions:

Unit: validate a single component in isolation (fast, pinpoint regressions).
System: validate interactions between components (catches integration failures and interface drift).
End-to-end: validate user-visible outcomes under realistic and adversarial scenarios (holistic, slower, higher signal for release readiness).
Online/A/B: validate true impact with real users against a baseline (best realism, requires guardrails and careful gating).

Use unit and system checks to iterate quickly early on, then shift weight toward end-to-end as the pipeline stabilises, and use online testing as a final gate for changes that must prove real-world impact. Maintain layering so an end-to-end failure can be traced back to a specific component via targeted tests, reducing diagnosis time and preventing “mystery regressions.”

Map stakeholders and constraints as first-class inputs to the evaluation design. Identify who can veto deployment and what they consider failure. Document constraints as design parameters (privacy limits, annotation budget, compute latency, data retention, audit requirements) and turn priorities into operational rules: which metrics gate releases, which thresholds block, and what reporting/escalation is required when something drifts.

Treat evaluation artefacts like production assets: version datasets, prompts, metric definitions, code, configurations, and model weights so any result can be reproduced and audited. Regression testing should compare every candidate against a known baseline with explicit allowable deltas, and must include rare but critical edge cases and safety boundaries…not just average performance.

Aim for repeatability: fixed seeds, stable splits, cached references where appropriate; if nondeterminism is unavoidable, report variance and avoid overreacting to noise. Plan for safe evolution: update test suites when new query types or risks appear, keep legacy tests for backward compatibility during major architecture shifts, and use shadow deployments when possible to observe behaviour alongside production before switching. Keep a human-readable changelog of what changed and why, so future decisions stay anchored to intent rather than folklore.

Chapter 2: Data & Prompt Design for Eval

Evaluation inputs should mirror production reality, then deliberately extend beyond it to cover rare conditions and hostile behaviour. Build and maintain an input set that is representative, stress-testing, and hard to “game” through tuning.

Real inputs are the anchor:

Pull inputs from logs/ production traffic/ support tickets (keep sampling broad enough to capture segment and time variation
Deduplicate and nomalise (remove trivial inputs) without sanitising away the messiness that causes real failures.
Stratify real inputs by the contexts that matter (domain, user type, length/complexity, interaction mode) so you don’t over-measure the easy middle.
Reserve an untouched slice as a future check against evaluation overfitting and gradual “drift” in what you test.

Synthetic inputs fill coverage gaps but need tight validation. Useful generation methods include:

Controlled mutations of real inputs (typos, dropped context, paraphrases, noise)
Template-driven variants (1000 variations of a simple template)
LLM-assisted generation: proposals for rare/ambiguous cases
Procedural generation for structured tasks (filling forms, constrain based queries etc).

Validate synthetics against real distributions (length/topic/vocab), plausibility/coherence, and novelty (remove duplicates and “too easy” items), with manual spot checks where risk is high.

Adversarial inputs probe robustness and safety boundaries. Use expert red-teaming, subtle perturbations that change interpretation, prompt-injection/jailbreak patterns that attempt instruction hijacking, and automated generation pipelines to scale coverage. Treat adversarial evaluation as its own regime with stricter criteria (e.g., zero policy violations, safe refusal/fallback, no sensitive leakage), and reuse effective attacks across model versions because many transfer.

Combine the mix intentionally rather than accumulating examples randomly:

Maintain explicit proportions (70% real, 20% synthetic, 10% adversarial)
Label by difficulty and key attributes; track performance per stratum, not just overall.
Version datasets and only add; keep “known-bad” historical cases that previously caused failures.
Periodically inject newly sampled live inputs (unseen) to detect drift and new failure modes.

Prompt choice can change model ranking, so a single “canonical” prompt is a brittle basis for evaluation. Use a prompt family: one base template plus structured variants that reflect realistic rewrites. Vary: instruction phrasing, context markers/delimiters, context formatting, few-shot vs zero-shot, output constraints, system-message framing). Run the same inputs across variants to measure sensitivity and identify prompts that systematically break formatting or degrade correctness.

From that base, produce variants that change:

Instruction phrasing: Change "You are a knowledgeable assistant" to "You are an expert in domain X"
Context markers / delimiters: Wrap context in tags, or prefix with "Here is the passage: ..."
Example style: Include or omit examples (few-shot), or change how examples are formatted
Answer instructions: Require "in bullet points," "in less than 100 words," or "explain step by step"
System messages: Prepend "System: " or "Assistant, follow these rules..."

Include edge-case prompts that stress instruction handling: malformed/truncated prompts, ambiguous or conflicting instructions, nested tasks, and injection attempts embedded in user content. Score across variants and summarise stability using distributional statistics (median/quantiles, worst-case, variance), not just a single score that may be an artefact of wording.

Human labels are the ground truth; weak labelling corrupts every downstream metric. Start with explicit guidelines (labels, decision rules, edge cases, positive/negative examples), run a pilot on a challenging subset, analyse disagreement, then revise instructions until the task is consistently interpretable.

Choose labeling labor based on risk and complexity. Crowdsourcing scales for simpler judgments if you decompose tasks into atomic questions, add control items, use multiple annotators per item, and aggregate (majority/weighted). Use internal experts for domain-heavy or high-stakes cases, with adjudication paths for ambiguous items and regular calibration sessions to keep judgments aligned.

Treat label quality as a continuous system: iterative calibration on a gold set, consensus workflows with escalation when no clear majority exists, and routine inter-annotator agreement measurement (tracked overall, per annotator, and per category). When agreement drops, pause to diagnose (guideline ambiguity, fatigue, new edge cases) and retrain or re-scope.

Bias and representativeness require explicit engineering. Define the key axes of variation (language/region, topic, length, user type, difficulty, sensitive attributes where appropriate), stratify sampling so each stratum is adequately powered, and oversample rare or high-risk strata rather than letting random sampling erase them. Report subgroup metrics and error types, use paired counterfactual examples to isolate attribute effects, and automate stratified dashboards with confidence intervals/significance checks so disparities are detected early and remain traceable over time.

Chapter 3: Metrics That Work - Automatic, Hybrid, and Human

Metrics are only useful if they match the decision they’re meant to drive. Treat them as instruments with failure modes: a metric that’s easy to compute but misaligned will systematically reward the wrong behavior and hide risk.

Use exactness metrics when the task has a small, well-defined target. Accuracy and exact match work for classification, multiple choice, and factoid QA with a single acceptable answer, but they become brittle as soon as multiple phrasings are valid or the output is longer than a short span.

Use set-based metrics when outputs are collections of items. Precision captures how many predicted items are correct, recall captures how many required items were found, and F1 balances the two. These are most informative when you can meaningfully define “items” (entities, slots, citations, required facts) and when precision/recall tradeoffs reflect real product costs.

Overlap metrics are best treated as weak, task-specific signals. BLEU and ROUGE reward n-gram overlap with references, which can be useful for translation and summarisation diagnostics, but they punish legitimate paraphrases and can over-reward verbosity or copying. They’re most reliable when outputs are expected to be close to a reference and least reliable when multiple distinct answers are acceptable.

Semantic metrics measure meaning rather than surface overlap by comparing embeddings of candidate and reference. They are valuable for free-form generation where correct answers vary in wording (summaries, paraphrases, open-ended QA, dialogue), and they’re often a better fit for “did it say the same thing?” than BLEU/ROUGE.

Embedding-based scores can still be fooled. High similarity can hide subtle but critical factual errors, and short outputs can behave erratically. Choose embedding models that match your domain, calibrate acceptance thresholds by reviewing samples at different score levels, and treat semantic scores as complementary—especially when factuality or safety hinges on small details.

LLM-as-judge replaces rigid formulas with rubric-based scoring from a capable model. It scales nuanced evaluation (coherence, completeness, instruction following, safety) faster than humans and often tracks human preference when the rubric and prompt are well-designed.

To make LLM-judging reliable, control its main failure modes:

Use explicit rubrics, fixed output formats, and require brief rationales for auditability.
Test prompt sensitivity with variants before standardising a judging prompt.
Avoid “judge–generator” circularity by using a stronger or different judge model and periodically checking against humans and non-LLM metrics.
Expect judge errors on edge cases; sample and review disagreements and rationales, especially for adversarial inputs.

Single-metric evaluation leaves blind spots; combine evaluators to cover different axes (factuality, relevance, safety, fluency, style). When multiple evaluators agree, confidence rises; when they disagree, treat it as a triage signal: either the output is genuinely ambiguous, the rubric is underspecified, or an evaluator is miscalibrated.

Multi-evaluator setups work best when you define axes first, then assign at least two independent signals per axis (e.g., a semantic metric plus a rubric judge; a safety classifier plus a safety-focused judge). Specialised judge ensembles (multiple LLM evaluators prompted for different criteria) broaden error detection, but they increase complexity and require disciplined logging and periodic recalibration of prompts, weights, and thresholds.

Human judgments remain the reference point for what “good” looks like, particularly for user value and subtle quality attributes. Pairwise comparisons (“which is better?”) reduce scale interpretation and often produce cleaner preference signals than absolute ratings, especially when differences are small.

Rating scales provide per-dimension granularity (factuality, relevance, fluency, safety) but require strong calibration to prevent rater drift and inconsistent use of the scale. Comparative ranking extends pairwise to multiple candidates when you need to order many variants, then aggregate preferences into a stable system-level ranking.

Structured protocols keep human evaluation reproducible: define goals and rubrics up front, run calibration rounds on gold items, randomise and balance samples, include control items for quality monitoring, measure agreement, and keep an audit trail of guidelines, changes, and adjudications.

When combining metrics, normalise before aggregating. Metrics live on incompatible scales and directions (higher-better vs lower-better); without normalisation, the largest numeric range dominates regardless of importance. Prefer simple, interpretable scaling, and keep the normalisation constants versioned so comparisons remain valid across runs.

Weight metrics to match priorities, not convenience. Set weights explicitly (often with safety/factuality dominating), compute composite scores per sample, and summarise across samples with robust statistics (median/trimmed mean, plus worst-case or low-percentile scores) so catastrophic failures don’t vanish inside averages.

Always report uncertainty and spread, not just a single number. Track variance across samples, prompt variants, and evaluators; use confidence intervals when comparing systems; and document every choice (rubrics, prompts, normalisation, weights, model versions) so results are reproducible and changes don’t quietly rewrite history.

Chapter 4: Building the Eval Pipeline

An evaluation pipeline is a release-quality system, not a one-off test. It should reliably answer “is this version safe to ship?” across changing models, data, and priorities, and do so in a way that’s explainable to stakeholders when results are questioned.

Design evaluators as interchangeable components with a consistent “contract” (same kind of inputs, same kind of outputs). That lets you add, remove, or upgrade metrics without rebuilding the whole workflow, and it keeps the organisation from getting stuck with yesterday’s scoring approach when priorities shift (e.g., safety and compliance suddenly matter more than style). Chaining evaluators also supports simple decision logic (e.g., only run deeper checks when earlier checks pass, or treat certain failures as immediate blockers).

Keep evaluation behaviour configuration-driven rather than hardwired: which evaluators run, in what order, and which thresholds or weights matter. This makes changes deliberate, reviewable, and easy to roll back, and it reduces “silent” metric drift caused by ad-hoc edits.

Operational rigour is part of evaluation quality. A useful pipeline records enough context to explain outcomes: inputs, outputs, scores, the rubric/prompt used for any model-based judging, run identifiers, and the versions of data and scoring rules. When failures happen (timeouts, malformed outputs, evaluator errors), the system should degrade gracefully: capture partial results, flag what’s missing, and make it easy to pinpoint whether the issue is model quality or pipeline reliability.

Treat regression testing as the main decision mechanism. Establish a reproducible baseline (“champion”) and compare every candidate (“challenger”) against it on the same locked evaluation set and rules. Use clear acceptance gates that reflect non-negotiables, not just average improvement.

A minimal release gate checklist:

No regressions on critical metrics and “must-not-fail” cases.
Slice-level checks so rare but important segments can’t be masked by overall averages.
A clear promote/block decision plus an audit trail explaining why.

Chapter 5: Integration into MLOps / Deployment Workflow

Evaluations only change outcomes when they’re wired into the release process, not run ad hoc. Treat them as standard checks that run whenever something meaningful changes: model, prompts, retrieval logic, or data—and make the “pass/fail” decision explicit rather than interpretive.

Use tiers of evaluation speed vs confidence. A fast check catches obvious breakage early, a full run is the release gate, and scheduled runs catch slow drift. What matters is consistency: the same rules, the same reference sets, and clear criteria for what blocks a change.

Make gates about regressions and non‑negotiables, not just “overall improvement.” Define minimum acceptable levels for critical qualities (often safety and factuality) and require “no worse than baseline” on must-not-fail scenarios. Always look at key slices (e.g., high-risk intents, sensitive topics, important user segments) so improvements in the average can’t hide damage in the edge cases.

Prove performance under real traffic before full rollout. Shadow testing compares the candidate against the current system on live requests without showing its outputs to users; canary testing then exposes a small percentage of users, watches both user-facing KPIs and operational health, and ramps up only if results stay within safe bounds. Rollback needs to be fast and routine, not a heroic response.

Control and reversibility are as important as quality. Version everything that can change behaviour (model, prompts, datasets, eval rules) and use feature flags to limit exposure, target cohorts, and turn changes off instantly when something goes wrong: without waiting for a new deploy cycle.

Treat traceability as part of the deliverable: every evaluation result should answer “what ran, on what data, with what rules, and when?” Store artefacts and metadata so you can audit decisions, investigate surprises, and compare runs over time without guesswork.

When evaluations fail, the system should do three things automatically:

Stop the rollout or revert to the last safe version
Alert the right people with a concise explanation of what breached and where to look
Preserve the evidence (inputs, outputs, configs, deltas) so the fix is driven by facts rather than speculation.

Chapter 6: Monitoring & Drift Detection in Production

Monitoring is what keeps a good launch from turning into silent decay. Once real traffic hits, inputs change, upstream systems break, user behaviour shifts, and even small infrastructure issues can distort outcomes long before anyone notices in product metrics.

The most common post-deploy failures are boring but damaging: malformed or missing inputs, new categories the model never saw, pipeline changes that alter preprocessing, and “training vs serving” mismatches where production data no longer resembles what was used to build the model. These can cause nonsense outputs, biased behaviour, or a slow drop in usefulness with no obvious outage.

Track a small set of signals that cover the whole chain:

Inputs: missing fields, unexpected formats, out-of-range values, new/unseen categories, spikes in nulls.
Data consistency: how production inputs compare to a fixed baseline profile from launch (are we seeing the same kinds of requests/users?).
Outputs: shifts in what the model tends to produce (more refusals, more fallbacks, longer/shorter answers, more low-confidence behaviour, more “edge” responses).
System health: latency, timeouts, error rates, cost per request.
User/business impact: escalations, complaints/tickets, churn/conversion, satisfaction proxies.

Distinguish these three problem types, because each requires a different response:

Data drift means inputs changed.
Concept drift means the world changed, so the old logic no longer fits—even if inputs look similar.
Training-serving skew means you accidentally changed the recipe (feature extraction, prompt template, retrieval, routing), which is usually fixable without retraining but must be caught quickly.

Anomaly detection complements drift monitoring by catching rare spikes and outliers that averages hide (sudden bursts of strange inputs, weird output clusters, unexpected drops in a key metric). Treat anomalies as investigation triggers: capture the exact examples, attach full context, and look for a common cause (pipeline bug, new user segment, adversarial behaviour, policy edge).

Decide in advance what happens when monitoring flags a problem. Use simple thresholds plus “trend” triggers (sustained degradation, repeated anomalies, or a drift score that keeps climbing), then route to one of a few actions: targeted re-evaluation on the affected slice, data collection/labelling for that slice, prompt/policy adjustments, or retraining. Keep humans in the loop for high-severity categories, and avoid whiplash by not retraining on normal short-term noise.

Make every alert actionable: what changed, how bad, which segment, and what the recommended next step is. If a team can’t quickly answer “what happened and why?”, monitoring becomes noise instead of a safety net.

Chapter 7: Robustness, Safety & Adversarial Evaluation

Robustness and safety evaluation treat real users as worst‑case testers: they will be messy, adversarial, and unpredictable. The goal is to find failure modes before they become incidents, not to prove the model works on average.

Adversarial testing has three complementary modes:

Perturbations check whether small, realistic input changes cause big behaviour shifts (including bypassing safety).
Attacks are deliberate attempts to override rules (prompt injection, jailbreak phrasing, obfuscation).
Red‑teaming is a sustained campaign: human and/or automated—to discover novel ways the system can be pushed into unsafe or misleading output.

Build adversarial suites around your actual risk boundaries, not generic benchmarks. Start from “what must never happen” (unsafe instructions, privacy leakage, biased responses, confident misinformation), then generate families of prompts that probe those boundaries across wording, formatting, and context. Treat every successful attack as a new permanent test case, not a one-off curiosity.

Stress testing expands beyond “malicious” into “unexpected”: extreme input lengths, malformed formats, ambiguity, mixed languages, rare high-stakes topics, and out-of-distribution requests that resemble how scope creep happens in production. Evaluate not just correctness, but whether the system fails safely (refusal, escalation, or constrained response) instead of hallucinating or complying.

Safety checks need to cover distinct risks because they fail differently:

hallucination (confidently wrong)
toxicity (harmful language)
bias (systematic differences across groups)
sensitivity (inappropriate responses in vulnerable contexts

Track these as explicit rates and severities, with clear thresholds for blocking release and for escalating review.

Automated filters won’t catch everything, so define human-in-the-loop as a planned safety layer, not an emergency measure. Use explicit escalation triggers (high-risk topics, low confidence, policy flags, user reports), give reviewers a small set of actions (approve, edit, block, escalate), and log decisions so you can audit outcomes and improve the system using real, labeled failures.

Treat auditing as a repeatable process, not a one-time certification:

Fix the scope and objectives (which risks, which user groups, which policies).
Run benchmark + adversarial + stress + sensitive scenarios together.
Categorise failures by type and severity, and verify escalation worked when it should.
Remediate, then rerun the same suite to prove the fix and prevent regressions.

Chapter 8: Evaluation for Advanced Models & Agents

Single-turn text scoring stops working once systems mix vision, retrieval, tools, memory, and multi-step planning. Evaluation has to move from "was this answer good?" to "did the system reliably achieve the user goal, using the right information, under real operating constraints?"

For vision+language systems, the key risk is plausible text that isn't grounded in the image. Treat "groundedness" as a first-class measure: does the response accurately reference what's visually present, avoid inventing details, and stay relevant to the prompt? Reference-based metrics can be useful for scale, but they miss subtle visual hallucinations; use targeted reviews (human or trusted judging) to classify failures like hallucination, omission, misidentification, and overconfident guesses. Add robustness tests for multi-object scenes, compositional questions, and out-of-distribution visual styles.

For RAG (retrieval-augmented generation), separate component evaluation from end-to-end quality. Measure the retriever's ability to surface the right sources (and how often it misses), then measure whether the generator uses retrieved evidence faithfully rather than filling gaps with confident speculation. Classify failures explicitly: retriever miss (no relevant context), generator failure (relevant context present but wrong answer), and grounding failure (partly correct but adds unsupported claims). Track latency and safety risks introduced by retrieval (bad sources in, bad answers out).

Agent and interactive systems need trajectory evaluation, not just final output scoring. Define protocols for representative tasks (objective, allowed tools, constraints, success criteria), then score:

task completion rate
protocol adherence (used the right tools in acceptable ways)
error recovery (what happens when tools fail or inputs are ambiguous)
efficiency (steps/calls/time/cost)Human review becomes important for judging "reasonable behaviour" in edge cases and for auditing failures that automated checks can't interpret.

Advanced systems also need long-term behaviour checks. Test memory and consistency across many turns, paraphrases, and slight context shifts; quantify contradiction rates and the ability to maintain stable facts/preferences over time. Probe "emergent" capability with composite tasks that require multiple skills together (retrieve → compute → explain → format), and track regressions between versions because these abilities can appear or disappear unexpectedly.

System-level evaluation should be the final lens: did the whole stack resolve the issue within acceptable time, cost, and reliability, including retriever/tool failures and any human escalation? Monitor end-to-end success, latency/error rates, escalation rates, and user-confirmed resolution, and use these as the primary gates alongside the component metrics that explain why the system succeeded or failed.

Chapter 9: Scaling Across Teams & Governance

Evaluation at scale needs shared rules, not shared opinions. Standardise what gets measured, how it’s calculated, which datasets are acceptable, and what “pass” means for each model type and risk level. Without that, teams can’t compare results, and regressions slip through because every group uses a different yardstick.

Turn standards into enforceable policies: which changes require evaluation, what thresholds are non‑negotiable (especially safety, bias, privacy), how often to re-evaluate after release, and what happens on failure (block, rollback, escalate). Make exceptions rare, visible, and time‑boxed so “temporary” doesn’t become the norm.

Auditability is the insurance policy. Every evaluation run should be traceable to the exact model version, dataset version, evaluation rules, and reviewer decisions. If a result can’t be reproduced later, it can’t credibly support a release decision or a compliance response.

Shared evaluation libraries reduce reinvention and quietly prevent metric drift. Provide common metric implementations, safety checks, and reporting formats so teams don’t create incompatible variants. Pair this with a benchmarking catalog: curated datasets and test suites (core use cases, edge cases, adversarial and safety scenarios), with clear metadata, ownership, and versioning so updates don’t invalidate historical comparisons.

Reporting should match decision needs. Keep one line of sight for overall health and risk (trendlines, regressions, safety/compliance status), and another for diagnosis (where it broke, which slice, what changed). Dashboards work best when they make regressions obvious, allow drill-down into failing cases, and preserve the trail from high-level indicators to concrete examples.

Regulatory, ethical, and compliance requirements need to be built into evaluation rather than bolted on later. That means: privacy-safe data handling, explicit bias/fairness checks where relevant, safety thresholds that block release, documented limitations, and human oversight for high‑risk areas. Treat “proof of diligence” (logs, model cards, review records) as a required output, not optional paperwork.

Onboarding new teams should be a repeatable path, not tribal knowledge. A lightweight playbook plus mentorship gets projects aligned quickly:

Use the shared library + benchmark catalog first; only customise with a documented reason.
Produce a minimum compliant evaluation pack (metrics, safety checks, audit log, release recommendation).
Contribute back new edge cases, failure modes, and benchmarks so the system improves over time.

Chapter 10: Case Studies, Lessons Learned & Getting Unstuck

Real-world eval systems are continuous operations, not experiments. The strongest setups run continuously (or on a cadence), tie results to real outcomes, and keep a clear trail of what was tested and why.

High-stakes domains use layered evaluation. Teams combine fast automated checks with expert review and “does this actually work in the workflow?” validation, because a good metric score can still fail in practice.

Reusable eval assets are a force multiplier. Shared test suites, versioned datasets, and a library of “known bad cases” let teams scale quality without re-litigating basics every project.

The most common failure mode is measuring the wrong thing. Over-focusing on one headline metric or relying on a too-clean test set produces false confidence and surprise failures after launch.

Think of recovery is a capability, not a scramble. Good teams detect regressions quickly, roll back safely, isolate the cause (data shift vs. model change vs. pipeline issue), and then promote the failure into a permanent regression test.

Scaling eval infrastructure is mainly about consistency and visibility. Automation, centralised results, clear ownership, and dashboards that surface regressions early matter more than “fancier” scoring.

Evals must evolve as models gain new abilities. When models become better at reasoning, tool-use, retrieval, or multimodal tasks, you need new tests for those capabilities without losing comparability to older baselines.

Future direction: evaluate systems and behaviours, not just outputs. Expect more focus on end-to-end task success, safety under stress, long-horizon consistency, agent trajectories (did it follow the right steps), and practical constraints like latency, cost, and escalation rates.

Building Robust AI Evals

Review

You Might Also Like:

Understanding Variation

Book Review and Summary: Designing Machine Learning Systems by Huyen Chip for Product Managers

AI Engineering

Key Takeaways

Deep Summary

Introduction

Chapter 1: Laying the Foundations: Eval Requirements & Planning

Chapter 2: Data & Prompt Design for Eval

Chapter 3: Metrics That Work - Automatic, Hybrid, and Human

Chapter 4: Building the Eval Pipeline

Chapter 5: Integration into MLOps / Deployment Workflow

Chapter 6: Monitoring & Drift Detection in Production

Chapter 7: Robustness, Safety & Adversarial Evaluation

Chapter 8: Evaluation for Advanced Models & Agents

Chapter 9: Scaling Across Teams & Governance

Chapter 10: Case Studies, Lessons Learned & Getting Unstuck