Product #122

AI Engineering · Chip Huyen · 2025

I’ll read everything Chip Huyen writes. Her books are challenging to get through but they leave you with a much deeper understanding of the field. Though most tools and papers referenced in the book will inevitably fade into irrelevance, the timeless principles and insights won’t.

Key Insights

AI models are becoming more capable but require substantial data, compute, and specialised talent to train. Model-as-a-service providers have made foundation models accessible by abstracting training costs.

Language models work with tokens: characters, words, or word parts. Autoregressive models predict the next token using preceding context and are more popular than masked models. Models can be trained through self-supervision, where they infer labels from input data, removing the need for expensive labelled datasets. Model size is measured by parameters, with larger models requiring more training data to maximise performance.

Foundation models can be multimodal, handling text, images, and video. They can be adapted through prompt engineering, retrieval-augmented generation (RAG), or fine-tuning. AI engineering builds applications on existing models rather than creating new ones, focusing on adapting models through prompt-based techniques or fine-tuning.

Evaluation should consider the entire system, not isolated components. It mitigates risks by identifying failure points and uncovering opportunities. Foundation models are challenging to evaluate because they're open-ended and increasingly intelligent. As models approach human-level performance, evaluation becomes more time-consuming and difficult.

Important language modeling metrics include cross-entropy, perplexity, bits-per-character (BPC), and bits-per-byte (BPB). These metrics measure how well a model predicts the next token in the training data. Lower entropy or cross-entropy means the language is more predictable (i.e., the model assigns higher probability to the correct tokens). Perplexity is an exponential transformation of cross-entropy that reflects the model’s uncertainty: lower perplexity indicates better predictions. However, a perplexity of 3 does not mean a one-in-three chance of correct prediction; it means that, on average, the model is as uncertain as if it had to choose uniformly among 3 equally likely options.

Functional correctness evaluates whether systems perform intended tasks. For code generation, this means checking if code executes and passes unit tests. For tasks lacking automated evaluation, similarity measurements against reference data work well. Four approaches measure text similarity: human judgement, exact match, lexical similarity (token overlap), and semantic similarity (meaning-based using embeddings).

AI as a judge has become common for evaluating open-ended responses. It's fast, cheap, and doesn't require reference data. However, AI judges have biases: self-bias (favouring own responses), first-position bias, and verbosity bias. They should supplement exact or human evaluation rather than replace it entirely.

Comparative evaluation ranks models by having them compete in pairwise matches, using rating algorithms like Elo to generate rankings. This approach is borrowed from sports and captures human preference better than absolute scoring, though it faces scalability challenges and potential biases.

Models are best evaluated in context for their intended purposes. Start by defining evaluation criteria: domain-specific capability, generation capability, instruction-following ability, cost, and latency. Domain capabilities are tested through benchmarks. Generation capabilities focus on factual consistency, safety, toxicity, and biases. Instruction-following measures how well models follow given instructions.

Model selection involves balancing quality, cost, and latency through four steps: build-versus-buy decisions filtering by hard attributes, public benchmark evaluation, task-specific private evaluation, and online monitoring with user feedback. Public benchmarks help filter bad models but won't identify the best model for specific applications. Creating custom evaluation pipelines is essential.

Prompt engineering crafts instructions to generate desired outputs without changing model weights. It's the easiest adaptation technique. Effective prompting requires clear instructions, examples, sufficient context, breaking complex tasks into subtasks, allowing thinking time, and iteration. In-context learning allows models to incorporate new information through few-shot examples in prompts.

RAG retrieves relevant information from external sources before generation, helping with knowledge-intensive tasks and reducing hallucinations. It uses retrievers to index and query data, with term-based retrieval (keyword matching) or embedding-based retrieval (semantic similarity). Hybrid search combines both approaches. Optimisation strategies include chunking, reranking, query rewriting, and contextual retrieval.

Agents are systems that perceive environments and act on them using tools and planning. Tools expand agent capabilities through knowledge augmentation, capability extension, and multimodality bridges. Planning decomposes complex tasks into executable steps. Agents can reflect on their work through self-critique to improve reliability, though this increases cost and latency.

Fine-tuning adapts models by adjusting weights, building on transfer learning. It's remarkably sample-efficient, achieving comparable results with hundreds of examples versus millions needed for training from scratch. Types include continued pre-training, supervised fine-tuning, and preference fine-tuning. Small fine-tuned models often outperform much larger general-purpose models on specific tasks.

Fine-tune when you need specific output formats, domain-specific knowledge, semantic parsing, bias mitigation, consistent behaviour, or smaller deployable models. Don't fine-tune if prompting hasn't been thoroughly tested, you lack ML expertise and infrastructure, or you need one model for many unrelated tasks. RAG addresses information problems; fine-tuning addresses behaviour problems.

Dataset engineering creates the smallest, highest-value dataset that achieves target performance within budget. High-quality data is relevant, aligned with task requirements, consistent, correctly formatted, sufficiently unique, and compliant. Data must cover the range of problems users will present, with diversity across appropriate dimensions.

Synthetic data generation increases quantity and coverage, sometimes improves quality, mitigates privacy concerns, and enables model distillation. AI-powered synthesis enables paraphrasing, translation, code generation with verification, and self-play. However, synthetic data has limitations: quality control challenges, superficial imitation risks, potential model collapse from recursive training, and obscured lineage.

Typical data processing involves inspecting data thoroughly, deduplicating to avoid biases and test contamination, cleaning and filtering low-quality content, and formatting to match model expectations. Manual inspection has high value despite low prestige.

Inference optimisation improves cost and latency through model-level techniques like quantization and distillation, and service-level approaches like batching and parallelism. Key metrics include time to first token, time per output token, and throughput. Different workloads require different optimisation priorities.

Successful AI architectures start simple and add complexity gradually. Begin with query-model-response, then enhance context through retrieval and memory. Implement guardrails for input and output safety. Add model routers to direct queries appropriately. Reduce latency with caching. Incorporate agent patterns carefully, especially for write actions.

Critical metrics include reliability (mean time to detect and resolve), business outcomes (task success, retention), quality (factuality, safety), latency, throughput, and cost. Log raw inputs, prompts, context, tool usage, and trace end-to-end requests. Detect drift early through monitoring template changes, user behaviour shifts, and upstream model swaps.

Feedback powers evaluation and future improvements. Collect explicit signals like ratings and implicit signals like edits and regeneration requests. Design for micro-interactions, randomised presentation to reduce bias, and clear consent. Expect biases in feedback and mitigate through randomisation and counterfactual testing.

Successful AI engineering requires cross-functional collaboration between product, design, engineering, data, security, and support teams. Roll out incrementally, starting with basic functionality and progressively adding retrieval, guardrails, routing, caching, and agent capabilities based on measured impact. Favour context enhancement and routing before deeper complexity, making safety and observability defaults throughout.

Full Book Summary · Amazon

Subscribe Button

Quick Links

How to effectively pitch constraints · Article

How to build a resilient design team · Article

Reading vs Action · Article

Your target market isn’t demographic · Article

Psychological Strategies for Winning a Geopolitical Forecasting Tournament

Mellers, Ungar, Baron, Ramos, Gurcay, Fincher, Scott, Moore, Atanasov, Swift, Murray, Stone, Tetlock. 2014. (View Paper → )

Probability training corrected cognitive biases, encouraged forecasters to use reference classes, and provided forecasters with heuristics, such as averaging when multiple estimates were available. Teaming allowed forecasters to share information and discuss the rationales behind their beliefs. Tracking placed the highest performers (top 2% from Year 1) in elite teams that worked together. Results showed that probability training, team collaboration, and tracking improved both calibration and resolution. Forecasting is often viewed as a statistical problem, but forecasts can be improved with behavioural interventions. Training, teaming, and tracking are psychological interventions that dramatically increased the accuracy of forecasts. Statistical algorithms (reported elsewhere) improved the accuracy of the aggregation. Putting both statistics and psychology to work produced the best forecasts 2 years in a row

Key Highlights

Train and fine tune your understanding of probabilities: Consider reference classes when making predictions. Average multiple estimates when available. Avoid common biases like overconfidence and base-rate neglect. Use statistical reasoning and decision trees.

Collaborate with others to forecast: Share information and discuss rationale. Keep each other motivated through social interactions. Exchange news articles and evidence. Engage in constructive critique of different viewpoints.

Track good forecasters and put them together: Track forecast record, identify top performers. Group them in a team together. Get them working well together. High performers learn faster when surrounded by other high performers. Encourage them to make multiple predictions per question.

Book Highlights

Scenarios are constructed from the information gathered during our initial investigation phase. Typically, in both interviews and direct observation of users, we learn a lot about their tasks. Goals are stable and permanent, but tasks are fluid, changeable, and often unnecessary in computerized systems. As we develop scenarios, we need to seek out and eliminate tasks whose only justification is historical. Alan Cooper · The Inmates Are Running the Asylum

The most common problems created by the organisation itself arise from its history of specialisation and, usually, success. Richard Rumel · The Crux

Let the business strategy guide the product strategy… Roman Pichler · Strategize

Quotes & Tweets

The reward for good work is more work. Tom Sachs

We offer no explanation as to why these architectures seem to work; we attribute their success, as all else to divine benevolence Noam Shazeer