Chip Huyen
Review
I focus my reading on evergreen books, as the technical landscape changes so quickly. Chip Huyen's "Designing Machine Learning Systems" stands out as one of my favourites. Her books are challenging to get through (they're long and technical) but they leave you with a much deeper understanding of the field. Though most tools and papers referenced in the book will inevitably be surpassed, the timeless principles and insights made the investment worthwhile. Our world is about to be turned upside down by AI, and I want to be ready. This book certainly helped me hone my understanding.
You Might Also Like:
Key Takeaways
The 20% that gave me 80% of the value.
AI models are becoming more capable but require substantial data, compute, and specialised talent to train. Model-as-a-service providers have made foundation models accessible by abstracting training costs.
Language models work with tokens: characters, words, or word parts. Autoregressive models predict the next token using preceding context and are more popular than masked models. Models can be trained through self-supervision, where they infer labels from input data, removing the need for expensive labelled datasets. Model size is measured by parameters, with larger models requiring more training data to maximise performance.
Foundation models can be multimodal, handling text, images, and video. They can be adapted through prompt engineering, retrieval-augmented generation (RAG), or fine-tuning. AI engineering builds applications on existing models rather than creating new ones, focusing on adapting models through prompt-based techniques or fine-tuning.
Evaluation should consider the entire system, not isolated components. It mitigates risks by identifying failure points and uncovering opportunities. Foundation models are challenging to evaluate because they're open-ended and increasingly intelligent. As models approach human-level performance, evaluation becomes more time-consuming and difficult.
Important language modeling metrics include cross-entropy, perplexity, bits-per-character (BPC), and bits-per-byte (BPB). These metrics measure how well a model predicts the next token in the training data. Lower entropy or cross-entropy means the language is more predictable (i.e., the model assigns higher probability to the correct tokens). Perplexity is an exponential transformation of cross-entropy that reflects the model’s uncertainty: lower perplexity indicates better predictions. However, a perplexity of 3 does not mean a one-in-three chance of correct prediction; it means that, on average, the model is as uncertain as if it had to choose uniformly among 3 equally likely options.
Functional correctness evaluates whether systems perform intended tasks. For code generation, this means checking if code executes and passes unit tests. For tasks lacking automated evaluation, similarity measurements against reference data work well. Four approaches measure text similarity: human judgement, exact match, lexical similarity (token overlap), and semantic similarity (meaning-based using embeddings).
AI as a judge has become common for evaluating open-ended responses. It's fast, cheap, and doesn't require reference data. However, AI judges have biases: self-bias (favouring own responses), first-position bias, and verbosity bias. They should supplement exact or human evaluation rather than replace it entirely.
Comparative evaluation ranks models by having them compete in pairwise matches, using rating algorithms like Elo to generate rankings. This approach is borrowed from sports and captures human preference better than absolute scoring, though it faces scalability challenges and potential biases.
Models are best evaluated in context for their intended purposes. Start by defining evaluation criteria: domain-specific capability, generation capability, instruction-following ability, cost, and latency. Domain capabilities are tested through benchmarks. Generation capabilities focus on factual consistency, safety, toxicity, and biases. Instruction-following measures how well models follow given instructions.
Model selection involves balancing quality, cost, and latency through four steps: build-versus-buy decisions filtering by hard attributes, public benchmark evaluation, task-specific private evaluation, and online monitoring with user feedback. Public benchmarks help filter bad models but won't identify the best model for specific applications. Creating custom evaluation pipelines is essential.
Prompt engineering crafts instructions to generate desired outputs without changing model weights. It's the easiest adaptation technique. Effective prompting requires clear instructions, examples, sufficient context, breaking complex tasks into subtasks, allowing thinking time, and iteration. In-context learning allows models to incorporate new information through few-shot examples in prompts.
RAG retrieves relevant information from external sources before generation, helping with knowledge-intensive tasks and reducing hallucinations. It uses retrievers to index and query data, with term-based retrieval (keyword matching) or embedding-based retrieval (semantic similarity). Hybrid search combines both approaches. Optimisation strategies include chunking, reranking, query rewriting, and contextual retrieval.
Agents are systems that perceive environments and act on them using tools and planning. Tools expand agent capabilities through knowledge augmentation, capability extension, and multimodality bridges. Planning decomposes complex tasks into executable steps. Agents can reflect on their work through self-critique to improve reliability, though this increases cost and latency.
Fine-tuning adapts models by adjusting weights, building on transfer learning. It's remarkably sample-efficient, achieving comparable results with hundreds of examples versus millions needed for training from scratch. Types include continued pre-training, supervised fine-tuning, and preference fine-tuning. Small fine-tuned models often outperform much larger general-purpose models on specific tasks.
Fine-tune when you need specific output formats, domain-specific knowledge, semantic parsing, bias mitigation, consistent behaviour, or smaller deployable models. Don't fine-tune if prompting hasn't been thoroughly tested, you lack ML expertise and infrastructure, or you need one model for many unrelated tasks. RAG addresses information problems; fine-tuning addresses behaviour problems.
Dataset engineering creates the smallest, highest-value dataset that achieves target performance within budget. High-quality data is relevant, aligned with task requirements, consistent, correctly formatted, sufficiently unique, and compliant. Data must cover the range of problems users will present, with diversity across appropriate dimensions.
Synthetic data generation increases quantity and coverage, sometimes improves quality, mitigates privacy concerns, and enables model distillation. AI-powered synthesis enables paraphrasing, translation, code generation with verification, and self-play. However, synthetic data has limitations: quality control challenges, superficial imitation risks, potential model collapse from recursive training, and obscured lineage.
Typical data processing involves inspecting data thoroughly, deduplicating to avoid biases and test contamination, cleaning and filtering low-quality content, and formatting to match model expectations. Manual inspection has high value despite low prestige.
Inference optimisation improves cost and latency through model-level techniques like quantization and distillation, and service-level approaches like batching and parallelism. Key metrics include time to first token, time per output token, and throughput. Different workloads require different optimisation priorities.
Successful AI architectures start simple and add complexity gradually. Begin with query-model-response, then enhance context through retrieval and memory. Implement guardrails for input and output safety. Add model routers to direct queries appropriately. Reduce latency with caching. Incorporate agent patterns carefully, especially for write actions.
Critical metrics include reliability (mean time to detect and resolve), business outcomes (task success, retention), quality (factuality, safety), latency, throughput, and cost. Log raw inputs, prompts, context, tool usage, and trace end-to-end requests. Detect drift early through monitoring template changes, user behaviour shifts, and upstream model swaps.
Feedback powers evaluation and future improvements. Collect explicit signals like ratings and implicit signals like edits and regeneration requests. Design for micro-interactions, randomised presentation to reduce bias, and clear consent. Expect biases in feedback and mitigate through randomisation and counterfactual testing.
Successful AI engineering requires cross-functional collaboration between product, design, engineering, data, security, and support teams. Roll out incrementally, starting with basic functionality and progressively adding retrieval, guardrails, routing, caching, and agent capabilities based on measured impact. Favour context enhancement and routing before deeper complexity, making safety and observability defaults throughout.
Deep Summary
Longer form notes, typically condensed, reworded and de-duplicated.
Chapter 1: Introduction to Building AI Applications with Foundation Models
AI models are getting more capable, but they're also requiring more data, compute and specialised talent to train and scale them. The 'model as a service' providers have abstracted away much of the cost of training and creating foundation models, making them more accessible to other organisation.
Language Models
The basic unit of a language model is a token - a token can refer to a character, word or part of a word. Breaking text into tokens is called Tokenisation. The set of tokens a model can work with is the model's vocabulary.
There are two types of languages models:
- Masked: Trained to predict missing tokens in a sequence (fill in the blank) using context from before and afterward. Typically used for sentiment analysis and text classification. But they're useful for debugging too.
- Autoregressive: Trained to predict the next token in a sequence, using only the preceding tokens. Generally better at text prediction, they're more popular today.
A model that can generate open-ended outputs is called generative. Completions and predictions are based on probabilities - the probabilistic nature of language models makes them exciting and frustrating. Completion although powerful isn't the same as engaging in conversation.
Language models can be trained using self-supervision - while many others require supervision (labelled data).
- Supervision: label examples, to show the model what behaviours you expect. Expensive and time consuming.
- Self-supervision: model can infer labels from input data. Input data provides both the tokens to be predicted and the context. Removing the training data bottleneck - they can learn from and text sequences.
Model size is typically measured by number of parameters (a variable that's updated during training). What's considered large is changing over time, 100m in 2018, 1 trillion in 2025.
Larger models have more capacity to learn, and need more training data to maximise their performance.
Foundation Models:
Many 'Foundation' models now include other modalities beyond text (images, video etc). They are multimodal. 'Foundation' refers to the model's ability to be built upon for different needs. Foundation models are capable of a wide range of tasks - you can tweak them to be better at a specific task if needed.
OpenAI created CLIP an embedding model to produce joint embeddings of both texts and images. They generated 400 million text/image pairs.
There are a number of techniques you can use to adapt an existing powerful model to get better outputs (generally much easier than building a model from scratch). For example…
- Prompt Engineering: detailed instructions with example output
- Retrieval-Augmented Generation (RAG): using a vector database to supplement instructions
- Fine tune: further train the model on a dataset
AI Engineering
AI Engineering is the process of building applications on top of foundation models. ML engineering typically included building models, where AI Engineering leverages existing ones. AI engineering is growing quickly because: AI capabilities are general purpose, investment in AI is increasing and lowering barriers of entry to building AI applications.
Use Cases
Common Generative AI use cases include: programming, image and video products, writing, education, conversational bots, information aggregation, data organisation and workflow automation. Applications can involve multiple use cases.
Companies are favouring applications with lower risks at the moment (e.g. internal not customer facing).
Planning AI Applications
Experimenting with and building AI applications is one of the best ways for individuals and companies to learn.
Use Case Evaluation: Be clear on why you want to build the application. Assess the risks and opportunities (risks include forced obsolescence by competitors adopting AI, missing out on productivity or profits).
Think about the role of AI in the system:
- Critical or complementary: Will the app still work without that AI feature? If so then it's complementary.
- Reactive or proactive: Reactive needs to respond quickly to user input, proactive interrupt user journeys when there's an opportunity and can often be pre-computed.
- Dynamic or static: Dynamic means updated continually with user feedback (Google Photo's) or static can be updated periodically.
Think about the role of humans in the system. Will AI support humans in decision making, make decisions directly or do both? Microsoft model for humans in the loop, which shows how things could change over time as confidence grows.
- Crawl: human involvement mandatory
- Walk: AI can directly interact with internal employees
- Run: increased automation, direct interaction with external users
Launching an AI Product / Application
Three typical types of competitive advantages: technology, data, and distribution. Foundation models mean that everyone is using similar technology.
Setting Expectations: Think about how you'll measure success. Think about:
- The most important business metric you're looking to influence
- Quality metrics to measure the quality of the chatbot's responses
- Performance metrics: TTFT: time to first token, TPOT time per output token, total latency
- Cost metrics: how much will it cost per inference request
- Interpretability
- Fairness
It's super important that AI applications solve business problems, and business metrics are mapped to ML metrics.
Don't neglect thinking about maintenance of the solution over time.
Three layers of the AI Technology Stack:
- Application Development: AI interface, prompt engineering, context construction, evaluation
- Model Development: Inference optimisation, dataset engineering, modelling & training, evaluation
- Infrastructure: Compute management, data management, serving, monitoring
AI Engineering vs ML Engineering: AI engineering is less about model development and more about adapting what models we have. Models can be adapted primarily in two ways:
- Prompt-based techniques: adapt a model without updating the model weights
- Fine-tuning: adapts a model by updating model weights. More complex and can require more data, but can improve quality, latency and cost considerably
Training is an umbrella term that typically includes:
- Pre-training: training a model from scratch, e.g. starting with random weights
- Fine-tuning: continuing to train a previously trained model.
- Post-training: used some interchangeably with fine-tuning - but mostly refers to model developers (e.g. at hyper-scalers) tweaking their models
Inference optimisation: is making models faster and cheaper. Language models are often auto-regressive - tokens are generated sequentially. Getting latency down to 100ms expected for a web application is incredibly hard (at 100ms interactions feel instantaneous).
Evaluation is about mitigating risks and uncovering opportunities. Evaluation happens throughout the process, to help select models, benchmark progress and discuss deployment readiness.
Prompt engineering is about getting AI models to express desirable behaviours using inputs alone. It can make a massive difference to results - and can help models perform the task you want and respond in the format you want. For complex tasks with long context - you might need to help with memory management.
As AI engineering includes the AI interface, it looks a lot closer to full stack development. AI engineers need to be much more involved in building the product. So there's a different emphasis on where you start and how to approach things:
- AI Engineering: Product → Data → Model
- Traditional ML Engineering: Data → Model → Product
Chapter 2: Understanding Foundation Models
Models are only as good as the data they're trained on.
If you want a model to get better at a certain task - then you should think about including more data for that task in the training data.
Given training data is often costly or difficult to obtain, model builders tend to 'use what we have, not what we want'. Common Crawl for example is a text dataset, it crawls 2-3b websites each month but it includes a bunch of fake news and undesirable content. E.g. clickbait, misinformation, propaganda, conspiracy theories, racism, misogyny etc. Some model builders use heuristics to filter out low quality data (e.g. only include reddit threads with lots of upvotes).
English is the dominant language of the internet (45.88%), Russian (5.97%), German (5.88%), Chinese (4.8%) and Japanese (4.7%) follow far behind. General purpose models work much better in English than in other languages.
Domain Specific vs General Purpose
General-purpose models are good at coding, law, science, business, sports etc because those domains are well represented in training data. But they're not amazing at domain specific tasks, especially if they haven't seen them in training. Domain specific models can be very powerful if there's good data to train them on (e.g. AlphaFold).
Model Architecture
The dominant LLM architecture is the transformer architecture, based on the attention mechanism, which addressed many limitations of previous popular architectures. The attention mechanism allows the model (utilising Recurrent Neural Networks) to weigh the importance of input tokens, when generating output tokens.
- The query vector (Q) represents the state of the decoder at each decoding step.
- Each key vector (K) represents a previous token.
- Each value vector (V) represents the value of a previous token, as learned by the model.
The attention mechanism decides how much attention to give an input token by performing a dot product between the query vector and its key factor. A high score means that more of that content (value vector) will be used when generating the response.
The longer the sequence, the higher the importance of storing the key and value vectors. This makes extended context hard.
Inference for transformer-based models has two steps:
- Prefill: process input tokens, creates an intermediate state needed to generate the first output token (which includes key and value vectors for all input tokens).
- Decode: The model generates one output token at a time.
Prefill is parallelisable but decoding sequential.
High Level Architecture:
- Transformer Block: contains the attention module and the multi-layer perception model.
- Attention Module: four weight matrices: query, key, value and output projection
- MLP Module: linear layers (weight matrix), separated by nonlinear activation functions (that allow linear layers to learn nonlinear patterns).
- Embedding module: before the transformer blocks, convert tokens and their positions into embedding vectors. Size determines max context length.
- Output layer: maps output vectors into token probabilities,. Unembedding layer, the models last layer before output generation.
Model Size
The number of parameters is sometimes appended to the model name (Llama-13B refers to 13 billion parameters). Increasing parameters increases capacity to learn. You can estimate the cost to train and run (inference) a model based on the parameter count. It determines the number of GPU's required to do inference.
Sparsity (the % of zero-value parameters) can make model size misleading. A large sparse model can require less compute than a small dense model. Mixture of Experts (MoE) is a common sparse model. Only the relevant expert is used to process each token.
A larger model can underperform a smaller one if not trained on enough data. The number of samples isn't a good metric to measure dataset size (as the definition of sample can vary so much), a better measurement is the number of tokens in the dataset. The larger models are trained on trillions of tokens. The number of training tokens in a models dataset, isn't the same as its number of training tokens (as you can have multiple pass throughs / epochs) on the same data.
Three numbers signal a models scale:
- Number of parameters (a proxy for learning capacity)
- Number of tokens trained on (proxy for what was learnt)
- Number of FLOPS (a proxy of training cost)
Scaling Law
Important observations when thinking about scaling:
- Model performance depends on model size and dataset size.
- Bigger models and bigger datasets require more compute
- Compute costs money
Compute is often the limiting factor. So we're often optimising to be compute optimal (achieving the best performance given fixed compute budget). Given budget, you can use the Chinchilla scaling law to calculate the optimal model and dataset size.
The cost for the same model performance is decreasing over time. The cost for model performance improvements remains high. Improving a model accuracy from 90 to 95% is more expensive than from 80 to 85%.
Small changes in modelling loss can lead to big differences in quality.
Performance of models depend on the values of its hyperparameters (parameters set by research scientists / engineers to configure the model and control how it learns). Smaller models are trained many times with different hyperparameters, but this isn't possible for the biggest models. For larger models you have to use scaling extrapolation or hyperparameter transferring to guess the right hyperparameters.
Scaling Bottlenecks: every order of magnitude increase in model size has led to an increase in model performance. We may run out of training data from the internet.
Recursively training new AI models on AI generated data could cause the new models to gradually forget the original data patterns, degrading their performance over time. Proprietary data could become a competitive advantage.
Data-centre electricity could become the leading bottleneck soon, replacing chip supply. Data-centres are likely to move from 2% of energy production to 20%.
Post-Training
Pre-trained models have 2 problems:
- Self-supervision optimises the model for text completion, not conversations.
- Models trained indiscriminately on internet data can be undesirable (racist, sexist, rude or wrong)
Post-training steps:
- Supervised fine-tuning: on high-quality instruction data to optimise for conversation not completion
- Adding more context to questions / asking follow up questions / giving helpful instructions
- Preference fine-tuning: to output responses that align with human preference, normally via reinforcement learning from human feedback.
- Define a reward model. I prefer this response over that response.
Pre-training is about increasing the accuracy of token prediction. Post-training is about producing outputs users prefer.
Post-training consumes about 2% of the energy vs pre-training 98%.
Sampling
Sampling makes AI's outputs probabilistic, which creates inconsistencies and hallucinations.
Given an input, a neural network produces an output first by computing the possible probabilities of possible outcomes. For a classification model, this would be limited to the classes (e.g. 90% spam | 10% not spam). You can then make product decisions on top of that.
LLMs compute probability distributions for all tokens in its vocabulary.
A common strategy would then be to pick the output with the highest probability. This is called greedy sampling.
Given an input a neural network outputs a logit vector, each logit corresponds to a token in the model vocabulary. Logits don't represent probabilities, they don't sum to one and can be negative.
Temperature: always choosing the highest probability output restricts creativity. High temperature reduces the probability of common tokens, and increases the probability of rarer tokens.
You can debug AI models by looking at the probabilities computed for a given input. If probabilities look random, the model hasn't learned much. 'Logprobs' are probabilities on a log scale.
Top-K: reduces computation without sacrificing too much response diversity.
Top-P: (nucleus sampling) focuses only on more relevant value, outputs are more contextually appropriate.
Stopping conditions: ask models to stop after a certain number of tokens, or words. But you can encounter premature stopping.
Test Time Compute: Generating multiple responses per query, to increase the chances of a good response. You can use beam search to generate a fixed number of promising candidates. Increasing the diversity of outputs can yield better candidates. This is expensive. You can either let the user pick the best input, or pick the one with the higher probability (controlling for response length). Or you could use a reward model to score them. You can produce both in parallel and then show the user the one that completes first.
Some tasks require structured outputs, specifically: semantic parsing (Text to SQL) and when outputs are used by downstream applications (Agentic workflows). OpenAI has a JSON mode which only outputs are valid JSON. Prompting can help restrict output, but depends on how good the model is at showing instructions. You can use AI to validate and change the output of the original prompt too.
Post-processing: models make similar mistakes across queries. If you find common mistakes you can write scripts to correct them.
Constraint sampling: guiding the guiding text toward certain constraints.
Fine-tuning a model on examples like your desirable format is the most effective and general approach to get models to generate outputs in the right format.
The Probabilistic Nature of AI:
AI models are probabilistic by nature, meaning they sample responses based on likelihood rather than providing deterministic answers. This probabilistic sampling creates two major challenges:
- Inconsistency (same inputs producing different outputs or slight input changes causing dramatically different responses)
- Hallucinations (generating responses not grounded in facts). Inconsistency can be partially mitigated by fixing sampling parameters like temperature and seed values, though consistency isn't guaranteed. Hallucinations are more complex, with two leading theories: models can't distinguish between given data and their own generated content (leading to "snowballing" errors), or there's a mismatch between the model's knowledge and training data from human labellers. Practical mitigation strategies include caching responses for consistency, using verification prompts like "say 'I don't know' if unsure," requesting concise responses to reduce generation opportunities for errors, and employing careful prompt engineering. Completely eliminating these issues remains challenging and requires ongoing detection and measurement approaches.
Chapter 3: Evaluation Methodology
Evaluation has to be considered in the context of the whole system, not in isolation.
Evaluation aims to mitigate risks (identify where your system is likely to fail) and uncover opportunities. You may have to redesign your system to increase visibility into its failures.
Because evaluation is difficult, people may settles for eyeballing results and gut checks.
Systematic evaluation can make results more reliable.
Evaluating foundations models is challenging because they are open-ended.
‘AI as a judge’ or using AI to evaluate AI responses is gaining rapid traction.
Challenges of Evaluating Foundation Models
- As models get more intelligent, they become harder to evaluate. Evaluation of sophisticated tasks can be time consuming. E.g. Ask a model to summarise a book - unless you read the book, it’s hard to judge the quality of the summary.
- Open-ended models can’t always be checked against a ground truth. If there are many possible correct responses, it’s impossible to curate a list of precise expected outputs.
- Foundation models are black boxes - often you don’t know the architecture, training data and training process.
- As intelligence becomes more general, and advances quickly, public benchmarks can quickly become irrelevant or be surpassed.
- It’s harder to evaluate general purpose models vs task specific models (where the domain is more constrained).
- It’s hard to evaluate the potential of systems that in some aspects surpass human capabilities.
Understanding Language Model Metrics
There are four important language modelling metrics, that can be used to evaluate anything that generates tokens. Cross Entropy and it’s variations ‘Bits-per-character’ and ‘Bits-per-byte’. And the relative of cross entropy: perplexity.
Language models try to learn the distribution of their training data so they can assess the probability of the next tokens, and make accurate predictions. In general, the closer the model response is to the training data the better.
Entropy: a measure of how much information on average a token carries. High entropy means a token carries more information, and the more bits are needed to represent the token. Entropy measures how difficult it is to predict what comes next in a language. The lower the entropy the more predictable a language is.
Cross Entropy: a measure of how difficult it is for the language model to predict what comes next in the dataset. It depends on two things:
- the training data’s predictability, measured by the training data’s entropy
- how the distribution captured by the language model diverges from the true distribution of the training data
Language models are trained to to minimise cross entropy with respect to training data - if the language model learns perfectly, the model’s cross entropy will be exactly the same as the entropy of the training data.
Bits-per-character and Bits-per-byte: If the cross entropy of a language model is 6 bits, the language model needs 6 bits to represent each token. Bits-per-byte refers to the number of bits a language model needs to represent one byte of training data.
Cross entropy tells us how efficient a language model will be at compressing text.
Perplexity (or PPL): is the exponential of entropy and cross entropy. If cross entropy measures how difficult it is for a model to predict the next token, perplexity measures the amount of uncertainty it has when predicting the next token. Higher uncertainty means there are more possible options for the next token.
TensorFlow and PyTorch use nat (natural log) as the unit for entropy and cross entropy.
Perplexity Interpretation and use cases:
- More structured data gives lower expected perplexity (as it’s more predictable). HTML would be less than everyday text.
- The bigger the vocabulary, the higher the perplexity. The more tokens there are the harder prediction is. A bed-time-story book would be lower than War and Peace.
- The longer the context length, the lower the perplexity. The more context a model has, the less uncertainty it’ll have predicting the next token.
A perplexity of 3 or PPL of 3 would mean that a model has a one in three chance of predicting the next token correctly. If you had 10k words to choose from, that’s pretty impressive.
Post-training can collapse entropy: When models are post-trained to be better at answering questions or engaging in conversation → they’re being trained to respond in a way users prefer not to predict the next token. Therefore - perplexity ceases to be a great measure in this scenario.
Perplexity can be used to detect whether a text was in a model’s training data. This can help you determine what went into a model, but also what you might want to supplement it with in post-training.
Perplexity is highest for unpredictable texts and express unusual ideas. Perplexity can be used to detect abnormal texts.
Exact vs Subjective Evaluation
Exact evaluation produces judgement without ambiguity. Subjective evaluation is, well, subjective as the name suggests (the same person could give a different score some time apart). You can use a prompt to set guidelines to make subjective evaluation more exact.
Functional Correctness is evaluating a system based on whether it performs what you intended. If you ask a model to create a thing, does the model succeed. It’s the ultimate metric for evaluating the performance of any application - but it can sometimes be difficult to pin down.
- Functional Correctness (or execution accuracy) for code generation can be automated. You could say use a Python interpreter to check the code is valid, and then you can text if it outputs the expected variables given the inputs. Essentially - does it pass unit tests.
- There are benchmarks for code generation, essentially sets of problems, and the % the model passes is their benchmark result. The more shots a model takes at solving the problem, the higher the chance of passing. So benchmarks can be presented at 50% after 3 passes. Or 90% after 10 passes.
Tasks with measurable objectives can typically be evaluated using functional correctness.
Similarity Measurements Against Reference Data
If you can’t automate evaluation of your task for functional correctness then consider evaluating the AI’s output against reference data. Each example in reference data follows the format (input, reference responses). Reference responses are sometimes called ground truths or canonical responses. This approach is limited by how much and how fast reference data can be generated.
- If you use human generated data as references, then human performance is the ‘gold standard’.
- Using AI generated reference data, with some human oversight is much cheaper and faster.
Metrics that require references are called reference-based, those that don’t are called reference-free.
Generated responses that are more similar to reference responses are considered better. The 4 ways to measure similarity between two open-ended text are:
- Asking an evaluator to make the judgement whether two texts are the same.
- Exact match: whether the generated response matches one of the reference responses exactly. A binary score. Works for tasks that expect short, exact responses. There’s a variation of exact match which accept any response that contained the reference response somewhere. Exact match rarely works beyond simple tasks.
- Lexical similarity: measures how much two similar texts overlap. Break into tokens, count and compare them. You can do approximate string matching (fuzzy matching) - which counts how many edits it’d need to convert from one to another (edit distance). Or n-gram similarity which is a measure of the overlapping sequences of tokens. This approach requires a comprehensive set of reference responses. Low-quality reference data can cause issues. Higher lexical similarity scores don’t always mean better responses, incorrect responses can have high similarity to reference answers. Optimising for similarity isn’t the same as optimising for correctness.
- Semantic similarity: how close the generated response is to the references responses in meaning (semantics). It requires transforming text into a numerical representation (embedding). You can then compute cosine similarity. Semantic similarity is also called embedding similarity. BERTScore and MoverScore are standards. It doesn’t require reference data. Reliability depends on the embedding algorithm and it may take a while to compute all the embeddings.
Similarity measures can also be used for retrieval and search, ranking, clustering, anomaly detection and data duplication.
Introduction to embedding
The goal of producing embeddings is to create a numerical (vector) representation that captures the essence (or meaning) of the original input data. We can verify the effectiveness of embeddings by seeing if more-similar texts have closer embeddings as measured by cosine similarity (or related metrics) OR based on their utility for your task (classification, topic modelling, recommender systems, RAG).
- The MTEB (Massive Text Embedding Benchmark) measures embeddings against multiple tasks.
- You can even generate embeddings for retail products.
It is possible to create a multi-modal embedding space too, such as image text pairs etc.
The number of elements in an embedding vector can be between 100 and 10,000. The goal is to produce embeddings that capture the essence of the original data. There are models that are trained to produce embeddings (Google BERT, OpenAI CLIP, Open AI Embeddings API). Many models first require their input data to be transformed into vector representations.
AI as a Judge
Evaluating open-ended responses programatically was historically very difficult, so many fell back to human evaluation. BUT now models are getting better, it’s very common. ‘AI as a judge’ or ‘AI Judge’ is the name given to automating evaluation.
AI as a judge is fast, easy and cheap compared to human evaluators. They can also work without reference data, which can be costly to produce.
You can judge on different criteria: correctness, repetitiveness, toxicity, wholesomeness, hallucinations etc. Responses from AI judges can become strongly correlated to human evaluators. AI judges can also explain their judgement which can help with audibility. For some applications it’s the only option, but even when it’s not as good at humans, it’s often good enough to get a project going.
There are many different methods:
- Evaluate the quality of a response - given the question.
- Compare the response, to a generated response for similarity.
- Compare two generated responses to determine which is better / which a user will prefer.
A general purpose AI judge can be asked to use any criteria. Generally judging criteria isn’t standardised.
How to prompt an AI judge? You should explain the task the model is to perform, the criteria for evaluation and the scoring system (classification, numerical values, continuous numerical values). Prompts with examples perform better.
Limitations of AI Judges:
- Many teams are hesitant to adopt this approach - it seems tautological.
- As AI Judges are probabilistic, inconsistency makes it hard to reproduce evaluation results.
- Criteria ambiguity in open source tools makes it easy to misinterpret and misuse them.
As an application evolves over time, the way it’s evaluated should ideally be fixed. This way evaluation metrics can be used to monitor the applications changes.
Do not trust an AI judge if you can’t see the model and the prompt used for the judge.
You can use them in experimentation but also in production as guardrails to reduce risks. You can reduce costs by using weaker models as judges - or by spot-checking a subset of results in production.
AI judges in production can add latency, so you might want to evaluate after they are returned to users if the use case allows.
AI Judge Biases:
- Self-Bias: they favour their own responses over those of other models
- First-position Bias: favouring the first answer in pairwise comparison
- Verbosity Bias: favouring longer answers
Supplement AI judges with exact evaluation or human evaluation if you can.
Models can be judged by stronger or weaker models.
You can use stronger models to improve weaker ones. You might not be able to use the stronger model in production due to cost or latency issues.
You can use weaker models to judge stronger ones. Judging is easier than generating. Stronger models though are better correlated to humans.
Specialised judges:
- Reward model: Take in a (prompt, response) pair and score how good the response is given the prompt.
- Reference-based judge: evaluate the generated response against other responses with a similarity score.
- Preference model: takes in (prompt, response 1, response 2) as input and outputs which of the two responses is better.
Comparative Evaluation → Evaluating models against each other to see what’s performing better. For subjective domains, comparative evaluation is easier than ‘point-wise evaluation’. It’s easier to say which of two songs you prefer, it’s harder to score them out of 100.
- Comparative evaluation is sometimes used in production too, LLMs give two responses and ask an evaluator which they prefer. Interestingly the evaluator can be human or AI.
- Not all questions should be answered by preference - in some cases ‘correctness’ is more important. E.g. if somebody asks a medical question of an LLM, it would be frustrating and dangerous to be asked to rank two results giving different advice.
- Therefore you need to determine what questions should and shouldn’t be determined by preference voting. You should ask when:
- When users are knowledgable of the subject and the AI is acting like the intern in the workflow.
- The response is more subjective and less about correctness
- Comparative evaluation isn’t A/B testing because each user sees the response from two models not one.
- What gets measured in comparative evaluation is the win-rate of each model other another one. The probability that one model will beat another resolves to the win-rate of one model over another.
Match # | Model A | Model B | Winner |
1 | Model 1 | Model 2 | Model 2 |
2 | Model 2 | Model 10 | Model 2 |
3 | Model 5 | Model 6 | Model 6 |
- A rating algorithm (e.g. Elo, Bradley-Terry etc) can take the pairwise win-rates and rank models by their scores. Comparative evaluation as a concept was borrowed from sports and video games. Rankings are correct if for any pair the higher-ranked model is more likely to win in a match against the lower-ranked one.
- A ranking is only as good as it is at predicting future outcomes.
- Challenges of comparative evaluation include:
- Scalability bottlenecks: the number of pairs grows quickly with the number of models. Although you can imply pairwise results with transitivity (if A>B, and B>C we can assume A>C). Human preferences though are not necessarily transitive.
- When you introduce new models to the pairwise matrix, you can change all the rankings. It can be compute intensive to recalculate them.
- It’s hard to know or standardise what users are ranking on. They prefer incorrect responses that are polite.
- Commissioning evaluations for the purpose of evaluating models, can result in evaluations happening outside of the typical workflow/environment.
- You can only use evaluators that you trust - but doing so is expensive.
- For simple prompts and responses, it’s hard to differentiate a model’s performance - and that can pollute ranking data.
Some teams prefer AI because although it might not be as good as trained human experts it might be more reliable than random internet users.
Comparative performance vs Absolute performance: Sometimes it’s helpful to know which model is better. Sometimes you just want to know if a model is good enough. In the real world better models usually come with higher costs, you may have to determine if the extra performance is worth it.
Comparative evaluation aims to capture the quality we care about: human preference.
Comparative evaluation can be complimentary to A/B testing in online evaluation, and offline it’s a good data point to have alongside benchmarks.
Chapter 4: Evaluate AI Systems
A model is only useful if it works for its intended purposes → models are therefore best evaluated in context.
Evaluation-Driven Development: Before you spend time, money and resources building an AI application, define your evaluation criteria first.
The most common ML enterprise applications in production are those with clear evaluation criteria (e.g. Recommender systems → engagement, fraud detection systems → fraud prevention).
Start with a list of evaluation criteria specific to the application. For example: domain-specific capability, generation capability, instruction-following capability, cost and latency.
Domain Specific Capability
A model's domain specific capability are constrained by its model architecture, size and training data etc. To assess if a model has the domain capability you need typically you'll test it on a domain-specific benchmark (public or private). Typically exact evaluation is used for domain-specific evaluation (e.g. code generation for correctness) but you might also look at efficiency (runtime usage) or latency etc.
Most domain-specific capabilities (that aren't code generation) are evaluated using close-ended tasks like multiple choice questions (MCQs). Close ended outputs are easy to verify and produce. Excluding open-ended tasks on purpose helps avoid inconsistent assessment. Accuracy is a common metric, how many questions the model gets right. You can also allocate more points to harder questions. If there are multiple correct answers, you can award points for each correct answer.
- If you're doing classification, where the answer can be selected from a defined set (e.g. Negative, Positive, Neutral) → then you can use F1 Score, Accuracy and Precision.
Multiple choice questions are easy to create, verify and evaluate against the random baseline. Although be careful, models are sensitive to small changes in how the questions are worded.
MCQs test the ability to differentiate good vs bad responses → that's not the same as how to generate good responses. Use them to evaluate knowledge and reasoning, don't use them to evaluate generation capabilities like summarisation, translation and essay writing.
Generation Capability
The old traditional measures of text generation were things like:
- Fluency: is the text grammatically correct and does it sound natural?
- Coherence: how well structured the text is?
- Faithfulness: how faithful is the translation to the original sentence?
- Relevance: does the summary focus on the most important aspects of the source document
Those metrics are less relevant now as capabilities have improved so much. One of the most pressing issues today is hallucinations → so 'factual consistency' is now important. Now these models are used in production safety metrics are important too (toxicity and biases).
Factual consistency can be verified in two settings:
- Local factual consistency: evaluated against the relevant context, and important in tasks like summarisation, customer support and business analysis.
- Global factual consistency: evaluated against open knowledge / commonly accepted facts
Factual consistency is easier to verify against specific facts, e.g. if you're given the context of scientific papers etc. If no context is given, then you need to search for sources and derive them → the hardest part of factual consistency verification is determining what the facts are. Often what's considered factual depends on the sources that you trust.
- The internet is full of mis-information.
- The absence of evidence fallacy can also occur.
Models rely heavily things like the relevance of websites, and don't naturally weight things like scientific references as much as humans do.
If you're evaluating for hallucinations, focus on the queries that are likely to generate them.
- Queries that involve niche knowledge
- Queries asking for things that don't exist
AI judges can evaluate factual consistency when they have the context to evaluate an output against. AI as a judge techniques include:
- Predicting if humans will find a claim truthful (TruthfulQA is 90-96% accurate)
- Self-verification: if a model generates multiple outputs that disagree, the original output is likely hallucinated.
- Knowledge-augmented verification: DeepMinds' SAFE (Search Augmented Factuality Evaluator) leverages search engine responses to verify the response. 4 steps it follows: Decompose the response into individual statements, revise each statement to make it self-contained, propose fact checking queries to send to the Google Search APiI, Use AI to determine whether the statement is consistent with the research results.
Textual entertainment is a longstanding NLP technique that determines the relationship between two statements. Given a premise, it determines which category a hypothesis falls into:
- Entailment: the hypothesis can be inferred from the premise (implies factual consistency)
- Contradiction: the hypothesis contradicts the premise (implies factual inconsistency)
- Neutral: the premise neither entails nor contradicts the hypothesis (consistency can't be determined)
Instead of using general purpose AI judges, you can train scorers specialised in factual consistency prediction. Taking pairs of (premise, hypothesis) pairs to predict entailment.
TruthfulQA is a benchmark includes 817 questions AI is likely to get wrong (because they are common misconceptions).
Safety: Models can be harmful beyond factual consistency. The following are often of concern:
- Inappropriate language, profanity and explicit content
- Harmful recommendations and tutorials (rob a bank, self harm)
- Hate speech (racist, sexist, homophobic, discrimination)
- Violence (threats, graphic details)
- Stereotypes (CEO’s all having male names)
- Biases (political, religious)
You can use general-purpose AI judges to detect these or you can use / develop moderation tools.
There are benchmarks to measure toxicity (RealToxicityPrompts).
Instruction-Following Capability: Is a measure of how good is the model at following the instructions you give it. Instruction following capability is important for applications that require structured outputs. It can be approximated by Instruction-Following Evaluation benchmarks like IFEval (a set of 25 types of instructions that can be automatically verified). INFObench checks a model for it’s ability to stick to content constraints (e.g. discuss only climate change).
Create your own benchmark to evaluate your model’s capability to follow your instructions using your own criteria.
Roleplaying: can be useful for adopting a character (for users to interact with) or as a prompt engineering technique to improve model outputs. RoleLLM and CharacterEval are roleplaying benchmarks. Roleplaying should be evaluated on both style and knowledge/content.
Cost and Latency:
Model quality needs to be balanced with cost and latency. Often cost is the limiting factor.
Latency metrics: time to first token, time per token, time between tokens, time per query etc. You need to determine what matters for your application. Latency doesn’t just depend on the model, but the prompt itself and the nature of the response (number of tokens).
You maybe able to influence latency and cost by asking the model to be concise or setting stopping conditions.
You can either host models, or have them hosted. If you host your own models your cost per token can become cheaper with scale.
Model Selection: As you move through the development process, you’ll have to keep re-running a model selection process. You might begin with prompt engineering on expensive models to evaluate feasibility, them see if smaller cheaper models would work. It’s typically a two step process: 1) Figure out the best achievable performance 2) Map models along the cost-performance axes and choose the model that provides the best value.
A 4 step process for model selection:
- Build vs Buy decision: Filter out models by hard attributes
- Public model evaluation: public benchmarks and leaderboards
- Task-specific evaluation: private prompts, metrics and evals
- Online evaluation / monitoring / user feedback.
During steps 2 and 3 you’re often balancing model quality, cost and latency.
Build vs Buy: You can look into open source models but need to be mindful of commercial use restrictions and if it allows you to train or improve upon other models. To decide on wether you should host or use an API provider consider data privacy, data lineage (the legalities of what it’s trained on), performance, functionality, costs, control and on-device deployment.
The answer to whether to host a model yourself or use a model API depends on the use case. And the same use case can change over time. Here are seven axes to consider: data privacy, data lineage, performance, functionality, costs, control, and on-device deployment.
Navigate Public Benchmarks:
An evaluation harness helps you evaluate a model on multiple benchmarks. OpenAI evals lets you run any of 500 benchmarks and register new ones. Aggregating benchmark results to rank models gives you a leaderboard. The question becomes what benchmarks to include and how to aggregate them to rank models.
There are many public leaderboards, that rank models on their aggregated performance on a subset of benchmarks. Some of which focus on math, law, problem solving, science questions, knowledge, reasoning etc. Some of the benchmarks are closely correlated, so many leaderboards try to balance compute and coverage.
When evaluating models for a specific application, you should create a private leaderboard that ranks models based on your evaluation criteria.
Create a custom leaderboard with public benchmarks. Gather a list of recent / relevant benchmarks to evaluate the capabilities important to your application. You’ll need to think about how to aggregate the scores.
Many benchmarks might not be measuring what you expect them to measure.
The goal of using public benchmarks is usually to pick a shortlists of models to do more rigorous experiments with using your own benchmarks and shortlists.
Data contamination (data leakage, training on the test set) is when a model was trained on the same data it’s evaluated on. Benchmark data published before the training of a model is likely included in the models training data. Benchmarks become saturated so quickly - hence developers often need to create new benchmarks to evaluate their new models.
You can detect contamination using heuristics like:
- n-gram overlapping: checking to see if certain sequences of tokens appear in the training data as well as the evaluation sample
- perplexity is a measure of how difficult it is for a model to predict a given text. If a models perplexity on evaluation data is unusually low, then the model could have seen the data in training before.
Public benchmarks will help you filter out bad models, but own’y help you find the best model for your application.
Design Your Evaluation Pipeline:
Designing a reliable evaluation pipeline is essential for AI applications, especially on open-ended tasks. Your goal is to consistently distinguish good outcomes from bad ones. Start by evaluating both end-to-end task success and the intermediate outputs of each component, at per-turn and per-task levels. This pinpoints failure modes (e.g., PDF text extraction vs. employer extraction) and highlights efficiency (how many turns it takes to complete a task). Task-based evaluation matters most because users care about outcomes, but defining task boundaries can be tricky in conversational systems.
Create a clear evaluation guideline. Specify not only what the system should do but also what’s out of scope and how to handle it. Define criteria that reflect quality in your domain (e.g., relevance, factual consistency, safety) and recognise that “correct” may still be “bad” if it’s unhelpful. Build unambiguous scoring rubrics with examples, choose appropriate scales (binary, ternary, or continuous), and validate them with humans until they’re easy to apply. Tie these metrics to business goals so you know what level of model quality unlocks which outcomes (e.g., how factual consistency translates to percent of support automation), and set usefulness thresholds. Be deliberate about engagement metrics to avoid perverse incentives.
Then, choose evaluation methods and data. Mix specialised classifiers, similarity measures, logprob-based confidence, and AI judges, using cheaper signals broadly and expensive ones sparingly. Combine automatic metrics with ongoing human review—even in production—to catch regressions and unusual patterns. Curate annotated datasets that cover components, turns, and tasks; use real production data when possible, and reuse the guideline for future labeling or finetuning. Slice data (by user segment, channel, input length, topic, etc.) to find biases, debug weak spots, and avoid Simpson’s paradox. Maintain multiple eval sets: one mirroring production distribution plus stress sets for common failure and out-of-scope cases.
Size your evaluation sets so results are stable and actionable. Use bootstrapping to check variance; if scores swing widely across resamples, add more data. Remember that evaluation should support comparisons (models, prompts, components), so ensure your sample sizes are large enough to detect meaningful differences.
Finally, evaluate—and iterate on—your evaluation. Check that higher-quality responses truly earn higher scores and that better metrics correlate with better business outcomes. Measure reliability (reproducibility, variance across runs and datasets), metric correlations (drop redundant ones), and the added cost/latency of the pipeline. As needs and behaviors change, update criteria, rubrics, and datasets—but keep enough consistency to track progress. Log everything that can affect results (data, rubrics, prompts, and judge settings) to enable trustworthy, repeatable experimentation.
Design Your Evaluation Pipeline. This guide focuses on evaluating open-ended tasks through a systematic three-step process.
- Evaluate All System Components
Real-world AI applications require evaluation at multiple levels: per task, per turn, and per intermediate output. Each component should be evaluated independently to identify exactly where failures occur. For example, a CV-parsing application might fail at either PDF text extraction or employer identification, evaluating each step separately pinpoints the problem.
Turn-based evaluation measures individual output quality, while task-based evaluation assesses whether the system completes entire tasks. Task-based evaluation is more important since users care about accomplishing goals, but defining task boundaries can be challenging in conversational applications.
- Create Clear Evaluation Guidelines
Defining evaluation criteria is the most critical step. Determine what constitutes a "good" response, correctness alone isn't sufficient. For instance, LinkedIn's Job Assessment tool found that "You are a terrible fit" might be accurate but unhelpful without explaining gaps and improvement suggestions.
Develop specific criteria (e.g., relevance, factual consistency, safety) by testing real queries and generating multiple responses. Create scoring rubrics with concrete examples for each score level, then validate them with humans to ensure clarity. Ambiguous guidelines produce unreliable results.
Connect evaluation metrics to business outcomes. Map technical scores to business impact (e.g., "80% factual consistency enables automating 30% of support requests; 90% enables 50%"). This helps determine usefulness thresholds and justify resource investment.
- Define Methods and Data
Select appropriate evaluation methods for each criterion: specialised classifiers for toxicity, semantic similarity for relevance, AI judges for factual consistency, or logprobs for confidence measurement. Mix methods when needed: use cheap classifiers on all data and expensive AI judges on samples for cost-effective validation.
Use automatic metrics where possible, but don't avoid human evaluation. Many teams treat human assessment as the North Star metric, with experts evaluating daily subsets to detect performance changes.
Curate annotated evaluation sets representing production data. Slice data by user tiers, traffic sources, common mistakes, and edge cases. Evaluation sets should be large enough for reliable results but small enough to be practical—benchmarks typically use 300-1,000+ examples. Use bootstrapping to verify if your sample size provides consistent results.
Evaluate and Iterate
Assess your pipeline's quality: Do better responses get higher scores? How reliable and reproducible are results? What's the cost-latency tradeoff? Update criteria and rubrics as needs evolve, while maintaining consistency through proper experiment tracking of all variables (data, rubrics, prompts, configurations).
Chapter 5: Prompt Engineering
Prompt engineering is about crafting an instruction that gets a model to generate the desired outcome. It’s the easiest and most common adaptation technique, unlike fine tuning it changes the models behaviour without changing its weights. Make the most of prompting before moving onto fine-tuning etc. Think of prompt engineering as human to AI communication.
A prompt is an instruction given to a model to perform a task. They generally consist of a task description (their role, the output you want), an example output and a brief description of the task itself.
For prompting to work, the model has to be able to follow instructions. How much engineering is required depends on how robust the model is to prompt perturbation (when small changes in the prompt have big changes in the output). You can test a models robustness but testing slight variations of prompts.
In-Context Learning: Zero-Shot and Few-Shot
In context learning allows a model to incorporate new information continually to make decisions - preventing it from becoming outdated. You can include new information in the prompt without having to retrain the model. Teaching a model to learn from a few examples in the prompt is called few-shot learning. Experiment to determine how many are needed - you might eat into context length if you include too many. For domain-specific use cases, examples are likely to be more important.
Prompt and Context are sometimes used interchangeably. Some think context is the contextual information in the prompt, some think it’s everything that goes into the model (e.g. the prompt).
Think of LLMs as a library of programs, and you can choose which ones to activate with prompts. You can have a system prompt (instructions put in by developers) and a user prompt (instructions put in by users).
If a model has a chat template, make sure you’re using the right one and doing it properly.
Prompt Engineering Best Practices:
- Write clear and explicit instructions
- Provide examples
- Provide sufficient context (context construction).
- Break complex tasks into simpler subtasks (prompt decomposition helps with monitoring, debugging and parallelisation)
- Give the model time to think - when you can afford the latency
- Ask the model to self-critique / check its own results
- Iterate on your prompts
There are now tools (OpenPrompt) which aim to aid and automate prompt engineering. Mutator prompt + original prompt. Although, start without any tools and write your first prompts yourself.
Organise and version prompts: Separate prompts from code, put them in a file prompts.py and reference them when creating model queries. This has advantages: reusability, resting, readability, collaboration.
Defensive prompt engineering: protect yourself against the three common techniques: prompt extraction, jailbreaking / prompt injection and information extraction. Risks from prompt attacks include remote code execution, data leaks, social harm, misinformation, service interruption, corporate espionage and brand risk.
Chapter 6: RAG and Agents
To solve a task, a model needs the instruction and the necessary information to do it. AI models are more likely to hallucinate when missing context.
There are two main ways to construct context, RAG (retrieval augmented generation) or agents.
- RAG allows the model to retrieve relevant information from external data sources
- Agents allow models to use tools such as web search and news APIs to gather information
RAG
Enhances models generation by retrieving the relevant information from external memory sources. The retrieve-then-generate approach works well for knowledge-intensive tasks where all the available knowledge can’t be input into the model directly. RAG selects the information most relevant to the query (as determined by the retriever) and inputs it into the model. It helps generate better responses and reduce hallucinations. RAG constructs context relevant to each query, instead of using the same context for all queries.
Some think longer context windows will be the death of RAG. Two reasons that might not be true:
- What if the amount of relevant information grows faster than context windows?
- What if larger context windows can’t be used well. Context length vs context efficiency. (e.g. there’s a benefit to having less more relevant information)
RAG Architecture: has two components a retriever that retrieves information from external memory sources and a generator that generates a response based on the retrieved information. Often now the generative model is a large foundation model, but you could train them together to get better responses. The retriever is the important bit, it indexes and queries the data. Indexing is processing the data so it can be retrieved quickly.
Documents can vary in size, so documents are split into more manageable chunks. For each query the goal is to return the most relevant chunks.
Retrieval Algorithms: Retrieval is an old problem. Traditional retrieval algorithms can be used for RAG.
Retrieval typically refers to one database or system, whereas search involved retrieval across many systems.
Retrieval works by ranking documents based on relevance to a given query. There are two main approaches:
- Term-Based retrieval: find relevant documents with key words (lexical retrieval). Algorithms use metrics like TF (term frequency). BUT some terms are common, so inverse document frequency measure now significant it is that a document contains a term. TF-IDF combines the two. Elasticsearch uses an inverted index that maps terms to documents that contain them. Allowing for fast retrieval of documents given a term.
- Embedding-based retrieval: term-based retrieval computes relevance at a lexical level rather than a semantic level. Embedding-based-retrievers aim to rank documents based on how closely their meanings align with the query. Known as semantic retrieval. Indexing here includes converting data chunks into embeddings, and storing them in a vector database. Vector search is typically framed as a vector search problem, computing similarity scores and returning those with the highest. Larger datasets use ANN approximate nearest neighbour algorithms. Vector databases are typically organised into buckets, trees or graphs.
The quality of retrieval algorithms can be evaluated based on the quality of the data they retrieve:
- Context precision: out of all the documents retrieved, what percentage is relevant to the query
- Context recall: out of all the documents that are relevant to the query, what percentage is retrieved
To compute precision you need to curate an evaluation set with a list of test queries and test documents. There’s no way to compute recall without annotating the relevance of all documents to that query.
- NDCG - normalised discounted cumulative gain is a measure that helps you return the most relevant results higher up the search results.
The quality of a retriever should be evaluated in the context of the whole RAG system. RAG can impact latency. And generating the embeddings can be costly.
You can make a trade-off between indexing and querying. The more detailed the index is, the more accurate the retrieval process will be but the indexing process will be slower and more costly.
Hybrid search combines term-based retrieval and embedding-based retrieval. They can be used in sequence or in parallel as an assemble.
Four Retrieval Optimisation strategies:
- Chunking Strategy: Deciding how to split documents into manageable pieces for retrieval, such as by fixed character/word counts, sentences, paragraphs, or recursive splitting. Key considerations include chunk size (smaller allows more diverse information but may lose context), overlap (prevents cutting off important information), and the trade-off between retrieval quality and computational costs.
- Reranking: After initial retrieval, documents can be reordered to improve accuracy and reduce the number needed. A common pattern uses a cheap retriever to fetch candidates, then a more expensive, precise algorithm reranks them; documents can also be prioritised by recency for time-sensitive applications.
- Query Rewriting: This reformulates ambiguous or context-dependent user queries into clear, standalone questions that will retrieve better results. It's especially important for conversational AI where follow-up questions lack context (e.g., "How about Emily Doe?" becomes "When was the last time Emily Doe bought something from us?"), and can be done using AI models with appropriate prompts.
- Contextual Retrieval: Each document chunk is augmented with relevant context (metadata, tags, keywords, questions it can answer, or AI-generated summaries) to make retrieval easier. This helps the system understand what isolated chunks are about by prepending context that explains the chunk's relationship to its original document, typically 50-100 tokens describing its position and relevance.
Agents
Intelligent agents that can act as our assistants were considered to be one of the ultimate goals of AI. Two aspects of a agents determine their capabilities: tools and planning. Agents are systems that perceive an environment and act on it. An agent is defined by its environment and its action set, which is expanded by the tools it can use. There’s a tight coupling: environments constrain possible actions, and a tool inventory constrains which environments the agent can operate in. Modern LLM-driven products (RAG systems, ChatGPT-style assistants, coding bots) are agents with tools.
Why agents need stronger models. Multi‑step work compounds errors (step‑wise accuracy multiplies), and tool use raises the stakes. Autonomy can still be worthwhile by saving human time, but cost/latency must be managed.
Tool inventory: the set of external functions/APIs an agent may call. Tools both read (perceive) and write (change) the environment. Read tools (retrievers, SQL SELECT, email reader) expand perception; write tools (SQL UPDATE/DELETE, email sender, banking API) enable real-world impact and thus require trust, safeguards, and permissions.
- Knowledge augmentation (context construction): retrievers, SQL executors, enterprise APIs, web/search/news/social APIs; combats model staleness but increases exposure to low‑quality sources—choose APIs carefully.
- Capability extension: calculators, converters, calendars, translators, code interpreters (powerful but must be secured against code injection).
- Multimodality bridges: text↔image generation, OCR, transcription, image captioning, LaTeX renderers, charting.
Empirically, tool use can outperform base models on reasoning/QA benchmarks.
Planning: is getting from a goal to actions. Task = goal + constraints. Complex tasks require a plan (task decomposition).
- Decouple plan and execution: generate a plan, validate it (heuristics or AI judges), then execute; iterate as needed. Parallel plan proposals cut latency at higher cost.
- Intent classification: helps pick tools and detect IRRELEVANT requests to avoid wasted compute.
- FM vs. RL planners: RL agents learn a planner via RL (resource‑intensive). Foundation‑model (FM) agents use the model as the planner (prompted/finetuned). The two paradigms can be combined.
- Function calling: declare tools (names, params, docs); at query time specify which are available and whether tool use is
required/none/auto
. Always log parameter values the model proposes. - Planning granularity: detailed plans are harder to draft but easier to execute; high‑level plans are the reverse. A hierarchical approach works well. Natural‑language plans are API‑robust but require a translator layer.
- Control flows: beyond sequential, agents may use parallel, if/branching, and loops; frameworks differ in support, and parallelism can greatly reduce latency.
Reflection & error correction. Reflection: self‑critique to evaluate plans and outcomes, then revise. Common pattern: interleave reasoning and actions (e.g., ReAct‑style loops). Reflection boosts reliability but increases tokens, cost, and latency. Can be single‑agent (self‑critique) or multi‑agent (actor + evaluator).
Tool selection: No universal recipe. Experimentation and ablation help right‑size the inventory. Track distribution of tool calls, remove under‑ or mis‑used tools, or replace tools the model struggles to use. Different tasks/models prefer different tools. Ensure the framework supports your needed planners/tools and is extensible.
Failure modes & evaluation:
- Planning failures: invalid tools; invalid parameters; wrong parameter values; goal failures (violating constraints like budget/time). Measure rates of valid plans/tool calls and retries to validity.
- Tool failures: buggy outputs, poor translators, or missing tools for the domain. Always print tool calls and outputs for inspection; benchmark translators; collaborate with domain experts.
- Efficiency: track steps, cost per task, and per‑action latency; compare to baselines (agents or humans). Optimise for the agent’s strengths (e.g., parallel browsing).
Memory: making agents durable. Three mechanisms:
- internal knowledge (weights), short‑term memory (context window), long‑term memory (external stores via retrieval).
- Why it matters: handle context overflow, persist preferences across sessions, boost consistency, and preserve structured data integrity (tables/queues).
- Management strategies: FIFO/token windows (simple but lossy), summarization + entity tracking to reduce redundancy, and reflection‑based memory updates (insert/merge/replace; handle contradictions by policy).
Chapter 7: Finetuning
Overview and History: Finetuning is the process of adapting a model to a specific task by adjusting its weights, building on transfer learning concepts introduced in 1976. It represents a critical bridge between general-purpose pre-trained models and task-specific applications. Knowledge gained from pre-training on abundant data like text completion can be transferred to specialized tasks with limited data, making models remarkably sample-efficient. While training from scratch might require millions of examples, finetuning can achieve comparable results with just hundreds, fundamentally changing the economics of model development.
Types of finetuning include:
- Self‑supervised finetuning (continued pre‑training): further train on raw, unlabeled, task‑related text (e.g. legal corpora) before costly supervised steps.
- Infilling finetuning: train the model to fill in masked spans (useful for editing/code debug). Can be added even if base was autoregressive.
- Supervised finetuning (SFT): train on (input, output) pairs (instructions → responses) to align with formats, styles, safety.
- Finetuning for human preference: optimise to prefer “winning” over “losing” responses (comparative data); commonly implemented with RL‑style preference tuning.
You can also Finetune to extend context length: modify positional embeddings/attention so the model handles longer sequences; harder to do and can degrade short‑context performance.
Economic Considerations
The cost dynamics strongly favour finetuning over training from scratch. Finetuning dramatically reduces model costs by starting from existing capabilities rather than random weights, requiring orders of magnitude less data and compute time. Remarkably, small finetuned models often outperform much larger general-purpose models on specific tasks. Grammarly found their finetuned Flan-T5 models outperformed GPT-3 variants on writing tasks despite being 60 times smaller.
Both model developers and application developers engage in finetuning, though their approaches differ. Model developers typically release various finetuned variants of their base models, while application developers adapt these models for specific use cases. The accessibility of finetuning has expanded significantly with the proliferation of open-source models and frameworks.
When not to Finetune: If prompting hasn't been thoroughly tested with systematic experiments, finetuning is premature. The lack of ML expertise and infrastructure can make finetuning prohibitively complex. The inability to maintain and update models long-term creates technical debt. Critically, finetuning for one task often degrades performance on others, creating challenges for applications expecting diverse prompts. New base models released frequently might outperform your carefully finetuned version, making the investment questionable.
Finetune when…
- When you need specific output formats (JSON/YAML/DSL)
- When the model lacks domain-specific knowledge crucial to your task
- When you need semantic parsing (e.g. learning syntax, understanding what ambiguous language means in your context (e.g. show me last weeks numbers))
- You need to mitigate bias with curated counter‑bias data.
- When you need a stricter style/safety is important
- When you need consistent behaviour across diverse inputs
- You’re deploying a smaller, cheaper model that must match a big model (via distillation).
Do not finetune (yet) when…
- Prompting and context (tools, examples) haven’t been explored systematically.
- You mainly lack facts or freshness → use RAG first.
- You can’t commit to data curation, MLOps, and ongoing maintenance.
- You need one model for many unrelated tasks (finetuning for one can degrade others), instead consider multiple adapters/models.
As general models get better, they often beat domain‑specific models on domain tasks; always compare before investing.
Finetuning versus RAG: This depends fundamentally on whether failures are information-based or behavior-based.
- RAG excels at addressing factual inaccuracies where the model lacks information or has outdated knowledge. Starting with simple term-based search like BM25 before moving to more complex embedding-based approaches often provides the best results.
- Finetuning, conversely, addresses behavioral issues where outputs are factually correct but irrelevant, malformatted, or stylistically inappropriate.
The practical workflow for model adaptation typically progresses through several stages:
- Begin with careful prompt engineering using best practices and systematic versioning.
- Add examples progressively, typically between one and fifty depending on the use case.
- If the model fails due to missing information, implement RAG starting with basic retrieval methods.
- For persistent behavioral issues like irrelevant or malformatted responses, consider finetuning.
- Many applications ultimately benefit from combining both approaches for maximum performance.
Memory Bottlenecks and Technical Foundations
Memory is often the primary bottleneck for foundation model finetuning, with requirements determined by the number of parameters, trainable parameters, and numerical representations. Understanding these bottlenecks is crucial for selecting appropriate techniques. The training process involves both forward passes that compute outputs and backward passes that update weights using loss gradients. During the backward pass, the model compares outputs to ground truth, computes each parameter's contribution through gradients, and adjusts parameters using optimisers.
Training is so much more memory-intensive than inference. Inference memory roughly equals the number of parameters times bytes per parameter times 1.2 for overhead. Training memory includes model weights, activations, gradients, and optimiser states. For a 13 billion parameter model using Adam optimisation, gradients and optimiser states alone require 78GB of memory.
Models use various numerical representations that directly impact memory usage. FP32 uses 32 bits as the standard format, while FP16 and BF16 use 16 bits with different trade-offs between range and precision. Integer formats like INT8 and INT4 enable extreme compression. Quantization, the process of converting to lower-precision formats, provides dramatic memory savings. A 13 billion parameter model requires 52GB in FP32 but only 26GB in FP16. Post-training quantization has become standard for inference, while training quantization remains more challenging and typically uses mixed precision approaches.
A breakdown of the key Finetuning Techniques
- Full FT (Full Finetuning): Updates all weights in the model. Use when: You have ample data and compute resources, need maximum quality, and are willing to pay the memory and engineering costs.
- Partial finetuning: Updates only a subset of model layers. Use when: Full finetuning is too computationally expensive but you need better performance than PEFT methods.
- PEFT (Parameter-Efficient Fine-Tuning): Updates a tiny fraction of parameters while achieving comparable performance to full finetuning. Use when: You have limited computational resources but need good performance.
- LoRA/QLoRA: Decomposes weight matrices into products of smaller matrices. Use when: You need the best cost-quality tradeoff, want multi-tenant serving, or have single-GPU constraints.
- IA3/BitFit: Ultra-lightweight parameter-efficient methods. Use when: You need quick baseline improvements or want to implement multi-task batching with minimal resources.
- Soft prompts: Adds trainable tokens to inputs. Use when: You need the smallest possible footprint or just modest behavior/style adjustments to the model.
- Long-context finetune: Modifies positional embeddings/attention for longer sequences. Use when: You must process long documents or code, but be sure to validate for potential short-context side effects.
- Merging: Combines multiple existing models without additional training. Use when: You want to combine task strengths, need on-device consolidation, or implement federated aggregation of models.
Chapter 8: Dataset Engineering
Dataset engineering builds the smallest, highest‑value dataset that gets your model to target performance within budget. The quality of a model depends fundamentally on the quality of its training data. Dataset engineering aims to create datasets that enable training the best possible model within budget constraints. When many teams use the same base models and those base models become more expensive to train, data has become a key differentiator for AI performance. Clean, targeted, well‑verified data reliably beats marginal architecture tweaks for downstream performance. The field has evolved from model-centric approaches (improving architectures and training techniques) to data-centric approaches (enhancing data quality and processing).
Teams are now challenged to develop the best dataset for a given model, rather than the best model for a given dataset. However, meaningful progress typically requires investment in both dimensions.
Data Curation: Different training approaches require different types of data:
- Self-supervised finetuning needs raw sequences.
- Instruction finetuning requires (instruction, response) pairs.
- Preference finetuning needs (instruction, winning response, losing response) triplets.
- Reward model training uses annotated scores.
- Chain-of-thought reasoning requires training data with step-by-step explanations, not just final answers.
- Tool use (agents) examples present unique challenges. Human experts might miss steps or perform actions inefficiently for AI agents. Humans prefer web interfaces while models work better with APIs. This often necessitates simulations and synthetic techniques to generate appropriate tool use data, including special multi-message formats to handle messages sent to different destinations.
Data Quality: The Six Characteristics of High-Quality Data
- Relevant: Training examples align with the task domain and timeframe
- Aligned with task requirements: Annotations match what the task actually needs (factual accuracy, creativity, conciseness, etc.)
- Consistent: Different annotators produce similar annotations for the same examples
- Correctly formatted: Examples follow expected formats without extraneous tokens
- Sufficiently unique: Minimal duplication to avoid bias and contamination
- Compliant: Adheres to all relevant policies, laws, and regulations
Data Coverage and Diversity
Training data must cover the range of problems users will present. Different applications require different diversity dimensions. A French-to-English translator needs topic and style diversity but not language diversity. A global chatbot needs linguistic and cultural diversity but perhaps not domain diversity.
For general-purpose use cases, diversity becomes critical across multiple axes: task types, topics, instruction formats, output lengths, and speaking patterns. Meta revealed that Llama 3's performance gains came primarily from improvements in data quality and diversity rather than architectural changes. Interestingly, math, reasoning, and code accounted for nearly half their training data despite representing far less than 50% of internet content: high-quality code and math data disproportionately boosts reasoning capabilities.
Data Quantity
"Asking how much data you need is like asking how much money you need" it varies dramatically by situation. Three factors determine requirements:
- Finetuning technique: Full finetuning requires orders of magnitude more data than PEFT methods like LoRA
- Task complexity: Simple classification needs far less data than complex question answering
- Base model performance: Better base models need fewer examples to reach target performance
OpenAI's finetuning research showed that with few examples (100), advanced models perform better. But after training on many examples (550,000), all models performed similarly. In short: if you have limited data, use PEFT on advanced models. If you have abundant data, use full finetuning with smaller models.
Start with a small, well-crafted dataset (50-100 examples) to validate improvement potential. Clear gains suggest more data will help; no improvement rarely means more data will solve the problem. Plot performance against dataset size to estimate how much additional data you'll need. Steep slopes indicate doubling data will significantly improve results, while plateaus suggest diminishing returns.
Data Acquisition and Annotation Start with your product data and create a data flywheel that leverages user-generated content to create a significant competitive advantage
Later supplement with public/proprietary sources; always check licences and provenance. Write crisp guidelines (rubrics, examples, edge cases, scoring anchors). Expect to iterate; spot check regularly; measure inter annotator agreement.
Where to find public data. Hugging Face, Kaggle, Google Dataset Search, data.gov / data.gov.in, ICPSR, UCI ML Repository, OpenML, Open Data Network, AWS Open Data, TensorFlow Datasets, lm evaluation harness corpora, Stanford Large Network Dataset Collection.
Annotation guidelines prove among the most challenging parts of AI engineering. Teams must explicitly define what makes responses good, distinguish between correct-but-unhelpful and high-quality answers, and clarify scoring criteria. The good news: these guidelines serve double duty for both training and evaluation data.
Data Augmentation and Synthesis
Data augmentation creates new data from existing real data (like flipping an image of a cat). Data synthesis generates data that mimics real data properties (like simulating bot movements). While distinct, both aim to automate data creation.
Five Key Reasons for Synthetic Data
- Increase quantity: Generate abundant data for training and testing, especially for rare scenarios
- Increase coverage: Create targeted data with specific characteristics (adversarial examples, rare classes, toxic phrases for detection)
- Increase quality: Sometimes AI generates better data than humans (tool use patterns, complex math problems, consistent preference ratings)
- Mitigate privacy: Enable use cases where real data can't be used (healthcare records, insurance claims)
- Distill models: Train smaller, faster models to mimic larger ones
Traditional Synthesis Techniques
Rule-based synthesis uses templates and predefined rules to generate structured content like transactions, invoices, or math equations. Simple transformations can augment existing data: rotating images, replacing words with synonyms, or perturbation (adding noise). AlphaGeometry trained on 100 million synthetic geometry problems. Perturbation can also make models more robust. changing one pixel can fool models 67% of the time, so training on perturbed data improves resilience.
Simulation enables virtual experimentation without real-world costs or dangers. Self-driving cars test highway scenarios virtually. Robots learn to pour coffee through simulated joint movements. Finance simulates bankruptcies and market crashes. Climate scientists create variations in temperature and weather patterns.
AI-Powered Synthesis
Modern AI dramatically expands synthesis possibilities. AI can simulate API outcomes without actual calls, enabling faster, cheaper tool use training. Self-play lets models learn by competing against themselves. OpenAI's Dota 2 bot played 180 years' worth of games daily.
AI paraphrasing and translation augment datasets efficiently. "How to reset my password?" becomes "I forgot my password," "How can I change my password?" and "Steps to reset passwords." Researchers rewrote 15,000 math examples multiple ways to create 400,000 examples, significantly improving performance.
Llama 3's synthetic data workflow demonstrates AI synthesis sophistication:
- Generate diverse programming problem descriptions
- Create solutions in multiple programming languages
- Generate unit tests to verify code
- Prompt AI to fix errors in synthesised code
- Translate code to other languages, filtering failures
- Generate code explanations and documentation, verifying with back-translation
This pipeline produced over 2.7 million synthetic coding examples.
Data Verification
Synthetic data quality can be measured through functional correctness (does code execute properly) and AI judges (scoring or classifying examples as good/bad). Coding examples dominate synthetic data because they're verifiable: most Llama 3 synthetic data was coding-related.
For data that can't be functionally verified, AI verifiers assign quality scores or detect issues like factual inconsistency. Creative approaches include training content detectors (if synthetic data is easily distinguished from real data, it's not good enough) or anomaly detection to identify outliers.
Limitations of AI-Generated Data
Four major limitations prevent synthetic data from completely replacing human data:
- Quality control: AI can generate low-quality data. "Garbage in, garbage out" remains true. Reliable evaluation methods are essential.
- Superficial imitation: Models may mimic style without substance. Imitating a capable model's math solutions can teach a weaker model to produce solution-looking responses rather than actual solutions, forcing hallucination.
- Potential model collapse: Recursively training on AI-generated data can cause irreversible performance degradation. AI overrepresents probable events (not having cancer) while underrepresenting rare events (having cancer), causing models to forget rare cases over iterations. Mixing synthetic with real data appears necessary, though optimal ratios remain unclear.
- Obscure lineage: AI generation hides data origins. If model X trained on copyright-violating data generates your training data, your model inherits the violation risk. Data contamination becomes harder to detect.
Model Distillation
Model distillation trains a small student model to mimic a larger teacher model, producing smaller, faster models that retain comparable performance. DistilBERT reduced BERT's size by 40% while retaining 97% capability and being 60% faster. Alpaca finetuned Llama-7B on examples from the 175B-parameter text-davinci-003, creating a model that behaved similarly at 4% the teacher's size.
Note that not all synthetic data training is distillation. Students can outperform teachers. NVIDIA's Nemotron-4-340B-Instruct trained on data from the smaller Mixtral-8x7B-Instruct and outperformed its teacher. The key is introducing quality verification mechanisms for synthetic data rather than training indiscriminately.
Typical Data Processing Steps:
- Inspect Data: Understand origins, processing history, and usage. Plot token distributions, input/response lengths, topics, and languages. Analyse by source, time, and annotator to spot patterns and outliers. Manual inspection for 15 minutes typically reveals insights that save hours of problems. As OpenAI co-founder Greg Brockman noted: "Manual inspection of data has probably the highest value-to-prestige ratio of any activity in machine learning."
- Deduplicate Data: Duplicated data skews distributions, introduces biases, causes test contamination, and wastes compute. Repeating 0.1% of data 100 times can degrade an 800M parameter model to 400M parameter performance. Deduplication approaches include pairwise comparison (expensive), hashing (grouping similar examples), and dimensionality reduction (reducing data dimensions before comparison).
- Clean and Filter Data: Remove extraneous formatting tokens (HTML tags, Markdown). Databricks found removing these improved accuracy by 20% while reducing input length by 60%. Remove non-compliant data (PII, sensitive info, copyrighted material, toxic content). Filter low-quality data using heuristics discovered through manual inspection. Researchers found annotations made in the second half of sessions are lower quality due to fatigue.
- Format Data: Convert to the specific chat template expected by your model. Wrong templates cause strange bugs. Finetuning instructions typically don't need task descriptions or examples that base model prompts require. For example, a three-shot prompt might become simply "Classify: {example}" during finetuning. Ensure inference prompts match training format exactly. "burger ->" differs from "burger", "Item: burger -->", or "burger -->" (extra space).
Dataset engineering requires iteration between curation, generation, and processing. It's mostly toil, tears, and sweat but it's also the foundation on which model performance is built.
Working style reminders: Keep originals immutable; pipeline in cheap to expensive order; dry run every step on a small shard; log provenance and filters; measure effects on your eval set at each stage.
Chapter 9: Inference Optimisation
Inference Optimisation
Model usability hinges on inference cost and latency. Cheaper inference makes AI more affordable, while faster inference enables broader application integration. Key concepts:
- Efficiency metrics: For LLMs, latency includes time to first token (TTFT) from prefilling and time per output token (TPOT) from decoding. Throughput metrics directly relate to cost, with clear latency-throughput tradeoffs.
- Hardware considerations: Performance depends heavily on accelerator hardware. Different hardware requires different optimisation approaches.
- Optimisation techniques: These exist at model level (quantisation, distillation, attention optimisation) and service level (batching strategies, parallelism). Model-level changes may affect behaviour, while service-level optimisations preserve the model's characteristics.
- Workload-specific choices: KV caching matters more for long contexts; prompt caching helps with overlapping prompts or multi-turn conversations. Performance requirements also determine approach prioritising latency may mean using more replicas despite higher cost.
Most impactful techniques across use cases include quantisation, tensor parallelism, replica parallelism, and attention mechanism optimisation.
Chapter 10: AI Engineering Architecture and UI
Start Simple, Add Complexity Gradually: Successful AI products emerge from simple systems that grow only as concrete needs arise. Begin with the simplest architecture (query → model → response). Then evolve the production systems by adding components to solve concrete problems: missing facts, safety risks, rising latency/costs, or capability limits.
Common Progression In Production
Step 1: Enhance Context - Most quality gains come from better context, which functions like feature engineering for foundation models. Add retrieval mechanisms (documents, tables, images), memory, and read-only tools (search, SQL, web) to provide necessary information. Treat context building like feature engineering: version it, test it, and measure its impact.
Step 2: Implement Guardrails - Guardrails mitigate risks through input controls (preventing private information leaks) and output controls (catching failures like malformatted responses, toxicity, or brand-risk content). Develop clear policies for handling failures: retry, repair, or escalate to humans. Guardrails involve trade-offs between safety and latency; balance accordingly.
Step 3: Add Model Router and Gateway - As applications grow, routers direct different query types to appropriate solutions (cheap models for easy tasks, specialists for niche tasks, humans when needed). Model gateways provide unified interfaces to different models, simplifying maintenance and centralising authentication, rate limits, and access controls. They're natural homes for logging, analytics, and sometimes caching.
Step 4: Reduce Latency with Caches - Caching reduces latency and cost through exact caching (reusing identical requests) and semantic caching (reusing similar requests via embeddings). Implement clear eviction policies and TTL strategies based on query reuse likelihood. Never cache user-specific answers as global results. Only adopt semantic caching when measured benefits exceed complexity costs.
Step 5: Add Agent Patterns Carefully - Advanced applications incorporate loops, parallel too use, and conditional branching to handle complex tasks. Model outputs can invoke write actions (sending emails, placing orders, initiating transfers), which expand capability but significantly increase risk exposure. Treat write actions as privileged operations requiring confirmations, dry runs, audit logs, and human approval where needed.
Key Metrics:
- Reliability: Aim to lower MTTD (Mean Time To Detect), MTTR (Mean Time To Resolve), and CFR (Customer-Facing Reliability). Tie monitoring to evaluation so pre-launch metrics predict production behaviour, and production incidents feed back into test suites.
- Business/UX: task success/"problem solved?", abandonment, DAU/retention, turn count and dialogue diversity by task.
- Quality/Safety: format validity (e.g., JSON parse rate), factuality (context-groundedness), refusal rate, toxicity, sensitive-data flags.
- Latency/Throughput: TTFT, time per output token, total latency; requests/sec; queueing.
- Cost: input/output tokens, cache hit rate, unit cost per solved task.
- Breakdowns: by release, prompt/chain version, model, cohort, geography, and time of day. Correlate these to your north-star metrics.
Logs and traces you'll actually use. Log the raw user input (with privacy controls), final prompt, retrieved context pointers, tool I/O, model/config versions, and per-step latency/cost. Build traces that reconstruct a request end-to-end so anyone can see what ran, where it slowed, what it cost, and why it failed.
Detect drift early. Watch for (1) silent template/system-prompt edits, (2) user behaviour shifts (e.g., shorter prompts as users learn), and (3) upstream model swaps hidden behind stable API names. Use canaries and regression suites on critical paths.
Orchestrate late, not early. Start without a heavy orchestrator; add one when complexity demands it. Favour tools that integrate with your models/retrievers/gateway, support branching and parallelism, and don't hide latency or make surprise calls. Extensibility and transparency matter as much as features.
Feedback is fuel: treat it like data. Feedback powers evaluation, roadmap, personalisation, and future training. Be explicit about use, gain consent where needed, and keep provenance.
- Explicit signals: thumbs/stars, "better/worse/same," pairwise choices. clear but sparse and biased.
- Implicit signals: early stop, "No, I meant…" corrections, user edits of outputs (gold preference pairs), regenerate requests, human handoffs, conversation deletes/renames/shares, turn count and diversity, sentiment shifts, refusal spikes.
- When to ask: onboarding (light calibration), low confidence (offer options), after errors. Keep prompts nonintrusive and throttled; offer an "I don't know" in comparisons.
- Design patterns: micro-interactions embedded in the flow (accept/edit/ignore for code/text), randomised option order to reduce position bias, previews for side-by-side. For standalone apps, consider "data donation" consent to attach recent context to a feedback item.
Feedback limits and mitigations. Expect leniency, position, recency, and length biases; expect random clicks. Randomise presentation, hold out traffic, and run counterfactual tests. Beware degenerate loops (optimising for clicks changes who shows up and what they click). Guard against sycophancy when training on user preferences; keep a ground-truth track separate from "user-pleasing" signals.
Cross-functional responsibilities.
- Product: define success and failure modes, escalation policies, and the feedback flywheel.
- Design/Research: specify feedback touchpoints, consent, copy, and bias mitigations.
- Engineering/ML: implement context, routing, guardrails, gateway, caching, and traces; keep eval ↔ monitoring parity.
- Data/Analytics: build dashboards, drift/bias checks, and cost guardrails.
- Security/Privacy/Legal: sensitive-data classes, retention, license/compliance rules (incl. feedback use).
- Support/Ops: human takeover runbooks, SLOs, and incident management.
A pragmatic rollout plan.
- Ship the thin slice (query → model → response) with basic logging.
- Add retrieval/tools for the top failure mode.
- Add input/output guardrails for the riskiest vector.
- Introduce routing + gateway for cost/perf control.
- Add exact caching; measure hit rate and savings.
- Pilot agent loops on one high-leverage task; gate write actions.
- Stand up dashboards (quality/safety/latency/cost) with traces.
- Layer in feedback capture and a small human-in-the-loop flow.
- Only then consider semantic cache or a full orchestrator.
Bottom line. Grow architecture stepwise; favour context, routing, and caching before deeper complexity. Make safety and observability defaults. Treat feedback as a durable data asset. Add (and remove) components based on measured impact on user and business outcomes.