Author
Martin Zinkevich
Year
2023
Review
This short best practice guide by Martin Zinkevich is well worth reading. It shares some of Google's hard-earned machine learning best practices. It’s written for non-technical folks but a basic understanding of ML is required.
You Might Also Like:
Key Takeaways
The 20% that gave me 80% of the value.
- Don’t be afraid to launch a product without machine learning. Unless you have great data Machine Learning will likely underperform basic heuristics (e.g. if ranking apps in a market place use downloads and install rate).
- First, design and implement metrics. Instrument metrics now, decide on what’s important, establish a performance baseline. Track future optimisation goals and non-goals (what you don’t want to degrade).
- Choose machine learning over complex heuristics. If you’ve moved beyond simple heuristics to complex heuristics you’re ready to start with machine learning. ML can be easier to maintain than complex heuristics.
Your First Pipeline
- Keep the first model simple and get the infrastructure right. The first model deployment often provides the biggest performance bump, so keep it simple to deliver value quickly. Locating training data, define success and integrating into production is a lot to tackle - so don’t add more complexity than required.
- Test the infrastructure independently from the machine learning. Test getting data to the algorithm. Test getting models out of the training algorithm.
- Be careful about dropped data when copying pipelines. Some pipelines deliberately drop data they don’t need, so make sure you understand what you’re copying if you copy a pipeline to setup a new one.
- Turn heuristics into features, or handle them externally. Often ML systems are replacing existing systems with rules and heuristics. Mine those heuristics for information. Four ways you can use an existing heuristic:
- Pre-process using the heuristic (e.g. if a sender has been blacklisted for spam, leverage that don’t try and relearn it)
- Create a feature directly from the heuristic
- Mine the raw inputs of the heuristic
- Modify the label
Monitoring
- Know the freshness requirements of your system. Does model performance degrade over a day, week or quarter?
- Detect problems before exporting models. Sanity check before exporting.
- Watch for silent failures. Make sure data tables feeding models are updating, else data will go stale and model performance might degrade.
- Give feature column owners and documentation. Else you’ll struggle to maintain them.
Your First Objective
- Machine learning algorithms often require a single objective to optimise.
- Don’t overthink which objective you choose to optimise. Introducing ML is often so powerful that most of the metrics you care about will go up.
- Choose a simple, observable and attributable metric for your first objective. It should be easy to measure and a good proxy for the ‘true’ objective. Train on the simple objective and consider having a ‘policy layer’ to add additional logic for final ranking. The easiest thing to model is a user behaviour that is directly observed and attributable to an action of the system (e.g. a button was clicked). Indirect metrics can be useful - but better kept as KPIs (time on sight).
- Starting with an interpretable model makes debugging easier.
- Separate spam filtering and quality ranking in a policy layer. Quality ranking should focus on ranking content posted in good faith - so remove spam from the training data for the quality classifier. Racy content should be handled separately from quality ranking too.
Feature Engineering
- Plan to launch and iterate. Don’t expect the current model to be the last. Models will be regularly updated as features and objectives evolve.
- Start with directly observed features. Avoid relying on learned features initially; directly observed data is less prone to issues. By creating a model without deep features, you can get an excellent baseline performance. After this baseline is achieved, you can try more esoteric approaches.
- Explore features that generalise across contexts. Content-related features, such as word count or number of actions, often work across different scenarios.
- Use very specific features when possible. Specific features tend to be more informative than generalised ones. Don’t be afraid of groups of features where each feature applies to a very small fraction of your data, but overall coverage is above 90%. You can use regularisation to eliminate the features that apply to too few examples.
- Combine and modify existing features to create new features in human-understandable ways. Discretisation consists of taking a continuous feature and creating many discrete features from it. Crosses combine two or more feature columns.
- The number of feature weights you can learn is proportional to the amount of data. In linear models, more data allows for more complex features.
- Clean up unused features. Regularly remove outdated or unused features to simplify the system.
Human Analysis of the System
- You are not a typical end user. Always remember that your expertise skews your perspective; you’re too close to the code.
- Measure the delta between models. Track how different the new results are from production. Example: If you have a ranking problem, run both models on a sample of queries through the entire system, and look at the size of the symmetric difference of the results (weighted by ranking position). If the difference is very small, then you can tell without running an experiment that there will be little change. If the difference is very large, then you want to make sure that the change is good.
- Utilitarian performance trumps predictive power. Prioritise model utility over theoretical accuracy. If a change improves log loss but degrades the performance then look for another feature. When this starts happening more often, it is time to revisit the objective of your model.
- Look for patterns in errors. Use observed errors to create new features and improve the model. Look for trends in examples the model got wrong that are outside your current feature.
- Quantify undesirable behaviour. Try to put numbers on negative user behaviours to better address them. If issues are measurable, you can start using them as features, objectives, or metrics. Measure first, optimise second.
- Short-term behaviour isn’t always predictive of long-term results. Identical short-term results might lead to very different long-term outcomes.
Training-Serving Skew
- Training-serving skew is a difference between performance during training and performance during serving. This skew can be caused by:
- A discrepancy between how you handle data in the training and serving pipelines.
- A change in the data between when you train and when you serve.
- A feedback loop between your model and your algorithm.
- Train using serving-time features. Ensure consistency by using the same set of features for both training and serving.
- Importance-weight sampled data. Avoid dropping data arbitrarily; instead, adjust data weights when necessary.
- Beware of table joins. Data can change between training and serving, especially when joining from tables.
- Reuse code between training and serving pipelines. Minimise discrepancies by reusing code wherever possible.
- Test models on future data. After training on data from a specific date, test on data from the following days.
- Prioritise clean data over short-term performance in binary classification. For filtering tasks like spam detection, clean data is more valuable than temporary performance gains.
- Beware of inherent skew in ranking problems. Ranking issues often come with hidden biases.
- Avoid feedback loops with positional features. Be cautious of features based on item positions, as they can introduce feedback loops.
- Measure training/serving skew. Regularly check for discrepancies between training and serving environments.
Optimisation and Complex Models
- Don’t add new features if the problem is misaligned objectives. As improvements plateau, don’t be sucked into looking at issues that are outside the scope of the objectives. If the product goals are not covered by the existing algorithmic objective, you need to change either your objective or your product goals.
- Launch decisions are a proxy for long-term product goals. Evaluate models using a range of metrics, not just one. Launch decisions depend on multiple criteria, only some of which can be directly optimised using ML. The only easy launch decisions are when all the metrics get better.
- Keep ensembles simple. Avoid overly complex model ensembles unless absolutely necessary.
- Seek new data sources when performance plateaus. Look for new qualitative information instead of endlessly refining existing features.
- Don’t assume that diversity, personalisation, and relevance always correlate with user popularity.