Product #101

super:Link

https://open.substack.com/pub/productandrew/p/product-101?r=12u3a4&utm_campaign=post&utm_medium=web&showWelcomeOnShare=true

Trustworthy Online Controlled Experiments · 2020

Ron Kohavi, Diane Tang and Ya Xu

This book was well received in product development circles for making complex A/B testing concepts accessible through real-world examples. Going beyond basics, it covers potential pitfalls, best practices and key guidelines. The book emphasises practical aspects like experiment design, business alignment, and metric selection rather than pure statistics.

Key Insights

Online controlled experiments, also known as A/B tests, are the most reliable way to determine causation and measure the impact of changes. They work by randomly splitting users into different variants, including a control, and comparing metrics across the groups.

There are four essential requirements for useful controlled experiments:

An appropriate unit of randomisation (usually individual users)
A sufficiently large user base to achieve statistical power
Well-defined metrics that genuinely reflect business objectives
The ability to implement and iterate on changes rapidly

Organisations typically progress through four maturity phases in adopting experimentation:

Crawl - Only a few experiments run, focus on basic instrumentation
Walk - Standardise metrics, increase experiment trust with checks like Sample Ratio Mismatch (SRM)
Run - Conduct large-scale testing, adopt an Overall Evaluation Criterion (OEC), run most changes as experiments
Fly - Make experimentation routine, with automation and institutional memory driving continuous learning

Leadership buy-in is key at every phase to align teams on shared metrics and guardrails, set goals based on measurable improvements, and empower fast failure as part of the learning process. A robust experimentation platform and culture reduce errors and maximise learning.

To design a good experiment:

Define a clear hypothesis
Determine the randomisation unit, target population, desired effect size, and timeline
Ensure proper instrumentation to log user actions and variant assignments
Check invariant metrics to validate the experiment ran correctly
Look at the primary metric and make decisions based on statistical and practical significance thresholds

Common pitfalls in interpreting the statistics behind experiments include:

Assuming a non-significant p-value proves no effect rather than insufficient power
Misstating what a p-value implies about the truth of the hypothesis
Peeking repeatedly at p-values and inflating false positives
Ignoring multiple hypothesis comparisons
Misusing confidence intervals, especially concluding overlap means no difference

Internal validity threats include violations of the Stable Unit Treatment Value Assumption (SUTVA), survivorship bias, sample ratio mismatch, and carryover effects. External validity threats include limited generalisation beyond a population, seasonality, novelty effects, and primacy effects where users need time to adapt.

Simpson's paradox can occur when combining data reverses the direction of an effect, often stemming from uneven splits or weighting across segments. Segment differences may reveal crucial insights if a result is driven by certain subgroups, but anomalies should first be verified as real rather than logging flaws or side effects.

Metrics for experiments can be categorised as:

Goal metrics - Capture ultimate success but may be slow-moving
Driver metrics - More sensitive, serve as leading indicators
Guardrail metrics - Protect against unintended harm while pursuing goal or driver improvements

Principles for good experiment metrics include keeping them simple, aligning them with goals, making them actionable and sensitive, and combining them into a weighted OEC. Gameability should be avoided - reward outcomes you truly care about, not superficial measures. Incorporate notions of quality and negative indicators as guardrails.

To validate metrics, triangulate with user research, conduct observational analyses on correlations, run experiments explicitly testing causal links between drivers and goals, and draw on other companies' data. As the business evolves, metrics must be updated while preserving continuity.

An experimentation platform typically consists of:

Experiment management interface to define and track metadata
Deployment mechanism for variant assignment
Instrumentation to collect user actions and variant assignment logs
Automated analysis to compute metrics, check significance, and generate insights

Complementary techniques support discovery, refinement, and validation of ideas:

Logs analysis reveals patterns and opportunities but not the "why" behind user behaviour
User experience research explores struggles, interpretations and emotions in-depth
Surveys reach broader audiences to capture offline activities or sentiment
External data validates internal findings but may lack clear methodology

When controlled experiments aren't feasible, observational causal studies can estimate effects:

Difference-in-differences compares a treated group's pre/post trends to an untreated one
Interrupted time series models the counterfactual using the same population over time
Regression discontinuity exploits thresholds that approximate random assignment
Instrumental variables use "as good as random" factors that affect treatment but not outcome

However, these methods rely on strong assumptions and can still suffer from unmeasured confounders or spurious correlations. Randomised experiments remain the gold standard for causation.

To analyse results, key statistical concepts include:

P-values - The probability of seeing the observed difference if no true effect exists
Confidence intervals - A range likely to contain the true effect; excludes 0 if significant
Type I/II errors - False positives and false negatives based on significance thresholds
Power - The probability of detecting a difference of a certain size
Multiple testing - Examining many metrics inflates the odds of lucky significance
Meta-analysis - Pooling evidence across repeated or parallel experiments

Accurate variance estimation is vital for reliable p-values and intervals. Pitfalls include treating correlated ratio metric components as independent, or letting single outliers inflate variance. Improving sensitivity involves reducing variance through granular randomisation, advanced designs like pairwise testing, or pre-experiment stratification.

A/A tests, with identical experiences, validate the experimentation platform. P-values should be uniformly distributed and a consistent 5% should fall below 0.05. Deviations suggest improper variance calculation, mismatched randomisation, or hidden bias.

Guardrail metrics protect against unintended harms to user experience or business fundamentals. They capture areas you refuse to degrade, like latency, crashes, or spam, even if a feature improves on surface. When triggered, they invalidate an experiment. Sample Ratio Mismatch (SRM) specifically indicates potential randomisation or logging bugs.

Leakage and interference arise when variant assignments affect each other, violating independence assumptions. Network effects, shared resources, or time-sliced infrastructure can all create spillovers. Mitigation involves isolating clusters to the same variant, splitting key resources, or time-based designs.

Accurately measuring long-term effects requires extended duration, cohort tracking, post-period analysis, holdback groups, or reverse experiments. Each approach has tradeoffs in dilution, survivorship bias, power or coverage.

At scale, an efficient data pipeline moves from raw telemetry to reliable insights. Near-real-time processing catches egregious errors, while batching allows complex metrics. Surfacing results with an OEC, guardrail monitoring, and segment drilldowns empowers decisions.

Fostering a culture of responsible experimentation that continuously improves user experience and business value hinges on robust metrics, thoughtful design, transparent results, and a commitment to learning from failures and successes. By combining careful statistics with customer focus, organisations turn experimentation into a powerful engine for innovation.

Full Book Summary · Amazon

Subscribe Button

Quick Links

A north star for product transformation by John Cutler · Article

Make every detail perfect, limit the number of details · Short Video

Navigating the complex world of Product through mapping · Article

How to ship like a startup · Article

The case against conversational interfaces · Article

8 things about product frameworks I wish I knew earlier · Article

What to do · Article

Disruptive Technologies: Catching the Wave

Joseph L.Bower, Clayton M.Christensen · 1995 (View Paper)

One of the most consistent patterns in business is the failure of leading companies to stay at the top of their industries when technologies or markets change…. Although most managers like to think they are in control, customers wield extraordinary power in directing a company’s investments.

This paper summarises the key points from the ‘The Innovator’s Dilemma’ book by the same author.

Leading companies often fail to maintain their dominance when new technologies or markets emerge, despite investing heavily in technologies for their current customers.
Established companies tend to focus on existing customers' needs, which can cause them to overlook or underinvest in emerging technologies that do not initially meet these needs.
Disruptive technologies initially offer lower performance in key areas valued by mainstream customers but often improve rapidly, eventually meeting or surpassing market needs.
To successfully capitalise on disruptive technologies, companies must protect them from internal processes focused on existing markets. This often requires creating separate, independent organisations.
Different types of innovations impact product performance differently. Sustaining technologies improve performance in established markets, while disruptive technologies introduce new performance attributes that eventually appeal to mainstream markets.
Managers should identify and assess disruptive technologies by considering their potential future performance and the emergence of new markets, rather than focusing solely on immediate profitability.
Established companies often struggle to allocate resources to disruptive technologies because they appear financially unattractive compared to sustaining technologies.
Companies should manage disruptive technologies in independent units to avoid conflicts with the mainstream business, allowing the new technology to develop fully.
For a corporation to thrive, it must be willing to allow older business units to decline as new ones rise, even if that means disrupting its own market.

Book Highlights

So people became cynical about all the meetings, just turned up to be seen and deferred to the most senior person present. Many people stopped taking any decisions at all and delegated upwards. One senior executive responsible for a budget of $2bn told me that the last straw was when he had been asked by a refurbishment committee to decide on the color to paint the walls of a meeting room on the floor below. The people on the refurbishment committee either did not know what decision-making authority they had or were not prepared to use it. Few people knew what their freedoms were or where their boundaries lay. Because boundaries were so unclear, the only safe course of action was not to explore them, but to keep your head down and play safe. Stepping over them could result in punishment. Stephen Bungay · The Art of Action

To handle the complexity, we’ve split up the tasks among various specialties. But even divvied up, the work can become overwhelming Atul Gawande · The Checklist Manifesto

The biggest pitfall I see people falling into once they begin capturing digital notes is saving too much Tiago Forte · Building a Second Brain

In computerised systems, there are only two states—nonexistence and full compliance—and no intermediate states are recognised or accepted. In any manual system, there is an important but paradoxical state—unspoken, undocumented, but widely relied upon—of suspense... Alan Cooper · The Inmates Are Running the Asylum

Quotes & Tweets

Many people, myself included, didn't try to build a product around a language model because during the time you would work on a business-specific dataset, a larger generalist model will be released that will be as good for your business tasks as your smaller specialised model.
The disappointing releases of both GPT-4.5 and Llama 4 have shown that if you don't train a model to reason with reinforcement learning, increasing its size no longer provides benefits.
Reinforcement learning is limited only to domains where a reward can be assigned to the generation result. Until recently, these domains were math, logic, and code. Recently, these domains have also included factual question answering, where, to find an answer, the model must learn to execute several searches. This is how these "deep search" models have likely been trained.
If your business idea isn't in these domains, now is the time to start building your business-specific dataset. The potential increase in generalist models' skills will no longer be a threat. Andriy Burkov

Everyone dreams big until they see the price tag. Shane Parrish