Precision, Recall and F1

The concept of ‘precision and recall’ has caught my attention on a number of occasions. I first came across it when looking at information architecture and search. I stumbled across it again when working on recommender systems, and later, on a machine learning project. Explainer articles seem unnecessarily academic - so I intend to detail the concept here in my own words.

What would you do?

If you were asked to manually test the effectiveness of a search engine what would you do? Instinctively I think you’d type in a search term and assess if the presented results were relevant. You might also search for a piece of information you know to be in the database - to check if the engine returned it. If so, you were almost testing for Precision and Recall - two key components of relevance. To measure Precision and Recall, we have to define them more precisely. In this instance we can define them as follows:

Precision = of retrieved artefacts, what fraction were relevant?
Recall = what fraction of relevant artefacts were returned?

The Precision-Recall Tradeoff

Precision and Recall are intertwined. At the limit, increasing the number of artefacts returned increases Recall toward 100%. But returning an entire dataset would also mean returning all of the artefacts in the dataset that aren’t relevant, making Precision lower. This is what’s known as the ‘Precision-Recall Tradeoff.’

Depending on the context - Precision or Recall could be more important. Product Managers should come with an opinion about how to calibrate their system. For example, we tend to calibrate recommendation systems loosely, since the cost of a few rogue suggestions is low.

Precision and Recall in Machine Learning

So far I’ve used ‘relevant’ or ‘not relevant’ as labels to aid my explanation: but relevance is in the eye of the beholder, which makes my definitions above a little clumsy. That said, noisy labels are a common problem in machine learning, which is where I want to apply this concept next.

Let’s apply Precision and Recall to a machine learning system designed to detect cancer in patient lungs. The system takes an x-ray image as an input, and the output is a verdict: CANCER / NOT CANCER.

We have a prediction and a label - and the label is the ground truth of the patient. We define the positive case as the one we’re interested in, so for us that’s ‘CANCER’. Depending on the input and output we can now have 4 scenarios.

Scenarios	Prediction	Label
True Positive (TP)	CANCER	CANCER
True Negative (TN)	NOT CANCER	NOT CANCER
False Positive (FP)	CANCER	NOT CANCER
False Negative (FN)	NOT CANCER	CANCER

Defining Accuracy

The Accuracy of our model is defined as the correct predictions as a fraction of total predictions:

Accuracy = TP + TN / TP + TN + FP + FN
Accuracy = Correct Predictions / Total Predictions

Crucially, Accuracy can be a bad measurement if there’s a biased dataset. For example, if only 1 in 1000 labels are CANCER, then our model could simply predict NOT CANCER every time and achieve an Accuracy of 99.9% despite missing every case of CANCER. Many datasets are skewed (e.g detecting fraudulent transactions).

Defining Precision

Precision is a more informative measure. Precision is defined as the correct positive predictions as a fraction of all positive predictions. It answers ‘How precise are our positive predictions?’ or, in other words, it measures the quality of predictions:

Precision = TP / TP + FP
Precision = Predicted Positive Correctly / All Predicted Positive

But Precision as a measure has a weakness too. It doesn’t take into account performance on negative labels. A model that predicts the positive case incredibly conservatively, can therefore have 100% precision - whilst clocking up a huge number of false negatives. This would mean giving patients with cancer the all clear in error.

Defining Recall

Recall is defined as the correct positive predictions as a fraction of all positive labels. It answers the question: ‘How good is the model at finding all of the positive labels?’ It measures the completeness of predictions.

Recall = TP / TP + FN
Recall = Predicted Positive Correctly / All Positive Labels

Recall has a weakness - it can be cheated by predicting everything as CANCER. You’d get a recall score of 100%: and every cancer-free patient would be given devastating news.

Introducing and defining F1

Precision and Recall can’t be gamed at the same time. Artificially increasing your Recall will reduce your Precision and vice versa. So to understand how well a machine learning system is performing you have to report both.

The F1 score is a clever way of combining both measures into a single metric. It is the harmonic mean of Precision and Recall: a balanced measure of central tendency - describing the behaviour of data around a central value. The F1 score weights towards the lower of the two values but also punishes disagreement between the two, so:

F1 = 2 x Precision x Recall / Precision + Recall

If your model is perfect - you’ll have an F1 score of 1. If it’s perfectly imperfect you’ll have a score of 0. The F1 score weighs Precision and Recall equally. Although as we mentioned earlier, for some models you want to prioritise Precision over Recall (like our lung cancer system above) - you can use something called an FBeta score to do that. But that’s for another day!

Conclusion

It will become increasingly important for product managers to understand the trade-offs and performance of machine learning systems. These are powerful systems, but must be used with a strong understanding of the context in which they’re utilised.