Metrics for measuring your Classification Model

In this post, we will look at Precision and Recall performance measures you can use to evaluate your model for a binary classification problem.

Sample Model

For understanding the concepts let’s take example where we’ll be predicting if the content is harmful or not so that it can be broadcasted on TV. If the content is harmful then it should not broadcasted as it will be watched by kids in homes. Also note that there will be one more level of manual scrutiny if the model identifies something as harmful but non harmful content are directly broadcasted.

In this example, false negatives are probably worse. False Positives are ok because there’s second level of scrutiny.

For our understanding, we’ll mark harmful content as 1 and others as 0. Out of total content that is available for training, let’s say 80 are not harmful and 20 are harmful.

Let’s say we have built this model and would want to measure it’s performance.

Confusion Matrix

Confusion matrix is the simplest and intuitive way to visualize the predictions of the model.

Confusion matrix is a table with two dimensions, (Actual and Predicted)

This matrix is not a performance metric but all the actual performance matrices are dependent on it.

True Positives

True positives are the cases where actual value is true (1 ) and predicted values are also true (1)

Ex : The case where video content is actually harmful and model also predicts that it is harmful are True Positives.

True Negatives

True negatives are the cases where actual value is negative (0 ) and predicted values are also negative (0)

Ex : The case where video content is actually not harmful and model also predicts that it is not harmful are True Negatives.

False Positives

False positives are the cases where actual value is false (0) and predicted values are true (1). This is an interesting thing to note. It is called False because whatever the model is saying is not correct.

Ex : The case where video content is actually not harmful and model predicts that it is harmful are False Positives.

False Negatives

By now, you would have guessed the definition of False Negatives. These are the examples where actual values are positives and predicted values are negatives. Again the values predicted by model are wrong hence they are called False Negatives.


Accuracy is the simplest and intuitive metrics anyone can use. It is measured as total number of correct predictions divided by total number of predictions made.

If the model predicts that there’s no harmful content at all then accuracy of this model will 80/100=80%. But this is a terrible model. It has high accuracy but it can’t be used in real life. It will let rest of 20 video content to be broadcasted on TV. In this case model has high false negatives.

If model predicts only harmful content then accuracy will be 20/100=20%. This model has bad accuracy and would send remaining 80 video content as harmful. This has high false positives.

Let’s get into the formula for accuracy from Confusion Matrix:

Accuracy = (TP + TN)/(TP+FP+FN+TN)

When to use Accuracy:

Accuracy is a good measure when the target variable classes in the data are nearly balanced. If the data is imbalanced then ‘Accuracy’ metric will be inaccurate. As explained above, in our example data is imbalanced hence we can’t use accuracy metric.


TP / (TP+FP)

It is the number of positive predictions divided by the total number of positive class values predicted. Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of False Positives.

A trivial way to have perfect Precision is to make one single positive prediction and ensure it is correct (1/1=100%)

In our example , let’s say we had only five videos which are harmful. If our model is bad and it predicts all the videos as non-harmful, then TP = 5, FP = 95. So precision will be 5 / (95+5) = 5%. If model predicts that all the videos are harmful, still the precision would remain the same.


TP / (TP+FN)

It is the number of positive predictions divided by the number of positive class values in the test data. This is the subtle difference between precision and recall. For precision we’ll consider total positive values predicted by the model where as we take total positive values in test set for recall.

Precision – Recall

Recall talks bout percentage of true values the model is able to classify from test set. Precision will tell you out of those predicted true values how many are actually true.

Ex: In our example there are 80 false values and 20 true values. Let’s model is able to classify 15 of videos as harmful (positive) then recall is 15/20 . Model is able to identify 15 videos out if actual value of 20. However process doesn’t stop here. We still need to verify if those predicted values (15) are actually positive.

Out of 15 let’s 13 are True Positives and 2 are False Positives, then Precision will be 13 / (13+2).

It is always advisable to use both precision and recall together to correctly gauge the performance of the model.

Confused about Precision Recall ?

Consider another example as shown in the screenshot below:

In the test set there are 10 values out of which six values are true and four are false. After model is being run we get predictions for the same set which is given in as Model Predictions.

If model predicts a value as true and Test data also has true value for the corresponding entry then we call model output as True Positive. (Green Cells). We can similarly label all the model outputs as shown in the table above.

Note that when model predicts the output as Positive it wouldn’t know if is actually positive. So we need to verify it separately. Similarly when a model predicts the output as Negative we should verify this. This is where Confusion matrix helps. In the above figure it is shown as individual match against Test Data Set. It can also be viewed as a matrix as the name suggests:

What does ‘Recall’ do ?

Recall means – out of total positive values in your test data set, how many were actually predicted as positive by the model. We’re not worried about Negative values here.

There are 6 positive values in test data set and model predicts 6 values as positives so recall should by 100%. Wait ! That’s not correct. Remember the formula for Recall :

TP / (TP+FN)

What is it saying is – take all the positive values in test data set (Green + Yellow values) and compare this with True Positive value (Green Values). We want to see, how many values did model correctly classify as positive compared to total positive values. In our example it is 4/6 = 66%. This means model is able to identify 66% of positive values in test data set. Do you think this is enough ?

If a model says output is positive can you trust if it is actually positive ? Check the table above again.

What is the role of Precision?

If the model predicts a value as positive then precision will tell you by what percent it is true.

In our example, total positive values predicted by model is 4+2 (Green + Blue). This includes both True positives and False Positives. So, precision will measure – out of total values that are predicted as positive how many values are actually positive. In our example, it is 4 / 6 = 66%. Note that here 6 is sum of Green and Blue values.

Precision Recall Summary.

If our model predicts that the value is positive then it is correct only 66% of the time (Precision). On the other hand it is able to identify only 66% of positive values (Recall).


Out of all the negative values from the test set, how many were identified as negative by the model. This is Specificity and exact opposite of Recall.

F1 Score

Instead of calculating Precision and Recall everytime, what if we had one metric that summarizes the both. F1 score does exactly that.

If we take average of Precision & Recall, would that work ? Imagine a case where 97 of all values are False and 3 are True. If a model predicts that all the cases as Positive then Precision will be β†’ 3/(97+3) = 3% and Recall will be β†’ 3 / (3+0)= 100%. So, model is able to identify 100% of True values but it is right only 3% of the time. πŸ™†β€β™‚οΈ

For calculating F1 Score, If we take the average (Arithmetic Mean) of two then score will be 51% which is not correct. So, we have to take Harmonic Mean of the Two.

Harmonic mean outputs average value if both the inputs are equal. But when inputs are different, then it’s closer to the smaller number as compared to the larger number.

F1 Score = (2 * Precision * Recall) / (Precision + Recall)

F1 Score = (2 * 3 * 100) / (100+3) = 5%


This is the base for F1 Score but since F1 Score is most widely used, it is given separately.

In some cases, we might be interested in an F-measure with more attention given to precision, such as when false positives are more important to minimize. In other cases, we might be interested in an F-measure with more attention put on recall, such as when false negatives are more important to minimize.

The solution is Fbeta.

Fbeta = ((1 + beta^2) * Precision * Recall) / (beta^2 * Precision + Recall)

The choice of the beta parameter will be used in the name of the Fbeta-measure.

For example, a beta value of 2 is referred to as F2-measure or F2-score. A beta value of 1 is referred to as the F1-measure or the F1-score.


We’ve seen different metrics used for measuring the performance of classification model. We’ve seen Precision and Recall in depth and Fbeta which combines both Precision and Recall.

There are couple of more ways for measuring classification models which I would cover in future posts.