Don’t Be Confused by a Confusion Matrix

Maria Vasilenko
6 min readJul 7, 2019

--

Ever used confusion matrix to visualize the performance of a machine learning classifier, but each time got confused about terminology? In this post, I’ll share a simple approach that I came up with to (finally) get my head around the confusion matrix.

I have recently worked on a data science project aimed at predicting the probability of a woman getting pregnant. This project was a part of my internship (I finished MS in Data Science program at the University of San Francisco), and it very much resonated with me as a mom and a data scientist. In my view, it had a very noble purpose of reaching out early to women who are likely to be pregnant and offering them various educational programs. I think this project is very timely, given that the United States, unfortunately, has one of the highest pregnancy-related mortality rates among the developed countries, and this number is only rising.

The problem we were solving was essentially a binary classification task. To evaluate the performance of the model, we could have used an accuracy score, but we were dealing with the unbalanced dataset, in which case using accuracy might be very misleading!

Let’s look at the example. Imagine, we are a healthcare provider, and among our 100 female patients, 10 are early in the pregnant state. Our model predicts 8 pregnant women, of which in reality only 6 are correctly labeled. What would be the accuracy of this model?

This looks like a decent accuracy! But does our model produce decent results? That’s hard to say… Let’s consider another example: imagine our model is predicting that there is NO pregnant woman. Even so, we get an accuracy of 90%!

To get a better grasp of the model performance, we need to look at the recall and precision scores. Recall shows how well the model can retrieve the pregnant class, and precision — how confident we could be that those who are labeled pregnant by the model are indeed pregnant. In other words, precision characterizes the quality of the model, its ability to identify only the relevant data points, whereas recall quantifies the completeness of results.

Another convenient way to visualize the performance of classification algorithms (usually in supervised learning problems) is to look at a confusion matrix. It is essentially a table which allows comparing true labels with the predictions of your classification model by reporting True Positives (TP), True Negatives (TN), False Positives (FP), and False Negatives(FN).

That’s where things start to look complicated. Is False Positive a woman who is non-pregnant and labeled as pregnant, or the other way around? What do these metrics have to do with the model performance?

To get my head around the confusion matrix and (finally!) remember what stands behind the aforementioned scores, I devised a simple procedure to follow when evaluating the model performance and constructing a confusion matrix.

Step 1: State the Null and Alternative hypotheses!

This is actually the most important step, which is often is taken as granted and for some reason omitted. In my opinion, this is the only reason why it’s hard to follow the Wikipedia article: without knowing the Null it’s hard to understand what is standing behind the True/False Positives and Negatives. To define Type I and II errors, you need to know exactly what is the Null you are testing.

Continuing with the previous example, the Null Hypothesis is “a woman is not pregnant” vs Alternative “a woman is pregnant”. In clinical trials, the Null could be like “a drug has no effect”. For fraud detection problems, the Null could sound like “a transaction is not fraudulent”. For text classification problems — “a text comment is not toxic”. You should get an idea.

Step 2: Define the key components.

What do you identify as a Positive class? Being pregnant? Having a fraudulent transaction? Having an effect of administering a drug? Having a non-zero effect from economic policy?

That makes it much easier now to formulate 4 components of a confusion matrix. Let’s continue with the pregnancy prediction example.

True Positive: a pregnant person is correctly labeled by the model

True Negative: a non-pregnant person is correctly labeled being non-pregnant

False Positive: a non-pregnant person is falsely identified as pregnant
False Negative: a pregnant person is falsely identified as being non-pregnant

Step 3: Draw a confusion matrix and identify Type and Type II errors

Now, you are ready to draw a confusion matrix:

Combining steps 1 and 2 together allows us (almost) effortlessly identify Type I and Type II errors.

What is a Type I error? This is rejecting the Null when it is actually true. You know the null from Step1, and you also identified Positive/Negative classes in Step 2. Thus, Type I error in this case corresponds to a False Positive. In our example, we predict that a woman belongs to a Positive class (rejecting the Null), i.e. that she is pregnant, while in reality, she is not pregnant (the Null is true). Type II error is a False Negative — we fail to reject the Null (we predict she belongs to a Negative class, i.e. not pregnant), while she is actually pregnant.

Let’s also show how to decide what metric — recall or precision — should we focus on. By definition, recall is the share of correctly identified Positive class instances in the total number of the truly Positive class instances (which is the sum of True Positives and False Negatives, since False Negatives represent Positive class instances that were predicted as Negative):

Precision is the share of correctly labeled Positive instances in the total number of predicted Positive class instances, which is represented by the sum of True Positive and False Positive (i.e. we look at the Positive class from the model predictions perspective)

So, if you want to minimize Type II error, that is, decrease the number of False Negatives, then you should maximize the model’s recall. In the pregnancy example, this would be the case when we wanted to identify as many pregnant women as we can and avoid incorrect labeling of pregnant women. Actually, maximizing recall is pretty common in medical research — you’d rather make an error and ask a healthy person, who is falsely identified by the model as having high chances of a disease, to go through additional diagnostics, than miss a person who is indeed having a disease.

However, if you do not want to frustrate those who are identified by the model as pregnant but are actually not, and want to be more confident in your results, then you should be concentrating on minimizing Type I error and maximizing the precision of your classifier. This is especially true for those running marketing campaigns and sending promo offers.

If you are unsure about what metric — accuracy or recall — to prioritize, then it is a good idea to try and maximize the F1 score, which is a harmonic mean of the two:

To summarize, remember that the accuracy score might not be the best metrics to use for a classification problem, especially if you deal with the unbalanced dataset. It is always good to visualize classification results using the confusion matrix, and I hope that a simple 3-step procedure I outlined above would be handy. Finally, F1 score could be your metric of choice if you care about both precision and recall.

--

--

Maria Vasilenko

Data Scientist | Data Engineer| Economist| Mom| ❤️ Data, decision science, behavioral economics, cognitive science, biohacking