Theory Behind Confusion Matrix

OCTAVE - John Keells Group
8 min readNov 27, 2023

What is confusion matrix?

A confusion matrix is a summarized table of the number of correct and incorrect predictions (or actual and predicted values) yielded by a classifier for binary classification tasks.

In simple term confusion matrix evaluates the performance of the machine learning model.

An individual could assess the model’s accuracy by examining the diagonal values for calculating the number of accurate classifications by seeing the confusion matrix.

The size of the matrix is exactly proportional to the number of output classes when the matrix’s structure is taken into account.

The confusion matrix is a square matrix with the real values represented in the column, the model’s projected value in the row, and vice versa. Specifically,

1. A confusion matrix presents the ways in which a classification model becomes confused while making predictions.

2. A good matrix will have large values across the diagonal and small values off the diagonal.

3. Measuring a confusion matrix provides better insight in particulars of is our classification model is getting correct and what types of errors it is creating.

True Positive, True Negative, False Positive, and False Negative

The terms “True Positive,” “True Negative,” “False Positive,” and “False Negative” are frequently used to characterize the results of a problem with binary classification in statistics, classification, and machine learning. These terms are essential for evaluating a model’s performance, and knowing how they differ will help you determine a categorization system’s strengths and weaknesses.

In a 2x2 confusion matrix, the letters TP, TN, FP, and FN are frequently represented as rows for the actual class labels and columns for the predicted class labels. In contrast to the off-diagonal cells (FP and FN), the diagonal cells (TP and TN) show accurate predictions.

Consider a spam email classifier as an example. After receiving an email, the classifier must determine if it is spam or not. A true positive occurs when the classifier correctly predicts that an email is spam and it actually is. It is a true negative if the classifier predicts that an email is not spam and it is not spam. A false positive occurs when the classifier thinks that an email is spam when it is not. Last but not least, a false negative occurs when the classifier determines that an email is not spam when it is indeed spam.

Let’s now analyze how important these terms are. A classification model’s primary objective is to correctly identify the incoming data. The model’s precision in accurately identifying the input data is measured by TP and TN. On the other hand, false positives and false negatives gauge how accurately the model classified the incoming data.

A false positive in the instance of spam email might result in real communications being labeled as spam, which is annoying for the recipient. On the other hand, a false negative could get spam emails into the recipient’s mailbox.

To sum up, it is critical to comprehend the meanings of true positive, true negative, false positive, and false negative when assessing the effectiveness of a binary classification model. While FP and FN gauge the model’s error, TP and TN gauge the model’s accuracy. We may choose whether the model is appropriate for a certain application by understanding how important these terms are in that application.

Benefits of the confusion matrix

1) It gives information about errors made by the classifier and the types of errors that are being made.

2)It reflects how a classification model is disorganized and confused while making predictions.

3)This features assists in prevailing over the limitations of deploying classification accuracy alone.

4)It is practiced in conditions where the classification problem is profoundly imbalanced, and one class predominates over other classes.

5)The confusion matrix is hugely suitable for calculating Recall, Precision, Specificity, Accuracy and AUC-ROC Curve.

Recall, Precision, Accuracy, and F-measure in the confusion matrix

What is the Precision score?

The percentage of labels that were correctly predicted positively is represented by the model precision score. Another name for precision is the positive predictive value. False positives and false negatives are traded off using precision together with the recall. Class distribution affects precision. The precision will be reduced if the minority class has a larger number of samples. One way to think of precision is as an indicator of exactness or quality. A model with high precision is the one we would use if we wanted to reduce false negatives. On the other hand, we would pick a model with high recall if we wanted to reduce the number of false positives. Precision is mostly used when predicting the positive class is necessary since false positives are more costly than false negatives, like in spam filtering. For instance, if a model predicts the spam status of an email with 99% accuracy but only 50% precision, it is actually correct 50% of the time.

When the classes are highly unbalanced, the precision score is a helpful indicator of the accuracy of the forecast. It reflects the ratio of genuine positives to the total of both true and false positives mathematically.

Precision Score = TP / (FP + TP)

What is a recall score?

Model recall score represents the model’s ability to correctly predict the positives out of actual positives. This is unlike precision which measures how many predictions made by models are actually positive out of all positive predictions made. For example: If your machine learning model is trying to identify positive reviews, the recall score would be what percent of those positive reviews did your machine learning model correctly predict as a positive. In other words, it measures how good our machine learning model is at identifying all actual positives out of all positives that exist within a dataset. Recall is also known as sensitivity or the true positive rate.

The higher the recall score, the better the machine learning model is at identifying both positive and negative examples. A high recall score indicates that the model is good at identifying positive examples. Conversely, a low recall score indicates that the model is not good at identifying positive examples.

Recall is often used in conjunction with other performance metrics, such as precision and accuracy, to get a complete picture of the model’s performance. Mathematically, it represents the ratio of true positive to the sum of true positive and false negative.

Recall Score = TP / (FN + TP)

From the above formula, you could notice that the value of false-negative would impact the recall score. Thus, while building predictive models, you may choose to focus appropriately to build models with lower false negatives if a high recall score is important for the business requirements.

Precision Recall Tradeoff

When assessing a classification model’s performance, the precision-recall tradeoff is a frequent issue. Precision and recall are two criteria that are frequently used to assess a classifier’s performance, and they frequently contradict one another.

The accuracy of a model is defined as the percentage of correct positive predictions. (i.e., the number of correct positive predictions divided by the total number of positive predictions). It is a helpful indicator for assessing how well the model can avoid producing false positives.

The fraction of real positive cases that the model properly anticipated, on the other hand, is measured by the recall. (i.e., the number of correct positive predictions divided by the total number of true positive cases). It is an effective statistic for assessing the model’s resistance to false negatives.

A model’s recall will typically decrease as its precision does, and vice versa. This is due to the inverse relationship between precision and recall, which means that enhancing one will often induce a decline in the other. For example, a model with high precision will make a few false positive predictions, but it may also miss some true positive cases. On the other hand, a model with a high recall will correctly identify most of the true positive cases, but it may also make more false positive predictions.

Instead of focusing on just one of these metrics, it is crucial to take into account both precision and recall when assessing a categorization model. The proper ratio of precision to recall will vary depending on the particular objectives and conditions of the model as well as the features of the dataset. Good precision may be more crucial in some situations (such as in medical diagnosis), whereas good recall may be more crucial in others. (e.g., in fraud detection).

To balance precision and recall, practitioners often use the F1 score, which is a combination of the two metrics. The F1 score is calculated as the harmonic mean of precision and recall, and it provides a balance between the two metrics. However, even the F1 score is not a perfect solution, as it can be difficult to determine the optimal balance between precision and recall for a given application.

What is an Accuracy Score?

Model accuracy is a machine learning classification model performance metric that is defined as the ratio of true positives and true negatives to all positive and negative observations. In other words, accuracy tells us how often we can expect our machine learning model will correctly predict an outcome out of the total number of times it made predictions.

For example:

Let’s assume that you were testing your machine learning model with a dataset of 100 records and that your machine learning model predicted all 90 of those instances correctly.

The accuracy metric, in this case, would be: (90/100) = 90%.

The accuracy rate is great but it doesn’t tell us anything about the errors our machine learning models make on new data we haven’t seen before.

Mathematically, it represents the ratio of the sum of true positive and true negatives out of all the predictions.

Accuracy Score = (TP + TN)/ (TP + FN + TN + FP)

What is F1- Score?

F1 score represents the model score as a function of precision and recall score. F-score is a machine learning model performance metric that gives equal weight to both the Precision and Recall for measuring its performance in terms of accuracy, making it an alternative to Accuracy metrics.

It’s often used as a single value that provides high-level information about the model’s output quality. This is a useful measure of the model in the scenarios where one tries to optimize either of precision or recall score and as a result, the model performance suffers. The following represents the aspects relating to issues with optimizing either precision or recall score:

· Optimizing for recall helps with minimizing the chance of not detecting cancer. However, this comes at the cost of predicting cancer in patients although the patients are healthy (a high number of FP).

· Optimize for precision helps with correctness if the patient has cancer. However, this comes at the cost of missing cancer more frequently (a high number of FN).

Mathematically, it can be represented as a harmonic mean of precision and recall score.

F1 Score = 2* Precision Score * Recall Score/ (Precision Score + Recall Score/)

--

--

OCTAVE - John Keells Group

OCTAVE, the John Keells Group Centre of Excellence for Data and Advanced Analytics, is the cornerstone of the Group’s data-driven decision making.