Performance Metrics

Performance Metrics for Classification: Data Science with Python

Table of Content

  1. Introduction
  2. Classification metrics:
    1. Accuracy
    2. Confusion Matrix
    3. Precision and Recall
    4. F1 Score
    5. AUC-ROC
    6. Log Loss
    7. Hamming Loss
  3. Conclusion


Classification models are one of the most popular models among machine learning practitioners. Because of their popularity, it is crucial to know how to make an accurate, generalized model. There are various performance metrics for classification to evaluate a machine learning model. Choosing the most appropriate metrics is important to fine-tune your model based on its performance.

Image: (Source)

This article will discuss the mathematical basis, applications, and pros and cons of evaluation metrics in classification problems.

Confusion matrix

A confusion matrix is one of the most widely used classification model evaluation methods. Although the matrix is not a metric in itself, a matrix representation can be used to define various metrics, each of which is important in a particular case or scenario. It can be created by comparing the predicted class label of a data point with its actual class label. This comparison is repeated for the entire dataset and the results of this comparison are compiled in a matrix or tabular format.

This resultant matrix is called the confusion matrix. A confusion matrix can be created for a binary classification (2 classes) as well as a multi-class (more than 2 classes) classification model.

Let’s build a random forest model on our dataset and look at the confusion matrix for the model predictions on the test dataset. The link for the dataset is here.

Python Code Implementation

Python code to implement the confusion matrix of a machine learning model.

Confusion Matrix Code

The preceding output presents the confusion matrix with necessary annotations. We can see that out of 90 observations with label 0, our model has correctly predicted 88 observations. Similarly, out of 53 observations with label 1 (benign), our model has correctly predicted 52 observations.

Understanding the Confusion Matrix

To reiterate what you learned in the previous section, the confusion matrix is a tabular structure to keep track of true as well as false classifications. This is useful to evaluate the performance of a classification model where we know the real data labels and can compare them with the predicted data labels.

Each column of the confusion matrix indicates the number of cases classified based on the model’s predictions, and each row of the matrix represents the number of cases based on the real\true class label. For a binary classification problem, we have a class label defined as the positive class, which is essentially the class, we are interested in. For example, in our breast cancer dataset (data.csv), let’s say we want to detect or predict when a patient does not have (benign) breast cancer. So label 1 is our positive class. However, assuming our class of interest is cancerous (malignant), we can choose label 0 as the positive class.

Below is a typical confusion matrix for a binary classification problem, where p represents the positive class and n represents the negative class.

Actual vs Predicted values
Image: Actual vs. Predicted Values. (Source) 
  • True Positive (TP): This is the total number of instances of positive classes whose true class label is equal to the predicted class label.
  • False Positive (FP): This is the total number of instances of the negative class that our model misclassified by predicting them as positive. Hence the name, false positive.
  • True Negative (FN): This is the total number of instances of the negative class where the real class label is equal to the predicted class label.
  • False Negative (FN): This is the total number of instances of the positive class that our model misclassified by predicting them as negative.

A confusion matrix can be used to calculate several metrics that are useful measures for different scenarios.


A simple and widely used performance measure is accuracy. It is defined as the ratio of the number of correct predictions to the total number of predictions. It lies between [0, 1]. The higher the accuracy, the better the model (TP and TN must be high).


Python Code Implementation

Python code for computing the accuracy of a machine learning model.

Accuracy Code

Accuracy is mostly used when there are almost balanced classes, and correct predictions of those classes are equally significant.

Advantages of Accuracy

  1. It is the most popular and simple measure for evaluating machine learning models.

Disadvantages of Accuracy

  1. The only valid use of estimating accuracy scores is on datasets that are almost perfectly balanced, which rarely applies to real-world datasets.
  2. Accuracy does not allow Data Scientists to prioritize the importance of True Positives or True Negatives.


Recall, often known as Sensitivity, hit rate, coverage, or True Positive Rate (TPR), is the proportion/ratio of samples that the classifier model predicted as a positive class relative to those that belong to the positive class. It summarizes how well the positive class was predicted by the classifier.

The formula for the recall is:


Python Code Implementation

The following code displays recall on our model predictions.

Recall Code

Recall is a really important metric in scenarios where we want to catch the most instances of a particular class, even if it increases our false positives. For example, It is very useful in the case of bank fraud, a model with a high recall value will give us a higher number of potential fraud cases. This will also help us raise the alarm for the most suspicious cases.

Precision (Positive predictive value)

Precision is defined as the number of predictions made that are relevant or correct out of all the predictions and results based on the positive class. The formula for precision is:


Python Code Implementation

It summarizes how precise the model is out of those predicted as positive/correct, and how many of them are positive/correct.

Precision Code

A high-accuracy model identifies a higher proportion of positive classes compared to a low-accuracy model. Precision becomes important when we are more interested in finding the maximum number of positive classes, even if the total accuracy reduces.

F1 Score

In some cases, a balanced optimization of both precision and recall is required. The F1 score or measure, is a metric that is the harmonic mean of precision and recall and helps us optimize a classifier for balanced precision and recall performance.

The F1 lies within the range [0, 1]. The higher the value, the better the model. The formula for the F1 score is:

F1 Score

Python Code Implementation

Let’s calculate the F1 score based on the predictions made by our model using the following code:

F1 Score

When should I use the F1 score?

  1. The F1 score is often used in the field of information retrieval to measure search performance, document classification, and query classification.
  2. It is also applicable in various natural language processing tasks like evaluation of named entity recognition, span-based question answering, and word segmentation.

Advantages of F1-score

  1. Focuses on data distribution. For example, if any dataset is highly imbalanced (e.g., 93% of all students pass and 7% fail), then the F1 score will provide a better assessment of model performance.

Disadvantages of F1-score

  1. The F1 score is difficult to interpret as it is a blend of the model’s precision and recall.

Area Under the Receiver Operating Characteristic Curve (ROC AUC)

It is a visualization technique for displaying the tradeoff between the true positive rate (TPR) and false positive rate (FPR) of a classifier.

TPR = positives correctly classified/total positives (plotted along the Y-axis).

FPR = negatives incorrectly classified/total negatives (plotted along the X-axis).

The performance of each classifier is represented as a point on a ROC curve. The overall performance of a classifier is summarized over all possible thresholds of the curve. The area under the curve (AUC) can be used to assess a model’s performance.

A high-performing model has a ROC that runs close to the upper-left edge of the curve and provides more area below it. This can be illustrated using the below figure:

ROC Curve

Image: ROC Curve. (Source)

The higher the AUC, the better the classifier.

Important points (TP, FP)

  1. (0,0): declare everything to be negative class,
  2. (1,1): declare everything to be positive class,
  3. (1,0): ideal.

Python Code Implementation

In Python, ROC can be visualized by computing the true positive rate and false-positive rate.


ROC Code

ROC Curve

Image ROC Curve

Any model with a ROC curve above the random guessing classifier line can be considered a better model. Any model with an ROC curve below the random guess classifier line can be completely rejected.

Advantages of ROC AUC

  1. The AUC is independent of scale. Instead of measuring absolute values, it evaluates how well the predictions are aligned.
  2. AUC is insensitive to categorization thresholds. It evaluates the accuracy of the model predictions, regardless of the categorization level used.

Disadvantages of ROC AUC

  1. It is problematic when the data is imbalanced (highly skewed).
  2. Cannot interpret the model’s predictions as probabilities.
  3. Increasing AUC doesn’t reflect a better classifier. Instead, it’s just the side-effect of too many negative examples.

Logistic Loss (Log Loss or cross-entropy loss)

It is defined as the negative average of the logarithm of corrected predicted probabilities for each instance. Logistic loss (cross-entropy loss), measures the performance of a classification model that produces probability values ​​between 0 and 1. The log loss increases as the predicted probability diverges from the true/actual label. Therefore, predicting a probability of 0.012 when the true observation label is 1 would be bad and lead to a high loss value. An ideal model would have zero log loss.

Mathematically, Log Loss is defined as:

Log Loss

where N represents the number of sample instances, M is the number of possible labels, yij is a binary indicator indicating whether label j is a valid classification or not, for instance, i, and pij represents a model probability of assigning j to an instance i.

Log Loss Graph

Image: Log loss graph. (Source)

Less ideal classifiers have progressively larger log-loss values. If there are only two classes, the above expression is simplified to:

Log Loss

Advantages of Log loss

  1. The outcome estimates can be interpreted as probabilities.
  2. Used in deep learning algorithms because of their ability to overcome vanishing gradient problems.

Disadvantages of Log loss

  1. If there are many near-boundary predictions, the error metric can be very sensitive to false positives or false negatives.

Python Code Implementation

Implementation using Scikit-Learn’s log_loss function:

Log Loss Code

Hamming Loss

Hamming loss is the proportion of misclassified targets. The best value for Hamming loss is 0 and the worst value is 1.

Python Code Implementation

It can be calculated as:

Hamming Loss


Hamming loss presents one unambiguous single-performance-value for multi-label cases, as opposed to precision/recall/f1, which can only be estimated for an independent binary classifier for each label.

Classification report

Scikit-learn provides convenience reports when working with classification problems to give you a quick idea of ​​the accuracy of your model using a variety of metrics. The classification_report() function displays the precision, recall, f1-score, and support for each class.

The example below shows a binary classification problem report.



Data scientists across domains and industries must have a sound knowledge of these machine learning model evaluation metrics. Knowing which metrics to use for unbalanced or balanced data is important for a clear understanding of the performance and purpose of your model.

Understanding how well machine learning models perform on unseen data is the ultimate goal of working with these metrics. So, depending on the problem type (regression or classification), you can use some of these known metrics to evaluate the performance of your machine-learning model.

You can find the notebook with all the code used in this article here.

Stay Tuned

Click here to learn about performance metrics for regression.

Keep learning and keep implementing!!

Leave a Comment

Your email address will not be published. Required fields are marked *