Once you’ve built a machine learning classifier, the next step is to validate it and see how well it fits the data. This short post will list the common metrics we can use to evaluate a classifier. These metrics will be explained assuming that we are dealing with a binary classifier but the ideas can be extended to multi-class classification. Since these metrics are fundamental and very well known, I will try to pay attention how metrics are worked out and computed from a training dataset.

## Confusion Matrix

A confusion matrix is a table that categorises model predictions according to whether they match observations. One dimension indicates the possible categories of predicted values, and the other dimension indicates the possible categories of actual values. An example of a 2 x 2 confusion matrix, associated with a binary classifier, is

When the predicted value is equal to the actual value, the model classification is correct. Correct predictions fall on the main diagonal of the table. The off-diagonal entries indicate incorrect classifications.

The performance measures for classification models are based on the counts of predictions falling on and off the diagonal of the confusion matrix. Typically, we are interested in one class over another (or over others if we have a multi-class problem). The class of interest is called the positive class while the others are known as the negative class. The relationship between the positive class and negative class predictions can be categorised as:

1). True Positive (TP): Correct classification of the positive class

2). True Negative (TN): Correct classification of the negative class

3). False Positive (FP): Incorrect classification of the positive class

4). False Negative (FN): Incorrect classification of the negative class

## Measures of Performance

Once we have a confusion matrix, a number of performance measures can be extracted to determine the goodness of fit for a classifier. These measures are computed for a 2 x 2 confusion matrix but can be extended to larger matrices.

1. Prediction accuracy – Proportion that represents the number of true positives and true negatives, divided by the total number of predictions. Formally itis defined as

$$\text{accuracy} = \frac{TP + TN}{TP + TN + FP + FN}$$

1. The error rate is defined as

$$\text{error} = \frac{FP + FN}{TP + TN + FP + FN} = 1 – \text{accuracy}$$

1. Sensitivity (True Positive Rate/ Recall) – Measures the proportion of positive examples that were correctly classified.

$$\text{Sensitivity} = \frac{TP}{TP + FN} = \text{Recall}$$

1. Specificity (True Negative Rate) – Measures the proportion of negative examples that were correctly classified.

$$\text{Specificity} = \frac{TN}{TN + FP}$$

1. Precision (Positive Predictive Value) – Proportion of positive examples that are truly positive – when a model predicts the positive class, how often is it correct?

$$\text{Precision} = \frac{TP}{TP + FP}$$

1. Kappa Statistic – Adjusts accuracy by accounting for the possibility of a correct prediction by chance alone. This is important for datasets with a severe class imbalance, the Kappa Statistic will only reward a classifier if it is correct more often than random guessing. Kappa values range from a minimum of 0 to a maximum of 1. A Kappa value of 1 indicates perfect agreement between model predictions and the actual values; Kappa values less than 1 indicate imperfect agreement. The definition of the Kappa Statistic is

$$\kappa = \frac{Pr(a) – Pr(e)}{1 – Pr(e)}$$

where $Pr(a)$ refers to the proportion of the actual agreement between classifiers and the actual values and $Pr(e)$ refers to the expected agreement between the classifier and the actual values, under the assumption that they were chosen at random.

The formula for $Pr(a)$ and $Pr(e)$ are

$$Pr(a) = \frac{TN}{TP + TN + FP + FN} + \frac{TP}{TP + TN + FP + FN}$$

$$Pr(e) = (\frac{TN + FP}{TP + TN + FP + FN} \times \frac{TN + FN}{TP + TN + FP + FN}) + (\frac{FN + TP}{TP + TN + FP + FN} \times \frac{FP + TP}{TP + TN + FP + FN})$$

1. The F-measure – Describes the model performance in a single number which combines precision and recall – it provides a convenient way to compare several models side by side. The definition is given by

$$\text{F-measure} = \frac{2 \times precision \times recall}{recall + precision} = \frac{2 \times TP}{(2 \times TP) + FP + FN}$$

Proof:

$$\text{Precision} = \frac{TP}{TP + FP}$$ and $$\text{Recall} = \frac{TP}{TP + FN}$$

$$2 \times \text{Precision} \times \text{Recall} = 2 \times \frac{TP}{TP + FP} \times \frac{TP}{TP + FN} = \frac{2(TP)(TP)}{(TP + FP)(TP + FN)}$$

$$\text{Recall} + \text{Precision} = \frac{TP}{TP + FN} + \frac{TP}{TP + FP} = \frac{TP(TP + FP)}{(TP + FN)(TP + FP)} + \frac{TP(TP + FN)}{(TP + FN)(TP + FP)} = \frac{TP[(TP + FP) + (TP + FN)]}{(TP + FN)(TP + FP)}$$

$$\frac{2 \times precision \times recall}{recall + precision} = \frac{2(TP)(TP)}{(TP+FP)(TP + FN)} \times \frac{(TP + FN)(TP + FP)}{TP[(TP + FP) + (TP + FN)]} = \frac{2TP}{(TP + FP) + (TP + FN)} = \frac{2TP}{2TP + FP + FN} = \frac{(2 \times TP)}{(2 \times TP) \times FP \times FN}$$

## ROC Curve

A Receiver Operating Characteristic (ROC) curve is a graphical approach for displaying the trade-off between the True Positive Rate (TPR) and False Positive Rate (FPR) of a classifier, over varying score thresholds. In a ROC curve, the TPR is plotted along the y-axis and FPR is shown on the x-axis. Each point along the curve corresponds to a classification model generated by placing a threshold on the test (or validation) probability sores produced by the classifier.
The procedure for computing a ROC curve is:

1). Sort instances in ascending order of their output probability scores.

2). Select the lowest ranked instance (instance with the lowest score). Assign the selected instance and those ranked above it to the positive class. Since all the positive examples are hence classified correctly, and the negative examples are misclassified – TPR = FPR = 1.

3). Select the next instance in the sorted list. Classify the selected instance and those ranked above it as positive, while those ranked below it as negative. Update the counts of TP and FP by examining the actual class label of the selected instance. If the instance belongs to the positive class, the TP count is decremented and the FP count remains the same. If the instance belongs to the negative class, the FP count is decremented and TP count remains the same as before.

4). Repeat step 3 and update the TP and FP counts accordingly until the highest ranked instance is selected. At the final threshold, TPR = FPR = 0, because all instances are labeled as negative.

5). Plot the TPR (y-axis) against FPR (x-axis) of the classifier.

Example 1:  Consider the model output for a dataset with $n = 10$ data observations/ instances for a binary classifier.

Pass 1:

Pass 2:

Pass 3:

Pass 4:

Continuing in the same way gives a final summary table

This generates a ROC curve with the shape

## Properties of ROC Curve

There are critical points along the ROC curve which have well-known interpretations:

1. TPR = 0, FPR = 0: Classifier predicts every instance to be a negative class.
2. TPR = 1, FPR = 1: Classifier predicts every instance to be a positive class.
3. TPR = 1, FPR = 0: Perfect classifier with 0 misclassifications

Graphically, the worst possible classifier is no better than random guessing and the ROC curve looks like

The best possible classifier predicts every instance correctly, the ROC curve looks like

Every point on the ROC curve represents the performance of a classifier generated using a particular score threshold, they can be viewed as different operating points of the classifier.

## Area under the ROC Curve (AUC)

To summarise the aggregate behaviour across all the operating points, one measure is the area under the ROC curve (AUC). If the classifier is perfect, the AUC = 1. If the classifier is no better than random guessing, the AUC = 0.5.

One caveat in using the AUC measure to compare 2 distinct classifiers – $c_1$ and $c_2$, we cannot conclude one classifier is better than another simply because it has a higher AUC. It might be that the ROC curve of $c_1$ dominates (is strictly better than) the ROC curve of $c_2$ across some operating points and the ROC curve of $c_2$ dominates the ROC curve of $c_1$ across some other operating points – one ROC curve does not dominate the other across all points.  Both ROC and AUC are invariant to class imbalance – hence ROC curves are not suitable for measuring the impact of class skewness on classification performance.

## References

1. Introduction to Data Mining. P.N. Tang, M. Steinbach, A. Karpatne, V. Kumar