Translate

Wednesday, September 26, 2012

DATA WAREHOUSING AND MINIG ENGINEERING LECTURE NOTES--Evaluating the Accuracy of a classifier


Basic Evaluation Measures for Classifier Performance

Introduction

In Bioinformatics and machine learning in general, there is a large variation in the measures that are used to evaluate prediction systems. However, the measures used in a particular research project may not be appropriate to the particular problem domain. This page tries to explain some of the advantages and disadvantages of different measures.

First an introduction to some terminology is necessary. The following shows a confusion matrix which depicts how predictions on instances are tabulated:

    Predicted Label
    Positive Negative
Known Label Positive True Positive False Negative
(TP) (FN)
Negative False Positive True Negative
(FP) (TN)

For simplicity, the assumption is that each instance can only be assigned one of two classes: Positive or Negative (e.g. a patient's tumor may be malignant or benign). Each instance (e.g. a patient) has a Known label, and a Predicted label. Some method is used (e.g. cross-validation) to make predictions on each instance. Each instance then increments one cell in the confusion matrix.

A confusion matrix can be summarized using various formulas. This table shows the most commonly used formulas1:

Measure Formula Intuitive Meaning
Precision TP / (TP + FP) The percentage of positive predictions that are correct.
Recall / Sensitivity TP / (TP + FN) The percentage of positive labeled instances that were predicted as positive.
Specificity TN / (TN + FP) The percentage of negative labeled instances that were predicted as negative.
Accuracy (TP + TN) / (TP + TN + FP + FN) The percentage of predictions that are correct.
 
Discussion of Measures

Different problem domains call for the need to use different measures for summarizing prediction quality.
1.) For example, in a data set of 10,000 samples, where 100 of these samples are labeled positive, a predictor that predicts "Negative" for every instance it is presented with evaluates to Precision = 100%, Accuracy = 99%, and Specificity = 100%. This predictor would be entirely useless, and yet these
measures show it performs very well. The same predictor would evaluate to Recall = 0%. In this case, Recall seems to be most in tune with how well the classifier is actually performing.

2.) If a classifier predicts positive on all instances in the data set in case 1.), then Precision = 1%, Recall = 100%, Accuracy = 1% and Specificity = 0%. In this case, Precision, Accuracy, and Specificity show that this classifier is problematic.

3.) The other extreme is a data set where many of the examples are positive. For example if 9,900 out of 10,000 instances are positive, and a classifier predicts positive on all instances, then Precision = 99%, Accuracy = 99%, Specificity = 0%, and Recall = 100%. In this case, Specificity shows that this classifier is problematic.

4.) If a classifier predicts negative on all instances in the data set in case 3.), then Precision = 100%, Recall = 0%, Specificity = 100%, and Accuracy = 1%. Here, Recall, and Accuracy inform us that there is a problem with this classifier system.

Out of these four cases, the sets of measures that inform us of problems with the classifier are {Recall} (Case 1), {Precision, Accuracy, Specificity} (Case 2), {Specificity} (Case 3), and {Recall, Accuracy} (Case 4). The smallest set of measures that covers all of these cases is {Recall, Specificity}.

This seems to suggest that, without any knowledge of the distribution of data, the best measures to use are Recall (Sensitivity) and Specificity to allow one to find problems with classifiers. However, many other cases can arise other than these four boundary cases. Consider the following confusion matrix for a data set with 600 out of 11,100 instances positive:

    Predicted Label
    Positive Negative
Known Label Positive 500 100
Negative 500 10,000
 
In this case, Precision = 50%, Recall = 83%, Specificity = 95%, and Accuracy = 95%. In this case, Precision is low, which means the classifier is predicting positives poorly. However, the three other measures seem to suggest that this is a good classifier. This just goes to show that the problem domain has a major impact on the measures that should be used to evaluate a classifier within it, and that looking at the 4 simple cases presented is not sufficient.

 

No comments:

Post a Comment