Basic
Evaluation Measures for Classifier Performance
Introduction
In Bioinformatics and machine
learning in general, there is a large variation in the measures that are used
to evaluate prediction systems. However, the measures used in a particular research
project may not be appropriate to the particular problem domain. This page
tries to explain some of the advantages and disadvantages of different
measures.
First an introduction to some
terminology is necessary. The following shows a confusion matrix which depicts
how predictions on instances are tabulated:
Predicted Label | |||
Positive | Negative | ||
Known Label | Positive | True Positive | False Negative |
(TP) | (FN) | ||
Negative | False Positive | True Negative | |
(FP) | (TN) |
For simplicity, the assumption is
that each instance can only be assigned one of two classes: Positive or
Negative (e.g. a patient's tumor may be malignant or benign). Each instance
(e.g. a patient) has a Known label, and a Predicted label. Some method is used
(e.g. cross-validation) to make predictions on each instance. Each instance
then increments one cell in the confusion matrix.
A confusion matrix can be summarized
using various formulas. This table shows the most commonly used formulas1:
Measure | Formula | Intuitive Meaning |
Precision | TP / (TP + FP) | The percentage of positive predictions that are correct. |
Recall / Sensitivity | TP / (TP + FN) | The percentage of positive labeled instances that were predicted as positive. |
Specificity | TN / (TN + FP) | The percentage of negative labeled instances that were predicted as negative. |
Accuracy | (TP + TN) / (TP + TN + FP + FN) | The percentage of predictions that are correct. |
Discussion
of Measures
Different problem domains call for
the need to use different measures for summarizing prediction quality.
1.) For example, in a data set of 10,000
samples, where 100 of these samples are labeled positive, a predictor that
predicts "Negative" for every instance it is presented with evaluates
to Precision = 100%, Accuracy = 99%, and Specificity = 100%. This predictor
would be entirely useless, and yet these
measures show it performs very well.
The same predictor would evaluate to Recall = 0%. In this case, Recall seems to
be most in tune with how well the classifier is actually performing.
2.) If a classifier predicts
positive on all instances in the data set in case 1.), then Precision = 1%,
Recall = 100%, Accuracy = 1% and Specificity = 0%. In this case, Precision,
Accuracy, and Specificity show that this classifier is problematic.
3.) The other extreme is a data set
where many of the examples are positive. For example if 9,900 out of 10,000
instances are positive, and a classifier predicts positive on all instances,
then Precision = 99%, Accuracy = 99%, Specificity = 0%, and Recall = 100%. In
this case, Specificity shows that this classifier is problematic.
4.) If a classifier predicts
negative on all instances in the data set in case 3.), then Precision = 100%,
Recall = 0%, Specificity = 100%, and Accuracy = 1%. Here, Recall, and Accuracy
inform us that there is a problem with this classifier system.
Out of these four cases, the sets of
measures that inform us of problems with the classifier are {Recall} (Case 1),
{Precision, Accuracy, Specificity} (Case 2), {Specificity} (Case 3), and
{Recall, Accuracy} (Case 4). The smallest set of measures that covers all of
these cases is {Recall, Specificity}.
This seems to suggest that, without
any knowledge of the distribution of data, the best measures to use are Recall
(Sensitivity) and Specificity to allow one to find problems with classifiers.
However, many other cases can arise other than these four boundary cases.
Consider the following confusion matrix for a data set with 600 out of 11,100
instances positive:
Predicted Label | |||
Positive | Negative | ||
Known Label | Positive | 500 | 100 |
Negative | 500 | 10,000 |
In this case, Precision = 50%,
Recall = 83%, Specificity = 95%, and Accuracy = 95%. In this case, Precision is
low, which means the classifier is predicting positives poorly. However, the
three other measures seem to suggest that this is a good classifier. This just
goes to show that the problem domain has a major impact on the measures that
should be used to evaluate a classifier within it, and that looking at the 4
simple cases presented is not sufficient.
No comments:
Post a Comment