Performance Measures (Metrics)

Last modified: 2019-12-17


ROC curve

A receiver operating characteristic curve (ROC curve) is a graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied.

Generating an ROC curve

To produce an ROC curve, the sensitivities and specificities for different values of a continuous test measure are first tabulated. This results, essentially, in a list of various test values and the corresponding sensitivity and specificity of the test at that value. Then, the graphical ROC curve is produced by plotting sensitivity (TPR: true positive rate; recall) on the y-axis against 1-specificity (FPR: false positive rate) on the x-axis for the various values tabulated.

Using an ROC curve to understand the diagnostic value of a test

An ROC curve that follows the diagonal line y=x produces false positive results at the same rate as true positive results. Therefore, we expect a diagnostic test with reasonable accuracy to have an ROC curve in the upper left triangle above the y=x line (‘reference line’), as shown in this figure:

ROC curve

Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model as good as random. ROC can be broken down into sensitivity and specificity. A binary classification problem is really a trade-off between sensitivity and specificity.

We can determine whether or not our ROC curve is good by looking at the AUC and other parameters, which are also called as confusion metrics. A confusion matrix is a table that is often used to describe the performance of a classification model on a set of test data for which the true values are known. All the measures except AUC can be calculated by using left most four parameters.

Accuracy, Precision, Recall, F1 Score, …

True positive and true negatives are the observations that are correctly predicted, that we want to maximize (conversely, we want to minimize false positives and false negatives).

Confusion matrix 1

Confusion matrix 1

True Positives (TP) are the correctly predicted positive values; e.g., if the actual class value indicates that this patient survived, the predicted class tells you the same thing.

True Negatives (TN) are the correctly predicted negative values; e.g. if actual class says this patient did not survive, the predicted class tells you the same thing.

False positive and false negative values occur when your actual class contradicts the predicted class.

False Positives (FP) occur when the actual class is no and the predicted class is yes; e.g., if the actual class says this patient did not survive, but the predicted class tells you that this patient will survive.

False Negatives (FN) occur when the actual class is yes but the predicted class in no; e.g. if the actual class value indicates that this patient survived and the predicted class tells you that patient will die.

Once you understand these four parameters you can calculate accuracy, precision, recall and F1 score.

Confusion matrix 3

What can we learn from this matrix?

  • There are two possible predicted classes: “yes” and “no”. If we were predicting the presence of a disease, for example, “yes” would mean they have the disease, and “no” would mean they don’t have the disease.
  • The classifier made a total of 165 predictions (e.g., 165 patients were being tested for the presence of that disease).
  • Out of those 165 cases, the classifier predicted “yes” 110 times, and “no” 55 times.
  • In reality, 105 patients in the sample have the disease, and 60 patients do not.

  • true positives (TP): These are cases in which we predicted yes (they have the disease), and they do have the disease.
  • true negatives (TN): We predicted no, and they don’t have the disease.
  • false positives (FP): We predicted yes, but they don’t actually have the disease. (Also known as a “Type I error.”)
  • false negatives (FN): We predicted no, but they actually do have the disease. (Also known as a “Type II error.”)

Accuracy is the most intuitive performance measure (Overall, how often is the classifier correct?). Accuracy is simply a ratio of correctly predicted observation to the total observations. One may think that, if we have high accuracy then our model is best. Accuracy is a good measure, but only when you have symmetric datasets where values of false positive and false negatives are almost same. Therefore, you have to look at other parameters to evaluate the performance of your model.

Accuracy = TP+TN/TP+FP+FN+TN
Accuracy = (TP+TN)/total = (100+50)/165 = 0.91

Misclassification Rate: overall, how often is it wrong?

(FP+FN)/total = (10+5)/165 = 0.09
  • equivalent to 1 minus Accuracy
  • also known as “Error Rate”

Precision (When it predicts yes, how often is it correct?) Precision is the ratio of correctly predicted positive observations to the total predicted positive observations. The question that this metric answer is of all passengers that labeled as survived, how many actually survived? High precision relates to the low false positive rate. A precision score of 1.0 for entity type A means that every mention that was labeled as entity type A does indeed belong to that classification.

Precision = TP/TP+FP
Precision = TP/predicted yes = 100/110 = 0.91

Recall (Sensitivity, True Positive Rate) (When it’s actually yes, how often does it predict yes?) Recall is the ratio of correctly predicted positive observations to the all observations in actual class - yes. The question recall answers is: Of all the passengers that truly survived, how many did we label? A recall score of 1.0 means that every mention that should have been labeled as entity type A was labeled correctly.

Recall = TP/TP+FN
Recall = TP/actual yes = 100/105 = 0.95

False Positive Rate: when it’s actually no, how often does it predict yes?

FP/actual no = 10/60 = 0.17

Specificity: when it’s actually no, how often does it predict no?

TN/actual no = 50/60 = 0.83
equivalent to 1 minus False Positive Rate

Prevalence: how often does the yes condition actually occur in our sample?

actual yes/total = 105/165 = 0.64

$\small F_1$ score.  The $\small F_1$ score is the weighted average of precision and recall; therefore, this score takes both false positives and false negatives into account. Intuitively it is not as easy to understand as accuracy, but $\small F_1$ is usually more useful than accuracy, especially if you have an uneven class distribution. Accuracy works best if false positives and false negatives have similar cost. If the cost of false positives and false negatives are very different, it’s better to look at both precision and recall. A $\small F_1$ score reaches its best value at 1, and its worst value at 0.

    $\small \text{$F_1$ Score = 2 * (Recall * Precision) / (Recall + Precision)}$
    $\small \text{$F_1$ Score = 2 * (0.95 * 0.91) / (0.95 + 0.91) = 0.9295}$

Jakub Czakon, a Senior Data Scientist at, has a blog post that nicely summarizes which evaluation metric you should choose for your binary classification problems: 24 Evaluation Metrics for Binary Classification (And When to Use Them). He states:

“You will learn about a bunch of common and lesser-known evaluation metrics and charts to understand how to choose the model performance metric for your problem. Specifically, for each metric, I will talk about:

  • What is the definition and intuition behind it,
  • The non-technical explanation that you can communicate to business stakeholders,
  • How to calculate or plot it,
  • When should you use it.

Performance Metrics Relevant to Machine Learning, Natural Language Processing

  • Exact Match (EM).  Classification and other tasks are sometimes assessed through partial or complete class label matching; the latter is often referred to as exact matching.

    SQuAD uses two different metrics to evaluate how well a system does on the benchmark. The Exact Match (EM) metric measures the percentage of predictions that exactly match any one of the ground truth answers. The $\small F_1$ score metric more loosely measures the average overlap between the prediction and ground truth answer.

  • $\small F_1$ score

  • Mean Reciprocal Rank (MRR) is a statistic measure for evaluating any process that produces a list of possible responses to a sample of queries, ordered by probability of correctness. The reciprocal rank of a query response is the multiplicative inverse of the rank of the first correct answer: 1 for first place, 1⁄2 for second place, 1⁄3 for third place and so on. The mean reciprocal rank is the average of the reciprocal ranks of results for a sample of queries Q … For example, suppose we have the following three sample queries for a system that tries to translate English words to their plurals. In each case, the system makes three guesses, with the first one being the one it thinks is most likely correct:

    Query Proposed Results Correct response Rank Reciprocal rank
    cat catten, cati, cats cats 3 1/3
    tori torii, tori, toruses tori 2 1/2
    virus viruses, virii, viri viruses 1 1

    Given those three samples, we could calculate the mean reciprocal rank as (1/3 + 1/2 + 1)/3 = 11/18 or about 0.61. If none of the proposed results are correct, reciprocal rank is 0. Please note that only the rank of the first relevant answer is considered, possible further relevant answers are ignored.

    Filtered MRR. These metrics are indicative but can be flawed when some corrupted triplets end up being valid ones, from the training set for instance. In this case, those may be ranked above the test triplet, but this should not be counted as an error because both triplets are true. To avoid such this misleading behavior, we remove from the list of corrupted triplets all the triplets that appear either in the training, validation or test set (except the test triplet of interest). This ensures that all corrupted triplets do not belong to the data set. We report mean ranks and hits@10 according to both settings: the original (possibly flawed) one is termed “raw”, while the newer one is referred to as “filtered.” [Source: Translating Embeddings for Modeling Multi-relational Data]

    Restated in various papers:

    • “The filtered metrics are computed after removing all the other positive observed triples that appear in either training, validation or test set from the ranking, whereas the raw metrics do not remove these.”

    • “For the “filtered” setting protocol described in Bordes et al. (2013), we removed any corrupted triples that appear in the knowledge base, to avoid cases where a correct corrupted triple might be ranked higher than the test triple. The “filtered” setting thus provides a clearer view on the ranking performance.”

    With MRR and fMRR, higher values are generally better.

  • Perplexity: In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good at predicting the sample. In NLP, perplexity refers to the prediction capability of the language model: if the model is less perplexed, then it is a good model.

    For example, suppose your task is to read the first few letters of a word a user is typing on a smartphone keyboard, and to offer a list of possible completion words. Perplexity, $\small P$, for this task is approximately the number of guesses you need to offer in order for your list to contain the actual word the user is trying to type. Perplexity is related to cross-entropy as follows:

    $\ \ \ \ \small P = 2^{-\text{(cross entropy)}}$

  • Origin of term, ‘perplexity’ Jelinek F, Mercer RL, Bahl LR & Baker JK. (1977) “Perplexity - A Measure of the Difficulty of Speech Recognition Tasks.” The Journal of the Acoustical Society of America, 62(S1), S63-S63. doi:10.1121/1.2016299.

    • discussion: “Perplexity is ubiquitous in the evaluation of language models. But where did it come from? Jelinek et al. (1977) invented the measure and give a justification for its use.”


    [Click image to open in new window.]

  • Additional discussion (reddit) here.

  • BLEU (BiLingual Evaluation Understudy) is an algorithm for evaluating the quality of text which has been machine-translated from one natural language to another. Quality is considered to be the correspondence between a machine’s output and that of a human: the central idea behind BLEU is “the closer a machine translation is to a professional human translation, the better it is.” BLEU was one of the first metrics to claim a high correlation with human judgements of quality, and remains one of the most popular automated and inexpensive metrics. Scores are calculated for individual translated segments – generally sentences – by comparing them with a set of good quality reference translations. Those scores are then averaged over the whole corpus to reach an estimate of the translation’s overall quality. Intelligibility or grammatical correctness are not taken into account[citation needed]. BLEU’s output is always a number between 0 and 1; this value indicates how similar the candidate text is to the reference texts, with values closer to 1 representing more similar texts. Few human translations will attain a score of 1, since this would indicate that the candidate is identical to one of the reference translations. For this reason, it is not necessary to attain a score of 1.

  • The Meteor automatic evaluation metric (Automatic Machine Translation Evaluation System) scores machine translation hypotheses by aligning them to one or more reference translations. Alignments are based on exact, stem, synonym, and paraphrase matches between words and phrases. Segment and system level metric scores are calculated based on the alignments between hypothesis-reference pairs. [ … snip … ]

    Meteor is implemented in pure Java and requires no installation or dependencies to score MT [machine translation] output. On average, hypotheses are scored at a rate of 500 segments per second per CPU core. Meteor consistently demonstrates high correlation with human judgments in independent evaluations such as EMNLP WMT 2011 and NIST Metrics MATR 2010.

    Meteor X-ray uses XeTeX and Gnuplot to create visualizations of alignment matrices and score distributions from the output of Meteor. These visualizations allow easy comparison of MT systems or system configurations and facilitate in-depth performance analysis by examination of underlying Meteor alignments. Final output is in PDF form with intermediate TeX and Gnuplot files preserved for inclusion in reports or presentations. The Examples section includes sample alignment matrices and score distributions from Meteor X-ray.