This page shows the performance measures available for the different types of learning problems as well as general performance measures in alphabetical order. (See also the documentation about measures()
and makeMeasure()
for available measures and their properties.)
If you find that a measure is missing, you can either open an issue or try to implement a measure yourself.
Column Minim. indicates if the measure is minimized during, e.g., tuning or feature selection. Best and Worst show the best and worst values the performance measure can attain. For classification, column Multi indicates if a measure is suitable for multi-class problems. If not, the measure can only be used for binary classification problems.
The next six columns refer to information required to calculate the performance measure.
Prediction()
object.WrappedModel
(makeWrappedModel()
) (e.g., for calculating the training time).Task()
(relevant for cost-sensitive classification).Aggr. shows the default aggregation method (aggregations()
) tied to the measure.
ID / Name | Minim. | Best | Worst | Multi | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|---|
acc Accuracy |
1 | 0 | X | X | X | test.mean | Defined as: mean(response == truth) | |||||
auc Area under the curve |
1 | 0 | X | X | X | test.mean | Integral over the graph that results from computing fpr and tpr for many different thresholds. | |||||
bac Balanced accuracy |
1 | 0 | X | X | X | test.mean | For binary tasks, mean of true positive rate and true negative rate. | |||||
ber Balanced error rate |
X | 0 | 1 | X | X | X | test.mean | Mean of misclassification error rates on all individual classes. | ||||
brier Brier score |
X | 0 | 1 | X | X | X | test.mean | The Brier score is defined as the quadratic difference between the probability and the value (1,0) for the class. That means we use the numeric representation 1 and 0 for our target classes. It is similiar to the mean squared error in regression. multiclass.brier is the sum over all one vs. all comparisons and for a binary classifcation 2 * brier. | ||||
brier.scaled Brier scaled |
1 | 0 | X | X | X | test.mean | Brier score scaled to [0,1], see http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3575184/. | |||||
f1 F1 measure |
1 | 0 | X | X | test.mean | Defined as: 2 * tp/ (sum(truth == positive) + sum(response == positive)) | ||||||
fdr False discovery rate |
X | 0 | 1 | X | X | test.mean | Defined as: fp / (tp + fp). | |||||
fn False negatives |
X | 0 | Inf | X | X | test.mean | Sum of misclassified observations in the negative class. Also called misses. | |||||
fnr False negative rate |
X | 0 | 1 | X | X | test.mean | Percentage of misclassified observations in the negative class. | |||||
fp False positives |
X | 0 | Inf | X | X | test.mean | Sum of misclassified observations in the positive class. Also called false alarms. | |||||
fpr False positive rate |
X | 0 | 1 | X | X | test.mean | Percentage of misclassified observations in the positive class. Also called false alarm rate or fall-out. | |||||
gmean G-mean |
1 | 0 | X | X | test.mean | Geometric mean of recall and specificity. | ||||||
gpr Geometric mean of precision and recall. |
1 | 0 | X | X | test.mean | Defined as: sqrt(ppv * tpr) | ||||||
kappa Cohen’s kappa |
1 | -1 | X | X | X | test.mean | Defined as: 1 - (1 - p0) / (1 - pe). With: p0 = ‘observed frequency of agreement’ and pe = ’expected agremeent frequency under independence | |||||
logloss Logarithmic loss |
X | 0 | Inf | X | X | X | test.mean | Defined as: -mean(log(p_i)), where p_i is the predicted probability of the true class of observation i. Inspired by https://www.kaggle.com/wiki/MultiClassLogLoss. | ||||
lsr Logarithmic Scoring Rule |
0 | -Inf | X | X | X | test.mean | Defined as: mean(log(p_i)), where p_i is the predicted probability of the true class of observation i. This scoring rule is the same as the negative logloss, self-information or surprisal. See: Bickel, J. E. (2007). Some comparisons among quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4(2), 49-65. | |||||
mcc Matthews correlation coefficient |
1 | -1 | X | X | test.mean | Defined as (tp * tn - fp * fn) / sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)), denominator set to 1 if 0 | ||||||
mmce Mean misclassification error |
X | 0 | 1 | X | X | X | test.mean | Defined as: mean(response != truth) | ||||
multiclass.au1p Weighted average 1 vs. 1 multiclass AUC |
1 | 0.5 | X | X | X | X | test.mean | Computes AUC of c(c - 1) binary classifiers while considering the a priori distribution of the classes. See Ferri et al.: https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf. | ||||
multiclass.au1u Average 1 vs. 1 multiclass AUC |
1 | 0.5 | X | X | X | X | test.mean | Computes AUC of c(c - 1) binary classifiers (all possible pairwise combinations) while considering uniform distribution of the classes. See Ferri et al.: https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf. | ||||
multiclass.aunp Weighted average 1 vs. rest multiclass AUC |
1 | 0.5 | X | X | X | X | test.mean | Computes the AUC treating a c-dimensional classifier as c two-dimensional classifiers, taking into account the prior probability of each class. See Ferri et al.: https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf. | ||||
multiclass.aunu Average 1 vs. rest multiclass AUC |
1 | 0.5 | X | X | X | X | test.mean | Computes the AUC treating a c-dimensional classifier as c two-dimensional classifiers, where classes are assumed to have uniform distribution, in order to have a measure which is independent of class distribution change. See Ferri et al.: https://www.math.ucdavis.edu/~saito/data/roc/ferri-class-perf-metrics.pdf. | ||||
multiclass.brier Multiclass Brier score |
X | 0 | 2 | X | X | X | X | test.mean | Defined as: (1/n) sum_i sum_j (y_ij - p_ij)^2, where y_ij = 1 if observation i has class j (else 0), and p_ij is the predicted probability of observation i for class j. From http://docs.lib.noaa.gov/rescue/mwr/078/mwr-078-01-0001.pdf. | |||
npv Negative predictive value |
1 | 0 | X | X | test.mean | Defined as: tn / (tn + fn). | ||||||
ppv Positive predictive value |
1 | 0 | X | X | test.mean | Defined as: tp / (tp + fp). Also called precision. If the denominator is 0, PPV is set to be either 1 or 0 depending on whether the highest probability prediction is positive (1) or negative (0). | ||||||
qsr Quadratic Scoring Rule |
1 | -1 | X | X | X | test.mean | Defined as: 1 - (1/n) sum_i sum_j (y_ij - p_ij)^2, where y_ij = 1 if observation i has class j (else 0), and p_ij is the predicted probablity of observation i for class j. This scoring rule is the same as 1 - multiclass.brier. See: Bickel, J. E. (2007). Some comparisons among quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4(2), 49-65. | |||||
ssr Spherical Scoring Rule |
1 | 0 | X | X | X | test.mean | Defined as: mean(p_i(sum_j(p_ij))), where p_i is the predicted probability of the true class of observation i and p_ij is the predicted probablity of observation i for class j. See: Bickel, J. E. (2007). Some comparisons among quadratic, spherical, and logarithmic scoring rules. Decision Analysis, 4(2), 49-65. | |||||
tn True negatives |
Inf | 0 | X | X | test.mean | Sum of correctly classified observations in the negative class. Also called correct rejections. | ||||||
tnr True negative rate |
1 | 0 | X | X | test.mean | Percentage of correctly classified observations in the negative class. Also called specificity. | ||||||
tp True positives |
Inf | 0 | X | X | test.mean | Sum of all correctly classified observations in the positive class. | ||||||
tpr True positive rate |
1 | 0 | X | X | test.mean | Percentage of correctly classified observations in the positive class. Also called hit rate or recall or sensitivity. | ||||||
wkappa Mean quadratic weighted kappa |
1 | -1 | X | X | X | test.mean | Defined as: 1 - sum(weights * conf.mat) / sum(weights * expected.mat), the weight matrix measures seriousness of disagreement with the squared euclidean metric. |
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
expvar Explained variance |
1 | 0 | X | X | test.mean | Similar to measure rsq (R-squared). Defined as explained_sum_of_squares / total_sum_of_squares. | |||||
kendalltau Kendall’s tau |
1 | -1 | X | X | test.mean | Defined as: Kendall’s tau correlation between truth and response. Only looks at the order. See Rosset et al.: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.1398&rep=rep1&type=pdf. | |||||
mae Mean of absolute errors |
X | 0 | Inf | X | X | test.mean | Defined as: mean(abs(response - truth)) | ||||
mape Mean absolute percentage error |
X | 0 | Inf | X | X | test.mean | Defined as the abs(truth_i - response_i) / truth_i. Won’t work if any truth value is equal to zero. In this case the output will be NA. | ||||
medae Median of absolute errors |
X | 0 | Inf | X | X | test.mean | Defined as: median(abs(response - truth)). | ||||
medse Median of squared errors |
X | 0 | Inf | X | X | test.mean | Defined as: median((response - truth)^2). | ||||
mse Mean of squared errors |
X | 0 | Inf | X | X | test.mean | Defined as: mean((response - truth)^2) | ||||
msle Mean squared logarithmic error |
X | 0 | Inf | X | X | test.mean | Defined as: mean((log(response + 1, exp(1)) - log(truth + 1, exp(1)))^2). This measure is mostly used for count data, note that all predicted and actual target values must be greater or equal ‘-1’ to compute the measure. | ||||
rae Relative absolute error |
X | 0 | Inf | X | X | test.mean | Defined as sum_of_absolute_errors / mean_absolute_deviation. Undefined for single instances and when every truth value is identical. In this case the output will be NA. | ||||
rmse Root mean squared error |
X | 0 | Inf | X | X | test.rmse | The RMSE is aggregated as sqrt(mean(rmse.vals.on.test.sets^2)). If you don’t want that, you could also use test.mean . |
||||
rmsle Root mean squared logarithmic error |
X | 0 | Inf | X | X | test.mean | Defined as: sqrt(msle). Definition taken from: Definition taken from: https: / /www.kaggle.com / wiki / RootMeanSquaredLogarithmicError. This measure is mostly used for count data, note that all predicted and actual target values must be greater or equal ‘-1’ to compute the measure. | ||||
rrse Root relative squared error |
X | 0 | Inf | X | X | test.mean | Defined as sqrt (sum_of_squared_errors / total_sum_of_squares). Undefined for single instances and when every truth value is identical. In this case the output will be NA. | ||||
rsq Coefficient of determination |
1 | -Inf | X | X | test.mean | Also called R-squared, which is 1 - residual_sum_of_squares / total_sum_of_squares. | |||||
sae Sum of absolute errors |
X | 0 | Inf | X | X | test.mean | Defined as: sum(abs(response - truth)) | ||||
spearmanrho Spearman’s rho |
1 | -1 | X | X | test.mean | Defined as: Spearman’s rho correlation between truth and response. Only looks at the order. See Rosset et al.: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.95.1398&rep=rep1&type=pdf. | |||||
sse Sum of squared errors |
X | 0 | Inf | X | X | test.mean | Defined as: sum((response - truth)^2) |
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
cindex Harrell’s Concordance index |
1 | 0 | X | X | test.mean | Fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered. In other words, it is the probability of concordance between the predicted and the observed survival. | |||||
cindex.uno Uno’s Concordance index |
1 | 0 | X | X | X | X | test.mean | Fraction of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered. In other words, it is the probability of concordance between the predicted and the observed survival. Corrected by weighting with IPCW as suggested by Uno. Implemented in survAUC::UnoC. | |||
iauc.uno Uno’s estimator of cumulative AUC for right censored time-to-event data |
1 | 0 | X | X | X | X | test.mean | To set an upper time limit, set argument max.time (defaults to max time in complete task). Implemented in survAUC::AUC.uno. | |||
ibrier Integrated brier score using Kaplan-Meier estimator for weighting |
X | 0 | 1 | X | X | X | test.mean | Only works for methods for which probabilities are provided via pec::predictSurvProb. Currently these are only coxph and randomForestSRC. To set an upper time limit, set argument max.time (defaults to max time in test data). Implemented in pec::pec |
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
db Davies-Bouldin cluster separation measure |
X | 0 | Inf | X | X | test.mean | Ratio of the within cluster scatter, to the between cluster separation, averaged over the clusters. See ?clusterSim::index.DB . |
||||
G1 Calinski-Harabasz pseudo F statistic |
Inf | 0 | X | X | test.mean | Defined as ratio of between-cluster variance to within cluster variance. See ?clusterSim::index.G1 . |
|||||
G2 Baker and Hubert adaptation of Goodman-Kruskal’s gamma statistic |
1 | 0 | X | X | test.mean | Defined as: (number of concordant comparisons - number of discordant comparisons) / (number of concordant comparisons + number of discordant comparisons). See ?clusterSim::index.G2 . |
|||||
silhouette Rousseeuw’s silhouette internal cluster quality index |
Inf | 0 | X | X | test.mean | Silhouette value of an observation is a measure of how similar an object is to its own cluster compared to other clusters. The measure is calculated as the average of all silhouette values. See ?clusterSim::index.S . |
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
mcp Misclassification penalty |
X | 0 | Inf | X | X | test.mean | Average difference between costs of oracle and model prediction. | ||||
meancosts Mean costs of the predicted choices |
X | 0 | Inf | X | X | test.mean | Defined as: mean(y), where y is the vector of costs for the predicted classes. |
Note that in case of ordinary misclassification costs you can also generate performance measures from cost matrices by function makeCostMeasure()
. For details see the tutorial page on cost-sensitive classification and also the page on custom performance measures.
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
multilabel.acc Accuracy (multilabel) |
1 | 0 | X | X | test.mean | Averaged proportion of correctly predicted labels with respect to the total number of labels for each instance, following the definition by Charte and Charte: https: / /journal.r-project.org / archive / 2015 - 2 / charte-charte.pdf. Fractions where the denominator becomes 0 are replaced with 1 before computing the average across all instances. | |||||
multilabel.f1 F1 measure (multilabel) |
1 | 0 | X | X | test.mean | Harmonic mean of precision and recall on a per instance basis (Micro-F1), following the definition by Montanes et al.: http: / /www.sciencedirect.com / science / article / pii / S0031320313004019. Fractions where the denominator becomes 0 are replaced with 1 before computing the average across all instances. | |||||
multilabel.hamloss Hamming loss |
X | 0 | 1 | X | X | test.mean | Proportion of labels that are predicted incorrectly, following the definition by Charte and Charte: https://journal.r-project.org/archive/2015-2/charte-charte.pdf. | ||||
multilabel.ppv Positive predictive value (multilabel) |
1 | 0 | X | X | test.mean | Also called precision. Averaged ratio of correctly predicted labels for each instance, following the definition by Charte and Charte: https: / /journal.r-project.org / archive / 2015 - 2 / charte-charte.pdf. Fractions where the denominator becomes 0 are ignored in the average calculation. | |||||
multilabel.subset01 Subset-0-1 loss |
X | 0 | 1 | X | X | test.mean | Proportion of observations where the complete multilabel set (all 0-1-labels) is predicted incorrectly, following the definition by Charte and Charte: https://journal.r-project.org/archive/2015-2/charte-charte.pdf. | ||||
multilabel.tpr TPR (multilabel) |
1 | 0 | X | X | test.mean | Also called recall. Averaged proportion of predicted labels which are relevant for each instance, following the definition by Charte and Charte: https: / /journal.r-project.org / archive / 2015 - 2 / charte-charte.pdf. Fractions where the denominator becomes 0 are ignored in the average calculation. |
ID / Name | Minim. | Best | Worst | Pred. | Truth | Probs | Model | Task | Feats | Aggr. | Note |
---|---|---|---|---|---|---|---|---|---|---|---|
featperc Percentage of original features used for model |
X | 0 | 1 | X | X | test.mean | Useful for feature selection. | ||||
timeboth timetrain + timepredict |
X | 0 | Inf | X | X | test.mean | |||||
timepredict Time of predicting test set |
X | 0 | Inf | X | test.mean | ||||||
timetrain Time of fitting the model |
X | 0 | Inf | X | test.mean |