Extended definition
Classification metrics are the family of measures used to evaluate supervised classification models. Four are central in the binary case, derived from the confusion matrix (TP, FP, TN, FN — true/false positives/negatives):
Accuracy is the total proportion of correct classifications, , but misleading on imbalanced datasets. Precision measures the reliability of predicted positives. Recall (also called sensitivity or TPR) measures coverage of actual positives. F1 is the harmonic mean of the two, balancing the trade-off. AUC-ROC integrates performance across all decision thresholds. Powers (2011) offered the canonical mathematical treatment, deriving informedness, markedness, and Matthews correlation as additional metrics; Sokolova and Lapalme (2009) systematized comparison among metrics in multi-class tasks. Multi-class generalization uses macro-average (mean without weighting by support) or weighted-average (weighted by class frequency).
When it applies
Classification metrics apply in any supervised problem with categorical outcome, from prototyping to final evaluation. Metric choice should reflect the relative cost of false positives vs. false negatives in the domain: in medical diagnosis of serious disease, high recall is critical (missing a case is worse than false suspicion); in spam systems, high precision is critical (flagging legitimate email as spam is worse than letting one spam through). F1 is appropriate when both errors are similarly costly. AUC-ROC is appropriate for comparing models when the final decision threshold will be calibrated later.
When it does not apply
Accuracy in isolation does not apply to imbalanced-class problems — a trivial model that always predicts the majority class can have 95% accuracy without learning anything. Under severe imbalance, balanced accuracy, F1, MCC, or PR-AUC are alternatives. AUC-ROC can be misleading on heavily imbalanced datasets — PR-AUC is often preferable. They do not apply directly in regression (use RMSE, MAE, ). In multi-label problems (each example can have multiple simultaneous labels), metrics require specific extensions (Hamming loss, subset accuracy).
Applications by field
— Health and biomedical sciences: sensitivity (recall) and specificity are standard; AUC-ROC reported in diagnostic studies. — Fraud and security detection: precision and recall with curves calibrated for different operating points. — NLP: F1 micro/macro in text classification, NER, and multi-class tasks; benchmarks like GLUE report multiple metrics. — Computer vision: mAP (mean Average Precision) in object detection; IoU in segmentation — domain-specific metrics.
Common pitfalls
The first pitfall is reporting accuracy without mentioning class prevalence — in severe imbalance, high accuracy hides failure on the minority. The second is optimizing for one metric and ignoring others: a model with high F1 can have low recall on the rare class, a critical problem in some domains. The third is confusing AUC-ROC with AUC-PR — they diverge dramatically in imbalanced datasets; PR-AUC is more informative when the positive class is rare. The fourth is using the confusion matrix without normalizing — visualization on raw frequencies hides patterns in imbalanced datasets. The fifth is choosing metrics after seeing results — best practice requires defining the metric of interest before evaluation, connected to project goal and error cost.