Class imbalance — Glossary Aria Research

Extended definition

Class imbalance is the situation in which the categories of a classification problem are not approximately equally represented, with a majority class dominating and one or more rare minority classes. The problem is central because the rare class is usually the one of interest: fraud, disease, failure, dropout. He and Garcia (2009), in the reference review, show that classifiers trained without correction tend to favor the majority, optimizing global accuracy by ignoring the minority. The most widespread response at the data level is resampling. SMOTE, proposed by Chawla and colleagues (2002), generates synthetic minority examples by interpolating between near neighbors, rather than simply replicating records, which widens the decision region without mere duplication. Fernández and colleagues (2018), in the fifteen-year assessment, record that SMOTE became the de facto standard and catalog dozens of variants, alongside algorithm-level alternatives such as cost-sensitive learning.

When it applies

Handling imbalance applies when the class of interest is rare and the cost of missing it is high. It applies to fraud detection, diagnosis of low-prevalence diseases, failure prediction, and any scenario where a false negative is more serious than a false positive. SMOTE and its variants apply at the data level, before training, to rebalance the sample. Cost-sensitive learning applies at the algorithm level, penalizing the minority error more heavily. It also applies to the choice of metrics: since accuracy is misleading under imbalance, correct evaluation uses precision, recall, F1, and the area under the precision-recall curve. Fernández and colleagues (2018) recommend combining resampling with adequate evaluation, not one without the other.

When it does not apply

Imbalance correction does not apply automatically to every unequal dataset. When data is abundant and the minority, though proportionally smaller, is numerous in absolute terms, rebalancing may be unnecessary. SMOTE does not apply well to very high-dimensional data or data with many categorical variables, where interpolation between neighbors loses geometric meaning. It does not apply before the train-test split: generating synthetic examples and only then splitting contaminates the test with training information and inflates performance illusorily. It does not apply as a fix for noisy labels: resampling noise amplifies the noise. And it does not apply without revising the metric; rebalancing the data and continuing to evaluate by accuracy hides the very problem one set out to solve.

Applications by field

Fraud detection and security: rare positive classes, where minority recall is the central objective.
Health and diagnosis: low-prevalence diseases, with a high false-negative cost, handled by resampling or cost.
Predictive maintenance: rare failures in operating series, where the event of interest is the minority.
Credit risk and churn: infrequent events in large tabular bases, evaluated by the precision-recall curve.

Common pitfalls

The first pitfall is evaluating by accuracy under imbalance: a model that always predicts the majority looks excellent and is useless for the minority. The second is applying SMOTE before splitting train and test, leaking information and inflating the result. The third is using interpolation in spaces where it makes no sense, such as many categorical variables, generating unrealistic synthetic examples. The fourth is resampling noisy data, multiplying the label error instead of correcting it. The fifth is treating rebalancing as the sole solution, when adjusting the decision threshold and using cost-sensitive learning are often as effective as, or more effective than, touching the sample.