AI & MACHINE LEARNING

Cross-validation

Predictive model evaluation technique that partitions the dataset into k subsets, trains k times alternating which subset serves as validation, and reports the mean error. Standard for small datasets where a fixed train/test split is unstable.

Extended definition

Cross-validation (CV) is a predictive model evaluation technique that partitions the dataset into kk subsets (folds) and trains the model kk times, alternating each time which fold serves as validation. The final error estimate is the mean over the kk folds:

CVk=1ki=1kerrori\text{CV}_k = \frac{1}{k}\sum_{i=1}^{k} \text{error}_i

The canonical formalization is Stone (1974, Journal of the Royal Statistical Society), who proved theoretical properties of the technique for model selection. Kohavi (1995) offered the seminal empirical comparison between kk-fold CV and bootstrap for error estimation, establishing k=10k = 10 as the standard convention by balancing bias and variance. Variants include leave-one-out CV (k=nk = n, robust but computationally expensive), stratified kk-fold (preserves class proportion in each fold for imbalanced problems), and nested CV (external CV for evaluation + internal CV for tuning, avoiding optimism in the estimate). Repeated kk-fold runs multiple times with different seeds to reduce estimate variance.

When it applies

Cross-validation is appropriate in small to medium datasets (hundreds to tens of thousands of examples) where a fixed train/test split would be too unstable. It is standard in applied ML academic research, in comparisons among candidate models, and in hyperparameter tuning via GridSearchCV or RandomizedSearchCV. Stratified kk-fold is required in classification with imbalanced classes. Nested CV is the rigorous approach when the goal is simultaneously to select a model and estimate performance — it avoids the optimism bias of tuning and evaluating on the same set. In ML competitions (Kaggle), well-designed CV is the differentiator between competitive and suboptimal solutions.

When it does not apply

CV does not apply in time series — random splitting leaks future information into the past; adequate alternatives are time series split, blocked CV, or rolling-origin. It does not apply in highly grouped data without explicit subject grouping (group kk-fold is the alternative). In enormous datasets where the computational cost of kk trainings is prohibitive, fixed train/validation/test split is more practical. It does not replace external validation in a different domain — CV measures generalization within the same dataset, not to distinct populations or contexts.

Applications by field

Health and biomedical sciences: stratified CV is standard in clinical ML; nested CV in rigorous biomarker selection studies. — NLP and computer vision: CV used at prototyping stage; final comparisons on large datasets (GLUE, ImageNet) use fixed splits. — Bioinformatics: careful CV is essential in problems with few samples and many features (genomics, proteomics). — Finance and econometrics: CV adapted for time series (time series split) replaces random CV.

Common pitfalls

The first pitfall is doing hyperparameter tuning and error estimation on the same CV — produces an optimistic estimate by exploring the validation set repeatedly. Nested CV is the correction. The second is ignoring natural data grouping: random CV with multiple observations per subject creates leakage; group kk-fold solves it. The third is applying standard CV in time series — violates the essential chronological order. The fourth is trusting CV with small kk on a small dataset: k=5k = 5 on n=50n = 50 produces folds of size 10 with high variance. Increasing kk or using leave-one-out on small datasets is safer. The fifth is using CV to estimate production performance — CV measures generalization to samples from the same distribution; production distribution shifts are not captured, and external validation is needed.

Last updated —