Cross-validation — Glossary Aria Research

Extended definition

Cross-validation (CV) is a predictive model evaluation technique that partitions the dataset into $k$ subsets (folds) and trains the model $k$ times, alternating each time which fold serves as validation. The final error estimate is the mean over the $k$ folds:

\text{CV}_k = \frac{1}{k}\sum_{i=1}^{k} \text{error}_i

The canonical formalization is Stone (1974, Journal of the Royal Statistical Society), who proved theoretical properties of the technique for model selection. Kohavi (1995) offered the seminal empirical comparison between $k$ -fold CV and bootstrap for error estimation, establishing $k = 10$ as the standard convention by balancing bias and variance. Variants include leave-one-out CV ( $k = n$ , robust but computationally expensive), stratified $k$ -fold (preserves class proportion in each fold for imbalanced problems), and nested CV (external CV for evaluation + internal CV for tuning, avoiding optimism in the estimate). Repeated $k$ -fold runs multiple times with different seeds to reduce estimate variance.

When it applies

Cross-validation is appropriate in small to medium datasets (hundreds to tens of thousands of examples) where a fixed train/test split would be too unstable. It is standard in applied ML academic research, in comparisons among candidate models, and in hyperparameter tuning via GridSearchCV or RandomizedSearchCV. Stratified $k$ -fold is required in classification with imbalanced classes. Nested CV is the rigorous approach when the goal is simultaneously to select a model and estimate performance — it avoids the optimism bias of tuning and evaluating on the same set. In ML competitions (Kaggle), well-designed CV is the differentiator between competitive and suboptimal solutions.

When it does not apply

CV does not apply in time series — random splitting leaks future information into the past; adequate alternatives are time series split, blocked CV, or rolling-origin. It does not apply in highly grouped data without explicit subject grouping (group $k$ -fold is the alternative). In enormous datasets where the computational cost of $k$ trainings is prohibitive, fixed train/validation/test split is more practical. It does not replace external validation in a different domain — CV measures generalization within the same dataset, not to distinct populations or contexts.

Applications by field

— Health and biomedical sciences: stratified CV is standard in clinical ML; nested CV in rigorous biomarker selection studies. — NLP and computer vision: CV used at prototyping stage; final comparisons on large datasets (GLUE, ImageNet) use fixed splits. — Bioinformatics: careful CV is essential in problems with few samples and many features (genomics, proteomics). — Finance and econometrics: CV adapted for time series (time series split) replaces random CV.

Common pitfalls

The first pitfall is doing hyperparameter tuning and error estimation on the same CV — produces an optimistic estimate by exploring the validation set repeatedly. Nested CV is the correction. The second is ignoring natural data grouping: random CV with multiple observations per subject creates leakage; group $k$ -fold solves it. The third is applying standard CV in time series — violates the essential chronological order. The fourth is trusting CV with small $k$ on a small dataset: $k = 5$ on $n = 50$ produces folds of size 10 with high variance. Increasing $k$ or using leave-one-out on small datasets is safer. The fifth is using CV to estimate production performance — CV measures generalization to samples from the same distribution; production distribution shifts are not captured, and external validation is needed.