Random forest — Glossary Aria Research

Extended definition

Random forest is an ensemble method that combines many independently built decision trees and aggregates their predictions by majority vote, in classification, or by averaging, in regression. Two sources of randomness make the trees differ from one another. The first is bagging: each tree is trained on a bootstrap sample of the data, drawn with replacement. The second is random feature selection: at each node split, only a randomly drawn subset of the predictors is considered. Breiman (2001), who formalized the method, showed that this double randomness decorrelates the trees, and it is the averaging of weakly correlated models that reduces variance without raising bias. Biau and Scornet (2016), in the reference review, systematize the theory behind the algorithm, the choice of parameters, the resampling mechanism, and the variable-importance measures. Unlike gradient boosting, which is sequential, the trees of a random forest are independent and can be trained in parallel.

When it applies

Random forest applies as a robust baseline on tabular data, with good performance and little need for tuning. It applies when a model that works well almost out of the box is wanted: it resists overfitting thanks to averaging over many trees and tolerates irrelevant variables, noise, and nonlinear relationships. Couronné and colleagues (2018), in a benchmark of 243 real datasets, found random forest beating logistic regression in about 69% of cases with default parameters. It applies well where the out-of-bag error estimate removes the need for a separate validation set, and where variable importance guides predictor screening. It is strong in classification and risk regression, ecology, genomics, and any tabular problem where stability matters more than the last point of performance.

When it does not apply

Random forest does not apply when maximum predictive performance on tabular data is sought: at that frontier, gradient boosting usually wins, at the cost of more tuning. It does not apply to very high-dimensional, sparse data such as raw text or images, where specialized models dominate. It does not apply when the interpretability of a single model is required: a forest of hundreds of trees is not readable like a single tree or a regression. It does not apply to extrapolation beyond the training range, a limitation inherited from trees, which do not project trends. And the variable-importance measure, though useful, does not apply as causal evidence and is not neutral: predictors with many categories or high cardinality can be artificially favored, which requires careful interpretation.

Applications by field

Genomics and bioinformatics: classification and predictor selection on moderately high-dimensional data, with variable importance.
Ecology and environment: species distribution modeling and process modeling from heterogeneous variables.
Risk and finance: scoring and classification as a stable baseline before more tuned models.
Health: outcome prediction from structured variables, with out-of-bag error for internal evaluation.

Common pitfalls

The first pitfall is reading variable importance as causality or as a neutral measure, ignoring the bias toward high-cardinality predictors. The second is expecting boosting’s frontier performance from random forest and concluding the method is weak, when it trades a little accuracy for robustness. The third is applying it to raw text or images without adequate representation. The fourth is using it to extrapolate beyond the training domain, obtaining flat estimates. The fifth is wasting the out-of-bag estimate: it offers an almost free internal evaluation, and ignoring it to build a redundant validation is wasted effort.