DATA & STATISTICS

Logistic regression

Statistical model for categorical dependent variable that estimates the probability of belonging to a category as a logistic function of predictors. Variants: binary, multinomial, and ordinal. Cox (1958) formalized it for binary response.

Extended definition

Logistic regression is a statistical model for a categorical dependent variable that estimates the probability of belonging to a category as a logistic function of predictors. For a binary response (Y{0,1}Y \in \{0, 1\}), the canonical form is:

log(p1p)=β0+β1x1+β2x2++βkxk\log\left(\frac{p}{1-p}\right) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_k x_k

where p=P(Y=1X)p = P(Y=1 \mid X) and log(p1p)\log\left(\frac{p}{1-p}\right) is the logit (log-odds). Coefficients βi\beta_i are interpreted as change in log-odds per unit increase in xix_i, or — after exponentiation — as odds ratios (eβie^{\beta_i} = odds ratio). Cox (1958, JRSS B) formalized logistic regression in the modern framework; Hosmer, Lemeshow, and Sturdivant (2013, Applied Logistic Regression) consolidated the standard technical reference. Variants include multinomial (more than 2 categories with no natural order — softmax), ordinal (ordered categories — proportional odds), and conditional (matched case-control studies).

When it applies

Logistic regression applies in any problem with a categorical outcome that must be modeled as a function of continuous or categorical predictors: disease presence/absence in epidemiology, success/failure of intervention, vote/abstention, default/non-default in credit, binary classification in classical ML. It is the standard technique for association analysis in case-control studies (analytic epidemiology) and pairs well with odds-ratio confidence intervals. In ML, logistic regression serves as a strong baseline before more complex models (random forest, gradient boosting, neural networks) — often hard to beat on tabular problems with well-engineered features.

When it does not apply

It does not apply for continuous dependent variables — use linear regression. It does not apply for ordinal variables with more than 4-5 categories if the distance between categories is informative — ordinal models or linear regression may be more appropriate. It does not apply for outcomes with extreme class imbalance (<5%<5\% of one class) without adjustments (Firth correction, downsampling, weighting). It does not apply to data with strong dependence structure (repeated measures, clustering) without extensions: mixed models (GLMM), GEE, or hierarchical models are appropriate. In modern ML with high-dimensional features and non-linear relationships, simple logistic regression is often suboptimal.

Applications by field

Epidemiology: standard for odds ratios in case-control studies; confounder adjustment via covariate inclusion. — Finance: credit scoring, default prediction, fraud — logistic regression is a regulatory baseline in many contexts. — Marketing: churn modeling, conversion, campaign response — coefficient interpretability is a differentiator. — Applied ML: baseline for tabular classification problems before non-linear models.

Common pitfalls

The first pitfall is interpreting βi\beta_i as direct effect on pp — it is effect on the logit; change in pp depends on the starting value (sigmoid curve is non-linear). The second is confusing odds ratio (eβe^{\beta}) with relative risk — they coincide when the outcome is rare (<10%<10\%) but diverge in common outcomes; reporting as relative risk when it is odds ratio is a frequent error in epidemiology. The third is not checking assumptions: logit linearity in continuous predictors, absence of severe multicollinearity (VIF), independence of observations. The fourth is including variables based only on univariate p<0.05p < 0.05: inclusion should follow theoretical framework, not fishing. The fifth is interpreting pseudo-R2R^2 (Nagelkerke, McFadden) as the linear-regression R2R^2 — they are not equivalent; typical Nagelkerke values are between 0.1 and 0.4 even in good models.

Last updated —