DATA & STATISTICS

Confidence interval

Range of values constructed from sample data which, under repeated use, contains the true population parameter with probability equal to the nominal confidence level (typically 95%). Formalized by Neyman in 1937.

Extended definition

A confidence interval (CI) is a range of values constructed from sample data which, under repeated use of the same methodology, contains the true population parameter with probability equal to the nominal confidence level — typically 95%. The canonical formalization is Neyman (1937), in rupture with the purely Fisherian tradition based only on the pp-value. For a population mean with known variance, the classical form is:

CI1α=xˉ±z1α/2sn\text{CI}_{1-\alpha} = \bar{x} \pm z_{1-\alpha/2} \cdot \frac{s}{\sqrt{n}}

where xˉ\bar{x} is the sample mean, ss the sample standard deviation, nn the sample size, and z1α/2z_{1-\alpha/2} the critical value of the standard normal (1.96 for 95%). For small samples or unknown variance, zz is replaced by Student’s tt. The correct interpretation is probabilistic over the procedure, not over the specific computed interval: in 95% of repeated applications of the method, the interval will capture the parameter. Hoekstra et al. (2014) documented, in a study with more than 1,000 researchers and students, that most interpret CI incorrectly as “probability that the parameter lies in the interval” — a fallacy formally equivalent to the wrong pp-value interpretation.

When it applies

Confidence intervals are required in modern reporting of any point estimate: mean, proportion, between-group difference, odds ratio, regression coefficient, effect size. APA, AMA, ICMJE, and CONSORT require CIs in results communication. It is especially useful for communicating statistical precision — a narrow CI signals a precise estimate, a wide CI signals uncertainty. In meta-analysis, CI is the primary metric for assessing consistency among studies. In clinical decisions, CI guides judgment on practical relevance: an effect whose CI includes zero (or a neutral value such as 1 for ratios) signals that the true effect may be null.

When it does not apply

It does not apply as a substitute for effect size or pp-value — the three complement each other. It does not replace Bayesian analysis when context requires direct interpretation of the parameter’s probability (Bayesian credible interval is the analogous object). In very small samples (n<30n < 30) with non-normal distribution, classical CI based on normality is unreliable; bootstrap or non-parametric methods are alternatives. For non-standard parameters (median, quantiles, complex model parameters), analytic CI may not exist and numerical methods are required.

Applications by field

Health and biomedical sciences: mandatory standard in clinical trials (CONSORT requires); odds ratio with CI is the basic structure of epidemiology. — Applied social sciences: complements pp-value in modern reporting; APA requires. — Economics and finance: CIs for regression coefficients, time series forecasts, structural parameters. — Engineering: CIs for process parameters, instrument calibration, statistical quality control.

Common pitfalls

The first pitfall is interpreting CI as “probability that the parameter lies in the interval” — fallacy documented by Hoekstra et al. (2014); the probability refers to the long-run procedure, not to a specific already-computed interval. The second is trusting classical CI under violated assumptions — skewness, heteroscedasticity, or small samples require robust alternatives (percentile or BCa bootstrap). The third is equating CI with hypothesis test: a CI that includes the null value (zero, 1) is richer evidence than just p>0.05p > 0.05 — it informs not only about rejection but also about plausible magnitude. The fourth is assuming a symmetric CI: for odds ratios or relative risks, CIs are symmetric on the log scale, not on the original scale. The fifth is trusting CI without reporting sample size and standard deviation: small nn produces artificially narrow CI when variance is poorly estimated.

Last updated —