There is an automatic answer that circulates in graduate programs: if data have nested structure — students within schools, patients within hospitals, repeated measures within individuals — the appropriate analysis is multilevel modeling. The answer is correct in direction but insufficient in precision. Not every nested dataset requires MLM, and the argument that it always does fails under competent peer review.
The question that defines the choice is not whether structure exists — it almost always does in social, educational, organizational, or clinical research — but whether the structure has enough impact on the parameters of interest to justify the additional complexity of MLM. The operational heuristic, derived from simulations documented in Hox, Moerbeek, and van de Schoot (2017) and revisited in Sommet and Morselli (2017) and McNeish and Wentzel (2017), is the intraclass correlation coefficient, known as ICC.
What ICC measures
ICC, in its basic form for linear mixed models, is the proportion of total variance attributable to between-cluster variance. Formally, ICC = τ₀²/(τ₀² + σ²), where τ₀² is between-cluster variance and σ² is within-cluster variance. ICC = 0 indicates clusters indistinguishable on the response variable; ICC = 1 indicates all variance is between clusters and none is within.
The practical interpretation is direct. ICC measures how correlated observations within the same cluster are. In a classroom, for example, the ICC of mathematics performance measures how similar students in the same class are in performance — through the teacher, the curriculum, the class culture. Low ICC indicates the class matters little; high ICC indicates the class matters substantially.
The cost of ignoring high ICC
Ignoring nested structure when ICC is non-trivial is not a stylistic decision — it produces documentable inflation of Type I error rate. Classical simulations show that, for a test with nominal α = 0.05 conducted via OLS on nested data with typical cluster size (n ≈ 20), the observed Type I error rate grows non-linearly with ICC. At ICC = 0.05, the rate rises to approximately 11%. At ICC = 0.10, to 18%. At ICC = 0.20, to 33%.
The operational consequence is severe. A manuscript reporting “p < 0.05” on nested data with ICC = 0.20 has a real Type I error probability of around 33%, not 5%. The methodological reviewer familiar with this pattern requests reanalysis via MLM or via OLS with cluster-corrected standard errors, and the author who replied in the first round that “OLS is robust” loses the round.
The heuristic that sustains the decision
The operational rule that works in peer review has three bands. ICC below 0.05 indicates that nested structure is negligible for inferential purposes. OLS with robust standard errors may suffice, but the decision needs to be justified with the reported ICC. ICC between 0.05 and 0.20 indicates non-trivial structure. The choice is between MLM and OLS with cluster-robust standard errors (Cameron and Miller 2015). Both are defensible; the choice depends on whether the analytical focus includes between-cluster variance as an object of interest. ICC above 0.20 indicates that nested structure is central. MLM is the required choice.
The nuance competent reviewers add is that the ICC rule is not the only criterion. Number of clusters matters: MLM with fewer than 20 clusters produces unstable variance-component estimates. Homogeneous cluster size enables simplifications; heavily unbalanced cluster size complicates estimation. Analytical intent matters: studying between-cluster variation as a construct of interest requires MLM even at low ICC.
What to report in a manuscript
The methods section of a manuscript with nested data should report, in the order reviewers check: the nested structure made explicit — how many levels, how many clusters at each level, distribution of cluster sizes; ICC calculated from the null (unconditional) model; justification of the analytical choice based on ICC and complementary criteria; the models actually fitted, with named fixed and random effects; and the fit criteria used (deviance, AIC, BIC) in model comparison.
The cost of doing this sequence correctly is low. The cost of not doing it is an extra revision round, or a rejection motivated by methodological inadequacy.