A p-value alone won't cut it: what Q1 reviewers read in your results section

Q1 journals did not ban the p-value. They banned the p-value standing alone. The distinction is small in wording and enormous in what it asks of researchers when they write up results. In 2016, the American Statistical Association issued the first institutional position statement of its history on a specific aspect of statistical practice, and editorial practice at top-tier journals in psychology, the social sciences, public health, and the biomedical sciences moved toward a minimum reporting package that reviewers now look for automatically: effect size, confidence interval, justification of statistical power, and substantive interpretation kept distinct from inferential interpretation. A researcher who still presents results as “p < 0.05, therefore significant” is delivering a 2010 results section to a venue that reads in 2026.

Forest plot comparing the traditional Cohen 1988 benchmarks and the Funder and Ozer 2019 typical range against Cohen's d outliers between 2.85 and 3.71 identified by Fricker and colleagues in post-ban BASP manuscripts from 2016 — Cohen's d magnitude in observed empirical research. Cohen's (1988) traditional benchmarks and the typical range documented by Funder and Ozer (2019) cover values between 0.20 and 1.30; the four outliers identified by Fricker and colleagues (2019) in manuscripts published in BASP after the p-value ban sit between d = 2.85 and d = 3.71, magnitudes that exceed the typical range by more than three times.

The contrast in the chart illustrates the problem the statistical community came to recognize publicly from 2016 onward. When authors remove significance inference and keep only the magnitude of the effect, without complementary editorial discipline, estimates start drifting substantially away from what the field’s literature supports as plausible. The minimum package Q1 reviewers look for today exists to prevent exactly that scenario, and it answers four distinct questions that no isolated p-value can address.

What the ASA did in 2016 (and what kept happening after)

The statement and its six principles

The American Statistical Association had never issued a position document on specific statistical practice in its 177-year history. The 2016 statement, drafted by Ronald Wasserstein and Nicole Lazar and built over more than a year by a group of over twenty specialists with deliberately divergent viewpoints, was the first ¹. The document articulates six principles. The principles hold that p-values do not measure the probability that a hypothesis is true; they do not measure the size of the effect nor the importance of a finding; a scientific decision should not rest only on whether a p-value crosses a fixed threshold; p-values reported in isolation provide weak evidence; and proper inference requires full reporting and transparency.

A literal reading of the six principles shows that the ASA did not ban the p-value: it asked that the p-value stop being treated as evidence of something it is not. Three years later, in an entire supplementary issue of The American Statistician dedicated to the topic, Wasserstein, Schirm, and Lazar added one further instruction: stop using the term “statistically significant” ². The supplement carried 43 articles offering concrete proposals for how to report inference without falling into the significant/not-significant dichotomy. The direction is clear: keep computing p-values, but stop using them as the switch that separates relevant from irrelevant findings.

The BASP case and what journals did afterward

Before the ASA spoke, the journal Basic and Applied Social Psychology had taken a more radical decision. In February 2015, the editorial by Trafimow and Marks announced the complete ban of null hypothesis significance testing, prohibiting p-values, t-values, F-values, and even confidence intervals in accepted manuscripts. One year after the ban took effect, Ronald Fricker and colleagues audited the 31 articles published in BASP in 2016 and found a troubling pattern: with no significance testing in place, authors frequently sustained conclusions stronger than the data would have supported if inferential statistics had been applied ³. Readers, lacking access to the testing apparatus, had no way to recognize the fragility of the results.

The BASP experiment serves less as a model to follow and more as empirical evidence that the central problem is not the p-value itself: it is what gets done with it. Hence the vector that gained more traction: not ban, but require p-values to be accompanied by substantive information about magnitude and precision of the effect. McShane, Gal, Gelman, Robert, and Tackett went further and proposed dropping the notion of “statistical significance” as a binary category, treating the p-value as continuous evidence alongside others: effect magnitude, plausibility of mechanism, design quality, real-world costs ⁴. There are meaningful counterpoints. Leek and Peng argued in a Nature commentary that p-values are merely the tip of the iceberg, and that the real problem lies in every decision earlier in the research pipeline (experimental design, confounder control, measurement quality) that no p-value ban resolves ⁵. The point holds: the problem runs deeper than the p-value. But that is precisely the point that justifies the minimum reporting package Q1 reviewers look for today.

The minimum package Q1 reviewers look for today

There are four things a reviewer at a Q1 journal opens a results section looking for. When one is missing, the verdict moves to major revision or rejection.

Effect size as an argument of relevance

The thing most often missing. Effect size is the magnitude of the finding: Cohen’s d, η², Pearson’s r, odds ratio, mean difference in original units. Without it, the reviewer cannot tell whether a significant effect is large enough to matter. An effect that crosses p < 0.05 with n = 10,000 may be trivially small in magnitude. An effect that does not cross p < 0.05 with n = 30 may be substantively relevant but with inadequate power to detect. The p-value does not distinguish these two cases; effect size does. Nakagawa and Cuthill established the canonical argument in biology: null hypothesis significance testing fails to provide two essential pieces of information, the magnitude of the effect of interest and the precision of that magnitude. They advocate for the reporting of effect sizes and confidence intervals in all biological journals ⁶. The argument generalizes.

Reporting effect size without interpreting it is only half the work. Funder and Ozer show that using arbitrary benchmarks (small/medium/large in Cohen’s convention) is often misleading, and that effect sizes only become meaningful when compared with well-understood benchmarks from the specific literature or with concrete consequences ⁷. An r of 0.10 that looks “small” by Cohen’s table may be substantively consequential at population scale. An r of 0.40 that looks “large” may be an overestimate from a small sample, unlikely to hold up under replication.

Confidence interval as an argument of precision

The second thing missing. The confidence interval reports precision: how narrow or wide is the plausible range for the true population value of the parameter. Hespanhol and colleagues explain that reporting the interval lets the reader judge whether a significant finding is compatible with clinically meaningful effects, or whether it covers so wide a range that it spans from irrelevant to large effects, in which case “significance” is informative but precision too low to sustain practical inference ⁸. The confidence interval is not decoration; it is the honest reading of uncertainty around the finding.

Trained reviewers know to read the interval two ways: its width (precision) and the values it covers (substantive relevance). A 95% CI spanning from clinically trivial to clinically important is a different finding from a narrow CI centered on an important value, even when both cross the significance threshold.

Power justification as an argument of design

The third thing missing, and the one that betrays poorly planned projects before the analysis even begins. Justification of statistical power answers a question the reviewer asks in the second paragraph of the methods: given the expected effect size for this type of phenomenon, is this n adequate to detect it with acceptable probability? Without that justification, the reader cannot distinguish a genuine null finding (effect does not exist) from a null finding by underpowering (effect exists but the sample was too small to detect).

Daniël Lakens organizes six defensible approaches to justifying a sample size, of which a priori power analysis is only one: collecting data from the entire population, resource constraints, planning for desired accuracy, use of heuristics, and even explicit acknowledgment of the absence of justification are all legitimate when well articulated ⁹. The reviewer’s expectation is not that every analysis needs a formal power calculation; it is that every analysis needs an explicit narrative about why the chosen n is sufficient for the declared inferential goals.

The three patterns that sink the results section

Q1 reviewers recognize three quick patterns that signal methodological fragility. One in a manuscript is a warning; two is major revision; three is desk reject or rejection after the first round.

P-value alone as proof of effect

Sentences like “there was a significant difference between groups (p = 0.03)” with no accompanying effect size, no confidence interval, no substantive interpretation. The reviewer reads this and automatically asks: difference of what magnitude, in what direction, with what precision, and is it a difference that matters in the literature of the field? When the answer is not in the manuscript, the problem is not stylistic. It is that the author does not know whether the finding is substantive or an artifact of a large sample applied to a trivial effect.

Effect size without substantive interpretation

The opposite signal, and nearly as common: the author reports a d = 0.42 or an R² = 0.15 and moves to the next result without explaining what that means in the context of the field’s literature. The trained reviewer knows that d = 0.42 in a field where the literature sustains average effects between 0.10 and 0.20 is a notable result; the same d = 0.42 in a field where average effects circle 0.60 is below average. Effect size only speaks when contextualized.

Multiple tests without declared correction

The pattern most consistently signaling p-hacking, intentional or not. When a manuscript reports ten, twenty, thirty hypothesis tests with no mention of either multiple-comparison correction or why correction does not apply, the reviewer does the arithmetic: with α = 0.05 and ten independent tests, the probability of at least one false positive approaches 40%. Streiner synthesizes the practical options and when each one applies: Bonferroni, Holm, Hochberg, false discovery rate control via Benjamini-Hochberg, and resampling-based methods ¹⁰. The point is not that every multiple analysis needs correction. It is that every multiple analysis needs explicit discussion of why (not) correcting, in what family of tests, under what rationale.

Stefan and Schönbrodt compiled twelve documented p-hacking strategies and simulated each one’s impact on false-positive rates ¹¹. Much of what they document leaves no visible trace in the final manuscript. Selective outlier exclusion, late choice of statistical test, stopping data collection after a significant result: all invisible in the writing, all leaving statistical traces that trained reviewers have learned to recognize. Altman and Krzywinski showed in a Nature Methods column that even reporting confidence intervals does not correct for selecting the most significant result among multiple tests: the interval of the “winner” has coverage below nominal ¹². The defense is not stylistic, it is structural: declare how many tests were conducted, in what order, under which pre-specified hypotheses.

How to rewrite the results section thinking about Q1 reviewers

Calin-Jageman and Cumming synthesized the editorial framework in three questions that organize the textual genre of the results section in contemporary Q1 publishing: how much, how uncertain, what else is known ¹³. The first question demands effect size; the second demands confidence interval and power justification; the third demands integration with the literature, meta-analytic when available, comparative when not. When the three elements appear together for each central finding, the results section starts reading as substantive inference. When one is missing, the reviewer perceives it immediately.

Editorial direction at specific Q1 journals has moved this way. In October 2023, The Journal of Physiology published an editorial by Williams, Carson, and Tóth explicitly recommending that authors submitting to the journal report effect size and confidence interval alongside any p-value, and framing this expectation as editorial direction, not as a cosmetic request ¹⁴. The move replicates across other venues. Reviewers trained at these journals carry this expectation when assessing manuscripts elsewhere.

The rewriting that sustains the modern editorial argument is not cosmetic. It demands reorganizing the presentation of each result around magnitude, precision, and context, and consciously separating what is inferential evidence from what is substantive interpretation. That requires time, statistical expertise calibrated to the editorial state of the art, and a sensitive reading of what each target journal is looking for.

A p-value alone won't cut it: what Q1 reviewers read in your results section

What the ASA did in 2016 (and what kept happening after)

The statement and its six principles

The BASP case and what journals did afterward

The minimum package Q1 reviewers look for today

Effect size as an argument of relevance

Confidence interval as an argument of precision

Power justification as an argument of design

The three patterns that sink the results section

P-value alone as proof of effect

Effect size without substantive interpretation

Multiple tests without declared correction

How to rewrite the results section thinking about Q1 reviewers

References

This analysis reflects Aria's practice in Statistical Analysis and Revision and Rewriting.

What the ASA did in 2016 (and what kept happening after)

The statement and its six principles

The BASP case and what journals did afterward

The minimum package Q1 reviewers look for today

Effect size as an argument of relevance

Confidence interval as an argument of precision

Power justification as an argument of design

The three patterns that sink the results section

P-value alone as proof of effect

Effect size without substantive interpretation

Multiple tests without declared correction

How to rewrite the results section thinking about Q1 reviewers

References

This analysis reflects Aria's practice in Statistical Analysis and Revision and Rewriting.

Bibliometric analysis as empirical thesis argument

Measurement invariance in translated instruments

Multilevel modeling: when MLM is required and when OLS suffices