Extended definition
GRADE, short for grading of recommendations assessment, development and evaluation, is a system for rating the certainty of a body of evidence and the strength of a recommendation. The innovation, described by Guyatt and colleagues (2008), is to separate two things that were previously blended: how confident we are in the effect estimate and how strong the recommendation that follows from it is. The certainty of the evidence is graded into four levels, high, moderate, low, and very low, and Balshem and colleagues (2011) make explicit that it applies to a set of studies per outcome, not to an isolated study. The starting point is the design: randomized trials begin as high certainty, observational studies as low. From there, certainty can be downgraded for risk of bias, inconsistency, indirectness, imprecision, and publication bias, or upgraded for a large effect, a dose-response gradient, and plausible confounding that would only strengthen the finding. Guyatt and colleagues (2011) organize the result into evidence profiles and summary-of-findings tables.
When it applies
GRADE applies to the development of clinical guidelines and to evidence synthesis, where a set of studies must be turned into a statement about how much the effect can be trusted. It applies outcome by outcome: certainty can be high for a benefit and low for a harm in the same body of evidence. It applies when distinguishing certainty from strength: a strong recommendation can rest on moderate evidence if the values and preferences and the balance of benefits and harms are clear, and a weak recommendation can accompany high evidence when that balance is uncertain. It applies to transparent communication, with summary-of-findings tables making explicit the reason for each level. It is today the standard adopted by much of the organizations that produce guidelines.
When it does not apply
GRADE does not apply to an individual study: it assesses a body of evidence per outcome, and using it to label a single article distorts the method. It does not apply as a mechanical calculation: the decisions to downgrade or upgrade certainty involve structured judgment, and treating them as an automatic formula produces falsely objective ratings. It does not apply as the risk-of-bias assessment of a study, which is one of GRADE’s inputs, not its object; confusing the two erases the distinction between a study’s report and the certainty of the set. It does not apply without first defining the critical outcomes: grading everything without prioritizing dilutes the analysis. And it does not apply as a substitute for judgment about values and preferences, which enter the strength of the recommendation and cannot be deduced from the certainty of the evidence alone.
Applications by field
- Clinical guidelines: the standard for linking the certainty of evidence to the strength of practice recommendations.
- Systematic reviews: rating of certainty by outcome, with a summary-of-findings table.
- Public health and policy: decisions where benefits, harms, and uncertainty must be weighed transparently.
- Health technology assessment: structured synthesis that informs coverage and recommendation.
Common pitfalls
The first pitfall is applying GRADE to an isolated study, when it assesses the body of evidence per outcome. The second is confusing certainty of evidence with strength of recommendation, treating them as if they always moved together. The third is mechanizing the downgrades, losing the structured judgment the method requires. The fourth is confusing GRADE with risk-of-bias assessment, when the latter is only one of the five domains that can downgrade certainty. The fifth is grading all outcomes without first defining which are critical, diluting the analysis into items that do not decide the recommendation.