AI and machine learning

Generative AI in Systematic Review: Tool or Shortcut?

Generative AI speeds up the systematic review, but it becomes a shortcut the moment it replaces, rather than assists, human judgment under a documented protocol. The data show why: LLM screeners trade sensitivity for specificity. What makes the use legitimate is the protocol: pre-registration, validation, the model as a second screener with human arbitration, and reporting of prompt, model and version.

Generative AI entered the systematic review through the screening door, the stage where thousands of titles and abstracts must be read to decide what enters the synthesis. The promise is real: a language model reads in minutes what a team takes weeks to examine. But the same ease that makes the tool attractive is what turns it into a shortcut, and the difference between the two is not in the model, it is in the protocol. A systematic review is, by definition, a transparent and reproducible method; using a language model without subjecting it to that method trades rigor for speed and hands the reader a synthesis whose selection went unaudited.

The starting point is to understand what these models actually do in screening, and that has been measured. Sanghera and colleagues (2025)1 compared six language models on title and abstract screening, replicating inclusion decisions from 23 Cochrane reviews over a balanced set of 800 abstracts.

Grouped bar chart of sensitivity and specificity of four LLMs in screening: GPT-3.5 scores 1.000 and 0.393; GPT-4, 0.605 and 0.975; GPT-4o, 0.911 and 0.896; Sonnet 3.5, 0.819 and 0.966.
Sensitivity and specificity of four language models in abstract screening, from the Sanghera and colleagues (2025) measurement, 800-abstract set. No model is high on both: GPT-3.5's perfect recall (1.000) comes with a specificity of 0.393, which excludes almost nothing.

The reading is the central argument. GPT-3.5 reached perfect sensitivity, 1.000, missing no relevant study; but its specificity collapsed to 0.393, meaning it keeps almost everything and so saves none of the work that should justify its use. GPT-4 inverts the picture, with 0.605 sensitivity and 0.975 specificity, selective to the point of discarding relevant records. GPT-4o balances the two at 0.911 and 0.896, and Sonnet 3.5 lands at 0.819 and 0.966. No model is at once safe enough to miss no studies and selective enough to cut work. That is exactly the gap human judgment has to cover, and the reason the model is a screener, not a decider.

There is a subtler problem still, that apparent performance can be an artifact of measurement. Khraisha and colleagues (2024)2, in a pre-registered human-out-of-the-loop evaluation, found apparent parity with humans that vanished once chance agreement and dataset imbalance were discounted. In other words, a high number without the right adjustment can announce a competence that is not there. That is why model screening has to be validated against a human gold standard before any trust, not accepted on the raw metric.

What separates the tool from the shortcut is, then, a verifiable protocol, and the literature already describes it. Oami and colleagues (2024)4 show that the screening result is a function of the prompt: sensitivity jumped from 0.75 to 0.91 with one change to the instruction, which means the prompt is a methodological decision and must be pre-specified and reported, not improvised. Cao and colleagues (2025)5 take that to its correct extreme, developing and validating generic prompts across ten reviews, with 97.7% sensitivity against the near-random performance of zero-shot prompts. And Guo and colleagues (2023)3 frame the legitimate use: the model as an aid that prioritizes the workflow and explains its decisions, never as a replacement for the reviewer.

Screening is, even so, the easiest stage for a model, because it is a binary decision with explicit criteria. The later stages punish more. Data extraction requires reading tables, reconciling units and locating the right number in a figure, and that is exactly where model performance falls and the errors begin to contaminate the meta-analysis, not just the selection. Risk-of-bias assessment depends on fine methodological judgment, the kind of decision a model imitates without sustaining. Treating competence at screening as proof of competence at these stages is the error that turns a useful tool into a generator of unfounded synthesis.

There is also the matter of reproducibility, which is the heart of the systematic review. A proprietary model changes version without notice, and the same instruction can return different decisions across runs, because generation is not deterministic. A review that does not record the model, the version, the date and the seed, when available, cannot be reproduced or audited, and loses the property that distinguishes it from an informal search. Fixing and reporting these parameters is not bureaucracy; it is what keeps the synthesis verifiable.

The operating rule fits a sequence any reviewer can demand. Pre-register the model’s use, stating which model, which version and which prompt, because version and prompt change the result in measured ways. Validate the screener against a human-labeled subset before applying it to the whole corpus, reporting adjusted sensitivity and specificity. Use the model as a second parallel screener, or as an initial triage whose exclusions are always reviewed by a human, never as an autonomous excluder. Keep the human in the loop for the steps that require interpretation, such as data extraction and risk-of-bias assessment, where performance is most fragile. And report all of it in the methods section, with the prompts in the supplement. A systematic review that uses generative AI this way is still a systematic review; one that delegates selection to a model with no protocol is only a fast summary wearing the appearance of a method.

References

  1. Sanghera, R.; Thirunavukarasu, A. J.; El Khoury, M.; et al. (2025). High-performance automated abstract screening with large language model ensembles https://doi.org/10.1093/jamia/ocaf050
  2. Khraisha, Q.; Put, S.; Kappenberg, J.; Warraitch, A.; Hadfield, K. (2024). Can large language models replace humans in systematic reviews? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages https://doi.org/10.1002/jrsm.1715
  3. Guo, E.; Gupta, M.; Deng, J.; et al. (2023). Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study https://doi.org/10.2196/48996
  4. Oami, T.; Okada, Y.; Nakada, T. A. (2024). Performance of a Large Language Model in Screening Citations https://doi.org/10.1001/jamanetworkopen.2024.20496
  5. Cao, C.; Sang, J.; Arora, R.; et al. (2025). Development of Prompt Templates for Large Language Model-Driven Screening in Systematic Reviews https://doi.org/10.7326/annals-24-02189

This analysis reflects Aria's practice in Generative AI Applied to Research and Bibliometric Analysis.

If your project is at a point where this kind of reading is useful, consider submitting the manuscript or data for a technical diagnosis within 48 business hours.

Request a quote