AI and machine learning

Semantic embeddings for systematic review screening

Large-scale manual screening has a 5-12% human error rate and zero documented traceability. Semantic embeddings preserve recall above 90% and make every exclusion auditable against a declared threshold.

The first phase of a systematic review is title and abstract screening. The author retrieves from databases like PubMed, Scopus, or Web of Science an initial set typically varying between two and ten thousand records, and must decide, for each, whether to proceed to full-text reading. The unwritten operational heuristic is that two independent reviewers read all abstracts, with a third resolving disagreements. The cost is time — for a corpus of five thousand records and forty-second-per-abstract reading, that is approximately fifty-five hours of blind reading per reviewer, before the substantive work of the review even begins.

There is a documented alternative in recent methodological literature. Pretrained semantic embeddings — Sentence-BERT, SPECTER2, or compact derived models — combined with classifiers trained over a seed set of human-annotated inclusions/exclusions can reduce screening volume by sixty to ninety percent, with recall consistently above ninety percent. The operation preserves human auditing and adds traceability that pure manual screening does not offer.

The basic architecture

The pipeline has four stages. The first is the conversion of each title plus abstract into a dense vector of typically three hundred eighty-four to seven hundred sixty-eight dimensions via a pretrained model. SPECTER2, trained specifically on scientific corpus, tends to outperform generic Sentence-BERT for this use. The second stage is the manual annotation of a seed set — between ten and eighty records — as eligible or not eligible by the review’s criteria. The third stage is the training of a classifier (logistic regression, linear SVM, or small MLP) over the embeddings of annotated records. The fourth stage is the application of the classifier to all remaining records, with a probability threshold calibrated to the desired recall level.

The output of the process is not “include” or “exclude.” It is a ranking of unannotated records by inclusion probability. The human reviewer reads in decreasing order until reaching a stopping criterion — typically thirty to fifty consecutive records classified as exclusion by the model and confirmed as exclusion by the human.

Horizontal bar chart showing workload reduction in systematic review screening via semantic embeddings across multiple studies, with recall preserved above ninety percent
Workload reduction in systematic review screening via semantic embeddings, in studies published between 2024 and 2025. Each bar represents the central finding of an independent study, with recall preserved above ninety percent in all cases. Consistent results in Yamada and colleagues (2025) in JMIR Medical Informatics, Qin and colleagues in earlier review, Few-Shot Learning framework with Sentence-BERT (Wang and colleagues 2024), and compact LLMs such as GPT-4o mini, Llama 3.1, and Gemma 2 in Sciurti and colleagues (2025). The highlighted category — BERT with component selection — reached the highest observed point, eighty-eight point six percent reduction.

Where the operation gains rigor

The main advantage of embeddings over manual screening is not just volume reduction. It is three complementary properties that increase the methodological rigor of the review.

The first is traceability. Each automated exclusion decision comes with a model-calculated inclusion probability. In manual screening, the record of why an abstract was excluded often reduces to “not relevant.” In assisted screening, there is a number, and the threshold below which exclusions are automatic is a declared methodological decision, not an implicit intuition.

The second is reproducibility. Manual screening depends on the specific reviewer, mood, time of day, fatigue. A classifier trained on the same seed produces the same output on any day. Systematic model errors can be characterized, debated, and corrected by expanding the seed.

The third is reviewer bias detection. When the classifier disagrees with a human on a specific case, there is information. It does not mean the model is correct, but it means there is a signal from training about what was consistently annotated as inclusion. In a review with two manual reviewers, disagreement is resolved by a third human; in an assisted review, human-model disagreement can be audited against the seed set itself.

The methodological cost of not justifying

A systematic review published in 2026 that performs manual screening without justifying why it did not use assisted screening may receive methodological critique in peer review. The human error rate in title-abstract screening is documented around five to twelve percent — a rate comparable to or higher than that of well-trained models. The choice for manual screening now requires defense, not presumption.

The reasonable defense of manual screening exists and is typically one of three: the corpus is small enough (under three hundred records) that the gain does not justify infrastructure investment, the inclusion criteria involve complex qualitative judgment that classifiers trained on embeddings do not capture, or the reviewer team has training and supervision that produces an error rate below the state of the art of automated models. Without one of these defenses, manual screening on large reviews begins to look like choice by inertia.

References

  1. Yamada, T., et al. (2025). Improving Systematic Review Updates With Natural Language Processing Through Abstract Component Classification and Selection https://doi.org/10.2196/65371
  2. Wang, S., et al. (2024). Development and Validation of a Literature Screening Tool: Few-Shot Learning Approach in Systematic Reviews https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11669879/
  3. Sciurti, A., Migliara, G., Siena, L. M., et al. (2025). Compact large language models for title and abstract screening in systematic reviews: An assessment of feasibility, accuracy, and workload reduction https://doi.org/10.1017/rsm.2025.10044
  4. Reimers, N., & Gurevych, I. (2019). Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks https://doi.org/10.18653/v1/D19-1410
  5. Cohen, A. M., Hersh, W. R., Peterson, K., & Yen, P. Y. (2006). Reducing workload in systematic review preparation using automated citation classification https://doi.org/10.1197/jamia.M1929

This analysis reflects Aria's practice in NLP and Text Mining and Complete Data Science Pipeline.

If your project is at a point where this kind of reading is useful, consider submitting the manuscript or data for a technical diagnosis within 48 business hours.

Request a quote