AUC 0.95 won't publish in Q1: what reviewers read in medical computer vision manuscripts

The literature on computer vision applied to medical imaging has two published extremes, both in Q1 journals, both canonical in any literature review. On one side, models reporting metrics that approach specialist-level performance on curated datasets. On the other, studies demonstrating that those same metrics collapse when the model is evaluated on data from another hospital, another piece of equipment, or another population. The distance between these two extremes is what determines whether a medical computer vision manuscript clears the desk reject at venues like JAMA, Nature Medicine, Radiology: Artificial Intelligence, or BMJ. The metric is not what decides. What decides is what is done with the metric.

Horizontal bars comparing internal and external AUC for six canonical studies in medical computer vision; the drop peaks in Xin 2021, from 0.95 to 0.54 — Performance drop between internal and external validation across six canonical medical computer vision studies. Gulshan and colleagues (2016) reported AUC 0.99 on EyePACS-1 and Messidor-2; Voets, Møllersen, and Bongo (2019) attempted to reproduce the method on public data and obtained AUC 0.85 on Messidor-2; Zech and colleagues (2018) trained on NIH and MSH and measured a significant AUC drop when testing on IU. Xin and colleagues (2021), highlighted in the chart, show the maximum magnitude of the effect: internal AUC of 0.95 collapses to 0.54 when tested outside the training dataset. The pattern repeats across venues and modalities, and it is exactly what Q1 reviewers look for when they open the validation section.

The distance between the two extremes is not isolated methodological anomaly, it is the baseline scenario of the past decade. Recognizing this pattern, three editorial groups consolidated AI-specific reporting frameworks for medical imaging between 2020 and 2025, and reviewers trained at these journals open manuscripts looking for explicit checklist adherence. Those who do not know them publish in Q2 or Q3. Those who master them write the results section so that the verdict comes back favorable.

The literature between two extremes

What CheXNet, Gulshan, and the 2016 wave established

In December 2016, Varun Gulshan and colleagues published in JAMA what would become the anchor paper for the medical computer vision wave: a deep learning algorithm trained on 128,175 fundus images graded by a panel of 54 ophthalmologists, reaching AUC of 0.991 on EyePACS-1 and 0.990 on Messidor-2 for referable diabetic retinopathy ¹. The editorial impact was immediate. Within months, analogous papers started appearing in other modalities and pathologies, all with the same narrative: convolutional network trained on tens of thousands of images, compared against a specialist panel, AUC approaching 1.0 on a validation set. The cycle replicated in chest radiographs, dermatological lesions, cardiac MRIs, ophthalmological imagery.

In November 2018, Pranav Rajpurkar and colleagues extended the argument to chest. CheXNeXt, the peer-reviewed successor of the original CheXNet preprint, was trained to detect 14 pathologies on frontal radiographs and reached performance equivalent to board-certified radiologists on 11 of the 14 conditions ⁴. The paper accumulated hundreds of citations within months and consolidated the expectation that computer vision for medical imaging was a solved problem. Much of the material that circulates in AI-applied-to-medicine conferences still operates within that narrative.

The drop when the model leaves the training dataset

In parallel to that wave, three independent groups started publishing uncomfortable results. The most influential came out in November 2018 in PLOS Medicine. John Zech and colleagues trained convolutional networks on chest radiographs from three hospital systems (NIH, Mount Sinai, and Indiana University) and systematically measured what happened when the model was evaluated on data from another hospital. The central finding was unambiguous: in three of the five natural comparisons, external performance was significantly inferior to internal. Worse, the networks learned to detect with 99.95% accuracy the originating hospital system of a radiograph, calibrating predictions accordingly ³. The model was not learning pneumonia. It was learning to distinguish hospitals.

In parallel, Mike Voets, Kajsa Møllersen, and Lars Ailo Bongo attempted to reproduce Gulshan’s results using public data. They reimplemented the method because the source code was not available. They trained on EyePACS-Kaggle and tested on Messidor-2. The AUC obtained on Messidor-2 was 0.853, against the 0.990 reported in the original paper ². The gap was not marginal. The reproduction simply failed to validate the result. The authors were explicit about the lessons they drew: use public data or describe curation in detail, publish source code, and detail every hyperparameter and preprocessing step. Without that, the work does not hold up under methodological review.

The pattern repeats across modalities. In pediatrics, models trained on Guangzhou radiographs to detect pneumonia reached AUC of 0.95 on the internal set and dropped to 0.54 on radiographs from NIH ChestXray14. The distance between these two situations is not statistical, it is editorial. By 2026, Q1 reviewers no longer accept the first number without the second, and the presence of the Zech paper in the discussion section bibliography is practically mandatory in any manuscript that reports internal performance above 0.90.

STARD-AI, TRIPOD+AI, and CLAIM as editorial gates

What each framework requires

Three editorial groups formalized reporting expectations between 2020 and 2025. The first to appear was CLAIM, the Checklist for Artificial Intelligence in Medical Imaging, published in March 2020 in Radiology: Artificial Intelligence by John Mongan, Linda Moy, and Charles Kahn. The checklist has 42 items spread across summary, introduction, methods, results, discussion, and other information, and was designed specifically for medical imaging, including items on de-identification, handling of missing data, rationale for the gold standard, interpretability maps, and failure analysis ⁵. CLAIM received an update in 2024 incorporating four years of usage feedback, but the structure remains the editorial reference for any medical imaging manuscript submitted to RSNA journals.

In April 2024, Gary Collins and colleagues published in BMJ the TRIPOD+AI extension, updating the original 2015 TRIPOD for predictive models that use machine learning methods. The new version consolidates 27 reporting items and introduces an abstract-specific checklist, emphasizing transparency on data source, population definition, variable handling, internal and external validation, and model calibration ⁶. The original TRIPOD had 22 items and was endorsed by more than half of the top medical journals; the TRIPOD+AI version completely supersedes it and is today the reference framework for clinical prediction models.

Completing the trio, in September 2025 Viknesh Sounderajah and the STARD-AI consortium published in Nature Medicine the STARD-AI extension for diagnostic accuracy studies with artificial intelligence. The document adds 18 new or modified items to STARD 2015, focused on dataset description, AI index test and how it was evaluated, and explicit considerations on algorithmic bias and fairness ⁷. The development process involved more than 240 international stakeholders, and the final checklist covers exactly the kind of information reviewers at JAMA, Lancet Digital Health, and Nature Medicine have been expecting since 2024.

The common point: explicit human in the pipeline

The three frameworks use distinct language and have partially overlapping items, but they converge on one editorial point that reviewers read as a priority. The pipeline must document where and how the human intervenes. CLAIM requires explicit description of the annotation gold standard and inter-reader agreement. TRIPOD+AI demands transparency on how the model integrates into clinical decision-making. STARD-AI requires complete description of the clinical reference standard and the criteria used for ground truth. Together, the three checklists have drawn an expectation that operates as an editorial filter: manuscripts that treat the pipeline as a black box, with no explicit human at any stage, do not clear methodological review at Q1 journals.

The three patterns that sink Q1 manuscripts

Reviewers trained on these frameworks recognize quick patterns that signal methodological fragility. Three of them appear with enough frequency to be cataloged as canonical reasons for major revision or rejection.

Validation only on hold-out from the original dataset

The most common pattern. The manuscript reports high accuracy on a test set, but the test set comes from the same dataset used for training, with the same demographic distribution, the same equipment, the same acquisition protocol. The systematic analysis by Myura Nagendran and colleagues in BMJ in 2020 documented the pattern across 81 non-randomized studies comparing AI to clinicians: only nine were prospective, only six were tested in real clinical environments, and the median number of specialists in the comparator group was four ⁸. Sixty-one of the eighty-one studies claimed in the abstract that AI performance was comparable to or better than that of clinicians. Risk of bias was rated as high in fifty-eight of the eighty-one. Q1 reviewers today know this paper, and cite it when confronting manuscripts with the same structure.

Absence of demographic subgroup breakdown

The second pattern, and the one that quickly sinks manuscripts at journals that position themselves on equity. Yuzhe Yang and colleagues showed in October 2024 in Nature Medicine that medical imaging models use demographic shortcuts, and that this implicit encoding appears in radiology, dermatology, and ophthalmology with equal frequency. More importantly for the editorial argument: models with less encoding of demographic attributes perform better in external test environments. Models that look excellent on the training dataset may be capitalizing precisely on the attributes that need to be corrected for fair performance ⁹. Manuscripts that report aggregate AUC without a breakdown by demographic groups relevant to the clinical domain receive an explicit revision request today.

Lack of documented human-in-the-loop protocol

The third pattern, and the one that separates mature pipelines from proofs of concept. The manuscript presents a high-accuracy model but does not say where the human enters the clinical flow, in which cases the model defers to the specialist, and what the review protocol is for the model’s predictions. The absence of this documentation is problematic on two axes. Editorially, it violates explicit items in CLAIM and in STARD-AI on integration with clinical workflow. Substantively, it signals that the pipeline was not designed for real deployment, only for benchmark.

What a Q1-calibrated pipeline delivers

In July 2023, Krishnamurthy Dvijotham and colleagues published in Nature Medicine a system called CoDoC, Complementarity-Driven Deferral to Clinical Workflow, which concretely demonstrates what Q1 reviewers expect to see. CoDoC learns to decide when to trust the model’s prediction and when to defer to the human specialist, based on the complementarity pattern between the two decision sources. Applied to breast cancer screening, it reduced false positives by 25% at the same false-negative rate, with 66% reduction in clinician workload. Applied to tuberculosis triaging, it reduced false positives by 5 to 15% for three of five commercial systems evaluated ¹⁰. The paper is exemplary not for the metric but for the transparency: documented pipeline, integration with workflow described item by item, validation across multiple commercial systems, open-source code.

The additional reference model is the paper by Daniel Ting and colleagues, published in December 2017 also in JAMA. The group trained a deep learning system on 494,661 fundus images from multiethnic populations with diabetes and validated performance across ten additional cohorts covering Singapore, the United States, China, Hong Kong, Mexico, and Australia. AUCs reported across the ten external cohorts ranged from 0.889 to 0.983, with sensitivity and specificity explicitly described by demographic subgroup ¹¹. The presentation of results is what distinguishes this work from most of the literature: external validation across multiple sites, demographic breakdown at each one, clear declaration of the conditions under which the system is proposed for deployment.

The rewriting that sustains the modern editorial argument in medical computer vision is not incremental. It demands reorganizing the pipeline presentation around multi-site external validation, explicit demographic breakdown, documented human-in-the-loop protocol, and adherence to at least one of the three reporting frameworks. Manuscripts that deliver this set go through to peer review at Q1 journals. Those that do not, do not.

AUC 0.95 won't publish in Q1: what reviewers read in medical computer vision manuscripts

The literature between two extremes

What CheXNet, Gulshan, and the 2016 wave established

The drop when the model leaves the training dataset

STARD-AI, TRIPOD+AI, and CLAIM as editorial gates

What each framework requires

The common point: explicit human in the pipeline

The three patterns that sink Q1 manuscripts

Validation only on hold-out from the original dataset

Absence of demographic subgroup breakdown

Lack of documented human-in-the-loop protocol

What a Q1-calibrated pipeline delivers

References

This analysis reflects Aria's practice in Computer Vision and Complete Data Science Pipeline.

The literature between two extremes

What CheXNet, Gulshan, and the 2016 wave established

The drop when the model leaves the training dataset

STARD-AI, TRIPOD+AI, and CLAIM as editorial gates

What each framework requires

The common point: explicit human in the pipeline

The three patterns that sink Q1 manuscripts

Validation only on hold-out from the original dataset

Absence of demographic subgroup breakdown

Lack of documented human-in-the-loop protocol

What a Q1-calibrated pipeline delivers

References

This analysis reflects Aria's practice in Computer Vision and Complete Data Science Pipeline.

LDA vs. BERTopic in academic corpora

Semantic embeddings for systematic review screening