Skip to main content

M-PreSS: a transparent, open-source approach to study screening in systematic reviews


Overview

Screening thousands of titles and abstracts is often the single biggest bottleneck in a systematic review workflow. In this new medRxiv pre-print, we describe M-PreSS: a model pre-training approach that aims to make screening faster without relying on closed, black-box systems.

The key idea is to start from an open biomedical language model (BlueBERT) and fine-tune it for screening using a Siamese neural network setup, so that the resulting model can generalise across different review topics rather than needing a brand-new model each time.

Xu et al. "M-PreSS: A Model Pre-training Approach for Study Screening in Systematic Reviews". 2025, medRxiv DOI:10.1101/2025.04.08.25325463

What we did

In M-PreSS, we fine-tuned BlueBERT to produce representations of study records (titles/abstracts) that can be used to score relevance for screening decisions. We then evaluated several training strategies in seven COVID-19 systematic reviews, focusing on whether a model trained on some topics could transfer to another topic.

Two practical variations explored in the preprint are:

  • Enriching the “topic definition” used for training by adding explicit study selection criteria (the kind you would normally write in a protocol).
  • Training on more related review topics, to encourage broader generalisation.

What we found

Across the seven COVID-19 reviews, the approach showed good cross-topic performance:

  • Average recall/sensitivity was reported as 0.86 (range 0.67–1.00).
  • Average false positive rate was 6.48% (range 1.38%–11.41%).

Two additional findings are especially relevant if you are thinking about deploying screening models in real review pipelines:

  • Adding study selection criteria into the topic definition improved precision–recall performance (PRAUC) by 2.74%.
  • Adding more related topics during training increased performance by 15.82%.

We also report that, in the COVID-19 topics we compared against, this fine-tuned open model can outperform ChatGPT/GPT-4 in two out of three previously reported screening settings, while using substantially fewer computational resources.

Why this matters

From our perspective, this work lands in a useful “sweet spot”:

  • Transparent and reproducible: the underlying model is open, and the training approach can be documented and rerun.
  • Generalises across topics: rather than building a bespoke model from scratch for every review.
  • Practical levers to improve performance: especially the finding that writing selection criteria in a structured way can directly help the model.

That combination is important if we want screening automation to be something review teams can actually trust, maintain, and update over time.

Limitations and next steps

A couple of things we will be considering as we continue to work in this space:

  • Beyond COVID-19: the evaluation focuses on COVID-19 reviews, so it will be interesting to see how well the approach transfers to other domains (e.g. nutrition, cancer epidemiology, environmental exposures).
  • Human-in-the-loop integration: the biggest real-world gains often come from pairing models with active learning, prioritisation, and clear stopping rules—how M-PreSS plugs into those workflows will matter.

🧪 A Note on Preprints

This work is currently a preprint, meaning it has not yet been peer-reviewed. Preprints should be interpreted as early reports of research findings, and while valuable for rapid dissemination, they are preliminary.