Pilot analysis on BioRxiv and MedRxiv full text data to facilitate comprehensive data mining on biomedical literature

We have curated the full text data archives of BioRxiv and MedRxiv preprints and conducted some exploratory analyses for our next stage research projects in text mining biomedical literature with automated approaches.
seedcorn funding
text mining
NLP
Author

Yi Liu

Published

August 21, 2023

Overview

The BioRxiv and MedRxiv preprint facilities are vital infrastructure for the biomedical research community, which also provide a rich and comprehensive resource for data mining biomedical literature for investigations on research trends, interests, and novel findings. In our previous works we have conducted extensive literature mining efforts on BioRxiv and MedRxiv to extract structural literature knowledge into EpiGraphDB[1] and derive research claims from recent preprints to be triangulated with other evidence on ASQ[2].

Supported by the Elizabeth Blackwell Institute Rapid Research Funding Call, in this project we have acquired the full text data archives for BioRxiv and MedRxiv preprints from 2013 to May 2023, and we have also conducted some exploratory analysis on the data archives.

Full text archives of BioRxiv / MedRxiv preprints

Web scraping pre-print text is time consuming as well as error prone. However, BioRxiv and MedRxiv have provided archive data on the preprints for the purpose of text and data mining hosted on Amazon AWS S3 as Requester Pays Buckets. Full text archives from BioRxiv and MedRxiv are stored in two S3 buckets biorxiv-src-monthly and medrxiv-src-monthly respectively, where the organization structure is roughly as follows (high level description is also available on the tdm pages):

biorxiv-src-montly
├── Back_Content
└── Current_Content
    ├── April_2019
    ├── April_2020
    │   ├── 0002415e-6e79-1014-bad3-d7b11ff8718c.meca
    │   ├── 0008729e-7222-1014-9e12-b08e9cbb4568.meca
    │   ├── 00114dfb-6f21-1014-b58f-80aa2e0e89bd.meca
    │   ├── 00114e62-6cf9-1014-aedd-beed0b185e0b.meca
    │   ├── 0025be66-6dea-1014-8d12-849fadd63f55.meca
    │   ├── ....
    │   ├── ....
    ├── April_2021
    ├── April_2022
    ├── April_2023
    ├── August_2019
    ├── August_2020
    ├── August_2021
    ├── August_2022
    ├── December_2018
    ├── December_2019
    ├── December_2020
    ├── December_2021
    ├── December_2022
    ├── ...
    ├── ...
medrxiv-src-monthly
├── Back_Content
└── Current_Content

Each .meca file is an archive in .zip format which contains individual files associated with one preprint submission. For example the following structure corresponds to files associated with this preprint (Shrestha et al., 2022, Knowledge, Attitude and Practice (KAP) study on COVID-19 among the general population of Nepal). Specifically the file ./content/22279527.xml is the “full text / manuscript” file in .xml format.

0a2ef310-6c04-1014-8ee5-ac250845df11.meca
├── content
│   ├── 22279527.pdf
│   ├── 22279527v1_tbl1.tif
│   ├── 22279527v1_tbl2.tif
│   ├── 22279527v1_tbl3.tif
│   ├── 22279527v1_tbl4.tif
│   ├── 22279527v1_tbl5a.tif
│   ├── 22279527v1_tbl5.tif
│   ├── 22279527v1_tbl6a.tif
│   ├── 22279527v1_tbl6b.tif
│   ├── 22279527v1_tbl6.tif
│   └── 22279527.xml
├── directives.xml
├── manifest.xml
├── mimetype
└── transfer.xml

How do we know which .meca archive filename corresponds to which preprint DOI? Unfortunately we haven’t found a way to know this without actually opening the meca file and parsing the full text xml. In addition we have identified several rounds of changes to the storage structure, as well as text format (e.g. how a key information such as submission date is represented in xml tags), which means that it takes effort to systematically and robustly curate various information such as metadata as well as sections (e.g. abstracts, methods, results, conclusions) in the manuscripts. We will need to curate these information as separate intermediate datasets for our follow-up research projects.

Some exploratory results

As part of our on-going efforts in parsing and curating the preprint information from the raw archives, here we discuss some of the gathered results which will share insights into future projects.

Topic analysis

To further demonstrate the distribution of research topics, we did some simple topic analysis, where we used the BERTopic library to efficiently embed and model all (over 300,000) article titles to derive topics corresponding to research topics. Figure 4 shows the trends of the dynamic topics, where we visually identify a topic related to the transmission of SARS-CoV-2 virus (“sarscov2”, “transmission”, “seroprevalence”, etc.) dominate over others during the early phase (2020-2021) of the pandemic and its gradual decline. In addition, a topic related to vaccines (“vaccination”, “vaccine hesitancy”, etc.) can also been seen being prevalent during 2021. Figure 5 then further demonstrates how each topic relate to others in the topic clusters.

Figure 4: Trends of topics dynamic topic modelling analysis Dynamic topics are modelled by associating the submission month of the preprint with the topics represented from the article titles.

Figure 5: Topic clusters We randomly select a sample of 100_000 documents (article titles) to generate the topic model, and then used the UMAP clustering method to visualize the clustering of documents. For all the generated topics, we highlight those associated with the COVID-19 pandemic, as well as for comparison a few others related to health data science and genetic epidemiology topics.

Summary

The above preliminary results on the exploratory analysis demonstrate the richness in what can be data mined from analysing the BioRxiv / MedRxiv preprints. We are excited at the prospect of research projects on text mining biomedical studies involving the full historical archive of biomedical research (as proxied by the preprints), not just on public health and epidemiology topics, but also cross-discipline ones such as social-economic factors influencing the eventual publication of a preprint, effective modelling of research trends via various network or clustering analysis methods. For potential collaboration interests please reach out to Yi Liu (yi6240.liu[at]bristol.ac.uk) and Tom Gaunt (tom.gaunt[at]bristol.ac.uk).

Code availability

Our code in data accessing as well as research analysis have been published as a GitHub repository here https://github.com/MRCIEU/biorxiv-medrxiv-tdm, which we will continue to update in the future as our current results are just a first step towards bigger analysis projects.

References

[1] Yi Liu, Benjamin Elsworth, et al., Tom Gaunt, EpiGraphDB: a database and data mining platform for health data science, Bioinformatics, Volume 37, Issue 9, May 2021, Pages 1304–1311, https://doi.org/10.1093/bioinformatics/btaa961

[2] Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries medRxiv 2022.04.12.22273803; doi: https://doi.org/10.1101/2022.04.12.22273803