Skip to main content

MR-KG: A Knowledge Graph of Mendelian Randomization Evidence Powered by Large Language Models


📌 Background

Mendelian randomization (MR) is a powerful causal inference method that uses genetic variants as natural experiments to assess causal relationships between putative risk factors and disease outcomes. MR studies are increasingly abundant, but synthesising evidence across them remains challenging due to heterogeneity in reporting, traits examined, and the structure of the published literature.

To address this, Liu, Burton, Gatua, Hemani & Gaunt (2025) introduce MR-KG — a knowledge graph of MR evidence automatically extracted from published studies using large language models (LLMs).

Liu et al. "MR-KG: A knowledge graph of Mendelian randomization evidence powered by large language models". 2025, medRxiv DOI:10.64898/2025.12.14.25342218

Genetics as a side‑effect detective for antipsychotic medicines


Schematic of the genetics + pharmacology pipeline used to infer drug side-effect mechanisms

Side‑effects are one of the main reasons people stop taking antipsychotic medicines — even when the drugs are helping with symptoms. But when someone reports “I’ve gained weight” or “my blood pressure has changed”, it’s often hard to know whether the drug truly caused it, which biological target is responsible, and whether that target is the one we wanted to hit in the first place.

In work led by Andrew Elmore, published in PLOS Genetics, we combine pharmacology (what receptors a drug binds) with human genetics (natural experiments) to map side‑effects back to specific receptors.

M-PreSS: a transparent, open-source approach to study screening in systematic reviews


Overview

Screening thousands of titles and abstracts is often the single biggest bottleneck in a systematic review workflow. In this new medRxiv pre-print, we describe M-PreSS: a model pre-training approach that aims to make screening faster without relying on closed, black-box systems.

The key idea is to start from an open biomedical language model (BlueBERT) and fine-tune it for screening using a Siamese neural network setup, so that the resulting model can generalise across different review topics rather than needing a brand-new model each time.

Integrating Mendelian randomization and literature mining to map breast cancer risk factors


Illustration of integrating MR and literature-mined evidence to identify breast cancer risk pathways.

Breast cancer research spans epidemiology, molecular biology, clinical trials, and a vast and rapidly growing literature. One challenge is triangulating across these evidence types: when different sources point in the same direction, we can be more confident we are seeing something causal rather than correlational.

In a paper led by Marina Vabistsevits published in the Journal of Biomedical Informatics, we show how to bring two complementary sources together:

  1. Mendelian randomization (MR) evidence generated at scale using MR-EvE (“Everything-vs-Everything”), and
  2. Literature-mined relationships stored in EpiGraphDB, our biomedical knowledge graph.

Dissecting blood pressure and BMI a pathway- and tissue-partitioned Mendelian randomization comparison


Pathway- vs tissue-partitioned MR, simplified schematic.

Complex traits like blood pressure (BP) and body mass index (BMI) are highly polygenic: hundreds of associated variants can be used as instruments in Mendelian randomization (MR). But those variants don’t all “mean the same thing” biologically—some may act through kidney physiology, others through vasculature, neurobiology, metabolism, and so on. If we can separate instruments into interpretable biological subsets, we can start asking questions like:

  • Which component of BP is most responsible for coronary heart disease risk?
  • Are BMI → atrial fibrillation effects more “metabolic” or more “neuro-behavioural”?

Work led by Genevieve Leyden and Maria Sobczyk and now published in Genome Medicine sets out to do exactly this by comparing two ways of partitioning genetic instruments before running MR.

CanDrivR-CS: cancer-specific machine learning to separate recurrent from rare missense variants


Overview

Cancer genomes contain huge numbers of mutations, but only a subset are functionally important. One simple clue is recurrence: if the same missense variant shows up repeatedly across patients with the same cancer type, that can suggest positive selection for growth advantage. At the same time, rare variants can still matter (for example, if they emerge under treatment as resistance mechanisms).

In work led by Amy Francis, we introduce CanDrivR-CS, a framework that trains cancer-type-specific machine-learning models to distinguish recurrent from rare somatic missense variants. It’s a useful reminder that “one-size-fits-all” predictors can miss disease-context signals, and that relatively interpretable models can still surface mechanistic hypotheses.

DrivR-Base: a feature extraction toolkit for variant effect prediction


Understanding which genetic variants are likely to be functional (and which are probably benign) is a cornerstone of modern human genetics. Over the last decade, variant-effect predictors have become increasingly sophisticated — but behind every model sits the same practical headache: assembling a sensible set of features (annotations) for millions of variants from dozens of databases.

In a 2024 Bioinformatics paper led by Amy Francis, we introduce DrivR-Base, a reproducible, Dockerised toolkit that turns this feature-extraction step into something you can run and re-run with far less pain.

Pilot analysis on BioRxiv and MedRxiv full text data to facilitate comprehensive data mining on biomedical literature


Overview

The BioRxiv and MedRxiv preprint facilities are vital infrastructure for the biomedical research community, which also provide a rich and comprehensive resource for data mining biomedical literature for investigations on research trends, interests, and novel findings. In our previous works we have conducted extensive literature mining efforts on BioRxiv and MedRxiv to extract structural literature knowledge into EpiGraphDB[1] and derive research claims from recent preprints to be triangulated with other evidence on ASQ[2].

Proteome-wide Mendelian randomization in global biobank to identify multi-ancestry drug targets


Overview

Genetic studies have been very biased towards populations of European ancestry in western Europe and the United States of America, and this has led to a significant bias in the application of Mendelian randomization (MR) to identify intervention targets. In this project we worked with a leading international genetics consortium, the Global Biobank Meta-analysis Initiative (GBMI) to evaluate the differences in predicted drug target effects between African and European ancestry populations.

Triangulating evidence in health sciences with Annotated Semantic Queries


Update: The ASQ work has now been published in Bioinformatics.

Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries, Bioinformatics, Volume 40, Issue 9, September 2024, btae519, https://doi.org/10.1093/bioinformatics/btae519

Overview

Integrating information from data sources representing different study designs has the potential to strengthen evidence in population health research. However, this concept of evidence “triangulation” presents a number of challenges for systematically identifying and integrating relevant information.

In this medRxiv preprint we present ASQ (Annotated Semantic Queries), a natural language query interface to the integrated biomedical entities and epidemiological evidence in EpiGraphDB . ASQ enables users to extract “claims” from a piece of unstructured text, and then investigate the evidence that could either support, contradict the claims, or offer additional information to the query.

The ASQ approach has the potential to support the rapid review of pre-prints, grant applications, conference abstracts and articles submitted for peer review. ASQ implements strategies to harmonize biomedical entities in different taxonomies and evidence from different sources, to facilitate evidence triangulation and interpretation.

ASQ is openly available at https://asq.epigraphdb.org.

Systematic comparison of Mendelian randomization studies and randomized controlled trials using electronic databases


Overview

Mendelian Randomization (MR) uses genetic instrumental variables to make causal inferences. Whilst sometimes referred to as “nature’s randomized trial”, it has distinct assumptions that make comparisons between the results of MR studies with those of actual randomized controlled trials (RCTs) invaluable.

Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets


Overview

Molecular quantitative trait loci (molQTL), which can provide functional evidence on the mechanisms underlying phenotype-genotype associations, are increasingly used in drug target validation and safety assessment. In particular, protein abundance QTLs (pQTLs) and gene expression QTLs (eQTLs) are the most commonly used for this purpose. However, questions remain on how to best consolidate results from pQTLs and eQTLs for target validation.

Senior Research Associate / Research Fellow in Health Data Science


The role:

We are seeking a talented postdoctoral scientist with expertise in biomedical data integration and analysis, data mining and causal inference. As the successful candidate you will join a vibrant interdisciplinary research environment in the MRC Integrative Epidemiology Unit, working within a programme that applies data mining approaches to epidemiological research questions (www.biocompute.org.uk). The post holder will be appointed at either Senior Research Associate (grade J) or Research Fellow (grade K) depending on their level of experience. If successful, you will have the opportunity to develop your own research portfolio within the programme, contribute to teaching and postgraduate training and will be supported in your career progression. Closing date: 6th Feb 2022

Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease


Overview

Most Mendelian randomization (MR) studies focus on European populations because of the wealth of genome-wide association study (GWAS) datasets available from European ancestry population samples, in contrast to other populations. However, new GWAS summary datasets from studies such as Biobank Japan, China Kadoorie Biobank and the Japan Kidney Biobank enable us to run ancestry-specific MR analyses to compare causal effects of risk factors across populations. This approach is important in the use of MR to inform public health priorities and interventions in other populations and sub-populations that have historically been under-represented in research.

EpiGraphDB platform version 1.0


EpiGraphDB version 1.0

The EpiGraphDB platform has been updated with a new major release (version 1.0). This is the first release since version 0.3 in 2020 (what a year!) as well as since the publication of the journal article on Bioinformatics. We believe the underlying integration pipeline, data structure and architecture for the EpiGraphDB platform has now progressed sufficiently to a stable state that we are pleased to announce this major release a version 1.0!

Neo4J data integration pipeline


Background

We’ve been using Neo4j for around five years in a variety of projects, sometimes as the main database MELODI and sometimes as part of a larger platform (OpenGWAS). We find creating queries with Cypher intuitive and query performance to be good. However, the integration of data into a graph is still a challenge, especially when using many data from a variety of sources. Our latest project EpiGraphDB uses data from over 20 independent sources, most of which require cleaning and QC before they can be incorporated. In addition, each build of the graph needs to contain information on the versions of data, the schema of the graph and so on.