DrivR-Base: a feature extraction toolkit for variant effect prediction

Posted by Tom Gaunt, with initial draft by AI on April 30, 2024

Tags:

Understanding which genetic variants are likely to be functional (and which are probably benign) is a cornerstone of modern human genetics. Over the last decade, variant-effect predictors have become increasingly sophisticated — but behind every model sits the same practical headache: assembling a sensible set of features (annotations) for millions of variants from dozens of databases.

In a 2024 Bioinformatics paper led by Amy Francis, we introduce DrivR-Base, a reproducible, Dockerised toolkit that turns this feature-extraction step into something you can run and re-run with far less pain.

What problem is DrivR-Base trying to solve?

Most variant-effect prediction methods are “integrative”: they combine signals about a variant’s genomic context (e.g. conservation), regulatory annotations (e.g. ENCODE peaks), and protein-level consequences (e.g. amino-acid change, structure). The data exist — but pulling them together is often:

time-consuming (lots of sources, formats, and edge cases),
hard to reproduce (different software versions and dependencies), and
risky (you can spend weeks extracting features that later turn out not to help your model).

DrivR-Base’s core idea is simple: provide a single, consistent pipeline that extracts a broad set of annotations for all possible SNVs in GRCh38, so we can spend more time modelling and less time wrangling.

What is DrivR-Base?

DrivR-Base is a feature extraction toolkit for human single nucleotide variants (SNVs) in the GRCh38 genome build. It produces a table where each row is a variant and columns are feature values drawn from multiple sources (genome- and protein-level). It’s packaged for Docker, which helps make installs and runs repeatable across machines and over time.

The paper highlights a few motivating use-cases beyond “classic” pathogenicity prediction, including haploinsufficiency prediction and feature sets that could feed into drug repurposing workflows.

What features does it extract?

DrivR-Base groups its outputs into ten feature groups, spanning sequence context, regulatory genomics, and protein structure.

Conservation and mappability
PhyloP/PhastCons conservation scores across multiple alignments, plus Umap/Bismap mappability (useful for flagging regions prone to sequencing ambiguity).
Variant Effect Predictor (VEP) annotations
Transcript consequences (one-hot encoded), predicted amino acids (wild-type vs mutant), and distances to transcripts when multiple are affected.
Dinucleotide properties (DiProDB)
Thermodynamic and conformational properties for dinucleotide contexts around the variant, captured under wild-type and mutant configurations.
DNA shape (DNAShapeR)
Local structural properties like minor groove width, helix twist, propeller twist, roll, and electrostatic potential in a configurable window around the SNV.
GC content and CpG metrics
GC fraction, CpG counts, and observed/expected CpG across multiple window sizes.
Kernel-based sequence similarity (spectrum kernels)
K-mer based comparisons between wild-type and mutant sequence windows as a compact way to encode “sequence disruption”.
Amino-acid substitution matrices
Substitution rates from common matrices (e.g. BLOSUM, PAM, JTT variants) for non-synonymous variants.
Amino-acid properties
Hundreds of amino-acid descriptors (e.g. hydrophobicity, polarity, flexibility) for wild-type and mutant residues.
ENCODE-derived regulatory features
Peaks and signal summaries across multiple assay types (TF ChIP-seq, histone marks, DNase/ATAC, eCLIP, etc.). Note: the authors report this step can require substantial local storage (on the order of ~160GB) because it downloads large ENCODE datasets.
Protein structure features from AlphaFold (and PDB)
For coding variants, mapping to protein positions enables extraction of AlphaFold structural information (e.g. atom coordinates and conformation-type encodings).

Why this matters for our work

A lot of what we do in the DMER team sits at the interface of genetic evidence and downstream biology — and variant-level annotations are often the glue. Even when our end goal isn’t “variant pathogenicity prediction”, having a robust, standardised way to pull out features can help with:

building or benchmarking new predictors (and understanding why they behave as they do),
prioritising variants for experimental follow-up, and
reusing the same feature definitions across projects to avoid “feature drift”.

Just as importantly, DrivR-Base makes it easier to ask the boring-but-essential questions early, like: Which feature groups are actually informative for my prediction task? That can save a lot of iteration time.

Getting started

DrivR-Base is distributed via GitHub with Docker instructions. The paper and repository are the best places to start:

Paper (open access via PubMed Central): https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/
Code: https://github.com/amyfrancis97/DrivR-Base

Reference

Francis A, Campbell C, Gaunt TR. DrivR-Base: a feature extraction toolkit for variant effect prediction model construction. Bioinformatics (2024).

What problem is DrivR-Base trying to solve?​

What is DrivR-Base?​

What features does it extract?​

Why this matters for our work​

Getting started​

Reference​