DrivR-Base: a feature extraction toolkit for variant effect prediction
Understanding which genetic variants are likely to be functional (and which are probably benign) is a cornerstone of modern human genetics. Over the last decade, variant-effect predictors have become increasingly sophisticated — but behind every model sits the same practical headache: assembling a sensible set of features (annotations) for millions of variants from dozens of databases.
In a 2024 Bioinformatics paper led by Amy Francis, we introduce DrivR-Base, a reproducible, Dockerised toolkit that turns this feature-extraction step into something you can run and re-run with far less pain.
What problem is DrivR-Base trying to solve?
Most variant-effect prediction methods are “integrative”: they combine signals about a variant’s genomic context (e.g. conservation), regulatory annotations (e.g. ENCODE peaks), and protein-level consequences (e.g. amino-acid change, structure). The data exist — but pulling them together is often:
- time-consuming (lots of sources, formats, and edge cases),
- hard to reproduce (different software versions and dependencies), and
- risky (you can spend weeks extracting features that later turn out not to help your model).
DrivR-Base’s core idea is simple: provide a single, consistent pipeline that extracts a broad set of annotations for all possible SNVs in GRCh38, so we can spend more time modelling and less time wrangling.
What is DrivR-Base?
DrivR-Base is a feature extraction toolkit for human single nucleotide variants (SNVs) in the GRCh38 genome build. It produces a table where each row is a variant and columns are feature values drawn from multiple sources (genome- and protein-level). It’s packaged for Docker, which helps make installs and runs repeatable across machines and over time.
The paper highlights a few motivating use-cases beyond “classic” pathogenicity prediction, including haploinsufficiency prediction and feature sets that could feed into drug repurposing workflows.
What features does it extract?
DrivR-Base groups its outputs into ten feature groups, spanning sequence context, regulatory genomics, and protein structure.
-
Conservation and mappability
PhyloP/PhastCons conservation scores across multiple alignments, plus Umap/Bismap mappability (useful for flagging regions prone to sequencing ambiguity). -
Variant Effect Predictor (VEP) annotations
Transcript consequences (one-hot encoded), predicted amino acids (wild-type vs mutant), and distances to transcripts when multiple are affected. -
Dinucleotide properties (DiProDB)
Thermodynamic and conformational properties for dinucleotide contexts around the variant, captured under wild-type and mutant configurations. -
DNA shape (DNAShapeR)
Local structural properties like minor groove width, helix twist, propeller twist, roll, and electrostatic potential in a configurable window around the SNV. -
GC content and CpG metrics
GC fraction, CpG counts, and observed/expected CpG across multiple window sizes. -
Kernel-based sequence similarity (spectrum kernels)
K-mer based comparisons between wild-type and mutant sequence windows as a compact way to encode “sequence disruption”. -
Amino-acid substitution matrices
Substitution rates from common matrices (e.g. BLOSUM, PAM, JTT variants) for non-synonymous variants. -
Amino-acid properties
Hundreds of amino-acid descriptors (e.g. hydrophobicity, polarity, flexibility) for wild-type and mutant residues. -
ENCODE-derived regulatory features
Peaks and signal summaries across multiple assay types (TF ChIP-seq, histone marks, DNase/ATAC, eCLIP, etc.). Note: the authors report this step can require substantial local storage (on the order of ~160GB) because it downloads large ENCODE datasets. -
Protein structure features from AlphaFold (and PDB)
For coding variants, mapping to protein positions enables extraction of AlphaFold structural information (e.g. atom coordinates and conformation-type encodings).
Why this matters for our work
A lot of what we do in the DMER team sits at the interface of genetic evidence and downstream biology — and variant-level annotations are often the glue. Even when our end goal isn’t “variant pathogenicity prediction”, having a robust, standardised way to pull out features can help with:
- building or benchmarking new predictors (and understanding why they behave as they do),
- prioritising variants for experimental follow-up, and
- reusing the same feature definitions across projects to avoid “feature drift”.
Just as importantly, DrivR-Base makes it easier to ask the boring-but-essential questions early, like: Which feature groups are actually informative for my prediction task? That can save a lot of iteration time.
Getting started
DrivR-Base is distributed via GitHub with Docker instructions. The paper and repository are the best places to start:
- Paper (open access via PubMed Central): https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/
- Code: https://github.com/amyfrancis97/DrivR-Base
Reference
Francis A, Campbell C, Gaunt TR. DrivR-Base: a feature extraction toolkit for variant effect prediction model construction. Bioinformatics (2024).