Skip to main content

CanDrivR-CS: cancer-specific machine learning to separate recurrent from rare missense variants


Overview

Cancer genomes contain huge numbers of mutations, but only a subset are functionally important. One simple clue is recurrence: if the same missense variant shows up repeatedly across patients with the same cancer type, that can suggest positive selection for growth advantage. At the same time, rare variants can still matter (for example, if they emerge under treatment as resistance mechanisms).

In work led by Amy Francis, we introduce CanDrivR-CS, a framework that trains cancer-type-specific machine-learning models to distinguish recurrent from rare somatic missense variants. It’s a useful reminder that “one-size-fits-all” predictors can miss disease-context signals, and that relatively interpretable models can still surface mechanistic hypotheses.

What we did

We curated missense variant data from the International Cancer Genome Consortium (ICGC) and trained a suite of gradient boosting classifiers, one per cancer type, alongside a baseline pan-cancer model. The goal was not to label variants as “pathogenic” in the clinical sense, but to learn patterns that separate variants from two cancer-relevant frequency regimes: those that recur across samples versus those that appear rarely.

A practical detail was our evaluation setup: we report leave-one-group-out cross-validation (LOGO-CV), which is designed to test generalisation when a meaningful group (e.g. a gene or cohort) is held out at training time.

Key results

  • Cancer-type-specific models outperformed the pan-cancer baseline, with LOGO-CV F1 scores reaching 0.90 for skin cutaneous melanoma (CanDrivR-SKCM) and 0.89 for skin adenocarcinoma (CanDrivR-SKCA), versus 0.792 for the baseline model.
  • DNA-shape properties consistently ranked among the most informative features across cancer types. We report that recurrent missense variants were enriched in regions associated with DNA bends and rolls, raising the possibility that local structural context contributes to mutational hotspots (for example via replication or repair dynamics).

Why this matters

From a translational perspective, separating “common” from “rare” somatic variants is not the whole driver/passenger story — but it is a useful lens:

  • It can help prioritise variants for follow-up in cancer-type-specific settings (where selection pressures differ).
  • It provides an interpretable way to test whether adding new feature classes (like DNA-shape) improves discrimination.
  • It highlights the value of open, reusable pipelines for variant feature engineering and modelling.

Resources

Paper

Francis A, Campbell C, Gaunt T. CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants. bioRxiv (posted Sep 23, 2024). https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1