CLI Cancer genomics

Tissue-of-Origin Classifier

Single-sample tissue-of-origin inference using an AutoGluon ensemble trained on AACR GENIE v18 data.

Overview

Predict the tissue of origin for a tumor sample using somatic mutations, copy-number alterations, structural variants, and clinical features. The classifier integrates 7 feature modalities into 4,320 engineered features and predicts across 22 tumor types.

Output is a self-contained HTML report with top-3 predictions, full probability distributions, SHAP explanations, and a summary of the input data.

Performance

85.0% Balanced accuracy
94.8% Top-3 accuracy
22 Tumor types
4,320 Features

Evaluation

Model performance: (A) Per-class sensitivity across 22 tumor types, (B) Row-normalized confusion matrix, (C) Accuracy vs. coverage by confidence threshold, (D) External validation on GENIE holdout, UCSF, and MSK cohorts.
(A) Per-class sensitivity on the held-out test set. (B) Row-normalized confusion matrix. (C) Top-1 accuracy and sample coverage as a function of confidence threshold. (D) External validation on GENIE holdout, UCSF, and MSK Medical cohorts.

Feature modalities

Installation

Requires Python ≥ 3.11 and uv.

git clone https://github.com/viktorlj/tissue-classifier.git
cd tissue-classifier
uv venv && uv sync

Usage

# Basic prediction
tissue-classifier predict \
    --maf sample.maf \
    --seg sample.seg \
    --age 65 --sex Male

# With all options
tissue-classifier predict \
    --maf sample.maf \
    --seg sample.seg \
    --age 65 --sex Male \
    --genome hg19 \
    --output ./results \
    --sample-id PATIENT_001

# Validate input files
tissue-classifier validate --maf sample.maf --seg sample.seg

# Show model info
tissue-classifier info

Input files

MAF (required)

Tab-delimited mutation annotation file with columns: Hugo_Symbol, Chromosome, Start_Position, End_Position, Reference_Allele, Tumor_Seq_Allele2, Variant_Classification, Variant_Type, Tumor_Sample_Barcode.

SEG (optional)

Copy-number segmentation file with columns: ID, chrom, loc.start, loc.end, seg.mean.

SV (optional)

Structural variant file with columns: Sample_Id, Site1_Hugo_Symbol, Site2_Hugo_Symbol.

Related repositories

tissue-classifier Inference pipeline & pre-trained model
too-panelseq Training code & evaluation not yet public