• Pharma

AstraZeneca Quantum Machine Learning for Genomic Biomarker Discovery

AstraZeneca

AstraZeneca partnered with Cambridge Quantum (Quantinuum) to apply quantum kernel SVMs to RNA-seq expression data from cancer cell lines, targeting responder vs non-responder classification for oncology clinical trials.

Key Outcome
Quantum kernel SVM achieved 79% AUC for BRCA1-mutation responder classification vs 81% XGBoost; identified a quantum feature embedding approach that captures epistatic interactions invisible to linear models.

The Problem

Identifying which cancer patients will respond to a given therapy (the responder vs non-responder classification problem) is one of the most commercially and clinically critical challenges in oncology drug development. RNA-seq expression profiling generates readouts for 20,000 or more genes per patient sample. Hidden within that high-dimensional signal are combinations of gene interactions (epistatic effects) that determine drug response. These interactions are often non-linear and non-additive: gene A being overexpressed matters only when gene B is also silenced.

Classical machine learning handles this through ensemble methods like XGBoost, which build decision trees over feature combinations, or through deep neural networks. Both approaches require large labeled datasets (hundreds to thousands of patients) before the non-linear interactions become learnable. In early-phase oncology trials, labeled samples are scarce. AstraZeneca’s hypothesis: a quantum kernel, by mapping genomic features into a high-dimensional Hilbert space, might capture gene interaction structure with fewer training examples than classical kernels.

Dimensionality Reduction from 20,000 Genes to 20 Features

The Quantinuum H1-2 operates on 20 qubits. A quantum kernel SVM with angle encoding requires the input feature vector to match the number of qubits, or use repeated encoding layers. The first challenge was compressing RNA-seq profiles from 20,000 genes to 20 features without destroying the epistatic signal the quantum kernel was designed to capture.

AstraZeneca used a two-stage reduction. First, differential expression analysis narrowed candidates to the 500 most variable genes across the BRCA1-mutation cell line cohort. Second, PCA compressed those 500 to 20 principal components, scaled to [0, pi] for angle encoding. The components were ordered by variance explained, retaining approximately 72% of the total expression variance.

import numpy as np
import pennylane as qml
from sklearn.decomposition import PCA
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import roc_auc_score
from sklearn.model_selection import StratifiedKFold

# Stage 1: select top variable genes
def select_variable_genes(X, n_genes=500):
    gene_variances = np.var(X, axis=0)
    top_idx = np.argsort(gene_variances)[::-1][:n_genes]
    return X[:, top_idx], top_idx

# Stage 2: PCA to 20 components for angle encoding
def reduce_to_qubit_space(X, n_components=20):
    pca = PCA(n_components=n_components)
    X_pca = pca.fit_transform(X)
    scaler = MinMaxScaler(feature_range=(0, np.pi))
    X_scaled = scaler.fit_transform(X_pca)
    print(f"Variance retained: {pca.explained_variance_ratio_.sum():.2%}")
    return X_scaled, pca, scaler

n_qubits = 20
dev = qml.device("default.qubit", wires=n_qubits)

def genomic_feature_map(x):
    """
    Two-layer angle encoding with entanglement.
    Layer 1: RY rotations encode expression levels.
    Entangling layer: linear CNOT chain captures pairwise correlations.
    Layer 2: RZ rotations encode expression magnitude.
    """
    # First rotation layer
    for i in range(n_qubits):
        qml.RY(x[i], wires=i)
    # Entangling layer -- gene co-expression correlations
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])
    qml.CNOT(wires=[n_qubits - 1, 0])  # periodic boundary
    # Second rotation layer
    for i in range(n_qubits):
        qml.RZ(x[i], wires=i)

@qml.qnode(dev)
def kernel_circuit(x1, x2):
    genomic_feature_map(x1)
    qml.adjoint(genomic_feature_map)(x2)
    return qml.probs(wires=range(n_qubits))

def quantum_kernel(x1, x2):
    probs = kernel_circuit(x1, x2)
    return float(probs[0])

Cross-Validated AUC Comparison

The quantum kernel SVM was evaluated against XGBoost and a classical RBF kernel SVM in a five-fold stratified cross-validation on 112 BRCA1-mutation cancer cell lines with known drug response labels. Computing the full kernel matrix for each fold required approximately 6,000 kernel evaluations per fold, tractable on simulator for this cohort size but expensive on H1-2 hardware, so hardware runs were reserved for a subset of 40 samples.

def build_kernel_matrix(X_train, X_test=None):
    """
    Build kernel matrix for training or train-test evaluation.
    K_train[i,j] = quantum_kernel(X_train[i], X_train[j])
    K_test[i,j]  = quantum_kernel(X_test[i],  X_train[j])
    """
    n_train = len(X_train)
    K_train = np.zeros((n_train, n_train))
    for i in range(n_train):
        for j in range(i, n_train):
            val = quantum_kernel(X_train[i], X_train[j])
            K_train[i, j] = val
            K_train[j, i] = val

    if X_test is None:
        return K_train

    n_test = len(X_test)
    K_test = np.zeros((n_test, n_train))
    for i in range(n_test):
        for j in range(n_train):
            K_test[i, j] = quantum_kernel(X_test[i], X_train[j])
    return K_train, K_test

# Five-fold cross-validated AUC
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
aucs = []

for train_idx, test_idx in skf.split(X_scaled, y):
    X_tr, X_te = X_scaled[train_idx], X_scaled[test_idx]
    y_tr, y_te = y[train_idx], y[test_idx]

    K_tr, K_te = build_kernel_matrix(X_tr, X_te)
    clf = SVC(kernel="precomputed", probability=True, C=1.0)
    clf.fit(K_tr, y_tr)
    probs = clf.predict_proba(K_te)[:, 1]
    aucs.append(roc_auc_score(y_te, probs))

print(f"Quantum kernel SVM AUC: {np.mean(aucs):.3f} +/- {np.std(aucs):.3f}")

Epistatic Interactions and the Case for Quantum Kernels

The AUC gap (79% quantum vs 81% XGBoost) understates what the experiment revealed. Feature importance analysis on the XGBoost model identified the top predictive genes as individual biomarkers (BRCA1 expression level, TP53 status). The quantum kernel SVM classified samples differently on approximately 14% of the cohort: cases where individual gene expression was unremarkable but the combination of expression patterns across PC components triggered different predictions.

This subset of discordant predictions is the scientifically interesting finding. Gene interaction networks (epistasis) produce drug response phenotypes that individual-gene biomarkers cannot capture. A quantum kernel with entangling layers naturally encodes pairwise and higher-order correlations between the encoded features in its Hilbert space representation. Whether this translates to consistent classification advantage at scale (with hundreds of patients and 50+ qubit hardware) is the question AstraZeneca and Quantinuum are pursuing in follow-on work.

The AUC comparison is expected to be closer than the final fault-tolerant case: the real test is at cohort sizes of 500 to 1,000 patients where epistatic effects should dominate simpler biomarker signals and the quantum kernel’s structured Hilbert space becomes a genuine inductive prior rather than an expensive equivalent to RBF.

Learn more: PennyLane Reference