Quantum Transfer Learning with PennyLane

Why Transfer Learning Works

Deep convolutional neural networks learn hierarchical feature representations. When you train a ResNet-18 on ImageNet’s 1.2 million images, the layers develop a progression of increasingly abstract detectors:

Early layers (conv1, layer1): Edge detectors, color gradients, and texture filters. These are nearly universal across all image domains.
Middle layers (layer2, layer3): Combinations of edges into shapes, corners, curves, and simple object parts. Still broadly generic.
Late layers (layer4): High-level semantic features like wheels, wings, fur patterns, or window grids. These are more task-specific but still transfer well to related tasks.
Final fully connected layer (fc): A linear classifier mapping 512 features to 1000 ImageNet classes. This layer is entirely task-specific and gets discarded during transfer learning.

The key insight is that the 512-dimensional feature vector produced by the penultimate layer already encodes rich, discriminative information about any natural image. For many downstream tasks, these features are so well-structured that even a simple linear classifier achieves strong accuracy. This is precisely what makes quantum transfer learning viable: if the features are already well-separated, a small parameterized quantum circuit can learn the decision boundary without needing the representational power of a deep network.

Installing Dependencies

pip install pennylane torch torchvision scikit-learn matplotlib

You need PyTorch for the classical CNN backbone, PennyLane for the quantum circuit layer, scikit-learn for PCA, and matplotlib for visualizations.

Step 1: Load ResNet-18 and Freeze It

import torch
import torch.nn as nn
import torchvision.models as models

# Load ResNet-18 pretrained on ImageNet
model = models.resnet18(weights=models.ResNet18_Weights.DEFAULT)

# Freeze all parameters so we only train the quantum head
for param in model.parameters():
    param.requires_grad = False

# The final layer maps 512 features to 1000 ImageNet classes
print(model.fc)
# Linear(in_features=512, out_features=1000, bias=True)

Freezing the backbone means that during backpropagation, gradients stop at the boundary between the frozen ResNet and the trainable quantum head. This dramatically reduces the number of trainable parameters and prevents catastrophic forgetting of the learned representations.

Step 2: Extract Features and Visualize Separability

Before building the quantum circuit, it is worth verifying that the ResNet features actually separate our target classes. We extract features from CIFAR-10 airplanes (class 0) and automobiles (class 1), reduce them to 2D with PCA, and plot the result.

import numpy as np
from sklearn.decomposition import PCA
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader, Subset
import matplotlib.pyplot as plt

transform = transforms.Compose([
    transforms.Resize(224),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406],
                         std=[0.229, 0.224, 0.225]),
])

# Load only classes 0 (airplane) and 1 (automobile)
cifar = CIFAR10(root="./data", train=True, download=True, transform=transform)
idx = [i for i, (_, label) in enumerate(cifar) if label in [0, 1]]
subset = Subset(cifar, idx[:2000])
loader = DataLoader(subset, batch_size=64, shuffle=False)

# Build a feature extractor by removing the fc layer
feature_extractor = nn.Sequential(*list(model.children())[:-1])
feature_extractor.eval()

all_features, all_labels = [], []
with torch.no_grad():
    for imgs, labels in loader:
        feats = feature_extractor(imgs).squeeze()
        all_features.append(feats.numpy())
        all_labels.append(labels.numpy())

X = np.concatenate(all_features)
y = np.concatenate(all_labels)

# Visualize with 2D PCA
pca_2d = PCA(n_components=2)
X_2d = pca_2d.fit_transform(X)

plt.figure(figsize=(8, 6))
for label, name, color in [(0, "Airplane", "blue"), (1, "Automobile", "red")]:
    mask = y == label
    plt.scatter(X_2d[mask, 0], X_2d[mask, 1], c=color, label=name, alpha=0.5, s=15)
plt.xlabel("PC 1")
plt.ylabel("PC 2")
plt.title("ResNet-18 Features in 2D PCA Space")
plt.legend()
plt.tight_layout()
plt.savefig("feature_separability.png", dpi=150)
plt.show()

You will see that the two classes form mostly distinct clusters with a clear linear boundary between them. The first two principal components already capture enough variance to separate airplanes from automobiles. This confirms that even a simple quantum circuit with a few parameters can learn to classify these features, since the hard work of feature extraction was already done by ResNet.

Step 3: Reduce Dimensions with PCA

For the actual quantum circuit, we reduce from 512 to 4 dimensions. With 4 qubits and angle encoding, each qubit receives one PCA component as its rotation angle.

n_qubits = 4
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)

# Normalize to [-pi, pi] for angle encoding
# Each feature is scaled independently to preserve relative structure
X_min = X_pca.min(axis=0, keepdims=True)
X_max = X_pca.max(axis=0, keepdims=True)
X_pca_norm = 2 * np.pi * (X_pca - X_min) / (X_max - X_min + 1e-8) - np.pi

Note that we normalize each feature column independently. Normalizing all features together (using a global min/max) would distort the relative scales of the principal components, throwing away the variance information that PCA captured.

Step 4: Encoding Strategies Comparison

The choice of how to embed classical data into a quantum state significantly affects circuit expressivity and qubit efficiency. Here are three common strategies for encoding 4 features.

Angle Encoding

The simplest approach: one qubit per feature, each encoded as a rotation angle.

import pennylane as qml

dev_angle = qml.device("default.qubit", wires=4)

@qml.qnode(dev_angle)
def angle_encoding_circuit(x):
    """Angle encoding: 4 features on 4 qubits via Ry rotations."""
    for i in range(4):
        qml.RY(x[i], wires=i)
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

# Draw the circuit
x_example = np.array([0.5, -0.3, 1.2, -0.8])
print(qml.draw(angle_encoding_circuit)(x_example))

Angle encoding scales linearly (n features require n qubits) and is straightforward to implement. The limitation is that each qubit encodes exactly one feature with no cross-feature interactions in the encoding layer itself. Any feature mixing must come from the variational layers that follow.

Amplitude Encoding

Amplitude encoding stores N features in the amplitudes of log2(N) qubits. For 4 features, you need only 2 qubits.

dev_amp = qml.device("default.qubit", wires=2)

@qml.qnode(dev_amp)
def amplitude_encoding_circuit(x):
    """Amplitude encoding: 4 features in amplitudes of 2 qubits."""
    # x must be normalized to unit length
    qml.AmplitudeEmbedding(x, wires=range(2), normalize=True)
    return qml.probs(wires=range(2))

print(qml.draw(amplitude_encoding_circuit)(x_example))

Amplitude encoding is exponentially more compact, but it has two practical drawbacks. First, the state preparation circuit for arbitrary amplitudes can be deep (O(2^n) gates in the worst case). Second, the input vector must be normalized to unit length, which discards magnitude information. For high-dimensional data where you want qubit efficiency, amplitude encoding is attractive; for small feature counts like our PCA-reduced data, angle encoding is simpler and equally effective.

IQP (Instantaneous Quantum Polynomial) Encoding

IQP encoding embeds features as single-qubit rotations and then adds entangling gates that create cross-terms between features.

dev_iqp = qml.device("default.qubit", wires=4)

@qml.qnode(dev_iqp)
def iqp_encoding_circuit(x):
    """IQP encoding: features as Rz angles with entangling X interactions."""
    # First layer: Hadamard + Rz encoding
    for i in range(4):
        qml.Hadamard(wires=i)
        qml.RZ(x[i], wires=i)
    # Entangling layer: ZZ interactions encode feature cross-terms
    for i in range(3):
        qml.CNOT(wires=[i, i + 1])
        qml.RZ(x[i] * x[i + 1], wires=i + 1)
        qml.CNOT(wires=[i, i + 1])
    # Second encoding layer for increased expressivity
    for i in range(4):
        qml.Hadamard(wires=i)
        qml.RZ(x[i], wires=i)
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

print(qml.draw(iqp_encoding_circuit)(x_example))

The ZZ interaction terms (x_i * x_j) create feature cross-correlations that cannot be captured by single-qubit angle encoding alone. IQP circuits have been shown to produce kernel functions that are classically hard to compute, making them theoretically interesting for quantum machine learning. The tradeoff is increased circuit depth and more sensitivity to hardware noise.

For this tutorial, we use angle encoding because it is the most transparent and works well when the features are already well-separated by PCA.

Step 5: Build the Quantum Circuit Layer

dev = qml.device("default.qubit", wires=n_qubits)

def variational_layer(weights, wires):
    """Single variational layer: Ry-Rz rotations + nearest-neighbor CNOTs."""
    for i, w in enumerate(wires):
        qml.RY(weights[i, 0], wires=w)
        qml.RZ(weights[i, 1], wires=w)
    # Entangling: linear chain of CNOTs
    for i in range(len(wires) - 1):
        qml.CNOT(wires=[wires[i], wires[i + 1]])

@qml.qnode(dev, interface="torch", diff_method="best")
def quantum_circuit(inputs, weights):
    """Quantum classification circuit with angle encoding and variational layers."""
    # Angle encoding: one Ry per qubit
    for i in range(n_qubits):
        qml.RY(inputs[i], wires=i)
    # Variational layers
    for layer in range(weights.shape[0]):
        variational_layer(weights[layer], range(n_qubits))
    # Measure Z expectation on qubit 0 for binary classification
    return qml.expval(qml.PauliZ(0))

Step 6: Measurement Strategies

The choice of measurement determines what information you extract from the quantum state and how many output values the circuit produces.

Single Expectation Value (Binary Classification)

The current circuit returns qml.expval(PauliZ(0)), a single scalar in [-1, 1]. This is natural for binary classification: map positive values to class 0, negative values to class 1. Feed it through BCEWithLogitsLoss by treating it as a logit.

Multiple Expectation Values (Multi-Class)

For K-class classification, measure PauliZ on K qubits and use softmax:

dev_multi = qml.device("default.qubit", wires=4)

@qml.qnode(dev_multi, interface="torch", diff_method="best")
def multiclass_circuit(inputs, weights):
    """4-class circuit: one PauliZ measurement per class."""
    for i in range(4):
        qml.RY(inputs[i], wires=i)
    for layer in range(weights.shape[0]):
        variational_layer(weights[layer], range(4))
    # Return 4 expectation values, one per class
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

# Usage in a PyTorch model:
# logits = multiclass_circuit(features, weights)  # shape: (4,)
# probs = torch.softmax(logits, dim=-1)           # shape: (4,)
# loss = nn.CrossEntropyLoss()(logits, target)

Probability Distribution

For circuits where you want the full measurement statistics:

@qml.qnode(dev_multi, interface="torch", diff_method="best")
def probs_circuit(inputs, weights):
    """Returns 2^4 = 16 probabilities for all computational basis states."""
    for i in range(4):
        qml.RY(inputs[i], wires=i)
    for layer in range(weights.shape[0]):
        variational_layer(weights[layer], range(4))
    return qml.probs(wires=range(4))

# Output shape: (16,). You can feed this into a classical linear layer
# to map 16 probabilities to K classes.

The probabilities approach is the most flexible but also the most expensive: for n qubits you get 2^n outputs, and each probability requires many measurement shots on real hardware to estimate accurately.

Step 7: Wrap as a TorchLayer

PennyLane’s TorchLayer converts the QNode into a standard PyTorch module. You specify which arguments are trainable weights and their shapes.

n_layers = 2
weight_shapes = {"weights": (n_layers, n_qubits, 2)}  # [n_layers, n_qubits, (RY, RZ)]

qlayer = qml.qnn.TorchLayer(quantum_circuit, weight_shapes)

The qlayer is now a proper nn.Module with trainable parameters accessible to any PyTorch optimizer. Under the hood, TorchLayer handles batching, gradient computation, and parameter registration.

Step 8: Build the Quantum Head Module

class QuantumHead(nn.Module):
    def __init__(self, pca, qlayer):
        super().__init__()
        # Store PCA parameters as buffers (not trainable, but move with .to(device))
        self.register_buffer("pca_mean", torch.tensor(pca.mean_, dtype=torch.float32))
        self.register_buffer("pca_components", torch.tensor(pca.components_, dtype=torch.float32))
        self.register_buffer("pca_min", torch.tensor(X_pca.min(axis=0), dtype=torch.float32))
        self.register_buffer("pca_max", torch.tensor(X_pca.max(axis=0), dtype=torch.float32))
        self.qlayer = qlayer

    def forward(self, x):
        # Apply PCA projection inline (so we can use the full ResNet pipeline)
        x = x - self.pca_mean
        x = x @ self.pca_components.T
        # Normalize each feature to [-pi, pi] using training set statistics
        x = 2 * torch.pi * (x - self.pca_min) / (self.pca_max - self.pca_min + 1e-8) - torch.pi
        return self.qlayer(x)

model.fc = QuantumHead(pca, qlayer)

Using register_buffer instead of plain tensors ensures the PCA parameters travel with the model when you call .to(device) or torch.save(). This avoids silent CPU/GPU mismatches.

Step 9: Data Re-uploading for Greater Expressivity

Standard angle encoding embeds the input data once at the beginning of the circuit. Data re-uploading is a technique where the input features are re-encoded at multiple points in the circuit, alternating with variational layers:

Encode(x) -> Variational(layer 0) -> Encode(x) -> Variational(layer 1) -> Measure

This is analogous to a classical neural network where each layer processes the input through a non-linear transformation. A single data upload limits the circuit to a single Fourier frequency in the input, while re-uploading increases the frequency spectrum and expressivity.

dev_reupload = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_reupload, interface="torch", diff_method="best")
def reupload_circuit(inputs, weights):
    """Data re-uploading: alternate between encoding and variational layers."""
    n_layers = weights.shape[0]
    for layer in range(n_layers):
        # Re-encode the input at each layer
        for i in range(n_qubits):
            qml.RY(inputs[i], wires=i)
        # Variational layer
        variational_layer(weights[layer], range(n_qubits))
    return qml.expval(qml.PauliZ(0))

# Compare: standard circuit encodes once with 2 variational layers.
# Re-uploading circuit encodes twice with 2 variational layers.
# Same parameter count, but the re-uploading circuit can represent
# a richer set of decision boundaries.

weight_shapes_reupload = {"weights": (2, n_qubits, 2)}
qlayer_reupload = qml.qnn.TorchLayer(reupload_circuit, weight_shapes_reupload)

Data re-uploading is particularly useful when you have a small number of qubits and cannot increase circuit width. By encoding the data multiple times, you trade circuit depth for expressivity without adding qubits. In practice, 2-3 re-uploads are a good default for 4-qubit circuits.

Step 10: Ansatz Depth vs Barren Plateaus

When you scale parameterized quantum circuits, gradients tend to vanish exponentially with depth. This is the barren plateau problem: for a random circuit on n qubits, the variance of the gradient of any single parameter decreases as approximately 1/4^n for sufficiently deep circuits. Circuit depth amplifies this effect.

Here is an empirical measurement of gradient variance as a function of circuit depth:

import pennylane as qml
import numpy as np

def measure_gradient_variance(n_qubits, n_layers, n_samples=100):
    """Compute variance of dL/dtheta_0 across random initializations."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev, diff_method="parameter-shift")
    def circuit(weights):
        for layer in range(n_layers):
            for i in range(n_qubits):
                qml.RY(weights[layer, i, 0], wires=i)
                qml.RZ(weights[layer, i, 1], wires=i)
            for i in range(n_qubits - 1):
                qml.CNOT(wires=[i, i + 1])
        return qml.expval(qml.PauliZ(0))

    gradients = []
    for _ in range(n_samples):
        weights = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits, 2))
        grad = qml.grad(circuit)(weights)
        # Gradient of the first parameter: grad[0, 0, 0]
        gradients.append(grad[0, 0, 0])

    return np.var(gradients)

# Measure for depths 1 through 5
depths = [1, 2, 3, 4, 5]
variances = []
for d in depths:
    var = measure_gradient_variance(n_qubits=4, n_layers=d, n_samples=100)
    variances.append(var)
    print(f"Depth {d}: gradient variance = {var:.6f}")

# Plot on log scale
plt.figure(figsize=(8, 5))
plt.semilogy(depths, variances, "bo-", label="Measured variance")
# Theoretical scaling: proportional to 1/4^depth (for 4 qubits)
theoretical = [variances[0] * (0.25 ** (d - 1)) for d in depths]
plt.semilogy(depths, theoretical, "r--", label=r"$\propto (1/4)^{d}$ scaling")
plt.xlabel("Number of variational layers")
plt.ylabel("Var(dL/d$\\theta_0$)")
plt.title("Gradient Variance vs Circuit Depth (4 qubits)")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("barren_plateaus.png", dpi=150)
plt.show()

You will observe that the gradient variance drops by roughly a factor of 4 with each added layer. For 4 qubits and 5 layers, the variance is on the order of 10^-4, meaning the optimizer receives essentially random gradient signals. Practical guideline: for 4 qubits, keep n_layers <= 2 to maintain trainable gradients. For 8 qubits, even n_layers = 1 can show significant barren plateau effects.

Step 11: Full Training Pipeline with DataLoader

The training loop below is a realistic pipeline with proper batching, train/validation splits, data augmentation, learning rate scheduling, and early stopping.

import torch.optim as optim
from torch.utils.data import TensorDataset, DataLoader, random_split

# Prepare the pre-computed PCA features as tensors
X_tensor = torch.tensor(X_pca_norm, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).unsqueeze(1)

# Train/validation split: 80/20
dataset = TensorDataset(X_tensor, y_tensor)
n_train = int(0.8 * len(dataset))
n_val = len(dataset) - n_train
train_set, val_set = random_split(dataset, [n_train, n_val])

train_loader = DataLoader(train_set, batch_size=32, shuffle=True)
val_loader = DataLoader(val_set, batch_size=32, shuffle=False)

# Set up the model (quantum head only, since features are pre-extracted)
optimizer = optim.Adam(qlayer.parameters(), lr=0.01)
scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode="min",
                                                   factor=0.5, patience=5)
loss_fn = nn.BCEWithLogitsLoss()

# Training loop with early stopping
best_val_loss = float("inf")
patience_counter = 0
max_patience = 10
train_losses, val_losses = [], []
train_accs, val_accs = [], []

for epoch in range(50):
    # Training phase
    qlayer.train()
    epoch_loss, epoch_correct, epoch_total = 0.0, 0, 0
    for x_batch, y_batch in train_loader:
        optimizer.zero_grad()
        out = qlayer(x_batch).unsqueeze(1)
        loss = loss_fn(out, y_batch)
        loss.backward()
        optimizer.step()
        epoch_loss += loss.item() * len(x_batch)
        epoch_correct += ((out > 0) == y_batch.bool()).sum().item()
        epoch_total += len(x_batch)

    train_losses.append(epoch_loss / epoch_total)
    train_accs.append(epoch_correct / epoch_total)

    # Validation phase
    qlayer.eval()
    val_loss, val_correct, val_total = 0.0, 0, 0
    with torch.no_grad():
        for x_batch, y_batch in val_loader:
            out = qlayer(x_batch).unsqueeze(1)
            loss = loss_fn(out, y_batch)
            val_loss += loss.item() * len(x_batch)
            val_correct += ((out > 0) == y_batch.bool()).sum().item()
            val_total += len(x_batch)

    val_losses.append(val_loss / val_total)
    val_accs.append(val_correct / val_total)
    scheduler.step(val_loss / val_total)

    # Early stopping check
    if val_loss / val_total < best_val_loss:
        best_val_loss = val_loss / val_total
        patience_counter = 0
    else:
        patience_counter += 1
        if patience_counter >= max_patience:
            print(f"Early stopping at epoch {epoch + 1}")
            break

    if (epoch + 1) % 10 == 0:
        print(f"Epoch {epoch+1:3d}  "
              f"Train Loss: {train_losses[-1]:.4f}  Train Acc: {train_accs[-1]:.3f}  "
              f"Val Loss: {val_losses[-1]:.4f}  Val Acc: {val_accs[-1]:.3f}")

# Plot training curves
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(14, 5))
ax1.plot(train_losses, label="Train")
ax1.plot(val_losses, label="Validation")
ax1.set_xlabel("Epoch")
ax1.set_ylabel("Loss")
ax1.set_title("Loss Curves")
ax1.legend()
ax1.grid(True, alpha=0.3)

ax2.plot(train_accs, label="Train")
ax2.plot(val_accs, label="Validation")
ax2.set_xlabel("Epoch")
ax2.set_ylabel("Accuracy")
ax2.set_title("Accuracy Curves")
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.savefig("training_curves.png", dpi=150)
plt.show()

Step 12: Quantum vs Classical Parameter Efficiency

A central question in quantum machine learning is whether quantum circuits offer any advantage in parameter efficiency. Let us compare three heads on the same frozen ResNet features.

# Quantum head: 4 qubits, 2 layers
# Parameters: 2 layers * 4 qubits * 2 rotations = 16 parameters
print(f"Quantum head: {sum(p.numel() for p in qlayer.parameters())} parameters")

# Classical linear head: 4 inputs -> 1 output
linear_head = nn.Linear(4, 1)
print(f"Linear head: {sum(p.numel() for p in linear_head.parameters())} parameters")
# 4 weights + 1 bias = 5 parameters

# Classical MLP: 4 -> 8 -> 1
mlp_head = nn.Sequential(
    nn.Linear(4, 8),
    nn.ReLU(),
    nn.Linear(8, 1),
)
print(f"MLP head: {sum(p.numel() for p in mlp_head.parameters())} parameters")
# (4*8 + 8) + (8*1 + 1) = 49 parameters

Learning Curve: Accuracy vs Training Samples

To test whether any head generalizes better with limited data, we train each on increasing fractions of the dataset.

sample_fractions = [0.1, 0.2, 0.4, 0.6, 0.8, 1.0]
results = {"quantum": [], "linear": [], "mlp": []}

for frac in sample_fractions:
    n_samples = int(frac * n_train)

    for name, head_fn in [
        ("quantum", lambda: qml.qnn.TorchLayer(quantum_circuit,
                                                {"weights": (2, n_qubits, 2)})),
        ("linear", lambda: nn.Linear(4, 1)),
        ("mlp", lambda: nn.Sequential(nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1))),
    ]:
        head = head_fn()
        opt = optim.Adam(head.parameters(), lr=0.01)

        # Subset of training data
        subset_indices = list(range(n_samples))
        x_sub = X_tensor[subset_indices]
        y_sub = y_tensor[subset_indices]

        # Train for 30 epochs
        for epoch in range(30):
            opt.zero_grad()
            out = head(x_sub).unsqueeze(1) if name == "quantum" else head(x_sub)
            loss = loss_fn(out, y_sub)
            loss.backward()
            opt.step()

        # Evaluate on validation set
        head.eval()
        with torch.no_grad():
            x_val = X_tensor[n_train:]
            y_val = y_tensor[n_train:]
            out_val = head(x_val).unsqueeze(1) if name == "quantum" else head(x_val)
            acc = ((out_val > 0) == y_val.bool()).float().mean().item()
            results[name].append(acc)

    print(f"Frac {frac:.1f}: Q={results['quantum'][-1]:.3f}  "
          f"Lin={results['linear'][-1]:.3f}  MLP={results['mlp'][-1]:.3f}")

plt.figure(figsize=(8, 5))
for name, marker in [("quantum", "o"), ("linear", "s"), ("mlp", "^")]:
    plt.plot([int(f * n_train) for f in sample_fractions],
             results[name], f"-{marker}", label=name.capitalize())
plt.xlabel("Training Samples")
plt.ylabel("Validation Accuracy")
plt.title("Learning Curve: Quantum vs Classical Heads")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("learning_curve.png", dpi=150)
plt.show()

On classical image data with well-separated ResNet features, you will typically find that all three heads converge to similar accuracy (around 0.90-0.95). The quantum head does not show a clear advantage. This is expected: quantum advantage in machine learning is most likely to emerge with quantum-native data (data generated by quantum processes) rather than classical images.

Step 13: Comparing Quantum Head Sizes

Does increasing the number of qubits improve classification? We test this by reducing the PCA features to match each qubit count and measuring validation accuracy.

qubit_configs = [2, 4, 6, 8]
n_seeds = 3
qubit_results = []

for nq in qubit_configs:
    # Reduce to nq features with PCA
    pca_nq = PCA(n_components=nq)
    X_nq = pca_nq.fit_transform(X)
    X_nq_min = X_nq.min(axis=0, keepdims=True)
    X_nq_max = X_nq.max(axis=0, keepdims=True)
    X_nq_norm = 2 * np.pi * (X_nq - X_nq_min) / (X_nq_max - X_nq_min + 1e-8) - np.pi
    X_nq_t = torch.tensor(X_nq_norm, dtype=torch.float32)

    # Keep depth at 1 to avoid barren plateaus at higher qubit counts
    n_lay = 1
    dev_nq = qml.device("default.qubit", wires=nq)

    @qml.qnode(dev_nq, interface="torch", diff_method="best")
    def circuit_nq(inputs, weights):
        for i in range(nq):
            qml.RY(inputs[i], wires=i)
        for layer in range(weights.shape[0]):
            for i in range(nq):
                qml.RY(weights[layer, i, 0], wires=i)
                qml.RZ(weights[layer, i, 1], wires=i)
            for i in range(nq - 1):
                qml.CNOT(wires=[i, i + 1])
        return qml.expval(qml.PauliZ(0))

    seed_accs = []
    for seed in range(n_seeds):
        torch.manual_seed(seed)
        np.random.seed(seed)
        ql = qml.qnn.TorchLayer(circuit_nq, {"weights": (n_lay, nq, 2)})
        opt = optim.Adam(ql.parameters(), lr=0.01)

        for epoch in range(30):
            opt.zero_grad()
            out = ql(X_nq_t[:n_train]).unsqueeze(1)
            loss = loss_fn(out, y_tensor[:n_train])
            loss.backward()
            opt.step()

        ql.eval()
        with torch.no_grad():
            out_val = ql(X_nq_t[n_train:]).unsqueeze(1)
            acc = ((out_val > 0) == y_tensor[n_train:].bool()).float().mean().item()
            seed_accs.append(acc)

    n_params = n_lay * nq * 2
    mean_acc = np.mean(seed_accs)
    std_acc = np.std(seed_accs)
    qubit_results.append((nq, n_params, mean_acc, std_acc))
    print(f"n_qubits={nq:2d}  params={n_params:3d}  "
          f"val_acc={mean_acc:.3f} +/- {std_acc:.3f}")

# Display as table
print("\n| n_qubits | n_parameters | val_accuracy (mean +/- std) |")
print("|----------|-------------|----------------------------|")
for nq, np_, mean, std in qubit_results:
    print(f"| {nq:8d} | {np_:11d} | {mean:.3f} +/- {std:.3f}              |")

You will generally find that increasing qubit count beyond 4 does not improve accuracy on this task. The additional PCA components carry less variance, and the larger circuits have more parameters that are harder to train. More qubits increase the Hilbert space dimension exponentially, but the classification problem itself is low-dimensional.

Step 14: Gradient Computation Modes

PennyLane supports two gradient computation strategies when using the torch interface. Understanding the difference is important for performance.

Parameter-Shift Rule

The parameter-shift rule computes the gradient of a quantum circuit analytically by evaluating the circuit at two shifted parameter values:

dC/dθ = [C(θ + π/2) - C(θ - π/2)] / 2

For a circuit with P parameters, each gradient step requires 2P circuit evaluations. With our 16-parameter circuit, that is 32 evaluations per backward pass.

Backpropagation (Simulator Only)

When running on a classical simulator, PennyLane can differentiate through the simulation directly using standard backpropagation. This requires only one forward pass to compute all gradients simultaneously.

import time

# Parameter-shift gradient timing
dev_ps = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_ps, interface="torch", diff_method="parameter-shift")
def circuit_ps(inputs, weights):
    for i in range(n_qubits):
        qml.RY(inputs[i], wires=i)
    for layer in range(weights.shape[0]):
        variational_layer(weights[layer], range(n_qubits))
    return qml.expval(qml.PauliZ(0))

# Backpropagation gradient timing
dev_bp = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_bp, interface="torch", diff_method="backprop")
def circuit_bp(inputs, weights):
    for i in range(n_qubits):
        qml.RY(inputs[i], wires=i)
    for layer in range(weights.shape[0]):
        variational_layer(weights[layer], range(n_qubits))
    return qml.expval(qml.PauliZ(0))

# Time both methods
x_test = torch.tensor([0.5, -0.3, 1.2, -0.8], dtype=torch.float64,
                       requires_grad=False)
w_ps = torch.randn(2, n_qubits, 2, dtype=torch.float64, requires_grad=True)
w_bp = w_ps.clone().detach().requires_grad_(True)

# Warm up both circuits
_ = circuit_ps(x_test, w_ps)
_ = circuit_bp(x_test, w_bp)

# Time parameter-shift
start = time.time()
for _ in range(10):
    out = circuit_ps(x_test, w_ps)
    out.backward()
    w_ps.grad = None
ps_time = (time.time() - start) / 10

# Time backpropagation
start = time.time()
for _ in range(10):
    out = circuit_bp(x_test, w_bp)
    out.backward()
    w_bp.grad = None
bp_time = (time.time() - start) / 10

print(f"Parameter-shift: {ps_time:.4f}s per gradient step")
print(f"Backpropagation: {bp_time:.4f}s per gradient step")
print(f"Speedup: {ps_time / bp_time:.1f}x")

On a simulator, backpropagation is typically 10-30x faster than parameter-shift for circuits with 16+ parameters. However, backpropagation requires access to the full statevector during differentiation, which is only available on simulators. When deploying to real quantum hardware (IBM, IonQ, Rigetti), you must use diff_method="parameter-shift" or diff_method="best" (which automatically selects the appropriate method).

Step 15: Transfer to a Different Domain

The power of transfer learning is that the same feature extractor works across domains. Here we show how to swap the dataset while keeping the quantum head architecture identical.

# Simulate a "medical imaging" dataset using CIFAR-10 classes 3 (cat) and 5 (dog)
# In a real application, you would load actual medical images here.
idx_new = [i for i, (_, label) in enumerate(cifar) if label in [3, 5]]
subset_new = Subset(cifar, idx_new[:2000])
loader_new = DataLoader(subset_new, batch_size=64, shuffle=False)

# Extract features using the same frozen ResNet
new_features, new_labels = [], []
with torch.no_grad():
    for imgs, labels in loader_new:
        feats = feature_extractor(imgs).squeeze()
        new_features.append(feats.numpy())
        # Remap labels: 3 -> 0, 5 -> 1
        remapped = (labels == 5).long()
        new_labels.append(remapped.numpy())

X_new = np.concatenate(new_features)
y_new = np.concatenate(new_labels)

# Fit a new PCA on this domain's features
pca_new = PCA(n_components=n_qubits)
X_new_pca = pca_new.fit_transform(X_new)
X_new_min = X_new_pca.min(axis=0, keepdims=True)
X_new_max = X_new_pca.max(axis=0, keepdims=True)
X_new_norm = 2 * np.pi * (X_new_pca - X_new_min) / (X_new_max - X_new_min + 1e-8) - np.pi

X_new_t = torch.tensor(X_new_norm, dtype=torch.float32)
y_new_t = torch.tensor(y_new, dtype=torch.float32).unsqueeze(1)

# Train a fresh quantum head (same architecture, new random weights)
qlayer_new = qml.qnn.TorchLayer(quantum_circuit, {"weights": (2, n_qubits, 2)})
opt_new = optim.Adam(qlayer_new.parameters(), lr=0.01)

n_train_new = int(0.8 * len(X_new_t))
for epoch in range(30):
    opt_new.zero_grad()
    out = qlayer_new(X_new_t[:n_train_new]).unsqueeze(1)
    loss = loss_fn(out, y_new_t[:n_train_new])
    loss.backward()
    opt_new.step()

qlayer_new.eval()
with torch.no_grad():
    out_val = qlayer_new(X_new_t[n_train_new:]).unsqueeze(1)
    acc = ((out_val > 0) == y_new_t[n_train_new:].bool()).float().mean().item()
    print(f"Cat vs Dog validation accuracy: {acc:.3f}")

The key point: the ResNet backbone was never retrained. Only the PCA and quantum head are fitted to the new domain. This modularity is the core value of transfer learning. You can swap the dataset, refit PCA, and retrain only the lightweight quantum head.

Step 16: Hybrid Architecture Variations

The “CNN backbone + quantum head” pattern is the simplest hybrid architecture. Here are two alternatives.

Quantum Bottleneck

Insert the quantum circuit in the middle of the network as a compression layer. Classical features go in, a lower-dimensional quantum representation comes out, and a final classical layer produces the prediction.

class QuantumBottleneck(nn.Module):
    """Classical -> Quantum -> Classical architecture."""
    def __init__(self, input_dim, n_qubits, n_classes):
        super().__init__()
        # Classical pre-processing: reduce to n_qubits features
        self.pre = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, n_qubits),
            nn.Tanh(),  # Output in [-1, 1], scale to [-pi, pi] below
        )
        # Quantum bottleneck layer
        dev_bn = qml.device("default.qubit", wires=n_qubits)

        @qml.qnode(dev_bn, interface="torch", diff_method="best")
        def bottleneck_circuit(inputs, weights):
            for i in range(n_qubits):
                qml.RY(inputs[i] * np.pi, wires=i)  # Scale [-1,1] to [-pi,pi]
            variational_layer(weights[0], range(n_qubits))
            return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]

        self.quantum = qml.qnn.TorchLayer(
            bottleneck_circuit, {"weights": (1, n_qubits, 2)}
        )
        # Classical post-processing
        self.post = nn.Linear(n_qubits, n_classes)

    def forward(self, x):
        x = self.pre(x)
        x = self.quantum(x)
        return self.post(x)

bottleneck_model = QuantumBottleneck(input_dim=4, n_qubits=4, n_classes=1)
print(f"Bottleneck parameters: {sum(p.numel() for p in bottleneck_model.parameters())}")

The quantum bottleneck forces information through a quantum channel, which may capture correlations differently than a classical bottleneck (e.g., an autoencoder). Whether this is beneficial depends on the data structure.

Parallel Quantum-Classical

Run quantum and classical branches in parallel on the same features, then combine their outputs.

class ParallelQC(nn.Module):
    """Parallel quantum and classical branches with fusion."""
    def __init__(self, input_dim, n_qubits):
        super().__init__()
        # Classical branch
        self.classical = nn.Sequential(
            nn.Linear(input_dim, 8),
            nn.ReLU(),
        )
        # Quantum branch
        dev_par = qml.device("default.qubit", wires=n_qubits)

        @qml.qnode(dev_par, interface="torch", diff_method="best")
        def parallel_circuit(inputs, weights):
            for i in range(n_qubits):
                qml.RY(inputs[i], wires=i)
            variational_layer(weights[0], range(n_qubits))
            return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]

        self.quantum = qml.qnn.TorchLayer(
            parallel_circuit, {"weights": (1, n_qubits, 2)}
        )
        # Fusion layer: combine 8 classical + n_qubits quantum outputs
        self.fusion = nn.Linear(8 + n_qubits, 1)

    def forward(self, x):
        c_out = self.classical(x)
        q_out = self.quantum(x)
        combined = torch.cat([c_out, q_out], dim=-1)
        return self.fusion(combined)

parallel_model = ParallelQC(input_dim=4, n_qubits=4)
print(f"Parallel model parameters: {sum(p.numel() for p in parallel_model.parameters())}")

The parallel architecture lets the classical and quantum branches specialize on different aspects of the data. The fusion layer learns how to weight each branch’s contribution. This can be a safer starting point if you are unsure whether the quantum branch adds value: the classical branch provides a fallback.

Common Mistakes and How to Avoid Them

1. Not normalizing PCA features before angle encoding

The RY(x) gate rotates the qubit state by angle x. If your PCA features range from, say, -50 to 120, the rotations wrap around multiple times and the encoding becomes meaningless. Always normalize each PCA component to the range [-pi, pi] using the training set statistics:

# Correct: per-feature normalization
X_min = X_pca.min(axis=0, keepdims=True)
X_max = X_pca.max(axis=0, keepdims=True)
X_normalized = 2 * np.pi * (X_pca - X_min) / (X_max - X_min + 1e-8) - np.pi

# Wrong: no normalization (features may be outside [-pi, pi])
# X_normalized = X_pca  # Don't do this!

2. Using backprop gradient mode on real hardware

The diff_method="backprop" mode requires access to the full statevector simulation, which is only available on classical simulators like default.qubit. If you deploy to a real quantum device (e.g., qml.device("qiskit.remote", ...); the older qiskit.ibmq device name was removed along with IBM’s retired IBM Quantum channel), you must use diff_method="parameter-shift" or diff_method="best". Attempting backprop on real hardware will raise an error.

# For simulators (fast):
@qml.qnode(dev, interface="torch", diff_method="backprop")

# For real hardware (required):
@qml.qnode(dev, interface="torch", diff_method="parameter-shift")

# Let PennyLane choose (safe default):
@qml.qnode(dev, interface="torch", diff_method="best")

3. Freezing ResNet but forgetting model.eval()

Batch normalization layers behave differently during training and evaluation. In training mode, they compute running statistics from the current batch. In eval mode, they use the stored running mean and variance from pre-training. If you freeze the ResNet backbone but leave it in training mode, the batch norm layers will compute incorrect statistics from your (different, smaller) dataset.

# Correct: freeze AND set to eval mode
for param in model.parameters():
    param.requires_grad = False
model.eval()  # Critical for batch norm layers

# If you only train the quantum head, the backbone should stay in eval mode
# throughout training. Only call model.train() on the quantum head.

4. Using too many variational layers

As demonstrated in the barren plateau section, each additional variational layer roughly halves the gradient variance (for random circuits). For 4 qubits:

Layers	Approx. gradient variance
1	~10^-1
2	~10^-2
3	~10^-3
4	~10^-4

At 10^-4 variance, the optimizer cannot distinguish meaningful gradients from noise. Stick to 1-2 layers for 4-8 qubits. If you need more expressivity, use data re-uploading (repeated encoding) rather than deeper variational blocks.

5. Unfair classical-quantum comparisons

The fair comparison for a quantum classification head is against a classical head with the same frozen backbone and the same input features. Comparing the quantum head’s 93% accuracy to a fully retrained classical ResNet-18 at 97% is misleading because the classical model also retrained the backbone (11 million parameters vs. 16 quantum parameters). The right comparison is:

# Fair comparison: same frozen features, same input dimensionality
quantum_head = qlayer           # 16 parameters
linear_head = nn.Linear(4, 1)  # 5 parameters
mlp_head = nn.Sequential(      # 49 parameters
    nn.Linear(4, 8), nn.ReLU(), nn.Linear(8, 1)
)

# Unfair comparison: fully retrained classical model
# model_full = models.resnet18(weights=...)
# for param in model_full.parameters():
#     param.requires_grad = True  # Retraining 11M parameters!

When reporting results, always specify what was frozen and what was trained. The quantum vs. classical comparison is only meaningful when both have access to the same features and similar parameter budgets.