Quantum Machine Learning with PennyLane: Your First Classifier

Quantum machine learning (QML) applies parameterized quantum circuits the same way classical ML applies neural networks: encode data, run a model, measure the output, compare to a label, and update parameters via gradient descent.

Whether quantum circuits can outperform classical models on real-world data is still an open research question. But the mechanics are clean and PennyLane makes them easy to experiment with. This tutorial builds a working quantum binary classifier from scratch, then extends it to multiclass problems, quantum kernels, hybrid models, and hardware-realistic simulations.

Installation

pip install pennylane scikit-learn matplotlib torch

We install PyTorch alongside PennyLane because later sections use PennyLane’s torch interface for hybrid quantum-classical models. If you only plan to run the early sections, torch is optional.

The QML Framework

Every quantum machine learning model follows three steps:

Data encoding. Map a classical input vector x to a quantum state |phi(x)>. This is the quantum analogue of a feature map in classical kernel methods.
Processing. Apply a parameterized unitary U(theta) to produce |psi(x, theta)> = U(theta)|phi(x)>. The parameters theta are the trainable weights.
Readout. Measure an observable O to extract a scalar prediction: output = <psi(x, theta)|O|psi(x, theta)>.

Training adjusts theta to minimize a loss function that compares the readout to the true label. PennyLane computes gradients of quantum circuits using the parameter-shift rule, which evaluates the circuit at shifted parameter values rather than relying on backpropagation through the simulation.

The kernel interpretation

There is a second way to understand this pipeline. The encoding step defines an implicit kernel between any two data points:

K(x, x’) = |<phi(x)|phi(x’)>|^2

This kernel measures how similar the two quantum states are. It defines a reproducing kernel Hilbert space (RKHS) in which the model operates. Different encoding strategies produce different kernels, which in turn define different function classes. This perspective connects QML directly to classical kernel methods and provides theoretical tools for analyzing what a quantum model can and cannot learn.

We return to this perspective in the “Quantum Kernel Methods” section below, where we compute the kernel matrix explicitly and train a classical SVM on top of it.

The Dataset

We use a synthetic 2D dataset where class membership depends on whether a point lies inside or outside a circle. This is linearly inseparable, which makes it a fair test for any classifier.

import numpy as np
import pennylane as qml
from pennylane import numpy as pnp
from sklearn.datasets import make_circles
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Generate 200 points in two concentric circles
X, y = make_circles(n_samples=200, noise=0.1, factor=0.4, random_state=42)

# Scale features to [0, pi] for angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X = scaler.fit_transform(X)

# Labels: convert 0 -> -1 so they match the range of PauliZ expectation values
y = np.where(y == 0, -1, 1)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

Scaling to [0, pi] is important. RY(theta) produces a rotation proportional to theta, so the range [0, pi] maps features across the full range of the Bloch sphere’s z-axis. Feeding unscaled features (which might be in [-1.5, 1.5]) wastes part of the rotation range and compresses the representation.

Data Encoding Strategies

Angle encoding loads each feature into a rotation gate angle. A 2D input (x1, x2) maps to rotations on two qubits. This is a simple and common strategy for small feature vectors, but it is not the only option.

Angle encoding

Each feature becomes one rotation angle on one qubit. An n-dimensional input requires n qubits.

dev_demo = qml.device("default.qubit", wires=4)

@qml.qnode(dev_demo)
def angle_encoding_demo(x):
    qml.AngleEmbedding(x, wires=range(4), rotation="Y")
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

# 4 features -> 4 qubits, each gets one RY rotation
x = np.array([0.5, 1.2, 2.1, 0.8])
print(angle_encoding_demo(x))

The rotation parameter selects which Pauli rotation to use. "Y" (RY) is the most common choice because it maps real-valued features to states on the real plane of the Bloch sphere.

Amplitude encoding

The feature vector becomes the amplitudes of the quantum state. An n-qubit system has 2^n amplitudes, so you can encode 2^n features into n qubits.

dev_amp = qml.device("default.qubit", wires=2)

@qml.qnode(dev_amp)
def amplitude_encoding_demo(x):
    qml.AmplitudeEmbedding(x, wires=range(2), normalize=True)
    return qml.state()

# 4 features -> 2 qubits (2^2 = 4 amplitudes)
x = np.array([0.5, 1.2, 2.1, 0.8])
print(amplitude_encoding_demo(x))

The normalize=True flag rescales the input so that the amplitudes satisfy the normalization constraint (sum of squared amplitudes equals 1). The trade-off: amplitude encoding uses far fewer qubits, but the state preparation circuit requires O(2^n) gates in the worst case, which negates the qubit savings for large inputs.

IQP-style encoding

Repeating the encoding layer multiple times, interleaved with entangling gates, increases the expressivity of the feature map. This is sometimes called “data re-uploading at the encoding level.”

dev_iqp = qml.device("default.qubit", wires=2)

@qml.qnode(dev_iqp)
def iqp_encoding_demo(x):
    for _ in range(3):  # repeat 3 times
        qml.AngleEmbedding(x, wires=range(2), rotation="Y")
        qml.CNOT(wires=[0, 1])
    return qml.expval(qml.PauliZ(0))

x = np.array([0.5, 1.2])
print(iqp_encoding_demo(x))

Repeating the encoding creates higher-order Fourier components in the model’s output as a function of the input features. A single encoding layer produces a model that can represent functions with at most first-order Fourier terms; each repetition adds higher harmonics.

Choosing an encoding

Encoding	Qubits for n features	Circuit depth	Expressivity
Angle	n	O(1)	Low (linear)
Amplitude	log2(n)	O(n)	Moderate
IQP (repeated angle)	n	O(repetitions)	High
Data re-uploading	as few as 1	O(repetitions)	High

For this tutorial we start with angle encoding for simplicity, then explore data re-uploading and IQP-style approaches in later sections.

Building the Classifier

dev = qml.device("default.qubit", wires=2)

@qml.qnode(dev)
def quantum_classifier(x, weights):
    # Angle encoding: load the 2 features as RY rotation angles
    qml.RY(x[0], wires=0)
    qml.RY(x[1], wires=1)

    # Variational layers: parameterized rotations + entanglement
    for layer_weights in weights:
        qml.RY(layer_weights[0], wires=0)
        qml.RY(layer_weights[1], wires=1)
        qml.RZ(layer_weights[2], wires=0)
        qml.RZ(layer_weights[3], wires=1)
        qml.CNOT(wires=[0, 1])

    # Measure the first qubit's expectation value as the prediction
    return qml.expval(qml.PauliZ(0))

The circuit encodes data once at the start, then applies trainable rotation gates organized into layers. The CNOT gate between layers creates entanglement that lets the two qubits exchange information, enabling the circuit to learn non-linear boundaries.

The Cost Function

The cost is mean squared error between predictions and labels. Because labels are +1 or -1 and qml.expval(PauliZ) returns a value in [-1, 1], this works directly.

def cost(weights, X_batch, y_batch):
    predictions = pnp.array([quantum_classifier(x, weights) for x in X_batch])
    return pnp.mean((predictions - y_batch) ** 2)

def accuracy(weights, X_data, y_data):
    preds = pnp.array([quantum_classifier(x, weights) for x in X_data])
    predicted_labels = pnp.sign(preds)
    return pnp.mean(predicted_labels == y_data)

Training

# Initialize weights for 3 variational layers, 4 parameters each
n_layers = 3
np.random.seed(0)
weights = pnp.array(
    np.random.uniform(-np.pi, np.pi, size=(n_layers, 4)),
    requires_grad=True
)

opt = qml.AdamOptimizer(stepsize=0.1)

batch_size = 20
train_costs = []
train_accs = []

for epoch in range(60):
    # Mini-batch gradient descent
    idx = np.random.choice(len(X_train), size=batch_size, replace=False)
    X_batch = X_train[idx]
    y_batch = y_train[idx]

    weights, c = opt.step_and_cost(
        lambda w: cost(w, X_batch, y_batch),
        weights
    )

    if epoch % 10 == 0:
        acc = accuracy(weights, X_train, y_train)
        train_costs.append(float(c))
        train_accs.append(float(acc))
        print(f"Epoch {epoch:3d}: cost={c:.4f}, train accuracy={acc:.2%}")

# Epoch   0: cost=0.9412, train accuracy=49.29%
# Epoch  10: cost=0.6221, train accuracy=68.57%
# Epoch  20: cost=0.4108, train accuracy=79.29%
# Epoch  30: cost=0.3047, train accuracy=85.00%
# Epoch  40: cost=0.2531, train accuracy=87.86%
# Epoch  50: cost=0.2203, train accuracy=89.29%

test_acc = accuracy(weights, X_test, y_test)
print(f"\nTest accuracy: {test_acc:.2%}")

Expressivity vs. Trainability: The Layer Depth Trade-off

Deeper circuits can express more complex functions, but they are harder to train. This tension is called the barren plateau problem: as circuit depth grows, gradients of random circuits shrink exponentially, making optimization from random initialization nearly impossible.

The following experiment trains the same architecture with 1, 3, 5, and 7 layers and records the final test accuracy and training loss curve.

def train_with_layers(n_layers, n_epochs=60, batch_size=20, seed=0):
    """Train the quantum classifier with a given number of layers."""
    dev_layers = qml.device("default.qubit", wires=2)

    @qml.qnode(dev_layers)
    def circuit(x, weights):
        qml.RY(x[0], wires=0)
        qml.RY(x[1], wires=1)
        for layer_weights in weights:
            qml.RY(layer_weights[0], wires=0)
            qml.RY(layer_weights[1], wires=1)
            qml.RZ(layer_weights[2], wires=0)
            qml.RZ(layer_weights[3], wires=1)
            qml.CNOT(wires=[0, 1])
        return qml.expval(qml.PauliZ(0))

    def cost_fn(w, X_b, y_b):
        preds = pnp.array([circuit(x, w) for x in X_b])
        return pnp.mean((preds - y_b) ** 2)

    def acc_fn(w, X_d, y_d):
        preds = pnp.array([circuit(x, w) for x in X_d])
        return float(pnp.mean(pnp.sign(preds) == y_d))

    np.random.seed(seed)
    w = pnp.array(
        np.random.uniform(-np.pi, np.pi, size=(n_layers, 4)),
        requires_grad=True
    )
    optimizer = qml.AdamOptimizer(stepsize=0.1)
    loss_history = []

    for epoch in range(n_epochs):
        idx = np.random.choice(len(X_train), size=batch_size, replace=False)
        w, c = optimizer.step_and_cost(
            lambda ww: cost_fn(ww, X_train[idx], y_train[idx]),
            w
        )
        loss_history.append(float(c))

    final_acc = acc_fn(w, X_test, y_test)
    return loss_history, final_acc

# Run experiments
results = {}
for n_l in [1, 3, 5, 7]:
    losses, acc = train_with_layers(n_l)
    results[n_l] = (losses, acc)
    print(f"Layers={n_l}: test accuracy={acc:.2%}")

# Plot training curves
import matplotlib.pyplot as plt

plt.figure(figsize=(8, 5))
for n_l, (losses, acc) in results.items():
    plt.plot(losses, label=f"{n_l} layers (acc={acc:.0%})")
plt.xlabel("Epoch")
plt.ylabel("Training Loss")
plt.title("Effect of Circuit Depth on Training")
plt.legend()
plt.tight_layout()
plt.savefig("layer_depth_comparison.png", dpi=150)
plt.show()

With 2 qubits and this small dataset, you will typically observe that 3 layers reach good accuracy, 1 layer underfits, and 7 layers converge more slowly due to a more complex loss landscape. On larger systems (10+ qubits), the barren plateau effect becomes severe: 7-layer random circuits may fail to train entirely.

Data Re-uploading

A single encoding of the input data limits the Fourier spectrum of the model’s output function. Data re-uploading interleaves encoding layers with variational layers, feeding the data into the circuit multiple times at different depths.

The key result: a single qubit with data re-uploading can approximate any continuous function from x to [0, 1]. This is the quantum analogue of the universal approximation theorem for neural networks.

dev_reup = qml.device("default.qubit", wires=2)

@qml.qnode(dev_reup)
def reupload_classifier(x, weights):
    n_uploads = weights.shape[0]
    for i in range(n_uploads):
        # Re-encode the data at each layer
        qml.RY(x[0], wires=0)
        qml.RY(x[1], wires=1)
        # Variational rotation
        qml.RY(weights[i, 0], wires=0)
        qml.RY(weights[i, 1], wires=1)
        qml.RZ(weights[i, 2], wires=0)
        qml.RZ(weights[i, 3], wires=1)
        qml.CNOT(wires=[0, 1])
    return qml.expval(qml.PauliZ(0))

# Train the re-uploading classifier
n_uploads = 4
np.random.seed(42)
reup_weights = pnp.array(
    np.random.uniform(-np.pi, np.pi, size=(n_uploads, 4)),
    requires_grad=True
)

opt_reup = qml.AdamOptimizer(stepsize=0.1)

def reup_cost(w, X_b, y_b):
    preds = pnp.array([reupload_classifier(x, w) for x in X_b])
    return pnp.mean((preds - y_b) ** 2)

for epoch in range(80):
    idx = np.random.choice(len(X_train), size=20, replace=False)
    reup_weights, c = opt_reup.step_and_cost(
        lambda w: reup_cost(w, X_train[idx], y_train[idx]),
        reup_weights
    )
    if epoch % 20 == 0:
        preds = pnp.array([reupload_classifier(x, reup_weights) for x in X_test])
        acc = float(pnp.mean(pnp.sign(preds) == y_test))
        print(f"Epoch {epoch:3d}: cost={c:.4f}, test accuracy={acc:.2%}")

Data re-uploading typically achieves equal or better accuracy than single-encoding circuits with the same number of parameters, because each re-upload adds new Fourier frequencies to the model’s representational capacity.

Quantum Kernel Methods

Instead of optimizing a variational circuit, you can use the quantum circuit purely as a feature map and train a classical SVM on the resulting kernel matrix. This avoids the optimization challenges of variational circuits entirely.

The quantum kernel between two points x and z is:

K(x, z) = |<0|U(x)^dagger U(z)|0>|^2

This is the probability of measuring the all-zeros state after encoding x, applying the adjoint, and encoding z. If K(x, z) = 1, the two states are identical; if K(x, z) = 0, they are orthogonal.

from sklearn.svm import SVC

dev_kernel = qml.device("default.qubit", wires=2)

@qml.qnode(dev_kernel)
def kernel_circuit(x1, x2):
    """Compute the transition amplitude |<phi(x1)|phi(x2)>|^2."""
    # Encode x1
    qml.AngleEmbedding(x1, wires=range(2), rotation="Y")
    # Apply adjoint of encoding x2
    qml.adjoint(qml.AngleEmbedding)(x2, wires=range(2), rotation="Y")
    # Probability of measuring |00>
    return qml.probs(wires=range(2))

def quantum_kernel(x1, x2):
    """Return the kernel value K(x1, x2)."""
    probs = kernel_circuit(x1, x2)
    return float(probs[0])  # probability of |00> state

# Build kernel matrices
def kernel_matrix(X_a, X_b):
    """Compute the kernel matrix between two sets of data points."""
    n_a = len(X_a)
    n_b = len(X_b)
    K = np.zeros((n_a, n_b))
    for i in range(n_a):
        for j in range(n_b):
            K[i, j] = quantum_kernel(X_a[i], X_b[j])
    return K

# Compute kernel matrices
K_train = kernel_matrix(X_train, X_train)
K_test = kernel_matrix(X_test, X_train)

# Train a classical SVM with the quantum kernel
svm = SVC(kernel="precomputed")
svm.fit(K_train, y_train)
svm_acc = svm.score(K_test, y_test)
print(f"Quantum kernel SVM test accuracy: {svm_acc:.2%}")

The kernel approach has two advantages. First, there are no variational parameters to optimize in the quantum circuit, so barren plateaus are not an issue. Second, the SVM training is a convex optimization problem with a unique global minimum. The disadvantage is that computing the full kernel matrix requires O(n^2) circuit evaluations, which becomes expensive for large datasets.

PennyLane also provides qml.kernels.kernel_matrix as a convenience function that automates the pairwise computation.

Overfitting in QML

Quantum models can overfit, just like classical models. A circuit with many parameters trained for many epochs on a small dataset will memorize the training data and generalize poorly.

# Train for many epochs and track both train and test accuracy
np.random.seed(0)
overfit_weights = pnp.array(
    np.random.uniform(-np.pi, np.pi, size=(5, 4)),  # 5 layers, 20 params
    requires_grad=True
)

opt_of = qml.AdamOptimizer(stepsize=0.1)

dev_of = qml.device("default.qubit", wires=2)

@qml.qnode(dev_of)
def overfit_circuit(x, weights):
    qml.RY(x[0], wires=0)
    qml.RY(x[1], wires=1)
    for lw in weights:
        qml.RY(lw[0], wires=0)
        qml.RY(lw[1], wires=1)
        qml.RZ(lw[2], wires=0)
        qml.RZ(lw[3], wires=1)
        qml.CNOT(wires=[0, 1])
    return qml.expval(qml.PauliZ(0))

def overfit_cost(w, X_b, y_b):
    preds = pnp.array([overfit_circuit(x, w) for x in X_b])
    return pnp.mean((preds - y_b) ** 2)

def overfit_acc(w, X_d, y_d):
    preds = pnp.array([overfit_circuit(x, w) for x in X_d])
    return float(pnp.mean(pnp.sign(preds) == y_d))

train_acc_history = []
test_acc_history = []

for epoch in range(150):
    idx = np.random.choice(len(X_train), size=20, replace=False)
    overfit_weights, c = opt_of.step_and_cost(
        lambda w: overfit_cost(w, X_train[idx], y_train[idx]),
        overfit_weights
    )
    if epoch % 10 == 0:
        tr_acc = overfit_acc(overfit_weights, X_train, y_train)
        te_acc = overfit_acc(overfit_weights, X_test, y_test)
        train_acc_history.append(tr_acc)
        test_acc_history.append(te_acc)
        print(f"Epoch {epoch:3d}: train={tr_acc:.2%}, test={te_acc:.2%}")

plt.figure(figsize=(7, 5))
plt.plot(range(0, 150, 10), train_acc_history, "o-", label="Train")
plt.plot(range(0, 150, 10), test_acc_history, "s-", label="Test")
plt.xlabel("Epoch")
plt.ylabel("Accuracy")
plt.title("Overfitting in a Quantum Classifier")
plt.legend()
plt.tight_layout()
plt.savefig("qml_overfitting.png", dpi=150)
plt.show()

L2 regularization

The simplest countermeasure is L2 regularization, which adds a penalty on the magnitude of the weights:

weight_decay = 0.01

def regularized_cost(w, X_b, y_b):
    preds = pnp.array([overfit_circuit(x, w) for x in X_b])
    mse = pnp.mean((preds - y_b) ** 2)
    l2_penalty = weight_decay * pnp.sum(w ** 2)
    return mse + l2_penalty

Other strategies include early stopping (halt training when test accuracy stops improving) and reducing circuit depth. In QML, regularization through circuit architecture (fewer layers, fewer parameters) is often more effective than explicit weight penalties, because the circuit structure directly controls the model’s expressivity.

Using PyTorch with PennyLane

PennyLane integrates with PyTorch’s automatic differentiation engine. This unlocks PyTorch’s optimizers, loss functions, learning rate schedulers, and GPU support. For larger QML experiments, this is the preferred approach.

import torch
import torch.nn as nn

dev_torch = qml.device("default.qubit", wires=2)

@qml.qnode(dev_torch, interface="torch")
def torch_circuit(x, weights):
    qml.RY(x[0], wires=0)
    qml.RY(x[1], wires=1)
    for lw in weights:
        qml.RY(lw[0], wires=0)
        qml.RY(lw[1], wires=1)
        qml.RZ(lw[2], wires=0)
        qml.RZ(lw[3], wires=1)
        qml.CNOT(wires=[0, 1])
    return qml.expval(qml.PauliZ(0))

# Convert data to PyTorch tensors
X_train_t = torch.tensor(X_train, dtype=torch.float64, requires_grad=False)
y_train_t = torch.tensor(y_train, dtype=torch.float64, requires_grad=False)
X_test_t = torch.tensor(X_test, dtype=torch.float64, requires_grad=False)
y_test_t = torch.tensor(y_test, dtype=torch.float64, requires_grad=False)

# Trainable weights as a PyTorch parameter
torch.manual_seed(0)
torch_weights = torch.tensor(
    np.random.uniform(-np.pi, np.pi, size=(3, 4)),
    dtype=torch.float64,
    requires_grad=True
)

optimizer = torch.optim.Adam([torch_weights], lr=0.1)
loss_fn = nn.MSELoss()

for epoch in range(60):
    optimizer.zero_grad()

    # Mini-batch
    idx = torch.randperm(len(X_train_t))[:20]
    X_batch = X_train_t[idx]
    y_batch = y_train_t[idx]

    # Forward pass: compute predictions one by one
    predictions = torch.stack([torch_circuit(x, torch_weights) for x in X_batch])
    loss = loss_fn(predictions, y_batch)

    # Backward pass
    loss.backward()
    optimizer.step()

    if epoch % 10 == 0:
        with torch.no_grad():
            all_preds = torch.stack(
                [torch_circuit(x, torch_weights) for x in X_test_t]
            )
            acc = (torch.sign(all_preds) == y_test_t).float().mean()
            print(f"Epoch {epoch:3d}: loss={loss.item():.4f}, test accuracy={acc:.2%}")

The PyTorch interface also enables features like learning rate scheduling (torch.optim.lr_scheduler), gradient clipping, and weight decay built into the optimizer (torch.optim.AdamW).

Multiclass Classification

Binary classification uses a single expectation value and a sign function. Multiclass classification requires multiple outputs. We use the Iris dataset (4 features, 3 classes) and measure one expectation value per class.

from sklearn.datasets import load_iris
from sklearn.preprocessing import OneHotEncoder

# Load and prepare the Iris dataset
iris = load_iris()
X_iris = iris.data
y_iris = iris.target

# Scale features to [0, pi]
scaler_iris = MinMaxScaler(feature_range=(0, np.pi))
X_iris = scaler_iris.fit_transform(X_iris)

# One-hot encode labels for cross-entropy
onehot = OneHotEncoder(sparse_output=False)
y_iris_oh = onehot.fit_transform(y_iris.reshape(-1, 1))

X_ir_train, X_ir_test, y_ir_train, y_ir_test, y_raw_train, y_raw_test = (
    train_test_split(X_iris, y_iris_oh, y_iris, test_size=0.3, random_state=42)
)

# Convert to torch tensors
X_ir_train_t = torch.tensor(X_ir_train, dtype=torch.float64, requires_grad=False)
y_ir_train_t = torch.tensor(y_ir_train, dtype=torch.float64, requires_grad=False)
X_ir_test_t = torch.tensor(X_ir_test, dtype=torch.float64, requires_grad=False)
y_raw_test_t = torch.tensor(y_raw_test, dtype=torch.long, requires_grad=False)

dev_multi = qml.device("default.qubit", wires=4)

@qml.qnode(dev_multi, interface="torch")
def multiclass_circuit(x, weights):
    # Angle encoding: 4 features -> 4 qubits
    qml.AngleEmbedding(x, wires=range(4), rotation="Y")
    # Variational layers
    for lw in weights:
        for i in range(4):
            qml.RY(lw[i], wires=i)
            qml.RZ(lw[i + 4], wires=i)
        qml.CNOT(wires=[0, 1])
        qml.CNOT(wires=[1, 2])
        qml.CNOT(wires=[2, 3])
    # Measure 3 qubits for 3 classes
    return [qml.expval(qml.PauliZ(i)) for i in range(3)]

# Weights: 3 layers, 8 parameters per layer (4 RY + 4 RZ)
torch.manual_seed(0)
multi_weights = torch.tensor(
    np.random.uniform(-np.pi, np.pi, size=(3, 8)),
    dtype=torch.float64,
    requires_grad=True
)

optimizer_multi = torch.optim.Adam([multi_weights], lr=0.05)

for epoch in range(100):
    optimizer_multi.zero_grad()

    idx = torch.randperm(len(X_ir_train_t))[:25]
    X_b = X_ir_train_t[idx]
    y_b = y_ir_train_t[idx]

    # Forward pass
    outputs = torch.stack([
        torch.stack(multiclass_circuit(x, multi_weights)) for x in X_b
    ])

    # Softmax + cross-entropy
    # Shift outputs from [-1, 1] to [0, 1] for softmax
    logits = (outputs + 1) / 2
    log_probs = torch.log_softmax(logits, dim=1)
    loss = -torch.mean(torch.sum(y_b * log_probs, dim=1))

    loss.backward()
    optimizer_multi.step()

    if epoch % 20 == 0:
        with torch.no_grad():
            test_outputs = torch.stack([
                torch.stack(multiclass_circuit(x, multi_weights))
                for x in X_ir_test_t
            ])
            predicted = torch.argmax(test_outputs, dim=1)
            acc = (predicted == y_raw_test_t).float().mean()
            print(f"Epoch {epoch:3d}: loss={loss.item():.4f}, test accuracy={acc:.2%}")

The key changes from binary to multiclass are: (1) multiple expectation value outputs instead of one, (2) softmax to convert outputs to class probabilities, and (3) cross-entropy loss instead of MSE.

Transfer Learning: Hybrid Quantum-Classical Models

For high-dimensional data, encoding every feature directly into a quantum circuit is impractical. A hybrid approach uses a classical neural network to compress the input to a small number of features, then feeds those features into a quantum circuit.

class HybridModel(nn.Module):
    def __init__(self, input_dim, n_qubits, n_layers):
        super().__init__()
        self.n_qubits = n_qubits

        # Classical feature extractor
        self.classical = nn.Sequential(
            nn.Linear(input_dim, 16),
            nn.ReLU(),
            nn.Linear(16, n_qubits),
            nn.Sigmoid()  # map to [0, 1]
        )
        self.classical = self.classical.double()

        # Scale factor to map [0, 1] -> [0, pi]
        self.scale = np.pi

        # Quantum weights
        self.q_weights = nn.Parameter(
            torch.tensor(
                np.random.uniform(-np.pi, np.pi, size=(n_layers, 2 * n_qubits)),
                dtype=torch.float64
            )
        )

        # Quantum device and circuit
        self.dev = qml.device("default.qubit", wires=n_qubits)

        @qml.qnode(self.dev, interface="torch")
        def circuit(inputs, weights):
            qml.AngleEmbedding(inputs, wires=range(n_qubits), rotation="Y")
            for lw in weights:
                for i in range(n_qubits):
                    qml.RY(lw[i], wires=i)
                    qml.RZ(lw[i + n_qubits], wires=i)
                for i in range(n_qubits - 1):
                    qml.CNOT(wires=[i, i + 1])
            return qml.expval(qml.PauliZ(0))

        self.circuit = circuit

    def forward(self, x):
        # Classical compression
        features = self.classical(x) * self.scale
        # Quantum processing
        output = self.circuit(features, self.q_weights)
        return output

# Example: Iris binary classification (classes 0 vs 1)
mask = y_iris < 2  # only classes 0 and 1
X_bin = X_iris[mask]
y_bin = (y_iris[mask] * 2 - 1).astype(float)  # map to -1, +1

X_hb_train, X_hb_test, y_hb_train, y_hb_test = train_test_split(
    X_bin, y_bin, test_size=0.3, random_state=42
)

X_hb_train_t = torch.tensor(X_hb_train, dtype=torch.float64)
y_hb_train_t = torch.tensor(y_hb_train, dtype=torch.float64)
X_hb_test_t = torch.tensor(X_hb_test, dtype=torch.float64)
y_hb_test_t = torch.tensor(y_hb_test, dtype=torch.float64)

model = HybridModel(input_dim=4, n_qubits=2, n_layers=2)
opt_hybrid = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(80):
    opt_hybrid.zero_grad()
    preds = torch.stack([model(x) for x in X_hb_train_t])
    loss = nn.MSELoss()(preds, y_hb_train_t)
    loss.backward()
    opt_hybrid.step()

    if epoch % 20 == 0:
        with torch.no_grad():
            test_preds = torch.stack([model(x) for x in X_hb_test_t])
            acc = (torch.sign(test_preds) == y_hb_test_t).float().mean()
            print(f"Epoch {epoch:3d}: loss={loss.item():.4f}, test accuracy={acc:.2%}")

This hybrid architecture is the most practical QML approach for near-term hardware. The classical network handles dimensionality reduction (where classical hardware excels), and the quantum circuit processes a small, compressed representation (where quantum effects can contribute). The entire pipeline is end-to-end differentiable: gradients flow from the loss, through the quantum circuit (via the parameter-shift rule), and into the classical network (via backpropagation).

Shot Noise and Hardware Effects

All the experiments above use default.qubit, which computes exact expectation values through statevector simulation. Real quantum hardware estimates expectation values from a finite number of measurement samples (shots). This introduces shot noise.

For a Pauli-Z measurement, the variance of the estimated expectation value is:

Var(<Z>) = (1 - <Z>^2) / shots

With 1000 shots and <Z> = 0 (the noisiest case), the standard deviation is sqrt(1/1000) = 0.032. This noise directly affects gradient estimates and training stability.

# Simulate with finite shots
dev_shots = qml.device("default.qubit", wires=2, shots=1000)

@qml.qnode(dev_shots)
def noisy_classifier(x, weights):
    qml.RY(x[0], wires=0)
    qml.RY(x[1], wires=1)
    for lw in weights:
        qml.RY(lw[0], wires=0)
        qml.RY(lw[1], wires=1)
        qml.RZ(lw[2], wires=0)
        qml.RZ(lw[3], wires=1)
        qml.CNOT(wires=[0, 1])
    return qml.expval(qml.PauliZ(0))

# Train with shot noise
np.random.seed(0)
noisy_weights = pnp.array(
    np.random.uniform(-np.pi, np.pi, size=(3, 4)),
    requires_grad=True
)

opt_noisy = qml.AdamOptimizer(stepsize=0.05)  # smaller step size for noisy gradients
noisy_losses = []

def noisy_cost(w, X_b, y_b):
    preds = pnp.array([noisy_classifier(x, w) for x in X_b])
    return pnp.mean((preds - y_b) ** 2)

for epoch in range(80):
    idx = np.random.choice(len(X_train), size=20, replace=False)
    noisy_weights, c = opt_noisy.step_and_cost(
        lambda w: noisy_cost(w, X_train[idx], y_train[idx]),
        noisy_weights
    )
    noisy_losses.append(float(c))

    if epoch % 20 == 0:
        preds = pnp.array([noisy_classifier(x, noisy_weights) for x in X_test])
        acc = float(pnp.mean(pnp.sign(preds) == y_test))
        print(f"Epoch {epoch:3d}: cost={c:.4f}, test accuracy={acc:.2%}")

# Plot noisy vs ideal training curves
plt.figure(figsize=(8, 5))
plt.plot(noisy_losses, alpha=0.7, label="shots=1000")
plt.plot(train_costs_ideal if 'train_costs_ideal' in dir() else [],
         alpha=0.7, label="exact (statevector)")
plt.xlabel("Epoch")
plt.ylabel("Training Loss")
plt.title("Training with Shot Noise")
plt.legend()
plt.tight_layout()
plt.savefig("shot_noise_training.png", dpi=150)
plt.show()

Mitigation strategies for shot noise:

Increase shots. More shots reduce variance linearly: doubling shots halves the variance. The cost is proportionally more circuit executions per gradient step.
Smaller learning rate. Noisy gradients with a large step size cause the optimizer to overshoot. Reducing the learning rate from 0.1 to 0.05 or 0.01 stabilizes training at the cost of slower convergence.
Gradient accumulation. Average gradients over multiple mini-batches before applying an update. This smooths out shot noise without increasing the per-batch shot count.
Use lightning.qubit. For simulation, PennyLane’s lightning.qubit device is a C++-backed statevector simulator that provides exact gradients much faster than default.qubit. On real hardware, this is not an option, but during development it accelerates the iteration cycle.

Comparing to Logistic Regression

The same dataset with a classical logistic regression baseline shows what the quantum circuit is up against:

from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures

# Plain logistic regression cannot separate concentric circles
lr = LogisticRegression()
lr.fit(X_train, y_train)
print(f"Logistic regression test accuracy: {lr.score(X_test, y_test):.2%}")
# ~50% -- it cannot learn this boundary without feature engineering

# With polynomial features it can
lr_poly = make_pipeline(PolynomialFeatures(degree=2), LogisticRegression())
lr_poly.fit(X_train, y_train)
print(f"Logistic regression (poly features) test accuracy: {lr_poly.score(X_test, y_test):.2%}")
# ~93% -- classical approach with explicit feature map

The quantum circuit achieves a comparable result without manual feature engineering. The entangling layers implicitly create higher-order feature interactions, similar to how a kernel method expands the feature space.

Benchmarking: Honest Classical vs. Quantum Comparison

A responsible QML tutorial requires an honest comparison. For the concentric circles dataset, here are typical results:

from sklearn.svm import SVC
from sklearn.neural_network import MLPClassifier

# Quantum classifier (3 layers, angle encoding)
# ~88% test accuracy (from training above)

# Classical baselines
lr_plain = LogisticRegression()
lr_plain.fit(X_train, y_train)
print(f"Logistic regression:         {lr_plain.score(X_test, y_test):.2%}")

lr_poly2 = make_pipeline(PolynomialFeatures(degree=2), LogisticRegression())
lr_poly2.fit(X_train, y_train)
print(f"Logistic reg (poly deg 2):   {lr_poly2.score(X_test, y_test):.2%}")

svm_rbf = SVC(kernel="rbf")
svm_rbf.fit(X_train, y_train)
print(f"SVM (RBF kernel):            {svm_rbf.score(X_test, y_test):.2%}")

mlp = MLPClassifier(hidden_layer_sizes=(16, 16), max_iter=500, random_state=42)
mlp.fit(X_train, y_train)
print(f"MLP (2x16 hidden):           {mlp.score(X_test, y_test):.2%}")

Typical results:

Model	Test Accuracy
Logistic regression	~50%
Logistic reg + poly features	~93%
SVM (RBF kernel)	~97%
MLP (2 hidden layers)	~96%
Quantum circuit (3 layers)	~88%
Quantum kernel SVM	~90%

The quantum circuit is competitive but not superior on this classical task. This is expected. Quantum advantage in ML, meaning a provable and practical speedup over the best classical method, remains unproven for real-world datasets.

Where quantum advantage might appear:

Learning from quantum data. When the input data is itself a quantum state (e.g., classifying phases of a quantum system), classical models must first perform expensive tomography while a quantum model can process the state directly.
Specific kernel problems. Certain quantum kernels correspond to classically intractable feature maps. If a dataset’s structure aligns with such a kernel, quantum methods could have an advantage.
Variational quantum simulation. Using quantum circuits to learn ground states or dynamics of quantum systems is a natural application where the quantum representation is inherently compact.

Visualizing the Decision Boundary

resolution = 40
x1_range = np.linspace(0, np.pi, resolution)
x2_range = np.linspace(0, np.pi, resolution)
xx, yy = np.meshgrid(x1_range, x2_range)
grid = np.c_[xx.ravel(), yy.ravel()]

# Predict across the grid
preds_grid = np.array([float(quantum_classifier(pt, weights)) for pt in grid])
zz = preds_grid.reshape(resolution, resolution)

plt.figure(figsize=(7, 6))
plt.contourf(xx, yy, zz, levels=20, cmap="RdBu", alpha=0.7)
plt.colorbar(label="QNode output")
plt.scatter(X_test[:, 0], X_test[:, 1], c=y_test, cmap="RdBu",
            edgecolors="k", linewidths=0.5)
plt.title("Quantum Classifier Decision Boundary")
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.savefig("qml_boundary.png", dpi=150)
plt.show()

Common Mistakes

These errors come up repeatedly in QML code. Knowing them in advance saves hours of debugging.

1. Not scaling data before angle encoding. RY(x) for x outside [0, pi] produces a valid rotation, but values beyond pi wrap around the Bloch sphere and compress the feature representation. Always scale inputs to [0, pi] (or [-pi, pi] for encodings that use the full range).

2. Setting requires_grad=True on input data. Only the variational weights should require gradients. Marking the input x as differentiable causes PennyLane to compute unnecessary gradients and can produce silent errors in some interfaces.

# Wrong: data should not require gradients
x_bad = pnp.array([0.5, 1.2], requires_grad=True)

# Correct: only weights require gradients
x_good = pnp.array([0.5, 1.2], requires_grad=False)
weights = pnp.array([0.1, 0.2, 0.3, 0.4], requires_grad=True)

3. Evaluating full-dataset accuracy every epoch. Each forward pass requires a full quantum circuit simulation. Computing accuracy on 140 training points every epoch means 140 circuit evaluations per epoch on top of the training batch. For quick feedback during development, evaluate on a small held-out subset (20-30 points) and compute full accuracy only every 10 or 20 epochs.

4. Confusing the return order of step_and_cost. opt.step_and_cost(cost_fn, weights) returns (updated_weights, cost_value), not (cost_value, updated_weights). Getting this wrong means you silently replace your weights with a scalar cost and your cost with a weight array, producing nonsensical training behavior.

# Correct
weights, cost_val = opt.step_and_cost(cost_fn, weights)

# Wrong -- swapped order
cost_val, weights = opt.step_and_cost(cost_fn, weights)  # this will break training

5. Expecting mini-batches to speed up simulation. In classical deep learning, mini-batches enable parallelism across a GPU. In PennyLane, each circuit evaluation is a separate simulation. A batch of 20 samples runs 20 sequential simulations, not one parallel operation. True parallelism requires either multiple device instances or hardware-level batching (available in lightning.qubit via qml.batch_execute).

What to Try Next

Swap default.qubit for lightning.qubit to speed up simulation for larger circuits
Implement a custom encoding that repeats the data encoding 3 times with trainable rotations between repetitions
Try the quantum kernel approach on a dataset where the RBF kernel struggles (high-dimensional, sparse data)
Explore PennyLane’s built-in templates: qml.StronglyEntanglingLayers, qml.BasicEntanglerLayers, and qml.SimplifiedTwoDesign
Run on real hardware using PennyLane’s IBM Quantum plugin (pennylane-qiskit) or Amazon Braket plugin (amazon-braket-pennylane-plugin)
Read about barren plateaus in depth: McClean et al., “Barren plateaus in quantum neural network training landscapes,” Nature Communications 9 (2018)
See the PennyLane Reference for the full optimizer and embedding API