Building a Variational Quantum Classifier from Scratch

What Is a Variational Quantum Classifier?

A variational quantum classifier (VQC) is a hybrid quantum-classical model. It uses a parameterized quantum circuit to transform input features into a measurement outcome, then a classical optimizer to tune those parameters by minimizing a loss function. The quantum circuit plays the role of a nonlinear feature map and classifier combined.

VQCs are not universally better than classical classifiers. On near-term hardware they face noise and limited qubit counts. But they are a productive research area for understanding whether quantum feature maps can provide any advantage, and they are an excellent way to learn the mechanics of variational quantum algorithms.

Theoretical Motivation for VQCs

A VQC maps classical data x to a quantum feature space via an encoding circuit U(x). A set of trainable parameters w controls a variational ansatz V(w). The full quantum state is:

|ψ(x, w)⟩ = V(w) U(x) |0…0⟩

The model makes predictions based on the expectation value ⟨ψ(x, w)|O|ψ(x, w)⟩ for some observable O (typically a Pauli-Z on one qubit). The classical optimizer updates w to minimize the loss between these predictions and the true labels.

The key question is whether the quantum feature space provides any advantage over classical kernels. To understand this, consider the connection to kernel methods. A VQC implicitly defines a quantum kernel:

K(x, x’) = |⟨ψ(x)|ψ(x’)⟩|²

This kernel measures how similar two data points are in the quantum feature space. If this kernel is hard to compute classically (that is, no efficient classical algorithm can approximate it), then a quantum speedup is possible in principle. The encoding circuit U(x) determines the kernel. Simple encodings like angle encoding produce kernels that are easy to compute classically, so they do not offer a quantum advantage. More complex encodings, such as IQP-style circuits with entangling gates that depend on products of features, can produce kernels that are conjectured to be classically intractable.

In practice, “quantum advantage for classification” means finding a dataset and encoding where the quantum kernel separates the classes better than any efficient classical kernel. This remains an active area of research. For this tutorial, the goal is to understand the mechanics, not to claim advantage.

The Dataset: Moons

The sklearn make_moons dataset is a classic for nonlinear binary classification. Two interleaved crescent-shaped clusters make it impossible to separate with a straight line, requiring a curved decision boundary. It is complex enough to be interesting but small enough to train quickly.

from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np

X, y = make_moons(n_samples=200, noise=0.2, random_state=42)

# Scale features to [0, pi] for angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.25, random_state=42
)
y_train = 2 * y_train - 1  # Map {0,1} -> {-1, +1} for Pauli-Z expectation
y_test = 2 * y_test - 1

Data Encoding Strategies

The choice of encoding circuit determines what quantum feature space the data lives in. There are three main strategies, each with different trade-offs in qubit count, circuit depth, and expressiveness.

Angle Encoding

Angle encoding maps each classical feature to a rotation angle applied to one qubit. For a 2-feature dataset we use 2 qubits and encode x_0 via RY(x_0) on qubit 0 and x_1 via RY(x_1) on qubit 1.

This encoding is linear in the data dimension: n features require n qubits. The circuit depth is O(1) since all rotations are applied in parallel. It is a simple, hardware-efficient encoding, and is the right starting point for low-dimensional data.

Amplitude Encoding

Amplitude encoding maps a classical vector x = (x_0, x_1, …, x_{N-1}) directly into the amplitudes of a quantum state:

|ψ⟩ = Σ_i x_i |i⟩

where the vector must be normalized (||x|| = 1). This requires only log₂(N) qubits for an N-dimensional vector. For a 4-feature dataset, amplitude encoding uses just 2 qubits.

The catch is that preparing an arbitrary amplitude-encoded state requires an exponentially deep circuit in the worst case. PennyLane’s qml.AmplitudeEmbedding handles the state preparation automatically, but the circuit depth grows with the number of features. For NISQ devices, this cost can be prohibitive.

IQP (Instantaneous Quantum Polynomial) Encoding

IQP encoding applies single-qubit phase gates and two-qubit entangling gates that depend on products of features:

Apply Hadamard gates to all qubits.
Apply phase gates e^(i x_j Z_j) on each qubit j.
Apply entangling phase gates e^(i x_j x_k ZZ) on pairs (j, k).
Optionally repeat the encoding block multiple times.

This creates a feature map that is conjectured to be hard to simulate classically. The products x_j * x_k in the entangling gates introduce nonlinear interactions between features, which makes the resulting quantum kernel richer than what angle encoding produces.

Code Comparison: All Three Encodings

The following code shows all three encoding strategies applied to a 4-feature input:

import pennylane as qml
from pennylane import numpy as pnp
import numpy as np

# Sample 4-feature input
x = np.array([0.5, 1.2, 0.8, 2.1])

# --- Angle Encoding (4 qubits for 4 features) ---
dev_angle = qml.device("default.qubit", wires=4)

@qml.qnode(dev_angle)
def angle_encoding(x):
    for i in range(4):
        qml.RY(x[i], wires=i)
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

print("Angle encoding output:", angle_encoding(x))
print("Angle encoding circuit:")
print(qml.draw(angle_encoding)(x))

# --- Amplitude Encoding (2 qubits for 4 features) ---
dev_amp = qml.device("default.qubit", wires=2)

@qml.qnode(dev_amp)
def amplitude_encoding(x):
    # AmplitudeEmbedding normalizes the input vector automatically
    qml.AmplitudeEmbedding(features=x, wires=range(2), normalize=True)
    return [qml.expval(qml.PauliZ(i)) for i in range(2)]

print("\nAmplitude encoding output:", amplitude_encoding(x))
print("Amplitude encoding circuit:")
print(qml.draw(amplitude_encoding)(x))

# --- IQP Encoding (4 qubits for 4 features) ---
dev_iqp = qml.device("default.qubit", wires=4)

@qml.qnode(dev_iqp)
def iqp_encoding(x):
    # First layer: Hadamards + single-qubit phase gates
    for i in range(4):
        qml.Hadamard(wires=i)
    for i in range(4):
        qml.RZ(x[i], wires=i)
    # Entangling phase gates based on feature products
    for i in range(4):
        for j in range(i + 1, 4):
            qml.CNOT(wires=[i, j])
            qml.RZ(x[i] * x[j], wires=j)
            qml.CNOT(wires=[i, j])
    # Repeat the encoding for richer feature map
    for i in range(4):
        qml.Hadamard(wires=i)
    for i in range(4):
        qml.RZ(x[i], wires=i)
    for i in range(4):
        for j in range(i + 1, 4):
            qml.CNOT(wires=[i, j])
            qml.RZ(x[i] * x[j], wires=j)
            qml.CNOT(wires=[i, j])
    return [qml.expval(qml.PauliZ(i)) for i in range(4)]

print("\nIQP encoding output:", iqp_encoding(x))
print("IQP encoding circuit:")
print(qml.draw(iqp_encoding)(x))

Encoding	Qubits for N features	Circuit Depth	Kernel Complexity
Angle	N	O(1)	Classically easy
Amplitude	log₂(N)	O(N)	Depends on preparation
IQP	N	O(N²)	Conjectured classically hard

Circuit Architecture

The full circuit is:

Angle encoding layer: RY(x_i) on each qubit.
Variational layers (repeated L times):
- Single-qubit rotations: Rot(phi, theta, omega) on each qubit (three parameters per qubit per layer).
- Entangling CNOT ring: CNOT from qubit i to qubit (i+1) mod n.
Measure the Pauli-Z expectation value of qubit 0.

The output is a scalar in [-1, +1]. We apply a sigmoid to map it to a probability, then use binary cross-entropy loss.

Full PennyLane Implementation

import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

# Circuit configuration
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)


@qml.qnode(dev)
def circuit(params, x):
    """
    Variational quantum classifier circuit.
    params: shape (n_layers, n_qubits, 3) - rotation angles per layer per qubit
    x: shape (n_qubits,) for one sample, or (batch, n_qubits) for a whole batch.
    Indexing with x[..., i] takes feature i from either shape, so PennyLane
    broadcasts the batch through the circuit in a single simulation.
    """
    # Angle encoding
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)

    # Variational layers
    for layer in range(n_layers):
        # Parameterized rotations
        for qubit in range(n_qubits):
            qml.Rot(
                params[layer, qubit, 0],
                params[layer, qubit, 1],
                params[layer, qubit, 2],
                wires=qubit,
            )
        # Entangling layer: CNOT ring
        for qubit in range(n_qubits - 1):
            qml.CNOT(wires=[qubit, qubit + 1])
        if n_qubits > 2:
            qml.CNOT(wires=[n_qubits - 1, 0])

    return qml.expval(qml.PauliZ(0))


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def bce_loss(params, X_batch, y_batch):
    """Binary cross-entropy loss over a batch (one broadcast circuit call)."""
    pred_raw = circuit(params, X_batch)
    # Convert {-1,+1} labels to {0,1} for BCE
    y_01 = (y_batch + 1) / 2
    prob = sigmoid(pred_raw)
    prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
    return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))


def predict(params, X):
    raw = np.array(circuit(params, X))
    return np.sign(raw)


def accuracy(params, X, y):
    preds = predict(params, X)
    return np.mean(preds == y)


# Initialize parameters with small random values
np.random.seed(42)
params = pnp.array(
    np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3)),
    requires_grad=True,
)

# Adam optimizer
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
n_epochs = 40

print(f"Training VQC: {n_layers} layers, {n_qubits} qubits, {n_layers * n_qubits * 3} parameters")
print(f"Initial train accuracy: {accuracy(params, X_train, y_train_pm):.3f}")

for epoch in range(n_epochs):
    # Shuffle and create batches
    perm = np.random.permutation(len(X_train))
    X_shuf, y_shuf = X_train[perm], y_train_pm[perm]

    for start in range(0, len(X_train), batch_size):
        X_batch = X_shuf[start : start + batch_size]
        y_batch = y_shuf[start : start + batch_size]
        params, loss_val = opt.step_and_cost(lambda p: bce_loss(p, X_batch, y_batch), params)

    if (epoch + 1) % 10 == 0:
        train_acc = accuracy(params, X_train, y_train_pm)
        test_acc = accuracy(params, X_test, y_test_pm)
        print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Train acc: {train_acc:.3f} | Test acc: {test_acc:.3f}")

print(f"\nFinal test accuracy (VQC): {accuracy(params, X_test, y_test_pm):.3f}")

Using Strongly Entangling Layers

The manual CNOT ring above is one way to build entanglement, but PennyLane provides qml.StronglyEntanglingLayers, a template that uses Rot gates (3 parameters each) and CNOTs in a pattern designed to create long-range entanglement. In each layer, the CNOT pattern shifts so that different qubit pairs become entangled across layers, providing better coverage of the Hilbert space.

The parameter count is the same: 3 parameters per qubit per layer. But the entanglement pattern is more structured and generally more expressive for the same depth.

import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# Dataset setup (same as before)
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)


@qml.qnode(dev)
def circuit_sel(params, x):
    """VQC using StronglyEntanglingLayers."""
    # Angle encoding (x[..., i] accepts one sample or a batch of samples)
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)

    # Strongly entangling variational layers
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))

    return qml.expval(qml.PauliZ(0))


# Draw the circuit to see the structure
print("StronglyEntanglingLayers circuit:")
dummy_params = pnp.zeros((n_layers, n_qubits, 3))
dummy_x = pnp.zeros(n_qubits)
print(qml.draw(circuit_sel)(dummy_params, dummy_x))

# Parameter count comparison
manual_params = n_layers * n_qubits * 3
sel_shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
sel_params = np.prod(sel_shape)
print(f"\nManual ansatz parameters: {manual_params}")
print(f"StronglyEntanglingLayers parameters: {sel_params}")
print(f"Both use {n_layers} layers x {n_qubits} qubits x 3 rotation angles = {manual_params}")

# Train with StronglyEntanglingLayers
np.random.seed(42)
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
params_sel = pnp.array(
    np.random.uniform(-np.pi / 4, np.pi / 4, size=shape),
    requires_grad=True,
)


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def bce_loss_sel(params, X_batch, y_batch):
    pred_raw = circuit_sel(params, X_batch)
    y_01 = (y_batch + 1) / 2
    prob = sigmoid(pred_raw)
    prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
    return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))


opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
n_epochs = 40

for epoch in range(n_epochs):
    perm = np.random.permutation(len(X_train))
    X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
    for start in range(0, len(X_train), batch_size):
        X_batch = X_shuf[start : start + batch_size]
        y_batch = y_shuf[start : start + batch_size]
        params_sel, loss_val = opt.step_and_cost(
            lambda p: bce_loss_sel(p, X_batch, y_batch), params_sel
        )

    if (epoch + 1) % 10 == 0:
        preds = np.sign(np.array(circuit_sel(params_sel, X_test)))
        test_acc = np.mean(preds == y_test_pm)
        print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {test_acc:.3f}")

preds_final = np.sign(np.array(circuit_sel(params_sel, X_test)))
print(f"\nFinal test accuracy (StronglyEntanglingLayers): {np.mean(preds_final == y_test_pm):.3f}")

Both reach the same 0.820 test accuracy here, which is expected: on 2 qubits there is only one possible pair to entangle, so the template’s shifting CNOT pattern has nowhere to shift and reduces to much the same circuit as the manual ring. The template earns its keep on 4 or more qubits, where the CNOT range changes from layer to layer and reaches pairs a fixed nearest-neighbor ring never touches. This comparison cannot show that, because a 2-qubit circuit cannot show it.

Decision Boundary Visualization

def plot_decision_boundary(params, X, y, title="VQC Decision Boundary"):
    h = 0.05
    x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
    y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

    grid = np.c_[xx.ravel(), yy.ravel()]
    # The whole grid goes through the circuit as one broadcast batch
    raw_preds = np.array(circuit(params, grid))
    Z = np.sign(raw_preds).reshape(xx.shape)

    plt.figure(figsize=(8, 6))
    plt.contourf(xx, yy, Z, alpha=0.3, cmap="RdBu")
    y_01 = (y + 1) // 2
    plt.scatter(X[:, 0], X[:, 1], c=y_01, cmap="RdBu", edgecolors="k", s=40)
    plt.title(title)
    plt.xlabel("Feature 1")
    plt.ylabel("Feature 2")
    plt.tight_layout()
    plt.savefig("vqc_decision_boundary.png", dpi=120)


plot_decision_boundary(params, X_test, y_test_pm)

Comparison to Logistic Regression

# Classical baseline: logistic regression, constrained to a linear boundary
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_preds)

print(f"VQC test accuracy:                 {accuracy(params, X_test, y_test_pm):.3f}")
print(f"Logistic Regression test accuracy: {lr_acc:.3f}")

VQC test accuracy:                 0.820
Logistic Regression test accuracy: 0.840

The classical model wins. Logistic regression scores 0.840 and the VQC scores 0.820, and it would be dishonest to dress that up. This is the normal situation on a small classical dataset like this one, and it is worth understanding why rather than tuning the quantum model until the numbers come out the other way.

Three things are going on.

There is no reason to expect a quantum advantage here. The moons dataset is two-dimensional, has 200 points, and was generated by a classical process. Nothing about it has quantum structure to exploit. As established at the top of this tutorial, angle encoding produces a quantum kernel that is easy to compute classically, so this VQC is, in effect, an awkward parameterization of a classical model. An awkward parameterization of a classical model is not going to beat a well-conditioned classical model.

The moons are more linearly separable than they look. The two crescents interleave, but with noise=0.2 they overlap enough that a straight line already captures most of the structure. Logistic regression’s 0.840 is not the score of a model that has failed; a curved boundary has less to add than the picture suggests.

18 parameters is a very small model. The VQC has 3 layers times 2 qubits times 3 angles. It is trained with mini-batch Adam on a non-convex landscape and it lands where it lands. Logistic regression, by contrast, is a convex problem with a closed-form-quality solution: the optimizer finds the global optimum every time. Some of the 0.02 gap is simply that the classical model is solved and the quantum one is merely trained.

The useful conclusion is not “the VQC is bad”. It is that on this class of problem the encoding matters far more than the ansatz, and no ansatz can conjure structure that the encoding threw away. That lesson is what the rest of this tutorial builds on, and you will see it again, brutally, in the amplitude encoding section below.

Barren Plateaus: The Main Challenge for VQCs at Scale

A barren plateau is a phenomenon where the gradient of the cost function becomes exponentially small as the number of qubits grows. Specifically, for a random parameterized circuit on n qubits, the variance of the gradient with respect to any single parameter scales as:

Var(∂C/∂θ) ~ O(1/2^n)

This means that for large circuits, the gradient landscape is essentially flat everywhere except in an exponentially small region. The optimizer sees near-zero gradients and cannot make progress. This is not a numerical issue; it is a fundamental property of high-dimensional quantum state spaces.

Measuring this correctly requires care on two points, and both are easy to get wrong.

The first is which parameter you differentiate. The statement above says “any single parameter”, but that is only true of parameters that actually move the state. StronglyEntanglingLayers applies Rot(a, b, c) = RZ(c) RY(b) RZ(a) to each qubit, so in the first layer the leading RZ(a) acts on |0>. Since |0> is an eigenstate of Z, that gate contributes nothing but a global phase, and its gradient is exactly zero for every random sample, at every qubit count. Differentiate that angle and you get a column of zeros that looks like the most dramatic barren plateau ever recorded and in fact measures nothing at all. We differentiate the RY angle, index [0, 0, 1], and prepend an RY(pi/4) layer so no qubit starts on the Z axis.

The second is circuit depth. Barren plateaus are a property of circuits deep enough to behave like random (2-design) unitaries. At 2 layers the variance is erratic rather than exponentially decaying, so we use 6.

import pennylane as qml
from pennylane import numpy as pnp
import numpy as np

def compute_gradient_variance(n_qubits, n_layers=6, n_samples=500):
    """Variance of the gradient of one interior parameter, over random parameters."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def random_circuit(params):
        # Move every qubit off the Z axis, so that the parameter we differentiate
        # below is not a global phase with an identically-zero gradient.
        for i in range(n_qubits):
            qml.RY(np.pi / 4, wires=i)
        # Variational layers
        qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
        # Global cost function: Z on every qubit
        return qml.expval(qml.prod(*[qml.PauliZ(i) for i in range(n_qubits)]))

    def total(params):
        # params carries a leading sample axis. Each sample depends only on its
        # own parameters, so differentiating the sum yields every sample's
        # gradient from a single broadcast pass.
        return pnp.sum(random_circuit(params))

    np.random.seed(1234)
    params = pnp.array(
        np.random.uniform(-np.pi, np.pi, size=(n_samples, n_layers, n_qubits, 3)),
        requires_grad=True,
    )
    gradients = np.array(qml.grad(total)(params))
    # The RY angle of the first Rot gate on qubit 0
    return np.var(gradients[:, 0, 0, 1])

# Measure gradient variance for increasing qubit counts
qubit_counts = [2, 4, 6, 8]
variances = []

for n in qubit_counts:
    var = compute_gradient_variance(n)
    variances.append(var)
    print(f"n_qubits={n}: gradient variance = {var:.6f}")

# The variance decreases roughly exponentially with qubit count
print("\nRatio of successive variances (expect ~0.25 for exponential decay):")
for i in range(1, len(variances)):
    ratio = variances[i] / variances[i - 1]
    print(f"  Var(n={qubit_counts[i]}) / Var(n={qubit_counts[i-1]}) = {ratio:.4f}")

n_qubits=2: gradient variance = 0.102858
n_qubits=4: gradient variance = 0.023288
n_qubits=6: gradient variance = 0.005820
n_qubits=8: gradient variance = 0.001314

Ratio of successive variances (expect ~0.25 for exponential decay):
  Var(n=4) / Var(n=2) = 0.2264
  Var(n=6) / Var(n=4) = 0.2499
  Var(n=8) / Var(n=6) = 0.2258

The ratios come out at 0.2264, 0.2499 and 0.2258, against the 0.25 that 1/2^n predicts: the gradient variance really does fall by about a factor of 4 for every 2 qubits added. Extrapolating the same line, a 30-qubit circuit of this kind would have a gradient variance around 1e-10, which no realistic shot budget can distinguish from zero.

Note how much of that result depended on setting the experiment up correctly. The wrong parameter index gives you zeros forever, and too few layers gives you noise. A demonstration that prints an impressive-looking number is not the same as a demonstration that measures the thing you claimed.

Mitigation Strategies

Several approaches help mitigate barren plateaus:

Layer-by-layer training: Train one layer at a time, keeping previous layers fixed. Each layer is optimized in a shallow landscape before the next layer is added. This avoids the exponentially flat landscape of the full circuit.
Local cost functions: Instead of measuring a global observable like Z on every qubit at once (a tensor product across the whole register), measure local observables on individual qubits and sum them. Local cost functions have gradient variances that decay polynomially rather than exponentially.
Parameter initialization near identity: Initialize parameters close to zero so the circuit starts near the identity operation. This places the initial point in a region with meaningful gradients.
Quantum natural gradient: Use the quantum Fisher information matrix to precondition the gradient, which accounts for the geometry of the parameter space. We cover this in the next section.

Quantum Natural Gradient Optimizer

Standard gradient descent treats all parameter directions equally. But in a parameterized quantum circuit, small changes in one parameter can cause large changes in the quantum state, while large changes in another parameter barely move the state. The quantum natural gradient (QNG) accounts for this by preconditioning the gradient with the inverse of the quantum Fisher information matrix (also called the Fubini-Study metric tensor).

The update rule for QNG is:

w ← w - η F⁻¹ ∇C(w)

where F is the quantum Fisher information matrix with entries F_ij = Re(⟨∂_i ψ|∂_j ψ⟩ - ⟨∂_i ψ|ψ⟩⟨ψ|∂_j ψ⟩).

This accounts for the geometry of the quantum state space, making the optimizer take steps that are uniform in terms of state-space distance rather than parameter-space distance.

PennyLane provides qml.QNGOptimizer for this purpose. The following code compares Adam and QNG on the moons classification task:

import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

n_qubits = 2
n_layers = 2
dev = qml.device("default.qubit", wires=n_qubits)


@qml.qnode(dev)
def circuit_qng(params, x):
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    return qml.expval(qml.PauliZ(0))


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def cost_fn(params):
    """Full-batch cost for QNG (QNG works best with full-batch or large-batch)."""
    pred = circuit_qng(params, X_train)
    y_01 = (y_train_pm + 1) / 2
    prob = sigmoid(pred)
    prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
    return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))


def test_accuracy(params):
    preds = np.sign(np.array(circuit_qng(params, X_test)))
    return np.mean(preds == y_test_pm)


# Train with Adam
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
np.random.seed(42)
init_params = np.random.uniform(-np.pi / 4, np.pi / 4, size=shape)

params_adam = pnp.array(init_params.copy(), requires_grad=True)
opt_adam = qml.AdamOptimizer(stepsize=0.05)
adam_losses = []

for epoch in range(20):
    params_adam, loss = opt_adam.step_and_cost(cost_fn, params_adam)
    adam_losses.append(float(loss))
    if (epoch + 1) % 10 == 0:
        acc = test_accuracy(params_adam)
        print(f"Adam  Epoch {epoch+1:3d} | Loss: {loss:.4f} | Test acc: {acc:.3f}")

# Train with QNG.
# The cost sums the QNode over the whole dataset, so it is not a single QNode.
# QNGOptimizer then needs an explicit metric_tensor_fn. We average the
# (block-diagonal) metric tensor over the training inputs. The block-diagonal
# approximation also avoids needing an extra auxiliary wire on the device.
#
# Note the stepsize: 0.5, ten times Adam's. This is not a typo. Adam is an
# adaptive method that rescales each parameter's step to roughly its own
# stepsize regardless of the gradient magnitude. QNG is plain gradient descent
# with a preconditioner, so its step really is stepsize * F^-1 * grad, which
# here has a norm of about 0.006 at stepsize=0.01. Give QNG the same 0.05 that
# Adam uses and it barely moves, which looks exactly like a stalled optimizer
# but is only a badly chosen stepsize.
params_qng = pnp.array(init_params.copy(), requires_grad=True)
opt_qng = qml.QNGOptimizer(stepsize=0.5)
qng_losses = []


# Average the metric tensor over a small representative subset of inputs.
# Using a subset (rather than all training points) keeps the QNG step cheap;
# the metric tensor varies slowly across nearby inputs.
X_metric = X_train[:16]


def metric_tensor_fn(params):
    mts = [qml.metric_tensor(circuit_qng, approx="block-diag")(params, x) for x in X_metric]
    return sum(mts) / len(X_metric)


for epoch in range(20):
    params_qng, loss = opt_qng.step_and_cost(
        cost_fn, params_qng, metric_tensor_fn=metric_tensor_fn
    )
    qng_losses.append(float(loss))
    if (epoch + 1) % 10 == 0:
        acc = test_accuracy(params_qng)
        print(f"QNG   Epoch {epoch+1:3d} | Loss: {loss:.4f} | Test acc: {acc:.3f}")

# How many Adam steps does it take to reach the loss QNG reached in 20?
target = qng_losses[-1]
params_probe = pnp.array(init_params.copy(), requires_grad=True)
opt_probe = qml.AdamOptimizer(stepsize=0.05)
adam_steps_to_match = None
for step in range(300):
    params_probe, loss = opt_probe.step_and_cost(cost_fn, params_probe)
    if float(loss) <= target:
        adam_steps_to_match = step + 1
        break

print(f"\nQNG reached loss {target:.4f} in 20 steps.")
print(f"Adam (stepsize 0.05) needs {adam_steps_to_match} steps to reach the same loss.")

# Plot convergence comparison
plt.figure(figsize=(8, 5))
plt.plot(adam_losses, label="Adam (lr=0.05)")
plt.plot(qng_losses, label="QNG (lr=0.5)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Adam vs Quantum Natural Gradient Convergence")
plt.legend()
plt.tight_layout()
plt.savefig("adam_vs_qng.png", dpi=120)
print("Convergence plot saved to adam_vs_qng.png")

Adam  Epoch  10 | Loss: 0.7766 | Test acc: 0.340
Adam  Epoch  20 | Loss: 0.6769 | Test acc: 0.640
QNG   Epoch  10 | Loss: 0.5037 | Test acc: 0.820
QNG   Epoch  20 | Loss: 0.5011 | Test acc: 0.820

QNG reached loss 0.5011 in 20 steps.
Adam (stepsize 0.05) needs 135 steps to reach the same loss.

QNG converges in dramatically fewer steps here: it is essentially converged by epoch 10, while Adam at stepsize 0.05 is still at 0.6769 after 20 steps and needs 135 steps to reach the loss QNG reached in 20.

Two honest caveats before you read too much into that.

Part of the gap is stepsize, not geometry. Adam is being run at 0.05, the value used everywhere else in this tutorial, and it is simply a conservative choice for this problem. Raise Adam to stepsize 0.2 and it reaches a loss of 0.5203 and a test accuracy of 0.820 within the same 20 steps, which is most of the way to QNG’s result. Push Adam to 0.5 and it reaches a test accuracy of 0.840 in those same 20 steps, ahead of QNG’s 0.820. Comparing an untuned optimizer against a tuned one is the single easiest way to manufacture a misleading convergence plot, so tune both before you believe any such comparison, including this one.

Steps are not the unit that matters. Each QNG step has to estimate the Fisher information matrix, and this code additionally averages the metric tensor over 16 training inputs. (The full matrix would cost O(p²) circuit evaluations in p, the number of parameters; the block-diag approximation used above is cheaper, which is the whole reason it exists.) So QNG’s 20 steps cost far more circuit evaluations than Adam’s 135, and far more time: measured on this circuit, QNG’s 20 steps take roughly 7x longer than the 135 Adam steps that reach the same loss and the same test accuracy. QNG wins on step count and loses on wall-clock. Step count is the wrong unit, and quoting it alone is how natural-gradient methods get oversold.

The real reason QNG helps is that the metric tensor rescales the update to account for the geometry of the state space. Here the preconditioner stretches the raw gradient by roughly a factor of 5 while barely rotating it (the natural-gradient direction has a cosine similarity of about 0.95 with the plain gradient), so most of the benefit on this particular problem is a better-scaled step rather than a fundamentally better direction. On problems where different parameters move the state at wildly different rates, the rotation matters much more.

Amplitude Encoding for Higher-Dimensional Data

For datasets with more than a few features, angle encoding requires one qubit per feature, which quickly exceeds hardware limits. Amplitude encoding offers a logarithmic alternative: a 4-dimensional feature vector maps to a 2-qubit state.

The Trap: Amplitude Encoding Cannot See the Sign of Your Data

Before the code, a warning that costs people a great deal of time.

Amplitude encoding maps x to the state |ψ⟩ = Σ x_i |i⟩ / ||x||. Now consider -x. It maps to -|ψ⟩, and a global phase of -1 is physically unobservable. x and -x therefore produce the same physical state, and no measurement, and no variational circuit placed after the encoding, can tell them apart.

That is a landmine sitting directly under the most reflexive preprocessing step in machine learning. StandardScaler centres each feature on zero. Centring puts the two classes on opposite sides of the origin, which means their feature vectors point in roughly opposite directions, which means amplitude encoding collapses them onto the same states. Standardize, normalize, amplitude-encode, and you have deleted the labels before training begins. The model then trains happily, converges to a sensible-looking loss, and predicts at chance forever.

The fix is to scale into a non-negative range so that every sample lives in the positive orthant and no two points can be antipodal. The following code demonstrates the failure and then does it properly:

import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression

# Load Iris dataset (2 classes: setosa=0, versicolor=1)
iris = load_iris()
mask = iris.target < 2  # Select only setosa and versicolor
X_iris = iris.data[mask]
y_iris = iris.target[mask]

# Amplitude encoding: 4 features -> 2 qubits
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)


@qml.qnode(dev)
def encoded_state(x):
    qml.AmplitudeEmbedding(features=x, wires=range(n_qubits), normalize=True)
    return qml.state()


# --- WRONG: standardize (centre on zero), then normalize ---
X_std = StandardScaler().fit_transform(X_iris)
X_std_norm = X_std / np.linalg.norm(X_std, axis=1, keepdims=True)

mean_0 = X_std_norm[y_iris == 0].mean(axis=0)
mean_1 = X_std_norm[y_iris == 1].mean(axis=0)
cos_between = mean_0 @ mean_1 / (np.linalg.norm(mean_0) * np.linalg.norm(mean_1))
print("--- After StandardScaler + normalize ---")
print(f"Cosine between class mean directions: {cos_between:+.4f}  (near -1 = antipodal)")

# x and -x are the same physical state
s_plus = np.array(encoded_state(X_std_norm[0]))
s_minus = np.array(encoded_state(-X_std_norm[0]))
print(f"|<psi(x)|psi(-x)>|^2 = {abs(np.vdot(s_plus, s_minus))**2:.6f}  (1.0 = identical state)")
print("The classes point in opposite directions, so the encoder maps them to the")
print("same states. No ansatz can recover a label that the encoding destroyed.\n")

# --- RIGHT: scale to a non-negative range, then normalize ---
X_pos = MinMaxScaler(feature_range=(0.1, 1.0)).fit_transform(X_iris)
X_iris_norm = X_pos / np.linalg.norm(X_pos, axis=1, keepdims=True)

p0 = X_iris_norm[y_iris == 0].mean(axis=0)
p1 = X_iris_norm[y_iris == 1].mean(axis=0)
cos_fixed = p0 @ p1 / (np.linalg.norm(p0) * np.linalg.norm(p1))
print("--- After MinMaxScaler to [0.1, 1.0] + normalize ---")
print(f"Cosine between class mean directions: {cos_fixed:+.4f}  (positive = distinguishable)")

X_train, X_test, y_train, y_test = train_test_split(
    X_iris_norm, y_iris, test_size=0.25, random_state=42
)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1


@qml.qnode(dev)
def circuit_amp(params, x):
    """VQC with amplitude encoding for 4-feature data."""
    # Amplitude encoding: maps 4D vector to 2-qubit state
    qml.AmplitudeEmbedding(features=x, wires=range(n_qubits), normalize=False)

    # Variational layers
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))

    return qml.expval(qml.PauliZ(0))


# Print circuit structure
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
dummy_params = pnp.zeros(shape)
dummy_x = np.array([0.5, 0.5, 0.5, 0.5])
print("Amplitude encoding circuit (4 features, 2 qubits):")
print(qml.draw(circuit_amp)(dummy_params, dummy_x))

# Compare circuit depths
dev_4q = qml.device("default.qubit", wires=4)

@qml.qnode(dev_4q)
def circuit_angle_4(params, x):
    """VQC with angle encoding for 4-feature data (4 qubits)."""
    for i in range(4):
        qml.RY(x[i], wires=i)
    qml.StronglyEntanglingLayers(params, wires=range(4))
    return qml.expval(qml.PauliZ(0))

shape_4q = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=4)
print(f"\nAngle encoding: 4 qubits, {np.prod(shape_4q)} variational parameters")
print(f"Amplitude encoding: 2 qubits, {np.prod(shape)} variational parameters")
print("Amplitude encoding uses fewer qubits but a deeper state preparation circuit.")

# Train the amplitude-encoded VQC
np.random.seed(42)
params_amp = pnp.array(
    np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
)


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def bce_loss_amp(params, X_batch, y_batch):
    # AmplitudeEmbedding accepts a batch of feature vectors directly
    pred = circuit_amp(params, X_batch)
    y_01 = (y_batch + 1) / 2
    prob = sigmoid(pred)
    prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
    return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))


opt = qml.AdamOptimizer(stepsize=0.05)
for epoch in range(40):
    perm = np.random.permutation(len(X_train))
    X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
    for start in range(0, len(X_train), 16):
        X_batch = X_shuf[start : start + 16]
        y_batch = y_shuf[start : start + 16]
        params_amp, loss_val = opt.step_and_cost(
            lambda p: bce_loss_amp(p, X_batch, y_batch), params_amp
        )
    if (epoch + 1) % 10 == 0:
        preds = np.sign(np.array(circuit_amp(params_amp, X_test)))
        test_acc = np.mean(preds == y_test_pm)
        print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {test_acc:.3f}")

preds_final = np.sign(np.array(circuit_amp(params_amp, X_test)))
print(f"\nFinal test accuracy (Amplitude VQC on Iris): {np.mean(preds_final == y_test_pm):.3f}")

# Classical baseline on the same encoded features
lr_iris = LogisticRegression()
lr_iris.fit(X_train, y_train)
print(f"Logistic Regression test accuracy:           {lr_iris.score(X_test, y_test):.3f}")

--- After StandardScaler + normalize ---
Cosine between class mean directions: -0.9976  (near -1 = antipodal)
|<psi(x)|psi(-x)>|^2 = 1.000000  (1.0 = identical state)
The classes point in opposite directions, so the encoder maps them to the
same states. No ansatz can recover a label that the encoding destroyed.

--- After MinMaxScaler to [0.1, 1.0] + normalize ---
Cosine between class mean directions: +0.7416  (positive = distinguishable)

Epoch  10 | Loss: 0.4510 | Test acc: 1.000
Epoch  20 | Loss: 0.4552 | Test acc: 1.000
Epoch  30 | Loss: 0.4347 | Test acc: 1.000
Epoch  40 | Loss: 0.4269 | Test acc: 1.000

Final test accuracy (Amplitude VQC on Iris): 1.000
Logistic Regression test accuracy:           1.000

The diagnostic at the top is the whole story. After standardizing, the cosine between the two class mean directions is -0.9976: the classes are very nearly antipodal, and the encoder’s overlap check confirms that x and -x land on the identical physical state. Run this same VQC on those standardized features and it scores 0.480 on this 25-point test set, where simply always guessing the majority class would score 0.560. The model is not undertrained and the ansatz is not too weak. The labels were destroyed by the preprocessing, and a bigger circuit, more layers or a longer training run would not have recovered a single point of accuracy.

Rescaling into [0.1, 1.0] moves the cosine to +0.7416 and the same circuit, same ansatz, same optimizer and same number of epochs reaches 1.000. Setosa and versicolor are linearly separable, so logistic regression also scores 1.000, and neither result is impressive on its own. What is worth taking away is the size of the effect: chance versus perfect, decided entirely by a preprocessing choice that had nothing to do with the quantum model.

This is the sharpest illustration in the tutorial of the point made earlier. The encoding, not the ansatz, is where a VQC lives or dies.

The value of amplitude encoding is that it needs only log₂(N) qubits, so it becomes attractive on higher-dimensional data where angle encoding would demand too many qubits. The price, beyond the deep state preparation circuit, is exactly the constraint above: your data has to survive being projected onto a unit sphere with antipodal points identified.

Multi-Class Extension

The binary VQC measures one qubit and uses the sign of the expectation value. For k classes, we extend this by measuring k qubits and predicting the class with the highest expectation value. The output is a vector of Pauli-Z expectation values (⟨Z_0⟩, ⟨Z_1⟩, …, ⟨Z_{k-1}⟩), and we apply softmax to convert these into class probabilities.

The following example builds a 3-class VQC for the full Iris dataset:

import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Load full Iris dataset (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target  # 0, 1, 2

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
# Scale to [0, pi] for angle encoding
from sklearn.preprocessing import MinMaxScaler
scaler2 = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler2.fit_transform(X_scaled)

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_iris, test_size=0.25, random_state=42
)

# Circuit: 4 qubits (one per feature), measure qubits 0, 1, 2 for 3 classes
n_qubits = 4
n_layers = 3
n_classes = 3
dev = qml.device("default.qubit", wires=n_qubits)


@qml.qnode(dev)
def circuit_multi(params, x):
    """Multi-class VQC: returns 3 Pauli-Z expectation values."""
    # Angle encoding: 4 features on 4 qubits
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)

    # Variational layers
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))

    # Measure one qubit per class
    return [qml.expval(qml.PauliZ(i)) for i in range(n_classes)]


def softmax(z):
    """Numerically stable softmax over the last axis (one row per sample)."""
    z_shifted = z - pnp.max(z, axis=-1, keepdims=True)
    exp_z = pnp.exp(z_shifted)
    return exp_z / pnp.sum(exp_z, axis=-1, keepdims=True)


def cross_entropy_loss(params, X_batch, y_batch):
    """Multi-class cross-entropy loss with softmax."""
    # One broadcast call returns n_classes expectation values per sample.
    # Stack them into a (batch, n_classes) array of class scores.
    raw_outputs = pnp.stack(circuit_multi(params, X_batch), axis=-1)
    probs = softmax(raw_outputs)
    probs = pnp.clip(probs, 1e-7, 1.0)
    # Pick out the probability assigned to each sample's true class
    true_class_probs = probs[pnp.arange(len(y_batch)), y_batch]
    return -pnp.mean(pnp.log(true_class_probs))


def predict_multi(params, X):
    raw = np.stack(circuit_multi(params, X), axis=-1)
    return np.argmax(raw, axis=-1)


# Initialize and train
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
np.random.seed(42)
params_multi = pnp.array(
    np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
)

opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16

for epoch in range(50):
    perm = np.random.permutation(len(X_train))
    X_shuf, y_shuf = X_train[perm], y_train[perm]
    for start in range(0, len(X_train), batch_size):
        X_batch = X_shuf[start : start + batch_size]
        y_batch = y_shuf[start : start + batch_size]
        params_multi, loss_val = opt.step_and_cost(
            lambda p: cross_entropy_loss(p, X_batch, y_batch), params_multi
        )
    if (epoch + 1) % 10 == 0:
        preds = predict_multi(params_multi, X_test)
        acc = np.mean(preds == y_test)
        print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {acc:.3f}")

# Final VQC accuracy
preds_vqc = predict_multi(params_multi, X_test)
vqc_acc = np.mean(preds_vqc == y_test)
print(f"\nFinal 3-class VQC test accuracy: {vqc_acc:.3f}")

# Classical baseline: logistic regression (multi-class)
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_preds)
print(f"Logistic Regression test accuracy: {lr_acc:.3f}")

Final 3-class VQC test accuracy: 0.895
Logistic Regression test accuracy: 1.000

Again the classical baseline wins, and this time it wins outright: logistic regression classifies the Iris test set perfectly, while the VQC reaches 0.895. Iris is a 150-sample dataset that has been the standard “easy” benchmark since 1936, and a linear model solves it. There is nothing here for a quantum circuit to add, and a VQC that gets 0.895 where a one-line classical model gets 1.000 has not demonstrated anything except that it can be trained.

Read the section for its mechanics, which are what actually transfer: (1) allocate one measurement qubit per class, (2) use softmax to convert expectation values to probabilities, and (3) use cross-entropy loss instead of BCE. Do not read it as evidence that the VQC is competitive.

Quantum Kernel Method Comparison

An alternative to training a variational circuit is to use the quantum circuit purely as a kernel function. The quantum kernel between two data points is:

K(x, x’) = |⟨0…0| U(x)† U(x’) |0…0⟩|²

This equals the probability of measuring the all-zeros state after applying U(x’) followed by U(x)†. Once we compute the kernel matrix, we pass it to a classical SVM and let the classical optimizer handle classification. No variational parameters need to be trained on the quantum side.

import pennylane as qml
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)

n_qubits = 2
dev = qml.device("default.qubit", wires=n_qubits)


def encoding_circuit(x, wires):
    """IQP-style encoding for richer kernel."""
    for i in wires:
        qml.Hadamard(wires=i)
    for i in wires:
        qml.RZ(x[..., i], wires=i)
    for i in range(len(wires)):
        for j in range(i + 1, len(wires)):
            qml.CNOT(wires=[wires[i], wires[j]])
            qml.RZ(x[..., wires[i]] * x[..., wires[j]], wires=wires[j])
            qml.CNOT(wires=[wires[i], wires[j]])


@qml.qnode(dev)
def kernel_circuit(x1, x2):
    """Compute |<0|U(x1)†U(x2)|0>|^2 via the swap test alternative."""
    # Apply encoding for x1
    encoding_circuit(x1, wires=range(n_qubits))
    # Apply adjoint of encoding for x2
    qml.adjoint(encoding_circuit)(x2, wires=range(n_qubits))
    # Probability of all-zeros state
    return qml.probs(wires=range(n_qubits))


def quantum_kernel(x1, x2):
    """Compute the quantum kernel value between two data points."""
    probs = kernel_circuit(x1, x2)
    # Probability of |00...0> state
    return float(probs[0])


def compute_kernel_matrix(X1, X2):
    """Compute the kernel matrix between two sets of data points.

    x1 is broadcast, so each call evaluates a whole column of the matrix at
    once. That turns an O(N*M) loop of circuit runs into M runs.
    """
    K = np.zeros((len(X1), len(X2)))
    for j, x2 in enumerate(X2):
        probs = kernel_circuit(X1, x2)
        K[:, j] = np.array(probs)[..., 0]
    return K


# Compute kernel matrices
print("Computing training kernel matrix...")
K_train = compute_kernel_matrix(X_train, X_train)
print("Computing test kernel matrix...")
K_test = compute_kernel_matrix(X_test, X_train)

# Train SVM with precomputed quantum kernel
svm = SVC(kernel="precomputed")
svm.fit(K_train, y_train)
svm_preds = svm.predict(K_test)
svm_acc = accuracy_score(y_test, svm_preds)
print(f"\nQuantum Kernel SVM test accuracy: {svm_acc:.3f}")

# Compare with RBF kernel SVM
svm_rbf = SVC(kernel="rbf")
svm_rbf.fit(X_train, y_train)
rbf_preds = svm_rbf.predict(X_test)
rbf_acc = accuracy_score(y_test, rbf_preds)
print(f"Classical RBF SVM test accuracy: {rbf_acc:.3f}")

When to Use Kernel Methods vs. Variational Methods

Quantum kernel methods have a key advantage: they separate the quantum computation (kernel evaluation) from the classical optimization (SVM training). The kernel matrix is computed once, and the SVM solver is guaranteed to find the global optimum. There are no barren plateaus or vanishing gradients.

The trade-off is cost. Computing the full kernel matrix requires O(N²) circuit evaluations for N training samples. For large datasets, this is more expensive than the O(N * epochs) evaluations needed for VQC training with mini-batches.

Use kernel methods when: the dataset is small (under a few hundred samples), you want guaranteed convergence, or you need the kernel for interpretability. Use variational methods when: the dataset is large, you need fast inference (the trained VQC applies one circuit per input), or you want to co-optimize the encoding and classification.

Noise Impact on Classification

Real quantum hardware introduces noise through decoherence, gate errors, and measurement errors. Understanding how noise affects VQC performance is critical for deploying on real devices. PennyLane’s mixed-state simulator lets us model these effects.

import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

n_qubits = 2
n_layers = 3


def create_noisy_circuit(noise_prob):
    """Create a VQC circuit with depolarizing noise after each gate."""
    dev = qml.device("default.mixed", wires=n_qubits)

    @qml.qnode(dev)
    def noisy_circuit(params, x):
        # Angle encoding with noise
        for i in range(n_qubits):
            qml.RY(x[..., i], wires=i)
            if noise_prob > 0:
                qml.DepolarizingChannel(noise_prob, wires=i)

        # Variational layers with noise after each gate
        for layer in range(n_layers):
            for qubit in range(n_qubits):
                qml.Rot(
                    params[layer, qubit, 0],
                    params[layer, qubit, 1],
                    params[layer, qubit, 2],
                    wires=qubit,
                )
                if noise_prob > 0:
                    qml.DepolarizingChannel(noise_prob, wires=qubit)
            for qubit in range(n_qubits - 1):
                qml.CNOT(wires=[qubit, qubit + 1])
                if noise_prob > 0:
                    qml.DepolarizingChannel(noise_prob, wires=qubit)
                    qml.DepolarizingChannel(noise_prob, wires=qubit + 1)

        return qml.expval(qml.PauliZ(0))

    return noisy_circuit


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def train_noisy_vqc(noise_prob, n_epochs=40):
    """Train a VQC with a given noise level and return test accuracy."""
    circ = create_noisy_circuit(noise_prob)

    def bce_loss(params, X_batch, y_batch):
        pred = circ(params, X_batch)
        y_01 = (y_batch + 1) / 2
        prob = sigmoid(pred)
        prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
        return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))

    np.random.seed(42)
    params = pnp.array(
        np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3)),
        requires_grad=True,
    )
    opt = qml.AdamOptimizer(stepsize=0.05)
    batch_size = 16

    # The density-matrix (default.mixed) simulator is slow, so train this noise
    # demonstration on a subset of the data. This is enough to show the trend.
    X_sub, y_sub = X_train[:48], y_train_pm[:48]

    for epoch in range(n_epochs):
        perm = np.random.permutation(len(X_sub))
        X_shuf, y_shuf = X_sub[perm], y_sub[perm]
        for start in range(0, len(X_sub), batch_size):
            X_batch = X_shuf[start : start + batch_size]
            y_batch = y_shuf[start : start + batch_size]
            params, _ = opt.step_and_cost(lambda p: bce_loss(p, X_batch, y_batch), params)

    preds = np.sign(np.array(circ(params, X_test)))
    test_acc = np.mean(preds == y_test_pm)
    return test_acc, params


# Train at three noise levels
noise_levels = [0.0, 0.01, 0.05]
results = {}

for p in noise_levels:
    print(f"\nTraining with noise p={p}...")
    acc, trained_params = train_noisy_vqc(p, n_epochs=20)
    results[p] = acc
    print(f"  Test accuracy: {acc:.3f}")

print("\n--- Noise Impact Summary ---")
for p, acc in results.items():
    print(f"  p={p:.2f}: test accuracy = {acc:.3f}")

--- Noise Impact Summary ---
  p=0.00: test accuracy = 0.840
  p=0.01: test accuracy = 0.840
  p=0.05: test accuracy = 0.820

The honest reading of this table is that almost nothing happened. Going from a noiseless simulator to a depolarizing probability of 0.05 after every gate cost 0.02 of test accuracy, which on a 50-point test set is exactly one sample. That is not a measurable effect; it is a coin landing slightly differently.

Resist the temptation to narrate a story over it. It is tempting to say that the flat result from p=0.0 to p=0.01 shows noise acting as a regularizer, and that the dip at p=0.05 shows noise beginning to destroy the model. Neither claim is supported by three numbers separated by one test sample. To make a claim like that you would need multiple seeds, more noise levels and a test set large enough for a 2% difference to mean something.

What the table does show is that this classifier is robust at these noise levels, and the reason is structural rather than lucky. The depolarizing channel contracts every expectation value toward zero, but the prediction is sign(<Z>), and contraction toward zero does not change a sign. The margins shrink while the decisions survive. That robustness would run out on a deeper circuit, at higher error rates, or on real hardware where the shrinking margin has to be resolved against finite shot noise, but on this 2-qubit, 3-layer circuit it holds comfortably.

Hardware-Efficient Ansatz Selection

Choosing the right ansatz for a VQC involves balancing three criteria:

Expressibility: Can the ansatz represent the target function? A more expressive ansatz can approximate a wider range of decision boundaries, but also has more parameters to train.
Entanglement capability: How much entanglement can the ansatz create? Entanglement is what gives a quantum circuit power beyond independent single-qubit operations. Insufficient entanglement means the circuit is effectively classical.
Hardware efficiency: How many native 2-qubit gates does the ansatz require? On real hardware, 2-qubit gates are 10-100x noisier than single-qubit gates. Minimizing their count reduces total error.

The following code compares three ansatze on the moons dataset:

import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


# --- Ansatz 1: Manual CNOT ring + Rot layers ---
@qml.qnode(dev)
def circuit_manual(params, x):
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    for layer in range(n_layers):
        for qubit in range(n_qubits):
            qml.Rot(params[layer, qubit, 0], params[layer, qubit, 1],
                     params[layer, qubit, 2], wires=qubit)
        for qubit in range(n_qubits - 1):
            qml.CNOT(wires=[qubit, qubit + 1])
    return qml.expval(qml.PauliZ(0))


# --- Ansatz 2: StronglyEntanglingLayers ---
@qml.qnode(dev)
def circuit_strong(params, x):
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    return qml.expval(qml.PauliZ(0))


# --- Ansatz 3: SimplifiedTwoDesign ---
@qml.qnode(dev)
def circuit_simple(initial_layer_weights, weights, x):
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    qml.SimplifiedTwoDesign(
        initial_layer_weights=initial_layer_weights,
        weights=weights,
        wires=range(n_qubits),
    )
    return qml.expval(qml.PauliZ(0))


def train_ansatz(circuit_fn, params_init, n_epochs=40, label=""):
    """Train a circuit and return final test accuracy."""
    if isinstance(params_init, tuple):
        # SimplifiedTwoDesign has two parameter groups
        params = tuple(
            pnp.array(p.copy(), requires_grad=True) for p in params_init
        )
    else:
        params = pnp.array(params_init.copy(), requires_grad=True)

    opt = qml.AdamOptimizer(stepsize=0.05)
    batch_size = 16

    for epoch in range(n_epochs):
        perm = np.random.permutation(len(X_train))
        X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
        for start in range(0, len(X_train), batch_size):
            X_batch = X_shuf[start : start + batch_size]
            y_batch = y_shuf[start : start + batch_size]

            def cost(*p):
                pred = circuit_fn(*p, X_batch)
                y_01 = (y_batch + 1) / 2
                prob = sigmoid(pred)
                prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
                return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))

            if isinstance(params, tuple):
                # With multiple trainable args, step_and_cost returns
                # (updated_args_list, cost); result[0] is the list of new args.
                result = opt.step_and_cost(cost, *params)
                params = tuple(result[0])
            else:
                params, _ = opt.step_and_cost(cost, params)

    # Evaluate
    if isinstance(params, tuple):
        preds = np.sign(np.array(circuit_fn(*params, X_test)))
    else:
        preds = np.sign(np.array(circuit_fn(params, X_test)))
    acc = np.mean(preds == y_test_pm)
    print(f"{label}: test accuracy = {acc:.3f}")
    return acc


np.random.seed(42)

# Ansatz 1: Manual
p1 = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3))
manual_params_count = n_layers * n_qubits * 3
manual_cnots = n_layers * (n_qubits - 1)

# Ansatz 2: StronglyEntanglingLayers
shape_sel = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
p2 = np.random.uniform(-np.pi / 4, np.pi / 4, size=shape_sel)
sel_params_count = np.prod(shape_sel)
sel_cnots = n_layers * n_qubits  # One CNOT per qubit per layer

# Ansatz 3: SimplifiedTwoDesign
init_weights = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_qubits,))
layer_weights = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits - 1, 2))
simple_params_count = n_qubits + n_layers * (n_qubits - 1) * 2
simple_cnots = n_layers * (n_qubits - 1)

print("Ansatz Comparison:")
print(f"  Manual Rot+CNOT:        {manual_params_count} params, {manual_cnots} CNOTs/circuit")
print(f"  StronglyEntanglingLayers: {sel_params_count} params, {sel_cnots} CNOTs/circuit")
print(f"  SimplifiedTwoDesign:    {simple_params_count} params, {simple_cnots} CNOTs/circuit")
print()

acc1 = train_ansatz(circuit_manual, p1, n_epochs=20, label="Manual Rot+CNOT")
acc2 = train_ansatz(circuit_strong, p2, n_epochs=20, label="StronglyEntanglingLayers")
acc3 = train_ansatz(circuit_simple, (init_weights, layer_weights), n_epochs=20, label="SimplifiedTwoDesign")

Ansatz Comparison:
  Manual Rot+CNOT:        18 params, 3 CNOTs/circuit
  StronglyEntanglingLayers: 18 params, 6 CNOTs/circuit
  SimplifiedTwoDesign:    8 params, 3 CNOTs/circuit

Manual Rot+CNOT: test accuracy = 0.840
StronglyEntanglingLayers: test accuracy = 0.820
SimplifiedTwoDesign: test accuracy = 0.820

All three land within 0.02 of each other, which on a 50-point test set is one sample. On a 2-qubit circuit this comparison cannot separate them, and you should not conclude from it that the manual ansatz is “best”. The differences between ansatze only become meaningful on larger circuits, where the entanglement pattern has room to matter.

The one result here that is worth noting is that SimplifiedTwoDesign matches the others with 8 parameters instead of 18. The recommendation is to start with the simplest ansatz that achieves acceptable accuracy: fewer parameters means faster training and reduced risk of barren plateaus.

Hyperparameter Tuning

VQCs have several hyperparameters that significantly affect performance. The most important are:

Number of layers (L): Controls circuit depth and parameter count. Too few layers limits expressibility; too many causes barren plateaus and slow convergence.
Learning rate: Standard trade-off: too high causes oscillation, too low causes slow convergence.
Encoding method: Determines the quantum feature space (covered above).
Batch size: Larger batches give more stable gradients but slower epochs.

The following grid search explores layers and learning rate:

import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split

# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1

n_qubits = 2


def sigmoid(z):
    return 1.0 / (1.0 + pnp.exp(-z))


def run_experiment(n_layers, lr, seed, n_epochs=10):
    """Train VQC with given hyperparameters and return test accuracy."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def circ(params, x):
        for i in range(n_qubits):
            qml.RY(x[..., i], wires=i)
        qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
        return qml.expval(qml.PauliZ(0))

    shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
    np.random.seed(seed)
    params = pnp.array(
        np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
    )
    opt = qml.AdamOptimizer(stepsize=lr)
    batch_size = 16

    # Train each grid cell on a subset to keep the sweep fast; the relative
    # ranking of (layers, lr) configurations is preserved.
    X_sub, y_sub = X_train[:64], y_train_pm[:64]

    for epoch in range(n_epochs):
        perm = np.random.permutation(len(X_sub))
        X_shuf, y_shuf = X_sub[perm], y_sub[perm]
        for start in range(0, len(X_sub), batch_size):
            X_batch = X_shuf[start : start + batch_size]
            y_batch = y_shuf[start : start + batch_size]

            def cost(p):
                pred = circ(p, X_batch)
                y_01 = (y_batch + 1) / 2
                prob = sigmoid(pred)
                prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
                return -pnp.mean(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))

            params, _ = opt.step_and_cost(cost, params)

    preds = np.sign(np.array(circ(params, X_test)))
    return np.mean(preds == y_test_pm)


# Grid search.
# Each cell is averaged over 3 random initializations. A single run per cell is
# not good enough to rank anything here: with one seed, the winning cell of this
# exact grid was (3 layers, lr=0.01), and averaging over three seeds shows that
# cell is actually one of the WORST. A single-seed grid search measures luck.
layers_options = [1, 2, 3, 4]
lr_options = [0.01, 0.05, 0.1]
seeds = [0, 1, 2]

print("Hyperparameter Grid Search (10 epochs each, mean of 3 seeds)")
print("-" * 55)
print(f"{'Layers':<10} {'LR=0.01':<15} {'LR=0.05':<15} {'LR=0.1':<15}")
print("-" * 55)

results = {}
for n_layers in layers_options:
    row = []
    for lr in lr_options:
        accs = [run_experiment(n_layers, lr, seed) for seed in seeds]
        acc = float(np.mean(accs))
        results[(n_layers, lr)] = acc
        row.append(f"{acc:.3f}")
    print(f"{n_layers:<10} {'  '.join(f'{r:<13}' for r in row)}")

print("-" * 55)

# Find best combination
best_key = max(results, key=results.get)
print(f"\nBest: {best_key[0]} layers, lr={best_key[1]} -> accuracy={results[best_key]:.3f}")

Hyperparameter Grid Search (10 epochs each, mean of 3 seeds)
-------------------------------------------------------
Layers     LR=0.01         LR=0.05         LR=0.1         
1          0.780          0.793          0.787        
2          0.627          0.813          0.833        
3          0.560          0.807          0.833        
4          0.807          0.813          0.827        
-------------------------------------------------------

Best: 2 layers, lr=0.1 -> accuracy=0.833

Two things stand out, and one of them is a warning about grid searches rather than about VQCs.

Learning rate matters more than depth. The lr=0.05 and lr=0.1 columns are flat at roughly 0.79 to 0.83 across every depth. The lr=0.01 column is erratic, and at 2 and 3 layers it drops to 0.627 and 0.560. Those runs have not converged to a wrong answer, they have barely trained at all: after ten epochs at lr=0.01 the parameters have moved less than a third of the distance they cover at lr=0.1, so the model is still close to its random initialisation and scores near chance. (It is tempting to call this an inverted decision boundary, but it is not: flip the sign of those predictions and accuracy does not recover, because there is no boundary there to invert.) The fix is more epochs or a larger step, not a bigger circuit.

Depth buys nothing here. Two, three and four layers are indistinguishable at a sensible learning rate, and one layer is only slightly behind. The moons dataset does not need the extra expressibility, so the extra parameters just add optimization difficulty.

Now the warning. Run this grid with a single seed, as an earlier version of this tutorial did, and it reports Best: 3 layers, lr=0.01 at 0.840, which is the highest single number anywhere in the grid. It is also, on the seed-averaged evidence above, one of the two worst cells in the table. A 50-point test set gives accuracy in steps of 0.02, and a single training run of a non-convex model is a random draw. Reporting the argmax of a noisy grid as “the optimal hyperparameters” is one of the most common ways to publish a result that does not replicate. If you cannot afford several seeds per cell, you cannot afford to rank the cells.

Common Mistakes

Building VQCs involves several subtle pitfalls. This section covers the most common ones and how to avoid them.

1. Using regular NumPy instead of PennyLane NumPy

PennyLane’s automatic differentiation requires that all differentiable operations use pennylane.numpy (aliased as pnp). If you use standard numpy for parameter arrays, PennyLane cannot compute gradients.

# WRONG: standard numpy breaks automatic differentiation
import numpy as np
params = np.array([0.1, 0.2, 0.3])  # No gradient tracking

# CORRECT: use pennylane.numpy with requires_grad=True
from pennylane import numpy as pnp
params = pnp.array([0.1, 0.2, 0.3], requires_grad=True)

Standard numpy is fine for non-differentiable operations like data loading, indexing, and post-processing predictions. The rule is: any array that the optimizer needs to differentiate through must be a pnp array with requires_grad=True.

2. Not scaling data before angle encoding

Angle encoding applies RY(x_i), which has a period of 2π. If features are in wildly different ranges (for example, one feature ranges from 0 to 1000), the rotation wraps around many times and the encoding loses meaningful structure. Always scale features to a range like [0, π] or [-π, π] before encoding.

# WRONG: raw features can be outside the useful range
raw_feature = 500.0
qml.RY(raw_feature, wires=0)  # If raw_feature = 500, this wraps ~80 times

# CORRECT: scale first
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
qml.RY(X_scaled[0, 0], wires=0)  # Now in [0, pi], meaningful range

3. Mismatched loss function and label format

Binary cross-entropy (BCE) expects labels in {0, 1} and probabilities in (0, 1). The Pauli-Z expectation value gives outputs in [-1, +1]. You need to be consistent about which format you use. There are two valid approaches:

Approach A: Use {-1, +1} labels with MSE loss and raw expectation values.

# Approach A: MSE with {-1, +1} labels
# circuit_output holds <Z> for one batch of inputs, so each value is in [-1, +1]
circuit_output = pnp.array([0.82, -0.41, 0.15, -0.93])
y_batch = np.array([1, 0, 1, 0])  # Raw labels in {0, 1}

y_pm = 2 * y_batch - 1  # Labels in {-1, +1}
loss = pnp.mean((circuit_output - y_pm) ** 2)
print(f"MSE loss: {loss:.4f}")

Approach B: Use {0, 1} labels with BCE loss and sigmoid-transformed expectation values.

# Approach B: BCE with {0, 1} labels
y_01 = (y_pm + 1) / 2  # Convert to {0, 1}
prob = sigmoid(circuit_output)  # Map [-1, +1] to (0, 1)
loss = -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
print(f"BCE loss: {pnp.mean(loss):.4f}")

Mixing these (for example, using BCE with {-1, +1} labels) produces nonsensical gradients.

4. Forgetting requires_grad=True

PennyLane arrays default to requires_grad=False. If you forget to set it, the optimizer sees no trainable parameters and nothing changes.

# WRONG: parameters are not differentiable
params = pnp.array(np.random.randn(3, 2, 3))
# The optimizer will not update these parameters

# CORRECT: explicitly mark as differentiable
params = pnp.array(np.random.randn(3, 2, 3), requires_grad=True)

A common symptom of this mistake is that the loss stays constant across all epochs.

5. Too many qubits for too little data

More qubits means a larger Hilbert space, which sounds good for expressibility. But the barren plateau phenomenon means that gradient variance decreases exponentially with qubit count. For a small 2-feature dataset like moons, using 8 qubits (with redundant encoding) makes the circuit much harder to train without any benefit.

The rule of thumb: use the minimum number of qubits needed for the encoding. For angle encoding, that is one qubit per feature. For amplitude encoding, that is log₂(features) qubits. Adding extra “ancilla” qubits for expressibility is rarely worth the barren plateau cost.

6. Measuring all qubits when only one is needed

For binary classification, you only need one expectation value. Measuring all qubits and trying to combine them adds complexity without clear benefit, and it discards the entanglement structure that the circuit created.

n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)

np.random.seed(42)
demo_params = pnp.array(np.random.uniform(-np.pi / 4, np.pi / 4, size=shape))
demo_x = np.array([0.5, 1.2])


# WRONG (for binary classification): measuring everything
@qml.qnode(dev)
def circuit_all_qubits(params, x):
    for i in range(n_qubits):
        qml.RY(x[i], wires=i)
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]


# CORRECT (for binary classification): measure the readout qubit
@qml.qnode(dev)
def circuit_readout_qubit(params, x):
    for i in range(n_qubits):
        qml.RY(x[i], wires=i)
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    return qml.expval(qml.PauliZ(0))


print("Every qubit measured:", np.round(circuit_all_qubits(demo_params, demo_x), 4))
print("Readout qubit only:  ", np.round(circuit_readout_qubit(demo_params, demo_x), 4))

For multi-class classification (k classes), you do measure k qubits, one per class. But for binary problems, stick with one qubit.

Interpretation and Limitations

The VQC here uses n_layers * n_qubits * 3 = 18 parameters, a tiny model. Classical neural networks with the same parameter count would perform comparably or better. The purpose of this tutorial is not to claim quantum advantage but to understand the mechanics.

One more piece of honesty about this file’s own methodology. The scalers above are fitted on the whole dataset before the train/test split, which leaks test statistics into the preprocessing. It does not change any result here (refit the Iris scaler on the training split alone and it still scores 1.000), and it keeps the snippets short, but it is not what you should do in real work: fit the scaler on the training split and only transform the test split. A tutorial that spends this long telling you to distrust its own numbers should not quietly cut that corner without saying so.

It is worth being blunt about the scoreboard, because every comparison in this tutorial went the same way:

Task	VQC	Classical baseline
Moons (binary)	0.820	0.840 (logistic regression)
Moons (quantum kernel SVM)	0.920	0.960 (RBF SVM)
Iris (setosa vs versicolor)	1.000	1.000 (logistic regression)
Iris (3-class)	0.895	1.000 (logistic regression)

The classical model wins or ties on every row. That is the expected result and it is not a defect in the code. These are small, low-dimensional, classically generated datasets, which is precisely the regime where quantum models have no theoretical reason to help and no empirical track record of helping. A tutorial that showed a VQC beating logistic regression on the moons dataset would be showing you a tuning artefact, not a discovery.

Key takeaways:

Angle encoding is a direct and natural interface between classical data and qubit rotations.
The encoding decides what is learnable; the ansatz only decides how easily you find it. The amplitude-encoding section is the proof: identical circuit, identical training, chance accuracy versus perfect accuracy, decided entirely by whether the preprocessing left the classes distinguishable after a global phase is discarded.
Entangling layers are what give the circuit expressive power beyond independent single-qubit operations.
The gradient computation via PennyLane’s automatic differentiation uses the parameter-shift rule, which computes exact gradients by running the circuit at shifted parameter values (no finite differences needed).
Barren plateaus (exponentially small gradients for large circuits) are a known challenge for VQCs on more qubits. Starting with shallow circuits and careful initialization helps.
Quantum kernel methods offer an alternative that avoids variational training entirely but require O(N²) circuit evaluations for the kernel matrix.
At the noise levels tested here (depolarizing p up to 0.05), accuracy barely moved, because the classifier reads sign(<Z>) and depolarizing noise shrinks <Z> toward zero without flipping its sign. Do not mistake that robustness for evidence that noise is harmless: the margin is eroding the whole time.
Test sets of 50 points and single training runs cannot resolve differences of a few percent. Most of the “A beats B” conclusions available from experiments this size are noise, and should be averaged over seeds before they are believed.

For real quantum advantage in classification you would look at quantum kernel methods where the quantum circuit defines a kernel that may be classically hard to evaluate. That is the direction the research community is actively pursuing.