Introduction to Quantum Machine Learning

What Quantum Machine Learning Is

Quantum machine learning (QML) applies quantum circuits as trainable function approximators, analogous to neural networks, but implemented on quantum hardware. Where a classical neural network transforms inputs through layers of weighted linear operations and nonlinearities, a parameterized quantum circuit transforms an encoded input through a sequence of gates with tunable angles.

The fundamental appeal is that quantum circuits operate in exponentially large Hilbert spaces. In principle, a circuit on n qubits can represent correlations across 2^n dimensions with only polynomial circuit depth. Whether that theoretical advantage translates into practical learning benefits on real-world data is an open and contested question.

Three Types of QML

QML research falls into three distinct categories with very different prospects for quantum advantage:

Quantum-native data: data produced by quantum systems (molecular simulations, quantum sensor readouts, quantum communication protocols). Here a quantum circuit processes data that is already quantum, avoiding the encoding overhead entirely. This is the most promising setting for near-term quantum advantage.

Classical data on quantum hardware: the most common QML experiment today. Classical data (images, text, tabular) is encoded into quantum states and processed by a quantum circuit. The encoding cost is significant and erases most theoretical advantages.

Quantum-inspired classical algorithms: classical algorithms redesigned by studying quantum linear algebra. These run entirely on classical hardware but borrow ideas from quantum computing. They do not require quantum hardware at all.

This tutorial focuses on the second category: encoding classical data and training a quantum classifier.

The Quantum Model Space

Before diving into code, it helps to understand mathematically what a QML model computes.

A parameterized quantum circuit defines a function f(x, theta) through three stages:

Encoding: a unitary S(x) maps the classical input x into a quantum state |phi(x)> = S(x)|0>^n.
Variational processing: a parameterized unitary U(theta) transforms the encoded state.
Measurement: an observable O (typically a Pauli operator) produces the scalar output.

The full model output is:

f(x, theta) = <0|^n  S(x)^dag  U(theta)^dag  O  U(theta)  S(x)  |0>^n

Or equivalently, writing |psi(x, theta)> = U(theta) S(x) |0>^n:

f(x, theta) = <psi(x, theta)| O |psi(x, theta)>

This structure maps directly onto the neural network analogy:

Classical NN:              Quantum Circuit:
-----------                ----------------
Input layer      <--->     Encoding S(x)
Hidden layers    <--->     Variational U(theta)
Output neuron    <--->     Measurement <O>

A natural question arises: since n qubits span a Hilbert space of dimension 2^n, does the QML model have 2^n effective parameters? The answer is no. The number of trainable parameters equals the number of rotation angles in U(theta), which scales polynomially with n and the circuit depth. The 2^n-dimensional Hilbert space provides representational capacity (the space of functions the circuit could express), but the parameterization only explores a low-dimensional manifold within that space. Conflating Hilbert space dimension with model capacity is one of the most common misconceptions in QML.

Data Encoding Strategies

How you encode classical data into qubits determines much of the circuit’s behavior. There are four main approaches, each with distinct tradeoffs.

Basis Encoding

Each classical binary string maps to a computational basis state. Requires n qubits for n bits. Efficient in qubit count but rigid; no interpolation between data points.

For example, encoding the binary string “101” into 3 qubits:

import pennylane as qml
import numpy as np

dev_basis = qml.device("default.qubit", wires=3)

@qml.qnode(dev_basis)
def basis_encode_101():
    # Encode "101": flip qubits 0 and 2 to represent |101>
    qml.PauliX(wires=0)  # bit 0 = 1
    # qubit 1 stays |0>   # bit 1 = 0
    qml.PauliX(wires=2)  # bit 2 = 1
    return qml.state()

state = basis_encode_101()
# The state vector has a 1.0 at index 5 (binary 101), zeros elsewhere
print("State vector:", np.round(state, 4))
print("Index of nonzero entry:", np.argmax(np.abs(state)))

Basis encoding is conceptually simple but only suitable for problems where the input is already binary.

Amplitude Encoding

A vector of 2^n values is encoded as the amplitudes of an n-qubit state. Exponentially compact (encoding 2^n features into n qubits) but requires a complex state preparation circuit that often costs more gates than the learning circuit itself.

dev_amp = qml.device("default.qubit", wires=3)

@qml.qnode(dev_amp)
def amplitude_encode(x):
    # Encode an 8-dimensional unit vector into 3 qubits
    qml.AmplitudeEmbedding(features=x, wires=range(3), normalize=True)
    return qml.state()

# An arbitrary 8-dimensional vector (will be normalized automatically)
features = np.array([0.5, 0.3, 0.1, 0.7, 0.2, 0.4, 0.6, 0.1])
state = amplitude_encode(features)
print("Encoded state:", np.round(state, 4))
print("Sum of |amplitudes|^2:", np.round(np.sum(np.abs(state)**2), 6))

The normalize=True flag handles normalization for you, but be aware that the circuit depth required for arbitrary amplitude preparation scales as O(2^n), which partially negates the compression advantage.

Angle Encoding

Each feature x_i is mapped to a rotation angle. For n features you use n qubits with RY(x_i) gates. Simple, hardware-friendly, and widely used in practice.

n_qubits = 4
dev_angle = qml.device("default.qubit", wires=n_qubits)

def angle_encoding_ry(x):
    """Standard angle encoding with RY gates."""
    for i in range(n_qubits):
        qml.RY(x[i], wires=i)

def angle_encoding_rx_rz(x):
    """Alternate encoding: use RX for the first half, RZ for the second half.
    This places features on different rotation axes, increasing diversity."""
    half = len(x) // 2
    for i in range(half):
        qml.RX(x[i], wires=i)
    for i in range(half, len(x)):
        qml.RZ(x[i], wires=i)

After RY encoding, qubit i is in state cos(x_i/2)|0> + sin(x_i/2)|1>. Features should be scaled to [0, pi] or [-pi, pi] before encoding to use the full rotation range.

An important subtlety: rotation gates are periodic with period 2*pi. If your feature values span a range much larger than 2*pi, distinct inputs will map to the same quantum state (aliasing). If the range is much smaller, you use only a tiny portion of the Bloch sphere, reducing the model’s discriminative power. Proper feature scaling is not optional.

IQP Encoding

IQP (Instantaneous Quantum Polynomial) encoding uses diagonal gates to create correlations between features. The circuit structure alternates between single-qubit RZ rotations and two-qubit controlled-RZ gates:

dev_iqp = qml.device("default.qubit", wires=4)

@qml.qnode(dev_iqp)
def iqp_encode(x):
    # IQP encoding: Hadamard, then RZ + controlled-RZ, then Hadamard again
    # PennyLane's IQPEmbedding handles this pattern automatically
    qml.IQPEmbedding(features=x, wires=range(4), n_repeats=2)
    return qml.state()

features = np.array([0.5, 1.2, 0.8, 2.1])
state = iqp_encode(features)
print("IQP-encoded state:", np.round(state[:8], 4), "...")

The circuit applies Hadamard gates to all qubits, then RZ(x_i) on each qubit and controlled-RZ(x_i * x_j) on pairs, then Hadamard gates again. The n_repeats parameter controls how many times this block is repeated. The theoretical motivation is that classically simulating IQP circuits is believed to be hard (under plausible complexity assumptions), which suggests the resulting feature map may create quantum states that are classically intractable to reproduce. However, hardness of simulation does not automatically guarantee useful classification performance.

For most beginner and intermediate QML experiments, angle encoding is the right starting point.

Parameterized Quantum Circuits

A parameterized quantum circuit (PQC) has trainable rotation angles, just like weights in a neural network. A typical variational layer combines single-qubit rotations with entangling gates:

def variational_layer(weights, layer_idx):
    for i in range(n_qubits):
        qml.RY(weights[layer_idx, i, 0], wires=i)
        qml.RZ(weights[layer_idx, i, 1], wires=i)
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])
    # Ring entanglement
    qml.CNOT(wires=[n_qubits - 1, 0])

Stacking multiple variational layers increases the circuit’s expressive power, though it also increases the barren plateau risk.

Hardware-Efficient Ansatz Design

The term “hardware-efficient” means the circuit respects the physical constraints of real quantum hardware: shallow depth (few layers), only native two-qubit gates (CNOT or CZ depending on the device), and a connectivity pattern that matches the chip topology. A circuit that requires all-to-all qubit connectivity looks elegant on paper but compiles to many SWAP gates on a device with nearest-neighbor connectivity, inflating depth and noise.

PennyLane provides several built-in ansatz templates.

SimplifiedTwoDesign

This template alternates single-qubit rotation layers with controlled-Z entanglers. It is designed to approximate a unitary 2-design (a distribution that mimics Haar-random unitaries up to the second moment), which makes it a good default for expressibility studies.

n_qubits = 4
n_layers = 3
dev_s2d = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_s2d)
def simplified_two_design_circuit(initial_layer_weights, weights):
    qml.SimplifiedTwoDesign(
        initial_layer_weights=initial_layer_weights,
        weights=weights,
        wires=range(n_qubits)
    )
    return qml.expval(qml.PauliZ(0))

# Initial layer: one RY rotation per qubit
init_weights = np.random.uniform(0, 2 * np.pi, (n_qubits,))
# Variational layers: each layer has (n_qubits - 1) pairs, 2 parameters each
layer_weights = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits - 1, 2))

result = simplified_two_design_circuit(init_weights, layer_weights)
print(f"SimplifiedTwoDesign output: {result:.4f}")

StronglyEntanglingLayers

This is the most commonly used template in PennyLane tutorials. Each layer applies three rotations (Rot gate: RZ, RY, RZ) to every qubit, followed by CNOT entanglers with a configurable connectivity pattern that shifts across layers.

import pennylane as qml
import pennylane.numpy as pnp

n_qubits = 4
n_layers = 3
dev_sel = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_sel)
def strongly_entangling_circuit(weights):
    qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
    return qml.expval(qml.PauliZ(0))

# Shape: (n_layers, n_qubits, 3) for the three rotation angles per qubit per layer
sel_weights = pnp.random.uniform(
    -np.pi, np.pi, (n_layers, n_qubits, 3), requires_grad=True
)

result = strongly_entangling_circuit(sel_weights)
print(f"StronglyEntanglingLayers output: {result:.4f}")
print(f"Total trainable parameters: {n_layers * n_qubits * 3}")

Custom Hardware-Efficient Ansatz for Linear Topology

If your qubits are connected in a line (0-1-2-3), you should only place CNOT gates between adjacent pairs. This avoids SWAP overhead:

n_qubits = 4
dev_custom = qml.device("default.qubit", wires=n_qubits)

def custom_linear_ansatz(weights, n_layers):
    """Hardware-efficient ansatz for linear qubit connectivity 0-1-2-3.
    Each layer: RY + RZ on each qubit, then CNOTs on adjacent pairs only."""
    for layer in range(n_layers):
        # Single-qubit rotations
        for q in range(n_qubits):
            qml.RY(weights[layer, q, 0], wires=q)
            qml.RZ(weights[layer, q, 1], wires=q)
        # Entangling gates: only adjacent pairs (linear topology)
        for q in range(n_qubits - 1):
            qml.CNOT(wires=[q, q + 1])

@qml.qnode(dev_custom)
def custom_circuit(weights):
    n_layers = weights.shape[0]
    custom_linear_ansatz(weights, n_layers)
    return qml.expval(qml.PauliZ(0))

custom_weights = np.random.uniform(-np.pi, np.pi, (3, n_qubits, 2))
result = custom_circuit(custom_weights)
print(f"Custom linear ansatz output: {result:.4f}")

This circuit has 3 layers, 4 qubits, 2 parameters per qubit per layer, giving 24 total parameters and a circuit depth that stays manageable on real hardware.

A Complete QML Classifier

We train a binary classifier on the breast cancer dataset from scikit-learn, using PCA to reduce 30 features to 4, then angle encoding into 4 qubits.

Prepare the Data

from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pennylane.numpy as pnp
import numpy as np

n_qubits = 4

data = load_breast_cancer()
X, y = data.data, data.target

# PCA to 4 features
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)

# Scale to [0, pi] for angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)

# Convert labels: {0, 1} -> {-1, +1}
y_pm = 2 * y - 1

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_pm, test_size=0.2, random_state=42
)

Define the QNode

n_layers = 2
dev = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev)
def circuit(x, weights):
    # Angle encoding. Indexing with x[..., i] pulls out feature i whether x is
    # a single sample or a whole batch, so PennyLane can broadcast a batch
    # through the circuit in one simulation instead of one call per sample.
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    # Variational layers
    for l in range(n_layers):
        variational_layer(weights, l)
    # Measure Z on qubit 0 as the classifier output
    return qml.expval(qml.PauliZ(0))

Train with Adam

def loss(weights, X_batch, y_batch):
    predictions = circuit(X_batch, weights)  # one broadcast call for the batch
    # Binary cross-entropy approximation via MSE on +/-1 labels
    return pnp.mean((predictions - y_batch) ** 2)

# Seed the initialization. Without a seed, the accuracy this model lands on
# swings from about 0.72 to 0.92 depending on where the weights start, so an
# unseeded run tells you very little.
np.random.seed(42)

# Initialize weights using pennylane.numpy so gradients are tracked
weights = pnp.random.uniform(-np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True)

opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16

for epoch in range(60):
    idx = np.random.choice(len(X_train), batch_size, replace=False)
    X_b, y_b = X_train[idx], y_train[idx]

    weights, current_loss = opt.step_and_cost(
        lambda w: loss(w, X_b, y_b), weights
    )

    if (epoch + 1) % 20 == 0:
        train_preds = np.sign(circuit(X_train, weights))
        acc = np.mean(train_preds == y_train)
        print(f"Epoch {epoch+1:3d}  Loss: {current_loss:.4f}  Train acc: {acc:.3f}")

Evaluate

from sklearn.linear_model import LogisticRegression

test_preds = np.sign(circuit(X_test, weights))
test_acc = np.mean(test_preds == y_test)
print(f"VQC test accuracy:                 {test_acc:.3f}")

# Always measure the classical baseline on the *same* features (see mistake 6 below)
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
print(f"Logistic regression test accuracy: {lr.score(X_test, y_test):.3f}")

VQC test accuracy:                 0.904
Logistic regression test accuracy: 0.965

The classical baseline wins. Logistic regression scores 0.965 on the same 4 PCA features; the quantum circuit scores 0.904. That is the normal outcome, and it is worth stating plainly rather than glossing over: this is a small, essentially linearly separable classical dataset, so there is no reason to expect a quantum model to do better, and no known theory that says it should. The PCA step has already done the hard work of finding the informative directions, and a linear model reads them off directly. The quantum circuit has 16 parameters and has to rediscover that structure through rotations and entanglers.

Two further caveats on that 0.904. First, it depends on the seed: across random initializations this same model lands anywhere between roughly 0.72 and 0.92, which is why the code above fixes the seed. Second, the qubit count is set by the encoding (4 features, 4 qubits), not by anything about the problem’s difficulty. The value of the exercise is the mechanics, not the score.

The Parameter-Shift Rule

One of the most elegant aspects of quantum computing for machine learning is that gradients of quantum circuits can be computed exactly, not approximately. This is the parameter-shift rule.

Mathematical Foundation

Consider a quantum gate G(theta) = exp(-i * theta * P / 2) where P is a Pauli generator (a Hermitian matrix with eigenvalues +/- 1). The expectation value of an observable depends on theta through the circuit. The gradient with respect to theta is:

dL/d_theta = (1/2) * [ L(theta + pi/2) - L(theta - pi/2) ]

This formula is exact. It requires evaluating the circuit at only two shifted parameter values per gradient component. Unlike classical finite differences ([L(theta + epsilon) - L(theta)] / epsilon), which introduce approximation error proportional to epsilon, the parameter-shift rule yields the true analytical gradient regardless of the shift magnitude (which is always pi/2 for Pauli generators).

The proof follows from the structure of the rotation gate. Since exp(-i * theta * P / 2) is a linear combination of cos(theta/2) * I and -i * sin(theta/2) * P, the expectation value is sinusoidal in theta, and the derivative of a sinusoid can be expressed as the difference of two shifted evaluations.

Verification in Code

import pennylane as qml
import pennylane.numpy as pnp
import numpy as np

dev_ps = qml.device("default.qubit", wires=1)

@qml.qnode(dev_ps)
def simple_circuit(theta):
    qml.RY(theta, wires=0)
    return qml.expval(qml.PauliZ(0))

theta_val = pnp.array(0.7, requires_grad=True)

# Method 1: PennyLane's automatic gradient (uses parameter-shift internally)
grad_auto = qml.grad(simple_circuit)(theta_val)

# Method 2: Manual parameter-shift rule
shift = np.pi / 2
grad_manual = 0.5 * (simple_circuit(theta_val + shift) - simple_circuit(theta_val - shift))

print(f"Automatic gradient:       {float(grad_auto):.8f}")
print(f"Manual parameter-shift:   {float(grad_manual):.8f}")
print(f"Difference:               {abs(float(grad_auto) - float(grad_manual)):.2e}")
# The two values match to machine precision

The parameter-shift rule extends to gates with more general generators, though the formula becomes more complex (requiring more shift terms). PennyLane handles this automatically when you use qml.grad or qml.jacobian.

Quantum Kernel Methods

An alternative to the variational classifier approach is to use quantum circuits as kernel functions. Instead of training circuit parameters, you use the quantum feature map to define a similarity measure between data points, then hand the resulting kernel matrix to a classical SVM.

What Is a Quantum Kernel?

Given a feature map S(x) that encodes data point x into a quantum state |phi(x)> = S(x)|0>, the quantum kernel between two data points is:

K(x_i, x_j) = |<phi(x_i)|phi(x_j)>|^2

This is the fidelity (overlap squared) between the two encoded quantum states. If two inputs produce similar quantum states, their kernel value is close to 1. If the states are nearly orthogonal, the kernel is close to 0.

The key insight is that this kernel operates in the 2^n-dimensional Hilbert space without explicitly computing in that space. Computing K(x_i, x_j) on a quantum computer requires only polynomial resources, but evaluating the same kernel classically could require exponential resources if the feature map is sufficiently complex.

Computing the Kernel Matrix

import pennylane as qml
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score

n_qubits = 4
dev_kernel = qml.device("default.qubit", wires=n_qubits)

def kernel_feature_map(x):
    """Angle encoding followed by entangling layer for richer feature map."""
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])
    for i in range(n_qubits):
        qml.RZ(x[..., i], wires=i)

@qml.qnode(dev_kernel)
def kernel_circuit(x1, x2):
    """Compute |<phi(x1)|phi(x2)>|^2 using the swap test alternative:
    apply S(x1), then S(x2)^dag, then measure probability of |0...0>."""
    kernel_feature_map(x1)
    qml.adjoint(kernel_feature_map)(x2)
    return qml.probs(wires=range(n_qubits))

def quantum_kernel(x1, x2):
    """Return the kernel value: probability of measuring all zeros."""
    probs = kernel_circuit(x1, x2)
    return probs[0]  # |0...0> probability

def compute_kernel_matrix(X_a, X_b):
    """Compute the kernel matrix K[i, j] = quantum_kernel(X_a[i], X_b[j]).

    X_a is broadcast through the circuit, so one call fills a whole column
    of K instead of running one circuit per pair.
    """
    K = np.zeros((len(X_a), len(X_b)))
    for j, x_b in enumerate(X_b):
        probs = kernel_circuit(X_a, x_b)
        K[:, j] = np.array(probs)[..., 0]  # |0...0> probability per row
    return K

# Prepare data (same pipeline as before)
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)

# Use a small subset for speed (kernel matrix is O(n^2) in dataset size)
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y, test_size=0.2, random_state=42
)
X_train_small = X_train[:80]
y_train_small = y_train[:80]
X_test_small = X_test[:40]
y_test_small = y_test[:40]

# Compute kernel matrices
K_train = compute_kernel_matrix(X_train_small, X_train_small)
K_test = compute_kernel_matrix(X_test_small, X_train_small)

# Train a classical SVM with the quantum kernel
svm = SVC(kernel="precomputed")
svm.fit(K_train, y_train_small)
y_pred = svm.predict(K_test)
kernel_acc = accuracy_score(y_test_small, y_pred)
print(f"Quantum kernel SVM test accuracy: {kernel_acc:.3f}")

The quantum kernel approach has an advantage over the variational classifier: there are no barren plateaus because there are no quantum parameters to train. The downside is the O(n^2) cost of computing all pairwise kernel values, which becomes expensive for large datasets.

Expressibility and Entanglement Capacity

When choosing an ansatz, it helps to quantify how expressive it is and how much entanglement it generates. Two metrics are commonly used.

Expressibility

Expressibility measures how uniformly the ansatz samples from the space of all possible unitaries (the Haar measure). A highly expressive ansatz can reach states distributed uniformly across the Hilbert space. A low-expressibility ansatz is stuck in a small subspace.

The standard approach (Sim et al. 2019) compares the distribution of state fidelities generated by the ansatz to the Haar-random distribution. The KL divergence between the two distributions quantifies expressibility.

import pennylane as qml
import numpy as np

n_qubits = 4
n_layers = 2
dev_expr = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_expr)
def expressibility_circuit(weights):
    qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
    return qml.state()

def estimate_expressibility(n_samples=500):
    """Estimate expressibility by sampling state overlaps."""
    fidelities = []
    shape = (n_layers, n_qubits, 3)
    for _ in range(n_samples):
        # Sample two random parameter sets
        w1 = np.random.uniform(0, 2 * np.pi, shape)
        w2 = np.random.uniform(0, 2 * np.pi, shape)
        state1 = expressibility_circuit(w1)
        state2 = expressibility_circuit(w2)
        # Fidelity = |<psi1|psi2>|^2
        fidelity = np.abs(np.dot(np.conj(state1), state2)) ** 2
        fidelities.append(fidelity)
    return np.array(fidelities)

fidelities = estimate_expressibility(n_samples=120)
print(f"Mean fidelity:   {np.mean(fidelities):.4f}")
print(f"Std fidelity:    {np.std(fidelities):.4f}")
# For a Haar-random distribution on 2^4 = 16 dimensions,
# the expected mean fidelity is 1/16 = 0.0625.
# A highly expressive ansatz will produce values close to this.
print(f"Haar-random expected mean: {1 / 2**n_qubits:.4f}")

Entanglement Capacity

Entanglement capacity measures how much entanglement the ansatz generates across qubits. The von Neumann entropy of a subsystem quantifies this:

dev_ent = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_ent)
def entanglement_circuit(weights):
    qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
    return qml.vn_entropy(wires=[0, 1])  # entropy of the first 2 qubits

def estimate_entanglement_capacity(n_samples=200):
    """Average von Neumann entropy over random parameter samples."""
    entropies = []
    shape = (n_layers, n_qubits, 3)
    for _ in range(n_samples):
        w = np.random.uniform(0, 2 * np.pi, shape)
        entropy = entanglement_circuit(w)
        entropies.append(float(entropy))
    return np.array(entropies)

entropies = estimate_entanglement_capacity(n_samples=80)
print(f"Mean entanglement entropy: {np.mean(entropies):.4f}")
print(f"Max possible (2 qubits):   {np.log(2**2):.4f}")
# High mean entropy relative to the maximum indicates the ansatz
# generates significant entanglement across the bipartition.

An ansatz that is both highly expressive and highly entangling is powerful but may be harder to train (see barren plateaus below).

Barren Plateaus: The Core Challenge

As you scale up the number of qubits or layers, gradients of the loss function with respect to circuit parameters shrink exponentially. This is the barren plateau problem:

Var[dL/d_theta] ~ O(1 / 2^n)

For 20 qubits, the gradient variance is roughly one sixty-thousandth (2^-16) of what it is for 4 qubits. Training becomes effectively impossible without exponentially more shots to estimate gradients accurately.

Empirical Evidence

Before measuring anything, note two design details that decide whether this experiment works at all.

Pick a parameter that is not a global phase. StronglyEntanglingLayers applies Rot(a, b, c) = RZ(c) RY(b) RZ(a) to each qubit. In the very first layer, the leading RZ(a) acts on |0>, which is an eigenstate of Z, so that gate only multiplies the state by a global phase. A global phase is unobservable, so the gradient with respect to that angle is not merely small, it is identically zero for every sample. Differentiating it produces a table of 0.00000000 at every qubit count, which demonstrates nothing. We differentiate the RY angle instead (index [0, 0, 1]), and we prepend an RY(pi/4) layer so that every qubit starts off the Z axis.

Make the circuit deep enough to be random. The barren plateau is a property of circuits that approximate a 2-design. A 2-layer circuit is not deep enough, and its gradient variance bounces around instead of decaying cleanly. We use 6 layers.

import pennylane as qml
import pennylane.numpy as pnp
import numpy as np

def measure_gradient_variance(n_qubits, n_layers=6, n_samples=500):
    """Variance of dL/d_theta for one interior parameter of a random circuit."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def random_circuit(weights):
        # Start every qubit off the Z axis. Without this, the first RZ of the
        # first Rot gate would act on |0>, where it only adds a global phase,
        # and its gradient would be exactly zero for every sample.
        for i in range(n_qubits):
            qml.RY(np.pi / 4, wires=i)
        qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
        # Global cost function: measure Z on all qubits
        return qml.expval(
            qml.prod(*[qml.PauliZ(i) for i in range(n_qubits)])
        )

    def total(weights):
        # weights carries a leading sample axis. Each sample's output depends
        # only on its own parameters, so differentiating the sum gives every
        # sample's gradient in one broadcast pass instead of 500 separate ones.
        return pnp.sum(random_circuit(weights))

    np.random.seed(1234)
    weights = pnp.array(
        np.random.uniform(0, 2 * np.pi, (n_samples, n_layers, n_qubits, 3)),
        requires_grad=True,
    )
    grads = np.array(qml.grad(total)(weights))
    # The RY angle of the first Rot gate on qubit 0
    return np.var(grads[:, 0, 0, 1])

print(f"{'Qubits':>8} | {'Grad Variance':>15} | {'Ratio to n-1':>13}")
print("-" * 44)
previous = None
for n in range(2, 9):
    var = measure_gradient_variance(n)
    ratio = f"{var / previous:.3f}" if previous else "-"
    print(f"{n:>8} | {var:>15.8f} | {ratio:>13}")
    previous = var

  Qubits |   Grad Variance |  Ratio to n-1
--------------------------------------------
       2 |      0.10208071 |             -
       3 |      0.05013389 |         0.491
       4 |      0.02474543 |         0.494
       5 |      0.01149251 |         0.464
       6 |      0.00585542 |         0.509
       7 |      0.00273790 |         0.468
       8 |      0.00134937 |         0.493

Every qubit added roughly halves the gradient variance: the measured ratios sit between 0.464 and 0.509, against the value of 0.5 that 1/2^n predicts. Over the six qubits from n=2 to n=8 the variance falls by a factor of about 76, and 2^6 = 64. This is the barren plateau scaling, measured rather than asserted.

Extrapolate the same line and the problem is obvious. At 8 qubits the gradient is still workable at roughly 1e-3. At 30 qubits it would be on the order of 1e-10, and no realistic number of shots would resolve it from zero.

Four Mitigation Strategies

1. Local cost functions: instead of measuring all qubits (global cost), measure only one or two qubits near the parameter of interest. This slows the exponential decay of gradient variance from O(1/2^n) to a more favorable polynomial scaling for shallow circuits.

2. Layer-by-layer training: train the first variational layer while keeping the rest fixed, then progressively unfreeze deeper layers. This avoids random initialization in the full parameter space, where barren plateaus are most severe.

3. Identity initialization: initialize parameters so that each variational layer acts as the identity (all angles set to zero or to values that make the layer an identity up to a global phase). Training starts near a known point and gradually moves away, avoiding the flat landscape of random initialization.

4. Quantum natural gradient: the standard gradient treats all parameter directions equally, but the quantum state space has a non-Euclidean geometry described by the Fubini-Study metric. The quantum natural gradient rescales the gradient by the inverse of this metric tensor, giving larger effective steps in directions where the landscape is flat. PennyLane implements this via qml.QNGOptimizer.

Data Re-uploading

Standard angle encoding applies the input data once at the beginning of the circuit. The data re-uploading technique (Perez-Salinas et al. 2020) interleaves data encoding with variational layers, so the circuit “sees” the input multiple times at different depths.

This dramatically increases expressiveness because the circuit becomes a composition of multiple data-dependent unitaries. Mathematically, instead of U(theta) S(x) |0>, the state becomes U_L(theta_L) S(x) ... U_1(theta_1) S(x) |0>. Each re-upload effectively introduces a new “Fourier frequency” in the model’s representation of the input, making the function approximation more powerful.

Re-uploading Circuit

import pennylane as qml
import pennylane.numpy as pnp
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

n_qubits = 4
n_layers = 3
dev_reup = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_reup)
def reupload_circuit(x, weights):
    for layer in range(n_layers):
        # Re-encode the input at every layer
        for i in range(n_qubits):
            qml.RY(x[..., i], wires=i)
        # Variational block
        for i in range(n_qubits):
            qml.RY(weights[layer, i, 0], wires=i)
            qml.RZ(weights[layer, i, 1], wires=i)
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i + 1])
        qml.CNOT(wires=[n_qubits - 1, 0])
    return qml.expval(qml.PauliZ(0))

@qml.qnode(dev_reup)
def no_reupload_circuit(x, weights):
    # Encode input only once
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    for layer in range(n_layers):
        for i in range(n_qubits):
            qml.RY(weights[layer, i, 0], wires=i)
            qml.RZ(weights[layer, i, 1], wires=i)
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i + 1])
        qml.CNOT(wires=[n_qubits - 1, 0])
    return qml.expval(qml.PauliZ(0))

# Prepare data
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
y_pm = 2 * y - 1

X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_pm, test_size=0.2, random_state=42
)

def train_and_evaluate(circuit_fn, seed):
    # One random initialization tells you almost nothing here, so the caller
    # sweeps several seeds and we compare the distributions.
    np.random.seed(seed)
    weights = pnp.random.uniform(
        -np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True
    )
    opt = qml.AdamOptimizer(stepsize=0.05)
    batch_size = 16

    for epoch in range(40):
        idx = np.random.choice(len(X_train), batch_size, replace=False)
        X_b, y_b = X_train[idx], y_train[idx]

        def cost(w):
            preds = circuit_fn(X_b, w)
            return pnp.mean((preds - y_b) ** 2)

        weights, _ = opt.step_and_cost(cost, weights)

    test_preds = np.sign(circuit_fn(X_test, weights))
    return np.mean(test_preds == y_test)

seeds = [0, 1, 2, 3, 4]
reup_accs, plain_accs = [], []

print(f"{'seed':>4} | {'re-uploading':>12} | {'single upload':>13}")
print("-" * 36)
for s in seeds:
    a = train_and_evaluate(reupload_circuit, s)
    b = train_and_evaluate(no_reupload_circuit, s)
    reup_accs.append(a)
    plain_accs.append(b)
    print(f"{s:>4} | {a:>12.3f} | {b:>13.3f}")

print("-" * 36)
print(f"{'mean':>4} | {np.mean(reup_accs):>12.3f} | {np.mean(plain_accs):>13.3f}")
wins = sum(a > b for a, b in zip(reup_accs, plain_accs))
print(f"\nre-uploading wins on {wins} of {len(seeds)} seeds")

seed | re-uploading | single upload
------------------------------------
   0 |        0.947 |         0.965
   1 |        0.947 |         0.816
   2 |        0.947 |         0.912
   3 |        0.956 |         0.886
   4 |        0.947 |         0.798
------------------------------------
mean |        0.949 |         0.875

re-uploading wins on 4 of 5 seeds

Re-uploading helps on average (0.949 against 0.875), and it is also visibly more stable: the re-uploading column barely moves across seeds, while the single-upload column swings from 0.798 to 0.965. But note seed 0, where the single-upload model wins. The advantage is real in the mean and it is not a guarantee on any individual run. Had this demo been run once, unseeded, it could just as easily have printed the opposite conclusion, which is exactly why it sweeps seeds.

Transfer Learning in QML

Encoding raw high-dimensional data (such as images) directly into qubits is impractical. A 28x28 grayscale image has 784 pixels, requiring either 784 qubits for angle encoding or a 10-qubit amplitude encoding circuit of extreme depth. Neither option is viable.

The practical solution is classical-to-quantum transfer learning: use a pre-trained classical neural network to compress the input into a low-dimensional embedding, then feed that embedding into the quantum circuit.

Architecture

+------------------+       +------------------+       +----------------+
| Pre-trained CNN  |       | Quantum Circuit  |       | Classical      |
| (ResNet, VGG,    | ----> | (4-qubit PQC     | ----> | Post-process   |
|  MobileNet)      |       |  with angle      |       | (argmax,       |
|                  |       |  encoding)        |       |  threshold)    |
| Input: 224x224x3 |       | Input: 4 floats  |       | Output: class  |
| Output: 4 floats |       | Output: <Z>      |       |                |
+------------------+       +------------------+       +----------------+
     Classical                   Quantum                  Classical
   (frozen weights)          (trainable theta)

The classical CNN (with frozen pre-trained weights) acts as a feature extractor, mapping high-dimensional inputs to a small number of features that capture the essential structure. The quantum circuit then acts as the trainable classifier head. This is practical because:

The quantum circuit receives only 4-8 features, well within current hardware limits.
The classical feature extractor handles the hard problem of dimensionality reduction.
The overall pipeline is end-to-end differentiable if the classical network is implemented in a compatible framework (PyTorch + PennyLane via qml.qnn.TorchLayer).

This approach is the most realistic path to using quantum circuits for image or text classification tasks today.

Noise Effects on QML Models

Real quantum hardware introduces noise at every gate. Understanding how noise affects QML performance is critical for practical applications.

Simulating Noisy Circuits

import pennylane as qml
import pennylane.numpy as pnp
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler

n_qubits = 4
n_layers = 2

# Ideal (noiseless) device
dev_ideal = qml.device("default.qubit", wires=n_qubits)

# Noisy device with depolarizing noise
dev_noisy = qml.device("default.mixed", wires=n_qubits)

def encoding_and_layers(x, weights):
    """Shared circuit logic for both ideal and noisy versions."""
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    for layer in range(n_layers):
        for i in range(n_qubits):
            qml.RY(weights[layer, i, 0], wires=i)
            qml.RZ(weights[layer, i, 1], wires=i)
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i + 1])
        qml.CNOT(wires=[n_qubits - 1, 0])

@qml.qnode(dev_ideal)
def ideal_circuit(x, weights):
    encoding_and_layers(x, weights)
    return qml.expval(qml.PauliZ(0))

@qml.qnode(dev_noisy)
def noisy_circuit(x, weights):
    for i in range(n_qubits):
        qml.RY(x[..., i], wires=i)
    for layer in range(n_layers):
        for i in range(n_qubits):
            qml.RY(weights[layer, i, 0], wires=i)
            qml.DepolarizingChannel(0.01, wires=i)
            qml.RZ(weights[layer, i, 1], wires=i)
            qml.DepolarizingChannel(0.01, wires=i)
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i + 1])
            # Two-qubit gates are noisier in practice
            qml.DepolarizingChannel(0.01, wires=i)
            qml.DepolarizingChannel(0.01, wires=i + 1)
        qml.CNOT(wires=[n_qubits - 1, 0])
        qml.DepolarizingChannel(0.01, wires=n_qubits - 1)
        qml.DepolarizingChannel(0.01, wires=0)
    return qml.expval(qml.PauliZ(0))

# Prepare data
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
y_pm = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
    X_scaled, y_pm, test_size=0.2, random_state=42
)

# Train on ideal, evaluate on both
np.random.seed(42)
weights = pnp.random.uniform(-np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.05)

for epoch in range(30):
    idx = np.random.choice(len(X_train), 16, replace=False)
    X_b, y_b = X_train[idx], y_train[idx]
    def cost(w):
        preds = ideal_circuit(X_b, w)
        return pnp.mean((preds - y_b) ** 2)
    weights, _ = opt.step_and_cost(cost, weights)

# Compare ideal vs noisy inference
z_ideal = np.array(ideal_circuit(X_test, weights))
z_noisy = np.array(noisy_circuit(X_test, weights))
ideal_preds = np.sign(z_ideal)
noisy_preds = np.sign(z_noisy)

ideal_acc = np.mean(ideal_preds == y_test)
noisy_acc = np.mean(noisy_preds == y_test)
print(f"Ideal simulator accuracy: {ideal_acc:.3f}")
print(f"Noisy simulator accuracy: {noisy_acc:.3f}")
print(f"Accuracy drop:            {ideal_acc - noisy_acc:.3f}")

# Accuracy only reads sign(<Z>), so look at the margin |<Z>| to see what the
# noise actually did to the model.
print(f"\nMean |<Z>| ideal:         {np.abs(z_ideal).mean():.4f}")
print(f"Mean |<Z>| noisy:         {np.abs(z_noisy).mean():.4f}")
print(f"Sign flips:               {int(np.sum(ideal_preds != noisy_preds))} of {len(z_ideal)}")

Ideal simulator accuracy: 0.816
Noisy simulator accuracy: 0.816
Accuracy drop:            0.000

Mean |<Z>| ideal:         0.4441
Mean |<Z>| noisy:         0.3335
Sign flips:               0 of 114

Why Noise Is Not Always Catastrophic

The run above transfers weights trained on the ideal simulator straight onto a device with a depolarizing channel after every gate, and the accuracy does not move at all: 0.816 either way, a drop of 0.000.

That is not luck, and it is not evidence that noise is harmless. It follows from what the classifier actually reads. A depolarizing channel contracts the expectation value toward zero, and the last two lines of output show it doing exactly that: the mean margin |<Z>| falls from 0.4441 to 0.3335, about a quarter of the signal gone. But the prediction is sign(<Z>), and shrinking a number toward zero does not change its sign. Not one of the 114 test points flipped, so not one prediction changed.

The lesson is that accuracy is an insensitive instrument here. It reports “no damage” while a quarter of the margin has already been destroyed, and it will keep reporting no damage right up until the points nearest the boundary cross it. On real hardware the margin also has to survive shot noise: an |<Z>| of 0.33 estimated from a finite number of shots is a far less reliable prediction than an |<Z>| of 0.44, even when both give the same sign. Deeper circuits accumulate more contraction, which is another reason to prefer shallow, hardware-efficient ansatze.

Where QML Shows Genuine Promise

The near-term use cases with the clearest path to advantage are:

Quantum chemistry: optimizing ground state energies of molecules where the data is inherently quantum (VQE, quantum phase estimation).
Quantum data classification: classifying states produced by quantum experiments or sensors without first converting to classical data.
Quantum kernel methods: using a quantum circuit as a kernel function whose evaluation is classically hard to simulate.

For standard classical datasets (images, text, tabular data), there is no known theoretical advantage and no empirical evidence of advantage at any useful scale.

Quantum Kernel Advantage

A quantum kernel is useful specifically when the feature map S(x) creates a kernel function that is classically intractable to evaluate. Liu et al. (2021) proved that there exist classification problems where a quantum kernel achieves exponentially better prediction error than any classical kernel, provided the data distribution is specifically designed to exploit quantum structure. The critical caveat is that this advantage is data-dependent: for generic classical data, quantum kernels offer no guaranteed speedup.

The practical implication is that quantum kernels are most promising for data with inherent quantum structure, such as outputs from quantum simulations or quantum communication channels.

Quantum Generative Models

Beyond classification, quantum generative models represent a legitimate near-term application. Two notable examples:

Quantum GANs (QGANs): a quantum circuit acts as the generator, producing quantum states that a discriminator (quantum or classical) tries to distinguish from real data. QGANs are particularly natural for generating quantum states (e.g., for quantum chemistry initialization).

Born machines: these exploit the fact that measuring a quantum circuit produces samples from a probability distribution defined by the circuit’s amplitudes. The distribution p(x) = |<x|psi(theta)>|^2 can express correlations that are provably hard for classical probabilistic models. Born machines are a setting where quantum advantage is plausible even for near-term devices.

Common Mistakes in QML

Beginners (and sometimes experienced practitioners) frequently make these mistakes:

1. Using pnp arrays for non-gradient computations

PennyLane’s pennylane.numpy (pnp) wraps NumPy arrays with autograd tracking. Using pnp arrays for data loading, preprocessing, or evaluation (where you do not need gradients) adds unnecessary overhead. Use plain numpy for everything except the trainable parameters:

import numpy as np
import pennylane.numpy as pnp

# A small batch of already-scaled input features, and the weight shape
# for a 2-layer, 4-qubit ansatz with 2 rotation angles per qubit
some_data = np.random.uniform(0, np.pi, (8, 4))
shape = (2, 4, 2)

# WRONG: using pnp for data (slow, no benefit)
X_data = pnp.array(some_data, requires_grad=False)

# RIGHT: plain numpy for data, pnp only for trainable weights
X_data = np.array(some_data)
weights = pnp.random.uniform(-np.pi, np.pi, shape, requires_grad=True)

2. Forgetting to normalize features before angle encoding

Rotation gates are periodic with period 2*pi. If your features range from 0 to 1000, many distinct inputs will alias to the same rotation angle. Always scale features to [0, pi] or [-pi, pi] before encoding:

from sklearn.preprocessing import MinMaxScaler

# Raw, unscaled features: column 1 spans 0 to 1000 and would wrap the
# RY rotation many times over
X_raw = np.column_stack([
    np.random.uniform(0, 1, 20),
    np.random.uniform(0, 1000, 20),
])

# Always do this before angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_raw)

print("Raw range per feature:   ", np.round(X_raw.max(axis=0) - X_raw.min(axis=0), 2))
print("Scaled range per feature:", np.round(X_scaled.max(axis=0) - X_scaled.min(axis=0), 2))

3. Setting too many layers or qubits and hitting barren plateaus

More is not better in QML. A 20-qubit, 10-layer circuit has gradient variance on the order of 1/2^20, making optimization nearly impossible. Start with 4 qubits and 2 layers, verify that training converges, then scale incrementally.

4. Confusing expressibility with trainability

A highly expressive circuit can represent complex functions, but that does not mean you can find the right parameters. The most expressive circuits (deep, highly entangled) are often the hardest to train due to barren plateaus. A less expressive but trainable circuit frequently outperforms a more expressive but untrainable one.

5. Using global cost functions that worsen barren plateaus

Measuring the expectation of a tensor product of Pauli operators across all qubits (global cost) causes gradient variance to decay exponentially in n. Measuring only one or two qubits (local cost) significantly mitigates this:

# AVOID: global cost function (measures all qubits)
@qml.qnode(dev)
def global_cost_circuit(weights):
    # ... circuit ...
    return qml.expval(
        qml.prod(*[qml.PauliZ(i) for i in range(n_qubits)])
    )

# PREFER: local cost function (measures one qubit)
@qml.qnode(dev)
def local_cost_circuit(weights):
    # ... circuit ...
    return qml.expval(qml.PauliZ(0))

6. Not comparing to a classical baseline

Every QML experiment should include a classical baseline on the same feature set. If you reduce 30 features to 4 via PCA and train a quantum classifier, you must also train a classical model (logistic regression, SVM, small neural network) on those same 4 features. Without this comparison, you cannot claim any quantum advantage, and in practice the classical baseline often wins.

Summary

PennyLane makes it straightforward to prototype QML models: define a QNode, wrap it as an optimizer-compatible cost function, and train with AdamOptimizer. The key design decisions are the encoding strategy (angle encoding is the practical default), ansatz structure (hardware-efficient, depth-limited), and cost function locality (local measurements to avoid barren plateaus).

The parameter-shift rule provides exact gradients, quantum kernels offer an alternative to variational training, and data re-uploading increases expressiveness without adding qubits. Noise is a real concern but not always fatal at moderate levels.

Be aware that results on small qubit counts do not generalize to larger circuits due to barren plateaus, and that quantum advantage for classical data classification is not established. Focus QML efforts on genuinely quantum data, quantum kernel methods with provably hard feature maps, or quantum generative models for the best chance of near-term impact.