Introduction to Quantum Machine Learning
A conceptual and practical introduction to quantum machine learning: what QML is, data encoding strategies, parameterized quantum circuits, and a complete classification example.
What Quantum Machine Learning Is
Quantum machine learning (QML) applies quantum circuits as trainable function approximators, analogous to neural networks, but implemented on quantum hardware. Where a classical neural network transforms inputs through layers of weighted linear operations and nonlinearities, a parameterized quantum circuit transforms an encoded input through a sequence of gates with tunable angles.
The fundamental appeal is that quantum circuits operate in exponentially large Hilbert spaces. In principle, a circuit on n qubits can represent correlations across 2^n dimensions with only polynomial circuit depth. Whether that theoretical advantage translates into practical learning benefits on real-world data is an open and contested question.
Three Types of QML
QML research falls into three distinct categories with very different prospects for quantum advantage:
Quantum-native data: data produced by quantum systems (molecular simulations, quantum sensor readouts, quantum communication protocols). Here a quantum circuit processes data that is already quantum, avoiding the encoding overhead entirely. This is the most promising setting for near-term quantum advantage.
Classical data on quantum hardware: the most common QML experiment today. Classical data (images, text, tabular) is encoded into quantum states and processed by a quantum circuit. The encoding cost is significant and erases most theoretical advantages.
Quantum-inspired classical algorithms: classical algorithms redesigned by studying quantum linear algebra. These run entirely on classical hardware but borrow ideas from quantum computing. They do not require quantum hardware at all.
This tutorial focuses on the second category: encoding classical data and training a quantum classifier.
The Quantum Model Space
Before diving into code, it helps to understand mathematically what a QML model computes.
A parameterized quantum circuit defines a function f(x, theta) through three stages:
- Encoding: a unitary
S(x)maps the classical inputxinto a quantum state|phi(x)> = S(x)|0>^n. - Variational processing: a parameterized unitary
U(theta)transforms the encoded state. - Measurement: an observable
O(typically a Pauli operator) produces the scalar output.
The full model output is:
f(x, theta) = <0|^n S(x)^dag U(theta)^dag O U(theta) S(x) |0>^n
Or equivalently, writing |psi(x, theta)> = U(theta) S(x) |0>^n:
f(x, theta) = <psi(x, theta)| O |psi(x, theta)>
This structure maps directly onto the neural network analogy:
Classical NN: Quantum Circuit:
----------- ----------------
Input layer <---> Encoding S(x)
Hidden layers <---> Variational U(theta)
Output neuron <---> Measurement <O>
A natural question arises: since n qubits span a Hilbert space of dimension 2^n, does the QML model have 2^n effective parameters? The answer is no. The number of trainable parameters equals the number of rotation angles in U(theta), which scales polynomially with n and the circuit depth. The 2^n-dimensional Hilbert space provides representational capacity (the space of functions the circuit could express), but the parameterization only explores a low-dimensional manifold within that space. Conflating Hilbert space dimension with model capacity is one of the most common misconceptions in QML.
Data Encoding Strategies
How you encode classical data into qubits determines much of the circuit’s behavior. There are four main approaches, each with distinct tradeoffs.
Basis Encoding
Each classical binary string maps to a computational basis state. Requires n qubits for n bits. Efficient in qubit count but rigid; no interpolation between data points.
For example, encoding the binary string “101” into 3 qubits:
import pennylane as qml
import numpy as np
dev_basis = qml.device("default.qubit", wires=3)
@qml.qnode(dev_basis)
def basis_encode_101():
# Encode "101": flip qubits 0 and 2 to represent |101>
qml.PauliX(wires=0) # bit 0 = 1
# qubit 1 stays |0> # bit 1 = 0
qml.PauliX(wires=2) # bit 2 = 1
return qml.state()
state = basis_encode_101()
# The state vector has a 1.0 at index 5 (binary 101), zeros elsewhere
print("State vector:", np.round(state, 4))
print("Index of nonzero entry:", np.argmax(np.abs(state)))
Basis encoding is conceptually simple but only suitable for problems where the input is already binary.
Amplitude Encoding
A vector of 2^n values is encoded as the amplitudes of an n-qubit state. Exponentially compact (encoding 2^n features into n qubits) but requires a complex state preparation circuit that often costs more gates than the learning circuit itself.
dev_amp = qml.device("default.qubit", wires=3)
@qml.qnode(dev_amp)
def amplitude_encode(x):
# Encode an 8-dimensional unit vector into 3 qubits
qml.AmplitudeEmbedding(features=x, wires=range(3), normalize=True)
return qml.state()
# An arbitrary 8-dimensional vector (will be normalized automatically)
features = np.array([0.5, 0.3, 0.1, 0.7, 0.2, 0.4, 0.6, 0.1])
state = amplitude_encode(features)
print("Encoded state:", np.round(state, 4))
print("Sum of |amplitudes|^2:", np.round(np.sum(np.abs(state)**2), 6))
The normalize=True flag handles normalization for you, but be aware that the circuit depth required for arbitrary amplitude preparation scales as O(2^n), which partially negates the compression advantage.
Angle Encoding
Each feature x_i is mapped to a rotation angle. For n features you use n qubits with RY(x_i) gates. Simple, hardware-friendly, and widely used in practice.
n_qubits = 4
dev_angle = qml.device("default.qubit", wires=n_qubits)
def angle_encoding_ry(x):
"""Standard angle encoding with RY gates."""
for i in range(n_qubits):
qml.RY(x[i], wires=i)
def angle_encoding_rx_rz(x):
"""Alternate encoding: use RX for the first half, RZ for the second half.
This places features on different rotation axes, increasing diversity."""
half = len(x) // 2
for i in range(half):
qml.RX(x[i], wires=i)
for i in range(half, len(x)):
qml.RZ(x[i], wires=i)
After RY encoding, qubit i is in state cos(x_i/2)|0> + sin(x_i/2)|1>. Features should be scaled to [0, pi] or [-pi, pi] before encoding to use the full rotation range.
An important subtlety: rotation gates are periodic with period 2*pi. If your feature values span a range much larger than 2*pi, distinct inputs will map to the same quantum state (aliasing). If the range is much smaller, you use only a tiny portion of the Bloch sphere, reducing the model’s discriminative power. Proper feature scaling is not optional.
IQP Encoding
IQP (Instantaneous Quantum Polynomial) encoding uses diagonal gates to create correlations between features. The circuit structure alternates between single-qubit RZ rotations and two-qubit controlled-RZ gates:
dev_iqp = qml.device("default.qubit", wires=4)
@qml.qnode(dev_iqp)
def iqp_encode(x):
# IQP encoding: Hadamard, then RZ + controlled-RZ, then Hadamard again
# PennyLane's IQPEmbedding handles this pattern automatically
qml.IQPEmbedding(features=x, wires=range(4), n_repeats=2)
return qml.state()
features = np.array([0.5, 1.2, 0.8, 2.1])
state = iqp_encode(features)
print("IQP-encoded state:", np.round(state[:8], 4), "...")
The circuit applies Hadamard gates to all qubits, then RZ(x_i) on each qubit and controlled-RZ(x_i * x_j) on pairs, then Hadamard gates again. The n_repeats parameter controls how many times this block is repeated. The theoretical motivation is that classically simulating IQP circuits is believed to be hard (under plausible complexity assumptions), which suggests the resulting feature map may create quantum states that are classically intractable to reproduce. However, hardness of simulation does not automatically guarantee useful classification performance.
For most beginner and intermediate QML experiments, angle encoding is the right starting point.
Parameterized Quantum Circuits
A parameterized quantum circuit (PQC) has trainable rotation angles, just like weights in a neural network. A typical variational layer combines single-qubit rotations with entangling gates:
def variational_layer(weights, layer_idx):
for i in range(n_qubits):
qml.RY(weights[layer_idx, i, 0], wires=i)
qml.RZ(weights[layer_idx, i, 1], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
# Ring entanglement
qml.CNOT(wires=[n_qubits - 1, 0])
Stacking multiple variational layers increases the circuit’s expressive power, though it also increases the barren plateau risk.
Hardware-Efficient Ansatz Design
The term “hardware-efficient” means the circuit respects the physical constraints of real quantum hardware: shallow depth (few layers), only native two-qubit gates (CNOT or CZ depending on the device), and a connectivity pattern that matches the chip topology. A circuit that requires all-to-all qubit connectivity looks elegant on paper but compiles to many SWAP gates on a device with nearest-neighbor connectivity, inflating depth and noise.
PennyLane provides several built-in ansatz templates.
SimplifiedTwoDesign
This template alternates single-qubit rotation layers with controlled-Z entanglers. It is designed to approximate a unitary 2-design (a distribution that mimics Haar-random unitaries up to the second moment), which makes it a good default for expressibility studies.
n_qubits = 4
n_layers = 3
dev_s2d = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev_s2d)
def simplified_two_design_circuit(initial_layer_weights, weights):
qml.SimplifiedTwoDesign(
initial_layer_weights=initial_layer_weights,
weights=weights,
wires=range(n_qubits)
)
return qml.expval(qml.PauliZ(0))
# Initial layer: one RY rotation per qubit
init_weights = np.random.uniform(0, 2 * np.pi, (n_qubits,))
# Variational layers: each layer has (n_qubits - 1) pairs, 2 parameters each
layer_weights = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits - 1, 2))
result = simplified_two_design_circuit(init_weights, layer_weights)
print(f"SimplifiedTwoDesign output: {result:.4f}")
StronglyEntanglingLayers
This is the most commonly used template in PennyLane tutorials. Each layer applies three rotations (Rot gate: RZ, RY, RZ) to every qubit, followed by CNOT entanglers with a configurable connectivity pattern that shifts across layers.
import pennylane as qml
import pennylane.numpy as pnp
n_qubits = 4
n_layers = 3
dev_sel = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev_sel)
def strongly_entangling_circuit(weights):
qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
# Shape: (n_layers, n_qubits, 3) for the three rotation angles per qubit per layer
sel_weights = pnp.random.uniform(
-np.pi, np.pi, (n_layers, n_qubits, 3), requires_grad=True
)
result = strongly_entangling_circuit(sel_weights)
print(f"StronglyEntanglingLayers output: {result:.4f}")
print(f"Total trainable parameters: {n_layers * n_qubits * 3}")
Custom Hardware-Efficient Ansatz for Linear Topology
If your qubits are connected in a line (0-1-2-3), you should only place CNOT gates between adjacent pairs. This avoids SWAP overhead:
n_qubits = 4
dev_custom = qml.device("default.qubit", wires=n_qubits)
def custom_linear_ansatz(weights, n_layers):
"""Hardware-efficient ansatz for linear qubit connectivity 0-1-2-3.
Each layer: RY + RZ on each qubit, then CNOTs on adjacent pairs only."""
for layer in range(n_layers):
# Single-qubit rotations
for q in range(n_qubits):
qml.RY(weights[layer, q, 0], wires=q)
qml.RZ(weights[layer, q, 1], wires=q)
# Entangling gates: only adjacent pairs (linear topology)
for q in range(n_qubits - 1):
qml.CNOT(wires=[q, q + 1])
@qml.qnode(dev_custom)
def custom_circuit(weights):
n_layers = weights.shape[0]
custom_linear_ansatz(weights, n_layers)
return qml.expval(qml.PauliZ(0))
custom_weights = np.random.uniform(-np.pi, np.pi, (3, n_qubits, 2))
result = custom_circuit(custom_weights)
print(f"Custom linear ansatz output: {result:.4f}")
This circuit has 3 layers, 4 qubits, 2 parameters per qubit per layer, giving 24 total parameters and a circuit depth that stays manageable on real hardware.
A Complete QML Classifier
We train a binary classifier on the breast cancer dataset from scikit-learn, using PCA to reduce 30 features to 4, then angle encoding into 4 qubits.
Prepare the Data
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
import pennylane.numpy as pnp
import numpy as np
n_qubits = 4
data = load_breast_cancer()
X, y = data.data, data.target
# PCA to 4 features
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
# Scale to [0, pi] for angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
# Convert labels: {0, 1} -> {-1, +1}
y_pm = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_pm, test_size=0.2, random_state=42
)
Define the QNode
n_layers = 2
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit(x, weights):
# Angle encoding
for i in range(n_qubits):
qml.RY(x[i], wires=i)
# Variational layers
for l in range(n_layers):
variational_layer(weights, l)
# Measure Z on qubit 0 as the classifier output
return qml.expval(qml.PauliZ(0))
Train with Adam
def loss(weights, X_batch, y_batch):
predictions = pnp.array([circuit(x, weights) for x in X_batch])
# Binary cross-entropy approximation via MSE on +/-1 labels
return pnp.mean((predictions - y_batch) ** 2)
# Initialize weights using pennylane.numpy so gradients are tracked
weights = pnp.random.uniform(-np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
for epoch in range(30):
idx = np.random.choice(len(X_train), batch_size, replace=False)
X_b, y_b = X_train[idx], y_train[idx]
weights, current_loss = opt.step_and_cost(
lambda w: loss(w, X_b, y_b), weights
)
if (epoch + 1) % 10 == 0:
train_preds = np.sign([circuit(x, weights) for x in X_train])
acc = np.mean(train_preds == y_train)
print(f"Epoch {epoch+1:3d} Loss: {current_loss:.4f} Train acc: {acc:.3f}")
Evaluate
test_preds = np.sign([circuit(x, weights) for x in X_test])
test_acc = np.mean(test_preds == y_test)
print(f"Test accuracy: {test_acc:.3f}")
On 4 qubits with 2 layers and this dataset, you should see test accuracy around 0.82-0.90. A classical logistic regression on the same 4 PCA features typically achieves 0.92-0.95, so the quantum circuit is competitive but not superior at this scale.
The Parameter-Shift Rule
One of the most elegant aspects of quantum computing for machine learning is that gradients of quantum circuits can be computed exactly, not approximately. This is the parameter-shift rule.
Mathematical Foundation
Consider a quantum gate G(theta) = exp(-i * theta * P / 2) where P is a Pauli generator (a Hermitian matrix with eigenvalues +/- 1). The expectation value of an observable depends on theta through the circuit. The gradient with respect to theta is:
dL/d_theta = (1/2) * [ L(theta + pi/2) - L(theta - pi/2) ]
This formula is exact. It requires evaluating the circuit at only two shifted parameter values per gradient component. Unlike classical finite differences ([L(theta + epsilon) - L(theta)] / epsilon), which introduce approximation error proportional to epsilon, the parameter-shift rule yields the true analytical gradient regardless of the shift magnitude (which is always pi/2 for Pauli generators).
The proof follows from the structure of the rotation gate. Since exp(-i * theta * P / 2) is a linear combination of cos(theta/2) * I and -i * sin(theta/2) * P, the expectation value is sinusoidal in theta, and the derivative of a sinusoid can be expressed as the difference of two shifted evaluations.
Verification in Code
import pennylane as qml
import pennylane.numpy as pnp
import numpy as np
dev_ps = qml.device("default.qubit", wires=1)
@qml.qnode(dev_ps)
def simple_circuit(theta):
qml.RY(theta, wires=0)
return qml.expval(qml.PauliZ(0))
theta_val = pnp.array(0.7, requires_grad=True)
# Method 1: PennyLane's automatic gradient (uses parameter-shift internally)
grad_auto = qml.grad(simple_circuit)(theta_val)
# Method 2: Manual parameter-shift rule
shift = np.pi / 2
grad_manual = 0.5 * (simple_circuit(theta_val + shift) - simple_circuit(theta_val - shift))
print(f"Automatic gradient: {float(grad_auto):.8f}")
print(f"Manual parameter-shift: {float(grad_manual):.8f}")
print(f"Difference: {abs(float(grad_auto) - float(grad_manual)):.2e}")
# The two values match to machine precision
The parameter-shift rule extends to gates with more general generators, though the formula becomes more complex (requiring more shift terms). PennyLane handles this automatically when you use qml.grad or qml.jacobian.
Quantum Kernel Methods
An alternative to the variational classifier approach is to use quantum circuits as kernel functions. Instead of training circuit parameters, you use the quantum feature map to define a similarity measure between data points, then hand the resulting kernel matrix to a classical SVM.
What Is a Quantum Kernel?
Given a feature map S(x) that encodes data point x into a quantum state |phi(x)> = S(x)|0>, the quantum kernel between two data points is:
K(x_i, x_j) = |<phi(x_i)|phi(x_j)>|^2
This is the fidelity (overlap squared) between the two encoded quantum states. If two inputs produce similar quantum states, their kernel value is close to 1. If the states are nearly orthogonal, the kernel is close to 0.
The key insight is that this kernel operates in the 2^n-dimensional Hilbert space without explicitly computing in that space. Computing K(x_i, x_j) on a quantum computer requires only polynomial resources, but evaluating the same kernel classically could require exponential resources if the feature map is sufficiently complex.
Computing the Kernel Matrix
import pennylane as qml
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
n_qubits = 4
dev_kernel = qml.device("default.qubit", wires=n_qubits)
def kernel_feature_map(x):
"""Angle encoding followed by entangling layer for richer feature map."""
for i in range(n_qubits):
qml.RY(x[i], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
for i in range(n_qubits):
qml.RZ(x[i], wires=i)
@qml.qnode(dev_kernel)
def kernel_circuit(x1, x2):
"""Compute |<phi(x1)|phi(x2)>|^2 using the swap test alternative:
apply S(x1), then S(x2)^dag, then measure probability of |0...0>."""
kernel_feature_map(x1)
qml.adjoint(kernel_feature_map)(x2)
return qml.probs(wires=range(n_qubits))
def quantum_kernel(x1, x2):
"""Return the kernel value: probability of measuring all zeros."""
probs = kernel_circuit(x1, x2)
return probs[0] # |0...0> probability
def compute_kernel_matrix(X_a, X_b):
"""Compute the kernel matrix K[i, j] = quantum_kernel(X_a[i], X_b[j])."""
n_a, n_b = len(X_a), len(X_b)
K = np.zeros((n_a, n_b))
for i in range(n_a):
for j in range(n_b):
K[i, j] = quantum_kernel(X_a[i], X_b[j])
return K
# Prepare data (same pipeline as before)
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
# Use a small subset for speed (kernel matrix is O(n^2) in dataset size)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.2, random_state=42
)
X_train_small = X_train[:80]
y_train_small = y_train[:80]
X_test_small = X_test[:40]
y_test_small = y_test[:40]
# Compute kernel matrices
K_train = compute_kernel_matrix(X_train_small, X_train_small)
K_test = compute_kernel_matrix(X_test_small, X_train_small)
# Train a classical SVM with the quantum kernel
svm = SVC(kernel="precomputed")
svm.fit(K_train, y_train_small)
y_pred = svm.predict(K_test)
kernel_acc = accuracy_score(y_test_small, y_pred)
print(f"Quantum kernel SVM test accuracy: {kernel_acc:.3f}")
The quantum kernel approach has an advantage over the variational classifier: there are no barren plateaus because there are no quantum parameters to train. The downside is the O(n^2) cost of computing all pairwise kernel values, which becomes expensive for large datasets.
Expressibility and Entanglement Capacity
When choosing an ansatz, it helps to quantify how expressive it is and how much entanglement it generates. Two metrics are commonly used.
Expressibility
Expressibility measures how uniformly the ansatz samples from the space of all possible unitaries (the Haar measure). A highly expressive ansatz can reach states distributed uniformly across the Hilbert space. A low-expressibility ansatz is stuck in a small subspace.
The standard approach (Sim et al. 2019) compares the distribution of state fidelities generated by the ansatz to the Haar-random distribution. The KL divergence between the two distributions quantifies expressibility.
import pennylane as qml
import numpy as np
n_qubits = 4
n_layers = 2
dev_expr = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev_expr)
def expressibility_circuit(weights):
qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
return qml.state()
def estimate_expressibility(n_samples=500):
"""Estimate expressibility by sampling state overlaps."""
fidelities = []
shape = (n_layers, n_qubits, 3)
for _ in range(n_samples):
# Sample two random parameter sets
w1 = np.random.uniform(0, 2 * np.pi, shape)
w2 = np.random.uniform(0, 2 * np.pi, shape)
state1 = expressibility_circuit(w1)
state2 = expressibility_circuit(w2)
# Fidelity = |<psi1|psi2>|^2
fidelity = np.abs(np.dot(np.conj(state1), state2)) ** 2
fidelities.append(fidelity)
return np.array(fidelities)
fidelities = estimate_expressibility(n_samples=300)
print(f"Mean fidelity: {np.mean(fidelities):.4f}")
print(f"Std fidelity: {np.std(fidelities):.4f}")
# For a Haar-random distribution on 2^4 = 16 dimensions,
# the expected mean fidelity is 1/16 = 0.0625.
# A highly expressive ansatz will produce values close to this.
print(f"Haar-random expected mean: {1 / 2**n_qubits:.4f}")
Entanglement Capacity
Entanglement capacity measures how much entanglement the ansatz generates across qubits. The von Neumann entropy of a subsystem quantifies this:
dev_ent = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev_ent)
def entanglement_circuit(weights):
qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
return qml.vn_entropy(wires=[0, 1]) # entropy of the first 2 qubits
def estimate_entanglement_capacity(n_samples=200):
"""Average von Neumann entropy over random parameter samples."""
entropies = []
shape = (n_layers, n_qubits, 3)
for _ in range(n_samples):
w = np.random.uniform(0, 2 * np.pi, shape)
entropy = entanglement_circuit(w)
entropies.append(float(entropy))
return np.array(entropies)
entropies = estimate_entanglement_capacity(n_samples=200)
print(f"Mean entanglement entropy: {np.mean(entropies):.4f}")
print(f"Max possible (2 qubits): {np.log(2**2):.4f}")
# High mean entropy relative to the maximum indicates the ansatz
# generates significant entanglement across the bipartition.
An ansatz that is both highly expressive and highly entangling is powerful but may be harder to train (see barren plateaus below).
Barren Plateaus: The Core Challenge
As you scale up the number of qubits or layers, gradients of the loss function with respect to circuit parameters shrink exponentially. This is the barren plateau problem:
Var[dL/d_theta] ~ O(1 / 2^n)
For 20 qubits, the gradient variance is roughly one millionth of what it is for 4 qubits. Training becomes effectively impossible without exponentially more shots to estimate gradients accurately.
Empirical Evidence
The following code measures gradient variance as a function of qubit count and prints the results as a table. The exponential decay is visible even at small scales:
import pennylane as qml
import numpy as np
def measure_gradient_variance(n_qubits, n_layers=2, n_samples=200):
"""Measure variance of dL/d_theta_0 for a random circuit."""
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def random_circuit(weights):
qml.StronglyEntanglingLayers(weights=weights, wires=range(n_qubits))
# Global cost function: measure Z on all qubits
return qml.expval(
qml.prod(*[qml.PauliZ(i) for i in range(n_qubits)])
)
grad_fn = qml.grad(random_circuit)
shape = (n_layers, n_qubits, 3)
grads = []
for _ in range(n_samples):
w = np.random.uniform(0, 2 * np.pi, shape)
w_pnp = qml.numpy.array(w, requires_grad=True)
g = grad_fn(w_pnp)
# Take gradient of the first parameter
grads.append(float(g[0, 0, 0]))
return np.var(grads)
print(f"{'Qubits':>8} | {'Grad Variance':>15} | {'1/2^n':>12}")
print("-" * 42)
for n in range(2, 11):
var = measure_gradient_variance(n, n_layers=2, n_samples=200)
theoretical = 1.0 / 2**n
print(f"{n:>8} | {var:>15.8f} | {theoretical:>12.8f}")
You will see that the measured variance drops roughly in proportion to 1/2^n, confirming the barren plateau scaling.
Four Mitigation Strategies
1. Local cost functions: instead of measuring all qubits (global cost), measure only one or two qubits near the parameter of interest. This slows the exponential decay of gradient variance from O(1/2^n) to a more favorable polynomial scaling for shallow circuits.
2. Layer-by-layer training: train the first variational layer while keeping the rest fixed, then progressively unfreeze deeper layers. This avoids random initialization in the full parameter space, where barren plateaus are most severe.
3. Identity initialization: initialize parameters so that each variational layer acts as the identity (all angles set to zero or to values that make the layer an identity up to a global phase). Training starts near a known point and gradually moves away, avoiding the flat landscape of random initialization.
4. Quantum natural gradient: the standard gradient treats all parameter directions equally, but the quantum state space has a non-Euclidean geometry described by the Fubini-Study metric. The quantum natural gradient rescales the gradient by the inverse of this metric tensor, giving larger effective steps in directions where the landscape is flat. PennyLane implements this via qml.QNGOptimizer.
Data Re-uploading
Standard angle encoding applies the input data once at the beginning of the circuit. The data re-uploading technique (Perez-Salinas et al. 2020) interleaves data encoding with variational layers, so the circuit “sees” the input multiple times at different depths.
This dramatically increases expressiveness because the circuit becomes a composition of multiple data-dependent unitaries. Mathematically, instead of U(theta) S(x) |0>, the state becomes U_L(theta_L) S(x) ... U_1(theta_1) S(x) |0>. Each re-upload effectively introduces a new “Fourier frequency” in the model’s representation of the input, making the function approximation more powerful.
Re-uploading Circuit
import pennylane as qml
import pennylane.numpy as pnp
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
n_qubits = 4
n_layers = 3
dev_reup = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev_reup)
def reupload_circuit(x, weights):
for layer in range(n_layers):
# Re-encode the input at every layer
for i in range(n_qubits):
qml.RY(x[i], wires=i)
# Variational block
for i in range(n_qubits):
qml.RY(weights[layer, i, 0], wires=i)
qml.RZ(weights[layer, i, 1], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
qml.CNOT(wires=[n_qubits - 1, 0])
return qml.expval(qml.PauliZ(0))
@qml.qnode(dev_reup)
def no_reupload_circuit(x, weights):
# Encode input only once
for i in range(n_qubits):
qml.RY(x[i], wires=i)
for layer in range(n_layers):
for i in range(n_qubits):
qml.RY(weights[layer, i, 0], wires=i)
qml.RZ(weights[layer, i, 1], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
qml.CNOT(wires=[n_qubits - 1, 0])
return qml.expval(qml.PauliZ(0))
# Prepare data
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
y_pm = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_pm, test_size=0.2, random_state=42
)
def train_and_evaluate(circuit_fn, label):
weights = pnp.random.uniform(
-np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True
)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
for epoch in range(40):
idx = np.random.choice(len(X_train), batch_size, replace=False)
X_b, y_b = X_train[idx], y_train[idx]
def cost(w):
preds = pnp.array([circuit_fn(x, w) for x in X_b])
return pnp.mean((preds - y_b) ** 2)
weights, _ = opt.step_and_cost(cost, weights)
test_preds = np.sign([circuit_fn(x, weights) for x in X_test])
acc = np.mean(test_preds == y_test)
print(f"{label}: test accuracy = {acc:.3f}")
return acc
acc_reup = train_and_evaluate(reupload_circuit, "With re-uploading ")
acc_no_reup = train_and_evaluate(no_reupload_circuit, "Without re-uploading")
You should observe that the re-uploading version achieves noticeably higher accuracy, especially as you increase the number of layers. The improvement comes from the richer Fourier spectrum of the re-uploading model.
Transfer Learning in QML
Encoding raw high-dimensional data (such as images) directly into qubits is impractical. A 28x28 grayscale image has 784 pixels, requiring either 784 qubits for angle encoding or a 10-qubit amplitude encoding circuit of extreme depth. Neither option is viable.
The practical solution is classical-to-quantum transfer learning: use a pre-trained classical neural network to compress the input into a low-dimensional embedding, then feed that embedding into the quantum circuit.
Architecture
+------------------+ +------------------+ +----------------+
| Pre-trained CNN | | Quantum Circuit | | Classical |
| (ResNet, VGG, | ----> | (4-qubit PQC | ----> | Post-process |
| MobileNet) | | with angle | | (argmax, |
| | | encoding) | | threshold) |
| Input: 224x224x3 | | Input: 4 floats | | Output: class |
| Output: 4 floats | | Output: <Z> | | |
+------------------+ +------------------+ +----------------+
Classical Quantum Classical
(frozen weights) (trainable theta)
The classical CNN (with frozen pre-trained weights) acts as a feature extractor, mapping high-dimensional inputs to a small number of features that capture the essential structure. The quantum circuit then acts as the trainable classifier head. This is practical because:
- The quantum circuit receives only 4-8 features, well within current hardware limits.
- The classical feature extractor handles the hard problem of dimensionality reduction.
- The overall pipeline is end-to-end differentiable if the classical network is implemented in a compatible framework (PyTorch + PennyLane via
qml.qnn.TorchLayer).
This approach is the most realistic path to using quantum circuits for image or text classification tasks today.
Noise Effects on QML Models
Real quantum hardware introduces noise at every gate. Understanding how noise affects QML performance is critical for practical applications.
Simulating Noisy Circuits
import pennylane as qml
import pennylane.numpy as pnp
import numpy as np
from sklearn.datasets import load_breast_cancer
from sklearn.decomposition import PCA
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
n_qubits = 4
n_layers = 2
# Ideal (noiseless) device
dev_ideal = qml.device("default.qubit", wires=n_qubits)
# Noisy device with depolarizing noise
dev_noisy = qml.device("default.mixed", wires=n_qubits)
def encoding_and_layers(x, weights):
"""Shared circuit logic for both ideal and noisy versions."""
for i in range(n_qubits):
qml.RY(x[i], wires=i)
for layer in range(n_layers):
for i in range(n_qubits):
qml.RY(weights[layer, i, 0], wires=i)
qml.RZ(weights[layer, i, 1], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
qml.CNOT(wires=[n_qubits - 1, 0])
@qml.qnode(dev_ideal)
def ideal_circuit(x, weights):
encoding_and_layers(x, weights)
return qml.expval(qml.PauliZ(0))
@qml.qnode(dev_noisy)
def noisy_circuit(x, weights):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
for layer in range(n_layers):
for i in range(n_qubits):
qml.RY(weights[layer, i, 0], wires=i)
qml.DepolarizingChannel(0.01, wires=i)
qml.RZ(weights[layer, i, 1], wires=i)
qml.DepolarizingChannel(0.01, wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
# Two-qubit gates are noisier in practice
qml.DepolarizingChannel(0.01, wires=i)
qml.DepolarizingChannel(0.01, wires=i + 1)
qml.CNOT(wires=[n_qubits - 1, 0])
qml.DepolarizingChannel(0.01, wires=n_qubits - 1)
qml.DepolarizingChannel(0.01, wires=0)
return qml.expval(qml.PauliZ(0))
# Prepare data
data = load_breast_cancer()
X, y = data.data, data.target
pca = PCA(n_components=n_qubits)
X_pca = pca.fit_transform(X)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_pca)
y_pm = 2 * y - 1
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_pm, test_size=0.2, random_state=42
)
# Train on ideal, evaluate on both
weights = pnp.random.uniform(-np.pi, np.pi, (n_layers, n_qubits, 2), requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.05)
for epoch in range(30):
idx = np.random.choice(len(X_train), 16, replace=False)
X_b, y_b = X_train[idx], y_train[idx]
def cost(w):
preds = pnp.array([ideal_circuit(x, w) for x in X_b])
return pnp.mean((preds - y_b) ** 2)
weights, _ = opt.step_and_cost(cost, weights)
# Compare ideal vs noisy inference
ideal_preds = np.sign([ideal_circuit(x, weights) for x in X_test])
noisy_preds = np.sign([noisy_circuit(x, weights) for x in X_test])
ideal_acc = np.mean(ideal_preds == y_test)
noisy_acc = np.mean(noisy_preds == y_test)
print(f"Ideal simulator accuracy: {ideal_acc:.3f}")
print(f"Noisy simulator accuracy: {noisy_acc:.3f}")
print(f"Accuracy drop: {ideal_acc - noisy_acc:.3f}")
Why Noise Is Not Always Catastrophic
A quantum classifier is fundamentally a statistical model: it maps inputs to probability distributions over measurement outcomes. Moderate noise acts as a form of regularization, smoothing out sharp features in the decision boundary. This is analogous to how dropout regularizes classical neural networks.
At low noise levels (p < 0.01 per gate), the accuracy drop is often modest (a few percentage points). At higher noise levels, the circuit output converges toward completely mixed states, destroying all learned structure. The practical threshold depends on circuit depth: deeper circuits accumulate more noise and degrade faster. This is another reason to prefer shallow, hardware-efficient ansatze.
Where QML Shows Genuine Promise
The near-term use cases with the clearest path to advantage are:
- Quantum chemistry: optimizing ground state energies of molecules where the data is inherently quantum (VQE, quantum phase estimation).
- Quantum data classification: classifying states produced by quantum experiments or sensors without first converting to classical data.
- Quantum kernel methods: using a quantum circuit as a kernel function whose evaluation is classically hard to simulate.
For standard classical datasets (images, text, tabular data), there is no known theoretical advantage and no empirical evidence of advantage at any useful scale.
Quantum Kernel Advantage
A quantum kernel is useful specifically when the feature map S(x) creates a kernel function that is classically intractable to evaluate. Liu et al. (2021) proved that there exist classification problems where a quantum kernel achieves exponentially better prediction error than any classical kernel, provided the data distribution is specifically designed to exploit quantum structure. The critical caveat is that this advantage is data-dependent: for generic classical data, quantum kernels offer no guaranteed speedup.
The practical implication is that quantum kernels are most promising for data with inherent quantum structure, such as outputs from quantum simulations or quantum communication channels.
Quantum Generative Models
Beyond classification, quantum generative models represent a legitimate near-term application. Two notable examples:
Quantum GANs (QGANs): a quantum circuit acts as the generator, producing quantum states that a discriminator (quantum or classical) tries to distinguish from real data. QGANs are particularly natural for generating quantum states (e.g., for quantum chemistry initialization).
Born machines: these exploit the fact that measuring a quantum circuit produces samples from a probability distribution defined by the circuit’s amplitudes. The distribution p(x) = |<x|psi(theta)>|^2 can express correlations that are provably hard for classical probabilistic models. Born machines are a setting where quantum advantage is plausible even for near-term devices.
Common Mistakes in QML
Beginners (and sometimes experienced practitioners) frequently make these mistakes:
1. Using pnp arrays for non-gradient computations
PennyLane’s pennylane.numpy (pnp) wraps NumPy arrays with autograd tracking. Using pnp arrays for data loading, preprocessing, or evaluation (where you do not need gradients) adds unnecessary overhead. Use plain numpy for everything except the trainable parameters:
import numpy as np
import pennylane.numpy as pnp
# WRONG: using pnp for data (slow, no benefit)
X_data = pnp.array(some_data, requires_grad=False)
# RIGHT: plain numpy for data, pnp only for trainable weights
X_data = np.array(some_data)
weights = pnp.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
2. Forgetting to normalize features before angle encoding
Rotation gates are periodic with period 2*pi. If your features range from 0 to 1000, many distinct inputs will alias to the same rotation angle. Always scale features to [0, pi] or [-pi, pi] before encoding:
from sklearn.preprocessing import MinMaxScaler
# Always do this before angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X_raw)
3. Setting too many layers or qubits and hitting barren plateaus
More is not better in QML. A 20-qubit, 10-layer circuit has gradient variance on the order of 1/2^20, making optimization nearly impossible. Start with 4 qubits and 2 layers, verify that training converges, then scale incrementally.
4. Confusing expressibility with trainability
A highly expressive circuit can represent complex functions, but that does not mean you can find the right parameters. The most expressive circuits (deep, highly entangled) are often the hardest to train due to barren plateaus. A less expressive but trainable circuit frequently outperforms a more expressive but untrainable one.
5. Using global cost functions that worsen barren plateaus
Measuring the expectation of a tensor product of Pauli operators across all qubits (global cost) causes gradient variance to decay exponentially in n. Measuring only one or two qubits (local cost) significantly mitigates this:
# AVOID: global cost function (measures all qubits)
@qml.qnode(dev)
def global_cost_circuit(weights):
# ... circuit ...
return qml.expval(
qml.prod(*[qml.PauliZ(i) for i in range(n_qubits)])
)
# PREFER: local cost function (measures one qubit)
@qml.qnode(dev)
def local_cost_circuit(weights):
# ... circuit ...
return qml.expval(qml.PauliZ(0))
6. Not comparing to a classical baseline
Every QML experiment should include a classical baseline on the same feature set. If you reduce 30 features to 4 via PCA and train a quantum classifier, you must also train a classical model (logistic regression, SVM, small neural network) on those same 4 features. Without this comparison, you cannot claim any quantum advantage, and in practice the classical baseline often wins.
Summary
PennyLane makes it straightforward to prototype QML models: define a QNode, wrap it as an optimizer-compatible cost function, and train with AdamOptimizer. The key design decisions are the encoding strategy (angle encoding is the practical default), ansatz structure (hardware-efficient, depth-limited), and cost function locality (local measurements to avoid barren plateaus).
The parameter-shift rule provides exact gradients, quantum kernels offer an alternative to variational training, and data re-uploading increases expressiveness without adding qubits. Noise is a real concern but not always fatal at moderate levels.
Be aware that results on small qubit counts do not generalize to larger circuits due to barren plateaus, and that quantum advantage for classical data classification is not established. Focus QML efforts on genuinely quantum data, quantum kernel methods with provably hard feature maps, or quantum generative models for the best chance of near-term impact.
Was this tutorial helpful?