Building a Variational Quantum Classifier from Scratch
Build a complete binary quantum classifier with angle encoding, strongly entangling layers, BCE loss, and Adam optimizer, trained on the moons dataset, with decision boundary visualization and comparison to logistic regression.
What Is a Variational Quantum Classifier?
A variational quantum classifier (VQC) is a hybrid quantum-classical model. It uses a parameterized quantum circuit to transform input features into a measurement outcome, then a classical optimizer to tune those parameters by minimizing a loss function. The quantum circuit plays the role of a nonlinear feature map and classifier combined.
VQCs are not universally better than classical classifiers. On near-term hardware they face noise and limited qubit counts. But they are a productive research area for understanding whether quantum feature maps can provide any advantage, and they are an excellent way to learn the mechanics of variational quantum algorithms.
Theoretical Motivation for VQCs
A VQC maps classical data x to a quantum feature space via an encoding circuit U(x). A set of trainable parameters w controls a variational ansatz V(w). The full quantum state is:
|ψ(x, w)⟩ = V(w) U(x) |0…0⟩
The model makes predictions based on the expectation value ⟨ψ(x, w)|O|ψ(x, w)⟩ for some observable O (typically a Pauli-Z on one qubit). The classical optimizer updates w to minimize the loss between these predictions and the true labels.
The key question is whether the quantum feature space provides any advantage over classical kernels. To understand this, consider the connection to kernel methods. A VQC implicitly defines a quantum kernel:
K(x, x’) = |⟨ψ(x)|ψ(x’)⟩|²
This kernel measures how similar two data points are in the quantum feature space. If this kernel is hard to compute classically (that is, no efficient classical algorithm can approximate it), then a quantum speedup is possible in principle. The encoding circuit U(x) determines the kernel. Simple encodings like angle encoding produce kernels that are easy to compute classically, so they do not offer a quantum advantage. More complex encodings, such as IQP-style circuits with entangling gates that depend on products of features, can produce kernels that are conjectured to be classically intractable.
In practice, “quantum advantage for classification” means finding a dataset and encoding where the quantum kernel separates the classes better than any efficient classical kernel. This remains an active area of research. For this tutorial, the goal is to understand the mechanics, not to claim advantage.
The Dataset: Moons
The sklearn make_moons dataset is a classic for nonlinear binary classification. Two interleaved crescent-shaped clusters make it impossible to separate with a straight line, requiring a curved decision boundary. It is complex enough to be interesting but small enough to train quickly.
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import numpy as np
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
# Scale features to [0, pi] for angle encoding
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.25, random_state=42
)
y_train = 2 * y_train - 1 # Map {0,1} -> {-1, +1} for Pauli-Z expectation
y_test = 2 * y_test - 1
Data Encoding Strategies
The choice of encoding circuit determines what quantum feature space the data lives in. There are three main strategies, each with different trade-offs in qubit count, circuit depth, and expressiveness.
Angle Encoding
Angle encoding maps each classical feature to a rotation angle applied to one qubit. For a 2-feature dataset we use 2 qubits and encode x_0 via RY(x_0) on qubit 0 and x_1 via RY(x_1) on qubit 1.
This encoding is linear in the data dimension: n features require n qubits. The circuit depth is O(1) since all rotations are applied in parallel. It is a simple, hardware-efficient encoding, and is the right starting point for low-dimensional data.
Amplitude Encoding
Amplitude encoding maps a classical vector x = (x_0, x_1, …, x_{N-1}) directly into the amplitudes of a quantum state:
|ψ⟩ = Σ_i x_i |i⟩
where the vector must be normalized (||x|| = 1). This requires only log₂(N) qubits for an N-dimensional vector. For a 4-feature dataset, amplitude encoding uses just 2 qubits.
The catch is that preparing an arbitrary amplitude-encoded state requires an exponentially deep circuit in the worst case. PennyLane’s qml.AmplitudeEmbedding handles the state preparation automatically, but the circuit depth grows with the number of features. For NISQ devices, this cost can be prohibitive.
IQP (Instantaneous Quantum Polynomial) Encoding
IQP encoding applies single-qubit phase gates and two-qubit entangling gates that depend on products of features:
- Apply Hadamard gates to all qubits.
- Apply phase gates e^(i x_j Z_j) on each qubit j.
- Apply entangling phase gates e^(i x_j x_k ZZ) on pairs (j, k).
- Optionally repeat the encoding block multiple times.
This creates a feature map that is conjectured to be hard to simulate classically. The products x_j * x_k in the entangling gates introduce nonlinear interactions between features, which makes the resulting quantum kernel richer than what angle encoding produces.
Code Comparison: All Three Encodings
The following code shows all three encoding strategies applied to a 4-feature input:
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
# Sample 4-feature input
x = np.array([0.5, 1.2, 0.8, 2.1])
# --- Angle Encoding (4 qubits for 4 features) ---
dev_angle = qml.device("default.qubit", wires=4)
@qml.qnode(dev_angle)
def angle_encoding(x):
for i in range(4):
qml.RY(x[i], wires=i)
return [qml.expval(qml.PauliZ(i)) for i in range(4)]
print("Angle encoding output:", angle_encoding(x))
print("Angle encoding circuit:")
print(qml.draw(angle_encoding)(x))
# --- Amplitude Encoding (2 qubits for 4 features) ---
dev_amp = qml.device("default.qubit", wires=2)
@qml.qnode(dev_amp)
def amplitude_encoding(x):
# AmplitudeEmbedding normalizes the input vector automatically
qml.AmplitudeEmbedding(features=x, wires=range(2), normalize=True)
return [qml.expval(qml.PauliZ(i)) for i in range(2)]
print("\nAmplitude encoding output:", amplitude_encoding(x))
print("Amplitude encoding circuit:")
print(qml.draw(amplitude_encoding)(x))
# --- IQP Encoding (4 qubits for 4 features) ---
dev_iqp = qml.device("default.qubit", wires=4)
@qml.qnode(dev_iqp)
def iqp_encoding(x):
# First layer: Hadamards + single-qubit phase gates
for i in range(4):
qml.Hadamard(wires=i)
for i in range(4):
qml.RZ(x[i], wires=i)
# Entangling phase gates based on feature products
for i in range(4):
for j in range(i + 1, 4):
qml.CNOT(wires=[i, j])
qml.RZ(x[i] * x[j], wires=j)
qml.CNOT(wires=[i, j])
# Repeat the encoding for richer feature map
for i in range(4):
qml.Hadamard(wires=i)
for i in range(4):
qml.RZ(x[i], wires=i)
for i in range(4):
for j in range(i + 1, 4):
qml.CNOT(wires=[i, j])
qml.RZ(x[i] * x[j], wires=j)
qml.CNOT(wires=[i, j])
return [qml.expval(qml.PauliZ(i)) for i in range(4)]
print("\nIQP encoding output:", iqp_encoding(x))
print("IQP encoding circuit:")
print(qml.draw(iqp_encoding)(x))
| Encoding | Qubits for N features | Circuit Depth | Kernel Complexity |
|---|---|---|---|
| Angle | N | O(1) | Classically easy |
| Amplitude | log₂(N) | O(N) | Depends on preparation |
| IQP | N | O(N²) | Conjectured classically hard |
Circuit Architecture
The full circuit is:
- Angle encoding layer:
RY(x_i)on each qubit. - Variational layers (repeated
Ltimes):- Single-qubit rotations:
Rot(phi, theta, omega)on each qubit (three parameters per qubit per layer). - Entangling CNOT ring: CNOT from qubit
ito qubit(i+1) mod n.
- Single-qubit rotations:
- Measure the Pauli-Z expectation value of qubit 0.
The output is a scalar in [-1, +1]. We apply a sigmoid to map it to a probability, then use binary cross-entropy loss.
Full PennyLane Implementation
import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import matplotlib.pyplot as plt
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
# Circuit configuration
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit(params, x):
"""
Variational quantum classifier circuit.
params: shape (n_layers, n_qubits, 3) - rotation angles per layer per qubit
x: shape (n_qubits,) - input features
"""
# Angle encoding
for i in range(n_qubits):
qml.RY(x[i], wires=i)
# Variational layers
for layer in range(n_layers):
# Parameterized rotations
for qubit in range(n_qubits):
qml.Rot(
params[layer, qubit, 0],
params[layer, qubit, 1],
params[layer, qubit, 2],
wires=qubit,
)
# Entangling layer: CNOT ring
for qubit in range(n_qubits - 1):
qml.CNOT(wires=[qubit, qubit + 1])
if n_qubits > 2:
qml.CNOT(wires=[n_qubits - 1, 0])
return qml.expval(qml.PauliZ(0))
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def bce_loss(params, X_batch, y_batch):
"""Binary cross-entropy loss."""
total_loss = 0.0
for x, y in zip(X_batch, y_batch):
pred_raw = circuit(params, x)
# Convert {-1,+1} label to {0,1} for BCE
y_01 = (y + 1) / 2
prob = sigmoid(pred_raw)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total_loss += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total_loss / len(X_batch)
def predict(params, X):
raw = np.array([circuit(params, x) for x in X])
return np.sign(raw)
def accuracy(params, X, y):
preds = predict(params, X)
return np.mean(preds == y)
# Initialize parameters with small random values
np.random.seed(42)
params = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3)),
requires_grad=True,
)
# Adam optimizer
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
n_epochs = 40
print(f"Training VQC: {n_layers} layers, {n_qubits} qubits, {n_layers * n_qubits * 3} parameters")
print(f"Initial train accuracy: {accuracy(params, X_train, y_train_pm):.3f}")
for epoch in range(n_epochs):
# Shuffle and create batches
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
params, loss_val = opt.step_and_cost(lambda p: bce_loss(p, X_batch, y_batch), params)
if (epoch + 1) % 10 == 0:
train_acc = accuracy(params, X_train, y_train_pm)
test_acc = accuracy(params, X_test, y_test_pm)
print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Train acc: {train_acc:.3f} | Test acc: {test_acc:.3f}")
print(f"\nFinal test accuracy (VQC): {accuracy(params, X_test, y_test_pm):.3f}")
Using Strongly Entangling Layers
The manual CNOT ring above is one way to build entanglement, but PennyLane provides qml.StronglyEntanglingLayers, a template that uses Rot gates (3 parameters each) and CNOTs in a pattern designed to create long-range entanglement. In each layer, the CNOT pattern shifts so that different qubit pairs become entangled across layers, providing better coverage of the Hilbert space.
The parameter count is the same: 3 parameters per qubit per layer. But the entanglement pattern is more structured and generally more expressive for the same depth.
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Dataset setup (same as before)
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit_sel(params, x):
"""VQC using StronglyEntanglingLayers."""
# Angle encoding
for i in range(n_qubits):
qml.RY(x[i], wires=i)
# Strongly entangling variational layers
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
# Draw the circuit to see the structure
print("StronglyEntanglingLayers circuit:")
dummy_params = pnp.zeros((n_layers, n_qubits, 3))
dummy_x = pnp.zeros(n_qubits)
print(qml.draw(circuit_sel)(dummy_params, dummy_x))
# Parameter count comparison
manual_params = n_layers * n_qubits * 3
sel_shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
sel_params = np.prod(sel_shape)
print(f"\nManual ansatz parameters: {manual_params}")
print(f"StronglyEntanglingLayers parameters: {sel_params}")
print(f"Both use {n_layers} layers x {n_qubits} qubits x 3 rotation angles = {manual_params}")
# Train with StronglyEntanglingLayers
np.random.seed(42)
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
params_sel = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=shape),
requires_grad=True,
)
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def bce_loss_sel(params, X_batch, y_batch):
total_loss = 0.0
for x, label in zip(X_batch, y_batch):
pred_raw = circuit_sel(params, x)
y_01 = (label + 1) / 2
prob = sigmoid(pred_raw)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total_loss += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total_loss / len(X_batch)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
n_epochs = 40
for epoch in range(n_epochs):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
params_sel, loss_val = opt.step_and_cost(
lambda p: bce_loss_sel(p, X_batch, y_batch), params_sel
)
if (epoch + 1) % 10 == 0:
preds = np.sign(np.array([circuit_sel(params_sel, x) for x in X_test]))
test_acc = np.mean(preds == y_test_pm)
print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {test_acc:.3f}")
preds_final = np.sign(np.array([circuit_sel(params_sel, x) for x in X_test]))
print(f"\nFinal test accuracy (StronglyEntanglingLayers): {np.mean(preds_final == y_test_pm):.3f}")
The StronglyEntanglingLayers template typically produces comparable results to the manual CNOT ring on a 2-qubit circuit. The difference becomes more pronounced on 4+ qubits, where the shifting CNOT pattern creates long-range entanglement that a simple nearest-neighbor ring misses.
Decision Boundary Visualization
def plot_decision_boundary(params, X, y, title="VQC Decision Boundary"):
h = 0.05
x_min, x_max = X[:, 0].min() - 0.1, X[:, 0].max() + 0.1
y_min, y_max = X[:, 1].min() - 0.1, X[:, 1].max() + 0.1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
grid = np.c_[xx.ravel(), yy.ravel()]
raw_preds = np.array([float(circuit(params, pt)) for pt in grid])
Z = np.sign(raw_preds).reshape(xx.shape)
plt.figure(figsize=(8, 6))
plt.contourf(xx, yy, Z, alpha=0.3, cmap="RdBu")
y_01 = (y + 1) // 2
plt.scatter(X[:, 0], X[:, 1], c=y_01, cmap="RdBu", edgecolors="k", s=40)
plt.title(title)
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.tight_layout()
plt.savefig("vqc_decision_boundary.png", dpi=120)
plot_decision_boundary(params, X_test, y_test_pm)
Comparison to Logistic Regression
# Classical baseline: logistic regression (inherently linear, cannot separate moons well)
lr = LogisticRegression()
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_preds)
print(f"Logistic Regression test accuracy: {lr_acc:.3f}")
Logistic regression typically achieves around 85-87% accuracy on this dataset because it is constrained to a linear decision boundary. The VQC, with entangling layers, can learn a curved boundary and typically reaches 88-93% with this configuration.
Barren Plateaus: The Main Challenge for VQCs at Scale
A barren plateau is a phenomenon where the gradient of the cost function becomes exponentially small as the number of qubits grows. Specifically, for a random parameterized circuit on n qubits, the variance of the gradient with respect to any single parameter scales as:
Var(∂C/∂θ) ~ O(1/2^n)
This means that for large circuits, the gradient landscape is essentially flat everywhere except in an exponentially small region. The optimizer sees near-zero gradients and cannot make progress. This is not a numerical issue; it is a fundamental property of high-dimensional quantum state spaces.
The following code demonstrates barren plateaus empirically by measuring the gradient variance for circuits of increasing size:
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
def compute_gradient_variance(n_qubits, n_layers=2, n_samples=200):
"""Compute the variance of gradients for random parameters."""
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def random_circuit(params):
# Random encoding (fixed random features)
for i in range(n_qubits):
qml.RY(np.random.uniform(0, np.pi), wires=i)
# Variational layers
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
# Global cost function: expectation of Z on qubit 0
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
gradients = []
for _ in range(n_samples):
params = pnp.array(
np.random.uniform(-np.pi, np.pi, size=shape), requires_grad=True
)
grad_fn = qml.grad(random_circuit)
grad = grad_fn(params)
# Record the gradient of the first parameter
gradients.append(grad[0, 0, 0])
return np.var(gradients)
# Measure gradient variance for increasing qubit counts
qubit_counts = [2, 4, 6, 8]
variances = []
for n in qubit_counts:
var = compute_gradient_variance(n, n_layers=2, n_samples=200)
variances.append(var)
print(f"n_qubits={n}: gradient variance = {var:.6f}")
# The variance decreases roughly exponentially with qubit count
print("\nRatio of successive variances (expect ~0.25 for exponential decay):")
for i in range(1, len(variances)):
ratio = variances[i] / variances[i - 1]
print(f" Var(n={qubit_counts[i]}) / Var(n={qubit_counts[i-1]}) = {ratio:.4f}")
You should see the gradient variance drop by roughly a factor of 4 each time the qubit count increases by 2, consistent with the 1/2^n scaling.
Mitigation Strategies
Several approaches help mitigate barren plateaus:
-
Layer-by-layer training: Train one layer at a time, keeping previous layers fixed. Each layer is optimized in a shallow landscape before the next layer is added. This avoids the exponentially flat landscape of the full circuit.
-
Local cost functions: Instead of measuring a global observable like Z on one qubit (which depends on all qubits through entanglement), measure local observables on individual qubits and sum them. Local cost functions have gradient variances that decay polynomially rather than exponentially.
-
Parameter initialization near identity: Initialize parameters close to zero so the circuit starts near the identity operation. This places the initial point in a region with meaningful gradients.
-
Quantum natural gradient: Use the quantum Fisher information matrix to precondition the gradient, which accounts for the geometry of the parameter space. We cover this in the next section.
Quantum Natural Gradient Optimizer
Standard gradient descent treats all parameter directions equally. But in a parameterized quantum circuit, small changes in one parameter can cause large changes in the quantum state, while large changes in another parameter barely move the state. The quantum natural gradient (QNG) accounts for this by preconditioning the gradient with the inverse of the quantum Fisher information matrix (also called the Fubini-Study metric tensor).
The update rule for QNG is:
w ← w - η F⁻¹ ∇C(w)
where F is the quantum Fisher information matrix with entries F_ij = Re(⟨∂_i ψ|∂_j ψ⟩ - ⟨∂_i ψ|ψ⟩⟨ψ|∂_j ψ⟩).
This accounts for the geometry of the quantum state space, making the optimizer take steps that are uniform in terms of state-space distance rather than parameter-space distance.
PennyLane provides qml.QNGOptimizer for this purpose. The following code compares Adam and QNG on the moons classification task:
import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
n_qubits = 2
n_layers = 2
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit_qng(params, x):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def cost_fn(params):
"""Full-batch cost for QNG (QNG works best with full-batch or large-batch)."""
total_loss = 0.0
for x, label in zip(X_train, y_train_pm):
pred = circuit_qng(params, x)
y_01 = (label + 1) / 2
prob = sigmoid(pred)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total_loss += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total_loss / len(X_train)
def test_accuracy(params):
preds = np.sign(np.array([circuit_qng(params, x) for x in X_test]))
return np.mean(preds == y_test_pm)
# Train with Adam
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
np.random.seed(42)
init_params = np.random.uniform(-np.pi / 4, np.pi / 4, size=shape)
params_adam = pnp.array(init_params.copy(), requires_grad=True)
opt_adam = qml.AdamOptimizer(stepsize=0.05)
adam_losses = []
for epoch in range(30):
params_adam, loss = opt_adam.step_and_cost(cost_fn, params_adam)
adam_losses.append(float(loss))
if (epoch + 1) % 10 == 0:
acc = test_accuracy(params_adam)
print(f"Adam Epoch {epoch+1:3d} | Loss: {loss:.4f} | Test acc: {acc:.3f}")
# Train with QNG
params_qng = pnp.array(init_params.copy(), requires_grad=True)
opt_qng = qml.QNGOptimizer(stepsize=0.01)
qng_losses = []
for epoch in range(30):
params_qng, loss = opt_qng.step_and_cost(cost_fn, params_qng)
qng_losses.append(float(loss))
if (epoch + 1) % 10 == 0:
acc = test_accuracy(params_qng)
print(f"QNG Epoch {epoch+1:3d} | Loss: {loss:.4f} | Test acc: {acc:.3f}")
# Plot convergence comparison
plt.figure(figsize=(8, 5))
plt.plot(adam_losses, label="Adam (lr=0.05)")
plt.plot(qng_losses, label="QNG (lr=0.01)")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.title("Adam vs Quantum Natural Gradient Convergence")
plt.legend()
plt.tight_layout()
plt.savefig("adam_vs_qng.png", dpi=120)
print("\nConvergence plot saved to adam_vs_qng.png")
QNG typically converges in fewer steps, especially in the early phase of training. However, each QNG step requires O(p²) circuit evaluations to estimate the Fisher information matrix, where p is the number of parameters. For the 12-parameter circuit above, this overhead is manageable. For circuits with hundreds of parameters, the per-step cost becomes prohibitive, and block-diagonal approximations to the Fisher matrix are used instead.
Amplitude Encoding for Higher-Dimensional Data
For datasets with more than a few features, angle encoding requires one qubit per feature, which quickly exceeds hardware limits. Amplitude encoding offers a logarithmic alternative: a 4-dimensional feature vector maps to a 2-qubit state.
The following example trains a VQC on the Iris dataset (setosa vs. versicolor, 4 features) using amplitude encoding:
import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
# Load Iris dataset (2 classes: setosa=0, versicolor=1)
iris = load_iris()
mask = iris.target < 2 # Select only setosa and versicolor
X_iris = iris.data[mask]
y_iris = iris.target[mask]
# Standardize features, then normalize each sample to unit length for amplitude encoding
scaler = StandardScaler()
X_iris_scaled = scaler.fit_transform(X_iris)
X_iris_norm = X_iris_scaled / np.linalg.norm(X_iris_scaled, axis=1, keepdims=True)
X_train, X_test, y_train, y_test = train_test_split(
X_iris_norm, y_iris, test_size=0.25, random_state=42
)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
# Amplitude encoding: 4 features -> 2 qubits
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit_amp(params, x):
"""VQC with amplitude encoding for 4-feature data."""
# Amplitude encoding: maps 4D vector to 2-qubit state
qml.AmplitudeEmbedding(features=x, wires=range(n_qubits), normalize=False)
# Variational layers
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
# Print circuit structure
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
dummy_params = pnp.zeros(shape)
dummy_x = np.array([0.5, 0.5, 0.5, 0.5])
print("Amplitude encoding circuit (4 features, 2 qubits):")
print(qml.draw(circuit_amp)(dummy_params, dummy_x))
# Compare circuit depths
dev_4q = qml.device("default.qubit", wires=4)
@qml.qnode(dev_4q)
def circuit_angle_4(params, x):
"""VQC with angle encoding for 4-feature data (4 qubits)."""
for i in range(4):
qml.RY(x[i], wires=i)
qml.StronglyEntanglingLayers(params, wires=range(4))
return qml.expval(qml.PauliZ(0))
shape_4q = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=4)
print(f"\nAngle encoding: 4 qubits, {np.prod(shape_4q)} variational parameters")
print(f"Amplitude encoding: 2 qubits, {np.prod(shape)} variational parameters")
print("Amplitude encoding uses fewer qubits but a deeper state preparation circuit.")
# Train the amplitude-encoded VQC
np.random.seed(42)
params_amp = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
)
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def bce_loss_amp(params, X_batch, y_batch):
total_loss = 0.0
for x, label in zip(X_batch, y_batch):
pred = circuit_amp(params, x)
y_01 = (label + 1) / 2
prob = sigmoid(pred)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total_loss += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total_loss / len(X_batch)
opt = qml.AdamOptimizer(stepsize=0.05)
for epoch in range(40):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), 16):
X_batch = X_shuf[start : start + 16]
y_batch = y_shuf[start : start + 16]
params_amp, loss_val = opt.step_and_cost(
lambda p: bce_loss_amp(p, X_batch, y_batch), params_amp
)
if (epoch + 1) % 10 == 0:
preds = np.sign(np.array([circuit_amp(params_amp, x) for x in X_test]))
test_acc = np.mean(preds == y_test_pm)
print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {test_acc:.3f}")
preds_final = np.sign(np.array([circuit_amp(params_amp, x) for x in X_test]))
print(f"\nFinal test accuracy (Amplitude VQC on Iris): {np.mean(preds_final == y_test_pm):.3f}")
Setosa and versicolor are linearly separable, so both quantum and classical methods achieve near-perfect accuracy on this pair. The value of amplitude encoding becomes apparent on higher-dimensional datasets where angle encoding would require too many qubits.
Multi-Class Extension
The binary VQC measures one qubit and uses the sign of the expectation value. For k classes, we extend this by measuring k qubits and predicting the class with the highest expectation value. The output is a vector of Pauli-Z expectation values (⟨Z_0⟩, ⟨Z_1⟩, …, ⟨Z_{k-1}⟩), and we apply softmax to convert these into class probabilities.
The following example builds a 3-class VQC for the full Iris dataset:
import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Load full Iris dataset (3 classes)
iris = load_iris()
X_iris = iris.data
y_iris = iris.target # 0, 1, 2
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_iris)
# Scale to [0, pi] for angle encoding
from sklearn.preprocessing import MinMaxScaler
scaler2 = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler2.fit_transform(X_scaled)
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y_iris, test_size=0.25, random_state=42
)
# Circuit: 4 qubits (one per feature), measure qubits 0, 1, 2 for 3 classes
n_qubits = 4
n_layers = 3
n_classes = 3
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circuit_multi(params, x):
"""Multi-class VQC: returns 3 Pauli-Z expectation values."""
# Angle encoding: 4 features on 4 qubits
for i in range(n_qubits):
qml.RY(x[i], wires=i)
# Variational layers
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
# Measure one qubit per class
return [qml.expval(qml.PauliZ(i)) for i in range(n_classes)]
def softmax(z):
"""Numerically stable softmax."""
z_shifted = z - pnp.max(z)
exp_z = pnp.exp(z_shifted)
return exp_z / pnp.sum(exp_z)
def cross_entropy_loss(params, X_batch, y_batch):
"""Multi-class cross-entropy loss with softmax."""
total_loss = 0.0
for x, label in zip(X_batch, y_batch):
raw_outputs = pnp.array(circuit_multi(params, x))
probs = softmax(raw_outputs)
probs = pnp.clip(probs, 1e-7, 1.0)
total_loss += -pnp.log(probs[label])
return total_loss / len(X_batch)
def predict_multi(params, X):
predictions = []
for x in X:
raw = np.array(circuit_multi(params, x))
predictions.append(np.argmax(raw))
return np.array(predictions)
# Initialize and train
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
np.random.seed(42)
params_multi = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
for epoch in range(50):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
params_multi, loss_val = opt.step_and_cost(
lambda p: cross_entropy_loss(p, X_batch, y_batch), params_multi
)
if (epoch + 1) % 10 == 0:
preds = predict_multi(params_multi, X_test)
acc = np.mean(preds == y_test)
print(f"Epoch {epoch+1:3d} | Loss: {loss_val:.4f} | Test acc: {acc:.3f}")
# Final VQC accuracy
preds_vqc = predict_multi(params_multi, X_test)
vqc_acc = np.mean(preds_vqc == y_test)
print(f"\nFinal 3-class VQC test accuracy: {vqc_acc:.3f}")
# Classical baseline: logistic regression (multi-class)
lr = LogisticRegression(max_iter=1000, random_state=42)
lr.fit(X_train, y_train)
lr_preds = lr.predict(X_test)
lr_acc = accuracy_score(y_test, lr_preds)
print(f"Logistic Regression test accuracy: {lr_acc:.3f}")
The 3-class VQC typically reaches 85-95% accuracy on Iris, comparable to logistic regression. The key design choices for multi-class VQCs are: (1) allocate one measurement qubit per class, (2) use softmax to convert expectation values to probabilities, and (3) use cross-entropy loss instead of BCE.
Quantum Kernel Method Comparison
An alternative to training a variational circuit is to use the quantum circuit purely as a kernel function. The quantum kernel between two data points is:
K(x, x’) = |⟨0…0| U(x)† U(x’) |0…0⟩|²
This equals the probability of measuring the all-zeros state after applying U(x’) followed by U(x)†. Once we compute the kernel matrix, we pass it to a classical SVM and let the classical optimizer handle classification. No variational parameters need to be trained on the quantum side.
import pennylane as qml
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
n_qubits = 2
dev = qml.device("default.qubit", wires=n_qubits)
def encoding_circuit(x, wires):
"""IQP-style encoding for richer kernel."""
for i in wires:
qml.Hadamard(wires=i)
for i in wires:
qml.RZ(x[i], wires=i)
for i in range(len(wires)):
for j in range(i + 1, len(wires)):
qml.CNOT(wires=[wires[i], wires[j]])
qml.RZ(x[wires[i]] * x[wires[j]], wires=wires[j])
qml.CNOT(wires=[wires[i], wires[j]])
@qml.qnode(dev)
def kernel_circuit(x1, x2):
"""Compute |<0|U(x1)†U(x2)|0>|^2 via the swap test alternative."""
# Apply encoding for x1
encoding_circuit(x1, wires=range(n_qubits))
# Apply adjoint of encoding for x2
qml.adjoint(encoding_circuit)(x2, wires=range(n_qubits))
# Probability of all-zeros state
return qml.probs(wires=range(n_qubits))
def quantum_kernel(x1, x2):
"""Compute the quantum kernel value between two data points."""
probs = kernel_circuit(x1, x2)
# Probability of |00...0> state
return float(probs[0])
def compute_kernel_matrix(X1, X2):
"""Compute the kernel matrix between two sets of data points."""
n1, n2 = len(X1), len(X2)
K = np.zeros((n1, n2))
for i in range(n1):
for j in range(n2):
K[i, j] = quantum_kernel(X1[i], X2[j])
return K
# Compute kernel matrices
print("Computing training kernel matrix...")
K_train = compute_kernel_matrix(X_train, X_train)
print("Computing test kernel matrix...")
K_test = compute_kernel_matrix(X_test, X_train)
# Train SVM with precomputed quantum kernel
svm = SVC(kernel="precomputed")
svm.fit(K_train, y_train)
svm_preds = svm.predict(K_test)
svm_acc = accuracy_score(y_test, svm_preds)
print(f"\nQuantum Kernel SVM test accuracy: {svm_acc:.3f}")
# Compare with RBF kernel SVM
svm_rbf = SVC(kernel="rbf")
svm_rbf.fit(X_train, y_train)
rbf_preds = svm_rbf.predict(X_test)
rbf_acc = accuracy_score(y_test, rbf_preds)
print(f"Classical RBF SVM test accuracy: {rbf_acc:.3f}")
When to Use Kernel Methods vs. Variational Methods
Quantum kernel methods have a key advantage: they separate the quantum computation (kernel evaluation) from the classical optimization (SVM training). The kernel matrix is computed once, and the SVM solver is guaranteed to find the global optimum. There are no barren plateaus or vanishing gradients.
The trade-off is cost. Computing the full kernel matrix requires O(N²) circuit evaluations for N training samples. For large datasets, this is more expensive than the O(N * epochs) evaluations needed for VQC training with mini-batches.
Use kernel methods when: the dataset is small (under a few hundred samples), you want guaranteed convergence, or you need the kernel for interpretability. Use variational methods when: the dataset is large, you need fast inference (the trained VQC applies one circuit per input), or you want to co-optimize the encoding and classification.
Noise Impact on Classification
Real quantum hardware introduces noise through decoherence, gate errors, and measurement errors. Understanding how noise affects VQC performance is critical for deploying on real devices. PennyLane’s mixed-state simulator lets us model these effects.
import matplotlib
matplotlib.use("Agg")
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
n_qubits = 2
n_layers = 3
def create_noisy_circuit(noise_prob):
"""Create a VQC circuit with depolarizing noise after each gate."""
dev = qml.device("default.mixed", wires=n_qubits)
@qml.qnode(dev)
def noisy_circuit(params, x):
# Angle encoding with noise
for i in range(n_qubits):
qml.RY(x[i], wires=i)
if noise_prob > 0:
qml.DepolarizingChannel(noise_prob, wires=i)
# Variational layers with noise after each gate
for layer in range(n_layers):
for qubit in range(n_qubits):
qml.Rot(
params[layer, qubit, 0],
params[layer, qubit, 1],
params[layer, qubit, 2],
wires=qubit,
)
if noise_prob > 0:
qml.DepolarizingChannel(noise_prob, wires=qubit)
for qubit in range(n_qubits - 1):
qml.CNOT(wires=[qubit, qubit + 1])
if noise_prob > 0:
qml.DepolarizingChannel(noise_prob, wires=qubit)
qml.DepolarizingChannel(noise_prob, wires=qubit + 1)
return qml.expval(qml.PauliZ(0))
return noisy_circuit
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def train_noisy_vqc(noise_prob, n_epochs=40):
"""Train a VQC with a given noise level and return test accuracy."""
circ = create_noisy_circuit(noise_prob)
def bce_loss(params, X_batch, y_batch):
total_loss = 0.0
for x, label in zip(X_batch, y_batch):
pred = circ(params, x)
y_01 = (label + 1) / 2
prob = sigmoid(pred)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total_loss += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total_loss / len(X_batch)
np.random.seed(42)
params = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3)),
requires_grad=True,
)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
for epoch in range(n_epochs):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
params, _ = opt.step_and_cost(lambda p: bce_loss(p, X_batch, y_batch), params)
preds = np.sign(np.array([float(circ(params, x)) for x in X_test]))
test_acc = np.mean(preds == y_test_pm)
return test_acc, params
# Train at three noise levels
noise_levels = [0.0, 0.01, 0.05]
results = {}
for p in noise_levels:
print(f"\nTraining with noise p={p}...")
acc, trained_params = train_noisy_vqc(p, n_epochs=40)
results[p] = acc
print(f" Test accuracy: {acc:.3f}")
print("\n--- Noise Impact Summary ---")
for p, acc in results.items():
print(f" p={p:.2f}: test accuracy = {acc:.3f}")
Typical observations:
- Noiseless (p=0.0): Highest training accuracy, but may overfit slightly to the training data.
- Mild noise (p=0.01): Accuracy stays close to the noiseless case and may even improve slightly on the test set. The noise acts like regularization, preventing the model from fitting too precisely to training-set idiosyncrasies.
- Strong noise (p=0.05): Accuracy drops noticeably. The depolarizing channel pushes all expectation values toward zero, reducing the circuit’s ability to distinguish classes. The decision boundary becomes fuzzy and less useful.
This pattern, where mild noise helps generalization and strong noise destroys it, is analogous to dropout regularization in classical neural networks. It suggests that near-term noisy quantum devices may be able to produce useful classifiers if the noise is kept below a threshold.
Hardware-Efficient Ansatz Selection
Choosing the right ansatz for a VQC involves balancing three criteria:
-
Expressibility: Can the ansatz represent the target function? A more expressive ansatz can approximate a wider range of decision boundaries, but also has more parameters to train.
-
Entanglement capability: How much entanglement can the ansatz create? Entanglement is what gives a quantum circuit power beyond independent single-qubit operations. Insufficient entanglement means the circuit is effectively classical.
-
Hardware efficiency: How many native 2-qubit gates does the ansatz require? On real hardware, 2-qubit gates are 10-100x noisier than single-qubit gates. Minimizing their count reduces total error.
The following code compares three ansatze on the moons dataset:
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
n_qubits = 2
n_layers = 3
dev = qml.device("default.qubit", wires=n_qubits)
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
# --- Ansatz 1: Manual CNOT ring + Rot layers ---
@qml.qnode(dev)
def circuit_manual(params, x):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
for layer in range(n_layers):
for qubit in range(n_qubits):
qml.Rot(params[layer, qubit, 0], params[layer, qubit, 1],
params[layer, qubit, 2], wires=qubit)
for qubit in range(n_qubits - 1):
qml.CNOT(wires=[qubit, qubit + 1])
return qml.expval(qml.PauliZ(0))
# --- Ansatz 2: StronglyEntanglingLayers ---
@qml.qnode(dev)
def circuit_strong(params, x):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
# --- Ansatz 3: SimplifiedTwoDesign ---
@qml.qnode(dev)
def circuit_simple(initial_layer_weights, weights, x):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
qml.SimplifiedTwoDesign(
initial_layer_weights=initial_layer_weights,
weights=weights,
wires=range(n_qubits),
)
return qml.expval(qml.PauliZ(0))
def train_ansatz(circuit_fn, params_init, n_epochs=40, label=""):
"""Train a circuit and return final test accuracy."""
if isinstance(params_init, tuple):
# SimplifiedTwoDesign has two parameter groups
params = tuple(
pnp.array(p.copy(), requires_grad=True) for p in params_init
)
else:
params = pnp.array(params_init.copy(), requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.05)
batch_size = 16
for epoch in range(n_epochs):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
def cost(*p):
total = 0.0
for x, lab in zip(X_batch, y_batch):
pred = circuit_fn(*p, x)
y_01 = (lab + 1) / 2
prob = sigmoid(pred)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total / len(X_batch)
if isinstance(params, tuple):
result = opt.step_and_cost(cost, *params)
params = tuple(result[:-1])
else:
params, _ = opt.step_and_cost(cost, params)
# Evaluate
if isinstance(params, tuple):
preds = np.sign(np.array([float(circuit_fn(*params, x)) for x in X_test]))
else:
preds = np.sign(np.array([float(circuit_fn(params, x)) for x in X_test]))
acc = np.mean(preds == y_test_pm)
print(f"{label}: test accuracy = {acc:.3f}")
return acc
np.random.seed(42)
# Ansatz 1: Manual
p1 = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits, 3))
manual_params_count = n_layers * n_qubits * 3
manual_cnots = n_layers * (n_qubits - 1)
# Ansatz 2: StronglyEntanglingLayers
shape_sel = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
p2 = np.random.uniform(-np.pi / 4, np.pi / 4, size=shape_sel)
sel_params_count = np.prod(shape_sel)
sel_cnots = n_layers * n_qubits # One CNOT per qubit per layer
# Ansatz 3: SimplifiedTwoDesign
init_weights = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_qubits,))
layer_weights = np.random.uniform(-np.pi / 4, np.pi / 4, size=(n_layers, n_qubits - 1, 2))
simple_params_count = n_qubits + n_layers * (n_qubits - 1) * 2
simple_cnots = n_layers * (n_qubits - 1)
print("Ansatz Comparison:")
print(f" Manual Rot+CNOT: {manual_params_count} params, {manual_cnots} CNOTs/circuit")
print(f" StronglyEntanglingLayers: {sel_params_count} params, {sel_cnots} CNOTs/circuit")
print(f" SimplifiedTwoDesign: {simple_params_count} params, {simple_cnots} CNOTs/circuit")
print()
acc1 = train_ansatz(circuit_manual, p1, label="Manual Rot+CNOT")
acc2 = train_ansatz(circuit_strong, p2, label="StronglyEntanglingLayers")
acc3 = train_ansatz(circuit_simple, (init_weights, layer_weights), label="SimplifiedTwoDesign")
For a 2-qubit moons classifier, all three ansatze typically achieve similar accuracy (88-93%). The differences become meaningful on larger circuits. The recommendation is to start with the simplest ansatz that achieves acceptable accuracy. If SimplifiedTwoDesign works, there is no reason to use the more parameter-heavy alternatives. Fewer parameters means faster training and reduced risk of barren plateaus.
Hyperparameter Tuning
VQCs have several hyperparameters that significantly affect performance. The most important are:
- Number of layers (L): Controls circuit depth and parameter count. Too few layers limits expressibility; too many causes barren plateaus and slow convergence.
- Learning rate: Standard trade-off: too high causes oscillation, too low causes slow convergence.
- Encoding method: Determines the quantum feature space (covered above).
- Batch size: Larger batches give more stable gradients but slower epochs.
The following grid search explores layers and learning rate:
import pennylane as qml
from pennylane import numpy as pnp
import numpy as np
from sklearn.datasets import make_moons
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
# Dataset
X, y = make_moons(n_samples=200, noise=0.2, random_state=42)
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.25, random_state=42)
y_train_pm = 2 * y_train - 1
y_test_pm = 2 * y_test - 1
n_qubits = 2
def sigmoid(z):
return 1.0 / (1.0 + pnp.exp(-z))
def run_experiment(n_layers, lr, n_epochs=20):
"""Train VQC with given hyperparameters and return test accuracy."""
dev = qml.device("default.qubit", wires=n_qubits)
@qml.qnode(dev)
def circ(params, x):
for i in range(n_qubits):
qml.RY(x[i], wires=i)
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
np.random.seed(42)
params = pnp.array(
np.random.uniform(-np.pi / 4, np.pi / 4, size=shape), requires_grad=True
)
opt = qml.AdamOptimizer(stepsize=lr)
batch_size = 16
for epoch in range(n_epochs):
perm = np.random.permutation(len(X_train))
X_shuf, y_shuf = X_train[perm], y_train_pm[perm]
for start in range(0, len(X_train), batch_size):
X_batch = X_shuf[start : start + batch_size]
y_batch = y_shuf[start : start + batch_size]
def cost(p):
total = 0.0
for x, lab in zip(X_batch, y_batch):
pred = circ(p, x)
y_01 = (lab + 1) / 2
prob = sigmoid(pred)
prob = pnp.clip(prob, 1e-7, 1 - 1e-7)
total += -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
return total / len(X_batch)
params, _ = opt.step_and_cost(cost, params)
preds = np.sign(np.array([float(circ(params, x)) for x in X_test]))
return np.mean(preds == y_test_pm)
# Grid search
layers_options = [1, 2, 3, 4]
lr_options = [0.01, 0.05, 0.1]
print("Hyperparameter Grid Search (20 epochs each)")
print("-" * 55)
print(f"{'Layers':<10} {'LR=0.01':<15} {'LR=0.05':<15} {'LR=0.1':<15}")
print("-" * 55)
results = {}
for n_layers in layers_options:
row = []
for lr in lr_options:
acc = run_experiment(n_layers, lr, n_epochs=20)
results[(n_layers, lr)] = acc
row.append(f"{acc:.3f}")
print(f"{n_layers:<10} {' '.join(f'{r:<13}' for r in row)}")
print("-" * 55)
# Find best combination
best_key = max(results, key=results.get)
print(f"\nBest: {best_key[0]} layers, lr={best_key[1]} -> accuracy={results[best_key]:.3f}")
For the moons dataset, 2-3 layers with a learning rate of 0.05 is typically optimal. Adding a 4th layer does not improve accuracy and can hurt convergence because the additional parameters make the loss landscape harder to navigate. A learning rate of 0.01 is too slow for 20 epochs, while 0.1 can cause instability.
Common Mistakes
Building VQCs involves several subtle pitfalls. This section covers the most common ones and how to avoid them.
1. Using regular NumPy instead of PennyLane NumPy
PennyLane’s automatic differentiation requires that all differentiable operations use pennylane.numpy (aliased as pnp). If you use standard numpy for parameter arrays, PennyLane cannot compute gradients.
# WRONG: standard numpy breaks automatic differentiation
import numpy as np
params = np.array([0.1, 0.2, 0.3]) # No gradient tracking
# CORRECT: use pennylane.numpy with requires_grad=True
from pennylane import numpy as pnp
params = pnp.array([0.1, 0.2, 0.3], requires_grad=True)
Standard numpy is fine for non-differentiable operations like data loading, indexing, and post-processing predictions. The rule is: any array that the optimizer needs to differentiate through must be a pnp array with requires_grad=True.
2. Not scaling data before angle encoding
Angle encoding applies RY(x_i), which has a period of 2π. If features are in wildly different ranges (for example, one feature ranges from 0 to 1000), the rotation wraps around many times and the encoding loses meaningful structure. Always scale features to a range like [0, π] or [-π, π] before encoding.
# WRONG: raw features can be outside the useful range
qml.RY(raw_feature, wires=0) # If raw_feature = 500, this wraps ~80 times
# CORRECT: scale first
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler(feature_range=(0, np.pi))
X_scaled = scaler.fit_transform(X)
qml.RY(X_scaled[0, 0], wires=0) # Now in [0, pi], meaningful range
3. Mismatched loss function and label format
Binary cross-entropy (BCE) expects labels in {0, 1} and probabilities in (0, 1). The Pauli-Z expectation value gives outputs in [-1, +1]. You need to be consistent about which format you use. There are two valid approaches:
Approach A: Use {-1, +1} labels with MSE loss and raw expectation values.
# Approach A: MSE with {-1, +1} labels
y_pm = 2 * y - 1 # Labels in {-1, +1}
loss = pnp.mean((circuit_output - y_pm) ** 2)
Approach B: Use {0, 1} labels with BCE loss and sigmoid-transformed expectation values.
# Approach B: BCE with {0, 1} labels
y_01 = (y_pm + 1) / 2 # Convert to {0, 1}
prob = sigmoid(circuit_output) # Map [-1, +1] to (0, 1)
loss = -(y_01 * pnp.log(prob) + (1 - y_01) * pnp.log(1 - prob))
Mixing these (for example, using BCE with {-1, +1} labels) produces nonsensical gradients.
4. Forgetting requires_grad=True
PennyLane arrays default to requires_grad=False. If you forget to set it, the optimizer sees no trainable parameters and nothing changes.
# WRONG: parameters are not differentiable
params = pnp.array(np.random.randn(3, 2, 3))
# The optimizer will not update these parameters
# CORRECT: explicitly mark as differentiable
params = pnp.array(np.random.randn(3, 2, 3), requires_grad=True)
A common symptom of this mistake is that the loss stays constant across all epochs.
5. Too many qubits for too little data
More qubits means a larger Hilbert space, which sounds good for expressibility. But the barren plateau phenomenon means that gradient variance decreases exponentially with qubit count. For a small 2-feature dataset like moons, using 8 qubits (with redundant encoding) makes the circuit much harder to train without any benefit.
The rule of thumb: use the minimum number of qubits needed for the encoding. For angle encoding, that is one qubit per feature. For amplitude encoding, that is log₂(features) qubits. Adding extra “ancilla” qubits for expressibility is rarely worth the barren plateau cost.
6. Measuring all qubits when only one is needed
For binary classification, you only need one expectation value. Measuring all qubits and trying to combine them adds complexity without clear benefit, and it discards the entanglement structure that the circuit created.
# WRONG (for binary classification): measuring everything
return [qml.expval(qml.PauliZ(i)) for i in range(n_qubits)]
# CORRECT (for binary classification): measure the readout qubit
return qml.expval(qml.PauliZ(0))
For multi-class classification (k classes), you do measure k qubits, one per class. But for binary problems, stick with one qubit.
Interpretation and Limitations
The VQC here uses n_layers * n_qubits * 3 = 18 parameters, a tiny model. Classical neural networks with the same parameter count would perform comparably or better. The purpose of this tutorial is not to claim quantum advantage but to understand the mechanics.
Key takeaways:
- Angle encoding is a direct and natural interface between classical data and qubit rotations.
- Entangling layers are what give the circuit expressive power beyond independent single-qubit operations.
- The gradient computation via PennyLane’s automatic differentiation uses the parameter-shift rule, which computes exact gradients by running the circuit at shifted parameter values (no finite differences needed).
- Barren plateaus (exponentially small gradients for large circuits) are a known challenge for VQCs on more qubits. Starting with shallow circuits and careful initialization helps.
- Quantum kernel methods offer an alternative that avoids variational training entirely but require O(N²) circuit evaluations for the kernel matrix.
- Noise at mild levels can act as regularization, but strong noise destroys classification ability.
- The choice of ansatz, encoding, and hyperparameters all matter, and systematic exploration is essential.
For real quantum advantage in classification you would look at quantum kernel methods where the quantum circuit defines a kernel that may be classically hard to evaluate. That is the direction the research community is actively pursuing.
Was this tutorial helpful?