Quantum Natural Gradient in PennyLane

Classical Natural Gradient: Where the Idea Begins

Before touching quantum circuits, it helps to understand the classical natural gradient and why it exists. The idea originates from Shun-ichi Amari’s 1998 paper on information geometry. His key observation: the parameter space of a statistical model is not flat. Moving one unit in parameter space does not always produce the same change in the model’s output distribution. Standard gradient descent ignores this curvature and treats all directions equally, which leads to slow, zigzagging convergence paths.

Consider a neural network with parameters theta that defines a conditional distribution p(y|x, theta). The standard gradient update is:

theta_{t+1} = theta_t - lr * grad(L)

This moves in the steepest descent direction measured by Euclidean distance in parameter space. But the “right” distance metric for probability distributions is the Kullback-Leibler (KL) divergence. The steepest descent direction under KL divergence is the natural gradient:

theta_{t+1} = theta_t - lr * F^(-1) * grad(L)

where F is the classical Fisher information matrix (FIM):

F_{ij} = E_{y ~ p(y|theta)}[ (d/d_theta_i) log p(y|theta) * (d/d_theta_j) log p(y|theta) ]

The FIM acts as a metric tensor on the parameter manifold. Multiplying the gradient by F^(-1) “straightens out” the curved geometry. In flat regions (where many parameter directions produce nearly identical distributions), the FIM has small eigenvalues, and F^(-1) amplifies the gradient in those directions. In steep regions, it dampens overshoot.

For neural networks, computing the exact FIM is intractable because it requires an expectation over the full data distribution. Practitioners use the empirical Fisher (an average of outer products of per-sample gradients) as an approximation. For large networks, even the empirical Fisher is too expensive to store as a full matrix. The KFAC (Kronecker-Factored Approximate Curvature) method approximates the FIM as a Kronecker product of smaller matrices, one per layer, reducing storage from O(p^2) to O(p).

The quantum natural gradient applies the same principle to variational quantum circuits, replacing the classical FIM with the quantum Fisher information matrix (QFIM).

Why Standard Gradient Descent Struggles on Quantum Circuits

Variational quantum circuits have a parameter space with non-Euclidean geometry. The Euclidean distance between two parameter vectors does not reflect the actual distance between the quantum states they produce. Two points that are close in parameter space can produce very different quantum states, and vice versa.

Standard gradient descent treats all parameters as if they live in flat Euclidean space. This leads to slow convergence and sensitivity to the parameterization: rotating your basis or reparameterizing the circuit changes the convergence rate even though the optimization problem is identical.

Natural gradient descent corrects for this by using the Fisher information matrix to measure distances in the space of probability distributions. For quantum circuits, the relevant quantity is the quantum Fisher information matrix (QFIM), which measures the Bures distance between quantum states.

This problem becomes more severe as circuits grow deeper. In a hardware-efficient ansatz with many layers, some parameter combinations produce large state changes while others produce nearly none. Gradient descent cannot distinguish between these cases and allocates equal step sizes to both, wasting optimization budget on insensitive directions and overshooting on sensitive ones.

The Geometry of Quantum Circuit Parameter Space

A parameterized quantum circuit maps parameters theta to quantum states |psi(theta)>. The set of all states reachable by the circuit forms a manifold embedded in Hilbert space. This manifold inherits a natural distance metric from the Hilbert space inner product, and that metric is the Fubini-Study metric (equivalently, the Bures metric for pure states).

The infinitesimal distance between neighboring states |psi(theta)> and |psi(theta + d_theta)> is given by:

ds^2 = (1/4) * d_theta^T * F * d_theta

where F is the quantum Fisher information matrix. This is the Bures distance to second order. The factor of 1/4 is a convention that ensures consistency with the standard quantum Fisher information definition.

The geometric picture clarifies what QNG does. In flat Euclidean space, the steepest descent direction is simply the negative gradient. On a curved manifold, the steepest descent follows geodesics, and the natural gradient computes the tangent vector to the geodesic that decreases the cost function most rapidly. The update F^(-1) * grad projects the Euclidean gradient onto this geodesic direction.

This geometry also explains barren plateaus from a new angle. A barren plateau occurs when large regions of parameter space map to nearly identical quantum states. In geometric language, the manifold is nearly degenerate: it has very low curvature in most directions. The QFIM eigenvalues in these flat regions are exponentially small, which means the gradient per unit of state-space distance is exponentially small. QNG compensates by amplifying the gradient in flat directions (dividing by the small eigenvalues of F), but this amplification has limits when the QFIM itself becomes numerically singular.

The Quantum Geometric Tensor

The QFIM is actually one part of a richer object: the Quantum Geometric Tensor (QGT). The full QGT is a complex matrix defined as:

g_{ij} = <d_i psi | d_j psi> - <d_i psi | psi><psi | d_j psi>

where |d_i psi> = d/d_theta_i |psi(theta)>. The QGT has two meaningful parts:

Real part: Re[g_{ij}] = (1/4) * F_{ij}. This is the Fubini-Study metric tensor, proportional to the QFIM. It governs distances between quantum states in parameter space.
Imaginary part: Im[g_{ij}] = (1/2) * Omega_{ij}, where Omega is the Berry curvature tensor. The Berry curvature governs geometric (Berry) phases acquired during adiabatic evolution around closed loops in parameter space. It is central to topological phenomena in condensed matter physics.

For optimization, we only need the real part (the QFIM). But the full QGT is interesting for characterizing the circuit’s geometry, and PennyLane gives you access to it.

In PennyLane, you can compute the full metric tensor using qml.metric_tensor:

import pennylane as qml
from pennylane import numpy as np

# The default (Hadamard-test) metric tensor needs one free auxiliary wire,
# so the device is allocated with one extra wire beyond the 2 the circuit uses.
dev = qml.device("default.qubit", wires=3)

@qml.qnode(dev, diff_method="parameter-shift")
def example_circuit(params):
    qml.RY(params[0], wires=0)
    qml.RY(params[1], wires=1)
    qml.CNOT(wires=[0, 1])
    qml.RZ(params[2], wires=1)
    return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))

params = np.array([0.5, 0.8, 1.2], requires_grad=True)

# The metric tensor is the real part of the QGT (the Fubini-Study metric)
# This is (1/4) * QFIM
mt = qml.metric_tensor(example_circuit)(params)
print(f"Metric tensor shape: {mt.shape}")
print(f"Metric tensor:\n{np.round(mt, 4)}")

# The QFIM is 4 * metric_tensor
qfim = 4 * mt
print(f"\nQFIM:\n{np.round(qfim, 4)}")

The qml.gradients.quantum_fisher function returns the full QFIM (4 times the metric tensor) directly. For most optimization tasks, qml.metric_tensor is what PennyLane’s QNGOptimizer uses internally.

The Quantum Natural Gradient Update Rule

The standard gradient update is:

theta_{t+1} = theta_t - lr * grad(C)

The quantum natural gradient update is:

theta_{t+1} = theta_t - lr * F^+ * grad(C)

where F is the QFIM and F^+ is its Moore-Penrose pseudoinverse. The pseudoinverse is used instead of the regular inverse because F can be singular when some parameter directions produce no state change (redundant parameters).

For a circuit that prepares state |psi(theta)>, the QFIM element is:

F_{ij} = 4 * Re[<d_i psi | d_j psi> - <d_i psi | psi><psi | d_j psi>]

PennyLane computes this automatically inside QNGOptimizer. Understanding the formula helps you reason about what QNG is doing: it measures how much the quantum state changes when you perturb two parameters simultaneously, subtracting out the “trivial” overlap that comes from global phase changes.

Computing QFIM with Parameter-Shift Rules

Each element F_{ij} of the QFIM requires evaluating overlaps of parameter-shifted states. Here is why the cost scales as O(p^2).

The derivative |d_i psi> with respect to parameter theta_i can be expressed via the parameter-shift rule. For a gate of the form exp(-i * theta_i * G / 2), the derivative of the state involves evaluating the circuit at theta_i + pi/2 and theta_i - pi/2. Computing the overlap <d_i psi | d_j psi> then requires combining shifted circuits for both parameter i and parameter j.

Each off-diagonal element F_{ij} requires 4 circuit evaluations:

Both theta_i and theta_j shifted by +pi/2
theta_i shifted by +pi/2, theta_j shifted by -pi/2
theta_i shifted by -pi/2, theta_j shifted by +pi/2
Both theta_i and theta_j shifted by -pi/2

The diagonal elements F_{ii} each require 2 circuit evaluations. For p parameters, the total number of circuit evaluations is:

Total = 4 * p*(p-1)/2 (off-diagonal) + 2*p (diagonal) = 2*p^2

(Some implementations report 4*p^2 because they evaluate each pair independently without exploiting symmetry F_{ij} = F_{ji}.)

For our 4-qubit benchmark with 2 layers and 2 rotation gates per qubit per layer, we have p = 16 parameters. That means roughly 512 circuit evaluations per QNG step for the QFIM alone, plus 32 more for the gradient. Compare that to just 32 evaluations for standard gradient descent. The QFIM computation dominates the cost.

Setting Up the VQE Benchmark

We use a 4-qubit transverse-field Ising model as the benchmark. The Hamiltonian is:

H = -J * sum_i Z_i Z_{i+1} - h * sum_i X_i

This model has a well-defined ground state, and the circuit has enough parameters to show convergence differences between optimizers.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

# Define the Hamiltonian: transverse-field Ising model
n_qubits = 4
J = 1.0   # ZZ coupling
h = 0.5   # transverse field (X terms)

coeffs = []
obs = []
for i in range(n_qubits - 1):
    coeffs.append(-J)
    obs.append(qml.PauliZ(i) @ qml.PauliZ(i + 1))
for i in range(n_qubits):
    coeffs.append(-h)
    obs.append(qml.PauliX(i))

H = qml.Hamiltonian(coeffs, obs)

# Exact ground state energy for reference (via numpy)
H_matrix = qml.matrix(H, wire_order=list(range(n_qubits)))
exact_energy = float(np.min(np.linalg.eigvalsh(H_matrix)))
print(f"Exact ground state energy: {exact_energy:.6f}")

# Ansatz: hardware-efficient 2-layer RY/RZ + CNOT ring.
# One extra wire is allocated as the auxiliary wire that the metric-tensor
# (and QNGOptimizer / quantum_fisher) Hadamard test requires.
dev = qml.device("default.qubit", wires=n_qubits + 1)
n_layers = 2

@qml.qnode(dev, diff_method="parameter-shift")
def circuit(params):
    for layer in range(n_layers):
        for q in range(n_qubits):
            qml.RY(params[layer, q, 0], wires=q)
            qml.RZ(params[layer, q, 1], wires=q)
        for q in range(n_qubits):
            qml.CNOT(wires=[q, (q + 1) % n_qubits])
    return qml.expval(H)

Running Quantum Natural Gradient

PennyLane provides QNGOptimizer which computes the metric tensor and applies the natural gradient automatically.

from pennylane.optimize import QNGOptimizer, GradientDescentOptimizer

n_steps = 40
np.random.seed(42)
init_params = np.random.uniform(0, np.pi, (n_layers, n_qubits, 2), requires_grad=True)

# ---- Standard gradient descent ----
params_gd = init_params.copy()
optimizer_gd = GradientDescentOptimizer(stepsize=0.1)
energies_gd = []

for step in range(n_steps):
    params_gd, energy = optimizer_gd.step_and_cost(circuit, params_gd)
    energies_gd.append(float(energy))

# ---- Quantum natural gradient ----
params_qng = init_params.copy()
# lam adds regularization to the metric tensor pseudoinverse (prevents instability)
optimizer_qng = QNGOptimizer(stepsize=0.1, lam=0.001)
energies_qng = []

for step in range(n_steps):
    params_qng, energy = optimizer_qng.step_and_cost(circuit, params_qng)
    energies_qng.append(float(energy))
    if step % 20 == 0:
        print(f"QNG step {step:3d}: energy = {energy:.6f}")

# Compare final energies
print(f"\nExact energy:      {exact_energy:.6f}")
print(f"GD  final energy:  {energies_gd[-1]:.6f}")
print(f"QNG final energy:  {energies_qng[-1]:.6f}")

Visualising Convergence

import matplotlib.pyplot as plt

plt.figure(figsize=(8, 4))
plt.plot(energies_gd, label="Gradient Descent", color="steelblue")
plt.plot(energies_qng, label="Quantum Natural Gradient", color="darkorange")
plt.axhline(exact_energy, color="black", linestyle="--", label=f"Exact ({exact_energy:.3f})")
plt.xlabel("Optimisation step")
plt.ylabel("Energy (Ha)")
plt.title("QNG vs Standard Gradient Descent: 4-qubit Ising Model")
plt.legend()
plt.tight_layout()
plt.savefig("qng_comparison.png", dpi=150)
plt.show()

# Typical result: QNG converges in ~20-30 steps, GD needs 60-80 steps

Computing the QFIM Directly

You can inspect the quantum Fisher information matrix explicitly. This is useful for diagnosing circuit geometry, detecting barren plateaus, and understanding why QNG outperforms standard methods.

# Compute the QFIM at a specific parameter value
params_test = np.zeros((n_layers, n_qubits, 2), requires_grad=True)

# quantum_fisher returns the full QFIM (= 4 * metric_tensor)
qfim = qml.gradients.quantum_fisher(circuit)(params_test)
print(f"QFIM shape: {qfim.shape}")

# Flatten to a 2D matrix for linear algebra
n_params = n_layers * n_qubits * 2  # 16 parameters
qfim_2d = np.array(qfim).reshape(n_params, n_params)
print(f"QFIM condition number: {np.linalg.cond(qfim_2d):.1f}")

# High condition number indicates ill-conditioned geometry
# This often correlates with slow convergence of standard GD
eigenvalues = np.sort(np.linalg.eigvalsh(qfim_2d))[::-1]
print(f"Top 5 QFIM eigenvalues: {eigenvalues[:5]}")
print(f"Bottom 5 QFIM eigenvalues: {eigenvalues[-5:]}")

QFIM Eigenvalue Spectrum and Barren Plateau Diagnosis

The eigenvalue spectrum of the QFIM reveals whether your circuit is in a barren plateau and whether QNG will be effective.

In a healthy circuit (away from a barren plateau), the QFIM eigenvalues span a moderate range. There may be a few small eigenvalues corresponding to redundant parameters, but most eigenvalues are of order 1. In a barren plateau, most QFIM eigenvalues are exponentially small in the number of qubits. This means the circuit is nearly insensitive to parameter changes in most directions.

When the QFIM has very small eigenvalues, the QNG update F^+ * grad divides by those small values. This amplification is helpful when the small eigenvalues correspond to “flat but still informative” directions. But in a genuine barren plateau, the gradient itself is also exponentially small, so dividing a tiny gradient by a tiny eigenvalue yields a noisy, unreliable update.

The lam regularization parameter in QNGOptimizer prevents this blow-up. It adds lambda * I to the metric tensor before inversion:

(g + lambda * I)^(-1) * grad

This caps the maximum amplification at 1/lambda, trading off exact natural gradient geometry for numerical stability.

# Visualize the QFIM eigenvalue spectrum at different parameter values
fig, axes = plt.subplots(1, 3, figsize=(14, 4))

for idx, (label, pvals) in enumerate([
    ("All zeros", np.zeros((n_layers, n_qubits, 2), requires_grad=True)),
    ("Small random", np.random.uniform(0, 0.1, (n_layers, n_qubits, 2), requires_grad=True)),
    ("Large random", np.random.uniform(0, 2*np.pi, (n_layers, n_qubits, 2), requires_grad=True)),
]):
    qfim_check = qml.gradients.quantum_fisher(circuit)(pvals)
    qfim_flat = np.array(qfim_check).reshape(n_params, n_params)
    eigs = np.sort(np.linalg.eigvalsh(qfim_flat))[::-1]

    axes[idx].bar(range(len(eigs)), eigs, color="teal")
    axes[idx].set_title(f"QFIM eigenvalues: {label}")
    axes[idx].set_xlabel("Eigenvalue index")
    axes[idx].set_ylabel("Eigenvalue")
    axes[idx].set_yscale("log")
    axes[idx].set_ylim(bottom=1e-10)

plt.tight_layout()
plt.savefig("qfim_spectrum.png", dpi=150)
plt.show()

Practical guidelines for tuning lam:

Start with lam=0.001 (PennyLane’s default).
If training shows erratic jumps in energy, increase lam to 0.01 or 0.1.
If QNG converges slowly (almost like standard GD), decrease lam: a large lambda suppresses the natural gradient correction and reverts toward vanilla gradient descent.
Monitor the QFIM condition number during training. If it exceeds 10^6, increase lam.

Block-Diagonal and Diagonal QFIM Approximations

Computing the full QFIM is expensive. For p parameters, the full metric tensor requires O(p^2) circuit evaluations. PennyLane’s QNGOptimizer and qml.metric_tensor support several approximation levels that trade accuracy for speed.

Full QFIM: O(p^2) circuit evaluations. Exact but expensive. Use this for small circuits (p < 30) or for benchmarking.

Block-diagonal approximation: Approximate the metric tensor as block-diagonal, where each block corresponds to one layer of the circuit. This assumes that cross-layer correlations are small. The cost drops to O(p) because you only compute intra-layer elements.

Diagonal approximation: Only compute the diagonal elements g_{ii}. This requires O(p) evaluations (specifically, 2 shifted circuits per parameter). The diagonal approximation captures how sensitive the state is to each individual parameter but ignores all correlations between parameters. It is a reasonable approximation for circuits with low entanglement between layers.

from pennylane.optimize import QNGOptimizer

# Full metric tensor (default)
optimizer_full = QNGOptimizer(stepsize=0.1, lam=0.001, approx=None)

# Block-diagonal approximation
optimizer_block = QNGOptimizer(stepsize=0.1, lam=0.001, approx="block-diag")

# Diagonal approximation
optimizer_diag = QNGOptimizer(stepsize=0.1, lam=0.001, approx="diag")

# Run all three for comparison
np.random.seed(42)
init_params = np.random.uniform(0, np.pi, (n_layers, n_qubits, 2), requires_grad=True)
n_steps = 30

results = {}
for name, opt in [("QNG-full", optimizer_full),
                   ("QNG-block", optimizer_block),
                   ("QNG-diag", optimizer_diag)]:
    params = init_params.copy()
    energies = []
    for step in range(n_steps):
        params, energy = opt.step_and_cost(circuit, params)
        energies.append(float(energy))
    results[name] = energies
    print(f"{name:12s} final energy: {energies[-1]:.6f}")

# Plot
plt.figure(figsize=(8, 4))
for name, energies in results.items():
    plt.plot(energies, label=name)
plt.axhline(exact_energy, color="black", linestyle="--", label="Exact")
plt.xlabel("Optimisation step")
plt.ylabel("Energy")
plt.title("QFIM Approximation Comparison")
plt.legend()
plt.tight_layout()
plt.savefig("qng_approximations.png", dpi=150)
plt.show()

When to use each approximation:

Parameters (p)	Recommended	Circuits per QNG step
p < 30	Full	~2p^2
30 < p < 100	Block-diagonal	~2p
p > 100	Diagonal	~2p

Stochastic Quantum Natural Gradient

Computing the full QFIM at every optimization step is the main bottleneck of QNG. Several strategies reduce this cost without abandoning the natural gradient entirely.

Subsampled QFIM: Instead of computing all p^2 elements of the metric tensor, randomly sample a subset of k parameters at each step and compute only the k x k submatrix. Update only those k parameters using the natural gradient, and update the remaining parameters using the vanilla gradient. Cycling through different subsets across steps covers the full parameter space over time.

Exponential moving average (EMA): Keep a running average of the metric tensor across steps:

g_avg_t = beta * g_avg_{t-1} + (1 - beta) * g_t

This amortizes the cost over many steps. You compute the full metric tensor every N steps (say N=5 or N=10) and reuse the averaged version in between. The metric tensor changes smoothly during optimization (since parameters change by small increments per step), so a slightly stale estimate is still much better than the identity matrix that standard GD implicitly uses.

This EMA approach is analogous to the second-moment tracking in Adam. In fact, the diagonal QNG with EMA is conceptually similar to Adam applied to quantum circuits, except that the QFIM diagonal captures quantum state geometry rather than just gradient magnitude statistics.

# Manual implementation of EMA-QNG (conceptual, not a PennyLane built-in)

n_steps = 30
beta = 0.9  # EMA decay factor
recompute_every = 5  # Recompute metric tensor every N steps
lam = 0.001
lr = 0.1

params = init_params.copy()
energies_ema = []
g_avg = None

for step in range(n_steps):
    # Compute gradient
    grad_fn = qml.grad(circuit)
    grad_val = grad_fn(params)

    # Recompute metric tensor periodically
    if step % recompute_every == 0:
        g_new = qml.metric_tensor(circuit)(params)
        g_new_flat = np.array(g_new).reshape(n_params, n_params)
        if g_avg is None:
            g_avg = g_new_flat
        else:
            g_avg = beta * g_avg + (1 - beta) * g_new_flat

    # Regularize and invert
    g_reg = g_avg + lam * np.eye(n_params)
    g_inv = np.linalg.inv(g_reg)

    # Natural gradient update
    grad_flat = np.array(grad_val).flatten()
    nat_grad = g_inv @ grad_flat
    params = params - lr * nat_grad.reshape(params.shape)

    energy = circuit(params)
    energies_ema.append(float(energy))

print(f"EMA-QNG final energy: {energies_ema[-1]:.6f}")

The EMA approach computes the metric tensor only once every 5 steps in this example, reducing the per-step cost by roughly 5x while retaining most of the convergence benefit.

QNG with VQE on the H2 Molecule

To see QNG on a real chemistry problem, consider finding the ground state energy of molecular hydrogen (H2) at its equilibrium bond length. This is a 2-qubit problem when using a minimal basis set and the Bravyi-Kitaev transformation.

# H2 Hamiltonian at bond length 0.735 angstroms (minimal STO-3G basis)
# Coefficients from a standard Bravyi-Kitaev transformation
# Exact ground state energy: -1.13728 Ha

h2_coeffs = [
    -0.04207897,  # Identity (energy offset, ignored in VQE)
     0.17771287,  # Z0
     0.17771287,  # Z1
    -0.24274281,  # Z0 Z1
     0.17059738,  # X0 X1
]

# Build the Hamiltonian (dropping the identity offset for VQE)
H_h2 = qml.Hamiltonian(
    [h2_coeffs[1], h2_coeffs[2], h2_coeffs[3], h2_coeffs[4]],
    [qml.PauliZ(0), qml.PauliZ(1), qml.PauliZ(0) @ qml.PauliZ(1),
     qml.PauliX(0) @ qml.PauliX(1)]
)
offset = h2_coeffs[0]
exact_h2 = -1.13728  # Known exact value

# Simple UCCSD-inspired ansatz for H2 (one extra wire for the QNG aux wire)
dev_h2 = qml.device("default.qubit", wires=3)

@qml.qnode(dev_h2, diff_method="parameter-shift")
def h2_circuit(params):
    # Hartree-Fock initial state: |01>
    qml.PauliX(wires=0)
    # Single and double excitation
    qml.RY(params[0], wires=0)
    qml.RY(params[1], wires=1)
    qml.CNOT(wires=[0, 1])
    qml.RY(params[2], wires=0)
    qml.RY(params[3], wires=1)
    return qml.expval(H_h2)

# Compare QNG, Adam, and gradient descent
from pennylane.optimize import QNGOptimizer, GradientDescentOptimizer, AdamOptimizer

np.random.seed(123)
init_h2 = np.random.uniform(-0.5, 0.5, 4, requires_grad=True)
n_steps_h2 = 40

optimizers_h2 = {
    "GradientDescent": GradientDescentOptimizer(stepsize=0.4),
    "Adam": AdamOptimizer(stepsize=0.1),
    "QNG": QNGOptimizer(stepsize=0.2, lam=0.001),
}

results_h2 = {}
for name, opt in optimizers_h2.items():
    params = init_h2.copy()
    energies = []
    for step in range(n_steps_h2):
        params, energy = opt.step_and_cost(h2_circuit, params)
        energies.append(float(energy) + offset)  # Add back the identity offset
    results_h2[name] = energies
    print(f"{name:20s} final energy: {energies[-1]:.6f} Ha")

print(f"{'Exact':20s} energy:       {exact_h2:.6f} Ha")

# Plot
plt.figure(figsize=(8, 4))
for name, energies in results_h2.items():
    plt.plot(energies, label=name)
plt.axhline(exact_h2, color="black", linestyle="--", label=f"Exact ({exact_h2:.5f})")
plt.xlabel("Optimisation step")
plt.ylabel("Energy (Ha)")
plt.title("H2 Ground State: Optimizer Comparison")
plt.legend()
plt.tight_layout()
plt.savefig("h2_qng.png", dpi=150)
plt.show()

QNG typically converges to chemical accuracy (within 1.6 mHa of exact) in fewer than 15 steps on this problem. Adam reaches the same accuracy in roughly 30 steps. Standard gradient descent needs 40 or more steps, depending on the learning rate.

Comparing Optimizers: A Comprehensive Benchmark

Step count alone does not tell the full story. On real quantum hardware, the relevant cost metric is the total number of circuit evaluations (each of which requires many shots). Here is a comprehensive comparison for the 4-qubit Ising benchmark:

Optimizer	Steps to converge	Circuits per step	Total circuits	Notes
GradientDescent	~80	2p = 32	~2,560	Reliable but slow
Adam	~40	2p = 32	~1,280	Momentum helps significantly
QNG (full)	~20	2p^2 + 2p = 544	~10,880	Fewest steps, most circuits
QNG (block-diag)	~25	~2p + overhead	~1,600	Good middle ground
QNG (diagonal)	~30	4p = 64	~1,920	Modest per-step cost

Key takeaways:

QNG with the full QFIM uses the fewest steps but the most total circuits. On a simulator where circuit evaluation is fast, QNG wins. On real hardware where each circuit evaluation is expensive, QNG-full may actually be slower in wall-clock time.
QNG with diagonal approximation is a practical sweet spot: it converges in 30 steps with only 64 circuits per step, for a total circuit count competitive with Adam.
Adam is an excellent baseline. It requires no QFIM computation and converges reliably. Many practitioners use Adam as their default and switch to QNG only when Adam stalls.
The circuit-per-step count for QNG-full scales as O(p^2), which makes it impractical for circuits with more than about 50 parameters without approximation.

Equivariant QNG for Symmetric Circuits

When your ansatz has symmetries (for example, permutation symmetry across qubits, or a lattice translation symmetry matching the Hamiltonian), the QFIM inherits a block structure that respects those symmetries. Exploiting this structure can reduce the effective size of the QFIM and make QNG cheaper.

For an equivariant circuit (one whose structure commutes with the symmetry group action), parameters related by symmetry have identical QFIM blocks. Instead of computing the full p x p QFIM, you compute a reduced matrix whose size equals the number of independent parameter groups.

As a concrete example, consider a translationally invariant ansatz where each qubit in a layer has the same parameters. If you have n qubits and k unique parameters per layer, the effective QFIM is k x k instead of (nk) x (nk). For a 10-qubit circuit with 2 unique parameters per layer, this reduces a 20 x 20 QFIM to a 2 x 2 matrix.

This idea connects to the broader study of equivariant quantum circuits by Meyer et al. (2023) and Larocca et al. (2022), who showed that symmetry-respecting ansatze avoid barren plateaus more effectively and have more favorable QFIM spectra. If your Hamiltonian has a symmetry, designing an ansatz that shares it improves both the convergence rate and the cost of QNG.

QNG on Quantum Hardware

On real quantum hardware, each circuit evaluation requires a finite number of shots, and the expectation values are noisy. This introduces additional considerations for QNG.

The QFIM is estimated from finite-shot expectation values. With S shots per circuit, each QFIM element has a statistical error of order 1/sqrt(S). Since the QFIM has p^2 elements, the total shot budget per QNG step is S * O(p^2). For p=16 and S=1000, that is roughly 512,000 shots for the QFIM alone.

Noisy QFIM estimates lead to noisy QNG updates. If the QFIM is poorly estimated, the inverse (or pseudoinverse) amplifies the noise, producing erratic parameter updates. Increasing lam helps, but at the cost of losing the natural gradient geometry.

# Example: setting up QNG with a shot-based device (extra wire for QNG aux wire)
dev_shots = qml.device("default.qubit", wires=n_qubits + 1, shots=1000)

@qml.qnode(dev_shots, diff_method="parameter-shift")
def circuit_shots(params):
    for layer in range(n_layers):
        for q in range(n_qubits):
            qml.RY(params[layer, q, 0], wires=q)
            qml.RZ(params[layer, q, 1], wires=q)
        for q in range(n_qubits):
            qml.CNOT(wires=[q, (q + 1) % n_qubits])
    return qml.expval(H)

# With shots, increase lam for stability
optimizer_hw = QNGOptimizer(stepsize=0.05, lam=0.01)

A practical strategy for real hardware: use QNG on a noiseless simulator to find a good initialization point, then switch to Adam (or another low-cost optimizer) for fine-tuning on the actual device. The simulator-based QNG handles the “coarse” optimization where geometry matters most, and the hardware-based Adam handles the “fine” optimization near the minimum where the cost landscape is approximately quadratic and geometry matters less.

Common Mistakes

1. Using QNGOptimizer with an incompatible differentiation method. QNG requires the metric tensor, which PennyLane computes using parameter-shift rules or finite differences. If your QNode uses diff_method="backprop" or diff_method="adjoint", the metric tensor computation will fall back to a compatible method or raise an error. Always use diff_method="parameter-shift" for circuits you plan to optimize with QNG, especially if you want the metric tensor to match the actual hardware cost.

2. Forgetting the lam regularization. Without regularization, near-singular QFIM values cause the pseudoinverse to produce huge entries, leading to parameter updates that overshoot wildly. This typically manifests as the energy spiking on step 2 to 5 of training, then oscillating. If you see this, increase lam.

3. Using QNG for circuits with many parameters without approximation. The O(p^2) cost of the full QFIM becomes impractical beyond about 50 parameters. For a 100-parameter circuit, each QNG step requires roughly 20,000 circuit evaluations. Use the diagonal or block-diagonal approximation for large circuits.

4. Comparing QNG to gradient descent with the same learning rate. QNG’s effective step size is scaled by the QFIM inverse. A learning rate of 0.1 for QNG produces much larger parameter updates than 0.1 for gradient descent, because the QFIM inverse amplifies gradients in flat directions. The optimal learning rate for QNG is typically smaller than for GD. Start with lr=0.01 to 0.1 for QNG and lr=0.1 to 0.5 for GD when comparing.

5. Caching the QFIM across multiple steps. The QFIM is state-dependent: it changes as parameters change. Reusing a QFIM from step 0 at step 10 introduces errors proportional to the parameter distance traveled. If you cache the QFIM for efficiency (as in the EMA approach), recompute it frequently enough that the parameters have not moved far between updates. A safe rule: recompute every 3 to 5 steps for small learning rates, every 1 to 2 steps for large learning rates.

6. Expecting QNG to solve barren plateaus. QNG rescales gradients by the QFIM inverse, which can help navigate mild flat regions. But in a genuine barren plateau (where both the gradient and the QFIM eigenvalues are exponentially small), QNG divides a vanishingly small gradient by a vanishingly small eigenvalue. The result is numerically unstable and does not reliably point toward the minimum. Barren plateaus require architectural solutions (local cost functions, shallow circuits, symmetry-respecting ansatze), not just a better optimizer.

Summary

The quantum natural gradient adapts the classical natural gradient idea to variational quantum circuits. By using the quantum Fisher information matrix (the metric tensor of the quantum state manifold) to rescale gradients, QNG takes steps of equal size in state space rather than parameter space, leading to faster convergence.

The practical trade-off is cost: the full QFIM requires O(p^2) circuit evaluations per step. Block-diagonal and diagonal approximations reduce this to O(p) while retaining most of the convergence benefit. On real hardware, where circuit evaluations are expensive, the total circuit count (not just the step count) determines the practical efficiency of QNG.

For small to moderate circuits (fewer than 50 parameters), QNG with full QFIM is the gold standard for convergence speed. For larger circuits, diagonal QNG or the EMA approach provides a practical balance. And for the largest circuits, Adam remains a strong baseline that requires no QFIM computation at all.

The QFIM itself is a valuable diagnostic tool even if you do not use QNG for optimization. Its eigenvalue spectrum reveals the circuit’s sensitivity to each parameter direction, helping diagnose barren plateaus, identify redundant parameters, and understand the geometry of the optimization landscape.