Finite-Shot Optimization in PennyLane

Overview

When you run a variational algorithm on real quantum hardware, you pay for every circuit execution. The gradient of a parameterized circuit requires many evaluations, and each evaluation burns device time. The standard parameter-shift rule requires 2 circuit evaluations per parameter per gradient step, so an 8-parameter circuit needs 16 evaluations per step. At 256 shots each, that is 4,096 shots just to compute one gradient. Over 100 training steps, the total reaches 409,600 shots. On cloud-accessible quantum hardware, shots cost real money and queue time, so understanding how to reduce that shot cost while maintaining convergence is essential for practical NISQ applications.

This tutorial compares three gradient strategies for finite-shot optimization:

Parameter-shift rule: accurate but expensive. Computes an unbiased gradient using 2 evaluations per parameter.
SPSA (Simultaneous Perturbation Stochastic Approximation): cheap but noisy. Estimates the full gradient with only 2 evaluations total, regardless of parameter count.
Quantum Natural Gradient (QNG): expensive per step but fast-converging. Preconditions the gradient using the geometry of the quantum state space.

By the end, you will know how to choose the right optimizer for a given shot budget, parameter count, and convergence requirement.

What Does “Finite Shots” Mean?

Without a shots argument, PennyLane computes exact expectation values using statevector simulation. The simulator applies the circuit unitary to the state vector and calculates <psi|O|psi> analytically. This produces a perfect, noise-free result every time you evaluate the circuit.

With shots=N, PennyLane instead samples from the measurement distribution N times and averages the results, exactly as real hardware would. Each sample collapses the quantum state and returns a bitstring drawn from the Born-rule probability distribution. The expectation value is then estimated as the sample mean of the observable’s eigenvalues.

This introduces statistical noise. The standard error of the mean scales as 1/sqrt(N), which means halving the noise requires quadrupling the shots. For shots=256, the standard error on a Pauli expectation value (which has eigenvalues +/-1) is at most 1/sqrt(256) = 0.0625. For shots=1024, it drops to 1/sqrt(1024) ≈ 0.031.

This noise propagates directly into gradient estimates. When the gradient of a cost function is computed using finite-shot expectation values, the gradient itself becomes a noisy random variable. If the gradient noise is large relative to the true gradient magnitude, the optimizer cannot distinguish signal from noise and optimization stalls or diverges. The choice of gradient estimation strategy determines how many shots you spend per step and how much noise appears in the resulting gradient vector.

Setting Up a Finite-Shot Device

PennyLane’s default.qubit simulator accepts a shots argument to mimic the statistical noise from finite sampling on hardware. The circuit below defines an 8-parameter ansatz on 4 qubits: a layer of RY rotations, a chain of CNOT entangling gates, and a layer of RZ rotations. The cost function measures the two-qubit correlation <Z0 Z1>.

import pennylane as qml
import numpy as np

dev = qml.device("default.qubit", wires=4, shots=256)

@qml.qnode(dev)
def ansatz(params):
    for i in range(4):
        qml.RY(params[i], wires=i)
    for i in range(3):
        qml.CNOT(wires=[i, i + 1])
    for i in range(4):
        qml.RZ(params[i + 4], wires=i)
    return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))

Every call to ansatz(params) now returns a stochastic estimate. Calling it twice with the same parameters produces different values. This is the fundamental challenge of finite-shot optimization: every piece of information you extract from the quantum device is noisy.

Parameter-Shift Rule

The parameter-shift rule computes the exact gradient formula for gates with two eigenvalues (which includes all standard rotation gates like RX, RY, RZ). For each parameter theta_i, the partial derivative of the expectation value is:

d<f>/d(theta_i) = [f(theta_i + pi/2) - f(theta_i - pi/2)] / 2

This formula is exact in the analytic (infinite-shot) case. In the finite-shot case, each of the two function evaluations f(theta_i + pi/2) and f(theta_i - pi/2) is itself a noisy estimate, so the resulting gradient is noisy too. However, the estimate remains unbiased: the expected value of the gradient estimate equals the true gradient, even with shot noise. The noise averages out over many optimization steps.

The shot cost calculation for this ansatz is straightforward:

8 parameters, each requiring 2 circuit evaluations = 16 evaluations per gradient step
Each evaluation uses 256 shots
Total shots per step: 16 x 256 = 4,096 shots
Over 60 steps: 60 x 4,096 = 245,760 total shots

PennyLane’s GradientDescentOptimizer uses the parameter-shift rule automatically when the QNode has finite shots. You do not need to specify the differentiation method explicitly.

import pennylane.numpy as pnp
params = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)

opt = qml.GradientDescentOptimizer(stepsize=0.1)

for step in range(60):
    params, cost = opt.step_and_cost(ansatz, params)
    if step % 10 == 0:
        print(f"Step {step:3d} | cost = {cost:.4f}")

The GradientDescentOptimizer calls qml.grad, which detects the parameter-shift rule as the appropriate differentiation method for hardware-compatible QNodes. Each call to opt.step_and_cost triggers 16 circuit evaluations (2 per parameter) plus 1 evaluation for the cost itself, for a total of 17 evaluations per step.

The parameter-shift rule is the right default when accuracy matters and the parameter count is manageable. Its main limitation is that shot cost scales linearly with parameter count, making it expensive for large ansatze (20+ parameters).

SPSA: Fewer Circuits Per Step

Simultaneous Perturbation Stochastic Approximation (SPSA) is a classical optimization technique originally developed for control theory problems, later adapted for quantum circuits. Instead of computing the gradient of each parameter independently, SPSA perturbs all parameters simultaneously using a random binary vector delta where each component is +1 or -1 with equal probability. The gradient estimate is:

grad_estimate = (f(theta + c*delta) - f(theta - c*delta)) / (2*c) * (1/delta)

Here c is the perturbation size, and 1/delta denotes the element-wise reciprocal of the perturbation vector (which is trivial since each element is +/-1). This gives a noisy but unbiased estimate of the full gradient vector using only 2 circuit evaluations, regardless of the number of parameters.

The shot cost per step is:

2 circuit evaluations x 256 shots = 512 shots per step
This is 8x fewer shots than parameter-shift for this 8-parameter circuit
Over 60 steps: 60 x 512 = 30,720 total shots

The trade-off is variance. Each SPSA gradient estimate has much higher variance than the parameter-shift gradient because a single random perturbation direction provides limited information about the full gradient landscape. SPSA converges, but typically requires more iterations to reach the same accuracy as parameter-shift.

from pennylane.optimize import SPSAOptimizer

params_spsa = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt_spsa = SPSAOptimizer(maxiter=60)

for step in range(60):
    params_spsa, cost = opt_spsa.step_and_cost(ansatz, params_spsa)
    if step % 10 == 0:
        print(f"SPSA step {step:3d} | cost = {cost:.4f}")

The SPSA optimizer in PennyLane has two key hyperparameters that control the learning rate schedule:

alpha (default 0.602): controls how fast the step size decays. The step size at iteration k is proportional to 1 / (k + 1)^alpha. A larger alpha means faster decay, which stabilizes late-stage convergence but can slow early progress.
gamma (default 0.101): controls how fast the perturbation size c decays. The perturbation at iteration k is proportional to 1 / (k + 1)^gamma. The perturbation must shrink for convergence guarantees to hold, but shrinking too fast makes gradient estimates unreliable.

These values come from the theoretical convergence analysis by Spall (1998), which proves that SPSA converges to a local minimum under mild conditions when alpha > 0 and gamma > 0 with alpha > gamma. In practice, the defaults work well for many quantum optimization problems, but you may need to tune them if convergence is too slow or too unstable. A common approach is to run a short pilot (10-20 steps) and check whether the cost is decreasing on average.

SPSA is particularly valuable when the parameter count is large (20+ parameters) and your shot budget is tight. It is also useful as a first pass to get parameters into the right neighborhood before switching to a more accurate optimizer for fine-tuning.

Quantum Natural Gradient

The Quantum Natural Gradient (QNG) preconditions the gradient by the inverse of the quantum Fisher information matrix, also known as the Fubini-Study metric tensor. This transforms the update rule from ordinary gradient descent into natural gradient descent on the manifold of quantum states.

To understand why this helps, consider what ordinary gradient descent does. It moves parameters in the direction of steepest descent in Euclidean parameter space, treating all parameter directions equally. But the mapping from parameters to quantum states is not uniform. A small change in one parameter might produce a large change in the output state, while the same-sized change in another parameter might barely affect the state at all. This mismatch is especially pronounced near barren plateaus, where the gradient is small not because you are near a minimum, but because the parameterization is locally insensitive.

QNG corrects for this by measuring the local geometry of the state space. The Fubini-Study metric tensor g_ij quantifies how much the quantum state changes when you perturb parameters theta_i and theta_j. The QNG update rule is:

theta_new = theta - stepsize * g_inverse @ gradient

This rescales the gradient so that each step produces a roughly uniform change in the quantum state, regardless of the local parameterization. In practice, QNG often converges in 2-5x fewer iterations than vanilla gradient descent on variational circuits.

The cost of QNG is computing the metric tensor, which requires additional circuit evaluations. For a general n-parameter circuit, the full metric tensor has n(n+1)/2 independent entries, each requiring its own circuit evaluations. The approx="block-diag" option reduces this cost by assuming the metric tensor has block-diagonal structure, where each block corresponds to one layer of the circuit. This approximation is accurate when the circuit has a layered structure (as our ansatz does) and the inter-layer correlations in the metric are small.

from pennylane.optimize import QNGOptimizer

dev_exact = qml.device("default.qubit", wires=4, shots=512)

@qml.qnode(dev_exact)
def ansatz_qng(params):
    for i in range(4):
        qml.RY(params[i], wires=i)
    for i in range(3):
        qml.CNOT(wires=[i, i + 1])
    for i in range(4):
        qml.RZ(params[i + 4], wires=i)
    return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))

params_qng = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt_qng = QNGOptimizer(stepsize=0.1, approx="block-diag")

for step in range(40):
    params_qng, cost = opt_qng.step_and_cost(ansatz_qng, params_qng)
    if step % 10 == 0:
        print(f"QNG step {step:3d} | cost = {cost:.4f}")

Note that this QNode uses shots=512 (higher than the 256 used for parameter-shift and SPSA). QNG benefits from more accurate expectation values because the metric tensor inversion amplifies noise. With too few shots, the estimated metric tensor can become ill-conditioned, producing erratic parameter updates. Using 512 or 1024 shots per evaluation is a reasonable starting point for QNG.

The shot cost per step for QNG with block-diagonal approximation is roughly:

Gradient: 2 x 8 x 512 = 8,192 shots (parameter-shift for the gradient itself)
Metric tensor (block-diag): approximately 2 x 8 x 512 = 8,192 additional shots
Total per step: approximately 16,384 shots
Over 40 steps: 40 x 16,384 = 655,360 total shots

QNG is the most expensive per step, but it typically converges in far fewer iterations. Whether it wins on total shot budget depends on the problem.

Head-to-Head Comparison

A fair comparison between optimizers plots cost vs. total shots consumed, not cost vs. iteration count. An optimizer that converges in 20 iterations is not better if each iteration costs 10x more shots.

import matplotlib.pyplot as plt

# Track cumulative shots manually per optimizer
# parameter-shift: 2 * n_params * shots per step = 2 * 8 * 256 = 4096 shots/step
# SPSA: 2 * shots per step = 512 shots/step
# QNG (block-diag): roughly 3 * 2 * n_params * shots per step

print("Shots per step - param-shift:", 2 * 8 * 256)
print("Shots per step - SPSA:       ", 2 * 256)

Here is a summary of the per-step and total shot costs for each optimizer on this 8-parameter circuit:

Optimizer	Shots per evaluation	Evaluations per step	Shots per step	Steps to converge (typical)	Total shots (typical)
Parameter-shift	256	16 (+ 1 cost)	~4,096	50-60	~200,000-250,000
SPSA	256	2 (+ 1 cost)	~512	100-150	~50,000-75,000
QNG (block-diag)	512	~32	~16,384	20-30	~330,000-490,000

These numbers are approximate and problem-dependent. The key takeaways:

SPSA wins on total shot budget for this problem size. Despite needing more iterations, its per-step cost is so low that it uses fewer total shots.
QNG wins on iteration count but its per-step overhead makes it the most expensive overall for small circuits. QNG becomes more competitive as circuit complexity increases and vanilla gradient descent needs many more iterations to navigate the landscape.
Parameter-shift is the reliable middle ground. Predictable convergence, moderate shot cost, no hyperparameter tuning beyond the learning rate.

To run a proper benchmark, fix the random seed and initial parameters, then run all three optimizers from the same starting point:

np.random.seed(42)
init_params = np.random.uniform(-np.pi, np.pi, 8)

# Run each optimizer from init_params, tracking (cumulative_shots, cost) pairs
# Plot cost vs. cumulative_shots for all three on the same axes

This gives you a direct visual comparison of convergence efficiency per shot spent.

Shot Budget Planning

How do you decide how many shots to use per circuit evaluation? The core principle is that gradient noise should be small relative to the gradient signal. If the noise overwhelms the signal, the optimizer performs a random walk instead of descending.

For a Pauli expectation value, the standard error of the mean is at most 1/sqrt(shots):

Shots	Standard error	Gradient noise (parameter-shift, per component)
64	0.125	~0.088
256	0.0625	~0.044
1024	0.031	~0.022
4096	0.016	~0.011

The gradient noise per component for the parameter-shift rule is sqrt(2) * (1/sqrt(shots)) / 2 ≈ 0.707 / sqrt(shots), because the gradient formula subtracts two noisy estimates and divides by 2.

The rule of thumb: if your expected gradient magnitude is ~0.1 and your gradient noise per component is ~0.05, the optimizer still makes progress because the signal-to-noise ratio is about 2:1. If the noise exceeds the signal (ratio below 1:1), you are wasting shots on random walks.

A practical approach:

Start with 256 shots. This is enough for most variational circuits in the early stages of optimization, where gradients are larger.
Monitor convergence. If the cost function oscillates wildly instead of trending downward, the gradient noise is too high.
Increase to 1024 shots if convergence stalls. This cuts gradient noise in half.
Consider adaptive shots. Use fewer shots in early iterations (where gradients are large and you just need the right direction) and more shots in later iterations (where gradients are small and precision matters). PennyLane ships a built-in ShotAdaptiveOptimizer (implementing the Rosalin shot-allocation strategy) that automates this, or you can implement a simple schedule yourself by switching the shot count during training.

For SPSA, the noise characteristics are different because the gradient estimate has higher variance by construction. SPSA is more tolerant of low shot counts because it already expects noisy gradients. Starting with 128-256 shots per evaluation is reasonable for SPSA.

Recommendations

Choose your optimizer based on parameter count, shot budget, and convergence requirements:

Use parameter-shift when:

You have fewer than 15-20 parameters
You need reliable, predictable convergence
Your shot budget is moderate (100k-500k total shots)
You want minimal hyperparameter tuning (only the learning rate)

Use SPSA when:

You have 20+ parameters and the per-step cost of parameter-shift is prohibitive
Your total shot budget is tight (under 100k shots)
You are willing to tune the alpha, gamma, and initial perturbation size
You want a quick initial optimization pass before switching to a more accurate method

Use QNG when:

Your circuit has a layered structure (so the block-diagonal approximation is valid)
Vanilla gradient descent converges slowly, suggesting the optimization landscape is poorly conditioned
You can afford the per-step overhead (enough shots for accurate metric tensor estimation)
The circuit is deep enough that the metric tensor geometry matters (shallow circuits often do not benefit much from QNG)

Hybrid strategies work well in practice. Start with SPSA to cheaply explore the landscape, then switch to parameter-shift or QNG for fine-tuning near the minimum. You can also combine shot adaptation with optimizer switching: use low shots + SPSA for the first 50 steps, then high shots + parameter-shift for the final 30 steps.

Common Mistakes

Comparing optimizers by iteration count instead of total shot count. An optimizer that converges in 20 iterations but costs 16,000 shots per iteration is not necessarily better than one that takes 100 iterations at 512 shots each. Always compare convergence as a function of cumulative shots consumed.

Setting shots too low. If gradient noise exceeds the gradient signal, the optimizer performs a random walk. With 64 shots and a gradient magnitude of 0.05, the gradient noise (~0.088 per component for parameter-shift) is larger than the signal. The optimizer cannot make meaningful progress. When in doubt, increase shots and check if convergence improves.

Using QNG with block-diagonal approximation on non-layered ansatze. The approx="block-diag" option assumes the circuit decomposes into independent layers. If your ansatz has cross-layer parameter dependencies (for example, parameters that appear in multiple layers or hardware-efficient ansatze with irregular structure), the block-diagonal approximation can be inaccurate. This leads to poorly conditioned updates that hurt rather than help convergence. For non-layered circuits, consider using the full metric tensor (more expensive) or sticking with parameter-shift.

Forgetting that SPSA hyperparameters need tuning. The default alpha=0.602 and gamma=0.101 values are theoretically motivated but not universally optimal. If SPSA convergence is slow, try increasing the initial step size. If it is unstable (cost function jumps around), try decreasing the perturbation size or increasing gamma to make the perturbation decay faster. Run a short pilot of 10-20 steps to calibrate before committing to a long optimization run.

Ignoring the cost of the forward pass. The shot counts above account for gradient evaluations, but most optimizers also evaluate the cost function itself at the current parameters (the “forward pass”). This adds one extra circuit evaluation per step. For parameter-shift this is a small overhead (1 out of 17 evaluations), but for SPSA it increases the per-step cost by 50% (from 2 to 3 evaluations). Some implementations skip the forward evaluation when only the parameter update is needed.