Shot-Adaptive Optimization in PennyLane

Overview

Variational quantum algorithms like VQE and QAOA rely on gradient-based optimization to train parameterized circuits. Estimating gradients requires circuit evaluations, and each circuit evaluation requires measurement shots. On real quantum hardware, every shot costs time and money. A 100-step optimization loop with 8 parameters using the parameter-shift rule at 256 shots per evaluation consumes over 400,000 shots. That can eat through a free-tier quantum budget in minutes.

The naive approach to shot allocation is uniform: give every parameter the same number of shots at every step. This is wasteful. If some parameters sit near a local minimum and their gradients are nearly zero, measuring them precisely produces no useful information. The optimizer already knows those parameters should barely move. Meanwhile, parameters with large gradients and high variance need precise measurements to determine the correct update direction.

Shot-adaptive optimization solves this problem by dynamically reallocating shots toward the parameters that matter most at each training step. Parameters with large gradients and high variance receive more shots. Parameters near convergence receive the minimum allocation. The result is the same convergence quality as fixed-budget optimization, but with significantly fewer total shots consumed.

PennyLane provides the ShotAdaptiveOptimizer class, which implements this strategy out of the box. This tutorial walks through how it works, how to use it, and how to verify that it actually saves shots in practice.

How the Rosalin Algorithm Works

PennyLane’s ShotAdaptiveOptimizer implements the Rosalin algorithm (Random Operator Sampling for Adaptive Learning with Individual Number of shots), introduced by Arrasmith et al. (2020). Rosalin builds on the iCANS (individual Coupled Adaptive Number of Shots) strategy introduced by Kubler et al. (2020).

The core insight is straightforward. Not all parameters contribute equally to the cost function at any given step. Some parameters have large gradients and high measurement variance, meaning they strongly influence the cost and require precise estimation. Other parameters have near-zero gradients, meaning their updates are negligible regardless of measurement precision.

At each optimization step, the algorithm performs the following:

Estimate gradient magnitude and variance for each parameter. The optimizer evaluates the gradient using the parameter-shift rule, collecting enough samples to estimate both the mean gradient and its variance for every parameter independently.
Compute the optimal shot allocation. The optimizer allocates shots proportional to the product of gradient magnitude and variance for each parameter. Specifically, parameter i receives shots proportional to |g_i| * sigma_i, where g_i is the gradient estimate and sigma_i is its standard deviation. This allocation minimizes the total variance of the parameter update for a given shot budget.
Enforce minimum shot constraints. Every parameter receives at least min_shots measurements, preventing the optimizer from starving any parameter entirely. This floor ensures that the optimizer can detect when a previously dormant parameter becomes relevant again.
Adapt the learning rate. The optimizer scales the learning rate for each parameter based on the shot allocation it received. Parameters measured with fewer shots have noisier gradient estimates, so the optimizer applies a smaller effective step size to those parameters. This prevents noisy low-shot estimates from destabilizing the optimization.

The result is an optimizer that automatically concentrates measurement resources where they produce the most improvement, while maintaining stability across all parameters.

Why Shot Efficiency Matters

Shot efficiency is not an abstract concern. On current quantum hardware, shots translate directly to wall-clock time and cost.

Consider a concrete example. On the IBM Quantum free tier (the Open Plan), you receive 10 minutes of quantum processing time per 28-day rolling window. Each shot on a 4-qubit circuit takes roughly 1 millisecond of device time (including gate execution and measurement). A typical VQE optimization with 8 parameters using the parameter-shift rule requires 2 * 8 = 16 circuit evaluations per step (one forward shift and one backward shift per parameter). At 256 shots per evaluation over 100 optimization steps, that totals:

100 steps * 16 evaluations/step * 256 shots/evaluation = 409,600 shots
409,600 shots * 1 ms/shot = 409.6 seconds ≈ 6.8 minutes

That is 68% of the free-tier budget on a single optimization run.

If shot-adaptive optimization reduces total shots by 50% (a typical improvement for circuits with many parameters), the same optimization consumes roughly 3.4 minutes of device time. That leaves enough budget for a second run, or for experimentation with different ansatz structures.

On pay-per-use plans, the savings translate directly to cost. On shared academic hardware with job queues, fewer shots mean shorter jobs and faster turnaround. Even on simulators with finite shots, reducing shot count speeds up the optimization wall-clock time.

Setting Up the Device and Circuit

Use a simulated finite-shot device to mimic hardware conditions.

import pennylane as qml
import numpy as np
import pennylane.numpy as pnp

dev = qml.device("default.qubit", wires=4, shots=50)

@qml.qnode(dev)
def cost_fn(params):
    for i in range(4):
        qml.RY(params[i], wires=i)
    for i in range(3):
        qml.CNOT(wires=[i, i + 1])
    for i in range(4):
        qml.RZ(params[i + 4], wires=i)
    return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1) @ qml.PauliZ(2))

The shots=50 argument on the device sets the initial shot budget per circuit evaluation. This is the baseline that the shot-adaptive optimizer starts from. The optimizer uses this value as a reference point for its first step, then adapts the per-parameter allocation in subsequent steps. It may allocate more or fewer shots than this baseline depending on the current gradient landscape.

A few things to note about this circuit:

The circuit has 8 trainable parameters (4 RY rotations and 4 RZ rotations), giving the optimizer enough parameters to demonstrate meaningful shot reallocation.
The CNOT ladder creates entanglement, which means the gradient with respect to each parameter depends on the full quantum state rather than just the local qubit.
The cost function measures a three-qubit Pauli-Z correlation. Measuring multi-qubit observables typically produces higher variance than single-qubit measurements, making shot allocation more impactful.

Running the Shot-Adaptive Optimizer

from pennylane.optimize import ShotAdaptiveOptimizer

params = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)

opt = ShotAdaptiveOptimizer(min_shots=10)

cost_history = []
shot_history = []

for step in range(50):
    params, cost = opt.step_and_cost(cost_fn, params)
    cost_history.append(float(cost))
    shot_history.append(opt.total_shots_used)

    if step % 10 == 0:
        print(f"Step {step:2d} | Cost = {cost:.4f} | Shots used = {opt.total_shots_used}")

min_shots sets the floor for how few shots any parameter can receive in a single step.

Here is what happens internally at each step of the loop:

Cost evaluation. The optimizer evaluates the cost function at the current parameters using the device’s base shot count. This gives the current cost value returned by step_and_cost.
Gradient estimation with variance tracking. The optimizer evaluates the gradient using the parameter-shift rule. For each parameter, it computes both the gradient estimate and the variance of that estimate across the shots used. This requires 2 circuit evaluations per parameter (one with +pi/2 shift, one with -pi/2 shift).
Shot allocation computation. Using the gradient magnitudes and variances from step 2, the optimizer computes the optimal shot allocation for the next step. Parameters with large |gradient| * variance products receive more shots. Every parameter receives at least min_shots.
Parameter update. The optimizer takes a gradient descent step with a learning rate adapted to the per-parameter variance. Parameters measured with fewer shots receive smaller effective step sizes to compensate for noisier estimates.

To see the shot allocation at each step (not just the final allocation), extend the training loop:

params = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt = ShotAdaptiveOptimizer(min_shots=10)

for step in range(50):
    params, cost = opt.step_and_cost(cost_fn, params)

    if step % 10 == 0:
        print(f"\nStep {step:2d} | Cost = {cost:.4f}")
        print(f"  Total shots so far: {opt.total_shots_used}")
        print(f"  Per-parameter shots: {opt.s[0]}")

Early in training, the allocation tends to be relatively uniform because most parameters have significant gradients. As training progresses and some parameters converge, the allocation becomes increasingly concentrated on the remaining active parameters.

Comparing Shot Usage Against a Fixed-Budget Optimizer

The real value of shot-adaptive optimization shows up in a shot-efficiency comparison.

from pennylane.optimize import GradientDescentOptimizer

# Fixed-shot baseline: 100 shots per circuit evaluation
dev_fixed = qml.device("default.qubit", wires=4, shots=100)

@qml.qnode(dev_fixed)
def cost_fn_fixed(params):
    for i in range(4):
        qml.RY(params[i], wires=i)
    for i in range(3):
        qml.CNOT(wires=[i, i + 1])
    for i in range(4):
        qml.RZ(params[i + 4], wires=i)
    return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1) @ qml.PauliZ(2))

params_fixed = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt_fixed = GradientDescentOptimizer(stepsize=0.1)

# parameter-shift uses 2 * n_params circuit evaluations per step
shots_per_step_fixed = 2 * 8 * 100   # 1600 shots/step
total_fixed = shots_per_step_fixed * 50
print(f"Fixed optimizer total shots (50 steps): {total_fixed}")
print(f"Shot-adaptive total shots (50 steps):   {opt.total_shots_used}")

Note that this comparison counts total shots consumed, not iteration count. This distinction matters. The shot-adaptive optimizer may take more iterations to converge (because some steps use fewer total shots and produce noisier updates), but it typically reaches the same cost value using fewer total shots. Comparing iteration count alone can make shot-adaptive optimization look worse, when it is actually more efficient.

Convergence Analysis

To see the shot-efficiency advantage clearly, plot the cost as a function of cumulative shots consumed rather than step number.

import matplotlib.pyplot as plt

# Run shot-adaptive optimizer, tracking cost and cumulative shots
np.random.seed(42)
init_params = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)

# Shot-adaptive run
params_sa = init_params.copy()
opt_sa = ShotAdaptiveOptimizer(min_shots=10)
cost_sa = []
shots_sa = []

for step in range(80):
    params_sa, c = opt_sa.step_and_cost(cost_fn, params_sa)
    cost_sa.append(float(c))
    shots_sa.append(opt_sa.total_shots_used)

# Fixed-budget run from the same starting point
params_fb = init_params.copy()
opt_fb = GradientDescentOptimizer(stepsize=0.1)
cost_fb = []
shots_fb = []
cumulative_shots_fb = 0
shots_per_step = 2 * 8 * 100  # parameter-shift with 100 shots

for step in range(80):
    params_fb, c = opt_fb.step_and_cost(cost_fn_fixed, params_fb)
    cumulative_shots_fb += shots_per_step
    cost_fb.append(float(c))
    shots_fb.append(cumulative_shots_fb)

# Plot cost vs. total shots consumed
plt.figure(figsize=(10, 5))
plt.plot(shots_sa, cost_sa, label="Shot-adaptive", linewidth=2)
plt.plot(shots_fb, cost_fb, label="Fixed-budget (100 shots)", linewidth=2)
plt.xlabel("Total shots consumed")
plt.ylabel("Cost")
plt.title("Convergence: Cost vs. Total Shots")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In a typical run, the shot-adaptive curve reaches low cost values at a lower total shot count than the fixed-budget curve. The fixed-budget optimizer spends the same number of shots at step 1 (when all parameters matter) and at step 50 (when most parameters have converged). The shot-adaptive optimizer front-loads its budget during the early high-gradient phase and tapers off as parameters settle.

Inspecting Per-Parameter Shot Allocation

The optimizer exposes its internal shot allocation after each step.

# Re-run a single step and inspect the allocation
params_test = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt_inspect = ShotAdaptiveOptimizer(min_shots=5)

opt_inspect.step_and_cost(cost_fn, params_test)

# opt.s is a list with one entry per QNode argument; our single array of
# parameters lives in opt.s[0]
print("Shots allocated per parameter:")
for i, shots in enumerate(opt_inspect.s[0]):
    print(f"  param[{i}]: {shots} shots")

Parameters with larger gradients and higher variance receive more shots. Early in training, when many parameters matter, the allocation is more uniform. Later, it concentrates on the remaining sensitive parameters.

To visualize how the allocation evolves over training, collect the per-parameter shots at each step:

params_track = pnp.array(np.random.uniform(-np.pi, np.pi, 8), requires_grad=True)
opt_track = ShotAdaptiveOptimizer(min_shots=10)

allocation_history = []
for step in range(50):
    params_track, _ = opt_track.step_and_cost(cost_fn, params_track)
    # opt.s[0] is the per-parameter shot array for our single parameter vector
    allocation_history.append(list(opt_track.s[0]))

allocation_history = np.array(allocation_history)

plt.figure(figsize=(10, 5))
for i in range(8):
    plt.plot(allocation_history[:, i], label=f"param[{i}]", alpha=0.7)
plt.xlabel("Step")
plt.ylabel("Shots allocated")
plt.title("Per-Parameter Shot Allocation Over Training")
plt.legend(bbox_to_anchor=(1.05, 1), loc="upper left")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

You will typically see some parameters receiving the minimum shot count after the first few steps, while one or two parameters continue to receive elevated allocations. This reflects the optimizer discovering which parameters still have significant gradients and concentrating its measurement budget there.

Tuning the Optimizer

The ShotAdaptiveOptimizer accepts several hyperparameters that control its behavior.

# Increase learning rate and minimum shots for a faster but noisier run
opt_fast = ShotAdaptiveOptimizer(min_shots=20, stepsize=0.2)

# Tighten convergence with more minimum shots
opt_precise = ShotAdaptiveOptimizer(min_shots=50, stepsize=0.05)

A higher min_shots reduces gradient noise at the cost of more total shots. Start with min_shots=10 and increase if the optimization is unstable.

Here is a breakdown of the key hyperparameters:

min_shots (required, no default) sets the minimum number of shots any parameter receives per step. It must be larger than 2 so the optimizer can estimate the variance of each gradient component. Lower values allow more aggressive shot savings but risk noisy gradient estimates that destabilize training. Higher values provide a more reliable gradient floor but reduce the potential shot savings.

stepsize (default: 0.07) controls the base learning rate. The optimizer internally adapts this per parameter based on variance, but the base value still matters. Too large and the optimizer overshoots; too small and convergence stalls.

mu (default: 0.99) is the running-average constant for the exponential moving averages of the gradient and its variance. Values closer to 1.0 weight historical gradient information more heavily, producing smoother but slower-adapting shot allocations. Lower values make the allocation more responsive to recent gradient changes but potentially more erratic.

b (default: 1e-6) is a small regularization bias used in the shot allocation rule. It should be kept small but nonzero so the allocation formula stays numerically stable.

term_sampling (default: None) optionally enables weighted random sampling of the individual terms in a Hamiltonian cost function. Setting it to "weighted_random_sampling" distributes the shot budget across Hamiltonian terms as well as across parameters.

Aggressive vs. Conservative Tuning

Aggressive tuning prioritizes speed:

opt_aggressive = ShotAdaptiveOptimizer(
    min_shots=5,
    stepsize=0.2,
    mu=0.9
)

This configuration uses very few minimum shots, a large step size, and a low bias correction factor that responds quickly to gradient changes. It converges fast when it works, but can diverge if gradient noise is too high.

Conservative tuning prioritizes reliability:

opt_conservative = ShotAdaptiveOptimizer(
    min_shots=50,
    stepsize=0.05,
    mu=0.999
)

This configuration ensures precise gradient estimates at every step, takes small careful steps, and smooths the shot allocation over many steps. It rarely diverges but may consume more shots than necessary.

For most problems, starting with the defaults and adjusting min_shots first is a reasonable strategy. Increase min_shots if you see cost oscillations. Decrease it if shot savings are minimal.

When to Use Shot-Adaptive Optimization

Shot-adaptive optimization is not universally the best choice. Its advantages depend on the problem structure and execution context.

Best for:

Circuits with many parameters. The shot reallocation benefit grows with parameter count. With 2 parameters, there is little room to redistribute. With 20 or more parameters, the savings can be substantial because many parameters converge at different rates.
Limited shot budgets. When you have a fixed total shot budget (hardware time constraints, cost limits), shot-adaptive optimization extracts more optimization progress per shot than uniform allocation.
Hardware execution. On real quantum devices, shots are expensive. The overhead of the adaptive allocation logic is negligible compared to the cost of unnecessary circuit evaluations.
Problems where parameters converge at different rates. If some parameters quickly reach their optimal values while others require many more steps, shot-adaptive optimization stops wasting shots on the converged parameters.

Not ideal for:

Small parameter counts. With fewer than 4 parameters, the overhead of tracking variance and computing allocations outweighs the savings from reallocation. Standard gradient descent with a fixed shot count works fine.
Exact (statevector) simulators. When using default.qubit without a shot count, expectation values are computed analytically. There are no shots to optimize, so the shot-adaptive machinery provides no benefit. Use a standard optimizer like Adam or L-BFGS instead.
Problems requiring very few optimization steps. If the optimization converges in under 10 steps, the adaptive allocation has little time to learn the gradient landscape and provide savings.

Comparison to SPSA:

The Simultaneous Perturbation Stochastic Approximation (SPSA) optimizer is another shot-efficient alternative. SPSA estimates the gradient using only 2 circuit evaluations per step regardless of parameter count, while the parameter-shift rule (used by ShotAdaptiveOptimizer) requires 2 evaluations per parameter. This makes SPSA dramatically cheaper per step for circuits with many parameters. However, SPSA gradient estimates have higher variance because they combine all parameter directions into a single perturbation. Shot-adaptive optimization provides more accurate per-parameter gradients and allocates shots intelligently, which often leads to better convergence per total shot consumed. The choice depends on whether your bottleneck is per-step cost (favoring SPSA) or total-shot efficiency (favoring shot-adaptive).

Common Mistakes

Setting min_shots too low. With min_shots=1 or min_shots=2, the gradient estimates for low-allocation parameters become dominated by shot noise. A single shot produces a binary outcome (+1 or -1 for a Pauli measurement), giving a gradient estimate with maximum variance. The optimizer may receive wildly incorrect gradient signals and diverge. Use at least min_shots=5, and prefer min_shots=10 or higher for stability.

Comparing optimizers by iteration count instead of total shots. The shot-adaptive optimizer may take 80 iterations to reach a cost that fixed-budget gradient descent reaches in 50 iterations. This does not mean it is slower. If the shot-adaptive optimizer consumed 30,000 total shots while the fixed-budget optimizer consumed 80,000, the shot-adaptive approach is significantly more efficient. Always compare cost vs. cumulative shots, not cost vs. step number.

Using shot-adaptive optimization on a statevector simulator. If you create a device with qml.device("default.qubit", wires=4) (no shots argument), PennyLane computes exact expectation values analytically. Running ShotAdaptiveOptimizer in this context either raises an error or provides no benefit, because there is no shot noise to optimize against. Always set a finite shot count on the device when using this optimizer.

Not monitoring opt.total_shots_used to verify savings. The optimizer adapts shots automatically, but that does not guarantee savings in every scenario. Always track opt.total_shots_used and compare it against the fixed-budget baseline for your specific circuit. If the adaptive optimizer uses more total shots than fixed-budget (which can happen for small circuits or unusual cost landscapes), switch to a simpler optimizer.

Forgetting to use pnp.array with requires_grad=True. PennyLane’s shot-adaptive optimizer requires autograd-compatible arrays to compute gradients. Using plain NumPy arrays (np.array) silently disables gradient tracking, causing the optimizer to receive zero gradients and make no progress. Always wrap initial parameters with pennylane.numpy and set requires_grad=True.

Summary

The shot-adaptive optimizer redistributes measurement resources to where they matter most, reducing total shots compared to fixed-budget gradient descent. The Rosalin algorithm at its core estimates per-parameter gradient magnitudes and variances, then allocates shots proportionally to concentrate measurement effort on parameters that most influence the cost function.

Use it when you are optimizing on real hardware or a finite-shot simulator and want to minimize device time. Set min_shots to control the noise floor, tune mu and stepsize to balance responsiveness and stability, and monitor opt.total_shots_used to compare efficiency against fixed-budget baselines. For circuits with many parameters and limited shot budgets, shot-adaptive optimization can cut total shot usage by 30% to 60% while achieving equivalent convergence quality.