Diagnosing and Avoiding Barren Plateaus in PennyLane

What Is a Barren Plateau?

A barren plateau is a region of the parameter landscape where the gradient of a variational quantum circuit’s cost function becomes exponentially small in the number of qubits. As the system size n grows, the variance of the cost function’s partial derivative with respect to any trainable parameter theta_k scales as:

Var[∂C/∂θ_k] ~ O(2^{-n})   (or worse, O(4^{-n}) for global observables)

Concretely, for 20 qubits with a global cost function, the gradient variance can be as small as 4^{-20} ≈ 10^{-12}. A gradient this small is indistinguishable from shot noise on real hardware and from floating-point noise in simulation. The optimizer sees a flat landscape in every direction and makes no progress. Training time grows exponentially with system size, erasing any quantum advantage the algorithm might otherwise provide.

Barren plateaus are not merely a nuisance. They represent a fundamental obstruction to scalability. If your ansatz exhibits a barren plateau, adding more qubits does not buy more computational power; it makes the circuit harder to train without returning anything useful.

Two Distinct Sources of Barren Plateaus

Barren plateaus arise from two independent mechanisms. A circuit can suffer from one, the other, or both simultaneously.

Source 1: Expressibility-Induced Barren Plateaus

When a parameterized circuit is sufficiently expressive, random parameter values make it approximate a unitary 2-design: a distribution over unitaries that matches the first two moments of the Haar (uniform) measure on U(2^n). Under a 2-design, the expected gradient is exactly zero, and the variance is exponentially suppressed.

The foundational result is due to McClean et al. (2018): for a parameterized quantum circuit that forms an approximate 2-design, the expected value of any partial derivative of the cost function is zero,

E[∂C/∂θ_k] = 0

and the variance satisfies

Var[∂C/∂θ_k] ≤ F(n, H)

where F(n, H) is an exponentially small function of n that depends on the cost observable H. The precise scaling depends on whether the observable is global or local (see the next section for exact formulas).

The physical intuition: a 2-design spreads the circuit’s output state uniformly over the entire Hilbert space. Since the Hilbert space dimension grows as 2^n, any particular direction in parameter space has exponentially small influence on the cost function. Small rotations of one parameter barely change the output state’s overlap with the cost observable.

Source 2: Noise-Induced Barren Plateaus

Wang et al. (2021) proved that hardware noise independently causes barren plateaus, even for shallow circuits with local cost functions. If each gate in the circuit is followed by a noise channel (depolarizing, amplitude damping, etc.), the noisy circuit’s output converges toward the maximally mixed state as the number of noisy gates grows. The gradient variance then scales as:

Var[∂C/∂θ_k] ≤ F_noise(n, L, p)

where L is the circuit depth and p is the per-gate noise rate. The key difference from expressibility-induced BPs: the severity scales with the total number of noisy gates (roughly n * L), not just the number of qubits. This means that even a 2-layer circuit on 20 qubits with realistic noise rates can exhibit a barren plateau.

This result has a sobering implication: on NISQ hardware, you face barren plateaus from both directions simultaneously. Deep circuits hit expressibility BPs; shallow-but-noisy circuits hit noise-induced BPs. The trainable regime is a narrow window of moderate depth and low noise.

Quantitative Gradient Variance Formulas

The exact scaling laws determine how quickly trainability degrades. Here are the key formulas.

Global cost function (observable acts on all qubits, e.g., H = Z_0 ⊗ Z_1 ⊗ … ⊗ Z_{n-1}):

Var[∂C/∂θ_k] = (2 / (4^n - 1)) * (Tr[H²] / 2^n - (Tr[H])² / 4^n)

For a traceless observable with Tr[H] = 0 and Tr[H²] = 2^n (which holds for tensor products of Pauli operators), this simplifies to:

Var[∂C/∂θ_k] ≈ 2 / (4^n - 1) ≈ 2 * 4^{-n}

Local cost function (observable acts on a single qubit, e.g., H = Z_0 ⊗ I ⊗ … ⊗ I):

The scaling improves but remains exponential. For a local observable on one qubit:

Var[∂C/∂θ_k] ~ O(1 / (n * 2^n))    for "early-layer" parameters
Var[∂C/∂θ_k] ~ O(4^{-n})            for "late-layer" parameters

“Early” means the gate is in a layer close to the measurement. “Late” means the gate is separated from the measurement by many layers of entangling gates. The extra polynomial factor 1/n provides breathing room but does not eliminate exponential scaling. Local cost functions delay the onset of barren plateaus to larger n, but they do not prevent them entirely for deep circuits.

Numerical Verification of Gradient Variance Scaling

The following code measures gradient variance for increasing qubit counts and verifies the exponential scaling prediction. For a global cost function under a 2-design, log(variance) vs n should have slope approximately -log(4) ≈ -1.386 per qubit.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

def make_hardware_efficient_circuit(n_qubits, n_layers):
    """Build a hardware-efficient ansatz with RY-RZ rotations and CNOT entanglement."""
    def circuit(params):
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(params[layer, q, 0], wires=q)
                qml.RZ(params[layer, q, 1], wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
        return qml.expval(qml.PauliZ(0))
    return circuit

def measure_gradient_variance(n_qubits, n_layers, n_samples=200):
    """Compute the variance of the gradient of the first parameter over random initializations."""
    dev = qml.device("default.qubit", wires=n_qubits)
    circuit = make_hardware_efficient_circuit(n_qubits, n_layers)
    qnode = qml.QNode(circuit, dev, diff_method="parameter-shift")

    grads = []
    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits, 2))
        g = qml.grad(qnode)(params)
        grads.append(float(g[0, 0, 0]))

    return float(np.var(grads))

# Measure gradient variance for n = 2, 3, 4, 5 qubits with 4 layers
qubit_counts = [2, 3, 4, 5]
variances = []

print("Gradient variance vs system size (4 layers, global-like cost):")
for n in qubit_counts:
    var = measure_gradient_variance(n, n_layers=4, n_samples=15)
    variances.append(var)
    print(f"  n={n} qubits: Var[dC/dtheta] = {var:.4e}")

# Compute the empirical scaling exponent via linear regression on log(var) vs n
log_vars = np.log(np.array(variances))
ns = np.array(qubit_counts, dtype=float)
slope = np.polyfit(ns, log_vars, 1)[0]
print(f"\nEmpirical slope of log(Var) vs n: {slope:.3f}")
print(f"Theoretical slope for 4^{{-n}} scaling: {-np.log(4):.3f}")

# Plot the scaling
plt.figure(figsize=(6, 4))
plt.semilogy(qubit_counts, variances, "o-", label="Measured variance")
plt.semilogy(qubit_counts, [variances[0] * 4**(qubit_counts[0] - n) for n in qubit_counts],
             "--", label=r"$4^{-n}$ reference")
plt.xlabel("Number of qubits")
plt.ylabel("Var[dC/dθ]")
plt.title("Gradient Variance Scaling")
plt.legend()
plt.tight_layout()
plt.savefig("gradient_variance_scaling.png", dpi=150)
plt.show()

The measured slope should be close to -1.386 (i.e., -ln(4)). If your ansatz gives a significantly shallower slope, the circuit is not yet expressive enough to form a 2-design at that depth, which is actually good for trainability.

Detecting Barren Plateaus in Practice

The standard diagnostic measures gradient variance across many random parameter initializations and checks how it scales with system size. Here is a complete, production-quality diagnostic.

import pennylane as qml
from pennylane import numpy as np

def make_hardware_efficient_ansatz(n_qubits, n_layers):
    """Returns a hardware-efficient variational circuit."""
    def circuit(params):
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(params[layer, q, 0], wires=q)
                qml.RZ(params[layer, q, 1], wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
        return qml.expval(qml.PauliZ(0))
    return circuit

def full_gradient_variance_diagnostic(n_qubits, n_layers, n_samples=200):
    """Measure variance of ALL gradient components, not just the first one.
    
    Returns the minimum, median, and maximum variance across all parameters.
    Reporting only the first component can be misleading (see Common Mistakes).
    """
    dev = qml.device("default.qubit", wires=n_qubits)
    circuit = make_hardware_efficient_ansatz(n_qubits, n_layers)
    qnode = qml.QNode(circuit, dev, diff_method="parameter-shift")

    n_params = n_layers * n_qubits * 2
    all_grads = np.zeros((n_samples, n_params))

    for i in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits, 2))
        g = qml.grad(qnode)(params)
        all_grads[i] = g.flatten()

    # Variance of each gradient component across samples
    component_variances = np.var(all_grads, axis=0)

    return {
        "min_variance": float(np.min(component_variances)),
        "median_variance": float(np.median(component_variances)),
        "max_variance": float(np.max(component_variances)),
        "all_variances": component_variances,
    }

# Run the diagnostic
print("Full gradient variance diagnostic (4 qubits, 4 layers):")
result = full_gradient_variance_diagnostic(4, 4, n_samples=20)
print(f"  Min variance:    {result['min_variance']:.4e}")
print(f"  Median variance: {result['median_variance']:.4e}")
print(f"  Max variance:    {result['max_variance']:.4e}")

The minimum variance across all gradient components is the most informative number. If the minimum is already exponentially small, the circuit has parameters that are essentially untrainable regardless of optimizer choice.

Visualizing the Loss Landscape

Numbers alone can be abstract. Visualizing the cost function over a 2D slice of parameter space makes barren plateaus viscerally clear.

Fix all parameters except two (theta_1, theta_2) and sweep both over [0, 2pi]. For a shallow circuit, the landscape has clear hills and valleys. For a deep circuit, the landscape is flat with only noise-level fluctuations.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

def landscape_2d(n_qubits, n_layers, grid_points=25):
    """Compute the cost function over a 2D parameter slice."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def circuit(params_flat):
        params = params_flat.reshape(n_layers, n_qubits, 2)
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(params[layer, q, 0], wires=q)
                qml.RZ(params[layer, q, 1], wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
        return qml.expval(qml.PauliZ(0))

    # Fix all parameters randomly, then vary two of them
    total_params = n_layers * n_qubits * 2
    base_params = np.random.uniform(0, 2 * np.pi, total_params)
    
    # Indices of the two parameters to vary
    idx1, idx2 = 0, 1

    theta_range = np.linspace(0, 2 * np.pi, grid_points)
    landscape = np.zeros((grid_points, grid_points))

    for i, t1 in enumerate(theta_range):
        for j, t2 in enumerate(theta_range):
            p = base_params.copy()
            p[idx1] = t1
            p[idx2] = t2
            landscape[i, j] = circuit(p)

    return theta_range, landscape

# Compare shallow (2 layers) vs deep (10 layers) on 6 qubits
fig, axes = plt.subplots(1, 2, figsize=(12, 5))

for ax, n_layers, title in [
    (axes[0], 2, "Shallow circuit (2 layers, 6 qubits)"),
    (axes[1], 10, "Deep circuit (10 layers, 6 qubits)"),
]:
    theta_range, landscape = landscape_2d(6, n_layers, grid_points=15)
    im = ax.imshow(
        landscape,
        extent=[0, 2 * np.pi, 0, 2 * np.pi],
        origin="lower",
        cmap="RdBu",
        aspect="equal",
    )
    ax.set_xlabel("θ₁")
    ax.set_ylabel("θ₂")
    ax.set_title(title)
    plt.colorbar(im, ax=ax, label="Cost")

plt.tight_layout()
plt.savefig("loss_landscape_comparison.png", dpi=150)
plt.show()

In the shallow case, you see broad, smooth regions with distinct minima and maxima. The color range spans most of [-1, 1]. In the deep case, the entire landscape compresses to a narrow band around 0, with only tiny random fluctuations. An optimizer starting anywhere in the deep landscape receives no useful signal about which direction to move.

Why 2-Designs Cause Barren Plateaus

To understand barren plateaus at a deeper level, consider what it means for a circuit distribution to form a unitary 2-design.

The Haar measure is the uniform probability distribution over the unitary group U(2^n). It is the unique measure that is invariant under left and right multiplication by any fixed unitary. Sampling a unitary from the Haar measure gives you a “random” unitary with no preferred direction.

A unitary t-design is a finite set (or parameterized family) of unitaries whose average over the first t tensor powers matches that of the Haar measure. Specifically, a set {U_i} is a 2-design if:

(1/|S|) Σ_i (U_i ⊗ U_i*) ρ (U_i† ⊗ U_i†*) = ∫ (U ⊗ U*) ρ (U† ⊗ U†*) dμ_Haar(U)

for all density matrices ρ on the doubled Hilbert space.

The barren plateau theorem only requires a 2-design (not a full t-design for large t). This is because the gradient variance involves second moments of the circuit distribution: the variance is E[(∂C/∂θ)²] - (E[∂C/∂θ])², which depends on correlations of pairs of circuit instances.

Here is why forming a 2-design causes flat gradients. Under the Haar measure, the output state of the circuit is distributed uniformly over the Hilbert space. The cost function C(θ) = <ψ(θ)|H|ψ(θ)> then becomes a random variable whose expectation is Tr[H]/2^n (the “infinite temperature” value) and whose variance is exponentially small. Changing one parameter by a small amount δθ only rotates the output state in one direction within a 2^n-dimensional space. The overlap of this rotation with the cost observable H is exponentially small because H has bounded operator norm while the state space is exponentially large.

Shallow circuits do not form 2-designs. A circuit with depth d can only generate unitaries in a subset of U(2^n) of limited “volume.” This is precisely why shallow circuits avoid barren plateaus: they are not expressive enough to spread probability mass uniformly, so the cost function retains exploitable structure.

The practical implication: there is an inherent tension between expressibility and trainability. More expressive circuits can represent better solutions but are harder to train. Less expressive circuits are easier to train but may not contain the optimal solution. Finding the right balance is the central challenge of variational quantum algorithm design.

Strategy 1: Use Local Cost Functions

Global observables (those that act nontrivially on all qubits) produce the worst barren plateau scaling: Var ~ O(4^{-n}). Local observables (those that act on a constant number of qubits) produce milder scaling: Var ~ O(1/poly(n)) for shallow circuits.

The prescription: decompose your cost function into a sum of local terms whenever possible.

import pennylane as qml
from pennylane import numpy as np

n_qubits = 8
dev = qml.device("default.qubit", wires=n_qubits)

# Global cost: tensor product observable acting on all qubits
@qml.qnode(dev)
def global_cost(params):
    for q in range(n_qubits):
        qml.RY(params[q], wires=q)
    for q in range(n_qubits - 1):
        qml.CNOT(wires=[q, q + 1])
    # This is a global observable: Z on every qubit
    return qml.expval(
        qml.prod(*[qml.PauliZ(q) for q in range(n_qubits)])
    )

# Local cost: sum of single-qubit Z terms
@qml.qnode(dev)
def local_cost(params):
    for q in range(n_qubits):
        qml.RY(params[q], wires=q)
    for q in range(n_qubits - 1):
        qml.CNOT(wires=[q, q + 1])
    # Sum the observables into one operator, then take a single expectation value
    return qml.expval(sum(qml.PauliZ(q) for q in range(n_qubits)))

# Compare gradient variances
def compare_cost_functions(n_samples=30):
    global_grads = []
    local_grads = []

    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, n_qubits)
        g_global = qml.grad(global_cost)(params)
        g_local = qml.grad(local_cost)(params)
        global_grads.append(float(g_global[0]))
        local_grads.append(float(g_local[0]))

    print(f"Global cost gradient variance: {np.var(global_grads):.4e}")
    print(f"Local cost gradient variance:  {np.var(local_grads):.4e}")
    print(f"Ratio (local / global):        {np.var(local_grads) / np.var(global_grads):.1f}x")

compare_cost_functions()

For VQE on a molecular Hamiltonian, the Hamiltonian naturally decomposes into a sum of local Pauli terms (e.g., Z_0 Z_1 + 0.5 X_0 + …). Each term acts on at most a few qubits, so the cost function is already local. The barren plateau risk comes primarily from the ansatz depth, not the observable.

For classification tasks, you might be tempted to use a global observable like the parity of all qubits. Instead, measure a single output qubit or a small subset. This sacrifices some representational capacity but preserves trainability.

Strategy 2: Layerwise Training

Layerwise training builds the circuit incrementally. Start with one layer, train it to convergence, then freeze its parameters and add a second layer initialized near identity. Train the second layer. Repeat.

This approach works because at each stage, the circuit is shallow enough to avoid expressibility-induced barren plateaus. The newly added layer starts near the identity, so it makes only small perturbations to the already-trained circuit. Gradients with respect to the new parameters are not exponentially suppressed because the new layer’s effect on the output state is small and structured.

from pennylane import numpy as np
import pennylane as qml
from pennylane.optimize import AdamOptimizer

def layerwise_train(n_qubits, max_layers, steps_per_layer=100):
    """Train a variational circuit one layer at a time.
    
    Each new layer is initialized with all rotation angles set to zero,
    making the new layer act approximately as identity (the CNOT gates
    are still present, but the rotations contribute no additional change).
    """
    dev = qml.device("default.qubit", wires=n_qubits)
    all_params = []
    cost_history = []

    for n_layers in range(1, max_layers + 1):
        # Initialize new layer near zero (not uniformly random)
        new_layer = np.zeros((n_qubits, 2), requires_grad=True)
        all_params.append(new_layer)
        params_stack = np.concatenate([p.flatten() for p in all_params])

        # Fresh optimizer each stage: the parameter vector grows when a layer
        # is added, and Adam's momentum buffers must match the new shape.
        optimizer = AdamOptimizer(stepsize=0.05)

        @qml.qnode(dev)
        def circuit(params):
            reshaped = params.reshape(n_layers, n_qubits, 2)
            for layer in range(n_layers):
                for q in range(n_qubits):
                    qml.RY(reshaped[layer, q, 0], wires=q)
                    qml.RZ(reshaped[layer, q, 1], wires=q)
                for q in range(n_qubits - 1):
                    qml.CNOT(wires=[q, q + 1])
            return qml.expval(qml.PauliZ(0))

        layer_costs = []
        for step in range(steps_per_layer):
            params_stack, cost = optimizer.step_and_cost(circuit, params_stack)
            layer_costs.append(float(cost))

        cost_history.append(layer_costs)
        print(f"After layer {n_layers}: cost = {layer_costs[-1]:.4f}")
        all_params = [params_stack.reshape(n_layers, n_qubits, 2)]

    return params_stack, cost_history

# Run layerwise training on 6 qubits
params, history = layerwise_train(n_qubits=6, max_layers=5, steps_per_layer=40)

A critical detail: the new layer must be initialized near identity, not with random parameters. If you initialize each new layer randomly, you reintroduce the barren plateau at each stage (see Common Mistakes below).

Strategy 3: Identity Block Initialization

Identity block initialization is a refined version of “initialize near zero.” The idea: design each variational layer so that at initialization, the entire layer implements the identity operation exactly. This guarantees that at the start of training, the circuit output is |00…0> regardless of how many layers you stack.

A simple identity block for two qubits uses the structure RY(θ) - CNOT - RY(-θ), which telescopes to identity when θ = 0. For a more general construction, pair each rotation with its inverse:

import pennylane as qml
from pennylane import numpy as np

def identity_block_layer(params, n_qubits, layer_idx):
    """A single variational layer that acts as identity when all params are zero.
    
    Structure per qubit pair (q, q+1):
      RY(θ1) on q
      RZ(θ2) on q
      CNOT(q, q+1)
      RZ(-θ2) on q
      RY(-θ1) on q
      CNOT(q, q+1)
    
    When θ1 = θ2 = 0, the two CNOTs cancel (CNOT^2 = I) and the
    rotations cancel pairwise, giving the identity on both qubits.
    """
    for q in range(0, n_qubits - 1, 2):
        # Forward rotations
        qml.RY(params[layer_idx, q, 0], wires=q)
        qml.RZ(params[layer_idx, q, 1], wires=q)
        # Entangling gate
        qml.CNOT(wires=[q, q + 1])
        # Inverse rotations (negative angles)
        qml.RZ(-params[layer_idx, q, 1], wires=q)
        qml.RY(-params[layer_idx, q, 0], wires=q)
        # Second CNOT to cancel the first
        qml.CNOT(wires=[q, q + 1])

def verify_identity_initialization(n_qubits, n_layers):
    """Verify that the circuit acts as identity when all parameters are zero."""
    dev = qml.device("default.qubit", wires=n_qubits)

    @qml.qnode(dev)
    def circuit(params):
        for layer in range(n_layers):
            identity_block_layer(params, n_qubits, layer)
        return qml.state()

    params = np.zeros((n_layers, n_qubits, 2))
    output_state = circuit(params)

    # The initial state is |00...0>, which has amplitude 1 on the first component
    fidelity = np.abs(output_state[0]) ** 2
    print(f"Fidelity with |00...0> at initialization: {fidelity:.6f}")
    assert np.isclose(fidelity, 1.0), "Identity block is not acting as identity!"
    return True

# Verify for various sizes
for n in [4, 6, 8]:
    print(f"n={n} qubits, 5 layers: ", end="")
    verify_identity_initialization(n, 5)

The advantage of identity blocks over simple zero initialization: with zero-initialized RY-RZ layers followed by CNOTs, the circuit at initialization is not actually the identity (the CNOTs still act nontrivially). With identity blocks, the circuit is exactly the identity at initialization, so the initial state is preserved perfectly. During training, the parameters evolve away from zero, and the circuit gradually builds up the target unitary.

Strategy 4: Structured Ansatze

Problem-inspired ansatze like UCCSD (for quantum chemistry) or QAOA (for combinatorial optimization) have far fewer parameters than generic hardware-efficient circuits and concentrate probability mass on physically relevant states. This prevents the circuit from forming a 2-design.

QAOA (Quantum Approximate Optimization Algorithm): For a combinatorial optimization problem with cost Hamiltonian H_C and mixer Hamiltonian H_M = Σ X_i, the QAOA ansatz is:

|γ, β> = e^{-iβ_p H_M} e^{-iγ_p H_C} ... e^{-iβ_1 H_M} e^{-iγ_1 H_C} |+>^n

This has only 2p parameters (p = number of layers), regardless of the number of qubits. The structure is determined entirely by the problem Hamiltonian, so the circuit never becomes a 2-design.

UCCSD (Unitary Coupled Cluster Singles and Doubles): For quantum chemistry, the UCCSD ansatz applies excitation operators:

|ψ> = e^{T - T†} |HF>

where T = T_1 + T_2 contains single and double excitation operators. The number of parameters scales polynomially with the number of orbitals, not exponentially.

For applications without a natural problem structure, consider:

Correlation-aware connectivity: use qubit connectivity that mirrors the problem’s interaction graph rather than a generic brick-wall or linear pattern.
Shallow circuits with expressive gates: use 2-3 layers with gates chosen to cover the relevant subspace efficiently.
Data-driven ansatz design: use classical pre-processing (e.g., from a Hartree-Fock solution) to inform the initial circuit structure.

Quantum Convolutional Neural Networks (QCNNs) as BP-Free Ansatze

Pesah et al. (2021) proved that quantum convolutional neural networks (QCNNs) provably avoid barren plateaus under certain conditions. The QCNN architecture has a hierarchical structure that naturally restricts expressibility while maintaining representational power.

A QCNN consists of:

Convolutional layers: local 2-qubit unitaries applied in a brick-wall pattern across all qubits.
Pooling layers: measure every other qubit, conditioned on the measurement outcome, apply a single-qubit unitary to the remaining qubit. This halves the number of active qubits at each level.
Repeat: apply another convolutional layer to the remaining qubits, then pool again.
Final measurement: after log(n) pooling steps, one qubit remains. Measure it.

Because each pooling layer halves the system size, the total depth is O(log n). The cost function is a local measurement on one qubit at each level, avoiding global observables.

import pennylane as qml
from pennylane import numpy as np

def qcnn_circuit(params, n_qubits):
    """
    Implement a QCNN for n_qubits (must be a power of 2).
    
    params structure:
      - params[level] contains parameters for convolutional + pooling at each level
      - At level k, there are n_qubits / 2^k active qubits
    """
    active_wires = list(range(n_qubits))
    level = 0

    while len(active_wires) > 1:
        n_active = len(active_wires)

        # Convolutional layer: 2-qubit unitaries on adjacent pairs
        for i in range(0, n_active - 1, 2):
            q0, q1 = active_wires[i], active_wires[i + 1]
            # Parameterized 2-qubit gate (6 parameters per pair)
            qml.RY(params[level, i, 0], wires=q0)
            qml.RY(params[level, i, 1], wires=q1)
            qml.CNOT(wires=[q0, q1])
            qml.RY(params[level, i, 2], wires=q0)
            qml.RY(params[level, i, 3], wires=q1)

        # Odd-offset convolutional gates (shifted by 1)
        for i in range(1, n_active - 1, 2):
            q0, q1 = active_wires[i], active_wires[i + 1]
            qml.RY(params[level, i, 0] * 0.5, wires=q0)
            qml.RY(params[level, i, 1] * 0.5, wires=q1)
            qml.CNOT(wires=[q0, q1])

        # Pooling: keep every other qubit, apply conditioned rotation
        new_active = []
        for i in range(0, n_active, 2):
            kept = active_wires[i]
            if i + 1 < n_active:
                discarded = active_wires[i + 1]
                # Controlled rotation from discarded to kept qubit
                qml.CRY(params[level, i, 4] if params.shape[-1] > 4 else 0.0,
                         wires=[discarded, kept])
            new_active.append(kept)

        active_wires = new_active
        level += 1

    # Return expectation value on the final remaining qubit
    return qml.expval(qml.PauliZ(active_wires[0]))

def compare_qcnn_vs_hardware_efficient(n_qubits=8, n_samples=30):
    """Compare gradient variance between QCNN and hardware-efficient ansatz."""
    dev = qml.device("default.qubit", wires=n_qubits)
    n_levels = int(np.log2(n_qubits))

    # QCNN circuit
    @qml.qnode(dev)
    def qcnn_qnode(params):
        return qcnn_circuit(params, n_qubits)

    # Hardware-efficient circuit with similar depth
    @qml.qnode(dev)
    def hea_qnode(params):
        n_layers = n_levels * 2  # Similar depth to QCNN
        reshaped = params[:n_layers * n_qubits * 2].reshape(n_layers, n_qubits, 2)
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(reshaped[layer, q, 0], wires=q)
                qml.RZ(reshaped[layer, q, 1], wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
        return qml.expval(qml.PauliZ(0))

    # Measure QCNN gradient variance
    qcnn_grads = []
    qcnn_param_shape = (n_levels, n_qubits, 6)
    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, qcnn_param_shape)
        g = qml.grad(qcnn_qnode)(params)
        qcnn_grads.append(float(g.flatten()[0]))

    # Measure HEA gradient variance
    hea_grads = []
    hea_n_params = n_levels * 2 * n_qubits * 2
    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, hea_n_params)
        g = qml.grad(hea_qnode)(params)
        hea_grads.append(float(g[0]))

    print(f"QCNN gradient variance:              {np.var(qcnn_grads):.4e}")
    print(f"Hardware-efficient gradient variance: {np.var(hea_grads):.4e}")
    print(f"Ratio (QCNN / HEA):                  {np.var(qcnn_grads) / np.var(hea_grads):.1f}x")

compare_qcnn_vs_hardware_efficient()

The QCNN should show significantly higher gradient variance than the hardware-efficient ansatz at the same qubit count, confirming that its hierarchical structure avoids barren plateaus.

Noise-Induced Barren Plateaus: Demonstration

Even with a shallow circuit that avoids expressibility-induced barren plateaus, hardware noise can independently suppress gradients. The following code demonstrates this by adding depolarizing noise at varying rates.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

def measure_noisy_gradient_variance(n_qubits, n_layers, noise_rate, n_samples=100):
    """Measure gradient variance under depolarizing noise."""
    dev = qml.device("default.mixed", wires=n_qubits)

    @qml.qnode(dev, diff_method="parameter-shift")
    def noisy_circuit(params):
        for layer in range(n_layers):
            for q in range(n_qubits):
                qml.RY(params[layer, q], wires=q)
                if noise_rate > 0:
                    qml.DepolarizingChannel(noise_rate, wires=q)
            for q in range(n_qubits - 1):
                qml.CNOT(wires=[q, q + 1])
                if noise_rate > 0:
                    # Apply depolarizing noise to both qubits after CNOT
                    qml.DepolarizingChannel(noise_rate, wires=q)
                    qml.DepolarizingChannel(noise_rate, wires=q + 1)
        return qml.expval(qml.PauliZ(0))

    grads = []
    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, (n_layers, n_qubits))
        g = qml.grad(noisy_circuit)(params)
        grads.append(float(g[0, 0]))

    return float(np.var(grads))

# Shallow 2-layer circuit on 4 qubits with varying noise.
# (Density-matrix gradients are expensive, so we keep the qubit count
# and sample count small; the trend is the same at larger sizes.)
noise_rates = [0.0, 0.005, 0.02, 0.05]
variances = []

print("Noise-induced barren plateau (4 qubits, 2 layers):")
for p in noise_rates:
    var = measure_noisy_gradient_variance(4, 2, p, n_samples=15)
    variances.append(var)
    print(f"  noise rate p={p:.3f}: Var[dC/dθ] = {var:.4e}")

# Plot the result
plt.figure(figsize=(6, 4))
plt.semilogy(noise_rates, variances, "o-")
plt.xlabel("Depolarizing noise rate (p)")
plt.ylabel("Var[dC/dθ]")
plt.title("Gradient Variance vs. Noise Rate\n(4 qubits, 2 layers)")
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("noise_induced_bp.png", dpi=150)
plt.show()

Notice that even with only 2 layers, increasing the noise rate from 0 to 0.05 can reduce gradient variance by orders of magnitude. This happens because each noisy gate pushes the output state closer to the maximally mixed state I/2^n, where all expectation values are zero and all gradients vanish. The total number of noisy gates in a 2-layer, 4-qubit circuit is roughly 2 * (4 + 3*2) = 20, so even a small per-gate noise rate compounds rapidly.

Correlation Between Gradient Variance and Trainability

The gradient variance diagnostic is only useful if low variance actually predicts poor training performance. The following experiment confirms this connection empirically.

We run VQE on a 4-qubit transverse-field Ising model with three ansatz types of varying expressibility. For each, we measure the initial gradient variance and then run optimization. The ansatz with the highest gradient variance should converge fastest.

import pennylane as qml
from pennylane import numpy as np
from pennylane.optimize import AdamOptimizer

n_qubits = 4

# Transverse-field Ising Hamiltonian: H = -Σ Z_i Z_{i+1} - 0.5 Σ X_i
coeffs = []
obs = []
for i in range(n_qubits - 1):
    coeffs.append(-1.0)
    obs.append(qml.PauliZ(i) @ qml.PauliZ(i + 1))
for i in range(n_qubits):
    coeffs.append(-0.5)
    obs.append(qml.PauliX(i))
H = qml.Hamiltonian(coeffs, obs)

dev = qml.device("default.qubit", wires=n_qubits)

# Ansatz A: Deep hardware-efficient (8 layers, likely barren plateau)
@qml.qnode(dev)
def ansatz_deep(params):
    n_layers = 8
    reshaped = params.reshape(n_layers, n_qubits, 2)
    for layer in range(n_layers):
        for q in range(n_qubits):
            qml.RY(reshaped[layer, q, 0], wires=q)
            qml.RZ(reshaped[layer, q, 1], wires=q)
        for q in range(n_qubits - 1):
            qml.CNOT(wires=[q, q + 1])
    return qml.expval(H)

# Ansatz B: Shallow hardware-efficient (2 layers)
@qml.qnode(dev)
def ansatz_shallow(params):
    n_layers = 2
    reshaped = params.reshape(n_layers, n_qubits, 2)
    for layer in range(n_layers):
        for q in range(n_qubits):
            qml.RY(reshaped[layer, q, 0], wires=q)
            qml.RZ(reshaped[layer, q, 1], wires=q)
        for q in range(n_qubits - 1):
            qml.CNOT(wires=[q, q + 1])
    return qml.expval(H)

# Ansatz C: QAOA-inspired (2 layers, problem-structured)
@qml.qnode(dev)
def ansatz_qaoa(params):
    # Initial superposition
    for q in range(n_qubits):
        qml.Hadamard(wires=q)
    # QAOA layers
    for layer in range(2):
        gamma = params[2 * layer]
        beta = params[2 * layer + 1]
        # Cost unitary: e^{-i gamma H_C}
        for i in range(n_qubits - 1):
            qml.CNOT(wires=[i, i + 1])
            qml.RZ(2 * gamma, wires=i + 1)
            qml.CNOT(wires=[i, i + 1])
        # Mixer unitary: e^{-i beta H_M}
        for i in range(n_qubits):
            qml.RX(2 * beta, wires=i)
    return qml.expval(H)

# Measure initial gradient variance for each ansatz
def measure_ansatz_grad_var(qnode, param_shape, n_samples=40):
    grads = []
    for _ in range(n_samples):
        params = np.random.uniform(0, 2 * np.pi, param_shape)
        g = qml.grad(qnode)(params)
        grads.append(float(g.flatten()[0]))
    return float(np.var(grads))

var_deep = measure_ansatz_grad_var(ansatz_deep, (8 * n_qubits * 2,))
var_shallow = measure_ansatz_grad_var(ansatz_shallow, (2 * n_qubits * 2,))
var_qaoa = measure_ansatz_grad_var(ansatz_qaoa, (4,))

print("Initial gradient variance:")
print(f"  Deep HEA (8 layers):    {var_deep:.4e}")
print(f"  Shallow HEA (2 layers): {var_shallow:.4e}")
print(f"  QAOA (2 layers):        {var_qaoa:.4e}")

# Now run optimization for each
n_steps = 40

results = {}
for name, qnode, param_shape in [
    ("Deep HEA", ansatz_deep, (8 * n_qubits * 2,)),
    ("Shallow HEA", ansatz_shallow, (2 * n_qubits * 2,)),
    ("QAOA", ansatz_qaoa, (4,)),
]:
    params = np.random.uniform(0, 2 * np.pi, param_shape)
    # Fresh optimizer per ansatz: each has a different parameter count, and
    # Adam's momentum buffers are sized to the first set of parameters it sees.
    optimizer = AdamOptimizer(stepsize=0.05)
    costs = []
    for step in range(n_steps):
        params, cost = optimizer.step_and_cost(qnode, params)
        costs.append(float(cost))
    results[name] = costs
    print(f"{name}: initial cost = {costs[0]:.3f}, final cost = {costs[-1]:.3f}")

# The ansatz with highest initial gradient variance should show the most progress

You should observe that the deep hardware-efficient ansatz makes little to no progress (stuck near the random initialization energy), the shallow hardware-efficient ansatz makes moderate progress, and the QAOA ansatz converges most reliably toward the ground state energy. This directly validates the gradient variance diagnostic as a predictor of trainability.

Hardware-Efficient Ansatz Depth Analysis

A practical question: for a given qubit count, how many layers can you add before hitting a barren plateau? The answer depends on the ansatz type.

For random hardware-efficient ansatze, gradient variance decreases exponentially with depth. For structured ansatze (QAOA, UCCSD), the decrease is polynomial or even absent. The following experiment maps this boundary.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

def grad_var_vs_depth(n_qubits, max_layers, n_samples=100):
    """Measure gradient variance as a function of circuit depth."""
    variances = []
    layer_counts = list(range(1, max_layers + 1))

    for n_layers in layer_counts:
        dev = qml.device("default.qubit", wires=n_qubits)

        @qml.qnode(dev)
        def circuit(params):
            reshaped = params.reshape(n_layers, n_qubits, 2)
            for layer in range(n_layers):
                for q in range(n_qubits):
                    qml.RY(reshaped[layer, q, 0], wires=q)
                    qml.RZ(reshaped[layer, q, 1], wires=q)
                for q in range(n_qubits - 1):
                    qml.CNOT(wires=[q, q + 1])
            return qml.expval(qml.PauliZ(0))

        grads = []
        for _ in range(n_samples):
            params = np.random.uniform(0, 2 * np.pi, (n_layers * n_qubits * 2,))
            g = qml.grad(circuit)(params)
            grads.append(float(g[0]))
        variances.append(float(np.var(grads)))

    return layer_counts, variances

# Run for 4 qubits (parameter-shift gradients get costly at higher depth)
layers, vars_6q = grad_var_vs_depth(4, max_layers=5, n_samples=15)

plt.figure(figsize=(6, 4))
plt.semilogy(layers, vars_6q, "o-", label="4 qubits")
plt.xlabel("Number of layers")
plt.ylabel("Var[dC/dθ]")
plt.title("Gradient Variance vs. Circuit Depth")
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.savefig("depth_analysis.png", dpi=150)
plt.show()

# Practical guideline
print("\nRule of thumb: for n qubits, keep n_layers < n/2")
print("to stay in the trainable regime for hardware-efficient ansatze.")
print(f"For 6 qubits: max recommended layers = {6 // 2}")

The key observation: gradient variance typically saturates (reaches its minimum plateau value) after roughly n/2 layers for an n-qubit hardware-efficient ansatz. Adding more layers beyond this point increases circuit depth and noise susceptibility without improving expressibility in a useful way. This gives a practical guideline: for n qubits, keep the layer count below n/2 when using hardware-efficient ansatze.

Gradient-Free Optimization as a Workaround

When gradients are too small to measure reliably, gradient-free optimizers can still make progress on small problems. These methods use only cost function evaluations, not gradients, so they are immune to the measurement precision issue (though they still suffer from the flat landscape itself for large systems).

SPSA (Simultaneous Perturbation Stochastic Approximation) is particularly well-suited to quantum optimization. It estimates the gradient using only 2 function evaluations per step, regardless of the number of parameters:

g_est = [C(θ + δ·Δ) - C(θ - δ·Δ)] / (2δ) · Δ

where Δ is a random vector with entries drawn from {+1, -1} and δ is a small perturbation magnitude. Compare this to parameter-shift gradients, which require 2 evaluations per parameter.

import pennylane as qml
from pennylane import numpy as np
from pennylane.optimize import SPSAOptimizer

n_qubits = 4
dev = qml.device("default.qubit", wires=n_qubits)

# A cost function for SPSA optimization
@qml.qnode(dev)
def cost_circuit(params):
    for q in range(n_qubits):
        qml.RY(params[q], wires=q)
    for q in range(n_qubits - 1):
        qml.CNOT(wires=[q, q + 1])
    for q in range(n_qubits):
        qml.RZ(params[n_qubits + q], wires=q)
    return qml.expval(qml.PauliZ(0))

# PennyLane's built-in SPSA optimizer
optimizer = SPSAOptimizer(maxiter=80)
params = np.random.uniform(0, 2 * np.pi, 2 * n_qubits)

costs = []
for step in range(80):
    params, cost = optimizer.step_and_cost(cost_circuit, params)
    costs.append(float(cost))
    if step % 20 == 0:
        print(f"Step {step:3d}: cost = {cost:.4f}")

print(f"Final cost: {costs[-1]:.4f}")

SPSA has important limitations. It works well for small to moderate parameter counts (up to a few hundred) but converges slowly for large parameter spaces. It also does not solve the fundamental problem: if the cost landscape is genuinely flat (as in a barren plateau on a large system), SPSA will also fail to find useful descent directions. Gradient-free methods buy you some robustness against measurement noise but do not overcome exponentially vanishing cost function variation.

Other gradient-free options include Nelder-Mead and COBYLA, used by wrapping your QNode cost function with scipy.optimize.minimize (PennyLane does not ship dedicated optimizer classes for these methods). COBYLA tends to be more sample-efficient than SPSA for low-dimensional problems, while SPSA scales better to higher dimensions.

Strategies Comparison Table

Strategy	Gradient Scaling Improvement	Depth Constraint	Works for Noise-Induced BPs	Implementation Complexity	Hardware Suitable
Local cost functions	4^{-n} to ~1/(n·2^n) for early-layer params	None	No	Low	Yes
Layerwise training	Maintains O(1) variance at each stage	Grows incrementally	Partially (if layers stay shallow)	Medium	Yes
Identity block init	Preserves O(1) variance at initialization	None	No	Medium	Yes
Structured ansatze (QAOA)	Polynomial in n, not exponential	Problem-dependent	No	Low (if problem structure known)	Yes
Structured ansatze (UCCSD)	Polynomial in orbitals	Chemistry-specific	No	High (fermion-to-qubit mapping)	Limited (deep circuits)
QCNN	Provably O(1) for local cost	O(log n) depth	No	Medium	Yes
Quantum Natural Gradient	Does not fix BP, but improves convergence in non-BP regime	None	No	High (requires Fisher info matrix)	Limited (overhead)
SPSA / gradient-free	N/A (bypasses gradient)	None	Partially (avoids gradient measurement)	Low	Yes
Error mitigation	Does not fix BP, but reduces noise floor	None	Partially	High	Yes (designed for it)

No single strategy works universally. In practice, you often combine several: use a structured ansatz with local cost functions, identity block initialization, and layerwise training. For NISQ hardware, add error mitigation on top.

Common Mistakes

Understanding what not to do is as important as knowing the correct strategies.

Mistake 1: Diagnosing with Only One Gradient Component

Measuring the variance of a single gradient component (e.g., the first parameter) and concluding the circuit is trainable can be misleading. Different parameters can have dramatically different gradient variances depending on their position in the circuit. A parameter in the first layer, close to the input, may have reasonable gradient variance, while a parameter in the last layer may already be in the barren plateau regime.

Always measure gradient variance for all (or a representative sample of) parameters and report the minimum variance. If any parameter has exponentially small gradient variance, that parameter is untrainable.

Mistake 2: Confusing Barren Plateaus with Local Minima

A local minimum is a single point where the gradient is zero but the cost function is not globally optimal. A barren plateau is a region where the gradient is near zero everywhere across the entire parameter space. The distinction matters for strategy selection:

Local minima can be escaped with techniques like random restarts, momentum, or simulated annealing.
Barren plateaus cannot be escaped by any optimizer because the gradient is small everywhere, not just at the current point.

If you run optimization from 100 random starting points and all of them show near-zero gradients, you are in a barren plateau, not stuck in local minima.

Mistake 3: Excessive Optimization Before Diagnosing

If gradients are exponentially small, running 10,000 optimization steps instead of 100 makes no difference. The optimizer is making random steps of size ~10^{-12} in random directions. Always run the gradient variance diagnostic before committing to long optimization runs. A 30-second diagnostic can save hours of wasted computation.

Mistake 4: Random Initialization of New Layers in Layerwise Training

Layerwise training works because each new layer starts near the identity and makes small perturbations. If you initialize each new layer with random parameters drawn from [0, 2pi], you immediately put the circuit into the expressive regime for that layer, reintroducing the barren plateau. Always initialize new layers with parameters near zero.

Mistake 5: Believing Adam Optimizer Solves Barren Plateaus

Adam normalizes the gradient by its running root-mean-square:

θ_{t+1} = θ_t - lr * m_t / (sqrt(v_t) + ε)

where m_t is the gradient’s exponential moving average and v_t is the squared gradient’s moving average. When all gradients are near zero (say, ~10^{-12}), Adam computes m_t ≈ 10^{-12} and v_t ≈ 10^{-24}, giving an update of ~10^{-12} / (10^{-12} + 10^{-8}) ≈ 10^{-4}. This looks like progress, but Adam is amplifying noise, not signal. The updates are random walks that do not converge to anything meaningful.

You can verify this by running Adam on a barren plateau circuit and checking whether the cost function actually decreases over time. It will fluctuate randomly without trending downward.

Summary

Barren plateaus are the central trainability challenge for variational quantum algorithms. They arise from two independent sources: circuit expressibility (forming approximate 2-designs) and hardware noise (effective depolarization). The gradient variance diagnostic, measured across many random initializations and all parameter components, reliably predicts trainability.

Key quantitative facts to remember:

Global cost functions: gradient variance scales as O(4^{-n})
Local cost functions: gradient variance scales as O(1/(n * 2^n)) for early-layer parameters
Hardware-efficient ansatze: barren plateau onset at roughly n/2 layers for n qubits
Noise-induced BPs: severity scales with total number of noisy gates, not just qubit count

Mitigation strategies form a toolkit, not a single solution:

Use local cost functions to improve the base scaling.
Use structured ansatze (QAOA, UCCSD, QCNN) to avoid forming 2-designs.
Initialize near identity (identity blocks or zero initialization).
Train layerwise to keep the effective circuit shallow at each optimization stage.
Diagnose before optimizing to avoid wasting compute.
On NISQ hardware, account for noise-induced BPs by keeping total gate count low.

The fundamental tension between expressibility and trainability means there is no free lunch. The best variational quantum algorithms carefully balance circuit expressiveness against trainability, using problem structure to guide ansatz design rather than relying on generic parameterized circuits.