Noise-Aware Training for Variational Quantum Circuits

One of the most persistent gaps in near-term quantum machine learning is the discrepancy between circuits trained in noiseless simulation and their performance on real hardware. A circuit optimized to a deep minimum of the noiseless loss landscape may sit on a flat plateau when noise is introduced, because noise effectively smooths the cost function. Noise-aware training closes this gap by incorporating a realistic device noise model directly into the training loop, so the optimizer learns parameters that are robust to the noise it will encounter at inference time.

The Problem with Noiseless Training

Consider a parameterized circuit $U(\theta)$ trained to minimize $\mathcal{L}(\theta) = \langle 0 | U(\theta)^\dagger H U(\theta) | 0 \rangle$ . On real hardware, the expectation value becomes:

$\tilde{\mathcal{L}}(\theta) = \text{Tr}[\mathcal{E}_\theta(\rho_0) H]$

where $\mathcal{E}_\theta$ is the noisy quantum channel implementing $U(\theta)$ . The gradient $\nabla_\theta \tilde{\mathcal{L}}$ differs from $\nabla_\theta \mathcal{L}$ in both magnitude and direction, especially after many gates. Parameters that minimize the noiseless cost often perform poorly under the true noisy channel.

Setup: Noisy Device in PennyLane

PennyLane’s default.mixed device propagates density matrices through the circuit, supporting all standard noise channels.

import pennylane as qml
from pennylane import numpy as np
import matplotlib.pyplot as plt

n_qubits = 4
n_layers = 3

# Clean (statevector) device
dev_clean = qml.device("default.qubit", wires=n_qubits)

# Noisy (density matrix) device -- same structure, mixed states
dev_noisy = qml.device("default.mixed", wires=n_qubits)

# Depolarizing error rates representative of superconducting hardware
SINGLE_QUBIT_DEPOL = 0.002   # ~0.2% single-qubit error
TWO_QUBIT_DEPOL = 0.01       # ~1% two-qubit error
T1_AMPLITUDE_DAMPING = 0.005 # amplitude damping per gate (T1 decay)

Defining the Noisy Ansatz

We use a hardware-efficient ansatz: layers of $R_Y$ , $R_Z$ rotations followed by CNOT entangling gates. After each gate, we insert the appropriate noise channel.

def noisy_layer(params_layer, depol_1q, depol_2q, amp_damp):
    """
    One layer of the ansatz with noise channels inserted after each gate.
    params_layer: shape (n_qubits, 2) -- [RY angle, RZ angle] per qubit
    """
    # Single-qubit rotation block
    for i in range(n_qubits):
        qml.RY(params_layer[i, 0], wires=i)
        qml.DepolarizingChannel(depol_1q, wires=i)
        qml.AmplitudeDamping(amp_damp, wires=i)

        qml.RZ(params_layer[i, 1], wires=i)
        qml.DepolarizingChannel(depol_1q, wires=i)
        qml.AmplitudeDamping(amp_damp, wires=i)

    # Entangling block: linear chain of CNOTs
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])
        # Two-qubit depolarizing (approximated as single-qubit on each)
        qml.DepolarizingChannel(depol_2q / 2, wires=i)
        qml.DepolarizingChannel(depol_2q / 2, wires=i + 1)

def clean_layer(params_layer):
    """Same ansatz without noise channels."""
    for i in range(n_qubits):
        qml.RY(params_layer[i, 0], wires=i)
        qml.RZ(params_layer[i, 1], wires=i)
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])

Building Clean and Noisy QNodes

We define both a clean and a noisy QNode with the same parameter structure, but evaluated on different devices.

# Hamiltonian: simple Z-Z correlator task
H = qml.Hamiltonian(
    [1.0, 0.5, 0.5, 0.3],
    [
        qml.PauliZ(0) @ qml.PauliZ(1),
        qml.PauliZ(1) @ qml.PauliZ(2),
        qml.PauliZ(2) @ qml.PauliZ(3),
        qml.PauliX(0) @ qml.PauliX(2),
    ]
)

@qml.qnode(dev_clean, diff_method="backprop")
def circuit_clean(params):
    for layer in range(n_layers):
        clean_layer(params[layer])
    return qml.expval(H)

@qml.qnode(dev_noisy, diff_method="parameter-shift")
def circuit_noisy(params,
                  depol_1q=SINGLE_QUBIT_DEPOL,
                  depol_2q=TWO_QUBIT_DEPOL,
                  amp_damp=T1_AMPLITUDE_DAMPING):
    for layer in range(n_layers):
        noisy_layer(params[layer], depol_1q, depol_2q, amp_damp)
    return qml.expval(H)

# Initialize parameters
np.random.seed(42)
params_shape = (n_layers, n_qubits, 2)
init_params = np.random.uniform(-np.pi, np.pi, params_shape, requires_grad=True)

print(f"Clean circuit energy: {circuit_clean(init_params):.4f}")
print(f"Noisy circuit energy: {circuit_noisy(init_params):.4f}")

Training Without Noise

def train_circuit(cost_fn, params, n_steps=80, lr=0.05, label=""):
    """Generic training loop using gradient descent."""
    params = params.copy()
    opt = qml.GradientDescentOptimizer(stepsize=lr)
    history = []

    for step in range(n_steps):
        params, cost = opt.step_and_cost(cost_fn, params)
        history.append(float(cost))
        if step % 20 == 0:
            print(f"  [{label}] Step {step:3d}: cost = {cost:.6f}")

    return params, history

print("Training on clean (noiseless) circuit...")
params_clean_trained, hist_clean = train_circuit(
    circuit_clean, init_params, n_steps=20, label="clean"
)

Training With Noise (Noise-Aware)

def noisy_cost(params):
    return circuit_noisy(params)

print("\nTraining on noisy circuit (noise-aware)...")
params_noise_trained, hist_noise = train_circuit(
    noisy_cost, init_params, n_steps=20, label="noisy"
)

Evaluating on the Noisy Device

After training, we evaluate both parameter sets on the noisy device. The key question is which parameters achieve lower energy (better performance) when noise is actually present.

# Evaluate both parameter sets on the noisy device
e_clean_on_noisy = float(circuit_noisy(params_clean_trained))
e_noise_on_noisy = float(circuit_noisy(params_noise_trained))

print("\n--- Final Evaluation on Noisy Device ---")
print(f"Parameters trained clean, evaluated noisy: {e_clean_on_noisy:.6f}")
print(f"Parameters trained noisy, evaluated noisy: {e_noise_on_noisy:.6f}")
print(f"Noise-aware advantage: {e_clean_on_noisy - e_noise_on_noisy:.6f}")

# Also evaluate on clean device for reference
e_clean_on_clean = float(circuit_clean(params_clean_trained))
e_noise_on_clean = float(circuit_clean(params_noise_trained))
print(f"\nParameters trained clean, evaluated clean: {e_clean_on_clean:.6f}")
print(f"Parameters trained noisy, evaluated clean: {e_noise_on_clean:.6f}")

Noise Injection as Regularization

Noise-aware training can be viewed as a form of regularization. Adding noise during training discourages the optimizer from finding parameters that exploit the sharp features of the noiseless landscape, features that disappear under noise. This is analogous to dropout in classical neural networks.

A controlled noise injection schedule works well: start with a slightly higher noise level than the target device to force the optimizer to find robust solutions, then anneal the noise down toward the actual device level.

def train_with_noise_schedule(params, n_steps=100, lr=0.05):
    """
    Noise-aware training with a noise decay schedule.
    Start with 2x the target noise, end at target noise level.
    """
    params = params.copy()
    opt = qml.GradientDescentOptimizer(stepsize=lr)
    history = []

    for step in range(n_steps):
        # Linearly decay noise from 2x to 1x target
        scale = 2.0 - step / n_steps
        depol_1q_scaled = SINGLE_QUBIT_DEPOL * scale
        depol_2q_scaled = TWO_QUBIT_DEPOL * scale
        amp_damp_scaled = T1_AMPLITUDE_DAMPING * scale

        def cost_scaled(p):
            return circuit_noisy(p, depol_1q_scaled, depol_2q_scaled, amp_damp_scaled)

        params, cost = opt.step_and_cost(cost_scaled, params)
        history.append(float(cost))

        if step % 10 == 0:
            print(f"  [scheduled] Step {step:3d}: cost = {cost:.6f}, noise_scale = {scale:.2f}")

    return params, history

print("\nTraining with noise decay schedule...")
params_scheduled, hist_scheduled = train_with_noise_schedule(init_params, n_steps=20)

# Final comparison
e_scheduled_on_noisy = float(circuit_noisy(params_scheduled))
print(f"\nScheduled noise-aware training, evaluated noisy: {e_scheduled_on_noisy:.6f}")

Plotting the Training Curves

plt.figure(figsize=(9, 4))
plt.plot(hist_clean, label='Trained clean')
plt.plot(hist_noise, label='Trained noisy')
plt.plot(hist_scheduled, label='Trained with noise schedule', linestyle='--')
plt.axhline(e_clean_on_noisy, color='blue', linestyle=':', alpha=0.5, label='Clean params (noisy eval)')
plt.axhline(e_noise_on_noisy, color='orange', linestyle=':', alpha=0.5, label='Noisy params (noisy eval)')
plt.xlabel('Training Step')
plt.ylabel('Cost (training device)')
plt.title('Noise-Aware vs Standard Training')
plt.legend(fontsize=8)
plt.tight_layout()
plt.savefig('noise_aware_training.png', dpi=150)

Key Takeaways and Extensions

Noise-aware training consistently outperforms noiseless training when evaluated on real or simulated noisy devices, with the margin growing with circuit depth and noise level. For shallow circuits (1-2 layers) the effect is small; for 5+ layers on realistic hardware, it can be decisive.

The gradient landscape changes under noise. The parameter-shift rule remains valid for computing gradients through noisy channels, but the gradients themselves are smaller (noise damps energy differences). This can slow convergence, requiring lower learning rates and more steps.

Circuit knitting is a complementary technique: instead of training a deep noisy circuit, you decompose it into shorter fragments, execute each fragment, and recombine classically. This reduces per-fragment noise at the cost of more circuit evaluations. PennyLane’s qml.cut_circuit transform supports this workflow.

Noise model accuracy matters. Training on an inaccurate noise model can be worse than noiseless training. For real hardware deployment, use characterization data (e.g., from randomized benchmarking) to build the noise model rather than relying on default depolarizing assumptions. The closer the training noise matches the device, the better the transfer.