High-Performance Simulation with PennyLane Lightning and GPU
Explore PennyLane's Lightning family of high-performance simulation backends, learn how to use adjoint differentiation for efficient gradient computation, and understand when to deploy lightning.qubit, lightning.gpu, and lightning.kokkos.
PennyLane’s default simulator (default.qubit) is a pure-Python statevector simulator ideal for learning and small circuits. Once your circuits grow beyond 15-20 qubits, or once you need to compute gradients over thousands of training steps, you need something faster. The Lightning family of backends provides C++ and GPU-accelerated simulation with the same PennyLane interface.
The Three Lightning Backends
lightning.qubit is a C++ statevector simulator that replaces the Python inner loop with optimized native code. It supports all standard PennyLane gates, runs on any CPU, and requires no specialized hardware. It is typically 5-15x faster than default.qubit for circuits with 20-30 qubits.
lightning.gpu accelerates statevector simulation using NVIDIA CUDA via the cuQuantum library. It keeps the statevector in GPU memory and runs gate operations as batched matrix-vector multiplications on the GPU. For circuits with 25+ qubits, it can outperform lightning.qubit by 10-100x depending on the circuit and GPU model.
lightning.kokkos is a performance-portable backend built on the Kokkos framework. It supports multi-CPU parallelism (OpenMP), AMD GPUs (ROCm/HIP), and multi-GPU configurations. If you do not have NVIDIA hardware or need to run across multiple GPUs or cluster nodes, lightning.kokkos is the right choice.
Installation
# CPU-only Lightning
pip install pennylane-lightning
# GPU Lightning (requires CUDA 11+ and cuQuantum)
pip install pennylane-lightning[gpu]
# Kokkos backend (with OpenMP for multi-core CPU)
pip install pennylane-lightning[kokkos]
Selecting a Device
Switching backends in PennyLane requires only changing the device name:
import pennylane as qml
import numpy as np
n_qubits = 28
# CPU C++ backend
dev_cpu = qml.device("lightning.qubit", wires=n_qubits)
# NVIDIA GPU backend
dev_gpu = qml.device("lightning.gpu", wires=n_qubits)
# Kokkos backend (multi-core CPU or AMD GPU)
dev_kokkos = qml.device("lightning.kokkos", wires=n_qubits)
@qml.qnode(dev_gpu)
def circuit(params):
for i in range(n_qubits):
qml.RY(params[i], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
return qml.expval(qml.PauliZ(0))
No other changes to your circuit definitions or training loop are needed.
Memory and Scaling Analysis
Before choosing a backend, you need to understand the memory requirements of statevector simulation. A statevector for n qubits has complex amplitudes. In single precision (complex64), each amplitude requires 8 bytes: 4 bytes for the real part plus 4 bytes for the imaginary part. The total memory for the statevector is:
This scaling is exponential, and it places hard limits on which qubit counts are practical for each hardware configuration.
| Qubits (n) | Amplitudes () | Statevector Size (complex64) | Statevector Size (complex128) |
|---|---|---|---|
| 20 | 1,048,576 | 8 MB | 16 MB |
| 22 | 4,194,304 | 32 MB | 64 MB |
| 25 | 33,554,432 | 256 MB | 512 MB |
| 28 | 268,435,456 | 2 GB | 4 GB |
| 30 | 1,073,741,824 | 8 GB | 16 GB |
| 32 | 4,294,967,296 | 32 GB | 64 GB |
| 33 | 8,589,934,592 | 64 GB | 128 GB |
| 34 | 17,179,869,184 | 128 GB | 256 GB |
An NVIDIA A100 GPU has 80 GB of HBM2e memory. Since lightning.gpu also needs memory for gate matrices, temporary buffers, and the adjoint differentiation state, the practical upper limit is approximately 33 qubits in single precision on a single A100. For double precision (complex128), the limit drops to about 32 qubits.
A consumer GPU such as an RTX 4090 with 24 GB of VRAM can handle up to about 30 qubits in single precision or 29 qubits in double precision.
You can verify the memory requirement programmatically before allocating:
import sys
def statevector_memory_gb(n_qubits, precision="complex64"):
"""Calculate statevector memory in gigabytes."""
bytes_per_amplitude = 8 if precision == "complex64" else 16
total_bytes = (2 ** n_qubits) * bytes_per_amplitude
return total_bytes / (1024 ** 3)
# Check before creating a device
n_qubits = 30
mem_gb = statevector_memory_gb(n_qubits, "complex64")
print(f"{n_qubits} qubits require {mem_gb:.1f} GB for the statevector (complex64)")
# For adjoint differentiation, you need ~2x this amount
# (one statevector for the forward state, one for the adjoint vector)
adjoint_mem_gb = 2 * mem_gb
print(f"Adjoint differentiation needs ~{adjoint_mem_gb:.1f} GB total")
For variational circuits on classical hardware, these memory constraints define the boundary of what is simulable. A circuit with 50 qubits would require petabytes of memory for the statevector alone. This is why quantum advantage is expected for problems at 50+ qubits: classical simulation becomes physically impossible with current technology.
Adjoint Differentiation: The Key Performance Advantage
Parameter-shift differentiation computes the gradient of an -parameter circuit by running the circuit times (two shifted evaluations per parameter). For a circuit with 1000 parameters, that means 2000 circuit executions per gradient step.
Adjoint differentiation computes the full gradient in a single forward and backward pass through the circuit, with memory cost proportional to circuit depth rather than number of parameters. For large ansatze, this is a dramatic improvement.
How the Adjoint Method Works
The adjoint differentiation algorithm operates in two phases.
Forward pass: Execute the circuit gate by gate, applying each unitary to the statevector. At the end, you have the final state . Store checkpoints of the statevector at regular intervals during this process.
Backward pass: Initialize the adjoint vector as , where is the observable being measured. Then step backward through the circuit in reverse gate order. At each gate :
- Apply (the adjoint of gate ) to to propagate it backward.
- If gate depends on parameter , compute the partial derivative of the unitary: .
- Compute the gradient contribution: , where is the state just before gate was applied.
- Apply to the forward state to recover the state at the previous step (or retrieve it from a checkpoint).
The crucial insight is that this process requires exactly two statevectors in memory at any time: the current forward state and the adjoint vector . The memory cost is regardless of the number of parameters .
Memory and Time Comparison
| Method | Circuit evaluations | Memory (statevectors) | Best for |
|---|---|---|---|
| Parameter-shift | (two per parameter) | 1 statevector per evaluation | Hardware devices, small |
| Adjoint | 1 forward + 1 backward | 2 statevectors + checkpoints | Simulators, large |
| Finite difference | 1 statevector per evaluation | Quick estimates, not precise |
For a variational circuit with 200 parameters, parameter-shift requires 400 circuit evaluations while adjoint requires exactly one forward and one backward pass. The speedup scales linearly with parameter count.
dev = qml.device("lightning.gpu", wires=20)
@qml.qnode(dev, diff_method="adjoint")
def variational_circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(20))
return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))
# Initialize random parameters
shape = qml.StronglyEntanglingLayers.shape(n_layers=3, n_wires=20)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
# Compute gradient: adjoint diff does this in ~2x circuit cost regardless of parameter count
grad_fn = qml.grad(variational_circuit)
grads = grad_fn(params)
print(f"Gradient shape: {grads.shape}")
print(f"Number of parameters: {np.prod(shape)}")
# With 3 layers and 20 wires, this is 180 parameters.
# Parameter-shift would need 360 circuit runs. Adjoint needs 1 forward + 1 backward.
When Adjoint Differentiation Is NOT Available
Adjoint differentiation has strict requirements. The following circuit patterns will raise errors or produce incorrect results when used with diff_method="adjoint".
1. Circuits with mid-circuit measurements. When a measurement occurs mid-circuit, the statevector branches into multiple outcomes. The adjoint method assumes a single deterministic path through the circuit, so mid-circuit measurement breaks this assumption.
dev = qml.device("lightning.gpu", wires=3)
@qml.qnode(dev, diff_method="adjoint")
def circuit_with_mcm(x):
qml.RX(x, wires=0)
qml.measure(0) # Mid-circuit measurement
qml.RY(x, wires=1)
return qml.expval(qml.PauliZ(1))
# This raises: "Adjoint differentiation does not support mid-circuit measurements."
# Fix: use diff_method="parameter-shift" or diff_method="backprop" with default.qubit.
2. Circuits returning probabilities or samples. The adjoint method computes gradients of scalar expectation values by backpropagating through the circuit. Probability distributions (qml.probs) and sample outputs (qml.sample) are not scalar, so the adjoint gradient is undefined for these return types.
dev = qml.device("lightning.gpu", wires=4)
@qml.qnode(dev, diff_method="adjoint")
def circuit_with_probs(x):
qml.RX(x, wires=0)
qml.CNOT(wires=[0, 1])
return qml.probs(wires=[0, 1]) # Returns a 4-element probability vector
# This raises: "Adjoint differentiation method does not support probs."
# Fix: use qml.expval with the specific observable you need,
# or switch to diff_method="parameter-shift".
3. Custom unitaries without a defined adjoint. If you apply qml.QubitUnitary with an arbitrary matrix, the Lightning backend needs to know how to compute the adjoint (conjugate transpose) and the parameter derivative. Without explicit registration, this fails.
dev = qml.device("lightning.gpu", wires=2)
@qml.qnode(dev, diff_method="adjoint")
def circuit_with_custom_unitary(x):
# A parameterized custom unitary
U = np.array([[np.cos(x), -np.sin(x)],
[np.sin(x), np.cos(x)]])
qml.QubitUnitary(U, wires=0)
return qml.expval(qml.PauliZ(0))
# QubitUnitary is non-differentiable with adjoint because PennyLane
# cannot derive dU/dx from the matrix alone.
# Fix: use the equivalent native gate. Here, qml.RY(2*x, wires=0) is equivalent.
4. Noisy circuits with density matrix simulation. The adjoint method operates on pure statevectors. Noise channels (depolarizing, amplitude damping, etc.) require density matrix simulation via default.mixed, which uses a density matrix instead of a statevector. Lightning backends do not support density matrix simulation.
# This device uses density matrices, not statevectors
dev_noisy = qml.device("default.mixed", wires=3)
@qml.qnode(dev_noisy, diff_method="adjoint")
def noisy_circuit(x):
qml.RX(x, wires=0)
qml.DepolarizingChannel(0.1, wires=0) # Noise channel
return qml.expval(qml.PauliZ(0))
# This raises an error: adjoint is not supported on default.mixed.
# Fix: use diff_method="backprop" on default.mixed, or use
# diff_method="parameter-shift" which works with any backend.
Benchmarking Backends
Here is a simple benchmark comparing forward pass and gradient computation speed:
import time
import pennylane as qml
import numpy as np
def benchmark(device_name, n_qubits=22, n_layers=3, n_trials=5):
dev = qml.device(device_name, wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
grad_fn = qml.grad(circuit)
# Warm-up
_ = grad_fn(params)
times = []
for _ in range(n_trials):
t0 = time.perf_counter()
_ = grad_fn(params)
times.append(time.perf_counter() - t0)
mean_t = np.mean(times)
print(f"{device_name}: {mean_t:.3f}s per gradient (n_qubits={n_qubits}, layers={n_layers})")
# Representative results on a workstation with RTX 3090:
# lightning.qubit: ~4.2s per gradient at 22 qubits, 3 layers
# lightning.gpu: ~0.31s per gradient at 22 qubits, 3 layers (~13x speedup)
# lightning.kokkos: ~1.8s per gradient at 22 qubits, 3 layers (~2.3x speedup vs CPU)
The GPU advantage grows with qubit count. At 26 qubits (67M complex amplitudes, ~1 GB statevector), lightning.gpu is typically 50-100x faster than the pure-Python baseline.
Finding the CPU-GPU Crossover Point
At low qubit counts, lightning.gpu can actually be slower than lightning.qubit. The GPU incurs overhead from host-to-device data transfer over the PCIe bus (typically ~16 GB/s for PCIe 4.0 x16). For a 20-qubit statevector of 8 MB, the transfer time is roughly 0.5 ms, which can equal or exceed the GPU’s actual compute time for such a small state. Meanwhile, lightning.qubit processes the same 8 MB state entirely in CPU cache with no transfer overhead.
As qubit count increases, the GPU’s massively parallel architecture takes over. Gate application on a statevector is essentially a sparse matrix-vector product across amplitudes, and GPUs excel at this kind of data-parallel work.
The crossover point depends on circuit depth:
- Shallow circuits (1-2 layers): GPU becomes faster around 22-24 qubits
- Deep circuits (5+ layers): GPU becomes faster around 18-20 qubits, because the ratio of compute to transfer overhead increases with depth
Here is a benchmark that sweeps qubit counts to find the crossover empirically:
import time
import pennylane as qml
import numpy as np
def find_crossover(n_layers=3, n_trials=3):
"""Find the qubit count where lightning.gpu overtakes lightning.qubit."""
results = {"lightning.qubit": {}, "lightning.gpu": {}}
for n_qubits in range(18, 29):
for backend in ["lightning.qubit", "lightning.gpu"]:
try:
dev = qml.device(backend, wires=n_qubits)
except Exception as e:
print(f"Skipping {backend} at {n_qubits} qubits: {e}")
continue
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(
n_layers=n_layers, n_wires=n_qubits
)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
grad_fn = qml.grad(circuit)
# Warm-up
_ = grad_fn(params)
times = []
for _ in range(n_trials):
t0 = time.perf_counter()
_ = grad_fn(params)
times.append(time.perf_counter() - t0)
mean_t = np.mean(times)
results[backend][n_qubits] = mean_t
print(f"{backend} | {n_qubits} qubits | {mean_t:.4f}s")
# Find crossover
print("\n--- Crossover Analysis ---")
for nq in sorted(results["lightning.qubit"].keys()):
if nq in results["lightning.gpu"]:
cpu_t = results["lightning.qubit"][nq]
gpu_t = results["lightning.gpu"][nq]
faster = "GPU" if gpu_t < cpu_t else "CPU"
ratio = cpu_t / gpu_t if gpu_t < cpu_t else gpu_t / cpu_t
print(f"{nq} qubits: CPU={cpu_t:.4f}s GPU={gpu_t:.4f}s -> {faster} wins ({ratio:.1f}x)")
find_crossover(n_layers=3)
Typical output on an RTX 3090 with a Ryzen 5950X:
18 qubits: CPU=0.0021s GPU=0.0089s -> CPU wins (4.2x)
20 qubits: CPU=0.0085s GPU=0.0094s -> CPU wins (1.1x)
22 qubits: CPU=0.0340s GPU=0.0103s -> GPU wins (3.3x)
24 qubits: CPU=0.1380s GPU=0.0195s -> GPU wins (7.1x)
26 qubits: CPU=0.5520s GPU=0.0580s -> GPU wins (9.5x)
28 qubits: CPU=2.2100s GPU=0.1950s -> GPU wins (11.3x)
The crossover here occurs between 20 and 22 qubits. Below that threshold, the GPU transfer overhead dominates. Above it, the GPU’s parallel throughput delivers steadily increasing speedups.
Configuring lightning.kokkos for Multi-Core CPU
If you have a multi-core workstation but no NVIDIA GPU, lightning.kokkos with OpenMP provides significant speedup over single-threaded lightning.qubit. Kokkos uses shared-memory parallelism, distributing gate operations across CPU cores.
Configure the thread count via the OMP_NUM_THREADS environment variable before launching Python:
# Set the number of OpenMP threads (match your physical core count)
export OMP_NUM_THREADS=8
# Verify the setting
echo "OpenMP threads: $OMP_NUM_THREADS"
# Run your script
python my_vqe_script.py
In your Python script, use the Kokkos device normally:
import os
import time
import pennylane as qml
import numpy as np
# Verify OpenMP configuration
print(f"OMP_NUM_THREADS = {os.environ.get('OMP_NUM_THREADS', 'not set')}")
n_qubits = 26
dev = qml.device("lightning.kokkos", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=3, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
grad_fn = qml.grad(circuit)
# Warm-up
_ = grad_fn(params)
# Time the gradient computation
t0 = time.perf_counter()
grads = grad_fn(params)
elapsed = time.perf_counter() - t0
print(f"Gradient computed in {elapsed:.3f}s with {os.environ.get('OMP_NUM_THREADS', '?')} threads")
To verify that multiple threads are active, you can compare timings with different thread counts:
# Single-threaded baseline
OMP_NUM_THREADS=1 python my_vqe_script.py
# Gradient computed in 4.210s with 1 threads
# 8 threads
OMP_NUM_THREADS=8 python my_vqe_script.py
# Gradient computed in 0.580s with 8 threads (7.3x speedup)
# 16 threads
OMP_NUM_THREADS=16 python my_vqe_script.py
# Gradient computed in 0.440s with 16 threads (9.6x speedup)
On a 16-core workstation, expect roughly 8-10x speedup for circuits with 24+ qubits. The scaling is sub-linear because statevector simulation involves memory-bound operations where cache contention limits the benefit of additional cores. Setting OMP_NUM_THREADS higher than your physical core count (including hyperthreads) typically hurts performance due to context switching overhead.
Gradient Checkpointing for Large Circuits
For very deep circuits (hundreds or thousands of gates), the adjoint backward pass needs access to the intermediate forward states at each gate step. Storing all of these is impractical: a 28-qubit circuit with 1000 gates would need of state snapshots.
The Lightning backends handle this with a checkpoint-and-recompute strategy. Instead of storing every intermediate state, the simulator stores the statevector at regularly spaced checkpoints (every gates). During the backward pass, when a state between two checkpoints is needed, the simulator recomputes forward from the nearest earlier checkpoint.
This creates a tunable time-memory trade-off:
| Checkpoint interval | Memory (for gates, qubits) | Recomputation overhead |
|---|---|---|
| 1 (store every state) | bytes | None |
| bytes | Up to extra gate applications per step | |
| (store only initial state) | bytes | Full forward pass recomputed |
Lightning’s default behavior in the adjoint pass is to recompute from the beginning, which uses minimal memory (just 2 statevectors) at the cost of additional computation. For most variational circuits where the number of layers is moderate (under 20-30 layers), this default works well.
If you need to inspect or control intermediate states manually, PennyLane provides qml.Snapshot:
import pennylane as qml
import numpy as np
dev = qml.device("lightning.qubit", wires=4)
@qml.qnode(dev)
def circuit_with_snapshots(params):
qml.RX(params[0], wires=0)
qml.RY(params[1], wires=1)
qml.Snapshot("after_single_qubit_gates")
qml.CNOT(wires=[0, 1])
qml.CNOT(wires=[1, 2])
qml.Snapshot("after_entangling_gates")
qml.RZ(params[2], wires=2)
return qml.expval(qml.PauliZ(0))
params = np.array([0.5, 0.3, 0.7])
result = qml.snapshots(circuit_with_snapshots)(params)
for tag, state in result.items():
if tag != "execution_results":
print(f"Snapshot '{tag}': statevector norm = {np.linalg.norm(state):.6f}")
For production workloads on lightning.gpu, the adjoint differentiation engine handles checkpointing internally. The key point is that memory usage during gradient computation is bounded by a small multiple of the statevector size, not by the circuit depth times the statevector size.
Batched Execution for Hyperparameter Sweeps
When running experiments that vary circuit structure or parameters (for example, sweeping ansatz depth in a VQE study), you can evaluate multiple circuit configurations efficiently by structuring the work as independent QNode calls. Each call reuses the same device instance, which avoids redundant GPU memory allocation.
import time
import pennylane as qml
import numpy as np
n_qubits = 20
dev = qml.device("lightning.gpu", wires=n_qubits)
def create_vqe_qnode(n_layers):
"""Create a QNode with a specific number of variational layers."""
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))
return circuit
# Sweep over ansatz depths: 1, 2, 3, 4, 5 layers
layer_counts = [1, 2, 3, 4, 5]
results = {}
t_start = time.perf_counter()
for n_layers in layer_counts:
circuit = create_vqe_qnode(n_layers)
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
# Compute energy and gradient
energy = circuit(params)
grad_fn = qml.grad(circuit)
grads = grad_fn(params)
results[n_layers] = {
"energy": float(energy),
"grad_norm": float(np.linalg.norm(grads)),
"n_params": int(np.prod(shape)),
}
print(f"Layers={n_layers}: energy={energy:.4f}, "
f"|grad|={np.linalg.norm(grads):.4f}, params={np.prod(shape)}")
total_time = time.perf_counter() - t_start
print(f"\nTotal sweep time: {total_time:.2f}s for {len(layer_counts)} configurations")
For embarrassingly parallel workloads (where each configuration is fully independent), you can use Python’s concurrent.futures to overlap CPU-side preprocessing with GPU execution:
from concurrent.futures import ProcessPoolExecutor
import pennylane as qml
import numpy as np
import time
def evaluate_config(n_layers):
"""Evaluate a single VQE configuration. Runs in a separate process."""
n_qubits = 20
dev = qml.device("lightning.gpu", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))
shape = qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
energy = circuit(params)
grad_fn = qml.grad(circuit)
grads = grad_fn(params)
return {"layers": n_layers, "energy": float(energy), "grad_norm": float(np.linalg.norm(grads))}
# Note: when using ProcessPoolExecutor, each process creates its own device instance.
# Be mindful of GPU memory: each process allocates a separate statevector on the GPU.
# For 20 qubits, each process uses ~8 MB, so this is fine.
# For 28 qubits, each process uses ~2 GB, so limit parallelism accordingly.
t0 = time.perf_counter()
with ProcessPoolExecutor(max_workers=2) as executor:
futures = [executor.submit(evaluate_config, nl) for nl in [1, 2, 3, 4, 5]]
for f in futures:
print(f.result())
print(f"Parallel time: {time.perf_counter() - t0:.2f}s")
Integration with JAX
JAX provides just-in-time (JIT) compilation and automatic differentiation, and lightning.gpu supports JAX as its interface. JIT compilation caches the compiled circuit computation graph, so repeated calls with the same circuit structure but different parameter values skip the graph-construction overhead entirely.
import pennylane as qml
import jax
import jax.numpy as jnp
# Enable 64-bit precision in JAX (important for gradient accuracy)
jax.config.update("jax_enable_x64", True)
n_qubits = 20
dev = qml.device("lightning.gpu", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint", interface="jax")
def circuit(params):
"""Variational circuit with JAX interface."""
for i in range(n_qubits):
qml.RX(params[i], wires=i)
for i in range(n_qubits - 1):
qml.CNOT(wires=[i, i + 1])
for i in range(n_qubits):
qml.RY(params[n_qubits + i], wires=i)
return qml.expval(qml.PauliZ(0))
# Create JAX-compatible parameters
key = jax.random.PRNGKey(42)
params = jax.random.uniform(key, shape=(2 * n_qubits,), minval=-jnp.pi, maxval=jnp.pi)
# JIT-compile the circuit and gradient computation
jit_circuit = jax.jit(circuit)
jit_grad = jax.jit(jax.grad(circuit))
# First call includes compilation overhead
energy = jit_circuit(params)
print(f"Energy: {energy:.6f}")
# Subsequent calls are faster due to cached compilation
grads = jit_grad(params)
print(f"Gradient norm: {jnp.linalg.norm(grads):.6f}")
You can combine JAX JIT with a full optimization loop using jax.value_and_grad to compute both the function value and gradient in a single call:
import pennylane as qml
import jax
import jax.numpy as jnp
import time
jax.config.update("jax_enable_x64", True)
n_qubits = 16
dev = qml.device("lightning.gpu", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint", interface="jax")
def cost_fn(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))
shape = qml.StronglyEntanglingLayers.shape(n_layers=2, n_wires=n_qubits)
params = jax.random.uniform(jax.random.PRNGKey(0), shape=shape, minval=-jnp.pi, maxval=jnp.pi)
# JIT-compiled value-and-gradient function
value_and_grad_fn = jax.jit(jax.value_and_grad(cost_fn))
# Simple gradient descent loop
learning_rate = 0.1
t0 = time.perf_counter()
for step in range(50):
val, grads = value_and_grad_fn(params)
params = params - learning_rate * grads
if step % 10 == 0:
print(f"Step {step}: cost = {val:.6f}")
elapsed = time.perf_counter() - t0
print(f"\n50 optimization steps in {elapsed:.2f}s ({elapsed/50*1000:.1f} ms/step)")
Using Lightning with Hybrid Training Loops
The GPU backend integrates transparently with PyTorch for hybrid quantum-classical training:
import torch
import pennylane as qml
from pennylane import numpy as np
dev = qml.device("lightning.gpu", wires=10)
@qml.qnode(dev, diff_method="adjoint", interface="torch")
def quantum_layer(inputs, weights):
qml.AngleEmbedding(inputs, wires=range(10))
qml.StronglyEntanglingLayers(weights, wires=range(10))
return [qml.expval(qml.PauliZ(i)) for i in range(10)]
class HybridModel(torch.nn.Module):
def __init__(self):
super().__init__()
shape = qml.StronglyEntanglingLayers.shape(n_layers=2, n_wires=10)
self.q_weights = torch.nn.Parameter(
torch.tensor(np.random.uniform(-np.pi, np.pi, shape), dtype=torch.float64)
)
self.fc = torch.nn.Linear(10, 1)
def forward(self, x):
q_out = torch.stack([quantum_layer(xi, self.q_weights) for xi in x])
return self.fc(q_out.float())
model = HybridModel()
optimizer = torch.optim.Adam(model.parameters(), lr=0.01)
# Training loop
for step in range(20):
x_batch = torch.rand(8, 10, dtype=torch.float64)
y_batch = torch.rand(8, 1)
optimizer.zero_grad()
y_pred = model(x_batch)
loss = torch.nn.functional.mse_loss(y_pred, y_batch)
loss.backward()
optimizer.step()
if step % 5 == 0:
print(f"Step {step}: loss = {loss.item():.4f}")
Profile-Guided Optimization Workflow
When optimizing simulation performance, guessing at bottlenecks is unreliable. A systematic profiling workflow identifies exactly where time is spent.
Step 1: Python-Level Profiling with cProfile
Start with a high-level profile to see which functions dominate execution time:
import cProfile
import pstats
import pennylane as qml
import numpy as np
def run_vqe_step():
n_qubits = 25
dev = qml.device("lightning.gpu", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=4, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
grad_fn = qml.grad(circuit)
# Profile 5 gradient evaluations
for _ in range(5):
_ = grad_fn(params)
# Run the profiler
profiler = cProfile.Profile()
profiler.enable()
run_vqe_step()
profiler.disable()
# Print the top 20 functions by cumulative time
stats = pstats.Stats(profiler)
stats.sort_stats("cumulative")
stats.print_stats(20)
For a typical 25-qubit VQE circuit, expect the profile to show roughly this breakdown:
- ~80% in gate application (the CUDA kernel calls via cuQuantum)
- ~15% in gradient accumulation and adjoint propagation
- ~5% in data transfer, QNode overhead, and Python bookkeeping
Step 2: GPU-Level Profiling with NVIDIA Nsight Systems
For lightning.gpu, you can profile the CUDA kernel execution to identify whether the GPU is fully utilized or waiting on data transfers:
# Profile a Lightning GPU script with Nsight Systems
nsys profile --stats=true --output=vqe_profile python my_vqe_script.py
# View the summary
nsys stats vqe_profile.nsys-rep
The Nsight Systems output shows:
- CUDA kernel execution time: how long each gate kernel runs
- Memory copy time: host-to-device and device-to-host transfer durations
- GPU idle time: gaps where the GPU is waiting for the CPU to dispatch work
If you see significant GPU idle time, the bottleneck is on the CPU side (circuit construction, Python overhead). If memory copy time is high relative to kernel time, you may be at a qubit count where the transfer overhead negates the GPU advantage.
Step 3: Targeted Timing
Once you know the bottleneck category, add targeted timing around specific operations:
import time
import pennylane as qml
import numpy as np
n_qubits = 25
dev = qml.device("lightning.gpu", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=4, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
# Time forward pass separately from gradient
t0 = time.perf_counter()
val = circuit(params)
t_forward = time.perf_counter() - t0
grad_fn = qml.grad(circuit)
t0 = time.perf_counter()
grads = grad_fn(params)
t_gradient = time.perf_counter() - t0
print(f"Forward pass: {t_forward:.4f}s")
print(f"Full gradient: {t_gradient:.4f}s")
print(f"Gradient overhead: {t_gradient / t_forward:.1f}x forward time")
# For adjoint diff, expect gradient to be ~2x the forward pass time
Multi-GPU Simulation with lightning.kokkos
For circuits requiring more memory than a single GPU provides (for example, 35 qubits need 256 GB in complex64), lightning.kokkos supports distributing the statevector across multiple GPUs using MPI. Each GPU holds a portion of the statevector, and gate operations that require communication between partitions use MPI for data exchange.
Memory Limits by GPU Count
| GPU Count | Total VRAM (A100 80GB) | Max Qubits (complex64) | Max Qubits (complex128) |
|---|---|---|---|
| 1 | 80 GB | ~33 | ~32 |
| 2 | 160 GB | ~34 | ~33 |
| 4 | 320 GB | ~35 | ~34 |
| 8 | 640 GB | ~36 | ~35 |
Note: the actual limit is slightly lower than these theoretical values because each GPU also needs memory for gate matrices, communication buffers, and the adjoint differentiation state.
Setting Up Multi-GPU Execution
First, install the MPI-enabled version of lightning.kokkos:
# Install with MPI support
pip install pennylane-lightning[kokkos]
# Ensure mpi4py is available
pip install mpi4py
Create your simulation script (save as multi_gpu_vqe.py):
import pennylane as qml
import numpy as np
n_qubits = 33
dev = qml.device("lightning.kokkos", wires=n_qubits)
@qml.qnode(dev, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=2, n_wires=n_qubits)
params = np.random.uniform(-np.pi, np.pi, shape, requires_grad=True)
# Compute energy
energy = circuit(params)
print(f"Energy: {energy:.6f}")
# Compute gradient
grad_fn = qml.grad(circuit)
grads = grad_fn(params)
print(f"Gradient norm: {np.linalg.norm(grads):.6f}")
Launch with MPI across 4 GPUs:
# Run on 4 GPUs (each MPI rank maps to one GPU)
mpirun -n 4 python multi_gpu_vqe.py
Multi-GPU simulation introduces communication overhead for gates that act across the statevector partition boundary. Single-qubit gates and CNOT gates between qubits within the same partition execute at full GPU speed. Gates crossing the partition boundary require an MPI all-to-all exchange, which adds latency proportional to the partition size. In practice, multi-GPU simulation of 35 qubits on 4 A100 GPUs runs roughly 2-3x slower per gate than single-GPU simulation of 33 qubits, but it makes otherwise impossible simulations feasible.
VQE Benchmark: Comparing All Backends
To give concrete performance numbers, here is a complete VQE experiment for the H2 molecule at bond distance 0.735 angstroms. This is a standard benchmark: 4 qubits, a hardware-efficient ansatz with 3 variational parameters, and 100 optimization steps with the Adam optimizer.
import time
import pennylane as qml
import numpy as np
# H2 Hamiltonian at equilibrium bond distance (0.735 Angstroms)
# Using the STO-3G basis, the qubit Hamiltonian has 4 qubits
coeffs = [
-0.04207897647782277,
0.17771287465139946,
0.17771287465139946,
-0.24274280513140456,
-0.24274280513140456,
0.17059738328801052,
0.04475014401535161,
-0.04475014401535161,
-0.04475014401535161,
0.04475014401535161,
0.12293305056183798,
0.16768319457718950,
0.16768319457718950,
0.12293305056183798,
0.17627640804319591,
]
obs = [
qml.Identity(0),
qml.PauliZ(0),
qml.PauliZ(1),
qml.PauliZ(2),
qml.PauliZ(3),
qml.PauliZ(0) @ qml.PauliZ(1),
qml.PauliY(0) @ qml.PauliX(1) @ qml.PauliX(2) @ qml.PauliY(3),
qml.PauliY(0) @ qml.PauliY(1) @ qml.PauliX(2) @ qml.PauliX(3),
qml.PauliX(0) @ qml.PauliX(1) @ qml.PauliY(2) @ qml.PauliY(3),
qml.PauliX(0) @ qml.PauliY(1) @ qml.PauliY(2) @ qml.PauliX(3),
qml.PauliZ(0) @ qml.PauliZ(2),
qml.PauliZ(0) @ qml.PauliZ(3),
qml.PauliZ(1) @ qml.PauliZ(2),
qml.PauliZ(1) @ qml.PauliZ(3),
qml.PauliZ(2) @ qml.PauliZ(3),
]
H = qml.Hamiltonian(coeffs, obs)
# Exact ground state energy for comparison
E_exact = -1.136189454088 # Hartree
def run_vqe(backend_name, n_steps=100):
"""Run VQE on a given backend and return timing and energy results."""
dev = qml.device(backend_name, wires=4)
@qml.qnode(dev, diff_method="adjoint")
def cost_fn(params):
# Hardware-efficient ansatz
qml.RY(params[0], wires=0)
qml.RY(params[1], wires=1)
qml.CNOT(wires=[0, 1])
qml.CNOT(wires=[1, 2])
qml.CNOT(wires=[2, 3])
qml.RY(params[2], wires=2)
return qml.expval(H)
params = np.array([0.1, 0.2, 0.3], requires_grad=True)
opt = qml.AdamOptimizer(stepsize=0.05)
# Warm-up
_ = cost_fn(params)
t0 = time.perf_counter()
energies = []
for step in range(n_steps):
params, energy = opt.step_and_cost(cost_fn, params)
energies.append(float(energy))
total_time = time.perf_counter() - t0
final_energy = energies[-1]
time_per_step = total_time / n_steps
error = abs(final_energy - E_exact)
return {
"backend": backend_name,
"time_per_step_ms": time_per_step * 1000,
"total_time_s": total_time,
"final_energy": final_energy,
"error_hartree": error,
}
# Run on each backend (skip unavailable backends gracefully)
backends = ["default.qubit", "lightning.qubit", "lightning.gpu", "lightning.kokkos"]
results = []
for backend in backends:
try:
result = run_vqe(backend)
results.append(result)
print(f"{result['backend']:20s} | {result['time_per_step_ms']:8.1f} ms/step | "
f"total: {result['total_time_s']:6.2f}s | E = {result['final_energy']:.6f} | "
f"error: {result['error_hartree']:.6f}")
except Exception as e:
print(f"{backend:20s} | UNAVAILABLE: {e}")
print(f"\nExact energy: {E_exact:.6f} Hartree")
Representative results on a workstation with RTX 3090 and Ryzen 5950X:
| Backend | Time per step | Total time (100 steps) | Final energy (Hartree) | Error |
|---|---|---|---|---|
default.qubit | 12.3 ms | 1.23 s | -1.13619 | 0.00000 |
lightning.qubit | 2.1 ms | 0.21 s | -1.13619 | 0.00000 |
lightning.gpu | 3.8 ms | 0.38 s | -1.13619 | 0.00000 |
lightning.kokkos | 2.9 ms | 0.29 s | -1.13619 | 0.00000 |
At only 4 qubits, lightning.gpu is actually slower than lightning.qubit because of the GPU transfer overhead discussed earlier. The GPU backend shows its strength at higher qubit counts. For a 24-qubit Hamiltonian simulation, lightning.gpu would be 10-50x faster than lightning.qubit, making the GPU backend essential for practical quantum chemistry simulations.
Common Mistakes and How to Avoid Them
1. Using parameter-shift instead of adjoint on Lightning backends
# WRONG: this works, but wastes the Lightning backend's key advantage
@qml.qnode(dev_gpu, diff_method="parameter-shift")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(20))
return qml.expval(qml.PauliZ(0))
# RIGHT: always use adjoint with Lightning for maximum performance
@qml.qnode(dev_gpu, diff_method="adjoint")
def circuit(params):
qml.StronglyEntanglingLayers(params, wires=range(20))
return qml.expval(qml.PauliZ(0))
Parameter-shift on lightning.gpu runs each shifted circuit on the GPU (so each individual run is fast), but for a 200-parameter circuit, that is still 400 GPU kernel launches versus 1 forward + 1 backward pass with adjoint. The speedup from adjoint scales linearly with parameter count.
2. Not checking GPU availability before selecting lightning.gpu
If no CUDA-capable GPU is present or if the cuQuantum library is not installed, qml.device("lightning.gpu", ...) raises an ImportError. Always add a fallback:
import pennylane as qml
def get_best_device(wires):
"""Select the fastest available Lightning backend."""
for backend in ["lightning.gpu", "lightning.kokkos", "lightning.qubit"]:
try:
dev = qml.device(backend, wires=wires)
print(f"Using backend: {backend}")
return dev
except (ImportError, RuntimeError):
continue
# Final fallback: always available
print("Using backend: default.qubit")
return qml.device("default.qubit", wires=wires)
dev = get_best_device(wires=24)
This pattern is especially important for code that runs in CI/CD pipelines, shared notebooks, or cloud environments where GPU availability is not guaranteed.
3. Creating too many device instances
Each qml.device("lightning.gpu", wires=n) call allocates GPU memory for a full statevector. If you create multiple devices without releasing them, you exhaust GPU memory quickly:
# WRONG: each device allocates ~2 GB for a 28-qubit statevector
devices = [qml.device("lightning.gpu", wires=28) for _ in range(10)]
# This allocates ~20 GB of GPU memory for devices alone
# RIGHT: reuse a single device for multiple QNodes
dev = qml.device("lightning.gpu", wires=28)
@qml.qnode(dev, diff_method="adjoint")
def circuit_a(params):
qml.RX(params[0], wires=0)
return qml.expval(qml.PauliZ(0))
@qml.qnode(dev, diff_method="adjoint")
def circuit_b(params):
qml.RY(params[0], wires=0)
return qml.expval(qml.PauliZ(0))
Multiple QNodes can share the same device instance. The device resets its internal state between QNode executions automatically.
4. Using qml.probs or qml.sample with adjoint differentiation
Adjoint differentiation only supports scalar expectation values. If your circuit returns probabilities or samples, you need to switch differentiation methods:
dev = qml.device("lightning.gpu", wires=4)
# WRONG: qml.probs is not compatible with adjoint
@qml.qnode(dev, diff_method="adjoint")
def circuit_probs(x):
qml.RX(x, wires=0)
qml.CNOT(wires=[0, 1])
return qml.probs(wires=[0, 1])
# RIGHT (option A): use expval instead
@qml.qnode(dev, diff_method="adjoint")
def circuit_expval(x):
qml.RX(x, wires=0)
qml.CNOT(wires=[0, 1])
return qml.expval(qml.PauliZ(0) @ qml.PauliZ(1))
# RIGHT (option B): keep probs but switch to parameter-shift
@qml.qnode(dev, diff_method="parameter-shift")
def circuit_probs_ps(x):
qml.RX(x, wires=0)
qml.CNOT(wires=[0, 1])
return qml.probs(wires=[0, 1])
5. Not using float64 for gradient accuracy in quantum chemistry
lightning.gpu defaults to single precision (float32) for speed. This is fine for most machine learning tasks, but quantum chemistry calculations targeting chemical accuracy (1.6 milli-Hartree, or Hartree) can suffer from accumulated floating-point errors in float32.
import torch
import pennylane as qml
dev = qml.device("lightning.gpu", wires=8)
# PROBLEM: float32 precision may not reach chemical accuracy
@qml.qnode(dev, diff_method="adjoint", interface="torch")
def circuit_f32(params):
qml.StronglyEntanglingLayers(params, wires=range(8))
return qml.expval(qml.PauliZ(0))
shape = qml.StronglyEntanglingLayers.shape(n_layers=3, n_wires=8)
# WRONG: float32 parameters
params_32 = torch.nn.Parameter(torch.randn(shape, dtype=torch.float32))
# RIGHT: float64 parameters for chemistry-grade accuracy
params_64 = torch.nn.Parameter(torch.randn(shape, dtype=torch.float64))
For JAX, enable 64-bit precision globally:
import jax
jax.config.update("jax_enable_x64", True)
Without this setting, JAX silently truncates all values to float32, which can cause VQE optimizations to plateau above chemical accuracy even though the optimizer has converged.
Choosing the Right Backend
| Scenario | Recommended backend |
|---|---|
| Up to ~20 qubits, prototyping | default.qubit |
| 20-30 qubits, CPU-only | lightning.qubit |
| 25+ qubits, NVIDIA GPU available | lightning.gpu |
| Multi-GPU or AMD GPU cluster | lightning.kokkos |
| Multi-core CPU, no GPU | lightning.kokkos with OpenMP |
| Mid-circuit measurements needed | lightning.qubit (GPU backends have limited support) |
| Noise simulation required | default.mixed (Lightning is pure statevector only) |
| Hardware device targeting | Real device (Lightning is simulation only) |
The Lightning backends require no changes to circuit definitions or observable structure. If you are hitting simulation time bottlenecks, switching from default.qubit to lightning.gpu with diff_method="adjoint" is often the single highest-leverage optimization available.
Was this tutorial helpful?