Accelerating Quantum Simulation with CUDA Quantum and GPU

Why GPU Acceleration Matters for Quantum Simulation

Classical simulation of quantum circuits is exponentially hard. A full state vector for an n-qubit system requires storing 2^n complex amplitudes. For 30 qubits that is roughly one billion complex numbers, each 16 bytes, about 16 GB just for the state vector itself. CPUs struggle to manipulate arrays of that size efficiently because memory bandwidth and floating-point throughput quickly become bottlenecks.

NVIDIA’s CUDA Quantum platform (the cudaq Python library) addresses this directly. It offloads state vector operations to GPU kernels, where thousands of CUDA cores can apply gate matrices in parallel across the full amplitude vector. The custatevec backend (part of the cuQuantum SDK) is the workhorse library that implements these parallel gate operations and is exposed automatically when you target a GPU device.

This tutorial assumes familiarity with quantum circuits and Python, and that you have an NVIDIA GPU (Ampere architecture or newer is strongly recommended) with the cudaq package installed.

Installing CUDA Quantum

pip install cudaq

For GPU support you also need the NVIDIA CUDA toolkit (12.x) and cuQuantum installed system-wide or inside a container. NVIDIA provides prebuilt containers on NGC that bundle everything:

docker pull nvcr.io/nvidia/nightly/cuda-quantum:latest

How custatevec Works Under the Hood

Before writing any CUDA Quantum code, it helps to understand what custatevec actually does when you apply a gate. This knowledge will inform your decisions about circuit design, qubit count, and performance expectations.

Gate Application as Batched Matrix-Vector Operations

A quantum state vector of n qubits is a flat array of 2^n complex amplitudes stored in GPU VRAM. When you apply a single-qubit gate U to qubit k, the 2^n amplitudes are logically partitioned into 2^(n-1) pairs. Each pair consists of the amplitude for a computational basis state where qubit k is 0 and the corresponding state where qubit k is 1. Concretely, for every bitstring b where bit k is 0, custatevec groups amplitude a[b] (the |…0_k…> component) with amplitude a[b ^ (1 << k)] (the |…1_k…> component) and applies the 2x2 unitary matrix:

[a'[b]          ]   [U00  U01] [a[b]          ]
[a'[b ^ (1<<k)] ] = [U10  U11] [a[b ^ (1<<k)] ]

This is equivalent to a batched 2x2 matrix-vector multiplication across 2^(n-1) independent pairs. custatevec implements this via cuBLAS-style CUDA kernels that use coalesced memory access patterns, meaning consecutive GPU threads read and write consecutive memory addresses. Coalesced access is critical because it maximizes the utilization of the GPU’s memory bus.

Two-Qubit Gates

For a two-qubit gate like CNOT acting on qubits j and k, the amplitudes are grouped into sets of four (corresponding to the four combinations of bits j and k). The gate matrix is 4x4, and 2^(n-2) such groups are transformed in parallel. Each group requires 4 reads and 4 writes from GPU VRAM.

Memory Bandwidth Math

Knowing this structure lets you estimate the time for a single gate operation. Consider a 28-qubit state vector:

State vector size: 2^28 amplitudes * 8 bytes per amplitude (complex64) = 2,147,483,648 bytes = 2 GB
CNOT gate data movement: The gate reads all 2 GB and writes all 2 GB, for a total of 4 GB of memory traffic.
A100 GPU memory bandwidth: approximately 2 TB/s (HBM2e)
Estimated gate time: 4 GB / 2 TB/s = 2 ms per gate

For fp64 (complex128), the state vector doubles to 4 GB, and each gate takes roughly 4 ms. This is why custatevec defaults to complex64: halving the precision halves the per-gate time and doubles the maximum qubit count for a given GPU.

In practice, gate times are slightly longer than the bandwidth-limited estimate because of kernel launch overhead (typically 0.01-0.1 ms per launch) and cache effects. But the bandwidth model gives you accurate order-of-magnitude predictions and explains why GPU memory bandwidth, not FLOPS, is the primary bottleneck for state vector simulation.

The Kernel Decorator

CUDA Quantum separates the quantum program from the classical host using the @cudaq.kernel decorator. Inside a kernel you write quantum operations; outside you collect results. This separation allows the compiler to lower the kernel to either a CPU simulator or GPU-accelerated backend without changing your source code.

import cudaq

@cudaq.kernel
def bell_state():
    qubits = cudaq.qvector(2)
    h(qubits[0])
    cx(qubits[0], qubits[1])
    mz(qubits)

A few things to notice:

cudaq.qvector(n) allocates n qubits, all initialized to |0>.
Gate names (h, cx, mz) are lowercase and called as free functions inside the kernel.
There is no explicit return; measurement results are collected by the sampling API on the host side.

Kernel Restrictions and Workarounds

CUDA Quantum kernels are not ordinary Python functions. The @cudaq.kernel decorator triggers a compilation step that lowers the kernel body to MLIR (Multi-Level Intermediate Representation) and then to QIR (Quantum Intermediate Representation). This compilation requires static analysis of the kernel code, which means several standard Python constructs are not allowed inside kernels.

What Is Not Allowed

Python lists: You cannot create or manipulate Python list objects inside a kernel.
NumPy arrays: No numpy operations or array creation.
Dynamic memory allocation: No append, no variable-length data structures.
Custom classes: No instantiation of user-defined Python classes.
Most standard library calls: No math.sin(), print(), or similar.

The reason for these restrictions is that MLIR/QIR must know the exact structure of the computation at compile time. Python’s dynamic typing and runtime object creation cannot be statically analyzed by the CUDA Quantum compiler.

What Is Allowed

cudaq.qvector(n) for qubit allocation (n must be a known integer or int parameter)
float and int parameters
list[float] parameters (these are passed as fixed-size arrays from the host)
for loops with integer range indices
if/else with classical int or bool conditions
All quantum gate operations (h, cx, ry, rz, etc.)
mz for measurement

Example: A Common Mistake and Its Fix

This kernel will fail to compile because it uses a Python list inside the kernel body:

import cudaq

# THIS WILL FAIL
@cudaq.kernel
def bad_kernel():
    angles = [0.1, 0.2, 0.3]  # Python list: not allowed
    qubits = cudaq.qvector(3)
    for i in range(3):
        ry(angles[i], qubits[i])
    mz(qubits)

The fix is to pass the angles as a list[float] parameter from the host:

import cudaq

cudaq.set_target("nvidia")

@cudaq.kernel
def good_kernel(angles: list[float]):
    qubits = cudaq.qvector(3)
    for i in range(3):
        ry(angles[i], qubits[i])
    mz(qubits)

# Pass the angles from the host side
result = cudaq.sample(good_kernel, [0.1, 0.2, 0.3], shots_count=1000)
print(result)

The key insight is that data flows into the kernel through its parameters, not through Python objects created inside it.

Choosing the GPU Backend

Before running, set the target:

cudaq.set_target("nvidia")          # single GPU, custatevec
# cudaq.set_target("nvidia-mgpu")   # multi-GPU
# cudaq.set_target("qpp-cpu")       # reference CPU simulator

Setting the target is the only change needed to switch from CPU to GPU simulation. The kernel source stays identical.

Sampling a Bell State

import cudaq

cudaq.set_target("nvidia")

@cudaq.kernel
def bell_state():
    qubits = cudaq.qvector(2)
    h(qubits[0])
    cx(qubits[0], qubits[1])
    mz(qubits)

result = cudaq.sample(bell_state, shots_count=10000)
print(result)
# Expected: roughly 50% |00> and 50% |11>

cudaq.sample runs the circuit the requested number of times (shots) and returns a SampleResult dictionary mapping bitstrings to counts.

GHZ State on Many Qubits

The GHZ state generalizes the Bell state to n qubits: a superposition of |000…0> and |111…1>. It is a useful stress test because the entanglement chain touches every qubit.

import cudaq

cudaq.set_target("nvidia")

@cudaq.kernel
def ghz_state(n: int):
    qubits = cudaq.qvector(n)
    h(qubits[0])
    for i in range(n - 1):
        cx(qubits[i], qubits[i + 1])
    mz(qubits)

# 28-qubit GHZ state: state vector is 2 GB (complex64) on GPU
result = cudaq.sample(ghz_state, 28, shots_count=1000)
print(result.most_probable())

On a CPU simulator a 28-qubit state vector takes minutes; on a modern GPU with custatevec the same circuit typically completes in a few seconds.

Computing Expectation Values with cudaq.observe

Shot-based sampling (cudaq.sample) introduces statistical noise. If you need the expectation value of an observable, you would need thousands of shots to get a precise estimate. cudaq.observe computes exact expectation values by calculating <psi|H|psi> directly from the state vector, with no shot noise at all.

Building a Hamiltonian with SpinOperator

CUDA Quantum represents observables as SpinOperator objects built from Pauli operators:

import cudaq

# Build H = Z0 + Z1 + Z0*Z1 for a 4-qubit system
# cudaq.spin.z(i) creates a Pauli-Z on qubit i
hamiltonian = cudaq.spin.z(0) + cudaq.spin.z(1) + cudaq.spin.z(0) * cudaq.spin.z(1)
print("Hamiltonian:", hamiltonian)

You can also build more complex Hamiltonians using cudaq.spin.x(i), cudaq.spin.y(i), and scalar multiplication:

# Transverse-field Ising model on 4 qubits
# H = -J * sum(Z_i * Z_{i+1}) - h * sum(X_i)
J = 1.0
h_field = 0.5
hamiltonian = 0.0 * cudaq.spin.i(0)  # initialize to zero operator
for i in range(3):
    hamiltonian += -J * cudaq.spin.z(i) * cudaq.spin.z(i + 1)
for i in range(4):
    hamiltonian += -h_field * cudaq.spin.x(i)

print("Ising Hamiltonian:", hamiltonian)

Using cudaq.observe

The observe function takes a kernel (without mz calls), a Hamiltonian, and any kernel parameters. It returns an ObserveResult containing the expectation value:

import cudaq

cudaq.set_target("nvidia")

# Parameterized ansatz: no mz call (observe handles expectation computation)
@cudaq.kernel
def ansatz(theta: float):
    qubits = cudaq.qvector(2)
    ry(theta, qubits[0])
    cx(qubits[0], qubits[1])

# Hamiltonian: H = Z0 + Z1 + 0.5 * Z0 * Z1
hamiltonian = (
    cudaq.spin.z(0)
    + cudaq.spin.z(1)
    + 0.5 * cudaq.spin.z(0) * cudaq.spin.z(1)
)

# Compute <psi(theta)|H|psi(theta)> exactly
theta = 0.7
result = cudaq.observe(ansatz, hamiltonian, theta)
print(f"<H> at theta={theta}: {result.expectation():.6f}")

This is exact (no shot noise) and faster than cudaq.sample for expectation values because it avoids repeated circuit execution. For variational algorithms like VQE, observe is the preferred interface since each energy evaluation needs a precise expectation value, not a histogram of bitstrings.

Gradient Computation with the Parameter-Shift Rule

Variational algorithms require gradients of the cost function with respect to circuit parameters. For a parameterized gate of the form R(theta) = exp(-i * theta * G / 2), the parameter-shift rule gives the exact gradient:

dE/dtheta = ( E(theta + pi/2) - E(theta - pi/2) ) / 2

This requires two evaluations of cudaq.observe per parameter. Here is a manual implementation for a 2-parameter ansatz:

import cudaq
import math

cudaq.set_target("nvidia")

@cudaq.kernel
def two_param_ansatz(params: list[float]):
    qubits = cudaq.qvector(2)
    ry(params[0], qubits[0])
    ry(params[1], qubits[1])
    cx(qubits[0], qubits[1])

hamiltonian = cudaq.spin.z(0) + 0.5 * cudaq.spin.z(0) * cudaq.spin.z(1)

def compute_gradient(params, hamiltonian):
    """Compute gradient using the parameter-shift rule."""
    shift = math.pi / 2.0
    gradient = []
    for i in range(len(params)):
        # Shift parameter i forward
        params_plus = list(params)
        params_plus[i] += shift

        # Shift parameter i backward
        params_minus = list(params)
        params_minus[i] -= shift

        # Evaluate energy at both shifted points
        e_plus = cudaq.observe(two_param_ansatz, hamiltonian, params_plus).expectation()
        e_minus = cudaq.observe(two_param_ansatz, hamiltonian, params_minus).expectation()

        gradient.append((e_plus - e_minus) / 2.0)
    return gradient

# Evaluate gradient at a specific point
params = [0.5, 1.2]
energy = cudaq.observe(two_param_ansatz, hamiltonian, params).expectation()
grad = compute_gradient(params, hamiltonian)
print(f"Energy: {energy:.6f}")
print(f"Gradient: [{grad[0]:.6f}, {grad[1]:.6f}]")

Each gradient evaluation requires 2 * (number of parameters) calls to cudaq.observe. For a VQE ansatz with 50 parameters, that means 100 observe calls per optimization step. This is where GPU acceleration pays off most: each observe call is fast on GPU, so the total gradient computation remains tractable.

Comparison with cudaq.vqe()

While manual gradient computation gives you full control, CUDA Quantum provides cudaq.vqe() which handles the optimization loop internally, including gradient computation and parameter updates.

Running VQE with cudaq.vqe()

The cudaq.vqe() function combines GPU-accelerated expectation value computation with a classical optimizer in a single call. This is the most convenient way to run variational quantum eigensolver computations.

A Simple H2 Molecule Example

The following example finds the ground state energy of a simplified 2-qubit Hamiltonian representing the hydrogen molecule at a fixed bond distance:

import cudaq

cudaq.set_target("nvidia")

# Simplified H2 Hamiltonian at bond distance 0.7414 Angstrom
# In a real application you would generate this from a chemistry library
hamiltonian = (
    -0.8124 * cudaq.spin.i(0) * cudaq.spin.i(1)
    + 0.1712 * cudaq.spin.z(0) * cudaq.spin.i(1)
    - 0.2257 * cudaq.spin.i(0) * cudaq.spin.z(1)
    + 0.1712 * cudaq.spin.z(0) * cudaq.spin.z(1)
    + 0.0453 * cudaq.spin.x(0) * cudaq.spin.x(1)
    + 0.0453 * cudaq.spin.y(0) * cudaq.spin.y(1)
)

# Parameterized ansatz for VQE
@cudaq.kernel
def h2_ansatz(theta: float):
    qubits = cudaq.qvector(2)
    x(qubits[0])               # start in |10> (one electron)
    ry(theta, qubits[1])
    cx(qubits[1], qubits[0])

# Run VQE with the COBYLA optimizer (gradient-free)
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 100

energy, optimal_params = cudaq.vqe(
    kernel=h2_ansatz,
    spin_operator=hamiltonian,
    optimizer=optimizer,
    parameter_count=1,
)

print(f"Optimal energy: {energy:.6f} Hartree")
print(f"Optimal theta: {optimal_params[0]:.6f}")
# Expected: energy close to -1.137 Hartree (exact ground state of H2)

Supported Optimizers

CUDA Quantum includes several built-in optimizers:

cudaq.optimizers.COBYLA(): Constrained Optimization BY Linear Approximation. Gradient-free, works well for small parameter counts.
cudaq.optimizers.NelderMead(): Gradient-free simplex method. Robust but slow for many parameters.
cudaq.optimizers.LBFGS(): Limited-memory Broyden-Fletcher-Goldfarb-Shanno. Requires gradients, efficient for large parameter counts.

For LBFGS, you need to provide a gradient strategy. The parameter-shift approach described earlier can be wrapped into a custom gradient function, or you can use finite differences as a simpler (but less accurate) alternative.

Amplitude Estimation with cudaq.observe

Beyond VQE, cudaq.observe is useful for any algorithm that computes expectation values. Quantum Phase Estimation (QPE) is a good example. In QPE, you estimate the eigenphase of a unitary operator by measuring the state of ancilla qubits after a series of controlled rotations. Using observe, you can extract the phase information without shot noise.

Here is a simplified QPE circuit that estimates the phase of a T gate (which has eigenvalue exp(i*pi/4), so the phase is 1/8 of a full rotation):

import cudaq
import math

cudaq.set_target("nvidia")

@cudaq.kernel
def qpe_circuit(n_ancilla: int):
    # Allocate ancilla qubits + 1 target qubit
    ancilla = cudaq.qvector(n_ancilla)
    target = cudaq.qvector(1)

    # Prepare target in eigenstate |1> of the T gate
    x(target[0])

    # Put ancilla qubits in superposition
    for i in range(n_ancilla):
        h(ancilla[i])

    # Controlled rotations: apply T^(2^i) controlled on ancilla[i]
    # T gate has phase pi/4, so T^(2^i) has phase pi/4 * 2^i
    for i in range(n_ancilla):
        angle = math.pi / 4.0 * (2 ** i)
        # Controlled Rz rotation (approximating controlled T^(2^i))
        cr1(angle, ancilla[i], target[0])

    # Inverse QFT on ancilla register
    for i in range(n_ancilla // 2):
        swap(ancilla[i], ancilla[n_ancilla - 1 - i])
    for i in range(n_ancilla):
        for j in range(i):
            angle = -math.pi / (2 ** (i - j))
            cr1(angle, ancilla[j], ancilla[i])
        h(ancilla[i])

# Use observe to measure Z on each ancilla qubit
n_ancilla = 4
for i in range(n_ancilla):
    obs = cudaq.spin.z(i)
    result = cudaq.observe(qpe_circuit, obs, n_ancilla)
    z_exp = result.expectation()
    # Convert Z expectation to probability of |0>: P(0) = (1 + <Z>)/2
    p0 = (1.0 + z_exp) / 2.0
    print(f"Ancilla qubit {i}: <Z> = {z_exp:.4f}, P(|0>) = {p0:.4f}")

This demonstrates how observe extends naturally to algorithms beyond simple variational circuits. Any computation that reduces to measuring Pauli expectation values can benefit from the exact, shot-free evaluation that GPU-accelerated observe provides.

Noise Models on GPU

CUDA Quantum supports noise modeling via cudaq.NoiseModel. You attach channels to specific gate types or qubit indices, then pass the model to cudaq.sample.

import cudaq

cudaq.set_target("density-matrix-cpu")

# Build a depolarizing noise model
noise = cudaq.NoiseModel()
depolarizing = cudaq.DepolarizingChannel(0.01)   # 1% error rate
noise.add_channel("h", [0], depolarizing)
noise.add_channel("cx", [0, 1], depolarizing)

@cudaq.kernel
def noisy_bell():
    qubits = cudaq.qvector(2)
    h(qubits[0])
    cx(qubits[0], qubits[1])
    mz(qubits)

noisy_result = cudaq.sample(noisy_bell, noise_model=noise, shots_count=10000)
print("Noisy result:", noisy_result)

clean_result = cudaq.sample(noisy_bell, shots_count=10000)
print("Clean result:", clean_result)

The GPU backend applies noise channels using density matrix evolution when noise is present, transparently switching from the pure state-vector path to a density matrix representation. This is significantly more memory intensive (the density matrix is 2^n x 2^n), so noise simulations beyond 15 qubits on current hardware are generally infeasible. See the memory management section below for the exact numbers.

Memory Management for Large Circuits

Understanding GPU memory requirements is essential for planning your simulations. Running out of VRAM produces cryptic CUDA errors, so it is better to check requirements up front.

State Vector Memory Formula

For state vector simulation, the memory required is:

complex64 (fp32): 2^n * 8 bytes per amplitude
complex128 (fp64): 2^n * 16 bytes per amplitude

custatevec defaults to complex64, which halves memory compared to complex128. For most applications, fp32 precision is sufficient.

Density Matrix Memory Formula

For density matrix simulation (used with noise models), the memory required is:

complex64: (2^n)^2 * 8 = 4^n * 8 bytes
complex128: (2^n)^2 * 16 = 4^n * 16 bytes

This grows dramatically faster than state vector simulation.

Reference Table

Qubits	State Vector (fp32)	State Vector (fp64)	Density Matrix (fp32)
20	8 MB	16 MB	8 TB
24	128 MB	256 MB	too large
26	512 MB	1 GB	too large
28	2 GB	4 GB	too large
30	8 GB	16 GB	too large
32	32 GB	64 GB	too large
34	128 GB	256 GB	too large

Note: The density matrix entry for 20 qubits (8 TB) illustrates why noise simulation is limited to roughly 14-15 qubits on current GPUs. At 15 qubits, the density matrix in fp32 is 4^15 * 8 = approximately 8.6 GB, which fits on an A100 80 GB with room to spare, but at 16 qubits it jumps to 34 GB.

Python Memory Estimator

def estimate_gpu_memory(n_qubits, simulation_type="statevector", precision="fp32"):
    """Estimate GPU memory in bytes for a given simulation configuration.

    Args:
        n_qubits: Number of qubits.
        simulation_type: "statevector" or "density_matrix".
        precision: "fp32" (complex64) or "fp64" (complex128).

    Returns:
        Memory in bytes as an integer.
    """
    bytes_per_amplitude = 8 if precision == "fp32" else 16

    if simulation_type == "statevector":
        return (2 ** n_qubits) * bytes_per_amplitude
    elif simulation_type == "density_matrix":
        return (4 ** n_qubits) * bytes_per_amplitude
    else:
        raise ValueError(f"Unknown simulation type: {simulation_type}")


def format_bytes(n_bytes):
    """Format byte count as a human-readable string."""
    for unit in ["B", "KB", "MB", "GB", "TB"]:
        if n_bytes < 1024:
            return f"{n_bytes:.1f} {unit}"
        n_bytes /= 1024
    return f"{n_bytes:.1f} PB"


# Example usage
for n in [20, 24, 28, 30, 32]:
    sv_mem = estimate_gpu_memory(n, "statevector", "fp32")
    print(f"{n} qubits, statevector fp32: {format_bytes(sv_mem)}")

# Check if your circuit fits on an RTX 4090 (24 GB VRAM)
gpu_vram = 24 * (1024 ** 3)  # 24 GB in bytes
for n in range(20, 36):
    mem = estimate_gpu_memory(n, "statevector", "fp32")
    if mem > gpu_vram:
        print(f"\nMax qubits on RTX 4090 (fp32 statevector): {n - 1}")
        break

Running this script shows that an RTX 4090 with 24 GB of VRAM can handle up to 31 qubits in fp32 statevector mode.

Multi-GPU Simulation with nvidia-mgpu

For circuits exceeding what a single GPU can hold, CUDA Quantum supports multi-GPU state vector distribution. The nvidia-mgpu target (also reachable as the mgpu option of the nvidia target in newer releases) pools the memory of multiple GPUs and partitions the state vector across them. Note that the separate nvidia-mqpu target does something different: it simulates multiple independent QPUs, one per GPU, for parallelizing many circuit evaluations rather than enlarging a single state vector.

import cudaq

cudaq.set_target("nvidia-mgpu")  # requires multiple NVIDIA GPUs

@cudaq.kernel
def large_ghz(n: int):
    qubits = cudaq.qvector(n)
    h(qubits[0])
    for i in range(n - 1):
        cx(qubits[i], qubits[i + 1])
    mz(qubits)

# 32-qubit simulation distributed across 4 GPUs (each holds 1/4 of state vector)
result = cudaq.sample(large_ghz, 32, shots_count=500)
print(result.most_probable())

With 4x A100 80 GB GPUs you can comfortably reach 34-36 qubits in the state vector picture before hitting combined memory limits. Beyond that, tensor network methods are the next step.

Tensor Network Target for Deep Circuits

State vector simulation stores the full 2^n amplitude vector, which is infeasible beyond about 35 qubits even with multi-GPU setups. Tensor network simulation takes a fundamentally different approach: it represents the quantum state as a network of tensors connected by contracted indices. When the entanglement in the circuit is limited (meaning the bond dimension of the tensor network stays small), this approach can simulate 50, 100, or even more qubits.

When to Use Tensor Networks

The tensor network approach works best for:

1D nearest-neighbor circuits: Circuits where gates only connect adjacent qubits. These naturally decompose into matrix product states (MPS) with bounded bond dimension.
Low-depth circuits: Circuits where the entanglement has not had time to spread across the full system.
Circuits with structure: Periodic or translationally invariant circuits where the tensor network can exploit symmetry.

The approach breaks down when entanglement is high. A fully entangling circuit (like a random circuit on all-to-all connectivity) forces the bond dimension to grow exponentially, and the tensor network becomes as expensive as full state vector simulation.

Running a 50-Qubit Circuit with tensornet

import cudaq

# Switch to the tensor network backend
cudaq.set_target("tensornet")

@cudaq.kernel
def mps_friendly_circuit(n: int):
    """A 1D nearest-neighbor circuit that stays low entanglement.
    This circuit applies layers of nearest-neighbor CNOT gates
    with single-qubit rotations, mimicking a variational ansatz
    on a 1D chain."""
    qubits = cudaq.qvector(n)

    # Initial superposition
    for i in range(n):
        h(qubits[i])

    # Three layers of nearest-neighbor entangling gates
    for layer in range(3):
        for i in range(0, n - 1, 2):
            cx(qubits[i], qubits[i + 1])
        for i in range(1, n - 1, 2):
            cx(qubits[i], qubits[i + 1])
        # Single-qubit rotation layer
        for i in range(n):
            rz(0.3, qubits[i])

    mz(qubits)

# 50-qubit simulation: infeasible with statevector, tractable with tensornet
result = cudaq.sample(mps_friendly_circuit, 50, shots_count=100)
print(f"Most probable bitstring: {result.most_probable()}")
print(f"Number of unique bitstrings sampled: {len(result)}")

This circuit has 50 qubits and only 3 layers of nearest-neighbor entanglement, so the MPS bond dimension stays manageable. On a single GPU, this simulation completes in seconds. The same circuit in statevector mode would require 2^50 * 8 bytes = 8 PB of memory, which is clearly impossible.

Comparing statevector and tensornet

A useful mental model: choose nvidia (statevector) when you need circuits with high entanglement on up to 30-35 qubits. Choose tensornet when you need circuits with limited entanglement on 35+ qubits. For circuits that have both high entanglement and many qubits, no classical simulation method will help; that is the regime where quantum hardware is needed.

Async Sampling for Throughput

When running variational algorithms, you often need to evaluate the same circuit at many different parameter values. Instead of evaluating them one at a time (each waiting for the GPU to finish before submitting the next), you can use cudaq.observe_async to submit multiple evaluations concurrently. This keeps the GPU busy by overlapping kernel launches.

import cudaq

cudaq.set_target("nvidia")

@cudaq.kernel
def param_ansatz(params: list[float]):
    qubits = cudaq.qvector(3)
    ry(params[0], qubits[0])
    ry(params[1], qubits[1])
    ry(params[2], qubits[2])
    cx(qubits[0], qubits[1])
    cx(qubits[1], qubits[2])

hamiltonian = (
    cudaq.spin.z(0) * cudaq.spin.z(1)
    + cudaq.spin.z(1) * cudaq.spin.z(2)
    + 0.5 * cudaq.spin.x(0)
)

# Generate 16 different parameter sets for a parameter sweep
import math
param_sets = [
    [0.1 * i, 0.2 * i, 0.3 * i] for i in range(16)
]

# Launch all 16 observe calls asynchronously
futures = []
for i in range(16):
    future = cudaq.observe_async(param_ansatz, hamiltonian, param_sets[i])
    futures.append(future)

# Collect results (blocks until each computation finishes)
energies = [f.get().expectation() for f in futures]

for i, energy in enumerate(energies):
    print(f"Parameter set {i}: energy = {energy:.6f}")

# Find the minimum energy configuration
min_idx = energies.index(min(energies))
print(f"\nBest parameters: {param_sets[min_idx]}")
print(f"Minimum energy: {energies[min_idx]:.6f}")

The observe_async pattern is especially valuable for:

Grid searches over parameter space
Parallel gradient evaluation: You can launch all 2p observe calls for a p-parameter gradient in parallel using async
Ensemble methods: Evaluating multiple ansatz configurations simultaneously

The GPU scheduler handles the concurrent kernel launches internally, maximizing utilization without requiring you to manage CUDA streams directly.

CPU vs GPU Speed Comparison

The following table gives approximate wall-clock times for sampling a GHZ state (1000 shots) at different qubit counts, comparing the qpp-cpu and nvidia (A100) backends:

Qubits	CPU time	A100 GPU time	Speedup
20	1.2 s	0.08 s	15x
24	18 s	0.4 s	45x
28	290 s	3.5 s	83x
30	~80 min	28 s	~170x

The speedup grows with qubit count because the state vector grows exponentially while the GPU’s parallelism absorbs much of that growth. At 20 qubits, the state vector fits comfortably in CPU cache, so the CPU is relatively efficient. At 30 qubits, the state vector is 8 GB (fp32), far exceeding any CPU cache, and the GPU’s high memory bandwidth dominates.

Parameterized Kernels

Kernels can accept classical parameters, enabling variational workflows:

import cudaq
import numpy as np

cudaq.set_target("nvidia")

@cudaq.kernel
def ry_ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])
    mz(q)

for angle in np.linspace(0, np.pi, 8):
    counts = cudaq.sample(ry_ansatz, angle, shots_count=2000)
    print(f"theta={angle:.2f}: {counts}")

This pattern underpins VQE and QAOA on GPU; the outer optimization loop runs on CPU while each energy evaluation is a GPU-accelerated sampling call.

Connecting to NVIDIA Quantum Cloud

If you do not have a local NVIDIA GPU, or you need to simulate circuits larger than your local hardware can handle, NVIDIA Quantum Cloud (NVQC) provides remote access to GPU-accelerated quantum simulation. This service runs the same custatevec and tensornet backends on NVIDIA’s managed infrastructure, enabling 40+ qubit simulations from any machine with an internet connection.

Authentication Setup

First, obtain an API key from the NVIDIA Developer portal. Set it as an environment variable:

export NVQC_API_KEY="your-api-key-here"

Targeting the Cloud Backend

import cudaq

# Connect to NVIDIA Quantum Cloud with fp64 statevector backend
cudaq.set_target("nvqc", backend="custatevec_fp64")

@cudaq.kernel
def large_ghz(n: int):
    qubits = cudaq.qvector(n)
    h(qubits[0])
    for i in range(n - 1):
        cx(qubits[i], qubits[i + 1])
    mz(qubits)

# 35-qubit GHZ state: requires 512 GB in fp64
# This is infeasible on a single consumer GPU but runs on NVQC
result = cudaq.sample(large_ghz, 35, shots_count=500)
print(f"Most probable: {result.most_probable()}")
print(f"|{'0'*35}> count: {result['0'*35]}")
print(f"|{'1'*35}> count: {result['1'*35]}")

NVQC supports both custatevec_fp32, custatevec_fp64, and tensornet backends. The cloud handles multi-GPU distribution automatically, so you do not need to manage GPU allocation yourself.

When to Use NVQC

Your local machine has no GPU, or only a consumer GPU with limited VRAM
You need fp64 precision at 32+ qubits (requiring 64+ GB)
You are prototyping on a laptop and want to test at scale without provisioning cloud GPU instances manually
You need reproducible simulation results for a publication and want to run on standardized hardware

Performance Profiling with NVIDIA Nsight

When your simulation is slower than expected, profiling reveals where the time is spent. NVIDIA Nsight Systems (nsys) captures a timeline of all CUDA operations, including kernel launches, memory transfers, and synchronization points.

Running the Profiler

Save your CUDA Quantum script as sim.py and run:

nsys profile -o sim_profile python sim.py

This produces a sim_profile.nsys-rep file that you can open in Nsight Systems GUI, or analyze from the command line:

nsys stats sim_profile.nsys-rep

Key Metrics to Watch

Kernel launch overhead: Typically 0.01-0.1 ms per kernel launch. If your circuit has thousands of gates, the cumulative launch overhead can become significant. custatevec batches gate operations to minimize this.
Gate operation time: Proportional to the state vector size. For a 28-qubit circuit in fp32, each gate touches 2 GB of data, so gate times are bounded by memory bandwidth (roughly 1-2 ms per gate on an A100).
Memory transfer time: Should be near zero if the state vector stays on GPU for the entire simulation. If you see large host-to-device or device-to-host transfers in the timeline, that indicates unnecessary data movement. The most common cause is a Python callback or print statement between circuit operations that forces a synchronization.
Shot-to-shot overhead: When running multiple shots, custatevec resets the state vector and reapplies all gates for each shot. The reset itself is fast (a single memset), but if you see gaps between shots in the timeline, check for unnecessary synchronization.

Example Profiling Workflow

# Profile with CUDA API tracing enabled
nsys profile -t cuda,nvtx -o detailed_profile python sim.py

# View summary statistics
nsys stats detailed_profile.nsys-rep

# Look for:
# 1. Total GPU kernel time (should dominate)
# 2. CUDA API call time (should be small)
# 3. Memory operations (should be minimal)

If profiling reveals that most time is spent in memory transfers rather than gate computation, restructure your code to avoid pulling results back to the CPU between circuit operations. Use cudaq.observe instead of cudaq.sample when you only need expectation values, as this avoids the measurement-and-reconstruct cycle entirely.

Common Mistakes

1. Using Python Objects Inside Kernels

The most common error for new CUDA Quantum users. Python lists, NumPy arrays, and custom classes cannot appear inside a @cudaq.kernel function. The compiler needs to lower the kernel to MLIR/QIR, which requires static types.

# WRONG: Python list inside kernel
@cudaq.kernel
def bad():
    angles = [0.1, 0.2]  # fails at compilation
    q = cudaq.qvector(2)
    ry(angles[0], q[0])

# CORRECT: Pass data through parameters
@cudaq.kernel
def good(angles: list[float]):
    q = cudaq.qvector(2)
    ry(angles[0], q[0])

Only use int, float, list[float], list[int], and cudaq.qvector inside kernels.

2. Setting the Target After Running Circuits

cudaq.set_target configures the backend for subsequent compilation and execution. If you change the target after already calling cudaq.sample or cudaq.observe, previously compiled kernels may have been compiled for the wrong backend. Always set your target once, at the top of your script, before defining any kernels:

import cudaq

# Set target FIRST
cudaq.set_target("nvidia")

# Then define and run kernels
@cudaq.kernel
def my_circuit():
    q = cudaq.qvector(2)
    h(q[0])
    mz(q)

result = cudaq.sample(my_circuit, shots_count=1000)

3. Running Density Matrix Simulation Without Checking Memory

Density matrix simulation requires (2^n)^2 memory, which grows as 4^n. This is dramatically more expensive than state vector simulation:

15 qubits: 4^15 * 8 bytes = 8.6 GB (fits on an A100)
16 qubits: 4^16 * 8 bytes = 34.4 GB (needs A100 80 GB)
20 qubits: 4^20 * 8 bytes = 8.8 TB (infeasible on any single GPU)

If you need noise simulation at 20+ qubits, use a noise mitigation technique (like probabilistic error cancellation or zero-noise extrapolation) on top of state vector simulation, rather than attempting full density matrix simulation.

4. Not Using complex64 When Memory Is Tight

custatevec defaults to complex64 (8 bytes per amplitude), but some workflows request fp64 precision (16 bytes per amplitude). For a 30-qubit circuit:

fp32: 2^30 * 8 = 8 GB (fits on an RTX 4090)
fp64: 2^30 * 16 = 16 GB (fits on an RTX 4090, but leaves little room for custatevec workspace)

Always check whether your application actually requires fp64 precision. For most variational algorithms, fp32 is sufficient. The numerical difference in energy expectation values between fp32 and fp64 is typically on the order of 1e-6, which is smaller than the optimization convergence threshold.

5. Using mz Inside cudaq.observe

This is a conceptual error that produces confusing results. mz performs a projective measurement that collapses the quantum state. cudaq.observe computes <psi|H|psi> by evaluating the expectation value on the intact state vector, without collapse.

If you include mz in a kernel passed to cudaq.observe, the measurement collapses the state before the expectation value is computed, giving incorrect results:

# WRONG: mz inside a kernel used with observe
@cudaq.kernel
def bad_ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])
    mz(q)  # This collapses the state!

# CORRECT: No mz when using observe
@cudaq.kernel
def good_ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])
    # No mz: observe computes the expectation value directly

hamiltonian = cudaq.spin.z(0) + cudaq.spin.z(1)
result = cudaq.observe(good_ansatz, hamiltonian, 0.5)
print(f"<H> = {result.expectation():.6f}")

The rule is simple: use mz with cudaq.sample, and omit mz with cudaq.observe.

When to Use CUDA Quantum

You need to simulate circuits with 24-36 qubits and cannot wait for CPU runtimes measured in hours.
You want noise-aware simulation at scale without writing custom CUDA code.
Your workflow involves many parameterized circuits (VQE, QAOA) where GPU throughput pays off across thousands of energy evaluations.
You plan to run on NVIDIA Quantum Cloud or a multi-GPU HPC cluster.
You want exact expectation values via cudaq.observe instead of noisy shot-based estimates.
You need tensor network simulation for 50+ qubit circuits with limited entanglement.

Next Steps

Try modifying the VQE example to use a deeper ansatz with more parameters and observe how GPU acceleration keeps the optimization tractable. Experiment with the tensornet backend on circuits that are too large for state vector simulation. Profile your scripts with Nsight Systems to understand where your simulation time is actually spent. If you do not have a local GPU, sign up for NVIDIA Quantum Cloud to run these examples on remote hardware.