Hello World with CUDA Quantum
Your first quantum circuit with NVIDIA's CUDA Quantum framework - create a Bell state, run on CPU and GPU simulators, and compute expectation values.
CUDA Quantum, also imported as cudaq, is NVIDIA’s open-source framework for hybrid quantum-classical computing. NVIDIA built it to solve a real bottleneck in quantum research: simulating circuits large enough to be interesting requires enormous memory and compute, and GPU hardware is well-suited to the linear algebra underneath statevector simulation. CUDA Quantum provides a single API that can target a CPU simulator, a local NVIDIA GPU, a cluster of GPUs, or real quantum hardware from IonQ or Quantinuum, all without changing your circuit code.
Why GPU Simulation Matters
Statevector simulation stores the full quantum state as an array of 2^n complex numbers, where n is the number of qubits. Each complex number takes 16 bytes (two 64-bit floats). The memory requirement grows exponentially:
Memory = 2^n × 16 bytes
| Qubits | Amplitudes | Memory Required |
|---|---|---|
| 20 | ~1 million | 16 MB |
| 25 | ~33 million | 512 MB |
| 28 | ~268 million | 4 GB |
| 30 | ~1 billion | 16 GB |
| 32 | ~4 billion | 64 GB |
| 35 | ~34 billion | 512 GB |
| 40 | ~1 trillion | 16 TB |
A standard workstation with 16 GB of RAM can simulate roughly 28 qubits using the CPU backend. An NVIDIA A100 GPU has 80 GB of high-bandwidth memory (HBM), enough for about 32 qubits without any CPU-GPU data transfers. A DGX system with 8 A100 GPUs can reach 35 qubits by distributing the state vector across GPUs using NCCL (NVIDIA’s Collective Communications Library). CUDA Quantum’s nvidia-mgpu backend handles this distribution automatically.
The other approach is tensor network simulation, which avoids storing the full state vector. Instead, it decomposes the circuit into a network of tensors and contracts them efficiently. For circuits with low to moderate entanglement, this can handle 50 or more qubits on a single GPU. The tradeoff: highly entangled circuits cause the tensor dimensions to explode, making contraction expensive.
| Target | Backend Type | Max Practical Qubits | GPU Required | Best Use Case |
|---|---|---|---|---|
qpp-cpu | CPU statevector | ~28 (16 GB RAM) | No | Development, testing, small circuits |
nvidia | Single GPU statevector | ~32 (80 GB A100) | Yes | Medium circuits, fast iteration |
nvidia-mgpu | Multi-GPU statevector | ~35 (8× A100) | Yes (multiple) | Large statevector simulations |
tensornet | GPU tensor network | 50+ (low entanglement) | Yes | Wide, shallow circuits |
density-matrix-cpu | CPU density matrix | ~14 | No | Noise simulation (small circuits) |
qpp-openmp | CPU statevector (parallel) | ~28 | No | Multi-core CPU acceleration |
ionq | IonQ hardware | Hardware-dependent | No | Real hardware execution |
quantinuum | Quantinuum hardware | Hardware-dependent | No | Real hardware execution |
Installation
pip install cuda-quantum
The base install gives you the CPU simulator (qpp-cpu). GPU acceleration requires an NVIDIA GPU with CUDA Toolkit 11.8 or 12.x installed. If you want to skip environment setup, the Docker image has everything:
docker pull nvcr.io/nvidia/cuda-quantum:latest
docker run --gpus all -it nvcr.io/nvidia/cuda-quantum:latest
Before attempting GPU simulation of large circuits, verify your GPU memory with:
nvidia-smi
CUDA Quantum Architecture
Understanding the compilation pipeline helps you write correct kernels and debug errors.
When you decorate a Python function with @cudaq.kernel, CUDA Quantum does not execute it as normal Python code. Instead, the decorator triggers a JIT (just-in-time) compilation pipeline:
- Python AST to MLIR: The function body is parsed and converted to MLIR (Multi-Level Intermediate Representation), a compiler framework developed by LLVM.
- MLIR to QIR: The MLIR is lowered to QIR (Quantum Intermediate Representation), a standard IR for quantum programs based on LLVM IR.
- QIR to backend: The QIR is compiled and dispatched to whichever target you selected (CPU simulator, GPU simulator, or hardware).
This compilation model explains several things about kernel behavior:
- Gate names like
h,cx,ry, andmzare recognized keywords in the CUDA Quantum kernel language. You do not import them; they are resolved by the compiler. - The kernel body is not arbitrary Python. It is a restricted language that looks like Python but compiles to quantum circuit instructions.
- You cannot use Python lists, dictionaries, numpy arrays, or print statements inside a kernel. The compiler does not know how to lower these to QIR.
What you can use inside a kernel:
cudaq.qvector(n)andcudaq.qubit()for qubit allocation- Gate operations:
h,x,y,z,s,t,rx,ry,rz,cx,cy,cz,swap,mz, and others - Arithmetic on
floatandintarguments - Control flow:
if,for,while - Calls to other
@cudaq.kernelfunctions
import cudaq
@cudaq.kernel
def valid_kernel(theta: float):
q = cudaq.qvector(2)
# Arithmetic on parameters is fine
half_theta = theta / 2.0
ry(half_theta, q[0])
# Control flow is fine
for i in range(2):
h(q[i])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
The following will not compile:
import cudaq
import numpy as np
@cudaq.kernel
def invalid_kernel():
q = cudaq.qvector(2)
angles = [0.1, 0.2] # ERROR: Python lists not allowed
arr = np.array([1, 2]) # ERROR: numpy not available in kernels
print("hello") # ERROR: print not available in kernels
The @cudaq.kernel Decorator
The central concept in CUDA Quantum is the kernel: a Python function decorated with @cudaq.kernel that defines your quantum circuit. The decorator JIT-compiles the function body into an intermediate form that can be lowered to any target backend.
import cudaq
@cudaq.kernel
def my_first_kernel():
q = cudaq.qvector(2) # allocate 2 qubits, both start in |0>
h(q[0]) # Hadamard gate on qubit 0
cx(q[0], q[1]) # CNOT: q[0] controls q[1]
mz(q[0]) # measure qubit 0 in the Z basis
mz(q[1]) # measure qubit 1 in the Z basis
A few things to notice:
- Gate names (
h,cx,mz) are bare function calls inside the kernel body. You do not import them separately; the compiler resolves them from the cudaq gate set. cudaq.qvector(n)allocates a register ofnqubits, all initialised to|0>.- A single qubit can be allocated with
cudaq.qubit().
Running the Bell State
Use cudaq.sample() to execute the kernel and collect measurement outcomes:
import cudaq
@cudaq.kernel
def bell_state():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
result = cudaq.sample(bell_state, shots_count=1000)
print(result)
Expected output:
{ 00:497 11:503 }
The result is a CountsDictionary. The two entries 00 and 11 each appear roughly half the time, which confirms the qubits are entangled: measuring one determines the other.
Drawing and Inspecting Circuits
Before running a circuit, you can visualize it. cudaq.draw() prints an ASCII circuit diagram for any kernel:
import cudaq
@cudaq.kernel
def bell_state():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
print(cudaq.draw(bell_state))
Output:
╭───╮
q0 : ┤ h ├──●──mz──
╰───╯╭─┴─╮
q1 : ─────┤ x ├─mz─
╰───╯
You can also extract the full statevector without measurement using cudaq.get_state(). This returns the complex amplitudes of the quantum state:
import cudaq
@cudaq.kernel
def bell_no_measure():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
state = cudaq.get_state(bell_no_measure)
print(state)
This prints the state vector [0.707+0j, 0+0j, 0+0j, 0.707+0j], confirming the Bell state (|00> + |11>)/sqrt(2). The get_state function is useful for debugging and verifying that your circuit produces the intended quantum state.
Reading Results
CountsDictionary has several useful accessors:
result = cudaq.sample(bell_state, shots_count=1000)
print(result["00"]) # count for bitstring "00"
print(result.most_probable()) # bitstring with highest count
print(result.probability("11")) # fraction of shots that gave "11"
for bitstring, count in result.items():
print(f"{bitstring}: {count}")
GHZ State: Scaling Beyond Two Qubits
The Bell state is a 2-qubit entangled state. The natural generalization is the GHZ (Greenberger-Horne-Zeilinger) state, which entangles n qubits into the superposition (|00…0> + |11…1>)/sqrt(2). The recipe: apply a Hadamard to the first qubit, then chain CNOT gates from each qubit to the next.
CUDA Quantum kernels accept integer parameters, so you can write a single kernel that creates a GHZ state of any size:
import cudaq
@cudaq.kernel
def ghz_state(n_qubits: int):
q = cudaq.qvector(n_qubits)
h(q[0])
for i in range(1, n_qubits):
cx(q[i - 1], q[i])
mz(q)
# Run for different sizes
for n in [4, 8, 16]:
result = cudaq.sample(ghz_state, n, shots_count=1000)
zeros = "0" * n
ones = "1" * n
print(f"n={n}: P(|{zeros}>) = {result.probability(zeros):.3f}, "
f"P(|{ones}>) = {result.probability(ones):.3f}")
Expected output:
n=4: P(|0000>) = 0.498, P(|1111>) = 0.502
n=8: P(|00000000>) = 0.507, P(|11111111>) = 0.493
n=16: P(|0000000000000000>) = 0.501, P(|1111111111111111>) = 0.499
To see the GPU advantage, try larger qubit counts with timing:
import cudaq
import time
cudaq.set_target("qpp-cpu")
@cudaq.kernel
def ghz_state(n_qubits: int):
q = cudaq.qvector(n_qubits)
h(q[0])
for i in range(1, n_qubits):
cx(q[i - 1], q[i])
mz(q)
for n in [16, 20, 24]:
start = time.time()
result = cudaq.sample(ghz_state, n, shots_count=1000)
elapsed = time.time() - start
print(f"n={n}: {elapsed:.3f}s on CPU")
On a GPU target, the same code runs significantly faster for 20+ qubits because the state vector operations (matrix-vector multiplications) map naturally to GPU parallel execution.
Kernel Composition
As circuits grow in complexity, you want to decompose them into reusable building blocks. CUDA Quantum supports this: one kernel can call another kernel. The inner kernel gets inlined at compile time, so there is no runtime overhead.
import cudaq
@cudaq.kernel
def state_preparation(q: cudaq.qview):
"""Prepare a specific initial state on the given qubits."""
h(q[0])
cx(q[0], q[1])
@cudaq.kernel
def full_circuit(theta: float):
q = cudaq.qvector(2)
# Call the state preparation kernel
state_preparation(q)
# Apply parameterized rotation on top
rz(theta, q[0])
ry(theta, q[1])
mz(q)
result = cudaq.sample(full_circuit, 0.5, shots_count=1000)
print(result)
The key type here is cudaq.qview, which represents a reference to qubits that were allocated elsewhere. When full_circuit passes q to state_preparation, the inner kernel operates on the same qubits. This is the idiomatic way to build modular quantum programs in CUDA Quantum. You can compose as many layers as needed, and the compiler flattens everything into a single circuit.
Switching Targets
The target controls where the circuit runs. You set it once before calling sample or observe:
import cudaq
# CPU simulator (default, works everywhere)
cudaq.set_target("qpp-cpu")
# Single NVIDIA GPU (requires CUDA)
cudaq.set_target("nvidia")
# GPU tensor network (handles 50+ qubit circuits)
cudaq.set_target("tensornet")
The circuit code does not change when you switch targets. This is the main practical benefit of the unified CUDA Quantum API: you can develop and test on CPU, then move to GPU or hardware by changing a single line.
import cudaq
cudaq.set_target("nvidia") # comment this out to fall back to CPU
@cudaq.kernel
def bell_state():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
result = cudaq.sample(bell_state, shots_count=1000)
print(result)
The Spin Module for Observables
For variational algorithms, you need to define Hamiltonians as sums of Pauli operators. CUDA Quantum’s cudaq.spin module provides the building blocks.
The four single-qubit Pauli operators are:
spin.x(i): Pauli X on qubit ispin.y(i): Pauli Y on qubit ispin.z(i): Pauli Z on qubit ispin.i(i): Identity on qubit i
These compose with standard arithmetic. Multiplication creates tensor products, and addition creates sums:
from cudaq import spin
# Single Pauli term
z0 = spin.z(0)
# Two-qubit ZZ interaction
zz = spin.z(0) * spin.z(1)
# Scaled term
scaled_x = 0.5 * spin.x(0)
# Sum of terms
simple_hamiltonian = spin.z(0) + 0.5 * spin.x(0) * spin.x(1)
print(simple_hamiltonian)
A realistic example: the molecular hydrogen (H2) Hamiltonian in the STO-3G basis, reduced to 2 qubits via the Bravyi-Kitaev transformation at equilibrium bond distance:
from cudaq import spin
H2_hamiltonian = (
-1.0523 * spin.i(0) * spin.i(1)
+ 0.3979 * spin.z(0) * spin.i(1)
- 0.3979 * spin.i(0) * spin.z(1)
- 0.0112 * spin.z(0) * spin.z(1)
+ 0.1809 * spin.x(0) * spin.x(1)
)
print(H2_hamiltonian)
This Hamiltonian has five terms. The exact ground state energy at equilibrium bond distance (0.735 angstroms) is approximately -1.137 Hartree. We will use this Hamiltonian in the VQE examples below.
Computing Expectation Values with cudaq.observe
For variational algorithms you rarely want raw counts. You want the expectation value of some observable (the Hamiltonian). cudaq.observe computes this directly from the statevector, without requiring you to measure and post-process:
import cudaq
from cudaq import spin
# Define a simple two-qubit Hamiltonian: Z0 * Z1
hamiltonian = spin.z(0) * spin.z(1)
@cudaq.kernel
def bell_state_no_measure():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
# No mz() calls: observe works on the state, not measurement outcomes
result = cudaq.observe(bell_state_no_measure, hamiltonian)
print(f"<Z0 Z1> = {result.expectation():.4f}")
# Output: <Z0 Z1> = 1.0000
The Bell state |00> + |11> is a +1 eigenstate of Z0*Z1, so the expectation value is 1.0.
For a multi-term Hamiltonian, cudaq.observe computes the total expectation value as the weighted sum of each Pauli term’s contribution:
import cudaq
from cudaq import spin
H2_hamiltonian = (
-1.0523 * spin.i(0) * spin.i(1)
+ 0.3979 * spin.z(0) * spin.i(1)
- 0.3979 * spin.i(0) * spin.z(1)
- 0.0112 * spin.z(0) * spin.z(1)
+ 0.1809 * spin.x(0) * spin.x(1)
)
@cudaq.kernel
def hf_state():
"""Hartree-Fock state for H2: |01>"""
q = cudaq.qvector(2)
x(q[1]) # flip qubit 1 to |1>
result = cudaq.observe(hf_state, H2_hamiltonian)
print(f"Hartree-Fock energy: {result.expectation():.4f} Hartree")
# Output: Hartree-Fock energy: -1.8270 Hartree
Parameterized Kernels
Variational algorithms require running the same circuit with different parameter values. CUDA Quantum kernels accept classical float arguments:
import cudaq
import math
@cudaq.kernel
def ry_ansatz(theta: float):
q = cudaq.qvector(1)
ry(theta, q[0])
mz(q[0])
# Sample at several angles
for angle in [0.0, math.pi / 4, math.pi / 2, math.pi]:
result = cudaq.sample(ry_ansatz, angle, shots_count=500)
count_one = result.get("1", 0)
print(f"theta={angle:.3f} P(|1>) ~ {count_one / 500:.2f}")
Expected output (approximate):
theta=0.000 P(|1>) ~ 0.00
theta=0.785 P(|1>) ~ 0.15
theta=1.571 P(|1>) ~ 0.50
theta=3.142 P(|1>) ~ 1.00
At theta = pi/2, the RY rotation puts the qubit into an equal superposition of |0> and |1>. At theta = pi, it fully flips to |1>.
You can also visualize parameterized kernels:
print(cudaq.draw(ry_ansatz, 1.0))
Manual Parameter Scan
import cudaq
from cudaq import spin
import math
cudaq.set_target("qpp-cpu")
# 1. Define a parameterized two-qubit ansatz
@cudaq.kernel
def two_qubit_ansatz(theta: float):
q = cudaq.qvector(2)
ry(theta, q[0])
cx(q[0], q[1])
# 2. Define a Hamiltonian
hamiltonian = 0.5 * spin.z(0) + 0.5 * spin.z(1) + spin.x(0) * spin.x(1)
# 3. Scan theta and find the minimum energy
best_energy = float("inf")
best_theta = 0.0
for i in range(20):
theta = i * math.pi / 10
obs_result = cudaq.observe(two_qubit_ansatz, hamiltonian, theta)
energy = obs_result.expectation()
if energy < best_energy:
best_energy = energy
best_theta = theta
print(f"Minimum energy: {best_energy:.4f} at theta={best_theta:.3f} rad")
This is a simplified version of the variational approach used by VQE. The manual scan becomes impractical for circuits with many parameters. The next section shows CUDA Quantum’s built-in VQE optimizer.
Built-in VQE with cudaq.vqe()
CUDA Quantum provides a cudaq.vqe() function that handles the optimization loop for you. It takes a kernel, a Hamiltonian, an optimizer, and the number of parameters, then returns the minimum energy and optimal parameters.
import cudaq
from cudaq import spin
# H2 Hamiltonian (2-qubit reduced form)
H2_hamiltonian = (
-1.0523 * spin.i(0) * spin.i(1)
+ 0.3979 * spin.z(0) * spin.i(1)
- 0.3979 * spin.i(0) * spin.z(1)
- 0.0112 * spin.z(0) * spin.z(1)
+ 0.1809 * spin.x(0) * spin.x(1)
)
# Minimal ansatz: Ry rotation followed by entangling CNOT
@cudaq.kernel
def h2_ansatz(theta: list[float]):
q = cudaq.qvector(2)
ry(theta[0], q[0])
cx(q[0], q[1])
# Choose an optimizer
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 50
# Run VQE
energy, optimal_params = cudaq.vqe(
h2_ansatz,
H2_hamiltonian,
optimizer,
parameter_count=1
)
print(f"VQE ground state energy: {energy:.4f} Hartree")
print(f"Optimal parameter: {optimal_params[0]:.4f} rad")
Available optimizers include:
cudaq.optimizers.COBYLA(): Constrained Optimization BY Linear Approximations. Gradient-free, good for noisy cost functions.cudaq.optimizers.NelderMead(): Gradient-free simplex method. Robust for low-dimensional parameter spaces.cudaq.optimizers.LBFGS(): Gradient-based quasi-Newton method. Requires gradient information but converges faster for smooth landscapes.
Compare this to the manual scan: cudaq.vqe() finds the minimum automatically and typically converges in far fewer function evaluations than a brute-force grid search.
Gradient Computation with Parameter-Shift
For gradient-based optimizers, CUDA Quantum supports automatic gradient computation. The parameter-shift rule gives exact gradients for quantum circuits: for a parameter theta controlling a rotation gate, the gradient is:
dE/dtheta = [E(theta + pi/2) - E(theta - pi/2)] / 2
This requires two circuit evaluations per parameter. CUDA Quantum implements this (and central difference approximation) in the cudaq.gradients module:
import cudaq
from cudaq import spin
H2_hamiltonian = (
-1.0523 * spin.i(0) * spin.i(1)
+ 0.3979 * spin.z(0) * spin.i(1)
- 0.3979 * spin.i(0) * spin.z(1)
- 0.0112 * spin.z(0) * spin.z(1)
+ 0.1809 * spin.x(0) * spin.x(1)
)
@cudaq.kernel
def h2_ansatz(theta: list[float]):
q = cudaq.qvector(2)
ry(theta[0], q[0])
cx(q[0], q[1])
# Use parameter-shift gradient with L-BFGS optimizer
gradient = cudaq.gradients.ParameterShift()
optimizer = cudaq.optimizers.LBFGS()
energy, optimal_params = cudaq.vqe(
h2_ansatz,
H2_hamiltonian,
optimizer,
gradient=gradient,
parameter_count=1
)
print(f"VQE energy (gradient-based): {energy:.4f} Hartree")
print(f"Optimal parameter: {optimal_params[0]:.4f} rad")
You can also use cudaq.gradients.CentralDifference(), which approximates the gradient numerically. Parameter-shift is preferred for quantum hardware because it gives exact gradients, while central difference is faster for simulation (only two evaluations vs. two per parameter).
Noise Models
Real quantum hardware has noise: gates are imperfect, qubits decohere, and measurements have errors. CUDA Quantum lets you simulate these effects so you can test how your algorithm performs under realistic conditions.
The basic building block is a noise channel. A depolarizing channel applies a random Pauli error (X, Y, or Z) with some probability after a gate:
import cudaq
# Create a noise model
noise_model = cudaq.NoiseModel()
# Add 1% depolarizing noise after every X gate on any qubit
depolarizing = cudaq.DepolarizationChannel(0.01)
noise_model.add_all_qubit_channel("x", depolarizing)
# Add 1% depolarizing noise after every Hadamard gate
noise_model.add_all_qubit_channel("h", depolarizing)
# Add 1% depolarizing noise after every CNOT gate
two_qubit_depolarizing = cudaq.DepolarizationChannel(0.01)
noise_model.add_all_qubit_channel("cx", two_qubit_depolarizing)
Now run the Bell state with noise:
import cudaq
noise_model = cudaq.NoiseModel()
depolarizing = cudaq.DepolarizationChannel(0.01)
noise_model.add_all_qubit_channel("h", depolarizing)
noise_model.add_all_qubit_channel("cx", depolarizing)
@cudaq.kernel
def bell_state():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
# Noiseless run
clean_result = cudaq.sample(bell_state, shots_count=10000)
print("Noiseless:", clean_result)
# Noisy run
noisy_result = cudaq.sample(bell_state, noise_model=noise_model, shots_count=10000)
print("Noisy: ", noisy_result)
Expected output (approximate):
Noiseless: { 00:4987 11:5013 }
Noisy: { 00:4900 01:52 10:55 11:4993 }
In the noisy simulation, the 01 and 10 outcomes appear at small but nonzero probability. These are errors introduced by the depolarizing channel: the noise occasionally flips one qubit relative to the other, breaking the perfect entanglement correlation.
CUDA Quantum also supports other noise channels:
cudaq.AmplitudeDampingChannel(probability): Models energy relaxation (T1 decay)cudaq.PhaseFlipChannel(probability): Models dephasing (T2 decay)cudaq.BitFlipChannel(probability): Models classical bit-flip errors
Async Execution
For production workflows where you need to run many circuits, cudaq.sample_async() returns a future that you can await later. This lets you submit multiple circuits in parallel:
import cudaq
@cudaq.kernel
def parameterized_circuit(theta: float):
q = cudaq.qvector(2)
ry(theta, q[0])
cx(q[0], q[1])
mz(q)
# Submit multiple circuits asynchronously
import math
futures = []
angles = [i * math.pi / 10 for i in range(20)]
for angle in angles:
future = cudaq.sample_async(parameterized_circuit, angle, shots_count=1000)
futures.append((angle, future))
# Collect results later
for angle, future in futures:
result = future.get()
p11 = result.probability("11")
print(f"theta={angle:.3f}: P(|11>) = {p11:.3f}")
This is especially useful when targeting remote backends (IonQ, Quantinuum) where circuit submission has network latency. The async API lets you queue many jobs without waiting for each one to complete.
MPI for Multi-Node GPU Clusters
For the largest simulations, CUDA Quantum supports MPI (Message Passing Interface) to distribute computation across multiple nodes, each with multiple GPUs. Set the target to nvidia-mgpu and launch with mpirun:
mpirun -n 4 python my_circuit.py
Inside the script, set the target before defining kernels:
import cudaq
cudaq.set_target("nvidia-mgpu")
cudaq.mpi.initialize()
@cudaq.kernel
def large_ghz(n_qubits: int):
q = cudaq.qvector(n_qubits)
h(q[0])
for i in range(1, n_qubits):
cx(q[i - 1], q[i])
mz(q)
result = cudaq.sample(large_ghz, 34, shots_count=1000)
if cudaq.mpi.rank() == 0:
print(result)
cudaq.mpi.finalize()
This is the path to 35+ qubit statevector simulation, limited only by the total GPU memory across your cluster.
Common Mistakes
1. Using Python data structures inside kernels
# WRONG: Python lists do not work in kernels
@cudaq.kernel
def bad_kernel():
angles = [0.1, 0.2, 0.3] # compile error
Use kernel parameters instead. Pass data into kernels as float, int, or list[float] arguments.
2. Calling cudaq.sample on a kernel without measurements
@cudaq.kernel
def no_measurements():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
# forgot mz()!
result = cudaq.sample(no_measurements, shots_count=1000)
# Returns empty or all-zeros results
cudaq.sample requires explicit mz() calls to know which qubits to measure. If you want the expectation value of an observable without measurement, use cudaq.observe instead.
3. Setting the target too late
import cudaq
@cudaq.kernel
def my_kernel():
q = cudaq.qvector(2)
h(q[0])
mz(q[0])
# This kernel may already be compiled for the default target
cudaq.set_target("nvidia") # too late for kernels defined above
Call cudaq.set_target() before defining or executing any kernels. The target must be set before kernel compilation occurs.
4. Tensor network limitations
The tensornet target handles large qubit counts efficiently for circuits with low entanglement. However, it does not support all gate types, and highly entangled circuits (deep random circuits, for example) cause the tensor bond dimensions to grow exponentially, negating the advantage. If your circuit creates volume-law entanglement, use the statevector backend instead.
5. Insufficient GPU memory for large circuits
GPU simulation of 30+ qubits requires substantial GPU memory. A 30-qubit simulation needs 16 GB; 32 qubits needs 64 GB. Always check available memory with nvidia-smi before attempting large simulations. If you run out of memory, the kernel will crash without a helpful error message.
Where to Go Next
- Full API reference: /reference/cuda-quantum
- Official docs and C++ examples: nvidia.github.io/cuda-quantum
- For variational algorithms at scale, explore the
cudaq.vqe()function with gradient-based optimizers - For large circuits, try the
tensornettarget and compare runtimes againstqpp-cpu - For noise-aware algorithm development, build
NoiseModelobjects that match the hardware you plan to deploy on
Was this tutorial helpful?