Hello World with CUDA Quantum

CUDA Quantum, now branded CUDA-Q and imported as cudaq, is NVIDIA’s open-source framework for hybrid quantum-classical computing. NVIDIA built it to solve a real bottleneck in quantum research: simulating circuits large enough to be interesting requires enormous memory and compute, and GPU hardware is well-suited to the linear algebra underneath statevector simulation. CUDA Quantum provides a single API that can target a CPU simulator, a local NVIDIA GPU, a cluster of GPUs, or real quantum hardware from IonQ or Quantinuum, all without changing your circuit code.

Why GPU Simulation Matters

Statevector simulation stores the full quantum state as an array of 2^n complex numbers, where n is the number of qubits. Each complex number takes 16 bytes (two 64-bit floats). The memory requirement grows exponentially:

Memory = 2^n × 16 bytes

Qubits	Amplitudes	Memory Required
20	~1 million	16 MB
25	~33 million	512 MB
28	~268 million	4 GB
30	~1 billion	16 GB
32	~4 billion	64 GB
35	~34 billion	512 GB
40	~1 trillion	16 TB

A standard workstation with 16 GB of RAM can simulate roughly 28 qubits using the CPU backend. An NVIDIA A100 GPU has 80 GB of high-bandwidth memory (HBM), enough for about 32 qubits without any CPU-GPU data transfers. A DGX system with 8 A100 GPUs can reach 35 qubits by distributing the state vector across GPUs using NCCL (NVIDIA’s Collective Communications Library). CUDA Quantum’s nvidia-mgpu backend handles this distribution automatically.

The other approach is tensor network simulation, which avoids storing the full state vector. Instead, it decomposes the circuit into a network of tensors and contracts them efficiently. For circuits with low to moderate entanglement, this can handle 50 or more qubits on a single GPU. The tradeoff: highly entangled circuits cause the tensor dimensions to explode, making contraction expensive.

Target	Backend Type	Max Practical Qubits	GPU Required	Best Use Case
`qpp-cpu`	CPU statevector	~28 (16 GB RAM)	No	Development, testing, small circuits
`nvidia`	Single GPU statevector	~32 (80 GB A100)	Yes	Medium circuits, fast iteration
`nvidia-mgpu`	Multi-GPU statevector	~35 (8× A100)	Yes (multiple)	Large statevector simulations
`tensornet`	GPU tensor network	50+ (low entanglement)	Yes	Wide, shallow circuits
`density-matrix-cpu`	CPU density matrix	~14	No	Noise simulation (small circuits)
`ionq`	IonQ hardware	Hardware-dependent	No	Real hardware execution
`quantinuum`	Quantinuum hardware	Hardware-dependent	No	Real hardware execution

Installation

pip install cudaq

Older docs show pip install cuda-quantum; that package name is deprecated, and new releases ship as cudaq. The base install gives you the CPU simulator (qpp-cpu). GPU acceleration requires an NVIDIA GPU with CUDA Toolkit 12.x or newer installed. If you want to skip environment setup, the Docker image has everything:

docker pull nvcr.io/nvidia/quantum/cuda-quantum:latest
docker run --gpus all -it nvcr.io/nvidia/quantum/cuda-quantum:latest

Before attempting GPU simulation of large circuits, verify your GPU memory with:

nvidia-smi

CUDA Quantum Architecture

Understanding the compilation pipeline helps you write correct kernels and debug errors.

When you decorate a Python function with @cudaq.kernel, CUDA Quantum does not execute it as normal Python code. Instead, the decorator triggers a JIT (just-in-time) compilation pipeline:

Python AST to MLIR: The function body is parsed and converted to MLIR (Multi-Level Intermediate Representation), a compiler framework developed by LLVM.
MLIR to QIR: The MLIR is lowered to QIR (Quantum Intermediate Representation), a standard IR for quantum programs based on LLVM IR.
QIR to backend: The QIR is compiled and dispatched to whichever target you selected (CPU simulator, GPU simulator, or hardware).

This compilation model explains several things about kernel behavior:

Gate names like h, cx, ry, and mz are recognized keywords in the CUDA Quantum kernel language. You do not import them; they are resolved by the compiler.
The kernel body is not arbitrary Python. It is a restricted language that looks like Python but compiles to quantum circuit instructions.
You cannot use Python lists, dictionaries, numpy arrays, or print statements inside a kernel. The compiler does not know how to lower these to QIR.

What you can use inside a kernel:

cudaq.qvector(n) and cudaq.qubit() for qubit allocation
Gate operations: h, x, y, z, s, t, rx, ry, rz, cx, cy, cz, swap, mz, and others
Arithmetic on float and int arguments
Control flow: if, for, while
Calls to other @cudaq.kernel functions

import cudaq

@cudaq.kernel
def valid_kernel(theta: float):
    q = cudaq.qvector(2)
    # Arithmetic on parameters is fine
    half_theta = theta / 2.0
    ry(half_theta, q[0])
    # Control flow is fine
    for i in range(2):
        h(q[i])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

The following will not compile:

import cudaq
import numpy as np

@cudaq.kernel
def invalid_kernel():
    q = cudaq.qvector(2)
    angles = [0.1, 0.2]     # ERROR: Python lists not allowed
    arr = np.array([1, 2])  # ERROR: numpy not available in kernels
    print("hello")           # ERROR: print not available in kernels

The @cudaq.kernel Decorator

The central concept in CUDA Quantum is the kernel: a Python function decorated with @cudaq.kernel that defines your quantum circuit. The decorator JIT-compiles the function body into an intermediate form that can be lowered to any target backend.

import cudaq

@cudaq.kernel
def my_first_kernel():
    q = cudaq.qvector(2)   # allocate 2 qubits, both start in |0>
    h(q[0])                # Hadamard gate on qubit 0
    cx(q[0], q[1])         # CNOT: q[0] controls q[1]
    mz(q[0])               # measure qubit 0 in the Z basis
    mz(q[1])               # measure qubit 1 in the Z basis

A few things to notice:

Gate names (h, cx, mz) are bare function calls inside the kernel body. You do not import them separately; the compiler resolves them from the cudaq gate set.
cudaq.qvector(n) allocates a register of n qubits, all initialised to |0>.
A single qubit can be allocated with cudaq.qubit().

Running the Bell State

Use cudaq.sample() to execute the kernel and collect measurement outcomes:

import cudaq

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

result = cudaq.sample(bell_state, shots_count=1000)
print(result)

Expected output:

{ 00:497 11:503 }

The result is a CountsDictionary. The two entries 00 and 11 each appear roughly half the time, which confirms the qubits are entangled: measuring one determines the other.

Drawing and Inspecting Circuits

Before running a circuit, you can visualize it. cudaq.draw() prints an ASCII circuit diagram for any kernel:

import cudaq

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

print(cudaq.draw(bell_state))

Output:

     ╭───╮          
q0 : ┤ h ├──●──mz──
     ╰───╯╭─┴─╮    
q1 : ─────┤ x ├─mz─
           ╰───╯

You can also extract the full statevector without measurement using cudaq.get_state(). This returns the complex amplitudes of the quantum state:

import cudaq

@cudaq.kernel
def bell_no_measure():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])

state = cudaq.get_state(bell_no_measure)
print(state)

This prints the state vector [0.707+0j, 0+0j, 0+0j, 0.707+0j], confirming the Bell state (|00> + |11>)/sqrt(2). The get_state function is useful for debugging and verifying that your circuit produces the intended quantum state.

Reading Results

CountsDictionary has several useful accessors:

result = cudaq.sample(bell_state, shots_count=1000)

print(result["00"])             # count for bitstring "00"
print(result.most_probable())   # bitstring with highest count
print(result.probability("11")) # fraction of shots that gave "11"

for bitstring, count in result.items():
    print(f"{bitstring}: {count}")

GHZ State: Scaling Beyond Two Qubits

The Bell state is a 2-qubit entangled state. The natural generalization is the GHZ (Greenberger-Horne-Zeilinger) state, which entangles n qubits into the superposition (|00…0> + |11…1>)/sqrt(2). The recipe: apply a Hadamard to the first qubit, then chain CNOT gates from each qubit to the next.

CUDA Quantum kernels accept integer parameters, so you can write a single kernel that creates a GHZ state of any size:

import cudaq

@cudaq.kernel
def ghz_state(n_qubits: int):
    q = cudaq.qvector(n_qubits)
    h(q[0])
    for i in range(1, n_qubits):
        cx(q[i - 1], q[i])
    mz(q)

# Run for different sizes
for n in [4, 8, 16]:
    result = cudaq.sample(ghz_state, n, shots_count=1000)
    zeros = "0" * n
    ones = "1" * n
    print(f"n={n}: P(|{zeros}>) = {result.probability(zeros):.3f}, "
          f"P(|{ones}>) = {result.probability(ones):.3f}")

Expected output:

n=4: P(|0000>) = 0.498, P(|1111>) = 0.502
n=8: P(|00000000>) = 0.507, P(|11111111>) = 0.493
n=16: P(|0000000000000000>) = 0.501, P(|1111111111111111>) = 0.499

To see the GPU advantage, try larger qubit counts with timing:

import cudaq
import time

cudaq.set_target("qpp-cpu")

@cudaq.kernel
def ghz_state(n_qubits: int):
    q = cudaq.qvector(n_qubits)
    h(q[0])
    for i in range(1, n_qubits):
        cx(q[i - 1], q[i])
    mz(q)

for n in [16, 20, 24]:
    start = time.time()
    result = cudaq.sample(ghz_state, n, shots_count=1000)
    elapsed = time.time() - start
    print(f"n={n}: {elapsed:.3f}s on CPU")

On a GPU target, the same code runs significantly faster for 20+ qubits because the state vector operations (matrix-vector multiplications) map naturally to GPU parallel execution.

Kernel Composition

As circuits grow in complexity, you want to decompose them into reusable building blocks. CUDA Quantum supports this: one kernel can call another kernel. The inner kernel gets inlined at compile time, so there is no runtime overhead.

import cudaq

@cudaq.kernel
def state_preparation(q: cudaq.qview):
    """Prepare a specific initial state on the given qubits."""
    h(q[0])
    cx(q[0], q[1])

@cudaq.kernel
def full_circuit(theta: float):
    q = cudaq.qvector(2)
    # Call the state preparation kernel
    state_preparation(q)
    # Apply parameterized rotation on top
    rz(theta, q[0])
    ry(theta, q[1])
    mz(q)

result = cudaq.sample(full_circuit, 0.5, shots_count=1000)
print(result)

The key type here is cudaq.qview, which represents a reference to qubits that were allocated elsewhere. When full_circuit passes q to state_preparation, the inner kernel operates on the same qubits. This is the idiomatic way to build modular quantum programs in CUDA Quantum. You can compose as many layers as needed, and the compiler flattens everything into a single circuit.

Switching Targets

The target controls where the circuit runs. You set it once before calling sample or observe:

import cudaq

# CPU simulator (default, works everywhere)
cudaq.set_target("qpp-cpu")

# Single NVIDIA GPU (requires CUDA)
cudaq.set_target("nvidia")

# GPU tensor network (handles 50+ qubit circuits)
cudaq.set_target("tensornet")

The circuit code does not change when you switch targets. This is the main practical benefit of the unified CUDA Quantum API: you can develop and test on CPU, then move to GPU or hardware by changing a single line.

import cudaq

cudaq.set_target("nvidia")   # comment this out to fall back to CPU

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

result = cudaq.sample(bell_state, shots_count=1000)
print(result)

The Spin Module for Observables

For variational algorithms, you need to define Hamiltonians as sums of Pauli operators. CUDA Quantum’s cudaq.spin module provides the building blocks.

The four single-qubit Pauli operators are:

spin.x(i): Pauli X on qubit i
spin.y(i): Pauli Y on qubit i
spin.z(i): Pauli Z on qubit i
spin.i(i): Identity on qubit i

These compose with standard arithmetic. Multiplication creates tensor products, and addition creates sums:

from cudaq import spin

# Single Pauli term
z0 = spin.z(0)

# Two-qubit ZZ interaction
zz = spin.z(0) * spin.z(1)

# Scaled term
scaled_x = 0.5 * spin.x(0)

# Sum of terms
simple_hamiltonian = spin.z(0) + 0.5 * spin.x(0) * spin.x(1)

print(simple_hamiltonian)

A realistic example: the molecular hydrogen (H2) Hamiltonian in the STO-3G basis, reduced to 2 qubits via the Bravyi-Kitaev transformation at equilibrium bond distance:

from cudaq import spin

H2_hamiltonian = (
    -1.0523 * spin.i(0) * spin.i(1)
    + 0.3979 * spin.z(0) * spin.i(1)
    - 0.3979 * spin.i(0) * spin.z(1)
    - 0.0112 * spin.z(0) * spin.z(1)
    + 0.1809 * spin.x(0) * spin.x(1)
)

print(H2_hamiltonian)

This Hamiltonian has five terms. Its minimum eigenvalue (the electronic ground state energy) is approximately -1.857 Hartree; adding the nuclear repulsion energy of about 0.72 Hartree gives the familiar H2 total ground state energy of approximately -1.137 Hartree at the equilibrium bond distance (0.735 angstroms). We will use this Hamiltonian in the VQE examples below.

Computing Expectation Values with cudaq.observe

For variational algorithms you rarely want raw counts. You want the expectation value of some observable (the Hamiltonian). cudaq.observe computes this directly from the statevector, without requiring you to measure and post-process:

import cudaq
from cudaq import spin

# Define a simple two-qubit Hamiltonian: Z0 * Z1
hamiltonian = spin.z(0) * spin.z(1)

@cudaq.kernel
def bell_state_no_measure():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    # No mz() calls: observe works on the state, not measurement outcomes

result = cudaq.observe(bell_state_no_measure, hamiltonian)
print(f"<Z0 Z1> = {result.expectation():.4f}")
# Output: <Z0 Z1> = 1.0000

The Bell state |00> + |11> is a +1 eigenstate of Z0*Z1, so the expectation value is 1.0.

For a multi-term Hamiltonian, cudaq.observe computes the total expectation value as the weighted sum of each Pauli term’s contribution:

import cudaq
from cudaq import spin

H2_hamiltonian = (
    -1.0523 * spin.i(0) * spin.i(1)
    + 0.3979 * spin.z(0) * spin.i(1)
    - 0.3979 * spin.i(0) * spin.z(1)
    - 0.0112 * spin.z(0) * spin.z(1)
    + 0.1809 * spin.x(0) * spin.x(1)
)

@cudaq.kernel
def hf_state():
    """Hartree-Fock state for H2: |10>"""
    q = cudaq.qvector(2)
    x(q[0])  # flip qubit 0 to |1>

result = cudaq.observe(hf_state, H2_hamiltonian)
print(f"Hartree-Fock energy: {result.expectation():.4f} Hartree")
# Output: Hartree-Fock energy: -1.8369 Hartree

Parameterized Kernels

Variational algorithms require running the same circuit with different parameter values. CUDA Quantum kernels accept classical float arguments:

import cudaq
import math

@cudaq.kernel
def ry_ansatz(theta: float):
    q = cudaq.qvector(1)
    ry(theta, q[0])
    mz(q[0])

# Sample at several angles
for angle in [0.0, math.pi / 4, math.pi / 2, math.pi]:
    result = cudaq.sample(ry_ansatz, angle, shots_count=500)
    count_one = result.get("1", 0)
    print(f"theta={angle:.3f}  P(|1>) ~ {count_one / 500:.2f}")

Expected output (approximate):

theta=0.000  P(|1>) ~ 0.00
theta=0.785  P(|1>) ~ 0.15
theta=1.571  P(|1>) ~ 0.50
theta=3.142  P(|1>) ~ 1.00

At theta = pi/2, the RY rotation puts the qubit into an equal superposition of |0> and |1>. At theta = pi, it fully flips to |1>.

You can also visualize parameterized kernels:

print(cudaq.draw(ry_ansatz, 1.0))

Manual Parameter Scan

import cudaq
from cudaq import spin
import math

cudaq.set_target("qpp-cpu")

# 1. Define a parameterized two-qubit ansatz
@cudaq.kernel
def two_qubit_ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])

# 2. Define a Hamiltonian
hamiltonian = 0.5 * spin.z(0) + 0.5 * spin.z(1) + spin.x(0) * spin.x(1)

# 3. Scan theta and find the minimum energy
best_energy = float("inf")
best_theta = 0.0

for i in range(20):
    theta = i * math.pi / 10
    obs_result = cudaq.observe(two_qubit_ansatz, hamiltonian, theta)
    energy = obs_result.expectation()
    if energy < best_energy:
        best_energy = energy
        best_theta = theta

print(f"Minimum energy: {best_energy:.4f} at theta={best_theta:.3f} rad")

This is a simplified version of the variational approach used by VQE. The manual scan becomes impractical for circuits with many parameters. The next section shows CUDA Quantum’s built-in VQE optimizer.

Built-in VQE with cudaq.vqe()

CUDA Quantum provides a cudaq.vqe() function that handles the optimization loop for you. It takes a kernel, a Hamiltonian, an optimizer, and the number of parameters, then returns the minimum energy and optimal parameters.

import cudaq
from cudaq import spin

# H2 Hamiltonian (2-qubit reduced form)
H2_hamiltonian = (
    -1.0523 * spin.i(0) * spin.i(1)
    + 0.3979 * spin.z(0) * spin.i(1)
    - 0.3979 * spin.i(0) * spin.z(1)
    - 0.0112 * spin.z(0) * spin.z(1)
    + 0.1809 * spin.x(0) * spin.x(1)
)

# Minimal ansatz: Ry rotation followed by entangling CNOT
@cudaq.kernel
def h2_ansatz(theta: list[float]):
    q = cudaq.qvector(2)
    ry(theta[0], q[0])
    cx(q[0], q[1])

# Choose an optimizer
optimizer = cudaq.optimizers.COBYLA()
optimizer.max_iterations = 50

# Run VQE
energy, optimal_params = cudaq.vqe(
    h2_ansatz,
    H2_hamiltonian,
    optimizer,
    parameter_count=1
)

print(f"VQE ground state energy: {energy:.4f} Hartree")
print(f"Optimal parameter: {optimal_params[0]:.4f} rad")

Available optimizers include:

cudaq.optimizers.COBYLA(): Constrained Optimization BY Linear Approximations. Gradient-free, good for noisy cost functions.
cudaq.optimizers.NelderMead(): Gradient-free simplex method. Robust for low-dimensional parameter spaces.
cudaq.optimizers.LBFGS(): Gradient-based quasi-Newton method. Requires gradient information but converges faster for smooth landscapes.

Compare this to the manual scan: cudaq.vqe() finds the minimum automatically and typically converges in far fewer function evaluations than a brute-force grid search.

Gradient Computation with Parameter-Shift

For gradient-based optimizers, CUDA Quantum supports automatic gradient computation. The parameter-shift rule gives exact gradients for quantum circuits: for a parameter theta controlling a rotation gate, the gradient is:

dE/dtheta = [E(theta + pi/2) - E(theta - pi/2)] / 2

This requires two circuit evaluations per parameter. CUDA Quantum implements this (and central difference approximation) in the cudaq.gradients module:

import cudaq
from cudaq import spin

H2_hamiltonian = (
    -1.0523 * spin.i(0) * spin.i(1)
    + 0.3979 * spin.z(0) * spin.i(1)
    - 0.3979 * spin.i(0) * spin.z(1)
    - 0.0112 * spin.z(0) * spin.z(1)
    + 0.1809 * spin.x(0) * spin.x(1)
)

@cudaq.kernel
def h2_ansatz(theta: list[float]):
    q = cudaq.qvector(2)
    ry(theta[0], q[0])
    cx(q[0], q[1])

# Use parameter-shift gradient with L-BFGS optimizer
gradient = cudaq.gradients.ParameterShift()
optimizer = cudaq.optimizers.LBFGS()

energy, optimal_params = cudaq.vqe(
    h2_ansatz,
    H2_hamiltonian,
    optimizer,
    gradient=gradient,
    parameter_count=1
)

print(f"VQE energy (gradient-based): {energy:.4f} Hartree")
print(f"Optimal parameter: {optimal_params[0]:.4f} rad")

You can also use cudaq.gradients.CentralDifference(), which approximates the gradient numerically. Parameter-shift is preferred for quantum hardware because it gives exact gradients, while central difference is faster for simulation (only two evaluations vs. two per parameter).

Noise Models

Real quantum hardware has noise: gates are imperfect, qubits decohere, and measurements have errors. CUDA Quantum lets you simulate these effects so you can test how your algorithm performs under realistic conditions.

The basic building block is a noise channel. A depolarizing channel applies a random Pauli error (X, Y, or Z) with some probability after a gate:

import cudaq

# Create a noise model
noise_model = cudaq.NoiseModel()

# Add 1% depolarizing noise after every X gate on any qubit
depolarizing = cudaq.DepolarizationChannel(0.01)
noise_model.add_all_qubit_channel("x", depolarizing)

# Add 1% depolarizing noise after every Hadamard gate
noise_model.add_all_qubit_channel("h", depolarizing)

# Add 1% two-qubit depolarizing noise after every CNOT gate.
# A CNOT is a controlled X, so attach a two-qubit channel to "x"
# with num_controls=1.
two_qubit_depolarizing = cudaq.Depolarization2(0.01)
noise_model.add_all_qubit_channel("x", two_qubit_depolarizing, 1)

Now run the Bell state with noise. Note that the default statevector target ignores noise models: switch to the density matrix simulator (or a GPU trajectory backend) first.

import cudaq

cudaq.set_target("density-matrix-cpu")  # noise-aware simulator backend

noise_model = cudaq.NoiseModel()
depolarizing = cudaq.DepolarizationChannel(0.01)
noise_model.add_all_qubit_channel("h", depolarizing)
noise_model.add_all_qubit_channel("x", cudaq.Depolarization2(0.01), 1)

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

# Noiseless run
clean_result = cudaq.sample(bell_state, shots_count=10000)
print("Noiseless:", clean_result)

# Noisy run
noisy_result = cudaq.sample(bell_state, noise_model=noise_model, shots_count=10000)
print("Noisy:    ", noisy_result)

Expected output (approximate):

Noiseless: { 00:4987 11:5013 }
Noisy:     { 00:4900 01:52 10:55 11:4993 }

In the noisy simulation, the 01 and 10 outcomes appear at small but nonzero probability. These are errors introduced by the depolarizing channel: the noise occasionally flips one qubit relative to the other, breaking the perfect entanglement correlation.

CUDA Quantum also supports other noise channels:

cudaq.AmplitudeDampingChannel(probability): Models energy relaxation (T1 decay)
cudaq.PhaseFlipChannel(probability): Models dephasing (T2 decay)
cudaq.BitFlipChannel(probability): Models classical bit-flip errors

Async Execution

For production workflows where you need to run many circuits, cudaq.sample_async() returns a future that you can await later. This lets you submit multiple circuits in parallel:

import cudaq

@cudaq.kernel
def parameterized_circuit(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])
    mz(q)

# Submit multiple circuits asynchronously
import math
futures = []
angles = [i * math.pi / 10 for i in range(20)]

for angle in angles:
    future = cudaq.sample_async(parameterized_circuit, angle, shots_count=1000)
    futures.append((angle, future))

# Collect results later
for angle, future in futures:
    result = future.get()
    p11 = result.probability("11")
    print(f"theta={angle:.3f}: P(|11>) = {p11:.3f}")

This is especially useful when targeting remote backends (IonQ, Quantinuum) where circuit submission has network latency. The async API lets you queue many jobs without waiting for each one to complete.

MPI for Multi-Node GPU Clusters

For the largest simulations, CUDA Quantum supports MPI (Message Passing Interface) to distribute computation across multiple nodes, each with multiple GPUs. Set the target to nvidia-mgpu and launch with mpirun:

mpirun -n 4 python my_circuit.py

Inside the script, set the target before defining kernels:

import cudaq

cudaq.set_target("nvidia-mgpu")
cudaq.mpi.initialize()

@cudaq.kernel
def large_ghz(n_qubits: int):
    q = cudaq.qvector(n_qubits)
    h(q[0])
    for i in range(1, n_qubits):
        cx(q[i - 1], q[i])
    mz(q)

result = cudaq.sample(large_ghz, 34, shots_count=1000)

if cudaq.mpi.rank() == 0:
    print(result)

cudaq.mpi.finalize()

This is the path to 35+ qubit statevector simulation, limited only by the total GPU memory across your cluster.

Common Mistakes

1. Using Python data structures inside kernels

# WRONG: Python lists do not work in kernels
@cudaq.kernel
def bad_kernel():
    angles = [0.1, 0.2, 0.3]  # compile error

Use kernel parameters instead. Pass data into kernels as float, int, or list[float] arguments.

2. Calling cudaq.sample on a kernel without measurements

@cudaq.kernel
def no_measurements():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    # forgot mz()!

result = cudaq.sample(no_measurements, shots_count=1000)
# Returns empty or all-zeros results

cudaq.sample requires explicit mz() calls to know which qubits to measure. If you want the expectation value of an observable without measurement, use cudaq.observe instead.

3. Setting the target too late

import cudaq

@cudaq.kernel
def my_kernel():
    q = cudaq.qvector(2)
    h(q[0])
    mz(q[0])

# This kernel may already be compiled for the default target
cudaq.set_target("nvidia")  # too late for kernels defined above

Call cudaq.set_target() before defining or executing any kernels. The target must be set before kernel compilation occurs.

4. Tensor network limitations

The tensornet target handles large qubit counts efficiently for circuits with low entanglement. However, it does not support all gate types, and highly entangled circuits (deep random circuits, for example) cause the tensor bond dimensions to grow exponentially, negating the advantage. If your circuit creates volume-law entanglement, use the statevector backend instead.

5. Insufficient GPU memory for large circuits

GPU simulation of 30+ qubits requires substantial GPU memory. A 30-qubit simulation needs 16 GB; 32 qubits needs 64 GB. Always check available memory with nvidia-smi before attempting large simulations. If you run out of memory, the kernel will crash without a helpful error message.

Where to Go Next

Full API reference: /reference/cuda-quantum
Official docs and C++ examples: nvidia.github.io/cuda-quantum
For variational algorithms at scale, explore the cudaq.vqe() function with gradient-based optimizers
For large circuits, try the tensornet target and compare runtimes against qpp-cpu
For noise-aware algorithm development, build NoiseModel objects that match the hardware you plan to deploy on