CUDA Quantum
NVIDIA's unified programming model for quantum-classical computing at GPU scale
Quick install
pip install cuda-quantum Background and History
CUDA Quantum was announced by NVIDIA at its GTC (GPU Technology Conference) in March 2022, with Jensen Huang presenting it as part of NVIDIA’s broader push into quantum computing infrastructure. The framework was developed by NVIDIA’s quantum computing team, led by Tim Costa, as an extension of NVIDIA’s existing CUDA parallel computing platform into the quantum domain. The initial release, then called “CUDA Quantum” (later stylized as CUDA-Q in some contexts), was open-sourced on GitHub in 2023.
NVIDIA’s entry into quantum computing software was driven by a clear thesis: quantum computers will operate as accelerators alongside classical GPUs, and the programming model should reflect this hybrid reality. CUDA Quantum provides a unified API where quantum kernels (decorated with @cudaq.kernel in Python) can be compiled and dispatched to CPU simulators, GPU-accelerated simulators, or real quantum hardware through the same interface. The GPU backends leverage NVIDIA’s cuQuantum library, which includes cuStateVec for statevector simulation and cuTensorNet for tensor network contraction.
The framework’s GPU-accelerated simulators are its primary differentiator. The nvidia backend offloads statevector computation to a single GPU, enabling simulation of circuits with 30 or more qubits at speeds that far exceed CPU-based simulators. The nvidia-mgpu backend distributes the statevector across multiple GPUs for larger simulations, and the tensornet backend uses GPU-accelerated tensor network methods to handle circuits with 50 or more qubits for certain circuit structures. These capabilities make CUDA Quantum particularly attractive for variational algorithm research where thousands of circuit evaluations need to be batched efficiently.
CUDA Quantum reached version 0.8 by early 2025 and supports hardware targets including IonQ, Quantinuum, and ORCA Computing, accessed through NVIDIA’s cloud layer. The framework provides both Python and C++ APIs, with the C++ path offering lower-level control for performance-critical applications. As of 2025, CUDA Quantum is actively developed with regular releases. Its community is growing, though it remains smaller than Qiskit or PennyLane. NVIDIA’s investment in the project signals a long-term commitment, and the framework is well positioned as quantum hardware scales to the point where tight classical-quantum co-processing becomes essential.
Overview
CUDA Quantum is NVIDIA’s entry into quantum computing infrastructure. Its core differentiator is GPU-accelerated simulation: the nvidia backend offloads statevector computation to a single NVIDIA GPU, while the tensornet backend uses GPU tensor network contraction to simulate circuits with 50 or more qubits that would be impossible on CPU simulators.
The framework targets hybrid quantum-classical workflows where classical GPU workloads and quantum circuits are tightly coupled. This makes it especially useful for variational algorithms (VQE, QAOA) where many circuit evaluations are batched and the gradient computation can stay on GPU.
CUDA Quantum exposes both a Python API and a lower-level C++ API. The Python API (imported as cudaq) is sufficient for most use cases and is the focus of this reference.
Installation
CPU-only simulation (no NVIDIA GPU required):
pip install cuda-quantum
For GPU-accelerated backends, you also need:
- NVIDIA GPU with CUDA Compute Capability 7.0 or higher
- CUDA Toolkit 11.8 or 12.x
- cuQuantum library (installed automatically with the GPU extras)
The easiest path to a fully GPU-enabled environment is the official Docker image:
docker pull nvcr.io/nvidia/cuda-quantum:latest
docker run --gpus all -it nvcr.io/nvidia/cuda-quantum:latest
Core Concepts
The @cudaq.kernel Decorator
Quantum circuits in CUDA Quantum are written as ordinary Python functions decorated with @cudaq.kernel. The decorator JIT-compiles the function to an intermediate representation that can be lowered to any supported target.
import cudaq
@cudaq.kernel
def my_circuit():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
Gate names inside kernels are called as bare functions (h, cx, mz). The compiler resolves them from the cudaq gate set.
Qubit Types
| Type | Description |
|---|---|
cudaq.qubit | Single qubit |
cudaq.qvector(n) | Fixed-size register of n qubits |
Execution Methods
| Method | Returns | Use case |
|---|---|---|
cudaq.sample(kernel, shots_count=N) | CountsDictionary | Measurement outcomes |
cudaq.observe(kernel, hamiltonian) | ObserveResult | Expectation value of an operator |
cudaq.get_state(kernel) | cudaq.State | Full statevector (simulation only) |
Code Examples
Bell State with Sampling
import cudaq
@cudaq.kernel
def bell_state():
q = cudaq.qvector(2)
h(q[0])
cx(q[0], q[1])
mz(q[0])
mz(q[1])
result = cudaq.sample(bell_state, shots_count=1000)
print(result)
# Output: { 00:496 11:504 }
print(result.most_probable()) # '00' or '11'
print(result["00"]) # count for the 00 outcome
Setting the Execution Target
import cudaq
# Default CPU simulator (no GPU needed)
cudaq.set_target("qpp-cpu")
# Single NVIDIA GPU
cudaq.set_target("nvidia")
# Multi-GPU (requires multiple GPUs)
cudaq.set_target("nvidia-mgpu")
# GPU tensor network (large circuits, 50+ qubits)
cudaq.set_target("tensornet")
# Real hardware via IonQ (requires API key)
cudaq.set_target("ionq", api_key="YOUR_KEY")
Targets must be set before calling cudaq.sample or cudaq.observe. Switching targets at runtime is supported.
Parameterized Kernels
Kernels accept classical parameters, which is the standard pattern for variational algorithms:
import cudaq
from cudaq import spin
@cudaq.kernel
def ry_circuit(theta: float):
q = cudaq.qvector(1)
ry(theta, q[0])
mz(q[0])
# Sweep over angles
import math
for angle in [0.0, math.pi / 4, math.pi / 2, math.pi]:
result = cudaq.sample(ry_circuit, angle, shots_count=500)
print(f"theta={angle:.2f} |1> count: {result['1']}")
Expectation Values with cudaq.observe
observe computes the expectation value of a SpinOperator (Hamiltonian) without explicit measurement:
import cudaq
from cudaq import spin
# Hamiltonian: Z0 tensor Z1
hamiltonian = spin.z(0) * spin.z(1)
@cudaq.kernel
def ansatz(theta: float):
q = cudaq.qvector(2)
ry(theta, q[0])
cx(q[0], q[1])
import math
result = cudaq.observe(ansatz, hamiltonian, math.pi / 4)
print(f"Expectation value: {result.expectation():.4f}")
Retrieving the Full Statevector
import cudaq
@cudaq.kernel
def superposition():
q = cudaq.qvector(2)
h(q[0])
state = cudaq.get_state(superposition)
print(state)
# Prints the 4-element complex amplitude vector
Backends and Hardware
| Target name | Type | Notes |
|---|---|---|
qpp-cpu | CPU simulator | Default, no GPU needed, exact statevector |
nvidia | GPU simulator | Single NVIDIA GPU, fast for 20-30 qubits |
nvidia-mgpu | Multi-GPU simulator | Distributes statevector across GPUs |
tensornet | GPU tensor network | Handles 50+ qubits on structured circuits |
ionq | Real hardware | IonQ trapped-ion processors, API key required |
quantinuum | Real hardware | Quantinuum H-series, API key required |
orca | Real hardware | Photonic hardware, limited availability |
The tensornet backend is particularly useful for shallow circuits on many qubits: it avoids storing the full statevector by contracting the tensor network on the fly.
Common Gate Reference
Inside @cudaq.kernel functions, gates are bare function calls:
| Gate call | Description |
|---|---|
h(q) | Hadamard |
x(q) | Pauli-X (NOT) |
y(q) | Pauli-Y |
z(q) | Pauli-Z |
s(q) | S gate |
t(q) | T gate |
rx(theta, q) | X-rotation by theta |
ry(theta, q) | Y-rotation by theta |
rz(theta, q) | Z-rotation by theta |
cx(control, target) | CNOT |
cz(control, target) | Controlled-Z |
swap(q0, q1) | SWAP |
mz(q) | Measure in Z basis |
my(q) | Measure in Y basis |
mx(q) | Measure in X basis |
Limitations
- GPU backends require an NVIDIA GPU with CUDA support. AMD and Apple Silicon GPUs are not supported.
- The Python
@cudaq.kerneldecorator imposes restrictions on what Python can appear inside the kernel body: no arbitrary Python objects, no dynamic list comprehensions, and limited control flow compared to standard Python. - The Python API is newer than the C++ API. Some advanced features, including multi-QPU parallel execution (MQPU) and distributed simulation across nodes, require the C++ interface or specific container environments.
- The community is smaller than Qiskit or PennyLane, so third-party tutorials and Stack Overflow answers are less abundant.
- Hardware targets (IonQ, Quantinuum) go through NVIDIA’s cloud, adding an intermediary compared to using those providers’ native SDKs directly.