CUDA Quantum Reference Dictionary

Background and History

CUDA Quantum was announced by NVIDIA at its GTC (GPU Technology Conference) in March 2022, with Jensen Huang presenting it as part of NVIDIA’s broader push into quantum computing infrastructure. The framework was developed by NVIDIA’s quantum computing team as an extension of NVIDIA’s existing CUDA parallel computing platform into the quantum domain. The project was originally named “CUDA Quantum” and was open-sourced on GitHub in 2023; NVIDIA has since rebranded it as CUDA-Q, which is the current name.

NVIDIA’s entry into quantum computing software was driven by a clear thesis: quantum computers will operate as accelerators alongside classical GPUs, and the programming model should reflect this hybrid reality. CUDA Quantum provides a unified API where quantum kernels (decorated with @cudaq.kernel in Python) can be compiled and dispatched to CPU simulators, GPU-accelerated simulators, or real quantum hardware through the same interface. The GPU backends leverage NVIDIA’s cuQuantum library, which includes cuStateVec for statevector simulation and cuTensorNet for tensor network contraction.

The framework’s GPU-accelerated simulators are its primary differentiator. The nvidia backend offloads statevector computation to a single GPU, enabling simulation of circuits with 30 or more qubits at speeds that far exceed CPU-based simulators. The nvidia-mgpu backend distributes the statevector across multiple GPUs for larger simulations, and the tensornet backend uses GPU-accelerated tensor network methods to handle circuits with 50 or more qubits for certain circuit structures. These capabilities make CUDA Quantum particularly attractive for variational algorithm research where thousands of circuit evaluations need to be batched efficiently.

CUDA-Q has continued through regular releases (recent versions are in the 0.x series) and supports hardware targets including IonQ, Quantinuum, and ORCA Computing, accessed through NVIDIA’s cloud layer. The framework provides both Python and C++ APIs, with the C++ path offering lower-level control for performance-critical applications. As of 2026, CUDA-Q is actively developed with regular releases. Its community is growing, though it remains smaller than Qiskit or PennyLane. NVIDIA’s investment in the project signals a long-term commitment, and the framework is well positioned as quantum hardware scales to the point where tight classical-quantum co-processing becomes essential.

Overview

CUDA Quantum is NVIDIA’s entry into quantum computing infrastructure. Its core differentiator is GPU-accelerated simulation: the nvidia backend offloads statevector computation to a single NVIDIA GPU, while the tensornet backend uses GPU tensor network contraction to simulate circuits with 50 or more qubits that would be impossible on CPU simulators.

The framework targets hybrid quantum-classical workflows where classical GPU workloads and quantum circuits are tightly coupled. This makes it especially useful for variational algorithms (VQE, QAOA) where many circuit evaluations are batched and the gradient computation can stay on GPU.

CUDA Quantum exposes both a Python API and a lower-level C++ API. The Python API (imported as cudaq) is sufficient for most use cases and is the focus of this reference.

Installation

CPU-only simulation (no NVIDIA GPU required):

pip install cudaq

Note: the PyPI package is now cudaq. Earlier releases were published as cuda-quantum; if you have that older package installed, uninstall it before installing cudaq. CUDA-specific wheels are also published as cuda-quantum-cu12 and cuda-quantum-cu13.

For GPU-accelerated backends, you also need:

NVIDIA GPU with CUDA Compute Capability 7.0 or higher
CUDA Toolkit 12.x or 13.x
cuQuantum library (installed automatically with the GPU extras)

The easiest path to a fully GPU-enabled environment is the official Docker image:

docker pull nvcr.io/nvidia/cuda-quantum:latest
docker run --gpus all -it nvcr.io/nvidia/cuda-quantum:latest

Core Concepts

The @cudaq.kernel Decorator

Quantum circuits in CUDA Quantum are written as ordinary Python functions decorated with @cudaq.kernel. The decorator JIT-compiles the function to an intermediate representation that can be lowered to any supported target.

import cudaq

@cudaq.kernel
def my_circuit():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

Gate names inside kernels are called as bare functions (h, cx, mz). The compiler resolves them from the cudaq gate set.

Qubit Types

Type	Description
`cudaq.qubit`	Single qubit
`cudaq.qvector(n)`	Fixed-size register of n qubits

Execution Methods

Method	Returns	Use case
`cudaq.sample(kernel, shots_count=N)`	`CountsDictionary`	Measurement outcomes
`cudaq.observe(kernel, hamiltonian)`	`ObserveResult`	Expectation value of an operator
`cudaq.get_state(kernel)`	`cudaq.State`	Full statevector (simulation only)

Code Examples

Bell State with Sampling

import cudaq

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

result = cudaq.sample(bell_state, shots_count=1000)
print(result)
# Output: { 00:496 11:504 }

print(result.most_probable())  # '00' or '11'
print(result["00"])             # count for the 00 outcome

Setting the Execution Target

import cudaq

# Default CPU simulator (no GPU needed)
cudaq.set_target("qpp-cpu")

# Single NVIDIA GPU
cudaq.set_target("nvidia")

# Multi-GPU (requires multiple GPUs)
cudaq.set_target("nvidia-mgpu")

# GPU tensor network (large circuits, 50+ qubits)
cudaq.set_target("tensornet")

# Real hardware via IonQ (requires API key)
cudaq.set_target("ionq", api_key="YOUR_KEY")

Targets must be set before calling cudaq.sample or cudaq.observe. Switching targets at runtime is supported.

Parameterized Kernels

Kernels accept classical parameters, which is the standard pattern for variational algorithms:

import cudaq
from cudaq import spin

@cudaq.kernel
def ry_circuit(theta: float):
    q = cudaq.qvector(1)
    ry(theta, q[0])
    mz(q[0])

# Sweep over angles
import math
for angle in [0.0, math.pi / 4, math.pi / 2, math.pi]:
    result = cudaq.sample(ry_circuit, angle, shots_count=500)
    print(f"theta={angle:.2f}  |1> count: {result['1']}")

Expectation Values with cudaq.observe

observe computes the expectation value of a SpinOperator (Hamiltonian) without explicit measurement:

import cudaq
from cudaq import spin

# Hamiltonian: Z0 tensor Z1
hamiltonian = spin.z(0) * spin.z(1)

@cudaq.kernel
def ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])

import math
result = cudaq.observe(ansatz, hamiltonian, math.pi / 4)
print(f"Expectation value: {result.expectation():.4f}")

Retrieving the Full Statevector

import cudaq

@cudaq.kernel
def superposition():
    q = cudaq.qvector(2)
    h(q[0])

state = cudaq.get_state(superposition)
print(state)
# Prints the 4-element complex amplitude vector

Backends and Hardware

Target name	Type	Notes
`qpp-cpu`	CPU simulator	Default, no GPU needed, exact statevector
`nvidia`	GPU simulator	Single NVIDIA GPU, fast for 20-30 qubits
`nvidia-mgpu`	Multi-GPU simulator	Distributes statevector across GPUs
`tensornet`	GPU tensor network	Handles 50+ qubits on structured circuits
`ionq`	Real hardware	IonQ trapped-ion processors, API key required
`quantinuum`	Real hardware	Quantinuum H-series, API key required
`orca`	Real hardware	Photonic hardware, limited availability

The tensornet backend is particularly useful for shallow circuits on many qubits: it avoids storing the full statevector by contracting the tensor network on the fly.

Common Gate Reference

Inside @cudaq.kernel functions, gates are bare function calls:

Gate call	Description
`h(q)`	Hadamard
`x(q)`	Pauli-X (NOT)
`y(q)`	Pauli-Y
`z(q)`	Pauli-Z
`s(q)`	S gate
`t(q)`	T gate
`rx(theta, q)`	X-rotation by theta
`ry(theta, q)`	Y-rotation by theta
`rz(theta, q)`	Z-rotation by theta
`cx(control, target)`	CNOT
`cz(control, target)`	Controlled-Z
`swap(q0, q1)`	SWAP
`mz(q)`	Measure in Z basis
`my(q)`	Measure in Y basis
`mx(q)`	Measure in X basis

Limitations

GPU backends require an NVIDIA GPU with CUDA support. AMD and Apple Silicon GPUs are not supported.
The Python @cudaq.kernel decorator imposes restrictions on what Python can appear inside the kernel body: no arbitrary Python objects, no dynamic list comprehensions, and limited control flow compared to standard Python.
The Python API is newer than the C++ API. Some advanced features, including multi-QPU parallel execution (MQPU) and distributed simulation across nodes, require the C++ interface or specific container environments.
The community is smaller than Qiskit or PennyLane, so third-party tutorials and Stack Overflow answers are less abundant.
Hardware targets (IonQ, Quantinuum) go through NVIDIA’s cloud, adding an intermediary compared to using those providers’ native SDKs directly.