Python / C++ v0.8

CUDA Quantum

NVIDIA's unified programming model for quantum-classical computing at GPU scale

Quick install

pip install cuda-quantum

Background and History

CUDA Quantum was announced by NVIDIA at its GTC (GPU Technology Conference) in March 2022, with Jensen Huang presenting it as part of NVIDIA’s broader push into quantum computing infrastructure. The framework was developed by NVIDIA’s quantum computing team, led by Tim Costa, as an extension of NVIDIA’s existing CUDA parallel computing platform into the quantum domain. The initial release, then called “CUDA Quantum” (later stylized as CUDA-Q in some contexts), was open-sourced on GitHub in 2023.

NVIDIA’s entry into quantum computing software was driven by a clear thesis: quantum computers will operate as accelerators alongside classical GPUs, and the programming model should reflect this hybrid reality. CUDA Quantum provides a unified API where quantum kernels (decorated with @cudaq.kernel in Python) can be compiled and dispatched to CPU simulators, GPU-accelerated simulators, or real quantum hardware through the same interface. The GPU backends leverage NVIDIA’s cuQuantum library, which includes cuStateVec for statevector simulation and cuTensorNet for tensor network contraction.

The framework’s GPU-accelerated simulators are its primary differentiator. The nvidia backend offloads statevector computation to a single GPU, enabling simulation of circuits with 30 or more qubits at speeds that far exceed CPU-based simulators. The nvidia-mgpu backend distributes the statevector across multiple GPUs for larger simulations, and the tensornet backend uses GPU-accelerated tensor network methods to handle circuits with 50 or more qubits for certain circuit structures. These capabilities make CUDA Quantum particularly attractive for variational algorithm research where thousands of circuit evaluations need to be batched efficiently.

CUDA Quantum reached version 0.8 by early 2025 and supports hardware targets including IonQ, Quantinuum, and ORCA Computing, accessed through NVIDIA’s cloud layer. The framework provides both Python and C++ APIs, with the C++ path offering lower-level control for performance-critical applications. As of 2025, CUDA Quantum is actively developed with regular releases. Its community is growing, though it remains smaller than Qiskit or PennyLane. NVIDIA’s investment in the project signals a long-term commitment, and the framework is well positioned as quantum hardware scales to the point where tight classical-quantum co-processing becomes essential.

Overview

CUDA Quantum is NVIDIA’s entry into quantum computing infrastructure. Its core differentiator is GPU-accelerated simulation: the nvidia backend offloads statevector computation to a single NVIDIA GPU, while the tensornet backend uses GPU tensor network contraction to simulate circuits with 50 or more qubits that would be impossible on CPU simulators.

The framework targets hybrid quantum-classical workflows where classical GPU workloads and quantum circuits are tightly coupled. This makes it especially useful for variational algorithms (VQE, QAOA) where many circuit evaluations are batched and the gradient computation can stay on GPU.

CUDA Quantum exposes both a Python API and a lower-level C++ API. The Python API (imported as cudaq) is sufficient for most use cases and is the focus of this reference.

Installation

CPU-only simulation (no NVIDIA GPU required):

pip install cuda-quantum

For GPU-accelerated backends, you also need:

  • NVIDIA GPU with CUDA Compute Capability 7.0 or higher
  • CUDA Toolkit 11.8 or 12.x
  • cuQuantum library (installed automatically with the GPU extras)

The easiest path to a fully GPU-enabled environment is the official Docker image:

docker pull nvcr.io/nvidia/cuda-quantum:latest
docker run --gpus all -it nvcr.io/nvidia/cuda-quantum:latest

Core Concepts

The @cudaq.kernel Decorator

Quantum circuits in CUDA Quantum are written as ordinary Python functions decorated with @cudaq.kernel. The decorator JIT-compiles the function to an intermediate representation that can be lowered to any supported target.

import cudaq

@cudaq.kernel
def my_circuit():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

Gate names inside kernels are called as bare functions (h, cx, mz). The compiler resolves them from the cudaq gate set.

Qubit Types

TypeDescription
cudaq.qubitSingle qubit
cudaq.qvector(n)Fixed-size register of n qubits

Execution Methods

MethodReturnsUse case
cudaq.sample(kernel, shots_count=N)CountsDictionaryMeasurement outcomes
cudaq.observe(kernel, hamiltonian)ObserveResultExpectation value of an operator
cudaq.get_state(kernel)cudaq.StateFull statevector (simulation only)

Code Examples

Bell State with Sampling

import cudaq

@cudaq.kernel
def bell_state():
    q = cudaq.qvector(2)
    h(q[0])
    cx(q[0], q[1])
    mz(q[0])
    mz(q[1])

result = cudaq.sample(bell_state, shots_count=1000)
print(result)
# Output: { 00:496 11:504 }

print(result.most_probable())  # '00' or '11'
print(result["00"])             # count for the 00 outcome

Setting the Execution Target

import cudaq

# Default CPU simulator (no GPU needed)
cudaq.set_target("qpp-cpu")

# Single NVIDIA GPU
cudaq.set_target("nvidia")

# Multi-GPU (requires multiple GPUs)
cudaq.set_target("nvidia-mgpu")

# GPU tensor network (large circuits, 50+ qubits)
cudaq.set_target("tensornet")

# Real hardware via IonQ (requires API key)
cudaq.set_target("ionq", api_key="YOUR_KEY")

Targets must be set before calling cudaq.sample or cudaq.observe. Switching targets at runtime is supported.

Parameterized Kernels

Kernels accept classical parameters, which is the standard pattern for variational algorithms:

import cudaq
from cudaq import spin

@cudaq.kernel
def ry_circuit(theta: float):
    q = cudaq.qvector(1)
    ry(theta, q[0])
    mz(q[0])

# Sweep over angles
import math
for angle in [0.0, math.pi / 4, math.pi / 2, math.pi]:
    result = cudaq.sample(ry_circuit, angle, shots_count=500)
    print(f"theta={angle:.2f}  |1> count: {result['1']}")

Expectation Values with cudaq.observe

observe computes the expectation value of a SpinOperator (Hamiltonian) without explicit measurement:

import cudaq
from cudaq import spin

# Hamiltonian: Z0 tensor Z1
hamiltonian = spin.z(0) * spin.z(1)

@cudaq.kernel
def ansatz(theta: float):
    q = cudaq.qvector(2)
    ry(theta, q[0])
    cx(q[0], q[1])

import math
result = cudaq.observe(ansatz, hamiltonian, math.pi / 4)
print(f"Expectation value: {result.expectation():.4f}")

Retrieving the Full Statevector

import cudaq

@cudaq.kernel
def superposition():
    q = cudaq.qvector(2)
    h(q[0])

state = cudaq.get_state(superposition)
print(state)
# Prints the 4-element complex amplitude vector

Backends and Hardware

Target nameTypeNotes
qpp-cpuCPU simulatorDefault, no GPU needed, exact statevector
nvidiaGPU simulatorSingle NVIDIA GPU, fast for 20-30 qubits
nvidia-mgpuMulti-GPU simulatorDistributes statevector across GPUs
tensornetGPU tensor networkHandles 50+ qubits on structured circuits
ionqReal hardwareIonQ trapped-ion processors, API key required
quantinuumReal hardwareQuantinuum H-series, API key required
orcaReal hardwarePhotonic hardware, limited availability

The tensornet backend is particularly useful for shallow circuits on many qubits: it avoids storing the full statevector by contracting the tensor network on the fly.

Common Gate Reference

Inside @cudaq.kernel functions, gates are bare function calls:

Gate callDescription
h(q)Hadamard
x(q)Pauli-X (NOT)
y(q)Pauli-Y
z(q)Pauli-Z
s(q)S gate
t(q)T gate
rx(theta, q)X-rotation by theta
ry(theta, q)Y-rotation by theta
rz(theta, q)Z-rotation by theta
cx(control, target)CNOT
cz(control, target)Controlled-Z
swap(q0, q1)SWAP
mz(q)Measure in Z basis
my(q)Measure in Y basis
mx(q)Measure in X basis

Limitations

  • GPU backends require an NVIDIA GPU with CUDA support. AMD and Apple Silicon GPUs are not supported.
  • The Python @cudaq.kernel decorator imposes restrictions on what Python can appear inside the kernel body: no arbitrary Python objects, no dynamic list comprehensions, and limited control flow compared to standard Python.
  • The Python API is newer than the C++ API. Some advanced features, including multi-QPU parallel execution (MQPU) and distributed simulation across nodes, require the C++ interface or specific container environments.
  • The community is smaller than Qiskit or PennyLane, so third-party tutorials and Stack Overflow answers are less abundant.
  • Hardware targets (IonQ, Quantinuum) go through NVIDIA’s cloud, adding an intermediary compared to using those providers’ native SDKs directly.