Measuring Quantum Advantage: Benchmarks, Metrics, and What They Really Mean

Why Benchmarking Is Hard

Classical computers have straightforward benchmarks: FLOPS for floating-point throughput, SPEC CPU for workload-representative performance, memory bandwidth for data-movement bound tasks. These metrics are imperfect but they correlate with real application performance.

Quantum computers are harder to benchmark for several reasons. The device space is multidimensional: qubit count, coherence time, gate fidelity, connectivity, and measurement fidelity all matter and can be traded against each other. Different applications weight these properties differently. And the field is commercially competitive, creating incentives to optimize metrics rather than genuine performance.

Understanding what each benchmark actually measures (and what it does not) is essential for evaluating vendor claims and choosing the right platform for a given problem.

Randomized Benchmarking: Gate Fidelity Without State Tomography

The most widely used gate-level benchmark is randomized benchmarking (RB). The idea is elegant: apply a random sequence of Clifford gates followed by the recovery gate that ideally returns the qubit to its initial state. If the gates were perfect, the final state would always be |0>. With noise, fidelity decays exponentially with the number of gates:

F(m) = A * p^m + B

where m is the number of Clifford gates, p is the depolarizing parameter, A and B absorb state preparation and measurement (SPAM) errors, and the average gate error is:

r = (1 - p) * (d - 1) / d

with d = 2^n for an n-qubit system.

RB is robust to SPAM errors (they are absorbed into A and B), insensitive to specific gate set choice (Clifford gates form a group, so random sequences are easy to construct), and gives a single interpretable number: the average Clifford gate error rate.

Typical values: superconducting single-qubit gates achieve 0.01% to 0.1% error (1e-4 to 1e-3). Two-qubit gates are 0.1% to 1% (1e-3 to 1e-2). Trapped ions achieve similar single-qubit rates but somewhat better two-qubit rates (0.1% to 0.5%) due to longer coherence times.

RB does not tell you about crosstalk (how one gate affects neighboring qubits), coherence limits (T1 and T2), or how errors accumulate in structured circuits. Interleaved RB adds a specific gate between the random Cliffords to isolate that gate’s error rate.

Quantum Volume: A Holistic Single-Number Score

IBM introduced Quantum Volume (QV) in 2019 as a more comprehensive benchmark that captures the system’s ability to run useful circuits. QV is defined as:

QV = 2^n

where n is the largest square circuit (n qubits wide and n layers deep) that the device can run with heavy output probability greater than 2/3. The heavy output probability is the probability that a random circuit produces an output in the “heavy” set, the 50% of outcomes with the highest ideal probabilities.

The key idea: heavy output probability can be verified by classical simulation (which is feasible for small n) and provides a task that rewards both gate fidelity and effective qubit connectivity simultaneously. A device with many low-quality qubits scores lower than a device with fewer high-quality qubits.

QV progression over time illustrates hardware improvement: IBM Quantum achieved QV 32 in early 2020, QV 64 later that year, and QV 512 in May 2022 on its Falcon r10 processor, which remains IBM’s highest published QV (the company has since shifted its focus to layer fidelity metrics for its larger processors). Quantinuum achieved QV 32768 (n=15) in February 2023 with its H1-1 trapped ion system, and pushed past QV 2^20 (n=20) in 2024, reflecting the higher gate fidelity of ion trap hardware.

The limitation of QV: it does not scale well beyond n=20 or so because the classical simulation required for verification becomes exponentially expensive. And it is a worst-case metric over random circuits, not a best-case metric for specific algorithms. A device might have QV 128 but still fail at a particular 10-qubit algorithm due to specific connectivity constraints.

CLOPS: Speed Matters Too

A device with high QV but slow execution is useless for practical computation. IBM introduced Circuit Layer Operations Per Second (CLOPS) in 2021 to measure execution throughput:

CLOPS = (M * K * S * D) / time

where M = number of circuit templates (typically 100), K = parameter updates per template (typically 10), S = shots per circuit (typically 100), D = number of QV layers (log2 of the measured QV). The templates are parameterized QV-sized circuits whose parameters are refreshed between iterations. The numerator counts total “useful work” and the denominator is the wall-clock time including classical control overhead.

CLOPS captures the end-to-end latency including: circuit compilation, qubit reset, gate execution, measurement, and readout. Early superconducting devices had CLOPS in the hundreds. IBM’s Heron processors (2023) achieved CLOPS of around 10,000 to 100,000 depending on configuration.

For variational algorithms (VQE, QAOA), CLOPS is often the binding constraint because they require thousands of circuit executions with parameter updates per optimization step.

Cross-Entropy Benchmarking: The Google Sycamore Approach

Cross-entropy benchmarking (XEB) was developed to characterize devices at scales where full state tomography or RB is infeasible. It measures how well a noisy device tracks the ideal output distribution of random circuits.

For a random circuit with ideal output probabilities p_U(x), the linear XEB fidelity is:

F_XEB = <2^n * p_U(x)> - 1

where the expectation is over experimental measurement outcomes x. For a perfect device, F_XEB = 1. For a fully depolarized device producing a uniform distribution, F_XEB = 0.

XEB is scalable: you can compute it for large circuits where you cannot verify individual output probabilities by using a subset of circuit instances where classical simulation is feasible. This is how Google validated their 2019 Sycamore “quantum supremacy” experiment; they ran 53-qubit random circuits, measured F_XEB values consistent with their error model, and extrapolated to conclude that the full circuits were executed correctly.

The controversy. Google claimed that simulating their circuits on classical hardware would take 10,000 years on the best available supercomputer. IBM researchers responded within days that an optimized classical simulation using tensor network methods could do it in 2.5 days with sufficient disk space, later improved to hours. In 2022, a Chinese team reported classical simulation in hundreds of seconds using improved contraction methods.

This does not invalidate the experiment (the circuits were run and F_XEB values were measured), but it dramatically reduced the quantum speedup claimed. The boundary of “classical hardness” for random circuit sampling is an active research area and has moved repeatedly.

Mirror Circuits: A More Practical Benchmark

Mirror circuits (Sandia National Laboratories, 2021) address a key weakness of QV and XEB: they require classical simulation for verification, limiting practical scalability.

A mirror circuit appends the inverse of a random circuit to itself:

C_mirror = C^dagger * C

If the device were perfect, the output would always be the input state (typically |00…0>). The heavy output probability of the ideal circuit is known analytically: it is 1.0 for the input state. This means verification does not require classical simulation; you just check if you got the input state back.

Mirror circuit fidelity can be measured for circuits far beyond classical simulation limits, making it a scalable benchmark for 100+ qubit devices. Sandia’s benchmarking group has used mirror circuits to probe whether devices perform non-trivial quantum computation at large qubit counts. Keep in mind that mirror circuit benchmarking is its own approach, distinct from QV and from IBM’s layer fidelity and error per layered gate (EPLG) metrics, which address related but different questions.

Python: Computing Quantum Volume from a Simulated Device

The following code simulates a noisy quantum device and computes its Quantum Volume score.

import numpy as np
from itertools import product

def random_su4():
    """Generate a random SU(4) matrix using QR decomposition."""
    Z = np.random.randn(4, 4) + 1j * np.random.randn(4, 4)
    Q, R = np.linalg.qr(Z)
    # Make Q unitary with det +1
    d = np.diagonal(R)
    Q = Q * (d / np.abs(d))
    Q /= np.linalg.det(Q) ** 0.25
    return Q

def apply_depolarizing(rho, p_error, n_qubits):
    """Apply depolarizing noise: mix with maximally mixed state."""
    identity = np.eye(2**n_qubits) / (2**n_qubits)
    return (1 - p_error) * rho + p_error * identity

def simulate_qv_circuit(n_qubits, n_layers, two_qubit_error, measurement_error=0.01):
    """
    Simulate a QV circuit on n_qubits with n_layers of random SU(4) gates.
    Returns the output probability vector including noise.
    """
    dim = 2 ** n_qubits
    
    # Initial state |00...0>
    rho = np.zeros((dim, dim), dtype=complex)
    rho[0, 0] = 1.0
    
    for layer in range(n_layers):
        # Random permutation of qubits, then pair them up
        perm = np.random.permutation(n_qubits)
        pairs = [(perm[i], perm[i+1]) for i in range(0, n_qubits - 1, 2)]
        
        for q1, q2 in pairs:
            # Random SU(4) on qubits q1, q2
            U = random_su4()
            
            # Embed U into full n-qubit space (simplified: full matrix multiplication)
            # For small n we just use full density matrix
            # Build the full unitary by tensoring with identity on other qubits
            U_full = _embed_two_qubit_gate(U, q1, q2, n_qubits)
            
            rho = U_full @ rho @ U_full.conj().T
            
            # Apply two-qubit depolarizing noise after each gate
            rho = apply_depolarizing(rho, two_qubit_error, n_qubits)
    
    # Measurement: diagonal of density matrix gives probabilities
    probs = np.real(np.diag(rho))
    probs = np.clip(probs, 0, 1)
    probs /= probs.sum()
    
    # Apply measurement error (bit-flip with probability measurement_error)
    noisy_probs = (1 - measurement_error) * probs + measurement_error * np.ones(dim) / dim
    
    return probs, noisy_probs

def _embed_two_qubit_gate(U2, q1, q2, n_total):
    """Embed a 2-qubit gate U2 acting on qubits q1 and q2 into full n_total qubit space."""
    dim = 2 ** n_total
    U_full = np.zeros((dim, dim), dtype=complex)
    
    for i in range(dim):
        for j in range(dim):
            # Extract bits for qubits q1 and q2
            bi1 = (i >> (n_total - 1 - q1)) & 1
            bi2 = (i >> (n_total - 1 - q2)) & 1
            bj1 = (j >> (n_total - 1 - q1)) & 1
            bj2 = (j >> (n_total - 1 - q2)) & 1
            
            # Check if other bits match
            mask = ~((1 << (n_total - 1 - q1)) | (1 << (n_total - 1 - q2)))
            if (i & mask) == (j & mask):
                u_idx_row = bi1 * 2 + bi2
                u_idx_col = bj1 * 2 + bj2
                U_full[i, j] = U2[u_idx_row, u_idx_col]
    
    return U_full

def compute_heavy_output_probability(ideal_probs, noisy_probs, n_samples=5000):
    """
    Compute the heavy output probability:
    fraction of samples from noisy device that fall in the heavy set.
    Heavy set: outputs with ideal probability > median ideal probability.
    """
    median_prob = np.median(ideal_probs)
    heavy_set = set(np.where(ideal_probs > median_prob)[0])
    
    # Sample from noisy distribution
    outcomes = np.random.choice(len(noisy_probs), size=n_samples, p=noisy_probs)
    heavy_count = sum(1 for o in outcomes if o in heavy_set)
    
    return heavy_count / n_samples

def find_quantum_volume(max_n=6, two_qubit_error=0.005, n_circuits=20):
    """
    Find the QV of a device with given two-qubit error rate.
    QV = 2^n where n is the largest n such that heavy output probability > 2/3.
    """
    qv = 1
    for n in range(2, max_n + 1):
        hop_values = []
        for _ in range(n_circuits):
            ideal_probs, noisy_probs = simulate_qv_circuit(n, n, two_qubit_error)
            hop = compute_heavy_output_probability(ideal_probs, noisy_probs)
            hop_values.append(hop)
        
        mean_hop = np.mean(hop_values)
        std_hop = np.std(hop_values) / np.sqrt(n_circuits)
        passes = mean_hop > 2/3
        
        print(f"  n={n}: mean HOP={mean_hop:.4f} +/- {std_hop:.4f} -- {'PASS' if passes else 'FAIL'}")
        
        if passes:
            qv = 2 ** n
        else:
            break
    
    return qv

# Test different error rates (small n for speed)
print("Quantum Volume vs Two-Qubit Gate Error Rate:")
print("=" * 55)
for error_rate in [0.001, 0.005, 0.01, 0.02, 0.05]:
    print(f"\nTwo-qubit error rate: {error_rate*100:.1f}%")
    np.random.seed(42)
    qv = find_quantum_volume(max_n=5, two_qubit_error=error_rate, n_circuits=15)
    print(f"  --> Quantum Volume = {qv}")

Running this simulation reveals the relationship between hardware error rates and QV score. A device with 0.1% two-qubit gate error (world-class superconducting or trapped-ion hardware) achieves QV in the range of 32 to 64 for the small n values simulatable classically. The exponential scaling of QV with n means even small improvements in gate fidelity translate to significant QV gains.

What “Quantum Advantage” Actually Requires

Laboratory demonstrations of quantum supremacy or quantum advantage have a fundamental caveat: they use random circuits specifically designed to be hard to simulate classically. These circuits have no obvious practical utility. Quantum advantage for a useful problem is a much higher bar.

For quantum advantage on a useful problem, three conditions must hold simultaneously:

The quantum algorithm provides a theoretical speedup. This is known for factoring (Shor), search (Grover), simulation of quantum systems, and a handful of others. For many proposed quantum algorithms, the speedup is conditional, asymptotic, or unproven.
The hardware is good enough to run the algorithm before errors dominate. For Shor’s algorithm on 2048-bit RSA, current estimates require millions of error-corrected logical qubits. We have at most a few thousand noisy physical qubits today.
The classical comparison is fair. Classical algorithms improve too. When a “quantum advantage” claim is made, the classical baseline matters enormously. If the comparison is against a naive classical algorithm rather than the best available algorithm, the advantage may be illusory.

IBM’s utility-scale experiments on its Eagle and Heron processors (2023-2024) demonstrate high-fidelity operation on 100+ qubit systems. The Google Sycamore XEB experiments show that random circuit sampling is hard to simulate classically, but the gap is narrowing as classical simulation methods improve.

The most credible near-term path to quantum advantage is quantum simulation of quantum systems themselves: molecular electronic structure, condensed matter physics, and quantum chemistry. Here the problem is genuinely quantum mechanical, classical algorithms scale exponentially with system size, and approximate answers from a quantum computer may still be useful.

The benchmarks discussed in this tutorial (QV, CLOPS, RB, XEB) are all necessary tools for tracking hardware progress. None of them, individually, tells you whether a quantum computer will outperform classical computing on a problem you care about. That determination requires problem-specific analysis, classical algorithm comparison, and careful accounting of end-to-end latency including compilation, data input/output, and post-processing.

The field is progressing rapidly. What counts as a credible benchmark will continue to evolve as the hardware improves and as our understanding of the classical/quantum boundary sharpens.