Surface Code Simulation: Stim + PyMatching Tutorial

Stim is Google’s high-performance stabilizer circuit simulator, purpose-built for quantum error correction research. Unlike general-purpose simulators, Stim represents circuits as stabilizer tableaus and can simulate millions of shots per second on a CPU, making it practical to collect the statistics needed to measure logical error rates at realistic code distances.

PyMatching is the standard minimum-weight perfect matching (MWPM) decoder for surface codes. Together, Stim + PyMatching form the standard simulation stack for fault-tolerant quantum computing research.

Installation

pip install stim pymatching numpy matplotlib

What is a Stabilizer Circuit?

Before writing code, a quick refresher on why Stim is fast. A surface code operates on Pauli stabilizers: tensor products of Pauli operators I, X, Y, Z. The key property is that all operations in an error correction round (CNOT gates, Hadamard gates, and measurements in the X or Z basis) preserve the stabilizer structure. Stim tracks stabilizers directly rather than statevectors, making memory linear in qubit count rather than exponential.

For a distance-d surface code with d^2 + (d-1)^2 physical qubits:

Statevector simulator: needs 2^n complex amplitudes
Stim: needs n + n^2 bits for the stabilizer tableau

For d=3 (17 qubits): statevector needs 2^17 = 131,072 amplitudes, Stim needs ~289 bits.

Defining a Surface Code in Stim

Stim uses a domain-specific language for defining circuits. Here’s how to generate a distance-3 surface code circuit:

import stim

# Stim can generate standard codes programmatically
circuit = stim.Circuit.generated(
    code_task="surface_code:rotated_memory_z",
    distance=3,
    rounds=3,          # number of error correction rounds
    after_clifford_depolarization=0.001,   # gate error rate
    before_round_data_depolarization=0.001,
    before_measure_flip_probability=0.001,
)

print(circuit)        # Prints the full Stim circuit
print(f"Qubits: {circuit.num_qubits}")
print(f"Measurements: {circuit.num_measurements}")

The surface_code:rotated_memory_z task encodes a single logical qubit in the rotated surface code and runs syndrome measurement rounds, tracking whether the logical Z observable is flipped at the end.

Available code tasks:

surface_code:rotated_memory_z (Z-basis logical qubit)
surface_code:rotated_memory_x (X-basis logical qubit)
repetition_code:memory (simpler 1D code for testing)

Collecting Syndrome Data

Stim’s detector error model (DEM) is the key interface to decoders. A detector is a set of measurements whose XOR should be 0 in the absence of errors. When a detector fires (XOR = 1), that signals an error somewhere.

import stim
import numpy as np

# Generate d=3 surface code, 3 rounds, p=1% error rate
circuit = stim.Circuit.generated(
    "surface_code:rotated_memory_z",
    distance=3,
    rounds=3,
    after_clifford_depolarization=0.01,
    before_round_data_depolarization=0.01,
    before_measure_flip_probability=0.01,
)

# Compile a fast sampler
sampler = circuit.compile_detector_sampler()

# Sample 100,000 shots
# detection_events: (n_shots, n_detectors) bool array
# observable_flips: (n_shots, n_observables) bool array (the "answer")
detection_events, observable_flips = sampler.sample(
    shots=100_000,
    separate_observables=True,
)

print(f"Detection events shape: {detection_events.shape}")
print(f"Observable flips shape: {observable_flips.shape}")
print(f"Fraction of shots with any detection: {detection_events.any(axis=1).mean():.3f}")
print(f"Raw logical error rate (no decoding): {observable_flips[:, 0].mean():.4f}")

Decoding with PyMatching

The detection events tell us something went wrong but not where. The decoder’s job is to infer the most likely correction given the syndrome pattern. Minimum-weight perfect matching (MWPM) finds the minimum-weight set of errors consistent with the observed syndrome.

import pymatching

# Build the detector error model from the circuit
# This encodes the graph structure that the MWPM decoder needs
dem = circuit.detector_error_model(decompose_errors=True)
matcher = pymatching.Matching.from_detector_error_model(dem)

# Decode all shots at once (vectorized, fast)
predictions = matcher.decode_batch(detection_events)

# Compare predictions to actual observable flips
# A logical error occurs when the decoder's prediction differs from the actual flip
num_errors = np.sum(predictions != observable_flips)
logical_error_rate = num_errors / (len(predictions) * observable_flips.shape[1])

print(f"Logical error rate (with MWPM decoding): {logical_error_rate:.5f}")

For a distance-3 code at 1% physical error rate, the logical error rate with MWPM decoding is typically around 0.1-0.2%, about 5-10x lower than the physical rate, confirming the code is below threshold.

Measuring the Code Threshold

The threshold is the physical error rate below which increasing the code distance reduces the logical error rate. Below threshold, distance d+2 has lower logical error rate than distance d. Above threshold, the opposite is true.

import stim
import pymatching
import numpy as np
import matplotlib.pyplot as plt

distances = [3, 5, 7]
error_rates = np.logspace(-3, -1.3, 12)  # 0.001 to 0.05
shots = 50_000

results = {}

for d in distances:
    logical_errors = []
    for p in error_rates:
        circuit = stim.Circuit.generated(
            "surface_code:rotated_memory_z",
            distance=d,
            rounds=d,               # rounds = distance is standard
            after_clifford_depolarization=p,
            before_round_data_depolarization=p,
            before_measure_flip_probability=p,
        )
        dem = circuit.detector_error_model(decompose_errors=True)
        matcher = pymatching.Matching.from_detector_error_model(dem)

        sampler = circuit.compile_detector_sampler()
        detection_events, observable_flips = sampler.sample(
            shots=shots, separate_observables=True
        )
        predictions = matcher.decode_batch(detection_events)
        
        logical_err = np.sum(predictions != observable_flips) / (shots * observable_flips.shape[1])
        logical_errors.append(logical_err)
        print(f"d={d}, p={p:.4f}, logical_err={logical_err:.5f}")
    
    results[d] = logical_errors

# Plot threshold curve
plt.figure(figsize=(8, 5))
for d in distances:
    plt.semilogy(error_rates * 100, results[d], 'o-', label=f'd={d}')
plt.xlabel("Physical error rate (%)")
plt.ylabel("Logical error rate (log scale)")
plt.title("Surface Code Threshold")
plt.legend()
plt.grid(True, alpha=0.3)
plt.axvline(x=1.0, color='gray', linestyle='--', alpha=0.5, label='~1% threshold')
plt.tight_layout()
plt.savefig("surface_code_threshold.png", dpi=150)
plt.show()

# The curves cross near p ~ 0.9-1.1%: this is the threshold
# Below: higher d -> lower logical error rate (good!)
# Above: higher d -> higher logical error rate (bad)

The crossing point of these curves is the threshold. For the depolarizing noise model, the surface code threshold is approximately 1%. This is why achieving 99%+ two-qubit gate fidelity is the critical benchmark for hardware targeting fault-tolerant operation.

Logical Error Rate vs. Distance (Below Threshold)

Below threshold, the logical error rate drops exponentially with distance. A useful approximation:

logical_error_rate ≈ A * (p / p_threshold)^((d+1)/2)

Let’s verify this scaling:

# Fit the sub-threshold scaling at p = 0.005 (half of threshold)
p_fixed = 0.005
distances_fit = [3, 5, 7, 9, 11]
logical_errs_fit = []

for d in distances_fit:
    circuit = stim.Circuit.generated(
        "surface_code:rotated_memory_z",
        distance=d,
        rounds=d,
        after_clifford_depolarization=p_fixed,
        before_round_data_depolarization=p_fixed,
        before_measure_flip_probability=p_fixed,
    )
    dem = circuit.detector_error_model(decompose_errors=True)
    matcher = pymatching.Matching.from_detector_error_model(dem)
    sampler = circuit.compile_detector_sampler()
    detection_events, observable_flips = sampler.sample(100_000, separate_observables=True)
    predictions = matcher.decode_batch(detection_events)
    logical_errs_fit.append(np.sum(predictions != observable_flips) / (100_000 * observable_flips.shape[1]))

# Fit log(p_L) vs d
log_errs = np.log(logical_errs_fit)
coeffs = np.polyfit(distances_fit, log_errs, 1)
print(f"Exponential suppression factor per 2 distance steps: {np.exp(2 * coeffs[0]):.3f}")
# Expected: roughly (0.005/0.01)^1 = 0.5 per step of (d+1)/2

Speedup Notes

For production use:

Sample in batches matching your RAM: sampler.sample(shots=1_000_000) is fine with 8GB RAM for d=7
pymatching uses MWPM in C++, runs in seconds for millions of shots
For cross-validation, also try beliefmatching (belief propagation + MWPM) which can outperform pure MWPM at high distances

# Runtime benchmark
import time

circuit = stim.Circuit.generated(
    "surface_code:rotated_memory_z",
    distance=7,
    rounds=7,
    after_clifford_depolarization=0.005,
)
dem = circuit.detector_error_model(decompose_errors=True)
matcher = pymatching.Matching.from_detector_error_model(dem)
sampler = circuit.compile_detector_sampler()

t0 = time.time()
detection_events, observable_flips = sampler.sample(1_000_000, separate_observables=True)
predictions = matcher.decode_batch(detection_events)
t1 = time.time()

print(f"1M shots of d=7, decoded in {t1-t0:.1f}s")
# Typically 3-10 seconds on a laptop -- Stim is fast

What Comes Next

This tutorial covered the basics of simulating the surface code under a simple depolarizing noise model. Real quantum error correction research goes further:

Correlated noise: circuit-level noise with crosstalk and leakage requires modified decoding
Biased noise: some hardware has predominantly dephasing or relaxation errors; tailored codes (XZZX surface code) exploit this
Fault-tolerant gates: the tutorial only covers memory experiments; implementing logical gates requires transversal operations or code switching
Magic state distillation: implementing T gates fault-tolerantly requires factory circuits that can be simulated with Stim
Belief propagation decoders: outperform MWPM on higher-rate codes and near-threshold error rates

Stim’s documentation includes worked examples for all of these. The Stim paper and PyMatching paper are the primary references.