Quantum GANs with PennyLane

Classical GANs: A Quick Refresher

Before jumping into the quantum version, it helps to understand how a classical GAN works. A GAN consists of two neural networks locked in a competitive game. The generator G takes a random latent vector z (sampled from a simple distribution like a standard normal) and maps it to a data sample G(z). The discriminator D takes any sample (real or generated) and outputs a probability D(x) that the sample is real. The generator’s goal is to produce outputs that the discriminator cannot distinguish from real data.

Training alternates between two steps. First, the discriminator updates its weights to better classify real data as real and generated data as fake. Second, the generator updates its weights to make the discriminator’s job harder. The original GAN formulation expresses this as a minimax game:

min_G max_D  E[log D(x)] + E[log(1 - D(G(z)))]

At Nash equilibrium, the generator produces samples indistinguishable from the real data distribution, and the discriminator outputs 0.5 for every input (it cannot tell the difference). In practice, the standard minimax loss suffers from vanishing gradients early in training when the generator is poor and D(G(z)) is close to zero. The non-saturating variant flips the generator’s objective to maximize E[log D(G(z))] instead of minimizing E[log(1 - D(G(z)))]. This provides stronger gradients when the generator is far from optimal, and it is the variant we use in this tutorial.

Why Use a Quantum Generator?

A quantum generator replaces the classical neural network with a parameterized quantum circuit (PQC). The circuit acts on n qubits initialized in the |0…0> state, applies a sequence of parameterized gates, and then measures. The measurement probabilities form a discrete probability distribution over 2^n basis states.

For n = 4 qubits, this gives a 16-dimensional probability vector. The circuit uses O(n * L) parameters where L is the number of layers. A classical generator producing a 16-dimensional output needs at least 16 parameters (one per output bin) and typically many more for an expressive architecture. The quantum circuit, on the other hand, generates the full distribution through quantum interference and entanglement, which allows certain probability distributions to emerge naturally from the circuit structure.

The quantum advantage claim for QGANs is specific and narrow: for distributions that arise from quantum processes (such as the output statistics of a quantum system), a quantum circuit can generate them efficiently while a classical network may require exponentially many parameters. For classical distributions like the Gaussian target in this tutorial, there is no quantum advantage. We use a Gaussian here purely as a pedagogical example because it is easy to visualize and verify.

Practical applications where QGANs show genuine promise include generating molecular orbital distributions for drug discovery, simulating correlated asset returns in quantitative finance, and preparing quantum states that approximate ground states of physical systems. We discuss these in more detail at the end of the tutorial.

Setup

Install the required packages:

pip install pennylane pennylane-lightning torch numpy matplotlib

We use pennylane-lightning for the high-performance C++ simulator backend. For circuits with 4+ qubits, lightning.qubit is typically 10-100x faster than the pure-Python default.qubit device.

Defining the Quantum Generator

The generator is a 4-qubit PQC. Its output is the measurement probability vector over all 16 basis states, which we treat as a discrete probability distribution.

import pennylane as qml
import torch
import torch.nn as nn
import numpy as np

n_qubits = 4
n_layers = 3
dev = qml.device("lightning.qubit", wires=n_qubits)

@qml.qnode(dev, interface="torch")
def quantum_generator(params):
    """Strongly entangling ansatz as generator."""
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    return qml.probs(wires=range(n_qubits))

def generator_params_shape():
    return qml.StronglyEntanglingLayers.shape(n_layers=n_layers, n_wires=n_qubits)

shape = generator_params_shape()
gen_params = torch.nn.Parameter(
    torch.tensor(np.random.uniform(0, 0.1, shape), dtype=torch.float64),
    requires_grad=True,
)

Notice that we initialize parameters in the range [0, 0.1] rather than [0, pi]. Small initial parameters keep the circuit close to the identity operation at the start of training, which helps avoid barren plateaus. We discuss this initialization strategy in detail later.

Understanding StronglyEntanglingLayers

The qml.StronglyEntanglingLayers template is a popular ansatz for variational quantum circuits. Each layer consists of two stages:

Single-qubit rotations: Every qubit receives a general Rot gate, which decomposes as RZ(phi) * RY(theta) * RZ(omega). This applies three rotation angles per qubit per layer, giving the circuit full single-qubit expressibility.
Entangling CNOT gates: A pattern of CNOT gates connects qubit pairs. The pattern changes with each layer so that after a few layers, every pair of qubits has been directly entangled. This is the “strongly entangling” property: unlike a linear nearest-neighbor chain, the circuit builds correlations across all qubit pairs quickly.

The parameter tensor has shape (n_layers, n_qubits, 3) because each qubit needs 3 rotation angles per layer. For our configuration of 3 layers and 4 qubits, the total parameter count is:

3 layers * 4 qubits * 3 angles = 36 parameters

These 36 parameters control the full 16-dimensional output distribution. The entanglement structure allows the circuit to represent distributions with complex correlations between basis states that would require many more parameters in a purely classical model with independent outputs.

The probs() Measurement and Its Gradient

The qml.probs(wires=range(n_qubits)) measurement returns the probability of observing each computational basis state. For 4 qubits, this is a 16-element vector [p_0000, p_0001, …, p_1111] that sums to 1. Each probability is computed as p_k = |<k|psi(theta)>|^2, where |psi(theta)> is the state produced by the parameterized circuit.

This measurement is differentiable. PennyLane computes gradients using the parameter-shift rule: for each parameter theta_i, the gradient is calculated as:

d(p_k)/d(theta_i) = [p_k(theta_i + pi/2) - p_k(theta_i - pi/2)] / 2

This requires two circuit evaluations per parameter per output probability, but it gives exact analytical gradients (no finite-difference approximation needed). You can verify the gradient computation produces the expected shapes:

# Verify gradient computation (run once, then remove)
grad_fn = qml.jacobian(quantum_generator)
jac = grad_fn(gen_params.detach().numpy())
print(f"Jacobian shape: {jac.shape}")  # (16, n_layers, n_qubits, 3)
print(f"Sum of each row: {jac.sum(axis=0)[:1]}")  # Should be near zero (probs sum to 1)

The Jacobian has shape (16, 3, 4, 3), meaning we get the derivative of each of the 16 output probabilities with respect to each of the 36 parameters. Since the probabilities must always sum to 1, the gradients across all 16 outputs for a given parameter must sum to zero.

Alternative Generator: A Simpler Custom Ansatz

The StronglyEntanglingLayers template is convenient, but building a custom ansatz can be more instructive and gives you finer control over the circuit structure. Here is a minimal 2-layer ansatz with fewer parameters:

@qml.qnode(dev, interface="torch")
def custom_generator(params):
    """Custom 2-layer ansatz with 12 parameters."""
    # Layer 1: RY and RZ rotations on each qubit
    for i in range(n_qubits):
        qml.RY(params[0, i, 0], wires=i)
        qml.RZ(params[0, i, 1], wires=i)
    # Entangling layer: linear chain of CNOTs
    for i in range(n_qubits - 1):
        qml.CNOT(wires=[i, i + 1])
    # Layer 2: RY rotations only
    for i in range(n_qubits):
        qml.RY(params[1, i, 0], wires=i)
    return qml.probs(wires=range(n_qubits))

# Parameter shape: layer 0 has 4*2=8 params, layer 1 has 4*1=4 params
# We pad to a uniform shape for convenience
custom_shape = (2, n_qubits, 2)
custom_params = torch.nn.Parameter(
    torch.tensor(np.random.uniform(0, 0.1, custom_shape), dtype=torch.float64),
    requires_grad=True,
)

This circuit uses only 12 active parameters (the params[1, :, 1] entries are unused and can be set to zero). It is less expressive than StronglyEntanglingLayers but trains faster and can still learn simple distributions. For the Gaussian target, this simpler ansatz often converges within 150 epochs.

We continue the rest of the tutorial using the StronglyEntanglingLayers version for its greater expressiveness.

Defining the Classical Discriminator

The discriminator is a small MLP that takes a probability vector and outputs a scalar score (probability of being “real”).

class Discriminator(nn.Module):
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.LeakyReLU(0.2),
            nn.Linear(32, 16),
            nn.LeakyReLU(0.2),
            nn.Linear(16, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.net(x.float())

discriminator = Discriminator(input_dim=2**n_qubits)

The architecture choices here matter. LeakyReLU (rather than ReLU) prevents dead neurons when the discriminator receives near-zero probability entries. The final Sigmoid squashes the output to [0, 1] for use with binary cross-entropy loss. The hidden layer sizes (32 and 16) are deliberately small because our input is only 16-dimensional; a larger discriminator would overpower the quantum generator too easily.

Real Data Distribution

We target a Gaussian-like distribution over the 16 basis states by assigning probabilities proportional to a Gaussian evaluated at each basis index.

def real_distribution(n_states=16):
    indices = np.arange(n_states)
    center = (n_states - 1) / 2.0
    sigma = n_states / 5.0
    probs = np.exp(-0.5 * ((indices - center) / sigma) ** 2)
    probs /= probs.sum()
    return torch.tensor(probs, dtype=torch.float32)

real_probs = real_distribution()

This produces a bell-shaped curve centered at index 7.5, with standard deviation 3.2 bins. The distribution is smooth and unimodal, which makes it a reasonable first target for the QGAN. More complex targets (bimodal distributions, sharp peaks) require deeper circuits and longer training.

Understanding the Bhattacharyya Coefficient

We need a metric to evaluate how well the generator has learned the target distribution. The Bhattacharyya coefficient (BC) measures the overlap between two discrete probability distributions p and q:

BC(p, q) = sum_k sqrt(p_k * q_k)

Key properties of the Bhattacharyya coefficient:

BC = 1 means the distributions are identical
BC = 0 means the distributions have completely disjoint support (no overlap)
BC is always in the range [0, 1]
BC is symmetric: BC(p, q) = BC(q, p)
Unlike KL divergence, BC never diverges to infinity, even when one distribution assigns zero probability to a bin where the other does not

The Bhattacharyya coefficient relates to other distance measures. The Hellinger distance is H = sqrt(1 - BC), and the Bhattacharyya distance is D_B = -log(BC). For our training, we expect BC to start around 0.7-0.8 (random circuit output has some overlap with any distribution) and converge toward 0.95-0.99 after successful training.

def bhattacharyya_coefficient(p, q):
    """Compute BC between two probability distributions."""
    return float(torch.sum(torch.sqrt(p * q)))

Training Loop

We use the non-saturating GAN loss. The discriminator maximizes log D(real) + log(1 - D(fake)); the generator maximizes log D(fake) to get stronger gradients early in training.

Note the asymmetric learning rates: the generator uses lr=0.05 and the discriminator uses lr=0.005. The discriminator learns 10x slower, which prevents it from dominating the generator early in training. This is a common and important trick for stable GAN training.

gen_optimizer = torch.optim.Adam([gen_params], lr=0.05)
disc_optimizer = torch.optim.Adam(discriminator.parameters(), lr=0.005)
bce = nn.BCELoss()

n_epochs = 200

for epoch in range(n_epochs):
    # --- Train discriminator ---
    disc_optimizer.zero_grad()

    real_input = real_probs.unsqueeze(0)            # shape (1, 16)
    real_label = torch.ones(1, 1)
    d_real = discriminator(real_input)
    loss_real = bce(d_real, real_label)

    # .detach() is critical: we do not want gradients flowing back
    # through the generator when training the discriminator
    fake_probs = quantum_generator(gen_params).float().unsqueeze(0).detach()
    fake_label = torch.zeros(1, 1)
    d_fake = discriminator(fake_probs)
    loss_fake = bce(d_fake, fake_label)

    disc_loss = loss_real + loss_fake
    disc_loss.backward()
    disc_optimizer.step()

    # --- Train generator ---
    gen_optimizer.zero_grad()

    fake_probs_gen = quantum_generator(gen_params).float().unsqueeze(0)
    d_fake_for_gen = discriminator(fake_probs_gen)
    # Generator wants discriminator to output 1 (fooled)
    gen_loss = bce(d_fake_for_gen, torch.ones(1, 1))
    gen_loss.backward()
    gen_optimizer.step()

    if epoch % 40 == 0:
        bc = bhattacharyya_coefficient(fake_probs_gen.detach().squeeze(), real_probs)
        d_real_score = float(d_real.detach())
        d_fake_score = float(d_fake.detach())
        print(f"Epoch {epoch:3d} | D loss: {disc_loss.item():.4f} "
              f"| G loss: {gen_loss.item():.4f} | BC: {bc:.4f} "
              f"| D(real): {d_real_score:.3f} | D(fake): {d_fake_score:.3f}")

The monitoring output now includes D(real) and D(fake) scores, which are essential for diagnosing training health. Healthy training shows D(real) gradually decreasing from ~1.0 toward ~0.5, while D(fake) gradually increases from ~0.0 toward ~0.5. If D(real) stays pinned at 1.0 and D(fake) stays at 0.0, the discriminator is dominating and you need to reduce its learning rate.

Training Stability: Diagnosing Failure Modes

GANs are notoriously difficult to train. Here are the main failure modes and how to detect them:

Mode collapse occurs when the generator learns to produce a single output that always fools the discriminator, instead of learning the full target distribution. Signs: the generator loss drops quickly, but the Bhattacharyya coefficient plateaus at a low value (0.5-0.7). The generator has found a “shortcut” rather than learning the true distribution. Fix: increase the discriminator’s capacity or add noise to the discriminator’s inputs.

Discriminator dominance occurs when the discriminator becomes too accurate too fast. The generator receives near-zero gradients because D(fake) is pinned at 0. Signs: D(real) stays near 1.0, D(fake) stays near 0.0, and the generator loss saturates around log(2) = 0.693. Fix: reduce the discriminator’s learning rate, or train the generator for multiple steps per discriminator step.

Generator dominance is the opposite problem: the generator fools the discriminator so easily that it stops improving. Signs: G loss approaches 0 quickly, but BC does not reach 0.95+. The discriminator is too weak to provide useful gradient signal. Fix: train the discriminator for 2-5 steps per generator step.

Oscillation occurs when the two networks chase each other without converging. Signs: losses oscillate with increasing amplitude. Fix: reduce both learning rates and consider using gradient penalty (see the Wasserstein GAN section below).

Evaluating the Trained Generator

After training, compare the learned distribution to the target:

with torch.no_grad():
    learned = quantum_generator(gen_params).numpy()

import matplotlib.pyplot as plt

x = np.arange(2**n_qubits)
plt.figure(figsize=(8, 4))
plt.bar(x - 0.2, real_probs.numpy(), width=0.4, label="Target (Gaussian)", alpha=0.8)
plt.bar(x + 0.2, learned, width=0.4, label="Quantum Generator", alpha=0.8)
plt.xlabel("Basis state index")
plt.ylabel("Probability")
plt.title("QGAN: Target vs Learned Distribution")
plt.legend()
plt.tight_layout()
plt.savefig("qgan_result.png", dpi=150)

A Bhattacharyya coefficient above 0.95 indicates the learned distribution closely matches the target. Values above 0.99 are achievable with good hyperparameter choices and sufficient training epochs.

Loss Function Variant: Wasserstein GAN

The standard BCE loss relies on the Jensen-Shannon divergence between distributions, which can have poor gradient properties when the real and generated distributions have little overlap. The Wasserstein GAN (WGAN) uses the Earth Mover’s distance instead, which provides meaningful gradients even when distributions do not overlap.

The key changes for a WGAN:

Remove the Sigmoid from the discriminator (it becomes a “critic” that outputs an unbounded score)
Replace BCE loss with the Wasserstein distance estimate
Add a gradient penalty to enforce the Lipschitz constraint

class WassersteinCritic(nn.Module):
    """Critic (not discriminator) for WGAN. No Sigmoid output."""
    def __init__(self, input_dim):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, 32),
            nn.LeakyReLU(0.2),
            nn.Linear(32, 16),
            nn.LeakyReLU(0.2),
            nn.Linear(16, 1),  # No Sigmoid here
        )

    def forward(self, x):
        return self.net(x.float())

# Wasserstein losses (inside the training loop):
# Critic loss: -(critic(real) - critic(fake))  -> maximize the margin
# Generator loss: -critic(fake)                -> maximize the critic's score on fakes

The Wasserstein distance provides a smooth, continuous loss landscape that correlates with sample quality. In practice, WGAN training tends to be more stable than standard GAN training, though it requires careful tuning of the gradient penalty coefficient (typically lambda = 10). For quantum GANs at small qubit counts, the standard BCE loss usually works fine, but the Wasserstein variant becomes increasingly useful as the circuit size grows and the optimization landscape becomes more complex.

Quantum Discriminator Variant

So far we have used a classical MLP as the discriminator. For a fully quantum GAN, we can replace it with a quantum circuit. The quantum discriminator takes a probability vector as input (via amplitude encoding) and outputs a scalar expectation value as the real/fake score:

dev_d = qml.device("default.qubit", wires=n_qubits)

@qml.qnode(dev_d, interface="torch")
def quantum_discriminator(probs_input, params):
    """Quantum discriminator using amplitude encoding."""
    # Encode the probability vector into quantum amplitudes
    qml.AmplitudeEmbedding(features=probs_input, wires=range(n_qubits), normalize=True)
    # Trainable variational layer
    qml.StronglyEntanglingLayers(params, wires=range(n_qubits))
    # Output: expectation of Z on first qubit, range [-1, 1]
    return qml.expval(qml.PauliZ(0))

The output is a scalar in [-1, 1]. To use it with BCE loss, map it to [0, 1] via (output + 1) / 2. A fully quantum GAN (quantum generator + quantum discriminator) is theoretically appealing because it keeps the entire computation in quantum space. However, it is significantly harder to train in practice: both networks suffer from barren plateaus independently, and the interaction between two quantum optimizers makes the loss landscape even more complex. For pedagogical and practical purposes, the hybrid approach (quantum generator, classical discriminator) is the recommended starting point.

Barren Plateaus: The Central Challenge

The barren plateau problem is the most significant obstacle to scaling variational quantum algorithms, including QGANs. In a barren plateau, the gradient of the cost function vanishes exponentially with the number of qubits. Concretely, for a randomly initialized circuit on n qubits, the variance of any gradient component scales as O(1/2^n). At n = 20 qubits, gradients are on the order of 10^-6, making optimization essentially impossible with gradient-based methods.

Why Barren Plateaus Occur

Barren plateaus arise from the geometry of the unitary group. A randomly parameterized circuit on n qubits samples from a distribution that approximates the Haar measure on U(2^n). In this regime, the cost function becomes nearly flat everywhere except in an exponentially small region of parameter space. The circuit is so expressive that a random initialization lands in a featureless landscape with overwhelming probability.

Prevention Strategies

Several strategies mitigate barren plateaus:

Identity initialization: Set all parameters close to zero so the circuit starts near the identity operation. This places the initial state in a region with non-vanishing gradients. The parameter space near identity has structure because the circuit has not yet “scrambled” the input state.

# Good: small random initialization near identity
params = np.random.uniform(0, 0.1, shape)

# Bad: large random initialization (barren plateau territory)
params = np.random.uniform(0, np.pi, shape)

Layer-by-layer training: Train a single layer to convergence, then freeze it and add the next layer. Each new layer starts near identity (small parameters), so it inherits the non-trivial gradient landscape from the already-trained layers. This strategy dramatically improves convergence for circuits with 3+ layers.

Local cost functions: Instead of measuring a global observable across all qubits, measure local observables (single-qubit expectations). The variance of gradients for local cost functions vanishes polynomially rather than exponentially with qubit count. For the QGAN, this means the discriminator should ideally process local features of the distribution rather than the full 2^n-dimensional vector.

Correlation-aware initialization: If you have domain knowledge about the target distribution, encode it in the initial parameters. For example, if the target is symmetric, initialize the circuit with parameters that produce a symmetric state.

Hardware-efficient ansatze: Match the circuit structure to the hardware’s native connectivity. Unnecessary SWAP gates to implement all-to-all connectivity add depth without adding expressiveness, increasing the risk of barren plateaus.

Common Mistakes

Here are pitfalls that frequently trip up newcomers to quantum GANs:

Missing .detach() when training the discriminator: When computing fake_probs for the discriminator update, you must call .detach() to prevent gradients from flowing back through the quantum generator. Without detach, PyTorch tries to build a computation graph through two separate backward passes, which causes errors or incorrect gradients.

# Correct: detach fake probabilities for discriminator training
fake_probs = quantum_generator(gen_params).float().unsqueeze(0).detach()

# Wrong: missing detach causes computation graph issues
fake_probs = quantum_generator(gen_params).float().unsqueeze(0)

Double-normalizing the output: qml.probs() already returns a normalized probability distribution that sums to 1. Applying softmax or dividing by the sum again changes the distribution shape and introduces unnecessary computation.

Using default.qubit for non-trivial circuits: The default.qubit device is a pure-Python simulator. For circuits with 4+ qubits, switch to lightning.qubit, which uses an optimized C++ backend and is 10-100x faster. The interface is identical; just change the device name.

Equal learning rates for generator and discriminator: Setting the same learning rate for both networks almost always leads to discriminator dominance because the classical MLP learns faster than the quantum circuit. Start with the discriminator learning rate 5-10x smaller than the generator’s, and adjust from there.

Large parameter initialization: Initializing circuit parameters uniformly in [0, pi] or [0, 2*pi] places the circuit in barren plateau territory from the start. Use a narrow range like [0, 0.1] for much better convergence.

Real-World QGAN Applications

QGANs are an active research area with several promising application domains:

Drug discovery: Molecular properties are governed by quantum mechanics, and the probability distributions describing molecular orbitals and electron densities arise naturally from quantum processes. QGANs can potentially generate valid molecular configurations by learning these quantum-native distributions directly, rather than encoding them classically first.

Quantitative finance: Correlated asset return distributions are challenging to model classically, especially in the tails where extreme events cluster. QGANs can learn joint probability distributions over multiple assets, with the entanglement structure naturally capturing correlations that require copulas or other complex constructions in classical models.

High-energy physics: Simulating particle collision events for detector calibration is computationally expensive. QGANs can learn to generate synthetic collision data that matches the statistical properties of real detector output, potentially accelerating Monte Carlo simulation pipelines at facilities like CERN.

Quantum state preparation: One of the most natural applications is using a QGAN to learn an unknown quantum state. Given samples from a target quantum state (measurement outcomes), the quantum generator learns to prepare a circuit that produces the same measurement statistics. This is useful for quantum chemistry and condensed matter simulations where the target state is complex but can be sampled experimentally.

All current practical demonstrations use small qubit counts (4-10 qubits) with classical simulation. Near-term quantum hardware introduces noise that further complicates training, though error mitigation techniques can partially compensate. Scaling QGANs to useful problem sizes remains an open challenge, with barren plateaus and hardware noise as the two primary obstacles.

Key Points

The quantum generator acts as a differentiable probabilistic program: gradients flow through qml.probs back to the circuit parameters via the parameter-shift rule. This makes the training loop identical to PyTorch in structure, with PennyLane handling the quantum gradient computation transparently.

The hybrid architecture (quantum generator, classical discriminator) offers the best balance of trainability and quantum expressiveness for current hardware and simulators. A fully quantum GAN is possible but significantly harder to optimize. Start with the hybrid approach, verify that training converges on simple targets, and then explore quantum discriminators once you are comfortable with the training dynamics.

For readers looking to go deeper, the PennyLane documentation includes demos on quantum GANs with different ansatz choices, and the original QGAN paper by Dallaire-Demers and Killoran (2018) provides the theoretical foundation for the approach used in this tutorial.