The HHL Algorithm: Quantum Linear Systems

What Problem Does HHL Solve?

Solving a linear system Ax = b is one of the most common tasks in science and engineering. Given a matrix A and a vector b, find the vector x that satisfies the equation. Classical algorithms like Gaussian elimination run in O(N^3) time for an N-dimensional system. Even the best iterative classical methods (conjugate gradient, GMRES) scale at least linearly in N. For large systems with N = 10^6 or beyond, these costs become substantial.

The HHL algorithm, introduced by Harrow, Hassidim, and Lloyd in 2009, promises exponential speedup under specific conditions. Instead of recovering x directly, HHL prepares the quantum state |x> proportional to the solution vector. When you only need to estimate some property of x (rather than all its components), HHL can be far faster than any classical approach.

This tutorial walks through the algorithm in detail: the mathematical setup, a concrete worked example, circuit-level intuition for each phase, resource scaling, practical limitations, and an honest assessment of where HHL stands today.

The Mathematical Setup

Assume A is an s-sparse Hermitian matrix (at most s non-zero entries per row) with condition number kappa. The condition number is the ratio of the largest eigenvalue to the smallest eigenvalue: kappa = |lambda_max| / |lambda_min|. The input is a quantum state |b> encoding the right-hand side. The output is a quantum state |x> = A^(-1)|b> / ||A^(-1)|b>||.

HHL achieves this in time O(log(N) * s^2 * kappa^2 / epsilon), where epsilon is the desired precision. For a dense classical solver at O(N^3), this represents an exponential improvement in N when s, kappa, and 1/epsilon are all small.

Why A Must Be Hermitian

HHL as typically stated requires A to be Hermitian (equal to its conjugate transpose, A = A-dagger). This requirement exists for two reasons:

Hermitian matrices have real eigenvalues, which the algorithm encodes as phases via quantum phase estimation.
The Hamiltonian simulation subroutine (computing e^{iAt}) requires a Hermitian operator to generate a valid unitary evolution.

Many real-world linear systems involve non-Hermitian matrices. For a general matrix A, there is a standard trick: construct the augmented system

[[0,  A ],  [x]   [b]
 [A†, 0]]  [y] = [0]

This doubled system is Hermitian by construction. Solving it yields the original solution x = A^{-1}b (along with an auxiliary vector y). The cost is doubling the system size, which adds only one extra qubit to the quantum register.

The Three Phases of HHL

At a high level, HHL performs three steps: (1) decompose the input into eigenvalue components using quantum phase estimation, (2) invert each eigenvalue using controlled rotations, and (3) undo the decomposition and postselect to obtain the result.

Phase 1: Quantum Phase Estimation (QPE)

QPE is the engine that drives HHL. It extracts the eigenvalues of A and writes them into a quantum register.

The algorithm begins by decomposing |b> in the eigenbasis of A:

|b> = sum_j beta_j |u_j>

where |u_j> are eigenvectors of A with eigenvalues lambda_j. The coefficients beta_j are the overlaps of |b> with each eigenvector.

QPE works by using the matrix exponential e^{iAt} as the controlled unitary. The key identity is that if A|u_j> = lambda_j|u_j>, then e^{iAt}|u_j> = e^{i * lambda_j * t}|u_j>. The eigenvalue lambda_j appears as a phase, and QPE extracts it.

The QPE circuit uses p precision qubits in an ancilla register (the “clock” register). These qubits are initialized in the |+> state via Hadamard gates. Controlled applications of e^{iAt} for different powers of t entangle the clock register with the eigenvalue information. An inverse quantum Fourier transform on the clock register then converts these phases into binary representations of the eigenvalues.

After QPE, the joint state of the clock register and system register is:

sum_j beta_j |tilde{lambda_j}> |u_j>

where |tilde{lambda_j}> is the p-bit binary approximation of lambda_j.

How many precision qubits? The number of precision qubits p directly controls accuracy. With p qubits, eigenvalues are resolved to precision 2^{-p}. To achieve overall error epsilon in the final solution, you need p = O(log(1/epsilon) + log(kappa)) precision qubits. More qubits give better eigenvalue resolution, which in turn gives more accurate matrix inversion.

Phase 2: Controlled Rotation

With eigenvalues encoded in the clock register, the algorithm applies a controlled rotation on a single flag qubit (also called the ancilla qubit). This is the step that performs the actual matrix inversion.

For each eigenvalue lambda_j stored in the clock register, the algorithm applies a Y-rotation of angle 2 * arcsin(C / lambda_j) to the flag qubit. The constant C is chosen so that C / lambda_j is at most 1 for all eigenvalues; typically C = lambda_min or smaller.

After this rotation, the state becomes:

The |1> component of the flag qubit now carries an amplitude of C/lambda_j. This is the critical insight: the rotation converts each eigenvalue lambda_j into its reciprocal 1/lambda_j in the amplitude of the flag qubit.

Circuit structure. The controlled rotation is implemented as a sequence of controlled-Ry gates, where each qubit of the clock register controls a rotation by a different angle. The circuit looks like this (schematically):

Clock qubit 0: ----●-----------------
                   |
Clock qubit 1: ----|----●------------
                   |    |
Clock qubit 2: ----|----|---------●--
                   |    |         |
Flag qubit:    ---Ry---Ry--------Ry--
               (theta_0)(theta_1)(theta_2)

Each controlled rotation adds a contribution to the total rotation angle based on whether that clock qubit is |0> or |1>. The angles theta_k are chosen so that when the clock register encodes a binary value representing lambda_j, the total rotation equals 2 * arcsin(C / lambda_j).

Phase 3: Uncompute and Postselect

QPE is run in reverse (the inverse QPE circuit) to uncompute the eigenvalue register, disentangling it from the system register. After this step, the full state is:

sum_j beta_j |0…0>_clock (sqrt(1 - C^2/lambda_j^2)|0> + C/lambda_j|1>)_flag |u_j>_system

The algorithm then measures the flag qubit. When the measurement outcome is |1>, the remaining system register collapses to:

sum_j beta_j * (C / lambda_j) |u_j> (up to normalization)

This is exactly proportional to A^{-1}|b>, because A^{-1} acting on |b> = sum_j beta_j |u_j> gives sum_j (beta_j / lambda_j)|u_j>.

Postselection success probability. The probability of measuring |1> on the flag qubit is:

P(success) = sum_j |beta_j|^2 * C^2 / lambda_j^2

This probability is proportional to ||A^{-1}b||^2 * C^2. For a well-conditioned system where all eigenvalues are of similar magnitude, this probability is reasonable. For an ill-conditioned system (large kappa), the smallest eigenvalue dominates the sum, and the success probability scales as O(C^2 / lambda_min^2) = O(1/kappa^2). This means you may need to repeat the entire algorithm O(kappa^2) times to get a single successful postselection. This repetition cost is one of the hidden expenses of HHL and is already reflected in the O(kappa^2) factor of the runtime.

Putting It Together in Pseudocode

# HHL conceptual pseudocode (not runnable directly)
# Requires: quantum phase estimation, controlled rotations

def hhl_pseudocode(A, b_state, precision_qubits):
    """
    A: sparse Hermitian matrix (encoded as Hamiltonian)
    b_state: quantum state encoding b
    precision_qubits: number of qubits for eigenvalue register
    """
    # Step 1: prepare |b>
    # (assume b_state is already prepared)

    # Step 2: QPE on e^{iAt} with ancilla register
    qpe_register = quantum_phase_estimation(A, b_state, precision_qubits)
    # ancilla now encodes eigenvalues lambda_j

    # Step 3: controlled rotation on flag qubit
    flag = controlled_rotation(qpe_register, C=smallest_eigenvalue)
    # flag qubit encodes 1/lambda_j amplitude

    # Step 4: uncompute QPE
    inverse_qpe(qpe_register)

    # Step 5: postselect on flag qubit = |1>
    result = postselect(flag, outcome=1)
    # result register now holds |x> proportional to A^{-1}|b>

    return result

Worked Example: A 2x2 Diagonal System

To build concrete intuition, let us walk through HHL on the simplest possible case.

The problem. Solve Ax = b where:

A = [[2, 0],    b = [1, 0]
     [0, 1]]

The classical solution is x = A^{-1}b = [1/2, 0].

Step 1: Prepare |b>. The vector b = [1, 0] corresponds to the quantum state |0> on a single qubit. This is the default initial state of a qubit, so no preparation circuit is needed.

Step 2: Eigendecomposition. Since A is diagonal, the eigenvectors are the computational basis states:

|u_0> = |0> with eigenvalue lambda_0 = 2
|u_1> = |1> with eigenvalue lambda_1 = 1

The input |b> = |0> = 1.0 * |u_0> + 0.0 * |u_1>, so beta_0 = 1 and beta_1 = 0.

Step 3: QPE encodes eigenvalues. After QPE (using one precision qubit for simplicity), the clock register stores the eigenvalue. Since |b> is entirely in the lambda_0 = 2 eigenvector, the state after QPE is:

|tilde{lambda_0}>_clock |u_0>_system = |lambda=2>_clock |0>_system

In a more general case where |b> overlaps with both eigenvectors, we would have a superposition over both eigenvalue branches.

Step 4: Controlled rotation. We choose C = 1 (the smallest eigenvalue). The rotation angle for lambda_0 = 2 is:

theta = 2 * arcsin(C / lambda_0) = 2 * arcsin(1/2) = 2 * (pi/6) = pi/3

After the controlled rotation on the flag qubit:

|lambda=2>_clock (sqrt(1 - 1/4)|0> + 1/2|1>)_flag |0>_system

The amplitude of |1> on the flag qubit is C/lambda_0 = 1/2.

Step 5: Uncompute and postselect. Inverse QPE removes the clock register. Measuring the flag qubit and postselecting on |1> leaves the system register in the state |0>.

The final state is proportional to (1/2)|0> + 0|1>, which, after normalization, is just |0>. Reading this out, we get x proportional to [1/2, 0], the correct solution.

What if |b> were a superposition? If b = [1/sqrt(2), 1/sqrt(2)] (so |b> = |+>), then beta_0 = beta_1 = 1/sqrt(2). After controlled rotation, the flag qubit amplitudes would be C/lambda_0 = 1/2 for the first branch and C/lambda_1 = 1 for the second branch. Postselecting on |1> would yield the state proportional to (1/2)|0> + (1)|1>, giving x proportional to [1/2, 1], which matches A^{-1}b.

Qiskit Implementation Sketch

The following Qiskit code constructs the key components of HHL for the 2x2 diagonal case. This is a high-level sketch showing how the circuit pieces connect; a more complete implementation needs a more careful Hamiltonian simulation routine. (Qiskit once shipped a built-in HHL class in qiskit.algorithms.linear_solvers, but it was deprecated in 2022 and has since been removed; older docs that reference it are out of date. Use a textbook-style implementation or a community package instead.)

from qiskit import QuantumCircuit, QuantumRegister, ClassicalRegister
import numpy as np

# Registers:
#   clock: 1 qubit for eigenvalue encoding (simplified QPE)
#   system: 1 qubit for |b>
#   flag: 1 qubit for controlled rotation / postselection

clock = QuantumRegister(1, name='clock')
system = QuantumRegister(1, name='system')
flag = QuantumRegister(1, name='flag')
c_measure = ClassicalRegister(1, name='flag_meas')

qc = QuantumCircuit(clock, system, flag, c_measure)

# Step 1: Prepare |b> = |0> on the system register
# (already initialized to |0>, so nothing to do)

# Step 2: QPE -- for diagonal A with eigenvalues 2 and 1,
# we simulate e^{iAt} as controlled phase gates.
# Hadamard on clock qubit to create superposition
qc.h(clock[0])

# Controlled-U where U = e^{iAt} for appropriate t
# For eigenvalue lambda, this applies phase e^{i*lambda*t}
# With t chosen so that lambda=1 maps to phase pi and lambda=2 maps to phase 2*pi
t = np.pi  # chosen for this specific example
# Controlled phase rotation on system qubit
qc.cp(2 * t, clock[0], system[0])  # encodes eigenvalue in clock

# Inverse QFT on clock register (for 1 qubit, this is just H)
qc.h(clock[0])

# Step 3: Controlled rotation on flag qubit
# If clock = |1> (encoding lambda=2), rotate by 2*arcsin(C/2)
# If clock = |0> (encoding lambda=1), rotate by 2*arcsin(C/1)
C = 0.5
theta_lambda2 = 2 * np.arcsin(C / 2)  # arcsin(0.25)
theta_lambda1 = 2 * np.arcsin(C / 1)  # arcsin(0.5)

# Rotation conditioned on clock qubit state
# When clock=|0> (lambda=1): apply Ry(theta_lambda1)
# When clock=|1> (lambda=2): apply Ry(theta_lambda2)
# Decompose as: Ry(avg) + controlled-Ry(diff)
avg_theta = (theta_lambda1 + theta_lambda2) / 2
diff_theta = (theta_lambda1 - theta_lambda2) / 2
qc.ry(avg_theta, flag[0])
qc.cry(-diff_theta, clock[0], flag[0])  # controlled correction

# Step 4: Uncompute QPE (reverse of step 2)
qc.h(clock[0])
qc.cp(-2 * t, clock[0], system[0])
qc.h(clock[0])

# Step 5: Measure flag qubit, postselect on |1>
qc.measure(flag, c_measure)

print(qc.draw(output='text'))

This circuit demonstrates the structural flow: prepare |b>, run QPE, perform eigenvalue-dependent rotations on a flag qubit, uncompute QPE, and measure the flag. In practice, the Hamiltonian simulation step (controlled-e^{iAt}) requires a much more sophisticated decomposition for non-trivial matrices, and the number of clock qubits scales with the desired precision.

Resource Scaling

The following table summarizes how the main circuit resources scale with problem parameters:

Resource	Scaling	Notes
System qubits	O(log N)	Encodes the N-dimensional vectors
Precision qubits	O(log(1/epsilon) + log(kappa))	Clock register for QPE
Circuit depth	O(kappa * s * log(N) * poly(log(1/epsilon)))	Dominated by Hamiltonian simulation
Hamiltonian simulation queries	O(s^2 * kappa / epsilon)	Calls to the oracle for entries of A
Postselection repetitions	O(kappa^2)	Probability of flag = 1 is O(1/kappa^2)

Concrete estimate. For a realistic problem with N = 10^6, kappa = 100, epsilon = 0.01, and s = 10, the algorithm requires roughly:

System qubits: ~20 (since log2(10^6) is about 20)
Precision qubits: ~14 (about 7 bits for 1/epsilon, 7 for kappa)
Total logical qubits: several hundred (including ancillas for Hamiltonian simulation and error correction overhead)

This is well within reach of a future fault-tolerant quantum computer, but far beyond current NISQ hardware capabilities.

The State Loading Problem

One of the most significant practical challenges for HHL is preparing the quantum state |b> from classical data.

The problem. HHL assumes that |b> is available as a quantum state. But if b is a classical vector stored in ordinary memory, you must load it into a quantum register. For a general N-dimensional vector, this state preparation step requires O(N) gates in the worst case, because you need to encode N amplitudes into log(N) qubits.

This O(N) loading cost completely negates the exponential speedup: if loading the input already takes O(N) time, the total runtime is O(N), which classical algorithms can match or beat.

Quantum RAM (QRAM). The proposed solution is quantum random-access memory, a hypothetical hardware device that loads classical data into quantum superposition in O(log N) time. QRAM would function like classical RAM but support quantum queries: given an address register in superposition, QRAM returns the corresponding data in superposition.

However, QRAM remains a speculative technology. No physical implementation exists as of 2026. Proposed architectures based on “bucket brigade” designs face serious noise and resource overhead concerns. Some researchers argue that building QRAM with sufficiently low error rates is as hard as building a full fault-tolerant quantum computer.

When state preparation is efficient. There are scenarios where |b> can be prepared without QRAM:

When b has efficient structure (e.g., uniform superposition, or b is the output of another quantum computation)
When the linear system arises naturally within a larger quantum algorithm (e.g., as a subroutine in quantum simulation)
When b can be prepared by a quantum circuit of polylog(N) depth

These special cases are where HHL is most likely to provide genuine speedup.

The Speedup is Conditional

HHL’s exponential speedup comes with four conditions that must all hold simultaneously.

Condition 1: Small condition number. If kappa is large (ill-conditioned matrix), the runtime scales as O(kappa^2), which can eliminate the speedup. Classical iterative methods like conjugate gradient also depend on kappa, but often with better constants. Preconditioning techniques that work classically may not have quantum analogues.

Condition 2: Sparse input. If A is dense, simulating e^{iAt} costs O(N^2) per step, negating the log(N) advantage. The sparsity s must be polylogarithmic in N for HHL to win.

Condition 3: Efficient state loading. Preparing the quantum state |b> from classical data requires QRAM or a state preparation circuit. As discussed above, these often cost O(N) in the worst case, removing the speedup entirely.

Condition 4: Quantum output suffices. HHL returns |x>, not the classical vector x. Reading out all N components requires O(N) measurements. HHL provides advantage only when you need a single expectation value like <x|M|x> for some observable M, or when |x> feeds into another quantum subroutine.

When HHL Provides Genuine Advantage

HHL is most useful when:

You want to compute a single linear functional of x, such as a dot product with a known vector v.
The matrix A is sparse and well-conditioned (kappa = polylog(N)).
Quantum state preparation of |b> is efficient (e.g., b has special structure or arises from a quantum computation).
The output feeds directly into another quantum subroutine, avoiding classical readout entirely.

The ideal use case is as a subroutine inside a larger quantum algorithm, where both the input and output remain in quantum form throughout.

Tang’s Dequantization and Its Consequences

In 2019, Ewin Tang (then an undergraduate at UT Austin) published a landmark result that reshaped the landscape of quantum machine learning. Tang showed that for the quantum recommendation systems problem, a proposed application of HHL-style techniques, there exists a classical algorithm that matches the quantum version’s polynomial scaling.

The key insight. Many quantum machine learning algorithms assume the ability to access data in superposition (via QRAM). Tang demonstrated that classical algorithms with “sampling access” to the data (the ability to draw random samples proportional to row norms or entry magnitudes) can achieve comparable performance. This type of sampling access is a classical analogue of QRAM and is often achievable in practice.

What “dequantization” means. A quantum algorithm is “dequantized” when a classical algorithm is found that matches its runtime up to polynomial factors, given the same type of data access assumptions. The quantum exponential speedup disappears; what remains is at best a polynomial advantage.

Dequantized algorithms. The following quantum ML algorithms, all based on HHL-style linear algebra subroutines, have been dequantized:

Quantum recommendation systems (Kerenidis and Prakash, 2017; dequantized by Tang, 2019)
Quantum principal component analysis (Lloyd, Mohseni, Rebentrost, 2014; dequantized by Tang, 2021)
Quantum supervised clustering (dequantized by Tang and others)
Quantum linear regression (Wiebe, Braun, Lloyd, 2012; classical analogue demonstrated)

These results do not mean HHL is useless. They mean that the specific machine learning applications where HHL was thought to provide exponential speedup do not, in fact, provide that speedup when the classical algorithm is given fair data access assumptions.

Applications: Current Status

The following table summarizes proposed HHL applications and where they stand after dequantization results:

Application	Theoretical speedup	Status
Recommendation systems	Exponential (claimed)	Dequantized by Tang 2019
Quantum linear regression	Exponential (claimed)	Dequantized
Quantum SVM	Exponential (claimed)	Dequantized
CFD (computational fluid dynamics)	Polynomial	Active research, not dequantized
Semidefinite programming	Quadratic	Active research
Finite element methods	Problem-dependent	Active research, promising for structured problems
Quantum chemistry (as subroutine)	Varies	Strong candidate; input/output stay quantum

The pattern is clear: applications where data starts and ends in classical form are vulnerable to dequantization. Applications where the linear system arises within a quantum computation (quantum chemistry, quantum simulation) are more likely to retain quantum advantage.

HHL on NISQ vs. Fault-Tolerant Hardware

NISQ hardware (today). HHL requires phase estimation with many ancilla qubits and circuit depths in the hundreds to thousands of gates. On current noisy intermediate-scale quantum (NISQ) devices with error rates around 10^{-3} per gate and limited qubit counts (order 100-1000 physical qubits), HHL provides no practical advantage. Noise overwhelms the computation well before any useful result emerges. Small proof-of-concept demonstrations of HHL on 3-4 qubits have been published, but these solve systems that a calculator can handle instantly.

Fault-tolerant hardware (future). Genuine HHL advantage requires error-corrected logical qubits with error rates below 10^{-10} or better. For a problem of practical interest (N = 10^6, kappa = 100), you need several hundred logical qubits, each composed of thousands of physical qubits (depending on the error correction code). This puts useful HHL in the realm of fault-tolerant quantum computing. Vendor roadmaps target the first fault-tolerant machines around the end of this decade, but systems with the hundreds of high-quality logical qubits HHL needs are generally expected to take longer.

Variational alternatives. Researchers have explored variational quantum linear solvers (VQLS) as NISQ-friendly alternatives to HHL. These replace the deep QPE circuit with a parameterized ansatz optimized classically. While VQLS can run on shallower circuits, it lacks the theoretical speedup guarantees of HHL and faces its own challenges with barren plateaus, convergence, and scaling.

Common Misconceptions

“HHL solves any linear system exponentially faster than classical algorithms.” This is false. The exponential speedup requires all four conditions simultaneously: sparse A, small condition number, efficient state loading, and quantum-compatible output. Violating any single condition can reduce or eliminate the advantage.

“HHL can run on today’s quantum computers.” This is false. Current NISQ hardware lacks the qubit count, coherence time, and gate fidelity needed for useful HHL. Small toy demonstrations (2x2, 4x4 systems) exist but provide no computational advantage.

“The output of HHL is the solution vector x.” This is misleading. HHL produces the quantum state |x>, not a classical vector. Extracting all N components of x from |x> requires O(N) measurements (via quantum state tomography), which negates the exponential speedup. HHL is valuable only when you need a summary statistic of x (an expectation value, a projection, a sampling outcome) or when |x> feeds into another quantum computation.

“Quantum computers try all solutions simultaneously.” This common oversimplification misrepresents how HHL works. The algorithm does create a superposition over eigenvalue branches, but the speedup comes from the structure of quantum interference and phase estimation, not from parallel evaluation. The controlled rotation step carefully encodes 1/lambda_j into amplitudes, and postselection extracts the correct answer probabilistically. It is a precisely choreographed sequence of operations, not a brute-force search.

Key Takeaways

HHL is a landmark theoretical result showing that quantum computers can solve linear systems exponentially faster than classical algorithms in an idealized setting. It introduced the paradigm of quantum linear algebra, inspiring an entire field of quantum machine learning algorithms.

In practice, the assumptions required (sparsity, small condition number, efficient state preparation, quantum output) are rarely all satisfied simultaneously. Tang’s dequantization results further narrowed the regime of quantum advantage by showing that many proposed applications have efficient classical analogues.

The most promising path forward for HHL is as a subroutine within larger quantum algorithms, particularly in quantum simulation and quantum chemistry, where the linear system arises naturally in quantum form and the solution remains in quantum form. For these applications, the state loading and output extraction problems vanish, and the genuine quantum advantage of HHL can shine.

Understanding HHL deeply, including both its power and its limitations, is essential for evaluating claims about quantum advantage in machine learning, scientific computing, and beyond.