Efficient optimization for quadratic functions. Preconditioning.
Quick unlock - significant prerequisite investment but simple final step. Verify prerequisites first.
Conjugate Gradient (CG) is what you use when “gradient descent feels right” but you can’t afford to be slow: it keeps the simplicity of iterative descent, but exploits the structure of symmetric positive-definite (SPD) quadratics to converge in dramatically fewer steps—often without ever forming a matrix inverse, and sometimes without even storing the matrix explicitly.
For the SPD linear system A x = b (equivalently minimizing f(x)=½xᵀAx−bᵀx), CG builds a sequence of search directions that are mutually A-orthogonal (A-conjugate). Each step does an exact 1-D line minimization, and in exact arithmetic it reaches the exact solution in at most n steps. In practice, convergence depends on the condition number κ(A); preconditioning replaces the geometry with an easier one (via an SPD M≈A) to accelerate convergence.
CG is not a general-purpose optimizer. It is specialized for a very important class of problems:
1) Solve a linear system
2) where A is symmetric positive-definite (SPD):
This single assumption (SPD) is what makes CG fast, stable, and geometric.
The key equivalence is that solving is the same as minimizing a strictly convex quadratic:
Compute the gradient (showing the work):
For symmetric ,
and
So
Setting gives . So the minimizer and the linear-system solution coincide.
If you already know gradient descent, you might propose:
That works, but on ill-conditioned problems it “zig-zags” along narrow valleys (elliptical level sets). Convergence becomes painfully slow because the best step size is constrained by the largest eigenvalue while progress is governed by the smallest.
CG fixes this by changing the question from “which way is steepest right now?” to:
“Which direction lets me make progress without undoing what I already fixed?”
CG is easiest to understand using two complementary pictures:
1) Residual view (linear algebra):
2) Energy/inner-product view (geometry):
CG chooses directions that are orthogonal under this -inner-product (called A-conjugate), which is the right notion of “independent directions” for this quadratic.
Suggested SVG anchor for this node: /images/cg-geometry.svg
What it should depict (single figure with labeled layers):
Even if learners skip interactivity, this one image should connect “ellipses” ↔ “residuals” ↔ “conjugate directions.”
For the quadratic , curvature is encoded by . If you move along direction p, the second derivative along that line is . So if two directions are “independent with respect to curvature,” they should satisfy:
This is A-conjugacy (also called A-orthogonality).
Why it matters: if you minimize the quadratic along p₀ and then later move along some other direction p₁ that is A-conjugate to p₀, you do not spoil the minimization you already achieved along p₀ in the quadratic’s geometry.
In Euclidean space, we make directions orthogonal using dot products . For CG, the orthogonality condition is instead .
Conceptually, CG is building an A-orthogonal basis of directions
and doing an exact line minimization along each.
A second major structural fact is that CG iterates live in expanding Krylov subspaces.
Let the initial residual be r₀ = b − Ax₀.
Define the Krylov subspace of order k:
CG chooses xₖ from the affine space
This matters because:
This is the conceptual reason for the celebrated finite-step property:
In exact arithmetic, CG finds the exact solution in at most n iterations (n = dimension), because after n steps the Krylov subspace can span the whole space.
In practice, roundoff and “near dependence” mean you often stop earlier based on tolerance.
A surprising and useful property (for exact arithmetic) is:
Residuals are mutually orthogonal in the standard dot product, even though search directions are A-orthogonal.
This is one of the best places to use visualization.
Interactive canvas suggestion (two linked panels):
Panel A: Level sets + directions (geometry)
1) “Residual orthogonality”: draw r₀, r₁, r₂ from a common origin and show right-angle markers for rᵢ ⟂ rⱼ.
2) “A-conjugacy”: draw pᵢ and show right-angle markers in the A-metric (see below).
Panel B: Krylov subspace growth (algebra)
How to visualize A-orthogonality concretely:
A-orthogonality is Euclidean orthogonality after a change of variables. Since A is SPD, write (e.g., Cholesky). Then
So pᵢ ⟂ₐ pⱼ means transformed vectors Cpᵢ are orthogonal in Euclidean space. The canvas can show both p and Cp.
CG is not “steepest descent with a fancy step size.” It is closer to:
That combination produces fast progress on SPD problems.
We track three sequences:
Initialize:
The first direction equals steepest descent. After that, directions become “corrected” to maintain conjugacy.
We move along pₖ:
Choose to minimize .
Compute it explicitly. Let .
Start expanding:
Differentiate w.r.t. α:
Group terms:
Recall :
Set :
In standard CG (with pₖ built from rₖ), one can show , giving the commonly used formula:
Rather than recompute rₖ₊₁ = b − Axₖ₊₁ from scratch, use:
So
This highlights the computational core: each iteration needs one matrix-vector product .
We want pₖ₊₁ to be in the span of the new residual plus the previous direction, but also maintain A-conjugacy:
Choose so that
Compute:
So
With additional CG identities, this becomes the well-known Fletcher–Reeves-style expression:
(For SPD quadratics and exact arithmetic, these are consistent. In numerical practice, variants exist.)
Given SPD A, b, and x₀:
1) r₀ = b − Ax₀
2) p₀ = r₀
3) For k = 0, 1, 2, … until converged:
CG is typically stopped when the residual is small:
Because rₖ is the gradient (up to sign), small residual also means near-stationary for the quadratic.
CG is attractive when:
Per iteration cost is dominated by one product plus a handful of dot products and axpy operations.
Memory is small: a few vectors (often 4–6) of length n.
This is why CG is the workhorse for PDE discretizations, large least-squares normal equations (carefully), and many kernel / Gaussian process subproblems.
In exact arithmetic, CG solves in ≤ n steps. So why do people obsess over convergence speed?
Because in realistic problems:
So the practical question becomes:
How many iterations to reach tolerance?
This is governed largely by the condition number
When κ(A) is large, the ellipsoids are extremely stretched, and CG must work hard to correct error components along small-eigenvalue directions.
A classic bound (informal but insightful) is that the A-norm error decreases like:
You don’t need to memorize this; the message is:
A preconditioner is an SPD matrix M that approximates A but is easier to invert (or solve with).
We want to solve
but instead run CG on a transformed system with better conditioning.
There are multiple equivalent views.
Solve
This does not preserve symmetry in the usual Euclidean inner product, but it can be handled carefully (see View 3).
Use (Cholesky of M). Define .
Start with:
Substitute :
Multiply by :
Now the matrix
is SPD, and hopefully has a much smaller condition number than A.
Run standard CG on and map back .
Preconditioned CG (PCG) can be seen as running CG where the notion of orthogonality is changed by M.
Instead of using raw residual rₖ, define a “preconditioned residual” zₖ:
Think: zₖ is what you get after cheaply solving .
Then PCG uses dot products like to measure progress.
Given SPD A, SPD M, b, x₀:
For k = 0, 1, 2, …:
Compared to CG, you add one extra operation: apply (implemented as solving with M).
A good M satisfies a three-way tradeoff:
| Goal | What you want | Why it matters |
|---|---|---|
| Accurate | M ≈ A (spectrally) | Reduces κ of the preconditioned system |
| Cheap | Solve M z = r fast | Each iteration must remain inexpensive |
| Stable | M SPD and numerically well-behaved | Ensures PCG theory + avoids breakdown |
Common choices:
Preconditioning’s geometric effect is easier than its algebra:
Interactive canvas suggestion (add-on mode):
This makes the “conditioning → iteration count” link tangible.
Many problems arrive as least squares:
Normal equations yield SPD system:
You can run CG on , but note:
Often better: use methods like LSQR / LSMR (Krylov methods tailored to least squares), or precondition aggressively.
CG is a bridge between optimization and linear algebra:
Solve A x = b with
A = [[4, 1], [1, 3]] (SPD), b = [1, 2]^T, start x0 = [0, 0]^T.
We will compute two CG iterations and observe that in 2D it reaches the exact solution in ≤ 2 steps (in exact arithmetic).
1) Initialize
x0 = [0, 0]^T
r0 = b - A x0 = b = [1, 2]^T
p0 = r0 = [1, 2]^T
2) Compute α0
First compute A p0:
A p0 = [[4,1],[1,3]] [1,2]^T = [41+12, 11+32]^T = [6, 7]^T
Now compute dot products:
r0^T r0 = 1^2 + 2^2 = 5
p0^T A p0 = p0^T (A p0) = [1,2]·[6,7] = 16 + 27 = 20
So α0 = (r0^T r0)/(p0^T A p0) = 5/20 = 1/4
3) Update x1 and r1
x1 = x0 + α0 p0 = [0,0] + (1/4)[1,2] = [1/4, 1/2]
r1 = r0 - α0 A p0 = [1,2] - (1/4)[6,7] = [1 - 6/4, 2 - 7/4] = [-1/2, 1/4]
4) Check residual orthogonality (r1 ⟂ r0)
r0^T r1 = [1,2]·[-1/2, 1/4] = -1/2 + 2*(1/4) = -1/2 + 1/2 = 0
So r0 ⟂ r1 in Euclidean inner product (as theory predicts).
5) Compute β0 and p1
r1^T r1 = (-1/2)^2 + (1/4)^2 = 1/4 + 1/16 = 5/16
β0 = (r1^T r1)/(r0^T r0) = (5/16)/5 = 1/16
p1 = r1 + β0 p0 = [-1/2, 1/4] + (1/16)[1,2]
= [-1/2 + 1/16, 1/4 + 2/16]
= [(-8+1)/16, (4+2)/16] = [-7/16, 6/16] = [-7/16, 3/8]
6) Compute α1
Compute A p1:
A p1 = [[4,1],[1,3]] [-7/16, 3/8]^T
First component: 4(-7/16) + 1(3/8) = -28/16 + 6/16 = -22/16 = -11/8
Second component: 1(-7/16) + 3(3/8) = -7/16 + 9/8 = -7/16 + 18/16 = 11/16
Now compute:
r1^T r1 = 5/16 (already)
p1^T A p1 = p1^T (A p1) = [-7/16, 3/8]·[-11/8, 11/16]
= (-7/16)(-11/8) + (3/8)(11/16)
= 77/128 + 33/128 = 110/128 = 55/64
So α1 = (r1^T r1)/(p1^T A p1) = (5/16)/(55/64) = (5/16)*(64/55) = 20/55 = 4/11
7) Update x2 and r2
x2 = x1 + α1 p1
= [1/4, 1/2] + (4/11)[-7/16, 3/8]
Compute the increment:
(4/11)*(-7/16) = -28/176 = -7/44
(4/11)*(3/8) = 12/88 = 3/22
So x2 = [1/4 - 7/44, 1/2 + 3/22]
Convert:
1/4 = 11/44, so 11/44 - 7/44 = 4/44 = 1/11
1/2 = 11/22, so 11/22 + 3/22 = 14/22 = 7/11
Thus x2 = [1/11, 7/11]
Now residual:
r2 = r1 - α1 A p1
= [-1/2, 1/4] - (4/11)[-11/8, 11/16]
Increment:
(4/11)*(-11/8) = -44/88 = -1/2
(4/11)*(11/16) = 44/176 = 1/4
So r2 = [-1/2 - (-1/2), 1/4 - 1/4] = [0, 0]
Insight: In 2D, CG terminates in at most 2 steps (exact arithmetic) because it constructs two A-conjugate directions spanning the space. The residuals became orthogonal (r0 ⟂ r1), and after the second step the residual is exactly zero, meaning A x2 = b. The solution is x* = [1/11, 7/11].
Consider A = diag(1, 100) and b arbitrary. This is SPD but ill-conditioned: κ(A)=100. We show how a simple preconditioner makes the problem perfectly conditioned.
We focus on the geometry/conditioning rather than running many iterations.
1) Compute condition number
Eigenvalues of A are 1 and 100.
So κ(A) = 100/1 = 100.
Level sets of f(x)=½ x^T A x - b^T x are very elongated ellipses (narrow valley).
2) Choose a preconditioner M ≈ A
Let M = diag(1, 100) = A itself (an “ideal” but usually impractical preconditioner).
Then M is SPD and easy to invert here.
3) Form the symmetrically preconditioned matrix
Take C such that M = C^T C. Since M is diagonal with positive entries, pick C = diag(1, 10).
Then
 = C^{-T} A C^{-1}.
But A = M = C^T C, so:
 = C^{-T} (C^T C) C^{-1} = (C^{-T} C^T)(C C^{-1}) = I.
4) Condition number after preconditioning
κ(Â) = κ(I) = 1.
So in transformed coordinates, the quadratic has spherical level sets and CG converges in essentially one step (again, in idealized exact arithmetic).
5) Interpret for realistic M
In real problems, M is not exactly A, but if M captures most of A’s spectral shape, Â’s eigenvalues cluster more tightly and κ(Â) drops dramatically—leading to far fewer CG iterations.
Insight: Preconditioning is best understood as changing the metric so the ellipsoids become closer to circles. CG’s iteration count is controlled by the spread of eigenvalues; reducing that spread (lower κ) is the direct route to faster convergence.
Use the same system as Example 1 but apply a simple Jacobi preconditioner M = diag(A) = diag(4,3).
We show one iteration of PCG to see how z = M^{-1} r enters the formulas.
A = [[4,1],[1,3]], b = [1,2]^T, x0=[0,0]^T.
1) Initialize
r0 = b - A x0 = [1,2]^T
Solve M z0 = r0 with M=diag(4,3):
z0 = [1/4, 2/3]^T
p0 = z0
2) Compute α0
Compute A p0:
A p0 = [[4,1],[1,3]] [1/4, 2/3]^T
First component: 4(1/4) + 1(2/3) = 1 + 2/3 = 5/3
Second component: 1(1/4) + 3(2/3) = 1/4 + 2 = 9/4
Compute dot products:
r0^T z0 = [1,2]·[1/4,2/3] = 1/4 + 4/3 = (3/12 + 16/12)=19/12
p0^T A p0 = [1/4,2/3]·[5/3,9/4] = (1/4)(5/3) + (2/3)(9/4) = 5/12 + 18/12 = 23/12
So α0 = (r0^T z0)/(p0^T A p0) = (19/12)/(23/12) = 19/23
3) Update x1 and r1
x1 = x0 + α0 p0 = (19/23)[1/4,2/3] = [19/92, 38/69]
r1 = r0 - α0 A p0 = [1,2] - (19/23)[5/3,9/4]
Compute components:
First: 1 - (19/23)(5/3) = 1 - 95/69 = (69-95)/69 = -26/69
Second: 2 - (19/23)(9/4) = 2 - 171/92 = (184-171)/92 = 13/92
4) Compute z1 and β0
Solve M z1 = r1:
z1 = [(-26/69)/4, (13/92)/3] = [-26/276, 13/276] = [-13/138, 13/276]
Compute β0 = (r1^T z1)/(r0^T z0)
First compute r1^T z1:
r1^T z1 = [-26/69, 13/92]·[-13/138, 13/276]
= (-26/69)(-13/138) + (13/92)(13/276)
= 338/(9522) + 169/(25392)
Simplify:
338/9522 = 169/4761
Now common denominator 4761? Note 25392 = 4761(25392/4761) not integer; keep numeric:
169/4761 ≈ 0.03550
169/25392 ≈ 0.00666
So r1^T z1 ≈ 0.04216
r0^T z0 = 19/12 ≈ 1.5833
Thus β0 ≈ 0.04216 / 1.5833 ≈ 0.0266
5) Update p1
p1 = z1 + β0 p0
This shows PCG mixes the new preconditioned residual with the previous direction, analogous to CG but using M-weighted inner products.
Insight: In PCG, the key scalar is rᵀz where z=M^{-1}r. This is the natural measure of residual size in the preconditioned geometry, and it replaces rᵀr throughout the recurrence.
CG is designed for SPD systems A x = b, equivalently minimizing the strictly convex quadratic f(x)=½xᵀAx−bᵀx.
Residuals rₖ = b − Axₖ are negative gradients; CG uses them to build improved search directions.
CG search directions are A-conjugate: pᵢᵀ A pⱼ = 0 for i≠j, enabling independent progress on the quadratic’s curvature.
Each iteration performs an exact 1-D minimization along pₖ, with αₖ = (rₖᵀrₖ)/(pₖᵀA pₖ) in standard CG.
In exact arithmetic, CG reaches the exact solution in at most n steps because it searches over expanding Krylov subspaces Kₖ(A,r₀).
Practical convergence depends strongly on conditioning; iteration counts track eigenvalue spread and κ(A).
Preconditioning (PCG) introduces an SPD M≈A, using z=M^{-1}r and replacing rᵀr with rᵀz, often reducing iterations dramatically.
The most useful visual intuition: CG builds new dimensions (Krylov growth) while keeping directions independent under the A-metric (conjugacy).
Applying CG to non-SPD matrices (indefinite or nonsymmetric) and expecting it to work; use MINRES/GMRES or reformulate.
Using normal equations A=BᵀB without considering conditioning (κ squares) and then blaming CG for slow convergence.
Choosing a preconditioner M that is not SPD (or implemented inconsistently), which can break PCG assumptions and cause instability.
Stopping based only on small step sizes or objective decrease rather than monitoring the residual norm (relative ||r||/||b||).
Show that minimizing f(x)=½xᵀAx−bᵀx with SPD A has a unique minimizer x and that it satisfies A x = b.
Hint: Compute ∇f(x). Use SPD to argue strict convexity and uniqueness.
Compute the gradient:
∇f(x)=A x − b (since A is symmetric).
A critical point satisfies A x*=b.
Because A is SPD, f is strictly convex (Hessian ∇²f=A ≻ 0), so the critical point is unique and is the global minimizer.
Run one full CG iteration for A = [[2,0],[0,8]], b=[2,8]^T, x0=[0,0]^T. Compute r0, p0, α0, x1, r1, and verify r0ᵀr1=0.
Hint: Because A is diagonal, A p0 is easy. Use α0=(r0ᵀr0)/(p0ᵀA p0).
r0=b−Ax0=b=[2,8]^T. p0=r0.
A p0 = [22, 88]^T = [4,64]^T.
r0ᵀr0 = 2^2+8^2=68.
p0ᵀA p0 = [2,8]·[4,64]=8+512=520.
α0=68/520=17/130.
x1=x0+α0p0=(17/130)[2,8]=[34/130, 136/130]=[17/65, 68/65].
r1=r0−α0A p0=[2,8]−(17/130)[4,64]=[2−68/130, 8−1088/130]
= [ (260−68)/130, (1040−1088)/130 ] = [192/130, −48/130] = [96/65, −24/65].
Check orthogonality:
r0ᵀr1 = [2,8]·[96/65, −24/65] = 192/65 − 192/65 = 0.
Consider preconditioning A=diag(1,100,10000) with M=diag(1,100,10000). Compute κ(A) and κ(C^{-T}AC^{-1}) where M=CᵀC. Explain in one sentence what this predicts about CG iteration count.
Hint: For diagonal SPD, eigenvalues are diagonal entries. If M=A then the symmetrically preconditioned matrix becomes I.
Eigenvalues of A are {1,100,10000}, so κ(A)=10000/1=10000.
With M=A and C=diag(1,10,100), we get C^{-T}AC^{-1}=I, so κ=1.
Prediction: preconditioning collapses the eigenvalue spread, so CG should converge in dramatically fewer iterations (in the ideal case, essentially immediately).