Krylov subspace methods and CG

The Krylov subspace

The set of basis vectors for the Krylov subspace Kk(A,b)\mathcal{K}_k(A, b) are formed by repeated application of a matrix AA on a vector bb

Kk(A,b)=span{b,Ab,A2b,...,Ak1b} \mathcal{K}_k(A, b) = \text{span}\{ b, Ab, A^2b, ..., A^{k-1}b \}

Krylov subspace methods

Krylov subspace methods are an important family of iterative algorithms for solving Ax=bAx=b. Lets suppose that AA is an n×nn \times n invertible matrix, and our only knowledge of AA is its matrix-vector product with an arbitrary vector x\mathbf{x}. Repeated application of AA, nn times on an initial vector bb gives us a sequence of vectors b,Ab,A2b,...,Anb\mathbf{b}, A \mathbf{b}, A^2 \mathbf{b}, ..., A^{n} \mathbf{b}. Since we are in nn-dimensional space, we are guaranteed that these n+1n+1 vectors are linearly dependent, and therefore

a0b+a1Ab+...+anAnb=0 a_0 \mathbf{b} + a_1 A \mathbf{b} + ... + a_n A^n \mathbf{b} = 0

for some set of coefficients aia_i. Let now pick the smallest index kk such that ak0a_k \ne 0, and multiply the above equation by 1akAk1\frac{1}{a_k} A^{-k-1}, giving

A1b=1ak(ak+1b+...+anAnk1b) A^{-1} b = \frac{1}{a_k} ( a_{k+1} b + ... + a_n A^{n-k-1} b )

This implies that A1bA^{-1} b can be found only via repeated application of AA, and also motivates the search for solutions from the Krylov subspace.

For each iteration kk of a Krylov subspace method, we choose the "best" linear combination of the Krylov basis vectors b,Ab,...,Ak1b\mathbf{b}, A\mathbf{b}, ... , A^{k−1} \mathbf{b} to form an improved solution xk\mathbf{x}_k. Different methods give various definitions of "best", for example:

  1. The residual rk=bAxk\mathbf{r}_k = \mathbf{b} − A\mathbf{x}_k is orthogonal to Kk\mathcal{K}_k (Conjugate Gradients).
  2. The residual rk\mathbf{r}_k has minimum norm for xk\mathbf{x}_k in Kk\mathcal{K}_k (GMRES and MINRES).
  3. rk\mathbf{r}_k is orthogonal to a different space Kk(AT)\mathcal{K}_k(A^T) (BiConjugate Gradients).

Conjugate Gradient Method

Here we will give a brief summary of the CG method, for more details you can consult the text by Golub and Van Loan (Chapter 10).

The CG method is based on minimising the function

ϕ(x)=12xTAxxTb \phi(x) = \frac{1}{2}x^T A x - x^T b

If we set xx to the solution of Ax=bAx =b, that is x=A1bx = A^{-1} b, then the value of ϕ(x)\phi(x) is at its minimum ϕ(A1b)=bTA1b/2\phi(A^{-1} b) = -b^T A^{-1} b / 2, showing that solving Ax=bAx = b and minimising ϕ\phi are equivalent.

At each iteration kk of CG we are concerned with the residual, defined as rk=bAxkr_k = b - A x_k. If the residual is nonzero, then at each step we wish to find a positive α\alpha such that ϕ(xk+αpk)<ϕ(xk)\phi(x_k + \alpha p_k) < \phi(x_k), where pkp_k is the search direction at each kk. For the classical steepest descent optimisation algorithm the search direction would be the residual pk=rkp_k = r_k, however, steepest descent can suffer from convergence problems, so instead we aim to find a set of search directions pkp_k so that pkTrk10p_k^T r_{k-1} \ne 0 (i.e. at each step we are guaranteed to reduce ϕ\phi), and that the search directions are linearly independent. The latter condition guarantees that the method will converge in at most nn steps, where nn is the size of the square matrix AA.

It can be shown that the best set of search directions can be achieved by setting

βk=pk1TArk1pk1TApk1pk=rk1+βkpk1αk=pkTrk1pkTApk \begin{aligned} \beta_k &= \frac{-p^T_{k-1} A r_{k-1}}{p^T_{k-1} A p_{k-1}} \\ p_k &= r_{k-1} + \beta_k p_{k-1} \\ \alpha_k &= \frac{p^T_k r_{k-1}}{p^T_k A p_k} \end{aligned}

This ensures that the search direction pk\mathbf{p}_k is the closest vector to rk1\mathbf{r}_{k-1} that is also A-conjugate to p1,...,pk1\mathbf{p}_1, ..., \mathbf{p}_{k-1}, i.e. piTApj=0p^T_i A p_j=0 for all iji \ne j, which gives the algorithm its name. After kk iterations the sequence of residuals ri\mathbf{r}_i for i=1..ki=1..k form a set of mutually orthogonal vectors that span the Krylov subspace Kk\mathcal{K}_k.

Directly using the above equations in an iterative algorithm results in the standard CG algorithm. A more efficient algorithm can be derived from this by computing the residuals recursively via rk=rk1αkApkr_k = r_{k-1} - \alpha_k A p_k, leading to the final algorithm given below (reproduced from Wikipedia):

Conjugate Gradient algorithm

Preconditioning

The CG method works well (i.e. converges quickly) if the condition number of the matrix AA is low. The condition number of a matrix gives a measure of how much the solution xx changes in response to a small change in the input bb, and is a property of the matrix AA itself, so can vary from problem to problem. In order to keep the number of iterations small for iterative solvers, it is therefore often necessary to use a preconditioner, which is a method of transforming what might be a difficult problem with a poorly conditioned AA, into a well conditioned problem that is easy to solve.

Consider the case of preconditioning for the CG methods, we start from the standard problem Ax=bA x = b, and we wish to solve an equivalent transformed problem given by

A~x~=b~ \tilde{A} \tilde{x} = \tilde{b}

where A~=C1AC1\tilde{A} = C^{-1} A C^{-1}, x~=Cx\tilde{x} = Cx, b~=C1b\tilde{b} = C^{-1} b , and CC is a symmetric positive matrix.

We then simply apply the standard CG method as given above to this transformed problem. This leads to an algorithm which is then simplified by instead computing the transformed quantities p~k=Cpk\tilde{p}_k = C p_k, x~k=Cxk\tilde{x}_k = C x_k, and r~k=C1rk\tilde{r}_k = C^{-1} r_k. Finally we define a matrix M=C2M = C^2, which is known as the preconditioner, leading to the final preconditioned CG algorithm given below (reproduced and edited from Wikipedia):

r0:=bAx0\mathbf{r}_0 := \mathbf{b} - \mathbf{A x}_0
z0:=M1r0\mathbf{z}_0 := \mathbf{M}^{-1} \mathbf{r}_0
p0:=z0\mathbf{p}_0 := \mathbf{z}_0
k:=0k := 0 \,
repeat until rk2<ϵb2|| \mathbf{r}_k ||_2 < \epsilon ||\mathbf{b}||_2
   αk:=rkTzkpkTApk\alpha_k := \frac{\mathbf{r}_k^T \mathbf{z}_k}{ \mathbf{p}_k^T \mathbf{A p}_k }
   xk+1:=xk+αkpk\mathbf{x}_{k+1} := \mathbf{x}_k + \alpha_k \mathbf{p}_k
   rk+1:=rkαkApk\mathbf{r}_{k+1} := \mathbf{r}_k - \alpha_k \mathbf{A p}_k
   if rk+1r_{k+1} is sufficiently small then exit loop end if
   zk+1:=M1rk+1\mathbf{z}_{k+1} := \mathbf{M}^{-1} \mathbf{r}_{k+1}
   βk:=rk+1Tzk+1rkTzk\beta_k := \frac{\mathbf{r}_{k+1}^T \mathbf{z}_{k+1}}{\mathbf{r}_k^T \mathbf{z}_k}
   pk+1:=zk+1+βkpk\mathbf{p}_{k+1} := \mathbf{z}_{k+1} + \beta_k \mathbf{p}_k
   k:=k+1k := k + 1 \,
end repeat

The key point to note here is that the preconditioner is used by inverting MM, so this matrix must be "easy" to solve in some fashion, and also result in a transformed problem with better conditioning.

Termination: The CG algorithm is normally run until convergence to a given tolerance which is based on the norm of the input vector bb. In the algorithm above we iterate until the residual norm is less than some fraction (set by the user) of the norm of bb.

What preconditioner to choose for a given problem is often highly problem-specific, but some useful general purpose preconditioners exist, such as the incomplete Cholesky preconditioner for preconditioned CG (see Chapter 10.3.2 of the Golub & Van Loan text given below). Chapter 3 of the Barrett et al. text, also cited below, contains descriptions of a few more commonly used preconditioners.

Other Reading

  • Golub, G. H. & Van Loan, C. F. Matrix Computations, 3rd Ed. (Johns Hopkins University Press, 1996). Chapter 10
  • Barrett, R., Berry, M., Chan, T. F., Demmel, J., Donato, J., Dongarra, J., ... & Van der Vorst, H. (1994). Templates for the solution of linear systems: building blocks for iterative methods. Society for Industrial and Applied Mathematics.