Chapter 5 Solution of linear system of equations: \(A\mathbf x=\mathbf b\)

Problem For a given \(d\times d\) matrix \(A\) and vector \(\mathbf b \in \mathbb{R}^d\), find \(\vec x\in\mathbb{R}^d\) such that \(A\vec x=\vec b\). If the entries of \(A\) are \(a_{ij}\) (row \(i\), column \(j\)) and the entries of \(\vec x\) and \(\vec b\) are \(x_j\) and \(b_j\), this means \[ \sum_{j=1}^d a_{ij} x_j =b_i,\qquad i=1,\dots,d. \]

You have studied already row-reduction techniques. In numerical analysis, these are developed in a way to improve numerical stability into the standard technique `Gaussian elimination with partial pivoting''. This is an example of a **direct method** and this means the solution $\mathbf x$ is found by a finite number of arithmetic operations. Packages such asscipy.linalg` use Gaussian elimination to find an LU factorisation.

Definition 5.1 A matrix \(P\) is a permutation matrix if each row and column has exactly one non-zero entry equal to one; multiplication by a permutation matrix \(P\) in \(PA\) permutes the rows of \(A\).

A matrix \(L\) is lower triangular if \(L_{ij}=0\) for \(i< j\) and unit lower triangular if additionally \(L_{ii}=1\). A matrix \(U\) is upper triangular if \(U_{ij}=0\) for \(i> j\).

The LU factorisation consists of a permutation matrix \(P\), a unit lower-triangular matrix \(L\), and an upper-triangular matrix \(U\) such that \(PA=LU\); this can be found in Pythong using P, L, U = scipy.linalg.lu(A). For example, \[ L= \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 1/2 & 1/2 & 1 \end{bmatrix},\quad% U= \begin{bmatrix} 4 & -1 & 1 \\ 0 & -1 & 2\\ 0 & 0 & -3/2 \end{bmatrix},\quad% P=\begin{bmatrix} 0 & 1 & 0 \\ 0 & 0 & 1\\ 1 & 0 & 0 \end{bmatrix} \] is an LU factorisation of \[ A= \begin{bmatrix} 2 & -1 & 0\\ 4 & -1 & 1\\ 0 & -1 & 2 \end{bmatrix}. \]

You should verify that \(PA=LU\). The entries of \(L\) and \(U\) express the row reductions normally performed in Gaussian elimination. We do not show how to compute \(L\) and \(U\) in detail by hand. There are many matrix factorisation in numerical linear algebra and this one is useful for solving linear systems of equations.

Example 5.1 (LU factorisation) To solve a linear system of equations with the LU factorisation, substitute \(PA=LU\) into the linear system \(A\vec x=\vec b\): \[ PA \vec x=L U \vec x= P \vec b. \] Let \(\vec y\colon= U \vec x\). Given \(\vec b\in\mathbb{R}^d\), we solve the triangular system \[ L \vec y=P \vec b, \qquad \text{to find $\vec y\in\mathbb{R}^d$} \] and \[ U \vec x= \vec y,\qquad \text{to find $\vec x\in \mathbb{R}^d$.} \] We have replaced the problem of solve \(A\vec x=\vec b\) by solving two much simpler linear systems. Both matrices are triangular and their solution is easily found by forward or backward substitution.

We spend most of our time studying iterative methods, where \(\vec x\) occurs as the limit of approximations \(\vec x_n\) as \(n\to \infty\). In general, direct methods (such as the LU factorisation or Gaussian elimination) are good for dense matrices (where \(a_{ij}\ne 0\) for nearly all \(i,j\)) and the complexity of such a linear solve is \(\mathcal{O}(d^3)\). If \(d\) is very large, it may be impossible to store \(A\) in memory and to perform row reductions. On the other hand, the matrix may be sparse (\(a_{ij}=0\) for many \(i,j\)) and it may be easy to compute matrix–vector products \(A \vec x\). In this case, iterative methods are valuable.

We will work with the following example of a sparse matrix throughout.

Example 5.2 (finite–difference matrix) Suppose that \(u\) is a smooth real-valued function on \([0,1]\) and we want to approximate its second derivative based on evaluations on the mesh \(x_i=ih\) for some mesh spacing \(h=1/(d+1)\). By Taylor’s theorem, \[\begin{align*} u(x+h)&=u(x)+h u'(x)% +\frac 12 h^2 u''(x)% +\frac 16 h^3 u'''(x)+\mathcal{O}(h^4),\\ u(x-h)&=u(x)-h u'(x)% +\frac 12 h^2 u''(x)% -\frac 16 h^3 u'''(x)+\mathcal{O}(h^4). \end{align*}\]

Then, \[ u''(x)% =\frac{u(x+h)-2u(x)+u(x-h)}{h^2}% +\mathcal{O}(h^2). \] By dropping the \(\mathcal{O}(h^2)\) term, we have a finite-difference approximation to the second derivative. This can be used to find an approximate solution to the two-point boundary value problem: for a given function \(f\colon[0,1]\to\mathbb{R}\), find \(u(x)\) such that \[ -u''(x)=f(x),\qquad u(0)=u(1)=0. \] Using the finite-difference approximation, \[ -\begin{bmatrix} u''(x_1)\\ u''(x_2)\\\vdots\\u''(x_d) \end{bmatrix} =\frac{1}{h^2} A \begin{bmatrix} u(x_1)\\ u(x_2)\\\vdots\\u(x_d) \end{bmatrix}+\mathcal{O}(h^2) \] for \[\begin{equation} \tag{5.1} A\colon= \begin{bmatrix} 2 & -1 & & \\ -1& 2 & -1 & \\ & \ddots & \ddots& \ddots \\ & &-1& 2 \end{bmatrix}. \end{equation}\] Then \(u''(x)=-f(x)\) gives \(\frac{1}{h^2}A \vec u= \vec f\) if we neglect the \(\mathcal{O}(h^2)\) term, for \[\begin{equation} \vec f\colon=\begin{bmatrix} f(x_1)\\ f(x_2)\\\vdots\\f(x_d) \end{bmatrix},\quad \vec u\colon=\begin{bmatrix} u(x_1)\\ u(x_2)\\\vdots\\u(x_d) \end{bmatrix}. \end{equation}\] We have a linear system of equations that can be solved to determine an approximate solution of the boundary-value problem.

Only the main and two off-diagonals of \(A\) are non-zero. All other entries are zero and the matrix is sparse. We will use the finite-difference matrix \(A\) as a prototype example as we develop iterative methods. The matrix is typical of the ones that arise in the numerical solution of PDEs.

5.1 Iterative methods

Suppose we wish to solve \(A\vec x=\vec b\) for \(\vec x\in\mathbb{R}^d\) given a \(d\times d\) matrix \(A\) and \(\vec b\in\mathbb{R}^d\). Write \(A=A_1-A_2\) so that \[ A_1\vec x= A_2 \vec x+ \vec b. \] This motivates the following iterative method: find \(\vec x_{n+1}\) such that \[ A_1 \vec x_{n+1}% =A_2 \vec x_n+\vec b. \] When \(\vec x_n\) is known and \(A_1\) is well chosen, we easily find \(\vec x_{n+1}\) and generate a sequence \(\vec x_1,\vec x_2,\dots\) that we hope converges to the solution \(\vec x\). We can interpret this as a fixed-point iteration and \[ \vec x_{n+1}=\vec g(\vec x_n),\qquad \vec g(\vec x)\colon= A_1^{-1}(A_2\vec x+\vec b), \] where we assume the inverse matrix \(A_1^{-1}\) exists.

Example 5.3 (Jacobi) Let \(A_1\) denote the diagonal part of \(A\) and \(A_2=A_1-A=-(L+U)\) (for the lower- and upper-triangular parts \(L\) and \(U\) of \(A\)). Take for example \[ A= \begin{bmatrix} 2&-1\\-1&2 \end{bmatrix},\qquad A_1= \begin{bmatrix} 2&0\\0&2 \end{bmatrix},\qquad A_2= \begin{bmatrix} 0&1\\1&0 \end{bmatrix}. \] The Jacobi iteration is \[ \begin{bmatrix} 2 & 0 \\ 0 &2 \end{bmatrix}\vec x_{n+1}% = \begin{bmatrix} 0 &1\\1 & 0 \end{bmatrix}\vec x_n +\vec b \] or \[ \vec x_{n+1} = \begin{bmatrix} 0 & 1/2 \\ 1/2 & 0 \end{bmatrix}\vec x_n + \frac 12 \vec b. \] Notice how simple it is to evaluate the right-hand side given \(\vec x_n\) and \(\vec b\).

For the finite-difference example , the Jacobi method is \[\begin{equation} \tag{5.2} \vec x_{n+1}=\frac 12 \begin{bmatrix} 0&1 & & & \\ 1&\ddots &1 & & \\ & \ddots& &\ddots & \\ &&&&1\\ &&& 1& \end{bmatrix}\vec x_n% +\frac12 \vec b. \end{equation}\] Even when \(d\) is large, the right-hand side is cheap to compute and the matrix–vector product is easy to evaluate without storing the sparse matrix.
Example 5.4 (Gauss–Seidel) Let \(A_1=D+L\) (the diagonal and lower-triangular part) and \(A_2=-U\). For the finite-difference matrix, \[ \underbrace{ \begin{bmatrix} 2 & & & \\ -1& 2 & & \\ &-1& 2 & \\ &&\ddots&\ddots & \\ \end{bmatrix}}_{=A_1}\vec x_{n+1}% =\underbrace{\begin{bmatrix} 0& 1 & & & \\ &\ddots &1 & \\ && &1 \\ &&&\ddots & \\ \end{bmatrix}}_{=A_2}\vec x_{k}+\vec b% \] This time a linear solve is required. As the matrix on the left-hand side is lower triangular, this can be done efficiently to find \(\vec x_{n+1}\).

We now develop some tools for understanding the convergence of iterative methods.

5.2 Vector and matrix norms

To understand convergence of \(\vec x_n\), we introduce a way to measure distance on \(\mathbb{R}^d\). It turns out to be very convenient to have more than one measurement of distance.

Definition 5.2 (vector norm) A vector norm on \(\mathbb{R}^d\) is a real-valued function \(\| \cdot \|\) on \(\mathbb{R}^d\) such that

  1. \(\| \vec x \|\ge 0\) for all \(\vec x\in \mathbb{R}^d\),
  2. \(\| \vec x \|=0\) if and only if \(\vec x=\vec 0\),
  3. \(\| \alpha \vec x \|=| \alpha |\,\| \vec x \|\) for \(\alpha\in \mathbb{R}\), and
  4. \(\| \vec x+\vec y \|\le \| \vec x \|+\| \vec y \|\) for all \(\vec x, \vec y\in\mathbb{R}^d\) (the triangle inequality).

The standard Euclidean norm \(\| \vec x \|_2\colon= \left( \vec x^\mathsf{T}\vec x \right)^{1/2}\equiv \left( x_1^2+\dots+x_d^2 \right)^{1/2}\) satisfies these conditions and is a vector norm. The conditions (a–c) are easy to verify. The last one follows from the Cauchy–Schwarz inequality, which says that \(\vec x^\mathsf{T}\vec y \le \| \vec x \|_2 \| \vec y \|_2\) and so \[\begin{align*} \| \vec x+\vec y \|_2^2% &= (\vec x+\vec y)^\mathsf{T}(\vec x+\vec y)% =\vec x^\mathsf{T}\vec x + 2 \vec x^\mathsf{T}\vec y+ \vec y^\mathsf{T}\vec x% \le \| \vec x \|_2^2 + 2 \| \vec x \|_2 \| \vec y \|_2 + \| \vec y \|_2^2\\ &= \left( \| \vec x \|_2 + \| \vec y \|_2 \right)^2. \end{align*}\] We will make use of two more examples: \[ \| \vec x \|_\infty% \colon= \max_{j=1,\dots, d}| x_j |,\qquad \| \vec x \|_1% \colon= \sum_{j=1}^{d}| x_j |. \] Don’t forget the absolute-value signs on the right-hand side! Verification of the norm axioms is straightforward here.

These give different numbers and, for \(\vec x=(-1,1,2)^\mathsf{T}\), we get \[ \| \vec x \|_2=\sqrt{6},\qquad \| \vec x \|_1=4,\qquad \| \vec x \|_\infty=2. \]

We also need to measure matrices. The obvious way to do this is to treat a \(d\times d\) matrix as a \(d^2\) vector and thereby inherit the norms defined for vectors. This does not say anything about the multiplicative nature of matrices and so we develop the following concept.

Definition 5.3 (matrix norm) A matrix norm \(\| A \|\) of a \(d\times d\) matrix \(A\) is a real-valued function on \(\mathbb{R}^{d\times d}\) such that

  1. \(\| A \|\ge 0\) for all \(A\in \mathbb{R}^{d\times d}\),
  2. \(\| A \|=0\) if and only if \(A=0\),
  3. \(\| \alpha A \|=| \alpha |\,\| A \|\) for \(\alpha\in \mathbb{R}\),
  4. \(\| A+B \|\le \| A \|+\| B \|\) (the triangle inequality) and
  5. \(\| A B \| \le \| A \|\,\| B \|\) for all \(A, B\in\mathbb{R}^{d\times d}\).

Conditions (a–d) correspond to the ones for vector norms. The last, the sub-multiplicative condition, relates to matrix products.

Vector norms lead naturally to a corresponding matrix norm, known as the subordinate or operator norm.

Definition 5.4 The operator norm \(\| A \|_{\mathrm{op}}\) with respect to a vector norm \(\| \vec x \|\) is defined by \[ \| A \|_{\mathrm{op}}\colon= \sup_{\vec x\ne \vec 0} \frac{\| A\vec x \|}{\| \vec x \|}. \] Equivalently, because of condition (c), \[ \| A \|_{\mathrm{op}}= \sup_{\| \vec x \|=1} {\| A\vec x \|} \] (\(\sup\) means least upper bound or roughly “the maximum”). The operator norm describes the maximum stretch that can be achieved by multiplication by \(A\).
Theorem 5.1 The operator norm \(\| A \|_{\mathrm{op}}\) is a matrix norm
Proof. We show (e). As \(\| A \|_{\mathrm{op}}=\sup \| A\vec x \|/\| \vec x \|\), we have \[\begin{equation} \tag{5.3} \| A \vec x \|% \le \| A \|_{\mathrm{op}} \| \vec x \| \end{equation}\] for any \(\vec x\in \mathbb{R}^d\). Then, applying (5.3) twice, \[ \| A B \vec x \|% \le \| A \|_{\mathrm{op}} \| B\vec x \|% \le \| A \|_{\mathrm{op}} \| B \|_{\mathrm{op}} \| \vec x \|. \] Hence, \[ \| AB \|_{\mathrm{op}}% =\max_{\| \vec x \|=1} \| A B \vec x \| \le \| A \|_{\mathrm{op}} \| B \|_{\mathrm{op}}. \] The remaining conditions are left for a problem sheet.
Let \(\| A \|_1\) be the operator norm associated to the vector norm \(\| \vec x \|_1\), and \(\| A \|_\infty\) be the operator norm associated to the vector norm \(\| \vec x \|_\infty\).
Theorem 5.2 Let \(A\) be an \(d\times d\) matrix. Then, \[\begin{align*} \| A \|_1&% =\max_{j=1,\dots,d} \sum_{i=1}^d |a_{ij}|,\qquad \text{maximum column sum}\\ \| A \|_\infty&% =\max_{i=1,\dots,d} \sum_{j=1}^d |a_{ij}|,\qquad \text{maximum row sum}. \end{align*}\]

To remember which way round it is, \(\| A \|_1\) is the maximum column sum and the subscript \(1\) looks like a column. Don’t forget the absolute-value signs!

Proof. We prove that \[ \| A \|_\infty=\max_i \sum_j|a_{ij}| =\colon f(A). \] The argument for \(\| A \|_1\) is similar and addressed on the problem sheet. We divide and conquer, first showing that \(\| A \|_\infty \le f(A)\) and then \(\| A \|_\infty \ge f(A).\)

To show that \(\| A \|_\infty \le f(A)\), consider \(\vec x\in\mathbb{R}^d\) with \(\| \vec x \|_\infty=1\). Then, \(|x_i|\le 1\) for all \(i=1,\dots,d\) and hence \[ | \sum_{j=1}^d a_{ij} x_j |% \le\sum_{j=1}^d | a_{ij} x_j |% \le \sum_{j=1}^d |a_{ij}|% \le f(A). \] That is, the \(i\)th entry of \(A\vec x\) is smaller (in absolute value) than \(f(A)\). Hence, \(\| A\vec x \|_\infty \le f(A)\).

To show that \(\| A \|_\infty \ge f(A)\), suppose that \(f(A)=\sum_j |a_{ij}|\) (i.e., row \(i\) has the maximum sum). Let \[ x_j% \colon= \begin{cases} +1,& a_{ij}\ge 0,\\ -1, & a_{ij}<0. \end{cases} \] Clearly then \(\| \vec x \|_\infty=1\) and \[ A\vec x = \begin{bmatrix} \times\\\times\\\vdots\\ \sum_j a_{ij} x_j\\ \times \end{bmatrix}% = \begin{bmatrix} \times\\\times\\\vdots \\ \sum_j | a_{ij} | \\\times \end{bmatrix}= \begin{bmatrix} \times\\ \times\\\vdots \\ f(A) \\ \times \end{bmatrix}, \] where we write the \(i\)th row only. The magnitude of the largest entry of \(A\vec x\) is at least \(f(A)\) and \[ \| A\vec x \|_\infty \ge f(A), \] completing the proof.
Both these operator norms are very easy to compute.
Example 5.5 Let \[ A= \begin{bmatrix} 3 &-8& -9\\ 1 &-2& 0\\ 9 &-14& 6 \end{bmatrix}. \] Then \(\| A \|_1=\max\{13, 24,15\}=24\) and \(\| A \|_\infty=\max\{20,3,29\}=29\).
The matrix 2-norm \(\| A \|_2\) induced by the Euclidean vector norm \(\| \vec x \|_2\) is much harder to understand. We quote the following theorem:
Theorem 5.3 Let \(A\) be a \(d\times d\) matrix; then \[\begin{equation} \tag{5.4} \| A \|_2% =\sqrt{\rho(A^\mathsf{T}A)}, \end{equation}\] where \(\rho(B)\) is the spectral radius or size of the largest eigenvalue defined by \[\rho(B)\colon= \max\{|\lambda| \colon \text{$\lambda$ is an eigenvalue of $B$ so that $B\vec u=\lambda \vec u$ for some $\vec u\ne \vec 0$}\}.\] When \(A\) is symmetric, we have simply that \[ \| A \|_2= \rho(A). \]

The proof of (5.4) is not covered. In the symmetric case, \(A=A^\mathsf{T}\) and, if \(\lambda\) is an eigenvalue of \(A\), then \(\lambda^2\) is an eigenvalue of \(A^\mathsf{T}A=A^2\). We expect \(\rho(A^\mathsf{T}A)=\rho(A)^2\).

Eigenvalues of large matrices are difficult to compute, though we can easily do it for a \(2\times 2\) example.
Example 5.6 Let \[ A= \begin{bmatrix} 3 & 1 \\ 0 &1 \end{bmatrix}, \qquad A^\mathsf{T}A= \begin{bmatrix} 9 & 3 \\ 3 &2 \end{bmatrix}. \] The eigenvalues \(\lambda\) are the solutions of \[ \det(A^\top A -\lambda I)% =\det{ \begin{vmatrix} 9-\lambda & 3 \\ 3 & 2-\lambda \end{vmatrix}}% =(9-\lambda)(2-\lambda) -9 =\lambda^2-11\lambda+9=0. \] The quadratic equation formula gives \(\lambda=(11\pm \sqrt{121-36})/2\) and \(\| A \|_2= \sqrt{\rho(A^\top A)} =\sqrt{(11+\sqrt{85})/2}\).

Example 5.7 Recall the \(d\times d\) finite-difference matrix \[ A% = \begin{bmatrix} 2& -1 & & \\ -1 & 2 &-1 & \\ & \ddots&\ddots&\ddots \\ \\ \end{bmatrix}. \] We find the eigenvalues of \(A\), which allows us to find \(\| A \|_2\). Let \(h\colon= 1/(d+1)\) and \[ \vec u_k% \colon= ( \sin k\pi h, \sin 2k\pi h, \dots, \sin d k \pi h)^\mathsf{T}% \in \mathbb{R}^d. \] We show that \(\vec u_k\) is an eigenvector of \(A\). The \(j\)th component of \(A\vec u_k\) is \[ (A \vec u_k)_j% = 2 \sin (j k \pi h)% - \sin((j-1) k \pi h)% - \sin((j+1)k \pi h), \] where we use \(\sin((j-1)k\pi h)=0\) for \(j=1\) and \(\sin((j+1)k \pi h)=\sin(k \pi)=0\) for \(j=d\).

The trig identity \(\sin(X+Y)=\cos X \sin Y+\cos Y \sin X\) gives \[\begin{align*} (A \vec u_k)_j% &= 2 \sin (j k \pi h) - \left( \cos (k\pi h) \sin (j k \pi h) - \cos (j k \pi h) \sin (k \pi h) \right)\\ &\qquad- \left( \cos (k \pi h) \sin (j k \pi h) + \cos (j k \pi h) \sin( k \pi h) \right)\\ &= 2(1-\cos k \pi h) \sin (j k \pi h)\\ &= \lambda_k \times \text{the $j$th component of $\vec u_k$,} \end{align*}\] where \(\lambda_k\colon= 2(1-\cos (k\pi h))\). In other words, \(\lambda_k\) is an eigenvalue of \(A\) with eigenvector \(\vec u_k\). This gives \(d\) distinct eigenvalues for \(A\). We conclude that \[ \rho(A)% = \max \{\lambda_k\colon k=1,\dots,d \}% =2\left( 1-\cos \frac{d\pi}{d+1} \right). \] As \(A\) is symmetric, Theorem 5.3 gives \[ \| A \|_2% =\sqrt{\rho(A^\mathsf{T}A)}% =\rho(A)% =2\left( 1-\cos \frac{d\pi}{d+1} \right). \]

5.3 Convergence of iterative methods

If \(A=D+L+U\) (sum of diagonal and lower- and upper-triangular parts), the Jacobi method is \[ D \vec x_{n+1}=-(L+U) \vec x_n+\vec b\] and the Gauss–Seidel method is \[ (L+D) \vec x_{n+1} = -U \vec x_n + \vec b. \] If \(D\) is non-singular, both can be written \[ \vec x_{n+1}% =T \vec x_n+\vec c \] where \[\begin{alignat*}{3} T=T_{\textrm J} &\colon= -D^{-1}(L+U),& \qquad \vec c &=D^{-1} \vec b, &\qquad &\text{Jacobi},\\ T=T_{\textrm GS} &\colon= -(L+D)^{-1}U,& \qquad \vec c&=(L+D)^{-1} \vec b, &\qquad &\text{Gauss-Seidel}. \end{alignat*}\]

Lemma 5.1 Suppose that \(A\) and \(D\) are non-singular. Consider \(T=T_{\mathrm J}\) or \(T=T_{\mathrm{GS}}\). Then \(\vec x\) is the solution of \(A\vec x=\vec b\) if and only if \(\vec x=\vec g(\vec x)\colon= T\vec x+\vec c\) (i.e., \(\vec{x}\) is a fixed point of \(\vec g\); see Definition 4.1).
Proof. Elementary.
Theorem 5.4 Suppose that the conditions of Lemma 5.1 hold, so that \(A \vec x=\vec b\) has a unique solution \(\vec x\). Suppose that \(\vec x_n\) is a sequence in \(\mathbb{R}^d\) generated by \[\begin{equation*} \vec x_{n+1} = T \vec x_n+\vec c, \end{equation*}\] for some initial vector \(\vec x_0\). Then, \[ \| \vec x_n-\vec x \|% \le \| T \|_{\mathrm{op}}^n \| \vec x_0-\vec x \|, \] where \(\| T \|_{\mathrm{op}}\) is the operator norm associated to a vector norm \(\| \vec x \|\).
Proof. We have \(\vec x_{n+1}=T\vec x+\vec c\) and \(\vec x=T \vec x+\vec c\); then \[ \vec x_{n+1}-\vec x% =\left( T\vec x_n-T \vec x \right)% + \left( \vec c-\vec c \right). \] Apply the vector norm: \[ \| \vec x_{n+1}-\vec x \|% =\| T\vec x_n-T \vec x \|. \] Using (5.3), \[ \| \vec x_{n+1}-\vec x \|% \le\| T \|_{\mathrm{op}}\,\| \vec x_n- \vec x \|. \] As simple induction argument shows that \(\| \vec x_{n}-\vec x \| \le \| T \|_{\mathrm{op}}^n \| \vec x_0-\vec x \|.\)
Corollary 5.1 If \(\| T \|_{\mathrm{op}}<1\), then \(\vec x_n\) converges to \(\vec x\) as \(n\to \infty\). The convergence is linear (see Definition 4.2); that is, \(\| \vec x_{n+1}-\vec x \|\le K \| \vec x_n-\vec x \|\) for \(K=\| T \|_{\mathrm{op}}<1\).

When \(\| T \|_{\mathrm{op}}\) is small, the convergence is more rapid and this is desirable in numerical calculations.

In finite dimensions, it turns out that all norms are equivalent and any operator norm can be used as long as \(\| T \|_{\mathrm{op}}<1\).

An alternative characterisation of convergence can be given in terms of eigenvalues.
Corollary 5.2 The sequence \(\vec x_{n}\) given by \(\vec x_{n+1}=T\vec x_n+\vec c\) converges to the fixed point \(\vec x\) satisfying \(\vec x=T\vec x+\vec c\) for any \(\vec x_0\in \mathbb{R}^d\) if \(\rho(T)<1\).

Proof. This is a corollary of Theorem 5.4 when \(T\) is symmetric as \(\rho(T)=\| T \|_2\).

Suppose for simplicity that \(T\) has \(d\) distinct eigenvalues \(\lambda_j\) with corresponding eigenvectors \(\vec u_j\), so that \(T \vec u_j=\lambda_j \vec u_j\). Then, \(\vec u_j\) is a basis for \(\mathbb{R}^d\) and \[ \vec x_0-\vec x=\sum_{j=1}^d \alpha_j \vec u_j, \] for some \(\alpha_j\in\mathbb{R}\). Let \(\vec e_n=\vec x_n-\vec x\). Then, \(\vec e_{n+1}=T\vec e_n\) and \[ \vec e_{n}% =T^n \vec e_0% =T^n \sum_{j=1}^d \alpha_j \vec u_j% = \sum_{j=1}^d\alpha_j T^n \vec u_j% = \sum_{j=1}^d \alpha_j \lambda^n \vec u_j. \] If \(\rho(T)<1\) then all eigenvalues \(\lambda\) satisfy \(|\lambda|<1\). Hence, \(\lambda^n \to 0\) as \(n\to \infty\). Thus \(\vec e_n\to 0\) as \(n\to \infty\) and the iterative method converges.

The case where the eigenvalues of \(T\) are not distinct is omitted.
Example 5.8 Consider \[ \begin{bmatrix} 8 & -1 & 0 \\ 1 & 5 & 2 \\ 0 & 2 & 4 \end{bmatrix}% \vec x% = \vec b% = \begin{bmatrix} 1 \\2 \\3 \end{bmatrix}. \] The Jacobi iteration is \[ \begin{bmatrix} 8 &0 &0 \\ 0& 5 &0 \\0 &0 &4 \end{bmatrix} \vec x_{n+1}% = \begin{bmatrix} 0 & 1 & 0\\ -1 & 0 &-2 \\ 0& -2 & 0 \end{bmatrix}\vec x_n + \begin{bmatrix} 1 \\ 2 \\3 \end{bmatrix} \] and \[ T_{\mathrm{J}}= \begin{bmatrix} 8 & 0&0 \\0 & 5 &0 \\0 &0 &4 \end{bmatrix}^{-1} \begin{bmatrix} 0 & 1 & 0\\ -1 & 0 &-2 \\ 0& -2 & 0 \end{bmatrix}% = \begin{bmatrix} 0 & 1/8 & 0\\ -1/5 & 0 & -2/5\\ 0 & -1/2 & 0 \end{bmatrix}. \] Then, \[ \| T_J \|_\infty= \max\left\{\frac 18, \frac 35, \frac 12\right\}% =\frac 35,\qquad \text{max row sum} \] \[ \| T_J \|_1 = \max\left\{\frac 15, \frac 58, \frac 25\right\}% =\frac 58,\qquad \text{max column sum}. \] The matrix norms are less than one and the Jacobi iteration converges for this example.

Example 5.9 (finite-difference matrix) Recall the Jacobi iteration for the finite-difference matrix (5.2). Then, \(\vec x_{n+1}=T \vec x_n+\vec c\) for \[ T% = \frac 12 \begin{bmatrix} 0&1 & & & \\ 1&\ddots &1 & & \\ & \ddots& &\ddots & \\ &&&&&1\\ &&&& 1& 0 \end{bmatrix}, \qquad% \vec c% =\frac 12 \vec b. \]

Note that \(\| T \|_\infty=\| T \|_1=1\) so that Corollary 5.1 does not apply. We work out the eigenvalues similarly to Example 5.7. Let \(\vec u_k=[\sin( k\pi h),\sin (2k\pi h),\dots, \sin (d k \pi h)]^\mathsf{T}\) for \(h=1/(d+1)\). Then, the \(j\)th component of \(T\vec u_k\) is \(\frac 12 (\sin(j-1)k \pi h) + \sin (j+1) k\pi h)\) and, by applying a trig identity, this is \(\cos (k\pi h) \sin(j k \pi h)\). Thus, \(\lambda_k=\cos (k\pi h)\) for \(k=1,\dots,d\) are the eigenvalues of \(T\) and, as \(| \lambda_k |<1\), the convergence of the Jacobi iteration follows for the finite-difference matrix.

Note however that \(\rho(T)=\max\{\cos (k\pi h)\colon k=1,\dots,d\}\to 1\) as \(h\to 0\) as \(\cos (\pi h)\approx 1\) for \(h\approx 0\). This means when \(h\) is small the Jacobi iteration converges slowly. This is a significant problem in applications where small \(h\) corresponds to accurate resolution of the underlying physics. In other words, the more accurate our discretisation the slower the Jacobi iteration is.

For convergence of Gauss–Seidel with the finite-difference matrix, see Problem Sheets.

5.4 Condition number

We’d like to estimate the error in the approximation when solving a linear system. Our methods provide a computed value \(\vec x_c\) that we hope approximates the solution \(\vec x\) of \(A\vec x=\vec b\). We cannot evaluate a norm for \(\vec x_c-\vec x\) as \(\vec x\) is usually unknown. What we do have is the residual defined by \[ \vec r\colon= A\vec x_c-\vec b. \] Note that \[ \vec r\colon= A\vec x_c-A\vec x= A(\vec x_c-\vec x) \] and \(\vec x_c-\vec x=A^{-1} \vec r\). Apply (5.3) to get \[ \| \vec x_c-\vec x \|\le \| A^{-1} \|_{\mathrm{op}} \| \vec r \|. \] Furthermore, \(A\vec x=\vec b\) gives \[ \| \vec b \|=\| A\vec x \| \le \| A \|_{\mathrm{op}}\, \| \vec x \|. \] Dividing the two expressions, \[ \frac{\| \vec x_c-\vec x \|}{\| A \|_{\mathrm{op}}\,\| \vec x \|}% \le \frac{\| A^{-1} \|_{\mathrm{op}} \, \| \vec r \|}{\| \vec b \|}. \] Rearranging we get \[ \underbrace{\frac{\| \vec x_c-\vec x \|}{\| \vec x \|}}_{\text{relative error}}% \le \underbrace{\vphantom{\frac{\| x \|}{\| x \|}}\| A^{-1} \|_{\mathrm{op}}\, \| A \|_{\mathrm{op}}}_{\text{condition number}} \, \underbrace{\frac{\| \vec r \|}{\| \vec b \|}.}_{\text{relative residual}} \] Thus, the relative error is bounded by a constant times the relative residual. The constant is known as the condition number:

Definition 5.5 (condition number) The condition number of a non-singular matrix \(A\) is defined by \[ \operatorname{Cond}(A)% \colon=\| A \|_{\mathrm{op}}\,\| A^{-1} \|_{\mathrm{op}}. \] Each choice of operator norm \(\| A \|_{\mathrm{op}}\) gives a different condition corresponding to the choice of vector norm \(\| \vec x \|\) above and \(\operatorname{Cond}_1(A)\), \(\operatorname{Cond}_2(A)\), \(\operatorname{Cond}_\infty(A)\) denote the condition numbers with respect to the \(1\)-, \(2\)-, and \(\infty\)-norms. Often \(\kappa(A)\) is used to denote the condition number.

The condition number is a widely used measure of the difficulty of solving \(A\vec x=\vec b\). When \(\operatorname{Cond}(A)\) is large, it may be impossible to get accurate results.

Example 5.10 Returning to (1.1), we have \[ A= \begin{bmatrix} \epsilon & 1 \\ 0 & 1 \end{bmatrix},\qquad A^{-1}= \frac 1 \epsilon\begin{bmatrix} 1 & -1 \\ 0 & \epsilon \end{bmatrix}. \] Suppose that \(0<\epsilon<1\). Then, \(\| A \|_1=2\) and \(\| A^{-1} \|_1=(1+\epsilon)/\epsilon\) and hence \(\operatorname{Cond}_1(A)=2(1+\epsilon)/\epsilon\). This is large when \(\epsilon\) is small and the matrix is ill-conditioned. This reflects the sensitivity of \(\vec x\) to changes in \(\vec b\) found when solving \(A\vec x=\vec b\) in .

Example 5.11 (Hilbert matrix) Refer back to Problem Sheet 1 where we performed experiments with the Hilbert matrix \(A\), which is the \(d\times d\) matrix with \((i,j)\) entry \(1/(i+j-1)\). We found that it was difficult to solve the linear system of equations \(A\vec x=\vec b\) for a given \(\vec b\in\mathbb{R}^d\) if \(d\) is moderately large (e.g., \(d=10\)). In Python, numpy.linalg provides the command cond for finding the condition number. We apply this to the Hilbert matrix

  from numpy.linalg import cond
  from scipy.linalg import hilbert
  d = 4
  A = hilbert(d)
  cond_A = cond(A,p=1)
  print(A, cond_A)

The output is

[[1.         0.5        0.33333333 0.25      ]
 [0.5        0.33333333 0.25       0.2       ]
 [0.33333333 0.25       0.2        0.16666667]
 [0.25       0.2        0.16666667 0.14285714]]
28374.99999999729

Repeating this experiment for \(d=6\) we find a condition number of \(2.9070\times 10^7\); for \(d=8\), the condition number is \(3.3872\times 10^{10}\); for \(d=10\), the condition number is \(3.534\times 10^{13}\). Even for small systems (problem in real-world applications can be easily have millions of unknowns), the condition number is extremely large. ```

5.4.1 Second derivation of \(\operatorname{cond}(A)\)

Suppose that \[ A\vec x= \vec b,\qquad (A+\Delta A)\vec x_c= \vec b,\qquad \] where \(\Delta A\) can be thought as the effects of rounding error. Then, \[ \vec x=A^{-1} \vec b= A^{-1}(A+\Delta A) \vec x_c% =\vec x_c+ A^{-1} \Delta A \vec x_c. \] Hence, \[ \vec x-\vec x_c = A^{-1} \Delta A \vec x_c. \] Applying norms, we find that \[ \| \vec x-\vec x_c \|% \le \| A^{-1} \|_{\mathrm{op}} \| \Delta A \|_{\mathrm{op}} \| \vec x_c \|. \] We can rewrite this in terms of the condition number: \[ \frac{\| \vec x-\vec x_c \|}{\| \vec x_c \|}% \le \operatorname{Cond}(A) \frac{\| \Delta A \|_{\mathrm{op}}}{\| A \|_{\mathrm{op}}}. \] The relative error is less than the condition number times the size of the relative change in \(A\).
Example 5.12 (finite-difference matrix) Let \(A\) be the \(d\times d\) finite-difference matrix of Example 5.2. Then, \(\operatorname{Cond}_2(A)\) is easy to find, because we know all the eigenvalues of \(A\). The eigenvalues of \(A^{-1}\) are simply \(\lambda^{-1}\) where \(\lambda\) is an eigenvalue of \(A\). So, as \(A\) is symmetric, \[ \operatorname{Cond}_2(A)=\| A \|_2 \| A^{-1} \|_2=\rho(A) \rho(A^{-1}). \] The eigenvalues are \(\lambda_k=2(1-\cos (k\pi h))\). Note that \(\lambda_k\) increases from \(\lambda_1=2(1-\cos(\pi h))\) to \(\lambda_d=2(1-\cos(d\pi/(d+1)))\). Then, \(\rho(A)=2(1-\cos(d\pi /(d+1))\) and \(\rho(A^{-1})=1/(2(1-\cos( h \pi)))\), and \[ \operatorname{Cond}_2(A)=\frac{1-\cos(d\pi/(d+1))}{ 1-\cos(\pi h)} \to \infty, \qquad \text{ as $h\downarrow 0$.} \] In other words, the finite-difference matrix becomes more ill-conditioned as we make \(h\) small and approximate the second derivate accurately. This is a common problem when approximating PDEs numerically.