Chapter 4 Solving nonlinear equations
Root finding problem For a given function \(f\colon \mathbb{R} \to \mathbb{R}\), determine a solution \(x\) of the equation \(f(x)=0\). A solution \(x\) is known as a root of \(f\).
For most \(f\), there is no formula to give \(x\) explicitly and numerical methods are required. For example, we may use the bisection method and choose an interval \([a,b]\) that contains the root by checking that \(f(a)>0\) and \(f(b)<0\) or vice versa (\(f\) changes sign and has a root in the interval if it is continuous). Bisecting the interval and choosing one subinterval where \(f\) changes signs at the endpoints gives a new interval containing a root. Then, we iterate to find successively smaller intervals and better approximations to the root. Given a suitable initial interval, this bisection method is simple to apply but does not generalise easily to higher dimension. Instead, we focus on fixed-point iterations.
4.1 Fixed-point iteration
Example 4.1 Let \(f(x)=x^3+4x^2-10\). There are many ways of posing this as a FP problem \(g(x)=x\).
Let \(g_1(x)=x-x^3-4x^2+10\). Then, it is easy to check that \(g_1(x)=x\) if and only if \(f(x)=0\).
Let \(g_2(x)=\frac 12 (10-x^3)^{1/2}\) (positive root). For \(x>0\), \(g_2(x)=x\) if and only if \(f(x)=0\).
Let \(g_3(x)= \left({\frac{10}{4+x}}\right)^{1/2}\), which is again a FP problem for the root-finding problem for \(f\).
Define the sequence \(x_n\) by \(x_{n+1}=g(x_n)\), given an initial condition \(x_0\). Under what conditions does \(x_n\) converge to a fixed point of \(g\) and can this be used for computing the root? To answer this question, we have the fixed-point theorem. We use the term `smooth’ to mean the function has enough continuous derivatives.
Example 4.2 We look back at the fixed point problems in Example 4.1:
\(g_1(x)=x-x^3-4x^2+10\) and \([a,b]=[1,2]\). Then \(g_1(1)=6\) and condition (i) fails. The FP theorem does not apply to the iteration based on \(g_1\).
\(g_2(x)=\frac 12 \left({10-x^3}\right)^{1/2}\) and \[\begin{align*} g_2'(x) &=\frac 14 \left({10-x^3}\right)^{-1/2} (-3x^2) =\frac{-3x^2 }{4(10-x^3)^{1/2}},\\ g_2'(2) &=\frac{-3\cdot 4}{4(10-8)^{1/2}} =\frac{-3}{\sqrt{2}} \approx -2.12. \end{align*}\] Hence, \(\lvert g'_2(x) \rvert >1\) and condition (ii) fails. Again, the FP theorem does not apply.
\(g_3(x)=\left({\frac{10}{4+x}}\right)^{1/2}\) and \[ g'_3(x)=\frac 12 \left({\frac{10}{4+x}}\right)^{-1/2} \left({\frac{-10}{(4+x)^2}}\right) = \frac{-5}{(4+x)^{3/2}\sqrt{10}}. \] As \(g_3\) is decreasing and \(g_3(2)=\sqrt{10/6}\in[1,2]\) and \(g_3(1)=\sqrt 2\in[1,2]\), we see that condition (i) holds. Further, \[ \lvert g'_3(x) \rvert \le \frac{5}{\sqrt{10}} \frac{1}{5^{3/2}} <1\qquad \text{for $x\in [1,2]$}. \] Hence (ii) holds. The FP theorem applies and \(x_n\to x\), the unique fixed point of \(g\), and the root of \(f\): try it, \[ x_1=1.5,\qquad x_2=g_3(1.5)\approx 1.3484,\qquad x_3\approx 1.3674,\qquad x_4\approx 1.365. \] We see that the first three digits of \(x_n\) have already converged and that \(f(1.365)=-0.0038\), indicating that \(1.365\) is close to the root of \(f\).
For the proof of the fixed point theorem, we use the mean-value theorem from MA10207.
Proof. Let \(f(x)=g(x)-x\). Then, by (i), \(f(a)=g(a)-a\ge 0\) and \(f(b)=g(b)-b\le 0\). By the intermediate-value theorem, there exists \(x\in [a,b]\) such that \(f(x)=0\). In other words, there exists \(x\in[a,b]\) so that \(g(x)=x\).
Consider the iteration \(x_{n+1}=g(x_n)\) and the fixed-point equation \(x=g(x)\). Then, \[ x_{n+1}-x =g(x_n)-g(x). \] By the mean-value theorem, there exists \(\xi\in(a,b)\) so that \[ x_{n+1}-x =g'(\xi)(x_{n}-x) \] (as \(g\) is smooth). Now \(|g'(\xi)|\le \lambda\) and \[ \lvert x_{n+1}-x \rvert \le \lambda \lvert x_n-x \rvert. \] By a simple induction argument, this implies that \(|x_{n}-x|\le \lambda^n |x_0-x|\).
Finally, to show uniqueness, consider two fixed-points \(x,y\) of \(g\). Then \(g(x)=x\) and \(g(y)=y\) and hence \[ x-y =g(x)-g(y) =g'(\xi)(x-y). \] As \(\lvert g'(\xi) \rvert \le \lambda\), we see that \[ \lvert x-y \rvert \le \lambda \lvert x-y \rvert. \] As \(\lambda<1\), it must hold that \(x=y\) and there is only one fixed point of \(g\) in \([a,b]\).4.2 Newton’s method
The most well-known example of a fixed-point iteration is Newton’s method. This is the iteration \[ x_{n+1}=g(x_n),\qquad g(x)\colon= x-\frac{f(x)}{f'(x)}, \] where we assume that \(f'(x)\ne 0\). Clearly, \(f(x)=0\) if and only if \(g(x)=x\).
We show the FP theorem applies:This theorem is problematic for the practitioner: it says that Newton’s method converges if we can start close enough to the root! We don’t usually know the root and we don’t usually know what \(\epsilon\) is (i.e., what close enough means). However, Newton’s method is often effective and, when it works, it is often very fast.
To quantify the speed of convergence, we define the order of convergence.Definition 4.2 (order of convergence) Consider a sequence \(x_n\) approximating \(x\). Let \(e_n\colon= \lvert{x_n-x}\rvert\), the error in the approximation. We say that \(x_n\) converges to \(x\) with order \(r\) if
case \(r=1\) (linear convergence): \(e_{n+1} \le K e_n\) for all \(n\in \mathbb{N}\), for some \(K<1\);
case \(r>1\): \(e_{n+1} \le K e_n^r\) for all \(n\in \mathbb{N}\), for some \(K>0\). The case \(r=2\) is known as quadratic convergence.
One of the most useful tools in numerical analysis is Taylor’s theorem from MA10207:
x=x-tan(x)
.
The example illustrates nicely quadratic convergence and we see the number of correct digits increases rapidly (3, 10, 11 digits; the last one is affected by rounding error).
4.3 Solution of Initial-Value Problems (IVPs)
An IVP in standard form is a system of \(N\) differential equations: \[\begin{equation} \tag{4.2} \frac{d\mathbf y}{dt} =\mathbf f(\mathbf y,t),\quad t\geq 0, \end{equation}\] where \(\mathbf f \colon \mathbb{R}^{N+1} \to \mathbb{R}^N\) is a given function and \(\mathbf y (t) \in \mathbb{R}^N\) is the solution (for \(t>0\)), to be found subject to initial conditions of the form: \[\begin{equation} \tag{4.3} \mathbf y(0)=\mathbf y_0, \end{equation}\] with initial data \(\mathbf y_0 \in \mathbb{R}^N\) given.
Example 4.4 (\(N = 1\)) Let \(f(y,t) = y^2\) and \(y_0 = 1\). Then
\[ \frac{d y}{d t}=y^2, \quad \text{subject to} \quad y(0)=1. \] It is easy to see that the exact solution is \(y(t)=1/(1-t).\) (Please check!) Usually the exact solution is not so easy to find.4.4 Euler’s method
Given Examples 4.4 and 4.5, suppose we want to find \(\vec y (t)\) for \(t=h\) for some small \(h\). Integrating each side of (4.2) over \([0,h]\) gives
\[\begin{equation} \vec y(h) - \vec y(0) = \int^h_0\frac{d \vec y}{d t}\,d t = \int^h_0 f(\vec y(t),t)\,d t \tag{4.5} \end{equation}\]
The integral on the right-hand side is still unknown, but we can approximate it, for example, by replacing the integrand by its value at \(t = 0\) (which will be okay if \(h\) is small). This yields \[\begin{equation} \vec y(h) \approx \vec y(0) + h \, \mathbf{f}(\vec y(0),0) = \vec y_0 + h \,\mathbf{f}(\vec y_0,0). %\Rightarrow &\vec y(h)=\vec y(0)+\int^h_0 f(\vec y(t),t)\,d t. \end{equation}\]
Everything on the right-hand side is known so we can compute an approximation of \(\vec y(h)\). Then \(\vec y(2h)\) can be approximated by taking another step, and so on. This is Euler’s Method, which computes a sequence of approximations \(\vec Y_j\) to \(\vec y(t_j)\), where \(t_j = jh\), \(j = 1,2,\ldots\) by \[\begin{equation} \begin{aligned} \vec Y_0& = \vec y_0\\ \vec Y_{j}& = \vec Y_{j-1} + h\,\mathbf{f}(\vec Y_{j-1},t_{j-1}), \qquad \text{for all } j\geq 1. \end{aligned} \end{equation}\]
Example 4.6 Solve \[\begin{equation} \tag{4.6} \frac{d y}{d t}=y^2, \quad \quad y(0) = 1 \end{equation}\] using Euler’s method with \(h=0.1\); that is, \[\begin{align*} Y_0&=y_0= 1\\ Y_1&=Y_0 + hf(Y_0)=1+0.1(1^2) = 1.1\\ Y_2&=Y_1+ hf(Y_1)=1.1+0.1(1.1)^2=1.221. \end{align*}\]
Using the program,we approximate the solution of (4.6) at the time \(T = 1/2\) for various \(h\). The number of steps taken in Euler’s method is then \(n = T/h\). Since we know \(y(1/2) = 2\) in this case (see (4.2)), we can find the error exactly.
Results:\(n = T/h\) | \(h\) | \(| y(1/2) - Y_n |\) | Ratio |
---|---|---|---|
4 | 1/8 | 0.2338 | 0.59 |
8 | 1/16 | 0.1389 | 0.55 |
16 | 1/32 | 0.07696 | 0.53 |
32 | 1/64 | 0.04073 |
The ratios for the errors approach \(1/2\), which suggests that \(| Y_n - y(t) | \cong \mathcal{O}(h)\) as \(h \rightarrow 0\). This is what we prove in the next section.
4.5 Convergence of one-step methods
We now restrict to the simplified version of (4.2) with \(N = 1\) and \(f(y,t) = f(y)\): \[\begin{equation} \tag{4.7} \frac{d y}{d t}=f(y) \end{equation}\] subject to the initial condition \[\begin{equation} y(0)=y_0 . \tag{4.8} \end{equation}\]
We shall consider “one step” methods of the form : \[\begin{align} Y_0& = y_0 \nonumber\\ Y_{j}&= Y_{j-1} + h\,F_h(Y_{j-1}), \qquad \text{for all } j\geq 1 \tag{4.9} \end{align}\] where \(F_h\) is a function to be specified.
An example is Euler’s method, where \(F_h \equiv f\). The error at the \(j\)th step is defined to be \[ e_j = y(t_j)-Y_j \quad \text{where} \quad t_j = jh. \]
The convergence analysis proceeds by (i) bounding the error in the computed solutions \(Y_j\) in terms of the local truncation errors and (ii) estimating the local truncation error using Taylor’s theorem.
Proof. By definition of \(\tau_j\), \[\begin{equation} \tag{4.11} y(t_{j}) =y(t_{j-1}) + h\,F_h(y(t_{j-1})) + h\,\tau_j \, . \end{equation}\] So with \(e_j = y(t_j) - Y_j\), we have by subtracting (4.9) from (4.11): \[ e_{j}=e_{j-1}+h \, \Big(F_h(y(t_{j-1})) - F_h(Y_{j-1})\Big) +h \,\tau_j. \] By the triangle inequality and the Lipschitz continuity of \(F_h\), \[\begin{align*} | e_{j} | & \leq | e_{j-1} | + h \, | F_h(y(t_{j-1})) - F_h(Y_{j-1}) | + h \, | \tau_j | \\ & \leq (1 + hL) | e_{j-1} | + h \, | \tau_j |. \end{align*}\] We see that (4.10) holds.
Now we shall prove by induction that, for all \(n \geq 1\), \[\begin{equation} \tag{4.12} | e_n | \leq h \sum_{j = 0} ^{n-1} (1 + hL)^j \, |\tau_{n-j}|. \end{equation}\] Clearly (4.12) holds for \(n =1\), since the first equation in (4.9) implies \(| e_0| = 0\) and (4.10) then gives \(| e_1 | \leq h | \tau_1|\). Now if (4.12) holds for some \(n\), then (4.10) implies \[\begin{align*} | e_{n+1} | & \leq h \, \sum_{j =0}^{n-1} (1 + hL)^{j+1} \, | \tau_{n-j}| + h \, | \tau_{n+1} | \\ & = h \, \sum_{j =1}^{n} (1 + h L)^{j} \, | \tau_{n+1-j}| + h \, | \tau_{n+1} |\\ & = h \, \sum_{j =0}^{n} (1 + hL)^{j} \, | \tau_{n+1-j}|. \end{align*}\] Hence, (4.12) holds for \(n+1\), and for all \(n\) by induction.
Finally, from (4.12) we have, since \(\displaystyle \sum_{j=0}^{n-1} q^j = \frac{q^n-1}{q-1}\), \[\begin{align*} | e_n | \leq & \frac{(1+ hL)^{n}-1}{L} \max_{1 \le j \le n} | \tau_j | % = \frac{\exp(nhL)-1}{L} \max_{1 \le j \le n} | \tau_j |. \end{align*}\] since \(1 + x \leq \exp x \,\), for all \(x \geq 0\). And so the result follows for all \(nh\leq T\).Normally the local truncation error is estimated by applying Taylor’s theorem.
Example 4.7 Euler’s method is (4.9) with \(F_h (Y) \colon= f(Y)\). If we assume that \(f\) is Lipschitz continuous, then Theorem 4.6 applies and to show convergence we have to estimate \(\tau_j\),.
To do this we write (using the definition of \(\tau_j\)): \[ \tau_j = \frac{y(t_{j} ) - y(t_{j-1})}{h} - f(y(t_{j-1})) = \frac{y(t_{j-1} + h ) - y(t_{j-1})}{h} - f(y(t_{j-1})) \] which we can expand via Taylor’s Theorem, with \(\xi_j \in (t_{j-1} , t_{j})\), such that \[\begin{align*} \tau_j = & \frac{\displaystyle \left( y(t_{j-1}) + h \frac{dy}{dt}(t_{j-1}) + \frac{h^2}{2} \frac{d^2y}{dt^2} (\xi_j) - y(t_{j-1}) \right) }{h} - f(y(t_{j-1})) \\ % = & \left[{ \frac{dy}{dt}(t_{j-1}) - f(y(t_{j-1}))}\right]% + \frac{h}{2} \, \frac{d^2y}{dt^2} (\xi_j). \end{align*}\]
Now, since \(y(t)\) is the solution of (4.7), we have, for all \(j = 1, \ldots , n\): \[\begin{equation} | \tau_j| \leq \frac{1}{2} \max_{t \in [0,T]} | \frac{d^2 y }{dt^2} (t) | \, h. \end{equation}\] Hence if \[\begin{equation} \tag{4.13} | \frac{d^2 y }{dt^2} (t) | \qquad \text{is bounded for $t \in \mathbb{R}$}, \end{equation}\]
then Theorem 4.6 implies that \[\begin{equation} \tag{4.14} | e_n | \leq C(T) h , \end{equation}\]
where \(C(T)\) is a constant depending on \(T\); and, for fixed \(T\), we have convergence (i.e., \(e_n \rightarrow 0\) as \(h \rightarrow 0\)).
Often one just assumes (4.13) and then concludes (4.14). To show (4.13) rigorously, we need to make some assumptions on the given function \(f\). In particular, assume that \[\begin{align} | f(x) | & \leq M, \quad x \in \mathbb{R}, \\[1.5ex] \text{ and } \quad | f'(x) | & \leq L , \quad x \in \mathbb{R}. \end{align}\] Then \(f\) is Lipschitz continuous with Lipschitz constant \(L\), and by (4.7) and the chain rule, \[ \left| \frac{d^2y}{dt^2} (t) \right| = \left| \frac{d}{dt}f(y(t))\right| = \left| f'(y(t)) \frac{dy}{dt}(t) \right| = \left| f'(y(t))| | f(y(t)) \right| \leq LM . \]4.6 Higher-order methods
We saw in (4.14) that Euler’s Method converges with \(\mathcal{O}(h)\). This is relatively slow. Higher-order methods can be found by employing higher-order quadrature in (4.5). For simplicity, restrict to the case the case \(N=1\) and \(f(y,t) = f(y)\) again: Then (4.5) is \[ y(h)-y(0)=\int^h_0f(y(t))\,d t. \] Instead of approximating the right-hand side by the one-point quadrature rule at \(0\), consider using instead the trapezium rule, to obtain
\[\begin{equation} y(h)\approx y(0)+\frac h2 \Big[{f(y(0))+f(y(h))}\Big]. \tag{4.15} \end{equation}\] This motivates the Crank–Nicholson (or Trapezoidal) Method for solving (4.2): \[\begin{equation} \begin{aligned} Y_0& = y_0\\ Y_{j}&= Y_{j-1} + \frac h2 \Big[{f(Y_{j-1})+f(Y_{j})}\Big], \qquad \text{for all $j\geq 1$.} \end{aligned} \tag{4.16} \end{equation}\] This method has a big disadvantage, since to find \(Y_{j}\) from \(Y_{j-1}\) in (4.16), we have to solve a (possibly nonlinear) equation \[ Y_{j}-\frac h2 f(Y_{j}) = Y_{j-1}+\frac h2 f(Y_{j-1}) \, . \]
If there are \(N\) differential equations, then there are \(N\) (possibly nonlinear) equations to solve at each timestep. Despite this extra cost such “implicit” methods are often preferred because of their good stability properties. (Stability is an advanced topic.)
To estimate the local truncation error for (4.16), write the method as \[ \frac{Y_{j} - Y_{j-1}}{h} = \frac{1}{2} \Big[f(Y_{j-1})+f(Y_{j})\Big]. \]
Then using (4.7) and applying Taylor’s theorem to both \(y\) and \(dy/dt\) we get: \[\begin{align*} \tau_j&=\frac{y(t_{j})-y(t_{j-1})}{h}-\frac12[f(y(t_{j-1}))+f(y(t_{j}))] \\ &= \frac{dy}{dt}(t_{j-1}) + \frac{h}{2} \frac{d^2y}{dt^2}(t_{j-1}) + \frac{h^2}{6} \frac{d^3y}{dt^3}(t_{j-1}) - \frac12\left[ \frac{dy}{dt}(t_{j-1}) + \frac{dy}{dt}(t_{j}) \right] + \mathcal{O}(h^3)\\ &= \left( \frac{dy}{dt} + \frac{h}{2} \frac{d^2y}{dt^2} + \frac{h^2}{6} \frac{d^3y}{dt^3} - \frac{dy}{dt} - \frac{h}{2} \frac{d^2y}{dt^2} - \frac{h^2}{4} \frac{d^3y}{dt^3} \right)(t_{j-1}) + \mathcal{O}(h^3)\\ &= - \frac{1}{12} \frac{d^3y}{dt^3}(t_{j-1})\,h^2 + \mathcal{O}(h^3), \end{align*}\] Hence, provided \(y\) has three bounded derivatives, \(|\tau_j| = \mathcal{O}(h^2)\). Note that \[ \frac{d^3y}{dt^3} = \frac{d^2}{dt^2} f(y) = \frac{d}{dt} \big(f'(y) y'\big) = \frac{d}{dt} \big(f'(y) f(y)\big) = f''(y) (f(y))^2 + (f'(y))^2 f(y)\,. \]
Proof. From (4.16), \[ Y_j = Y_{j-1} + \frac{h}{2} \, \Big[f(Y_{j-1}) + f(Y_{j}) \Big] \] and by definition of the trunctation error \[ y(t_j) = y(t_{j-1}) + \frac{h}{2} \, \left[ f(y(t_{j-1})) + f(y(t_{j})) \right] + h \tau_j\,. \] Hence, subtracting \[ |e_j| % = | e_{j-1} + \frac{h}{2} \, \left[ f(y(t_{j-1})) - f(Y_{j-1}) \right] + \frac{h}{2} \, \left[ f(y(t_{j})) - f(Y_{j}) \right] + h \tau_j | \] and, using the triangle inequality and Lipschitz continuity of \(f\), we have \[ |e_j| \le | e_{j-1} | + \frac{hL}{2} \, | e_{j-1} | + \frac{hL}{2} \, | e_{j} | + h \, |\tau_j |. \] Rearrange as follows. \[\begin{align*} \left( 1 - \frac{hL}{2} \right) |e_j|% &\le \left( 1 + \frac{hL}{2} \right) | e_{j-1} | + h \, |\tau_j | \\ \Rightarrow \qquad |e_j| % &\le \left( 1 - \frac{hL}{2} \right)^{-1} \left[ \left( 1 + \frac{hL}{2} \right) | e_{j-1} | + h \, |\tau_j | \right]. \end{align*}\] ow, for any \(0 \le x \le \frac{1}{2}\), \[ (1-x)^{-1} \le 1+2x \qquad (Check!) \] and so for \(hL \le 1\) (since \((1-\frac{hL}{2})^{-1} \le 1+hL\) and \(1+\frac{hL}{2} \le 1+hL\)) \[ |e_j| \le ( 1 + hL )^2 | e_{j-1} | + h( 1 + hL ) \, |\tau_j |. \]
The rest of the proof is Problem E8.3.4.6.1 Higher-order explicit methods
There are ways of achieving higher order without using implicitness. Assume that we have computed an approximation \(Y_{j-1}\) to \(y(t_{j-1})\). The uses first the standard Euler method to get an approximation \(\widehat Y_{j}\) to \(y(t_{j})\) and then the trapezoidal rule to improve it: \[\begin{align} \widehat Y_{j}% = &Y_{j-1}+hf(Y_{j-1})\qquad\qquad\qquad\quad \text{``prediction''}\\ Y_{j}=&Y_{j-1}+\frac h2\left[ f(Y_{j-1})+f(\widehat Y_j) \right] \qquad \text{``correction''} \end{align}\] This method fits into the framework of Theorem 4.6 because it can be written: \[ Y_{j} = Y_{j-1}% + \frac h2 \left( f(Y_{j-1})% + f\left( Y_{j-1} + hf(Y_{j-1}) \right) % \right) % = Y_n + h f_h(Y_j) , \] which is of the form (4.9) with \(F_h(Y) \colon= \frac 12 f(Y) + \frac 12 f\left( Y + hf(Y) \right).\) The truncation error is \[ \tau_j = \frac{y(t_{j}) - y(t_{j-1})}{h} - \frac{1}{2} \left[{f(y(t_{j-1})) + f\left( y(t_{j-1}) + h\,f(y(t_{j-1})) \right) }\right], \] and it turns out that \(|\tau_j| = \mathcal{O}(h^2)\) for sufficiently smooth \(f\) (see Problem E8.2.).
Moreover if \(f\) is Lipschitz then so is \(F_h\), since \[\begin{align*} | F_h(Y) - F_h(Z)| &\leq \frac 12 | f(Y) - f(Z) | + \frac 12 | f(Y + hf(Y)) - f(Z + h f(Z)) |\\ &\leq \frac L2 | Y - Z | + \frac L2 | (Y + hf(Y)) - (Z + h f(Z)) |\\ &\leq L | Y - Z | + \frac {hL}2 | f(Y) - f(Z) | \leq \left( L + \frac 12 hL^2 \right)| Y - Z |. \end{align*}\]
So under these conditions, Theorem 4.6 implies that the improved Euler method converges with order \(\mathcal{O}(h^2)\), at the small additional cost of an extra evaluation of \(f\) at each time step.
Higher-order methods can be built up using more evaluations of \(f\). In general these methods are called Runge–Kutta methods.
Example 4.8 Let \[\begin{equation*} \left. \begin{aligned} K_1&=f(Y_0),\\ K_2&=f(Y_0+\frac h2 K_1),\\ K_3&=f(Y_0+\frac h2 K_2),\\ K_4&=f(Y_0+hK_3),\\ Y_1&=Y_0+\frac h6[K_1+2K_2+2K_3+K_4]. \end{aligned} \right\} \end{equation*}\]
This is an order-4 Runge–Kutta method (requires some analysis).