Chapter 4 Decision theory

Any situation in which choices are to made amongst two or more possible alternative courses of action is a decision problem. If we are certain of the consequences of these choices, or actions, then making a decision is relatively straightforward. You can simply write down all of the possible options and choose the decision which you like the best. When the consequences are uncertain then the problem is much less straightforward and consequently much more interesting. We study how decisions ought to be made in these circumstances, in short what is the optimal decision?

The approach is prescriptive (that is how people should behave rationally in making decisions which satisfy some criteria) rather than descriptive (that is, studying decisions that people actually make).

Let \(\mathcal{D}\) be the class of possible decisions. For each \(d \in \mathcal{D}\) let \(\Theta\) be the set of relevant events which affect the result of choosing \(d\). It is often helpful to think of each \(\theta \in \Theta\) as describing a state of nature so the actual value denotes the “true state of nature”.

Having chosen \(d\) we need a way of assessing the consequence of how good or bad the choice of decision \(d\) was under the event \(\theta\). This measurement is called your utility.

We will largely restrict our attention to statistical decision theory where we regard inference as a decision problem. We can typically think of each \(d\) as representing a method of estimating \(\theta\) so that the utility measures how good or bad the estimation procedure is.

It is clear that the utility will depend upon context. For example, in some cases it may be much worse to overestimate a parameter \(\theta\) than to underestimate it whilst in others in may be equally serious to under or over estimate the parameter. Thus, the optimal decision will also depend upon context.

4.1 Utility

You can think about the pair \((\theta, d)\) as defining your status, \(S\), and denote the utility for this status by \(u(S)\). We now consider how such a utility can be obtained.

Suppose that \(S_{1}, \ldots, S_{n}\) constitute a collection of \(n\) statuses. We shall assume that you can compare these statuses. So, for any two statuses \(S_{i}\), \(S_{j}\) we write

  • \(S_{j} \prec^{\star} S_{i}\), you prefer \(S_{i}\) to \(S_{j}\), if you would pay an amount of money (however small) in order to swap \(S_{j}\) for \(S_{i}\).

  • \(S_{j} \sim^{\star} S_{i}\), you are indifferent between \(S_{i}\) and \(S_{j}\), if neither \(S_{j} \prec^{\star} S_{i}\) or \(S_{i} \prec^{\star} S_{j}\) hold.

  • \(S_{j} \preceq^{\star} S_{i}\), \(S_{i}\) is at least as good as \(S_{j}\), if one of \(S_{j} \prec^{\star} S_{i}\) or \(S_{j} \sim^{\star} S_{i}\) holds.

Notice that this gives us a framework for comparing anything.

Example 4.1 A bakery has five types of cake available and I assert that \[\begin{eqnarray} \mbox{fruit cake} \prec^{\star} \mbox{carrot cake} \prec^{\star} \mbox{banana cake} \prec^{\star} \mbox{chocolate cake} \prec^{\star} \mbox{cheese cake}. \nonumber \end{eqnarray}\] Thus, I would be willing to pay to exchange a fruit cake for a carrot cake and then pay again to exchange the carrot cake for a banana cake and so on.

We make two assumptions about our preferences over statuses. Suppose that \(S_{1}, S_{2}, \ldots, S_{n}\) constitute a collection of \(n\) statuses. We assume

  1. (COMPARABILITY) For any \(S_{i}\), \(S_{j}\) exactly one of \(S_{i} \prec^{\star} S_{j}\), \(S_{j} \prec^{\star} S_{i}\), \(S_{i} \sim^{\star} S_{j}\) holds.

  2. (COHERENCE) If \(S_{i} \prec^{\star} S_{j}\) and \(S_{j} \prec^{\star} S_{k}\) then \(S_{i} \prec^{\star} S_{k}\).

Comparability ensures that we can express a preference between any two rewards.

Example 4.2 Suppose that I didn’t have coherence over my preferences for cakes. For example, consider that I assert \(\mbox{carrot cake} \prec^{\star} \mbox{banana cake}\) and \(\mbox{banana cake} \prec^{\star} \mbox{chocolate cake}\) but that \(\mbox{chocolate cake} \prec^{\star} \mbox{carrot cake}\). Then I would pay money to swap from carrot cake to banana cake and then from banana cake to chocolate cake. I am then are willing to pay to switch the chocolate cake for a carrot cake. I am back in my original position, but I have spent money to maintain this status quo. I am a money pump.

The consequence of these assumptions is that, for a collection of \(n\) statuses \(S_{1}, S_{2}, \ldots, S_{n}\), there is a labelling \(S_{(1)}, S_{(2)}, \ldots, S_{(n)}\) such that \[\begin{eqnarray} S_{(1)} \preceq^{\star} S_{(2)} \preceq^{\star} \cdots \preceq^{\star} S_{(n)}. \nonumber \end{eqnarray}\] This is termed a preference ordering for the statuses. In particular, there is a worst state \(S_{(1)}\) and a best state \(S_{(n)}\). Notice that these need not necessary be unique.

In many situations, we are not certain as to which state will occur. This can be viewed as a gamble. We write \[\begin{eqnarray} G & = & p_{1}S_{1} +_{g} p_{2}S_{2} +_{g} \cdots +_{g} p_{n}S_{n} \nonumber \end{eqnarray}\] for the gamble that returns \(S_{1}\) with probability \(p_{1}\), \(S_{2}\) with probability \(p_{2}\), \(\ldots\), \(S_{n}\) with probability \(p_{n}\). We make two assumptions to ensure that our gambles are coherently compared.

  1. If \(S_{j} \preceq^{\star} S_{i}\), \(p < q\) then \(pS_{i} +_{g} (1-p)S_{j} \preceq^{\star} qS_{i} +_{g} (1-q)S_{j}\).

  2. If \(S_{j} \preceq^{\star} S_{i}\) then \(pS_{j} +_{g} (1-p)S_{k} \preceq^{\star} pS_{i} +_{g} (1-p)S_{k}\) for any \(S_{k}\).

Gambles provide the link between probability, preference and utility.

Definition 4.1 (Utility) A utility function \(u(\cdot)\) on gambles \(G = p_{1}S_{1} +_{g} p_{2}S_{2} +_{g} \cdots +_{g} p_{n}S_{n}\) over statuses \(S_{1}, S_{2}, \ldots, S_{n}\) assigns a real number \(u(G)\) to each \(G\) subject to the following conditions

  1. Let \(G_{i}\), \(G_{j}\) be any two gambles. If \(G_{j} \prec^{\star} G_{i}\) then \(u(G_{j}) < u(G_{i})\), and if \(G_{j} \sim^{\star} G_{i}\) then \(u(G_{j}) = u(G_{i})\).

  2. For any \(p \in [0, 1]\) and any statuses \(A\), \(B\), \[\begin{eqnarray} u(pA +_{g} (1-p)B) & = & pu(A) + (1-p)u(B). \nonumber \end{eqnarray}\]

  • Condition \(1.\) says that utilities agree with preferences, so you choose the gamble with the highest utility.

  • Condition \(2.\) says that, for the generic gamble \(G= p_{1}S_{1} +_{g} p_{2}S_{2} +_{g} \cdots +_{g} p_{n}S_{n}\), \(u(G) = p_{1}u(S_{1}) + p_{2}u(S_{2}) + \cdots + p_{n}u(S_{n})\). Hence, \(u(G) = E\{u(G)\}\).

i.e. Expected utility of a gamble \(=\) Actual utility of that gamble.

  • Conditions \(1.\) and \(2.\) combined imply that we choose the gamble with the highest expected utility. So, if we can specify a utility function over statuses, we can solve any decision problem by choosing the decision which maximises expected utility.

Notice that a utility function over an ordered set of statuses \(S_{(1)} \preceq^{\star} S_{(2)} \preceq^{\star} \cdots \preceq^{\star} S_{(n)}\) is often constructed by setting \(u(S_{(1)}) = 0\) and \(u(S_{(n)}) = 1\) and, for each \(1 < i < n\), defining \(u(S_{(i)})\) to be the probability \(p\) such that \[\begin{eqnarray} S_{(i)} \sim^{\star} (1-p)S_{(1)} +_{g} pS_{(n)}. \nonumber \end{eqnarray}\] Thus, \(p\) is the probability where you are indifferent between a guaranteed status of \(S_{(i)}\) and a gamble with gives status \(S_{(n)}\) with probability \(p\) and \(S_{(1)}\) with probability \((1-p)\). \(p\) is often termed the indifference probability for status \(S_{(i)}\). A utility function is unique up to a positive linear transformation.

4.2 Statistical decision theory

In statistical decision theory we consider inference about a parameter \(\theta\) as a decision problem. For any parameter value \(\theta \in \Theta\), where \(\Theta\) is the parameter space, and decision \(d \in \mathcal{D}\), where \(\mathcal{D}\) is the decision space, we have \(u(\theta, d)\), the utility of choosing \(d\) when \(\theta\) is the true value. We shall define loss to be \[\begin{eqnarray} L(\theta, d) & = & - u(\theta, d). \tag{4.1} \end{eqnarray}\] The three main types of inference about \(\theta\) are:

  1. point estimation,
  2. set estimation,
  3. hypothesis testing.

It is a great conceptual and practical simplification that statistical decision theory distinguishes between these three types simply according to their decision spaces, which are: \[\begin{eqnarray*} \begin{array}{c|c} \mbox{Type of inference} & \mbox{Decision space $\mathcal{D}$} \\ \hline \mbox{Point estimation} & \mbox{The parameter space, $\Theta$.} \\ \mbox{Set estimation} & \mbox{A set of subsets of $\Theta$.} \\ \mbox{Hypothesis testing} & \mbox{A specified partition of $\Theta$, denoted $\mathcal{H}$.} \end{array} \end{eqnarray*}\] In examples, we’ll typically focus on point estimation and so specify a single value \(d\) as an estimate of \(\theta\).

A statistical decision has a number of ingredients.

  1. The possible values of the parameter: \(\Theta\), the parameter space.

  2. The set of possible decisions: \(\mathcal{D}\), the decision space.

  3. The probability distribution on \(\Theta\), \(\pi(\theta)\). For example,

    1. this could be a prior distribution, \(\pi(\theta) = f(\theta)\).

    2. this could be a posterior distribution, \(\pi(\theta) = f(\theta \, | \, x)\) following the receipt of some data \(x\).

    3. this could be a posterior distribution \(\pi(\theta) = f(\theta \, | \, x, y)\) following the receipt of some data \(x, y\).

  4. The loss function \(L(\theta, d)\).

From (4.1), the decision which maximises the expected utility is the one which minimises the expected loss. Thus, we choose \(d\) to minimise \[\begin{eqnarray} \rho(\pi, d) & = & \int_{\theta} L(\theta, d)\pi(\theta) \, d\theta \tag{4.2} \end{eqnarray}\] the risk of \(d\) under \(\pi(\theta)\). The decision problem is completely specified by \([\Theta, \mathcal{D}, \pi(\theta), L(\theta, d)]\).

Definition 4.2 (Bayes rule and Bayes risk) The Bayes risk \(\rho^{*}(\pi)\) minimises the expected loss, \[\begin{eqnarray} \rho^{*}(\pi) & = & \inf_{d \in \mathcal{D}} \rho(\pi, d) \nonumber \end{eqnarray}\] with respect to \(\pi(\theta)\). A decision \(d^{*} \in \mathcal{D}\) for which \(\rho(\pi, d^{*}) = \rho^{*}(\pi)\) is a Bayes (decision) rule against \(\pi(\theta)\).

The Bayes rule may not be unique, and in weird cases it might not exist. Typically, we solve \([\Theta, \mathcal{D}, \pi(\theta), L(\theta, d)]\) by finding \(\rho^{*}(\pi)\) and (at least one) \(d^{*}\).

Example 4.3 Quadratic Loss. We consider the loss function \[\begin{eqnarray} L(\theta, d) & = & (\theta - d)^{2}. \nonumber \end{eqnarray}\] From (4.2), the risk of decision \(d\) is \[\begin{eqnarray} \rho(\pi, d) & = & E\{L(\theta, d) \, | \, \theta \sim \pi(\theta)\} \nonumber \\ & = & E_{(\pi)}\{(\theta - d)^{2}\} \nonumber \\ & = & E_{(\pi)}(\theta^{2}) - 2dE_{(\pi)}(\theta) + d^{2}, \nonumber \end{eqnarray}\] where \(E_{(\pi)}(\cdot)\) is a notational device to define the expectation computed using the distribution \(\pi(\theta)\). Differentiating with respect to \(d\) we have \[\begin{eqnarray} \frac{\partial}{\partial d} \rho(\pi, d) & = & -2E_{(\pi)}(\theta) + 2d. \nonumber \end{eqnarray}\] So, the Bayes rule \(d^{*} = E_{(\pi)}(\theta)\). The corresponding Bayes risk is \[\begin{eqnarray} \rho^{*}(\pi) \ = \ \rho(\pi, d^{*}) & = & E_{(\pi)}(\theta^{2}) - 2d^{*}E_{(\pi)}(\theta) + (d^{*})^{2} \nonumber \\ & = & E_{(\pi)}(\theta^{2}) - 2E_{(\pi)}^{2}(\theta) + E_{(\pi)}^{2}(\theta) \nonumber \\ & = & E_{(\pi)}(\theta^{2}) - E_{(\pi)}^{2}(\theta) \nonumber\\ & = & Var_{(\pi)}(\theta) \nonumber \end{eqnarray}\] where \(Var_{(\pi)}(\theta)\) is the variance of \(\theta\) computed using the distribution \(\pi(\theta)\).

  1. If \(\pi(\theta) = f(\theta)\), a prior for \(\theta\), then the Bayes rule of an immediate decision is \(d^{*} = E(\theta)\) with corresponding Bayes risk \(\rho^{*} = Var(\theta)\).

  2. If we observe sample data \(x\) then the Bayes rule given this sample information is \(d^{*} = E(\theta \, | \, X)\) with corresponding Bayes risk \(\rho^{*} = Var(\theta \, | \, X)\) as \(\pi(\theta) = f(\theta \, | \, x)\).

Typically we can solve \([\Theta, \mathcal{D}, f(\theta), L(\theta, d)]\), the immediate decision problem, and solve \([\Theta, \mathcal{D}\), \(f(\theta \, | \, x), L(\theta, d)]\), the decision problem after sample information. Often, we may be interested in the risk of the sampling procedure, before observing the sample, to decide whether or not to sample. For each possible sample, we need to specify which decision to make. We have a decision function \[\begin{eqnarray} \delta : X \rightarrow \mathcal{D} \nonumber \end{eqnarray}\] where \(X\) is the data or sample information. Let \(\Delta\) be the collection of all decision functions, so \(\delta \in \Delta \Rightarrow \delta(x) \in \mathcal{D} \ \forall x \in X\). The risk of decision function \(\delta\) is \[\begin{eqnarray} \rho(f(\theta), \delta) & = & \int_{x}\int_{\theta} L(\theta, \delta(x))f(\theta, x) \, d\theta dx \nonumber \\ & = & \int_{x}\int_{\theta} L(\theta, \delta(x))f(\theta \, | \, x)f(x) \, d\theta dx \nonumber \\ & = & \int_{x} \left\{\int_{\theta} L(\theta, \delta(x))f(\theta \, | \, x) \, d\theta \right\} f(x) \, dx \nonumber \\ & = & \int_{x} E\{L(\theta, \delta(x)) \, | \, X\} f(x) \, dx \tag{4.3} \end{eqnarray}\] where, from (4.2), \(E\{L(\theta, \delta(x)) \, | \, X\} = \rho(f(\theta \, | \, x), \delta(x))\), the posterior risk. We want to find the Bayes decision function \(\delta^{*}\) for which \[\begin{eqnarray} \rho(f(\theta), \delta^{*}) & = & \inf_{\delta \in \Delta} \rho(f(\theta), \delta). \nonumber \end{eqnarray}\] From (4.3), as \(f(x) \geq 0\), \(\delta^{*}\) may equivalently be found as \[\begin{eqnarray} \rho(f(\theta), \delta^{*}) & = & \inf_{\delta \in \Delta} E\{L(\theta, \delta(x)) \, | \, X\}, \tag{4.4} \end{eqnarray}\] the posterior risk. The corresponding risk of the sampling procedure is \[\begin{eqnarray} \rho^{*}_{n} & = & E[E\{L(\theta, \delta^{*}(x)) \, | \, X\}]. \tag{4.5} \end{eqnarray}\] So, from (4.4), the Bayes decision function is the Bayes rule of the decision problem \([\Theta, \mathcal{D}, f(\theta \, | \, x), L(\theta, d)]\) considered as a function of (random) \(x\) whilst, from (4.5), the Bayes risk of the sampling procedure is the expected value of the Bayes risk of the decision problem \([\Theta, \mathcal{D}, f(\theta \, | \, x), L(\theta, d)]\) considered as a function of (random) \(x\).

Example 4.4 Suppose that we wish to estimate the parameter, \(\theta\), of a Poisson distribution. Our prior for \(\theta\) is \(Gamma(\alpha, \beta)\). The loss function, for estimate \(d\) and value \(\theta\), is \[\begin{eqnarray} L(\theta, d) & = & \theta(\theta-d)^{2}. \nonumber \end{eqnarray}\]

  1. Find the Bayes rule and Bayes risk of an immediate decision.

  2. Find the Bayes rule and Bayes risk if we take a sample of size \(n\).

  3. Find the Bayes risk of the sampling procedure.

We consider the decision problem \([\Theta, \mathcal{D}, \pi(\theta), L(\theta, d)]\). Relative to distribution \(\pi\) the expected loss is \[\begin{eqnarray} E_{(\pi)}\{L(\theta, d)\} & = & E_{(\pi)}\{\theta(\theta-d)^{2}\} \nonumber \\ & = & E_{(\pi)}(\theta^{3} - 2d\theta^{2} + d^{2}\theta) \nonumber \\ & = & E_{(\pi)}(\theta^{3}) - 2dE_{(\pi)}(\theta^{2}) + d^{2}E_{(\pi)}(\theta) \nonumber \end{eqnarray}\] Differentiating with respect to \(d\) we find \[\begin{eqnarray} \frac{\partial}{\partial d} E_{(\pi)}\{L(\theta, d)\} & = & - 2E_{(\pi)}(\theta^{2}) + 2dE_{(\pi)}(\theta) \nonumber \end{eqnarray}\] so that the Bayes rule is \[\begin{eqnarray} d^{*} & = & \frac{E_{(\pi)}(\theta^{2})}{E_{(\pi)}(\theta)} \tag{4.6} \end{eqnarray}\] with corresponding Bayes risk \[\begin{eqnarray} \rho^{*}(\pi) & = & E_{(\pi)}(\theta^{3}) - 2d^{*}E_{(\pi)}(\theta^{2}) + (d^{*})^{2}E_{(\pi)}(\theta) \nonumber \\ & = & E_{(\pi)}(\theta^{3}) - 2\frac{E_{(\pi)}(\theta^{2})}{E_{(\pi)}(\theta)}E_{(\pi)}(\theta^{2}) + \left\{\frac{E_{(\pi)}(\theta^{2})}{E_{(\pi)}(\theta)}\right\}^{2}E_{(\pi)}(\theta) \nonumber \\ & = & E_{(\pi)}(\theta^{3}) - \frac{E_{(\pi)}^{2}(\theta^{2})}{E_{(\pi)}(\theta)}. \tag{4.7} \end{eqnarray}\] We now consider the immediate decision by solving the decision problem \([\Theta, \mathcal{D}, f(\theta), L(\theta, d)]\). As \(\theta \sim Gamma(\alpha, \beta)\) then \[\begin{eqnarray} E(\theta^{k}) & = & \int_{0}^{\infty} \theta^{k} \frac{\beta^{\alpha}}{\Gamma(\alpha)}\, \theta^{\alpha-1}e^{-\beta \theta} \, d\theta \nonumber \\ & = & \frac{\Gamma(\alpha + k)}{\Gamma(\alpha)}\times \frac{\beta^{\alpha}}{\beta^{\alpha +k}} \int_{0}^{\infty} \frac{\beta^{\alpha+k}}{\Gamma(\alpha+k)}\, \theta^{\alpha+k-1}e^{-\beta \theta} \, d\theta \tag{4.8} \\ & = & \frac{\Gamma(\alpha + k)}{\Gamma(\alpha)}\times \frac{\beta^{\alpha}}{\beta^{\alpha +k}} \tag{4.9} \end{eqnarray}\] provided that \(\alpha + k > 0\) so that the integral in (4.8) is of the density of a \(Gamma(\alpha + k, \beta)\) distribution. Now, from (4.9), for \(k = 1, 2, 3\), \[\begin{eqnarray} E(\theta) & = & \frac{\Gamma(\alpha + 1)\beta^{\alpha}}{\Gamma(\alpha)\beta^{\alpha+1}} \ = \ \frac{\alpha \Gamma(\alpha)}{\beta \Gamma(\alpha)} \ = \ \frac{\alpha}{\beta} \tag{4.10} \\ E(\theta^{2}) & = & \frac{\Gamma(\alpha + 2)\beta^{\alpha}}{\Gamma(\alpha)\beta^{\alpha+2}} \ = \ \frac{(\alpha+1)\alpha \Gamma(\alpha)}{\beta^{2} \Gamma(\alpha)} \ = \ \frac{(\alpha+1)\alpha}{\beta^{2}} \tag{4.11} \\ E(\theta^{3}) & = & \frac{\Gamma(\alpha + 3)\beta^{\alpha}}{\Gamma(\alpha)\beta^{\alpha+3}} \ = \ \frac{(\alpha+2)(\alpha+1)\alpha \Gamma(\alpha)}{\beta^{3} \Gamma(\alpha)} \ = \ \frac{(\alpha+2)(\alpha+1)\alpha}{\beta^{3}} \tag{4.12} \end{eqnarray}\] Substituting (4.10) and (4.11) into (4.6), the Bayes rule of the immediate decision is \[\begin{eqnarray} d^{*} & = & \frac{(\alpha+1)\alpha}{\beta^{2}} \times \frac{\beta}{\alpha} \ = \ \frac{\alpha +1}{\beta}. \tag{4.13} \end{eqnarray}\] Substituting (4.10) - (4.12) into (4.7), the Bayes risk of the immediate decision is \[\begin{eqnarray} \rho^{*}(f(\theta)) & = & \frac{(\alpha+2)(\alpha+1)\alpha}{\beta^{3}} - \frac{(\alpha+1)\alpha}{\beta^{2}} \times \frac{\alpha +1}{\beta} \ = \ \frac{\alpha(\alpha+1)}{\beta^{3}}. \tag{4.14} \end{eqnarray}\] We now consider the problem after observing a sample of size \(n\) by solving \([\Theta, \mathcal{D}, f(\theta \, | \, x)\), \(L(\theta, d)]\) where \(x = (x_{1}, \ldots, x_{n})\). As \(\theta \sim Gamma(\alpha, \beta)\) and \(X_{i} \, | \, \theta \sim Po(\theta)\) then \(\theta \, | \, x \sim Gamma(\alpha + \sum_{i=1}^{n} x_{i}, \beta + n)\) (see, for example, Question Sheet Three Exercise 1.(b)).

We can exploit this conjugacy to observe that the Bayes rule and Bayes risk after sampling can be found from substituting \(\alpha + \sum_{i=1}^{n} x_{i}\) for \(\alpha\) and \(\beta + n\) for \(\beta\) in (4.13) and (4.14) to obtain \[\begin{eqnarray} d^{*} & = & \frac{\alpha + \sum_{i=1}^{n} x_{i}+1}{\beta+n} \tag{4.15} \end{eqnarray}\] as the Bayes rule after observing a sample of size \(n\) with corresponding Bayes risk \[\begin{eqnarray} \rho^{*}(f(\theta \, | \, x)) & = & \frac{(\alpha + \sum_{i=1}^{n} x_{i})(\alpha+\sum_{i=1}^{n} x_{i} + 1)}{(\beta+n)^{3}}. \tag{4.16} \end{eqnarray}\] We now consider the risk of the sampling procedure. From (4.4) the Bayes decision function is (4.15) viewed as a random variable, that is \[\begin{eqnarray} \delta^{*} & = & \frac{\alpha + \sum_{i=1}^{n} X_{i}+1}{\beta+n}. \nonumber \end{eqnarray}\] From (4.5), the risk of the sampling procedure, \(\rho^{*}_{n}\), is the expected value of (4.16) when viewed as a random variable, \[\begin{eqnarray} \rho^{*}_{n} & = & E\left\{\frac{(\alpha + \sum_{i=1}^{n} X_{i})(\alpha+\sum_{i=1}^{n} X_{i} + 1)}{(\beta+n)^{3}}\right\} \nonumber\\ & = & \frac{\alpha(\alpha+1)+(2\alpha+1)E(\sum_{i=1}^{n} X_{i})+ E\{(\sum_{i=1}^{n} X_{i})^{2}\}}{(\beta+n)^{3}} \tag{4.17} \end{eqnarray}\] Now, we utilise the tower property of expectations to find \(E(\sum_{i=1}^{n} X_{i})\) and \(E\{(\sum_{i=1}^{n} X_{i})^{2}\}\). Question Sheet One Exercise 5 gives further details of the tower property.

We have that \[\begin{eqnarray} E\left(\sum_{i=1}^{n} X_{i}\right) & = & E\left.\left\{E\left(\sum_{i=1}^{n} X_{i} \, \right| \, \theta\right)\right\} \nonumber \\ & = & E\left\{\sum_{i=1}^{n} E(X_{i} \, | \, \theta)\right\} \nonumber \\ & = & \sum_{i=1}^{n}E(\theta) \ = \ \frac{n\alpha}{\beta}; \tag{4.18} \\ E\left\{\left(\sum_{i=1}^{n} X_{i}\right)^{2}\right\} & = & E\left.\left[E\left\{\left(\sum_{i=1}^{n} X_{i}\right)^{2} \, \right| \, \theta\right\}\right] \nonumber \\ & = & E\left.\left\{Var\left(\sum_{i=1}^{n} X_{i} \, \right| \, \theta\right) + E^{2}\left.\left(\sum_{i=1}^{n} X_{i} \, \right| \, \theta\right)\right\} \nonumber \\ & = & E(n \theta + n^{2}\theta^{2}) \ = \ \frac{n\alpha}{\beta} + \frac{n^{2}\alpha(\alpha+1)}{\beta^{2}}. \tag{4.19} \end{eqnarray}\] Notice that we have exploited the independence of the \(X_{i}\) given \(\theta\) and that \(X_{i} \, | \, \theta \sim Po(\theta)\) with \(\theta \sim Gamma(\alpha, \beta)\). A slightly quicker, though less general, approach is to note that \(\sum_{i=1}^{n} X_{i} \, | \, \theta \sim Po(n\theta)\). Substituting (4.18) and (4.19) into (4.17) gives \[\begin{eqnarray} \rho^{*}_{n} & = & \frac{1}{(\beta+n)^{3}}\left\{\alpha(\alpha+1)+(2\alpha+1)\frac{n\alpha}{\beta} + \frac{n\alpha}{\beta} + \frac{n^{2}\alpha(\alpha+1)}{\beta^{2}}\right\} \nonumber \\ & = & \frac{\alpha(\alpha+1)}{\beta^{2}(\beta+n)^{3}}\left\{\beta^{2} + 2\beta n + n^{2}\right\} \nonumber \\ & = & \frac{\alpha(\alpha+1)}{\beta^{2}(\beta+n)}. \nonumber \end{eqnarray}\] Notice that when \(n=0\), \(\rho^{*}_{n=0} = \frac{\alpha(\alpha+1)}{\beta^{3}} = \rho^{*}(f(\theta))\), the Bayes risk of the immediate decision. As \(n\) increases then \(\rho^{*}_{n}\) decreases.