Chapter 3 Statistical Decision Theory

3.1 Introduction

The basic premise of Statistical Decision Theory is that we want to make inferences about the parameter of a family of distributions in the statistical model \[\begin{eqnarray*} \mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\} , \end{eqnarray*}\] typically following observation of sample data, or information, \(x\). We would like to understand how to construct the “\(\mbox{Ev}\)” function from Chapter 2, in such a way that it reflects our needs, which will vary from application to application, and which assesses the consequences of making a good or bad inference.

The set of possible inferences, or decisions, is termed the decision space, denoted \(\mathcal{D}\). For each \(d \in \mathcal{D}\), we want a way to assess the consequence of how good or bad the choice of decision \(d\) was under the event \(\theta\).

Definition 3.1 (Loss function) A loss function is any function \(L\) from \(\Theta \times \mathcal{D}\) to \([0, \infty)\).

The loss function is measures the penalty or error, \(L(\theta, d)\) of the decision \(d\) when the parameter takes the value \(\theta\). Thus, larger values indicate worse consequences.

The three main types of inference about \(\theta\) are:

  1. point estimation,
  2. set estimation,
  3. hypothesis testing.

It is a great conceptual and practical simplification that Statistical Decision Theory distinguishes between these three types simply according to their decision spaces, which are:

Type of inference Decision space \(\mathcal{D}\)
Point estimation The parameter space, \(\Theta\). See Section 3.4.
Set estimation A set of subsets of \(\Theta\). See Section 3.6.
Hypothesis testing A specified partition of \(\Theta\), denoted \(\mathcal{H}\). See Section 3.7.

3.2 Bayesian statistical decision theory

In a Bayesian approach, a statistical decision problem \([\Theta, \mathcal{D}, \pi(\theta), L(\theta, d)]\) has the following ingredients.

  1. The possible values of the parameter: \(\Theta\), the parameter space.

  2. The set of possible decisions: \(\mathcal{D}\), the decision space.

  3. The probability distribution on \(\Theta\), \(\pi(\theta)\). For example,

    1. this could be a prior distribution, \(\pi(\theta) = f(\theta)\).

    2. this could be a posterior distribution, \(\pi(\theta) = f(\theta \, | \, x)\) following the receipt of some data \(x\).

    3. this could be a posterior distribution \(\pi(\theta) = f(\theta \, | \, x, y)\) following the receipt of some data \(x, y\).

  4. The loss function \(L(\theta, d)\).

In this setting, only \(\theta\) is random and we can calculate the expected loss, or risk.

Definition 3.2 (Risk) The risk of decision \(d \in \mathcal{D}\) under the distribution \(\pi(\theta)\) is \[\begin{eqnarray} \rho(\pi(\theta), d) & = & \int_{\theta} L(\theta, d)\pi(\theta) \, d\theta. \tag{3.1} \end{eqnarray}\]

We choose \(d\) to minimise this risk.

Definition 3.3 (Bayes rule and Bayes risk) The Bayes risk \(\rho^{*}(\pi)\) minimises the expected loss, \[\begin{eqnarray*} \rho^{*}(\pi) & = & \inf_{d \in \mathcal{D}} \rho(\pi, d) \end{eqnarray*}\] with respect to \(\pi(\theta)\). A decision \(d^{*} \in \mathcal{D}\) for which \(\rho(\pi, d^{*}) = \rho^{*}(\pi)\) is a Bayes rule against \(\pi(\theta)\).

The Bayes rule may not be unique, and in weird cases it might not exist. Typically, we solve \([\Theta, \mathcal{D}, \pi(\theta), L(\theta, d)]\) by finding \(\rho^{*}(\pi)\) and (at least one) \(d^{*}\).

Example 3.1 (Quadratic Loss) Suppose that \(\Theta \subset \mathbb{R}\). We consider the loss function \[\begin{eqnarray*} L(\theta, d) & = & (\theta - d)^{2}. \end{eqnarray*}\] From (3.1), the risk of decision \(d\) is \[\begin{eqnarray*} \rho(\pi, d) & = & \mathbb{E}\{L(\theta, d) \, | \, \theta \sim \pi(\theta)\} \\ & = & \mathbb{E}_{(\pi)}\{(\theta - d)^{2}\} \\ & = & \mathbb{E}_{(\pi)}(\theta^{2}) - 2d\mathbb{E}_{(\pi)}(\theta) + d^{2}, \end{eqnarray*}\] where \(\mathbb{E}_{(\pi)}(\cdot)\) is a notational device to define the expectation computed using the distribution \(\pi(\theta)\). Differentiating with respect to \(d\) we have \[\begin{eqnarray*} \frac{\partial}{\partial d} \rho(\pi, d) & = & -2\mathbb{E}_{(\pi)}(\theta) + 2d. \end{eqnarray*}\] So, the Bayes rule \(d^{*} = \mathbb{E}_{(\pi)}(\theta)\). The corresponding Bayes risk is \[\begin{eqnarray*} \rho^{*}(\pi) \ = \ \rho(\pi, d^{*}) & = & \mathbb{E}_{(\pi)}(\theta^{2}) - 2d^{*}\mathbb{E}_{(\pi)}(\theta) + (d^{*})^{2} \\ & = & \mathbb{E}_{(\pi)}(\theta^{2}) - 2\mathbb{E}_{(\pi)}^{2}(\theta) + \mathbb{E}_{(\pi)}^{2}(\theta) \\ & = & \mathbb{E}_{(\pi)}(\theta^{2}) - \mathbb{E}_{(\pi)}^{2}(\theta) \\ & = & Var_{(\pi)}(\theta) \end{eqnarray*}\] where \(Var_{(\pi)}(\theta)\) is the variance of \(\theta\) computed using the distribution \(\pi(\theta)\).

  1. If \(\pi(\theta) = f(\theta)\), a prior for \(\theta\), then the Bayes rule of an immediate decision is \(d^{*} = \mathbb{E}(\theta)\) with corresponding Bayes risk \(\rho^{*} = Var(\theta)\).
  2. If we observe sample data \(x\) then the Bayes rule given this sample information is \(d^{*} = \mathbb{E}(\theta \, | \, X)\) with corresponding Bayes risk \(\rho^{*} = Var(\theta \, | \, X)\) as \(\pi(\theta) = f(\theta \, | \, x)\).

Typically we can solve \([\Theta, \mathcal{D}, f(\theta), L(\theta, d)]\), the immediate decision problem, and solve \([\Theta, \mathcal{D}\), \(f(\theta \, | \, x), L(\theta, d)]\), the decision problem after sample information. Often, we may be interested in the risk of the sampling procedure, before observing the sample, to decide whether or not to sample. For each possible sample, we need to specify which decision to make. This gives us the idea of a decision rule.

Definition 3.4 (Decision rule) A decision rule \(\delta(x)\) is a function from \(\mathcal{X}\) into \(\mathcal{D}\), \[\begin{eqnarray*} \delta : \mathcal{X} \rightarrow \mathcal{D}. \end{eqnarray*}\]
If \(X = x\) is the observed value of the sample information then \(\delta(x)\) is the decision that will be taken. The collection of all decision rules is denoted by \(\Delta\) so that \(\delta \in \Delta \Rightarrow \delta(x) \in \mathcal{D} \ \forall x \in X\).

In this case, we wish to solve the problem \([\Theta, \Delta, f(\theta, x), L(\theta, \delta(x))]\). In analogy to Definition 3.3, we make the following definition.

Definition 3.5 (Bayes (decision) rule and risk of the sampling procedure) The decision rule \(\delta^{*}\) is a Bayes (decision) rule exactly when \[\begin{eqnarray} \mathbb{E}\{L(\theta, \delta^{*}(X))\} & \leq & \mathbb{E}\{L(\theta, \delta(X))\} \tag{3.2} \end{eqnarray}\] for all \(\delta(x) \in \mathcal{D}\). The corresponding risk \(\rho^{*} = \mathbb{E}\{L(\theta, \delta^{*}(X))\}\) is termed the risk of the sampling procedure.

If the sample information consists of \(X = (X_{1}, \ldots, X_{n})\) then \(\rho^{*}\) will be a function of \(n\) and so can be used to help determine sample size choice.

We have the following theorem. Note that finiteness of \(\mathcal{D}\) ensures existence of a Bayes rule. Similar but more general results are possible, but they require more topological conditions to ensure a minimum occurs within \(\mathcal{D}\).

Theorem 3.1 (Bayes rule theorem, BRT) Suppose that a Bayes rule exists for \([\Theta, \mathcal{D}\), \(f(\theta \, | \, x), L(\theta, d)]\). Then \[\begin{eqnarray} \delta^*(x) = \mbox{arg } \min_{d \in \mathcal{D}} \mathbb{E}(L(\theta, d) \, | \, X = x). \tag{3.3} \end{eqnarray}\]

Proof: Let \(\delta\) be arbitrary. Then \[\begin{eqnarray} \mathbb{E}\{L(\theta, \delta(X))\}& = & \int_{x}\int_{\theta} L(\theta, \delta(x))f(\theta, x) \, d\theta dx \nonumber \\ & = & \int_{x}\int_{\theta} L(\theta, \delta(x))f(\theta \, | \, x)f(x) \, d\theta dx \nonumber \\ & = & \int_{x} \left\{\int_{\theta} L(\theta, \delta(x))f(\theta \, | \, x) \, d\theta \right\} f(x) \, dx \nonumber \\ & = & \int_{x} \mathbb{E}\{L(\theta, \delta(x)) \, | \, X\} f(x) \, dx \tag{3.4} \end{eqnarray}\] where, from (3.1), \(\mathbb{E}\{L(\theta, \delta(x)) \, | \, X\} = \rho(f(\theta \, | \, x), \delta(x))\), the posterior risk. We want to find the Bayes decision function \(\delta^{*}\) for which \[\begin{eqnarray*} \mathbb{E}\{L(\theta, \delta^{*}(X))\} & = & \inf_{\delta \in \Delta} \mathbb{E}\{L(\theta, \delta(X))\}. \end{eqnarray*}\] From (3.4), as \(f(x) \geq 0\), \(\delta^{*}\) may equivalently be found as \[\begin{eqnarray} \rho(f(\theta), \delta^{*}) & = & \inf_{\delta(x) \in \mathcal{D}} \mathbb{E}\{L(\theta, \delta(x)) \, | \, X\}, \tag{3.5} \end{eqnarray}\] giving equation (3.3). \(\Box\)

This astounding result indicates that the minimisation of expected loss over the space of all functions from \(\mathcal{X}\) to \(\mathcal{D}\) can be achieved by the pointwise minimisation over \(\mathcal{D}\) of the expected loss conditional on \(X = x\). It converts an apparently intractable problem into a simple one. We could consider \(\Delta\), the set of decision rules, to be our possible set of inferences about \(\theta\) when the sample is observed so that \(\mbox{Ev}(\mathcal{E}, x)\) is \(\delta^*(x)\). We thus have the following result.

Theorem 3.2 The Bayes rule for the posterior decision respects the strong likelihood principle.

Proof: If we have two Bayesian models with the same prior distribution, \(\mathcal{E}_{B, 1} = \{\mathcal{X}_{1}, \Theta\), \(f_{X_{1}}(x_{1} \, | \, \theta), \pi(\theta)\}\) and \(\mathcal{E}_{B, 2} = \{\mathcal{X}_{2}, \Theta, f_{X_{2}}(x_{2} \, | \, \theta), \pi(\theta)\}\) then, as in (2.13), if \(f_{X_{1}}(x_{1} \, | \, \theta) = c(x_{1}, x_{2})f_{X_{2}}(x_{2} \, | \, \theta)\) then the corresponding posterior distributions \(\pi(\theta \, | \, x_{1})\) and \(\pi(\theta \, | \, x_{2})\) are the same and so the corresponding Bayes rule (and risk) is the same. \(\Box\)

3.3 Admissible rules

Bayes rules rely upon a prior distribution for \(\theta\): the risk, see Definition 3.2, is a function of \(d\) only. In classical statistics, there is no distribution for \(\theta\) and so another approach is needed. This involves the classical risk.

Definition 3.6 (The classical risk) For a decision rule \(\delta (x)\), the classical risk for the model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\) is \[\begin{eqnarray*} R(\theta, \delta) & = & \int_{X} L(\theta, \delta(x))f_{X}(x \, | \, \theta) \, dx. \end{eqnarray*}\]

The classical risk is thus, for each \(\delta\), a function of \(\theta\).

Example 3.2 Let \(X = (X_{1}, \ldots, X_{n})\) where \(X_{i} \sim N(\theta, \sigma^{2})\) and \(\sigma^{2}\) is known. Suppose that \(L(\theta, d) = (\theta - d)^{2}\) and consider a conjugate prior \(\theta \sim N(\mu_{0}, \sigma^{2}_{0})\). Possible decision functions include:

  1. \(\delta_{1}(x) = \overline{x}\), the sample mean.
  2. \(\delta_{2}(x) = \mbox{med}\{x_{1}, \ldots, x_{n}\} = \tilde{x}\), the sample median.
  3. \(\delta_{3}(x) = \mu_{0}\), the prior mean.
  4. \(\delta_{4}(x) = \mu_{n}\), the posterior mean where \[\begin{eqnarray*} \mu_{n} & = & \left(\frac{1}{\sigma_{0}^{2}} + \frac{n}{\sigma^{2}}\right)^{-1}\left(\frac{\mu_{0}}{\sigma_{0}^{2}} + \frac{n\overline{x}}{\sigma^{2}}\right), \end{eqnarray*}\] the weighted average of the prior and sample mean accorded to their respective precisions.

The respective classical risks are

  1. \(R(\theta,\delta_{1}) = \frac{\sigma^{2}}{n}\), a constant for \(\theta\), since \(\overline{X} \sim N(\theta, \sigma^{2}/n)\).
  2. \(R(\theta,\delta_{2}) = \frac{\pi \sigma^{2}}{2n}\), a constant for \(\theta\), since \(\tilde{X} \sim N(\theta, \pi\sigma^{2}/2n)\) (approximately).
  3. \(R(\theta, \delta_{3}) = (\theta - \mu_{0})^{2} = \sigma^{2}_{0}\left(\frac{\theta - \mu_{0}}{\sigma_{0}}\right)^{2}\).
  4. \(R(\theta, \delta_{4}) = \left(\frac{1}{\sigma_{0}^{2}} + \frac{n}{\sigma^{2}}\right)^{-2}\left\{\frac{1}{\sigma_{0}^{2}}\left(\frac{\theta - \mu_{0}}{\sigma_{0}}\right)^{2} + \frac{n}{\sigma^{2}}\right\}\).

Which decision do we choose? We observe that \(R(\theta,\delta_{1}) < R(\theta,\delta_{2})\) for all \(\theta \in \Theta\) but other comparisons depend upon \(\theta\).

The accepted approach for classical statisticians is to narrow the set of possible decision rules by ruling out those that are obviously bad.

Definition 3.7 (Admissible decision rule) A decision rule \(\delta_{0}\) is inadmissible if there exists a decision rule \(\delta_{1}\) which dominates it, that is \[\begin{eqnarray*} R(\theta, \delta_{1}) & \leq & R(\theta, \delta_{0}) \end{eqnarray*}\] for all \(\theta \in \Theta\) with $ R(, {1}) < R(, {0})$ for at least one value \(\theta_{0} \in \Theta\). If no such \(\delta_{1}\) exists then \(\delta_{0}\) is admissible.

If \(\delta_{0}\) is dominated by \(\delta_{1}\) then the classical risk of \(\delta_{0}\) is never smaller than that of \(\delta_{1}\) and \(\delta_{1}\) has a smaller risk for \(\theta_{0}\). Thus, you would never want to use \(\delta_{0}\).

Note: here I am assuming that all other considerations are the same in the two cases: e.g. for all \(x \in \mathcal{X}\), \(\delta_{1}(x)\) and \(\delta_{0}(x)\) take about the same amount of resource to compute.

Hence, the accepted approach is to reduce the set of possible decision rules under consideration by only using admissible rules. It is hard to disagree with this approach, although one wonders how big the set of admissible rules will be, and how easy it is to enumerate the set of admissible rules in order to choose between them. It turns out that admissible rules can be related to a Bayes rule \(\delta^{*}\) for a prior distribution \(\pi(\theta)\) (as given by Definition 3.4).

Theorem 3.3 If a prior distribution \(\pi(\theta)\) is strictly positive for all \(\Theta\) with finite Bayes risk and the classical risk, \(R(\theta, \delta)\), is a continuous function of \(\theta\) for all \(\delta\), then the Bayes rule \(\delta^{*}\) is admissible.

Proof: We follow p75 of Robert (2007). Suppose that \(\delta^{*}\) is inadmissible and dominated by \(\delta_{1}\) so that in an open set \(C\) of \(\theta\), \(R(\theta, \delta_{1}) < R(\theta, \delta^{*})\) with \(R(\theta, \delta_{1}) \leq R(\theta, \delta^{*})\) elsewhere. Then, in an analogous way to the proof of Theorem 3.1 but now writing \(f(\theta, x) = f_{X}(x \, | \, \theta)\pi(\theta)\), for any decision rule \(\delta\), \[\begin{eqnarray*} \mathbb{E}\{L(\theta, \delta(X))\} \ = \ \int_{\Theta} R(\theta, \delta)\pi(\theta) \, d\theta. \end{eqnarray*}\] Thus, if \(\delta_{1}\) dominates \(\delta^{*}\) then \(\mathbb{E}\{L(\theta, \delta_{1}(X))\} < \mathbb{E}\{L(\theta, \delta^{*}(X))\}\) which is a contradiction to \(\delta^{*}\) being the Bayes rule. \(\Box\)

The relationship between a Bayes rule with prior \(\pi(\theta)\) and an admissible decision rule is even stronger and described in the following very beautiful result, originally due to an iconic figure in Statistics, Abraham Wald (1902-1950).

Theorem 3.4 (Wald's Complete Class Theorem, CCT) In the case where the parameter space \(\Theta\) and sample space \(\mathcal{X}\) are finite, a decision rule \(\delta\) is admissible if and only if it is a Bayes rule for some prior distribution \(\pi(\theta)\) with strictly positive values.

An illuminating blackboard proof of this result can be found in Section 11.6 of Cox and Hinkley (1974). There are generalisations of this theorem to non-finite decision sets, parameter spaces, and sample spaces but the results are highly technical. See Chapter 3 of Schervish (1995), Chapters 4 and 8 of J. O. Berger (1985), and Chapter 2 of Ghosh and Meeden (1997) for more details and references to the original literature. In the rest of this section, we will assume the more general result, which is that a decision rule is admissible if and only if it is a Bayes rule for some prior distribution \(\pi(\theta)\), which holds for practical purposes.

So what does the CCT say? First of all, admissible decision rules respect the SLP. This follows from the fact that admissible rules are Bayes rules which respect the SLP: see Theorem 3.2. Insofar as we think respecting the SLP is a good thing, this provides support for using admissible decision rules, because we cannot be certain that inadmissible rules respect the SLP. Second, if you select a Bayes rule according to some positive prior distribution \(\pi(\theta)\) then you cannot ever choose an inadmissible decision rule. So the CCT states that there is a very simple way to protect yourself from choosing an inadmissible decision rule.

But here is where you must pay close attention to logic. Suppose that \(\delta'\) is inadmissible and \(\delta\) is admissible. It does not follow that \(\delta\) dominates \(\delta'\). So just knowing of an admissible rule does not mean that you should abandon your inadmissible rule \(\delta'\). You can argue that although you know that \(\delta'\) is inadmissible, you do not know of a rule which dominates it. All you know, from the CCT, is the family of rules within which the dominating rule must live: it will be a Bayes rule for some positive \(\pi(\theta)\). Statisticians sometimes use inadmissible rules. They can argue that yes, their rule \(\delta'\) is or may be inadmissible, which is unfortunate, but since the identity of the dominating rule is not known, it is not wrong to go on using \(\delta'\). Do not attempt to explore this rather arcane line of reasoning with your client!

3.4 Point estimation

For point estimation the decision space is \(\mathcal{D} = \Theta\), and the loss function \(L(\theta, d)\) represents the (negative) consequence of choosing \(d\) as a point estimate of \(\theta\). There will be situations where an obvious loss function \({L: \Theta \times \Theta \to \mathbb{R}}\) presents itself. But not very often. Hence the need for a generic loss function which is acceptable over a wide range of situations. A natural choice in the very common case where \(\Theta\) is a convex subset of \(\mathbb{R}^{p}\) is a convex loss function, \[\begin{eqnarray*}\label{eq:convexL} L(\theta, d) & = & h(d - \theta) \end{eqnarray*}\] where \(h : \mathbb{R}^{p} \to \mathbb{R}\) is a smooth non-negative convex function with \(h(0) = 0\). This type of loss function asserts that small errors are much more tolerable than large ones. One possible further restriction would be that \(h\) is an even function, \(h(d - \theta) = h(\theta - d)\) so that \(L(\theta, \theta + \epsilon) = L(\theta, \theta - \epsilon)\) so that under-estimation incurs the same loss as over-estimation.

As we saw in Example 3.1, the (univariate) quadratic loss function \(L(\theta, d) = (\theta - d)^{2}\) has attractive features and is also, in terms of the classical risk, related to the MSE. As we will see, this result generalises to \(\mathbb{R}^{p}\) in a similar way.

There are many situations where this is not appropriate and the loss function should be asymmetric and a generic loss function should be replaced by a more specific one.

Example 3.3 (Bilinear loss) The bilinear loss function for \(\Theta \subset \mathbb{R}\) is, for \(\alpha, \beta > 0\), \[\begin{eqnarray*} L(\theta, d) & = & \left\{\begin{array}{ll} \alpha(\theta - d) & \mbox{if $d \leq \theta$}, \\ \beta(d - \theta) & \mbox{if $d \geq \theta$}. \end{array} \right. \end{eqnarray*}\] The Bayes rule is a \(\frac{\alpha}{\alpha + \beta}\)-fractile of \(\pi(\theta)\).

Note that if \(\alpha = \beta = 1\) then \(L(\theta, d) = |\theta - d|\), the absolute loss which gives a Bayes rule of the median of \(\pi(\theta)\). \(|\theta - d|\) is smaller that \((\theta - d)^{2}\) for \(|\theta - d| > 1\) and so absolute loss is smaller than quadratic loss for large deviations. Thus, it takes less account of the tails of \(\pi(\theta)\) leading to the choice of the median. The choice of \(\alpha\) and \(\beta\) can account for asymmetry. If \(\alpha > \beta\), so \(\frac{\alpha}{\alpha + \beta} >0.5\), then under-estimation is penalised more than over-estimation and so that Bayes rule is more likely to be an over-estimate.

Example 3.4 (Example 2.1.2 of Robert (2007)) Suppose \(X\) is distributed as the \(p\)-dimensional normal distribution with mean \(\theta\) and known variance matrix \(\Sigma\) which is diagonal with diagonal elements \(\sigma^{2}_{i}\) for each \(i = 1, \ldots, p\). Then \(\mathcal{D} = \mathbb{R}^{p}\). We might consider a loss function of the form \[\begin{eqnarray*} L(\theta, d) & = &\sum_{i=1}^{p} \left(\frac{d_{i} - \theta_{i}}{\sigma_{i}}\right)^{2} \end{eqnarray*}\] so that the total loss is the sum of the squared component-wise errors.

In this case, we observe that if \(Q = \Sigma^{-1}\) then the loss function is a form of quadratic loss which we generalise in the following example.

Example 3.5 If \(\Theta \in \mathbb{R}^{p}\), the Bayes rule \(\delta^{*}\) associated with the prior distribution \(\pi(\theta)\) and the quadratic loss \[\begin{eqnarray*} L(\theta, d) & = & (d - \theta)^{T} Q \, (d - \theta) \end{eqnarray*}\] is the posterior expectation \(\mathbb{E}(\theta \, | \, X)\) for every positive-definite symmetric \(p \times p\) matrix \(Q\).

Thus, as the Bayes rule does not depend upon \(Q\), it is the same for an uncountably large class of loss functions. If we apply the Complete Class Theorem, Theorem 3.4, to this result we see that for quadratic loss, a point estimator for \(\theta\) is admissible if and only if it is the conditional expectation with respect to some positive prior distribution \(\pi(\theta)\). The value, and interpretability, of the quadratic loss can be further observed by noting that, from a Taylor series expansion, an even, differentiable and strictly convex loss function can be approximated by a quadratic loss function.

3.5 Stein’s example

Let \(X = (X_{1}, \ldots, X_{p})^{T}\) and suppose that \(X \, | \, \theta \sim N_{p}(\theta, I_{p})\) where \(I_{p}\) is the \(p \times p\) identity matrix and \(\theta = (\theta_{1}, \ldots, \theta_{p})^{T}\) is a vector of parameters. Thus, given \(\theta\), the \(X_{i}\)s are independent \(N(\theta_{i}, 1)\). Suppose we consider a single observation, \(X = x\). The likelihood for \(\theta\) is \[\begin{eqnarray*} L_{X}(\theta; x) & = & \prod_{i=1}^{p} \frac{1}{\sqrt{2\pi}} \exp\left\{-(x_{i} - \theta_{i})^{2}\right\}. \end{eqnarray*}\] so that the maximum likelihood estimate is \(x\). The corresponding maximum likelihood estimator is \(X\) which is unbiased.

3.5.1 Estimation under quadratic loss

We consider point estimation of \(\theta\) using quadratic loss. Following Examples 3.4 and 3.5, \(Q = \Sigma^{-1} = I_{p}\), we have
\[\begin{eqnarray*} L(\theta, d) \ = \ (d - \theta )^{T}(d - \theta) \ = \ \sum_{i=1}^{p} (d_{i} - \theta_{i})^{2} \end{eqnarray*}\] for decision \(d = (d_{1}, \ldots, d_{p})^{T} \in \mathbb{R}^{p}\). We consider the decision rule \(\delta_{0}(X) = X\), the maximum likelihood estimator. From Definition 3.6, the classical risk of \(\delta_{0}\) is \[\begin{eqnarray*} R(\theta, \delta_{0}) & = & \mathbb{E}[L(\theta, \delta_{0}(X)) \, | \, \theta] \\ & = & \sum_{i=1}^{p} \mathbb{E}[(\theta_{i} - X_{i})^{2} \, | \, \theta] \\ & = & \sum_{i=1}^{p} Var(X_{i} \, | \, \theta) \ = \ p. \end{eqnarray*}\] We’ll show that, for \(p \geq 3\), \(\delta_{0}\) is inadmissible by finding a decision rule which dominates it. Consider the set of James-Stein estimators \[\begin{eqnarray*} \delta_{a}(X) & = & \left(1 - \frac{a}{X^{T}X}\right) X \end{eqnarray*}\] for \(a \geq 0\). Notice that \(a = 0\) gives \(\delta_{0}(X) = X\) and that for \(a > 0\), \(\delta_{a}(X)\) is biased. If \(X^{T}X > 0\) then \(\delta_{a}(X)\) shrinks \(X\) towards 0.

Lemma 3.1 (Stein's Lemma) If \(X \, | \, \theta \sim N_{p}(\theta, I_{p})\) and \(g(X)\) a suitably behaved real valued function then \[\begin{eqnarray*} \mathbb{E}(g(X)(X_{i} - \theta_{i}) \, | \, \theta) & = & \mathbb{E}\left.\left[ \frac{\partial g(X)}{\partial X_{i}} \, \right| \, \theta \right]. \end{eqnarray*}\]

Proof: First note that, \[\begin{eqnarray*} \mathbb{E}(g(X)(X_{i} - \theta_{i}) \, | \, \theta) & = & \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} g(x)(x_{i} - \theta_{i}) \prod_{j=1}^{p} \frac{1}{\sqrt{2\pi}} \exp\left\{-(x_{j} - \theta_{j})^{2}\right\} dx_{1} \ldots dx_{p} \end{eqnarray*}\] Now, for \(j = i\), using integration by parts, \[\begin{eqnarray*} \int_{-\infty}^{\infty} g(x)(x_{i} - \theta_{i}) \exp\left\{-(x_{i} - \theta_{i})^{2}\right\} dx_{i} & = & \int_{-\infty}^{\infty} \frac{\partial g(x)}{\partial x_{i}} \exp\left\{-(x_{i} - \theta_{i})^{2}\right\} dx_{i} + \\ & & \left[-g(x) \exp\left\{-(x_{i} - \theta_{i})^{2}\right\} \right]_{-\infty}^{\infty} \\ & = & \int_{-\infty}^{\infty} \frac{\partial g(x)}{\partial x_{i}} \exp\left\{-(x_{i} - \theta_{i})^{2}\right\} dx_{i} \end{eqnarray*}\] for suitable \(g(x)\). Consequently, \[\begin{eqnarray*} \mathbb{E}(g(X)(X_{i} - \theta_{i}) \, | \, \theta) & = & \int_{-\infty}^{\infty} \cdots \int_{-\infty}^{\infty} \frac{\partial g(x)}{\partial x_{i}} \prod_{j=1}^{p} \frac{1}{\sqrt{2\pi}} \exp\left\{-(x_{j} - \theta_{j})^{2}\right\} dx_{1} \ldots dx_{p} \\ & = & \mathbb{E}\left.\left[ \frac{\partial g(X)}{\partial X_{i}} \, \right| \, \theta \right] \end{eqnarray*}\] which completes the proof. \(\Box\)

We now calculate the classical risk of \(\delta_{a}\) under quadratic loss. We have \[\begin{eqnarray*} R(\theta, \delta_{a}) & = & \mathbb{E} [(\theta - \delta_{a}(X))^{T}(\theta - \delta_{a}(X)) \, | \, \theta] \\ & = & \mathbb{E}\left.\left[\left((\theta -X) + \frac{aX}{X^{T}X}\right)^{T} \left((\theta -X) + \frac{aX}{X^{T}X}\right) \, \right| \, \theta \right] \\ & = & \mathbb{E} [(\theta - X)^{T}(\theta - X) \, | \, \theta] + a^{2} \mathbb{E} \left.\left[\frac{1}{X^{T}X} \, \right| \, \theta \right] \\ & & -2a \mathbb{E} \left.\left[\frac{X^{T}(X - \theta)}{X^{T}X} \, \right| \, \theta \right] \\ & = & R(\theta, \delta_{0}) + a^{2} \mathbb{E} \left.\left[\frac{1}{X^{T}X} \, \right| \, \theta \right] -2a \sum_{i=1}^{p} \mathbb{E} \left.\left[\frac{X_{i}(X_{i} - \theta_{i})}{X^{T}X} \, \right| \, \theta \right] \end{eqnarray*}\] Now, using Stein’s Lemma with \(g(X) = X_{i}/X^{T}X\) we have \[\begin{eqnarray*} \sum_{i=1}^{p} \mathbb{E} \left.\left[\frac{X_{i}}{X^{T}X}(X_{i} - \theta_{i}) \, \right| \, \theta \right] &= & \sum_{i=1}^{p} \mathbb{E} \left.\left[\frac{\partial}{\partial X_{i}} \frac{X_{i}}{X^{T}X} \, \right| \, \theta \right] \\ & = & \sum_{i=1}^{p} \mathbb{E} \left.\left[\frac{X^{T}X - 2X_{i}^{2}}{(X^{T}X)^{2}} \, \right| \, \theta \right] \\ & = & \mathbb{E} \left.\left[\frac{pX^{T}X - 2 \sum_{i=1}^{p} X_{i}^{2}}{(X^{T}X)^{2}} \, \right| \, \theta \right] \\ & = & (p-2) \mathbb{E} \left.\left[\frac{1}{X^{T}X} \, \right| \, \theta \right] \end{eqnarray*}\] since \(\sum_{i=1}^{p} X_{i}^{2} = X^{T}X\). Thus, \[\begin{eqnarray*} R(\theta, \delta_{a}) & = & R(\theta, \delta_{0}) + (a^{2}-2a(p-2)) \mathbb{E} \left.\left[\frac{1}{X^{T}X} \, \right| \, \theta \right]. \end{eqnarray*}\] Now, \(\mathbb{E} [1/X^{T}X \, | \, \theta] > 0\) and thus if \(a^{2}-2a(p-2) < 0\) then \[\begin{eqnarray*} R(\theta, \delta_{a}) & < & R(\theta, \delta_{0}). \end{eqnarray*}\] Hence, if \(0 < a < 2(p-2)\) \(\delta_{0}\) is inadmissible as, for all \(\theta\), \(R(\theta, \delta_{a}) < R(\theta, \delta_{0})\). Consequently, for \(p \geq 3\) the maximum likelihood estimate \(\delta_{0}\) is inadmissible.

Notice that \(a = p-2\) minimises \(R(\theta, \delta_{a})\) and that if \(\theta = 0\) then \(X^{T}X \sim \chi_{p}^{2}\), the chi-squared distribution with \(p\) degrees of freedom. Thus, \(\mathbb{E} [1/X^{T}X \, | \, \theta = 0] = 1/(p-2)\). Recalling that \(R(\theta, \delta_{0}) = p\) then \(R(0, \delta_{p-2}) = 2\) and so when \(p\) is large, this will be much smaller than the corresponding risk of \(\delta_{0}\).

The \(i\)th term of \(\delta_{a}(X) = \left(1 - \frac{a}{X^{T}X}\right) X\) is \(\left(1 - \frac{a}{X^{T}X}\right) X_{i}\) and so depends on all \(X_{1}, \ldots, X_{p}\) even though the \(X_{i}\)s are independent. As Young and Smith (2005) comment on p35: “at first sight, the result seems incredible: there is no apparent ‘tying together’ of the losses, yet the obvious estimator, the sample ‘mean’ X, is not admissible.” Indeed, the result caused such consternation when first published, see Stein (1956) and James and Stein (1961), that it might be termed “Stein’s bombshell” and, it can be shown, to occur in many situations beyond normal means with known common variance when comparing three or more populations. See Efron and Morris (1977) for more details, and Samworth (2012) for an accessible overview. Note that whilst its admissibility under quadratic loss is questionable, the MLE remains the dominant point estimator in applied statistics.

3.6 Set estimation

For set estimation the decision space is a set of subsets of \(\Theta\) so that each \(d \subset \Theta\). There are two contradictory requirements for set estimators of \(\Theta\). We want the sets to be small, but we also want them to contain \(\theta\). There is a simple way to represent these two requirements as a loss function, which is to use \[\begin{eqnarray} L(\theta, d) & = & |d| + \kappa(1 - \mathbb{I}_{\theta \in d}) \tag{3.6} \end{eqnarray}\] for some \(\kappa > 0\) where \(|d|\) is the volume of \(d\). The value of \(\kappa\) controls the trade-off between the two requirements. If \(\kappa \downarrow 0\) then minimising the expected loss will always produce the empty set. If \(\kappa \uparrow \infty\) then minimising the expected loss will always produce \(\Theta\). For \(\kappa\) in-between, the Bayes rule will depend on beliefs about \(X\) and the value~\(x\). For loss functions of the form (3.6) there is a a simple necessary condition for a rule to be a Bayes rule. A set \(d \subset \Theta\) is a level set of the posterior distribution exactly when \(d = \{\theta \, : \, \pi(\theta \, | \, x) \geq k\}\) for some \(k\).

Theorem 3.5 (Level set property, LSP) If \(\delta^{*}\) is a Bayes rule for the loss function in (3.6) then it is a level set of the posterior distribution.

Proof: For fixed \(x\), we show that if \(d\) is not a level set of the posterior distribution then there is a \(d' \neq d\) which has a smaller expected loss so that \(\delta^{*}(x) \neq d\). Note that \[\begin{eqnarray} \mathbb{E}\{L(\theta, d) \, | \, X\} & = & |d| + \kappa\mathbb{P}(\theta \notin d \, | \, X). \tag{3.7} \end{eqnarray}\] Suppose that \(d\) is not a level set of \(\pi(\theta \, | \, x)\). Then there is a \(\theta \in d\) and \(\theta' \notin d\) for which \(\pi(\theta' \, | \, x) > \pi(\theta \, | \, x)\). Let \(d' = d \cup d\theta' \setminus d\theta\) where \(d\theta\) is the tiny region of \(\Theta\) around \(\theta\) and \(d\theta'\) is the tiny region of \(\Theta\) around \(\theta'\) for which \(|d\theta| = |d\theta'|\). Then \(|d'| = |d|\) but \[\begin{eqnarray*} \mathbb{P}(\theta \notin d' \, | \, X) < \mathbb{P}(\theta \notin d \, | \, X) \end{eqnarray*}\] Thus, from equation (3.7), \(\mathbb{E}\{L(\theta, d') \, | \, X\} < \mathbb{E}\{L(\theta, d) \, | \, X\}\) showing that \(\delta^{*}(x) \neq d\). \(\Box\)

Now relate this result to the CCT (Theorem 3.4). First, Theorem 3.5 asserts that \(\delta\) having the LSP is necessary (but not sufficient) for \(\delta\) to be a Bayes rule for loss functions of the form (3.6). Second, the CCT asserts that being a Bayes rule is a necessary (but not sufficient) condition for \(\delta\) to be admissible. So, being a level set of a posterior distribution for some prior distribution \(\pi(\theta)\) is a necessary condition for being admissible for loss functions of the form (3.6). Bayesian HPD regions satisfy the necessary condition for being a set estimator whilst classical set estimators achieve a similar outcome if they are level sets of the likelihood function, because the posterior is proportional to the likelihood under a uniform prior distribution.

In the case where \(\Theta\) is unbounded, this prior distribution may have to be truncated to be proper.

3.7 Hypothesis tests

For hypothesis tests, the decision space is a partition of \(\Theta\), denoted \[\begin{eqnarray*} \mathcal{H} & := & \{ H_0, H_1, \dots, H_d\} . \end{eqnarray*}\] Each element of \(\mathcal{H}\) is termed a hypothesis; it is traditional to number the hypotheses from zero. The loss function \(L(\theta, H_i)\) represents the (negative) consequences of choosing element \(H_i\), when the true value of \(\Theta\) is \(\theta\). It would be usual for the loss function to satisfy \[\begin{eqnarray*} \theta \in H_i \implies L(\theta, H_i) & = & \min_{j} L(\theta, H_{j}) \end{eqnarray*}\] on the grounds that an incorrect choice of element should never incur a smaller loss than the correct choice. There is a generic loss function for hypothesis tests: the 0-1 (“zero-one”) loss function \[\begin{eqnarray*} L(\theta, H_i) = 1 - \mathbb{I}_{\{\theta \in H_i\}} , \end{eqnarray*}\] i.e., zero if \(\theta\) is in \(H_i\), and one if it is not. The corresponding Bayes rule is to select the hypothesis with the largest posterior probability.

Its arguable about why the 0-1 loss function would approximate a wide range of actual loss functions and an alternative approach has proved more popular. This is to co-opt the theory of set estimators, for which there is a defensible generic loss function, which has strong implications for the selection of decision rules (see Section 3.6). The statistician can use her set estimator \(\delta\) to make at least some distinctions between the members of \(\mathcal{H}\):

  • “Accept” \(H_i\) exactly when \(\delta(x) \subset H_i\),
  • “Reject” \(H_i\) exactly when \(\delta(x) \cap H_i = \emptyset\),
  • “Undecided” about \(H_i\) otherwise.

Note that these three terms are given in quotes, to indicate that they acquire a technical meaning in this context. We do not use the quotes in practice, but we always bear in mind that we are not “accepting \(H_i\)” in the vernacular sense, but simply asserting that \(\delta(x) \subset H_i\) for our particular choice of \(\delta\).

Bibliography

Berger, J. O. 1985. Statistical Decision Theory and Bayesian Analysis. Second. NY, USA: Springer-Verlag New York, Inc.
Cox, D. R., and D. V. Hinkley. 1974. Theoretical Statistics. London, UK: Chapman; Hall.
Efron, B., and C. Morris. 1977. “Stein’s Paradox in Statistics.” Scientific American 236 (5): 119–27.
Ghosh, M., and G. Meeden. 1997. Bayesian Methods for Finite Population Sampling. London, UK: Chapman & Hall.
James, W., and C. Stein. 1961. “Estimation with Quadratic Loss.” In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, 1:361–80. University of California Press.
Robert, C. P. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. New York, USA: Springer.
Samworth, R. J. 2012. “Stein’s Paradox.” Eureka 62: 38–41.
Schervish, M. J. 1995. Theory of Statistics. New York NY, USA: Springer.
Stein, C. 1956. “Inadmissibility of the Usual Estimator for the Mean of a Multivariate Normal Distribution.” In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, 1:197–206. University of California Press.
Young, G. A., and R. L. Smith. 2005. Essentials of Statistical Inference. Cambridge UK: Cambridge University Press.