Chapter 5 Schools of thought for statistical inference
There are two broad approaches to statistical inference, generally termed the classical approach and the Bayesian approach. The former approach is also called frequentist. In brief the difference between the two is in their interpretation of the parameter \(\theta\). In a classical setting, the parameter is viewed as a fixed unknown constant and inferences are made utilising the distribution \(f_{X}(x \, | \, \theta)\) even after the data \(x\) has been observed. Conversely, in a Bayesian approach parameters are treated as random and so may be equipped with a probability distribution. We now give a short overview of each school.
5.1 Classical inference
In a classical approach to statistical inference, no further probabilistic assumptions are made once the parametric model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\) is specified. In particular, \(\theta\) is treated as an unknown constant and interest centres on constructing good methods of inference.
To illustrate the key ideas, we shall initially consider point estimators. The most familiar classical point estimator is the maximum likelihood estimator (MLE). The MLE \(\hat{\theta} = \hat{\theta}(X)\) satisfies, see Definition 4.1, \[\begin{eqnarray*} L_{X}(\hat{\theta}(x); x) & \geq & L_{X}(\theta; x) \end{eqnarray*}\] for all \(\theta \in \Theta\). Intuitively, the MLE is a reasonable choice for an estimator: it’s the value of \(\theta\) which makes the observed sample most likely. In general, the MLE can be viewed as a good point estimator with a number of desirable properties. For example, it satisfies the invariance property that if \(\hat{\theta}\) is the MLE of \(\theta\) then for any function \(g(\theta)\), the MLE of \(g(\theta)\) is \(g(\hat{\theta})\). For a proof of this property, see Theorem 7.2.10 of Casella and Berger (2002). However, there are drawbacks which come from the difficulties of finding the maximum of a function.
Efron and Hastie (2016) consider that there are three ages of statistical inference: the pre-computer age (essentially the period from 1763 and the publication of Bayes’ rule up until the 1950s), the early-computer age (from the 1950s to the 1990s), and the current age (a period of computer-dependence with enormously ambitious algorithms and model complexity). With these developments in mind, it is clear that there exist a hierarchy of statistical models.
- Models where \(f_{X}(x \, | \, \theta)\) has a known analytic form.
- Models where \(f_{X}(x \, | \, \theta)\) can be evaluated.
- Models where we can simulate \(X\) from \(f_{X}(x \, | \, \theta)\).
Between the first case and the second case exist models where \(f_{X}(x \, | \, \theta)\) can be evaluated up to an unknown constant, which may or may not depend upon \(\theta\).
In the first case, we might be able to derive an analytic expression for \(\hat{\theta}\) or to prove that \(f_{X}(x \, | \, \theta)\) has a unique maximum so that any numerical maximisation will converge to \(\hat{\theta}(x)\).
Example 5.1 We revisit Examples 4.1 and 4.4 and the case when \(\theta = (\alpha, \beta)\) are the parameters of a Gamma distribution. In this case, the maximum likelihood estimators \(\hat{\theta} = (\hat{\alpha}, \hat{\beta})\) satisfy the equations \[\begin{eqnarray*} \hat{\beta} & = & \frac{\hat{\alpha}}{\overline{X}}, \\ 0 & = & n \log \hat{\alpha} - n \log \overline{X} - n \frac{\Gamma^{\prime}(\hat{\alpha})}{\Gamma(\hat{\alpha})} + \sum_{i=1}^{n} \log X_{i}. \end{eqnarray*}\] Thus, numerical methods are required to find \(\hat{\theta}\).
In the second case, we could still numerically maximise \(f_{X}(x \, | \, \theta)\) but the maximiser may converge to a local maximum rather than the global maximum \(\hat{\theta}(x)\). Consequently, any algorithm utilised for finding \(\hat{\theta}(x)\) must have some additional procedures to ensure that all local maxima are ignored. This is a non-trivial task in practice. In the third case, it is extremely difficult to find the MLE and other estimators of \(\theta\) may be preferable. This example shows that the choice of algorithm is critical: the MLE is a good method of inference only if:
- you can prove that it has good properties for your choice of \(f_{X}(x \, | \, \theta)\) and
- you can prove that the algorithm you use to find the MLE of \(f_{X}(x \, | \, \theta)\) does indeed do this.
The second point arises once the choice of estimator has made. We now consider how to assess whether a chosen point estimator is a good estimator. One possible attractive feature is that the method is, on average, correct. As estimator \(T = t(X)\) is said to be unbiased if \[\begin{eqnarray*} \mbox{bias}(T \, | \, \theta) & = & \mathbb{E}(T \, | \, \theta) - \theta \end{eqnarray*}\] is zero for all \(\theta \in \Theta\). This is a superficially attractive criterion but it can lead to unexpected results (which are not sensible estimators) even in simple cases.
Example 5.2 (Example 8.1 of Cox and Hinkley (1974)) Let \(X\) denote the number of independent \(\mbox{Bernoulli}(\theta)\) trials up to and including the first success so that \(X \sim \mbox{Geom}(\theta)\) with \[\begin{eqnarray*} f_{X}(x \, | \, \theta) & = & (1-\theta)^{x-1}\theta \end{eqnarray*}\] for \(x = 1, 2, \ldots\) and zero otherwise. If \(T = t(X)\) is an unbiased estimator of \(\theta\) then \[\begin{eqnarray*} \mathbb{E}(T \, | \, \theta) \ = \ \sum_{x=1}^{\infty} t(x)(1-\theta)^{x-1}\theta \ = \ \theta. \end{eqnarray*}\] Letting \(\phi = 1 - \theta\) we thus have \[\begin{eqnarray*} \sum_{x=1}^{\infty} t(x) \phi^{x-1}(1 - \phi) & = & 1 - \phi. \end{eqnarray*}\] Thus, equating the coefficients of powers of \(\phi\), we find that the unique unbiased estimate of \(\theta\) is \[\begin{eqnarray*} t(x) & = & \left\{\begin{array}{ll} 1 & x = 1, \\ 0 & x = 2,3, \ldots. \end{array}\right. \end{eqnarray*}\] This is clearly not a sensible estimator.
Another drawback with the bias is that it is not, in general, transformation invariant. For example, if \(T\) is an unbiased estimator of \(\theta\) then \(T^{-1}\) is not, in general, an unbiased estimator of \(\theta^{-1}\) as \(\mathbb{E}(T^{-1} \, | \, \theta) \neq 1/\mathbb{E}(T \, | \, \theta) \ = \ \theta^{-1}\). An alternate, and better, criterion is that \(T\) has small mean square error (MSE), \[\begin{eqnarray*} \mbox{MSE}(T \, | \, \theta) & = & \mathbb{E}((T - \theta)^{2} \, | \, \theta) \\ & = & \mathbb{E}(\{(T - \mathbb{E}(T \, | \, \theta)) + (\mathbb{E}(T \, | \, \theta) - \theta)\}^{2} \, | \, \theta) \\ & = & Var(T \, | \, \theta) + \mbox{bias}(T \, | \, \theta)^{2}. \end{eqnarray*}\] Thus, estimators with a small mean square error will typically have small variance and bias and it’s possible to trade unbiasedness for a smaller variance. What this discussion does make clear is that it is properties of the distribution of the estimator \(T\), known as the sampling distribution, across the range of possible values of \(\theta\) that are used to determine whether or not \(T\) is a good inference rule. Moreover, this assessment is made not for the observed data \(x\) but based on the distributional properties of \(X\). In this sense, we determine the method of inference by calibrating how they would perform were they to be used repeatedly. As Cox (2006) on p8 notes “we intend, of course, that this long-run behaviour is some assurance that with our particular data currently under analysis sound conclusions are drawn.”
Example 5.3 Let \(X = (X_{1}, \ldots, X_{n})\) and suppose that the \(X_{i}\) are independent and identically distribution normal random variables with mean \(\theta\) and variance \(1\). Letting \(\overline{X} = \frac{1}{n}\sum_{i-1}^{n} X_{i}\) then \[\begin{eqnarray*} \mathbb{P}\left(\left.\theta - \frac{1.96}{\sqrt{n}} \leq \overline{X} \leq \theta + \frac{1.96}{\sqrt{n}} \, \right| \, \theta\right) \ = \ \mathbb{P}\left(\left.\overline{X} - \frac{1.96}{\sqrt{n}} \leq \theta \leq \overline{X} + \frac{1.96}{\sqrt{n}} \, \right| \, \theta\right) \ = \ 0.95. \end{eqnarray*}\] Thus, \((\overline{X} - \frac{1.96}{\sqrt{n}}, \overline{X} + \frac{1.96}{\sqrt{n}})\) is a set estimator for \(\theta\) with a coverage probability of 0.95. We can consider this as a method of inference, or algorithm. If we observe \(X = x\) corresponding to \(\overline{X} = \overline{x}\) then our algorithm is \[\begin{eqnarray*} x & \mapsto & \left(\overline{x} - \frac{1.96}{\sqrt{n}}, \overline{x} + \frac{1.96}{\sqrt{n}}\right) \end{eqnarray*}\] which produces a 95% confidence interval for \(\theta\). Notice that we report two things: the result of the algorithm (the actual interval) and the justification (the long-run property of the algorithm) or of the algorithm (95% confidence interval).
As the example demonstrates, the certification is determined by the sampling distribution (\(\overline{X}\) is a normal distribution with mean \(\theta\) and variance \(1/n\)) whilst the choice of algorithm is determined by the certification: in this case, the coverage probability of 0.95. If we wanted a coverage of 0.90 then we would amend the algorithm by replacing 1.96 in the interval calculation with 1.645. This is an inverse problem in the sense that we work backwards from the required certificate to the choice of algorithm. Notice that we are able to compute the coverage for every \(\theta \in \Theta\) as we have a pivot: \(\sqrt{n}(\overline{X} - \theta)\) is a normal distribution with mean \(0\) and variance \(1\) and so parameter free. For more complex models it will not be straightforward to do this.
We can generalise the idea exhibited in Example 5.3 into a key principle of the classical approach that
- Every algorithm is certified by its sampling distribution, and
- The choice of algorithm depends on this certification.
Thus, point estimators of \(\theta\) may be certified by their mean square error function; set estimators of \(\theta\) may be certified by their coverage probability; hypothesis tests may be certified by their power function. The definition of each of these certifications is not important here, though they are easy to look up. What is important to understand is that in each case an algorithm is proposed, the sampling distribution is inspected, and then a certificate is issued. Individuals and user communities develop conventions about certificates they like their algorithms to possess, and thus they choose an algorithm according to its certification. For example, in clinical trials, it is for a hypothesis test to have a type I error below 5% with large power.
We now consider prediction in a classical setting. As in Section 3, see equation (3.1), from a parametric model for \((X, Y)\), \(\mathcal{E} = \{\mathcal{X}\times \mathcal{Y}, \Theta, f_{X, Y}(x, y \, | \, \theta)\}\) we can calculate the predictive model \[\begin{eqnarray*} \mathcal{E}^{*} & = & \{\mathcal{Y}, \Theta, f_{Y | X}(y \, | \, x, \theta)\}. \end{eqnarray*}\] The difficulty here is that \(\mathcal{E}^{*}\) is a family of distributions and we seek to reduce this down to a single distribution; effectively, to “get rid of” \(\theta\). If we accept, as our working hypothesis, that one of the elements in the family of distributions is true, that is that there is a \(\theta^{*} \in \Theta\) which is the true value of \(\theta\) then the corresponding predictive distribution \(f_{Y | X}(y \, | \, x, \theta^{*})\) is the true predictive distribution for \(Y\). The classical solution is to replace \(\theta^{*}\) by plugging-in an estimate based on \(x\).
Example 5.4 If we use the MLE \(\hat{\theta} = \hat{\theta}(x)\) then we have an algorithm \[\begin{eqnarray*} x & \mapsto & f_{Y | X}(y \, | \, x, \hat{\theta}(x)). \end{eqnarray*}\]
The estimator does not have to be the MLE and so we see that different estimators produce different algorithms.
5.2 Bayesian inference
In a Bayesian approach to statistical inference, we consider that, in addition to the parametric model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\), the uncertainty about the parameter \(\theta\) prior to observing \(X\) can be represented by a prior distribution \(\pi\) on \(\theta\). We can then utilise Bayes’s theorem to obtain the posterior distribution \(\pi(\theta \, | \, x)\) of \(\theta\) given \(X = x\), \[\begin{eqnarray*} \pi(\theta \, | \, x) & = & \frac{f_{X}(x \, | \, \theta) \pi(\theta)}{\int_{\Theta} f_{X}(x \, | \, \theta) \pi(\theta) \, d\theta}. \end{eqnarray*}\] We make the following definition.
Definition 5.1 (Bayesian statistical model) A Bayesian statistical model is the collection \(\mathcal{E}_{B} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta), \pi(\theta)\}\).
As O’Hagan and Forster (2004) note on p5, “the posterior distribution encapsulates all that is known about \(\theta\) following the observation of the data \(x\), and can be thought of as comprising an all-embracing inference statement about \(\theta\).” In the context of algorithms, we have \[\begin{eqnarray*} x & \mapsto & \pi(\theta \, | \, x) \end{eqnarray*}\] where each choice of prior distribution produces a different algorithm. In this course, our primary focus is upon general theory and methodology and so, at this point, we shall merely note that both specifying a prior distribution for the problem at hand and deriving the corresponding posterior distribution are decidedly non-trivial tasks. Indeed, in the same way that we discussed a hierarchy of statistical models for \(f_{X}(x \, | \, \theta)\) in Section 5.1, an analogous hierarchy exists for the posterior distribution \(\pi(\theta \, | \, x)\).
In contrast to the plug-in classical approach to prediction, the Bayesian approach can be viewed as integrate-out. If \(\mathcal{E}_{B} = \{\mathcal{X}\times \mathcal{Y}, \Theta, f_{X, Y}(x, y \, | \, \theta), \pi(\theta)\}\) is our Bayesian model for \((X, Y)\) and we are interested in prediction for \(Y\) given \(X = x\) then we can integrate out \(\theta\) to obtain the parameter free conditional distribution \(f_{Y | X}(y \, | \, x)\): \[\begin{eqnarray} f_{Y | X}(y \, | \, x) & = & \int_{\Theta} f_{Y | X}(y \, | \, x, \theta)\pi(\theta \, | \, x) \, d\theta. \tag{5.1} \end{eqnarray}\] In terms of an algorithm, we have \[\begin{eqnarray*} x & \mapsto & f_{Y | X}(y \, | \, x) \end{eqnarray*}\] where, as equation (5.1) involves integrating out \(\theta\) according to the posterior distribution, then each choice of prior distribution produces a different algorithm.
Whilst the posterior distribution expresses all of knowledge about the parameter \(\theta\) given the data \(x\), in order to express this knowledge in clear and easily understood terms we need to derive appropriate summaries of the posterior distribution. Typical summaries include point estimates, interval estimates, probabilities of specified hypotheses.
Example 5.5 Suppose that \(\theta\) is a univariate parameter and we consider summarising \(\theta\) by a number \(d\). We may compute the posterior expectation of the squared distance between \(t\) and \(\theta\). \[\begin{eqnarray*} \mathbb{E}((d - \theta)^{2} \, | \, X) & = & \mathbb{E}(d^{2} - 2d\theta + \theta^{2} \, | \, X) \\ & = & d^{2} -2d\mathbb{E}(\theta \, | \, X) + \mathbb{E}(\theta^{2} \, | \, X) \\ & = & (d - \mathbb{E}(\theta \, | \, X))^{2} + Var(\theta \, | \, X). \end{eqnarray*}\] Consequently \(d = \mathbb{E}(\theta \, | \, X)\), the posterior expectation, minimises the posterior expected square error and the minimum value of this error is \(Var(\theta \, | \, X)\), the posterior variance.
In this way, we have a justification for \(\mathbb{E}(\theta \, | \, X)\) as an estimate of \(\theta\). We could view \(d\) as a decision, the result of which was to occur an error \(t - \theta\). In this example we choose to measure how good or bad a particular decision was by the squared error suggesting that we were equally happy to overestimate \(\theta\) as underestimate it and that large errors are more serious than they would be if an alternate measure such as \(|d - \theta|\) was used.
5.3 Inference as a decision problem
In the second half of the course we will study inference as a decision problem. In this context we assume that we make a decision \(d\) which acts as an estimate of \(\theta\). The consequence of this decision in a given context can be represented by a specific loss function \(L(\theta, d)\) which measures the quality of the choice \(d\) when \(\theta\) is known. In this setting, decision theory allows us to identify a best decision. As we will see, this approach has two benefits. Firstly, we can form a link between Bayesian and classical procedures, in particular the extent to which classical estimators, confidence intervals and hypothesis tests can be interpreted within a Bayesian framework. Secondly, we can provide Bayesian solutions to the inference questions addressed in a classical approach.