Chapter 4 Some principles of statistical inference
In the first half of the course we shall consider principles for statistical inference. These principles guide the way in which we learn about \(\theta\) and are meant to be either self-evident, or logical implications of principles which are self-evident. In this section we aim to motivate three of these principles: the weak likelihood principle, the strong likelihood principle, and the sufficiency principle. The first two principles relate to the concept of the likelihood and the third to the idea of a sufficient statistic.
4.1 Likelihood
In the model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\), \(f_{X}\) is a function of \(x\) for known \(\theta\). If we have instead observed \(x\) then we could consider viewing this as a function, termed the likelihood, of \(\theta\) for known \(x\). This provides a means of comparing the plausibility of different values of \(\theta\).
Definition 4.1 (Likelihood) The likelihood for \(\theta\) given observations \(x\) is \[\begin{eqnarray*} L_{X}(\theta; x) & = & f_{X}(x \, | \, \theta), \ \ \theta \in \Theta \end{eqnarray*}\] regarded as a function of \(\theta\) for fixed \(x\).
If \(L_{X}(\theta_{1}; x) > L_{X}(\theta_{2}; x)\) then the observed data \(x\) were more likely to occur under \(\theta = \theta_{1}\) than \(\theta_{2}\) so that \(\theta_{1}\) can be viewed as more plausible than \(\theta_{2}\). Note that we choose to make the dependence on \(X\) explicit as the measurement scale affects the numerical value of the likelihood.
Example 4.1 Let \(X = (X_{1}, \ldots, X_{n})\) and suppose that, for given \(\theta = (\alpha, \beta)\), the \(X_{i}\) are independent and identically distributed \(\mbox{Gamma}(\alpha, \beta)\) random variables. Then, \[\begin{eqnarray} f_{X}(x \, | \, \theta) & = & \frac{\beta^{n\alpha}}{\Gamma^{n}(\alpha)} \left(\prod_{i=1}^{n} x_{i}\right)^{\alpha - 1} \exp\left(-\beta \sum_{i=1}^{n} x_{i}\right) \tag{4.1} \end{eqnarray}\] if \(x_{i} > 0\) for each \(i \in \{1, \ldots, n\}\) and zero otherwise. If, for each \(i\), \(Y_{i} = X_{i}^{-1}\) then the \(Y_{i}\) are independent and identically distributed \(\mbox{Inverse-Gamma}(\alpha, \beta)\) random variables with \[\begin{eqnarray*} f_{Y}(y \, | \, \theta) & = & \frac{\beta^{n\alpha}}{\Gamma^{n}(\alpha)} \left(\prod_{i=1}^{n} \frac{1}{y_{i}}\right)^{\alpha +1} \exp\left(-\beta \sum_{i=1}^{n} \frac{1}{y_{i}}\right) \end{eqnarray*}\] if \(y_{i} > 0\) for each \(i \in \{1, \ldots, n\}\) and zero otherwise. Thus, \[\begin{eqnarray*} L_{Y}(\theta; y) & = & \left(\prod_{i=1}^{n} \frac{1}{y_{i}}\right)^{2}L_{X}(\theta; x). \end{eqnarray*}\] If we are interested in inferences about \(\theta = (\alpha, \beta)\) following the observation of the data, then it seems reasonable that these should be invariant to the choice of measurement scale: it should not matter whether \(x\) or \(y\) was recorded.
In the course, we will see that this idea can developed into an inference principle called the Transformation Principle.
More generally, suppose that \(X\) is a continuous vector random variable and \(Y = g(X)\) a one-to-one transformation of \(X\) with non-vanishing Jacobian \(\partial x/\partial y\) then the probability density function of \(Y\) is \[\begin{eqnarray} f_{Y}(y \, | \, \theta) & = & f_{X}(x \, | \, \theta) \left|\frac{\partial x}{\partial y}\right|, \tag{4.2} \end{eqnarray}\] where \(x = g^{-1}(y)\) and \(| \cdot |\) denotes the determinant. Consequently, as Cox and Hinkley (1974), p12, observe, if we are interested in comparing two possible values of \(\theta\), \(\theta_{1}\) and \(\theta_{2}\) say, using the likelihood then we should consider the ratio of the likelihoods rather than, for example, the difference since \[\begin{eqnarray*} \frac{f_{Y}(y \, | \, \theta = \theta_{1})}{f_{Y}(y \, | \, \theta = \theta_{2})} & = & \frac{f_{X}(x \, | \, \theta = \theta_{1})}{f_{X}(x \, | \, \theta = \theta_{2})} \end{eqnarray*}\] so that the comparison does not depend upon whether the data was recorded as \(x\) or as \(y = g(x)\). It seems reasonable that the proportionality of the likelihoods given by equation (4.2) should lead to the same inference about \(\theta\).
4.1.1 The likelihood principle
Our discussion of the likelihood function suggests that it is the ratio of the likelihoods for differing values of \(\theta\) that should drive our inferences about \(\theta\). In particular, if two likelihoods are proportional for all values of \(\theta\) then the corresponding likelihood ratios for any two values \(\theta_{1}\) and \(\theta_{2}\) are identical. Initially, we consider two outcomes \(x\) and \(y\) from the same model: this gives us our first possible principle of inference.
Definition 4.2 (The weak likelihood principle) If \(X = x\) and \(X = y\) are two observations for the experiment \(\mathcal{E}_{X} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\) such that \[\begin{eqnarray*} L_{X}(\theta; y) & = & c(x, y)L_{X}(\theta; x) \end{eqnarray*}\] for all \(\theta \in \Theta\) then the inference about \(\theta\) should be the same irrespective of whether \(X = x\) or \(X = y\) was observed.
A stronger principle can be developed if we consider two random variables \(X\) and \(Y\) corresponding to two different experiments, \(\mathcal{E}_{X} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\) and \(\mathcal{E}_{Y} = \{\mathcal{Y}, \Theta, f_{Y}(y \, | \, \theta)\}\) respectively, for the same parameter \(\theta\). Notice that this situation includes the case where \(Y = g(X)\) (see equation (4.2)) but is not restricted to that.
Example 4.2 Consider, given \(\theta\), a sequence of independent Bernoulli trials with parameter \(\theta\). We wish to make inference about \(\theta\) and consider two possible methods. In the first, we carry out \(n\) trials and let \(X\) denote the total number of successes in these trials. Thus, \(X \, | \, \theta \sim Bin(n, \theta)\) with \[\begin{eqnarray*} f_{X}(x \, | \, \theta) & = & \binom{n}{x} \theta^{x}(1- \theta)^{n-x}, \ \ x = 0, 1, \ldots, n. \end{eqnarray*}\] In the second method, we count the total number \(Y\) of trials up to and including the \(r\)th success so that \(Y \, | \, \theta \sim Nbin(r, \theta)\), the negative binomial distribution, with \[\begin{eqnarray*} f_{Y}(y \, | \, \theta) & = & \binom{y-1}{r-1} \theta^{r}(1- \theta)^{y-r}, \ \ y = r, r+1, \ldots. \end{eqnarray*}\] Suppose that we observe \(X = x = r\) and \(Y = y = n\). Then in each experiment we have seen \(x\) successes in \(n\) trials and so it may be reasonable to conclude that we make the same inference about \(\theta\) from each experiment. Notice that in this case \[\begin{eqnarray*} L_{Y}(\theta; y) \ = \ f_{Y}(y \, | \, \theta) & = & \frac{x}{y} f_{X}(x \, | \, \theta) \ = \ \frac{x}{y} L_{X}(\theta; x) \end{eqnarray*}\] so that the likelihoods are proportional.
Motivated by this example, a second possible principle of inference is a strengthening of the weak likelihood principle.
Definition 4.3 (The strong likelihood principle) Let \(\mathcal{E}_{X}\) and \(\mathcal{E}_{Y}\) be two experiments which have the same parameter \(\theta\). If \(X = x\) and \(Y = y\) are two observations such that \[\begin{eqnarray*} L_{Y}(\theta; y) & = & c(x, y)L_{X}(\theta; x) \end{eqnarray*}\] for all \(\theta \in \Theta\) then the inference about \(\theta\) should be the same irrespective of whether \(X = x\) or \(Y = y\) was observed.
4.2 Sufficient statistics
Consider the model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\). If a sample \(X =x\) is obtained there may be cases when, rather than knowing each individual value of the sample, certain summary statistics could be utilised as a sufficient way to capture all of the relevant information in the sample. This leads to the idea of a sufficient statistic.
Definition 4.4 (Sufficient statistic) A statistic \(S = s(X)\) is sufficient for \(\theta\) if the conditional distribution of \(X\), given the value of \(s(X)\) (and \(\theta\)) \(f_{X | S}(x \, | \, s, \theta)\) does not depend upon \(\theta\).
Note that, in general, \(S\) is a vector and that if \(S\) is sufficient then so is any one-to-one function of \(S\). It should be clear from Definition 4.4 that the sufficiency of \(S\) for \(\theta\) is dependent upon the choice of the family of distributions in the model.
Example 4.3 Let \(X = (X_{1}, \ldots, X_{n})\) and suppose that, for given \(\theta\), the \(X_{i}\) are independent and identically distributed \(\mbox{Po}(\theta)\) random variables. Then \[\begin{eqnarray*} f_{X}(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{\theta^{x_{i}}\exp (-\theta)}{x_{i}!} \ = \ \frac{\theta^{\sum_{i=1}^{n} x_{i}} \exp (-n \theta)}{\prod_{i=1}^{n} x_{i}!}, \end{eqnarray*}\] if \(x_{i} \in \{0, 1, \ldots\}\) for each \(i \in \{1, \ldots, n\}\) and zero otherwise. Let \(S = \sum_{i=1}^{n} X_{i}\) then \(S \sim \mbox{Po}(n\theta)\) so that \[\begin{eqnarray*} f_{S}(s \, | \, \theta) & = & \frac{(n\theta)^{s} \exp (-n\theta)}{s!} \end{eqnarray*}\] for \(s \in \{0, 1, \ldots\}\) and zero otherwise. Thus, if \(f_{S}(s \, | \, \theta) > 0\) then, as \(s = \sum_{i=1}^{n} x_{i}\), \[\begin{eqnarray*} f_{X | S}(x \, | \, s, \theta) \ = \ \frac{f_{X}(x \, | \, \theta)}{f_{S}(s \, | \, \theta)} \ = \ \frac{(\sum_{i=1}^{n} x_{i})!}{\prod_{i=1}^{n} x_{i}!} n^{-\sum_{i=1}^{n} x_{i}} \end{eqnarray*}\] which does not depend upon \(\theta\). Hence, \(S = \sum_{i=1}^{n} X_{i}\) is sufficient for \(\theta\). Similarly, the sample mean \(\frac{1}{n}S\) is also sufficient.
Sufficiency for a parameter \(\theta\) can be viewed as the idea that \(S\) captures all of the information about \(\theta\) contained in \(X\). Having observed \(S\), nothing further can be learnt about \(\theta\) by observing \(X\) as \(f_{X | S}(x \, | \, s, \theta)\) has no dependence on \(\theta\).
Definition 4.4 is confirmatory rather than constructive: in order to use it we must somehow guess a statistic \(S\), find the distribution of it and then check that the ratio of the distribution of \(X\) to the distribution of \(S\) does not depend upon \(\theta\). However, the following theorem allows us to easily find a sufficient statistic. For a proof see, for example, p276 of Casella and Berger (2002).
Theorem 4.1 (Fisher-Neyman Factorization) The statistic \(S = s(X)\) is sufficient for \(\theta\) if and only if, for all \(x\) and \(\theta\), \[\begin{eqnarray*} f_{X}(x \, | \, \theta) & = & g(s(x), \theta)h(x) \end{eqnarray*}\] for some pair of functions \(g(s(x), \theta)\) and \(h(x)\).
Example 4.4 We revisit Example 4.1 and the case where the \(X_{i}\) are independent and identically distributed \(\mbox{Gamma}(\alpha, \beta)\) random variables. From equation (4.1) we have \[\begin{eqnarray*} f_{X}(x \, | \, \theta) & = & \frac{\beta^{n\alpha}}{\Gamma^{n}(\alpha)} \left(\prod_{i=1}^{n} x_{i}\right)^{\alpha} \exp\left(-\beta \sum_{i=1}^{n} x_{i}\right)\left(\prod_{i=1}^{n} x_{i}\right)^{ - 1} \\ & = & g\left(\prod_{i=1}^{n} x_{i}, \sum_{i=1}^{n} x_{i}, \theta\right)h(x) \end{eqnarray*}\] so that \(S = \left(\prod_{i=1}^{n} X_{i}, \sum_{i=1}^{n} X_{i}\right)\) is sufficient for \(\theta\).
Notice that \(S\) defines a data reduction. In Example 4.3, \(S = \sum_{i=1}^{n} X_{i}\) is a scalar so that all of the information in the \(n\)-vector \(x = (x_{1}, \ldots, x_{n})\) relating to the scalar \(\theta\) is contained in just one number. In Example 4.4, all of the information in the \(n\)-vector for the two dimensional parameter \(\theta = (\alpha, \beta)\) is contained in just two numbers. Using the Fisher-Neyman Factorization Theorem, we can easily obtain the following result for models drawn from the exponential family.
Theorem 4.2 Let \(X = (X_{1}, \ldots, X_{n})\) and suppose that the \(X_{i}\) are independent and identically distributed from the exponential family of distributions given by \[\begin{eqnarray*} f_{X_{i}}(x_{i} \, | \, \theta) & = & h(x_{i})c(\theta) \exp \left(\sum_{j=1}^{k} a_{j}(\theta) b_{j}(x_{i})\right), \end{eqnarray*}\] where \(\theta = (\theta_{1}, \ldots, \theta_{d})\) for \(d \leq k\). Then \[\begin{eqnarray*} S & = & \left(\sum_{i=1}^{n} b_{1}(X_{i}), \ldots, \sum_{i=1}^{n} b_{k}(X_{i})\right) \end{eqnarray*}\] is a sufficient statistic for \(\theta\).
Example 4.5 The Poisson distribution, see Example 4.3, is a member of the exponential family where \(d = k = 1\) and \(b_{1}(x_{i}) = x_{i}\) giving the sufficient statistic \(S = \sum_{i=1}^{n} X_{i}\). The Gamma distribution, see Example 4.4, is also a member of the exponential family with \(d = k = 2\) and \(b_{1}(x_{i}) = x_{i}\) and \(b_{2}(x_{i}) = \log x_{i}\) giving the sufficient statistic \(S = \left(\sum_{i=1}^{n} X_{i}, \sum_{i=1}^{n} \log X_{i}\right)\) which is equivalent to the pair \(\left(\sum_{i=1}^{n} X_{i}, \prod_{i=1}^{n} X_{i}\right)\).
4.2.1 The sufficiency principle
Following Section 2.2(iii) of Cox and Hinkley (1974), we may interpret sufficiency as follows. Consider two individuals who both assert the model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\). The first individual observes \(x\) directly. The second individual also observes \(x\) but in a two stage process:
- They first observe a value \(s(x)\) of a sufficient statistic \(S\) with distribution \(f_{S}(s \, | \, \theta)\).
- They then observe the value \(x\) of the random variable \(X\) with distribution \(f_{X | S}(x \, | \, s)\) which does not depend upon \(\theta\).
It may well then be reasonable to argue that, as the final distribution for \(X\) for the two individuals are identical, the conclusions drawn from the observation of a given \(x\) should be identical for the two individuals. That is, they should make the same inference about \(\theta\). For the second individual, when sampling from \(f_{X | S}(x \, | \, s)\) they are sampling from a fixed distribution and so, assuming the correctness of the model, only the first stage is informative: all of the knowledge about \(\theta\) is contained in \(s(x)\). If one takes these two statements together then the inference to be made about \(\theta\) depends only on the value \(s(x)\) and not the individual values \(x_{i}\) contained in \(x\). This leads us to a third possible principle of inference.
Definition 4.5 (The sufficiency principle) If \(S = s(X)\) is a sufficient statistic for \(\theta\) and \(x\) and \(y\) are two observations such that \(s(x) = s(y)\), then the inference about \(\theta\) should be the same irrespective of whether \(X = x\) or \(X = y\) was observed.