Chapter 4 Confidence sets and p-values
4.1 Confidence procedures and confidence sets
We consider interval estimation, or more generally set estimation. Consider the model \(\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}\). For given data \(X = x\), we wish to construct a set \(C = C(x) \subset \Theta\) and the inference is the statement that \(\theta \in C\). If \(\theta \in \mathbb{R}\) then the set estimate is typically an interval. As Casella and Berger (2002) note in Section 9.1, the goal of a set estimator is to have some guarantee of capturing the parameter of interest. With this in mind, we make the following definition.
Definition 4.1 (Confidence procedure) A random set \(C(X)\) is a level-\((1-\alpha)\) confidence procedure exactly when \[\begin{eqnarray*} \mathbb{P}(\theta \in C(X) \, | \, \theta) & \geq & 1 - \alpha \end{eqnarray*}\] for all \(\theta \in \Theta\). \(C\) is an level-\((1-\alpha)\) confidence procedure if the probability equals \((1-\alpha)\) for all \(\theta\).
Thus, exact is a special case and typically \(\mathbb{P}(\theta \in C(X) \, | \, \theta)\) will depend upon \(\theta\). The value \(\mathbb{P}(\theta \in C(X) \, | \, \theta)\) is termed the coverage of \(C\) at \(\theta\). Thus a \(95\%\) confidence procedure has coverage of at least \(95\%\) for all \(\theta\), and an exact \(95\%\) confidence procedure has coverage of exactly \(95\%\) for all \(\theta\). If it is necessary to emphasise that \(C\) is not exact, then the term conservative is used.
Example 4.1 Let \(X_{1}, \ldots, X_{n}\) be independent and identically distributed \(\mbox{Unif}(0, \theta)\) random variables where \(\theta > 0\). Let \(Y = \max\{X_{1}, \ldots, X_{n}\}\). For observed \(x_{1}, \ldots, x_{n}\), we have that \(\theta > y\). Noting that \(X_{i}/\theta \sim \mbox{Unif}(0, 1)\) then if \(T = Y/\theta\) we have that \(\mathbb{P}(T \leq t) = t^{n}\) for \(0 \leq t \leq 1\). We consider two possible sets: \((aY, bY)\) where \(1 \leq a < b\) and \((Y + c, Y + d)\) where \(0 \leq c < d\). Notice that \[\begin{eqnarray*} \mathbb{P}(\theta \in (aY, bY) \, | \, \theta) &= & \mathbb{P}(aY < \theta < bY \, | \, \theta) \\ & = & \mathbb{P}(b^{-1} < T < a^{-1} \, | \, \theta) \\ & = & \left(\frac{1}{a}\right)^{n} - \left(\frac{1}{b}\right)^{n}. \end{eqnarray*}\] Thus, the coverage probability of the interval does not depend upon \(\theta\). However, \[\begin{eqnarray*} \mathbb{P}(\theta \in (Y + c, Y + d) \, | \, \theta) &= & \mathbb{P}(Y +c < \theta < Y+d \, | \, \theta) \\ & = & \mathbb{P}\left(1 - \frac{d}{\theta} < T < 1 - \frac{c}{\theta} \, | \, \theta \right) \\ & = & \left(1 - \frac{c}{\theta}\right)^{n} - \left(1 - \frac{d}{\theta}\right)^{n}. \end{eqnarray*}\] In this case, the coverage probability of the interval does depend upon \(\theta\).
It is helpful to distinguish between the confidence procedure \(C\), which is a random interval and so a function for each possible \(x\), and the result when \(C\) is evaluated at the observation \(x\), which is a set in \(\Theta\). We follow the terms used in Morey et al. (2016), which we will later adapt to \(p\)-values, see for example Definition 4.8.
Definition 4.2 (Confidence set) The observed \(C(x)\) is a level-\((1-\alpha)\) confidence set exactly when the random \(C(X)\) is a level-\((1-\alpha)\) confidence procedure.
If \(\Theta \subset \mathbb{R}\) and \(C(x)\) is convex, i.e. an interval, then a confidence set (interval) is represented by a lower and upper value. We should write, for example, “using procedure \(C\), the \(95\%\) confidence interval for \(\theta\) is \((0.55, 0.74)\)”, inserting “exact” if the confidence procedure \(C\) is exact.
The challenge with confidence procedures is to construct one with a specified level. One could propose an arbitrary \(C\), and then laboriously compute the coverage for every \(\theta \in \Theta\). At that point we would know the level of \(C\) as a confidence procedure, but it is unlikely to be \(95\%\); adjusting \(C\) and iterating this procedure many times until the minimum coverage was equal to \(95\%\) would be exceedingly tedious. So we need to go backwards: start with the level, e.g. \(95\%\), then construct a \(C\) guaranteed to have this level. With this in mind, we can generalise Definition 4.1.
Definition 4.3 (Family of confidence procedures) \(C(X; \alpha)\) is a family of confidence procedures exactly when \(C(X; \alpha)\) is a level-\((1-\alpha)\) confidence procedure for every \(\alpha \in [0, 1]\). \(C\) is a nesting family exactly when \(\alpha < \alpha'\) implies that \(C(x; \alpha') \subset C(x; \alpha)\).
If we start with a family of confidence procedures for a specified model, then we can compute a confidence set for any level we choose.
4.2 Constructing confidence procedures
The general approach to construct a confidence procedure is to invert a test statistic. In Example 4.1, the coverage of the procedure \((aY, bY)\) does not depend upon \(\theta\) because the coverage probability could be expressed in terms of \(T = Y/\theta\) where the distribution of \(T\) did not depend upon \(\theta\). \(T\) is an example of a pivot. As Example 4.1 shows, confidence procedures are straightforward to compute from a pivot. However, a drawback to this approach in general is that there is no hard and fast method for finding a pivot.
An alternate method which does work generally is to exploit the property that every confidence procedure corresponds to a hypothesis test and vice versa. Consider a hypothesis test where we have to decide either to accept that an hypothesis \(H_{0}\) is true or to reject \(H_{0}\) in favour of an alternative hypothesis \(H_{1}\) based on a sample \(x \in \mathcal{X}\). The set of \(x\) for which \(H_{0}\) is rejected is called the rejection region with its complement, where \(H_{0}\) is accepted, the acceptance region. A hypothesis test can be constructed from any statistic \(T = T(X)\), one popular method which is optimal in some cases is the likelihood ratio test.
Definition 4.4 (Likelihood Ratio Test, LRT) The likelihood ratio test (LRT) statistic for testing \(H_{0}: \theta \in \Theta_{0}\) versus \(H_{1}: \theta \in \Theta_{0}^{c}\), where \(\Theta_{0} \cup \Theta_{0}^{c} = \Theta\), is \[\begin{eqnarray} \lambda(x) & = & \frac{\sup_{\theta \in \Theta_{0}}L_{X}(\theta; x)}{\sup_{\theta \in \Theta} L_{X}(\theta; x)}. \tag{4.1} \end{eqnarray}\] A LRT at significance level \(\alpha\) has a rejection region of the form \(\{x \, : \, \lambda(x) \leq c\}\) where \(0 \leq c \leq 1\) is chosen so that \(\mathbb{P}(\mbox{Reject } H_{0} \, | \, \theta) \leq \alpha\) for all \(\theta \in \Theta_{0}\).
We consider the following example.
Example 4.2 Let \(X = (X_{1}, \ldots, X_{n})\) and suppose that the \(X_{i}\) are independent and identically distributed \(N(\theta, \sigma^{2})\) random variables where \(\sigma^{2}\) is known and consider the likelihood ratio test for \(H_{0}: \theta = \theta_{0}\) versus \(H_{1}: \theta \neq \theta_{0}\). Then, as the maximum likelihood estimate of \(\theta\) is \(\overline{x}\), \[\begin{eqnarray*} \lambda(x) & = & \frac{L_{X}(\theta_{0}; x)}{L_{X}(\overline{x}; x)} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}} \sum_{i=1}^{n} \left( (x_{i} - \theta_{0})^{2} - (x_{i} - \overline{x})^{2}\right)\right\} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}} n(\overline{x} - \theta_{0})^{2}\right\}. \end{eqnarray*}\] Notice that, under \(H_{0}\), \(\frac{\sqrt{n}(\overline{X} - \theta_{0})}{\sigma} \sim N(0, 1)\) so that \[\begin{eqnarray} -2\log \lambda(X) \ = \ \frac{n(\overline{X} - \theta_{0})^{2}}{\sigma^{2}} \sim \chi^{2}_{1}, \tag{4.2} \end{eqnarray}\] the chi-squared distribution with one degree of freedom. Letting \(\chi^{2}_{1, \alpha}\) be such that \(\mathbb{P}(\chi^{2}_{1} \geq \chi^{2}_{1, \alpha}) = \alpha\) then, as the rejection region \(\{x \, : \, \lambda(x) \leq c\}\) corresponds to \(\{x \, : \, -2 \log \lambda(x) \geq k\}\) where \(k = -2 \log c\), setting \(k = \chi^{2}_{1, \alpha}\) gives a test at the (exact) significance level \(\alpha\). The corresponding acceptance region of this test is \(\{x \, : \, -2 \log \lambda(x) < \chi^{2}_{1, \alpha}\}\) where \[\begin{eqnarray} \mathbb{P}\left(\left.\frac{n(\overline{X} - \theta_{0})^{2}}{\sigma^{2}} < \chi^{2}_{1, \alpha} \, \right| \, \theta = \theta_{0} \right) & = & 1- \alpha. \tag{4.3} \end{eqnarray}\] This holds for all \(\theta_{0}\) and so, additionally rearranging, \[\begin{eqnarray} \mathbb{P}\left(\left. \overline{X} - \sqrt{\chi^{2}_{1, \alpha}} \frac{\sigma}{\sqrt{n}} < \theta < \overline{X} + \sqrt{\chi^{2}_{1, \alpha}} \frac{\sigma}{\sqrt{n}} \, \right| \, \theta \right) & = & 1- \alpha. \tag{4.4} \end{eqnarray}\] Thus, \(C(X) = (\overline{X} - \sqrt{\chi^{2}_{1, \alpha}} \frac{\sigma}{\sqrt{n}}, \overline{X} + \sqrt{\chi^{2}_{1, \alpha}} \frac{\sigma}{\sqrt{n}})\) is an exact level-\((1-\alpha)\) confidence procedure with \(C(x)\) the corresponding confidence set. Noting that \(\sqrt{\chi^{2}_{1, \alpha}} = z_{\alpha/2}\), where \(z_{\alpha/2}\) is such that \(\mathbb{P}(Z \geq z_{\alpha/2}) = \alpha/2\) for \(Z \sim N(0, 1)\), this confidence set is more familiarly written as \(C(x) = (\overline{x} - z_{\alpha/2} \frac{\sigma}{\sqrt{n}}, \overline{x} + z_{\alpha/2} \frac{\sigma}{\sqrt{n}})\).
The level-\((1-\alpha)\) confidence procedure defined by equation (4.4) is obtained by inverting the acceptance region, see equation (4.3), of the level \(\alpha\) significance test. This correspondence between acceptance regions of tests and confidence sets is a general property.
Theorem 4.1 (Duality of Acceptance Regions and Confidence Sets) Firstly, for each \(\theta_{0} \in \Theta\), let \(A(\theta_{0})\) be the acceptance region of a test of \(H_{0}: \theta = \theta_{0}\) at significance level \(\alpha\). For each \(x \in \mathcal{X}\), define \(C(x) = \{\theta_{0} \, : \, x \in A(\theta_{0})\}\). Then \(C(X)\) is a level-\((1-\alpha)\) confidence procedure. Secondly, let \(C(X)\) be a level-\((1-\alpha)\) confidence procedure and, for any \(\theta_{0} \in \Theta\), define \(A(\theta_{0}) = \{ x \, : \, \theta_{0} \in C(x)\}\). Then \(A(\theta_{0})\) is the acceptance region of a test of \(H_{0}: \theta = \theta_{0}\) at significance level \(\alpha\).
Proof: Firstly, as we have a level \(\alpha\) test for each \(\theta_{0} \in \Theta\) then \(\mathbb{P}( X \in A(\theta_{0}) \, | \, \theta = \theta_{0}) \geq 1- \alpha\). Since \(\theta_{0}\) is arbitrary we may write \(\theta\) instead of \(\theta_{0}\) and so, for all \(\theta \in \Theta\), \[\begin{eqnarray*} \mathbb{P}(\theta \in C(X) \, | \, \theta) \ = \ \mathbb{P}(X \in A(\theta) \, | \, \theta) \ \geq 1 - \alpha. \end{eqnarray*}\] Hence, from Definition 4.1, \(C(X)\) is a level-\((1-\alpha)\) confidence procedure. Secondly, for a test of \(H_{0}: \theta = \theta_{0}\), the probability of a Type I error (rejecting \(H_{0}\) when it is true) is \[\begin{eqnarray*} \mathbb{P}(X \notin A(\theta_{0}) \, | \, \theta = \theta_{0}) \ = \ \mathbb{P}(\theta_{0} \notin C(X) , | \, \theta = \theta_{0}) \ \leq \ \alpha \end{eqnarray*}\] since \(C(X)\) is a level-\((1-\alpha)\) confidence procedure. Hence, we have a test at significance level \(\alpha\). \(\Box\)
A possibly easier way to understand the relationship between significance tests and confidence sets is by defining the set \(\{(x, \theta) \, : \, (x, \theta) \in \tilde{C}\}\) in the space \(\mathcal{X} \times \Theta\) where \(\tilde{C}\) is also a set in \(\mathcal{X} \times \Theta\).
- For fixed \(x\), we may define the confidence set as \(C(x) = \{\theta \, : \, (x, \theta) \in \tilde{C}\}\).
- For fixed \(\theta\), we may define the acceptance region as \(A(\theta) = \{x \, : \, (x, \theta) \in \tilde{C}\}\).
Example 4.3 We revisit Example 4.2 and, recalling that \(x = (x_{1}, \ldots, x_{n})\), define the set \[\begin{eqnarray*} \{(x, \theta) \, : \, (x, \theta) \in \tilde{C}\} & = & \left\{(x, \theta) \, : \, -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \overline{x} - \theta < z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right\}. \end{eqnarray*}\] The confidence set is then \[C(x) = \left\{\theta \, : \, -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \overline{x} - \theta < z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right\} = \left\{\theta \, : \, \overline{x} -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \theta < \overline{x} + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right\}\] and acceptance region \[A(\theta) = \left\{x \, : \, -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \overline{x} - \theta < z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right\} = \left\{x \, : \, \theta -z_{\alpha/2}\frac{\sigma}{\sqrt{n}} < \overline{x} < \theta + z_{\alpha/2}\frac{\sigma}{\sqrt{n}}\right\}.\]
4.3 Good choices of confidence procedures
Section 3.6 made a recommendation about set estimators for \(\theta\), which was that they should be based on level sets of \(f_{X}(x \, | \, \theta)\). This was to satisfy a necessary condition to be admissible under the loss function (3.6). With this in mind, a good choice of confidence procedure would be one that satisfied a level set property.
Definition 4.5 (Level set property, LSP) A confidence procedure \(C\) has the level set property exactly when \[\begin{eqnarray*} C(x) & = & \{\theta \, : \, f_{X}(x \, | \, \theta) > g(x)\} \end{eqnarray*}\] for some \(g : \mathcal{X} \rightarrow \mathbb{R}\).
We now show that we can construct a family of confidence procedures with the LSP. The result has pedagogic value, because it can be used to generate an uncountable number of families of confidence procedures, each with the LSP.
Theorem 4.2 Let \(h\) be any probability density function for \(X\). Then \[\begin{eqnarray} C_{h}(x; \alpha) & := & \{ \theta \in \Theta \, : \, f_{X}(x \, | \, \theta) > \alpha h(x)\} \tag{4.5} \end{eqnarray}\] is a family of confidence procedures, with the LSP.
Proof: First notice that if we let \(\mathcal{X}(\theta) := \{x \in \mathcal{X} \, : \, f_{X}(x \, | \, \theta) > 0\}\) then \[\begin{eqnarray} \mathbb{E}(h(X) / f_{X}(X \, | \, \theta) \, | \, \theta) & = & \int_{x \in \mathcal{X}(\theta)} \frac{ h(x) }{ f_{X}(x \, | \, \theta) }f_{X}(x \, | \, \theta) \, dx \nonumber\\ & = & \int_{x \in \mathcal{X}(\theta)} h(x) \, dx \nonumber\\ & \leq & 1 \tag{4.6} \end{eqnarray}\] because \(h\) is a probability density function. Now, \[\begin{eqnarray} \mathbb{P}(f_{X}(X \, | \,\theta) / h(X) \leq u \, | \, \theta) & = & \mathbb{P}(h(X) / f_{X}(X \, | \, \theta) \geq 1 / u \, | \, \theta) \tag{4.7} \\ & \leq & \frac{\mathbb{E}( h(X)/ f_{X}(X \, | \, \theta) \, | \, \theta) }{ 1 / u } \tag{4.8} \\ & \leq & \frac{ 1 }{ 1 / u } = u \tag{4.9} \end{eqnarray}\] where (4.8) follows from (4.7) by Markov’s inequality that if \(X\) is a nonnegative random variable and \(a > 0\) then \(\mathbb{P}(X \geq a) \leq \mathbb{E}(X)/a\). Equation (4.9) follows from (4.8) by (4.6). \(\Box\)
Notice that if we define \(g(x, \theta) := f_{X}(x \, | \, \theta) / h(x)\), which may be \(\infty\) then the proof shows that \(\mathbb{P}(g(X, \theta) \leq u \, | \, \theta) \leq u\). As we will see in Definition 4.7 this means that \(g(X, \theta)\) is super-uniform for each \(\theta\).
Among the interesting choices for \(h\), one possibility is \(h(x) =f_{X}(x \, | \, \theta_{0})\), for some \(\theta_{0} \in \Theta\). Note that with this choice, the confidence set of equation (4.5) always contains \(\theta_{0}\). So we know that we can construct a level-\((1-\alpha)\) confidence procedure whose confidence sets will always contain \(\theta_{0}\). Two statisticians can both construct \(95\%\) confidence sets for \(\theta\) which satisfy the LSP, using different families of confidence procedures. Yet the first statistician may reject the null hypothesis that \(H_0 : \theta = \theta_0\) (see Section 3.7, and the second statistician may fail to reject it, for any \(\theta_0 \in \Theta\). This does not fill one with confidence about using confidence procedures for hypothesis tests.
Actually, the situation is not as grim as it seems. Markov’s inequality is very slack, and so the coverage of the family of confidence procedures defined in Theorem 4.2 is likely to be much larger than \((1-\alpha)\), e.g. much larger than \(95\%\).
Note that the diameter of a set in a metric space such as Euclidean space is the maximum of the distance between two points in the set.
For any confidence procedure, the diameter of \(C(x)\) can grow rapidly with its coverage. In fact, the relation must be extremely convex when coverage is nearly one, because, in the case where \(\Theta = \mathbb{R}\), the diameter at 100% coverage is unbounded. So an increase in the coverage from, say 95% to 99%, could easily correspond to a doubling or more of the diameter of the confidence procedure.
A more likely outcome in the two statisticians situation is that \(C_{h}(x; 0.05)\) is large for many different choices of \(h\), in which case no one rejects the null hypothesis, which is not a useful outcome for a hypothesis test. But perhaps it is a useful antidote to the current “crisis of reproducibility”, in which far too many null hypotheses are being rejected in published papers.
All in all, it would be much better to use an exact family of confidence procedures which satisfy the LSP, if one existed. And, for perhaps the most popular model in the whole of Statistics, this is the case. This is the linear model with a normal error.
4.3.1 The linear model
We briefly discuss the linear model and, in what can be viewed as an extension of Example 4.2, consider constructing a confidence procedure using the likelihood ratio. Wood (2017) is a recommended textbook discussion of the whole (generalised) theory.
Note that we typically use \(X\) to denote a generic random variable. Whilst it is not ideal, it is standard notation for linear models to use \(X\) to denote a specified matrix.
Let \(Y = (Y_{1}, \ldots, Y_{n})\) be an \(n\)-vector of observables with \(Y = X\theta + \epsilon\), where \(\mu = X\theta\) is an \((n \times p)\) matrix \(X\) of regressors, \(\theta\) is a \(p\)-vector of regression coefficients, and \(\epsilon\) is an \(n\)-vector of residuals. Assume that \(\epsilon \sim N_{n}(0, \sigma^{2}I_{n})\), the \(n\)-dimensional multivariate normal distribution, where \(\sigma^{2}\) is known.
We will utilise the following two properties of the multivariate normal distribution.
Theorem 4.3 (Properties of the multivariate normal distribution) Let \(W = (W_{1}, \ldots, W_{k})\) with \(W \sim N_{k}(\mu, \Sigma)\), the \(k\)-dimensional multivariate normal distribution with mean vector \(\mu\) and variance matrix \(\Sigma\). Firstly, if \(Y = AW + c\), where \(\mbox{A}\) is any \((l \times k)\) matrix and \(c\) any \(l\)-dimensional vector, then \(Y \sim N_{l}(A \mu + c, A \Sigma A^{T})\). Secondly, if \(\Sigma > 0\) then \(Y = \Sigma^{-\frac{1}{2}}(W - \mu) \sim N_{k}(0, \mbox{I}_{k})\), where \(\mbox{I}_{k}\) is the \((k \times k)\) identity matrix, and \((W - \mu)^{T} \Sigma^{-1}(W - \mu) = \sum_{i=1}^{k} y_{i}^{2} \sim \chi^{2}_{k}\).
Proof: See for example, Theorem 3.2.1 and Corollary 3.2.1.1 of Mardia, Kent, and Bibby (1979). \(\Box\)
It is thus immediate from the first property that for the linear model, \(Y \sim N_{n}(\mu, \sigma^{2}I_{n})\) where \(\mu = X\theta\). Now, \[\begin{eqnarray} L_{Y}(\theta; y) & = & \left(2\pi \sigma^{2}\right)^{-\frac{n}{2}} \exp\left\{-\frac{1}{2\sigma^{2}} (y - X\theta)^{T}(y-X\theta)\right\}. \tag{4.10} \end{eqnarray}\] Let \(\hat{\theta} = \hat{\theta}(y) = \left(X^{T}X\right)^{-1}X^{T}y\) then \[\begin{eqnarray} (y - X\theta)^{T}(y-X\theta) & = & (y-X\hat{\theta} + X\hat{\theta} - X\theta)^{T}(y-X\hat{\theta} + X\hat{\theta} - X\theta) \nonumber \\ & = & (y-X\hat{\theta})^{T}(y-X\hat{\theta}) + (X\hat{\theta} - X\theta)^{T}(X\hat{\theta} - X\theta) \nonumber \\ & = & (y-X\hat{\theta})^{T}(y-X\hat{\theta}) + (\hat{\theta} - \theta)^{T}X^{T}X(\hat{\theta} - \theta). \tag{4.11} \end{eqnarray}\] Thus, \((y - X\theta)^{T}(y-X\theta)\) is minimised when \(\theta = \hat{\theta}\) and so, from equation (4.10), \(\hat{\theta} = \left(X^{T}X\right)^{-1}X^{T}y\) is the maximum likelihood estimator of \(\theta\). From equation (4.1), we can calculate the likelihood ratio \[\begin{eqnarray} \lambda(y) & = & \frac{L_{Y}(\theta; y)}{L_{Y}(\hat{\theta}; y)}\nonumber \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}} \left[(y - X\theta)^{T}(y-X\theta) - (y - X\hat{\theta})^{T}(y-X\hat{\theta})\right]\right\} \nonumber\\ & = & \exp\left\{-\frac{1}{2\sigma^{2}}(\hat{\theta} - \theta)^{T}X^{T}X(\hat{\theta} - \theta)\right\}, \tag{4.12} \end{eqnarray}\] where equation (4.12) follows from equation (4.11). Thus, \[\begin{eqnarray*} -2\log \lambda(y) & = & \frac{1}{\sigma^{2}}(\hat{\theta} - \theta)^{T}X^{T}X(\hat{\theta} - \theta). \end{eqnarray*}\] Now, as \(\hat{\theta}(Y) = \left(X^{T}X\right)^{-1}X^{T}Y\) then, from the first property of Theorem 4.3, \[\hat{\theta}(Y) \sim N_{p}\left(\theta, \sigma^{2}\left(X^{T}X\right)^{-1}\right)\] so that, from the second property of Theorem 4.3, \(-2\log \lambda(Y) \sim \chi_{p}^{2}\). Hence, with \(\mathbb{P}(\chi^{2}_{p} \geq \chi^{2}_{p, \alpha}) = \alpha\), \[\begin{eqnarray} C(y; \alpha) & = & \left\{\theta \in \mathbb{R}^{p} \, : \, -2\log\lambda(y) \ = \ -2\log\frac{f_{Y}(y \, | \, \theta, \sigma^{2})}{f_{Y}(y \, | \, \hat{\theta}, \sigma^{2})} < \chi^{2}_{p, \alpha}\right\} \nonumber\\ & = & \left\{\theta \in \mathbb{R}^{p} \, : \, f_{Y}(y \, | \, \theta, \sigma^{2}) > \exp\left(-\frac{\chi^{2}_{p, \alpha}}{2}\right)f_{Y}(y \, | \, \hat{\theta}, \sigma^{2})\right\} \tag{4.13} \end{eqnarray}\] is a family of exact confidence procedures for \(\theta\) which has the LSP.
4.3.2 Wilks confidence procedures
This outcome where we can find a family of exact confidence procedures with the LSP is more-or-less unique to the regression parameters of the linear model but it is found, approximately, in the large \(n\) behaviour of a much wider class of models. The result can be traced back to Wilks (1938) and, as such, the resultant confidence procedures are often termed Wilks confidence procedures.
Theorem 4.4 (Wilks Theorem) Let \(X = (X_{1}, \ldots, X_{n})\) where each \(X_{i}\) is independent and identically distributed, \(X_{i} \sim f(x_{i} \, | \, \theta)\),where \(f\) is a regular model and the parameter space \(\Theta\) is an open convex subset of \(\mathbb{R}^{p}\) (and invariant to \(n\)). The distribution of the statistic \(-2\log \lambda(X)\) converges to a chi-squared distribution with \(p\) degrees of freedom as \(n \rightarrow \infty\).
The definition of “regular model” is quite technical, but a working guideline is that \(f\) must be smooth and differentiable in \(\theta\); in particular, the support must not depend on \(\theta\). Chapter 6 of Cox (2006) provides a summary of this result and others like it, and more details can be found in Chapter 10 of Casella and Berger (2002) or, for the full story, in Vaart (1998). Analogous to equation (4.13), we thus have that if the conditions of Theorem 4.4 are met, \[\begin{eqnarray} C(x; \alpha) & = & \left\{\theta \in \mathbb{R}^{p} \, : \, f_{X}(x \, | \, \theta) > \exp\left(-\frac{\chi^{2}_{p, \alpha}}{2}\right)f_{X}(x \, | \, \hat{\theta})\right\} \tag{4.14} \end{eqnarray}\] is a family of approximately exact confidence procedures which satisfy the LSP. The pertinent question, as always with methods based on asymptotic properties for particular types of model, is whether the approximation is a good one. The crucial concept here is level error. The coverage that we want is at least \((1 - \alpha)\) everywhere, which is termed the “nominal level”. But were we to evaluate a confidence procedure such as (4.14) for a general model (not a linear model) we would find that, over all \(\theta \in \Theta\), that the minimum coverage was not \((1 - \alpha)\) but something else; usually something less than \((1 - \alpha)\). This is the “actual level”. The difference is \[\begin{eqnarray*} \mbox{level error} & = & \mbox{nominal level} - \mbox{actual level}. \end{eqnarray*}\] Level error exists because the conditions under which (4.14) provides an exact confidence procedure are not met in practice, outside the linear model. Although it is tempting to ignore level error, experience suggests that it can be large, and that we should attempt to correct for level error if we can. One method for making this correction is bootstrap calibration, described in DiCiccio and Efron (1996).
4.4 Significance procedures and duality
A hypothesis test of \(H_{0}: \theta \in \Theta_{0}\) versus \(H_{1}: \theta \in \Theta_{0}^{c}\), where \(\Theta_{0} \cup \Theta_{0}^{c} = \Theta\), with a significance level of \(5\%\) (or any other specified value) returns one bit of information, either we “accept \(H_{0}\)” or “reject \(H_{0}\)”. We do not know whether the decision was borderline or nearly conclusive; i.e. whether, for rejection, \(H_0\) and \(C(x; 0.05)\) were close, or well-separated. Of more interest is to consider what is the smallest value of \(\alpha\) for which \(C(x; \alpha)\) does not intersect \(H_{0}\). This value is termed the \(p\)-value.
Definition 4.6 (\(p\)-value) A \(p\)-value \(p(X)\) is a statistic satisfying \(p(x) \in [0,1]\) for every \(x \in \mathcal{X}\). Small values of \(p(x)\) support the hypothesis that \(H_{1}\) is true. A \(p\)-value is valid if, for every \(\theta \in \Theta_{0}\) and every \(\alpha \in [0, 1]\), \[\begin{eqnarray} \mathbb{P}(p(X) \leq \alpha \, | \, \theta) & \leq & \alpha. \tag{4.15} \end{eqnarray}\]
If \(p(X)\) is a valid \(p\)-value then a significance test that rejects \(H_{0}\) if and only if \(p(X) \leq \alpha\) is, from (4.15), a test with significance level \(\alpha\). In this section we introduce the idea of significance procedures and derive a duality between a significance procedure at level \(\alpha\) and a confidence procedure at level \(1-\alpha\). We first need some additional concepts. Let \(X\) and \(Y\) be two scalar random variables. Then \(X\) stochastically dominates \(Y\) exactly when \[\begin{eqnarray*} \mathbb{P}(X \leq v) & \leq & \mathbb{P}(Y \leq v) \end{eqnarray*}\] for all \(v \in \mathbb{R}\). Visually, the distribution function for \(X\) is never to the left of the distribution function for \(Y\).
Recollect that the distribution function of \(X\) has the form \(F(x) := \mathbb{P}(X \leq x)\) for \(x \in \mathbb{R}\).
Recall that if \(U \sim \mbox{Unif}(0, 1)\), the standard uniform distribution, then \(\mathbb{P}(U \leq u) = u\) for \(u \in [0, 1]\). With this in mind, we make the following definition.
Definition 4.7 (Super-uniform) The random variable \(X\) is super-uniform exactly when it stochastically dominates a standard uniform random variable. That is \[\begin{eqnarray} \mathbb{P}(X \leq u) & \leq & u \tag{4.16} \end{eqnarray}\] for all \(u \in [0, 1]\).
Example 4.4 From Definition 4.6, we see that for \(\theta \in \Theta_{0}\), the \(p\)-value \(p(X)\) is super-uniform.
We now define a significance procedure which can be viewed as an extension of Definition 4.6. Note the similarities with the definitions of a confidence procedure which are not coincidental.
Definition 4.8 (Significance procedure) Firstly, \(p : \mathcal{X} \rightarrow \mathbb{R}\) is a significance procedure for \(\theta_{0} \in \Theta\) exactly when \(p(X)\) is super-uniform under \(\theta_{0}\). If \(p(X)\) is uniform under \(\theta_{0}\), then p is an exact significance procedure for \(\theta_{0}\). Secondly, for \(X = x\), \(p(x)\) is a significance level or (observed) \(p\)-value for \(\theta_{0}\) exactly when \(p\) is a significance procedure for \(\theta_{0}\). Thirdly, \(p : \mathcal{X} \times \Theta \rightarrow \mathbb{R}\) is a family of significance procedures exactly when \(p(x; \theta_{0})\) is a significance procedure for \(\theta_{0}\) for every \(\theta_{0} \in \Theta\).
We now show that there is a duality between significance procedures and confidence procedures.
Theorem 4.5 (Duality theorem) Firstly, let \(p\) be a family of significance procedures. Then \[\begin{eqnarray*} C(x; \alpha) & := & \{\theta \in \Theta \, : \, p(x; \theta) > \alpha\} \end{eqnarray*}\] is a nesting family of confidence procedures. Secondly, conversely, let \(C\) be a nesting family of confidence procedures. Then \[\begin{eqnarray*} p(x; \theta_{0}) & := & \inf \{\alpha \, : \, \theta_{0} \notin C(x; \alpha)\} \end{eqnarray*}\] is a family of significance procedures. If either is exact, then the other is exact as well.
Proof: If \(p\) is a family of significance procedures then for any \(\theta \in \Theta\), \[\begin{eqnarray} \mathbb{P}(\theta \in C(X; \alpha) \, | \, \theta) & = & \mathbb{P}(p(X; \theta) > \alpha \, | \,\theta) \nonumber\\ & = & 1 - \mathbb{P}(p(X; \theta) \leq \alpha \, | \,\theta). \tag{4.17} \end{eqnarray}\] Now, as \(p\) is super-uniform for \(\theta\) then \(\mathbb{P}(p(X; \theta) \leq \alpha \, | \,\theta) \leq \alpha\). Thus, from equation (4.17), \[\begin{eqnarray} \mathbb{P}(\theta \in C(X; \alpha) \, | \, \theta) & \geq & 1 - \alpha \tag{4.18} \end{eqnarray}\] so that, from Definition 4.1, \(C(X; \alpha)\) is a level-\((1-\alpha)\) confidence procedure. From Definition 4.3 it is clear that \(C\) is nesting. If \(p\) is exact then the inequality in (4.18) can be replaced by an equality and so \(C\) is also exact. We thus have the first statement. Now, if \(C\) is a nesting family of confidence procedures then (we’re finessing the issue of the boundary of \(C\) by assuming that if \(\alpha^{*} : = \inf \{\alpha \, : \, \theta_{0} \notin C(x; \alpha)\}\) then \(\theta_{0} \notin C(x; \alpha^{*})\)) \[\begin{eqnarray*} \inf \{\alpha \, : \, \theta_{0} \notin C(x; \alpha)\} \leq u & \iff & \theta_{0} \notin C(x; u). \end{eqnarray*}\] Let \(\theta_{0}\) and \(u \in [0, 1]\) be arbitrary. Then, \[\begin{eqnarray*} \mathbb{P}(p(X; \theta_{0}) \leq u \, | \, \theta_{0}) \ = \ \mathbb{P}(\theta_{0} \notin C(X; u) \, | \,\theta_{0}) \leq u \end{eqnarray*}\] as \(C(X; u)\) is a level-\((1-u)\) confidence procedure. Thus, \(p\) is super-uniform. If \(C\) is exact, then the inequality is replaced by an equality, and hence \(p\) is exact as well. \(\Box\)
Theorem 4.5 shows that confidence procedures and significance procedures are two sides of the same coin. If we have a way of constructing families of confidence procedures then we have a way of constructing families significance procedures, and vice versa. If we have a good way of constructing confidence procedures then (presumably, and in principle) we have a good way of constructing significance procedures. This is helpful because, as Section 4.5 will show, there are an uncountable number of families of significance procedures, and so there are an uncountable number of families of confidence procedures. Naturally, in both these cases, almost all of the possible procedures are useless for our inference. So just being a confidence procedure, or just being a significance procedure, is never enough. We need to know how to make good choices.
4.5 Families of significance procedures
We now consider a very general way to construct a family of significance procedures. We will then show how to use simulation to compute the family.
Theorem 4.6 Let \(t : \mathcal{X} \rightarrow \mathbb{R}\) be a statistic. For each \(x \in \mathcal{X}\) and \(\theta_{0} \in \Theta\) define \[\begin{eqnarray*} p_{t}(x; \theta_{0}) & := & \mathbb{P}(t(X) \geq t(x) \, | \, \theta_{0}). \end{eqnarray*}\] Then \(p_{t}\) is a family of significance procedures. If the distribution function of \(t(X)\) is continuous, then \(p_{t}\) is exact.
Proof: We follow Theorem 8.3.27 of Casella and Berger (2002). Now, \[\begin{eqnarray*} p_{t}(x; \theta_{0}) & = & \mathbb{P}(t(X) \geq t(x) \, | \, \theta_{0}) \\ & = & \mathbb{P}(-t(X) \leq -t(x) \, | \, \theta_{0}). \end{eqnarray*}\] Let \(F\) denote the distribution function of \(Y(X) = -t(X)\) then \[\begin{eqnarray*} p_{t}(x; \theta_{0}) & = & F(-t(x) \, | \, \theta_{0}). \end{eqnarray*}\] If \(t(X)\) is continuous then \(Y(X) = -t(X)\) is continuous and, using the Probability Integral Transform, see Theorem 4.11, \[\begin{eqnarray*} \mathbb{P}(p_{t}(X; \theta_{0}) \leq \alpha \, | \, \theta_{0}) & = & \mathbb{P}(F(Y) \leq \alpha \, | \, \theta_{0}) \\ & = & \mathbb{P}(Y \leq F^{-1}(\alpha) \, | \, \theta_{0}) = F(F^{-1}(\alpha)) = \alpha. \end{eqnarray*}\] Hence, \(p_{t}\) is uniform under \(\theta_{0}\). If \(t(X)\) is not continuous then, via the Probability Integral Transform, \(\mathbb{P}(F(Y) \leq \alpha \, | \, \theta_{0}) \leq \alpha\) and so \(p_{t}(X; \theta_{0})\) is super-uniform under \(\theta_{0}\). \(\Box\)
So there is a family of significance procedures for each possible function \(t : \mathcal{X} \rightarrow \mathbb{R}\). Clearly only a tiny fraction of these can be useful functions, and the rest must be useless. Some, like \(t(x) = c\) for some constant \(c\), are always useless. Others, like \(t(x) = \sin(x)\) might sometimes be a little bit useful, while others, like \(t(x) = \sum_{i} x_{i}\) might be quite useful - but it all depends on the circumstances. Some additional criteria are required to separate out good from poor choices of the test statistic \(t\), when using the construction in Theorem 4.6. The most pertinent criterion is:
- Select a test statistic for which \(t(X)\) which will tend to be larger for decision-relevant departures from \(\theta_{0}\).
Example 4.5 For the likelihood ratio, \(\lambda(x)\), given by equation (4.1), small observed values of \(\lambda(x)\) support departures from \(\theta_{0}\). Thus, \(t(X) = -2\log \lambda(X)\), is a test statistic for which large values support departures from \(\theta_{0}\).
In the context of Definition 4.6, large values of \(t(X)\) will correspond to small values of the \(p\)-value, supporting the hypothesis that \(H_{1}\) is true. Thus, this criterion ensures that \(p_{t}(X; \theta_{0})\) will tend to be smaller under decision-relevant departures from \(\theta_{0}\); small \(p\)-values are more interesting, precisely because significance procedures are super-uniform under \(\theta_{0}\).
4.5.1 Computing p-values
Only in very special cases will it be possible to find a closed-form expression for \(p_{t}\) from which we can compute the \(p\)-value \(p_{t}(x; \theta_{0})\). Instead,we can use simulation, according to the following result adapted from Besag and Clifford (1989).
Note that if \(X_{0}, X_{1}, \ldots, X_{m}\) are exchangeable then their joint density function satisfies \(f(x_{0}, \ldots, x_{m}) = f(x_{\pi(0)}, \ldots, x_{\pi(m)})\) for all permutations \(\pi\) defined on the set \(\{0, \ldots, m\}\).
Theorem 4.7 For any finite sequence of scalar random variables \(X_{0}, X_{1}, \ldots, X_{m}\), define the rank of \(X_{0}\) in the sequence as \[\begin{eqnarray*} R & := & \sum_{i=1}^{m} \mathbb{I}_{\{X_{i} \leq X_{0}\}}. \end{eqnarray*}\] If \(X_{0}, X_{1}, \ldots, X_{m}\) are exchangeable then \(R\) has a discrete uniform distribution on the integers \(\{0, 1, \ldots, m\}\), and \((R + 1)/(m + 1)\) has a super-uniform distribution.
Proof: By exchangeability, \(X_{0}\) has the same probability of having rank \(r\) as any of the other \(X_{i}\)’s, for any \(r\), and therefore \[\begin{eqnarray} \mathbb{P}(R = r) & = & \frac{1}{m+1} \tag{4.19} \end{eqnarray}\] for \(r \in \{0, 1, \ldots, m\}\) and zero otherwise, proving the first claim. For the second claim, \[\begin{eqnarray*} \mathbb{P}\left(\frac{R+1}{m+1} \leq u\right) & = & \mathbb{P}(R+1 \leq u(m+1)) \\ & = & \mathbb{P}(R +1 \leq \lfloor u(m+1) \rfloor) \end{eqnarray*}\] since \(R\) is an integer and \(\lfloor x \rfloor\) denotes the largest integer no larger than \(x\). Hence, \[\begin{eqnarray} \mathbb{P}\left(\frac{R+1}{m+1} \leq u\right) & = & \sum_{r=0}^{\lfloor u(m+1) \rfloor -1} \mathbb{P}(R = r) \tag{4.20} \\ & = & \sum_{r=0}^{\lfloor u(m+1) \rfloor -1} \frac{1}{m+1} \tag{4.21} \\ & = & \frac{\lfloor u(m+1) \rfloor}{m+1} \leq u, \nonumber \end{eqnarray}\] as required where equation (4.21) follows from (4.20) by (4.19). \(\Box\)
To use this result, fix the test statistic \(t(x)\) and define \(T_{i} = t(X_{i})\) where \(X_{1}, \ldots, X_{m}\) are independent and identically distributed random variables with density \(f(\cdot \, | \, \theta_{0})\). Define \[\begin{eqnarray*} R_{t}(x; \theta_{0}) \ := \ \sum_{i=1}^{m} \mathbb{I}_{\{-T_{i} \leq -t(x)\}} \ = \ \sum_{i=1}^{m} \mathbb{I}_{\{T_{i} \geq t(x)\}}, \end{eqnarray*}\] where \(\theta_{0}\) is an argument to \(R\) because \(\theta_{0}\) needs to be specified in order to simulate \(T_{1}, \ldots, T_{m}\). Then Theorem 4.7 implies that \[\begin{eqnarray*} P_{t}(x; \theta_{0}) & := & \frac{R_{t}(x; \theta_{0}) + 1}{m+1} \end{eqnarray*}\] has a super-uniform distribution under \(X \sim f(\cdot \, | \, \theta_{0})\), because in this case \(t(X), T_{1}, \ldots, T_{m}\) are exchangeable. Furthermore, the Weak Law of Large Numbers (WLLN) implies that \[\begin{eqnarray*} \lim_{m \rightarrow \infty} P_{t}(x; \theta_{0}) & = & \lim_{m \rightarrow \infty} \frac{R_{t}(x; \theta_{0}) + 1}{m+1} \\ & = & \lim_{m \rightarrow \infty} \frac{R_{t}(x; \theta_{0}) }{m} \\ & = & \mathbb{P}(T \geq t(x) \, | \,\theta_{0}) \ = \ p_{t}(x; \theta_{0}). \end{eqnarray*}\] Therefore, not only is \(P_{t}(x; \theta_{0})\) super-uniform under \(\theta_{0}\), so that \(P_{t}\) is a family of significance procedures for every \(m\), but the limiting value of \(P_{t}(x; \theta_{0})\) as \(m\) becomes large is \(p_{t}(x; \theta_{0})\).
In summary, if you can simulate from your model under \(\theta_{0}\) then you can produce a \(p\)-value for any test statistic \(t\), namely \(P_{t}(x; \theta_{0})\), and if you can simulate cheaply, so that the number of simulations m is large, then \(P_{t}(x; \theta_{0}) \approx p_{t}(x; \theta_{0})\).
The less-encouraging news is that this simulation-based approach is not well-adapted to constructing confidence sets. Let \(C_{t}\) be the family of confidence procedures induced by \(p_{t}\) using duality, see Theorem 4.5. We can answer the question “Is \(\theta_{0} \in C_{t}(x; \alpha)\)?” with one set of \(m\) simulations. These simulations give a value \(P_{t}(x; \theta_{0})\) which is either larger or not larger than \(\alpha\). If \(P_{t}(x; \theta_{0}) > \alpha\) then \(\theta_{0} \in C_{t}(x; \alpha)\), and otherwise it is not. Clearly, though, this is not an effective way to enumerate all of the points in \(C_{t}(x; \alpha)\), because we would need to do \(m\) simulations for each point in \(\Theta\).
4.6 Generalisations
So far, confidence procedures and significance procedures have been defined with respect to a point \(\theta_{0} \in \Theta\). Often, though, we require a more general treatment, where a confidence procedure is defined for some \(g : \theta \mapsto \phi\), where \(g\) may not be bijective; or where a significance procedure is defined for some \(\Theta_{0} \subset \Theta\), where \(\Theta_{0}\) may not be a single point. These general treatments are always possible, but the result is often very conservative. As discussed at the end of Section 4.3, conservative procedures are formally correct but they can be practically useless.
4.6.1 Marginalisation of confidence procedures
Suppose that \(g: \theta \mapsto \phi\) is some specified function, and we would like a confidence procedure for \(\phi\). If \(C\) is a level-\((1-\alpha)\) confidence procedure for \(\phi\) then it must have \(\phi\)-coverage of at least \((1-\alpha)\) for all \(\theta \in \Theta\). The most common situation is where \(\Theta \subset \mathbb{R}^{p}\), and \(g\) extracts a single component of \(\theta\): for example, \(\theta = (\mu, \sigma^2)\) and \(g(\theta) = \mu\).
Theorem 4.8 (Confidence Procedure Marginalisation, CPM) Suppose both that \(g : \theta \mapsto \phi\), and that \(C\) is a level-\((1-\alpha)\) procedure for \(\theta\). Then \[\begin{eqnarray*} gC & := & \{\phi \, : \, \phi = g(\theta) \mbox{ for some $\theta \in C$}\} \end{eqnarray*}\] is a level-\((1-\alpha)\) confidence procedure for \(\phi\).
Proof: The result follows immediately by noting that \(\theta \in C(x)\) implies that \(\phi \in gC(x)\) for all \(x\), and hence \[\begin{eqnarray*} \mathbb{P}(\theta \in C(X) \, | \,\theta) & \leq & \mathbb{P}(\phi \in gC(X) \, | \, \theta) \end{eqnarray*}\] for all \(\theta \in \Theta\). So if \(C\) has \(\theta\)-coverage of at least \((1-\alpha)\), then \(gC\) has \(\phi\)-coverage of at least \((1-\alpha)\) as well. \(\Box\)
This result shows that we can derive level-\((1-\alpha)\) confidence procedures for functions of \(\theta\) directly from level-\((1-\alpha)\) confidence procedures for \(\theta\). Furthermore, if the confidence procedure for \(\theta\) is easy to enumerate, then the confidence procedure for \(\phi\) is easy to enumerate too - just by transforming each element. But it also shows that the coverage of such derived procedures will typically be more than \((1 - \alpha)\), even if the original confidence procedure is exact: thus \(gC\) is a conservative confidence procedure. As already noted, conservative confidence procedures can often be far larger than they need to be: sometimes too large to be useful.
4.6.2 Generalisation of significance procedures
We now give a simple result which extends a family of significance procedures over a set in \(\Theta\).
Theorem 4.9 Let \(\Theta_{0} \subset \Theta\). If \(p\) is a family of significance procedures, then \[\begin{eqnarray*} P(x; \Theta_{0}) & := & \sup_{\theta_{0} \in \Theta_{0}} p(x; \theta_{0}) \end{eqnarray*}\] is super-uniform for all \(\theta \in \Theta_{0}\).
Proof: \(P(x; \Theta_{0}) \leq u\) implies that \(p(x; \theta_{0}) \leq u\) for all \(\theta_{0} \in \Theta\). Let \(\theta \in \Theta_{0}\) be arbitrary; then, for any \(u \geq 0\), \[\begin{eqnarray*} \mathbb{P}(P(X; \Theta_{0}) \leq u \, | \, \theta) \ \leq \ \mathbb{P}(p(X; \Theta_{0}) \leq u \, | \, \theta) \ \leq \ u \end{eqnarray*}\] for \(\theta \in \Theta_{0}\), showing that \(P(x; \Theta_{0})\) is super-uniform for all \(\theta \in \Theta_{0}\). \(\Box\)
As with the marginalisation of confidence procedures, this result shows that we can derive a significance procedure for an arbitrary \(\Theta_{0} \subset \Theta\). The difference, though, is that this is rather impractical, because of the need, in general, to maximise over a possibly unbounded set \(\Theta_{0}\). As a result, this type of \(p\)-value is not much used in practice. It is sometimes replaced by simple approximations. For example, if the parameter is \((v, \theta)\) then a \(p\)-value for \(v_{0}\) could be approximated by plugging-in a specific value for \(\theta\), such as the maximum likelihood value, and treating the model as though it were parameterised by \(v\) alone. But this does not give rise to a well defined significance procedure for \(v_{0}\) on the basis of the original model. Adopting this type of approach is something of an act of desperation, for when Theorem 4.9 is intractable. The difficulty is that you get a number, but you do not know what it signifies.
4.7 Reflections
4.7.1 On the definitions
The first thing to note is the abundance of families of confidence procedures and significance procedures, most of which are useless. For example, let \(U\) be a uniform random quantity. Based on the definition alone, \[\begin{eqnarray*} C(x; \alpha) & = & \left\{\begin{array}{ll} \{0\} & U < \alpha \\ \Theta & U \geq \alpha \end{array}\right. \end{eqnarray*}\] is a perfectly acceptable family of exact confidence procedures, and \[\begin{eqnarray*} p(x; \theta_{0}) & = & U \end{eqnarray*}\] is a perfectly acceptable family of exact significance procedures. They are both useless. You cannot object that these examples are pathological because they contain the auxiliary random quantity U, because the most accessible method for computing \(p\)-values also contains auxiliary random quantities (see Section 4.5.1. You could object that the family of significance procedures does not have the LSP property (Definition 4.5), which is a valid objection if you intend to apply the LSP rigorously. But would you then have to insist that every significance procedure’s dual confidence procedure (see Theorem @ref{thm:theo-duality) should also have the LSP?
The second thing to note is how often confidence procedures and significance procedures will be conservative. This means that there is some region of the parameter space where the actual coverage of the confidence procedure is more than the nominal coverage of \((1 - \alpha)\). Or where the significance procedure has a super-uniform but not uniform distribution under \(\theta_{0}\). As shown in this chapter:
- A generic method for constructing families of confidence procedures with the LSP (see Theorem 4.2) is always conservative.
- Confidence procedures for non-bijective functions of the parameters are always conservative (see Theorem 4.8).
- Significance procedures based on test statistics where \(t(X)\) is discrete are always conservative (see Theorem 4.6).
- Significance procedures for composite hypotheses are always conservative (see Theorem 4.9).
4.7.2 On the interpretations
It is a very common observation, made repeatedly over the last 50 years see, for example, Rubin (1984), that clients think more like Bayesians than classicists. Classical statisticians have to wrestle with the issue that their clients will likely misinterpret their results. This is bad enough for confidence sets (see, for example, Morey et al. (2016)) but potentially disastrous for \(p\)-values.
A \(p\)-value \(p(x; \theta_{0})\) refers only to \(\theta_{0}\), making no reference at all to other hypotheses about \(\theta\). But a posterior probability \(\pi(\theta_{0} \, | \, x)\) contrasts \(\theta_{0}\) with other values in \(\Theta\) which \(\theta\) might have taken. The two outcomes can be radically different, as first captured in Lindley’s paradox (Lindley 1957), see also Bartlett (1957). A \(p\)-value can be viewed as measuring the fit of a model, that under \(H_{0}\), to the observed data. A large \(p\)-value indicates only that the data is not unusual under the model but it does not imply that the model is correct. For example, there may be many other models defined by other hypotheses which may be exhibit greater consistency with the observed data. Greenland et al. (2016) discuss 25 misinterpretations of \(p\)-values, confidence intervals, and power. Wasserstein and Lazar (2016) is a statement from the American Statistical Association (ASA) on statistical significance and \(p\)-values. The statement gives six principles for the correct use and interpretation of \(p\)-values, some of which, in particular Principles 3 and 4, are applicable more generally and dovetail to Smith (2010)’s view of there being three players in an inference problem: the client, the statistician, and the auditor. We state them here as values that should be at the heart of any work that we do.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Practices that reduce data analysis or scientific inference to mechanical “bright-line” rules (such as “\(p < 0.05\)”) for justifying scientific claims or conclusions can lead to erroneous beliefs and poor decision making.
Researchers should bring many contextual factors into play to derive scientific inferences, including the design of a study, the quality of the measurements, the external evidence for the phenomenon under study, and the validity of assumptions that underlie the data analysis.
Proper inference requires full reporting and transparency.
Whenever a researcher chooses what to present based on statistical results, valid interpretation of those results is severely compromised if the reader is not informed of the choice and its basis.
Researchers should disclose the number of hypotheses explored during the study, all data collection decisions, all statistical analyses conducted, and all p-values computed.