Principles for Statistical Inference

Question 1

Consider Birnbaum’s Theorem, $(\mbox{WIP} \wedge \mbox{WCP} \, ) \leftrightarrow \mbox{SLP}$. In lectures, we showed that $(\mbox{WIP} \wedge \mbox{WCP} \,) \rightarrow \mbox{SLP}$ but not the converse. Hence, show that $\mbox{SLP} \rightarrow \mbox{WIP}$ and $\mbox{SLP} \rightarrow \mbox{WCP}$.

Question 2

Suppose that we have two discrete experiments $\mathcal{E}_{1} = \{\mathcal{X}_{1}, \Theta, f_{X_{1}}(x_{1} \, | \, \theta)\}$ and $\mathcal{E}_{2} = \{\mathcal{X}_{2}, \Theta, f_{X_{2}}(x_{2} \, | \, \theta)\}$ and that, for $x_{1}' \in \mathcal{X}_{1}$ and $x_{2}' \in \mathcal{X}_{2}$, \[\begin{eqnarray} f_{X_{1}}(x_{1}' \, | \, \theta) \ = \ cf_{X_{2}}(x_{2}' \, | \, \theta) \tag{1} \end{eqnarray}\] for all $\theta$ where $c$ is a positive constant not depending upon $\theta$ (but which may depend on $x_{1}', x_{2}'$) and $f_{X_{1}}(x_{1}' \, | \, \theta) > 0$. We wish to consider estimation of $\theta$ under a loss function $L(\theta, d)$ which is strictly convex in $d$ for each $\theta$. Thus, for all $d_{1} \neq d_{2} \in \mathcal{D}$, the decision space, and $\alpha \in (0, 1)$, \[\begin{eqnarray*} L(\theta, \alpha d_{1} + (1-\alpha)d_{2}) & < & \alpha L(\theta, d_{1}) + (1-\alpha)L(\theta, d_{2}). \end{eqnarray*}\] For the experiment $\mathcal{E}_{j}$, $j = 1, 2$, for the observation $x_{j}$ we will use the decision rule $\delta_{j}(x_{j})$ as our estimate of $\theta$ so that \[\begin{eqnarray*} \mbox{Ev}(\mathcal{E}_{j}, x_{j}) & = & \delta_{j}(x_{j}). \end{eqnarray*}\] Suppose that the inference violates the strong likelihood principle so that, whilst equation (1) holds, $\delta_{1}(x_{1}') \neq \delta_{2}(x_{2}')$.

Let $\mathcal{E}^{*}$ be the mixture of the experiments $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ according to mixture probabilities $1/2$ and $1/2$. For the outcome $(j, x_{j})$ the decision rule is $\delta(j, x_{j})$. If the Weak Conditionality Principle (WCP) applies to $\mathcal{E}^{*}$ show that \[\begin{eqnarray*} \delta(1, x_{1}') & \neq & \delta(2, x_{2}'). \end{eqnarray*}\]
An alternative decision rule for $\mathcal{E}^{*}$ is \[\begin{eqnarray*} \delta^{*}(j, x_{j}) & = & \left\{\begin{array}{ll} \frac{c}{c+1}\delta(1, x_{1}') + \frac{1}{c+1}\delta(2, x_{2}') & \mbox{if $x_{j} = x_{j}'$ for $j = 1, 2$}, \\ \delta(j, x_{j}) & \mbox{otherwise}. \end{array} \right. \end{eqnarray*}\] Show that if the WCP applies to $\mathcal{E}^{*}$ then $\delta^{*}$ dominates $\delta$ so that $\delta$ is inadmissible. [Hint: First show that $R(\theta, \delta^{*}) = \frac{1}{2}\mathbb{E}[L(\theta, \delta^{*}(1, X_{1})) \, | \, \theta] + \frac{1}{2}\mathbb{E}[L(\theta, \delta^{*}(2, X_{2})) \, | \, \theta]$.]
Comment on the result of part b.

Statistical Decision Theory

Question 3

Suppose we have a hypothesis test of two simple hypotheses \[\begin{eqnarray*} H_{0}: X \sim f_{0} \, \mbox{versus} \, H_{1}: X \sim f_{1} \end{eqnarray*}\] so that if $H_{i}$ is true then $X$ has distribution $f_{i}(x)$. It is proposed to choose between $H_{0}$ and $H_{1}$ using the following loss function. \[\begin{eqnarray*} \begin{array}{cc|cc} & & \mbox{Decision} \\ & & H_{0} & H_{1} \\ \hline \mbox{Outcome} & \begin{array}{c} H_{0} \\ H_{1} \end{array} & \begin{array}{c} c_{00} \\ c_{10} \end{array} & \begin{array}{c} c_{01} \\ c_{11} \end{array} \end{array} \end{eqnarray*}\] where $c_{00} < c_{01}$ and $c_{11} < c_{10}$. Thus, $c_{ij} = L(H_{i}, H_{j})$ is the loss when the “true” hypothesis is $H_{i}$ and the decision $H_{j}$ is taken. Show that a decision rule $\delta(x)$ for choosing between $H_{0}$ and $H_{1}$ is admissible if and only if \[\begin{eqnarray*} \delta(x) & = & \left\{\begin{array}{cl} H_{0} & \mbox{if } \dfrac{f_{0}(x)}{f_{1}(x)} > c, \\ H_{1} & \mbox{if } \dfrac{f_{0}(x)}{f_{1}(x)} < c, \\ \mbox{either } H_{0} \mbox{ or } H_{1} & \mbox{if } \dfrac{f_{0}(x)}{f_{1}(x)} = c, \end{array}\right. \end{eqnarray*}\] for some critical value $c > 0$. [Hint: Consider Wald’s Complete Class Theorem and a prior distribution $\pi = (\pi_{0}, \pi_{1})$ where $\pi_{i} = \mathbb{P}(H_{i}) > 0$. You may assume that for all $x \in \mathcal{X}$, $f_{i}(x) > 0$.]

Question 4

Let $X_{1}, \ldots, X_{n}$ be exchangeable random variables so that, conditional upon a parameter $\theta$, the $X_{i}$ are independent. Suppose that $X_{i} \, | \, \theta \sim N(\theta, \sigma^{2})$ where the variance $\sigma^{2}$ is known, and that $\theta \sim N(\mu_{0}, \sigma_{0}^{2})$ where the mean $\mu_{0}$ and variance $\sigma_{0}^{2}$ are known. We wish to produce a point estimate $d$ for $\theta$, with loss function \[\begin{eqnarray} L(\theta, d) & = & 1 - \exp\left\{-\frac{1}{2}(\theta - d)^{2} \right\}. \tag{2} \end{eqnarray}\]

Let $f(\theta)$ denote the probability density function of $\theta \sim N(\mu_{0}, \sigma_{0}^{2})$. Show that $\rho(f, d)$, the risk of $d$ under $f(\theta)$, can be expressed as \[\begin{eqnarray*} \rho(f, d) & = & 1 - \frac{1}{\sqrt{1+\sigma_{0}^{2}}}\exp\left\{-\frac{1}{2(1+\sigma_{0}^{2})}(d - \mu_{0})^{2} \right\}. \end{eqnarray*}\] [Hint: You may use, without proof, the result that \[\begin{eqnarray*} (\theta - a)^{2} + b(\theta - c)^{2} & = & (1+b)\left(\theta - \frac{a+bc}{1+b}\right)^{2} + \left(\frac{b}{1+b}\right)(a-c)^{2} \end{eqnarray*}\] for any $a,b,c \in \mathbb{R}$ with $b \neq -1$.]
Using part a, show that the Bayes rule of an immediate decision is $d^{*} = \mu_{0}$ and find the corresponding Bayes risk.
Find the Bayes rule and Bayes risk after observing $x = (x_{1}, \ldots, x_{n})$. Express the Bayes rule as a weighted average of $d^{*}$ and the maximum likelihood estimate of $\theta$, $\overline{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}$, and interpret the weights. [Hint: Consider conjugacy.]
Suppose now, given data $y$, the parameter $\theta$ has the general posterior distribution $f(\theta \, | \, y)$. We wish to use the loss function $L(\theta, d)$, as given in equation (2), to find a point estimate $d$ for $\theta$. By considering an approximation of $L(\theta, d)$, or otherwise, what can you say about the corresponding Bayes rule?

Confidence sets and $p$-values

Question 5

Show that if $p$ is a family of significance procedures then \[\begin{eqnarray*} p(x; \Theta_{0}) & = & \sup_{\theta \in \Theta_{0}} p(x; \theta) \end{eqnarray*}\] is a significance procedure for the null hypothesis $\Theta_{0} \subset \Theta$, that is that $p(X; \Theta_{0})$ is super-uniform for every $\theta \in \Theta_{0}$.

Question 6

Suppose that, given $\theta$, $X_{1}, \ldots, X_{n}$ are independent and identically distributed $N(\theta, 1)$ random variables so that, given $\theta$, $\overline{X} = \frac{1}{n} \sum_{i=1}^{n} X_{i} \sim N(\theta, 1/n)$.

Consider the test of the hypotheses \[\begin{eqnarray*} H_{0}: \theta = 0 \, \mbox{versus} \, H_{1}: \theta = 1 \end{eqnarray*}\] using the statistic $\overline{X}$ so that large observed values $\overline{x}$ support $H_{1}$. For a given $n$, the corresponding $p$-value is \[\begin{eqnarray*} p_{n}(\overline{x}; 0) & = & \mathbb{P}(\overline{X} \geq \overline{x} \, | \, \theta = 0). \end{eqnarray*}\] We wish to investigate how, for a fixed $p$-value, the likelihood ratio for $H_{0}$ versus $H_{1}$, \[\begin{eqnarray*} LR(H_{0}, H_{1}) & := & \frac{f(\overline{x} \, | \, \theta = 0)}{f(\overline{x} \, | \, \theta = 1)} \end{eqnarray*}\] changes as $n$ increases.
1. Use R to create a plot of $LR(H_{0}, H_{1})$ for each $n \in \{1, \ldots, 20\}$ where, for each $n$, $\overline{x}$ is the value which corresponds to a $p$-value of 0.05. [Hint: You may need to utilise the qnorm and dnorm functions. The look of the plot may be improved by using a log-scale on the axes.]
2. Comment on your plot, in particular on what happens to the likelihood ratio as $n$ increases. What is the implication for hypothesis testing and the corresponding (fixed) $p$-value?
Consider the test of the hypotheses \[\begin{eqnarray*} H_{0}: \theta = 0 \, \mbox{versus} \, H_{1}: \theta > 0 \end{eqnarray*}\] using once again $\overline{X}$ as the test statistic.
1. Suppose that $\overline{x} > 0$. Show that \[\begin{eqnarray*} lr(H_{0}, H_{1}) \ := \ \min_{\theta > 0} \frac{f(\overline{x} \, | \, \theta = 0)}{f(\overline{x} \, | \, \theta)} \ = \ \exp\left\{-\frac{n}{2}\overline{x}^{2}\right\}. \end{eqnarray*}\]
2. Use to create a plot of $lr(H_{0}, H_{0})$ for a range of $p$-values for $H_{0}$ from 0.001 to 0.1. As the plot doesn’t depend upon the actual choice of $n$ then, without loss of generality, you may choose $n = 1$. Once again, the look of the plot may be improved by using a log-scale on the axes. Comment on whether the conventional choice of 0.05 is a suitable threshold for choosing between hypotheses, or whether some other choice might be better.

For the origins of the use of 0.05 see Cowles, M. and C. Davis (1982). On the origins of the .05 level of statistical significance. American Psychologist 37(5), 553-558.

APTS Assessment on Statistical Inference

Simon Shaw, s.shaw@bath.ac.uk, University of Bath

Warwick, 13-16 December 2022