Chapter 2 Principles for Statistical Inference

2.1 Introduction

We wish to consider inferences about a parameter $\theta$ given a parametric model \[\begin{eqnarray*} \mathcal{E} & = & \{ \mathcal{X}, \Theta, f_{X}(x \, | \, \theta) \}. \end{eqnarray*}\] We assume that the model is true so that only $\theta \in \Theta$ is unknown. We wish to learn about $\theta$ from observations $x$ so that $\mathcal{E}$ represents a model for this experiment. Our inferences can be described in terms of an algorithm involving both $\mathcal{E}$ and $x$. In this chapter, we shall assume that $\mathcal{X}$ is finite; Basu (1975) argues that

“this contingent and cognitive universe of ours is in reality only finite and, therefore, discrete [infinite and continuous models] are to be looked upon as mere approximations to the finite realities.”

Statistical principles guide the way in which we learn about $\theta$. These principles are meant to be either self-evident, or logical implications of principles which are self-evident. What is really interesting about Statistics, for both statisticians and philosophers (and real-world decision makers) is that the logical implications of some self-evident principles are not at all self-evident, and have turned out to be inconsistent with prevailing practices. This was a discovery made in the 1960s. Just as interesting, for sociologists (and real-world decision makers) is that the then-prevailing practices have survived the discovery, and continue to be used today.

This chapter is about statistical principles, and their implications for statistical inference. It demonstrates the power of abstract reasoning to shape everyday practice.

2.2 Reasoning about inferences

Statistical inferences can be very varied, as a brief look at the “Results” sections of the papers in an Applied Statistics journal will reveal. In each paper, the authors have decided on a different interpretation of how to represent the “evidence” from their dataset. On the surface, it does not seem possible to construct and reason about statistical principles when the notion of “evidence” is so plastic. It was the inspiration of Allan Birnbaum (1923-1976) (Birnbaum 1962) to see—albeit indistinctly at first—that this issue could be side-stepped. Over the next two decades, his original notion was refined; key papers in this process were Birnbaum (1972), Basu (1975), Dawid (1977), and the book by J. Berger and Wolpert (1988).

The model $\mathcal{E}$ is accepted as a working hypothesis. How the statistician chooses her statements about the true value $\theta$ is entirely down to her and her client: as a point or a set in $\Theta$, as a choice among alternative sets or actions, or maybe as some more complicated, not ruling out visualisations. Dawid (1977) puts this well - his formalism is not excessive, for really understanding this crucial concept. The statistician defines, a priori, a set of possible “inferences about $\theta$”, and her task is to choose an element of this set based on $\mathcal{E}$ and $x$. Thus the statistician should see herself as a function $\mbox{Ev}$: a mapping from $(\mathcal{E}, x)$ into a predefined set of “inferences about $\theta$”, or \[\begin{eqnarray*} (\mathcal{E}, x) & \stackrel{\mbox{statistician, Ev}}{\longmapsto} & \mbox{Inference about $\theta$.} \end{eqnarray*}\] Thus, $\mbox{Ev}(\mathcal{E}, x)$ is the inference about $\theta$ made if $\mathcal{E}$ is performed and $X = x$ is observed. For example, $\mbox{Ev}(\mathcal{E}, x)$ might be the maximum likelihood estimator of $\theta$ or a 95% confidence interval for $\theta$. Birnbaum called $\mathcal{E}$ the “experiment”, $x$ the “outcome”, and $\mbox{Ev}$ the “evidence”.

Birnbaum (1962)’s formalism, of an experiment, an outcome, and an evidence function, helps us to anticipate how we can construct statistical principles. First, there can be different experiments with the same $\theta$. Second, under some outcomes, we would agree that it is self-evident that these different experiments provide the same evidence about $\theta$. Thus, we can follow p3 of Basu (1975) and define the equality or equivalence of $\mbox{Ev}(\mathcal{E}_{1}, x_{1})$ and $\mbox{Ev}(\mathcal{E}_{2}, x_{2})$ as meaning that

The experiments $\mathcal{E}_{1}$ and $\mathcal{E}_{2}$ are related to the same parameter $\theta$.
“Everything else being equal”, the outcome $x_{1}$ from $\mathcal{E}_{1}$ “warrants the same inference” about $\theta$ as does the outcomes $x_{2}$ from $\mathcal{E}_{2}$.

As we will show, these self-evident principles imply other principles. These principles all have the same form: under such and such conditions, the evidence about $\theta$ should be the same. Thus they serve only to rule out inferences that satisfy the conditions but have different evidences. They do not tell us how to do an inference, only what to avoid.

2.3 The principle of indifference

We now give our first example of a statistical principle, using the name conferred by Basu (1975).

Definition 2.1 (Weak Indifference Principle, WIP) Let $\mathcal{E} = \{ \mathcal{X}, \Theta, f_{X}(x \, | \, \theta) \}$. If $f_{X}(x \, | \, \theta) = f_{X}(x' \, | \, \theta)$ for all $\theta \in \Theta$ then $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}, x')$.

As Birnbaum (1972) notes, this principle, which he termed mathematical equivalence, asserts that we are indifferent between two models of evidence if they differ only in the manner of the labelling of sample points. For example, if $X = (X_{1}, \ldots, X_{n})$ where the $X_{i}$s are a series of independent Bernoulli trials with parameter $\theta$ then $f_{X}(x \, | \, \theta) = f_{X}(x' \, | \, \theta)$ if $x$ and $x'$ contain the same number of successes. We will show that the WIP logically follows from the following two principles, which I would argue are self-evident, for which we use the names conferred by Dawid (1977).

Definition 2.2 (Distribution Principle, DP) If $\mathcal{E} = \mathcal{E}'$, then $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}', x)$.

As Dawid (1977) on p247 writes “informally, this says that the only aspects of an experiment which are relevant to inference are the sample space and the family of distributions over it.”

Definition 2.3 (Transformation Principle, TP) Let $\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \,\theta)\}$. For the bijective $g : \mathcal{X} \to \mathcal{Y}$, let $\mathcal{E}^g = \{\mathcal{Y}, \Theta, f_{Y}(y \, | \,\theta)\}$, the same experiment as $\mathcal{E}$ but expressed in terms of $Y = g(X)$, rather than $X$. Then $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}^g, g(x))$.

This principle states that inferences should not depend on the way in which the sample space is labelled.

Example 2.1 Recall Example 1.2. Under TP, inferences about $\theta$ are the same if we observe $x = (x_{1}, \ldots, x_{n})$ where each independent $X_{i} \sim Gamma(\alpha, \beta)$ or $X^{-1} = (1/x_{1}, \ldots, 1/x_{n})$ where each independent $X_{i}^{-1} \sim \mbox{Inverse-Gamma}(\alpha, \beta)$.

We have the following result, see Basu (1975), Dawid (1977).

Theorem 2.1 $(\mbox{DP} \wedge \mbox{TP} \, ) \rightarrow \mbox{WIP}.$

Proof: Fix $\mathcal{E}$, and suppose that $x, x' \in \mathcal{X}$ satisfy $f_{X}(x \, | \, \theta) = f_{X}(x' \, | \, \theta)$ for all $\theta \in \Theta$, as in the condition of the WIP. Now consider the transformation $g : \mathcal{X} \to \mathcal{X}$ which switches $x$ for $x'$, but leaves all of the other elements of $\mathcal{X}$ unchanged. In this case $\mathcal{E} = \mathcal{E}^g$. Then \[\begin{eqnarray} \mbox{Ev}(\mathcal{E}, x') & = & \mbox{Ev}(\mathcal{E}^g, x') \tag{2.1} \\ & = & \mbox{Ev}(\mathcal{E}^g, g(x)) \tag{2.2}\\ & = & \mbox{Ev}(\mathcal{E}, x), \tag{2.3} \end{eqnarray}\] where equation (2.1) follows by the DP and (2.3) follows from (2.2) by the TP. We thus have the WIP. $\Box$

Therefore, if I accept the principles DP and TP then I must also accept the WIP. Conversely, if I do not want to accept the WIP then I must reject at least one of the DP and TP. This is the pattern of the next few sections, where either I must accept a principle, or, as a matter of logic, I must reject one of the principles that implies it.

2.4 The Likelihood Principle

Suppose we have experiments $\mathcal{E}_{i} = \{\mathcal{X}_{i}, \Theta, f_{X_{i}}(x_{i} \, | \, \theta)\}$, $i = 1, 2, \ldots$, where the parameter space $\Theta$ is the same for each experiment. Let $p_{1}, p_{2}, \ldots$ be a set of known probabilities so that $p_{i} \geq 0$ and $\sum_{i} p_{i} = 1$. The mixture $\mathcal{E}^{*}$ of the experiments $\mathcal{E}_{1}, \mathcal{E}_{2}, \ldots$ according to mixture probabilities $p_{1}, p_{2}, \ldots$ is the two-stage experiment

A random selection of one of the experiments: $\mathcal{E}_{i}$ is selected with probability $p_{i}$.
The experiment selected in stage 1. is performed.

Thus, each outcome of the experiment $\mathcal{E}^{*}$ is a pair $(i, x_{i})$, where $i = 1, 2, \ldots$ and $x_{i} \in \mathcal{X}_{i}$, and family of distributions \[\begin{eqnarray} f^{*}((i, x_{i}) \, | \, \theta) & = & p_{i}f_{X_{i}}(x_{i} \, | \, \theta). \tag{2.4} \end{eqnarray}\]

The famous example of a mixture experiment is the “two instruments” (see Section 2.3 of Cox and Hinkley (1974)). There are two instruments in a laboratory, and one is accurate, the other less so. The accurate one is more in demand, and typically it is busy $80\%$ of the time. The inaccurate one is usually free. So, a priori, there is a probability of $p_1 = 0.2$ of getting the accurate instrument, and $p_2 = 0.8$ of getting the inaccurate one. Once a measurement is made, of course, there is no doubt about which of the two instruments was used. The following principle asserts what must be self-evident to everybody, that inferences should be made according to which instrument was used and not according to the a priori uncertainty.

Definition 2.4 (Weak Conditionality Principle, WCP) Let $\mathcal{E}^{*}$ be the mixture of the experiments $\mathcal{E}_{1}$, $\mathcal{E}_{2}$ according to mixture probabilities $p_{1}$, $p_{2} = 1-p_{1}$. Then $\mbox{Ev}\left(\mathcal{E}^*, (i, x_i) \right) = \mbox{Ev}(\mathcal{E}_i, x_i)$.

Thus, the WCP states that inferences for $\theta$ depend only on the experiment performed. As Casella and Berger (2002) on p293 state “the fact that this experiment was performed rather than some other, has not increased, decreased, or changed knowledge of $\theta$.”

In Section 1.4.1.1, we motivated the strong likelihood principle, see Definition 1.5. We now reassert this principle.

Definition 2.5 (Strong Likelihood Principle, SLP) Let $\mathcal{E}_1$ and $\mathcal{E}_2$ be two experiments which have the same parameter $\theta$. If $x_1 \in \mathcal{X}_1$ and $x_2 \in \mathcal{X}_2$ satisfy $f_{X_{1}}(x_{1} \, | \, \theta) = c(x_{1}, x_{2})f_{X_{2}}(x_{2} \, | \, \theta)$, that is \[\begin{eqnarray*} L_{X_{1}}(\theta; x_{1}) & = & c(x_{1}, x_{2})L_{X_{2}}(\theta; x_{2}) \end{eqnarray*}\] for some function $c > 0$ for all $\theta \in \Theta$ then ${\mbox{Ev}(\mathcal{E}_1, x_1) = \mbox{Ev}(\mathcal{E}_2, x_2)}$.

The SLP thus states that if two likelihood functions for the same parameter have the same shape, then the evidence is the same.

The SLP is self-attributed to George Barnard (1915-2002), see his comment on p308 of the discussion to Birnbaum (1962). But it is alluded to in the statistical writings of R.A. Fisher, almost appearing in its modern form in Fisher (1956).

As we shall discuss in Section 2.8, many classical statistical procedures violate the SLP and the following result was something of the bombshell, when it first emerged in the 1960s. The following form is due to Birnbaum (1972) and Basu (1975). Birnbaum’s original result (Birnbaum 1962) used a stronger condition than WIP and a slightly weaker condition than WCP. Theorem 2.2 is clearer.

Theorem 2.2 (Birnbaum's Theorem) $(\mbox{WIP} \wedge \mbox{WCP} \, ) \leftrightarrow \mbox{SLP}.$

Proof: Both $\mbox{SLP} \rightarrow \mbox{WIP}$ and $\mbox{SLP} \rightarrow \mbox{WCP}$ are straightforward. The trick is to prove \[(\mbox{WIP} \wedge \mbox{WCP} \,) \rightarrow \mbox{SLP}.\] So let $\mathcal{E}_1$ and $\mathcal{E}_2$ be two experiments which have the same parameter, and suppose that $x_1 \in \mathcal{X}_1$ and $x_2 \in \mathcal{X}_2$ satisfy $f_{X_{1}}(x_{1} \, | \, \theta) = c(x_{1}, x_{2})f_{X_{2}}(x_{2} \, | \, \theta)$ where the function $c > 0$. As the value $c$ is known (as the data has been observed) then consider the mixture experiment with $p_{1} = 1/(1+c)$ and $p_{2} = c/(1+c)$. Then, using equation (2.4), \[\begin{eqnarray} f^{*}((1, x_{1}) \, | \, \theta) & = & \frac{1}{1+c} f_{X_{1}}(x_{1} \, | \, \theta) \nonumber\\ & = & \frac{c}{1+c} f_{X_{2}}(x_{2} \, | \, \theta) \tag{2.5} \\ & = & f^{*}((2, x_{2}) \, | \, \theta) \tag{2.6} \end{eqnarray}\] where equation (2.6) follows from (2.5) by (2.4). Then the WIP implies that \[\begin{eqnarray*} \mbox{Ev} \left( \mathcal{E}^*, (1, x_1) \right) & = & \mbox{Ev} \left( \mathcal{E}^*, (2, x_2) \right). \end{eqnarray*}\] Finally, applying the WCP to each side we infer that \[\begin{eqnarray*} \mbox{Ev}(\mathcal{E}_1, x_1) & = & \mbox{Ev}(\mathcal{E}_2, x_2) , \end{eqnarray*}\] as required. $\Box$

Thus, either I accept the SLP, or I explain which of the two principles, WIP and WCP, I refute. Methods which violate the SLP face exactly this challenge.

2.5 The Sufficiency Principle

In Section 1.4.2.1 we considered the idea of sufficiency. From Definition 1.6, if $S = s(X)$ is sufficient for $\theta$ then \[\begin{eqnarray} f_{X}(x \, | \, \theta) & = & f_{X | S}(x \, | \, s, \theta)f_{S}(s \, | \, \theta) \tag{2.7} \end{eqnarray}\] where $f_{X | S}(x \, | \, s, \theta)$ does not depend upon $\theta$. Consequently, we consider the experiment $\mathcal{E}^{S} = \{s(\mathcal{X}), \Theta, f_{S}(s \, | \, \theta)\}$.

Definition 2.6 (Strong Sufficiency Principle, SSP) If $S = s(X)$ is a sufficient statistic for the experiment $\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}$ then $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}^{S}, s(x))$.

A weaker, Basu (1975) terms it “perhaps a trifle less severe”, but more familiar version which is in keeping with Definition 1.7 is as follows.

Definition 2.7 (Weak Sufficiency Principle, WSP) If $S = s(X)$ is a sufficient statistic for the experiment $\mathcal{E} = \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta)\}$ and $s(x) = s(x')$ then $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}, x')$.

Theorem 2.3 $\mbox{SLP} \rightarrow \mbox{SSP} \rightarrow \mbox{WSP} \rightarrow \mbox{WIP}$.

Proof: From equation (2.7), $f_{X}(x \, | \, \theta) = cf_{S}(s \, | \, \theta)$ where $c = f_{X | S}(x \, | \, s, \theta)$ does not depend upon $\theta$. Thus, from the SLP, Definition 2.5, $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}^{S}, s(x))$ which is the SSP, Definition 2.6. Note, that from the SSP, \[\begin{eqnarray} \mbox{Ev}(\mathcal{E}, x) & = & \mbox{Ev}(\mathcal{E}^{S}, s(x)) \tag{2.8}\\ & = & \mbox{Ev}(\mathcal{E}^{S}, s(x')) \tag{2.9}\\ & = & \mbox{Ev}(\mathcal{E}, x') \tag{2.10} \end{eqnarray}\] where (2.9) follows from (2.8) as $s(x) = s(x')$ and (2.10) from (2.9) from the SSP. We thus have the WSP, Definition 2.7. Finally, notice that if $f_{X}(x \, | \, \theta) = f_{X}(x' \, | \, \theta)$ as in the statement of WIP, Definition 2.1, then $s(x) = x'$ is sufficient for $x$ and so from the WSP, $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}, x')$ giving the WIP. $\Box$

Finally, we note that if we put together Theorem 2.2 and Theorem 2.3 we get the following corollary.

Corollary 2.1 $(\mbox{WIP} \wedge \mbox{WCP}) \rightarrow \mbox{SSP}$.

2.6 Stopping rules

Suppose that we consider observing a sequence of random variables $X_{1}, X_{2}, \ldots$ where the number of observations is not fixed in advanced but depends on the values seen so far. That is, at time $j$, the decision to observe $X_{j+1}$ can be modelled by a probability $p_j(x_1, \dots, x_j)$. We can assume, resources being finite, that the experiment must stop at specified time $m$, if it has not stopped already, hence $p_m(x_1, \dots, x_m) = 0$. The stopping rule may then be denoted as $\tau = (p_1, \dots, p_m)$. This gives an experiment $\mathcal{E}^{\tau}$ with, for $n = 1, 2, \ldots$, $f_n(x_1, \dots, x_n \, | \, \theta)$ where consistency requires that \[\begin{eqnarray*} f_n(x_1, \dots, x_n \, | \, \theta) & = & \sum_{x_{n+1}} \cdots \sum_{x_m} f_{m}(x_1, \dots, x_n, x_{n+1}, \dots x_m \, | \, \theta). \end{eqnarray*}\] We utilise the following example from p42 of Basu (1975) to motivate the stopping rule principle. Consider four different coin-tossing experiments (with some finite limit on the number of tosses).

$\mathcal{E}_1$: Toss the coin exactly $10$ times;
$\mathcal{E}_2$: Continue tossing until $6$ heads appear;
$\mathcal{E}_3$: Continue tossing until $3$ consecutive heads appear;
$\mathcal{E}_4$: Continue tossing until the accumulated number of heads exceeds that of tails by exactly $2$.

One could easily adduce more sequential experiments which gave the same outcome. Notice that $\mathcal{E}_{1}$ corresponds to a binomial model and $\mathcal{E}_{2}$ to a negative binomial. Suppose that all four experiments have the same outcome ${x = (\mbox{T,H,T,T,H,H,T,H,H,H})}$.

In line with Example 1.3, we may feel that the evidence for $\theta$, the probability of heads, is the same in every case. Once the sequence of heads and tails is known, the intentions of the original experimenter (i.e. the experiment she was doing) are immaterial to inference about the probability of heads, and the simplest experiment $\mathcal{E}_1$ can be used for inference. We can consider the following principle which Basu (1975) claims is due to George Barnard (1915-2002)

Definition 2.8 (Stopping Rule Principle, SRP) In a sequential experiment $\mathcal{E}^\tau$, $\mbox{Ev} \left( \mathcal{E}^\tau, (x_1, \dots, x_n) \right)$ does not depend on the stopping rule $\tau$.

The SRP is nothing short of revolutionary, if it is accepted. It implies that that the intentions of the experimenter, represented by $\tau$, are irrelevant for making inferences about $\theta$, once the observations $(x_1, \dots, x_n)$ are available. Once the data is observed, we can ignore the sampling plan. Thus the statistician could proceed as though the simplest possible stopping rule were in effect, which is $p_1 = \cdots = p_{n-1} = 1$ and $p_n = 0$, an experiment with $n$ fixed in advance. Obviously it would be liberating for the statistician to put aside the experimenter’s intentions (since they may not be known and could be highly subjective), but can the SRP possibly be justified? Indeed it can.

Theorem 2.4 $\mbox{SLP} \rightarrow \mbox{SRP}$.

Proof: Let $\tau$ be an arbitrary stopping rule, and consider the outcome $(x_1, \dots, x_n)$, which we will denote as $x_{1:n}$. We take the first observation with probability one and, for $j=1, \ldots, n-1$, the $(j+1)$th observation is taken with probability $p_{j}(x_{1:j})$, and we stop after the $n$th observation with probability $1 - p_n(x_{1:n})$. Consequently, the probability of this outcome under $\tau$ is \[\begin{eqnarray*} f_\tau(x_{1:n} \, | \, \theta) & = & f_1(x_1 \, | \, \theta) \left\{ \prod_{j=1}^{n-1} p_j(x_{1:j}) \, f_{j+1}(x_{j+1} \, | \, x_{1:j}, \theta) \right\} \left( 1 - p_n(x_{1:n}) \right) \\ & = & \left\{\prod_{j=1}^{n-1} p_j(x_{1:j}) \right\} \left( 1 - p_n(x_{1:n}) \right) f_1(x_1 \, | \, \theta) \prod_{j=2}^n f_j(x_j \, | \, x_{1:(j-1)}, \theta) \\ & = & \left\{\prod_{j=1}^{n-1} p_j(x_{1:j}) \right\} \left( 1 - p_n(x_{1:n}) \right) f_n(x_{1:n} \, | \, \theta) . \end{eqnarray*}\] Now observe that this equation has the form \[\begin{eqnarray} f_\tau(x_{1:n} \, | \, \theta) & = & c(x_{1:n})f_n(x_{1:n} \, | \, \theta) \tag{2.11} \end{eqnarray}\] where $c(x_{1:n}) > 0$. Thus the SLP implies that $\mbox{Ev}(\mathcal{E}^\tau, x_{1:n}) = \mbox{Ev}(\mathcal{E}^n, x_{1:n})$ where $\mathcal{E}^n = \{ \mathcal{X}^n, \Theta, f_n(x_{1:n} \, | \, \theta) \}$. Since the choice of stopping rule was arbitrary, equation (2.11) holds for all stopping rules, showing that the choice of stopping rule is irrelevant.$\Box$

The Stopping Rule Principle has become enshrined in our profession’s collective memory due to this iconic comment, see p76 of Savage et al. (1962), from Leonard Jimmie Savage (1917-1971), one of the great statisticians of the Twentieth Century,

May I digress to say publicly that I learned the stopping rule principle from Professor Barnard, in conversation in the summer of 1952. Frankly, I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right.

This comment captures the revolutionary and transformative nature of the SRP.

2.7 A stronger form of the WCP

The new concept in this section is “ancillarity”. This has several different definitions in the Statistics literature; the one we use is close to that of Section 2.2 of Cox and Hinkley (1974).

Definition 2.9 (Ancillarity) $Y$ is ancillary in the experiment $\mathcal{E} = \{ \mathcal{X} \times \mathcal{Y}, \Theta, f_{X,Y}(x, y \, | \,\theta)\}$ exactly when $f_{X,Y}$ factorises as \[\begin{eqnarray*} f_{X,Y}(x, y \, | \, \theta) &= & f_Y(y)f_{X | Y}(x \, | \, y, \theta). \end{eqnarray*}\]

In other words, the marginal distribution of $Y$ is completely specified. Not all families of distributions will factorise in this way, but when they do, there are new possibilities for inference, based around stronger forms of the WCP.

Here is an example, which will be familiar to all statisticians. We have been given a sample $x = (x_{1}, \ldots, x_{n})$ to evaluate. In fact $n$ itself is likely to be the outcome of a random variable $N$, because the process of sampling itself is rather uncertain. However, we seldom concern ourselves with the distribution of $N$ when we evaluate $x$; instead we treat $N$ as known. Equivalently, we treat $N$ as ancillary and condition on $N = n$. In this case, we might think that inferences drawn from observing $(n, x)$ should be the same as those for $x$ conditioned on $N = n$.

When $Y$ is ancillary, we can consider the conditional experiment \[\begin{eqnarray*}\label{eq:ancillary} \mathcal{E}^{X \, | \, y} = \{\mathcal{X}, \Theta, f_{X \, | \, Y}(x \, | \, y, \theta)\}, \end{eqnarray*}\] This is an experiment where we condition on $Y = y$, i.e. treat $Y$ as known, and treat $X$ as the only random variable. This is an attractive idea, captured in the following principle.

Definition 2.10 (Strong Conditionality Principle, SCP) If $Y$ is ancillary in $\mathcal{E}$, then $\mbox{Ev} \left( \mathcal{E}, (x, y) \right) = \mbox{Ev}( \mathcal{E}^{X | y}, x)$.

As a second example, consider a regression of $Y$ on $X$ appears to make a distinction between $Y$, which is random, and $X$, which is not. This distinction is insupportable, given that the roles of $Y$ and $X$ are often interchangeable, and determined by the hypothese du jour. What is really happening is that $(X, Y)$ is random, but $X$ is being treated as ancillary for the parameters in $f_{Y | X}$, so that its parameters are auxiliary in the analysis. Then the SCP is invoked (implicitly), which justifies modelling $Y$ conditionally on $X$, treating $X$ as known.

Clearly the SCP implies the WCP, with the experiment indicator $I \in \{1, 2\}$ being ancillary, since $p$ is known. It is almost obvious that the SCP comes for free with the SLP. Another way to put this is that the WIP allows us to “upgrade” the WCP to the SCP.

Theorem 2.5 $\mbox{SLP} \rightarrow \mbox{SCP}.$

Proof: Suppose that $Y$ is ancillary in $\mathcal{E} = \{ \mathcal{X} \times \mathcal{Y}, \Theta, f_{X,Y}(x, y \, | \, \theta)\}$. Thus, for all $\theta \in \Theta$, \[\begin{eqnarray*} f_{X,Y}(x, y \, | \, \theta) & = & f_Y(y)f_{X | Y}(x \, | \, y, \theta) \\ & = & c(y) f_{X | Y}(x \, | \, y, \theta) \end{eqnarray*}\] Then the SLP implies that \[\begin{eqnarray*} \mbox{Ev} \left( \mathcal{E}, (x, y) \right) = \mbox{Ev}( \mathcal{E}^{X | y}, x) , \end{eqnarray*}\] as required. $\Box$

2.8 The Likelihood Principle in practice

Now we should pause for breath, and ask the obvious questions: is the SLP vacuous? Or trivial? In other words, Is there any inferential approach which respects it? Or do all inferential approaches respect it? We shall focus on the classical and Bayesian approaches, as outlined in Section 1.5.1 and Section 1.5.2 respectively.

Recall from Definition 1.8 that a Bayesian statistical model is the collection \[\begin{eqnarray*} \mathcal{E}_{B} & = & \{\mathcal{X}, \Theta, f_{X}(x \, | \, \theta), \pi(\theta)\}. \end{eqnarray*}\] The posterior distribution \[\begin{eqnarray} \pi(\theta \, | \, x) & = & c(x)f_{X}(x \, | \, \theta)\pi(\theta) \tag{2.12} \end{eqnarray}\] where $c(x)$ is the normalising constant, \[\begin{eqnarray*} c(x) & = & \left\{\int_{\Theta} f_{X}(x \, | \, \theta)\pi(\theta) \, d\theta\right\}^{-1}. \end{eqnarray*}\] From a Bayesian perspective, all knowledge about the parameter $\theta$ given the data $x$ are represented by $\pi(\theta \, | \, x)$ and any inferences made about $\theta$ are derived from this distribution. If we have two Bayesian models with the same prior distribution, $\mathcal{E}_{B, 1} = \{\mathcal{X}_{1}, \Theta, f_{X_{1}}(x_{1} \, | \, \theta), \pi(\theta)\}$ and $\mathcal{E}_{B, 2} = \{\mathcal{X}_{2}, \Theta, f_{X_{2}}(x_{2} \, | \, \theta), \pi(\theta)\}$ and $f_{X_{1}}(x_{1} \, | \, \theta) = c(x_{1}, x_{2})f_{X_{2}}(x_{2} \, | \, \theta)$ then \[\begin{eqnarray} \pi(\theta \, | \, x_{1}) & = & c(x_{1})f_{X_{1}}(x_{1} \, | \, \theta)\pi(\theta) \nonumber \\ & = & c(x_{1}) c(x_{1}, x_{2})f_{X_{2}}(x_{2} \, | \, \theta) \pi(\theta) \nonumber \\ & = & \pi(\theta \, | \, x_{2}) \tag{2.13} \end{eqnarray}\] so that the posterior distributions are the same. Consequently, the same inferences are drawn from either model and so the Bayesian approach satisfies the SLP. Notice that this assumes that the prior distribution exists independently of the outcome, that is the prior does not depend upon the form of the data. In practice, though, is hard to do. Some methods for making default choices for $\pi$ depend on $f_X$, notably Jeffreys priors and reference priors, see for example, Section 5.4 of Bernardo and Smith (2000). These methods violate the SLP.

The classical approach however violates the SLP. As we noted in Section 1.5.1, algorithms are certified in terms of their sampling distributions, and selected on the basis of their certification. For example, the mean square error of an estimator $T$, $MSE(T \, | \, \theta) = Var(T \, | \, \theta) + \mbox{bias}(T \, | \, \theta)^{2}$ depends upon the first and second moments of the distribution of $T \, | \, \theta$. Consequently, they depend on the whole sample space $\mathcal{X}$ and not just the observed $x \in \mathcal{X}$.

Example 2.2 (Example 1.3.5 of Robert (2007)) Suppose that $X_{1}, X_{2}$ are iid $N(\theta, 1)$ so that \[\begin{eqnarray*} f(x_{1}, x_{2} \, | \, \theta) \ \propto \ \exp\left\{-(\overline{x} - \theta)^{2}\right\}. \end{eqnarray*}\] Now, consider the alternate model for the same parameter $\theta$ \[\begin{eqnarray*} g(x_{1}, x_{2} \, | \, \theta) \ = \ \pi^{-\frac{3}{2}}\frac{\exp\left\{-(\overline{x} - \theta)^{2}\right\}}{1 + (x_{1} - x_{2})^{2}} \end{eqnarray*}\] We thus observe that $f(x_{1}, x_{2} \, | \, \theta) \propto g(x_{1}, x_{2} \, | \, \theta)$ as a function of $\theta$. If the SLP is applied, then inference about $\theta$ should be the same in both models. However, the distribution of $g$ is quite different from that of $f$ and so estimators of $\theta$ will have different classical properties if they do not depend only on $\overline{x}$. For example, $g$ has heavier tails than $f$ and so respective confidence intervals may differ between the two.

We can extend the idea of this example by showing that if $\mbox{Ev}(\mathcal{E}, x)$ depends on the value of $f_{X}(x' \, | \, \theta)$ for some $x' \neq x$ then we can create an alternate experiment $\mathcal{E}_{1} = \{\mathcal{X}, \Theta, f_{1}(x \, | \, \theta)\}$ where $f_{1}(x \, | \, \theta) = f_{X}(x \, | \, \theta)$ for the observed $x$ but $f_{1}(x \, | \, \theta) \neq f_{X}(x \, | \, \theta)$ for all $x \in \mathcal{X}$. In particular, we can ensure that $f_{1}(x' \, | \, \theta) \neq f_{X}(x'\, | \, \theta)$. Then, typically, $\mbox{Ev}$ does not respect the SLP.

To do this, let $\tilde{x} \neq x, x'$ and set \[\begin{eqnarray*} f_{1}(x' \, | \, \theta) & = & \alpha f_{X} (x' \, | \, \theta) + \beta f_{X} (\tilde{x} \, | \, \theta) \\ f_{1}(\tilde{x} \, | \, \theta) & = & (1-\alpha) f_{X} (x' \, | \, \theta) + (1-\beta) f_{X} (\tilde{x} \, | \, \theta) \end{eqnarray*}\] with $f_{1} = f_{X}$ elsewhere. Clearly $f_{1}(x' \, | \, \theta) + f_{1}(\tilde{x} \, | \, \theta) = f_{X}(x' \, | \, \theta) + f_{X}(\tilde{x} \, | \, \theta)$ and so $f_{1}$ is a probability distribution. By suitable choice of $\alpha$, $\beta$ we can redistribute the mass to ensure $f_{1}(x' \, | \, \theta) \neq f_{X}(x'\, | \, \theta)$. Consequently, whilst $f_{1}(x \, | \, \theta) = f_{X}(x \, | \, \theta)$ for the observed $x$ we will not have that $\mbox{Ev}(\mathcal{E}, x) = \mbox{Ev}(\mathcal{E}_{1}, x)$ and so will violate the SLP.

This illustrates that classical inference typically does not respect the SLP because the sampling distribution of the algorithm depends on values of $f_{X}$ other than $L(\theta; x) = f_{X}(x \, | \, \theta)$. The two main difficulties with violating the SLP are:

To reject the SLP is to reject at least one of the WIP and the WCP. Yet both of these principles seem self-evident. Therefore violating the SLP is either illogical or obtuse.
In their everyday practice, statisticians use the SCP (treating some variables as ancillary) and the SRP (ignoring the intentions of the experimenter). Neither of these is self-evident, but both are implied by the SLP. If the SLP is violated, then they both need an alternative justification.

Alternative formal justifications for the SCP and the SRP have not been forthcoming.

2.9 Reflections

The statistician takes delivery of an outcome $x$. Her standard practice is to assume the truth of a statistical model $\mathcal{E}$, and then turn $(\mathcal{E}, x)$ into an inference about the true value of the parameter $\theta$. As remarked several times already (see Chapter 1), this is not the end of her involvement, but it is a key step, which may be repeated several times, under different notions of the outcome and different statistical models. This chapter concerns this key step: how she turns $(\mathcal{E}, x)$ into an inference about $\theta$.

Whatever inference is required, we assume that the statistician applies an algorithm to $(\mathcal{E}, x)$. In other words, her inference about $\theta$ is not arbitrary, but transparent and reproducible - this is hardly controversial, because anything else would be non-scientific. Following Birnbaum, the algorithm is denoted $\mbox{Ev}$. The question now becomes: how does she choose her $\mbox{Ev}$?

As discussed in Chapter 1 of Smith (2010), there are three players in an inference problem, although two roles may be taken by the same person. There is the client, who has the problem, the statistician whom the client hires to help solve the problem, and the auditor whom the client hires to check the statistician’s work. The statistician needs to be able to satisfy an auditor who asks about the logic of their approach. This chapter does not explain how to choose $\mbox{Ev}$; instead it describes some properties that “Ev” might have. Some of these properties are self-evident, and to violate them would be hard to justify to an auditor. These properties are the Distribution Principle (Definition 2.2), the Transformation Principle (Definition 2.3), and the Weak Conditionality Principle (Defintion 2.4). Other properties are not at all self-evident; the most important of these are the Strong Likelihood Principle (Definition 2.5), the Stopping Rule Principle (Definition 2.8) and the Strong Conditionality Principle (Definition 2.10). These not self-evident properties would be extremely attractive, were it possible to justify them. And as we have seen, they can all be justified as logical deductions from the properties that are self-evident. This is the essence of Birnbaum’s Theorem (Theorem 2.2).

For over a century, statisticians have been proposing methods for selecting algorithms for $\mbox{Ev}$, independently of this strand of research concerning the properties that such algorithms ought to have (remember that Birbaum’s Theorem was published in 1962). Bayesian inference, which turns out to respect the SLP, is compatible with all of the properties given above, but classical inference, which turns out to violate the SLP, is not. The two main consequences of this violation are described in Section 2.8.

Now it is important to be clear about one thing. Ultimately, an inference is a single element in the space of “possible inferences about $\theta$”. An inference cannot be evaluated according to whether or not it satisfies the SLP. What is being evaluated in this chapter is the algorithm, the mechanism by which $\mathcal{E}$ and $x$ are turned into an inference. It is quite possible that statisticians of quite different persuasions will produce effectively identical inferences from different algorithms. For example, if asked for a set estimate of $\theta$, a Bayesian statistician might produce a 95% High Density Region, and a classical statistician a 95% confidence set, but they might be effectively the same set. But it is not the inference that is the primary concern of the auditor: it is the justification for the inference, among the uncountable other inferences that might have been made but weren’t. The auditor checks the “why”, before passing the “what” on to the client.

So the auditor will ask: why do you choose algorithm $\mbox{Ev}$? The classical statistician will reply, “Because it is a 95% confidence procedure for $\theta$, and, among the uncountable number of such procedures, this is a good choice [for some reasons that are then given].” The Bayesian statistician will reply “Because it is a 95% High Posterior Density region for $\theta$ for prior distribution $\pi(\theta)$, and among the uncountable number of prior distributions, $\pi(\theta)$ is a good choice [for some reasons that are then given].” Let’s assume that the reasons are compelling, in both cases. The auditor has a follow-up question for the classicist but not for the Bayesian: “Why are you not concerned about violating the Likelihood Principle?” A well-informed auditor will know the theory of the previous sections, and the consequences of violating the SLP that are given in Section 2.8. For example, violating the SLP is either illogical or obtuse - neither of these properties are desirable in an applied statistician.

This is not an easy question to answer. The classicist may reply “Because it is important to me that I control my error rate over the course of my career”, which is incompatible with the SLP. In other words, the statistician ensures that, by always using a 95% confidence procedure, the true value of $\theta$ will be inside at least 95% of her confidence sets, over her career. Of course, this answer means that the statistician puts her career error rate before the needs of her current client. I can just about imagine a client demanding “I want a statistician who is right at least 95% of the time.” Personally, though, I would advise a client against this, and favour instead a statistician who is concerned not with her career error rate, but rather with the client’s particular problem.

Bibliography

Basu, D. 1975. “Statistical Information and Likelihood.” Sankhyā 37 (1): 1–71.

Berger, J., and R. Wolpert. 1988. The Likelihood Principle. Second. Hayward CA, USA: Institute of Mathematical Statistics.

Bernardo, J. M., and A. F. M. Smith. 2000. Bayesian Theory. Chichester, UK: John Wiley & Sons Ltd.

Birnbaum, A. 1962. “On the Foundations of Statistical Inference.” Journal of the American Statistical Association 57: 269–306.

———. 1972. “More Concepts of Statistical Evidence.” Journal of the American Statistical Association 67: 858–61.

Casella, G., and R. L. Berger. 2002. Statistical Inference. 2nd ed. Pacific Grove, CA: Duxbury.

Cox, D. R., and D. V. Hinkley. 1974. Theoretical Statistics. London, UK: Chapman; Hall.

Dawid, A. P. 1977. “Conformity of Inference Patterns.” In Recent Developments in Statistics, edited by J. R. Barra et al. Amsterdam: North-Holland Publishing Company.

Fisher, R. A. 1956. Statistical Methods and Scientific Inference. Edinburgh; London: Oliver; Boyd.

Robert, C. P. 2007. The Bayesian Choice: From Decision-Theoretic Foundations to Computational Implementation. New York, USA: Springer.

Savage, L. J. et al. 1962. The Foundations of Statistical Inference. London, UK: Methuen.

Smith, J. Q. 2010. Bayesian Decision Analysis: Principle and Practice. Cambridge, UK: Cambridge University Press.