Chapter 2 Modelling
In our work so far we have made use of the concepts of parameters, likelihoods and prior distributions with little attention focused upon clarity or justification.
Example 2.1 Bernoulli model, different interpretation of parameter.
Consider tossing coins. It is natural to think about the model of independent Bernoulli trials given \(\theta_{c}\) but what exactly is \(\theta_{c}\) and can we justify it? Intuitively we think of \(\theta_{c}\) as representing the “true probability of a head” but what does this mean?
Consider an urn which contains balls of proportion \(\theta_{u}\) of colour purple and \((1 - \theta_{u})\) of colour green. We sample with replacement from the urn, so a ball is drawn from the urn, its colour noted and returned to the urn which is then shaken prior to the next draw. Given \(\theta_{u}\) we might model the draws of the balls identically to the tossing of coins but our intuitive understanding of \(\theta_{c}\) and \(\theta_{u}\) are not the same. For example, we can never determine \(\theta_{c}\) but we could physically determine \(\theta_{u}\): we could smash the urn and count the balls. (Other, more pacifistic, approaches are available e.g. just empty the urn.)
Urn problems are widely used as thought experiments in statistics and probability. For an initial overview see urn problem.
How can we reconcile these cases? The answer lies in a judgment of exchangeability which is the key result of this chapter.
2.1 Predictive distribution
Suppose that we want to make predictions about the values of future observations of data. The distribution of future data \(Z\) given observed data \(x\) is the predictive distribution \(f(z | x)\).
Notice that this distribution only depends upon \(z\) and \(x\). If we have a parametric model then \[\begin{eqnarray} f(z \, | \, x) & = & \int_{\theta} f(z \, | \, \theta, x)f(\theta \, | \, x) \, d\theta. \tag{2.1} \end{eqnarray}\] If we judge that \(X\) and \(Z\) are conditionally independent, see (1.4), given \(\theta\) then \(f(z | \theta, x) = f(z | \theta)\) so that (2.1) becomes \[\begin{eqnarray} f(z \, | \, x) & = & \int_{\theta} f(z \, | \, \theta)f(\theta \, | \, x) \, d\theta. \tag{2.2} \end{eqnarray}\]
Example 2.2 Consider a Beta-Binomial analysis, such as tossing coins. We judge that \(X \, | \, \theta \sim Bin(n, \theta)\) with \(\theta \sim Beta(\alpha, \beta)\). Then, from Example 1.1, \(\theta \, | \, x \sim Beta(\alpha + x, \beta + n - x)\). For ease of notation, let \(a = \alpha + x\) and \(b = \beta + n - x\). Now consider a further random variable \(Z\) for which we judge \(Z \, | \, \theta \sim Bin(m, \theta)\) and that \(X\) and \(Z\) are conditionally independent given \(\theta\). So, having tossed a coin \(n\) times, we consider tossing it a further \(m\) times. We seek the predictive distribution of \(Z\) given the observed \(x\). As \(X\) and \(Z\) are conditionally independent given \(\theta\) then, from (2.2), \[\begin{eqnarray} f(z \, | \, x) & = & \int_{\theta} f(z \, | \, \theta)f(\theta \, | \, x) \, d\theta \nonumber \\ & = & \int_{0}^{1} \binom{m}{z} \theta^{z}(1 - \theta)^{m-z} \frac{1}{B(a, b)} \, \theta^{a-1}(1-\theta)^{b-1} \, d\theta \nonumber \\ & = & \binom{m}{z}\frac{1}{B(a, b)} \int_{0}^{1} \theta^{a+z-1}(1-\theta)^{b+m-z-1} \, d\theta \tag{2.3} \\ & = & \binom{m}{z} \frac{B(a+z, b+m-z)}{B(a, b)}. \tag{2.4} \end{eqnarray}\] Notice that \[\begin{eqnarray} \frac{B(a+z, b+m-z)}{B(a, b)} & = & \frac{\Gamma(a+z) \Gamma(b+m-z)}{\Gamma(a+b+m)} \times \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \nonumber \end{eqnarray}\] so that (2.4) can be expressed as \[\begin{eqnarray} f(z \, | \, x) & = & c \binom{m}{z} \Gamma(a+z) \Gamma(b+m-z) \nonumber \end{eqnarray}\] where the constant (i.e. not depending upon \(z\)) \[\begin{eqnarray} c & = & \frac{\Gamma(a+b)}{\Gamma(a+b+m) \Gamma(a) \Gamma(b)}.\nonumber \end{eqnarray}\] \(Z \, | \, x\) is the Binomial-Beta distribution with parameters \(a\), \(b\), \(m\).
Notice that the derivation of this predictive distribution involved, see (2.3), the identification of a kernel of a \(Beta(a+z, b+m-z)\) distribution. This should not have been a surprise. From (1.10) (with \(y =z\) and using the conditional independence of \(X\) and \(Z\) given \(\theta\)), \(f(z \, | \, \theta)f(\theta \, | \, x) \propto f(\theta \, | \, z, x)\) and, from Example 1.1, \(\theta \, | \, z, x \sim Beta(a+z, b+m-z)\).
The predictive distribution can be difficult to calculate but predictive summaries are often easily available. In particular we can apply generalisations of the results that for random variables \(X\) and \(Y\) we have \(E(X) = E(E(X \, | \, Y))\) and \(Var(X) = Var(E(X \, | \, Y)) + E(Var(X \, | \, Y))\) (see Question Sheet One Exercise 5) to the predictive distribution.
Lemma 2.1 If \(X\), \(Z\) and \(\theta\) are three random variables with \(X\) and \(Z\) conditionally independent given \(\theta\) then \[\begin{eqnarray} E(Z \, | \, X) & = & E(E(Z \, | \, \theta) \, | \, X), \tag{2.5} \\ Var(Z \, | \, X) & = & Var(E(Z \, | \, \theta) \, | \, X) + E(Var(Z \, | \, \theta) \, | \, X). \tag{2.6} \end{eqnarray}\]
Proof: From Question Sheet One Exercise 5, conditioning everywhere, we have \[\begin{eqnarray} E(Z \, | \, X) & = & E(E(Z \, | \, \theta, X) \, | \, X), \tag{2.7} \\ Var(Z \, | \, X) & = & Var(E(Z \, | \, \theta, X) \, | \, X) + E(Var(Z \, | \, \theta, X) \, | \, X). \tag{2.8} \end{eqnarray}\]
Now as \(X\) and \(Z\) are conditionally independent given \(\theta\) then \(E(Z \, | \, \theta, X) = E(Z \, | \, \theta)\) and \(Var(Z \, | \, \theta, X) = Var(Z \, | \, \theta)\). Substituting these into (2.7) and (2.8) gives (2.5) and (2.6). \(\Box\)
Observe that \(E(Z \, | \, \theta)\) and \(Var(Z \, | \, \theta)\) are computed using \(f(z \, | \, \theta)\) and are both functions of \(\theta\), \(g_{1}(\theta)\) and \(g_{2}(\theta)\) respectively say. We then obtain \(E(g_{1}(\theta) \, | \, X)\), \(Var(g_{1}(\theta) \, | \, X)\) and \(E(g_{2}(\theta) \, | \, X)\) using \(f(\theta \, | \, x)\). These were also the distributions we needed, see (2.2), to compute \(f(z \, | \, x)\). The conditional independence of \(X\) and \(Z\) given \(\theta\) means that any calculation involving \(X\), \(Z\) and \(\theta\) can be performed using calculations between \(X\) and \(\theta\) only and \(Z\) and \(\theta\) only: by exploiting independence structure we can reduce the dimensionality of our problems.
This is the basis behind what is known as local computation and is often exploited in highly complex models, specifically in Bayesian networks.
Example 2.3 From Example 2.2 we consider the predictive expectation of \(Z\) given \(x\). As \(Z \, | \, \theta \sim Bin(m, \theta)\) then \(E(Z \, | \, \theta) = m \theta\) whilst as \(\theta \, | \, x \sim Beta(a, b)\) then \(E(\theta \, | \, X) = \frac{a}{a+b}\). Hence, we have that \[\begin{eqnarray} E(Z \, | \, X) & = & E(E(Z \, | \, \theta) \, | \, X) \nonumber \\ & = & E(m\theta \, | \, X) \nonumber \\ & = & m\frac{a}{a+b} \ = \ m\frac{\alpha + x}{\alpha + \beta + n}. \nonumber \end{eqnarray}\]
The modelling in Examples 2.2 and 2.3 raises an interesting question. \(X\) and \(Z\) are not independent and nor would we expect them to be: if we don’t know \(\theta\) then we expect observing \(x\) to be informative for \(Z\).
e.g. When tossing a coin, if we don’t know whether or not the coin is fair then an initial sequence of \(n\) tosses will be informative for a future sequence of \(m\) tosses.
Consider such a model from a classical perspective (there is an interesting side question as to how to perform prediction in this case as \(f(z \, | \, x)\) does not explicitly depend upon any parameter \(\theta\)). We would view \(X\) and \(Z\) as comprising a random sample and of being independent and identically distributed. As we can see from the prediction of \(Z\) given \(x\) this is a slightly misleading statement: they are only independent conditional on the parameter \(\theta\).
2.2 Exchangeability
The concept of exchangeability, introduced by Bruno de Finetti (1906-1985) in the 1930s, is the basic modelling tool within Bayesian statistics. One of the key differences between the classical and Bayesian schools is that observations that are former would treat as independent are treated as exchangeable by the latter.
Definition 2.1 (Finite exchangeability) The random variables \(X_{1}, \ldots, X_{n}\) are judged to be finitely exchangeable if their joint density function satisfies \[\begin{eqnarray} f(x_{1}, \ldots, x_{n}) & = & f(x_{\pi(1)}, \ldots, x_{\pi(n)}) \nonumber \end{eqnarray}\] for all permutations \(\pi\) defined on the set \(\{1, \ldots, n\}\).
Example 2.4 \(X_{1}\) and \(X_{2}\) are finitely exchangeable if \(f(x_{1}, x_{2}) = f(x_{2}, x_{1})\). \(X_{1}\), \(X_{2}\) and \(X_{3}\) are finitely exchangeable if \(f(x_{1}, x_{2}, x_{3}) = f(x_{1}, x_{3}, x_{2}) = f(x_{2}, x_{1}, x_{3}) = f(x_{2}, x_{3}, x_{1}) = f(x_{3}, x_{1}, x_{2}) = f(x_{3}, x_{2}, x_{1})\).
In essence, exchangeability captures the notion that only the values of the observations matter and not the order in which they were obtained. The labels are uninformative. Exchangeability is a stronger statement than identically distributed but weaker than independence. Independent observations are exchangeable as \[\begin{eqnarray} f(x_{1}, \ldots, x_{n}) & = & \prod_{i=1}^{n} f(x_{i}). \nonumber \end{eqnarray}\] However, exchangeable observations need not be independent.
Example 2.5 Suppose that \(X_{1}\) and \(X_{2}\) have the joint density function \[\begin{eqnarray} f(x_{1}, x_{2}) & = & \frac{3}{2}(x_{1}^{2} + x_{2}^{2}) \nonumber \end{eqnarray}\] for \(0 < x_{1} < 1\) and \(0 < x_{2} < 1\). Then \(X_{1}\) and \(X_{2}\) are (finitely) exchangeable but they are not independent.
Definition 2.2 (Infinite exchangeability) The infinite sequence of random variables \(X_{1}\), \(X_{2}\), \(\ldots\) are judged to be infinitely exchangeable if every finite subsequence is judged finitely exchangeable.
A natural question is whether every finitely exchangeable sequence can be embedded into, or extended, to an infinitely exchangeable sequence.
Example 2.6 Suppose that \(X_{1}, X_{2}, X_{3}\) are three finitely exchangeable events with \[\begin{eqnarray} P(X_{1} = 0, X_{2} = 1, X_{3} = 1) & = & P(X_{1} = 1, X_{2} = 0, X_{3} = 1) \nonumber \\ & = & P(X_{1} = 1, X_{2} = 1, X_{3} = 0) \ = \ \frac{1}{3} \tag{2.9} \end{eqnarray}\] with all other combinations having probability 0. Is there an \(X_{4}\) such that \(X_{1}, X_{2}, X_{3}, X_{4}\) are finitely exchangeable? If there is then, for example, \[\begin{eqnarray} P(X_{1} = 0, X_{2} = 1, X_{3} = 1, X_{4} = 0) & = & P(X_{1} = 0, X_{2} = 0, X_{3} = 1, X_{4} = 1). \tag{2.10} \end{eqnarray}\] Now, assuming \(X_{1}, X_{2}, X_{3}, X_{4}\) are finitely exchangeable, \[\begin{eqnarray} & & P(X_{1} = 0, X_{2} = 1, X_{3} = 1, X_{4} = 0) \ = \ P(X_{1} = 0, X_{2} = 1, X_{3} = 1) \nonumber \\ & & \hspace{7cm} \ - P(X_{1} = 0, X_{2} = 1, X_{3} = 1, X_{4} = 1) \nonumber \\ & & \hspace{5cm} \ = \ \frac{1}{3} - P(X_{1} = 0, X_{2} = 1, X_{3} = 1, X_{4} = 1) \tag{2.11} \\ & & \hspace{5cm} \ = \ \frac{1}{3} - P(X_{1} = 1, X_{2} = 1, X_{3} = 1, X_{4} = 0) \tag{2.12} \end{eqnarray}\] where (2.11) follows from (2.9) and (2.12) from finite exchangeability. Note that \[\begin{eqnarray} P(X_{1} = 1, X_{2} = 1, X_{3} = 1, X_{4} = 0) & = & P(X_{1} = 1, X_{2} = 1, X_{3} = 1) \nonumber \\ & & - P(X_{1} = 1, X_{2} = 1, X_{3} = 1, X_{4} = 1) \nonumber \\ & \leq & P(X_{1} = 1, X_{2} = 1, X_{3} = 1) \ = \ 0. \tag{2.13} \end{eqnarray}\] Substituting (2.13) into (2.12) gives \[\begin{eqnarray} P(X_{1} = 0, X_{2} = 1, X_{3} = 1, X_{4} = 0) & = & \frac{1}{3}. \tag{2.14} \end{eqnarray}\] However, \[\begin{eqnarray} P(X_{1} = 0, X_{2} = 0, X_{3} = 1, X_{4} = 1) & = & P(X_{1} = 0, X_{2} = 0, X_{3} = 1) \nonumber \\ & & - P(X_{1} = 0, X_{2} = 0, X_{3} = 1, X_{4} = 0) \nonumber \\ & \leq & P(X_{1} = 0, X_{2} = 0, X_{3} = 1) \ = \ 0. \tag{2.15} \end{eqnarray}\] So, from (2.14) and (2.15) we observe that (2.10) does not hold: a contradiction to the assumption of finite exchangeability of \(X_{1}, X_{2}, X_{3}, X_{4}\).
Example 2.5 thus shows that a finitely exchangeable sequence can not necessarily even be embedded into a larger finitely exchangeable sequence let alone an infinitely exchangeable one.
Note that it is common notation to often just term a sequence as exchangeable and leave it to context as to whether this means finitely or infinitely exchangeable.
Theorem 2.1 (Representation theorem for 0-1 random variables) Let \(X_{1}, X_{2}, \ldots\) be a sequence of infinitely exchangeable 0-1 random variables (i.e. events). Then the joint distribution of \(X_{1}, \ldots, X_{n}\) has an integral representation of the form \[\begin{eqnarray} f(x_{1}, \ldots, x_{n}) & = & \int_{0}^{1} \left\{\prod_{i=1}^{n} \theta^{x_{i}}(1-\theta)^{1-x_{i}}\right\}f(\theta) \, d\theta \nonumber \end{eqnarray}\] with \(y_{n} = \sum_{i=1}^{n} x_{i}\) and \(\theta = \lim_{n \rightarrow \infty} \frac{y_{n}}{n}\).
The interpretation of this theorem, often termed de Finetti’s theorem, is of profound significance. It is as if
Conditional upon a random variable \(\theta\), the \(X_{i}\) are judged to be independent Bernoulli random variables.
\(\theta\) itself is assigned a probability distribution \(f(\theta)\).
\(\theta = \lim_{n \rightarrow \infty} \frac{y_{n}}{n}\) so that \(f(\theta)\) represents beliefs about the limiting value of the mean of the \(X_{i}\)s.
So, conditional upon \(\theta\), \(X_{1}, \ldots, X_{n}\) are a random sample from a Bernoulli distribution with parameter \(\theta\) generating a parametrised joint sampling distribution \[\begin{eqnarray} f(x_{1}, \ldots, x_{n} \, | \, \theta) & = & \prod_{i=1}^{n} f(x_{i} \, | \, \theta) \nonumber \\ & = & \prod_{i=1}^{n} \theta^{x_{i}}(1-\theta)^{1-x_{i}} \nonumber \\ & = & \theta^{y_{n}}(1-\theta)^{n-y_{n}} \nonumber \end{eqnarray}\] where the parameter is assigned a prior distribution \(f(\theta)\).
Theorem 2.1 provides a justification for the Bayesian approach of combining a likelihood and a prior.
Notice that \(\theta\) is provided with a formal definition and the sentence “the true value of \(\theta\)” has a well-defined meaning.
Theorem 2.2 (General representation theorem - simplified form) If \(X_{1}, X_{2}. \ldots\) is an infinitely exchangeable sequence of random variables then the joint distribution of \(X_{1}, \ldots, X_{n}\) has an integral representation of the form \[\begin{eqnarray} f(x_{1}, \ldots, x_{n}) & = & \int_{\theta} \left\{\prod_{i=1}^{n} f(x_{i} \, | \, \theta)\right\} f(\theta) \, d\theta \nonumber \end{eqnarray}\] where \(\theta\) is the limit as \(n \rightarrow \infty\) of some function of the observations \(x_{1}, \ldots, x_{n}\) and \(f(\theta)\) is a distribution over \(\theta\).
Points to note.
In full generality \(\theta\) is an unknown distribution function (cdf) in the infinite dimensional space of all possible distribution functions so is, in effect, an infinite dimensional parameter.
Typically we put probability 1 to the event that \(\theta\) lies in a family of distributions to obtain the familiar representation above.
The representation theorem is an existence theorem: it generally does not specify the model \(f(x_{i} \, | \, \theta)\) (Theorem 2.1 provides an example when the model is specified) and it never specifies the prior \(f(\theta)\).
The crux is that infinitely exchangeable random variables (and not just events) may be viewed as being conditionally independent given a parameter \(\theta\) and provide a justification for the Bayesian approach.
Attempts to model \(\theta\) using unknown distribution functions is closely related to the vibrant research area of Bayesian nonparametrics and the Dirichlet process.
Example 2.7 An extension of Example 1.5. Let \(X_{1}, \ldots, X_{n}\) be a finite subset of a sequence of infinitely exchangeable random variables (this rather wordy description is often just shortened to “exchangeable”) which are normally distributed with unknown mean \(\theta\) and known variance \(\sigma^{2}\). Thus, from Theorem 2.2, we can assume that \(X_{i} \, | \, \theta \sim N(\theta, \sigma^{2})\) and that, conditional upon \(\theta\), the \(X_{i}\) are independent. Hence, specification of a prior distribution for \(\theta\) will allow us to obtain the joint distribution of \(X_{1}, \ldots, X_{n}\). Suppose that our prior beliefs about \(\theta\) can be expressed by \(\theta \sim N(\mu_{0}, \sigma_{0}^{2})\) for known constants \(\mu_{0}\) and \(\sigma_{0}^{2}\). We compute the posterior distribution of \(\theta\) given \(x = (x_{1}, \ldots, x_{n})\). The likelihood is \[\begin{eqnarray} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{1}{2\sigma^{2}} (x_{i} - \theta)^{2}\right\} \nonumber \\ & \propto & \prod_{i=1}^{n} \exp \left\{-\frac{1}{2\sigma^{2}} (\theta^{2} - 2 x_{i}\theta)\right\} \nonumber \\ & = & \exp \left\{-\frac{n}{2\sigma^{2}} (\theta^{2} - 2 \bar{x}\theta)\right\}. \tag{2.16} \end{eqnarray}\] Notice that, when viewed as a function of \(\theta\), (2.16) has the form of a normal kernel (a kernel of the \(N(\bar{x}, \frac{\sigma^{2}}{n})\)). The prior \[\begin{eqnarray} f(\theta)) & \propto & \exp \left\{-\frac{1}{2\sigma_{0}^{2}} (\theta^{2} - 2 \mu_{0}\theta)\right\} \nonumber \end{eqnarray}\] so that the posterior \[\begin{eqnarray} & & f(\theta \, | \, x) \ \propto \ \exp \left\{-\frac{n}{2\sigma^{2}} (\theta^{2} - 2 \bar{x}\theta)\right\} \exp \left\{-\frac{1}{2\sigma_{0}^{2}} (\theta^{2} - 2 \mu_{0}\theta)\right\} \nonumber \\ & & \ = \ \exp\left\{-\frac{1}{2}\left(\frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}\right)\left[\theta^{2} - 2\left(\frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}\right)^{-1}\left(\frac{n}{\sigma^{2}}\bar{x} + \frac{1}{\sigma_{0}^{2}}\mu_{0}\right)\theta\right]\right\}. \tag{2.17} \end{eqnarray}\] We recognise (2.17) as the kernel of a Normal density so that \(\theta \, | \, x \sim N(\mu_{n}, \sigma_{n}^{2})\) where \[\begin{eqnarray} \frac{1}{\sigma_{n}^{2}} & = & \frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}, \tag{2.18} \\ \mu_{n} & = & \left(\frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}\right)^{-1}\left(\frac{n}{\sigma^{2}}\bar{x} + \frac{1}{\sigma_{0}^{2}}\mu_{0}\right). \tag{2.19} \end{eqnarray}\] Notice the similarity of these values with those in Example 1.5, (1.18) and (1.19). The posterior precision, \(Var^{-1}(\theta \, | \, X)\), is the sum of the data precision, \(Var^{-1}(\bar{X} \, | \, \theta)\), and the prior precision, \(Var^{-1}(\theta)\). The posterior mean, \(E(\theta \, | \, X)\), is a weighted average of the data mean, \(\bar{x}\), and the prior mean, \(E(\theta)\), weighted according to the corresponding precisions. Observe that weak prior information is represented by a large prior variance. Letting \(\sigma_{0}^{2} \rightarrow \infty\) we note that \(\mu_{n} \rightarrow \bar{x}\) and \(\sigma_{n}^{2} \rightarrow \frac{\sigma^{2}}{n}\): the familiar classical model.
2.3 Sufficiency, exponential families and conjugacy
Definition 2.3 (Sufficiency) A statistic \(t(X)\) is said to be sufficient for \(X\) for learning about \(\theta\) if we can write \[\begin{eqnarray} f(x \, | \, \theta) & = & g(t, \theta)h(x) \tag{2.20} \end{eqnarray}\] where \(g(t, \theta)\) depends upon \(t(x)\) and \(\theta\) and \(h(x)\) does not depend upon \(\theta\) but may depend upon \(x\).
Equivalent statements to (2.20) are
\(f(x \, | \, t, \theta)\) does not depend upon \(\theta\) so that \(f(x \, | \, t, \theta) = f(x \, | \, t)\).
\(f(\theta \, | \, x, t)\) does not depend upon \(x\) so that \(f(\theta \, | \, x, t) = f(\theta \, | \, t)\).
Sufficiency represents the notion that given \(t(x)\) nothing further can be learnt about \(\theta\) from additionally observing \(x\): \(\theta\) and \(x\) are conditionally independent given \(t\).
Example 2.8 In Example 2.7, we have, from (2.16) and reinstalling the constants of proportionality, \[\begin{eqnarray} f(x \, | \, \theta) & = & \exp \left\{-\frac{n}{2\sigma^{2}} (\theta^{2} - 2 \bar{x}\theta)\right\} \times \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^{n} \exp \left\{-\frac{1}{2\sigma^{2}} \sum_{i=1}^{n} x_{i}^{2}\right\} \nonumber \\ & = & g(\bar{x}, \theta)h(x). \nonumber \end{eqnarray}\] Hence, \(\bar{X}\) is sufficient for \(X = (X_{1}, \ldots, X_{n})\) for learning about \(\theta\).
Definition 2.4 (k-parameter exponential family) A probability density \(f(x \, | \, \theta)\), \(\theta = (\theta_{1}, \ldots, \theta_{k})\), is said to belong to the \(k\)-parameter exponential family if it is of the form \[\begin{eqnarray} f(x \, | \, \theta) & = & Ef_{k}(x \, | \, g, h, u, \phi, \theta) \nonumber \\ & = & \exp \left\{\sum_{j=1}^{k} \phi_{j}(\theta) u_{j}(x) + g(\theta) + h(x) \right\} \tag{2.21} \end{eqnarray}\] where \(\phi(\theta) = (\phi_{1}(\theta), \ldots, \phi_{k}(\theta))\) and \(u(x) = (u_{1}(x), \ldots, u_{k}(x))\). The family is regular if the sample space of \(X\) does not depend upon \(\theta\), otherwise it is non-regular.
Equation (2.21) will prove to be the most useful form for our purposes. In MA40092 you saw it expressed as \(f(x \, | \, \theta) = \tilde{g}(\theta)\tilde{h}(x)\exp \left\{\sum_{j=1}^{k} \phi_{j}(\theta) u_{j}(x)\right\}\) where \(\tilde{g}(\theta) = \exp\{g(\theta)\}\) and \(\tilde{h}(x) = \exp\{h(x)\}\).
An example of a non-regular family is the Uniform distribution, see Question 4 of Question Sheet Three.
Example 2.9 Suppose that \(X \, | \, \theta \sim Bernoulli(\theta)\). Then \[\begin{eqnarray} f(x \, | \, \theta) & = & \theta^{x}(1-\theta)^{1-x} \nonumber \\ & = & \exp\left\{x\log \theta + (1-x)\log (1-\theta)\right\} \nonumber \\ & = & \exp\left\{\left(\log \frac{\theta}{1-\theta}\right)x + \log (1-\theta) \right\} \nonumber \end{eqnarray}\] Hence, this is the \(1\)-parameter regular exponential family with \(\phi_{1}(\theta) = \log \frac{\theta}{1-\theta}\), \(u_{1}(x) = x\), \(g(\theta) = \log (1-\theta)\) and \(h(x) = 0\).
Proposition 2.1 If \(X_{1}, \ldots, X_{n}\) is an exchangeable sequence such that, given a regular \(k\)-parameter exponential family \(Ef_{k}(\cdot \, | \, \cdot)\), \[\begin{eqnarray} f(x_{1}, \ldots, x_{n}) & = & \int_{\theta} \left\{\prod_{i=1}^{n} Ef_{k}(x_{i} \, | \, g, h, u, \phi, \theta)\right\} f(\theta) \, d\theta \nonumber \end{eqnarray}\] then \(t_{n} = t_{n}(X_{1}, \ldots, X_{n}) = [n, \sum_{i=1}^{n} u_{1}(X_{i}), \ldots, \sum_{i=1}^{n} u_{k}(X_{i})]\), \(n = 1, 2, \ldots\) is a sequence of sufficient statistics.
Proof: From the representation, we have \[\begin{eqnarray} f(x \, | \, \theta) & = & \prod_{i=1}^{n} Ef_{k}(x_{i} \, | \, g, h, u, \phi, \theta) \nonumber \\ & = & \prod_{i=1}^{n} \exp \left\{\sum_{j=1}^{k} \phi_{j}(\theta) u_{j}(x_{i}) + g(\theta) + h(x_{i}) \right\} \nonumber \\ & = & \exp \left\{\sum_{j=1}^{k} \phi_{j}(\theta) \left(\sum_{i=1}^{n} u_{j}(x_{i})\right) + ng(\theta) + \sum_{i=1}^{n} h(x_{i}) \right\} \nonumber \\ & = & \exp \left\{\sum_{j=1}^{k} \phi_{j}(\theta) \left(\sum_{i=1}^{n} u_{j}(x_{i})\right) + ng(\theta)\right\} \times \exp \left\{\sum_{i=1}^{n} h(x_{i}) \right\} \nonumber \\ & = & \tilde{g}(t_{n}, \theta)\tilde{h}(x) \nonumber \end{eqnarray}\] where \(\tilde{g}(t_{n}, \theta)\) is a function of \(t_{n}\) and \(\theta\) and \(\tilde{h}(x)\) a function of \(x\). Hence \(t_{n}\) is sufficient for \(X = (X_{1}, \ldots, X_{n})\) for learning about \(\theta\). \(\Box\)
Example 2.10 Let \(X_{1}, \ldots, X_{n}\) be an exchangeable sequence with \(X_{i} \, | \, \theta \sim Bernoulli(\theta)\). Then, from Example 2.9, with \(x = (x_{1}, \ldots, x_{n})\) \[\begin{eqnarray} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \exp\left\{\left(\log \frac{\theta}{1-\theta}\right)x_{i} + \log (1-\theta) \right\}.\nonumber \end{eqnarray}\] For this 1-parameter exponential family we have \(u_{1}(x_{i}) = x_{i}\) and \(\sum_{i=1}^{n} u_{1}(x_{i}) = \sum_{i=1}^{n} x_{i}\) so that, from Proposition 2.1, \(t_{n} = (n , \sum_{i=1}^{n} X_{i})\) is sufficient for \(X_{1}, \ldots, X_{n}\) for learning about \(\theta\).
2.3.1 Exponential families and conjugate priors
If \(f(x \, | \, \theta)\) is a member of a \(k\)-parameter exponential family, it is easy to observe that a conjugate prior can be found in the \((k+1)\)-parameter exponential family. Regarding \(f(x \, | \, \theta)\) as a function of \(\theta\), notice that we can express (2.21) as \[\begin{eqnarray} f(x \, | \, \theta) & = & \exp \left\{\sum_{j=1}^{k} u_{j}(x) \phi_{j}(\theta) + g(\theta) + h(x) \right\} \tag{2.22} \end{eqnarray}\] which can be viewed as an exponential family over \(\theta\). If we take as our prior the following \((k+1)\)-parameter exponential family over \(\theta\) \[\begin{eqnarray} f(\theta) & = & \exp \left\{\sum_{j=1}^{k} a_{j} \phi_{j}(\theta) + dg(\theta) + c(a, d) \right\} \tag{2.23} \end{eqnarray}\] where \(a = (a_{1}, \ldots, a_{k})\) and \(c(a, d)\) is a normalising constant so that, \[\begin{eqnarray} c(a, d) & = & - \log \int_{\theta} \exp \left\{\sum_{j=1}^{k} a_{j} \phi_{j}(\theta) + dg(\theta) \right\} \, d\theta, \tag{2.24} \end{eqnarray}\] then, from (2.22) and (2.23), our posterior is \[\begin{eqnarray} f(\theta \, | \, x) & = & \exp \left\{\sum_{j=1}^{k} u_{j}(x) \phi_{j}(\theta) + g(\theta) + h(x) \right\}\exp \left\{\sum_{j=1}^{k} a_{j} \phi_{j}(\theta) + dg(\theta) + c(a, d) \right\} \nonumber \\ & \propto & \exp \left\{\sum_{j=1}^{k} [a_{j} + u_{j}(x)]\phi_{j}(\theta) + (d+1)g(\theta)\right\} \tag{2.25} \end{eqnarray}\] Notice that, up to constants of proportionality, (2.25) has the same form as (2.23). Indeed if we let \(\tilde{a}_{j} = a_{j} + u_{j}(x)\) and \(\tilde{d} = d + 1\) then we can express the posterior distribution as \[\begin{eqnarray} f(\theta \, | \, x) & = & \exp \left\{\sum_{j=1}^{k} \tilde{a}_{j} \phi_{j}(\theta) + \tilde{d}g(\theta) + c(\tilde{a}, \tilde{d}) \right\} \tag{2.26} \end{eqnarray}\] where \(\tilde{a} = (\tilde{a}_{1}, \ldots, \tilde{a}_{k})\) and \(c(\tilde{a}, \tilde{d})\) is a normalising constant, equivalent to (2.24) but with \(\tilde{a}_{j}\) for \(a_{j}\) and \(\tilde{d}\) for \(d\). Thus, we have that the \((k+1)\)-parameter exponential family is a conjugate family with respect to the \(k\)-parameter exponential family likelihood. In this case, we talk about the natural conjugate prior.
It should be clear, using a similar approach to Proposition 2.1 that this result is easily obtained for the case when \(X_{1}, \ldots, X_{n}\) is exchangeable with \(X_{i} \, | \, \theta\) a member of the \(k\)-parameter exponential family.
Example 2.11 We find the natural conjugate prior for \(X \, | \, \theta \sim Bernoulli(\theta)\). From Example 2.9 we have that \[\begin{eqnarray} f(x \, | \, \theta) & = & \exp\left\{\left(x\log \frac{\theta}{1-\theta}\right) + \log (1-\theta) \right\} \nonumber \end{eqnarray}\] so that we take a prior of the form \[\begin{eqnarray} f(\theta) & \propto & \exp\left\{\left(a\log \frac{\theta}{1-\theta}\right) + d\log (1-\theta) \right\} \nonumber \\ & = & \left(\frac{\theta}{1-\theta}\right)^{a}(1 - \theta)^{d} \nonumber \\ & = & \theta^{a}(1-\theta)^{d-a} \nonumber \end{eqnarray}\] which is a kernel of a Beta distribution. To obtain the familiar parametrisation we take \(a = a(\alpha, \beta) = \alpha - 1\) and \(d = d(\alpha, \beta) = \beta + \alpha -2\). The likelihood has one parameter, \(\theta\), so the natural conjugate prior has two parameters, \(\alpha\) and \(\beta\).
We term the \((k+1)\)-parameter exponential family as being the natural conjugate prior to the \(k\)-parameter exponential family likelihood as the \((k+1+s)\)-parameter exponential family \[\begin{eqnarray} f(\theta) & = & \exp \left\{\sum_{j=1}^{k} a_{j} \phi_{j}(\theta) + dg(\theta) + \sum_{r=1}^{s} e_{r}E_{r}(\theta) + c(a, d, e) \right\}, \nonumber \end{eqnarray}\] where \(e = (e_{1}, \ldots, e_{s})\) and \(c(a, d, e)\) is the normalising constant, is also conjugate to the \(k\)-parameter exponential family likelihood.
From Example 2.11, we noted that the natural conjugate to the Bernoulli likelihood has two parameters \(\alpha\) and \(\beta\), the different possible values of these parameters indexing the specific member of the Beta family chosen as the prior distribution. In order to distinguish the parameters indexing the family of prior distributions from the parameters \(\theta\) about which we wish to make inference we term the former hyperparameters. Thus, in the Bernoulli case, \(\alpha\) and \(\beta\) are the hyperparameters. The general \((k+1+s)\)-parameter exponential family conjugate prior has \(k+1+s\) hyperparameters.
Conjugate priors are useful for a number of reasons. Firstly, they ease the inferential process following the observation of data in that the posterior is straightforward to calculate: we only need to update the hyperparameters.
In the majority of cases, from Proposition 2.1 and (2.25), this will involve simple adjustments using the sufficient statistics of the likelihood.
Secondly, they ease the burden on the prior specification: specifying the prior distribution reduces to specifying the hyperparameters. This can be done by specifying either the hyperparameters directly or through a series of distributional summaries from which they can be inferred. For example, which can infer the two hyperparameters of a Beta prior from specifying the mean and the variance of the prior.
A further example of this is given in Question 2 of Question Sheet Three.
Notice that in order for a conjugate family to exist, the likelihood \(\prod_{i=1}^{n} f(x_{i} \, | \, \theta)\) must involve only a finite number of different functions of \(x = (x_{1}, \ldots, x_{n})\) for \(n\) arbitrarily large. Thus, the likelihood must contains a finite number of sufficient statistics which implies (given regularity conditions) that the likelihood is a member of a regular exponential family, see the Pitman-Koopman-Darmois theorem. Thus, (subject to these regularity conditions), only regular exponential families exhibit conjugacy.
An example that breaks the regulatory conditions is the non-regular exponential family likelihood given by \(U(0, \theta)\). From Question 4 of Question Sheet Three we find that this has a conjugate prior given by the Pareto distribution.
2.4 Noninformative prior distributions
The posterior distribution combines the information provided by the data with the prior information. In many situations, the available prior information may be too vague to be formalised as a probability distribution or too subjective to be used in public decision making.
There has been a desire for prior distributions that are guaranteed to play a minimal part in the posterior distribution. Such priors are sometimes called ‘reference priors’ and the prior density is said be ‘flat’ or ‘noninformative’. The argument proposed is to “let the data speak for themselves”.
Example 2.12 Suppose that \(X \, | \, \theta \sim Bin(n, \theta)\) so that \[\begin{eqnarray} f(x \, | \, \theta) & = & \binom{n}{x} \theta^{x}(1- \theta)^{n-x}. \nonumber \end{eqnarray}\] The natural conjugate prior is the \(Beta(\alpha, \beta)\) distribution. If the hyperparameters \(\alpha\) and \(\beta\) and chosen so that \(\alpha = \beta = 1\) then the prior is \(Unif(0, 1)\) so that \[\begin{eqnarray} f(\theta) & = & 1, \ \ 0 \leq \theta \leq 1.\nonumber \end{eqnarray}\] The prior suggests that we judge that \(\theta\) is equally likely to be anywhere between \(0\) and \(1\) and can be viewed as a judgement of ignorance. The use of this uniform prior distribution is often referred to as Bayes’ postulate. The posterior is \[\begin{eqnarray} f(\theta \, | \, x) & \propto & f(x \, | \, \theta)f(\theta) \nonumber \\ & = & f(x \, | \, \theta). \nonumber \end{eqnarray}\] The posterior density is thus \[\begin{eqnarray} f(\theta \, | \, x) & = & \frac{f(x \, | \, \theta)}{\int_{\theta} f(x \, | \, \theta) \, d\theta} \nonumber \end{eqnarray}\] which is the scaled likelihood. In this case, we have \(\theta \, | \, x \sim Beta(x+1, n-x+1)\).
The basic idea is to represent ignorance using uniform prior distributions. However, if the possible values of \(\theta\) do not lie in interval for which both endpoints are finite then no such proper distribution (i.e. one that integrates to unity) exists. However, this may not be a problem if the scaled likelihood has a finite integral over \(\theta\).
Definition 2.5 (Improper prior) The specification \(f(\theta) = f^{*}(\theta)\) is said to be an improper prior if \[\begin{eqnarray} \int_{\theta} f^{*}(\theta) \, d\theta & = & \infty.\nonumber \end{eqnarray}\]
Example 2.13 Recall Example 2.13. For an exchangeable collection \(X = (X_{1}, \ldots, X_{n})\) let \(X_{i} \, | \, \theta \sim N(\theta, \sigma^{2})\) where \(\sigma^{2}\) is known and \(\theta \sim N(\mu_{0}, \sigma_{0}^{2})\) for known constants \(\mu_{0}\) and \(\sigma_{0}^{2}\). The posterior is \(\theta \, | \, x \sim N(\mu_{n}, \sigma_{n}^{2})\) where \[\begin{eqnarray} \frac{1}{\sigma_{n}^{2}} & = & \frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}, \nonumber \\ \mu_{n} & = & \left(\frac{n}{\sigma^{2}} + \frac{1}{\sigma_{0}^{2}}\right)^{-1}\left(\frac{n}{\sigma^{2}}\bar{x} + \frac{1}{\sigma_{0}^{2}}\mu_{0}\right). \nonumber \end{eqnarray}\] Recall that, in Example 2.7, as \(\sigma_{0}^{2} \rightarrow \infty\) then \(\mu_{n} \rightarrow \bar{x}\) and \(\sigma_{n}^{2} \rightarrow \frac{\sigma^{2}}{n}\) which matches the scaled likelihood. As \(\sigma_{0}^{2} \rightarrow \infty\) the prior density \(f(\theta)\) becomes flatter and flatter which we could view as becoming more and more uniform. Notice that we could obtain \(N(\bar{x}, \frac{\sigma^{2}}{n})\) as our posterior distribution using the improper prior \(f(\theta) \propto 1\).
2.4.1 Jeffreys’ prior
Suppose we take \(f(\theta) \propto 1\) to represent ignorance about a parameter \(\theta\). If we consider the parameter \(\lambda = \frac{1}{\theta}\) then \(f(\lambda) \propto \frac{1}{\lambda^{2}}\) which is not uniform. Although we attempt to express ignorance about \(\theta\) we do not have ignorance about \(\frac{1}{\theta}\)!
Can we calculate prior distributions that are invariant to the choice of parameterisation?
The answer is yes if we use Jeffreys’ prior, \[\begin{eqnarray} f(\theta) & \propto & |I(\theta)|^{\frac{1}{2}} \tag{2.27} \end{eqnarray}\] where \(I(\theta)\) is the Fisher information matrix and \(| \cdot |\) represents the determinant of a matrix.
Jeffreys’ prior is named after Harold Jeffreys (1891-1989).
Recall that if \(\theta = (\theta_{1}, \ldots, \theta_{k})\) is a \(k\)-dimensional parameter then \(I(\theta)\) is the \(k \times k\) matrix with \((i, j)\)th entry \[\begin{eqnarray} (I(\theta))_{ij} & = & E\left.\left\{\frac{\partial}{\partial \theta_{i}} \log f(x \, | \, \theta) \frac{\partial}{\partial \theta_{j}} \log f(x \, | \, \theta) \, \right| \, \theta \right\} \nonumber \\ & = & - E\left.\left\{\frac{\partial^{2}}{\partial \theta_{i}\partial \theta_{j}} \log f(x \, | \, \theta) \, \right| \, \theta \right\} \nonumber \end{eqnarray}\]
Example 2.14 For an exchangeable collection \(X = (X_{1}, \ldots, X_{n})\) let \(X_{i} \, | \, \theta \sim N(\theta, \sigma^{2})\) where \(\sigma^{2}\) is known. We find the Jeffreys prior for \(\theta\). \[\begin{eqnarray} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi} \sigma} \exp \left\{-\frac{1}{2\sigma^{2}} (x_{i} - \theta)^{2}\right\} \nonumber\\ & = & (2\pi \sigma^{2})^{-\frac{n}{2}} \exp \left\{-\frac{1}{2\sigma^{2}} \sum_{i=1}^{n} (x_{i} - \theta)^{2}\right\}.\nonumber \end{eqnarray}\] Hence, as \(\theta\) is univariate, \[\begin{eqnarray} \frac{\partial^{2}}{\partial \theta^{2}} \log f(x \, | \, \theta) & = & \frac{\partial}{\partial \theta}\left\{\frac{\partial}{\partial \theta}\log f(x \, | \, \theta) \right\} \nonumber \\ & = & \frac{\partial}{\partial \theta}\left\{ \frac{\partial}{\partial \theta} \left(-\frac{n}{2} \log 2\pi \sigma - \frac{1}{2\sigma^{2}} \sum_{i=1}^{n} (x_{i} - \theta)^{2} \right)\right\} \nonumber \\ & = & \frac{\partial}{\partial \theta}\left\{\frac{1}{\sigma^{2}} \sum_{i=1}^{n} (x_{i} - \theta) \right\} \nonumber\\ & = & -\frac{n}{\sigma^{2}} \nonumber \end{eqnarray}\] so that the Fisher information is \[\begin{eqnarray} I(\theta) & = & -E\left(-\frac{n}{\sigma^{2}} \, | \, \theta \right) \nonumber \\ & = & \frac{n}{\sigma^{2}}. \tag{2.28} \end{eqnarray}\] Substituting (2.28) into (2.27) we see that, in this case, the Jeffreys prior is \[\begin{eqnarray} f(\theta) \ = \ \sqrt{\frac{n}{\sigma^{2}}} \ = \ \propto 1 \nonumber \end{eqnarray}\] which is the improper prior we suggested in Example 2.13.
Recall how to transform the distribution of random variables. If \(X = (X_{1}, \ldots, X_{p})\) is a \(p\)-dimensional random variable with density function \(f_{X}(x)\) and \(Y = (Y_{1}, \ldots, Y_{p})\) is a \(p\)-dimensional random variable with density function \(f_{Y}(y)\) then if \(Y = g(X)\) \[\begin{eqnarray} f_{Y}(y) & = & |\det J(x, y)| f_{X}(g^{-1}(y)) \nonumber \end{eqnarray}\] where \(J(x, y)\) is the \(p \times p\) Jacobian matrix with \((i, j)\)th element \[\begin{eqnarray} (J(x, y))_{ij} & = & \frac{\partial x_{i}}{\partial y_{j}} \nonumber \end{eqnarray}\] where we write \(x_{i} = h_{i}(y)\) and \(y_{j} = g_{j}(x)\), \(\det J(x, y)\) the determinant of the Jacobian and \(| \cdot |\) denotes the modulus.
\(\det J(x, y)\) is often called the Jacobian determinant or even just the Jacobian.
Example 2.15 (Not given in lectures but just to remind you how to transform variables) Consider \(X = (X_{1}, X_{2})\) denoting Cartesian coordinates and \(Y = (Y_{1}, Y_{2})\) denoting polar co-ordinates. We have that \(Y = (\sqrt{X_{1}^{2} + X_{2}^{2}}, \tan^{-1} (\frac{X_{2}}{X_{1}}))\). We wish to find the density \(f_{Y}(y)\) by transforming the density \(f_{X}(x)\). We have \(y_{1} = g_{1}(x) = \sqrt{x_{1}^{2} + x_{2}^{2}}\), \(y_{2} = g_{2}(x) = \tan^{-1}(\frac{x_{2}}{x_{1}})\). The inverse transformation is \[\begin{eqnarray} x_{1} \ = \ h_{1}(y) & = & y_{1}\cos y_{2}, \nonumber\\ x_{2} \ = \ h_{2}(y) & = & y_{1}\sin y_{2}. \nonumber \end{eqnarray}\] Thus, the Jacobian \[\begin{eqnarray} J(x, y) \ = \ \left(\begin{array}{ll} \frac{\partial x_{1}}{\partial y_{1}} & \frac{\partial x_{1}}{\partial y_{2}} \nonumber \\ \frac{\partial x_{2}}{\partial y_{1}} & \frac{\partial x_{2}}{\partial y_{2}} \end{array} \right) & = & \left(\begin{array}{ll} \cos y_{2} & -y_{1} \sin y_{2} \\ \sin y_{2} & y_{1} \cos y_{2} \end{array} \right) \nonumber \end{eqnarray}\] with \(\det J(x, y) = y_{1}\). Hence, \[\begin{eqnarray} f_{Y}(y) \ = \ |\det J(x, y)| f_{X}(g^{-1}(y)) & = & y_{1} f_{X}(y_{1}\cos y_{2}, y_{1}\sin y_{2}). \nonumber \end{eqnarray}\]
Let’s consider transformations of Jeffreys’ prior. For simplicity of exposition we’ll consider the univariate case. For a univariate parameter \(\theta\), Jeffreys’ prior \(f_{\theta}(\theta) \propto \sqrt{I(\theta)}\). Consider a univariate transformation \(\phi = g(\theta)\) (e.g. \(\phi = \log \theta\), \(\phi = \frac{1}{\theta}\), \(\ldots\)). There are two possible ways to obtain Jeffreys’ prior for \(\phi\).
Obtain Jeffreys’ prior for \(\theta\) and transform \(f_{\theta}(\theta)\) to \(f_{\phi}(\phi)\).
Transform the data immediately and obtain \(f_{\phi}(\phi)\) using Jeffreys’ prior for \(\phi\).
We’ll show the two approaches are identical. Transforming \(f_{\theta}(\theta)\) to \(f_{\phi}(\phi)\) we have \[\begin{eqnarray} f_{\phi}(\phi) \ = \ \left|\frac{\partial \theta}{\partial \phi}\right| f_{\theta}(g^{-1}(\phi)) & \propto & \left|\frac{\partial \theta}{\partial \phi}\right| \sqrt{I(\theta)}. \tag{2.29} \end{eqnarray}\] (e.g. \(\phi = \log \theta\) gives \(\theta = e^{\phi}\), \(\frac{\partial \theta}{\partial \phi} = e^{\phi}\) so that \(f_{\phi}(\phi) = e^{\phi}f_{\theta}(e^{\phi})\); \(\phi = \frac{1}{\theta}\) gives \(\theta = \frac{1}{\phi}\), \(\frac{\partial \theta}{\partial \phi} = - \frac{1}{\phi^{2}}\) so that \(f_{\phi}(\phi) = \frac{1}{\phi^{2}}f_{\theta}(\frac{1}{\phi})\).) We now consider finding Jeffreys’ prior for \(\phi\) directly. We have \[\begin{eqnarray} I(\phi) & = & E\left.\left\{\left(\frac{\partial}{\partial \phi} \log f(x \, | \, \phi)\right)^{2} \, \right| \phi \right\} \nonumber \\ & = & E\left.\left\{\left(\frac{\partial \theta}{\partial \phi}\frac{\partial}{\partial \theta} \log f(x \, | \, \theta)\right)^{2} \, \right| \theta \right\} \nonumber \\ & = & \left(\frac{\partial \theta}{\partial \phi}\right)^{2} E\left.\left\{\left(\frac{\partial}{\partial \theta} \log f(x \, | \, \theta)\right)^{2} \, \right| \theta \right\} \nonumber \\ & = & \left(\frac{\partial \theta}{\partial \phi}\right)^{2} I(\theta). \tag{2.30} \end{eqnarray}\] Hence, Jeffreys’ prior for \(\phi\) is \[\begin{eqnarray} f_{\phi}(\phi) \ \propto \ \sqrt{I(\phi)} & = & \left|\frac{\partial \theta}{\partial \phi}\right| \sqrt{I(\theta)} \tag{2.31} \end{eqnarray}\] so that (2.29) and (2.31) are identical.
The multivariate case is similar, only, for \(\phi\) and \(\theta\) multivariate, (2.30) becomes \(I(\phi) = J(\theta, \phi) I(\theta) J(\theta, \phi)^{T}\) so that \(|I(\phi)| = |J(\theta, \phi)|^{2}|I(\theta)|\) and the result follows.
Jeffreys’ prior has the property that the prior is invariant in that, whatever scale we choose to measure the unknown parameter the same prior results when the scale is transformed. To quote Jeffreys:
any arbitrariness in the choice of parameters could make no difference to the results.
— Jeffreys, H.D. (1961). Theory of Probability, 3rd ed. Oxford: University Press.
2.4.2 Some final remarks about noninformative priors
A number of objections can be made to noninformative priors. One major objection to Jeffreys’ prior is that it depends upon the form of the data whereas the prior should only reflect the prior information and not by influenced by what data are to be collected. For example, on Question 4 of Question Sheet Four we show that Jeffreys’ priors for the binomial and negative binomial likelihoods are different which leads to the violation of the likelihood principle. The likelihood principle states that the likelihood contains all the information about the data \(x\) so that two likelihoods contain the same information if they are proportional.
The adoption of the likelihood principle is controversial and has caused much debate. Classical statistics violates the likelihood principle but Bayesian statistics (using proper prior distributions) does not.
The use of improper priors may appear as a convenient tool to minimise the role of the prior distribution. However, the posterior is not always guaranteed to be proper. This is particularly the case if the likelihood, viewed as a function of \(\theta\), does not have a non-zero finite integral when integrated over \(\theta\).
Example 2.16 Consider \(X \, | \, \theta \sim Bin(n, \theta)\) and the improper prior \(f(\theta) \propto \theta^{-1}(1 - \theta)^{-1}\). In this case, \(\theta\) is often said to have the improper \(Beta(0, 0)\) density. The posterior is \[\begin{eqnarray} f(\theta \, | \, x) & \propto & \theta^{x} (1- \theta)^{n-x} \times \theta^{-1}(1 - \theta)^{-1} \ = \ \theta^{x - 1}(1 - \theta)^{n-x-1} \nonumber \end{eqnarray}\] which looks like a kernel of a \(Beta(x, n-x)\) density so that \(\theta \, | \, x \sim Beta(x, n-x)\). Notice that, in this case, we have \[\begin{eqnarray} E(\theta \, | \, x) & = & \frac{x}{n} \nonumber \end{eqnarray}\] so that the posterior mean is equal to the maximum likelihood estimate which provides a motivation for the use of this improper prior. However, if \(x = 0\) (or \(n=1\)) then the posterior is improper.