MA40189: Topics in Bayesian statistics

Solution Sheet Two

Question 1

Let \(X_{1}, \ldots, X_{n}\) be conditionally independent given \(\lambda\) so that \(f(x \, | \, \lambda) = \prod_{i=1}^{n} f(x_{i} \, | \, \lambda)\) where \(x = (x_{1}, \ldots, x_{n})\). Suppose that \(\lambda \sim Gamma(\alpha, \beta)\) and \(X_{i} \, | \, \lambda \sim Exp(\lambda)\) where \(\lambda\) represents the rate so that \(E(X_{i} \, | \, \lambda) = \lambda^{-1}\).

Show that \(\lambda \, | \, x \sim Gamma(\alpha +n, \beta + n \bar{x})\).

As \(X_{i} \, | \, \lambda \sim Exp(\lambda)\) then, with \(x = (x_{1}, \ldots, x_{n})\), \[\begin{eqnarray*} f(x \, | \, \lambda) & = & \lambda^{n} e^{-n\bar{x}\lambda} \end{eqnarray*}\] which, when viewed as a function of \(\lambda\), is a kernel of a Gamma distribution. We expect that if the prior is a general member of the Gamma family we will have conjugacy. We now confirm this using \(\lambda \sim Gamma(\alpha, \beta)\). The posterior is \[\begin{eqnarray*} f(\lambda \, | \, x) & \propto & \lambda^{n} e^{-n\bar{x}\lambda} \times \lambda^{\alpha - 1}e^{-\beta \lambda} \\ & = & \lambda^{\alpha + n - 1}e^{-(\beta + n\bar{x})\lambda} \end{eqnarray*}\] which is a kernel of a \(Gamma(\alpha +n, \beta + n \bar{x})\) density, that is \(\lambda \, | \, x \sim Gamma(\alpha +n, \beta + n \bar{x})\).
Show that the posterior mean for the failure rate \(\lambda\) can be written as a weighted average of the prior mean of \(\lambda\) and the maximum likelihood estimate, \(\bar{x}^{-1}\), of \(\lambda\).
\[\begin{eqnarray*} E(\lambda \, | \, X) & = & \frac{\alpha + n}{\beta + n \bar{x}} \\ & = & \frac{\beta\left(\frac{\alpha}{\beta}\right) + n \bar{x} \left(\frac{1}{\bar{x}}\right)}{\beta + n \bar{x}} \ = \ c \frac{\alpha}{\beta} + (1 - c)\frac{1}{\bar{x}} \end{eqnarray*}\] where \(c = \frac{\beta}{\beta + n \bar{x}}\). Hence \(E(\lambda \, | \, X)\) is a weighted average of the prior mean, \(E(\lambda) = \frac{\alpha}{\beta}\), and the maximum likelihood estimate, \(\bar{x}^{-1}\), of \(\lambda\).

A water company is interested in the failure rate of water pipes. They ask two groups of engineers about their prior beliefs about the failure rate. The first group believe the mean failure rate is \(\frac{1}{8}\) with coefficient of variation \(\frac{1}{\sqrt{11}}\), whilst the second group believe the mean is \(\frac{1}{11}\) with coefficient of variation \(\frac{1}{2}\).
[Note: The coefficient of variation is the standard deviation divided by the mean.]
Let \(X_{i}\) be the time until water pipe \(i\) fails and assume that the \(X_{i}\) follow the exponential likelihood model described above. A sample of five of pipes is taken and the following times to failure were observed: \(8.2, \ 9.2, \ 11.2, \ 9.8, \ 10.1.\)

Find the appropriate members of the Gamma families the prior statements of the two groups of engineers represent. In each case find the posterior mean and variance. Approximating the posterior by \(N(E(\lambda \, | \, x), Var(\lambda \, | \, x))\), where \(x = (x_{1}, \ldots, x_{5})\), estimate, in each case, the probability that the failure rate is less than 0.1.

For a \(Gamma(\alpha, \beta)\) distribution, the mean is \(\frac{\alpha}{\beta}\) whilst the coefficient of variation is \(\sqrt{\frac{\alpha}{\beta^{2}}} \times \frac{\beta}{\alpha} = \frac{1}{\sqrt{\alpha}}\).

Suppose that the first group of engineers choose the prior \(\lambda \sim Gamma(\alpha_{1}, \beta_{1})\) then \(\frac{\alpha_{1}}{\beta_{1}} = \frac{1}{8}\) and \(\frac{1}{\sqrt{\alpha_{1}}} = \frac{1}{\sqrt{11}}\). Hence, \(\alpha_{1} = 11\) and \(\beta_{1} = 88\).

If the second group of engineers choose the prior \(\lambda \sim Gamma(\alpha_{2}, \beta_{2})\) then \(\frac{\alpha_{2}}{\beta_{2}} = \frac{1}{11}\) and \(\frac{1}{\sqrt{\alpha_{2}}} = \frac{1}{2}\). Hence, \(\alpha_{2} = 4\) and \(\beta_{2} = 44\).

We observe \(n = 5\) and \(\bar{x} = 9.7\) so that the first group of engineers have a posterior given by \(Gamma(11 + 5, 88 + 5(9.7)) = Gamma(16, 136.5)\) whilst the second group of engineers have a posterior given by \(Gamma(4 + 5, 44 + 5(9.7)) = Gamma(9, 92.5)\). We summarise the results in the following table.
\[\begin{eqnarray*} \begin{array}{ccccc} \mbox{Prior} & \mbox{Posterior} & E(\lambda \, | \, X) & Var(\lambda \, | \, X) & P(\lambda < 0.1 \, | \, X) \\ \hline Gamma(11, 88) & Gamma(16, 136.5) & 0.11722 & 0.00086 & 0.2776 \\ Gamma(4, 44) & Gamma(9, 92.5) & 0.09730 & 0.00105 & 0.5319\end{array} \end{eqnarray*}\]
How do you expect any differences between the engineers to be reconciled as more data becomes available?

The maximum likelihood estimate of \(\lambda\) is \(\frac{1}{9.7} = 0.10309\) which is closest to the second group of engineers who had a higher initial variance, corresponding to more uncertainty, for \(\lambda\) than the first group. Thus, the posterior for the second group is closer to the likelihood than for the first. We can also observe this by calculating the corresponding weights of prior mean and maximum likelihood estimate in the posterior means as per part (b). For the first group, we have \(c = \frac{\beta_{1}}{\beta_{1} + n\bar{x}} = \frac{88}{136.5} = 0.64469\) so that the posterior mean attaches a heavier weight to the prior mean whereas for the second group of engineers the weight is \(c = \frac{\beta_{2}}{\beta_{2} + n\bar{x}} = \frac{44}{92.5} = 0.47568\) so that a heavier weight is attached to the maximum likelihood estimate.

As more and more data are collected, the influence of the prior in the posterior will decrease and the data will take over. This can be seen clearly in the posterior mean calculation in part (b). Assuming \(\bar{x}\) stabilises then, as \(n \rightarrow \infty\), \(c \rightarrow 0\) and the posterior mean will tend towards the maximum likelihood estimate.

Question 2

Let \(x\) be the number of successes in \(n\) independent Bernoulli trials, each one having unknown probability \(\theta\) of success. It is judged that \(\theta\) may be modelled by a \(Unif(0, 1)\) distribution so \[\begin{eqnarray*} f(\theta) & = & 1, \ \ \ \ \ 0 < \theta < 1. \end{eqnarray*}\] An extra trial, \(z\) is performed, independent of the first \(n\) given \(\theta\), but with probability \(\frac{\theta}{2}\) of success. The full data is thus \((x, z)\) where \(z = 1\) if the extra trial is a success and \(0\) otherwise.

Show that \[\begin{eqnarray*} f(\theta \, | \, x, z=0) & = & c\{\theta^{\alpha - 1}(1-\theta)^{\beta - 1} + \theta^{\alpha -1}(1 - \theta)^{\beta}\} \end{eqnarray*}\] where \(\alpha = x+1\), \(\beta = n-x+1\) and \(c = \frac{1}{B(\alpha, \beta) + B(\alpha, \beta+1)}\).

As \(X \, | \, \theta \sim Bin(n, \theta)\) then \[\begin{eqnarray} f(x \, | \, \theta) & = & \binom{n}{x} \theta^{x}(1- \theta)^{n-x} \tag{1} \end{eqnarray}\] whilst as \(Z \, | \, \theta \sim Bernoulli(\frac{\theta}{2})\) then \[\begin{eqnarray} f(z=0 \, | \, \theta) \ = \ P(Z = 0 \, | \, \theta) & = & 1 - \frac{\theta}{2} \nonumber \\ & = & \frac{1}{2}(2 - \theta) \ = \ \frac{1}{2}\{1+(1-\theta)\}. \tag{2} \end{eqnarray}\] As \(X\) and \(Z\) are conditionally independent given \(\theta\), from (1) and (2), we have \[\begin{eqnarray} f(x, z=0 \, | \, \theta) & = & f(x \, | \, \theta)f(z=0 \, | \, \theta) \nonumber \\ & = & \binom{n}{x} \theta^{x}(1- \theta)^{n-x} \times \frac{1}{2}\{1+(1-\theta)\}. \tag{3} \end{eqnarray}\] As \(f(\theta) = 1\) then, using (3), the posterior is \[\begin{eqnarray*} f(\theta \, | \, x, z=0) & \propto & f(x, z=0 \, | \, \theta)f(\theta) \nonumber \\ & \propto & \theta^{x}(1- \theta)^{n-x} + \theta^{x}(1- \theta)^{n-x+1}. \end{eqnarray*}\] Letting \(\alpha = x+1\), \(\beta = n-x+1\) we have \[\begin{eqnarray} f(\theta \, | \, x, z=0) & c \{\theta^{\alpha-1}(1- \theta)^{\beta-1} + \theta^{\alpha-1}(1- \theta)^{\beta}\} \tag{4} \end{eqnarray}\] where \[\begin{eqnarray} c^{-1} & = & \int_{0}^{1} \{\theta^{\alpha-1}(1- \theta)^{\beta-1} + \theta^{\alpha-1}(1- \theta)^{\beta}\} \, d\theta \nonumber \\ & = & \int_{0}^{1} \theta^{\alpha-1}(1- \theta)^{\beta-1} \, d\theta + \int_{0}^{1} \theta^{\alpha-1}(1- \theta)^{\beta} \, d\theta \nonumber \\ & = & B(\alpha, \beta) + B(\alpha, \beta+1). \tag{5} \end{eqnarray}\]
Hence show that \[\begin{eqnarray*} E(\theta \, | \, X, Z = 0) & = & \frac{(x+1)(2n-x+4)}{(n+3)(2n-x+3)}. \end{eqnarray*}\] [Hint: Show that \(c = \frac{\alpha + \beta}{B(\alpha, \beta)(\alpha + 2\beta)}\) and work with \(\alpha\) and \(\beta\).]

From (4) we have \[\begin{eqnarray} E(\theta \, | \, X, Z = 0) & = & \int_{0}^{1} \theta \times c \{\theta^{\alpha-1}(1- \theta)^{\beta-1} + \theta^{\alpha-1}(1- \theta)^{\beta}\} \, d\theta \nonumber \\ & = & c\left\{\int_{0}^{1} \theta^{\alpha}(1- \theta)^{\beta-1} \, d\theta + \int_{0}^{1} \theta^{\alpha}(1- \theta)^{\beta} \, d\theta \right\} \nonumber \\ & = & c\{B(\alpha + 1,\beta) + B(\alpha+1, \beta+1)\}. \tag{6} \end{eqnarray}\] Note that \(B(\alpha, \beta+1) = \frac{\Gamma(\alpha)\Gamma(\beta +1)}{\Gamma(\alpha + \beta + 1)} = \frac{\beta}{\alpha + \beta} B(\alpha, \beta)\) so that, from (5), \[\begin{eqnarray} c \ = \ \frac{1}{B(\alpha. \beta) + B(\alpha, \beta+1)} & = & \frac{1}{B(\alpha. \beta) + \frac{\beta}{\alpha + \beta} B(\alpha, \beta)} \nonumber \\ & = & \frac{\alpha + \beta}{B(\alpha, \beta)(\alpha + 2\beta)}. \tag{7} \end{eqnarray}\] Now, \(B(\alpha + 1,\beta) = \frac{\alpha}{\alpha + \beta} B(\alpha, \beta)\) and \(B(\alpha+1, \beta+1) = \frac{\alpha \beta}{(\alpha + \beta + 1)(\alpha + \beta)}B(\alpha, \beta)\) so that \[\begin{eqnarray} B(\alpha + 1,\beta) + B(\alpha+1, \beta+1) & = & \left\{\frac{\alpha}{\alpha + \beta} + \frac{\alpha \beta}{(\alpha + \beta + 1)(\alpha + \beta)}\right\}B(\alpha, \beta) \nonumber \\ & = & \frac{\alpha(\alpha + 2\beta + 1)}{(\alpha + \beta + 1)(\alpha + \beta)}B(\alpha, \beta). \tag{8} \end{eqnarray}\] Substituting (8) and (7) into (6) gives \[\begin{eqnarray} E(\theta \, | \, X, Z = 0) \ = \ \frac{\alpha(\alpha + 2\beta + 1)}{(\alpha + \beta + 1)(\alpha + 2\beta)} \ = \ \frac{(x+1)(2n-x+4)}{(n+3)(2n-x+3)} \nonumber \end{eqnarray}\] as \(\alpha = x+1\) and \(\beta = n - x + 1\).
Show that, for all \(x\), \(E(\theta \, | \, X, Z = 0)\) is less than \(E(\theta \, | \, X, Z = 1)\).

In this case \(f(z = 1 \, | \, \theta) = \frac{\theta}{2}\) so that \[\begin{eqnarray} f(x, z=1 \, | \, \theta) & = & \binom{n}{x} \theta^{x}(1- \theta)^{n-x} \times \frac{1}{2}\theta. \tag{9} \end{eqnarray}\] Now, as \(f(\theta) \propto 1\), \(f(\theta \, | \, x, z=1) \propto f(x, z=1 \, | \, \theta)\) which, from viewing (9) as a function of \(\theta\), we observe as a kernel of a \(Beta(x+2, n-x+1)\) distribution, so that \(\theta \, | \, x, z=1 \sim Beta(x+2, n-x+1)\). Hence, \(E(\theta \, | \, X, Z = 1) = \frac{x+2}{n+3}\). Now, from part (b), \[\begin{eqnarray*} E(\theta \, | \, X, Z = 0) \ = \ \frac{(x+1)(2n-x+4)}{(n+3)(2n-x+3)} \ = \ \frac{x+1}{n+3}\left(1 + \frac{1}{2n-x+3}\right) \end{eqnarray*}\] Hence \[\begin{eqnarray*} E(\theta \, | \, X, Z = 0) \ < \ E(\theta \, | \, X, Z = 1) & \Leftrightarrow & \frac{x+1}{2n-x+3} \ < \ 1 \\ & \Leftrightarrow & x \ < \ n+1 \end{eqnarray*}\] which is true as \(x \in \{0, 1, \ldots, n\}\).

Question 3

Let \(X_{1}, \ldots, X_{n}\) be conditionally independent given \(\theta\), so \(f(x \, | \, \theta) = \prod_{i=1}^{n} f(x_{i} \, | \, \theta)\) where \(x = (x_{1}, \ldots, x_{n})\), with each \(X_{i} \, | \, \theta \sim N(\mu, \theta)\) where \(\mu\) is known.

Let \(s(x) = \sum_{i=1}^{n} (x_{i} - \mu)^{2}\). Show that we can write \[\begin{eqnarray*} f(x \, | \, \theta) & = & g(s, \theta)h(x) \end{eqnarray*}\] where \(g(s, \theta)\) depends upon \(s(x)\) and \(\theta\) and \(h(x)\) does not depend upon \(\theta\) but may depend upon \(x\). The equation shows that \(s(X) = \sum_{i=1}^{n} (X_{i} - \mu)^{2}\) is sufficient for \(X_{1}, \ldots, X_{n}\) for learning about \(\theta\). \[\begin{eqnarray*} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{1}{\sqrt{2 \pi \theta}} \exp \left\{ - \frac{1}{2\theta}(x_{i} - \mu)^{2} \right\} \\ & = & (2 \pi \theta)^{-\frac{n}{2}} \exp \left\{ - \frac{1}{2\theta} \sum_{i=1}^{n} (x_{i} - \mu)^{2} \right\} \\ & = & (2 \pi \theta)^{-\frac{n}{2}} \exp \left\{ - \frac{1}{2\theta} s(x) \right\} \\ & = & g(s, \theta)h(x) \end{eqnarray*}\] where \(g(s, \theta) = \theta^{-\frac{n}{2}}\exp \left\{ - \frac{1}{2\theta} s(x) \right\}\) and \(h(x) = (2 \pi)^{-\frac{n}{2}}\).
An inverse-gamma distribution with known parameters \(\alpha, \beta > 0\) is judged to be the prior distribution for \(\theta\). So, \[\begin{eqnarray*} f(\theta) & = & \frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{-(\alpha+1)}e^{-\beta/\theta}, \ \ \ \ \ \theta > 0. \end{eqnarray*}\]
1. Show that the distribution of the precision \(\tau = \frac{1}{\theta}\) is \(Gamma(\alpha, \beta)\).
  
  Let \(Y \sim Inv-Gamma(\alpha, \beta)\) so that \(f_{Y}(y) = \frac{\beta^{\alpha}}{\Gamma(\alpha)}y^{-(\alpha+1)}e^{-\beta/y}\), \(y > 0\) and consider the distribution of \(Z = \frac{1}{Y}\). The density function of \(Z\), \(f_{Z}(z)\), is \[\begin{eqnarray*} f_{Z}(z) \ = \ \frac{\partial}{\partial z} P(Z \leq z) & = & \frac{\partial}{\partial z} P\left(\frac{1}{Y} \leq z \right) \\ & = & \frac{\partial}{\partial z} P\left(Y \geq \frac{1}{z} \right) \\ & = & - \frac{\partial}{\partial z} P\left(Y \leq \frac{1}{z} \right) \\ & = & - \frac{\partial y}{\partial z} f_{Y}(\frac{1}{z}) \end{eqnarray*}\] where the last equation follows using the chain rule. Now, as \(y = \frac{1}{z}\), we have \[\begin{eqnarray*} f_{Z}(z) \ = \ - \left(-\frac{1}{z^{2}}\right) f_{Y}(\frac{1}{z}) & = & \frac{1}{z^{2}} \frac{\beta^{\alpha}}{\Gamma(\alpha)}\left(\frac{1}{z}\right)^{-(\alpha+1)}e^{-\beta/(1/z)} \\ & = & \frac{\beta^{\alpha}}{\Gamma(\alpha)} z^{\alpha - 1}e^{-\beta z} \end{eqnarray*}\] which is the density function of the \(Gamma(\alpha, \beta)\) distribution. Hence, \(Z \sim Gamma(\alpha, \beta)\). It is immediate that if \(\theta \sim Inv-Gamma(\alpha, \beta)\) then \(\tau \sim Gamma(\alpha, \beta)\).

Find the posterior distribution of \(\theta\) given \(x = (x_{1}, \ldots, x_{n})\).
\[\begin{eqnarray*} f(\theta \, | \, x) & \propto & f(x \, | \, \theta) f(\theta) \\ & = & (2 \pi \theta)^{-\frac{n}{2}} \exp \left\{ - \frac{1}{2\theta} s(x) \right\} \times \frac{\beta^{\alpha}}{\Gamma(\alpha)}\theta^{-(\alpha+1)}e^{-\beta/\theta} \\ & \propto & \theta^{-\frac{n}{2}}\exp \left\{ - \frac{1}{2\theta} s(x) \right\} \times \theta^{-(\alpha+1)}e^{-\beta/\theta} \\ & = & \theta^{-\left(\alpha + \frac{n}{2} + 1 \right)} \exp\left\{-\frac{\beta + \frac{s(x)}{2}}{\theta}\right\} \end{eqnarray*}\] which is a kernel of an \(Inv-Gamma(\alpha + \frac{n}{2}, \beta + \frac{s(x)}{2})\). Hence, we have the posterior \(\theta \, | \, x \sim Inv-Gamma(\alpha + \frac{n}{2}, \beta + \frac{s(x)}{2})\).
Show that the posterior mean for \(\theta\) can be written as a weighted average of the prior mean of \(\theta\) and the maximum likelihood estimate, \(s(x)/n\), of \(\theta\).

As \(\theta \, | \, x \sim Inv-Gamma(\alpha + \frac{n}{2}, \beta + \frac{s(x)}{2})\) we have that
\[\begin{eqnarray*} E(\theta \, | \, X) & = & \frac{\beta + \frac{s(x)}{2}}{\alpha + \frac{n}{2} - 1} \\ & = & \frac{(\alpha-1) \frac{\beta}{\alpha-1} + \frac{n}{2}\frac{s(x)}{n}}{\alpha + \frac{n}{2} - 1} \ = \ c\left(\frac{\beta}{\alpha-1}\right)+(1-c)\left(\frac{s(x)}{n}\right) \end{eqnarray*}\] where \(c = \frac{\alpha-1}{\alpha + \frac{n}{2}-1}\).

Note that we need \(\alpha + \frac{n}{2} > 1\) for \(E(\theta \, | \, X)\) to be finite. For any \(\alpha > 0\) then \(n > 2\) will ensure this. For the decomposition as a weighted average we implicitly assume \(\alpha > 1\) so that \(E(\theta)\) is finite.

Hence \(E(\theta \, | \, X)\) is a weighted average of the prior mean, \(E(\theta) = \frac{\beta}{\alpha-1}\), and the maximum likelihood estimate, \(\frac{s(x)}{n}\), of \(\theta\).

Question 4

Suppose that \(X_{1}, \ldots, X_{n}\) are identically distributed discrete random variables taking \(k\) possible values with probabilities \(\theta_{1}, \ldots, \theta_{k}\). Inference is required about \(\theta = (\theta_{1}, \ldots, \theta_{k})\) where \(\sum_{j=1}^{k} \theta_{j} = 1\).

Assuming that the \(X_{i}\)s are independent given \(\theta\), explain why \[\begin{eqnarray} f(x \, | \, \theta) & \propto & \prod_{j=1}^{k} \theta_{j}^{n_{j}} \tag{10} \end{eqnarray}\] where \(x = (x_{1}, \ldots, x_{n})\) and \(n_{j}\) is the number of \(x_{i}\)s observed to take the \(j\)th possible value.

This is just a generalisation of Bernoulli trials. Let \(I_{\{x_{i}, j\}}\) denote the indicator function which is equal to one if \(x_{i}\) is in the \(j\)th class and zero otherwise. Then \[\begin{eqnarray*} P(X_{i} = x_{i} \, | \, \theta) & = & \prod_{j=1}^{k} \theta_{j}^{I_{\{x_{i}, j\}}} \end{eqnarray*}\] and, by the conditional independence, \[\begin{eqnarray*} f(x \, | \, \theta) \ = \ \prod_{i=1}^{n} P(X_{i} = x_{i} \, | \, \theta) \ = \ \prod_{i=1}^{n} \prod_{j=1}^{k} \theta_{j}^{I_{\{x_{i}, j\}}} \end{eqnarray*}\] which gives (10) as we thus get a contribution of \(\theta_{j}\) for each \(x_{i}\) which lands in the \(j\)th class and there are a total of \(n_{j}\) of these, \(j = 1, \ldots, k\). Notice that if we are only told the \(n_{j}\)s rather than which specific \(x_{i}\) contributed to each \(n_{j}\) then the likelihood is \[\begin{eqnarray*} f(x \, | \, \theta) & = & \frac{n!}{\prod_{j=1}^{k} n_{j}!} \prod_{j=1}^{k} \theta_{j}^{n_{j}} \end{eqnarray*}\] which is the Multinomial distribution.
Suppose that the prior for \(\theta\) is Dirichlet distributed with known parameters \(a = (a_{1}, \ldots, a_{k})\) so \[\begin{eqnarray*} f(\theta) & = & \frac{1}{B(a)} \prod_{j=1}^{k} \theta_{j}^{a_{j}-1} \end{eqnarray*}\] where \(B(a) = B(a_{1}, \ldots, a_{k}) = \frac{\prod_{j=1}^{k} \Gamma(a_{j})}{\Gamma(\sum_{j=1}^{k} a_{j})}\). Show that the posterior for \(\theta\) given \(x\) is Dirichlet with parameters \(a + n = (a_{1} + n_{1}, \ldots, a_{k} + n_{k})\).
\[\begin{eqnarray*} f(\theta \, | \, x) & \propto & f( x \, | \, \theta) f(\theta) \\ & \propto & \left\{\prod_{j=1}^{k} \theta_{j}^{n_{j}}\right\} \times \left\{\prod_{j=1}^{k} \theta_{j}^{a_{j}-1}\right\} \\ & = & \prod_{j=1}^{k} \theta_{j}^{a_{j}+n_{j}-1} \end{eqnarray*}\] which is a kernel of the Dirichlet with parameter \(a + n = (a_{1} + n_{1}, \ldots, a_{k} + n_{k})\) so that the posterior \(\theta \, | \, x\) is Dirichlet with parameter \(a + n\).

The Multinomial distribution is the multivariate generalisation of the Binomial (\(k = 2\) for the Multinomial gives the Binomial) and the Dirichlet the multivariate generalisation of the Beta (\(k = 2\) for the Dirichlet gives the Beta). It is straightforward to obtain the moments of the Dirichlet distribution. For example, \[\begin{eqnarray} E\left(\prod_{j=1}^{k} \theta_{j}^{m_{j}}\right) & = & \int_{\theta} \prod_{j=1}^{k} \theta_{j}^{m_{j}} \times \frac{1}{B(a)} \prod_{j=1}^{k} \theta_{j}^{a_{j}-1} \, d\theta \tag{11} \\ & = & \frac{1}{B(a)} \int_{\theta} \prod_{j=1}^{k} \theta_{j}^{a_{j}+m_{j}-1} \, d\theta \tag{12} \\ & = & \frac{B(a+m)}{B(a)} \tag{13} \end{eqnarray}\] where \(a + m = (a_{1} + m_{1}, \ldots, a_{k} + m_{k})\). Notice that the integral in (11) is \(k\)-dimensional (as \(\theta\) is \(k\)-dimensional) and (13) follows as the integral in (12) is a kernel of the Dirichlet distribution with parameter \(a+m\). In particular, taking \(m_{j} = 1\) and \(m_{j'} = 0\) for \(j' \neq j\) we have \[\begin{eqnarray*} E(\theta_{j}) \ = \ \frac{B(a+m)}{B(a)} & = & \frac{\left\{\prod_{j' \neq j}^{k} \Gamma (a_{j'})\right\}\Gamma(a_{j}+1)}{\Gamma((\sum_{j=1}^{k} a_{j})+1)} \times \frac{\Gamma(\sum_{j=1}^{k} a_{j})}{\prod_{j =1}^{k} \Gamma (a_{j})} \\ & = & \frac{\Gamma(a_{j} + 1)}{\Gamma((\sum_{j=1}^{k} a_{j})+1)} \times \frac{\Gamma(\sum_{j=1}^{k} a_{j})}{\Gamma (a_{j})} \\ & = & \frac{a_{j}}{\sum_{j=1}^{k} a_{j}}. \end{eqnarray*}\] As \(\theta \, | \, x\) is Dirichlet with parameter \(a + n\) we have \[\begin{eqnarray*} E(\theta_{j} \, | \, x) & = & \frac{a_{j} + n_{j}}{\sum_{j=1}^{k} a_{j} + \sum_{j=1}^{k} n_{j}} \\ & = & \frac{\tilde{a}}{\tilde{a}+\tilde{n}}\left(\frac{a_{j}}{\tilde{a}}\right) + \frac{\tilde{n}}{\tilde{a}+\tilde{n}}\left(\frac{n_{j}}{\tilde{n}}\right) \end{eqnarray*}\] where \(\tilde{a} = \sum_{j=1}^{k} a_{j}\) and \(\tilde{n} = \sum_{j=1}^{k} n_{j}\). The posterior mean for \(\theta_{j}\) is a weighted average of its prior mean, \(\frac{a_{j}}{\tilde{a}}\), and the classical maximum likelihood estimate of \(\theta_{j}\), \(\frac{n_{j}}{\tilde{n}}\). Notice that \(\tilde{a}\) controls the weight of the prior mean in the posterior mean and is often said to represent the prior strength.