## Solution Sheet Eight (Additional Questions)

### Question 1

Let $$X_{1}, \ldots, X_{n}$$ be exchangeable so that the $$X_{i}$$ are conditionally independent given a parameter $$\theta$$. Suppose that $$X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)$$, where $$\alpha$$ is known, and we judge that $$\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})$$, where $$\alpha_{0}$$ and $$\beta_{0}$$ are known.

1. Show that $$\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$$ where $$\alpha_{n} = \alpha_{0} + n \alpha$$, $$\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}$$, and $$x = (x_{1}, \ldots, x_{n})$$.

As $$X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)$$ then $\begin{eqnarray*} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{\theta^{\alpha}}{\Gamma(\alpha)} x_{i}^{-(\alpha+1)} \exp \left(-\frac{\theta}{x_{i}}\right) \\ & \propto & \theta^{n\alpha} \exp \left(-\theta \sum_{i=1}^{n}\frac{1}{x_{i}}\right). \end{eqnarray*}$ Similarly, as $$\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})$$, $\begin{eqnarray*} f(\theta) & \propto & \theta^{\alpha_{0}-1}\exp \left(-\theta \beta_{0}\right). \end{eqnarray*}$ Hence, $\begin{eqnarray*} f(\theta \, | \, x) & \propto & f(x \, | \, \theta)f(\theta) \\ & \propto & \theta^{(\alpha_{0}+n\alpha)-1}\exp \left\{-\theta\left(\beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\right)\right\} \end{eqnarray*}$ which we recognise as a kernal of a $$\mbox{Gamma}(\alpha_{0} + n \alpha, \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}})$$. Thus, $$\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$$ as required.

1. We wish to use the Metropolis-Hastings algorithm to sample from the posterior distribution $$\theta \, | \, x$$ using a normal distribution with mean $$\theta$$ and chosen variance $$\sigma^{2}$$ as the symmetric proposal distribution.

1. Suppose that, at time $$t$$, the proposed value $$\theta^{*} \leq 0$$. Briefly explain why the corresponding acceptance probability is zero for such a $$\theta^{*}$$ and thus that the sequence of values generated by the algorithm are never less than zero.

As the proposal distribution is symmetric then the Metropolis-Hastings algorithm reduces to the Metropolis algorithm with acceptance probability $\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) = \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \end{eqnarray*}$ where $$\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$$. Now, assuming $$\theta^{(t-1)} > 0$$, if $$\theta^{*} \leq 0$$ then $$f(\theta^{*} \, | \, x) = 0$$ and $$\alpha(\theta^{(t-1)}, \theta^{*}) = 0$$ so the move to $$\theta^{*} \leq 0$$ is rejected and $$\theta^{(t)} = \theta^{(t-1)}$$. So, provided $$\theta^{(0)} > 0$$ then the sequence of values generated by the algorithm are never less than zero.

1. Describe how the Metropolis-Hastings algorithm works for this example, giving the acceptance probability in its simplest form.

The algorithm is performed as follows.

1. Choose a starting point $$\theta^{(0)}$$ for which $$f(\theta^{(0)} \, | \, x) > 0$$. This will hold for any $$\theta^{(0)} > 0$$ as $$\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$$.

2. At time $$t$$

• Sample $$\theta^{*} \sim N(\theta^{(t-1)}, \sigma^{2})$$.

• Calculate the acceptance probability $\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) & = & \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \\ & = & \left\{\begin{array}{cl} \min\left(1, \frac{(\theta^{*})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{*})}{(\theta^{(t-1)})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{(t-1)})}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right.\\ & = & \left\{\begin{array}{cl} \min\left(1, \left(\frac{\theta^{*}}{\theta^{(t-1)}}\right)^{\alpha_{n} - 1} \exp\left\{-\beta_{n}\left(\theta^{*} - \theta^{(t-1)}\right)\right\}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right. \end{eqnarray*}$ where $$\alpha_{n} = \alpha_{0} + n \alpha$$, $$\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}$$.

• Generate $$U \sim U(0, 1)$$.

• If $$U \leq \alpha(\theta^{(t-1)}, \theta^{*})$$ accept the move, $$\theta^{(t)} = \theta^{*}$$. Otherwise reject the move, $$\theta^{(t)}= \theta^{(t-1)}$$.

1. Repeat step 2.

The algorithm will produce a Markov Chain with stationary distribution $$f(\theta \, | \, x)$$. After a sufficiently long time, $$b$$ say, to allow for convergence, the values $$\{\theta^{(t)}\}$$ for $$t > b$$ may be considered as a sample from $$f(\theta \, | \, x)$$, that is a sample from $$\mbox{Gamma}(\alpha_{n}, \beta_{n})$$, where the samples $$\{\theta^{(t)} \}$$ for $$t \leq b$$ are the “burn-in”.

### Question 2

Suppose that $$X \, | \, \theta \sim N(\theta, \sigma^{2})$$ and $$Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})$$, where $$\sigma^{2}$$ is a known constant and $$X$$ and $$Y$$ are conditionally independent given $$\theta$$ and $$\delta$$. It is judged that the improper noninformative joint prior distribution $$f(\theta, \delta) \propto 1$$ is appropriate.

1. Show that the joint posterior distribution of $$\theta$$ and $$\delta$$ given $$x$$ and $$y$$ is bivariate normal with mean vector $$\mu$$ and variance matrix $$\Sigma$$ where $\begin{eqnarray*} \mu \ = \ \left(\begin{array}{c} E(\theta \, | \, X, Y) \\ E(\delta \, | \, X, Y) \end{array} \right) \ = \ \left(\begin{array}{c} x \\ x-y \end{array} \right); \ \ \Sigma \ = \ \left(\begin{array}{cc} \sigma^{2} & \sigma^{2} \\ \sigma^{2} & 2\sigma^{2} \end{array} \right). \end{eqnarray*}$

As $$X \, | \, \theta \sim N(\theta, \sigma^{2})$$ then $\begin{eqnarray*} f(x \, | \, \theta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}} (\theta^{2} - 2x\theta)\right\}. \end{eqnarray*}$ As $$Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})$$ then $\begin{eqnarray*} f(y \, | \, \theta, \delta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[(\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}$ Now, by conditional independence, $\begin{eqnarray*} f(x, y \, | \, \theta, \delta) & = & f(x \, | \, \theta)f(y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}$ So, $\begin{eqnarray*} f(\theta, \delta \, | \, x, y) & \propto & f(x, y \, | \, \theta, \delta)f(\theta, \delta) \\ & \propto & f(x, y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\} \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} -2(x+y)\theta - 2\theta\delta + \delta^{2} + 2y\delta]\right\}. \end{eqnarray*}$ The exponential argument is quadratic in $$\theta$$, $$\delta$$ so $$\theta, \delta \, | \, x, y$$ is bivariate normal. We can find the exact normal by completing the square or, in this case, using the information given in the question. We have, $\begin{eqnarray*} \Sigma^{-1} & = & \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right) \end{eqnarray*}$ so that the density of the $$N_{2}(\mu, \Sigma)$$ is $\begin{eqnarray*} \exp\left\{-\frac{1}{2} \left(\theta - x \ \ \delta - (x-y)\right) \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right)\left(\begin{array}{c} \theta - x \\ \delta - (x-y) \end{array}\right) \right\} & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(2x-(x-y))\theta - 2\theta \delta + \delta^{2} - 2((x-y)-x)\delta]\right\} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y)\theta - 2\theta \delta + \delta^{2} + 2y\delta]\right\} \\ & \propto & f(\theta, \delta \, | \, x, y). \end{eqnarray*}$

1. Describe how the Gibbs sampler may be used to sample from the posterior distribution $$\theta, \delta \, | \, x , y$$, deriving all required conditional distributions.

The Gibbs sampler requires the conditional distributions $$\theta \, | \, \delta, x, y$$ and $$\delta \, | \, \theta, x, y$$. Now, with proportionality with respect to $$\theta$$, $\begin{eqnarray*} f(\theta \, | \, \delta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y+\delta)\theta]\right\} \\ & = & \exp\left\{-\frac{1}{2}\left(\frac{2}{\sigma^{2}}\right)\left[\theta^{2} - 2\left(\frac{x+y+\delta}{2}\right)\theta\right]\right\} \end{eqnarray*}$ which is a kernel of a normal density. Thus, $$\theta \, | \, \delta, x, y \sim N(\frac{x+y+\delta}{2}, \frac{\sigma^{2}}{2})$$. Now, with proportionality with respect to $$\delta$$, $\begin{eqnarray*} f(\delta \, | \, \theta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[\delta^{2} - 2(\theta-y)\delta]\right\} \end{eqnarray*}$ which is a kernel of a normal density. Thus, $$\delta \, | \, \theta, x, y \sim N(\theta - y, \sigma^{2})$$. The Gibbs sampler algorithm is

1. Choose a starting value $$(\theta^{(0)}, \delta^{(0)})$$ for which $$f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0$$.
2. At iteration $$t$$ generate the new values $$(\theta^{(t)}, \delta^{(t)})$$ as follows:
• draw $$\theta^{(t)}$$ from $$N(\frac{x+y+\delta^{(t-1)}}{2}, \frac{\sigma^{2}}{2})$$, the distribution of $$\theta^{(t)} \, | \, \delta^{(t-1)}, x, y$$.

• draw $$\delta^{(t)}$$ from $$N(\theta^{(t)}-y, \sigma^{2})$$, the distribution of $$\delta^{(t)} \, | \, \theta^{(t)}, x, y$$.

1. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution $$f(\theta, \delta \, | \, x, y)$$. After a sufficiently long time to allow for convergence, the values $$\{\theta^{(t)}, \delta^{(t)}\}$$ for $$t > b$$ may be considered as a sample from $$f(\theta, \delta \, | \, x, y)$$ where the samples $$\{\theta^{(t)}, \delta^{(t)}\}$$ for $$t \leq b$$ are the “burn-in”.

1. Suppose that $$x=2$$, $$y=1$$ and $$\sigma^{2} = 1$$. Sketch the contours of the joint posterior distribution. Starting from the origin, add to your sketch the first four steps of a typical Gibbs sampler path.

For $$x=2$$, $$y=1$$ and $$\sigma^{2} = 1$$ we shall use the Gibbs sampler to sample from $$N_{2}(\mu, \Sigma)$$ where $$\mu = \left(\begin{array}{c} 2 \\ 1 \end{array}\right)$$ and $$\Sigma = \left(\begin{array}{cc} 1 & 1 \\ 1 & 2 \end{array}\right)$$. We draw $$\theta^{(t)}$$ from $$N(\frac{3+\delta^{(t-1)}}{2}, \frac{1}{2})$$ and $$\delta^{(t)}$$ from $$N(\theta^{(t)}-1, 1)$$. Suppose that we start from $$(1,1)$$ and sample $$\theta^{(1)} = 2.216715$$, $$\delta^{(1)} = 1.475738$$, $$\theta^{(2)} = 2.086383$$ and $$\delta^{(2)} = 1.617760$$. The path of the trajectory of these points is shown in the following figure. A trajectory path of the Gibbs sampler sampling from $$N_{2}(\mu, \Sigma)$$.

1. Suppose, instead, that we consider sampling from the posterior distribution using the Metropolis-Hastings algorithm where the proposal distribution is the bivariate normal with mean vector $$\tilde{\mu}^{(t-1)} = (\theta^{(t-1)}, \delta^{(t-1)})^{T}$$ and known variance matrix $$\tilde{\Sigma}$$. Explain the Metropolis-Hastings algorithm for this case, explicitly stating the acceptance probability.

For the proposal we have $\begin{eqnarray*} q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) & \propto & \exp\left\{-\frac{1}{2} \left(\theta^{*} - \theta^{(t-1)} \ \ \delta^{*} - \delta^{(t-1)}\right) \tilde{\Sigma}^{-1}\left(\begin{array}{c} \theta^{*} - \theta^{(t-1)} \\ \delta^{*} - \delta^{(t-1)} \end{array}\right) \right\} \end{eqnarray*}$ so that $$q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) = q(\theta^{(t-1)}, \delta^{(t-1)} \, | \, \theta^{*}, \delta^{*})$$. The proposal distribution is symmetric and so the Metropolis-Hastings algorithm reduces to the Metropolis algorithm. The algorithm is performed as follows.

1. Choose a starting point $$(\theta^{(0)}, \delta^{(0)})$$ for which $$f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0$$.
2. At time $$t$$
• Sample $$(\theta^{*}, \delta^{*}) \sim N_{2}(\tilde{\mu}^{(t-1)}, \tilde{\Sigma})$$.

• Calculate the acceptance probability $\begin{eqnarray*} \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*})) & = & \min \left(1, \frac{f(\theta^{*}, \delta^{*} \, | \, x, y)}{f(\theta^{(t-1)}, \delta^{(t-1)} \, | \, x, y)}\right) \\ & = & \min \left(1, \frac{\exp\left\{-\frac{1}{2} (\gamma^{*} - \mu)^{T} \Sigma^{-1} (\gamma^{*} - \mu)\right\}}{\exp\left\{-\frac{1}{2} (\gamma^{(t-1)} - \mu)^{T} \Sigma^{-1} (\gamma^{(t-1)} - \mu)\right\}}\right) \end{eqnarray*}$ where $$\gamma^{*} = \left(\begin{array}{c} \theta^{*} \\ \delta^{*} \end{array}\right)$$ and $$\gamma^{(t-1)} = \left(\begin{array}{c} \theta^{(t-1)} \\ \delta^{(t-1)} \end{array}\right)$$.

• Generate $$U \sim U(0, 1)$$.

• If $$U \leq \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*}))$$ accept the move, $$(\theta^{(t)}, \delta^{(t)}) = (\theta^{*}, \delta^{*})$$. Otherwise reject the move, $$(\theta^{(t)}, \delta^{(t)}) = (\theta^{(t-1)}, \delta^{(t-1)})$$.

1. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution $$f(\theta, \delta \, | \, x, y)$$. After a sufficiently long time to allow for convergence, the values $$\{\theta^{(t)}, \delta^{(t)}\}$$ for $$t > b$$ may be considered as a sample from $$f(\theta, \delta \, | \, x, y)$$ where the samples $$\{\theta^{(t)}, \delta^{(t)}\}$$ for $$t \leq b$$ are the “burn-in”.

### Question 3

Let $$X_{1}, \ldots, X_{n}$$ be exchangeable so that the $$X_{i}$$ are conditionally independent given a parameter $$\theta = (\mu, \lambda)$$. Suppose that $$X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)$$ so that $$\mu$$ is the mean and $$\lambda$$ the precision of the distribution. Suppose that we judge that $$\mu$$ and $$\lambda$$ are independent with $$\mu \sim N(\mu_{0}, 1/\tau)$$, where $$\mu_{0}$$ and $$\tau$$ are known, and $$\lambda \sim \mbox{Gamma}(\alpha, \beta)$$, where $$\alpha$$ and $$\beta$$ are known.

1. Show that the posterior density $$f(\mu, \lambda \, | \, x)$$, where $$x = (x_{1}, \ldots, x_{n})$$, can be expressed as $\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}$

As $$X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)$$ and $$\theta = (\mu, \lambda)$$ then $\begin{eqnarray*} f(x \, | \, \mu, \lambda) & \propto & \prod_{i=1}^{n} \lambda^{\frac{1}{2}} \exp\left\{-\frac{\lambda}{2}(x_{i}-\mu)^{2}\right\} \\ & = & \lambda^{\frac{n}{2}} \exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right\}. \end{eqnarray*}$ $$\mu \sim N(\mu_{0}, 1/\tau)$$ so that $\begin{eqnarray*} f(\mu) & \propto & \exp\left\{- \frac{\tau}{2}(\mu - \mu_{0})^{2}\right\} \\ & \propto & \exp\left\{- \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu\right\}. \end{eqnarray*}$ Finally, as $$\lambda \sim Gamma(\alpha, \beta)$$ then $\begin{eqnarray*} f(\lambda) & \propto & \lambda^{\alpha -1} \exp\left\{-\beta \lambda\right\}. \end{eqnarray*}$ Hence, as $$\mu$$ and $$\lambda$$ are independent, $\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & f(x \, | \, \mu, \lambda)f(\mu)f(\lambda) \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}$

1. Hence show that $\begin{eqnarray*} \lambda \, | \, \mu, x \sim \mbox{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right). \end{eqnarray*}$

We have $\begin{eqnarray*} f(\lambda \, | \, \mu, x) & = & \frac{f(\lambda, \mu \, | \, x)}{f(\mu \, | \, x)} \\ & \propto & f(\lambda, \mu \, | \, x) \end{eqnarray*}$ where the proportionality is with respect to $$\lambda$$. So, from 2.(a) we have $\begin{eqnarray*} f(\lambda \, | \, \mu, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\} \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\lambda \left(\beta + \frac{1}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right)\right\} \end{eqnarray*}$ which is a kernel of a $$Gamma\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right)$$.

1. Given that $$\mu \, | \, \lambda, x \sim N(\frac{\tau\mu_{0} + n\lambda \bar{x}}{\tau + n \lambda}, \frac{1}{\tau + n \lambda})$$, where $$\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}$$, describe how the Gibbs sampler may be used to sample from the posterior distribution $$\mu, \lambda \, | \, x$$. Give a sensible estimate of $$Var(\lambda \, | \, x)$$.

The Gibbs sampler requires the conditional distributions $$\lambda \, | \, \mu, x$$ and $$\mu \, | \, \lambda, x$$ which we have. The algorithm is

1. Choose a starting value $$(\mu^{(0)}, \lambda^{(0)})$$ for which $$f(\mu^{(0)}, \lambda^{(0)} \, | \, x) > 0$$.

2. At iteration $$t$$ generate the new values $$(\mu^{(t)}, \lambda^{(t)})$$ as follows:

• draw $$\mu^{(t)}$$ from $$N(\frac{\tau\mu_{0} + n\lambda^{(t-1)} \bar{x}}{\tau + n \lambda^{(t-1)}}, \frac{1}{\tau + n \lambda^{(t-1)}})$$, the distribution of $$\mu^{(t)} \, | \, \lambda^{(t-1)}, x$$.
• draw $$\lambda^{(t)}$$ from $$Gamma\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu^{(t)})^{2}\right)$$, the distribution of $$\lambda^{(t)} \, | \, \mu^{(t)}, x$$.
1. Repeat step 2.

The algorithm will produce a Markov Chain with stationary distribution $$f(\mu, \lambda \, | \, x)$$. After a sufficiently long time to allow for convergence, the values $$\{\mu^{(t)}, \lambda^{(t)}\}$$ for $$t > b$$ may be considered as a sample from $$f(\mu, \lambda \, | \, x)$$ where the samples $$\{\mu^{(t)}, \lambda^{(t)}\}$$ for $$t \leq b$$ are the “burn-in”.

Ergodic averages converge to the desired expectations under the target distribution so that, for example, $\begin{eqnarray*} \frac{1}{N} \sum_{t=1}^{N} g(\lambda^{(t)}) & \rightarrow & E(g(\lambda) \, | \, X) \end{eqnarray*}$ as $$N \rightarrow \infty$$. As this includes the observations prior to convergence, we will do better to estimate only using the observations after the “burn-in”. Noting that $$Var(\lambda \, | \, X)$$ corresponds to $$g(\lambda) = (\lambda - E(\lambda \, | \, X))^{2}$$ then any sensible sample variance will suffice. For example, we could estimate $$Var(\lambda \, | \, X) = E(\lambda^{2} \, | \, X) - E^{2}(\lambda \, | \, X)$$ by $\begin{eqnarray*} \frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)} - \bar{\lambda})^{2} & = & \left(\frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)})^{2}\right) -\bar{\lambda}^{2} \end{eqnarray*}$ where our sample is $$\{\mu^{(t)}, \lambda^{(t)} : t = b+1, \ldots, N\}$$ and $$\bar{\lambda} = \frac{1}{N-b} \sum_{t=b+1}^{N} \lambda^{(t)}$$, the sample mean.

### Question 4

Consider a Poisson hierarchical model. At the first stage we have observations $$s_{j}$$ which are Poisson with mean $$t_{j}\lambda_{j}$$ for $$j = 1, \ldots, p$$ where each $$t_{j}$$ is known. We assume that that the $$\lambda_{j}$$ are independent and identically distributed with $$\mbox{Gamma}(\alpha, \beta)$$ prior distributions. The parameter $$\alpha$$ is known but $$\beta$$ is unknown and is given a $$\mbox{Gamma}(\gamma, \delta)$$ distribution where $$\gamma$$ and $$\delta$$ are known. The $$s_{j}$$ are assumed to be conditionally independent given the unknown parameters.

1. Find, up to a constant of integration, the joint posterior distribution of the unknown parameters given $$s = (s_{1}, \ldots, s_{p})$$.

Let $$\lambda = (\lambda_{1}, \ldots, \lambda_{p})$$ then the unknown parameters are $$\theta = (\lambda_{1}, \ldots, \lambda_{p}, \beta) = (\lambda, \beta)$$. The prior is $\begin{eqnarray*} f(\lambda, \beta) & = & f(\lambda \, | \, \beta)f(\beta) \\ & = & \left\{\prod_{j=1}^{p} f(\lambda_{j} \, | \, \beta)\right\}f(\beta) \\ & = & \left\{\prod_{j=1}^{p} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda_{j}^{\alpha-1}e^{-\beta\lambda_{j}}\right\} \times \frac{\delta^{\gamma}}{\Gamma(\gamma)} \beta^{\gamma-1}e^{-\delta \beta} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\right\}. \end{eqnarray*}$ The likelihood is $\begin{eqnarray*} f(s \, | \, \lambda, \beta) & = & \prod_{j=1}^{p} \frac{(t_{j}\lambda_{j})^{s_{j}}e^{-t_{j}\lambda_{j}}}{s_{j}!} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{s_{j}}\right) \exp\left(-\sum_{j=1}^{p}t_{j}\lambda_{j} \right). \end{eqnarray*}$ The posterior is thus $\begin{eqnarray*} f(\lambda, \beta \, | \, s) & \propto & f(s \, | \, \lambda, \beta)f(\lambda, \beta) \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha+s_{j}-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right) -\sum_{j=1}^{p}t_{j}\lambda_{j} \right\}. \end{eqnarray*}$

1. Describe how the Gibbs sampler may be used to sample from the posterior distribution, deriving all required conditional distributions.

For each $$j$$, let $$\lambda_{-j} = \lambda \setminus \lambda_{j}$$. For the Gibbs sampler we need to find the distributions of $$\lambda_{j} \, | \, \lambda_{-j}, \beta, s$$, $$j = 1, \ldots, p$$, and $$\beta \, | \, \lambda, s$$. Now, $\begin{eqnarray*} f(\lambda_{j} \, | \, \lambda_{-j}, \beta, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda_{-j}, \beta \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to \lambda_{j})} \\ & = & \lambda_{j}^{\alpha+s_{j}-1} \exp\left\{-(\beta+t_{j})\lambda_{j}\right\} \end{eqnarray*}$ which is a kernel of a $$Gamma(\alpha+s_{j}, \beta+t_{j})$$ distribution. Hence, $$\lambda_{j} \, | \, \lambda_{-j}, \beta, s \sim Gamma(\alpha+s_{j}, \beta+t_{j})$$. $\begin{eqnarray*} f(\beta \, | \, \lambda, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to \beta)} \\ & = & \beta^{(p\alpha+\gamma)-1}\exp\left\{-\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\beta\right\} \end{eqnarray*}$ which is a kernel of a $$Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})$$ distribution. So, $$\beta \, | \, \lambda, s \sim Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})$$.

The Gibbs sampler is

1. Choose an arbitrary starting point for which $$f(\lambda^{(0)}, \beta^{(0)} \, | \, s) > 0$$. As $$\lambda_{j} > 0$$ for $$j=1, \ldots, p$$ and $$\beta > 0$$ then any $$p+1$$ collection of positive numbers will suffice.

2. At time $$t$$

• Obtain $$\lambda_{1}^{(t)}$$ by sampling from the distribution of $$\lambda_{1} \, | \, \lambda_{-1}^{(t-1)}, \beta^{(t-1)}, s$$ which is $$Gamma(\alpha + s_{1}, \beta^{(t-1)} + t_{1})$$.

• $$\vdots$$

• Obtain $$\lambda_{j}^{(t)}$$ by sampling from the distribution of $$\lambda_{j} \, | \, \lambda_{1}^{(t)}, \ldots, \lambda_{j-1}^{(t)}, \lambda_{j+1}^{(t-1)}$$, $$\ldots, \lambda_{p}^{(t-1)}, \beta^{(t-1)}, s$$ which is $$Gamma(\alpha + s_{j}, \beta^{(t-1)} + t_{j})$$.

• $$\vdots$$

• Obtain $$\lambda_{p}^{(t)}$$ by sampling from the distribution of $$\lambda_{p} \, | \, \lambda_{-p}^{(t)}, \beta^{(t-1)}, s$$ which is $$Gamma(\alpha + s_{p}, \beta^{(t-1)} + t_{p})$$.

• Obtain $$\beta^{(t)}$$ by sampling from the distribution of $$\beta \, | \, \lambda^{(t)}, s$$, the $$Gamma(p\alpha$$ $$+ \gamma, \delta+\sum_{j=1}^{p} \lambda_{j}^{(t)})$$.

1. Repeat step 2.

The algorithm is run until convergence is reached, When convergence is reached, then resulting value $$(\lambda^{(t)}, \beta^{(t)})$$ is a realisation from the posterior $$\lambda, \beta \, | \, s$$.

1. Let $$\{\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)}, \beta^{(t)}; t = 1, \ldots, N\}$$, with $$N$$ large, be a realisation of the Gibbs sampler described above. Give sensible estimates of $$E(\lambda_{j} \, | \, s)$$, $$Var(\beta \, | \, s)$$ and $$E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$$ where $$0 < a < b$$ are given constants.

We first remove the burn-in samples. Suppose we determine the burn-in time $$b$$, where we assume that the Markov chain has sufficiently converged for $$t > b$$. (We assume that $$b < N$$.) Then we discard observations $$\{\lambda^{(t)}, \beta^{(t)}; t \leq b\}$$ and use the remaining sample $$\{\lambda^{(t)}, \beta^{(t)}; t = b+1, \ldots N\}$$ to estimate posterior summaries. Here $$\lambda^{(t)} = (\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)})$$.

The posterior mean of $$\lambda_{j} \, | \, s$$ can be estimated by the sample mean of $$\{\lambda_{j}^{(t)}, t = b+1, \ldots, N\}$$ so that our estimate of $$E(\lambda_{j} \, | \, s)$$ is $$\frac{1}{N-b} \sum_{t = b+1}^{N} \lambda_{j}^{(t)}$$.

The posterior variance of $$\beta \, | \, s$$ can be estimated by the sample variance of $$\{\beta^{(t)}, t = b+1, \ldots, N\}$$ so that our estimate of $$Var(\beta \, | \, s)$$ is $$\frac{1}{N-b} \sum_{t = b+1}^{N} (\beta^{(t)}-\bar{\beta})^{2}$$ $$= (\frac{1}{N-b} \sum_{t = b+1}^{N} (\beta^{(t)})^{2}) - \bar{\beta}^{2}$$ where $$\bar{\beta} = \frac{1}{N-b} \sum_{t = b+1}^{N} \beta^{(t)}$$.

For estimating $$E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$$ we not only discard the burn-in observations but also those elements $$\lambda_{j}^{(t)}$$ for which $$\beta^{(t)} < a$$ or $$\beta^{(t)} > b$$ for each $$t = b+1, \ldots, N$$. So, letting $$\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$$ denote the indicator function which is one if $$a \leq \beta^{(t)} \leq b$$, the number of observations after the burn-in with $$a \leq \beta^{(t)} \leq b$$ is $$M = \sum_{t = b+1}^{N} \mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$$ and our estimate of $$E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$$ is $$\frac{1}{M} \sum_{t=b+1}^{N} \lambda_{j}^{(t)}\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$$.

### Question 5

Show that the Gibbs sampler for sampling from a distribution $$\pi(\theta)$$ where $$\theta = (\theta_{1}, \ldots, \theta_{d})$$ can be viewed as a special case of the Metropolis-Hastings algorithm where each iteration $$t$$ consists of $$d$$ Metropolis-Hastings steps each with an acceptance probability of 1.

For the Gibbs sampler, we update each $$\theta_{p}$$ one at a time by obtaining $$\theta^{(t)}_{p}$$ from the conditional distribution $$\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$$. Thus, we move from $$\theta_{p}^{(t-1)}$$ to $$\theta_{p}^{(t)}$$ leaving the remaining $$d-1$$ parameters fixed.

We now demonstrate that this update step in the Gibbs sampler is equivalent to a single Metropolis-Hastings step with acceptance probability 1. Let $$\theta^{(t-1)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)},$$ $$\theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$$ and $$\theta^{(t)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$$. Consider the move from $$\theta^{(t-1)}_{[p]}$$ to $$\theta^{(t)}_{[p]}$$. As $$\theta^{(t)}_{[p]}$$ is drawn from $$\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$$ then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm so that $\begin{eqnarray} q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{1} \end{eqnarray}$ If, instead, we consider the move from $$\theta^{(t)}_{[p]}$$ to $$\theta^{(t-1)}_{[p]}$$ then, using Gibbs sampling, $$\theta^{(t-1)}_{[p]}$$ is obtained by drawing $$\theta_{p}^{(t-1)}$$ from $$\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$$. Thus, we set $\begin{eqnarray} q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{2} \end{eqnarray}$ Now, $\begin{eqnarray} \pi(\theta^{(t-1)}_{[p]} ) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{3} \\ & = & q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{4} \end{eqnarray}$ where (4) follows by substituting (2) into (3). In an identical fashion, $\begin{eqnarray} \pi(\theta^{(t)}_{[p]}) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{5} \\ & = & q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{6} \end{eqnarray}$ where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives $\begin{eqnarray*} \frac{\pi(\theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})} & = & \frac{q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})}{q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})} \end{eqnarray*}$ so that $\begin{eqnarray} \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} & = & 1. \tag{7} \end{eqnarray}$ Thus, a Metropolis-Hastings algorithm for sampling from $$\pi(\theta)$$ for a proposed move from $$\theta^{(t-1)}_{[p]}$$ to $$\theta^{(t)}_{[p]}$$ using the proposal distribution given by (1) has an acceptance ratio of $\begin{eqnarray} \alpha(\theta^{(t-1)}_{[p]}, \theta^{(t)}_{[p]}) & = & \min \left(1, \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} \right) \tag{8} \\ & = & 1 \tag{9} \end{eqnarray}$ where (9) follows from (8) using (7).