MA40189: Topics in Bayesian statistics

Solution Sheet Eight

Question 1

Let $X_{1}, \ldots, X_{n}$ be exchangeable so that the $X_{i}$ are conditionally independent given a parameter $\theta$. Suppose that $X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)$, where $\alpha$ is known, and we judge that $\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})$, where $\alpha_{0}$ and $\beta_{0}$ are known.

Show that $\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$ where $\alpha_{n} = \alpha_{0} + n \alpha$, $\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}$, and $x = (x_{1}, \ldots, x_{n})$.

As $X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)$ then \[\begin{eqnarray*} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{\theta^{\alpha}}{\Gamma(\alpha)} x_{i}^{-(\alpha+1)} \exp \left(-\frac{\theta}{x_{i}}\right) \\ & \propto & \theta^{n\alpha} \exp \left(-\theta \sum_{i=1}^{n}\frac{1}{x_{i}}\right). \end{eqnarray*}\] Similarly, as $\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})$, \[\begin{eqnarray*} f(\theta) & \propto & \theta^{\alpha_{0}-1}\exp \left(-\theta \beta_{0}\right). \end{eqnarray*}\] Hence, \[\begin{eqnarray*} f(\theta \, | \, x) & \propto & f(x \, | \, \theta)f(\theta) \\ & \propto & \theta^{(\alpha_{0}+n\alpha)-1}\exp \left\{-\theta\left(\beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\right)\right\} \end{eqnarray*}\] which we recognise as a kernal of a $\mbox{Gamma}(\alpha_{0} + n \alpha, \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}})$. Thus, $\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$ as required.
We wish to use the Metropolis-Hastings algorithm to sample from the posterior distribution $\theta \, | \, x$ using a normal distribution with mean $\theta$ and chosen variance $\sigma^{2}$ as the symmetric proposal distribution.
1. Suppose that, at time $t$, the proposed value $\theta^{*} \leq 0$. Briefly explain why the corresponding acceptance probability is zero for such a $\theta^{*}$ and thus that the sequence of values generated by the algorithm are never less than zero.
  
  As the proposal distribution is symmetric then the Metropolis-Hastings algorithm reduces to the Metropolis algorithm with acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) = \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \end{eqnarray*}\] where $\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$. Now, assuming $\theta^{(t-1)} > 0$, if $\theta^{*} \leq 0$ then $f(\theta^{*} \, | \, x) = 0$ and $\alpha(\theta^{(t-1)}, \theta^{*}) = 0$ so the move to $\theta^{*} \leq 0$ is rejected and $\theta^{(t)} = \theta^{(t-1)}$. So, provided $\theta^{(0)} > 0$ then the sequence of values generated by the algorithm are never less than zero.
2. Describe how the Metropolis-Hastings algorithm works for this example, giving the acceptance probability in its simplest form.
  
  The algorithm is performed as follows.
  
  $\mbox{1}$. Choose a starting point $\theta^{(0)}$ for which $f(\theta^{(0)} \, | \, x) > 0$. This will hold for any $\theta^{(0)} > 0$ as $\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})$.
  
  $\mbox{2}$. At time $t$
  $\bullet$ Sample $\theta^{*} \sim N(\theta^{(t-1)}, \sigma^{2})$.
  
  $\bullet$ Calculate the acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) & = & \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \\ & = & \left\{\begin{array}{cl} \min\left(1, \frac{(\theta^{*})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{*})}{(\theta^{(t-1)})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{(t-1)})}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right.\\ & = & \left\{\begin{array}{cl} \min\left(1, \left(\frac{\theta^{*}}{\theta^{(t-1)}}\right)^{\alpha_{n} - 1} \exp\left\{-\beta_{n}\left(\theta^{*} - \theta^{(t-1)}\right)\right\}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right. \end{eqnarray*}\] where $\alpha_{n} = \alpha_{0} + n \alpha$, $\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}$.
  
  $\bullet$ Generate $U \sim U(0, 1)$.
  
  $\bullet$ If $U \leq \alpha(\theta^{(t-1)}, \theta^{*})$ accept the move, $\theta^{(t)} = \theta^{*}$. Otherwise reject the move, $\theta^{(t)}= \theta^{(t-1)}$.
  
  $\mbox{3}$. Repeat step 2.
  
  The algorithm will produce a Markov Chain with stationary distribution $f(\theta \, | \, x)$. After a sufficiently long time, $b$ say, to allow for convergence, the values $\{\theta^{(t)}\}$ for $t > b$ may be considered as a sample from $f(\theta \, | \, x)$, that is a sample from $\mbox{Gamma}(\alpha_{n}, \beta_{n})$, where the samples $\{\theta^{(t)} \}$ for $t \leq b$ are the “burn-in”.

Question 2

Suppose that $X \, | \, \theta \sim N(\theta, \sigma^{2})$ and $Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})$, where $\sigma^{2}$ is a known constant and $X$ and $Y$ are conditionally independent given $\theta$ and $\delta$. It is judged that the improper noninformative joint prior distribution $f(\theta, \delta) \propto 1$ is appropriate.

Show that the joint posterior distribution of $\theta$ and $\delta$ given $x$ and $y$ is bivariate normal with mean vector $\mu$ and variance matrix $\Sigma$ where \[\begin{eqnarray*} \mu \ = \ \left(\begin{array}{c} E(\theta \, | \, X, Y) \\ E(\delta \, | \, X, Y) \end{array} \right) \ = \ \left(\begin{array}{c} x \\ x-y \end{array} \right); \ \ \Sigma \ = \ \left(\begin{array}{cc} \sigma^{2} & \sigma^{2} \\ \sigma^{2} & 2\sigma^{2} \end{array} \right). \end{eqnarray*}\]
As $X \, | \, \theta \sim N(\theta, \sigma^{2})$ then \[\begin{eqnarray*} f(x \, | \, \theta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}} (\theta^{2} - 2x\theta)\right\}. \end{eqnarray*}\] As $Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})$ then \[\begin{eqnarray*} f(y \, | \, \theta, \delta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[(\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] Now, by conditional independence, \[\begin{eqnarray*} f(x, y \, | \, \theta, \delta) & = & f(x \, | \, \theta)f(y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] So, \[\begin{eqnarray*} f(\theta, \delta \, | \, x, y) & \propto & f(x, y \, | \, \theta, \delta)f(\theta, \delta) \\ & \propto & f(x, y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\} \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} -2(x+y)\theta - 2\theta\delta + \delta^{2} + 2y\delta]\right\}. \end{eqnarray*}\] The exponential argument is quadratic in $\theta$, $\delta$ so $\theta, \delta \, | \, x, y$ is bivariate normal. We can find the exact normal by completing the square or, in this case, using the information given in the question. We have, \[\begin{eqnarray*} \Sigma^{-1} & = & \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right) \end{eqnarray*}\] so that the density of the $N_{2}(\mu, \Sigma)$ is \[\begin{eqnarray*} \exp\left\{-\frac{1}{2} \left(\theta - x \ \ \delta - (x-y)\right) \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right)\left(\begin{array}{c} \theta - x \\ \delta - (x-y) \end{array}\right) \right\} & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(2x-(x-y))\theta - 2\theta \delta + \delta^{2} - 2((x-y)-x)\delta]\right\} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y)\theta - 2\theta \delta + \delta^{2} + 2y\delta]\right\} \\ & \propto & f(\theta, \delta \, | \, x, y). \end{eqnarray*}\]
Describe how the Gibbs sampler may be used to sample from the posterior distribution $\theta, \delta \, | \, x , y$, deriving all required conditional distributions.

The Gibbs sampler requires the conditional distributions $\theta \, | \, \delta, x, y$ and $\delta \, | \, \theta, x, y$. Now, with proportionality with respect to $\theta$, \[\begin{eqnarray*} f(\theta \, | \, \delta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y+\delta)\theta]\right\} \\ & = & \exp\left\{-\frac{1}{2}\left(\frac{2}{\sigma^{2}}\right)\left[\theta^{2} - 2\left(\frac{x+y+\delta}{2}\right)\theta\right]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, $\theta \, | \, \delta, x, y \sim N(\frac{x+y+\delta}{2}, \frac{\sigma^{2}}{2})$. Now, with proportionality with respect to $\delta$, \[\begin{eqnarray*} f(\delta \, | \, \theta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[\delta^{2} - 2(\theta-y)\delta]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, $\delta \, | \, \theta, x, y \sim N(\theta - y, \sigma^{2})$. The Gibbs sampler algorithm is

$\mbox{1}$. Choose a starting value $(\theta^{(0)}, \delta^{(0)})$ for which $f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0$.

$\mbox{2}$. At iteration $t$ generate the new values $(\theta^{(t)}, \delta^{(t)})$ as follows:

$\bullet$ draw $\theta^{(t)}$ from $N(\frac{x+y+\delta^{(t-1)}}{2}, \frac{\sigma^{2}}{2})$, the distribution of $\theta^{(t)} \, | \, \delta^{(t-1)}, x, y$.

$\bullet$ draw $\delta^{(t)}$ from $N(\theta^{(t)}-y, \sigma^{2})$, the distribution of $\delta^{(t)} \, | \, \theta^{(t)}, x, y$.

$\mbox{3}$. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution $f(\theta, \delta \, | \, x, y)$. After a sufficiently long time to allow for convergence, the values $\{\theta^{(t)}, \delta^{(t)}\}$ for $t > b$ may be considered as a sample from $f(\theta, \delta \, | \, x, y)$ where the samples $\{\theta^{(t)}, \delta^{(t)}\}$ for $t \leq b$ are the “burn-in”.
Suppose that $x=2$, $y=1$ and $\sigma^{2} = 1$. Sketch the contours of the joint posterior distribution. Starting from the origin, add to your sketch the first four steps of a typical Gibbs sampler path.

For $x=2$, $y=1$ and $\sigma^{2} = 1$ we shall use the Gibbs sampler to sample from $N_{2}(\mu, \Sigma)$ where $\mu = \left(\begin{array}{c} 2 \\ 1 \end{array}\right)$ and $\Sigma = \left(\begin{array}{cc} 1 & 1 \\ 1 & 2 \end{array}\right)$. We draw $\theta^{(t)}$ from $N(\frac{3+\delta^{(t-1)}}{2}, \frac{1}{2})$ and $\delta^{(t)}$ from $N(\theta^{(t)}-1, 1)$. Suppose that we start from $(1,1)$ and sample $\theta^{(1)} = 2.216715$, $\delta^{(1)} = 1.475738$, $\theta^{(2)} = 2.086383$ and $\delta^{(2)} = 1.617760$. The path of the trajectory of these points is shown in the following figure.

$A trajectory path of the Gibbs sampler sampling from $N_{2}(\mu, \Sigma)$.$

A trajectory path of the Gibbs sampler sampling from $N_{2}(\mu, \Sigma)$.

Suppose, instead, that we consider sampling from the posterior distribution using the Metropolis-Hastings algorithm where the proposal distribution is the bivariate normal with mean vector $\tilde{\mu}^{(t-1)} = (\theta^{(t-1)}, \delta^{(t-1)})^{T}$ and known variance matrix $\tilde{\Sigma}$. Explain the Metropolis-Hastings algorithm for this case, explicitly stating the acceptance probability.

For the proposal we have \[\begin{eqnarray*} q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) & \propto & \exp\left\{-\frac{1}{2} \left(\theta^{*} - \theta^{(t-1)} \ \ \delta^{*} - \delta^{(t-1)}\right) \tilde{\Sigma}^{-1}\left(\begin{array}{c} \theta^{*} - \theta^{(t-1)} \\ \delta^{*} - \delta^{(t-1)} \end{array}\right) \right\} \end{eqnarray*}\] so that $q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) = q(\theta^{(t-1)}, \delta^{(t-1)} \, | \, \theta^{*}, \delta^{*})$. The proposal distribution is symmetric and so the Metropolis-Hastings algorithm reduces to the Metropolis algorithm. The algorithm is performed as follows.

$\mbox{1}$. Choose a starting point $(\theta^{(0)}, \delta^{(0)})$ for which $f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0$.

$\mbox{2}$. At time $t$

$\bullet$ Sample $(\theta^{*}, \delta^{*}) \sim N_{2}(\tilde{\mu}^{(t-1)}, \tilde{\Sigma})$.

$\bullet$ Calculate the acceptance probability \[\begin{eqnarray*} \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*})) & = & \min \left(1, \frac{f(\theta^{*}, \delta^{*} \, | \, x, y)}{f(\theta^{(t-1)}, \delta^{(t-1)} \, | \, x, y)}\right) \\ & = & \min \left(1, \frac{\exp\left\{-\frac{1}{2} (\gamma^{*} - \mu)^{T} \Sigma^{-1} (\gamma^{*} - \mu)\right\}}{\exp\left\{-\frac{1}{2} (\gamma^{(t-1)} - \mu)^{T} \Sigma^{-1} (\gamma^{(t-1)} - \mu)\right\}}\right) \end{eqnarray*}\] where $\gamma^{*} = \left(\begin{array}{c} \theta^{*} \\ \delta^{*} \end{array}\right)$ and $\gamma^{(t-1)} = \left(\begin{array}{c} \theta^{(t-1)} \\ \delta^{(t-1)} \end{array}\right)$.

$\bullet$ Generate $U \sim U(0, 1)$.

$\bullet$ If $U \leq \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*}))$ accept the move, $(\theta^{(t)}, \delta^{(t)}) = (\theta^{*}, \delta^{*})$. Otherwise reject the move, $(\theta^{(t)}, \delta^{(t)}) = (\theta^{(t-1)}, \delta^{(t-1)})$.

$\mbox{3}$. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution $f(\theta, \delta \, | \, x, y)$. After a sufficiently long time to allow for convergence, the values $\{\theta^{(t)}, \delta^{(t)}\}$ for $t > b$ may be considered as a sample from $f(\theta, \delta \, | \, x, y)$ where the samples $\{\theta^{(t)}, \delta^{(t)}\}$ for $t \leq b$ are the “burn-in”.

Question 3

Let $X_{1}, \ldots, X_{n}$ be exchangeable so that the $X_{i}$ are conditionally independent given a parameter $\theta = (\mu, \lambda)$. Suppose that $X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)$ so that $\mu$ is the mean and $\lambda$ the precision of the distribution. Suppose that we judge that $\mu$ and $\lambda$ are independent with $\mu \sim N(\mu_{0}, 1/\tau)$, where $\mu_{0}$ and $\tau$ are known, and $\lambda \sim \mbox{Gamma}(\alpha, \beta)$, where $\alpha$ and $\beta$ are known.

Show that the posterior density $f(\mu, \lambda \, | \, x)$, where $x = (x_{1}, \ldots, x_{n})$, can be expressed as \[\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}\]
As $X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)$ and $\theta = (\mu, \lambda)$ then \[\begin{eqnarray*} f(x \, | \, \mu, \lambda) & \propto & \prod_{i=1}^{n} \lambda^{\frac{1}{2}} \exp\left\{-\frac{\lambda}{2}(x_{i}-\mu)^{2}\right\} \\ & = & \lambda^{\frac{n}{2}} \exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right\}. \end{eqnarray*}\] $\mu \sim N(\mu_{0}, 1/\tau)$ so that \[\begin{eqnarray*} f(\mu) & \propto & \exp\left\{- \frac{\tau}{2}(\mu - \mu_{0})^{2}\right\} \\ & \propto & \exp\left\{- \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu\right\}. \end{eqnarray*}\] Finally, as $\lambda \sim \mbox{Gamma}(\alpha, \beta)$ then \[\begin{eqnarray*} f(\lambda) & \propto & \lambda^{\alpha -1} \exp\left\{-\beta \lambda\right\}. \end{eqnarray*}\] Hence, as $\mu$ and $\lambda$ are independent, \[\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & f(x \, | \, \mu, \lambda)f(\mu)f(\lambda) \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}\]
Hence show that \[\begin{eqnarray*} \lambda \, | \, \mu, x \sim \mbox{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right). \end{eqnarray*}\]
We have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & = & \frac{f(\lambda, \mu \, | \, x)}{f(\mu \, | \, x)} \\ & \propto & f(\lambda, \mu \, | \, x) \end{eqnarray*}\] where the proportionality is with respect to $\lambda$. So, from 2.(a) we have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\} \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\lambda \left(\beta + \frac{1}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right)\right\} \end{eqnarray*}\] which is a kernel of a $\mbox{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right)$.
Given that $\mu \, | \, \lambda, x \sim N(\frac{\tau\mu_{0} + n\lambda \bar{x}}{\tau + n \lambda}, \frac{1}{\tau + n \lambda})$, where $\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}$, describe how the Gibbs sampler may be used to sample from the posterior distribution $\mu, \lambda \, | \, x$. Give a sensible estimate of $Var(\lambda \, | \, x)$.

The Gibbs sampler requires the conditional distributions $\lambda \, | \, \mu, x$ and $\mu \, | \, \lambda, x$ which we have. The algorithm is

$\mbox{1}$. Choose a starting value $(\mu^{(0)}, \lambda^{(0)})$ for which $f(\mu^{(0)}, \lambda^{(0)} \, | \, x) > 0$.

$\mbox{2}$. At iteration $t$ generate the new values $(\mu^{(t)}, \lambda^{(t)})$ as follows:

$\bullet$ draw $\mu^{(t)}$ from $N(\frac{\tau\mu_{0} + n\lambda^{(t-1)} \bar{x}}{\tau + n \lambda^{(t-1)}}, \frac{1}{\tau + n \lambda^{(t-1)}})$, the distribution of $\mu^{(t)} \, | \, \lambda^{(t-1)}, x$.

$\bullet$ draw $\lambda^{(t)}$ from $\mbox{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu^{(t)})^{2}\right)$, the distribution of $\lambda^{(t)} \, | \, \mu^{(t)}, x$.

$\mbox{3}$. Repeat step 2.

The algorithm will produce a Markov Chain with stationary distribution $f(\mu, \lambda \, | \, x)$. After a sufficiently long time to allow for convergence, the values $\{\mu^{(t)}, \lambda^{(t)}\}$ for $t > b$ may be considered as a sample from $f(\mu, \lambda \, | \, x)$ where the samples $\{\mu^{(t)}, \lambda^{(t)}\}$ for $t \leq b$ are the “burn-in”.

Ergodic averages converge to the desired expectations under the target distribution so that, for example, \[\begin{eqnarray*} \frac{1}{N} \sum_{t=1}^{N} g(\lambda^{(t)}) & \rightarrow & E(g(\lambda) \, | \, X) \end{eqnarray*}\] as $N \rightarrow \infty$. As this includes the observations prior to convergence, we will do better to estimate only using the observations after the “burn-in”. Noting that $Var(\lambda \, | \, X)$ corresponds to $g(\lambda) = (\lambda - E(\lambda \, | \, X))^{2}$ then any sensible sample variance will suffice. For example, we could estimate $Var(\lambda \, | \, X) = E(\lambda^{2} \, | \, X) - E^{2}(\lambda \, | \, X)$ by \[\begin{eqnarray*} \frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)} - \bar{\lambda})^{2} & = & \left(\frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)})^{2}\right) -\bar{\lambda}^{2} \end{eqnarray*}\] where our sample is $\{\mu^{(t)}, \lambda^{(t)} : t = b+1, \ldots, N\}$ and $\bar{\lambda} = \frac{1}{N-b} \sum_{t=b+1}^{N} \lambda^{(t)}$, the sample mean.

Question 4

Consider a Poisson hierarchical model. At the first stage we have observations $s_{j}$ which are Poisson with mean $t_{j}\lambda_{j}$ for $j = 1, \ldots, p$ where each $t_{j}$ is known. We assume that that the $\lambda_{j}$ are independent and identically distributed with $\mbox{Gamma}(\alpha, \beta)$ prior distributions. The parameter $\alpha$ is known but $\beta$ is unknown and is given a $\mbox{Gamma}(\gamma, \delta)$ distribution where $\gamma$ and $\delta$ are known. The $s_{j}$ are assumed to be conditionally independent given the unknown parameters.

Find, up to a constant of integration, the joint posterior distribution of the unknown parameters given $s = (s_{1}, \ldots, s_{p})$.

Let $\lambda = (\lambda_{1}, \ldots, \lambda_{p})$ then the unknown parameters are $\theta = (\lambda_{1}, \ldots, \lambda_{p}, \beta) = (\lambda, \beta)$. The prior is \[\begin{eqnarray*} f(\lambda, \beta) & = & f(\lambda \, | \, \beta)f(\beta) \\ & = & \left\{\prod_{j=1}^{p} f(\lambda_{j} \, | \, \beta)\right\}f(\beta) \\ & = & \left\{\prod_{j=1}^{p} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda_{j}^{\alpha-1}e^{-\beta\lambda_{j}}\right\} \times \frac{\delta^{\gamma}}{\Gamma(\gamma)} \beta^{\gamma-1}e^{-\delta \beta} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\right\}. \end{eqnarray*}\] The likelihood is \[\begin{eqnarray*} f(s \, | \, \lambda, \beta) & = & \prod_{j=1}^{p} \frac{(t_{j}\lambda_{j})^{s_{j}}e^{-t_{j}\lambda_{j}}}{s_{j}!} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{s_{j}}\right) \exp\left(-\sum_{j=1}^{p}t_{j}\lambda_{j} \right). \end{eqnarray*}\] The posterior is thus \[\begin{eqnarray*} f(\lambda, \beta \, | \, s) & \propto & f(s \, | \, \lambda, \beta)f(\lambda, \beta) \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha+s_{j}-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right) -\sum_{j=1}^{p}t_{j}\lambda_{j} \right\}. \end{eqnarray*}\]
Describe how the Gibbs sampler may be used to sample from the posterior distribution, deriving all required conditional distributions.

For each $j$, let $\lambda_{-j} = \lambda \setminus \lambda_{j}$. For the Gibbs sampler we need to find the distributions of $\lambda_{j} \, | \, \lambda_{-j}, \beta, s$, $j = 1, \ldots, p$, and $\beta \, | \, \lambda, s$. Now, \[\begin{eqnarray*} f(\lambda_{j} \, | \, \lambda_{-j}, \beta, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda_{-j}, \beta \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\lambda_{j}$)} \\ & = & \lambda_{j}^{\alpha+s_{j}-1} \exp\left\{-(\beta+t_{j})\lambda_{j}\right\} \end{eqnarray*}\] which is a kernel of a $\mbox{Gamma}(\alpha+s_{j}, \beta+t_{j})$ distribution. Hence, $\lambda_{j} \, | \, \lambda_{-j}, \beta, s \sim \mbox{Gamma}(\alpha+s_{j}, \beta+t_{j})$. \[\begin{eqnarray*} f(\beta \, | \, \lambda, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\beta$)} \\ & = & \beta^{(p\alpha+\gamma)-1}\exp\left\{-\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\beta\right\} \end{eqnarray*}\] which is a kernel of a $\mbox{Gamma}(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})$ distribution. So, $\beta \, | \, \lambda, s \sim \mbox{Gamma}(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})$.

The Gibbs sampler is

$\mbox{1}$. Choose an arbitrary starting point for which $f(\lambda^{(0)}, \beta^{(0)} \, | \, s) > 0$. As $\lambda_{j} > 0$ for $j=1, \ldots, p$ and $\beta > 0$ then any $p+1$ collection of positive numbers will suffice.

$\mbox{2}$. At time $t$

$\bullet$ Obtain $\lambda_{1}^{(t)}$ by sampling from the distribution of $\lambda_{1} \, | \, \lambda_{-1}^{(t-1)}, \beta^{(t-1)}, s$ which is $\mbox{Gamma}(\alpha + s_{1}, \beta^{(t-1)} + t_{1})$.

$\bullet$ $\vdots$ $\vdots$ $\vdots$

$\bullet$ Obtain $\lambda_{j}^{(t)}$ by sampling from the distribution of $\lambda_{j} \, | \, \lambda_{1}^{(t)}, \ldots, \lambda_{j-1}^{(t)}, \lambda_{j+1}^{(t-1)}$, $\ldots, \lambda_{p}^{(t-1)}, \beta^{(t-1)}, s$ which is $\mbox{Gamma}(\alpha + s_{j}, \beta^{(t-1)} + t_{j})$.

$\bullet$ $\vdots$ $\vdots$ $\vdots$

$\bullet$ Obtain $\lambda_{p}^{(t)}$ by sampling from the distribution of $\lambda_{p} \, | \, \lambda_{-p}^{(t)}, \beta^{(t-1)}, s$ which is $\mbox{Gamma}(\alpha + s_{p}, \beta^{(t-1)} + t_{p})$.

$\bullet$ Obtain $\beta^{(t)}$ by sampling from the distribution of $\beta \, | \, \lambda^{(t)}, s$, the $\mbox{Gamma}(p\alpha$ $+ \gamma, \delta+\sum_{j=1}^{p} \lambda_{j}^{(t)})$.

$\mbox{3}$. Repeat step 2.

The algorithm is run until convergence is reached, When convergence is reached, then resulting value $(\lambda^{(t)}, \beta^{(t)})$ is a realisation from the posterior $\lambda, \beta \, | \, s$.
Let $\{\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)}, \beta^{(t)}; t = 1, \ldots, N\}$, with $N$ large, be a realisation of the Gibbs sampler described above. Give sensible estimates of $E(\lambda_{j} \, | \, s)$, $Var(\beta \, | \, s)$ and $E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$ where $0 < a < b$ are given constants.

We first remove the burn-in samples. Suppose we determine the burn-in time $B$, where we assume that the Markov chain has sufficiently converged for $t > B$. (We assume that $B < N$.) Then we discard observations $\{\lambda^{(t)}, \beta^{(t)}; t \leq B\}$ and use the remaining sample $\{\lambda^{(t)}, \beta^{(t)}; t = B+1, \ldots N\}$ to estimate posterior summaries. Here $\lambda^{(t)} = (\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)})$.

The posterior mean of $\lambda_{j} \, | \, s$ can be estimated by the sample mean of $\{\lambda_{j}^{(t)}, t = B+1, \ldots, N\}$ so that our estimate of $E(\lambda_{j} \, | \, s)$ is $\frac{1}{N-B} \sum_{t = B+1}^{N} \lambda_{j}^{(t)}$.

The posterior variance of $\beta \, | \, s$ can be estimated by the sample variance of $\{\beta^{(t)}, t = B+1, \ldots, N\}$ so that our estimate of $Var(\beta \, | \, s)$ is $\frac{1}{N-B} \sum_{t = B+1}^{N} (\beta^{(t)}-\bar{\beta})^{2}$ $= (\frac{1}{N-B} \sum_{t = B+1}^{N} (\beta^{(t)})^{2}) - \bar{\beta}^{2}$ where $\bar{\beta} = \frac{1}{N-B} \sum_{t = B+1}^{N} \beta^{(t)}$.

For estimating $E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$ we not only discard the burn-in observations but also those elements $\lambda_{j}^{(t)}$ for which $\beta^{(t)} < a$ or $\beta^{(t)} > b$ for each $t = B+1, \ldots, N$. So, letting $\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$ denote the indicator function which is one if $a \leq \beta^{(t)} \leq b$, the number of observations after the burn-in with $a \leq \beta^{(t)} \leq b$ is $M = \sum_{t = B+1}^{N} \mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$ and our estimate of $E(\lambda_{j} \, | \, a \leq \beta \leq b, s)$ is $\frac{1}{M} \sum_{t=B+1}^{N} \lambda_{j}^{(t)}\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}$.

Question 5

Show that the Gibbs sampler for sampling from a distribution $\pi(\theta)$ where $\theta = (\theta_{1}, \ldots, \theta_{d})$ can be viewed as a special case of the Metropolis-Hastings algorithm where each iteration $t$ consists of $d$ Metropolis-Hastings steps each with an acceptance probability of 1.

For the Gibbs sampler, we update each $\theta_{p}$ one at a time by obtaining $\theta^{(t)}_{p}$ from the conditional distribution $\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$. Thus, we move from $\theta_{p}^{(t-1)}$ to $\theta_{p}^{(t)}$ leaving the remaining $d-1$ parameters fixed.

We now demonstrate that this update step in the Gibbs sampler is equivalent to a single Metropolis-Hastings step with acceptance probability 1. Let $\theta^{(t-1)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)},$ $\theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$ and $\theta^{(t)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$. Consider the move from $\theta^{(t-1)}_{[p]}$ to $\theta^{(t)}_{[p]}$. As $\theta^{(t)}_{[p]}$ is drawn from $\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$ then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm so that \[\begin{eqnarray} q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{1} \end{eqnarray}\] If, instead, we consider the move from $\theta^{(t)}_{[p]}$ to $\theta^{(t-1)}_{[p]}$ then, using Gibbs sampling, $\theta^{(t-1)}_{[p]}$ is obtained by drawing $\theta_{p}^{(t-1)}$ from $\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})$. Thus, we set \[\begin{eqnarray} q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{2} \end{eqnarray}\] Now, \[\begin{eqnarray} \pi(\theta^{(t-1)}_{[p]} ) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{3} \\ & = & q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{4} \end{eqnarray}\] where (4) follows by substituting (2) into (3). In an identical fashion, \[\begin{eqnarray} \pi(\theta^{(t)}_{[p]}) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{5} \\ & = & q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{6} \end{eqnarray}\] where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives \[\begin{eqnarray*} \frac{\pi(\theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})} & = & \frac{q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})}{q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})} \end{eqnarray*}\] so that \[\begin{eqnarray} \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} & = & 1. \tag{7} \end{eqnarray}\] Thus, a Metropolis-Hastings algorithm for sampling from $\pi(\theta)$ for a proposed move from $\theta^{(t-1)}_{[p]}$ to $\theta^{(t)}_{[p]}$ using the proposal distribution given by (1) has an acceptance ratio of \[\begin{eqnarray} \alpha(\theta^{(t-1)}_{[p]}, \theta^{(t)}_{[p]}) & = & \min \left(1, \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} \right) \tag{8} \\ & = & 1 \tag{9} \end{eqnarray}\] where (9) follows from (8) using (7).