Solution Sheet Eight (Additional Questions)

Question 1

Let \(X_{1}, \ldots, X_{n}\) be exchangeable so that the \(X_{i}\) are conditionally independent given a parameter \(\theta\). Suppose that \(X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)\), where \(\alpha\) is known, and we judge that \(\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})\), where \(\alpha_{0}\) and \(\beta_{0}\) are known.

  1. Show that \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\) where \(\alpha_{n} = \alpha_{0} + n \alpha\), \(\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\), and \(x = (x_{1}, \ldots, x_{n})\).

As \(X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)\) then \[\begin{eqnarray*} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{\theta^{\alpha}}{\Gamma(\alpha)} x_{i}^{-(\alpha+1)} \exp \left(-\frac{\theta}{x_{i}}\right) \\ & \propto & \theta^{n\alpha} \exp \left(-\theta \sum_{i=1}^{n}\frac{1}{x_{i}}\right). \end{eqnarray*}\] Similarly, as \(\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})\), \[\begin{eqnarray*} f(\theta) & \propto & \theta^{\alpha_{0}-1}\exp \left(-\theta \beta_{0}\right). \end{eqnarray*}\] Hence, \[\begin{eqnarray*} f(\theta \, | \, x) & \propto & f(x \, | \, \theta)f(\theta) \\ & \propto & \theta^{(\alpha_{0}+n\alpha)-1}\exp \left\{-\theta\left(\beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\right)\right\} \end{eqnarray*}\] which we recognise as a kernal of a \(\mbox{Gamma}(\alpha_{0} + n \alpha, \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}})\). Thus, \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\) as required.

  1. We wish to use the Metropolis-Hastings algorithm to sample from the posterior distribution \(\theta \, | \, x\) using a normal distribution with mean \(\theta\) and chosen variance \(\sigma^{2}\) as the symmetric proposal distribution.

    1. Suppose that, at time \(t\), the proposed value \(\theta^{*} \leq 0\). Briefly explain why the corresponding acceptance probability is zero for such a \(\theta^{*}\) and thus that the sequence of values generated by the algorithm are never less than zero.

As the proposal distribution is symmetric then the Metropolis-Hastings algorithm reduces to the Metropolis algorithm with acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) = \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \end{eqnarray*}\] where \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\). Now, assuming \(\theta^{(t-1)} > 0\), if \(\theta^{*} \leq 0\) then \(f(\theta^{*} \, | \, x) = 0\) and \(\alpha(\theta^{(t-1)}, \theta^{*}) = 0\) so the move to \(\theta^{*} \leq 0\) is rejected and \(\theta^{(t)} = \theta^{(t-1)}\). So, provided \(\theta^{(0)} > 0\) then the sequence of values generated by the algorithm are never less than zero.

  1. Describe how the Metropolis-Hastings algorithm works for this example, giving the acceptance probability in its simplest form.

The algorithm is performed as follows.

  1. Choose a starting point \(\theta^{(0)}\) for which \(f(\theta^{(0)} \, | \, x) > 0\). This will hold for any \(\theta^{(0)} > 0\) as \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\).

  2. At time \(t\)

  • Sample \(\theta^{*} \sim N(\theta^{(t-1)}, \sigma^{2})\).

  • Calculate the acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) & = & \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \\ & = & \left\{\begin{array}{cl} \min\left(1, \frac{(\theta^{*})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{*})}{(\theta^{(t-1)})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{(t-1)})}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right.\\ & = & \left\{\begin{array}{cl} \min\left(1, \left(\frac{\theta^{*}}{\theta^{(t-1)}}\right)^{\alpha_{n} - 1} \exp\left\{-\beta_{n}\left(\theta^{*} - \theta^{(t-1)}\right)\right\}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right. \end{eqnarray*}\] where \(\alpha_{n} = \alpha_{0} + n \alpha\), \(\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\).

  • Generate \(U \sim U(0, 1)\).

  • If \(U \leq \alpha(\theta^{(t-1)}, \theta^{*})\) accept the move, \(\theta^{(t)} = \theta^{*}\). Otherwise reject the move, \(\theta^{(t)}= \theta^{(t-1)}\).

  1. Repeat step 2.

The algorithm will produce a Markov Chain with stationary distribution \(f(\theta \, | \, x)\). After a sufficiently long time, \(b\) say, to allow for convergence, the values \(\{\theta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta \, | \, x)\), that is a sample from \(\mbox{Gamma}(\alpha_{n}, \beta_{n})\), where the samples \(\{\theta^{(t)} \}\) for \(t \leq b\) are the “burn-in”.

Question 2

Suppose that \(X \, | \, \theta \sim N(\theta, \sigma^{2})\) and \(Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})\), where \(\sigma^{2}\) is a known constant and \(X\) and \(Y\) are conditionally independent given \(\theta\) and \(\delta\). It is judged that the improper noninformative joint prior distribution \(f(\theta, \delta) \propto 1\) is appropriate.

  1. Show that the joint posterior distribution of \(\theta\) and \(\delta\) given \(x\) and \(y\) is bivariate normal with mean vector \(\mu\) and variance matrix \(\Sigma\) where \[\begin{eqnarray*} \mu \ = \ \left(\begin{array}{c} E(\theta \, | \, X, Y) \\ E(\delta \, | \, X, Y) \end{array} \right) \ = \ \left(\begin{array}{c} x \\ x-y \end{array} \right); \ \ \Sigma \ = \ \left(\begin{array}{cc} \sigma^{2} & \sigma^{2} \\ \sigma^{2} & 2\sigma^{2} \end{array} \right). \end{eqnarray*}\]

As \(X \, | \, \theta \sim N(\theta, \sigma^{2})\) then \[\begin{eqnarray*} f(x \, | \, \theta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}} (\theta^{2} - 2x\theta)\right\}. \end{eqnarray*}\] As \(Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})\) then \[\begin{eqnarray*} f(y \, | \, \theta, \delta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[(\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] Now, by conditional independence, \[\begin{eqnarray*} f(x, y \, | \, \theta, \delta) & = & f(x \, | \, \theta)f(y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] So, \[\begin{eqnarray*} f(\theta, \delta \, | \, x, y) & \propto & f(x, y \, | \, \theta, \delta)f(\theta, \delta) \\ & \propto & f(x, y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\} \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} -2(x+y)\theta - 2\theta\delta + \delta^{2} + 2y\delta]\right\}. \end{eqnarray*}\] The exponential argument is quadratic in \(\theta\), \(\delta\) so \(\theta, \delta \, | \, x, y\) is bivariate normal. We can find the exact normal by completing the square or, in this case, using the information given in the question. We have, \[\begin{eqnarray*} \Sigma^{-1} & = & \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right) \end{eqnarray*}\] so that the density of the \(N_{2}(\mu, \Sigma)\) is \[\begin{eqnarray*} \exp\left\{-\frac{1}{2} \left(\theta - x \ \ \delta - (x-y)\right) \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right)\left(\begin{array}{c} \theta - x \\ \delta - (x-y) \end{array}\right) \right\} & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(2x-(x-y))\theta - 2\theta \delta + \delta^{2} - 2((x-y)-x)\delta]\right\} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y)\theta - 2\theta \delta + \delta^{2} + 2y\delta]\right\} \\ & \propto & f(\theta, \delta \, | \, x, y). \end{eqnarray*}\]

  1. Describe how the Gibbs sampler may be used to sample from the posterior distribution \(\theta, \delta \, | \, x , y\), deriving all required conditional distributions.

The Gibbs sampler requires the conditional distributions \(\theta \, | \, \delta, x, y\) and \(\delta \, | \, \theta, x, y\). Now, with proportionality with respect to \(\theta\), \[\begin{eqnarray*} f(\theta \, | \, \delta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y+\delta)\theta]\right\} \\ & = & \exp\left\{-\frac{1}{2}\left(\frac{2}{\sigma^{2}}\right)\left[\theta^{2} - 2\left(\frac{x+y+\delta}{2}\right)\theta\right]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, \(\theta \, | \, \delta, x, y \sim N(\frac{x+y+\delta}{2}, \frac{\sigma^{2}}{2})\). Now, with proportionality with respect to \(\delta\), \[\begin{eqnarray*} f(\delta \, | \, \theta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[\delta^{2} - 2(\theta-y)\delta]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, \(\delta \, | \, \theta, x, y \sim N(\theta - y, \sigma^{2})\). The Gibbs sampler algorithm is

  1. Choose a starting value \((\theta^{(0)}, \delta^{(0)})\) for which \(f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0\).
  2. At iteration \(t\) generate the new values \((\theta^{(t)}, \delta^{(t)})\) as follows:
  • draw \(\theta^{(t)}\) from \(N(\frac{x+y+\delta^{(t-1)}}{2}, \frac{\sigma^{2}}{2})\), the distribution of \(\theta^{(t)} \, | \, \delta^{(t-1)}, x, y\).

  • draw \(\delta^{(t)}\) from \(N(\theta^{(t)}-y, \sigma^{2})\), the distribution of \(\delta^{(t)} \, | \, \theta^{(t)}, x, y\).

  1. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution \(f(\theta, \delta \, | \, x, y)\). After a sufficiently long time to allow for convergence, the values \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta, \delta \, | \, x, y)\) where the samples \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t \leq b\) are the “burn-in”.

  1. Suppose that \(x=2\), \(y=1\) and \(\sigma^{2} = 1\). Sketch the contours of the joint posterior distribution. Starting from the origin, add to your sketch the first four steps of a typical Gibbs sampler path.

For \(x=2\), \(y=1\) and \(\sigma^{2} = 1\) we shall use the Gibbs sampler to sample from \(N_{2}(\mu, \Sigma)\) where \(\mu = \left(\begin{array}{c} 2 \\ 1 \end{array}\right)\) and \(\Sigma = \left(\begin{array}{cc} 1 & 1 \\ 1 & 2 \end{array}\right)\). We draw \(\theta^{(t)}\) from \(N(\frac{3+\delta^{(t-1)}}{2}, \frac{1}{2})\) and \(\delta^{(t)}\) from \(N(\theta^{(t)}-1, 1)\). Suppose that we start from \((1,1)\) and sample \(\theta^{(1)} = 2.216715\), \(\delta^{(1)} = 1.475738\), \(\theta^{(2)} = 2.086383\) and \(\delta^{(2)} = 1.617760\). The path of the trajectory of these points is shown in the following figure.

A trajectory path of the Gibbs sampler sampling from $N_{2}(\mu, \Sigma)$.

A trajectory path of the Gibbs sampler sampling from \(N_{2}(\mu, \Sigma)\).

  1. Suppose, instead, that we consider sampling from the posterior distribution using the Metropolis-Hastings algorithm where the proposal distribution is the bivariate normal with mean vector \(\tilde{\mu}^{(t-1)} = (\theta^{(t-1)}, \delta^{(t-1)})^{T}\) and known variance matrix \(\tilde{\Sigma}\). Explain the Metropolis-Hastings algorithm for this case, explicitly stating the acceptance probability.

For the proposal we have \[\begin{eqnarray*} q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) & \propto & \exp\left\{-\frac{1}{2} \left(\theta^{*} - \theta^{(t-1)} \ \ \delta^{*} - \delta^{(t-1)}\right) \tilde{\Sigma}^{-1}\left(\begin{array}{c} \theta^{*} - \theta^{(t-1)} \\ \delta^{*} - \delta^{(t-1)} \end{array}\right) \right\} \end{eqnarray*}\] so that \(q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) = q(\theta^{(t-1)}, \delta^{(t-1)} \, | \, \theta^{*}, \delta^{*})\). The proposal distribution is symmetric and so the Metropolis-Hastings algorithm reduces to the Metropolis algorithm. The algorithm is performed as follows.

  1. Choose a starting point \((\theta^{(0)}, \delta^{(0)})\) for which \(f(\theta^{(0)}, \delta^{(0)} \, | \, x, y) > 0\).
  2. At time \(t\)
  • Sample \((\theta^{*}, \delta^{*}) \sim N_{2}(\tilde{\mu}^{(t-1)}, \tilde{\Sigma})\).

  • Calculate the acceptance probability \[\begin{eqnarray*} \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*})) & = & \min \left(1, \frac{f(\theta^{*}, \delta^{*} \, | \, x, y)}{f(\theta^{(t-1)}, \delta^{(t-1)} \, | \, x, y)}\right) \\ & = & \min \left(1, \frac{\exp\left\{-\frac{1}{2} (\gamma^{*} - \mu)^{T} \Sigma^{-1} (\gamma^{*} - \mu)\right\}}{\exp\left\{-\frac{1}{2} (\gamma^{(t-1)} - \mu)^{T} \Sigma^{-1} (\gamma^{(t-1)} - \mu)\right\}}\right) \end{eqnarray*}\] where \(\gamma^{*} = \left(\begin{array}{c} \theta^{*} \\ \delta^{*} \end{array}\right)\) and \(\gamma^{(t-1)} = \left(\begin{array}{c} \theta^{(t-1)} \\ \delta^{(t-1)} \end{array}\right)\).

  • Generate \(U \sim U(0, 1)\).

  • If \(U \leq \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*}))\) accept the move, \((\theta^{(t)}, \delta^{(t)}) = (\theta^{*}, \delta^{*})\). Otherwise reject the move, \((\theta^{(t)}, \delta^{(t)}) = (\theta^{(t-1)}, \delta^{(t-1)})\).

  1. Repeat step 2.

The algorithm produces a Markov Chain with stationary distribution \(f(\theta, \delta \, | \, x, y)\). After a sufficiently long time to allow for convergence, the values \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta, \delta \, | \, x, y)\) where the samples \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t \leq b\) are the “burn-in”.

Question 3

Let \(X_{1}, \ldots, X_{n}\) be exchangeable so that the \(X_{i}\) are conditionally independent given a parameter \(\theta = (\mu, \lambda)\). Suppose that \(X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)\) so that \(\mu\) is the mean and \(\lambda\) the precision of the distribution. Suppose that we judge that \(\mu\) and \(\lambda\) are independent with \(\mu \sim N(\mu_{0}, 1/\tau)\), where \(\mu_{0}\) and \(\tau\) are known, and \(\lambda \sim \mbox{Gamma}(\alpha, \beta)\), where \(\alpha\) and \(\beta\) are known.

  1. Show that the posterior density \(f(\mu, \lambda \, | \, x)\), where \(x = (x_{1}, \ldots, x_{n})\), can be expressed as \[\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}\]

As \(X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)\) and \(\theta = (\mu, \lambda)\) then \[\begin{eqnarray*} f(x \, | \, \mu, \lambda) & \propto & \prod_{i=1}^{n} \lambda^{\frac{1}{2}} \exp\left\{-\frac{\lambda}{2}(x_{i}-\mu)^{2}\right\} \\ & = & \lambda^{\frac{n}{2}} \exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right\}. \end{eqnarray*}\] \(\mu \sim N(\mu_{0}, 1/\tau)\) so that \[\begin{eqnarray*} f(\mu) & \propto & \exp\left\{- \frac{\tau}{2}(\mu - \mu_{0})^{2}\right\} \\ & \propto & \exp\left\{- \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu\right\}. \end{eqnarray*}\] Finally, as \(\lambda \sim Gamma(\alpha, \beta)\) then \[\begin{eqnarray*} f(\lambda) & \propto & \lambda^{\alpha -1} \exp\left\{-\beta \lambda\right\}. \end{eqnarray*}\] Hence, as \(\mu\) and \(\lambda\) are independent, \[\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & f(x \, | \, \mu, \lambda)f(\mu)f(\lambda) \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}\]

  1. Hence show that \[\begin{eqnarray*} \lambda \, | \, \mu, x \sim \mbox{Gamma}\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right). \end{eqnarray*}\]

We have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & = & \frac{f(\lambda, \mu \, | \, x)}{f(\mu \, | \, x)} \\ & \propto & f(\lambda, \mu \, | \, x) \end{eqnarray*}\] where the proportionality is with respect to \(\lambda\). So, from 2.(a) we have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\} \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\lambda \left(\beta + \frac{1}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right)\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right)\).

  1. Given that \(\mu \, | \, \lambda, x \sim N(\frac{\tau\mu_{0} + n\lambda \bar{x}}{\tau + n \lambda}, \frac{1}{\tau + n \lambda})\), where \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_{i}\), describe how the Gibbs sampler may be used to sample from the posterior distribution \(\mu, \lambda \, | \, x\). Give a sensible estimate of \(Var(\lambda \, | \, x)\).

The Gibbs sampler requires the conditional distributions \(\lambda \, | \, \mu, x\) and \(\mu \, | \, \lambda, x\) which we have. The algorithm is

  1. Choose a starting value \((\mu^{(0)}, \lambda^{(0)})\) for which \(f(\mu^{(0)}, \lambda^{(0)} \, | \, x) > 0\).

  2. At iteration \(t\) generate the new values \((\mu^{(t)}, \lambda^{(t)})\) as follows:

  • draw \(\mu^{(t)}\) from \(N(\frac{\tau\mu_{0} + n\lambda^{(t-1)} \bar{x}}{\tau + n \lambda^{(t-1)}}, \frac{1}{\tau + n \lambda^{(t-1)}})\), the distribution of \(\mu^{(t)} \, | \, \lambda^{(t-1)}, x\).
  • draw \(\lambda^{(t)}\) from \(Gamma\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu^{(t)})^{2}\right)\), the distribution of \(\lambda^{(t)} \, | \, \mu^{(t)}, x\).
  1. Repeat step 2.

The algorithm will produce a Markov Chain with stationary distribution \(f(\mu, \lambda \, | \, x)\). After a sufficiently long time to allow for convergence, the values \(\{\mu^{(t)}, \lambda^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\mu, \lambda \, | \, x)\) where the samples \(\{\mu^{(t)}, \lambda^{(t)}\}\) for \(t \leq b\) are the “burn-in”.

Ergodic averages converge to the desired expectations under the target distribution so that, for example, \[\begin{eqnarray*} \frac{1}{N} \sum_{t=1}^{N} g(\lambda^{(t)}) & \rightarrow & E(g(\lambda) \, | \, X) \end{eqnarray*}\] as \(N \rightarrow \infty\). As this includes the observations prior to convergence, we will do better to estimate only using the observations after the “burn-in”. Noting that \(Var(\lambda \, | \, X)\) corresponds to \(g(\lambda) = (\lambda - E(\lambda \, | \, X))^{2}\) then any sensible sample variance will suffice. For example, we could estimate \(Var(\lambda \, | \, X) = E(\lambda^{2} \, | \, X) - E^{2}(\lambda \, | \, X)\) by \[\begin{eqnarray*} \frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)} - \bar{\lambda})^{2} & = & \left(\frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)})^{2}\right) -\bar{\lambda}^{2} \end{eqnarray*}\] where our sample is \(\{\mu^{(t)}, \lambda^{(t)} : t = b+1, \ldots, N\}\) and \(\bar{\lambda} = \frac{1}{N-b} \sum_{t=b+1}^{N} \lambda^{(t)}\), the sample mean.

Question 4

Consider a Poisson hierarchical model. At the first stage we have observations \(s_{j}\) which are Poisson with mean \(t_{j}\lambda_{j}\) for \(j = 1, \ldots, p\) where each \(t_{j}\) is known. We assume that that the \(\lambda_{j}\) are independent and identically distributed with \(\mbox{Gamma}(\alpha, \beta)\) prior distributions. The parameter \(\alpha\) is known but \(\beta\) is unknown and is given a \(\mbox{Gamma}(\gamma, \delta)\) distribution where \(\gamma\) and \(\delta\) are known. The \(s_{j}\) are assumed to be conditionally independent given the unknown parameters.

  1. Find, up to a constant of integration, the joint posterior distribution of the unknown parameters given \(s = (s_{1}, \ldots, s_{p})\).

Let \(\lambda = (\lambda_{1}, \ldots, \lambda_{p})\) then the unknown parameters are \(\theta = (\lambda_{1}, \ldots, \lambda_{p}, \beta) = (\lambda, \beta)\). The prior is \[\begin{eqnarray*} f(\lambda, \beta) & = & f(\lambda \, | \, \beta)f(\beta) \\ & = & \left\{\prod_{j=1}^{p} f(\lambda_{j} \, | \, \beta)\right\}f(\beta) \\ & = & \left\{\prod_{j=1}^{p} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda_{j}^{\alpha-1}e^{-\beta\lambda_{j}}\right\} \times \frac{\delta^{\gamma}}{\Gamma(\gamma)} \beta^{\gamma-1}e^{-\delta \beta} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\right\}. \end{eqnarray*}\] The likelihood is \[\begin{eqnarray*} f(s \, | \, \lambda, \beta) & = & \prod_{j=1}^{p} \frac{(t_{j}\lambda_{j})^{s_{j}}e^{-t_{j}\lambda_{j}}}{s_{j}!} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{s_{j}}\right) \exp\left(-\sum_{j=1}^{p}t_{j}\lambda_{j} \right). \end{eqnarray*}\] The posterior is thus \[\begin{eqnarray*} f(\lambda, \beta \, | \, s) & \propto & f(s \, | \, \lambda, \beta)f(\lambda, \beta) \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha+s_{j}-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right) -\sum_{j=1}^{p}t_{j}\lambda_{j} \right\}. \end{eqnarray*}\]

  1. Describe how the Gibbs sampler may be used to sample from the posterior distribution, deriving all required conditional distributions.

For each \(j\), let \(\lambda_{-j} = \lambda \setminus \lambda_{j}\). For the Gibbs sampler we need to find the distributions of \(\lambda_{j} \, | \, \lambda_{-j}, \beta, s\), \(j = 1, \ldots, p\), and \(\beta \, | \, \lambda, s\). Now, \[\begin{eqnarray*} f(\lambda_{j} \, | \, \lambda_{-j}, \beta, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda_{-j}, \beta \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\lambda_{j}$)} \\ & = & \lambda_{j}^{\alpha+s_{j}-1} \exp\left\{-(\beta+t_{j})\lambda_{j}\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma(\alpha+s_{j}, \beta+t_{j})\) distribution. Hence, \(\lambda_{j} \, | \, \lambda_{-j}, \beta, s \sim Gamma(\alpha+s_{j}, \beta+t_{j})\). \[\begin{eqnarray*} f(\beta \, | \, \lambda, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\beta$)} \\ & = & \beta^{(p\alpha+\gamma)-1}\exp\left\{-\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\beta\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})\) distribution. So, \(\beta \, | \, \lambda, s \sim Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})\).

  The Gibbs sampler is

  1. Choose an arbitrary starting point for which \(f(\lambda^{(0)}, \beta^{(0)} \, | \, s) > 0\). As \(\lambda_{j} > 0\) for \(j=1, \ldots, p\) and \(\beta > 0\) then any \(p+1\) collection of positive numbers will suffice.

  2. At time \(t\)

  • Obtain \(\lambda_{1}^{(t)}\) by sampling from the distribution of \(\lambda_{1} \, | \, \lambda_{-1}^{(t-1)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{1}, \beta^{(t-1)} + t_{1})\).

  • \(\vdots\)

  • Obtain \(\lambda_{j}^{(t)}\) by sampling from the distribution of \(\lambda_{j} \, | \, \lambda_{1}^{(t)}, \ldots, \lambda_{j-1}^{(t)}, \lambda_{j+1}^{(t-1)}\), \(\ldots, \lambda_{p}^{(t-1)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{j}, \beta^{(t-1)} + t_{j})\).

  • \(\vdots\)

  • Obtain \(\lambda_{p}^{(t)}\) by sampling from the distribution of \(\lambda_{p} \, | \, \lambda_{-p}^{(t)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{p}, \beta^{(t-1)} + t_{p})\).

  • Obtain \(\beta^{(t)}\) by sampling from the distribution of \(\beta \, | \, \lambda^{(t)}, s\), the \(Gamma(p\alpha\) \(+ \gamma, \delta+\sum_{j=1}^{p} \lambda_{j}^{(t)})\).

  1. Repeat step 2.

The algorithm is run until convergence is reached, When convergence is reached, then resulting value \((\lambda^{(t)}, \beta^{(t)})\) is a realisation from the posterior \(\lambda, \beta \, | \, s\).

  1. Let \(\{\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)}, \beta^{(t)}; t = 1, \ldots, N\}\), with \(N\) large, be a realisation of the Gibbs sampler described above. Give sensible estimates of \(E(\lambda_{j} \, | \, s)\), \(Var(\beta \, | \, s)\) and \(E(\lambda_{j} \, | \, a \leq \beta \leq b, s)\) where \(0 < a < b\) are given constants.

We first remove the burn-in samples. Suppose we determine the burn-in time \(b\), where we assume that the Markov chain has sufficiently converged for \(t > b\). (We assume that \(b < N\).) Then we discard observations \(\{\lambda^{(t)}, \beta^{(t)}; t \leq b\}\) and use the remaining sample \(\{\lambda^{(t)}, \beta^{(t)}; t = b+1, \ldots N\}\) to estimate posterior summaries. Here \(\lambda^{(t)} = (\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)})\).

The posterior mean of \(\lambda_{j} \, | \, s\) can be estimated by the sample mean of \(\{\lambda_{j}^{(t)}, t = b+1, \ldots, N\}\) so that our estimate of \(E(\lambda_{j} \, | \, s)\) is \(\frac{1}{N-b} \sum_{t = b+1}^{N} \lambda_{j}^{(t)}\).

The posterior variance of \(\beta \, | \, s\) can be estimated by the sample variance of \(\{\beta^{(t)}, t = b+1, \ldots, N\}\) so that our estimate of \(Var(\beta \, | \, s)\) is \(\frac{1}{N-b} \sum_{t = b+1}^{N} (\beta^{(t)}-\bar{\beta})^{2}\) \(= (\frac{1}{N-b} \sum_{t = b+1}^{N} (\beta^{(t)})^{2}) - \bar{\beta}^{2}\) where \(\bar{\beta} = \frac{1}{N-b} \sum_{t = b+1}^{N} \beta^{(t)}\).

For estimating \(E(\lambda_{j} \, | \, a \leq \beta \leq b, s)\) we not only discard the burn-in observations but also those elements \(\lambda_{j}^{(t)}\) for which \(\beta^{(t)} < a\) or \(\beta^{(t)} > b\) for each \(t = b+1, \ldots, N\). So, letting \(\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\) denote the indicator function which is one if \(a \leq \beta^{(t)} \leq b\), the number of observations after the burn-in with \(a \leq \beta^{(t)} \leq b\) is \(M = \sum_{t = b+1}^{N} \mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\) and our estimate of \(E(\lambda_{j} \, | \, a \leq \beta \leq b, s)\) is \(\frac{1}{M} \sum_{t=b+1}^{N} \lambda_{j}^{(t)}\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\).

Question 5

Show that the Gibbs sampler for sampling from a distribution \(\pi(\theta)\) where \(\theta = (\theta_{1}, \ldots, \theta_{d})\) can be viewed as a special case of the Metropolis-Hastings algorithm where each iteration \(t\) consists of \(d\) Metropolis-Hastings steps each with an acceptance probability of 1.

For the Gibbs sampler, we update each \(\theta_{p}\) one at a time by obtaining \(\theta^{(t)}_{p}\) from the conditional distribution \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Thus, we move from \(\theta_{p}^{(t-1)}\) to \(\theta_{p}^{(t)}\) leaving the remaining \(d-1\) parameters fixed.

We now demonstrate that this update step in the Gibbs sampler is equivalent to a single Metropolis-Hastings step with acceptance probability 1. Let \(\theta^{(t-1)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)},\) \(\theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\) and \(\theta^{(t)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Consider the move from \(\theta^{(t-1)}_{[p]}\) to \(\theta^{(t)}_{[p]}\). As \(\theta^{(t)}_{[p]}\) is drawn from \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\) then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm so that \[\begin{eqnarray} q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{1} \end{eqnarray}\] If, instead, we consider the move from \(\theta^{(t)}_{[p]}\) to \(\theta^{(t-1)}_{[p]}\) then, using Gibbs sampling, \(\theta^{(t-1)}_{[p]}\) is obtained by drawing \(\theta_{p}^{(t-1)}\) from \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Thus, we set \[\begin{eqnarray} q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{2} \end{eqnarray}\] Now, \[\begin{eqnarray} \pi(\theta^{(t-1)}_{[p]} ) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{3} \\ & = & q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{4} \end{eqnarray}\] where (4) follows by substituting (2) into (3). In an identical fashion, \[\begin{eqnarray} \pi(\theta^{(t)}_{[p]}) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{5} \\ & = & q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{6} \end{eqnarray}\] where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives \[\begin{eqnarray*} \frac{\pi(\theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})} & = & \frac{q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})}{q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})} \end{eqnarray*}\] so that \[\begin{eqnarray} \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} & = & 1. \tag{7} \end{eqnarray}\] Thus, a Metropolis-Hastings algorithm for sampling from \(\pi(\theta)\) for a proposed move from \(\theta^{(t-1)}_{[p]}\) to \(\theta^{(t)}_{[p]}\) using the proposal distribution given by (1) has an acceptance ratio of \[\begin{eqnarray} \alpha(\theta^{(t-1)}_{[p]}, \theta^{(t)}_{[p]}) & = & \min \left(1, \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} \right) \tag{8} \\ & = & 1 \tag{9} \end{eqnarray}\] where (9) follows from (8) using (7).