Let \(X_{1}, \ldots, X_{n}\) be exchangeable so that the \(X_{i}\) are conditionally independent given a parameter \(\theta\). Suppose that \(X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)\), where \(\alpha\) is known, and we judge that \(\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})\), where \(\alpha_{0}\) and \(\beta_{0}\) are known.
As \(X_{i} \, | \, \theta \sim \mbox{Inv-gamma}(\alpha, \theta)\) then \[\begin{eqnarray*} f(x \, | \, \theta) & = & \prod_{i=1}^{n} \frac{\theta^{\alpha}}{\Gamma(\alpha)} x_{i}^{-(\alpha+1)} \exp \left(-\frac{\theta}{x_{i}}\right) \\ & \propto & \theta^{n\alpha} \exp \left(-\theta \sum_{i=1}^{n}\frac{1}{x_{i}}\right). \end{eqnarray*}\] Similarly, as \(\theta \sim \mbox{Gamma}(\alpha_{0}, \beta_{0})\), \[\begin{eqnarray*} f(\theta) & \propto & \theta^{\alpha_{0}-1}\exp \left(-\theta \beta_{0}\right). \end{eqnarray*}\] Hence, \[\begin{eqnarray*} f(\theta \, | \, x) & \propto & f(x \, | \, \theta)f(\theta) \\ & \propto & \theta^{(\alpha_{0}+n\alpha)-1}\exp \left\{-\theta\left(\beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\right)\right\} \end{eqnarray*}\] which we recognise as a kernal of a \(\mbox{Gamma}(\alpha_{0} + n \alpha, \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}})\). Thus, \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\) as required.
We wish to use the Metropolis-Hastings algorithm to sample from the posterior distribution \(\theta \, | \, x\) using a normal distribution with mean \(\theta\) and chosen variance \(\sigma^{2}\) as the symmetric proposal distribution.
As the proposal distribution is symmetric then the Metropolis-Hastings algorithm reduces to the Metropolis algorithm with acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) = \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \end{eqnarray*}\] where \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\). Now, assuming \(\theta^{(t-1)} > 0\), if \(\theta^{*} \leq 0\) then \(f(\theta^{*} \, | \, x) = 0\) and \(\alpha(\theta^{(t-1)}, \theta^{*}) = 0\) so the move to \(\theta^{*} \leq 0\) is rejected and \(\theta^{(t)} = \theta^{(t-1)}\). So, provided \(\theta^{(0)} > 0\) then the sequence of values generated by the algorithm are never less than zero.
The algorithm is performed as follows.
Choose a starting point \(\theta^{(0)}\) for which \(f(\theta^{(0)} \, | \, x) > 0\). This will hold for any \(\theta^{(0)} > 0\) as \(\theta \, | \, x \sim \mbox{Gamma}(\alpha_{n}, \beta_{n})\).
At time \(t\)
Sample \(\theta^{*} \sim N(\theta^{(t-1)}, \sigma^{2})\).
Calculate the acceptance probability \[\begin{eqnarray*} \alpha(\theta^{(t-1)}, \theta^{*}) & = & \min \left(1, \frac{f(\theta^{*} \, | \, x)}{f(\theta^{(t-1)} \, | \, x)}\right) \\ & = & \left\{\begin{array}{cl} \min\left(1, \frac{(\theta^{*})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{*})}{(\theta^{(t-1)})^{\alpha_{n}-1}\exp(-\beta_{n}\theta^{(t-1)})}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right.\\ & = & \left\{\begin{array}{cl} \min\left(1, \left(\frac{\theta^{*}}{\theta^{(t-1)}}\right)^{\alpha_{n} - 1} \exp\left\{-\beta_{n}\left(\theta^{*} - \theta^{(t-1)}\right)\right\}\right) & \theta^{*} > 0 \\ 0 & \theta^{*} \leq 0 \end{array}\right. \end{eqnarray*}\] where \(\alpha_{n} = \alpha_{0} + n \alpha\), \(\beta_{n} = \beta_{0} + \sum_{i=1}^{n} \frac{1}{x_{i}}\).
Generate \(U \sim U(0, 1)\).
If \(U \leq \alpha(\theta^{(t-1)}, \theta^{*})\) accept the move, \(\theta^{(t)} = \theta^{*}\). Otherwise reject the move, \(\theta^{(t)}= \theta^{(t-1)}\).
The algorithm will produce a Markov Chain with stationary distribution \(f(\theta \, | \, x)\). After a sufficiently long time, \(b\) say, to allow for convergence, the values \(\{\theta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta \, | \, x)\), that is a sample from \(\mbox{Gamma}(\alpha_{n}, \beta_{n})\), where the samples \(\{\theta^{(t)} \}\) for \(t \leq b\) are the “burn-in”.
Suppose that \(X \, | \, \theta \sim N(\theta, \sigma^{2})\) and \(Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})\), where \(\sigma^{2}\) is a known constant and \(X\) and \(Y\) are conditionally independent given \(\theta\) and \(\delta\). It is judged that the improper noninformative joint prior distribution \(f(\theta, \delta) \propto 1\) is appropriate.
As \(X \, | \, \theta \sim N(\theta, \sigma^{2})\) then \[\begin{eqnarray*} f(x \, | \, \theta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}} (\theta^{2} - 2x\theta)\right\}. \end{eqnarray*}\] As \(Y \, | \, \theta, \delta \sim N(\theta - \delta, \sigma^{2})\) then \[\begin{eqnarray*} f(y \, | \, \theta, \delta) & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[(\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] Now, by conditional independence, \[\begin{eqnarray*} f(x, y \, | \, \theta, \delta) & = & f(x \, | \, \theta)f(y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\}. \end{eqnarray*}\] So, \[\begin{eqnarray*} f(\theta, \delta \, | \, x, y) & \propto & f(x, y \, | \, \theta, \delta)f(\theta, \delta) \\ & \propto & f(x, y \, | \, \theta, \delta) \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[\theta^{2} - 2x\theta + (\theta-\delta)^{2} - 2y(\theta - \delta)]\right\} \\ & \propto & \exp \left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} -2(x+y)\theta - 2\theta\delta + \delta^{2} + 2y\delta]\right\}. \end{eqnarray*}\] The exponential argument is quadratic in \(\theta\), \(\delta\) so \(\theta, \delta \, | \, x, y\) is bivariate normal. We can find the exact normal by completing the square or, in this case, using the information given in the question. We have, \[\begin{eqnarray*} \Sigma^{-1} & = & \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right) \end{eqnarray*}\] so that the density of the \(N_{2}(\mu, \Sigma)\) is \[\begin{eqnarray*} \exp\left\{-\frac{1}{2} \left(\theta - x \ \ \delta - (x-y)\right) \frac{1}{\sigma^{2}}\left(\begin{array}{cc} 2 & -1 \\ -1 & 1 \end{array} \right)\left(\begin{array}{c} \theta - x \\ \delta - (x-y) \end{array}\right) \right\} & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(2x-(x-y))\theta - 2\theta \delta + \delta^{2} - 2((x-y)-x)\delta]\right\} \\ & = & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y)\theta - 2\theta \delta + \delta^{2} + 2y\delta]\right\} \\ & \propto & f(\theta, \delta \, | \, x, y). \end{eqnarray*}\]
The Gibbs sampler requires the conditional distributions \(\theta \, | \, \delta, x, y\) and \(\delta \, | \, \theta, x, y\). Now, with proportionality with respect to \(\theta\), \[\begin{eqnarray*} f(\theta \, | \, \delta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[2\theta^{2} - 2(x+y+\delta)\theta]\right\} \\ & = & \exp\left\{-\frac{1}{2}\left(\frac{2}{\sigma^{2}}\right)\left[\theta^{2} - 2\left(\frac{x+y+\delta}{2}\right)\theta\right]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, \(\theta \, | \, \delta, x, y \sim N(\frac{x+y+\delta}{2}, \frac{\sigma^{2}}{2})\). Now, with proportionality with respect to \(\delta\), \[\begin{eqnarray*} f(\delta \, | \, \theta, x, y) & \propto & f(\theta, \delta \, | \, x, y) \\ & \propto & \exp\left\{-\frac{1}{2\sigma^{2}}[\delta^{2} - 2(\theta-y)\delta]\right\} \end{eqnarray*}\] which is a kernel of a normal density. Thus, \(\delta \, | \, \theta, x, y \sim N(\theta - y, \sigma^{2})\). The Gibbs sampler algorithm is
draw \(\theta^{(t)}\) from \(N(\frac{x+y+\delta^{(t-1)}}{2}, \frac{\sigma^{2}}{2})\), the distribution of \(\theta^{(t)} \, | \, \delta^{(t-1)}, x, y\).
draw \(\delta^{(t)}\) from \(N(\theta^{(t)}-y, \sigma^{2})\), the distribution of \(\delta^{(t)} \, | \, \theta^{(t)}, x, y\).
The algorithm produces a Markov Chain with stationary distribution \(f(\theta, \delta \, | \, x, y)\). After a sufficiently long time to allow for convergence, the values \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta, \delta \, | \, x, y)\) where the samples \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t \leq b\) are the “burn-in”.
For \(x=2\), \(y=1\) and \(\sigma^{2} = 1\) we shall use the Gibbs sampler to sample from \(N_{2}(\mu, \Sigma)\) where \(\mu = \left(\begin{array}{c} 2 \\ 1 \end{array}\right)\) and \(\Sigma = \left(\begin{array}{cc} 1 & 1 \\ 1 & 2 \end{array}\right)\). We draw \(\theta^{(t)}\) from \(N(\frac{3+\delta^{(t-1)}}{2}, \frac{1}{2})\) and \(\delta^{(t)}\) from \(N(\theta^{(t)}-1, 1)\). Suppose that we start from \((1,1)\) and sample \(\theta^{(1)} = 2.216715\), \(\delta^{(1)} = 1.475738\), \(\theta^{(2)} = 2.086383\) and \(\delta^{(2)} = 1.617760\). The path of the trajectory of these points is shown in the following figure.
For the proposal we have \[\begin{eqnarray*} q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) & \propto & \exp\left\{-\frac{1}{2} \left(\theta^{*} - \theta^{(t-1)} \ \ \delta^{*} - \delta^{(t-1)}\right) \tilde{\Sigma}^{-1}\left(\begin{array}{c} \theta^{*} - \theta^{(t-1)} \\ \delta^{*} - \delta^{(t-1)} \end{array}\right) \right\} \end{eqnarray*}\] so that \(q(\theta^{*}, \delta^{*} \, | \, \theta^{(t-1)}, \delta^{(t-1)}) = q(\theta^{(t-1)}, \delta^{(t-1)} \, | \, \theta^{*}, \delta^{*})\). The proposal distribution is symmetric and so the Metropolis-Hastings algorithm reduces to the Metropolis algorithm. The algorithm is performed as follows.
Sample \((\theta^{*}, \delta^{*}) \sim N_{2}(\tilde{\mu}^{(t-1)}, \tilde{\Sigma})\).
Calculate the acceptance probability \[\begin{eqnarray*} \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*})) & = & \min \left(1, \frac{f(\theta^{*}, \delta^{*} \, | \, x, y)}{f(\theta^{(t-1)}, \delta^{(t-1)} \, | \, x, y)}\right) \\ & = & \min \left(1, \frac{\exp\left\{-\frac{1}{2} (\gamma^{*} - \mu)^{T} \Sigma^{-1} (\gamma^{*} - \mu)\right\}}{\exp\left\{-\frac{1}{2} (\gamma^{(t-1)} - \mu)^{T} \Sigma^{-1} (\gamma^{(t-1)} - \mu)\right\}}\right) \end{eqnarray*}\] where \(\gamma^{*} = \left(\begin{array}{c} \theta^{*} \\ \delta^{*} \end{array}\right)\) and \(\gamma^{(t-1)} = \left(\begin{array}{c} \theta^{(t-1)} \\ \delta^{(t-1)} \end{array}\right)\).
Generate \(U \sim U(0, 1)\).
If \(U \leq \alpha((\theta^{(t-1)}, \delta^{(t-1)}), (\theta^{*}, \delta^{*}))\) accept the move, \((\theta^{(t)}, \delta^{(t)}) = (\theta^{*}, \delta^{*})\). Otherwise reject the move, \((\theta^{(t)}, \delta^{(t)}) = (\theta^{(t-1)}, \delta^{(t-1)})\).
The algorithm produces a Markov Chain with stationary distribution \(f(\theta, \delta \, | \, x, y)\). After a sufficiently long time to allow for convergence, the values \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\theta, \delta \, | \, x, y)\) where the samples \(\{\theta^{(t)}, \delta^{(t)}\}\) for \(t \leq b\) are the “burn-in”.
Let \(X_{1}, \ldots, X_{n}\) be exchangeable so that the \(X_{i}\) are conditionally independent given a parameter \(\theta = (\mu, \lambda)\). Suppose that \(X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)\) so that \(\mu\) is the mean and \(\lambda\) the precision of the distribution. Suppose that we judge that \(\mu\) and \(\lambda\) are independent with \(\mu \sim N(\mu_{0}, 1/\tau)\), where \(\mu_{0}\) and \(\tau\) are known, and \(\lambda \sim \mbox{Gamma}(\alpha, \beta)\), where \(\alpha\) and \(\beta\) are known.
As \(X_{i} \, | \, \theta \sim N(\mu, 1/\lambda)\) and \(\theta = (\mu, \lambda)\) then \[\begin{eqnarray*} f(x \, | \, \mu, \lambda) & \propto & \prod_{i=1}^{n} \lambda^{\frac{1}{2}} \exp\left\{-\frac{\lambda}{2}(x_{i}-\mu)^{2}\right\} \\ & = & \lambda^{\frac{n}{2}} \exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right\}. \end{eqnarray*}\] \(\mu \sim N(\mu_{0}, 1/\tau)\) so that \[\begin{eqnarray*} f(\mu) & \propto & \exp\left\{- \frac{\tau}{2}(\mu - \mu_{0})^{2}\right\} \\ & \propto & \exp\left\{- \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu\right\}. \end{eqnarray*}\] Finally, as \(\lambda \sim Gamma(\alpha, \beta)\) then \[\begin{eqnarray*} f(\lambda) & \propto & \lambda^{\alpha -1} \exp\left\{-\beta \lambda\right\}. \end{eqnarray*}\] Hence, as \(\mu\) and \(\lambda\) are independent, \[\begin{eqnarray*} f(\mu, \lambda \, | \, x) & \propto & f(x \, | \, \mu, \lambda)f(\mu)f(\lambda) \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\}. \end{eqnarray*}\]
We have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & = & \frac{f(\lambda, \mu \, | \, x)}{f(\mu \, | \, x)} \\ & \propto & f(\lambda, \mu \, | \, x) \end{eqnarray*}\] where the proportionality is with respect to \(\lambda\). So, from 2.(a) we have \[\begin{eqnarray*} f(\lambda \, | \, \mu, x) & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\frac{\lambda}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2} - \frac{\tau}{2}\mu^{2} + \tau \mu_{0} \mu - \beta \lambda\right\} \\ & \propto & \lambda^{\alpha + \frac{n}{2} -1}\exp\left\{-\lambda \left(\beta + \frac{1}{2}\sum_{i=1}^{n}(x_{i}-\mu)^{2}\right)\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma\left(\alpha + \frac{n}{2}, \beta + \frac{1}{2} \sum_{i=1}^{n} (x_{i} - \mu)^{2}\right)\).
The Gibbs sampler requires the conditional distributions \(\lambda \, | \, \mu, x\) and \(\mu \, | \, \lambda, x\) which we have. The algorithm is
Choose a starting value \((\mu^{(0)}, \lambda^{(0)})\) for which \(f(\mu^{(0)}, \lambda^{(0)} \, | \, x) > 0\).
At iteration \(t\) generate the new values \((\mu^{(t)}, \lambda^{(t)})\) as follows:
The algorithm will produce a Markov Chain with stationary distribution \(f(\mu, \lambda \, | \, x)\). After a sufficiently long time to allow for convergence, the values \(\{\mu^{(t)}, \lambda^{(t)}\}\) for \(t > b\) may be considered as a sample from \(f(\mu, \lambda \, | \, x)\) where the samples \(\{\mu^{(t)}, \lambda^{(t)}\}\) for \(t \leq b\) are the “burn-in”.
Ergodic averages converge to the desired expectations under the target distribution so that, for example, \[\begin{eqnarray*} \frac{1}{N} \sum_{t=1}^{N} g(\lambda^{(t)}) & \rightarrow & E(g(\lambda) \, | \, X) \end{eqnarray*}\] as \(N \rightarrow \infty\). As this includes the observations prior to convergence, we will do better to estimate only using the observations after the “burn-in”. Noting that \(Var(\lambda \, | \, X)\) corresponds to \(g(\lambda) = (\lambda - E(\lambda \, | \, X))^{2}\) then any sensible sample variance will suffice. For example, we could estimate \(Var(\lambda \, | \, X) = E(\lambda^{2} \, | \, X) - E^{2}(\lambda \, | \, X)\) by \[\begin{eqnarray*} \frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)} - \bar{\lambda})^{2} & = & \left(\frac{1}{N-b} \sum_{t=b+1}^{N} (\lambda^{(t)})^{2}\right) -\bar{\lambda}^{2} \end{eqnarray*}\] where our sample is \(\{\mu^{(t)}, \lambda^{(t)} : t = b+1, \ldots, N\}\) and \(\bar{\lambda} = \frac{1}{N-b} \sum_{t=b+1}^{N} \lambda^{(t)}\), the sample mean.
Consider a Poisson hierarchical model. At the first stage we have observations \(s_{j}\) which are Poisson with mean \(t_{j}\lambda_{j}\) for \(j = 1, \ldots, p\) where each \(t_{j}\) is known. We assume that that the \(\lambda_{j}\) are independent and identically distributed with \(\mbox{Gamma}(\alpha, \beta)\) prior distributions. The parameter \(\alpha\) is known but \(\beta\) is unknown and is given a \(\mbox{Gamma}(\gamma, \delta)\) distribution where \(\gamma\) and \(\delta\) are known. The \(s_{j}\) are assumed to be conditionally independent given the unknown parameters.
Let \(\lambda = (\lambda_{1}, \ldots, \lambda_{p})\) then the unknown parameters are \(\theta = (\lambda_{1}, \ldots, \lambda_{p}, \beta) = (\lambda, \beta)\). The prior is \[\begin{eqnarray*} f(\lambda, \beta) & = & f(\lambda \, | \, \beta)f(\beta) \\ & = & \left\{\prod_{j=1}^{p} f(\lambda_{j} \, | \, \beta)\right\}f(\beta) \\ & = & \left\{\prod_{j=1}^{p} \frac{\beta^{\alpha}}{\Gamma(\alpha)} \lambda_{j}^{\alpha-1}e^{-\beta\lambda_{j}}\right\} \times \frac{\delta^{\gamma}}{\Gamma(\gamma)} \beta^{\gamma-1}e^{-\delta \beta} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\right\}. \end{eqnarray*}\] The likelihood is \[\begin{eqnarray*} f(s \, | \, \lambda, \beta) & = & \prod_{j=1}^{p} \frac{(t_{j}\lambda_{j})^{s_{j}}e^{-t_{j}\lambda_{j}}}{s_{j}!} \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{s_{j}}\right) \exp\left(-\sum_{j=1}^{p}t_{j}\lambda_{j} \right). \end{eqnarray*}\] The posterior is thus \[\begin{eqnarray*} f(\lambda, \beta \, | \, s) & \propto & f(s \, | \, \lambda, \beta)f(\lambda, \beta) \\ & \propto & \left(\prod_{j=1}^{p} \lambda_{j}^{\alpha+s_{j}-1}\right)\beta^{(p\alpha+\gamma)-1}\exp\left\{-\beta\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right) -\sum_{j=1}^{p}t_{j}\lambda_{j} \right\}. \end{eqnarray*}\]
For each \(j\), let \(\lambda_{-j} = \lambda \setminus \lambda_{j}\). For the Gibbs sampler we need to find the distributions of \(\lambda_{j} \, | \, \lambda_{-j}, \beta, s\), \(j = 1, \ldots, p\), and \(\beta \, | \, \lambda, s\). Now, \[\begin{eqnarray*} f(\lambda_{j} \, | \, \lambda_{-j}, \beta, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda_{-j}, \beta \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\lambda_{j}$)} \\ & = & \lambda_{j}^{\alpha+s_{j}-1} \exp\left\{-(\beta+t_{j})\lambda_{j}\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma(\alpha+s_{j}, \beta+t_{j})\) distribution. Hence, \(\lambda_{j} \, | \, \lambda_{-j}, \beta, s \sim Gamma(\alpha+s_{j}, \beta+t_{j})\). \[\begin{eqnarray*} f(\beta \, | \, \lambda, s) & = & \frac{f(\lambda, \beta \, | \, s)}{f(\lambda \, | \, s)} \\ & \propto & f(\lambda, \beta \, | \, s) \ \ \ \ \mbox{(taken with respect to $\beta$)} \\ & = & \beta^{(p\alpha+\gamma)-1}\exp\left\{-\left(\delta+\sum_{j=1}^{p}\lambda_{j}\right)\beta\right\} \end{eqnarray*}\] which is a kernel of a \(Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})\) distribution. So, \(\beta \, | \, \lambda, s \sim Gamma(p\alpha+\gamma,\delta+\sum_{j=1}^{p}\lambda_{j})\).
The Gibbs sampler is
Choose an arbitrary starting point for which \(f(\lambda^{(0)}, \beta^{(0)} \, | \, s) > 0\). As \(\lambda_{j} > 0\) for \(j=1, \ldots, p\) and \(\beta > 0\) then any \(p+1\) collection of positive numbers will suffice.
At time \(t\)
Obtain \(\lambda_{1}^{(t)}\) by sampling from the distribution of \(\lambda_{1} \, | \, \lambda_{-1}^{(t-1)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{1}, \beta^{(t-1)} + t_{1})\).
\(\vdots\)
Obtain \(\lambda_{j}^{(t)}\) by sampling from the distribution of \(\lambda_{j} \, | \, \lambda_{1}^{(t)}, \ldots, \lambda_{j-1}^{(t)}, \lambda_{j+1}^{(t-1)}\), \(\ldots, \lambda_{p}^{(t-1)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{j}, \beta^{(t-1)} + t_{j})\).
\(\vdots\)
Obtain \(\lambda_{p}^{(t)}\) by sampling from the distribution of \(\lambda_{p} \, | \, \lambda_{-p}^{(t)}, \beta^{(t-1)}, s\) which is \(Gamma(\alpha + s_{p}, \beta^{(t-1)} + t_{p})\).
Obtain \(\beta^{(t)}\) by sampling from the distribution of \(\beta \, | \, \lambda^{(t)}, s\), the \(Gamma(p\alpha\) \(+ \gamma, \delta+\sum_{j=1}^{p} \lambda_{j}^{(t)})\).
The algorithm is run until convergence is reached, When convergence is reached, then resulting value \((\lambda^{(t)}, \beta^{(t)})\) is a realisation from the posterior \(\lambda, \beta \, | \, s\).
We first remove the burn-in samples. Suppose we determine the burn-in time \(B\), where we assume that the Markov chain has sufficiently converged for \(t > B\). (We assume that \(B < N\).) Then we discard observations \(\{\lambda^{(t)}, \beta^{(t)}; t \leq B\}\) and use the remaining sample \(\{\lambda^{(t)}, \beta^{(t)}; t = B+1, \ldots N\}\) to estimate posterior summaries. Here \(\lambda^{(t)} = (\lambda_{1}^{(t)}, \ldots, \lambda_{p}^{(t)})\).
The posterior mean of \(\lambda_{j} \, | \, s\) can be estimated by the sample mean of \(\{\lambda_{j}^{(t)}, t = B+1, \ldots, N\}\) so that our estimate of \(E(\lambda_{j} \, | \, s)\) is \(\frac{1}{N-B} \sum_{t = B+1}^{N} \lambda_{j}^{(t)}\).
The posterior variance of \(\beta \, | \, s\) can be estimated by the sample variance of \(\{\beta^{(t)}, t = B+1, \ldots, N\}\) so that our estimate of \(Var(\beta \, | \, s)\) is \(\frac{1}{N-B} \sum_{t = B+1}^{N} (\beta^{(t)}-\bar{\beta})^{2}\) \(= (\frac{1}{N-B} \sum_{t = B+1}^{N} (\beta^{(t)})^{2}) - \bar{\beta}^{2}\) where \(\bar{\beta} = \frac{1}{N-B} \sum_{t = B+1}^{N} \beta^{(t)}\).
For estimating \(E(\lambda_{j} \, | \, a \leq \beta \leq b, s)\) we not only discard the burn-in observations but also those elements \(\lambda_{j}^{(t)}\) for which \(\beta^{(t)} < a\) or \(\beta^{(t)} > b\) for each \(t = B+1, \ldots, N\). So, letting \(\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\) denote the indicator function which is one if \(a \leq \beta^{(t)} \leq b\), the number of observations after the burn-in with \(a \leq \beta^{(t)} \leq b\) is \(M = \sum_{t = B+1}^{N} \mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\) and our estimate of \(E(\lambda_{j} \, | \, a \leq \beta \leq b, s)\) is \(\frac{1}{M} \sum_{t=B+1}^{N} \lambda_{j}^{(t)}\mathbb{I}_{[a \leq \beta^{(t)} \leq b]}\).
Show that the Gibbs sampler for sampling from a distribution \(\pi(\theta)\) where \(\theta = (\theta_{1}, \ldots, \theta_{d})\) can be viewed as a special case of the Metropolis-Hastings algorithm where each iteration \(t\) consists of \(d\) Metropolis-Hastings steps each with an acceptance probability of 1.
For the Gibbs sampler, we update each \(\theta_{p}\) one at a time by obtaining \(\theta^{(t)}_{p}\) from the conditional distribution \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Thus, we move from \(\theta_{p}^{(t-1)}\) to \(\theta_{p}^{(t)}\) leaving the remaining \(d-1\) parameters fixed.
We now demonstrate that this update step in the Gibbs sampler is equivalent to a single Metropolis-Hastings step with acceptance probability 1. Let \(\theta^{(t-1)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)},\) \(\theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\) and \(\theta^{(t)}_{[p]} = (\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Consider the move from \(\theta^{(t-1)}_{[p]}\) to \(\theta^{(t)}_{[p]}\). As \(\theta^{(t)}_{[p]}\) is drawn from \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\) then we shall use this for the proposal distribution in the Metropolis-Hastings algorithm so that \[\begin{eqnarray} q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{1} \end{eqnarray}\] If, instead, we consider the move from \(\theta^{(t)}_{[p]}\) to \(\theta^{(t-1)}_{[p]}\) then, using Gibbs sampling, \(\theta^{(t-1)}_{[p]}\) is obtained by drawing \(\theta_{p}^{(t-1)}\) from \(\pi(\theta_{p} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\). Thus, we set \[\begin{eqnarray} q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}). \tag{2} \end{eqnarray}\] Now, \[\begin{eqnarray} \pi(\theta^{(t-1)}_{[p]} ) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t-1)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t-1)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{3} \\ & = & q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{4} \end{eqnarray}\] where (4) follows by substituting (2) into (3). In an identical fashion, \[\begin{eqnarray} \pi(\theta^{(t)}_{[p]}) \ = \ \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) & = & \pi(\theta_{p}^{(t)} \, | \, \theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)})\pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{5} \\ & = & q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]}) \pi(\theta_{1}^{(t)}, \ldots, \theta_{p-1}^{(t)}, \theta_{p+1}^{(t-1)}, \ldots, \theta_{d}^{(t-1)}) \tag{6} \end{eqnarray}\] where (6) follows by substituting (1) into (5). Dividing (6) by (5) gives \[\begin{eqnarray*} \frac{\pi(\theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})} & = & \frac{q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})}{q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})} \end{eqnarray*}\] so that \[\begin{eqnarray} \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} & = & 1. \tag{7} \end{eqnarray}\] Thus, a Metropolis-Hastings algorithm for sampling from \(\pi(\theta)\) for a proposed move from \(\theta^{(t-1)}_{[p]}\) to \(\theta^{(t)}_{[p]}\) using the proposal distribution given by (1) has an acceptance ratio of \[\begin{eqnarray} \alpha(\theta^{(t-1)}_{[p]}, \theta^{(t)}_{[p]}) & = & \min \left(1, \frac{\pi(\theta^{(t)}_{[p]})q(\theta^{(t-1)}_{[p]} \, | \, \theta^{(t)}_{[p]})}{\pi(\theta^{(t-1)}_{[p]})q(\theta^{(t)}_{[p]} \, | \, \theta^{(t-1)}_{[p]})} \right) \tag{8} \\ & = & 1 \tag{9} \end{eqnarray}\] where (9) follows from (8) using (7).