Consider the integral \[\begin{eqnarray*} I & = & \int_{0}^{10} \exp(-2|x-5|) \, dx. \end{eqnarray*}\]
Note that \[\begin{eqnarray*} I \ = \ \int_{0}^{10} \exp(-2|x-5|) \, dx & = & \int_{0}^{10} 10\exp(-2|x-5|) \frac{1}{10} \, dx \\ & = & E_{f}(10\exp(-2|X-5|)). \end{eqnarray*}\] We can then estimate \(I\) using Monte Carlo integration as follows:
For importance sampling, we want to write \(I\) as an expectation calculated with respect to the \(N(5, 1)\) distribution which has support on \((-\infty, \infty)\). If we let \(\mathbb{I}_{\{0 \leq x \leq 10\}}\) be the indicator function that \(0 \leq x \leq 10\) then \[\begin{eqnarray*} I & = & \int_{0}^{10} \exp(-2|x-5|) \, dx \ = \ \int_{-\infty}^{\infty} \exp(-2|x-5|)\mathbb{I}_{\{0 \leq x \leq 10\}} \, dx \\ & = & \int_{-\infty}^{\infty} \frac{10\exp(-2|x-5|)\frac{1}{10}\mathbb{I}_{\{0 \leq x \leq 10\}}}{\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}(x-5)^{2}\right\}}\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}(x-5)^{2}\right\}\, dx \\ & = & \int_{-\infty}^{\infty} \frac{g(x)f(x)}{q(x)} q(x) \, dx \ = \ E_{q}\left(\frac{g(X)f(X)}{q(X)}\right) \end{eqnarray*}\] where \(g(x) = 10\exp(-2|x-5|)\), \(f(x) = \frac{1}{10}\mathbb{I}_{\{0 \leq x \leq 10\}}\) (the density of the \(U(0, 10)\) distribution) and \(q(x) = \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}(x-5)^{2}\right\}\) (the density of the \(N(5, 1)\) distribution). We can then estimate \(I\) using importance sampling as follows:
We can write a function monte0
in R
to
estimate \(I\) by \(\hat{I}\), as given by equation (1).
# Calculates m replications of the Monte Carlo
# integration in part (a) with n = n1
monte0 <-function(n1,m){
vect <- c(1:m)
for (i in 1:m)
{
x <- runif(n1,0,10)
vect[i] <- sum(10*exp(-2*abs(x-5)))/n1
}
return(vect)
}
Similarly, the function import0
estimates \(I\) by \(\tilde{I}\), as given by equation (2).
Notice that \(\mbox{dunif(x, 0, 10)} =
\frac{1}{10}\mathbb{I}_{\{0 \leq x \leq 10\}}\).
# Calculates m replications of the importance
# sampling in part (b) with n = n1
import0 <- function(n1,m){
vect <- c(1:m)
for (i in 1:m)
{
x <- rnorm(n1,5,1)
weight <- dunif(x,0,10)/dnorm(x,5,1)
vect[i] <- sum(10*exp(-2*abs(x-5))*weight)/n1
}
return(vect)
}
We explore the two methods for \(n = 100, 500, 1000\), and \(5000\), in each case taking 1000 replications. The results are shown in the following table, note that the true value of \(I\) is \(1-\exp(-10) = 0.9999546\). \[\begin{eqnarray*} \begin{array}{|c|c|c|c|} \hline \mbox{Method}& n & \begin{array}{c}\mbox{Mean of } \\ \mbox{1000 samples}\end{array}& \begin{array}{c}\mbox{Variance of } \\ \mbox{1000 samples}\end{array} \\ \hline \mbox{Monte} & 100 & 1.013746 &0.04300273 \\ \mbox{Import} & 100 & 0.9997384 & 0.003624815 \\ \mbox{Monte} & 500 & 0.999168 & 0.008210746 \\ \mbox{Import} & 500 & 1.00074 & 0.0007107777\\ \mbox{Monte} & 1000 & 1.00279 & 0.004010403 \\ \mbox{Import} & 1000 & 1.000122 & 0.0003694998 \\ \mbox{Monte} & 5000 & 1.000884 & 0.000773445 \\ \mbox{Import} & 5000 & 0.9997536 & 0.00007040271 \\ \hline \end{array} \end{eqnarray*}\] We recall from lectures, that both methods give unbiased estimates of \(I\) so we compare their performance by considering the respective variances. From the table, we observe that the sample variance for the Monte Carlo integration is about ten times that of the importance sampling method.
We thus see that the importance sampling method is preferable to the Monte Carlo integration. The following figure demonstrates why this is the case.
The function \(g(x) = 10\exp(-2|x-5|)\) has a peak at \(x = 5\) and decays quickly elsewhere in the interval \([0, 10]\) so that when we sample from \(U(0, 10)\) many of the sampled values will be where \(g(x)\) is very close to zero. We are likely to increase the precision of our estimate if we sample from a distribution which attaches more weight, or importance, to the region where \(g(x)\) is not close to zero. As the figure shows, the \(N(5, 1)\) does this, leading to the reduced variance in the importance sampling scheme.
Suppose that \(f(\theta \, | \, x) = cg(\theta)\) where \(g(\theta) \propto f(x \, | \, \theta)f(\theta)\) and \(c\) is the normalising constant. For a function \(h(\theta)\) we wish to find \(E(h(\theta) \, | \, X) = E_{f}(h(\theta))\) where \[\begin{eqnarray*} E_{f}(h(\theta)) & = & \int_{-\infty}^{\infty} h(\theta)f(\theta \, | \, x) \, d\theta. \end{eqnarray*}\] Suppose that we sample \(N\) values \(\theta_{1}, \ldots, \theta_{N}\) independently from a distribution with probability density function \(q(\theta)\) (you may assume that the support of \(q(\theta)\) is the same as \(f(\theta \, | \, x)\)).
Note that, as \(f(\theta \, | \, x) = cg(\theta)\), then \[\begin{eqnarray*} E_{f}(h(\theta)) & = & c\int_{-\infty}^{\infty} h(\theta)g(\theta) \, d\theta \\ & = & c\int_{-\infty}^{\infty} \frac{h(\theta)g(\theta)}{q(\theta)}q(\theta) \, d\theta \ = \ c E_{q}\left(\frac{h(\theta)g(\theta)}{q(\theta)}\right). \end{eqnarray*}\] We compute \(\frac{h(\theta_{i})g(\theta_{i})}{q(\theta_{i})}\) for each \(i = 1, \ldots, N\) and estimate \(E_{q}\left(\frac{h(\theta)g(\theta)}{q(\theta)}\right)\) by \[\begin{eqnarray} \hat{I} & = & \frac{1}{N} \sum_{i=1}^{N} \frac{h(\theta_{i})g(\theta_{i})}{q(\theta_{i})} \tag{3} \end{eqnarray}\] and, since \(c\) is known, estimate \(E_{f}(h(\theta))\) by \(c\hat{I}\).
We have that \[\begin{eqnarray*} 1 \ = \ \int_{-\infty}^{\infty} f(\theta \, | \, x) \, d\theta \ = \ \int_{-\infty}^{\infty} cg(\theta) \, d\theta \end{eqnarray*}\] so that \[\begin{eqnarray*} c^{-1} & = & \int_{-\infty}^{\infty} g(\theta) \, d\theta \\ & = & \int_{-\infty}^{\infty} \frac{g(\theta)}{q(\theta)} q(\theta) \, d\theta \ = \ E_{q}\left(\frac{g(\theta)}{q(\theta)}\right). \end{eqnarray*}\] Thus, we compute \(\frac{g(\theta_{i})}{q(\theta_{i})}\) for each \(i = 1, \ldots, N\) and estimate \(c^{-1} = E_{q}\left(\frac{g(\theta)}{q(\theta)}\right)\) by \[\begin{eqnarray} \tilde{I} & = & \frac{1}{N} \sum_{i=1}^{N} \frac{g(\theta_{i})}{q(\theta_{i})}. \tag{4} \end{eqnarray}\]
By the Law of Large Numbers, \(\hat{I}\), as given by equation (3), almost surely converges to \(E_{q}\left(\frac{h(\theta)g(\theta)}{q(\theta)}\right)\) whilst \(\tilde{I}\), as given by equation (4), almost surely converges to \(c^{-1}\) so that a sensible estimate of \(E_{f}(h(\theta))\) is \[\begin{eqnarray*} \check{I} \ = \ \frac{\hat{I}}{\tilde{I}} & = & \frac{\sum_{i=1}^{N} h(\theta_{i})w_{i}}{\sum_{i=1}^{N} w_{i}} \end{eqnarray*}\] where \(w_{i} = g(\theta_{i})/q(\theta_{i})\). Estimating \(E_{f}(h(\theta))\) by \(\check{I}\) is often called self-normalised importance sampling and provides a method of estimating posterior expectations without needing to know the normalising constant \(c\). It can be shown that (provided that \(q(\theta) > 0\) whenever \(g(\theta) > 0\)) \(\check{I}\) converges to \(E_{f}(h(\theta))\).
Let \(X \sim N(0, 1)\).
Note that \[\begin{eqnarray*} P(X > a) & = & \int_{a}^{\infty} f(x) \, dx \\ & = & \int_{-\infty}^{\infty} \mathbb{I}_{\{x > a\}}f(x) \, dx \\ & = & E_{f}(\mathbb{I}_{\{X > a\}}) \end{eqnarray*}\] where \(\mathbb{I}_{\{X > a\}}\) is the indicator function of the event \(\{X > a\}\) and \(E_{f}(\cdot)\) denotes expectation with respect to the distribution \(f\). We thus may use Monte Carlo integration to estimate \(P(X > a)\) as follows:
We can write a function monte
in R
to
estimate \(P(X > a)\).
# Calculates m replications of the Monte Carlo
# integration in part (a) with a = y and n = n1
monte <- function(y,n1,m){
vect<-c(1:m)
for (i in 1:m)
{
x <- rnorm(n1)
total <- length(x[x > y])
vect[i] <- total/n1
}
return(vect)
}
We consider sampling \(n =1000\) observations to estimate \(P(X > a)\) for \(a = 0, 1.96, 3\) and \(4.5\). We’ll repeat each experiment 1000 times to explore the sampling properties of the estimate and also calculate the number of samples which contained observations greater than \(a\). For example, for \(a = 4.5\) we compute
> sample<-monte(4.5,1000,1000)
> mean(sample)
[1] 5e-06
> var(sample)
[1] 4.97998e-09
> length(sample[sample > 0])
[1] 5
We summarise our results. \[\begin{eqnarray*} \begin{array}{|c|c|c|c|c|} \hline a & P(X > a) & \begin{array}{c}\mbox{Mean of } \\ \mbox{1000 samples}\end{array}& \begin{array}{c}\mbox{Variance of } \\ \mbox{1000 samples}\end{array} & \begin{array}{c}\mbox{No of non-} \\ \mbox{zero samples}\end{array} \\ \hline 0 & 0.5 & 0.499914 & 0.000245 & 1000 \\ 1.96 & 0.0249979 & 0.025027 & 2.437865 \times 10^{-5} & 1000 \\ 3 & 0.001349898 & 0.001395 & 1.394369 \times 10^{-6} & 754 \\ 4.5 & 3.397673 \times 10^{-6} & 5 \times 10^{-6} & 4.97998 \times 10^{-9} & 5 \\ \hline \end{array} \end{eqnarray*}\] As \(a\) increases then \(P(X > a)\) decreases and we are interested in estimating the probability of an increasingly rare event. If we let \(t = \sum_{i=1}^{n} \mathbb{I}_{\{x_{i} > a\}}\) then \(t\) denotes the number of the \(n\) observations that were greater than \(a\) and our estimate of \(P(X > a)\) is \(t/n\). For large \(a\), we will need to take \(n\) to be extremely large to even observe a non-zero \(t\) let alone obtain a stable estimate. For example, in our 1000 samples with \(n =1000\) and \(a = 4.5\), only five of the samples contained an observation greater than 4.5. An alternate approach to estimating \(P(X > a)\) which doesn’t require \(n\) to be extremely large may be to use importance sampling where we sample from a distribution that puts more mass on the event \(\{X > a\}\). Two possible approaches are considered in parts (c) and (d).
Let \(q(x)\) denote the probability density function of the \(N(\mu, 1)\). Then, as \(q(x) > 0\) for all \(x \in (-\infty, \infty)\), we have \[\begin{eqnarray*} E_{f}(\mathbb{I}_{\{X > a\}}) \ = \ \int_{-\infty}^{\infty} \mathbb{I}_{\{x > a\}}f(x) \, dx & = & \int_{-\infty}^{\infty} \left(\frac{\mathbb{I}_{\{x > a\}}f(x)}{q(x)}\right)q(x) \, dx \\ & =& E_{q}\left(\frac{\mathbb{I}_{\{X > a\}}f(X)}{q(X)}\right) \end{eqnarray*}\] where \[\begin{eqnarray*} \frac{f(X)}{q(X)} \ = \ \frac{\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}X^{2}\right\}}{\frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2}(X-\mu)^{2}\right\}} \ = \ \exp\left\{\frac{\mu^{2}}{2}- \mu X\right\}. \end{eqnarray*}\] Thus, \[\begin{eqnarray*} E_{f}(\mathbb{I}_{\{X > a\}}) & = & E_{q}\left(\mathbb{I}_{\{X > a\}}\exp\left\{\frac{\mu^{2}}{2}- \mu X\right\}\right). \end{eqnarray*}\] We thus use importance sampling to estimate \(P(X > a) = E_{f}(\mathbb{I}_{\{X > a\}})\) as follows:
We write a function import
in R
to
implement this algorithm.
# Calculates m replications of the importance sampling
# in part (c) with a = y1, mu = y2 and n = n1
import <- function(y1,y2,n1,m)
{
vect <- c(1:m)
for (i in 1:m)
{
x <- rnorm(n1,y2)
weight <- dnorm(x)/dnorm(x, y2)
vect[i] <- sum(weight[x > y1])/n1
}
return(vect)
}
Suppose we consider \(a = 3\) and \(\mu = 4\) so that \(P(X > 3) = 0.001349898\). We compare the importance sampling method with direct Monte Carlo integration for various choices of \(n\). \[\begin{eqnarray*} \begin{array}{|c|c|c|c|c|} \hline \mbox{Method}& n & \begin{array}{c}\mbox{Mean of } \\ \mbox{1000 samples}\end{array}& \begin{array}{c}\mbox{Variance of } \\ \mbox{1000 samples}\end{array} & \begin{array}{c}\mbox{No of non-} \\ \mbox{zero samples}\end{array} \\ \hline \mbox{Monte} & 100 & 0.00119 &1.16956\times 10^{-5} & 113 \\ \mbox{Import} & 100 & 0.001349189 & 9.391652\times 10^{-8} & 1000 \\ \mbox{Monte} & 500 & 0.001232 & 2.308484\times 10^{-6} & 472 \\ \mbox{Import} & 500 &0.001349657&1.949415 \times 10^{-8} & 1000 \\ \mbox{Monte} & 1000& 0.001352 & 1.373469 \times 10^{-6} & 733 \\ \mbox{Import} & 1000 & 0.001348711 & 9.348038 \times 10^{-9} & 1000 \\ \hline \end{array} \end{eqnarray*}\] We observe that, even for \(n =100\) observations, the method of importance sampling is doing well and that the variance of the estimate is much reduced when compared to direct Monte Carlo integration. We repeat the process with \(a=4.5\) and \(\mu = 4.5\) so that \(P(X > 4.5) = 3.397673 \times 10^{-6}\). \[\begin{eqnarray*} \begin{array}{|c|c|c|c|c|} \hline \mbox{Method}& n & \begin{array}{c}\mbox{Mean of } \\ \mbox{1000 samples}\end{array}& \begin{array}{c}\mbox{Variance of } \\ \mbox{1000 samples}\end{array} & \begin{array}{c}\mbox{No of non-} \\ \mbox{zero samples}\end{array} \\ \hline \mbox{Monte} & 100 & & & 0 \\ \mbox{Import} & 100 & 3.42554 \times 10^{-6} & 6.096552 \times 10^{-13} & 1000\\ \mbox{Monte} & 500 & & & 0 \\ \mbox{Import} & 500 &3.395555 \times 10^{-6} &1.202796 \times 10^{-13} & 1000 \\ \mbox{Monte} & 1000& 2 \times 10^{-6} & 1.997998 \times 10^{-9} & 2 \\ \mbox{Import} & 1000 & 3.397125 \times 10^{-6} & 5.644111 \times 10^{-14} & 1000 \\ \hline \end{array} \end{eqnarray*}\] Notice that even for sizes of \(n\) when Monte Carlo fails (as we haven’t observed any non-zero values), the importance sampling method is working well. When we can use Monte Carlo, for the same \(n\), the variance of the estimate using importance sampling is vastly reduced.
In this part, the idea is to remove the need of the indicator function when estimating \(P(X > a)\) by sampling from a distribution, \(q(x)\), whose support is \((a, \infty)\) so that \[\begin{eqnarray*} P(X > a) \ = \ \int_{a}^{\infty} f(x) \, dx \ = \ \int_{a}^{\infty} \frac{f(x)}{q(x)} q(x) \, dx \ = \ E_{q}\left(\frac{f(X)}{q(X)}\right). \end{eqnarray*}\] For \(a = 4.5\) we have that, for \(X > 4.5\), \[\begin{eqnarray*} \frac{f(X)}{q(X)} \ = \ \frac{\frac{1}{\sqrt{2\pi}} \exp\left\{-\frac{1}{2} X^{2}\right\}}{\exp\{-(X-4.5)\}} \ = \ \frac{1}{\sqrt{2\pi}}\exp\left\{-\frac{1}{2} X^{2} + X - 4.5\right\} \end{eqnarray*}\] so that our importance sampling method is as follows:
We write a function import1
in R
to
implement this algorithm.
This example is inspired by Example 3.5 of Robert, C. P. and Casella G. (2010). Introducing Monte Carlo Methods with R, Springer, New York.
# Calculates the importance sampling in part (d) for
# P(X > y) and q(x) the Exp(1) truncated at y
# Plots the cumulative estimate of P(X > y) with n = n1
import1 <- function(y,n1){
x<-rexp(n1)+y
weight=dnorm(x)/dexp(x-y)
plot(cumsum(weight)/1:n1,type="l",ylab="Cumulative sum of estimate",xlab="n")
abline(a=pnorm(-y),b=0,col="red")
}
To demonstrate the convergence of the estimate in the following figure we plot the cumulative sum of \(\hat{I}\) for each \(n\).
The above figure shows that the estimate is converging to the true value and that this happens quickly, certainly when compared with the direct Monte Carlo integration approach of part (a) whose limitations were noted in part (b).
Let \(E_{f}(g(X))\) denote the expectation of \(g(X)\) with respect to the probability density function \(f(x)\) so that \[\begin{eqnarray*} E_{f}(g(X)) & = & \int_{X} g(x)f(x) \, dx. \end{eqnarray*}\] Suppose that we sample \(N\) independent observations from a distribution \(q(x)\) and use the method of importance sampling to estimate \(E_{f}(g(X))\). Our estimator is thus \[\begin{eqnarray*} \hat{I} & = & \frac{1}{N} \sum_{i=1}^{N} \frac{g(X_{i})f(X_{i})}{q(X_{i})}. \end{eqnarray*}\] Prove that the choice of \(q(x)\) which minimises \(Var_{q}(\hat{I})\), the variance of \(\hat{I}\) calculated with respect to \(q(x)\), is \[\begin{eqnarray*} q^{*}(x) & = & \frac{|g(x)|f(x)}{\int_{X} |g(z)|f(z) \, dz}. \end{eqnarray*}\]
Let \(Var_{q}(\cdot)\) denote variance calculated with respect to \(q(x)\). Then, as \(X_{1}, \ldots, X_{N}\) are independent observations from \(q(x)\), \[\begin{eqnarray*} Var_{q}(\hat{I}) \ = \ \frac{1}{N^{2}} \sum_{i=1}^{N} Var_{q}\left(\frac{g(X_{i})f(X_{i})}{q(X_{i})}\right) \ = \ \frac{1}{N} Var_{q}\left(\frac{g(X)f(X)}{q(X)}\right). \end{eqnarray*}\] Hence, the choice of \(q(x)\) which minimises the variance of \(\hat{I}\) is equivalent to that which minimises \[\begin{eqnarray*} Var_{q}\left(\frac{g(X)f(X)}{q(X)}\right) & = & E_{q}\left(\frac{g^{2}(X)f^{2}(X)}{q^{2}(X)}\right) - E_{q}^{2}\left(\frac{g(X)f(X)}{q(X)}\right) \nonumber \\ & = & E_{q}\left(\frac{g^{2}(X)f^{2}(X)}{q^{2}(X)}\right) - E_{f}^{2}\left(g(X)\right). \nonumber \end{eqnarray*}\] Thus, as \(E_{f}^{2}\left(g(X)\right)\) does not depend upon the choice of \(q(x)\), then we need only consider finding a \(q(x)\) minimising \(E_{q}\left(\frac{g^{2}(X)f^{2}(X)}{q^{2}(X)}\right)\). Now, for any \(q(x)\), as \(Var_{q}\left(\frac{|g(X)|f(X)}{q(X)}\right) \geq 0\), \[\begin{eqnarray} E_{q}\left(\frac{g^{2}(X)f^{2}(X)}{q^{2}(X)}\right) & \geq & E_{q}^{2}\left(\frac{|g(X)|f(X)}{q(X)}\right) \ = \ \left(\int_{X} |g(x)|f(x) \, dx\right)^{2}. \tag{5} \end{eqnarray}\] Notice that the lower bound does not depend upon the choice of \(q(x)\). Thus, if there exists a \(q(x)\) which achieves this minimum bound then it must minimise the variance of \(\hat{I}\).
If \(q^{*}(x) = \frac{|g(x)|f(x)}{\int_{X} |g(z)|f(z) \, dz}\) then \[\begin{eqnarray*} \frac{|g(X)|f(X)}{q^{*}(X)} & = & \int_{X} |g(z)|f(z) \, dz \end{eqnarray*}\] so that, as \(\int_{X} |g(z)|f(z) \, dz\) is a constant, \[\begin{eqnarray*} E_{q}\left(\frac{g^{2}(X)f^{2}(X)}{(q^{*}(X))^{2}}\right) \ = \ E_{q}\left( \left(\int_{X} |g(z)|f(z) \, dz \right)^{2}\right) \ = \ \left(\int_{X} |g(z)|f(z) \, dz\right)^{2} \end{eqnarray*}\] and the lower bound in (5) is achieved showing that \(q^{*}(x)\) minimises the variance of \(\hat{I}\).
Notice that this is really a theoretical rather than a practical result. For example, if \(g(x) > 0\) then \(q^{*}(x) = (g(x)f(x))/E_{f}(g(X))\) and so requires \(E_{f}(g(X))\) which is the integral we’re trying to evaluate.