2 November 2016

Goals

  • We aim for some probability distribution \[\theta \sim \pi(\theta), \, \theta \in \Theta\]

  • We want some summaries of this distribution \[ E_{\pi}g(\theta) = \int_{\Theta} g(\theta) \pi(\theta) d\theta \]

  • In many cases \(\pi(\theta)\) is the posterior distribution from some Bayesian model: \[ \pi(\theta) = p(\theta|y) \propto p(y|\theta) p(\theta)\]

Monte carlo (according to a statistician)

  • If we can generate independent samples
    \[\theta_i \sim_{iid} \pi, \, i = 1, \dots, n\]
  • Then under fairly general conditions \[\sqrt n \left(\frac{1}{n} \sum_{i=1}^n g(\theta_i) - E_{\pi}g(\theta) \right) \rightarrow_d N(0, \sigma^2_g) \]

  • We can get a similar result even if we can only sample \(\theta_i\) via a Markov chain with stationary distribution \(\pi(\theta)\)

Markov chains

A sequence \(\theta_1, \theta_2,\dots\) of random elements of \(\Theta\) is a Markov chain if \[p(\theta_{t+1}| \theta_1,\dots, \theta_t) =p(\theta_{t+1}| \theta_t)\]

  • \(p(\theta_{t+1}= y| \theta_t = x) = P_{xy}\) (finite state space) or
  • \(p(\theta_{t+1}\in B| \theta_t = x)\) (general state space) is called the transition distribution/kernel
  • Assume \(p(\theta_{t+1}| \theta_t)\) is independent of \(t\): "stationary transition probabilities" or "time independent transition probabilities"

Some desirable properties

Want to construct a stationary Markov chain: \[p(\theta_t) = \pi(\theta_t)\]

  • Detailed balance \[\pi(x) P_{xy} = \pi(y) P_{yx}, \, \forall x,y \in \Theta\]

  • Irreducibility: w/ probability > 0, can get from any state to any other state in a finite number of moves

  • Aperiodicity: return times to a particular state can be irregular

Example

(Gaussian) Random Walk Metropolis Hastings

\[\theta^\star \sim q = N(\theta_t, s)\]

Set \(\theta_{t+1} = \theta^\star\) ('accept proposed move') with probability \[\min\left\{ 1 , \frac{\pi(\theta^\star) q(\theta_t| \theta^\star)}{\pi(\theta_t) q(\theta^\star|\theta_t)}\right\}\] Otherwise \(\theta_{t+1} = \theta_t\)

Simple 1-D model

\[Y_i \sim N(\theta, 10), \quad \text{prior}(\theta) = N(0,100)\] \[\min\left\{ 1 , \frac{\pi(\theta^\star) {\color{green}q(\theta_t| \theta^\star)}}{\pi(\theta_t) {\color{red} q(\theta^\star|\theta_t)}}\right\}\]

Convergence 1-D example

RW MH with s = .1, 1:

Convergence

  • Want to quickly reduce influence of our initial state \(\theta_1\)

  • Reduce auto-correlation and move around \(\Theta\) efficiently

  • Mixing: time until multiple chains with different initial states look identical (e.g., all have converged to the stationary distribution)?

  • Burn in: early iterations that are discarded because of obvious dependence on \(\theta_1\)

Problems with RW-MH

  • Need to balance size of jumps (controlled by s) and achieving a good acceptance rate (average probability of moving to \(\theta^\star\))

  • Difficult to accept 'long distance' proposals if \(\Theta\) is high dimensional
  • Good to use a proposal that adapts to the local features of \(\pi\) in these settings using, e.g, Hamiltonian Monte Carlo

Reparameterisations/Transformations?

  • McMC is not an intrinsic method

  • Target \(\pi(\theta)\), in many cases \(\theta\) is only a label for another object (probability distribution)

  • We can work with any \(\eta=T(\theta)\) and obtain samples from \(\eta\) to then "go back" to \(\theta=T^{-1}(\eta)\)

  • McMC by itself is invariant to reparametrisation but practical efficiency (convergence, proposal choice) might depend on parametrisation (e.g.centering in hierarchical model)

  • reparametrization is part of the "repertoire" of McMC in Stats

Geometry

  • If \(\theta\) labels other objects then samples from \(\theta\) are samples from those objects not samples of numbers

  • Step behaviour should not be measured only by length and direction in terms of \(\theta\)

  • Beneficial to measure step behaviour via some Geometry of the objects

  • HMC is example in this direction

  • Clever reparametrisations is another option. Geometry might suggests good parametrizations (e.g. geodesics)

  • Example: Distance between N(0,1) and N(1,1) is larger than that between N(0,2) and N(1,2) (Hyperbolic Geometry)