Introduction

Consider a problem where we wish to make inferences about a parameter \(\theta\) given data \(x\). In a classical setting the data is treated as if it is random, even after it has been observed, and the parameter is viewed as a fixed unknown constant. Consequently, no probability distribution can be attached to the parameter. Conversely in a Bayesian approach parameters, having not been observed, are treated as random and thus possess a probability distribution whilst the data, having been observed, is treated as being fixed.

Example 0.1 Suppose that we perform \(n\) independent Bernoulli trials in which we observe \(x\), the number of times an event occurs. We are interested in making inferences about \(\theta\), the probability of the event occurring in a single trial. Let us consider the classical approach to this problem.

Prior to observing the data, the probability of observing \(x\) was \[\begin{eqnarray} P(X = x \, | \, \theta) & = & \binom{n}{x} \theta^{x}(1-\theta)^{n-x}. \tag{0.1} \end{eqnarray}\] This is a function of the (future) \(x\), assuming that \(\theta\) is known. If we know \(x\) but don’t know \(\theta\) we could treat (0.1) as a function of \(\theta\), \(L(\theta)\), the likelihood function. We then choose the value which maximises this likelihood. The maximum likelihood estimate is \(\frac{x}{n}\) with corresponding estimator \(\frac{X}{n}\).

In the general case, the classical approach uses an estimate \(T(x)\) for \(\theta\). Justifications for the estimate depend upon the properties of the corresponding estimator \(T(X)\) (bias, consistency, ) using its sampling distribution (given \(\theta\)). That is, we treat the data as being random even though it is known! Such an approach can lead to nonsensical answers.

Remember the invariance properties of maximum likelihood estimators. If \(T(X)\) is the maximum likelihood estimator of \(\theta\) then the maximum likelihood of \(g(\theta)\), a function of \(\theta\), is \(g(T(X))\).

Example 0.2 Suppose in the Bernoulli trials of Example 0.1 we wish to estimate \(\theta^{2}\). The maximum likelihood estimator is \(\left(\frac{X}{n}\right)^{2}\). However this is a biased estimator as \[\begin{eqnarray} E(X^{2} \, | \, \theta) & = & Var(X \, | \, \theta) + E^{2}(X \, | \, \theta)\nonumber \\ & = & n\theta(1-\theta) + n^{2}\theta^{2} \nonumber \\ & = & n\theta + n(n-1)\theta^{2}. \tag{0.2} \end{eqnarray}\] Noting that \(E(X \, | \, \theta) = n\theta\) then (0.2) may be rearranged as \[\begin{eqnarray} E(X^{2} \, | \, \theta) - E(X \, | \, \theta) \ = \ n(n-1)\theta^{2} & \Rightarrow & E\{X(X-1) \, | \, \theta \} \ = \ n(n-1)\theta^{2}. \nonumber \end{eqnarray}\] Thus, \(\frac{X(X-1)}{n(n-1)}\) is an unbiased estimator of \(\theta^{2}\). Suppose we observe \(x = 1\). Then our estimate of \(\theta^{2}\) is 0: we estimate a chance as zero even though the event has occurred!

Example 0.3 Now let’s consider two different experiments, both modelled as Bernoulli trials.

  1. Toss a coin \(n\) times and observe \(x\) heads. Parameter \(\theta_{c}\) represents the probability of tossing a head on a single trial.

  2. Toss a drawing pin \(n\) times and observe that the pin lands facing upwards on \(x\) occasions.

The maximum likelihood estimates for \(\theta_{c}\) and \(\theta_{p}\) are identical and share the same properties. Is this sensible?

I, and perhaps you do too, have lots of experience of tossing coins and these are well known to have propensities close to \(\frac{1}{2}\). Thus, even before I toss the coin on these \(n\) occasions, I have some knowledge about \(\theta_{c}\). (At the very least I can say something about where I think it will be and how confident I am in this location which could be viewed as specifying a mean and a variance for \(\theta_{c}\).) Equally, I have little knowledge about the tossing propensities of drawing pins - I don’t really know much about \(\theta_{p}\). Shouldn’t I take these differences into account somehow? The classical approach provides no scope for this as \(\theta_{c}\) and \(\theta_{p}\) are both unknown constants. In a Bayesian analysis, see Example 1.2, we can reflect our knowledge about \(\theta_{c}\) and \(\theta_{p}\) by specifying probability distributions for \(\theta_{c}\) and \(\theta_{p}\).

Let’s think back to maximum likelihood estimation. We ask

“what value of \(\theta\) makes the data most likely to occur?”

Isn’t this the wrong way around, what we are really interested in is

“what value of \(\theta\) is most likely given the data?”

In a classical analysis this question makes no sense. However, it can be answered in a Bayesian context. We specify a prior distribution \(f(\theta)\) for \(\theta\) and combine this with the likelihood \(f(x \, | \, \theta)\) to obtain the posterior distribution for \(\theta\) given \(x\) using Bayes’ theorem, \[\begin{eqnarray} f(\theta \, | \, x) & = & \frac{f(x \, | \, \theta)f(\theta)}{\int f(x \, | \, \theta)f(\theta) \, d\theta} \nonumber \\ & \propto & f(x \, | \, \theta)f(\theta). \nonumber \end{eqnarray}\] Bayesian analysis is concerned with the distributions of \(\theta\) and how they are changed in the light of new information (typically data). The distribution \(f(x \, | \, \theta)\) is irrelevant to the Bayesian after \(X\) has been observed yet to the classicist it is the only distribution they can work with.