2.2 Probability Background
Probability space, \(\sigma\)-algebra, random variable, conditional expectation, Markov property.
In this section, we give a brief summary of the main background we will need from previous courses in probability. We will not go through these definitions in detail in the lectures, but you may find this useful to read through. The material here is not directly examinable (you will not be asked, for example to recall the definition of conditional expectation in an exam), but you will be expected to be able to use the results/ideas here in the manner that we use them elsewhere in the notes. Everything in this section can be found in either MA20224, MA20225 or MA30125. (In the latter case, we will briefly recap the ideas).
Let’s recall the key mathematical ingredients that we need to discuss probability.
A sample space, \(\Omega\), which is a set of points \(\omega \in \Omega\), which we think of as the possible outcomes of an experiment. When we run our experiment, one of these \(\omega\) gets picked at random according to some probability law \(\mathbb{P}\).
Example 8.
Flip a coin twice (\(C_2\)): \(\Omega_{C_2} = \{ H,T\}^2 = \{HH,HT,TH,TT\}\), e.g. \(\omega = HT\).
Flip a coin infinitely many times (\(C_{\infty}\)): \(\Omega_{C_\infty} = \{H,T\}^\mathbb{N}\), so \(\omega \in \Omega_{C_{\infty}}\) is a sequence, \(\omega = (\omega_1, \omega_2, \dots)\) with \(\omega_n \in \{H,T\} \quad \forall n \in \mathbb{N}\).
The second ingredient is an event. An event, \(F\) is a subset of \(\Omega\), \(F \subset \Omega\). The event \(F\) occurs if and only if (iff) the chosen \(\omega\) is an element of \(F\).
Example 9. Infinite coin flips: \(G_p\) is event that proportion of heads converges to \(p\): \[G_p = \left\{ \omega = (\omega_1, \omega_2, \omega_3, \dots) \in \Omega_{C_{\infty}} \Big| \frac{1}{n} \sum_{k = 1}^{n} \boldsymbol{1}_{H}(\omega_k) \to p \mbox{ as } n \to \infty\right\}.\] Where we have introduced the indicator function: for an event \(Y \subset \Omega\), defined by: \[\boldsymbol{1}_{Y}(\omega) = \begin{cases} 1 & \text{if } \omega \in Y \\ 0 & \text{if } \omega \not \in Y \end{cases}.\]
Collection of events \(\mathcal{F}\): this all the events of interest. If \(\Omega\) is a finite or countable set, usually \(\mathcal{F}= \mathcal{P}(\Omega)\), the set of all subsets of \(\Omega\). If \(|\Omega| = N\), \(|\mathcal{P}(\Omega)| = 2^N\).
Example 10. Coin flipping, \(C_2\), then \[\mathcal{F}= \mathcal{P}(\Omega_{C_2}) = \{ \emptyset, \{HH\}, \{HT\}, \dots, \{HT,TH,TT\}, \Omega\}\] and \(|\mathcal{F}| = 2^4 = 16\).
Remember that, if \(\Omega\) is countable, we can take \(\mathcal{F}\) to be the power set of \(\Omega\), however this leads to problems if \(\Omega\) is uncountable. Fortunately, even for uncountable \(\Omega\) we can still choose \(\mathcal{F}\) to be big enough that it contains all events of interest, and so that \(\mathcal{F}\) is a \(\sigma\)-algebra: \(\mathcal{F}\subset \mathcal{P}(\Omega)\) is a \(\sigma\)-algebra if:
\(\emptyset \in \mathcal{F}\);
\(F \in \mathcal{F}\implies F^C\in \mathcal{F}\);
\(F_n \in \mathcal{F}\), for \(n \in \mathbb{N}\), then \(\cup_{n \in \mathbb{N}} F_n \in \mathcal{F}\)
NB: \(\mathcal{F}\) closed under complements and countable unions. Since \(\emptyset \in \mathcal{F}\) and \(\Omega = \emptyset^C\), then \(\Omega \in \mathcal{F}\). Also \(\cap_{n \in \mathbb{N}} F_n = \left( \cup_{n \in \mathbb{N}} F_n^C\right)^C\), so countable intersection of \(F_n \in \mathcal{F}\) also in \(\mathcal{F}\).
If \(A\subset \mathcal{P}(\Omega)\) is a collection of subsets of \(\Omega\), then we define \(\sigma(A)\), the \(\sigma\)-algebra generated by \(A\), to be the smallest \(\sigma\)-algebra on \(\Omega\) that contains \(A\). Note that since the intersection of two \(\sigma\)-algebras is a (smaller) \(\sigma\)-algebra, this can be alternatively defined as the intersection of all \(\sigma\)-algebras containing \(A\). It is also easy to check that \(\mathcal{P}(\Omega)\) is a \(\sigma\)-algebra, so there is at least one \(\sigma\)-algebra containing \(A\).
If \(E \subset \Omega\) is a single event, then \(\sigma(E) = \{\emptyset, E, E^C, \Omega\}\). (Check that this is a \(\sigma\)-algebra, and that every \(\sigma\)-algebra containing \(E\) must contain all these elements.)
Example 11. Infinite coin flipping, \(C_{\infty}\): again, \(\Omega_{C_\infty}\) is uncountable, but if we take \(F_n = \{ \omega \in \Omega_{C_{\infty}} | \omega_n = H\}\), i.e. head on \(n^{th}\) flip, then \(\mathcal{F}= \sigma(F_n: n \in \mathbb{N})\) contains all events we need.
Now we can discuss probability! Given sample space \(\Omega\) and \(\sigma\)-algebra \(\mathcal{F}\) on \(\Omega\), the map \(\mathbb{P}: \mathcal{F}\to [0,1]\) is a probability measure on \((\Omega, \mathcal{F})\) if:
\(\mathbb{P}(\emptyset) = 0\);
\(\mathbb{P}(\Omega) = 1\);
(Countable additivity) If \((F_n: n \in \mathbb{N})\) is a sequence of disjoint (i.e. \(F_i \cap F_j = \emptyset\) for all \(i \neq j\)) events in \(\mathcal{F}\), then \(\mathbb{P}(\cup_{n \in \mathbb{N}} F_n) = \sum_{n \in \mathbb{N}} \mathbb{P}(F_n)\).
Then we call \((\Omega, \mathcal{F}, \mathbb{P})\) a probability triple, or a probability space, and we interpret \(\mathbb{P}(F)\) as the probability of the event \(F\) happening — that is, the probability that the selected \(\omega\) is in \(F\).
Lemma 12 (Properties of \(\mathbb{P}\)). Let \((\Omega, \mathcal{F}, \mathbb{P})\) be a probability triple. Then:
If \(A \in \mathcal{F}\) then \(\mathbb{P}(A^C) = 1-\mathbb{P}(A)\);
If \(A,B \in \mathcal{F}\) and \(A \subset B\) then \(\mathbb{P}(A) \le \mathbb{P}(B)\);
(Finite Additivity) If \(F_1, F_2, \dots, F_n\in \mathcal{F}\) are disjoint then \(\mathbb{P}(\cup_{j =1}^n F_j) = \sum_{j = 1}^n \mathbb{P}(F_j)\);
(Boole’s Inequality) For any \(F_1, F_2, \dots, F_n\in \mathcal{F}\), \(\mathbb{P}(\cup_{j =1}^n F_j) \le \sum_{j = 1}^n \mathbb{P}(F_j)\);
(Continuity of \(\mathbb{P}\))
Suppose \(F_1 \subset F_2 \subset F_3 \subset \cdots\) and \(F = \cup_{n \in \mathbb{N}} F_n\), then \(\mathbb{P}(F_n) \uparrow \mathbb{P}(F)\) (i.e. \(\mathbb{P}(F_n)\) is an increasing sequence, with limit \(\mathbb{P}(F)\)).
Suppose \(F_1 \supset F_2 \supset F_3 \supset \cdots\) and \(F = \cap_{n \in \mathbb{N}} F_n\), then \(\mathbb{P}(F_n) \downarrow \mathbb{P}(F)\).
When \(\Omega\) is countable, then we can define a probability measure simply by assigning a probability to each \(\omega \in \Omega\), and provided \(\sum_{\omega \in \Omega} \mathbb{P}(\omega) = 1\), and \(\mathbb{P}(\omega) \in [0,1]\) then we can define \(\mathbb{P}(F) = \sum_{\omega \in F} \mathbb{P}(\omega)\), and this satisfies the definition of a probability. It’s not so easy when \(\Omega\) is uncountable, but consider \(\Omega = [0,1]\) and define the Borel \(\sigma\)-algebra on \([0,1]\) to be \(\sigma(G)\) where \(G = \{[0,x]: x \in [0,1]\}\). Then:
Theorem 13 (Lebesgue and Borel). There exists a unique probability measure \(\mathbb{P}\) on \(([0,1],\mathcal{B}([0,1]))\) such that \(\mathbb{P}([0,x]) = x\) for all \(x \in [0,1]\).
A random variable (r.v.) \(X\) is a function \(X:\Omega \to \mathbb{R}\) such that \(\{X \le x\} := \{\omega \in \Omega: X(\omega) \le x\} \in \mathcal{F}\) for all \(x \in \mathbb{R}\).
NB: Since \(\mathcal{B}(\mathbb{R}) = \sigma((-\infty,x], x \in \mathbb{R})\) and \(\mathcal{F}\) is a \(\sigma\)-algebra, \(X^{-1}(A) := \{\omega \in \Omega: X(\omega) \in A\} \in \mathcal{F}\) for any \(A \in \mathcal{B}(\mathbb{R})\). Hence for any \(A \in \mathcal{B}(\mathbb{R})\), \(\mathbb{P}(X \in A)\) makes sense!
When this happens, we say that \(X\) is a \(\mathcal{F}\)-measurable function from \(\Omega\) to \(\mathbb{R}\). All the functions we meet in this course will be suitably measurable.
The distribution function \(F_X\) of a random variable \(X\) on \((\Omega, \mathcal{F}, \mathbb{P})\) is the map \(F_X:\mathbb{R}\to [0,1]\) where \(F_X(x) = \mathbb{P}(X \le x) = \mathbb{P}(\{\omega: X(\omega) \le x\})\).
If \(X\) takes values in a countable set \(X(\Omega) := \{X(\omega): \omega \in \Omega\}\), we say that \(X\) is a discrete r.v., and the probability mass function (p.m.f.) is the function \(p_X(x) := \mathbb{P}(X=x) = \mathbb{P}(\{\omega: X(\omega) = x\})\).
A r.v. is called continuous if there exists a non-negative function \(f_X(x)\) on \(\mathbb{R}\) such that \(\mathbb{P}(a < X \le b) = F_X(b)-F_X(a) = \int_a^b f_X(x) \, \mathrm{d}x\), in which case \(f_X(x)\) is the probability density function (p.d.f.) of \(X\).
Let \(X\) be a r.v. with \(0 \le X \le \infty\). If \(X\) is discrete, the expectation of \(X\) is: \[\mathbb{E}X = \sum_{x \in X(\Omega)} x \, \mathbb{P}(X = x) = \sum_{x \in X(\Omega)} x \, p_X(x) \le \infty.\] Note: \(\mathbb{E}\boldsymbol{1}_{A} = 0 \times \mathbb{P}(\boldsymbol{1}_A = 0) + 1 \times \mathbb{P}(\boldsymbol{1}_A = 1) = \mathbb{P}(\boldsymbol{1}_A = 1) = \mathbb{P}(A)\), since \(\{\boldsymbol{1}_A(\omega) = 1\} = \{\omega \in A\} = A\).
If \(X\) is a continuous random variable, then we define \[\mathbb{E}X = \int_{x \in \mathbb{R}} x f_X(x) \, \mathrm{d}x.\]
More generally, we can define the expectation of a r.v. \(X\) with \(0 \le X \le \infty\) by: \[\mathbb{E}X = \sup\{\mathbb{E}Y | Y \text{ is discrete and } Y \le X\}\] where \(Y\le X\) means \(Y(\omega) \le X(\omega)\) for all \(\omega \in \Omega\).
For a r.v. \(X\) on \(\mathbb{R}\), write \(X^+(\omega) = \max\{0,X(\omega)\}\) and \(X^-(\omega) = \max\{0,-X(\omega)\}\). Then \(X^+\) and \(X^-\) are both non-negative, and \(X = X^+-X^-\) and \(|X| = X^+ + X^-\). If \(\mathbb{E}X^+ < \infty\) and \(\mathbb{E}X^-<\infty\), so \(\mathbb{E}|X| = \mathbb{E}X^+ + \mathbb{E}X^- < \infty\), we say \(X\) is integrable, write \(X \in \mathcal{L}^1\), and define \(\mathbb{E}X = \mathbb{E}X^+ - \mathbb{E}X^-\).
Lemma 14 (Properties of Expectation).
\(|\mathbb{E}\left[ X \right]| \le \mathbb{E}|X|\). (e.g. if \(X\) is \(1\) or \(-1\) with equal probability, \(\mathbb{E}|X| = 1 \neq 0 = | \mathbb{E}\left[ X \right]|\).)
If \(X,Y \in \mathcal{L}^1\) and \(\lambda, \mu \in \mathbb{R}\) then \(\lambda X + \mu Y \in \mathcal{L}^1\) and \(\mathbb{E}\left[ \lambda X + \mu Y \right] = \lambda \mathbb{E}X + \mu \mathbb{E}Y\).
If \(X\) is a discrete r.v. and \(h:X(\Omega) \to \mathbb{R}\) then \(\mathbb{E} h(X) = \sum_{x \in X(\Omega)} h(x) \mathbb{P}(X=x)\), and \(h(X)\) is integrable iff \(\sum_{x \in X(\Omega)} |h(x)| \mathbb{P}(X=x) < \infty\).
If \(X\) is a continuous r.v. with p.d.f. \(f_X\) and \(h\) is a piecewise continuous function, \(\mathbb{E}\left[ h(X) \right] = \int_{\mathbb{R}} h(x) f_X(x) \, \mathrm{d}x\) where \(h(X)\) is integrable iff \(\int_{\mathbb{R}} |h(x)| f_X(x) \, \mathrm{d}x< \infty\).
(Monotone Convergence Theorem) If \((X_n)\) is a sequence of non-negative r.v.’s such that \(0 \le X_n \uparrow X\) then \(\mathbb{E}X_n \uparrow \mathbb{E}X \le \infty\).
Given \((\Omega, \mathcal{F}, \mathbb{P})\) and \(A,B \in \mathcal{F}\) with \(\mathbb{P}(B)>0\), the conditional probability of \(A\) given \(B\) is \[\mathbb{P}(A|B) = \frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}.\] In addition, \(\mathbb{P}(\cdot|B): \mathcal{F}\to [0,1]\) is a probability measure on \((\Omega,\mathcal{F})\).
Independence: Events \(A,B \in \mathcal{F}\) are independent if \(\mathbb{P}(A\cap B) = \mathbb{P}(A) \mathbb{P}(B)\). Then \(\mathbb{P}(A|B) = \mathbb{P}(A|B^C) = \mathbb{P}(A)\). So whether or not \(B\) happens does not affect the probability of \(A\) happening.
More generally, a finite collection of events, \(F_1, \dots, F_n\) are independent if, for every \(k =2, \dots, n\) and \(i_1 < i_2 < \cdots < i_k\) we have: \[\mathbb{P}(F_{i_1} \cap F_{i_2} \cap \cdots \cap F_{i_k}) = \mathbb{P}(F_{i_1}) \mathbb{P}(F_{i_2}) \cdots \mathbb{P}(F_{i_k}).\]
Random variables, \(X_1, X_2, X_3, \dots\) in a finite or infinite sequence are independent if, whenever \(i_1, i_2, \dots, i_r\) are distinct integers, and \(x_{i_1}, x_{i_2}, \dots, x_{i_r} \in \mathbb{R}\), then \[\mathbb{P}(X_{i_1} \le x_{i_1}, X_{i_2} \le i_2, \dots, X_{i_r} \le x_{i_r}) = \mathbb{P}(X_{i_1} \le x_{i_1})\mathbb{P}( X_{i_2} \le i_2)\cdots\mathbb{P}(X_{i_r} \le x_{i_r})\] where we write \(\mathbb{P}(A,B)\) as shorthand for \(\mathbb{P}(A \cap B)\). We sometimes use the notation \(X \perp Y\) to mean \(X\) and \(Y\) are independent.
If \(X_1, X_2, X_3, \dots\) are independent r.v.’s, and \(f_1, f_2, \dots\) are ‘nice’ functions on \(\mathbb{R}\) then \(f(X_1), f(X_2), \dots\) are also independent, and in fact, if \(f\) is ‘nice’ then \[f(X_1, X_2,\dots, X_n), X_{n+1}, X_{n+2}, \dots\] are also independent. (Here, ‘nice’ relates to the Borel \(\sigma\)-algebra, and every function we consider will be ‘nice’).
Lemma 15 (Independence means multiply). Suppose \(X \perp Y\). Then
If \(X,Y \ge 0\) then \(\mathbb{E}\left[ XY \right] = \mathbb{E}\left[ X \right] \mathbb{E}\left[ Y \right]\).
If \(X,Y \in \mathcal{L}^1\), then \(XY \in \mathcal{L}^1\) and \(\mathbb{E}\left[ XY \right] = \mathbb{E}\left[ X \right] \mathbb{E}\left[ Y \right]\).
But: \(\mathbb{E}\left[ XY \right] = \mathbb{E}\left[ X \right] \mathbb{E}\left[ Y \right] \not\implies X \perp Y\).
Definition 16. For an integrable r.v. \(X\) and an event \(A \in \mathcal{F}\) with \(\mathbb{P}(A)>0\), we define the conditional expectation of \(X\) given \(A\) to be: \[\mathbb{E}\left[ X | A \right] = \frac{\mathbb{E}\left[ X;A \right] }{\mathbb{P}(A)} = \frac{\mathbb{E}\left[ X \boldsymbol{1}_{A} \right]}{\mathbb{P}{A}}.\] In particular, if \(X\) takes values in a countable set, so \(X(\Omega) = \{ x_1, x_2, \dots \}\), then \[\mathbb{E}\left[ X | A \right] = \sum x_j \mathbb{P}(X = x_j | A).\]
Definition 17. Suppose \(X\) and \(Y\) are random variables.
If \(Y\) is a discrete r.v., so \(Y(\Omega) = \{y_1, y_2, \dots\} = \{y_i: i \in I\}\), where either \(I = \{1,2,\dots,n\}\) or \(I = \mathbb{N}\), depending on whether \(|Y(\Omega)| < \infty\) or \(|Y(\Omega)| = \infty\), and \(\mathbb{P}(Y = y_i) > 0\) for all \(i \in I\), and \(\sum_{i \in I} \mathbb{P}(Y = y_i) = 1\), then we define the Conditional Expectation of \(X\) given \(Y\) to be the random variable: \[\mathbb{E}\left[ X|Y \right] = \sum_{i \in I} \boldsymbol{1}_{Y = y_i} \mathbb{E}\left[ X | Y = y_i \right].\] Recall that \(\boldsymbol{1}_{A} = \boldsymbol{1}_{A}(\omega)\) is a r.v. which takes the value \(1\) if \(\omega \in A\), and 0 otherwise. Since the events \(\{Y = y_i\}\) are disjoint, only one may happen, and therefore at most one term in the sum is not zero.
Note that if we define the function \[\psi(y) = \mathbb{E}\left[ X|Y = y \right]\] then \(\mathbb{E}\left[ X|Y \right] = \psi(Y)\).
More generally, if \(Y_0, Y_1, \dots, Y_n\) are discrete random variables, with \(Y_k\) taking values in a set \(I_k\), we can define the conditional expectation \[\begin{align} & \mathbb{E}\left[ X|Y_0, Y_1, \dots, Y_n \right] \\ &\quad = \sum_{i_0 \in I_0, i_1 \in I_1, \dots, i_n \in I_n} \boldsymbol{1}_{Y_0 = y_{i_0}, Y_1 = y_{i_1}, \dots, Y_n = y_{i_n}} \mathbb{E}\left[ X | Y_0 = y_{i_0}, Y_1 = y_{i_1}, \dots, Y_n = y_{i_n} \right],\end{align}\] and it follows that \(\mathbb{E}\left[ X|Y_0, Y_1, \dots, Y_n \right] = \psi(Y_0, Y_1, \dots, Y_n)\) if we define: \[\psi(y_0, y_1, \dots, y_n) = \mathbb{E}\left[ X|Y_0 = y_0, Y_1 = y_1, \dots, Y_n = y_n \right].\]
Alternatively, we can define \((\mathcal{F}_{n})_{n \ge 0}\) to be the filtration of the sequence of random variables \((Y_i)_{i \ge 0}\). That is, \(\mathcal{F}_n\) is the collection of events depending only on \(Y_0, Y_1, \dots, Y_n\). We usually think of \(Y_k\) as the value of something observed at time \(k\), and then \(\mathcal{F}_n\) is the ‘state of knowledge’ or ‘history’ up to time \(n\).
Given a filtration, we can then define the conditional expectation of \(X\) ‘given the history up to time \(n\)’ as \[\begin{align} & \mathbb{E}\left[ X|\mathcal{F}_n \right] = \mathbb{E}\left[ X|Y_0, Y_1, \dots, Y_n \right] \\ &\quad = \sum_{i_0 \in I_0, i_1 \in I_1, \dots, i_n \in I_n} \boldsymbol{1}_{Y_0 = y_{i_0}, Y_1 = y_{i_1}, \dots, Y_n = y_{i_n}} \mathbb{E}\left[ X | Y_0 = y_{i_0}, Y_1 = y_{i_1}, \dots, Y_n = y_{i_n} \right],\end{align}\] We will also frequently use the shorthand: \(\mathbb{E}\left[ X|\mathcal{F}_n \right] = \mathbb{E}_{n} \left[ X \right]\).
If \(X\) and \(Y\) are both continuous random variables with joint density \(f_{X,Y}(x,y)\), and marginal densities \(f_X(x)\) and \(f_Y(y)\) respectively, then the conditional p.d.f. of \(X\) given \(Y = y\) is: \[f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_{Y}(y)}.\] If we define the function \[\psi(y) = \mathbb{E}\left[ X|Y=y \right] = \int_{x\in \mathbb{R}} x \, f_{X|Y}(x|y) \, \mathrm{d}x,\] then the conditional expectation of \(X\) given \(Y\) is \(\psi(Y)\).
Similarly, if \(Y_0, Y_1, \dots, Y_n\) are continuous random variables, with \(X, Y_0, \dots, Y_n\) having joint density \(f_{X,Y_0, \dots, Y_n}(x,y_0,\dots,y_n)\) and \[f_{Y_0,\dots, Y_n}(y_0,\dots,y_n) = \int_{x \in \mathbb{R}} f_{X,Y_0, \dots, Y_n}(x,y_0,\dots,y_n),\] then we can define \[f_{X|Y_0, \dots, Y_n}(x|y_0, \dots, y_n) = \frac{f_{X,Y_0, \dots, Y_n}(x,y_0,\dots,y_n)}{f_{Y_0,\dots, Y_n}(y_0,\dots,y_n)},\] and \[\psi(y_0,y_1,\dots,y_n) = \int_{x\in \mathbb{R}} x \, f_{X|Y_0, Y_1, \dots, Y_n}(x|y_0, y_1, \dots, y_n) \, \mathrm{d}x,\] and we define the conditional expectation by: \[\mathbb{E}\left[ X|Y_0, Y_1, \dots, Y_n \right] = \psi(Y_0, \dots, Y_n).\]
Lemma 18 (Properties of Conditional Expectation). The following properties hold for all the forms of conditional expectation above, if \(X,Z,Y_0, \dots, Y_n\) are r.v.’s:
(Taking out what is known): for a ‘nice’ function \(h\) \[\mathbb{E}\left[ h(Y_0,Y_1, \dots, Y_n) X|Y_0, \dots, Y_n \right] = h(Y_0,Y_1, \dots, Y_n) \mathbb{E}\left[ X|Y_0, \dots, Y_n \right];\]
(Tower Property): if \(m \le n\), \[\mathbb{E}\left[ \mathbb{E}\left[ X|Y_0, \dots, Y_n \right]|Y_0, \dots, Y_m \right] = \mathbb{E}\left[ X|Y_0, \dots, Y_m \right],\] and in particular, if \(m=0\), then \[\mathbb{E}\left[ \mathbb{E}\left[ X|Y_0, \dots, Y_n \right] \right] = \mathbb{E}\left[ X \right];\]
(Combining (i) and (ii)): for a ‘nice’ function \(h\) \[\mathbb{E}\left[ h(Y_0,Y_1, \dots, Y_n) X \right] = \mathbb{E}\left[ h(Y_0,Y_1, \dots, Y_n) \mathbb{E}\left[ X|Y_0, \dots, Y_n \right] \right];\]
(Independence): if \(X\) is independent of \(Y_0, \dots, Y_n\) \[\mathbb{E}\left[ X|Y_0, \dots, Y_n \right] = \mathbb{E}\left[ X \right];\]
(Linearity): for \(\alpha, \beta \in \mathbb{R}\) \[\mathbb{E}\left[ \alpha X + \beta Z | Y_0, \dots, Y_n \right] = \alpha \mathbb{E}\left[ X| Y_0, \dots, Y_n \right] + \beta \mathbb{E}\left[ Z | Y_0, \dots, Y_n \right].\]
Example 19. Consider random variables, on the coin-flipping probability space \(\Omega_{C_4} = \{HHHH, HHHT, HHTH, \dots, TTTT\}\), then \(|\Omega_{C_4}| = 2^4 = 16\), and \(\mathcal{F}= \mathcal{P}(\Omega_{C_4}) = \{ \emptyset, \{HHHH\}, \{HHHT\}, \dots, \Omega\}\) the power set of \(\Omega_{C_4}\) and \(|\mathcal{F}| = 2^{16} = 65536\).
Let \(Y_n\) be the r.v. which is \(1\) if the \(n^{\text{th}}\) flip is a head, and \(-1\) otherwise, for \(n=1,2,3,4\). Let \[\begin{align} X_n & = \# \text{heads on first $n$ flips} - \# \text{tails on first $n$ flips}\\ & = \sum_{i = 1}^n Y_i,\end{align}\] and set also \(X_0 = 0\).
If we assume that \(Y_i\) are independent and identically distributed (i.i.d.) with \(\mathbb{P}(Y_i = 1) = p = 1-\mathbb{P}(Y_i = -1)\), then \(X_n\) is a simple random walk. We can use conditional expectations to compute certain probabilities:
Find \(\mathbb{E}\left[ X_4| Y_1, Y_2, Y_3 \right]\).
Find \(\mathbb{E}\left[ X_4|X_1, X_2, X_3 \right]\) and \(\mathbb{E}\left[ X_4|X_3 \right]\).
Find \(\mathbb{E}\left[ (X_4)^2 | X_1, X_2, X_3 \right]\).
Suppose that \(p = \frac{1}{2}\). Find \(\mathbb{E}\left[ (X_4)^2 \right]\).
Solution:
\[\begin{align} \mathbb{E}\left[ X_4| Y_1, Y_2, Y_3 \right]=&\mathbb{E}\left[ Y_1+Y_2+Y_3+Y_4| Y_1, Y_2, Y_3 \right]\\ =&\mathbb{E}\left[ Y_1| Y_1, Y_2, Y_3 \right]+\mathbb{E}\left[ Y_2| Y_1, Y_2, Y_3 \right]+\mathbb{E}\left[ Y_3| Y_1, Y_2, Y_3 \right]+\mathbb{E}\left[ Y_4| Y_1, Y_2, Y_3 \right]\\ =&Y_1+Y_2+Y_3+\mathbb{E}\left[ Y_4 \right]\\ =&X_3+p\cdot 1+(1-p)\cdot(-1)\\ =&X_3+2p-1\end{align}\]
\[\begin{align} \mathbb{E}\left[ X_4| X_1, X_2, X_3 \right]=&\mathbb{E}\left[ X_3+Y_4| X_1, X_2, X_3 \right]\\ =&X_3+\mathbb{E}\left[ Y_4| X_1, X_2, X_3 \right]\\ =&X_3+\mathbb{E}\left[ Y_4 \right]\\ =&X_3+2p-1\end{align}\]
\[\begin{align} \mathbb{E}\left[ X_4| X_3 \right]=&\mathbb{E}\left[ \mathbb{E}\left[ X_4| X_1, X_2, X_3 \right]|X_3 \right]\\ =&\mathbb{E}\left[ X_3+2p-1|X_3 \right]\\ =& X_3+2p-1\end{align}\]
\[\begin{align} \mathbb{E}\left[ (X_4)^2| X_1, X_2, X_3 \right]=&\mathbb{E}\left[ (X_3+Y_4)^2| X_1, X_2, X_3 \right]\\ =&\mathbb{E}\left[ (X_3)^2+2X_3Y_4+(Y_4)^2| X_1, X_2, X_3 \right]\\ =&(X_3)^2+2X_3\mathbb{E}\left[ Y_4 \right]+\mathbb{E}\left[ (Y_4)^2 \right]\\ =&(X_3)^2+2X_3(2p-1)+p\cdot 1^2+(1-p)\cdot(-1)^2\\ =&(X_3)^2+2X_3(2p-1)+1\\ =&(X_3)^2+1\end{align}\] Similarly: \(\mathbb{E}\left[ (X_k)^2| X_1, \dots, X_{k-1} \right]=(X_{k-1})^2+1\) By the Tower property: \[\begin{align} \mathbb{E}\left[ (X_4)^2 \right]=&\mathbb{E}\left[ \mathbb{E}\left[ (X_4)^2| X_1, X_2, X_3 \right] \right]\\ =&\mathbb{E}\left[ (X_3)^2+1 \right]\\ =&\mathbb{E}\left[ \mathbb{E}\left[ (X_3)^2| X_1, X_2 \right] \right]+1\\ =&\mathbb{E}\left[ (X_2)^2 \right]+2\\ &\vdots\\ =&\mathbb{E}\left[ (X_0)^2 \right]+4\\ =&4\end{align}\]
\(\square\)
Definition 20. A stochastic process (in discrete time) is a sequence of random variables indexed by time. E.g. in the previous example, both \(X_0, X_1, X_2, X_3, X_4\) and \(Y_1, Y_2, Y_3, Y_4\) were stochastic processes.
A stochastic process \(X_0, X_1, \dots, X_N\) is a Markov process if, for each \(n=0,1,2,\dots, N-1\) and for any function \(f(x)\), there exists a function \(g(x)\) (depending on \(n\) and \(f\)) such that: \[\begin{align} \mathbb{E}\left[ f(X_{n+1}) |X_0, X_1, \dots, X_{n} \right] = g(X_n). \tag{2.1}\end{align}\] That is, all the information about the future of the process \(X_n\) is contained in its current value.
Note that, when (2.1) holds, then \(g(X_n) = \mathbb{E}\left[ f(X_{n+1})|X_n \right]\).