6. Random vectors#
Essentially, an \(n\)-dimensional random vector is an \(n\)-tuple of random variables. These are the objects of study in the present chapter. We will discuss higher-dimensional generalizations of many of the gadgets that we studied in the context of random variables back in Chapter 3, including probability measures induced by random vectors, known as joint distributions. These latter distributions will be used to generalize to random variables the notions of conditional probability and independence that we first studied back in Chapter 2, which are some of the most important concepts in all probability theory. A careful study of this chapter is absolutely critical for the rest of the book!
6.1. Motivation#
To introduce random vectors, let’s return to the housing dataset that we studied a previous programming assignment, which contains data on \(2{,}930\) houses in Ames, Iowa. We are interested in two particular features in the dataset, area and selling price:
Show code cell source
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
import seaborn as sns
import warnings
warnings.filterwarnings("ignore")
plt.style.use('../aux-files/custom_style_light.mplstyle')
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
url = 'https://raw.githubusercontent.com/jmyers7/stats-book-materials/main/data/data-3-1.csv'
df = pd.read_csv(url, usecols=['area', 'price'])
df.describe()
area | price | |
---|---|---|
count | 2930.000000 | 2930.000000 |
mean | 1499.690444 | 180.796060 |
std | 505.508887 | 79.886692 |
min | 334.000000 | 12.789000 |
25% | 1126.000000 | 129.500000 |
50% | 1442.000000 | 160.000000 |
75% | 1742.750000 | 213.500000 |
max | 5642.000000 | 755.000000 |
The areas are measured in square feet, while the prices are measured in thousands of US dollars. We label the area observations as \(x\)’s and the price observations as \(y\)’s, so that our dataset consists of two lists of \(m=2{,}930\) numbers:
We conceptualize these as observed values corresponding to two random variables \(X\) and \(Y\).
We may plot histograms (and KDEs) for the empirical distributions of the datasets:
Show code cell source
_, axes = plt.subplots(ncols=2, figsize=(10, 4))
sns.histplot(data=df, x='area', ax=axes[0], ec='black', stat='density', kde=True)
axes[0].set_ylabel('density')
sns.histplot(data=df, x='price', ax=axes[1], ec='black', stat='density', kde=True)
axes[1].set_ylabel('density')
plt.tight_layout()
Whatever information we might glean from these histograms, the information is about the two random variables \(X\) and \(Y\) in isolation from each other. But because we expect that the size of a house and its selling price might be (strongly) related, it might be more informative to study \(X\) and \(Y\) in tandem with each other. One way to do this is via the scatter plots that we produced in a previous programming assignment:
Show code cell source
sns.scatterplot(data=df, x='area', y='price')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
This plot confirms exactly what we expected! Since the pattern of points generally follows a positively-sloped line, we may conclude that as the size \(X\) of a house increases, so too does its selling price \(Y\).
To reiterate:
We would not have been able to discover this relation between \(X\) and \(Y\) had we studied these two variables in isolation from each other. This suggests that, given any pair of random variables \(X\) and \(Y\), we might obtain valuable information if we study them together as a single object.
6.2. \(2\)-dimensional random vectors#
So, what do you get when you want to combine two random variables into a single object? Here’s the answer:
Let \(S\) be a probability space. A \(2\)-dimensional random vector is a function
Thus, we may write \(\mathbf{X}(s) = (X_1(s), X_2(s))\) for each sample point \(s\in S\), where
are random variables. When we do so, the random variables \(X_1\) and \(X_2\) are called the components of the random vector \(\mathbf{X}\).
So, a \(2\)-dimensional random vector is nothing but a pair of random variables. That’s it. We will often write a random vector simply as a pair \((X_1,X_2)\), where \(X_1\) and \(X_2\) are the component random variables. In the example in the previous section, we have the random vector \((X,Y)\) consisting of the size of a house and its selling price; notice here that \(X\) does not stand for the random vector, but rather its first component.
So what’s the big deal? Simply putting two random variables together into a random vector doesn’t seem like it would lead to anything useful. But do you remember that every random variable \(X\) induces a probability distribution \(P_X\) on \(\mathbb{R}\)? (If not, look here.) As I will now show you, every random vector \((X,Y)\) does the same, inducing a probability measure denoted \(P_{XY}\), but this time the measure lives on the plane \(\mathbb{R}^2\):
Here is the official definition. It will be worth comparing this to Definition 3.2.
Let \((X,Y):S \to \mathbb{R}^2\) be a \(2\)-dimensional random vector on a probability space \(S\) with probability measure \(P\). We define the probability measure of \((X,Y)\), denoted \(P_{XY}\), via the formula
for all events \(C\subset \mathbb{R}^2\). The probability measure \(P_{XY}\) is also called the joint distribution or the bivariate distribution of \(X\) and \(Y\).
For a given event \(C\subset \mathbb{R}^2\), notice that the set
inside the probability measure on the right-hand side of (6.1) consists exactly of those sample points \(s\in S\) that land in \(C\) under the action of the random vector \((X,Y)\); I would visualize this as:
Then, the probability \(P_{XY}(C)\) is (by definition!) equal to the probability of the set (6.2) as measured by the original measure \(P\).
There is alternate notation for \(P_{XY}(C)\) that you’ll see, similar to the alternate notation introduced back in Section 3.2 for \(P_X\). Indeed, instead of writing \(P_{XY}(C)\), we will often write
If \(C\) happens to be a product event
where \(A,B\subset \mathbb{R}\), then we will write
in place of \(P_{XY}(C)\). Notice that the expressions in (6.3) and (6.4) are technically abuses of notation.
Problem Prompt
Do problem 1 on the worksheet.
Before continuing, it might be worth briefly reviewing the definitions of discrete and bivariate continuous probability distributions (see Section 1.6 for the first, and Section 1.11 for the second). The first types of distributions were defined in terms of the existence of probability mass functions, while the second were defined in terms of the existence of probability density functions. We use these definitions in:
Let \((X,Y)\) be a \(2\)-dimensional random vector.
We shall say \((X,Y)\) is discrete, or that \(X\) and \(Y\) are jointly discrete, if the joint probability distribution \(P_{XY}\) is discrete. In other words, we require that there exists a joint probability mass function \(p(x,y)\) such that
\[ P((X,Y)\in C) = \sum_{(x,y)\in C} p(x,y) \]for all events \(C \subset \mathbb{R}^2\).
We shall say \((X,Y)\) is continuous, or that \(X\) and \(Y\) are jointly continuous, if the joint probability distribution \(P_{XY}\) is continuous. In other words, we require that there exists a joint probability density function \(f(x,y)\) such that
\[ P\left( (X,Y)\in C \right) = \iint_C f(x,y) \ \text{d}x \text{d}y \]for all events \(C\subset \mathbb{R}^2\).
So, the component random variables \(X\) and \(Y\) of a random vector \((X,Y)\) may be discrete or continuous, and the random vector \((X,Y)\) itself may be discrete or continuous. What are the relations between these properties? Answer:
Let \((X,Y)\) be a \(2\)-dimensional random vector.
The random vector \((X,Y)\) is discrete if and only if both \(X\) and \(Y\) are discrete.
If \((X,Y)\) is continuous, then \(X\) and \(Y\) are both continuous. However, it does not nececessarily follow that if both \(X\) and \(Y\) are continuous, then so too is \((X,Y)\).
Part of this theorem will be proved below in Theorem 6.3.
The reason that individual continuity of \(X\) and \(Y\) does not imply continuity of \((X,Y)\) can be explained by observing that if \((X,Y)\) were continuous, then the probability that \((X,Y)\) takes values on any given line in \(\mathbb{R}^2\) is \(0\). Indeed, this is because probabilities of bivariate continuous distributions are volumes under density surfaces, and there is no volume above a line in the plane. But if \(X\) is continuous and \(X=Y\), then the probability that \((X,Y)\) takes values on the line \(x=y\) is exactly \(1\). Thus, the random vector \((X,Y)\) cannot be continuous!
Problem Prompt
Do problems 2-4 on the worksheet.
6.3. Bivariate distribution functions#
We now generalize the cumulative distribution functions from Section 3.4 to \(2\)-dimensional random vectors.
Let \((X,Y)\) be a \(2\)-dimensional random vector. The distribution function of \((X,Y)\) is the function \(F:\mathbb{R}^2 \to \mathbb{R}\) defined by
In particular:
If \((X,Y)\) is discrete with probability mass function \(p(x,y)\), then
\[ F(x,y) = \sum_{x^\star\leq x, \ y^\star \leq y} p(x^\star, y^\star). \]If \((X,Y)\) is continuous with probability density function \(f(x,y)\), then
\[ F(x,y) = \int_{-\infty}^y \int_{-\infty}^x f(x^\star, y^\star) \ \text{d}x^\star \text{d} y^\star. \]
Problem Prompt
Do problem 5 on the worksheet.
6.4. Marginal distributions#
I would visualize a \(2\)-dimensional random vector \((X,Y)\) along with its component random variables \(X\) and \(Y\) as follows:
Here, the two maps labeled “proj” are what we mathematicians call the universal projection maps; the first one, on the left, “projects” \(\mathbb{R}^2\) onto the \(x\)-axis, and is given simply by chopping off the \(y\)-coordinate:
The second projection, on the right in the diagram, “projects” \(\mathbb{R}^2\) onto the \(y\)-axis by chopping off the \(x\)-coordinate:
Notice that the diagram “commutes,” in the sense that the action of \(X\) coincides with the action of the composite map \(\text{proj}\circ (X,Y)\). Thus, if you begin at the sample space \(S\) and proceed to \(\mathbb{R}\) along \(X\), you’ll get the same result as first going along \((X,Y)\) to \(\mathbb{R}^2\), and then going along the projection arrow to \(\mathbb{R}\). The same observations hold for \(Y\).
In this situation, we have four(!) probability measures in the mix. We have the original measure \(P\) on the sample space \(S\), the joint measure \(P_{XY}\) on the plane \(\mathbb{R}^2\), as well as the two measures \(P_X\) and \(P_Y\) on the line \(\mathbb{R}\). My goal is to convince you in this section that these probability measures are all tightly linked to each other.
Let’s focus on the link between \(P_{XY}\) and \(P_X\). Let’s suppose that we have an event \(A\subset \mathbb{R}\) along with the product event \(A \times \mathbb{R} \subset \mathbb{R}^2\). I would visualize this as:
Notice that the product set \(A\times \mathbb{R}\) consists exactly of those ordered pairs \((x,y)\) that land in \(A\) under the projection map. Now, consider the two sets
consisting, respectively, of those sample points \(s\in S\) that land in \(A\) under \(X\) and those sample points that land in \(A \times \mathbb{R}\) under \((X,Y)\). Take a moment to convince yourself that these are just two different descriptions of the same set! Therefore, we may conclude that
where the first equality follows from the definition of \(P_X\) while the last equality follows from the definition of \(P_{XY}\). This argument essentially amounts to a proof of the following crucial result:
Let \((X,Y)\) be a \(2\)-dimensional random vector with induced probability measure \(P_{XY}\). Then the measures \(P_X\) and \(P_Y\) may be obtained via the formulas
for all events \(A,B\subset \mathbb{R}\). In particular:
If \((X,Y)\) is discrete with probability mass function \(p(x,y)\), then
\[ P(X\in A) = \sum_{x\in A} \sum_{y\in \mathbb{R}} p(x,y) \quad \text{and} \quad P(Y\in B) = \sum_{y\in B} \sum_{x\in \mathbb{R}} p(x,y). \]If \((X,Y)\) is continuous with probability density function \(f(x,y)\), then
\[ P(X\in A) = \int_A \int_{-\infty}^\infty f(x,y) \ \text{d}y \text{d}x \]and
\[ P(Y\in B) = \int_B \int_{-\infty}^\infty f(x,y) \ \text{d}x \text{d}y. \]
In this scenario, the distributions \(P_X\) and \(P_Y\) have special names:
Let \((X,Y)\) be a \(2\)-dimensional random vector. Then the distributions \(P_X\) and \(P_Y\) are called the marginal distributions of \((X,Y)\).
So, just to emphasize:
Marginal distributions are nothing new—the only thing that is new is the terminology.
An immediate consequence of Theorem 6.2 is a description of the marginal probability mass and density functions. We will use this result often:
Let \((X,Y)\) be a \(2\)-dimensional random vector.
If \((X,Y)\) is discrete with probability mass function \(p(x,y)\), then both \(X\) and \(Y\) are discrete with probability mass functions given by
\[ p_X(x) = \sum_{y\in \mathbb{R}} p(x,y) \quad \text{and} \quad p_Y(y) = \sum_{x\in \mathbb{R}}p(x,y). \]If \((X,Y)\) is continuous with probability density function \(f(x,y)\), then both \(X\) and \(Y\) are continuous with probability density functions given by
(6.5)#\[ f_X(x) = \int_{-\infty}^\infty f(x,y) \ \text{d}y \quad \text{and} \quad f_Y(y) = \int_{-\infty}^\infty f(x,y) \ \text{d} x. \]
Here’s how I remember these formulas:
Tip
To obtain the marginal mass \(p_X(x)\) from the joint mass \(p(x,y)\), we “sum out” the dependence of \(p(x,y)\) on \(y\). Likewise for obtaining \(p_Y(y)\) from \(p(x,y)\).
To obtain the marginal density \(f_X(x)\) from the joint density \(f(x,y)\), we “integrate out” the dependence of \(f(x,y)\) on \(y\). Likewise for obtaining \(f_Y(y)\) from \(f(x,y)\).
In the continuous case, you should visualize the formulas (6.5) as integrations over cross-sections of the density surfaces. For example, the following picture is a visualization of the formula for the marginal density function \(f_Y(y)\) evaluated at \(y=5\):
In the picture, we imagine slicing the graph of the density \(f(x,y)\) with the vertical plane \(y=5\). (We must imagine this plane extends off infinitely far in each direction.) Where the plane and the density surface intersect, a curve will be traced out on the plane. The area beneath this cross-sectional curve is exactly the value
This is the visualization for continuous distributions—what might the analogous visualization look like for discrete distributions? Can you sketch the figure on your own?
Problem Prompt
Do problems 6 and 7 on the worksheet.
6.5. Bivariate empirical distributions#
Almost everything that we learned in the previous chapter on data and empirical distributions may be applied in the bivariate setting. Essentially, you just need to change all random variables to random vectors, and put either a “joint” or “bivariate” in front of everything. :)
To illustrate, let’s return to the Ames housing data explored in the first section. There, we had two observed random samples
of the sizes and the selling prices of \(m=2{,}930\) houses. In the precise language of the previous chapter, we would say that these datasets are observations from the IID random samples
Make sure you remember the difference between a random sample and an observed random sample!
However, the \(i\)-th size \(x_i\) and the \(i\)-th price \(y_i\) naturally go together, since they are both referring to the \(i\)-th house. Therefore, we might instead consider our two observed random samples as a single observed random sample
But what is this new observed random sample an observation of? Answer:
Let \((X_1,Y_1), (X_2,Y_2),\ldots,(X_m,Y_m)\) be a sequence of \(2\)-dimensional random vectors, all defined on the same probability space.
The random vectors are called a bivariate random sample if they are independent and identically distributed (IID).
Provided that the sequence is a bivariate random sample, an observed bivariate random sample, or a bivariate dataset, is a sequence of pairs of real numbers
where \((x_i,y_i)\) is an observation of \((X_i,Y_i)\).
To say that \((x_i,y_i)\) is an observed value of the random vector \((X_i,Y_i)\) simply means that it is in the range of the random vector (as a function with codomain \(\mathbb{R}^2\)). As I remarked in the previous chapter, we haven’t officially defined independence yet—that will come in Definition 6.11 below.
Thus, it might be more natural to say that our housing data constitutes an observed bivariate random sample, rather than just two individual observed univariate random samples. These types of random samples—along with their higher-dimensional cousins called multivariate random samples—are quite common in prediction tasks where we aim to predict the \(y_i\)’s based on the \(x_i\)’s. For example, this is the entire gist of supervised machine learning.
Adapting the definition of empirical distributions of univariate datasets from the previous chapter is also easy:
Let \((x_1,y_1),(x_2,y_2),\ldots,(x_m,y_m)\) be an observed bivariate random sample, i.e., a bivariate dataset. The empirical distribution of the dataset is the discrete probability measure on \(\mathbb{R}^2\) with joint probability mass function
We saw in the first section that we may visualize bivariate empirical distributions using scatter plots. Here’s a variation on a scatter plot, which places the marginal empirical distributions in the (where else?) margins of the plot!
Show code cell source
sns.jointplot(data=df, x='area', y='price', marginal_kws={'kde' : True, 'ec': 'black'})
plt.gcf().set_size_inches(w=5, h=5)
plt.tight_layout()
Along the top of the figure we see a histogram (with KDE) of the empirical distribution of the \(x_i\)’s, and along the side we see a histogram (with KDE) of the empirical distribution of the \(y_i\)’s.
This type of figure makes it very clear how marginal distributions are obtained from joint distributions. For example, take a look at:
Show code cell source
df_slice = df[(145 <= df['price']) & (df['price'] <= 155)]
g = sns.JointGrid()
sns.scatterplot(data=df, x='area', y='price', ax=g.ax_joint, alpha=0.1)
sns.scatterplot(data=df_slice, x='area', y='price', ax=g.ax_joint)
ax1 = sns.histplot(data=df, x='area', ax=g.ax_marg_x, ec='black')
ax2 = sns.histplot(data=df, y='price', ax=g.ax_marg_y, ec='black')
for bar in ax2.patches:
bar.set_facecolor('w')
ax2.patches[11].set_facecolor('#FD46FC')
plt.gcf().set_size_inches(4, 4)
g.set_axis_labels(xlabel='area', ylabel='price')
plt.gcf().set_size_inches(w=5, h=5)
plt.tight_layout()
The height of the highlighted histogram bar on the right is the value \(p_Y(150)\), where \(p_Y\) is the empirical marginal mass function of the price variable \(Y\). Remember, this value is obtained through the formula
where \(p_{XY}\) is the empirical joint mass function. We visualize this formula as summing the joint mass function \(p_{XY}(x,150)\) along the (highlighted) horizontal slice of the scatter plot where \(y=150\).
What about bivariate versions of KDEs and histograms? Answer:
Show code cell source
_, axes = plt.subplots(ncols=2, figsize=(9, 4), sharey=True, sharex=True)
sns.kdeplot(data=df, x='area', y='price', ax=axes[0])
sns.histplot(data=df, x='area', y='price', cbar=True, ax=axes[1], cbar_kws={'label': 'density'}, stat='density')
axes[0].set_title('bivariate kde')
axes[1].set_title('bivariate histogram')
plt.tight_layout()
Even though the density surface for a bivariate empirical distribution is not a continuous surface, if it were, you can imagine that the curves in the KDE on the left are its contours. In other words, these are the curves over which the density surface has constant height. It appears that the density surface has either a global minimum or global maximum near \((1000,125)\), but we can’t tell which from the KDE alone because the contours are not labeled.
On the right-hand side of the figure above, we have a bivariate version of a histogram. While a histogram for a univariate dataset is obtained by subdividing the line \(\mathbb{R}\) into bins, for a bivariate dataset the plane \(\mathbb{R}^2\) is subdivided into rectangular bins. Then, over each of these rectangular bins we would place a \(3\)-dimensional “bar” whose height is equal (or proportional) to the number of data points that fall in the bin; thus, a histogram for bivariate data should really live in three dimensions. However, the histogram above shows only the bins in the plane \(\mathbb{R}^2\), and it displays the heights of the “bars” by color, with darker shades of blue indicating a larger number of data points are contained in the bin. It is evident from this diagram that the global extreme point identified in the KDE is, in fact, a global maximum.
6.6. Conditional distributions#
Given two events \(A\) and \(B\) in a probability space, we learned previously that the conditional probability of \(A\) given \(B\) is defined via the formula
provided that \(P(B) \neq 0\). The probability on the left is the probability that \(A\) occurs, given that you already know the event \(B\) has occurred. One may view \(P(-|B)\) (the “\(-\)” means “blank”) as a probability measure with sample space \(B\) and where all events are of the form \(A\cap B\). It is worth repeating this, in slightly simpler language:
Passing from plain probabilities to conditional probabilities has the effect of shrinking the sample space to the event that you are “conditioning on.”
Let’s see how this might work with the probability measures induced by random variables.
To get a feel for what we’re going for, let’s return to our housing data
and its bivariate empirical distribution that we studied in the previous section. Suppose that we are interested in studying the (empirical) distribution of sizes \(x\) of houses with fixed sale price \(y=150\). If we set \(B = \{150\}\), then this means we want to shrink the range of the \(y\)’s down from all of \(\mathbb{R}\) to the simple event \(B\). The slice of data points with \(y=150\) are highlighted in the following scatter plot:
Show code cell source
sns.scatterplot(data=df, x='area', y='price', alpha=0.1)
sns.scatterplot(data=df_slice, x='area', y='price')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
Then, after cutting down the range of \(y\)’s to lie in \(B=\{150\}\), we wonder what the distribution over the sizes \(x\) looks like. Answer:
Show code cell source
g = sns.JointGrid()
scatter = sns.scatterplot(data=df, x='area', y='price', ax=g.ax_joint, alpha=0.1)
sns.scatterplot(data=df_slice, x='area', y='price', ax=g.ax_joint)
sns.histplot(data=df_slice, x='area', ax=g.ax_marg_x, ec='black', color='#FD46FC', kde=True)
g.ax_marg_y.remove()
g.set_axis_labels(xlabel='area', ylabel='price')
plt.gcf().set_size_inches(w=5, h=5)
plt.tight_layout()
The histogram along the top of the figure shows the empirical distribution of the \(x\)’s belonging to data points \((x,y)\) with \(y=150\). If we remember that our original random variables in the first section were \(X\) and \(Y\), then this empirical distribution is an approximation to the conditional distribution of \(X\) given \(Y=150\). (The precise definition is below.) So, the histogram along the top of the scatter plot displays an empirical conditional distribution.
Alright. We’re ready for the definitions. At this level, it turns out that the easiest way to define conditional distributions is via mass and density functions:
Let \(X\) and \(Y\) be random variables.
Suppose \((X,Y)\) is discrete, so that both \(X\) and \(Y\) are discrete as well. The conditional probability mass function of \(X\) given \(Y\) is the function
\[ p_{X|Y}(x|y) = \frac{p_{XY}(x,y)}{p_Y(y)}, \]defined for all those \(y\) such that \(p_Y(y)\neq 0\).
Suppose \((X,Y)\) is continuous, so that both \(X\) and \(Y\) are continuous as well. The conditional probability density function of \(X\) given \(Y\) is the function
\[ f_{X|Y}(x|y) = \frac{f_{XY}(x,y)}{f_Y(y)}, \]defined for all those \(y\) such that \(f_Y(y)\neq 0\).
Let’s get some practice:
Problem Prompt
Do problems 8 and 9 on the worksheet.
Just calling \(p_{X|Y}(x|y)\) a probability mass function does not make it so, and similarly for \(f_{X|Y}(x|y)\). So, in what sense do these define probability measures?
Let \(X\) and \(Y\) be random variables.
In the case that \((X,Y)\) is discrete, for fixed \(y\) with \(p_Y(y)\neq 0\), the function \(p_{X|Y}(x|y)\) is a probability mass function in the variable \(x\). In particular, we have
(6.6)#\[ P(X\in A|Y=y) = \sum_{x\in A} p_{X|Y}(x|y), \]for all events \(A\subset \mathbb{R}\).
In the case that \((X,Y)\) is continuous, for fixed \(y\) with \(f_Y(y)\neq 0\), the function \(f_{X|Y}(x|y)\) is a probability density function in the variable \(x\).
Let me show you how the discrete case works, leaving you do adapt the arguments to the continuous case on your own. First, note that \(p_{X|Y}(x|y)\) is nonnegative, as all PMF’s must be. Thus, we need only check that it sums to \(1\) over all \(x\):
where I used Theorem 6.2 in the second equality. The same type of argument will prove (6.6), which I will also let you do on your own.
Warning
Notice the absence of an analogous equation to (6.6) for conditional density functions! This is because the left-hand side of (6.6) is equal to the ratio
But both the numerator and denominator of this fraction are \(0\) in the case that \(Y\) is continuous! So what probability does \(f_{X|Y}(x|y)\) compute?
Answering this question precisely is hard—this is part of what I was alluding to in my margin note above. But here’s the rough idea: Suppose that \(\epsilon\) is a small, positive real number. Then the conditional probability
at least has a chance to be well-defined, since the denominator
can be nonzero. But we also have
and so substituting these last two expressions into the conditional probability gives
The interpretation is this: For fixed \(y\), integrating the conditional density \(f_{X|Y}(x|y)\) over \(x\in A\) yields the probability that \(X\in A\), given that \(Y\) is in an “infinitesimal” neighorhood of \(y\). (This “infinitesimal” neighborhood is represented by \([y,y+\epsilon]\), when \(\epsilon\) is really small.)
In spite of this warning, we shall still imagine that the conditional density \(f_{X|Y}(x|y)\) is the density of the conditional probability \(P(X\in A | Y=y)\), though technically the latter is undefined according to the standard definition of conditional probability. You will see this in:
Problem Prompt
Do problems 10 and 11 on the worksheet.
6.7. The Law of Total Probability and Bayes’ Theorem for random variables#
Back in Section 2.7, we studied the Law of Total Probability and Bayes’ Theorem for arbitrary probability measures. In this section, we adapt these results to the probability measures induced by random variables.
(The Law of Total Probability (for random variables))
Let \(X\) and \(Y\) be random variables.
Let me show you how to prove (6.8); I will leave the other equality (6.7) for you to do on your own. But this is just an easy computation:
We used the definition of the conditional density in moving from the first integral to the second, while the second equality follows from Theorem 6.3.
(Bayes’ Theorem (for random variables))
Let \(X\) and \(Y\) be random variables.
The proofs of these two equations follow immediately from the definitions of conditional mass and density functions. In applications, you will often see Bayes’ Theorem combined with the Law of Total Probability, the latter allowing one to compute the denominators in (6.9) and (6.10). For example, in the continuous case, we have
for all \(x,y\) for which the densities are defined. The advantage gained by writing the denominator like this is that one only needs information about the conditional density \(f_{Y|X}(y|x)\) and the marginal density \(f_X(x)\) in order to compute the other conditional density \(f_{X|Y}(x|y)\).
Note
Notice that the random vector \((X,Y)\) was required to be either discrete or continuous in both the Law of Total Probability and Bayes’ Theorem. Actually, there are versions of many of the definitions and results in this chapter for “mixed” joint distributions in which one of \(X\) or \(Y\) is continuous and the other is discrete (as long as an analog of a mass or density function exists). I won’t state these more general results here because it would be incredibly awkward and tedious to cover all possible cases using our limited theory and language. This is a situation where the machinery of measure theory is needed.
In any case, you’ll see an example of one of these “mixed” distributions in the following Problem Prompt in which I introduce you to the canonical example in Bayesian statistics. (See also Section 6.10 below.)
Problem Prompt
Do problem 12 on the worksheet.
6.8. Random vectors in arbitrary dimensions#
Up till now in this chapter, we have studied pairs of random variables \(X\) and \(Y\), or what is the same thing, \(2\)-dimensional random vectors \((X,Y)\). But there’s an obvious generalization of these considerations to higher dimensions:
Let \(S\) be a probability space and \(n\geq 1\) an integer. An \(n\)-dimensional random vector is a function
Thus, we may write
for each sample point \(s\in S\). When we do so, the functions \(X_1,X_2,\ldots,X_n\) are ordinary random variables that are called the components of the random vector \(\mathbf{X}\).
Random vectors in dimensions \(>2\) induce joint probability distributions, just like their \(2\)-dimensional relatives:
Let \((X_1,X_2,\ldots,X_n):S \to \mathbb{R}^n\) be an \(n\)-dimensional random vector on a probability space \(S\) with probability measure \(P\). We define the probability measure of the random vector, denoted \(P_{X_1X_2\cdots X_n}\), via the formula
for all events \(C\subset \mathbb{R}^n\). The probability measure \(P_{X_1X_2\cdots X_n}\) is also called the joint distribution of the component random variables \(X_1,X_2,\ldots,X_n\).
The equation (6.11) is the precise definition of the joint distribution for any event \(C\) in \(\mathbb{R}^n\). If \(C\) happens to be a product event
for some events \(A_1,A_2,\ldots,A_n\subset \mathbb{R}\), then we shall always write
in place of \(P_{X_1X_2\cdots X_n}(C)\). Again, this expression (6.12) is technically an abuse of notation.
Almost all the definitions and results that we considered above for \(2\)-dimensional random vectors have obvious generalizations to higher-dimensional random vectors. This includes higher-dimensional marginal and conditional distributions, as well as Laws of Total Probability and Bayes’ Theorems. Provided that you understand the \(2\)-dimensional situation well, I am confident that the higher-dimensional case should pose no problem. Therefore, we will content ourselves with working through a few example problems, in place of an exhaustive account of all the definitions and theorems.
Problem Prompt
Do problems 13 and 14 on the worksheet.
6.9. Independence#
Because of its central role in the definitions of random samples and datasets (see Definition 5.1 and Definition 6.6), independence is one of the most important concepts in all probability and statistics. We already studied a form of independence back in Section 2.5, where we saw that two events \(A\) and \(B\) in a probability space are independent if
As long as \(P(B)\neq 0\) (if not, then both sides of this last equation are \(0\)), independence is the same as
This latter equation is telling us that \(A\) and \(B\) are independent provided that the conditional probability of \(A\), given \(B\), is just the plain probability of \(A\). In other words, if \(A\) and \(B\) are independent, then whether \(B\) has occurred has no impact on the probability of \(A\) occurring.
Our mission in this section is to adapt these definitions to the probability measures induced by random variables and vectors. The key step is to replace the left-hand side of (6.13) with a joint probability distribution. We make this replacement in the next defintion:
Let \(\mathbf{X}_1,\mathbf{X}_2,\ldots,\mathbf{X}_m\) be random vectors, all defined on the same probability space, but possibly of different dimensions. Then these random vectors are said to be independent if
for all events \(C_1,C_2,\ldots,C_m\). If the vectors are not independent, they are called dependent.
Notice that no conditions are placed on the random vectors \(\mathbf{X}_1,\ldots,\mathbf{X}_m\) in this definition, such as assuming they are discrete or continuous. However, provided that mass or density functions exist, then convenient criteria for independence may be obtained in terms of these functions. For simplicity, we will only state these criteria in the case of a sequence of random variables, leaving you with the task of generalizing to sequences of random vectors.
(Mass/Density Criteria for Independence)
Let \(X_1,X_2,\ldots,X_m\) be random variables.
Suppose that the random variables are jointly discrete. Then they are independent if and only if
\[ p_{X_1X_2\cdots X_m}(x_1,x_2,\ldots,x_m) = p_{X_1}(x_1)p_{X_2}(x_2) \cdots p_{X_m}(x_m) \]for all \(x_1,x_2,\ldots,x_m \in \mathbb{R}\).
Suppose that the random variables are jointly continuous. Then they are independent if and only if
(6.15)#\[ f_{X_1X_2\cdots X_m}(x_1,x_2,\ldots,x_m) = f_{X_1}(x_1)f_{X_2}(x_2) \cdots f_{X_m}(x_m) \]for all \(x_1,x_2,\ldots,x_m \in \mathbb{R}\).
Let’s outline a quick proof of one direction of the equivalence in the case that there are only two jointly continuous random variables \(X\) and \(Y\). To begin, notice that we always have equalities
for all events \(A,B\subset \mathbb{R}\), no matter if \(X\) and \(Y\) are independent or not. But if the joint density factors into the product of the marginals, then this shows that
which proves \(X\) and \(Y\) are independent.
Conversely, if \(X\) and \(Y\) are indepenendent, then we have
From this we would be tempted to conclude that the joint density factors into the marginals, but there’s actually a mathematical subtlety hiding. Indeed, to prove that the product \(f_X(x)f_Y(y)\) serves as the joint density, we would need to show that
for all events \(C\subset \bbr^2\); but (6.16) only establishes this equality for product events of the form \(C = A \times B\). Therefore, in order to obtain the desired factorization, one needs to show additionally that (6.16) implies (6.17) holds for all events \(C\). This is true, but the techniques required are beyond the scope of this book.
It turns out that there is also a characterization of independence in terms of factorizations of joint cumulative distribution functions. This characterization is actually taken as the definition of independence in some references (e.g., [WMS14]).
The next important theorem shows that transformations of independent random vectors remain independent:
(Invariance of Independence)
Let \(\mathbf{X}_1,\mathbf{X}_2,\ldots,\mathbf{X}_m\) be independent random vectors and let \(g_1,\ldots,g_m\) be vector-valued functions for which the transformed random vectors
are all defined. Then the random vectors (6.18) are independent.
The proof is easy. Letting \(C_1,\ldots,C_m\) be events, note that
An immediate corollary of this theorem is the following result that shows independence of the components of random vectors follows from independence of the random vectors themselves.
(Independence of Components)
Suppose \(\bX_1,\bX_2,\ldots,\bX_m\) are independent random vectors. Then all sequences
of component random variables are independent, where \(X_{ij}\) is the \(j\)-th component random variable of \(\bX_i\).
To see how this is a corollary of Theorem 6.8, notice that \(X_{ij}\) is the transformation of \(\bX_i\) under the canonical projection map
where \(n\) is the dimension of \(\bX_i\). Thus, the sequence of random variables (6.19) is the same as the sequence
where each \(g_i\) is the canonical projection just defined.
Before moving on to the worksheet problems, we state a “random variable” version of the equation (6.14) describing independent events in terms of conditional probabilities.
(Conditional Criteria for Independence)
Let \(X\) and \(Y\) be two random variables.
Suppose \(X\) and \(Y\) are jointly discrete. Then they are independent if and only if
\[ p_{X|Y}(x|y) = p_X(x) \]for all \(x\in \mathbb{R}\) and all \(y\in \mathbb{R}\) such that \(p_Y(y)>0\).
Suppose \(X\) and \(Y\) are jointly continuous. Then they are independent if and only if
\[ f_{X|Y}(x|y) = f_X(x) \]for all \(x\in \mathbb{R}\) and all \(y\in \mathbb{R}\) such that \(f_Y(y)>0\).
Now:
Problem Prompt
Do problems 15-18 on the worksheet.
6.10. Case study: an untrustworthy friend#
My goal in this section is to step through an extended example that illustrates how independence is often used in computations involving real data. The scenario is taken from Bayesian statistics, and is similar to problem 12 on the worksheet. The point is not for you to learn the techniques and philosophy of Bayesian statistics in depth—that would require an entire course of its own. Instead, view this section as an enjoyable excursion into more advanced techniques that you might choose to study later.
The situation we find ourselves in is this:
The Canonical Bayesian Coin-Flipping Scenario
Our friend suggests that we play a game, betting money on whether a coin flip lands heads or tails. If the coin lands heads, our friend wins; if the coin lands tails, we win.
However, our friend has proven to be untrustworthy in the past. We suspect that the coin might be unfair, with a probability of \(\theta = 0.75\) of landing heads. So, before we play the game, we collect data and flip the coin ten times and count the number of heads. Depending on our results, how might we alter our prior estimate of \(\theta = 0.75\) for the probability of landing heads?
You might initially wonder why we need any statistical theory at all. Indeed, we might base our assessment of whether the coin favors us or our friend entirely on the proportion of heads that we see in our ten flips: If we see six or more heads, then we might believe \(\theta>0.5\) and that the coin favors our friend, while if we obtain four or less, then \(\theta < 0.5\) and the coin favors us.
But the crucial point is that our friend has an untrustworthy track record. Since we believe at the outset that \(\theta\) is somewhere around \(0.75\), seeing four heads is not enough to “offset” this large value and tug it down below \(0.5\). So, the situation is a bit more complex than it might appear at first glance.
The goal is to find a systematic and quantitative method for updating our assessment of the likely values of \(\theta\) based on the data. The keys to the computations will be Bayes’ Theorem (as you might have guessed) and independence.
Let’s begin with our prior estimate \(\theta = 0.75\) for the probability of landing heads. We might go a step further than this single (point) estimate and actually cook up an entire probability distribution for what we believe are the most likely values of \(\theta\). Perhaps we suppose that \(\theta\) is an observation of a \(\mathcal{B}eta(6,2)\) random variable:
Show code cell source
import scipy as sp
Theta = sp.stats.beta(a=6, b=2)
theta = np.linspace(0, 1, 150)
plt.plot(theta, Theta.pdf(theta))
plt.xlabel(r'$\theta$')
plt.ylabel('probability density')
plt.suptitle(r'$\theta \sim \mathcal{B}eta(6,2)$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
Notice that this distribution favors values of \(\theta\) toward \(1\), which is consistent with our friend’s untrustworthy track record. Also, if \(\theta\) really does come from a \(\mathcal{B}eta(6,2)\) random variable, then as we learned back in Theorem 4.12, its expected value is indeed \(6/(6+2) = 0.75\). Suppose we write
for the density function of \(\theta\) drawn from the \(\mathcal{B}eta(6,2)\) distribution.
Now, suppose that we flip the coin ten times. It is natural to realize the resulting dataset as an observation from an IID random sample
with \(X_i \sim \mathcal{B}er(\theta)\) for each \(i\) and a value \(x_i=1\) indicating that a head was obtained on the \(i\)-th flip. It is also natural to assume that the flips are independent of each other (conditioned on \(\theta\)). Therefore, the joint mass function of the random sample (conditioned on \(\theta\)) factors as
where \(x = x_1+x_2+\cdots +x_{10}\) is the total number of heads in the dataset.
We now use Bayes’ Theorem to update our prior \(\mathcal{B}eta(6,2)\) distribution for \(\theta\) to a new distribution with density \(f(\theta|x_1,x_2,\ldots,x_{10})\). Assuming that the values \(x_1,x_2,\ldots,x_{10}\) are fixed, observed values, here are the computations:
Thus, the “updated” distribution must be of the form \(\mathcal{B}eta(x+6,12-x)\), where \(x\) is the number of heads in the dataset with \(0\leq x \leq 10\). Let’s plot these eleven distributions:
Show code cell source
blues = sns.color_palette('blend:#cce8ff,#486afb', n_colors=11)
for x in range(10, -1, -1):
Theta_posterior = sp.stats.beta(a=x + 6, b=12 - x)
plt.plot(theta, Theta_posterior.pdf(theta), label=rf'$x={x}$', color=blues.as_hex()[x])
plt.xlabel(r'$\theta$')
plt.ylabel('probability density')
plt.legend()
plt.tight_layout()
Here’s what you should notice: For larger values of \(x\), the distributions are shifted toward higher values of \(\theta\), which reflects the fact that many heads suggests \(\theta\) is close to \(1\). In the other direction, smaller values of \(x\) shift the distributions toward \(0\).
In particular, for \(x=3\) we obtain an “updated” distribution of \(\mathcal{B}eta(9,9)\), which has the symmetric density curve in the figure. The expected value for this distribution is exactly \(0.5\), so we would need to see three heads in our dataset to conclude that the coin has a good chance of being fair (at least according to the “posterior mean”). But if we see four or more heads, then our computations still support the conclusion that our friend is untrustworthy since the distributions in this range have expected values \(>0.5\). In the other direction, if we see two or fewer heads, then we might believe that the coin favors us.
Note
It is worth addressing this point again: Why do we need to see three heads to conclude that the coin might be fair, and not five heads?
Remember, we believe that our friend is untrustworthy. If we believed that \(\theta=0.75\), then seeing five heads is not enough evidence to convince us that \(\theta\) is, in reality, near \(0.5\). We would need to see even fewer heads to overcome (or offset) the prior estimate of \(\theta=0.75\).
Actually, my description here is slightly at odds with a strict interpretation of Bayesian philosophy, since a true Bayesian would never assume that the parameter \(\theta\) has a fixed valued at the outset, only that it has a prior probability distribution.
Notice that independence of the coin flips was absolutely crucial for the computations to go through. Without assuming independence, we would not have been able to factor the joint mass function (conditioned on \(\theta\)) as in (6.20), which would have made our application of Bayes’ Theorem much more difficult. This joint mass function is actually known as the likelihood function \(\mathcal{L}(\theta)\) of the parameter \(\theta\), and we will see these same computations appear again when we study maximum likelihood estimation (MLE).