7. More probability theory#
Introduction to be written later… Beware that this is a first draft of the chapter. To do:
Rewrite section on moment generating functions.
7.1. Expectations and joint distributions#
To motivate the considerations in this section, suppose that we have two random variables
where
In any case, provided that this “transformed” random variable
However, this formula is quite inconvenient to use in practice, due to the necessity of the density
Theorem 7.1 (Bivariate Law of the Unconscious Statistician (LotUS))
Let
If
and are jointly discrete with mass function , thenIf
and are jointly continuous with density function , then
Though the argument in the discrete case is very similar to the one given for the univariate version of Theorem 3.1, we will not give it here. See if you can work it out on your own. You can imagine that the univariate and bivariate LotUS’s are special cases of a general multivariate LotUS that computes expecations of random variables of the form
Our first application of the bivariate LotUS is to show that the expectation operator is multiplicative on independent random variables:
Theorem 7.2 (Independence and expectations)
If
The proof is a simple computation using the LotUS; it appears in your homework for this chapter.
Our second application of the bivariate LotUS is to tie up a loose end from Section 3.9 and prove the full-strength version of “linearity of expectation.”
Theorem 7.3 (Linearity of Expectations)
Let
and
Proof. The proof of the second equation (7.2) was already handled back in the proof of Theorem 3.3. For the proof of the first equation (7.1) (in the continuous case), we apply the bivariate LotUS:
Let’s finish off the section by working through an example problem:
Problem Prompt
Do problem 1 on the worksheet.
7.2. Expectations and conditional distributions#
In the previous section, we learned how to use joint distributions in computations of expectations. In this section, we define new types of expectations using conditional distributions.
Definition 7.1
Let
If
and are jointly discrete with conditional mass function , then the conditional expected value of given is the sumIf
and are jointly continuous with conditional density function , then the conditional expected value of given is the integral
In both cases, notice that conditional expected values depend on the particular choice of observed value
where
Let’s take a look at a practice problem before proceeding.
Problem Prompt
Do problem 2 on the worksheet.
The input
is an observation of a random variable, and thus this function is also “random.” To make this precise, let’s suppose for simplicity that
obtained as the composite
Problem Prompt
Do problem 3 on the worksheet.
Now, since
However, this formula is of little practical value, since we would need to compute the density
However, if you push through the computations beginning with (7.4), you will find that the “iterated” expectation reduces to the expectation of
Theorem 7.4 (The Law of Total Expectation)
Let
Before we begin the proof, observe that there is a problem with the interpretation of the integral (7.4) since it extends over all
Proof. Let’s consider the case that the variables are jointly continuous. Beginning from (7.4), we compute
Besides the LotUS in the first line, notice that in going from the third line to the fourth, we integrated out the dependence on
Problem Prompt
Do problem 4 on the worksheet.
7.3. Density transformations#
It is very often the case that one needs to compute the density of a transformed continuous random variable. Actually, we saw such a situation already in Theorem 4.7 where we computed the distribution of an affine transformation of a normal random variable. We proved that theorem by explicitly computing the density of the transformed random variable. The main result in this section gives us a direct formula:
Theorem 7.5 (Density Transformation Theorem)
Let
for all
We will not prove the theorem, as it will end up being a special case of the generalized transformation theorem given in Theorem 7.6 below. But we should also say that there are simpler proofs of the theorem that do not use the heavy machinery that the proof of Theorem 7.6 relies upon; see, for example, Section 4.7 in [DBC21].
We are assuming that the support
Observe that the equality (7.5) is “one half” of what it means for
for all
Note that we are explicitly assuming that the inverse
Let’s do some examples:
Problem Prompt
Do problems 5 and 6 on the worksheet.
We now begin moving toward the generalization of The Density Transformation Theorem to higher dimensions. The statement of this generalization uses the following object:
Definition 7.2
Given a vector-valued function
its gradient matrix at
provided that the partial derivatives exist at
Note that the gradient matrix is the transpose of the Jacobian matrix of
The Jacobian matrix is the matrix representation (with respect to the standard bases) of the derivative of
The gradient matrix is used to define the affine tangent space approximation of
Notice the similarity to the affine tangent line approximation studied in single-variable calculus.
With gradient matrices in hand, we now state the generalization of Theorem 7.5:
Theorem 7.6 (Multivariate Density Transformation Theorem)
Let
for all
The following proof uses mathematics that are likely unfamiliar. Look through it if you like, but also feel free to skip it as well. I’ve included it really only out of principle, because I could not find a satisfactory proof in the standard references on my bookshelf.
Proof. Letting
where the final equality follows from the Change-of-Variables Theorem for Multiple Integrals; see Theorem 3-13 in [Spi65]. If we then define
Since a probability measure defined on the Borel algebra of
Problem Prompt
Do problem 7 in the worksheet.
7.4. Moment generating functions#
We begin with:
Definition 7.3
Let
The
-th moment of is the expectation .The
-th central moment of is the expectation , where .
Take care to notice that I am not claiming that all of these moments exist for all random variables. Notice also that the first moment of
Actually, this analogy with derivatives can be carried further, so let’s leave the world of probability theory for a moment and return to calculus. Indeed, as you well know if a function
You also learned that this series may, or may not, converge to the original function
On the other hand, there exist functions for which the Taylor series (7.6) exists and converges everywhere, but does not converge to the function on any open interval around
Theorem 7.7 (Taylor Series Uniqueness Theorem)
Suppose
for all in an open interval containing . for all .The Taylor series for
and (centered at ) are equal coefficient-wise.
What this Uniqueness Theorem tells us is that complete knowledge of all the values
Now, hold this lesson in your mind for just a bit as we return to probability theory. We are about to see something very similar occur in the context of moments of random variables. The gadget that is going to play the role for random variables analogous to Taylor series is defined in:
Definition 7.4
Let
We shall say the moment generating function exists if
The reason that the function
Theorem 7.8 (Derivatives of Moment Generating Functions)
Let
Thus, the moments
Proof. Let’s prove the theorem in the case that
Notice that we used the LotUS in the first line. Q.E.D.
Problem Prompt
Do problem 8 on the worksheet.
Now, the true power of moment generating functions comes from the following extremely important and useful theorem, which may be seen as an analog of the Taylor Series Uniqueness Theorem stated above as Theorem 7.7. It essentially says that: If you know all the moments, then you know the distribution.
Theorem 7.9 (Moment Generating Function Uniqueness Theorem)
Suppose
The distributions of
and are equal for all .The moment generating functions
and are equal for all in an open interval containing .
Proof. Here is a very brief sketch of the proof: The implication
for all
and similarly
Then, the hard part of the proof is showing that
To illustrate how this theorem may be used, we will establish the fundamental fact that sums of independent normal random variables are normal; this is stated officially in the second part of the following theorem.
Theorem 7.10 (Moment generating functions of normal variables)
If
, then its moment generating function is given byfor all
.Let
be independent random variables such that for each , and let be scalars. Then the affine linear combinationis normal with mean
and variance .
Proof. We will only prove the second part; for the proof of the first (which isn’t hard), see Theorem 5.6.2 in [DS14]. So, we begin our computations:
The second line follows from the first by independence, Theorem 6.8, and Theorem 7.2, while we obtain the third line from the first part of this theorem and Theorem 4.7. But notice that the expression on the last line is exactly the moment generating function of an
Before turning to the worksheet to finish this section, it is worth extracting one of the steps used in the previous proof and putting it in its own theorem:
Theorem 7.11 (Moment generating functions of sums of independent variables)
Suppose that
for all
Now, let’s do an example:
Problem Prompt
Do problem 9 on the worksheet.
7.5. Dependent random variables, covariance, and correlation#
In this section, we begin our study of dependent random variables, which are just random variables that are not independent. This study will continue through Section 9.3 in the next chapter, and then culminate in Chapter 11 where we learn how to use “networks” of dependent random variables to model real-world data.
While the definition of dependence of random variables
The most straightforward method to guarantee a transfer of “information” between the variables is to link them deterministically via a function
Theorem 7.12 (Functional dependence
Let
Proof. In order to prove this, we need to make the (mild) assumption that there is an event
(This doesn’t always have to be true. For example, what happens if
On the other hand, we have
and so
by (7.7). But then
which proves
What does a pair of functionally dependent random variables look like? For an example, let’s suppose that
Then, let’s simulate a draw of 1000 samples from
to obtain the associated
Show code cell source
import numpy as np
import scipy as sp
import seaborn as sns
import pandas as pd
import matplotlib.pyplot as plt
import scipy as sp
import math
import matplotlib_inline.backend_inline
import warnings
plt.style.use('../aux-files/custom_style_light.mplstyle')
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
warnings.filterwarnings('ignore')
blue = '#486AFB'
magenta = '#FD46FC'
def h(x):
return x * (x - 1) * (x - 2)
np.random.seed(42)
x = sp.stats.norm.rvs(loc=1, scale=0.5, size=1000)
y = h(x)
sns.scatterplot(x=x, y=y)
plt.xlabel('$x$')
plt.ylabel('$y=g(x)$')
plt.ylim(-1.5, 1.5)
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
The plot looks exactly like we would expect: A bunch of points lying on the graph of the function
Show code cell source
epsilon = sp.stats.norm.rvs(scale=0.15, size=1000)
grid = np.linspace(-0.5, 3)
_, ax = plt.subplots(ncols=2, figsize=(7, 3), sharey=True)
sns.scatterplot(x=x, y=y + epsilon, ax=ax[0])
ax[0].set_ylim(-1.5, 1.5)
ax[0].set_xlabel('$x$')
ax[0].set_ylabel('$y=g(x) + $noise')
sns.scatterplot(x=x, y=y + epsilon, alpha=0.2, ax=ax[1])
ax[1].plot(grid, h(grid), color='#FD46FC')
ax[1].set_xlabel('$x$')
plt.tight_layout()
The “noisy” functional relationship is drawn in the left-hand plot, while on the right-hand plot I have superimposed the graph of the function
The goal in this chapter is to study “noisy” linear dependencies between random variables; relationships that look like these:
Show code cell source
grid = np.linspace(-2.5, 2.5)
epsilon = sp.stats.norm.rvs(scale=0.3, size=500)
m = [1, 0, -1]
x = sp.stats.norm.rvs(size=500)
_, ax = plt.subplots(ncols=3, figsize=(10, 3), sharey=True, sharex=True)
for i, m in enumerate(m):
y = m * x + epsilon
sns.scatterplot(x=x, y=y, ax=ax[i], alpha=0.3)
ax[i].plot(grid, m * grid, color='#FD46FC')
ax[i].set_xlim(-3, 3)
ax[i].set_ylim(-3, 3)
ax[i].set_xlabel('$x$')
ax[i].set_ylabel('$y$')
plt.tight_layout()
Our goal in this section is to uncover ways to quantify or measure the strength of these types of “noisy” linear dependencies between random variables. We will discover that there are two such measures, called covariance and correlation. An alternate measure of more general dependence, called mutual information, will be studied in the next chapter in Section 9.3.
The definition of covariance is based on the following pair of basic observations:
Let
be an observation of a two-dimensional random vector
If the observed values of
cluster along a line of positive slope, then and tend to be large (small) at the same time. If the observed values of
cluster along a line of negative slope, then a large (small) value tends to be paired with a small (large) value .
The visualizations that go along with these observations are:
In order to extract something useful from these observations, it is convenient to center the dataset by subtracting off the means:
Notice that
and similarly
Show code cell source
_, axes = plt.subplots(ncols=2, figsize=(8, 3), sharex=True, sharey=True)
np.random.seed(42)
m = 0.5
x = sp.stats.norm.rvs(loc=2, scale=2, size=500)
y = m * x + epsilon + 2
mean = (x.mean(), y.mean())
sns.scatterplot(x=x, y=y, ax=axes[0], alpha=0.2)
axes[0].scatter(mean[0], mean[1], color='magenta', zorder=3, s=100)
axes[0].set_xlim(-6, 6)
axes[0].set_ylim(-6, 6)
axes[0].set_title('raw data')
axes[0].set_xlabel('$x$')
axes[0].set_ylabel('$y$')
sns.scatterplot(x=x - mean[0], y=y - mean[1], ax=axes[1], alpha=0.2)
axes[1].scatter(0, 0, color=magenta, zorder=3, s=100)
axes[1].set_xlim(-6, 6)
axes[1].set_ylim(-6, 6)
axes[1].set_title('centered data')
axes[1].set_xlabel('$x$')
axes[1].set_ylabel('$y$')
plt.tight_layout()
You can see that the dataset has not changed its shape—it has only shifted so that its mean (represented by the magenta dot) is at the origin
The reason that we center the data is that it allows us to conveniently rephrase our observations above in terms of signs:
Let
be an observation of a centered two-dimensional random vector
If the observed values of
cluster along a line of positive slope, then and tend to have the same sign, i.e., . If the observed values of
cluster along a line of negative slope, then and tend to have opposite signs, i.e., .
The new visualization is:
Essentially, the next definition takes the average value of the product
Definition 7.5
Let
Notice that the covariance of a random variable
Before we look at an example, it will be convenient to state and prove the following generalization of Theorem 3.5:
Theorem 7.13 (Shortcut Formula for Covariance)
Let
Proof. The proof is a triviality, given all the properties that we already know about expectations:
Armed with this formula, let’s do an example problem:
Problem Prompt
Do problem 10 on the worksheet.
A pair of very useful properties of covariance are listed in the following:
Theorem 7.14 (Covariance
Symmetry. If
and are random variables, then .Bilinearity. Let
and be sequences of random variables, and and sequences of real numbers. Then:(7.8)#
I suggest that you attempt to prove this theorem on your own. A special case appears in your homework for this chapter.
Bilinearity of covariance allows us to generalize Theorem 3.6 on the variance of an affine transformation of a random variable:
Theorem 7.15 (Variance of a linear combination)
Let
Proof. The proof is an application of bilinearity of covariance:
In particular, we see that if
We now turn toward the other measure of linear dependence, called correlation. To motivate this latter measure, we note that while the signs of covariances are significant, their precise numerical values may be less so, if we are attempting to use them to measure the strength of a linear dependence. One reason for this is that covariances are sensitive to the scales on which the variables are measured. For an example, consider the following two plots:
Show code cell source
_, axes = plt.subplots(ncols=2, figsize=(8, 3))
np.random.seed(42)
m = 0.5
x = sp.stats.norm.rvs(loc=0, scale=2, size=500)
y = m * x + epsilon
mean = (x.mean(), y.mean())
sns.scatterplot(x=x, y=y, ax=axes[0], alpha=0.5)
axes[0].set_xlim(-5, 5)
axes[0].set_ylim(-5, 5)
axes[0].set_title('raw data')
axes[0].set_xlabel('$x$')
axes[0].set_ylabel('$y$')
sns.scatterplot(x=10 * x, y=10 * y, ax=axes[1], alpha=0.5)
axes[1].set_xlim(-50, 50)
axes[1].set_ylim(-50, 50)
axes[1].set_title('scaled data')
axes[1].set_xlabel('$x$')
axes[1].set_ylabel('$y$')
plt.tight_layout()
The only difference between the two datasets is the axis scales, but the linear relationship between the
Thus, if we use the numerical value of the covariance to indicate the strength of a linear relationship, then we should conclude that the data on right-hand side is one hundred times more “linearly correlated” than the data on the left. But this is nonsense!
The remedy is to define a “normalized” measure of linear dependence:
Definition 7.6
Let
The correlation
Theorem 7.16 (Properties of correlation)
Let
Symmetry. We have
.Scale invariance. If
is a nonzero real number, thenNormalization. We have
.
Proof. The symmetry property of correlation follows from the same property of covariance in Theorem 7.14. Scale invariance follows from bilinearity of covariance, as well as the equality
Remember, covariance and correlation were cooked up to measure linear dependencies between random variables. We wonder, then, what is the correlation between two random variables that are perfectly linearly dependent? Answer:
Theorem 7.17 (Correlation of linearly dependent random variables)
Let
Proof. The proof is a simple computation, similar to the proof of scale invariance from above:
We give a name to two random variables whose correlation is zero:
Let’s take a look at an example before continuing:
Problem Prompt
Do problem 11 in the worksheet.
You should think of independence as a strong form of non-correlation, and hence correlation is also a strong form of dependence. This is the content of the first part of the following result:
Theorem 7.18 (Dependence and correlation)
Let
If
and are independent, then they are uncorrelated.If
and are correlated, then they are dependent.However, there exist dependent
and that are uncorrelated.
Proof. The proof of the first statement is a simple application of Theorem 7.2 and the Shortcut Formula for Covariance in Theorem 7.13. Indeed, we have
and then
For the second statement, take two continuous random variables
By symmetry, we have
We have only considered a pair
Definition 7.8
Let
We define the covariance matrix of
to be the matrixwhere
.We define the correlation matrix of
to be the matrixwhere
.
Covariance matrices will prove especially important in the next section.
Notice that both the covariance matrix and correlation matrix of a random vector are symmetric, meaning
Definition 7.9
Let
If
for all , then is called positive semidefinite. If the only vector for which equality holds is , then is called positive definite.If
for all , then is called negative semidefinite. If the only vector for which equality holds is , then is called negative definite.
It will be convenient to have alternate characterizations of definite and semidefinite matrices:
Theorem 7.19
Let
The matrix
is positive semidefinite (definite) if and only if all its eigenvalues are nonnegative (positive).The matrix
is positive semidefinite if and only if there is a positive semidefinite matrix such that . Moreover, the matrix is the unique positive semidefinite matrix with this property.
The matrix
Proof. Implicit in both characterizations are the claims that the eigenvalues of
where
Now, since
The columns of
where
The eigenvectors of
The characterizations in the first statement of the theorem follow immediately.
Turning toward the second statement, suppose that
Notice that the square roots are real, since the eigenvalues are (real) nonnegative. You may easily check that
We are now ready to prove the main result regarding covariance and correlation matrices:
Theorem 7.20 (Covariance and correlation matrices are positive semidefinite)
Let
Proof. We will only prove that the correlation matrix
This establishes that
7.6. Multivariate normal distributions#
The goal in this section is simple: Generalize the univariate normal distributions from Section 4.6 to higher dimensions. We needed to wait until the current chapter to do this because our generalization will require the machinery of covariance matrices that we developed at the end of the previous section. The ultimate effect will be that the familiar “bell curves” of univariate normal densities will turn into “bell (hyper)surfaces.” For example, in two dimensions, the density surface of a bivariate normal random vector might look something like this:

The isoprobability contours of this density surface (i.e., curves of constant probability) look like:
Show code cell source
rho = 0.0
sigma1 = math.sqrt(2)
sigma2 = 1
Sigma = np.array([[sigma1 ** 2, rho * sigma1 * sigma2], [rho * sigma1 * sigma2, sigma2 ** 2]])
mu = np.array([1, 1])
norm = sp.stats.multivariate_normal(mean=mu, cov=Sigma)
x, y = np.mgrid[-2:4:0.1, -2:4:0.1]
grid = np.dstack((x, y))
z = norm.pdf(grid)
contour = plt.contour(x, y, z, colors=blue)
plt.clabel(contour, inline=True, fontsize=8)
plt.scatter(1, 1)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=4, h=3)
plt.tight_layout()
Notice that these contours are concentric ellipses centered at the point
Actually, if we consider only this special case, i.e., when the principal axes of the isoprobability surfaces are parallel with the coordinate axes—we do not need the full strength of the machinery of covariance matrices. Studying this case will also help us gain insight into the general formula for the density of a multivariate normal distribution. So, let’s begin with a sequence
or
As you may easily check by creating your own plots, this formula (in dimension

Indeed, the contours of this surface are given by:
Show code cell source
rho = 0.5
sigma1 = 1
sigma2 = 2
Sigma = np.array([[sigma1 ** 2, rho * sigma1 * sigma2], [rho * sigma1 * sigma2, sigma2 ** 2]])
mu = np.array([1, 1])
norm = sp.stats.multivariate_normal(mean=mu, cov=Sigma)
x, y = np.mgrid[-3:5:0.1, -3:5:0.1]
grid = np.dstack((x, y))
z = norm.pdf(grid)
contour = plt.contour(x, y, z, colors=blue)
plt.clabel(contour, inline=True, fontsize=8)
plt.scatter(1, 1)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=4, h=3)
plt.tight_layout()
The key to uncovering the formula for the density in the general case begins by returning to the formula (7.11) and rewriting it using vector and matrix notation. Indeed, notice that if we set
and let
where
Definition 7.10
Let
if its probability density function is given by
with support
Since
To help understand the shape of the density (hyper)surfaces created by (7.12), it helps to ignore the normalizing constant and write
where
Note that since
The isoprobability contours of
where
where now
Since
and hence the square root
these “spheres” are ellipses centered at
Show code cell source
def mahalanobis(x, mean, cov):
return np.sum(np.matmul(x - mean, np.linalg.inv(cov)) * (x - mean), axis=-1)
x, y = np.mgrid[-5:7:0.1, -5:7:0.1]
grid = np.dstack((x, y))
z = mahalanobis(x=grid, mean=mu, cov=Sigma)
contour = plt.contour(x, y, z, colors=blue, levels=15)
plt.clabel(contour, inline=True, fontsize=8)
plt.scatter(1, 1)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=4, h=3)
plt.tight_layout()
Let’s collect our observations in the following theorem, along with a few additional facts:
Theorem 7.21 (Isoprobability contours of normal vectors)
Suppose that
The isoprobability contours of the density function
are concentric ellipsoids centered at defined by equations(7.18)#for fixed
.For
, the principal axes of the ellipsoid defined by (7.18) point along the eigenvectors of the matrix . The half-lengths of the principal axes are given by .In particular, if
is a (positive) multiple of the identity matrix, then the isoprobability contours are concentric spheres centered at .
We have already proved the first statement; for the second, see Section 4.4 in [HardleS19], for example.
It will be useful to generalize a pair of important univariate results from Section 4.6 to the case of multivariate normal distributions. The first result is a generalization of Theorem 4.7 that states (invertible) affine transformations of normal random variables are still normal. The same is true for normal random vectors:
Theorem 7.22 (Affine transformations of normal vectors)
Let
Proof. The proof goes through Theorem 7.6. In the language and notation of that theorem, we have
and so
for all
where we recognize the expression on the second line as the density of an
The second result is a generalization of Corollary 4.1, which states that we may perform an invertible affine transformation of a normal random variable to obtain a standard normal one. Again, the same is true for normal random vectors. To state this generalized result formally, we require the concept of the square root of a positive definite matrix that we introduced in Theorem 7.19.
Corollary 7.1 (Standardization of normal vectors)
If
The affine transformation
Theorem 7.23 (Components of normal vectors)
Let
Then each component random variable
Proof. By Corollary 7.1, the random vector
But
where
However, we have
since
In fact, an even stronger result than Theorem 7.23 is true: Not only are the individual univariate components of a normal random vector themselves normal, but any linear combination of these components is also normal. Even more surprising, this fact turns out to give a complete characterization of normal random vectors! This is the content of:
Theorem 7.24 (The Linear-Combination Criterion for Normal Random Vectors)
Let
is a normal random variable. In this case, the parameter
We will not prove this, since the clearest proof (that I know of) makes use of characteristic functions. (See, for example, Chapter 5 in [Gut09].)
At least in two dimensions, using the fact that the parameter
where
Show code cell source
def covar_matrix(rho, sigma1, sigma2):
return np.array([[sigma1 ** 2, rho * sigma1 * sigma2], [rho * sigma1 * sigma2, sigma2 ** 2]])
parameters = [[0.5, 1, 2], [0.5, 1, 4], [-0.5, 1, 2], [0, 3, 2]] # rho, sigma1, sigma2
x, y = np.mgrid[-2:4:0.1, -2:4:0.1]
grid = np.dstack((x, y))
_, axes = plt.subplots(ncols=2, nrows=2, figsize=(8, 6), sharex=True, sharey=True)
for parameter, axis in zip(parameters, axes.flatten()):
Sigma = covar_matrix(*parameter)
rho, sigma1, sigma2 = parameter
norm = sp.stats.multivariate_normal(mean=np.array([1, 1]), cov=Sigma)
z = norm.pdf(grid)
axis.contour(x, y, z, colors=blue)
axis.scatter(1, 1)
axis.set_xlabel('$x$')
axis.set_ylabel('$y$')
axis.set_title(f'$(\\rho,\\sigma_X,\\sigma_Y)=({rho},{sigma1},{sigma2})$')
plt.tight_layout()
In the programming assignment for this chapter, you will actually implement this procedure. For example, I used these methods to simulate
Show code cell source
rho = 0.5
sigma1 = 1
sigma2 = 2
Sigma = np.array([[sigma1 ** 2, rho * sigma1 * sigma2], [rho * sigma1 * sigma2, sigma2 ** 2]])
mu = np.array([1, 1])
norm = sp.stats.multivariate_normal(mean=mu, cov=Sigma)
np.random.seed(42)
sample = norm.rvs(size=1000)
x, y = np.mgrid[-5:7:0.1, -5:7:0.1]
grid = np.dstack((x, y))
z = norm.pdf(grid)
sns.scatterplot(x=sample[:, 0], y=sample[:, 1], alpha=0.5)
plt.contour(x, y, z, colors=magenta, alpha=1)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.gcf().set_size_inches(w=5, h=4)
plt.tight_layout()
Let’s finish off this long chapter with an example problem:
Problem Prompt
Do problem 12 on the worksheet.