10. Optimization#
Lest the reader wonder why an entire chapter on optimization is included in a textbook on probability and statistics, we suggest that they recall the discussion in the expository chapter Chapter 8. There, we described the rationale: Our overarching goal over the current sequence of chapters is to train probabilistic models on data. To do this, we minimize the KL divergence between the proposed model distribution
Many of the optimization problems we will encounter do not have closed-form solutions, and so we will need to study methods for approximation. All those studied in this chapter are versions of an iterative method called gradient descent. For a warm-up, we will study a simple single-variable version of this method in Section 10.1 before proceeding to the full multi-variable version in Section 10.3. Sandwiched between these two sections is Section 10.2, where we review the tools from elementary differential geometry that allow the generalization from one variable to many. Then, we finish with Section 10.4, where we study a particular form of gradient descent, called stochastic gradient descent, that is specifically tailored for the objective functions encountered in training probabilistic models. An appendix is incuded at the end of the chapter, containing some basic theorems on convex functions and their (rather fussy) proofs.
As hinted in the first paragraph, the inclusion of gradient-based optimization algorithms and their applications to parameter estimation is what distinguishes this book from a traditional book on mathematical statistics. This material is often included in texts on machine learning, but it is not in any text on statistics (that I know of). However, we are just barely scratching the surface of optimization and machine learning. If you are new to these fields and want to learn more, I suggest beginning with the fifth chapter of [GBC16] for a quick overview. After this, you can move on to [HR22], before tackling the massive, encyclopedic texts [Mur22] and [Mur23].
10.1. Gradient descent in one variable#
In this section, we describe the single-variable version of the gradient descent algorithm to help motivate the general algorithm in arbitrary dimensions. To begin, consider the optimization problem of locating the minimum values of the polynomial function
This function is called the objective function of the optimization problem. Its graph is displayed in:
Show code cell source
import torch
from torch.distributions.multivariate_normal import MultivariateNormal
import seaborn as sns
import matplotlib.pyplot as plt
import numpy as np
import matplotlib_inline.backend_inline
from math_stats_ml.gd import GD, SGD, plot_gd
plt.style.use('../aux-files/custom_style_light.mplstyle')
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
blue = '#486AFB'
magenta = '#FD46FC'
def J(theta):
return (theta ** 4) - 6 * (theta ** 3) + 11 * (theta ** 2) - 7 * theta + 4
grid = torch.linspace(start=-0.5, end=3.5, steps=300)
plt.plot(grid, J(grid))
plt.xlabel('$\\theta$')
plt.ylabel('$J(\\theta)$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
From the graph, we see that the objective function has minimums of approximately
It will be convenient to introduce the following terminology for our optimization problems:
Definition 10.1
Let
for all
Using this terminology, we would say that
Let’s see how the single-variable version of the gradient descent algorithm would solve our optimization problem. This algorithm depends on an initial guess for a minimizer, as well as two parameters called the learning rate and the number of gradient steps. We will state the algorithm first, and then walk through some intuition for why it works:
Algorithm 10.1 (Single-variable gradient descent)
Input: A differentiable objective function
Output: An approximation to a local minimizer
For
Return
Though this simple description of the algorithm outputs just a single approximation to a local minimizer, in practice it is often convenient to output the entire sequence of for
loop:
Then, as we will see below, by tracking the associated objective values
one may monitor convergence of the algorithm.
The assignment
in the for
loop is called the update rule. This form is convenient for implementation in code, but for theoretical analysis, it is often convenient to rewrite the rule as a recurrence relation in the form
for all
To understand the intuition behind the algorithm, consider the two cases that the derivative
Show code cell source
def J_prime(theta):
return 4 * (theta ** 3) - 18 * (theta ** 2) + 22 * theta - 7
plt.plot(grid, J(grid), color=blue)
plt.plot(grid, J_prime(-0.4) * (grid + 0.4) + J(-0.4), color=magenta, zorder=10)
plt.scatter(-0.4, J(-0.4), color=magenta, s=100, zorder=15)
plt.scatter(-0.4, 0, color=magenta, s=100, zorder=20)
plt.plot([-0.4, -0.4], [J(-0.4), 0], color=magenta, linestyle='--')
plt.plot(grid, J(grid))
plt.plot(grid, J_prime(3.3) * (grid - 3.3) + J(3.3))
plt.scatter(3.3, J(3.3), color=magenta, s=100, zorder=10)
plt.scatter(3.3, 0, color=magenta, s=100, zorder=10)
plt.plot([3.3, 3.3], [J(3.3), 0], color=magenta, linestyle='--')
plt.text(-0.6, 0.6, '$\\theta_0$', ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
plt.text(0.15, J(-0.4), "$J'(\\theta_0)<0$", ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
plt.text(3.5, 0.6, '$\\theta_0$', ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
plt.text(2.8, J(3.3), "$J'(\\theta_0)>0$", ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
plt.xlim(-0.8, 3.7)
plt.ylim(-0.3, 12.2)
plt.xlabel('$\\theta$')
plt.ylabel('$J(\\theta)$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
In this plot, we’ve drawn the tangent lines to the graph of
since
But notice that the nearest minimizer to
is not too large (in magnitude) causing the new
From these considerations, we conclude the following:
Observation 10.1
The negative derivative
always “points downhill.”When the gradient descent algorithm works, it locates a minimizer by following the negative derivative “downhill.”
The sense in which the negative derivative “points downhill” is made precise by our observation that it is positive if the point
Let’s run the gradient algorithm four times, with various settings of the parameters:
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([-0.5]),
torch.tensor([3.45]),
torch.tensor([-0.5]),
torch.tensor([3.45])],
'num_steps': [8, 7, 5, 5],
'lr': [1e-2, 1e-2, 1e-1, 2e-1]}
_, axes = plt.subplots(nrows=2, ncols=2, figsize=(8, 5))
grid = torch.linspace(start=-0.5, end=3.5, steps=300)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(**gd_parameters_slice, J=J)
alpha = gd_parameters_slice['lr']
N = gd_parameters_slice['num_steps']
thetas = gd_output.parameters['theta']
axis.plot(grid, J(grid))
axis.step(x=thetas, y=gd_output.per_step_objectives, where='post', color=magenta, zorder=2)
axis.scatter(x=thetas, y=gd_output.per_step_objectives, s=30, color=magenta, zorder=2)
axis.scatter(x=thetas[0], y=gd_output.per_step_objectives[0], s=100, color=magenta, zorder=2)
axis.set_xlabel('$\\theta$')
axis.set_ylabel('$J(\\theta)$')
axis.set_title(f'$\\alpha={alpha}$, $N={N}$')
plt.tight_layout()
In all four plots, the large magenta dot represents the initial point
where
Problem Prompt
Do problem 1 on the worksheet.
It is possible for the gradient descent algorithm to diverge, especially if the learning rate is too large. For example, suppose that we set the learning rate to
Show code cell source
gd_output = GD(J=J, init_parameters=torch.tensor([3.5]), lr=2e-1, num_steps=3)
thetas = gd_output.parameters['theta']
grid = torch.linspace(start=-55, end=50, steps=300)
plt.plot(grid, J(grid))
plt.step(x=thetas, y=gd_output.per_step_objectives, where='post', color=magenta, zorder=2)
plt.scatter(x=thetas, y=gd_output.per_step_objectives, s=30, color=magenta, zorder=2)
plt.scatter(x=thetas[0], y=gd_output.per_step_objectives[0], s=100, color=magenta, zorder=2)
plt.xlabel('$\\theta$')
plt.ylabel('$J(\\theta)$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
We see already that
Problem Prompt
Do problems 2 and 3 on the worksheet.
Of course, one can often prevent divergence by simply using a smaller learning rate, but sometimes a large initial learning rate is desirable to help the algorithm quickly find the neighborhood of a minimizer. So, what we desire is a scheme to shrink the learning rate from large values to smaller ones as the algorithm runs. This scheme is called a learning rate schedule, and it is implemented by adding an extra decay rate parameter to the gradient descent algorithm:
Algorithm 10.2 (Single-variable gradient descent with learning rate decay)
Input: A differentiable objective function
Output: An approximation to a local minimizer
For
Return
The new parameter
provided that
Show code cell source
gd_output = GD(J=J, init_parameters=torch.tensor([3.5]), lr=2e-1, num_steps=8, decay_rate=0.1)
thetas = gd_output.parameters['theta']
grid = torch.linspace(start=-0.5, end=3.5, steps=300)
plt.plot(grid, J(grid))
plt.step(x=thetas, y=gd_output.per_step_objectives, where='post', color=magenta, zorder=2)
plt.scatter(x=thetas, y=gd_output.per_step_objectives, s=30, color=magenta, zorder=2)
plt.scatter(x=thetas[0], y=gd_output.per_step_objectives[0], s=100, color=magenta, zorder=2)
plt.xlabel('$\\theta$')
plt.ylabel('$J(\\theta)$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
We have carried out
Problem Prompt
Do problem 4 on the worksheet.
The learning rate
we would plot the following:
Show code cell source
gd_output = GD(J=J, init_parameters=torch.tensor([-0.5]), lr=1e-2, num_steps=15, decay_rate=0.1)
plot_gd(gd_output, h=3, plot_title=False, parameter_title=False, ylabel='objective', per_step_alpha=1)
plt.tight_layout()
One may use this plot to decide on the total number
where
10.2. Differential geometry#
If
The name arises from the observation that small (first-order infinitesimal) perturbations of
Our goal in this section is twofold: First, we briefly recall how these curvature considerations help us identify extremizers in the single-variable case—the relevant tools are the first and second derivatives. Then, we generalize these derivatives to higher dimensions to obtain gradient vectors and Hessian matrices. We indicate how local curvature in higher dimensions may be computed using these new tools and, in particular, how we may use them to identify extremizers.
So, let’s begin with the familiar routine from single-variable calculus called the Second Derivative Test. Given a point
If
and , then is a local minimizer.If
and , then is a local maximizer.
Provided that the second derivative of
Show code cell source
def f(x):
return x ** 2 + 4
def g(x):
return -x ** 2 + 4
functions = [f, g]
grid = np.linspace(-2, 2)
_, axes = plt.subplots(ncols=2, figsize=(8, 3), sharey=True, sharex=True)
for i, (function, axis) in enumerate(zip(functions, axes)):
axis.plot(grid, function(grid))
axis.scatter(0, 4, s=50, color=magenta, zorder=3)
if i == 0:
axis.text(0, 3, "$J '(0) = 0$, $J ''(0)>0$", ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
axis.set_title("convex $\\Rightarrow$ minimizer")
else:
axis.text(0, 5, "$J '(0) = 0$, $J ''(0)<0$", ha='center', va='center', bbox=dict(facecolor='white', edgecolor=None))
axis.set_title("concave $\\Rightarrow$ maximizer")
plt.tight_layout()
To see why second derivatives encode local curvature, let’s suppose
We already met the notion of concavity back in Section 9.2 where it was fundamental in our proof of Gibb’s inequality in Theorem 9.3. As shown in the figure above, a function is concave if it lies below its secant lines, while it is convex if it lies above. Both these shapes have implications for the search for extremizers—in particular, stationary points of concave and convex functions are always global extremizers. The proof of this claim, along with equivalent characterizations of concavity and convexity in terms of tangent lines and tangent (hyper)planes are given in the two main theorems in the appendix. (The claims are easily believable, while the proofs are annoyingly fussy. At least glance at the statements of the theorems to convince yourself that your intuition is on point, but do not feel compelled to go through the proofs line by line.)
How might we generalize the Second Derivative Test to higher dimensions? To help gain insight into the answer, let’s first add only one additional dimension, going from a function of a single variable to a twice-differentiable function of two variables:
Actually, to obtain the best results relating curvature to second derivatives, we shall assume that the second-order partial derivatives are continuous, though this isn’t strictly needed for some definitions and results below. Functions with continuous first- and second-order partial derivatives are said to be of class
For example, let’s suppose that the graph of

At any given point on this surface (like the one above the black dot) there is not a single slope and curvature, but rather infinitely many slopes and curvatures in all the different directions that one may step in the plane

Taking advantage of the very special circumstance that our graph is embedded as a surface in

The intersections of these vertical planes and the surface yield curves called sections—for the planes displayed in the plot above, the sections are a trio of downward opening parabolas. The slopes and curvatures on the surface in the three directions are then the slopes and curvatures of these sectional curves.
To obtain these slopes and curvatures, let’s suppose that
traces out the line in the plane
is exactly the sectional curve on the surface. Notice that this mapping is a real-valued function of a single real variable
Definition 10.2
Let
while we define the directional second derivative to be
In this context, the vector
Problem Prompt
Do problem 5 on the worksheet.
The familiar relations between these directional derivatives and partial derivatives pass through the gadgets defined in the following box. The first is familiar to us from our course in multi-variable calculus:
Definition 10.3
Let
while we define the the Hessian matrix to be
Note that since
We have already met the symbol “
In fact, if we write
then the gradient matrix of
Of course, a function
Theorem 10.1 (Hessian matrices are gradient matrices of gradient vectors)
Let
where
Proof. By definition, we have
and so
The following important theorem expresses the relations between the first and second directional derivatives and the gradient vector and Hessian matrix.
Theorem 10.2 (Slopes, curvatures, and partial derivatives)
Let
We have
We have
So, the directional first derivative is obtained through an inner product with the gradient vector, while the directional second derivative is the quadratic form induced by the Hessian matrix.
Proof. The proofs are simple exercises using the multi-variable Chain Rule. Indeed, note that
Plugging in
Plugging in
Problem Prompt
Do problem 6 on the worksheet.
When the directional vector
For our purposes, the most important properties of the gradient vector
for fixed
Show code cell source
def f(x, y):
return -4 * x ** 2 - y ** 2 + 15
grid = np.linspace(0, 2)
x, y = np.mgrid[-4:4:0.1, -4:4:0.1]
z = f(x, y)
def tangent_line(x):
return (8 / 3.5) * (x - 1) - 1.75
plt.contour(x, y, z, levels=10, colors=blue, linestyles='solid')
plt.plot(grid, tangent_line(grid), color=magenta)
plt.arrow(1, -1.75, -8 / 10, 3.5 / 10, head_width=0.2, head_length=0.3, fc=magenta, ec=magenta, linewidth=2)
plt.gcf().set_size_inches(4, 4)
plt.ylim(-4, 4)
plt.xlabel('$x$')
plt.ylabel('$y$')
plt.tight_layout()
The vector in the figure is the gradient vector, and the magenta line is the tangent line to the contour that passes through the point at the tail of the gradient. Note that the gradient is orthogonal to this tangent line, and therefore also orthogonal to the contour.
In general, it should be intuitively clear from the equation
in Theorem 10.2 that the gradient is orthogonal to level surfaces. Indeed, if
Let’s return to the first two properties of the gradient vector mentioned above, that it points in the direction of maximum rate of change and its negative points in the direction of minimum rate of change. In informal treatments, these claims are often justified by an appeal to the “angle”
Taking
from Theorem 10.2. Since
But the “angle”
Theorem 10.3 (Properties of gradient vectors)
Let
The gradient vector
points in the direction of maximum rate of change.The negative gradient vector
points in the direction of minimum rate of change.The gradient vector
is orthogonal to level surfaces.
Proof. We already offered a “proof” of the third statement above, so we need only prove the first two. For this, we recall that for any two vectors
with equality if and only if
The goal is then to identify unit vectors
But if the derivative achieves the upper bound, then by the criterion for equality in the Cauchy-Schwarz inequality (10.7), there must be a nonzero scalar
But then
and so
Problem Prompt
Do problem 7 on the worksheet.
Now, let’s return to the “angle”
We then define the angle
Thus, the fact that the angle
Observe that the part in Theorem 10.3 about the negative gradient vector “pointing downhill” is the higher-dimensional version of the observation in Observation 10.1 regarding the negative derivative of a single-variable function. This property will be key to the general multi-variable gradient descent algorithm that we will discuss in Section 10.3 below.
Let’s now turn toward extremizers of a multi-variable function
is a necessary condition for
as
The answer is yes!
Theorem 10.4 (Second Derivative Test)
Let
If the Hessian matrix
is positive definite, then is a local minimizer.If the Hessian matrix
is negative definite, then is a local maximizer.
For a proof of this result, see Theorem 13.10 in [Apo74]. Note also that if the Hessian matrix is either positive semidefinite or negative semidefinite everywhere, then every stationary point is a global extremizer; see Theorem 10.8 in the appendix.
Problem Prompt
Do problem 8 on the worksheet.
With infinitely many local (directional) curvatures at a point on the graph of a function
Theorem 10.5 (Eigenvalues, eigenvectors, and local curvature)
Let
Then:
The directional curvature
is maximized exactly when lies in the eigenspace of , in which case .The directional curvature
is minimized exactly when lies in the eigenspace of , in which case .
Proof. Let
for each
and
But then
where the first equality follows from Theorem 10.2. Using (10.9), eliminate
Letting
we have
(If
which implies that
Problem Prompt
Do problem 9 on the worksheet.
If the Hessian matrix is positive definite, then its extreme eigenvalues are exactly the extreme local (directional) curvatures. The ratio of the largest curvature to the smallest should then convey the “variance” or the “range” of these curvatures. This ratio has a name:
Definition 10.4
Let
The spectrum of
, denoted , is the set of eigenvalues of .The spectral radius of
, denoted , is given byIf
is positive definite, the condition number of , denoted , is the ratioof the largest eigenvalue of
to the smallest.
This definition of condition number applies only in the case that
Problem Prompt
Do problem 10 on the worksheet.
Intuitively, when the condition number of a positive definite Hessian matrix is large (in which case the Hessian matrix is called ill-conditioned), the curvatures vary widely as we look in all different directions; conversely, when the condition number is near
10.3. Gradient descent in multiple variables#
With the gradient vector taking the place of the derivative, it is easy to generalize the single-variable gradient descent algorithm from Algorithm 10.2 to multiple variables:
Algorithm 10.3 (Multi-variable gradient descent with learning rate decay)
Input: A differentiable function
Output: An approximation to a local minimizer
For
Return
Just like the single-variable version, in practice it is convenient to track the entire sequence of
For an example, let’s consider the polynomial objective function
in two dimensions. Its graph looks like

while its contour plot is
Show code cell source
def J(theta):
theta1 = theta[0]
theta2 = theta[1]
return (theta1 ** 2 + 10 * theta2 ** 2) * ((theta1 - 1) ** 2 + 10 * (theta2 - 1) ** 2)
x, y = np.mgrid[-0.50:1.5:0.01, -0.3:1.3:0.01]
grid = np.dstack((x, y))
z = np.apply_along_axis(J, axis=-1, arr=grid)
plt.contour(x, y, z, levels=range(11), colors=blue)
plt.xlabel('$\\theta_1$')
plt.ylabel('$\\theta_2$')
plt.gcf().set_size_inches(w=5, h=4)
plt.tight_layout()
The function has two minimizers at
as well as a “saddle point” at
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([0.25, 0.9]),
torch.tensor([0.25, 1]),
torch.tensor([0.75, 1.2]),
torch.tensor([0.5, 0.49])]}
alpha = 1e-2
beta = 0
N = 20
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9))
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(J=J,
lr=alpha,
num_steps=N,
decay_rate=beta,
**gd_parameters_slice)
thetas = gd_output.parameters['theta']
axis.contour(x, y, z, levels=range(11), colors=blue, alpha=0.5)
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(x=thetas[0, 0], y=thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
axis.set_title(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
fig.suptitle('first runs of gradient descent')
plt.tight_layout()
The large magenta dots in the plots indicate the initial guesses
Let’s carry the runs out further, to
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([0.25, 0.9]),
torch.tensor([0.25, 1]),
torch.tensor([0.75, 1.2]),
torch.tensor([0.5, 0.49])]}
alpha = 1e-2
beta = 0
N = 125
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 6), sharey=True)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(J=J,
lr=alpha,
num_steps=N,
decay_rate=beta,
**gd_parameters_slice)
plot_gd(gd_output=gd_output, ylabel='objective', ax=axis, plot_title=False, per_step_alpha=1)
fig.suptitle('first runs of gradient descent')
plt.tight_layout()
The oscillatory nature of the runs is even more clear. In particular, we now see that even the first run, which appeared to be converging, begins to oscillate after about 40 gradient steps. Let’s take a closer look at the trace of this first run, and also plot the components
Show code cell source
theta0 = torch.tensor([0.25, 0.9])
alpha = 1e-2
beta = 0
N = 125
gd_output = GD(J=J, init_parameters=theta0, lr=alpha, num_steps=N, decay_rate=beta)
thetas = gd_output.parameters['theta']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))
axes[0].contour(x, y, z, levels=range(11), colors=blue, alpha=0.5)
axes[0].plot(thetas[:, 0], thetas[:, 1], color=magenta)
axes[0].scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axes[0].scatter(x=thetas[0, 0], y=thetas[0, 1], s=100, color=magenta, zorder=2)
axes[0].set_xlabel('$\\theta_1$')
axes[0].set_ylabel('$\\theta_2$')
axes[0].set_xlim(0.2, 1.2)
axes[0].set_ylim(0.8, 1.15)
axes[1].plot(range(len(gd_output.per_step_objectives)), thetas[:, 0], label='$\\theta_1$')
axes[1].plot(range(len(gd_output.per_step_objectives)), thetas[:, 1], alpha=0.3, label='$\\theta_2$')
axes[1].set_xlabel('gradient steps')
axes[1].set_ylim(0.8, 1.1)
axes[1].legend()
fig.suptitle(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
plt.tight_layout()
fig.suptitle(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
plt.tight_layout()
Both plots make clear that the oscillations occur mostly along the semi-minor axes of the elliptical contours, which coincides with the direction of largest curvature. From the previous section, we know that local curvatures are encoded in the Hessian matrix. This suggests that studying the Hessian matrix might lead to insights into the convergence properties of gradient descent. But first, to prepare for this general study, let’s do:
Problem Prompt
Do problem 11 on the worksheet.
To begin the theoretical study of convergence of gradient descent, let’s start more generally with any function
If we believe that the Taylor polynomial accurately reflects the local geometry of the graph of
where
Assuming that the decay rate is
for all
which leads us to the closed form
for all
Now, let’s suppose that
Then the eigenvalues of the matrix
The eigenvectors
for some scalars
for all
For the particular run of the algorithm shown in the last plot, we have
so that (10.13) is
Since
To do this, we apply the triangle inequality to the right-hand side of (10.13) to obtain the upper bound
for all
is the spectral radius of the matrix
in which case we get
Then, with this choice of
where
This shows that the fastest rates of convergence guaranteed by our arguments are obtained for those objective functions that have Hessian matrices at local minimizers with condition numbers near
This argument easily adapts to the case that the objective function
Theorem 10.6 (Curvature and gradient descent)
Let
converges to
for each
The ratio
Of course, in order to obtain the exponentially quick convergence guaranteed by the theorem, one needs to place their initial guess
For our polynomial objective
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([0.25, 0.9]),
torch.tensor([0.25, 1]),
torch.tensor([0.75, 1.2]),
torch.tensor([0.5, 0.49])]}
alpha = 8e-3
beta = 0
N = 40
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9))
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(J=J,
lr=alpha,
num_steps=N,
decay_rate=beta,
**gd_parameters_slice)
thetas = gd_output.parameters['theta']
axis.contour(x, y, z, levels=range(11), colors=blue, alpha=0.5)
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(x=thetas[0, 0], y=thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
axis.set_title(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
fig.suptitle('second runs of gradient descent with smaller learning rates')
plt.tight_layout()
Just like magic, the undesirable oscillations have vanished. We take a closer look at the third run of the algorithm, in the bottom left of the figure:
Show code cell source
theta0 = torch.tensor([0.75, 1.2])
alpha = 8e-3
beta = 0
N = 125
gd_output = GD(J=J, init_parameters=theta0, lr=alpha, num_steps=N, decay_rate=beta)
thetas = gd_output.parameters['theta']
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4))
axes[0].contour(x, y, z, levels=range(11), colors=blue, alpha=0.5)
axes[0].plot(thetas[:, 0], thetas[:, 1], color=magenta)
axes[0].scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axes[0].scatter(x=thetas[0, 0], y=thetas[0, 1], s=100, color=magenta, zorder=2)
axes[0].set_xlabel('$\\theta_1$')
axes[0].set_ylabel('$\\theta_2$')
axes[0].set_xlim(0.2, 1.2)
axes[0].set_ylim(0.6, 1.25)
axes[1].plot(range(len(gd_output.per_step_objectives)), thetas[:, 0], label='$\\theta_1$')
axes[1].plot(range(len(gd_output.per_step_objectives)), thetas[:, 1], alpha=0.5, label='$\\theta_2$')
axes[1].set_xlabel('gradient steps')
axes[1].set_ylim(0.9, 1.1)
axes[1].legend()
fig.suptitle(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
plt.tight_layout()
We may also dampen the oscillations and keep the original (relatively large) learning rates by adding a slight learning rate decay at
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([0.25, 0.9]),
torch.tensor([0.25, 1]),
torch.tensor([0.75, 1.2]),
torch.tensor([0.5, 0.49])]}
alpha = 1e-2
N = 40
beta = 0.05
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9))
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(J=J,
lr=alpha,
num_steps=N,
decay_rate=beta,
**gd_parameters_slice)
thetas = gd_output.parameters['theta']
axis.contour(x, y, z, levels=range(11), colors=blue, alpha=0.5)
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(x=thetas[0, 0], y=thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
axis.set_title(f'$\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
fig.suptitle('third runs of gradient descent with learning rate decay')
plt.tight_layout()
Here are the values of the objective function for these last runs with learning rate decay, plotted against the number of gradient steps:
Show code cell source
gd_parameters = {'init_parameters': [torch.tensor([0.25, 0.9]),
torch.tensor([0.25, 1]),
torch.tensor([0.75, 1.2]),
torch.tensor([0.5, 0.49])]}
alpha = 1e-2
N = 40
beta = 0.05
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 6), sharey=True)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(J=J,
lr=alpha,
num_steps=N,
decay_rate=beta,
**gd_parameters_slice)
plot_gd(gd_output=gd_output, ylabel='objective', ax=axis, plot_title=False, per_step_alpha=1)
fig.suptitle('third runs of gradient descent with learning rate decay')
plt.tight_layout()
Notice the initial “overshoot” in the plot in the bottom left, causing the objective function
Of course, an objective function
10.4. Stochastic gradient descent#
Many of the objective functions that we will see in Chapter 12 have a special form making them amenable to optimization via a variation of the gradient descent algorithm called stochastic gradient descent. These special objective functions are called stochastic objective functions, which look like
where
so that
By linearity of the gradient operation, we have
where we write
Definition 10.5
The batch gradient descent algorithm is the gradient descent algorithm applied to a stochastic objective function of the form (10.17).
Let’s take a look at a simple example. Suppose that we define the loss function
where
Show code cell source
torch.manual_seed(42)
X = MultivariateNormal(loc=torch.zeros(2), covariance_matrix=torch.eye(2)).sample(sample_shape=(1024,))
sns.scatterplot(x=X[:, 0], y=X[:, 1])
plt.xlabel('$x_1$')
plt.ylabel('$x_2$')
plt.gcf().set_size_inches(w=5, h=3)
plt.tight_layout()
Before running the algorithm, let’s do:
Problem Prompt
Do problem 12 on the worksheet.
Now, two runs of the batch gradient descent algorithm produce the following plots of the objective function versus gradient steps:
Show code cell source
# define the loss function
def L(theta, x):
return 0.5 * torch.linalg.norm(x - theta, dim=1) ** 2
# define the objective function
def J(theta):
return L(theta, X).mean()
gd_parameters = {'num_steps': [30, 100],
'lr': [1e-1, 3e-2]}
theta0 = torch.tensor([1.5, 1.5])
beta = 0
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(7, 3), sharey=True)
for i, axis in enumerate(axes):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(**gd_parameters_slice, J=J, decay_rate=beta, init_parameters=theta0)
plot_gd(gd_output=gd_output, ylabel='objective', ax=axis, plot_title=False, per_step_alpha=1)
fig.suptitle('batch gradient descent')
plt.tight_layout()
If we track the parameters
Show code cell source
x, y = np.mgrid[-2:2:0.05, -2:2:0.05]
grid = np.dstack((x, y))
z = np.apply_along_axis(J, axis=-1, arr=grid)
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(9, 4.5), sharey=True)
for i, axis in enumerate(axes):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = GD(**gd_parameters_slice, J=J, decay_rate=beta, init_parameters=theta0)
thetas = gd_output.parameters['theta']
alpha = gd_parameters_slice['lr']
N = gd_parameters_slice['num_steps']
axis.contour(x, y, z, colors=blue, alpha=0.5, levels=np.arange(0, 10, 0.5))
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(thetas[0, 0], thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_title(f'$\\alpha={alpha}$, $\\beta={beta}$, gradient steps$={N}$')
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
fig.suptitle('batch gradient descent')
plt.tight_layout()
In both cases, notice that the algorithm is nicely converging toward the minimizer at
One of the drawbacks of the batch algorithm is that it needs the entire dataset in order to take just a single gradient step. This isn’t an issue for our small toy dataset of size
One method for dealing with this bottleneck is to use mini-batches of the data to compute gradient steps. To do so, we begin by randomly partitioning the dataset into subsets
We would then expect from (10.17) and (10.18) that
and
for each
The mini-batch version of the gradient descent algorithm loops over the mini-batches (10.20) and computes gradient steps using the approximation in (10.21). A single loop through all the mini-batches, covering the entire dataset, is called an epoch. This new version of the algorithm takes the number of epochs
Algorithm 10.4 (Stochastic gradient descent with learning rate decay)
Input: A dataset
Output: An approximation to a minimizer
For
Randomly partition the dataset into mini-batches
For each mini-batch
Return
Notice the auxiliary variable
As in the versions of gradient descent explored above, in practice is convenient to code the algorithm so that it returns the entire sequence of
This leads to the following sequences of objective values:
However, rather than have the algorithm output the exact objective values
where
But first, we note that it is possible to select a mini-batch size of
Show code cell source
gd_parameters = {'lr': [1e-1, 1e-1, 3e-2, 3e-2],
'max_steps': [40, 100, 80, 160]}
beta = 0
theta0 = torch.tensor([1.5, 1.5])
k = 1
N = 1
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 6), sharey=True)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = SGD(**gd_parameters_slice,
L=L,
X=X,
init_parameters=theta0,
batch_size=k,
decay_rate=beta,
num_epochs=N,
random_state=42)
plot_gd(gd_output, plot_title=False, ax=axis, show_epoch=False, per_step_alpha=1, legend=False, ylabel='objective')
fig.suptitle(f'stochastic gradient descent with $k=1$')
plt.tight_layout()
Note that we have halted the runs early, before the algorithm has had a chance to make it through even one epoch. The plots are very noisy, though a slight downward trend in objective values is detectable. The trace of the algorithm through parameter space is shown in:
Show code cell source
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9))
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = SGD(**gd_parameters_slice,
L=L,
X=X,
init_parameters=theta0,
batch_size=k,
decay_rate=beta,
num_epochs=N,
random_state=42)
alpha = gd_parameters_slice['lr']
max_steps = gd_parameters_slice['max_steps']
thetas = gd_output.parameters['theta']
axis.contour(x, y, z, levels=np.arange(0, 10, 0.5), colors=blue, alpha=0.5)
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(thetas[0, 0], thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_title(f'$k={k}$, $\\alpha={alpha}$, $\\beta={beta}$, $N={N}$,\n gradient steps$={max_steps}$')
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
fig.suptitle(f'stochastic gradient descent with $k=1$')
plt.tight_layout()
The traces are also very noisy, especially in the first row with the large learning rate
Show code cell source
gd_parameters = {'num_epochs': [1, 1, 1, 3],
'batch_size': [2, 8, 32, 128],
'max_steps': [60, 30, 24, 24]}
alpha = 1e-1
beta = 0
theta0 = torch.tensor([1.5, 1.5])
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 6), sharey=True)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = SGD(**gd_parameters_slice,
L=L,
X=X,
lr=alpha,
init_parameters=theta0,
decay_rate=beta,
random_state=42)
k = gd_parameters_slice['batch_size']
N = gd_parameters_slice['num_epochs']
plot_gd(gd_output, plot_title=False, ax=axis, show_epoch=False, per_step_alpha=1, legend=False, ylabel='objective', per_step_label='objective per step')
#axis.plot(range(len(gd_output.per_step_objectives)), gd_output.per_step_objectives)
axis.scatter(gd_output.epoch_step_nums, gd_output.per_step_objectives[gd_output.epoch_step_nums], color=magenta, s=50, zorder=2, label='epoch')
axis.set_xlabel('gradient steps')
axis.set_ylabel('objective')
axis.set_title(f'$k={k}$, $\\alpha={alpha}$, $\\beta={beta}$, $N={N}$')
axis.legend()
fig.suptitle('mini-batch gradient descent')
plt.tight_layout()
The runs with batch sizes
Show code cell source
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 9))
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = SGD(**gd_parameters_slice,
L=L,
X=X,
lr=alpha,
decay_rate=beta,
init_parameters=theta0,
random_state=42)
k = gd_parameters_slice['batch_size']
max_steps = gd_parameters_slice['max_steps']
N = gd_parameters_slice['num_epochs']
thetas = gd_output.parameters['theta']
axis.contour(x, y, z, levels=np.arange(0, 10, 0.5), colors=blue, alpha=0.5)
axis.plot(thetas[:, 0], thetas[:, 1], color=magenta)
axis.scatter(thetas[:, 0], thetas[:, 1], s=30, color=magenta, zorder=2)
axis.scatter(thetas[0, 0], thetas[0, 1], s=100, color=magenta, zorder=2)
axis.set_title(f'$k={k}$, $\\alpha={alpha}$, $\\beta={beta}$, $N={N}$,\n gradient steps$={max_steps}$')
axis.set_xlabel('$\\theta_1$')
axis.set_ylabel('$\\theta_2$')
fig.suptitle('mini-batch gradient descent')
plt.tight_layout()
As we mentioned above, it may be helpful to track the mean objective values per epoch to help see the general trend of the objective through the noise. To illustrate this point, let’s suppose that we continue our runs of mini-batch gradient descent so that they all complete six epochs. If we plot both the objective per gradient step and the mean objective per epoch, we get the following plots:
Show code cell source
gd_parameters = {'batch_size': [2, 8, 32, 128]}
N = 6
alpha = 1e-1
beta = 0
theta0 = torch.tensor([1.5, 1.5])
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(9, 6), sharey=True)
for i, axis in enumerate(axes.flatten()):
gd_parameters_slice = {key: value[i] for key, value in gd_parameters.items()}
gd_output = SGD(**gd_parameters_slice,
L=L,
X=X,
lr=alpha,
init_parameters=theta0,
num_epochs=N,
decay_rate=beta,
random_state=42)
plot_gd(gd_output,
plot_title=False,
ax=axis,
per_step_alpha=0.25,
legend=True,
ylabel='objective',
per_epoch_color=magenta,
per_step_label='objective per step',
per_epoch_label='mean objective per epoch')
fig.suptitle('mini-batch gradient descent')
plt.tight_layout()
10.5. Appendix: convex functions#
We begin with the definition of a convex function. Notice that it does not require that the function have any special properties, like differentiability.
Definition 10.6
We shall say a function
for all
Essentially, the definition says that a function is convex provided that its graph “lies below its secant lines.” If the function is twice-differentiable, then there equivalent characterizations of convexity in terms of calculus:
Theorem 10.7 (Main theorem on convex functions (single-variable version))
Let
The function
is convex.For all numbers
with , we haveThe graph of
lies above its tangent lines, i.e., for all , we haveThe second derivative is nonnegative everywhere, i.e.,
for all .
Moreover, if
Proof. We shall prove the statements as (1)
(1)
Provided
and so
Taking
as desired.
(3)
Then the inequalities
follow immediately, establishing (2).
(2)
holds for all
Then we have both
and
But then
by (10.25) and (10.23). We may rearrange this inequality to obtain
which is what we wanted to show.
(3)
But then
and so
(4)
But then
since
Our goal now is to generalize the main theorem to convex functions of multiple variables. To do this, we will use the theorem for functions of a single variable, along with the following device: Suppose
Then
while
Theorem 10.8 (Main theorem on convex functions (multi-variable version))
Let
The function
is convex.The graph of
lies above its tangent planes, i.e., for all , we haveThe Hessian matrix
is positive semidefinite
Moreover, if
Proof. Throughout the proof we fix
Notice that this is exactly the function used in Section 10.2 in the definitions of the directional first and second derivatives.
(1)
where the second equality follows from Definition 10.2 and Theorem 10.2.
(2)
through (10.27) using the auxiliary function
while
by Theorem 10.2. But by hypothesis, we have
from which (10.30) follows. So, since the graph of
which shows that the Hessian matrix
(3)
for all
so it will suffice to show that
from Theorem 10.2 and positive semidefiniteness of the Hessian matrix