Optimization

11. Optimization#

Lest the reader wonder why an entire chapter on optimization is included in a textbook on probability and statistics, we suggest that they recall the discussion in the expository chapter Chapter 9. There, we described the rationale: Our overarching goal over the current sequence of chapters is to train probabilistic models on data. To do this, we minimize the KL divergence between the proposed model distribution \(p(x,y;\theta)\) and the empirical distribution \(\hat{p}(x,y)\) of the dataset, and thereby turn “learning” into an optimization problem:

Many of the optimization problems we will encounter do not have closed-form solutions, and so we will need to study methods for approximation. All those studied in this chapter are versions of an iterative method called gradient descent. For a warm-up, we will study a simple single-variable version of this method in Section 11.1 before proceeding to the full multi-variable version in Section 11.3. Sandwiched between these two sections is Section 11.2, where we review the tools from elementary differential geometry that allow the generalization from one variable to many. Then, we finish with Section 11.4, where we study a particular form of gradient descent, called stochastic gradient descent, that is specifically tailored for the objective functions encountered in training probabilistic models. An appendix is incuded at the end of the chapter, containing some basic theorems on convex functions and their (rather fussy) proofs.

As hinted in the first paragraph, the inclusion of gradient-based optimization algorithms and their applications to parameter estimation is what distinguishes this book from a traditional book on mathematical statistics. This material is often included in texts on machine learning, but it is not in any text on statistics (that I know of). However, we are just barely scratching the surface of optimization and machine learning. If you are new to these fields and want to learn more, I suggest beginning with the fifth chapter of [GBC16] for a quick overview. After this, you can move on to [HR22], before tackling the massive, encyclopedic texts [Mur22] and [Mur23].

11.1. Gradient descent in one variable#

In this section, we describe the single-variable version of the gradient descent algorithm to help motivate the general algorithm in arbitrary dimensions. To begin, consider the optimization problem of locating the minimum values of the polynomial function

\[ J(\theta) = \theta^4 - 6\theta^3 + 11\theta^2 - 7\theta + 4. \]

This function is called the objective function of the optimization problem. Its graph is displayed in:

../_images/ff624546fd468218fde71ba3a4d57c873926c6ac53ef04c0e72c6e68a09329b2.svg

From the graph, we see that the objective function has minimums of approximately \(J(0.5)\) and \(J(2.7)\).

It will be convenient to introduce the following terminology for our optimization problems:

Definition 11.1

Let \(J: \bbr^n \to \bbr\) be a function. A vector \(\btheta^\star\) is a local minimizer of \(J(\btheta)\) provided that

\[ J(\btheta^\star) \leq J(\btheta) \]

for all \(\btheta\) in a neighborhood of \(\btheta^\star\); if this inequality holds for all \(\btheta\), then \(\btheta^\star\) is called a global minimizer of \(J(\btheta)\). If we flip the inequality the other direction, then we obtain the definitions of local and global maximizers. Collectively, local and global minimizers and maximizers of \(J(\btheta)\) are called extremizers, and the values \(J(\btheta^\star)\) of the function where \(\btheta^\star\) is an extremizer are called extrema or extreme values.

Using this terminology, we would say that \(0.5\) is (approximately) a local minimizer of our polynomial objective function \(J(\theta)\), while \(2.7\) is (approximately) a global minimizer.

Let’s see how the single-variable version of the gradient descent algorithm would solve our optimization problem. This algorithm depends on an initial guess for a minimizer, as well as two parameters called the learning rate and the number of gradient steps. We will state the algorithm first, and then walk through some intuition for why it works:

Algorithm 11.1 (Single-variable gradient descent)

Input: A differentiable objective function \(J:\mathbb{R}\to \mathbb{R}\), an initial guess \(\theta_0\in \mathbb{R}\) for a local minimizer \(\theta^\star\), a learning rate \(\alpha>0\), and the number \(N\) of gradient steps.

Output: An approximation to a local minimizer \(\theta^\star\).

   \(\theta := \theta_0\)
   For \(t\) from \(0\) to \(N-1\), do:
           \(\theta := \theta - \alpha J'(\theta)\)
   Return \(\theta\)

Though this simple description of the algorithm outputs just a single approximation to a local minimizer, in practice it is often convenient to output the entire sequence of \(\theta\)’s produced as the algorithm iterates through the for loop:

\[ \theta_0,\theta_1,\ldots,\theta_t,\ldots,\theta_{N}. \]

Then, as we will see below, by tracking the associated objective values

\[ J(\theta_0), J(\theta_1),\ldots, J(\theta_t), \ldots, J(\theta_N), \]

one may monitor convergence of the algorithm.

The assignment

(11.1)#\[ \theta := \theta - \alpha J'(\theta) \]

in the for loop is called the update rule. This form is convenient for implementation in code, but for theoretical analysis, it is often convenient to rewrite the rule as a recurrence relation in the form

\[ \theta_{t+1} = \theta_{t} - \alpha J'(\theta_{t}), \]

for all \(t\geq 0\). We say that the new parameter \(\theta_{t+1}\) is obtained by taking a gradient step from \(\theta_{t}\). The first update occurs when \(t=0\), yielding

\[ \theta_1 = \theta_{0} - \alpha J'(\theta_{0}). \]

To understand the intuition behind the algorithm, consider the two cases that the derivative \(J'(\theta_0)\) is positive or negative:

../_images/07a70d4bd36c3fe10783d03cc904af4ab7450a52be4d1d37257c084db705456d.svg

In this plot, we’ve drawn the tangent lines to the graph of \(J(\theta)\) at two initial values \(\theta_0=-0.4\) and \(\theta_0=3.3\). Since the derivatives are the slopes of these tangent lines, the sign of the derivative is negative when \(\theta_0=-0.4\) and positive when \(\theta_0 = 3.3\). In the first case, we have

(11.2)#\[\theta_1 = -0.4 - \alpha J'(-0.4) > -0.4\]

since \(\alpha>0\), while in the second case we have

(11.3)#\[\theta_1 = 3.3 - \alpha J'(3.3) < 3.3.\]

But notice that the nearest minimizer to \(\theta_0 = -0.4\) is \(\theta^\star \approx 0.5\), and so the new \(\theta_1\) computed according to (11.2) should be closer to \(\theta^\star\) than the initial guess \(\theta_0\), provided that the (scaled) negative derivative

(11.4)#\[-\alpha J'(\theta_0)\]

is not too large (in magnitude) causing the new \(\theta_1\) to “overshoot” the minimizer \(\theta^\star\). Similarly, the nearest minimizer to \(\theta_0 = 3.3\) is \(\theta^\star \approx 2.7\), so the new \(\theta_1\) computed according to (11.3) should be closer to \(\theta^\star\) than \(\theta_0\), again provided that the (scaled) negative derivative (11.4) is not too large in magnitude.

From these considerations, we conclude the following:

Observation 11.1

The negative derivative \(-J'(\theta)\) always “points downhill.”
When the gradient descent algorithm works, it locates a minimizer by following the negative derivative “downhill.”

The sense in which the negative derivative “points downhill” is made precise by our observation that it is positive if the point \((\theta_0,J(\theta_0))\) sits on a decreasing portion of the graph of \(J(\theta)\), and it is negative if \((\theta_0,J(\theta_0))\) is on an increasing portion of the graph. The role of the learning rate \(\alpha\) is to scale down the magnitude of the negative derivative so that the gradient step in the update rule does not cause \(\theta_1\) to “overshoot” a nearby minimizer.

Let’s run the gradient algorithm four times, with various settings of the parameters:

../_images/70a8c95a20e43c542d58be83f6499b089d3b249df8583c326476db8f08596c99.svg

In all four plots, the large magenta dot represents the initial point \((\theta_0,J(\theta_0))\), while the smaller dots represent the remaining \(N\) points

\[ (\theta_1,J(\theta_1)), (\theta_2, J(\theta_2)),\ldots, (\theta_N,J(\theta_N)), \]

where \(N\) is the number of gradient steps. In the first row of the figure, the algorithm appears to be converging in both cases to the nearest minimizer to the initial guesses. In the second row, the learning rate is (relatively) large, causing the first gradient steps to “overshoot” the nearest minimizers to the initial guesses. However, the algorithm still appears to converge in both cases.

Problem Prompt

Do problem 1 on the worksheet.

It is possible for the gradient descent algorithm to diverge, especially if the learning rate is too large. For example, suppose that we set the learning rate to \(\alpha = 0.2\) and use \(\theta_0 = 3.5\) as our initial guess. Then three steps of gradient descent produce the following:

../_images/3d0d9e16b76db84294c402b68dc6db5ab45d3b4331e8a8b574f203768508b81a.svg

We see already that \(J(\theta_3) \approx 10^7\); in fact, we have \(J(\theta_t) \to \infty\) as \(t\to\infty\) for these particular parameters.

Problem Prompt

Do problems 2 and 3 on the worksheet.

Of course, one can often prevent divergence by simply using a smaller learning rate, but sometimes a large initial learning rate is desirable to help the algorithm quickly find the neighborhood of a minimizer. So, what we desire is a scheme to shrink the learning rate from large values to smaller ones as the algorithm runs. This scheme is called a learning rate schedule, and it is implemented by adding an extra decay rate parameter to the gradient descent algorithm:

Algorithm 11.2 (Single-variable gradient descent with learning rate decay)

Input: A differentiable objective function \(J:\mathbb{R}\to \mathbb{R}\), an initial guess \(\theta_0\in \mathbb{R}\) for a local minimizer \(\theta^\star\), a learning rate \(\alpha>0\), a decay rate \(\beta \in [0, 1)\), and the number \(N\) of gradient steps.

Output: An approximation to a local minimizer \(\theta^\star\).

   \(\theta := \theta_0\)
   For \(t\) from \(0\) to \(N-1\), do:
           \(\theta := \theta - \alpha (1-\beta)^{t+1} J'(\theta)\)
   Return \(\theta\)

The new parameter \(\beta\), called the decay rate, shrinks the learning rate as

\[ \alpha (1-\beta) > \alpha (1-\beta)^2 >\cdots > \alpha (1-\beta)^t > \cdots > \alpha (1-\beta)^N, \]

provided that \(\beta> 0\). Setting \(\beta=0\) results in no change in the learning rate. In our diverging example above, setting the decay rate to \(\beta=0.1\) results in:

../_images/d82d4f56c4ce31cd345efefd4ef9533a8bf9b6724522adebc7971a7bb9ca9464.svg

We have carried out \(N=8\) gradient steps, and it appears that the algorithm has successfully located the minimizer \(\theta^\star \approx 2.7\).

Problem Prompt

Do problem 4 on the worksheet.

The learning rate \(\alpha\) and decay rate \(\beta\) are often chosen by experimentation, and they may be tuned by the analyst by closely monitoring the values of the objective function \(J(\theta)\) as the algorithm runs. This is easy in the single-variable case, since one can plot the graph of \(J(\theta)\). In the multi-variable case, however, the graph of \(J(\theta)\) may live in many more dimensions than we can visualize, so the analyst might track the values of the objective function against the number of gradient steps. For example, with our polynomial objective function \(J(\theta)\) from above and

\[ \theta_0 = -0.5, \quad \alpha = 0.01, \quad \beta = 0.1, \]

we would plot the following:

../_images/4d8c99f1270ac1b2d8aa1cf5e54a24aa111001a5037828556e350ffe7b8d8733.svg

One may use this plot to decide on the total number \(N\) of gradient steps; simply choose \(N\) large enough to reach a point where the plot “plateaus” or “levels out,” indicating that the algorithm is converging on a minimizer. Alternatively, the analyst may build an automatic stopping condition into the algorithm that halts when the magnitude between successive objective values is less than some chosen threshold, say

\[ |J(\theta_t) - J(\theta_{t-1}) | < \epsilon, \]

where \(\epsilon>0\) is a small number.

11.2. Differential geometry#

If \(\theta^\star\) is an extremizer of a differentiable function \(J:\bbr \to \bbr\), then \(\theta^\star\) must be a stationary point in the sense that

(11.5)#\[ J'(\theta^\star)=0. \]

The name arises from the observation that small (first-order infinitesimal) perturbations of \(\theta^\star\) do not change the value \(J(\theta^\star)\), i.e., the value \(J(\theta^\star)\) remains stationary under small perturbations. In certain very favorable situations, we may be able to solve the stationarity equation (11.5) for \(\theta^\star\) to obtain a formula in closed form. In this case, the iterative gradient descent algorithm is not needed. But \(\theta^\star\) being a solution to the stationarity equation (11.5) is only a necessary condition for it to be an extremizer—sufficient conditions may be obtained by considering the local curvature of the graph of \(J\) near \(\theta^\star\).

Our goal in this section is twofold: First, we briefly recall how these curvature considerations help us identify extremizers in the single-variable case—the relevant tools are the first and second derivatives. Then, we generalize these derivatives to higher dimensions to obtain gradient vectors and Hessian matrices. We indicate how local curvature in higher dimensions may be computed using these new tools and, in particular, how we may use them to identify extremizers.

So, let’s begin with the familiar routine from single-variable calculus called the Second Derivative Test. Given a point \(\theta^\star\) and a twice-differentiable function \(J:\bbr \to \bbr\), the test splits into two cases:

If \(J'(\theta^\star) = 0\) and \(J''(\theta^\star) > 0\), then \(\theta^\star\) is a local minimizer.
If \(J'(\theta^\star) = 0\) and \(J''(\theta^\star) < 0\), then \(\theta^\star\) is a local maximizer.

Provided that the second derivative of \(J\) not only exists but is also continuous, then intuition for the Second Derivative Test may be explained via local curvature. Indeed, in the first case, positivity of the second derivative means that the graph of \(J\) is convex near \(\theta^\star\), while in the second case negativity of the second derivative means that the graph is concave. The nature of the curvature helps us distinguish between minimizers and maximizers:

../_images/f2b9127df85cf6a6eda9563eed22c250f36294900807e2e815ba46e139f24cac.svg

To see why second derivatives encode local curvature, let’s suppose \(J''(\theta^\star) >0\) at some point \(\theta^\star\). Then continuity of \(J''\) means that there is a number \(\epsilon>0\) such that \(J''(\theta)>0\) for all \(\theta\) in the open interval \(I = (\theta^\star - \epsilon, \theta^\star + \epsilon)\) centered at \(\theta^\star\). But the second derivative is the first derivative of the first derivative, and thus positivity of \(J''\) over \(I\) means that \(J'\) is increasing over \(I\). Since the first derivative \(J'\) measures the slope of the graph of \(J\), this must mean that the slopes increase as we move from left to right across \(I\). Thus, \(J\) is convex near \(\theta^\star\).

We already met the notion of concavity back in Section 10.2 where it was fundamental in our proof of Gibb’s inequality in Theorem 10.3. As shown in the figure above, a function is concave if it lies below its secant lines, while it is convex if it lies above. Both these shapes have implications for the search for extremizers—in particular, stationary points of concave and convex functions are always global extremizers. The proof of this claim, along with equivalent characterizations of concavity and convexity in terms of tangent lines and tangent (hyper)planes are given in the two main theorems in the appendix. (The claims are easily believable, while the proofs are annoyingly fussy. At least glance at the statements of the theorems to convince yourself that your intuition is on point, but do not feel compelled to go through the proofs line by line.)

How might we generalize the Second Derivative Test to higher dimensions? To help gain insight into the answer, let’s first add only one additional dimension, going from a function of a single variable to a twice-differentiable function of two variables:

\[ J:\bbr^2 \to \bbr, \quad \btheta \mapsto J(\btheta). \]

Actually, to obtain the best results relating curvature to second derivatives, we shall assume that the second-order partial derivatives are continuous, though this isn’t strictly needed for some definitions and results below. Functions with continuous first- and second-order partial derivatives are said to be of class \(C^2\) in the mathematical literature.

For example, let’s suppose that the graph of \(J\) is an upside down paraboloid:

At any given point on this surface (like the one above the black dot) there is not a single slope and curvature, but rather infinitely many slopes and curvatures in all the different directions that one may step in the plane \(\bbr^2\). These different directions may be represented as directional vectors in the plane; here are three examples:

Taking advantage of the very special circumstance that our graph is embedded as a surface in \(\bbr^3\), we may visualize the slopes and curvatures in these three directions by first intersecting the graph with three vertical planes:

The intersections of these vertical planes and the surface yield curves called sections—for the planes displayed in the plot above, the sections are a trio of downward opening parabolas. The slopes and curvatures on the surface in the three directions are then the slopes and curvatures of these sectional curves.

To obtain these slopes and curvatures, let’s suppose that \(\bv\) is one of the three directional vectors in the plane, with its tail hooked to the point \(\btheta\) represented by the black dot. As we let \(t\in \bbr\) vary, the vector sum

\[ t \bv + \btheta \]

traces out the line in the plane \(\bbr^2\) through \(\btheta\) and in the direction of \(\bv\). It follows that the mapping

\[ t\mapsto J(t\bv + \btheta) \]

is exactly the sectional curve on the surface. Notice that this mapping is a real-valued function of a single real variable \(t\), and thus it has first and second derivatives in the ordinary sense from single-variable calculus. These considerations motivate the following definition, which applies to functions of \(n\) variables (not just two).

Definition 11.2

Let \(J : \bbr^n \to \bbr\) be a function of class \(C^2\), \(\btheta\in \bbr^n\) a point, and \(\bv \in \bbr^n\) a vector. We define the directional first derivative of \(J\) at \(\btheta\) in the direction \(\bv\) to be

\[ J_\bv'(\btheta) \def \frac{\text{d}}{\text{d}t} \bigg|_{t=0} J(t\bv + \btheta), \]

while we define the directional second derivative to be

\[ J_\bv''(\btheta) \def \frac{\text{d}^2}{\text{d}t^2} \bigg|_{t=0} J(t\bv + \btheta). \]

In this context, the vector \(\bv\) is called the directional vector.

Problem Prompt

Do problem 5 on the worksheet.

The familiar relations between these directional derivatives and partial derivatives pass through the gadgets defined in the following box. The first is familiar to us from our course in multi-variable calculus:

Definition 11.3

Let \(J : \bbr^n \to \bbr\) be a function of class \(C^2\) and \(\btheta\in \bbr^n\) a point. We define the gradient vector to be

\[\begin{split} \nabla J(\btheta) \def \begin{bmatrix} \displaystyle \frac{\partial J}{\partial \theta_i}(\btheta) \end{bmatrix} = \begin{bmatrix} \displaystyle\frac{\partial J}{\partial \theta_1}(\btheta) \\ \vdots \\ \displaystyle \frac{\partial J}{\partial \theta_n}(\btheta) \end{bmatrix} \in \bbr^n, \end{split}\]

while we define the the Hessian matrix to be

\[\begin{split} \nabla^2 J(\btheta) \def \begin{bmatrix} \displaystyle \frac{\partial^2 J}{\partial \theta_i \partial \theta_j}(\btheta) \end{bmatrix} = \begin{bmatrix} \displaystyle\frac{\partial^2 J}{\partial \theta_1^2}(\btheta) & \cdots &\displaystyle \frac{\partial^2 J}{\partial \theta_1 \partial \theta_n}(\btheta) \\ \vdots & \ddots & \vdots \\ \displaystyle\frac{\partial^2 J}{\partial \theta_n \partial \theta_1} (\btheta) & \cdots & \displaystyle\frac{\partial^2 J}{\partial \theta_n^2}(\btheta) \end{bmatrix} \in \bbr^{n\times n}. \end{split}\]

Note that since \(J\) is of class \(C^2\), the Hessian matrix is symmetric.

We have already met the symbol “\(\nabla\)” back in Definition 8.2 where we defined the gradient matrix of a vector-valued function

\[ \bf: \bbr^n \to \bbr^m, \quad \bx \mapsto \bf(\bx). \]

In fact, if we write

\[ \bf(\bx)^\intercal = (f_1(\bx),\ldots,f_m(\bx)), \]

then the gradient matrix of \(\bf\) is exactly the matrix whose columns are the gradient vectors of the component functions:

\[\begin{split} \nabla \bf(\bx) = \begin{bmatrix} \uparrow & \cdots & \uparrow \\ \nabla f_1(\bx) & \cdots & \nabla f_m(\bx) \\ \downarrow & \cdots & \downarrow \end{bmatrix}. \end{split}\]

Of course, a function \(J:\bbr^n \to \bbr\) may be considered a vector-valued function when we view the target space \(\bbr\) as a \(1\)-dimensional vector space, and then clearly there is no inconsistency in the notation since the gradient vector of \(J\) is the same as its gradient matrix (at \(\btheta\)). This dual usage of the symbol \(\nabla\) then helps explain the notation \(\nabla^2\) for the Hessian matrix; precisely, we have:

Theorem 11.1 (Hessian matrices are gradient matrices of gradient vectors)

Let \(J : \bbr^n \to \bbr\) be a function of class \(C^2\) and \(\btheta\in \bbr^n\) a point. Then we have the equality of matrices

\[ \nabla^2 J(\btheta) = \nabla\left(\nabla J \right)(\btheta), \]

where \(\nabla J\) on the right-hand side stands for the vector-valued function

\[ \nabla J: \bbr^n \to \bbr^n, \quad \btheta \mapsto \nabla J(\btheta). \]

Proof. By definition, we have

\[\begin{split} \nabla J(\btheta) = \begin{bmatrix} \displaystyle\frac{\partial J}{\partial \theta_1}(\btheta) \\ \vdots \\ \displaystyle \frac{\partial J}{\partial \theta_n}(\btheta) \end{bmatrix}, \end{split}\]

and so

\[\begin{split} \nabla\left(\nabla J \right)(\btheta) = \begin{bmatrix} \uparrow & \cdots & \uparrow \\ \nabla \left( \displaystyle\frac{\partial J}{\partial \theta_1}\right)(\btheta) & \cdots & \nabla \left( \displaystyle\frac{\partial J}{\partial \theta_n}\right)(\btheta) \\ \downarrow & \cdots & \downarrow \end{bmatrix} = \nabla^2 J(\btheta). \end{split}\]

The following important theorem expresses the relations between the first and second directional derivatives and the gradient vector and Hessian matrix.

Theorem 11.2 (Slopes, curvatures, and partial derivatives)

Let \(J:\bbr^n \to \bbr\) be a function of class \(C^2\), \(\btheta \in \bbr^n\) a point, and \(\bv \in \bbr^n\) a directional vector.

We have

\[ J_{\bv}'(\btheta) = \bv^\intercal \nabla J(\btheta). \]
We have

\[ J_{\bv}''(\btheta) = \bv^\intercal \nabla^2 J(\btheta) \bv. \]

So, the directional first derivative is obtained through an inner product with the gradient vector, while the directional second derivative is the quadratic form induced by the Hessian matrix.

Proof. The proofs are simple exercises using the multi-variable Chain Rule. Indeed, note that

\[ \frac{\text{d}}{\text{d}t} J(t \bv + \btheta) = \sum_{i=1}^n v_i \frac{\partial J}{\partial \theta_i} (t\bv + \btheta). \]

Plugging in \(t=0\) to both sides of this last equality then yields (1.). On the other hand, differentiating both sides of the equation with respect to \(t\) (and using the Chain Rule a second time) gives

\[ \frac{\text{d}^2}{\text{d}t^2} J(t \bv + \btheta) = \sum_{i=1}^n v_i \frac{\text{d}}{\text{d}t}\frac{\partial J}{\partial \theta_i} (t\bv + \btheta) = \sum_{i,j=1}^n v_i v_j \frac{\partial^2 J}{\partial \theta_i \partial \theta_j}(t\bv + \btheta). \]

Plugging in \(t=0\) to both ends yields (2.). Q.E.D.

Problem Prompt

Do problem 6 on the worksheet.

When the directional vector \(\bv\) is a unit vector, the value of the directional first derivative \(J'_\bv(\btheta)\) is interpreted as the (instantaneous) rate of change of \(J\) at \(\btheta\) in the direction indicated by \(\bv\). Likewise, if \(\bv\) is unit vector, then the value of the directional second derivative \(J''_\bv(\btheta)\) is interpreted as the local curvature of \(J\) at \(\btheta\) through the line indicated by \(\bv\).

For our purposes, the most important properties of the gradient vector \(\nabla J(\btheta)\) are (1) that it points in the direction of maximum rate of change, (2) its negative points in the direction of minimum rate of change, and (3) it is orthogonal to the level surfaces of \(J\), otherwise called contours. By definition, such a surface is the \((n-1)\)-dimensional set of solutions \(\btheta\in \bbr^n\) to an equation

\[ J(\btheta) = c \]

for fixed \(c\in \bbr\). In the case that \(n=2\), these level surfaces are actually level curves; for our upside down paraboloid above, they are the blue ellipses in the following:

../_images/a326937df3ff5e08def8fe4f2070c88195eacb504ff371c7eb0eb6e4ac6737d1.svg

The vector in the figure is the gradient vector, and the magenta line is the tangent line to the contour that passes through the point at the tail of the gradient. Note that the gradient is orthogonal to this tangent line, and therefore also orthogonal to the contour.

In general, it should be intuitively clear from the equation

\[ J_{\bv}'(\btheta) = \bv^\intercal J(\btheta) \]

in Theorem 11.2 that the gradient is orthogonal to level surfaces. Indeed, if \(\bv\) is a tangent vector to the level surface passing through \(\btheta\), then \(J\) should not change (at least up to first order) as we step in the direction of \(\bv\) since (by definition) the function \(J\) is constant along its level surfaces. Thus, we have \(J_\bv'(\btheta)=0\), and so \(\bv^\intercal \nabla J(\btheta) =0\). This shows the gradient vector is indeed orthogonal to the level surface passing through \(\btheta\). (For a more rigorous argument, see the proposition on page 23 of [GP10].)

Let’s return to the first two properties of the gradient vector mentioned above, that it points in the direction of maximum rate of change and its negative points in the direction of minimum rate of change. In informal treatments, these claims are often justified by an appeal to the “angle” \(\phi\in [0,\pi]\) between two vectors \(\bu\) and \(\bv\), which allegedly fits into an equation

(11.6)#\[ \bv^\intercal \bu = |\bu||\bv| \cos{\phi}. \]

Taking \(\bu = \nabla J(\btheta)\) and supposing \(\bv\) has unit length, we get

\[ J_\bv'(\btheta) = \bv^\intercal \nabla J(\btheta) = |\nabla J(\btheta) | \cos{\phi} \]

from Theorem 11.2. Since \(\cos{\phi}\) is maximized and minimized over \([0,\pi]\) when \(\phi=0\) and \(\phi=\pi\), respectively, it then follows that the gradient points in the direction of maximum rate of change, while its negative points in the direction of minimum rate of change. However, this argument does not address what is meant by the “angle” \(\phi\) between the vectors—certainly in two and three dimensions we have some idea of what this angle might be, but what about in 1000 dimensions?

But the “angle” \(\phi\) is just a distraction. It is much cleaner logically to work directly with the Cauchy-Schwarz inequality, which is the true reason that the gradient has these extremizing properties. An immediate corollary of this inequality says that an innner product of one vector against a unit vector is maximized (minimized) when the unit vector points in the same (opposite) direction as the first vector. In the following, we give a proof of the extremizing properties of the gradient vector using the Cauchy-Schwarz inequality, and then show afterwards how it may be used to give a rigorous definition of the angle \(\phi\).

Theorem 11.3 (Properties of gradient vectors)

Let \(J:\bbr^n \to \bbr\) be a function of class \(C^2\), \(\btheta \in \bbr^n\) a point, and suppose the gradient vector \(\nabla J(\btheta)\) is nonzero.

The gradient vector \(\nabla J(\btheta)\) points in the direction of maximum rate of change.
The negative gradient vector \(-\nabla J(\btheta)\) points in the direction of minimum rate of change.
The gradient vector \(\nabla J(\btheta)\) is orthogonal to level surfaces.

Proof. We already offered a “proof” of the third statement above, so we need only prove the first two. For this, we recall that for any two vectors \(\bu\) and \(\bv\), the Cauchy-Schwarz inequality states that

(11.7)#\[ |\bu^\intercal \bv| \leq |\bu | |\bv|, \]

with equality if and only if \(\bu\) and \(\bv\) are parallel (i.e., one is a scalar multiple of the other). In particular, if we take \(\bu = \nabla J(\btheta)\), let \(\bv\) be a unit vector, and use Theorem 11.2, we get

\[ -|\nabla J(\btheta )| \leq J_\bv'(\btheta) = \bv^ \intercal \nabla J(\btheta) \leq |\nabla J(\btheta )|. \]

The goal is then to identify unit vectors \(\bv\) that extremize the derivative \(J_\bv'(\btheta)\) in these bounds.

But if the derivative achieves the upper bound, then by the criterion for equality in the Cauchy-Schwarz inequality (11.7), there must be a nonzero scalar \(\alpha\) such that

\[ \bv = \alpha \nabla J(\btheta). \]

But then

\[ \alpha |\nabla J(\btheta)|^2 = \bv^\intercal \nabla J(\btheta) = |\nabla J(\btheta)|, \]

and so \(\alpha = 1/ |\nabla J(\btheta)|\). Hence, the derivative \(J_\bv'(\btheta)\) achieves its maximum value exactly when \(\bv\) is the normalized gradient vector. It is just as easy to show that the derivative achieves its minimum value when \(\bv\) is the negative of the normalized gradient vector. Q.E.D.

Problem Prompt

Do problem 7 on the worksheet.

Now, let’s return to the “angle” \(\phi\) between two nonzero vectors \(\bu\) and \(\bv\) in \(\bbr^n\). From the Cauchy-Schwarz inequality (11.7), it follows that

\[ -1 \leq \frac{\bv^\intercal \bu}{|\bu||\bv|} \leq 1. \]

We then define the angle \(\phi\) between the two vectors to be the unique number \(\phi \in [0,\pi]\) such that

\[ \cos{\phi} = \frac{\bv^\intercal \bu}{|\bu||\bv|}. \]

Thus, the fact that the angle \(\phi\) even exists is a consequence of the Cauchy-Schwarz inequality.

Observe that the part in Theorem 11.3 about the negative gradient vector “pointing downhill” is the higher-dimensional version of the observation in Observation 11.1 regarding the negative derivative of a single-variable function. This property will be key to the general multi-variable gradient descent algorithm that we will discuss in Section 11.3 below.

Let’s now turn toward extremizers of a multi-variable function \(J:\bbr^n \to \bbr\) of class \(C^2\) in arbitrary dimension \(n\). As you may recall from multi-variable calculus, the stationarity equation

\[ \nabla J(\btheta^\star) = 0 \]

is a necessary condition for \(\btheta^\star\) to be an extremizer of \(J\). As in the single-variable case, one might hope to classify stationary points (i.e., solutions \(\btheta^\star\) to the stationarity equation) as minimizers and maximizers based on the local curvature of \(J\). For if \(J\) is convex (concave) in all directions at \(\btheta^\star\), then intuition suggests that \(\btheta^\star\) should be a local minimizer (maximizer). But from Theorem 11.2, the local directional curvatures at \(\btheta^\star\) are measured by the quadratic form

\[ \bv^\intercal \nabla^2 J(\btheta^\star) \bv \]

as \(\bv\) cycles through all nonzero vectors in \(\bbr^n\). Thus, if these numbers are always positive (negative), then we would expect the stationary point \(\btheta^\star\) is a local minimizer (maximizer). However, to say that these numbers are either always positive or negative means exactly that the Hessian matrix is positive definite or negative definite, in the language of Definition 8.9. So, the question becomes: If we know that the Hessian matrix is positive (negative) definite at a stationary point \(\btheta^\star\), is \(\btheta^\star\) necessarily a local minimizer (maximizer)?

The answer is yes!

Theorem 11.4 (Second Derivative Test)

Let \(J:\bbr^n \to \bbr\) be a function of class \(C^2\) and \(\btheta^\star \in \bbr^n\) a stationary point.

If the Hessian matrix \(\nabla^2 J(\btheta^\star)\) is positive definite, then \(\btheta^\star\) is a local minimizer.
If the Hessian matrix \(\nabla^2 J(\btheta^\star)\) is negative definite, then \(\btheta^\star\) is a local maximizer.

For a proof of this result, see Theorem 13.10 in [Apo74]. Note also that if the Hessian matrix is either positive semidefinite or negative semidefinite everywhere, then every stationary point is a global extremizer; see Theorem 11.8 in the appendix.

Problem Prompt

Do problem 8 on the worksheet.

With infinitely many local (directional) curvatures at a point on the graph of a function \(J:\bbr^n\to \bbr\), it will be convenient to obtain a single number that attempts to summarize the complexity of the local curvature. The first step toward obtaining such a summary is given in the following:

Theorem 11.5 (Eigenvalues, eigenvectors, and local curvature)

Let \(J:\bbr^n \to \bbr\) be a function of class \(C^2\) and \(\btheta \in \bbr^n\) a point with positive definite Hessian matrix \(\nabla^2 J(\btheta)\). Suppose we linearly order the eigenvalues of the Hessian matrix as

(11.8)#\[ 0 < \lambda_1 \leq \lambda_2 \leq \cdots \leq \lambda_n. \]

Then:

The directional curvature \(J''_\bv(\btheta)\) is maximized exactly when \(\bv\) lies in the eigenspace of \(\lambda_n\), in which case \(J''_\bv(\btheta) = \lambda_n\).
The directional curvature \(J''_\bv(\btheta)\) is minimized exactly when \(\bv\) lies in the eigenspace of \(\lambda_1\), in which case \(J''_\bv(\btheta) = \lambda_1\).

Proof. Let \(\be_1,\ldots,\be_n\) be the associated orthonormal basis of eigenvectors with

\[ \nabla^2 J(\btheta) \be_i = \lambda_i\be_i \]

for each \(i\). Given a unit vector \(\bv\), there are unique scalars \(\alpha_1,\ldots,\alpha_n\) such that

\[ \bv = \alpha_1 \be_1 + \cdots + \alpha_n \be_n \]

and

(11.9)#\[ \alpha_1^2 + \cdots + \alpha_n^2 =1. \]

But then

\[ J''_\bv(\btheta) = \bv^\intercal \nabla^2 J(\btheta) \bv = \sum_{i,j=1}^n \alpha_i\alpha_j \be_i^\intercal \nabla^2 J(\btheta) \be_j = \sum_{i,j=1}^n \alpha_i\alpha_j\lambda_j \be_i^\intercal \be_j = \sum_{i=1}^n \alpha_i^2 \lambda_i, \]

where the first equality follows from Theorem 11.2. Using (11.9), eliminate \(\alpha_n^2\) from the last sum in favor of the other \(\alpha\)’s to get

\[ J''_\bv(\btheta) = \sum_{i=1}^{n-1}(\lambda_i - \lambda_n)\alpha_i^2 + \lambda_n. \]

Letting \(m\) be the smallest integer such that

\[ \lambda_{m-1} < \lambda_m = \lambda_{m+1} = \cdots = \lambda _n, \]

we have

\[ J''_\bv(\btheta) = \sum_{i=1}^{m-1}(\lambda_i - \lambda_n)\alpha_i^2 + \lambda_n. \]

(If \(m=1\), then we interpret this expression as \(J''_\bv(\btheta) = \lambda_n\).) But \(\lambda_i -\lambda_n < 0\) for each \(i=1,\ldots,m-1\), and so \(J''_\bv(\btheta)\) is clearly maximized when

\[ \alpha_1 = \cdots = \alpha_{m-1} = 0, \]

which implies that \(\bv\) lies in the eigenspace of \(\lambda_n\). This establishes the claim in the first statement, and the one in the second follows from the same type of argument with the obvious changes. Q.E.D.

Problem Prompt

Do problem 9 on the worksheet.

If the Hessian matrix is positive definite, then its extreme eigenvalues are exactly the extreme local (directional) curvatures. The ratio of the largest curvature to the smallest should then convey the “variance” or the “range” of these curvatures. This ratio has a name:

Definition 11.4

Let \(\bA\) be an \(n\times n\) square matrix.

The spectrum of \(\bA\), denoted \(\sigma(\bA)\), is the set of eigenvalues of \(\bA\).
The spectral radius of \(\bA\), denoted \(\rho(\bA)\), is given by

\[ \rho(\bA) \def \max_{\lambda \in \sigma(\bA)} |\lambda|. \]
If \(\bA\) is positive definite, the condition number of \(\bA\), denoted \(\kappa(\bA)\), is the ratio

\[ \kappa(\bA) \def \frac{\lambda_\text{max}}{\lambda_\text{min}} \]

of the largest eigenvalue of \(\bA\) to the smallest.

This definition of condition number applies only in the case that \(\bA\) is positive definite and hence all its eigenvalues are positive. In the general case, the definition needs to be altered; see here, for example.

Problem Prompt

Do problem 10 on the worksheet.

Intuitively, when the condition number of a positive definite Hessian matrix is large (in which case the Hessian matrix is called ill-conditioned), the curvatures vary widely as we look in all different directions; conversely, when the condition number is near \(1\), the directional curvatures are all nearly the same. As we will see in the next section, ill-conditioned Hessian matrices inflate an important upper-bound on the speed of convergence of gradient descent. In other words, ill-conditioned Hessian matrices may signal slow convergence of gradient descent.

11.3. Gradient descent in multiple variables#

With the gradient vector taking the place of the derivative, it is easy to generalize the single-variable gradient descent algorithm from Algorithm 11.2 to multiple variables:

Algorithm 11.3 (Multi-variable gradient descent with learning rate decay)

Input: A differentiable function \(J:\mathbb{R}^n\to \mathbb{R}\), an initial guess \(\btheta_0\in \mathbb{R}^n\) for a local minimizer \(\btheta^\star\), a learning rate \(\alpha>0\), a decay rate \(\beta \in [0, 1)\), and the number \(N\) of gradient steps.

Output: An approximation to a local minimizer \(\btheta^\star\).

   \(\btheta := \btheta_0\)
   For \(t\) from \(0\) to \(N-1\), do:
           \(\btheta := \btheta - \alpha(1-\beta)^{t+1} \nabla J(\btheta)\)
   Return \(\btheta\).

Just like the single-variable version, in practice it is convenient to track the entire sequence of \(\btheta\)’s produced by the algorithm:

\[ \btheta_0,\btheta_1, \ldots,\btheta_t,\ldots,\btheta_N. \]

For an example, let’s consider the polynomial objective function

(11.10)#\[ J:\bbr^2 \to \bbr, \quad J(\btheta) = J(\theta_1,\theta_2) = (\theta_1^2 + 10 \theta_2^2)\big((\theta_1-1)^2 + 10(\theta_2-1)^2 \big) \]

in two dimensions. Its graph looks like

while its contour plot is

../_images/0449ada59b9f0b98edba447e8b126d9fcb0f05493251bdc2d39930e422eabde8.svg

The function has two minimizers at

\[ \btheta^\star = (0, 0), (1,1), \]

as well as a “saddle point” at \((0.5, 0.5)\) where the gradient \(\nabla J(\btheta)\) vanishes. Let’s run the gradient descent algorithm four times, beginning with no learning rate decay, and track the approximations \(\btheta_t\) in \(\mathbb{R}^2\) plotted over the contours of \(J(\btheta)\):

../_images/1ef80bbc1870cf8f9030121e65b5d898ea71d182fb4220251f2da4f06a0fc142.svg

The large magenta dots in the plots indicate the initial guesses \(\btheta_0\), while the smaller dots indicate the approximations \(\btheta_t\) for \(t>0\). The algorithm appears to be converging nicely to the minimizer \(\btheta^\star = (1,1)\) in the upper-left plot, while in the other three plots, the algorithm finds a neighborhood of a minimizer, but then oscillates back and forth and never appears to settle down. This is due jointly to the elliptical (non-circular) shape of the contours and poorly chosen learning rates.

Let’s carry the runs out further, to \(N=125\) gradient steps, and plot the objective values versus gradient steps:

../_images/d189c20be11ef809ca37df75ce00e92c7196108f46732a19423ed901dc8bd95d.svg

The oscillatory nature of the runs is even more clear. In particular, we now see that even the first run, which appeared to be converging, begins to oscillate after about 40 gradient steps. Let’s take a closer look at the trace of this first run, and also plot the components \(\theta_1\) and \(\theta_2\) versus gradient steps:

../_images/4d0552c702a651e42d33358c3f258b160ce83b2a591e67bb6eb7410322c6bd9c.svg

Both plots make clear that the oscillations occur mostly along the semi-minor axes of the elliptical contours, which coincides with the direction of largest curvature. From the previous section, we know that local curvatures are encoded in the Hessian matrix. This suggests that studying the Hessian matrix might lead to insights into the convergence properties of gradient descent. But first, to prepare for this general study, let’s do:

Problem Prompt

Do problem 11 on the worksheet.

To begin the theoretical study of convergence of gradient descent, let’s start more generally with any function \(J:\bbr^2 \to \bbr\) of class \(C^2\) and \(\btheta^\star\) a point. We then take a degree-\(2\) Taylor polynomial approximation centered at \(\btheta^\star\):

\[ J(\btheta) \approx J(\btheta^\star) + (\btheta - \btheta^\star)^\intercal \nabla J(\btheta^\star) + \frac{1}{2} (\btheta - \btheta^\star)^\intercal \nabla^2 J(\btheta^\star) (\btheta - \btheta^\star). \]

If we believe that the Taylor polynomial accurately reflects the local geometry of the graph of \(J\) near \(\btheta^\star\) to within whatever degree of approximation we require, then we may as well replace \(J\) with its Taylor polynomial, and thereby assume that \(J\) is a degree-\(2\) (inhomogeneous) polynomial:

\[ J(\btheta) = \frac{1}{2}\btheta^\intercal \bH \btheta + \bb^\intercal \btheta + c, \]

where \(\bH \in \bbr^{n\times n}\) is a symmetric matrix, \(\bb\in \bbr^n\) is a vector, and \(c\in \bbr\) is a scalar. As you may easily compute (see the homework), the gradient vector and Hessian matrix are given by

\[ \nabla J(\btheta) = \bH \btheta + \bb \quad \text{and} \quad \nabla^2 J(\btheta) = \bH. \]

Assuming that the decay rate is \(\beta=0\) while the learning rate \(\alpha\) is arbitrary, the update rule in the algorithm is given by

\[ \btheta_{t+1} = \btheta_t - \alpha(\bH\btheta_t + \bb), \]

for all \(t\geq 0\). If \(\btheta^\star\) is a stationary point (like a local minimizer), we may rewrite the update rule as

\[ \btheta_{t+1} - \btheta^\star = (\bI - \alpha \bH)(\btheta_t - \btheta^\star), \]

which leads us to the closed form

(11.11)#\[ \btheta_t - \btheta^\star = (\bI - \alpha \bH)^t (\btheta_0 - \btheta^\star), \]

for all \(t\geq 1\).

Now, let’s suppose that \(\btheta^\star\) is a local minimizer of \(J\) with positive definite Hessian matrix \(\bH\). We are in \(2\)-dimensions, so \(\bH\) has only two eigenvalues:

\[ 0 < \lambda_1 \leq \lambda_2. \]

Then the eigenvalues of the matrix \(\bI - \alpha \bH\) are \(1 - \alpha \lambda_1\) and \(1-\alpha \lambda_2\), and they may be linearly ordered as

(11.12)#\[ 1 - \alpha \lambda_2 \leq 1 - \alpha \lambda_1. \]

The eigenvectors \(\be_1,\be_2\) of \(\bI - \alpha\bH\) are the same as those of \(\bH\) (which may be chosen to form an orthonormal basis of \(\bbr^2\)), so if we have

\[ \btheta_0 - \btheta^\star = \gamma_1 \be_1 + \gamma_2 \be_2 \]

for some scalars \(\gamma_1,\gamma_2\in \bbr\), then from (11.11) we get

(11.13)#\[ \btheta_t - \btheta^\star = \gamma_1 (1-\alpha \lambda_1)^t \be_1 + \gamma_2 (1-\alpha \lambda_2)^t \be_2 \]

for all \(t\geq 1\).

For the particular run of the algorithm shown in the last plot, we have

\[ \be_1^\intercal = (1,0), \quad \be_2^\intercal = (0,1), \quad \alpha = 0.01, \quad \lambda_1 = 22, \quad \lambda_2 = 220, \]

so that (11.13) is

\[ \btheta_t - \btheta^\star = \gamma_1 (0.78)^t \be_1 + \gamma_2 (-1.2)^t \be_2. \]

Since \(|0.78|< 1\) while \(|-1.2| >1\), this last equation shows why the algorithm diverges along the direction \(\be_2\) of maximum curvature, while converging along the direction \(\be_1\) of minimum curvature. The trick, then, is to choose the learning rate \(\alpha\) so that both eigenvalues in (11.12) have magnitude less than \(1\).

To do this, we apply the triangle inequality to the right-hand side of (11.13) to obtain the upper bound

(11.14)#\[ |\btheta_t - \btheta^\star| \leq \rho\left(\bI - \alpha \bH\right)^t |\btheta_0 - \btheta^\star |, \]

for all \(t\geq 1\), where

\[ \rho\left(\bI - \alpha \bH\right) = \max \{ |1-\alpha \lambda_1|, |1-\alpha \lambda_2|\} \]

is the spectral radius of the matrix \(\bI - \alpha \bH\). To guarantee fastest convergence, we want to choose the learning rate \(\alpha\) that minimizes the spectral radius. But an easy computation shows that this optimal learning rate is given by

\[ \alpha = \frac{2}{\lambda_1 + \lambda_2}, \]

in which case we get

\[ \rho\left( \bI - \alpha \bH \right) = |1-\alpha \lambda_1| = |1-\alpha \lambda_2|. \]

Then, with this choice of \(\alpha\), we have

(11.15)#\[ |\btheta_t - \btheta^\star| \leq \rho\left(1 - \alpha \lambda_1\right)^t |\btheta_0 - \btheta^\star | = \left( 1 - \frac{2}{1 + \kappa(\bH)} \right)^t |\btheta_0 - \btheta^\star |, \]

where \(\kappa(\bH)\) is the condition number of the Hessian matrix \(\bH\).

This shows that the fastest rates of convergence guaranteed by our arguments are obtained for those objective functions that have Hessian matrices at local minimizers with condition numbers near \(1\). In the extreme case that \(\kappa(\bH)=1\), notice that the bound shows \(\btheta_1 = \btheta^\star\), i.e., the algorithm lands on the local minimizer \(\btheta^\star\) in just one gradient step. The inequality (11.15) is tight, by which we mean that there are worst-case initial guesses \(\btheta_0\) for which equality holds. This is the case, for example, if the difference \(\btheta_0 - \btheta^\star\) is equal to the eigenvector corresponding to \(\lambda_1\).

This argument easily adapts to the case that the objective function \(J:\bbr^n \to \bbr\) is defined on a Euclidean space of arbitrary dimension. We present the general result in a theorem:

Theorem 11.6 (Curvature and gradient descent)

Let \(J:\bbr^n \to \bbr\) be a function of class \(C^2\) and \(\btheta^\star\) a local minimizer with positive definite Hessian matrix \(\bH = \nabla^2 J(\btheta^\star)\) with minimum and maximum eigenvalues \(\lambda_\text{min}\) and \(\lambda_\text{max}\). For initial guesses \(\btheta_0\) sufficiently near \(\btheta^\star\) to allow a degree-\(2\) Taylor polynomial approximation, the gradient descent algorithm with learning rate

\[ \alpha = \frac{2}{\lambda_\text{min} + \lambda_\text{max}} \quad \text{and} \quad \beta=0 \]

converges to \(\btheta^\star\) exponentially fast, with

\[ |\btheta_t - \btheta^\star| \leq \left( 1 - \frac{2}{1 + \kappa(\bH)} \right)^t |\btheta_0 - \btheta^\star | \]

for each \(t\geq 1\). Here, \(\kappa(\bH)\) is the condition number of \(\bH\).

The ratio \(2/(1 + \kappa(\bH))\) is sometimes called the rate of convergence. In the special case that \(n=2\), so that the graph of our function \(J:\bbr^2 \to \bbr\) is a surface in \(\bbr^3\), the value \((\lambda_\text{min} + \lambda_\text{max})/2\) is significant in differential geometry, called the mean curvature of the surface. Thus, in this case and using this language, the optimal value for the learning rate is given by the reciprocal mean curvature of the graph at the local minimum. Larger mean curvatures require smaller learning rates, while smaller mean curvatures allow larger learning rates.

Of course, in order to obtain the exponentially quick convergence guaranteed by the theorem, one needs to place their initial guess \(\btheta_0\) “sufficiently close” to the minimizer. But this would require the analyst to already have some sense of where the minimizer is likely to be located! This restricts its usefulness in practice.

For our polynomial objective \(J\) given in (11.10) above, we compute the spectra of the Hessian matrices at the minimizers \((0,0)\) and \((1,1)\) to be \(\{22,220\}\) in both cases. Thus, if we choose learning rate \(\alpha = 2/(22+220) \approx 0.008\) and re-run the gradient descent algorithm with the same initial guesses to test Theorem 11.6, we get the new plots:

../_images/b06e33a133663a960e586cf32d08f3fd5060762a56cc7b90970f934ef623b2f6.svg

Just like magic, the undesirable oscillations have vanished. We take a closer look at the third run of the algorithm, in the bottom left of the figure:

../_images/c95908dc26e6108d2a7d873ccecf3d8d0c88a4113dff1157238cf36c074cbfe9.svg

We may also dampen the oscillations and keep the original (relatively large) learning rates by adding a slight learning rate decay at \(\beta = 0.05\):

../_images/2c954ba6f5adc6fc0136296da500c412bc77a59191713d3ab64575bc232e4cd1.svg

Here are the values of the objective function for these last runs with learning rate decay, plotted against the number of gradient steps:

../_images/1ee082df545489a47e81501f9eb45d7c52669c2980b279bb33ebe333cd7f63ae.svg

Notice the initial “overshoot” in the plot in the bottom left, causing the objective function \(J(\btheta)\) to increase after the first gradient step. Recall also that the initial value \(\btheta_0\) in the bottom right plot is near the saddle point \((0.5,0.5)\), causing \(\nabla J(\btheta_0) \approx 0\). This accounts for the small initial changes in the objective function \(J(\btheta)\) indicated by the (nearly) horizontal stretch early in the run of the algorithm.

Of course, an objective function \(J:\mathbb{R}^2 \to \mathbb{R}\) defined on a \(2\)-dimensional input space is still not a realistic example of the objective functions encountered in the real world. In two dimensions, we have the ability to trace the algorithm’s progress through \(\mathbb{R}^2\) on contour plots, as we did multiple times above. In higher dimensions we lose this valuable visual aid. But no matter the input dimension, we may always plot the objective values against the number of gradient steps as a diagnostic plot for convergence.

11.4. Stochastic gradient descent#

Many of the objective functions that we will see in Chapter 13 have a special form making them amenable to optimization via a variation of the gradient descent algorithm called stochastic gradient descent. These special objective functions are called stochastic objective functions, which look like

(11.16)#\[J(\btheta) = E_{\bx \sim p(\bx)}\big[ L(\btheta, \bx) \big] = \sum_{\mathbf{x}\in \mathbb{R}^n} L(\btheta,\bx)p(\mathbf{x}),\]

where \(p(\bx)\) is a probability mass function, \(\btheta \in \mathbb{R}^k\) is a parameter vector, and \(L:\mathbb{R}^{n+k} \to \mathbb{R}\) is a differentiable function called the loss function. Very often, the mass function will be an empirical mass function of an observed multivariate dataset

\[ \bx_1,\bx_2,\ldots,\bx_m \in \mathbb{R}^n, \]

so that

(11.17)#\[ J(\btheta) = \frac{1}{m} \sum_{i=1}^m L\big(\btheta, \bx_i\big). \]

By linearity of the gradient operation, we have

(11.18)#\[ \nabla_\btheta J(\btheta) = \frac{1}{m} \sum_{i=1}^m \nabla_\btheta L\big(\btheta, \bx_i\big) \]

where we write \(\nabla_\btheta\) instead of just \(\nabla\) to emphasize that the gradient of the loss function \(L\) is computed with respect to the parameter vector \(\btheta\). In this context, the gradient descent algorithm applied to the objective function (11.17) is given a new name:

Definition 11.5

The batch gradient descent algorithm is the gradient descent algorithm applied to a stochastic objective function of the form (11.17).

Let’s take a look at a simple example. Suppose that we define the loss function

(11.19)#\[L: \bbr^4 \to \bbr, \quad L(\btheta, \bx) = \frac{1}{2}|\bx - \btheta|^2,\]

where \(\bx,\btheta\in \bbr^2\). We create a bivariate dataset by drawing a random sample of size \(1{,}024\) from a \(\mathcal{N}_2(\boldsymbol0,I)\) distribution. A scatter plot of the dataset looks like this:

../_images/6e115cc158337701848de605526369b907a07f696a89b9a2c4ac4c1991688666.svg

Before running the algorithm, let’s do:

Problem Prompt

Do problem 12 on the worksheet.

Now, two runs of the batch gradient descent algorithm produce the following plots of the objective function versus gradient steps:

../_images/1235d7be5f3912be4eeb7917386629c56f1cdee2762b9412b65c9947f2f30474.svg

If we track the parameters \(\btheta = (\theta_1,\theta_2)\) during the runs, we get the following:

../_images/da465f714f35e930278445fa1a0b3f7f3d263a686ef67858515e5bfef6cab163.svg

In both cases, notice that the algorithm is nicely converging toward the minimizer at \(\btheta^\star \approx (0,0)\).

One of the drawbacks of the batch algorithm is that it needs the entire dataset in order to take just a single gradient step. This isn’t an issue for our small toy dataset of size \(m=1{,}024\), but for the large datasets that you may encounter in the real world, this can be a serious hindrance to fast convergence.

One method for dealing with this bottleneck is to use mini-batches of the data to compute gradient steps. To do so, we begin by randomly partitioning the dataset into subsets \(B_1,B_2,\ldots,B_p\) called mini-batches:

(11.20)#\[B_1 \cup B_2 \cup \cdots \cup B_p = \{\bx_1,\bx_2,\ldots,\bx_m\}.\]

We would then expect from (11.17) and (11.18) that

\[ J(\btheta) \approx \frac{1}{|B_j|} \sum_{\bx \in B_j} L\big(\btheta, \bx\big) \]

and

(11.21)#\[ \nabla J(\btheta) \approx \frac{1}{|B_j|} \sum_{\bx \in B_j} \nabla_\btheta L\big(\btheta, \bx\big), \]

for each \(j=1,2,\ldots,p\), where \(|B_j|\) is the cardinality (or size) of the \(j\)-th mini-batch. Very often, the mini-batch sizes are chosen to be equal to a common value \(k\), except (possibly) for one to compensate for the fact that \(m\) may not be evenly divisible by \(k\). For example, if \(m=100\) and \(k=30\), then we would have four mini-batches, three of size \(30\) and the fourth of size \(10\).

The mini-batch version of the gradient descent algorithm loops over the mini-batches (11.20) and computes gradient steps using the approximation in (11.21). A single loop through all the mini-batches, covering the entire dataset, is called an epoch. This new version of the algorithm takes the number of epochs \(N\) as a parameter.

Algorithm 11.4 (Stochastic gradient descent with learning rate decay)

Input: A dataset \(\bx_1,\bx_2\ldots,\bx_m\in \mathbb{R}^n\), a differentiable loss function \(L:\bbr^{n+k} \to \bbr\), an initial guess \(\btheta_0\in \mathbb{R}^k\) for a minimizer \(\btheta^\star\) of the stochastic objective function (11.18), a learning rate \(\alpha>0\), a decay rate \(\beta \in [0, 1)\), and the number \(N\) of epochs.

Output: An approximation to a minimizer \(\btheta^\star\).

   \(\btheta := \btheta_0\)
   \(s := 0\)
   For \(t\) from \(0\) to \(N-1\), do:
           Randomly partition the dataset into mini-batches \(B_1,B_2,\ldots,B_p\)
           For each mini-batch \(B_j\), do:
                   \(\btheta := \btheta - \displaystyle \alpha(1-\beta)^{s+1} \frac{1}{|B_j|} \sum_{\bx \in B_j} \nabla_\btheta L\big(\btheta, \bx\big)\)
                   \(s := s+1\)
   Return \(\btheta\)

Notice the auxiliary variable \(s\), which counts the number of gradient steps and is used to decay the learning rate. As we mentioned above, very often the mini-batches are chosen to be of equal size \(k\), except (possibly) for one mini-batch to compensate for the fact that \(m\) may not be divisible by \(k\). In this case, we pass in the mini-batch size \(k\) as an additional parameter.

As in the versions of gradient descent explored above, in practice is convenient to code the algorithm so that it returns the entire sequence of \(\btheta\)’s produced to help monitor convergence. If there are \(p\) mini-batches and \(N\) epochs, then the output of the algorithm may be partitioned like this:

\[\begin{split} \begin{array}{c|c} \text{epoch number} & \text{output} \\ \hline 0 & \btheta_0 \\ 1 & \btheta_1, \btheta_2,\ldots, \btheta_p \\ 2 & \btheta_{p+1}, \btheta_{p+2},\ldots, \btheta_{2p} \\ \vdots & \vdots \\ N & \btheta_{(N-1)p+1}, \btheta_{(N-1)p+2},\ldots, \btheta_{Np} \end{array} \end{split}\]

This leads to the following sequences of objective values:

\[\begin{split} \begin{array}{c|c} \text{epoch number} & \text{objectives} \\ \hline 0 & J(\btheta_0) \\ 1 & J(\btheta_1), J(\btheta_2),\ldots, J(\btheta_p) \\ 2 & J(\btheta_{p+1}), J(\btheta_{p+2}),\ldots, J(\btheta_{2p}) \\ \vdots & \vdots \\ N & J(\btheta_{(N-1)p+1}), J(\btheta_{(N-1)p+2}),\ldots, J(\btheta_{Np}) \end{array} \end{split}\]

However, rather than have the algorithm output the exact objective values \(J(\btheta)\), we will have it output the approximate objective values obtained as realizations of the right-hand side of

\[ J(\btheta) \approx \frac{1}{|B|} \sum_{\bx \in B} L(\btheta, \bx), \]

where \(B\) is a mini-batch of data. In order to monitor convergence, we may plot these per-step (approximate) objective values versus gradient steps, as we have done in the previous two sections. But, depending on the mini-batch size, these plots may be quite noisy, so sometimes to get a better sense of the trend it is convenient to also track the mean (approximate) objective values per epoch. We will see examples below.

But first, we note that it is possible to select a mini-batch size of \(k=1\), so that the algorithm computes a gradient step per data point. (Some references refer to this algorithm as stochastic gradient descent.) In our example (11.19) from above, a batch size of \(k=1\) yields the following plots of objective values versus gradient steps:

../_images/ba0f143f8fe02347e42ca66774c8a0ccca3a56834aeda1ecd3bbc9706d314f29.svg

Note that we have halted the runs early, before the algorithm has had a chance to make it through even one epoch. The plots are very noisy, though a slight downward trend in objective values is detectable. The trace of the algorithm through parameter space is shown in:

../_images/e392fb97d6ed460b724ceb2356fb833b2b076f6221f9f13af465cf4e3cbc4236.svg

The traces are also very noisy, especially in the first row with the large learning rate \(\alpha=0.1\). Nevertheless, it is clear that the algorithm has found the neighborhood of the minimizer at \((0,0)\). We might try to tame the noise in these plots by increasing the decay rate, but it turns out that increasing the batch size is the better way to do this. So, let’s try four larger batch sizes:

../_images/94f12919748442b97b6fea8f11e673e8082c0eb90b9f33989eb969a3ac8ffbf7.svg

The runs with batch sizes \(k=2, 8, 32\) halt before completing one epoch, while the run with batch size \(k=128\) completes three full epochs. Notice that the noise in the plots progressively decreases as the batch size increases. The traces through parameter space look like:

../_images/bcfcd918c6c01b5240e3bfe4301b40d2c8620b0653dd62ca7a44c8df0d80cd1d.svg

As we mentioned above, it may be helpful to track the mean objective values per epoch to help see the general trend of the objective through the noise. To illustrate this point, let’s suppose that we continue our runs of mini-batch gradient descent so that they all complete six epochs. If we plot both the objective per gradient step and the mean objective per epoch, we get the following plots:

../_images/acfdeb41ed1ad71deb1962f4862c82d87e4e2d477b628790d0f4b59d813034cd.svg

11.5. Appendix: convex functions#

We begin with the definition of a convex function. Notice that it does not require that the function have any special properties, like differentiability.

Definition 11.6

We shall say a function \(J:\bbr^n \to \bbr\) is convex provided that

\[ J\big((1-t) \ba + t\bb\big) \leq (1-t) J(\ba) + t J(\bb) \]

for all \(\ba,\bb\in \bbr^n\) and all \(t\in [0,1]\). If “\(\leq\)” is replaced with “\(<\)”, then the function is called strictly convex; if the inequalities are reversed, we obtain the definitions of concave and strictly concave.

Essentially, the definition says that a function is convex provided that its graph “lies below its secant lines.” If the function is twice-differentiable, then there equivalent characterizations of convexity in terms of calculus:

Theorem 11.7 (Main theorem on convex functions (single-variable version))

Let \(J: \bbr \to \bbr\) be a twice-differentiable function. The following statements are equivalent:

The function \(J\) is convex.
For all numbers \(a,b,c\) with \(a<b<c\), we have

\[ \frac{J(b)-J(a)}{b-a} \leq \frac{J(c)-J(b)}{c-b}. \]
The graph of \(J\) lies above its tangent lines, i.e., for all \(\theta,a\in \bbr\), we have

\[ J(\theta) \geq J'(a)(\theta-a) + J(a). \]
The second derivative is nonnegative everywhere, i.e., \(J''(\theta)\geq 0\) for all \(\theta\in \bbr\).

Moreover, if \(\theta^\star\) is a stationary point of \(J\) and \(J\) is convex, then \(\theta^\star\) is a global minimizer.

Proof. We shall prove the statements as (1) \(\Rightarrow\) (3) \(\Rightarrow\) (2) \(\Rightarrow\) (1), and then (3) \(\Rightarrow\) (4) \(\Rightarrow\) (2).

(1) \(\Rightarrow\) (3): For each \(\theta,a\in \bbr\), we have

(11.22)#\[ J'(a)(\theta-a) = \lim_{t\to 0} \frac{J\big( a+t(\theta-a) \big) - J(a)}{t}. \]

Provided \(t\in [0,1]\), we have

\[ J\big( a+t(\theta-a) \big) = J \big( (1-t)a + t\theta\big) \leq (1-t)J(a) + tJ(\theta) \]

and so

\[ J(\theta) \geq \frac{J\big( a+t(\theta-a) \big) - J(a)}{t} + J(a). \]

Taking \(t\to 0^+\) and using (11.22) yields

\[ J(\theta) \geq J'(a) (\theta-a) + J(a), \]

as desired.

(3) \(\Rightarrow\) (2): Letting \(a<b<c\), by hypothesis we have

\[ J(a) \geq J'(b)(a-b) + J(b) \quad \text{and} \quad J(c) \geq J'(b)(c-b) + J(b). \]

Then the inequalities

\[ \frac{J(b)-J(a)}{b-a} \leq J'(b) \leq \frac{J(c)-J(b)}{c-b} \]

follow immediately, establishing (2).

(2) \(\Rightarrow\) (1): Suppose that the inequality

(11.23)#\[ \frac{J(b)-J(a)}{b-a} \leq \frac{J(c)-J(b)}{c-b} \]

holds for all \(a<b<c\). Fixing \(a\) and \(c\) with \(a<c\), let \(t\in (0,1)\) and set

\[ b = (1-t)a + tc. \]

Then we have both

(11.24)#\[ t = \frac{b-a}{c-a}, \]

and

(11.25)#\[ 1-t = \frac{c-b}{c-a} \]

But then

\[ \frac{J(b)-J(a)}{t} \leq \frac{J(c)-J(b)}{1-t} \]

by (11.25) and (11.23). We may rearrange this inequality to obtain

\[ J\big((1-t)a+tc \big) =J(b) \leq (1-t)J(a) + tJ(c), \]

which is what we wanted to show.

(3) \(\Rightarrow\) (4): By the Mean Value Theorem, to prove that \(J''(\theta)\geq 0\) for all \(\theta\), it will suffice to show that \(J'\) is an increasing function. For this, suppose given \(a,b\in \bbr\) with \(a< b\). By hypothesis, we have both

\[ J(b) \geq J'(a)(b-a) + J(a) \quad \text{and} \quad J(a) \geq J'(b)(a-b) + J(b). \]

But then

\[ J'(a)(b-a) + J(a) \leq J(b) \leq -J'(b)(a-b) + J(a) \]

and so \(J'(a) \leq J'(b)\).

(4) \(\Rightarrow\) (2): Let \(a<b<c\). The Mean Value Theorem is then used twice: Once to conclude that \(J'\) is an increasing function, and then to show that there are numbers \(u\in (a,b)\) and \(v\in (b,c)\) such that

\[ J'(u) = \frac{J(b)-J(a)}{b-a} \quad \text{and} \quad J'(v) = \frac{J(c)-J(b)}{b-c}. \]

But then

\[ \frac{J(b)-J(a)}{b-a} = J'(u) \leq J'(v) = \frac{J(c)-J(b)}{b-c} \]

since \(u<v\) and \(J'\) is increasing. Q.E.D.

Our goal now is to generalize the main theorem to convex functions of multiple variables. To do this, we will use the theorem for functions of a single variable, along with the following device: Suppose \(g:\bbr \to \bbr\) is a twice-differentiable function, \(s,t\) are fixed distinct real numbers, and we define

(11.26)#\[ h:\bbr \to \bbr, \quad h(r) = g\big(r(s-t)+t\big). \]

Then

(11.27)#\[ g(s) \geq g'(t) (s-t) + g(t) \quad \Leftrightarrow \quad h(1) \geq h'(0) + h(0) \]

while

(11.28)#\[ g''(t) \geq 0 \quad \Leftrightarrow \quad h''(0) \geq 0. \]

Theorem 11.8 (Main theorem on convex functions (multi-variable version))

Let \(J: \bbr^n \to \bbr\) be a function of class \(C^2\). The following statements are equivalent:

The function \(J\) is convex.
The graph of \(J\) lies above its tangent planes, i.e., for all \(\btheta,\bv\in \bbr^n\), we have

\[ J(\bv + \btheta) \geq \bv^\intercal \nabla J(\btheta) + J(\btheta). \]
The Hessian matrix \(\nabla^2 J(\btheta)\) is positive semidefinite

Moreover, if \(\btheta^\star\) is a stationary point of \(J\) and \(J\) is convex, then \(\btheta^\star\) is a global minimizer.

Proof. Throughout the proof we fix \(\btheta,\bv\in \bbr^n\) and we consider the function

(11.29)#\[ g: \bbr \to \bbr, \quad g(t) \def J\big(t \bv + \btheta \big). \]

Notice that this is exactly the function used in Section 11.2 in the definitions of the directional first and second derivatives.

(1) \(\Rightarrow\) (2): Supposing that \(J\) is convex, the function \(g\) is easily seen to be convex. Thus,

\[ J(\bv + \btheta) = g(1) \geq g'(0) + g(0) = \bv^\intercal \nabla J(\btheta) + J(\btheta), \]

where the second equality follows from Definition 11.2 and Theorem 11.2.

(2) \(\Rightarrow\) (3): For real numbers \(s,t\in \bbr\), we shall prove

(11.30)#\[ g(s) \geq g'(t)(s-t) + g(t) \]

through (11.27) using the auxiliary function \(h\) defined in terms of \(g\) via (11.26). But note that

\[ h(1) = J(s\bv + \btheta) \quad \text{and} \quad h(0) = J(t\bv + \btheta) \]

while

\[ h'(0) = (s-t) \bv^\intercal \nabla J(t\bv + \btheta) \]

by Theorem 11.2. But by hypothesis, we have

\[ J(s\bv + \btheta) \geq (s-t) \bv^\intercal \nabla J(t\bv + \btheta) + J(t\bv + \btheta), \]

from which (11.30) follows. So, since the graph of \(g\) lies above its tangent lines, we know what \(g''(t)\geq 0\) for all \(t\) and, in particular, that \(g''(0)\geq 0\). However, from Theorem 11.2 we get the equality in

\[ \bv^\intercal \nabla^2 J(\btheta) \bv = g''(0) \geq 0, \]

which shows that the Hessian matrix \(\nabla^2 J(\btheta)\) is positive semidefinite since \(\bv\) was chosen arbitrarily.

(3) \(\Rightarrow\) (1): We need to prove that

(11.31)#\[ J(t\bv + \btheta) \leq (1-t) J(\btheta) + t J (\bv + \btheta) \]

for all \(t\in [0,1]\). But note that this is the same inequality as

\[ g(t) \leq (1-t) g(0) + tg(1), \]

so it will suffice to show that \(g\) is convex. But to do this, we shall show \(g''(t)\geq 0\) through (11.28) and the auxiliary function \(h\). However, we have

\[ h''(0) = (s-t)^2\bv^\intercal \nabla^2 J(t\bv + \btheta)\bv \geq 0 \]

from Theorem 11.2 and positive semidefiniteness of the Hessian matrix \(\nabla^2 J(t\bv + \btheta)\).

Optimization

Contents

11. Optimization#

11.1. Gradient descent in one variable#

11.2. Differential geometry#

11.3. Gradient descent in multiple variables#

11.4. Stochastic gradient descent#

11.5. Appendix: convex functions#