文章目录

Mind Map
CONTENTS
- Overflow and Underflow
- Poor Conditioning
- Gradient-Based Optimization
- - Beyond the Gradient: Jacobian and Hessian Matrices
- Constrained Optimization
- Example: Linear Least Squares

Mind Map

Overflow and Underflow

The fundamental difficulty in performing continuous math on a digital computer is that we need to represent infinitely many real numbers with a finite number of bit patterns.
Rounding error（舍入误差） is problematic, especially when it compounds across many operations, and can cause algorithms that work in theory to fail in practice if they are not designed to minimize the accumulation of rounding error.
One form of rounding error that is particularly devastating is underflow.Underflow occurs when numbers near zero are rounded to zero.For example, we usually want to avoid division by zero.
Another highly damaging form of numerical error is overflow. Overflow occurs when numbers with large magnitude are approximated as ∞ \infty ∞ or − ∞ -\infty −∞. Further arithmetic will usually change these infinite values into not-a-number values. One example of a function that must be stabilized against underflow and overflow is the softmax function. The softmax function is often used to predict the probabilities associated with a multinoulli distribution. The softmax function is defined to be
softmax ⁡ ( x ) i = exp ⁡ ( x i ) ∑ j = 1 n exp ⁡ ( x j ) \operatorname{softmax}(\boldsymbol{x})_{i}=\frac{\exp \left(x_{i}\right)}{\sum_{j=1}^{n} \exp \left(x_{j}\right)} softmax(x)i=∑j=1nexp(xj)exp(xi)
Consider what happens when all of the x i x_{i} xi are equal to some constant c c c. Analytically, we can see that all of the outputs should be equal to 1 n \frac{1}{n} n1. Numerically, this may not occur when c c c has large magnitude.
If c c c is very negative, then exp ⁡ ( c ) \exp (c) exp(c) will underflow. This means the denominator(分母) of the softmax will become 0 , so the final result is undefined.
When c c c is very large and positive, exp ⁡ ( c ) \exp (c) exp(c) will overflow, again resulting in the expression as a whole being undefined.
Both of these difficulties can be resolved by instead evaluating softmax ⁡ ( z ) \operatorname{softmax}(\boldsymbol{z}) softmax(z) where z = x − max ⁡ i x i \boldsymbol{z}=\boldsymbol{x}-\max _{i} x_{i} z=x−maxixi. Simple algebra shows that the value of the softmax function is not changed analytically by adding or subtracting a scalar from the input vector. Subtracting max ⁡ i x i \max _{i} x_{i} maxixi results in the largest argument to exp being 0 , which rules out the possibility of overflow. Likewise, at least one term in the denominator has a value of 1 , which rules out the possibility of underflow in the denominator leading to a division by zero. There is still one small problem. Underflow in the numerator can still cause the expression as a whole to evaluate to zero. This means that if we implement log ⁡ softmax ⁡ ( x ) \log \operatorname{softmax}(\boldsymbol{x}) logsoftmax(x) by first running the softmax subroutine then passing the result to the log function, we could erroneously obtain − ∞ . -\infty . −∞. Instead, we must implement a separate function that calculates log ⁡ \log log softmax in a numerically stable way. The log ⁡ \log log softmax function can be stabilized using the same trick as we used to stabilize the softmax function.
For the most part, we do not explicitly detail all of the numerical considerations involved in implementing the various algorithms described in this book. Developers of low-level libraries should keep numerical issues in mind when implementing deep learning algorithms. Most readers of this book can simply rely on lowlevel libraries that provide stable implementations. In some cases, it is possible to implement a new algorithm and have the new implementation automatically stabilized.

Poor Conditioning

Conditioning refers to how rapidly a function changes with respect to small changes in its inputs.Functions that change rapidly when their inputs are perturbed slightly can be problematic for scientific computation because rounding errors in the inputs can result in large changes in the output.
Consider the function f ( x ) = A − 1 x . f(\boldsymbol{x})=\boldsymbol{A}^{-1} \boldsymbol{x} . f(x)=A−1x. When A ∈ R n × n \boldsymbol{A} \in \mathbb{R}^{n \times n} A∈Rn×n has an eigenvalue decomposition, its condition number is
max ⁡ i , j ∣ λ i λ j ∣ \max _{i, j}\left|\frac{\lambda_{i}}{\lambda_{j}}\right| i,jmax∣∣∣∣λjλi∣∣∣∣
This is the ratio of the magnitude of the largest and smallest eigenvalue. When this number is large, matrix inversion is particularly sensitive to error in the input.
This sensitivity is an intrinsic property of the matrix itself, not the result of rounding error during matrix inversion. Poorly conditioned matrices amplify pre-existing errors when we multiply by the true matrix inverse. In practice, the error will be compounded further by numerical errors in the inversion process itself.

Gradient-Based Optimization

Most deep learning algorithms involve optimization of some sort. Optimization refers to the task of either minimizing or maximizing some function f ( x ) f(\boldsymbol{x}) f(x) by altering x \boldsymbol{x} x. We usually phrase most optimization problems in terms of minimizing f ( x ) f(\boldsymbol{x}) f(x). Maximization may be accomplished via a minimization algorithm by minimizing − f ( x ) -f(\boldsymbol{x}) −f(x)
The function we want to minimize or maximize is called the objective function or criterion. When we are minimizing it, we may also call it the cost function, loss function, or error function. In this book, we use these terms interchangeably, though some machine learning publications assign special meaning to some of these terms.
We often denote the value that minimizes or maximizes a function with a superscript ∗ * ∗. For example, we might say x ∗ = arg ⁡ min ⁡ f ( x ) \boldsymbol{x}^{*}=\arg \min f(\boldsymbol{x}) x∗=argminf(x).
Suppose we have a function y = f ( x ) y=f(x) y=f(x), where both x x x and y y y are real numbers. The derivative of this function is denoted as f ′ ( x ) f^{\prime}(x) f′(x) or as d y d x \frac{d y}{d x} dxdy. The derivative f ′ ( x ) f^{\prime}(x) f′(x) gives the slope of f ( x ) f(x) f(x) at the point x x x. In other words, it specifies how to scale a small change in the input in order to obtain the corresponding change in the output: f ( x + ϵ ) ≈ f ( x ) + ϵ f ′ ( x ) f(x+\epsilon) \approx f(x)+\epsilon f^{\prime}(x) f(x+ϵ)≈f(x)+ϵf′(x).
The derivative is therefore useful for minimizing a function because it tells us how to change x x x in order to make a small improvement in y y y. For example, we know that f ( x − ϵ sign ⁡ ( f ′ ( x ) ) ) f\left(x-\epsilon \operatorname{sign}\left(f^{\prime}(x)\right)\right) f(x−ϵsign(f′(x))) is less than f ( x ) f(x) f(x) for small enough ϵ \epsilon ϵ. We can thus reduce f ( x ) f(x) f(x) by moving x x x in small steps with opposite sign of the derivative. This technique is called gradient descent.
When f ′ ( x ) = 0 f^{\prime}(x)=0 f′(x)=0, the derivative provides no information about which direction to move. Points where f ′ ( x ) = 0 f^{\prime}(x)=0 f′(x)=0 are known as critical points(临界点) or stationary points. A local minimum is a point where f ( x ) f(x) f(x) is lower than at all neighboring points, so it is no longer possible to decrease f ( x ) f(x) f(x) by making infinitesimal steps. A local maximum is a point where f ( x ) f(x) f(x) is higher than at all neighboring points, so it is not possible to increase f ( x ) f(x) f(x) by making infinitesimal(无穷小的) steps. Some critical points are neither maxima nor minima. These are known as saddle points.
A point that obtains the absolute lowest value of f ( x ) f(x) f(x) is a global minimum. It is possible for there to be only one global minimum or multiple global minima of the function. It is also possible for there to be local minima that are not globally optimal. In the context of deep learning, we optimize functions that may have many local minima that are not optimal, and many saddle points surrounded by very flat regions. All of this makes optimization very difficult, especially when the input to the function is multidimensional. We therefore usually settle for finding a value of f f f that is very low, but not necessarily minimal in any formal sense.
The directional derivative in direction u \boldsymbol{u} u (a unit vector) is the slope of the function f f f in direction u u u. In other words, the directional derivative is the derivative of the function f ( x + α u ) f(\boldsymbol{x}+\alpha \boldsymbol{u}) f(x+αu) with respect to α \alpha α, evaluated at α = 0 \alpha=0 α=0. Using the chain rule, we can see that ∂ ∂ α f ( x + α u ) \frac{\partial}{\partial \alpha} f(\boldsymbol{x}+\alpha \boldsymbol{u}) ∂α∂f(x+αu) evaluates to u ⊤ ∇ x f ( x ) \boldsymbol{u}^{\top} \nabla_{\boldsymbol{x}} f(\boldsymbol{x}) u⊤∇xf(x) when α = 0 \alpha=0 α=0.
To minimize f f f, we would like to find the direction in which f f f decreases the fastest. We can do this using the directional derivative:
min ⁡ u , u ⊤ u = 1 u ⊤ ∇ x f ( x ) = min ⁡ u , u ⊤ u = 1 ∥ u ∥ 2 ∥ ∇ x f ( x ) ∥ 2 cos ⁡ θ \begin{aligned} & \min _{\boldsymbol{u}, \boldsymbol{u}^{\top} \boldsymbol{u}=1} \boldsymbol{u}^{\top} \nabla_{\boldsymbol{x}} f(\boldsymbol{x}) \\ =\min _{\boldsymbol{u}, \boldsymbol{u}^{\top} \boldsymbol{u}=1} &\|\boldsymbol{u}\|_{2}\left\|\nabla_{\boldsymbol{x}} f(\boldsymbol{x})\right\|_{2} \cos \theta \end{aligned} =u,u⊤u=1minu,u⊤u=1minu⊤∇xf(x)∥u∥2∥∇xf(x)∥2cosθ
where θ \theta θ is the angle between u \boldsymbol{u} u and the gradient. Substituting in ∥ u ∥ 2 = 1 \|\boldsymbol{u}\|_{2}=1 ∥u∥2=1 and ignoring factors that do not depend on u \boldsymbol{u} u, this simplifies to min ⁡ u cos ⁡ θ \min _{\boldsymbol{u}} \cos \theta minucosθ. This is minimized when u \boldsymbol{u} u points in the opposite direction as the gradient. In other words, the gradient points directly uphill, and the negative gradient points directly downhill. We can decrease f f f by moving in the direction of the negative gradient. This is known as the method of steepest descent or gradient descent.
Steepest descent proposes a new point
x ′ = x − ϵ ∇ x f ( x ) \boldsymbol{x}^{\prime}=\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x}) x′=x−ϵ∇xf(x)
where ϵ \epsilon ϵ is the learning rate, a positive scalar determining the size of the step. We can choose ϵ \epsilon ϵ in several different ways. A popular approach is to set ϵ \epsilon ϵ to a small constant. Sometimes, we can solve for the step size that makes the directional derivative vanish. Another approach is to evaluate f ( x − ϵ ∇ x f ( x ) ) f\left(\boldsymbol{x}-\epsilon \nabla_{\boldsymbol{x}} f(\boldsymbol{x})\right) f(x−ϵ∇xf(x)) for several values of ϵ \epsilon ϵ and choose the one that results in the smallest objective function value. This last strategy is called a line search.
Steepest descent converges when every element of the gradient is zero (or, in practice, very close to zero ) ) ). In some cases, we may be able to avoid running this iterative algorithm, and just jump directly to the critical point by solving the equation ∇ x f ( x ) = 0 \nabla_{\boldsymbol{x}} f(\boldsymbol{x})=0 ∇xf(x)=0 for x \boldsymbol{x} x.
Although gradient descent is limited to optimization in continuous spaces, the general concept of repeatedly making a small move (that is approximately the best small move) towards better configurations can be generalized to discrete spaces. Ascending an objective function of discrete parameters is called hill climbing.

Beyond the Gradient: Jacobian and Hessian Matrices

The matrix containing all such partial derivatives is known as a Jacobian matrix.
Specifically, if we have a function f : R m → R n f: \mathbb{R}^{m} \rightarrow \mathbb{R}^{n} f:Rm→Rn, then the Jacobian matrix J ∈ R n × m \boldsymbol{J} \in \mathbb{R}^{n \times m} J∈Rn×m of f \boldsymbol{f} f is defined such that J i , j = ∂ ∂ x j f ( x ) i J_{i, j}=\frac{\partial}{\partial x_{j}} f(\boldsymbol{x})_{i} Ji,j=∂xj∂f(x)i
We are also sometimes interested in a derivative of a derivative. This is known as a second derivative. For example, for a function f : R n → R f: \mathbb{R}^{n} \rightarrow \mathbb{R} f:Rn→R, the derivative with respect to x i x_{i} xi of the derivative of f f f with respect to x j x_{j} xj is denoted as ∂ 2 ∂ x i ∂ x j f \frac{\partial^{2}}{\partial x_{i} \partial x_{j}} f ∂xi∂xj∂2f. In a single dimension, we can denote d 2 d x 2 f \frac{d^{2}}{d x^{2}} f dx2d2f by f ′ ′ ( x ) . f^{\prime \prime}(x) . f′′(x). The second derivative tells us how the first derivative will change as we vary the input. This is important because it tells us whether a gradient step will cause as much of an improvement as we would expect based on the gradient alone. We can think of the second derivative as measuring curvature.
When our function has multiple input dimensions, there are many second derivatives. These derivatives can be collected together into a matrix called the Hessian matrix. The Hessian matrix H ( f ) ( x ) \boldsymbol{H}(f)(\boldsymbol{x}) H(f)(x) is defined such that
H ( f ) ( x ) i , j = ∂ 2 ∂ x i ∂ x j f ( x ) \boldsymbol{H}(f)(\boldsymbol{x})_{i, j}=\frac{\partial^{2}}{\partial x_{i} \partial x_{j}} f(\boldsymbol{x}) H(f)(x)i,j=∂xi∂xj∂2f(x)
Equivalently, the Hessian is the Jacobian of the gradient. Anywhere that the second partial derivatives are continuous, the differential operators are commutative, i.e. their order can be swapped:
∂ 2 ∂ x i ∂ x j f ( x ) = ∂ 2 ∂ x j ∂ x i f ( x ) \frac{\partial^{2}}{\partial x_{i} \partial x_{j}} f(\boldsymbol{x})=\frac{\partial^{2}}{\partial x_{j} \partial x_{i}} f(\boldsymbol{x}) ∂xi∂xj∂2f(x)=∂xj∂xi∂2f(x)
This implies that H i , j = H j , i H_{i, j}=H_{j, i} Hi,j=Hj,i, so the Hessian matrix is symmetric at such points. Most of the functions we encounter in the context of deep learning have a symmetric Hessian almost everywhere.
Because the Hessian matrix is real and symmetric, we can decompose it into a set of real eigenvalues and an orthogonal basis of eigenvectors. The second derivative in a specific direction represented by a unit vector d \boldsymbol{d} d is given by d ⊤ H d \boldsymbol{d}^{\top} \boldsymbol{H} \boldsymbol{d} d⊤Hd. When d \boldsymbol{d} d is an eigenvector of H \boldsymbol{H} H, the second derivative in that direction is given by the corresponding eigenvalue. The (directional) second derivative tells us how well we can expect a gradient descent step to perform. We can make a second-order Taylor series approximation to the function f ( x ) f(\boldsymbol{x}) f(x) around the current point x ( 0 ) \boldsymbol{x}^{(0)} x(0) :
f ( x ) ≈ f ( x ( 0 ) ) + ( x − x ( 0 ) ) ⊤ g + 1 2 ( x − x ( 0 ) ) ⊤ H ( x − x ( 0 ) ) . f(\boldsymbol{x}) \approx f\left(\boldsymbol{x}^{(0)}\right)+\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right)^{\top} \boldsymbol{g}+\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right)^{\top} \boldsymbol{H}\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right) . f(x)≈f(x(0))+(x−x(0))⊤g+21(x−x(0))⊤H(x−x(0)).
where g \boldsymbol{g} g is the gradient and H \boldsymbol{H} H is the Hessian at x ( 0 ) \boldsymbol{x}^{(0)} x(0). If we use a learning rate of ϵ \epsilon ϵ, then the new point x x x will be given by x ( 0 ) − ϵ g x^{(0)}-\epsilon \boldsymbol{g} x(0)−ϵg. Substituting this into our approximation, we obtain
f ( x ( 0 ) − ϵ g ) ≈ f ( x ( 0 ) ) − ϵ g ⊤ g + 1 2 ϵ 2 g ⊤ H g f\left(\boldsymbol{x}^{(0)}-\epsilon \boldsymbol{g}\right) \approx f\left(\boldsymbol{x}^{(0)}\right)-\epsilon \boldsymbol{g}^{\top} \boldsymbol{g}+\frac{1}{2} \epsilon^{2} \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} f(x(0)−ϵg)≈f(x(0))−ϵg⊤g+21ϵ2g⊤Hg
There are three terms here: the original value of the function, the expected improvement due to the slope of the function, and the correction we must apply to account for the curvature of the function. When this last term is too large, the gradient descent step can actually move uphill. When g ⊤ H g \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} g⊤Hg is zero or negative, the Taylor series approximation predicts that increasing ϵ \epsilon ϵ forever will decrease f f f forever. In practice, the Taylor series is unlikely to remain accurate for large ϵ \epsilon ϵ, so one must resort to more heuristic choices of ϵ \epsilon ϵ in this case. When g ⊤ H g \boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g} g⊤Hg is positive, solving for the optimal step size that decreases the Taylor series approximation of the function the most yields critical point
ϵ ∗ = g ⊤ g g ⊤ H g \epsilon^{*}=\frac{\boldsymbol{g}^{\top} \boldsymbol{g}}{\boldsymbol{g}^{\top} \boldsymbol{H} \boldsymbol{g}} ϵ∗=g⊤Hgg⊤g
In the worst case, when g \boldsymbol{g} g aligns with the eigenvector of H \boldsymbol{H} H corresponding to the maximal eigenvalue λ max ⁡ \lambda_{\max } λmax, then this optimal step size is given by 1 λ max ⁡ \frac{1}{\lambda_{\max }} λmax1. To the extent that the function we minimize can be approximated well by a quadratic function, the eigenvalues of the Hessian thus determine the scale of the learning rate
The second derivative can be used to determine whether a critical point is a local maximum, a local minimum, or saddle point. Recall that on a critical point, f ′ ( x ) = 0 f^{\prime}(x)=0 f′(x)=0. When the second derivative f ′ ′ ( x ) > 0 f^{\prime \prime}(x)>0 f′′(x)>0, the first derivative f ′ ( x ) f^{\prime}(x) f′(x) increases as we move to the right and decreases as we move to the left. This means f ′ ( x − ϵ ) < 0 f^{\prime}(x-\epsilon)<0 f′(x−ϵ)<0 and f ′ ( x + ϵ ) > 0 f^{\prime}(x+\epsilon)>0 f′(x+ϵ)>0 for small enough ϵ \epsilon ϵ. In other words, as we move right, the slope begins to point uphill to the right, and as we move left, the slope begins to point uphill to the left. Thus, when f ′ ( x ) = 0 f^{\prime}(x)=0 f′(x)=0 and f ′ ′ ( x ) > 0 f^{\prime \prime}(x)>0 f′′(x)>0, we can conclude that x x x is a local minimum. Similarly, when f ′ ( x ) = 0 f^{\prime}(x)=0 f′(x)=0 and f ′ ′ ( x ) < 0 f^{\prime \prime}(x)<0 f′′(x)<0, we can conclude that x x x is a local maximum. This is known as the second derivative test. Unfortunately, when f ′ ′ ( x ) = 0 f^{\prime \prime}(x)=0 f′′(x)=0, the test is inconclusive. In this case x x x may be a saddle point, or a part of a flat region.
In multiple dimensions, we need to examine all of the second derivatives of the function. Using the eigendecomposition of the Hessian matrix, we can generalize the second derivative test to multiple dimensions. At a critical point, where ∇ x f ( x ) = 0 \nabla_{\boldsymbol{x}} f(\boldsymbol{x})=0 ∇xf(x)=0, we can examine the eigenvalues of the Hessian to determine whether the critical point is a local maximum, local minimum, or saddle point. When the Hessian is positive definite (all its eigenvalues are positive), the point is a local minimum. This can be seen by observing that the directional second derivative in any direction must be positive, and making reference to the univariate second derivative test. Likewise, when the Hessian is negative definite (all its eigenvalues are negative), the point is a local maximum. In multiple dimensions, it is actually possible to find positive evidence of saddle points in some cases. When at least one eigenvalue is positive and at least one eigenvalue is negative, we know that x \boldsymbol{x} x is a local maximum on one cross section of f f f but a local minimum on another cross section. Finally, the multidimensional second derivative test can be inconclusive, just like the univariate（单变量） version. The test is inconclusive whenever all of the non-zero eigenvalues have the same sign, but at least one eigenvalue is zero. This is because the univariate second derivative test is inconclusive in the cross section corresponding to the zero eigenvalue.
In multiple dimensions, there is a different second derivative for each direction at a single point. The condition number of the Hessian at this point measures how much the second derivatives differ from each other. When the Hessian has a poor condition number, gradient descent performs poorly.
This is because in one direction, the derivative increases rapidly, while in another direction, it increases slowly. Gradient descent is unaware of this change in the derivative so it does not know that it needs to explore preferentially in the direction where the derivative remains negative for longer.
It also makes it difficult to choose a good step size. The step size must be small enough to avoid overshooting the minimum and going uphill in directions with strong positive curvature. This usually means that the step size is too small to make significant progress in other directions with less curvature.
This issue can be resolved by using information from the Hessian matrix to guide the search. The simplest method for doing so is known as Newton’s method. Newton’s method is based on using a second-order Taylor series expansion to approximate f ( x ) f(\boldsymbol{x}) f(x) near some point x ( 0 ) \boldsymbol{x}^{(0)} x(0) :
f ( x ) ≈ f ( x ( 0 ) ) + ( x − x ( 0 ) ) ⊤ ∇ x f ( x ( 0 ) ) + 1 2 ( x − x ( 0 ) ) ⊤ H ( f ) ( x ( 0 ) ) ( x − x ( 0 ) ) f(\boldsymbol{x}) \approx f\left(\boldsymbol{x}^{(0)}\right)+\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right)^{\top} \nabla_{\boldsymbol{x}} f\left(\boldsymbol{x}^{(0)}\right)+\frac{1}{2}\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right)^{\top} \boldsymbol{H}(f)\left(\boldsymbol{x}^{(0)}\right)\left(\boldsymbol{x}-\boldsymbol{x}^{(0)}\right) f(x)≈f(x(0))+(x−x(0))⊤∇xf(x(0))+21(x−x(0))⊤H(f)(x(0))(x−x(0))
If we then solve for the critical point of this function, we obtain:
x ∗ = x ( 0 ) − H ( f ) ( x ( 0 ) ) − 1 ∇ x f ( x ( 0 ) ) \boldsymbol{x}^{*}=\boldsymbol{x}^{(0)}-\boldsymbol{H}(f)\left(\boldsymbol{x}^{(0)}\right)^{-1} \nabla_{\boldsymbol{x}} f\left(\boldsymbol{x}^{(0)}\right) x∗=x(0)−H(f)(x(0))−1∇xf(x(0))
When f f f is a positive definite quadratic function, Newton’s method consists of applying equation above once to jump to the minimum of the function directly.
When f f f is not truly quadratic but can be locally approximated as a positive definite quadratic, Newton’s method consists of applying equation above multiple times.
Iteratively updating the approximation and jumping to the minimum of the approximation can reach the critical point much faster than gradient descent would. This is a useful property near a local minimum, but it can be a harmful property near a saddle point.
Newton’s method is only appropriate when the nearby critical point is a minimum (all the eigenvalues of the Hessian are positive), whereas gradient descent is not attracted to saddle points unless the gradient points toward them.
Optimization algorithms that use only the gradient, such as gradient descent, are called first-order optimization algorithms.
Optimization algorithms that also use the Hessian matrix, such as Newton’s method, are called second-order optimization algorithms .
In the context of deep learning, we sometimes gain some guarantees by restricting ourselves to functions that are either Lipschitz continuous or have Lipschitz continuous derivatives. A Lipschitz continuous function is a function f f f whose rate of change is bounded by a Lipschitz constant L \mathcal{L} L :
∀ x , ∀ y , ∣ f ( x ) − f ( y ) ∣ ≤ L ∥ x − y ∥ 2 . \forall \boldsymbol{x}, \forall \boldsymbol{y},|f(\boldsymbol{x})-f(\boldsymbol{y})| \leq \mathcal{L}\|\boldsymbol{x}-\boldsymbol{y}\|_{2} . ∀x,∀y,∣f(x)−f(y)∣≤L∥x−y∥2.
This property is useful because it allows us to quantify our assumption that a small change in the input made by an algorithm such as gradient descent will have a small change in the output. Lipschitz continuity is also a fairly weak constraint, and many optimization problems in deep learning can be made Lipschitz continuous with relatively minor modifications.
Perhaps the most successful field of specialized optimization is convex optimization.
Convex optimization algorithms are able to provide many more guarantees by making stronger restrictions. Convex optimization algorithms are applicable only to convex functions—functions for which the Hessian is positive semidefinite everywhere. Such functions are well-behaved because they lack saddle points and all of their local minima are necessarily global minima. However, most problems in deep learning are difficult to express in terms of convex optimization. Convex optimization is used only as a subroutine of some deep learning algorithms. Ideas from the analysis of convex optimization algorithms can be useful for proving the convergence of deep learning algorithms. However, in general, the importance of convex optimization is greatly diminished in the context of deep learning.

Constrained Optimization

Sometimes we wish to find the maximal or minimal value of f ( x ) f(\boldsymbol{x}) f(x) for values of x \boldsymbol{x} x in some set S \mathbb{S} S. This is known as constrained optimization. Points x x x that lie within the set S \mathbb{S} S are called feasible points in constrained optimization terminology.

We often wish to find a solution that is small in some sense. A common approach in such situations is to impose a norm constraint, such as ∥ x ∥ ≤ 1 \|\boldsymbol{x}\| \leq 1 ∥x∥≤1.
One simple approach to constrained optimization is simply to modify gradient descent taking the constraint into account.
If we use a small constant step size ϵ \epsilon ϵ, we can make gradient descent steps, then project the result back into S \mathbb{S} S.
If we use a line search, we can search only over step sizes ϵ \epsilon ϵ that yield new x x x points that are feasible, or we can project each point on the line back into the constraint region.
When possible, this method can be made more efficient by projecting the gradient into the tangent space of the feasible region before taking the step or beginning the line search .
A more sophisticated approach is to design a different, unconstrained optimization problem whose solution can be converted into a solution to the original, constrained optimization problem.
For example, if we want to minimize f ( x ) f(\boldsymbol{x}) f(x) for x ∈ R 2 \boldsymbol{x} \in \mathbb{R}^{2} x∈R2 with x \boldsymbol{x} x constrained to have exactly unit L 2 L^{2} L2 norm, we can instead minimize g ( θ ) = f ( [ cos ⁡ θ , sin ⁡ θ ] ⊤ ) g(\theta)=f\left([\cos \theta, \sin \theta]^{\top}\right) g(θ)=f([cosθ,sinθ]⊤) with respect to θ \theta θ, then return [ cos ⁡ θ , sin ⁡ θ ] [\cos \theta, \sin \theta] [cosθ,sinθ] as the solution to the original problem. This approach requires creativity; the transformation between optimization problems must be designed specifically for each case we encounter.(讲白了就是三角换元)
The Karush-Kuhn-Tucker (KKT) approach provides a very general solution to constrained optimization. With the KKT approach, we introduce a new function called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe S \mathbb{S} S in terms of equations and inequalities. We want a description of S \mathbb{S} S in terms of m m m functions g ( i ) g^{(i)} g(i) and n n n functions h ( j ) h^{(j)} h(j) so that S = { x ∣ ∀ i , g ( i ) ( x ) = 0 \mathbb{S}=\left\{\boldsymbol{x} \mid \forall i, g^{(i)}(\boldsymbol{x})=0\right. S={x∣∀i,g(i)(x)=0 and ∀ j , h ( j ) ( x ) ≤ 0 } \left.\forall j, h^{(j)}(\boldsymbol{x}) \leq 0\right\} ∀j,h(j)(x)≤0}. The equations involving g ( i ) g^{(i)} g(i) are called the equality constraints and the inequalities involving h ( j ) h^{(j)} h(j) are called inequality constraints.
We introduce new variables λ i \lambda_{i} λi and α j \alpha_{j} αj for each constraint, these are called the KKT multipliers. The generalized Lagrangian is then defined as
L ( x , λ , α ) = f ( x ) + ∑ i λ i g ( i ) ( x ) + ∑ j α j h ( j ) ( x ) L(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha})=f(\boldsymbol{x})+\sum_{i} \lambda_{i} g^{(i)}(\boldsymbol{x})+\sum_{j} \alpha_{j} h^{(j)}(\boldsymbol{x}) L(x,λ,α)=f(x)+i∑λig(i)(x)+j∑αjh(j)(x)
We can now solve a constrained minimization problem using unconstrained optimization of the generalized Lagrangian. Observe that, so long as at least one feasible point exists and f ( x ) f(\boldsymbol{x}) f(x) is not permitted to have value ∞ \infty ∞, then
min ⁡ x max ⁡ λ max ⁡ α , α ≥ 0 L ( x , λ , α ) \min _{\boldsymbol{x}} \max _{\boldsymbol{\lambda}} \max _{\boldsymbol{\alpha}, \boldsymbol{\alpha} \geq 0} L(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}) xminλmaxα,α≥0maxL(x,λ,α)
has the same optimal objective function value and set of optimal points x \boldsymbol{x} x as
min ⁡ x ∈ S f ( x ) \min _{\boldsymbol{x} \in \mathbb{S}} f(\boldsymbol{x}) x∈Sminf(x)
This follows because any time the constraints are satisfied,
max ⁡ λ max ⁡ α , α ≥ 0 L ( x , λ , α ) = f ( x ) \max _{\boldsymbol{\lambda}} \max _{\boldsymbol{\alpha}, \boldsymbol{\alpha} \geq 0} L(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha})=f(\boldsymbol{x}) λmaxα,α≥0maxL(x,λ,α)=f(x)
while any time a constraint is violated,
max ⁡ λ max ⁡ α , α ≥ 0 L ( x , λ , α ) = ∞ \max _{\boldsymbol{\lambda}} \max _{\boldsymbol{\alpha}, \boldsymbol{\alpha} \geq 0} L(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha})=\infty λmaxα,α≥0maxL(x,λ,α)=∞
These properties guarantee that no infeasible point can be optimal, and that the optimum within the feasible points is unchanged.
To perform constrained maximization, we can construct the generalized Lagrange function of − f ( x ) -f(\boldsymbol{x}) −f(x), which leads to this optimization problem:
min ⁡ x max ⁡ λ max ⁡ α , α ≥ 0 − f ( x ) + ∑ i λ i g ( i ) ( x ) + ∑ j α j h ( j ) ( x ) \min _{\boldsymbol{x}} \max _{\boldsymbol{\lambda}} \max _{\alpha, \alpha \geq 0}-f(\boldsymbol{x})+\sum_{i} \lambda_{i} g^{(i)}(\boldsymbol{x})+\sum_{j} \alpha_{j} h^{(j)}(\boldsymbol{x}) xminλmaxα,α≥0max−f(x)+i∑λig(i)(x)+j∑αjh(j)(x)
We may also convert this to a problem with maximization in the outer loop:
max ⁡ x min ⁡ λ min ⁡ α , α ≥ 0 f ( x ) + ∑ i λ i g ( i ) ( x ) − ∑ j α j h ( j ) ( x ) \max _{\boldsymbol{x}} \min _{\boldsymbol{\lambda}} \min _{\boldsymbol{\alpha}, \boldsymbol{\alpha} \geq 0} f(\boldsymbol{x})+\sum_{i} \lambda_{i} g^{(i)}(\boldsymbol{x})-\sum_{j} \alpha_{j} h^{(j)}(\boldsymbol{x}) xmaxλminα,α≥0minf(x)+i∑λig(i)(x)−j∑αjh(j)(x)
The sign of the term for the equality constraints does not matter; we may define it with addition or subtraction as we wish, because the optimization is free to choose any sign for each λ i \lambda_{i} λi.
The inequality constraints are particularly interesting. We say that a constraint h ( i ) ( x ) h^{(i)}(\boldsymbol{x}) h(i)(x) is active if h ( i ) ( x ∗ ) = 0 h^{(i)}\left(\boldsymbol{x}^{*}\right)=0 h(i)(x∗)=0. If a constraint is not active, then the solution to the problem found using that constraint would remain at least a local solution if that constraint were removed.
It is possible that an inactive constraint excludes other solutions.
For example, a convex problem with an entire region of globally optimal points (a wide, flat, region of equal cost) could have a subset of this region eliminated by constraints,
or a non-convex problem could have better local stationary points excluded by a constraint that is inactive at convergence.
However, the point found at convergence remains a stationary(静止的) point whether or not the inactive constraints are included.

Because an inactive h ( i ) h^{(i)} h(i) has negative value, then the solution to min ⁡ x max ⁡ λ max ⁡ α , α ≥ 0 L ( x , λ , α ) \min _{\boldsymbol{x}} \max _{\boldsymbol{\lambda}} \max _{\boldsymbol{\alpha}, \boldsymbol{\alpha} \geq 0} L(\boldsymbol{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}) minxmaxλmaxα,α≥0L(x,λ,α) will have α i = 0 \alpha_{i}=0 αi=0. We can thus observe that at the solution, α ⊙ h ‾ ( x ) = 0 \boldsymbol{\alpha} \odot \overline{\boldsymbol{h}}(\boldsymbol{x})=\mathbf{0} α⊙h(x)=0.
In other words, for all i i i, we know that at least one of the constraints α i ≥ 0 \alpha_{i} \geq 0 αi≥0 and h ( i ) ( x ) ≤ 0 h^{(i)}(\boldsymbol{x}) \leq 0 h(i)(x)≤0 must be active at the solution. To gain some intuition for this idea, we can say that either the solution is on the boundary imposed by the inequality and we must use its KKT multiplier to influence the solution to x \boldsymbol{x} x, or the inequality has no influence on the solution and we represent this by zeroing out its KKT multiplier.
A simple set of properties describe the optimal points of constrained optimization problems. These properties are called the Karush-Kuhn-Tucker (KKT) conditions. They are necessary conditions, but not always sufficient conditions, for a point to be optimal. The conditions are:
– The gradient of the generalized Lagrangian is zero.
– All constraints on both x \boldsymbol{x} x and the KKT multipliers are satisfied.
– The inequality constraints exhibit “complementary slackness”: α ⊙ h ( x ) = 0 \boldsymbol{\alpha} \odot \boldsymbol{h}(\boldsymbol{x})=\mathbf{0} α⊙h(x)=0.

Example: Linear Least Squares

Suppose we want to find the value of x x x that minimizes
f ( x ) = 1 2 ∥ A x − b ∥ 2 2 f(\boldsymbol{x})=\frac{1}{2}\|\boldsymbol{A} \boldsymbol{x}-\boldsymbol{b}\|_{2}^{2} f(x)=21∥Ax−b∥22
There are specialized linear algebra algorithms that can solve this problem efficiently. However, we can also explore how to solve it using gradient-based optimization as a simple example of how these techniques work. First, we need to obtain the gradient:
∇ x f ( x ) = A ⊤ ( A x − b ) = A ⊤ A x − A ⊤ b \nabla_{\boldsymbol{x}} f(\boldsymbol{x})=\boldsymbol{A}^{\top}(\boldsymbol{A} \boldsymbol{x}-\boldsymbol{b})=\boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b} ∇xf(x)=A⊤(Ax−b)=A⊤Ax−A⊤b
We can then follow this gradient downhill, taking small steps. See algorithm 4.1 4.1 4.1 for details.
Algorithm ： An algorithm to minimize f ( x ) = 1 2 ∥ A x − b ∥ 2 2 f(\boldsymbol{x})=\frac{1}{2}\|\boldsymbol{A} \boldsymbol{x}-\boldsymbol{b}\|_{2}^{2} f(x)=21∥Ax−b∥22 with respect to x \boldsymbol{x} x using gradient descent, starting from an arbitrary value of x \boldsymbol{x} x.
Set the step size ( ϵ ) (\epsilon) (ϵ) and tolerance ( δ ) (\delta) (δ) to small, positive numbers.
while ∥ A ⊤ A x − A ⊤ b ∥ 2 > δ \left\|\boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b}\right\|_{2}>\delta ∥∥∥A⊤Ax−A⊤b∥∥∥2>δ do
x ← x − ϵ ( A ⊤ A x − A ⊤ b ) \boldsymbol{x} \leftarrow \boldsymbol{x}-\epsilon\left(\boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b}\right) x←x−ϵ(A⊤Ax−A⊤b)
end while
One can also solve this problem using Newton’s method. In this case, because the true function is quadratic, the quadratic approximation employed by Newton’s method is exact, and the algorithm converges to the global minimum in a single step.

Now suppose we wish to minimize the same function, but subject to the constraint x ⊤ x ≤ 1 \boldsymbol{x}^{\top} \boldsymbol{x} \leq 1 x⊤x≤1. To do so, we introduce the Lagrangian
L ( x , λ ) = f ( x ) + λ ( x ⊤ x − 1 ) . L(\boldsymbol{x}, \lambda)=f(\boldsymbol{x})+\lambda\left(\boldsymbol{x}^{\top} \boldsymbol{x}-1\right) . L(x,λ)=f(x)+λ(x⊤x−1).
We can now solve the problem
min ⁡ x max ⁡ λ , λ ≥ 0 L ( x , λ ) \min _{\boldsymbol{x}} \max _{\lambda, \lambda \geq 0} L(\boldsymbol{x}, \lambda) xminλ,λ≥0maxL(x,λ)
The smallest-norm solution to the unconstrained least squares problem may be found using the Moore-Penrose pseudoinverse: x = A + b \boldsymbol{x}=\boldsymbol{A}^{+} \boldsymbol{b} x=A+b. If this point is feasible, then it is the solution to the constrained problem.
Otherwise, we must find a solution where the constraint is active. By differentiating the Lagrangian with respect to x x x, we obtain the equation
A ⊤ A x − A ⊤ b + 2 λ x = 0 \boldsymbol{A}^{\top} \boldsymbol{A} \boldsymbol{x}-\boldsymbol{A}^{\top} \boldsymbol{b}+2 \lambda \boldsymbol{x}=0 A⊤Ax−A⊤b+2λx=0
This tells us that the solution will take the form
x = ( A ⊤ A + 2 λ I ) − 1 A ⊤ b \boldsymbol{x}=\left(\boldsymbol{A}^{\top} \boldsymbol{A}+2 \lambda \boldsymbol{I}\right)^{-1} \boldsymbol{A}^{\top} \boldsymbol{b} x=(A⊤A+2λI)−1A⊤b
The magnitude of λ \lambda λ must be chosen such that the result obeys the constraint. We can find this value by performing gradient ascent on λ \lambda λ. To do so, observe
∂ ∂ λ L ( x , λ ) = x ⊤ x − 1 \frac{\partial}{\partial \lambda} L(\boldsymbol{x}, \lambda)=\boldsymbol{x}^{\top} \boldsymbol{x}-1 ∂λ∂L(x,λ)=x⊤x−1
When the norm of x \boldsymbol{x} x exceeds 1 , this derivative is positive, so to follow the derivative uphill and increase the Lagrangian with respect to λ \lambda λ, we increase λ \lambda λ. Because the coefficient on the x ⊤ x \boldsymbol{x}^{\top} \boldsymbol{x} x⊤x penalty has increased, solving the linear equation for x \boldsymbol{x} x will now yield a solution with smaller norm. The process of solving the linear equation and adjusting λ \lambda λ continues until x \boldsymbol{x} x has the correct norm and the derivative on λ \lambda λ is 0. 0 . 0.

Numerical Computation相关推荐

STATS 782 - Numerical Computation
文章目录一.Simple Graphics 1. 用 plot()函数画图二.Optimization 1. 简单求最值 2. Interpolation(插值) 3. optim()函数三.M ...
FreeBSD Ports加速的方法
使用代理. 在/etc/make.conf中设置: FETCH_ENV= "HTTP_PROXY=IP[:端口]" 如果需要,在FETCH_ENV值后面加入空格, HTTP_PRO ...
阿联酋阿布扎比人工智能大学招收全奖博士，年薪20.9W
来源:AI求职 MBZUAI 穆罕默德·本·扎耶德人工智能大学(MBZUAI),是全球首所专注于人工智能领域研究生培养和科研应用的大学.该大学位于阿布扎比马斯达尔城的高校园区,以阿联酋阿布扎比王储穆罕 ...
matlab数值计算好处,第四章 MATLAB 的数值计算功能(一）
Chapter 4: Numerical computation of MATLAB 一.多项式(Polynomial)` 1．多项式的表达与创建(Expression and Creating of ...
python第三方库排行-Python常用第三方库总结
网络爬虫网络请求 requests: Requests allows you to send HTTP/1.1 requests extremely easily. 一个处理http请求的客户端库, ...
TensorFlow – A Collection of Resources
from: http://tm.durusau.net/?p=65606 Another Word For It Patrick Durusau on Topic Maps and Semantic ...
Eigen 矩阵运算库在实际项目中的使用
Eigen 矩阵运算库在实际项目中的使用情况如何? 心血来潮,试了试纯模板技术写的 Eigen 并作了简单的性能测试,三个 1000 阶的方阵连乘运算 Eigen 比 MATLAB 快了一倍,比 Op ...
资源推荐 | TensorFlow电子书《FIRST CONTACT WITH TENSORFLOW》
资源推荐 | TensorFlow电子书<FIRST CONTACT WITH TENSORFLOW> 2016-06-29 系统科学社本书由 UPC Barcelona Tech大学教 ...
计算机视觉经典论文整理
经典论文计算机视觉论文 ImageNet分类物体检测物体跟踪低级视觉边缘检测语义分割视觉注意力和显著性物体识别人体姿态估计 CNN原理和性质(Understanding CNN) 图 ...

Numerical Computation

文章目录

Mind Map

CONTENTS

Overflow and Underflow

Poor Conditioning

Gradient-Based Optimization

Beyond the Gradient: Jacobian and Hessian Matrices

Constrained Optimization

Example: Linear Least Squares

Numerical Computation相关推荐

最新文章

热门文章