优化问题（A overview of gradient descent optimization algorithms）

文章目录

Gradient descent variants
Momentum
Nesterov accelerated gradient(NAG)
Adagrad
Adadelta
RMSprop
Adam（Adaptive Moment Estimation）
AdaMax
Nadam
AMSGrad
Visualization of algorithms

当我们在优化一个模型的时候，通过网络得到预测结果，计算loss并反向传播，算出梯度（也就是下降方向 g k g_k gk），并根据步长（也叫学习率 α k \alpha_k αk），使用优化器来更新参数（GD）： W k + 1 = W k − α k g k W_{k+1}=W_k-\alpha_kg_k Wk+1=Wk−αkgk。
上面这句话的key points是设置loss，设置优化算法，计算梯度（代码中直接通过反向传播计算，调用底层代码，但是我们要知道GD算的是一阶梯度，牛顿法、拟牛顿法计算了二阶梯度），而学习率的设置由于优化器设置不同而不同。本文主要关注优化算法的设置。

深度学习优化算法经历了 SGD -> SGDM -> NAG ->AdaGrad -> AdaDelta ->RMSprop -> Adam -> Nadam -> AMSGrad这样的发展历程。

Gradient descent variants

There are three variants of gradient descent,which differ in how much data we use to compute the gradient of the objection function.
BGD（entire training datasets）

SGD（one training example）

Mini_batch（part of the training examples）
—the term SGD usually is employed also when mini-batches are used

共同的缺点（挑战）：
1.对learning rate的设置较为敏感，太小则训练的太慢，太大则容易使目标函数发散掉
2.针对不同的参数，learning rate都是一样的。这对于稀疏数据来说尤为不方便，因为我们更想对那些经常出现的数据采用较小的step size，而对于较为罕见的数据采用更大的step size。
3.梯度下降法的本质是寻找不动点（目标函数对参数的导数为0的点），而这种不动点通常包括三类：极大值、极小值、鞍点。高维非凸函数空间中存在大量的鞍点，使得梯度下降法极易陷入鞍点（saddle points）且长时间都出不来。

Momentum

动量梯度下降法（Gradient descent with momentum）-----SGDM
Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations .

where γ \gamma γ is usually set to 0.9 or a similar value.

垂直方向学习慢一点，减少震荡；水平方向学习快一点，加速收敛(The momentum term increase for dimensions whose gradients point in the same directions and reduces updates for dimensions whose gradients change directions.As a result,we gain faster convergence and reduced oscillation)

Nesterov accelerated gradient(NAG)

We can now effectively look ahead by calculating the gradient not w.r.t. to our current parameters θ \theta θ but w.r.t. the approximate future position of our parameters.

这幅图上面蓝色的表示Momentum，它首先计算了当前的梯度方向（小蓝色矢量），然后沿着更新累积梯度方向
进行大跳跃（大蓝色矢量）
下面表示NAG，它首先沿着以前累积梯度方向进行大跳跃（棕色矢量），计算梯度（红色的）然后作修正（绿色的），这种预期性更新防止我们走的太快，并增加了反应能力。

Adagrad

以前我们使用相同的学习率去更新所有的参数，虽然后面会进行学习率的衰减，但是Adagrad自适应不同的学习率来更新参数，对于罕见的执行较大的更新，对于频繁的执行较小的更新，而且这对于稀疏数据是非常适合的。
在第t步计算目标函数对参数 θ i \theta_i θi计算梯度：
g t , i g_{t,i} gt,i= ∇ θ t \nabla_{\theta_t} ∇θtJ( θ t , i \theta_{t,i} θt,i)
然后对每一个参数 θ i \theta_i θi用SGD进行更新：
θ t + 1 , i \theta_{t+1,i} θt+1,i= θ t , i − η g t , i \theta_{t,i}- \eta g_{t,i} θt,i−ηgt,i
但是Adagrad会基于对参数 θ i \theta_i θi的过去梯度修改通用的学习率 η \eta η:
θ t + 1 , i \theta_{t+1,i} θt+1,i= θ t , i \theta_{t,i} θt,i- η G i , i i + ϵ \frac{\eta}{\sqrt{G_{i,ii}+\epsilon}} Gi,ii+ϵ η ⋅ \cdot ⋅ g t , i g_{t,i} gt,i
( θ t + 1 \theta_{t+1} θt+1= θ t \theta_t θt- η G t + ϵ \frac{\eta}{\sqrt{G_t+\epsilon}} Gt+ϵ η ⊙ \odot ⊙ g t g_t gt)
这里的 G t G_t Gt是对角矩阵，原始是对参数 θ i \theta_i θi梯度的平方和。因此再也不需要手动去调学习率
缺点：因为G中梯度的平方和每一项都是正的，因此G保持上升，这导致学习率减小直到无限小，我们就不能学习其他的知识。

Adadelta

replace the diagonal matrix G t G_t Gt with the decaying average over past squared gradients E [ g 2 ] t E[g^2]_t E[g2]t:
Δ θ t \Delta\theta_t Δθt=- η G t + ϵ \frac{\eta}{\sqrt{G_t+\epsilon}} Gt+ϵ η ⊙ \odot ⊙ g t g_t gt 替换为： Δ θ t \Delta\theta_t Δθt=- η E [ g 2 ] t + ϵ \frac{\eta}{\sqrt{E[g^2]_t+\epsilon}} E[g2]t+ϵ η ⊙ \odot ⊙ g t g_t gt ，而

where γ \gamma γ is similar value as the momentum term,around 0.9.
我们发现这个更新和SGD、Momentum or Adagrad 不匹配，也就是说更新应该有相同的假设对于参数，因此我们用参数更新的平方代替梯度的平方，如下：

这样又会遇到一个问题，上一个公式是不可知的，我们只能近似为如下：
Δ θ t \Delta\theta_t Δθt=- E [ Δ θ 2 ] t − 1 + ϵ E [ g 2 ] t + ϵ \frac{\sqrt{E[\Delta\theta^2]_{t-1}+\epsilon}}{\sqrt{E[g^2]_t+\epsilon}} E[g2]t+ϵ E[Δθ2]t−1+ϵ ⊙ \odot ⊙ g t g_t gt

RMSprop

RMSprop和Adadelta 一样都是为了解决Adagrad急剧下降的学习率，事实上RMSprop是Adadelta的第一种更新向量是一样的：

Adam（Adaptive Moment Estimation）

RMSprop+Momentum+bias-correction
In addition to storing an exponentially decaying average of past squared gradients v t v_t vt like Adadelta and RMSprop, Adam also keeps an exponentially decaying average of past gradients m t m_t mt, similar to momentum:( 这就是将一阶动量【the mean of the gradients】和二阶动量【the uncentered variance of the gradients】结合起来）

因为 m t , v t m_t,v_t mt,vt被初始为0，他们被偏置为0，因此使用如下去中和他们的偏置：

最后可得：

AdaMax

we use u t u_t ut denote the infinity norm-constrained v t v_t vt:

用 u t u_t ut代替Adam中的 v t ^ + ϵ \sqrt{\hat{v_t}}+\epsilon vt^ +ϵ然后可得：

where η \eta η=0.002, β 1 \beta_1 β1=0.9, β 2 \beta_2 β2=0.999

Nadam

Adam+NAG
将NAG中的

修改为（use the current momentum vector m t m_t mt to look ahead）：
最后将：

变成：

AMSGrad

Visualization of algorithms

我们可以看到不同算法在损失面等高线图中的学习过程，它们均同同一点出发，但沿着不同路径达到最小值点。其中 Adagrad、Adadelta、RMSprop 从最开始就找到了正确的方向并快速收敛；SGD 找到了正确方向但收敛速度很慢；SGD-M 和 NAG 最初都偏离了航道，但也能最终纠正到正确方向，SGD-M 偏离的惯性比 NAG 更大

这幅图展现了不同算法在鞍点处的表现。这里，SGD、SGD-M、NAG 都受到了鞍点的严重影响，尽管后两者最终还是逃离了鞍点；而 Adagrad、RMSprop、Adadelta 都很快找到了正确的方向。

As we can see, the adaptive learning-rate methods, i.e. Adagrad, Adadelta, RMSprop, and Adam are
most suitable and provide the best convergence for these scenarios.