【翻译自 : Gradient Descent With Nesterov Momentum From Scratch】

【说明:Jason Brownlee PhD大神的文章个人很喜欢,所以闲暇时间里会做一点翻译和学习实践的工作,这里是相应工作的实践记录,希望能帮到有需要的人!】



在某些情况下,动量的加速度会导致搜索丢失或超出盆地或山谷底部的最小值。 Nesterov动量是动量的扩展,涉及计算搜索空间中投影位置的梯度的递减移动平均值,而不是计算实际位置本身。


在本教程中,您将发现如何从头开始使用Nesterov Momentum开发梯度下降优化算法。完成本教程后,您将知道:



本教程分为三个部分: 他们是:







x(t + 1)= x(t)–step* f'(x(t))


        如果步长太小,则搜索空间中的移动将很小,并且搜索将花费很长时间。如果步长太大,则搜索可能会在搜索空间附近反弹并跳过最优值。 现在我们已经熟悉了梯度下降优化算法,下面我们来看看Nesterov动量。


Nesterov动量是梯度下降优化算法的扩展。这种方法由Yurii Nesterov在1983年发表的题为“一种用收敛速度为O(1 / k ^ 2)解决凸规划问题的方法”描述(并以此名称命名)。Ilya Sutskever等。 负责在他们的2013年论文“深度学习中初始化和动量的重要性”中描述了Nesterov动量在具有随机梯度下降的神经网络训练中的广泛应用。 他们将这种方法称为“ Nesterov的加速梯度”,简称NAG。Nesterov动量就像更传统的动量一样,除了使用投影更新的偏导数而不是导数当前变量值执行更新。

然后,该变量的最后更新或最后更改将添加到由“动量”超参数缩放的变量中,该超参数控制要添加的最后更改的数量,例如 0.9为90%。更容易从两个步骤来考虑此更新,例如,使用偏导数计算变量的变化,然后计算变量的新值。

change(t + 1)=(momentum * change(t))–(step_size * f'(x(t)))
x(t + 1)= x(t)+change(t + 1)



Nesterov Momentum可以从以下四个步骤中轻松考虑这一点:



projection(t+1) = x(t) + (momentum * change(t))


gradient(t+1) = f'(projection(t+1))


change(t+1) = (momentum * change(t)) – (step_size * gradient(t+1))


x(t+1) = x(t) + change(t+1)

更一般地,在凸优化领域中,已知Nesterov Momentum可以提高优化算法的收敛速度(例如,减少找到解所需的迭代次数)。

现在,我们已经熟悉Nesterov Momentum算法,接下来,我们将探讨如何实现它并评估其性能。





# objective function
def objective(x, y):return x**2.0 + y**2.0


# 3d plot of the test function
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
r_min, r_max = -1.0, 1.0
# sample input range uniformly at 0.1 increments
xaxis = arange(r_min, r_max, 0.1)
yaxis = arange(r_min, r_max, 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a surface plot with the jet color scheme
figure = pyplot.figure()
axis = figure.gca(projection='3d')
axis.plot_surface(x, y, results, cmap='jet')
# show the plot

运行示例将创建目标函数的三维表面图。我们可以看到全局最小值为f(0,0)= 0的熟悉的碗形状。

我们还可以创建函数的二维图。 这在以后要绘制搜索进度时会很有帮助。下面的示例创建目标函数的轮廓图。

# contour plot of the test function
from numpy import asarray
from numpy import arange
from numpy import meshgrid
from matplotlib import pyplot# objective function
def objective(x, y):return x**2.0 + y**2.0# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# show the plot

运行示例将创建目标函数的二维轮廓图。我们可以看到碗的形状被压缩为以颜色渐变显示的轮廓。 我们将使用该图来绘制在搜索过程中探索的特定点。



我们可以将Nesterov动量的梯度下降应用于测试问题。首先,我们需要一个函数来计算此函数的导数。x ^ 2的导数在每个维度上均为x * 2,并且derivative()函数在下面实现此功能。

# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])




# generate an initial point
solution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])


# calculate the projected solution
projected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]
# calculate the gradient for the projection
gradient = derivative(projected[0], projected[1])


首先,使用偏导数和学习率以及变量最后一次变化的动量来计算变量的变化。 存储此更改以供算法的下一次迭代。 然后,更改将用于计算变量的新值。

# build a solution one variable at a time
new_solution = list()
for i in range(solution.shape[0]):# calculate the changechange[i] = (momentum * change[i]) - step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)


# evaluate candidate point
solution = asarray(new_solution)
solution_eval = objective(solution[0], solution[1])
# report progress
print('>%d f(%s) = %.5f' % (it, solution, solution_eval))

就是这样。我们可以将所有这些绑定到一个名为nesterov()的函数中,该函数采用目标函数和导数函数的名称,该数组具有域边界和超参数值,用于算法迭代的总数,学习率, 和动量,并返回最终解决方案及其评估。下面列出了完整的功能。

# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]# calculate the gradient for the projectiongradient = derivative(projected[0], projected[1])# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the changechange[i] = (momentum * change[i]) - step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return [solution, solution_eval]

请注意,为了提高可读性,我们有意使用列表和命令式编码样式,而不是矢量化操作。 随意将实现改编为带有NumPy数组的矢量化实现,以实现更好的性能。


在这种情况下,我们将使用算法的30次迭代,学习速率为0.1,动量为0.3。 经过一些反复试验后,发现了这些超参数值。

# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with nesterov momentum
best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
print('f(%s) = %f' % (best, score))

综合所有这些,下面列出了使用Nesterov Momentum进行梯度下降优化的完整示例。

# gradient descent optimization with nesterov momentum for a two-dimensional test function
from math import sqrt
from numpy import asarray
from numpy.random import rand
from numpy.random import seed# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]# calculate the gradient for the projectiongradient = derivative(projected[0], projected[1])# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the changechange[i] = (momentum * change[i]) - step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# evaluate candidate pointsolution = asarray(new_solution)solution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return [solution, solution_eval]# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 30
# define the step size
step_size = 0.1
# define momentum
momentum = 0.3
# perform the gradient descent search with nesterov momentum
best, score = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
print('f(%s) = %f' % (best, score))

运行示例将优化算法与Nesterov Momentum一起应用于我们的测试问题,并报告算法每次迭代的搜索性能。

注意:由于算法或评估程序的随机性,或者数值精度的差异,您的结果可能会有所不同。 考虑运行该示例几次并比较平均结果。


>0 f([-0.13276479 0.35251919]) = 0.14190
>1 f([-0.09824595 0.2608642 ]) = 0.07770
>2 f([-0.07031223 0.18669416]) = 0.03980
>3 f([-0.0495457 0.13155452]) = 0.01976
>4 f([-0.03465259 0.0920101 ]) = 0.00967
>5 f([-0.02414772 0.06411742]) = 0.00469
>6 f([-0.01679701 0.04459969]) = 0.00227
>7 f([-0.01167344 0.0309955 ]) = 0.00110
>8 f([-0.00810909 0.02153139]) = 0.00053
>9 f([-0.00563183 0.01495373]) = 0.00026
>10 f([-0.00391092 0.01038434]) = 0.00012
>11 f([-0.00271572 0.00721082]) = 0.00006
>12 f([-0.00188573 0.00500701]) = 0.00003
>13 f([-0.00130938 0.0034767 ]) = 0.00001
>14 f([-0.00090918 0.00241408]) = 0.00001
>15 f([-0.0006313 0.00167624]) = 0.00000
>16 f([-0.00043835 0.00116391]) = 0.00000
>17 f([-0.00030437 0.00080817]) = 0.00000
>18 f([-0.00021134 0.00056116]) = 0.00000
>19 f([-0.00014675 0.00038964]) = 0.00000
>20 f([-0.00010189 0.00027055]) = 0.00000
>21 f([-7.07505806e-05 1.87858067e-04]) = 0.00000
>22 f([-4.91260884e-05 1.30440372e-04]) = 0.00000
>23 f([-3.41109926e-05 9.05720503e-05]) = 0.00000
>24 f([-2.36851711e-05 6.28892431e-05]) = 0.00000
>25 f([-1.64459397e-05 4.36675208e-05]) = 0.00000
>26 f([-1.14193362e-05 3.03208033e-05]) = 0.00000
>27 f([-7.92908415e-06 2.10534304e-05]) = 0.00000
>28 f([-5.50560682e-06 1.46185748e-05]) = 0.00000
>29 f([-3.82285090e-06 1.01504945e-05]) = 0.00000
f([-3.82285090e-06 1.01504945e-05]) = 0.000000



# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]# calculate the gradient for the projectiongradient = derivative(projected[0], projected[1])# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the changechange[i] = (momentum * change[i]) - step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return solutions


# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# define momentum
momentum = 0.8
# perform the gradient descent search with nesterov momentum
solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)


# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')


# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')


# example of plotting the nesterov momentum search on a contour plot of the test function
from math import sqrt
from numpy import asarray
from numpy import arange
from numpy.random import rand
from numpy.random import seed
from numpy import meshgrid
from matplotlib import pyplot
from mpl_toolkits.mplot3d import Axes3D# objective function
def objective(x, y):return x**2.0 + y**2.0# derivative of objective function
def derivative(x, y):return asarray([x * 2.0, y * 2.0])# gradient descent algorithm with nesterov momentum
def nesterov(objective, derivative, bounds, n_iter, step_size, momentum):# track all solutionssolutions = list()# generate an initial pointsolution = bounds[:, 0] + rand(len(bounds)) * (bounds[:, 1] - bounds[:, 0])# list of changes made to each variablechange = [0.0 for _ in range(bounds.shape[0])]# run the gradient descentfor it in range(n_iter):# calculate the projected solutionprojected = [solution[i] + momentum * change[i] for i in range(solution.shape[0])]# calculate the gradient for the projectiongradient = derivative(projected[0], projected[1])# build a solution one variable at a timenew_solution = list()for i in range(solution.shape[0]):# calculate the changechange[i] = (momentum * change[i]) - step_size * gradient[i]# calculate the new position in this variablevalue = solution[i] + change[i]# store this variablenew_solution.append(value)# store the new solutionsolution = asarray(new_solution)solutions.append(solution)# evaluate candidate pointsolution_eval = objective(solution[0], solution[1])# report progressprint('>%d f(%s) = %.5f' % (it, solution, solution_eval))return solutions# seed the pseudo random number generator
# define range for input
bounds = asarray([[-1.0, 1.0], [-1.0, 1.0]])
# define the total iterations
n_iter = 50
# define the step size
step_size = 0.01
# define momentum
momentum = 0.8
# perform the gradient descent search with nesterov momentum
solutions = nesterov(objective, derivative, bounds, n_iter, step_size, momentum)
# sample input range uniformly at 0.1 increments
xaxis = arange(bounds[0,0], bounds[0,1], 0.1)
yaxis = arange(bounds[1,0], bounds[1,1], 0.1)
# create a mesh from the axis
x, y = meshgrid(xaxis, yaxis)
# compute targets
results = objective(x, y)
# create a filled contour plot with 50 levels and jet color scheme
pyplot.contourf(x, y, results, levels=50, cmap='jet')
# plot the sample as black circles
solutions = asarray(solutions)
pyplot.plot(solutions[:, 0], solutions[:, 1], '.-', color='w')
# show the plot




  1. 简述动量Momentum梯度下降

    梯度下降是机器学习中用来使模型逼近真实分布的最小偏差的优化方法. 在普通的随机梯度下降和批梯度下降当中,参数的更新是按照如下公式进行的: W = W - αdW b = b - αdb 其中α是学习率 ...

  2. 各种 Optimizer 梯度下降优化算法回顾和总结

    1. 写在前面 当前使用的许多优化算法,是对梯度下降法的衍生和优化.在微积分中,对多元函数的参数求  偏导数,把求得的各个参数的导数以向量的形式写出来就是梯度.梯度就是函数变化最快的地方.梯度下降是迭 ...

  3. 各种 Optimizer 梯度下降优化算法总结

    ↑↑↑关注后"星标"Datawhale 每日干货 & 每月组队学习,不错过 Datawhale干货 作者:DengBoCong,编辑:极市平台 来源:https://zhu ...

  4. 深度学习 Optimizer 梯度下降优化算法总结

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 来源:https://zhuanlan.zhihu.com/p/3 ...

  5. 各种Optimizer梯度下降优化算法回顾和总结

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 本文转自|机器学习算法那些事 论文标题:An overview o ...

  6. 收藏 | 各种 Optimizer 梯度下降优化算法回顾和总结

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达本文转自|深度学习这件小事 论文标题:An overview of ...

  7. 收藏 | 各种Optimizer梯度下降优化算法回顾和总结

    点上方蓝字计算机视觉联盟获取更多干货 在右上方 ··· 设为星标 ★,与你不见不散 仅作学术分享,不代表本公众号立场,侵权联系删除 转载于:作者丨DengBoCong@知乎 来源丨https://zh ...

  8. 深度学习中的梯度下降优化算法综述

    1 简介 梯度下降算法是最常用的神经网络优化算法.常见的深度学习库也都包含了多种算法进行梯度下降的优化.但是,一般情况下,大家都是把梯度下降系列算法当作是一个用于进行优化的黑盒子,不了解它们的优势和劣 ...

  9. 常见 Optimizer 梯度下降优化算法总结

    来源 | DengBoCong@知乎 https://zhuanlan.zhihu.com/p/343564175 编辑丨极市平台 仅作学术分享,不代表本公众号立场,侵权联系删除 论文标题:An ov ...


  1. thinkpad t400无线网卡故障恢复技巧
  2. 人工智能历经风雨二十载 AI专用芯片成蓝海
  3. 简单实用的Windows命令(一)
  4. 效果良好!构造一个输入速度的神经网络,以DQN方式实现小游戏的自动控制
  5. phalcon开发工具(phalcon-devtools)
  6. Java 实现发送Http请求
  7. ZK(1)——分布式系统概念与ZK简介
  8. python唯美壁纸_Python爬虫教程爬取5K分辨率超清唯美壁纸源码
  9. springSecurity分离资源服务器分析
  10. Unity学习01-unity物体移动三种方式
  11. 《微处理器体系结构》1.1 汇编语言与机器语言及应用
  12. python鸭制作类代码_python之类的多态(鸭子类型 )、封装和内置函数property
  13. JavaScript经典效果集[蓝色理想]
  14. zotero+坚果云安装记录
  15. Android 版本号和分支查看
  16. 计算机工作室名字大全,设计工作室名字(精选300个)
  17. 2021-10-13爬虫requests总结
  18. 国外大牛的黑苹果配置清单
  19. 【 shell 编程 】第1篇 变量
  20. springsecurity整合jwt实现授权认证,权限分配


  1. 计算机考研统考历年试题,计算机考研数据结构统考历年真题2009-2016年.doc
  2. 基于钣金工艺优化的钣金件结构设计
  3. 浙江 二本 计算机 招聘,在浙江招收二本的外省好学校
  4. 计算机会计教程,会计电算化软件的操作教程
  5. 联阳 IT66121FN 低功率单通道SDI转HDMI传输器芯片
  6. 如何在html自动播放提示音,ajax实现web页面的消息实时提醒时播放提示音
  7. comodo(科莫多)
  8. 教你怎样把jpg转换成pdf格式
  9. TMPGEnc Plus 3.0 XPress
  10. “新.发.地”——北京的菜篮子有多大?