文章目录

概述
Fancier optimization优化方法
- SGD + Momentum
- - Nesterov Momentum
- AdaGrad
- RMSProp
- Adam
- 参数的调整
- Model Ensembles
Regularization正则化
- 常规正则化方法
- 抓爆
- 正则化思想
- 其他正则化方法
Transfer Learning迁移学习

概述

在线Latex公式
本节包含三个内容：

Fancier optimization 优化方法
Regularization正则化
Transfer Learning迁移学习
第一块内容是重点，但是基本零零散散在ng或李宏毅的课里面都有讲-。-

Fancier optimization优化方法

这里面提到了很多优化方法，稍微列举一下，有个别没有在ng的课讲过，这里讲得还是比较粗。。。
SGD的缺点：
What if loss changes quickly in one direction and slowly in another?
What does gradient descent do?
Very slow progress along shallow dimension, jitter along steep direction

What if the loss function has a local minima or saddle point?
Zero gradient, gradient descent gets stuck.

SGD + Momentum

Build up “velocity” as a running mean of gradients
Rho gives “friction”; typically rho=0.9 or 0.99
这里的变量叫法和ng里面不太一样。

Nesterov Momentum

这个方法没怎么听懂，先记下来

这里提到这个算法用起来有点麻烦（不好同时计算损失函数和梯度），不过用换元法后就可以解决这个问题。

AdaGrad

Q: What happens with AdaGrad?
如果有两个数据轴，一个轴有较高梯度，一个轴有较小的梯度，在较小的梯度方向，AdaGrad累加梯度后除以一个比较小的数字，加速了该方向的训练进度，在较大的梯度方向，AdaGrad累加梯度后除以一个比较大的数字。
Q2: What happens to the step size over long time?
随着时间增加，学习步长会慢慢减少。

RMSProp

黑色：SGD、蓝色：SGD+momentum、红色：RMSProp

Adam

Bias correction for the fact that first and second moment estimates start at zero
Adam with beta1 = 0.9, beta2 = 0.999, and learning_rate = 1e-3 or 5e-4 is a great starting point for many models!

参数的调整

SGD, SGD+Momentum, Adagrad, RMSProp, Adam all have learning rate as a hyperparameter.
学习率是第一重要的超参数！
然后讲了拟牛顿法，这个真讲得粗，详细的可以去看李航的《统计学习方法》
然后提到BGFS算法

Quasi-Newton methods (BGFS most popular): instead of inverting the Hessian (O(n^3)), approximate
inverse Hessian with rank 1 updates over time (O(n^2) each).
L-BFGS (Limited memory BFGS): Does not form/store the full inverse Hessian.
L-BFGS usually works very well in full batch, deterministic mode
i.e. if you have a single, deterministic f(x) then L-BFGS will probably work very nicely
L-BFGS does not transfer very well to mini-batch setting.
Gives bad results. Adapting L-BFGS to large-scale, stochastic setting is an active area of research.
建议
Adam is a good default choice in most cases
If you can afford to do full batch updates then try out L-BFGS (and don’t forget to disable all sources of noise)

Model Ensembles

这个李宏毅的课讲得很清楚，不过这里讲得更加前沿，例如利用训练过程中的快照进行组合

Regularization正则化

常规正则化方法

抓爆

NN最常见的正则化方法，没有之一。
In each forward pass, randomly set some neurons to zero Probability of dropping is a hyperparameter; 0.5 is common.
在FC里面抓爆是随机设置隐藏层中的某些神经元为0，在CNN的convolution层中也可以用抓爆，这个时候设置的是某几个feature map为0。
抓爆为什么有用？
第一个解释和李课中绑手训练说法相同。另外一个解释是：
Dropout is training a large ensemble of models (that share parameters).
Each binary mask is one model
可以看做是在训练不同的子集，然后进行组合ensemble。
关于为什么抓爆之后为什么要除以（乘以）激活系数（激活系数倒数）

乘还是除的运算可以放在训练阶段，可以利用GPU的并行，测试阶段就可以不变。
使用抓爆会使得训练时间变长，但鲁棒性更好。

正则化思想

这里提到了正则化的通用思想：在训练的过程中加入一些随机性，防止模型对于训练数据过拟合，而在测试的时候消除这个随机性，使得模型的泛化能力变强。
Training: Add random noise
Testing: Marginalize over the noise

其他正则化方法

这里提到了利用这个思想的方法有：BN，data augmentation
还有一个类似抓爆的算法：DropConnect，它不是将激活函数归零而是吧权重归零。

Fractional Max Pooling

Stochastic Depth

会不会同时使用这个正则化方法？
一般先用BN，如果有过拟合现象则可以加入别的正则化方法。不会一开始就直接用多种方法。

Transfer Learning迁移学习

熟悉的表格。。。

小结：
Takeaway for your projects and beyond:
Have some dataset of interest but it has < ~1M images?

Find a very large dataset that has similar data, train a big ConvNet there
Transfer learn to your dataset
Deep learning frameworks provide a “Model Zoo” of pretrained models so you don’t need to train your own
Caffe: https://github.com/BVLC/caffe/wiki/Model-Zoo
TensorFlow: https://github.com/tensorflow/models
PyTorch: https://github.com/pytorch/vision

2017CS231n笔记7.训练神经网络（下）相关推荐

【手把手带你入门深度学习之150行代码的汉字识别系统】学习笔记 ·002 训练神经网络
立即学习:https://edu.csdn.net/course/play/24719/279509?utm_source=blogtoedu 目录一.神经网络训练代码二.思路总结 1.数据集图片 ...
神经网络与机器学习笔记—基本知识点（下）
神经网络与机器学习笔记-基本知识点(下) 0.1 网络结构: 神经网络中神经元的构造方式与用于训练网络的学习算法有着密切的联系,有三种基本的网络结构: 0.7 知识表示 ...
AI学习笔记（九）从零开始训练神经网络、深度学习开源框架
AI学习笔记之从零开始训练神经网络.深度学习开源框架从零开始训练神经网络构建网络的基本框架启动训练网络并测试数据深度学习开源框架深度学习框架组件--张量组件--基于张量的各种操作组件- ...
CS231n课程笔记翻译：神经网络笔记3（下）
CS231n课程笔记翻译:神经网络笔记3(下) 笔记译自斯坦福CS231n课程笔记Neural Nets notes 3,课程教师Andrej Karpathy授权翻译. 转自知乎,原文地址: htt ...
CS231n学习记录Lec8 Training训练神经网络（下）
Lec8 训练神经网络(下) 主要内容:更好的优化,优化方法介绍 1. Fancier Optimization problems with SGD 陷入局部最小(local minima)和鞍点(s ...
吴恩达《Machine Learning》精炼笔记 5：神经网络
作者 | Peter 编辑 | AI有道系列文章: 吴恩达<Machine Learning>精炼笔记 1:监督学习与非监督学习吴恩达<Machine Learning>精 ...
【NLP】预训练时代下的文本生成｜模型技巧
今天推荐一篇人大出品的37页文本生成综述: A Survey of Pretrained Language Models Based Text Generation https://arxiv.org ...
使用TensorFlow训练神经网络进行价格预测
Using Deep Neural Networks for regression problems might seem like overkill (and quite often is), bu ...
霹雳吧啦wz学习笔记1_卷积神经网络
霹雳吧啦wz学习笔记1_卷积神经网络全连接层: 全连接层就是由许许多多的神经元共同连接而得来的卷积层: 卷积就是一个滑动窗口在我们的特征图上进行滑动并计算卷积的目的:进行图像特征提取卷积核的c ...

2017CS231n笔记7.训练神经网络（下）