[占坑] Equalized learning rate implementation

前言

最近在看GPEN的代码，其中Generator的部分借鉴了StyleGAN2，并包含比较多tricks。本文主要挖一个关于「Equalized Learning Rate」的坑，后续再填。

ELR背景

ELR 源自PG-GAN，在StyleGAN系列中被沿用，目的是为了稳定训练。

关于其具体实现（在 linear layer 和 conv2d layer 中有应用），简单来说：

1）当初始化 layer 权重时，不再采用各种 fancy 的初始化方法，仅采用N(0, 1)分布随机初始化
2）在训练模型过程中，对 layer 权重进行归一化。其中，归一化系数 c 是通过 kaiming 初始化方法计算得到fan_in，具体的系数值会根据模型本身有所不同。当layer前向传播过程中，对layer权重进行缩放，缩放系数为c（即实现方式1中self.scale，参考PG-GAN 4.1节中per layer normalization constant from He’s initializer）

ELR出发点

依旧引用PG-GAN原文中的说法：

The benefit of doing this dynamically instead of during initialization is somewhat subtle, and relates to the scale-invariance in commonly used adaptive stochastic gradient descent methods such as RMSProp (Tieleman & Hinton, 2012) and Adam (Kingma & Ba, 2015). These methods normalize a gradient update by its estimated standard deviation, thus making the update independent of the scale of the parameter. As a result, if some parameters have a larger dynamic range than others, they will take longer to adjust. This is a scenario modern initializers cause, and thus it is possible that a learning rate is both too large and too small at the same time. Our approach ensures that the dynamic range, and thus the learning speed, is the same for all weights. A similar reasoning was independently used by van Laarhoven (2017).

也就是说ELR影响优化器（如Adam, RMSProp）求解时梯度更新。Adam和RMSProp的共同点在于，计算步长时考虑了梯度的二阶矩估计（Second Moment Estimation，即梯度的未中心化的方差）。计算步长时，梯度方差作为分母可以理解为将步长统一到相同尺度下（大概是这么个意思），而如果不同参数的数值范围（上文中的dynamic range）差距较大，意味着数值范围较大的参数往往需要更多次的步长更新，而数值范围较小的参数当前步长可能过大。ELR通过将参数本身进行normalization，使得所有参数的数值范围是相近的，再使用Adam、RMSProp求解的时候，每个参数的更新步长相近，学习速率是相同的。

ELR原理

理解ELR的原理，先回顾Kaiming大神的权重初始化方法：

Weight Initialization in Neural Networks: A Journey From the Basics to Kaiming 非常浅显易懂
kaiming初始化的推导理解思想后，从数学推导中进一步理解

即，通过Kaiming He initialization：

前向传播的时候，每一层的卷积计算结果的方差为1
反向传播的时