一、GroupNormalization的先辈 - BatchNormalization

Batch Normalization - “批归一化”,简称BN,是神经网络中一种特殊的层,如果batch_size为n,则在前向传播过程中,网络中每个节点都有n个输出,即对该层每个节点的这n个输出进行归一化再输出,具体相关计算如下所示:

其keras的实现代码如下(这一部分为源码,大家可以选择性跳过):

class BatchNormalization(Layer):"""Batch normalization layer (Ioffe and Szegedy, 2014).Normalize the activations of the previous layer at each batch,i.e. applies a transformation that maintains the mean activationclose to 0 and the activation standard deviation close to 1.# Argumentsaxis: Integer, the axis that should be normalized(typically the features axis).For instance, after a `Conv2D` layer with`data_format="channels_first"`,set `axis=1` in `BatchNormalization`.momentum: Momentum for the moving mean and the moving variance.epsilon: Small float added to variance to avoid dividing by zero.center: If True, add offset of `beta` to normalized tensor.If False, `beta` is ignored.scale: If True, multiply by `gamma`.If False, `gamma` is not used.When the next layer is linear (also e.g. `nn.relu`),this can be disabled since the scalingwill be done by the next layer.beta_initializer: Initializer for the beta weight.gamma_initializer: Initializer for the gamma weight.moving_mean_initializer: Initializer for the moving mean.moving_variance_initializer: Initializer for the moving variance.beta_regularizer: Optional regularizer for the beta weight.gamma_regularizer: Optional regularizer for the gamma weight.beta_constraint: Optional constraint for the beta weight.gamma_constraint: Optional constraint for the gamma weight.# Input shapeArbitrary. Use the keyword argument `input_shape`(tuple of integers, does not include the samples axis)when using this layer as the first layer in a model.# Output shapeSame shape as input.# References- [Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift](https://arxiv.org/abs/1502.03167)"""@interfaces.legacy_batchnorm_supportdef __init__(self,axis=-1,momentum=0.99,epsilon=1e-3,center=True,scale=True,beta_initializer='zeros',gamma_initializer='ones',moving_mean_initializer='zeros',moving_variance_initializer='ones',beta_regularizer=None,gamma_regularizer=None,beta_constraint=None,gamma_constraint=None,**kwargs):super(BatchNormalization, self).__init__(**kwargs)self.supports_masking = Trueself.axis = axisself.momentum = momentumself.epsilon = epsilonself.center = centerself.scale = scaleself.beta_initializer = initializers.get(beta_initializer)self.gamma_initializer = initializers.get(gamma_initializer)self.moving_mean_initializer = initializers.get(moving_mean_initializer)self.moving_variance_initializer = (initializers.get(moving_variance_initializer))self.beta_regularizer = regularizers.get(beta_regularizer)self.gamma_regularizer = regularizers.get(gamma_regularizer)self.beta_constraint = constraints.get(beta_constraint)self.gamma_constraint = constraints.get(gamma_constraint)def build(self, input_shape):dim = input_shape[self.axis]if dim is None:raise ValueError('Axis ' + str(self.axis) + ' of ''input tensor should have a defined dimension ''but the layer received an input with shape ' +str(input_shape) + '.')self.input_spec = InputSpec(ndim=len(input_shape), axes={self.axis: dim})shape = (dim,)if self.scale:self.gamma = self.add_weight(shape=shape,name='gamma',initializer=self.gamma_initializer,regularizer=self.gamma_regularizer,constraint=self.gamma_constraint)else:self.gamma = Noneif self.center:self.beta = self.add_weight(shape=shape,name='beta',initializer=self.beta_initializer,regularizer=self.beta_regularizer,constraint=self.beta_constraint)else:self.beta = Noneself.moving_mean = self.add_weight(shape=shape,name='moving_mean',initializer=self.moving_mean_initializer,trainable=False)self.moving_variance = self.add_weight(shape=shape,name='moving_variance',initializer=self.moving_variance_initializer,trainable=False)self.built = Truedef call(self, inputs, training=None):input_shape = K.int_shape(inputs)# Prepare broadcasting shape.ndim = len(input_shape)reduction_axes = list(range(len(input_shape)))del reduction_axes[self.axis]broadcast_shape = [1] * len(input_shape)broadcast_shape[self.axis] = input_shape[self.axis]# Determines whether broadcasting is needed.needs_broadcasting = (sorted(reduction_axes) != list(range(ndim))[:-1])def normalize_inference():if needs_broadcasting:# In this case we must explicitly broadcast all parameters.broadcast_moving_mean = K.reshape(self.moving_mean, broadcast_shape)broadcast_moving_variance = K.reshape(self.moving_variance, broadcast_shape)if self.center:broadcast_beta = K.reshape(self.beta, broadcast_shape)else:broadcast_beta = Noneif self.scale:broadcast_gamma = K.reshape(self.gamma, broadcast_shape)else:broadcast_gamma = Nonereturn K.batch_normalization(inputs,broadcast_moving_mean,broadcast_moving_variance,broadcast_beta,broadcast_gamma,axis=self.axis,epsilon=self.epsilon)else:return K.batch_normalization(inputs,self.moving_mean,self.moving_variance,self.beta,self.gamma,axis=self.axis,epsilon=self.epsilon)# If the learning phase is *static* and set to inference:if training in {0, False}:return normalize_inference()# If the learning is either dynamic, or set to training:normed_training, mean, variance = K.normalize_batch_in_training(inputs, self.gamma, self.beta, reduction_axes, epsilon=self.epsilon)if K.backend() != 'cntk':sample_size = K.prod([K.shape(inputs)[axis] for axis in reduction_axes])sample_size = K.cast(sample_size, dtype=K.dtype(inputs))# sample variance - unbiased estimator of population variancevariance *= sample_size / (sample_size - (1.0 + self.epsilon))self.add_update([K.moving_average_update(self.moving_mean, mean, self.momentum),K.moving_average_update(self.moving_variance, variance, self.momentum)], inputs)# Pick the normalized form corresponding to the training phase.return K.in_train_phase(normed_training,normalize_inference,training=training)def get_config(self):config = {'axis': self.axis,'momentum': self.momentum,'epsilon': self.epsilon,'center': self.center,'scale': self.scale,'beta_initializer': initializers.serialize(self.beta_initializer),'gamma_initializer': initializers.serialize(self.gamma_initializer),'moving_mean_initializer':initializers.serialize(self.moving_mean_initializer),'moving_variance_initializer':initializers.serialize(self.moving_variance_initializer),'beta_regularizer': regularizers.serialize(self.beta_regularizer),'gamma_regularizer': regularizers.serialize(self.gamma_regularizer),'beta_constraint': constraints.serialize(self.beta_constraint),'gamma_constraint': constraints.serialize(self.gamma_constraint)}base_config = super(BatchNormalization, self).get_config()return dict(list(base_config.items()) + list(config.items()))def compute_output_shape(self, input_shape):return input_shape

在实际使用中,BN层在深度学习卷积块中的应用效果还是非常不错的:

  1. 使网络中每层输入数据的分布相对稳定,加速模型学习速度
  2. 使模型对网络中的参数不那么敏感,简化调参过程,使得网络学习更加稳定
  3. 允许网络使用饱和性激活函数,缓解梯度消失问题
  4. 具有一定的正则化效果

但是BN层也具有它的一些缺陷,即使用BN层需要有足够大的batch_size,较小的batch_size会导致批统计不准确,会极大地增加模型错误率。但是在日常使用过程中,加大batch_size又会导致内存不够用,从而造成矛盾。从下面这张batch_size与error的统计图可以看出,BN层在batch_size从16降为8的某一结点上,error开始极具增大。

二、BatchNormalization的后浪小生 - GroupNormalization

那么,我们是否有什么新的方案来克服或者是缓解BatchNormalization的不足点呢,毕竟在实际中并不是所有人都拥有足够大的显存来运行代码。为此,GroupNormalization应运而生。GroupNormalization是由2018年3月份何恺明团队提出(说它是后浪小生倒也合理,毕竟年轻嘛,哈哈),它的主要创新点在于它优化了BN层在较小batch_size下训练误差过大的劣势。
从上面给出的图中我们也可以明显看出,GroupNormalization在每个batch_size下几乎error都是相同的。那么它与BN层在实现方面究竟有什么差异呢:

中间两个LayerNormalization和InstanceNormalization是BatchNormalization的另外两个派生,但本次讲的是GroupNormalization,故本次不对这两个进行对比,感兴趣的可以去翻看下对应的论文即可(或有时间我再更新补上这两个):

BatchNormalization:实质上就是对batch方向做归一化,算NxHxW的均值
GroupNormalization:实质上就是将channel方向分group,然后每个group内做归一化,计算(C//G)xHxW的均值
其中,C和N分别表示channel和batch_size,W和H表示特征图的长与宽尺寸,feature_map的张量表示:[N, W, H, C]

GroupNormalization在pytorch下的实现代码如下:

import numpy as np
import torch
import torch.nn as nnclass GroupNorm(nn.Module):def __init__(self, num_features, num_groups=32, eps=1e-5):super(GroupNorm, self).__init__()self.weight = nn.Parameter(torch.ones(1,num_features,1,1))self.bias = nn.Parameter(torch.zeros(1,num_features,1,1))self.num_groups = num_groupsself.eps = epsdef forward(self, x):N,C,H,W = x.size()G = self.num_groupsassert C % G == 0x = x.view(N,G,-1)mean = x.mean(-1, keepdim=True)var = x.var(-1, keepdim=True)x = (x-mean) / (var+self.eps).sqrt()x = x.view(N,C,H,W)

GroupNormalization在tensorflow下的实现代码如下(论文给出的代码):

def GroupNorm(x, gamma, beta, G, eps=1e-5):# x: input features with shape [N, C, H, W]# gamma, beta: scale and offset, with shape [1, C, 1, 1]# G: number of groups or GNN, C, H, W = x.shapex = tf.reshape(x, [N, G, C // G, H, W])mean, var = tf.nn.moments(x, [2, 3, 4], keep_dims=True)x = (x - mean) / tf.sqrt(var + eps)x = tf.reshape(x, [N, C, H, W])return x * gamma + beta

GroupNormalization在keras下的实现代码如下:

from keras.engine import Layer, InputSpec
from keras import initializers
from keras import regularizers
from keras import constraints
from keras import backend as Kclass GroupNormalization(Layer):"""Group normalization layerGroup Normalization divides the channels into groups and computes within each groupthe mean and variance for normalization. GN's computation is independent of batch sizes,and its accuracy is stable in a wide range of batch sizes# Argumentsgroups: Integer, the number of groups for Group Normalization.axis: Integer, the axis that should be normalized(typically the features axis).For instance, after a `Conv2D` layer with`data_format="channels_first"`,set `axis=1` in `BatchNormalization`.epsilon: Small float added to variance to avoid dividing by zero.center: If True, add offset of `beta` to normalized tensor.If False, `beta` is ignored.scale: If True, multiply by `gamma`.If False, `gamma` is not used.When the next layer is linear (also e.g. `nn.relu`),this can be disabled since the scalingwill be done by the next layer.beta_initializer: Initializer for the beta weight.gamma_initializer: Initializer for the gamma weight.beta_regularizer: Optional regularizer for the beta weight.gamma_regularizer: Optional regularizer for the gamma weight.beta_constraint: Optional constraint for the beta weight.gamma_constraint: Optional constraint for the gamma weight.# Input shapeArbitrary. Use the keyword argument `input_shape`(tuple of integers, does not include the samples axis)when using this layer as the first layer in a model.# Output shapeSame shape as input.# References- [Group Normalization](https://arxiv.org/abs/1803.08494)"""def __init__(self,groups=16,axis=-1,epsilon=1e-5,center=True,scale=True,beta_initializer='zeros',gamma_initializer='ones',beta_regularizer=None,gamma_regularizer=None,beta_constraint=None,gamma_constraint=None,**kwargs):super(GroupNormalization, self).__init__(**kwargs)self.supports_masking = Trueself.groups = groupsself.axis = axisself.epsilon = epsilonself.center = centerself.scale = scaleself.beta_initializer = initializers.get(beta_initializer)self.gamma_initializer = initializers.get(gamma_initializer)self.beta_regularizer = regularizers.get(beta_regularizer)self.gamma_regularizer = regularizers.get(gamma_regularizer)self.beta_constraint = constraints.get(beta_constraint)self.gamma_constraint = constraints.get(gamma_constraint)def build(self, input_shape):dim = input_shape[self.axis]if dim is None:raise ValueError('Axis ' + str(self.axis) + ' of ' +'input tensor should have a defined dimension' +'but the layer received an input with shape ' +str(input_shape) + '.')if dim < self.groups:raise ValueError('Number of groups (' + str(self.groups) + ') cannot be ' +'more than the number of channels (' +str(dim) + ').')if dim % self.groups != 0:raise ValueError('Number of groups (' + str(self.groups) + ') must be a ' +'multiple of the number of channels (' +str(dim) + ').')self.input_spec = InputSpec(ndim=len(input_shape), axes={self.axis: dim})shape = (dim,)if self.scale:self.gamma = self.add_weight(shape=shape, name='gamma',initializer=self.gamma_initializer,regularizer=self.gamma_regularizer,constraint=self.gamma_constraint)else:self.gamma = Noneif self.center:self.beta = self.add_weight(shape=shape, name='beta',initializer=self.beta_initializer,regularizer=self.beta_regularizer,constraint=self.beta_constraint)else:self.beta = Noneself.built = Truedef call(self, inputs, **kwargs):input_shape = K.int_shape(inputs)tensor_input_shape = K.shape(inputs)# Prepare broadcasting shape.reduction_axes = list(range(len(input_shape)))del reduction_axes[self.axis]broadcast_shape = [1] * len(input_shape)broadcast_shape[self.axis] = input_shape[self.axis] // self.groupsbroadcast_shape.insert(1, self.groups)reshape_group_shape = K.shape(inputs)group_axes = [reshape_group_shape[i] for i in range(len(input_shape))]group_axes[self.axis] = input_shape[self.axis] // self.groupsgroup_axes.insert(1, self.groups)# reshape inputs to new group shapegroup_shape = [group_axes[0], self.groups] + group_axes[2:]group_shape = K.stack(group_shape)inputs = K.reshape(inputs, group_shape)group_reduction_axes = list(range(len(group_axes)))group_reduction_axes = group_reduction_axes[2:]mean = K.mean(inputs, axis=group_reduction_axes, keepdims=True)variance = K.var(inputs, axis=group_reduction_axes, keepdims=True)inputs = (inputs - mean) / (K.sqrt(variance + self.epsilon))# prepare broadcast shapeinputs = K.reshape(inputs, group_shape)outputs = inputs# In this case we must explicitly broadcast all parameters.if self.scale:broadcast_gamma = K.reshape(self.gamma, broadcast_shape)outputs = outputs * broadcast_gammaif self.center:broadcast_beta = K.reshape(self.beta, broadcast_shape)outputs = outputs + broadcast_betaoutputs = K.reshape(outputs, tensor_input_shape)return outputsdef get_config(self):config = {'groups': self.groups,'axis': self.axis,'epsilon': self.epsilon,'center': self.center,'scale': self.scale,'beta_initializer': initializers.serialize(self.beta_initializer),'gamma_initializer': initializers.serialize(self.gamma_initializer),'beta_regularizer': regularizers.serialize(self.beta_regularizer),'gamma_regularizer': regularizers.serialize(self.gamma_regularizer),'beta_constraint': constraints.serialize(self.beta_constraint),'gamma_constraint': constraints.serialize(self.gamma_constraint)}base_config = super(GroupNormalization, self).get_config()return dict(list(base_config.items()) + list(config.items()))def compute_output_shape(self, input_shape):return input_shape

GroupNormalization的具体实现效果(论文中给出:https://arxiv.org/pdf/1803.08494.pdf):

GroupNormalization相关推荐

  1. 自定义报错返回_Keras编写自定义层--以GroupNormalization为例

    1. Group Normalization 介绍 Batch Normalization(BN)称为批量归一化,可加速网络收敛利于网络训练.但BN的误差会随着批量batch的减小而迅速增大.FAIR ...

  2. 21个深度学习调参技巧,一定要看到最后一个

    点击上方"小白学视觉",选择加"星标"或"置顶" 重磅干货,第一时间送达 本文转自|AI算法与图像处理 训练深度神经网络是困难的.它需要知识 ...

  3. 【AI初识境】深度学习模型中的Normalization,你懂了多少?

    文章首发于微信公众号<有三AI> [AI初识境]深度学习模型中的Normalization,你懂了多少? 这是<AI初识境>第6篇,这次我们说说Normalization.所谓 ...

  4. 【深度学习】21个深度学习调参技巧,一定要看到最后一个

    点击上方"AI算法与图像处理",选择加"星标"或"置顶" 重磅干货,第一时间送达 这篇文章在国外知名的网站 medium 上面获得了一千多的 ...

  5. 【小白学PyTorch】扩展之Tensorflow2.0 | 21 Keras的API详解(下)池化、Normalization

    <<小白学PyTorch>> 扩展之Tensorflow2.0 | 21 Keras的API详解(上)卷积.激活.初始化.正则 扩展之Tensorflow2.0 | 20 TF ...

  6. 21个深度学习调参的实用技巧

    文 | AI_study 源 | AI算法与图像处理 导读 在学习人工智能的时候,不管是机器学习还是深度学习都需要经历一个调参的过程,参数的好坏直接影响着模型效果的好坏.本文总结了在深度学习中21个实 ...

  7. AIGC基础:从VAE到DDPM原理、代码详解

    ©作者 | 王建周 单位 | 来也科技AI团队负责人 研究方向 | 分布式系统.CV.NLP 前言 AIGC 目前是一个非常火热的方向,DALLE-2,ImageGen,Stable Diffusio ...

  8. UNETR 医学图像分割架构 2D版 (Tensorflow2 Keras 实现UNETR)

    文章目录 前言 一.UNETR网络结构 二.代码 1.引入库 2.辅助函数和自定义keras层 3.构建Vision Transformer 4.构建完整UNETR 5.简单测试 前言   现在在尝试 ...

  9. Keras 主要的层函数

    文章目录 卷积层 池化层 Pooling layers 循环层 预处理层 Preprocessing layers 归一化层 正则化层 Regularization layers 注意力层 Atten ...

最新文章

  1. 回首这一年,其实我还是一样!
  2. HTML5 行内元素有哪些,块级元素有哪些, 空元素有哪些?
  3. C 实现strcpy函数
  4. pywebQQ-----linux下webQQ的替换者
  5. php扩展memcached和memcache的安装配置方法
  6. mongodb java 单例_Java单例MongoDB工具类
  7. Android自动化测试之路——技术准备
  8. 使链接在新窗口中打开
  9. 【Android Studio安装部署系列】二十二、Android studio自动生成set、get方法
  10. 怎样用计算机制作思维导图,一篇文章告诉你如何绘制并运用思维导图!
  11. u盘启动怎么修复计算机,电脑店u盘启动winpe如何修复系统引导
  12. 鲁大师电脑硬件兼容性测试软件,还在用鲁大师?查看电脑硬件信息可以用这些免费的软件!...
  13. 华为HCIE RS笔记-21OSPF基本知识
  14. flash火焰燃烧的文字效果
  15. 大学生web前端期末大作业实例代码 (1500套,建议收藏) HTML+CSS+JS
  16. [个人学习]透视画法的一点记录
  17. Python分析并爬取起点中文网的章节数据,保存为txt文档
  18. 杨元庆:中国企业山寨之风依然盛行
  19. Linux Shell脚本编程基础
  20. 微信JS SDK开发 共享问题小结

热门文章

  1. 计算机图形学基础教程学习总结
  2. Flutter ListView动态列表
  3. 云平台:赋能企业数字化转型的关键利器
  4. web项目管理系统的设计
  5. php js漂浮,js 居中漂浮广告
  6. uniapp 中点击某个按钮关注成功 操作
  7. 取整函数access_Access向上取整的类似Excel的Ceiling函数
  8. Word 参考文献 引用 自动更新 引用字体格式批量改动
  9. Cisco思科交换机 入门 - 查看本机IP地址
  10. 基于oracle的优化 pdf,基于成本的oracle优化法则