基于MindSpore实现X3D网络

X3D是一篇发表在CVPR2020上的关于视频动作分类的文章

原论文链接

notebook可执行案例

Code

官方实现

MindSpore实现

算法原理

X3D的工作受机器学习中特征选择方法的启发，采用一个简单的逐步拓展网络的方法，以X2D图像分类模型为基础，分别在宽度、深度、帧率、帧数以及分辨率等维度逐步进行拓展，从2D空间拓展为3D时空域。每一次只在一个维度上进行拓展，并在计算量和精度上进行权衡，选取最佳的拓展方式。

作者对比了图像分类网络的发展史，这些图像分类模型经历了对深度、分辨率、通道宽度等维度的探索，但视频分类模型只是简单的对时间维度进行扩张。因此作者提出了对不同维度改进的思考。

3D网络最佳的时间采样策略是什么？长的时间序列和较为稀疏的采样是否优于短时间内的稠密采样？（采样帧率）

是否需要一个更好的空间分辨率？目前的工作都为提高效率而使用低分辨率。是否存在一个最大空间分辨率导致性能饱和？（空间分辨率）

更快的帧率+更“瘦”的模型好亦或更慢的帧率+更“宽”的模型好？也即slow分支和fast分支哪种的结构更好？又或者存在一个二者的中间结构更好？（帧率与宽度）

当增加网络宽度时，是增加全局的宽度好还是增加bottleneck的宽度好？（宽度，inverted bottlenetck结构的借鉴）

网络变深的同时，是否应该增加输入的时空分辨率以保证感受野大小足够大？又或者应该增大不同的维度？（深度与时空分辨率）

X3D整体网络结构如上，卷积核的维度表示为。X3D通过6个轴来对X2D进行拓展，X2D在这6个轴上都为1。

拓张维度：

1. X-Fast：采样帧间隔

2. X-Temporal：采样帧数

3. X-Spatial：空间分辨率

4. X-Depth：网络深度

5. X-Width：网络宽度

6. X-Bottelneck：bottleneck宽度

Forward expansion

前向拓张是给定复杂度，逐步逐维度进行拓张。

首先给定两个指标，一个是衡量当前扩张因子X好坏的J(X),该指标得分越高，拓展因子越好，得分越低，拓展因子越差，这对应的是模型的准确率。第二个是复杂度评判因子C(X)，对应的是网络所需的浮点操作计算量，那么目标即为在给定复杂度C(X)=c的条件下，使得J(X)最大的扩张因子。

在网络尝试寻找最佳的拓展因子时，每一步只扩张一个维度，其他维度保持不变，而每一步最好的扩张因子被保留，接着进行下一步扩张。即在初始阶段，模型为X2D，对应着一个计算复杂度，然后给定一个目标复杂度，模型要通过每次改变一个因子，然后一步步变换到目标复杂度。且每一次改变所对应的改变量也是定义好的，即让当前的模型的复杂度变成两倍。再者，每一步的扩张是渐进式的，也即复杂度约2倍增长。这种方法可以看成是坐标下降法的特殊形式，扩张2倍的各维度操作具体如下：

1. X-Fast：

2. X-Temporal：

3. X-Spatial：

4. X-Depth：

5. X-Width：

6. X-Bottelneck：

Backward contraction

后向收缩是在超过复杂度时，进行回溯收缩。

由于前向扩展只在离散步骤中产生模型，如果目标复杂度被前向扩展步骤超过，他们执行后向收缩步骤以满足所需的目标复杂度。此收缩被实现为上一次展开的简单缩减，以便与目标相匹配。例如，如果最后一步将帧率提高了两倍，那么他们就会向后收缩将帧率降低到一个小于2的倍数，以大致匹配所需的目标复杂度。

渐进式拓张

扩张任意一个维度都增加了准确率，验证了最初的想法。

第一步扩张的不是时间维度，而是bottleneck宽度，这验证了MobileNetV2中的倒置残差结构，作者认为原因可能是这些层使用了channel-wise卷积十分轻量，因此首先扩张这个维度比较economical。且不同维度准确率变化很大，扩张bottleneck宽度达到了55.0%，而扩张深度只有51.3%。

第二步扩张的为帧数（因为最初只有单帧，因此扩展采样帧间隔和帧数是等同的），这也是我们认为“最应该在第一步扩张的维度”，因为者提供更多的时间信息。

第三步扩张的为空间分辨率，紧接着第四步为深度，接着是时间分辨率（帧率）和输入长度（帧间隔和帧数），然后是两次空间分辨率扩张，第十步再次扩张深度，这符合直观的想法，扩张深度会扩张滤波器感受野的大小。

值得注意的是，尽管模型一开始宽度比较小，但直到第十一步，模型才开始扩张全局的宽度，这使得X3D很像SlowFast的fast分支设计（时空分辨率很大但宽度很小），最后图里没显示扩张的两步为帧间隔和深度。

结果

api调用说明

通过基于VideoDataset编写的Kinetic400类来加载kinetic400数据集。用VideoShortEdgeResize根据短边来进行Resize，再用VideoRandomCrop对Resize后的视频进行随机裁剪，再用VideoRandomHorizontalFlip根据概率对视频进行水平翻转，通过VideoRescale对视频进行缩放，利用VideoReOrder对维度进行变换，再用VideoNormalize进行归一化处理。

class x3d(nn.Cell):"""x3d architecture.Christoph Feichtenhofer."X3D: Expanding Architectures for Efficient Video Recognition."https://arxiv.org/abs/2004.04730"""def __init__(self,block: Type[BlockX3D],depth_factor: float,num_frames: int,train_crop_size: int,num_classes: int,dropout_rate: float,bottleneck_factor: float = 2.25,eval_with_clips: bool = False):super(x3d, self).__init__()block_basis = [1, 2, 5, 3]stage_channels = (24, 48, 96, 192)stage_strides = ((1, 2, 2), (1, 2, 2), (1, 2, 2), (1, 2, 2))drop_rates = (0.2, 0.3, 0.4, 0.5)layer_nums = []for item in block_basis:nums = int(math.ceil(item * depth_factor))layer_nums.append(nums)spat_sz = int(math.ceil(train_crop_size / 32.0))pool_size = [num_frames, spat_sz, spat_sz]input_channel = int(math.ceil(192 * bottleneck_factor))self.num_frames = num_framesself.eval_with_clips = eval_with_clipsself.softmax = nn.Softmax()self.transpose = ops.Transpose()self.backbone = ResNetX3D(block=block, layer_nums=layer_nums, stage_channels=stage_channels,stage_strides=stage_strides, drop_rates=drop_rates)self.head = X3DHead(pool_size=pool_size, input_channel=input_channel, out_channel=2048,num_classes=num_classes, dropout_rate=dropout_rate)def construct(self, x):if not self.eval_with_clips:x = self.backbone(x)x = self.head(x)else:# use for 10-clip evalb, c, n, h, w = x.shape        if n > self.num_frames:x = x.reshape(b, c, -1, self.num_frames, h, w)x = self.transpose(x, (2, 0, 1, 3, 4, 5))x = x.reshape(-1, c, self.num_frames, h, w)        x = self.backbone(x)x = self.head(x)if n > self.num_frames:x = self.softmax(x)x = x.reshape(-1, b, 400)x = x.mean(axis=0, keep_dims=False)return x

X3D包含有多个子模型，通过调用X3D_M、X3D_S、X3D_XS、X3D_L来构建不同的模型。X3D模型主要由ResNetX3D和X3DHead两大部分构成。

@ClassFactory.register(ModuleType.MODEL)
def x3d_m(num_classes: int = 400,dropout_rate: float = 0.5,depth_factor: float = 2.2,num_frames: int = 16,train_crop_size: int = 224,eval_with_clips: bool = False,):"""X3D middle model.Christoph Feichtenhofer. "X3D: Expanding Architectures for Efficient Video Recognition."https://arxiv.org/abs/2004.04730Args:num_classes (int): the channel dimensions of the output.dropout_rate (float): dropout rate. If equal to 0.0, perform nodropout.depth_factor (float): Depth expansion factor.num_frames (int): The number of frames of the input clip.train_crop_size (int): The spatial crop size for training.Inputs:- **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`.Outputs:Tensor of shape :math:`(N, CLASSES_{out})`Supported Platforms:``GPU``Examples:>>> import numpy as np>>> from mindspore import Tensor>>> from mindvision.msvideo.models import x3d_m>>>>>> network = x3d_m()>>> input_x = Tensor(np.random.randn(1, 3, 16, 224, 224).astype(np.float32))>>> out = network(input_x)>>> print(out.shape)(1, 400)About x3d: Expanding Architectures for Efficient Video Recognition... code-block::@inproceedings{x3d2020,Author    = {Christoph Feichtenhofer},Title     = {{X3D}: Progressive Network Expansion for Efficient Video Recognition},Booktitle = {{CVPR}},Year      = {2020}}"""return x3d(BlockX3D, depth_factor, num_frames, train_crop_size,num_classes, dropout_rate, eval_with_clips=eval_with_clips)@ClassFactory.register(ModuleType.MODEL)
def x3d_s(num_classes: int = 400,dropout_rate: float = 0.5,depth_factor: float = 2.2,num_frames: int = 13,train_crop_size: int = 160,eval_with_clips: bool = False,):"""X3D small model."""return x3d(BlockX3D, depth_factor, num_frames, train_crop_size,num_classes, dropout_rate, eval_with_clips=eval_with_clips)@ClassFactory.register(ModuleType.MODEL)
def x3d_xs(num_classes: int = 400,dropout_rate: float = 0.5,depth_factor: float = 2.2,num_frames: int = 4,train_crop_size: int = 160,eval_with_clips: bool = False,):"""X3D x-small model."""return x3d(BlockX3D, depth_factor, num_frames, train_crop_size,num_classes, dropout_rate, eval_with_clips=eval_with_clips)@ClassFactory.register(ModuleType.MODEL)
def x3d_l(num_classes: int = 400,dropout_rate: float = 0.5,depth_factor: float = 5.0,num_frames: int = 16,train_crop_size: int = 312,eval_with_clips: bool = False,):"""X3D large model."""return x3d(BlockX3D, depth_factor, num_frames, train_crop_size,num_classes, dropout_rate, eval_with_clips=eval_with_clips)

ResNetX3D继承了ResNet3D，并在这基础上进行了修改。ResNetX3D的第一个模块是由两个3D卷积层以及batchnorm和relu构成的，第一个3D卷积层是空间维度的卷积，输入的通道数为3，输出的通道数是24，kernel大小为(1, 3, 3)，stride为(1, 2, 2)，第二个3D卷积层是时间维度的卷积，输入和输出通道均为24，kernel大小为(5, 1, 1)。ResNetX3D的后续模块是4个ResStage，每个ResStage中又含有不同数量的ResBlock。在ResBlock中，主要由下采样模块和Transform模块构成，下采样模块主要用于缩小输入的H和W的大小，Transform模块中含有多个conv模块来进行通道数量的变换，并引入了SE通道注意力机制和Swish非线性激活函数。而ResBlock的数量是由模型深度所决定的，每种模型所含有的ResBlock数量各不相同，以X3D-M为例，4个ResStage中所含有的ResBlock数量分别为3、5、11、7，在第一个ResStage中输入通道和输出通道都是24，中间通道是54，重复3次，在第二个ResStage中输入通道是24，输出通道是48，中间通道为108，重复5次，在第三个ResStage中输入通道是48，输出通道是96，中间通道为216，重复11次，在最后一个ResStage中，输入通道为96，输出通道192，中间通道432，重复7次。

class BlockX3D(ResidualBlock3D):"""BlockX3D 3d building block for X3D.Args:in_channel (int): Input channel.out_channel (int): Output channel.conv12(nn.Cell, optional): Block that constructs first two conv layers.It can be `Inflate3D`, `Conv2Plus1D` or other custom blocks, thisblock should construct a layer where the name of output feature channelsize is `mid_channel` for the third conv layers. Default: Inflate3D.inflate (int): Whether to inflate kernel.spatial_stride (int): Spatial stride in the conv3d layer. Default: 1.down_sample (nn.Module | None): DownSample layer. Default: None.block_idx (int): the id of the block.se_ratio (float | None): The reduction ratio of squeeze and excitationunit. If set as None, it means not using SE unit. Default: None.use_swish (bool): Whether to use swish as the activation functionbefore and after the 3x3x3 conv. Default: True.drop_connect_rate (float): dropout rate. If equal to 0.0, perform no dropout.bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv."""expansion: int = 1def __init__(self,in_channel,out_channel,conv12: Optional[nn.Cell] = Inflate3D,inflate: int = 2,norm: Optional[nn.Cell] = None,down_sample: Optional[nn.Cell] = None,block_idx: int = 0,se_ratio: float = 0.0625,use_swish: bool = True,drop_connect_rate: float = 0.0,bottleneck_factor: float = 2.25,**kwargs):super(BlockX3D, self).__init__(in_channel=in_channel,out_channel=out_channel,mid_channel=int(out_channel * bottleneck_factor),conv12=conv12,norm=norm,down_sample=down_sample,inflate=inflate,**kwargs)self.in_channel = in_channelself.out_channel = out_channelself.se_ratio = se_ratioself.use_swish = use_swishself._drop_connect_rate = drop_connect_rateif self.use_swish:self.swish = Swish()self.se_module = Noneif self.se_ratio > 0.0 and (block_idx + 1) % 2:self.se_module = SqueezeExcite3D(self.conv12.mid_channel, self.se_ratio)self.conv3 = Unit3D(in_channels=self.conv12.mid_channel,out_channels=self.out_channel,kernel_size=1,stride=1,padding=0,norm=nn.BatchNorm3d,activation=None)self.conv3.transform_final_bn = Truedef construct(self, x):"""Defines the computation performed at every call."""identity = xout = self.conv12(x)if self.se_module is not None:out = self.se_module(out)if self.use_swish:out = self.swish(out)out = self.conv3(out)if self.training and self._drop_connect_rate > 0.0:out = drop_path(out, self._drop_connect_rate)if self.down_sample:identity = self.down_sample(x)out = out + identityout = self.relu(out)return outclass ResNetX3D(ResNet3D):"""X3D backbone definition.Args:block (Optional[nn.Cell]): THe block for network.layer_nums (list): The numbers of block in different layers.stage_channels (Tuple[int]): Output channel for every res stage.stage_strides (Tuple[Tuple[int]]): Stride size for ResNet3D convolutional layer.drop_rates (list): list of the drop rate in different blocks. The basic rate at which blocksare dropped, linearly increases from input to output blocks.down_sample (Optional[nn.Cell]): Residual block in every resblock, it can transfer the inputfeature into the same channel of output. Default: Unit3D.bottleneck_factor (float): Bottleneck expansion factor for the 3x3x3 conv.fc_init_std (float): The std to initialize the fc layer(s).Inputs:- **x** (Tensor) - Tensor of shape :math:`(N, C_{in}, D_{in}, H_{in}, W_{in})`.Returns:Tensor, output tensor.Supported Platforms:``GPU``Examples:>>> net = ResNetX3D(BlockX3D, [3, 5, 11, 7], (24, 48, 96, 192), ((1, 2, 2),(1, 2, 2),>>>             (1, 2, 2),(1, 2, 2)), [0.2, 0.3, 0.4, 0.5], Unit3D)"""def __init__(self,block: Optional[nn.Cell],layer_nums: Tuple[int],stage_channels: Tuple[int],stage_strides: Tuple[Tuple[int]],drop_rates: Tuple[float],down_sample: Optional[nn.Cell] = Unit3D,bottleneck_factor: float = 2.25):super(ResNetX3D, self).__init__(block=block,layer_nums=layer_nums,stage_channels=stage_channels,stage_strides=stage_strides,down_sample=down_sample)self.in_channels = stage_channels[0]self.base_channels = 24self.conv1 = nn.SequentialCell([Unit3D(3,self.base_channels,kernel_size=(1, 3, 3),stride=(1, 2, 2),norm=None,activation=None),Unit3D(self.base_channels,self.base_channels,kernel_size=(5, 1, 1),stride=(1, 1, 1))])self.layer1 = self._make_layer(block,stage_channels[0],layer_nums[0],stride=tuple(stage_strides[0]),inflate=2,drop_connect_rate=drop_rates[0],block_idx=list(range(layer_nums[0])))self.layer2 = self._make_layer(block,stage_channels[1],layer_nums[1],stride=tuple(stage_strides[1]),inflate=2,drop_connect_rate=drop_rates[1],block_idx=list(range(layer_nums[1])))self.layer3 = self._make_layer(block,stage_channels[2],layer_nums[2],stride=tuple(stage_strides[2]),inflate=2,drop_connect_rate=drop_rates[2],block_idx=list(range(layer_nums[2])))self.layer4 = self._make_layer(block,stage_channels[3],layer_nums[3],stride=tuple(stage_strides[3]),inflate=2,drop_connect_rate=drop_rates[3],block_idx=list(range(layer_nums[3])))self.conv5 = Unit3D(stage_channels[-1],int(math.ceil(stage_channels[-1] * bottleneck_factor)),kernel_size=1,stride=1,padding=0)self._init_weights()def construct(self, x):x = self.conv1(x)x = self.layer1(x)x = self.layer2(x)x = self.layer3(x)x = self.layer4(x)x = self.conv5(x)return x

X3Dhead是一个用于动作分类任务的Head，主要由3D平均池化层、3D卷积层、ReLU和线性层构成。X3DHead对于输入的特征，先将其变换为2048维的特征向量，再由线性层将其变换到类别数量。

class X3DHead(nn.Cell):"""x3d head architecture.Args:input_channel (int): The number of input channel.out_channel (int): The number of inner channel. Default: 2048.num_classes (int): Number of classes. Default: 400.dropout_rate (float): Dropout keeping rate, between [0, 1]. Default: 0.5.Returns:TensorExamples:>>> head = X3DHead(input_channel=432, out_channel=2048, num_classes=400, dropout_rate=0.5)"""def __init__(self,pool_size,input_channel,out_channel=2048,num_classes=400,dropout_rate=0.5,):super(X3DHead, self).__init__()self.avg_pool = AvgPool3D(pool_size)self.lin_5 = nn.Conv3d(input_channel,out_channel,kernel_size=(1, 1, 1),stride=1,padding=0,has_bias=False)self.lin_5_relu = nn.ReLU()self.dense = DropoutDense(input_channel=out_channel,out_channel=num_classes,has_bias=True,keep_prob=dropout_rate)self.softmax = nn.Softmax(4)self.transpose = ops.Transpose()def construct(self, x):x = self.avg_pool(x)x = self.lin_5(x)x = self.lin_5_relu(x)# (N, C, T, H, W) -> (N, T, H, W, C).x = self.transpose(x, (0, 2, 3, 4, 1))x = self.dense(x)if not self.training:x = self.softmax(x)x = x.mean([1, 2, 3])x = x.view(x.shape[0], -1)return x

详细代码可前往仓库查看

基于MindSpore实现X3D网络相关推荐

基于MindSpore的MASS网络实现
自然语言处理(Natural Language Processing, NLP)是指计算机通过分析文本,建立计算框架实现语言表示及应用的模型,从而使其获得对语言的理解及应用的能力. 从1950年Tur ...
resnet50网络结构_Resnet50详解与实践（基于mindspore)
1. 简述 Resnet是残差网络(Residual Network)的缩写,该系列网络广泛用于目标分类等领域以及作为计算机视觉任务主干经典神经网络的一部分,典型的网络有resnet50, resne ...
技术干货 | 基于MindSpore更好的理解Focal Loss
[本期推荐专题]物联网从业人员必读:华为云专家为你详细解读LiteOS各模块开发及其实现原理. 摘要:Focal Loss的两个性质算是核心,其实就是用一个合适的函数去度量难分类和易分类样本对总的损失 ...
MindSpore21天实战营丨基于MindSpore的ResNet-50蘑菇“君”的识别应用体验
借助全新的设计理念,华为云推出了 MindSpore深度学习实战营,帮助小白更快的上手高性能深度学习框架,快速训练ResNet-50,实现你的第一个手机App开发,学会智能新闻分类.篮球检测和「猜你喜 ...
【深度学习】基于MindSpore和pytorch的Softmax回归及前馈神经网络
1 实验内容简介 1.1 实验目的 (1)熟练掌握tensor相关各种操作: (2)掌握广义线性回归模型(logistic模型.sofmax模型).前馈神经网络模型的原理: (3)熟练掌握基于mind ...
基于MindSpore复现Deeplabv3—语义分割
基于MindSpore复现Deeplabv3-语义分割实验介绍本实验主要介绍使用MindSpore深度学习框架在PASCAL VOC2012数据集上训练Deeplabv3网络模型.本实验使用了Mi ...
基于MindSpore框架的室内场景图像分割方法研究
基于MindSpore框架的室内场景图像分割方法研究概述本文以华为最新国产深度学习框架Mindspore为基础,研究室内场景语义分割方法.本文基于注意力机制改进U-Net网络,并选取VGG16与R ...
用C#实现基于TCP协议的网络通讯
TCP协议是一个基本的网络协议,基本上所有的网络服务都是基于TCP协议的,如HTTP,FTP等等,所以要了解网络编程就必须了解基于TCP协议的编程.然而TCP协议是一个庞杂的体系,要彻底的弄清楚它的实 ...
基于交换技术的网络中，全双工主要运行在？( 内有答案与详解）
基于交换技术的网络中,全双工主要运行在?( ) A．站点与站点之间 B．交换机与服务器之间 C．站点与服务器之间 D．站点与交换机之间答案: b 网站就是站点的意思,交换机实际是与数据打交道 ...

基于MindSpore实现X3D网络

基于MindSpore实现X3D网络相关推荐

最新文章

热门文章