Abstract

In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition.

我们提出了简单而有效的语义引导神经网络用于基于骨架的动作识别。

We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability.

我们明确地将关节的高级语义（关节类型和帧索引）引入到网络中来加强特征表示能力。

In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a frame-level module for modeling the dependencies of frames by taking the joints in the same frame as a whole.

另外，我们通过两个模块多层次地利用关节间的关系：关节级别模块用于为同一帧中关节间的相关性建模；帧级别模块通过将同一帧的关节作为一个整体对帧间的依赖关系建模。（同一帧和不同帧）

Introduction

Skeleton is a type of well structured data with each joint of the human body identified by a joint type, a frame index, and a 3D position.

骨架是一种结构良好的数据，人体的每个关节数据由关节类型、帧索引和3D位置识别。

Most skeleton-based approaches organize the coordinates of joints to a 2D map and resize the map to a size (e.g. 224×224) suitable for the input of a CNN (e.g. ResNet50 ). Its rows/columns correspond to the different types of joints/frames indexes.

大多数基于骨架的方法将关节的坐标组织为2D地图并将地图大小调整为适合CNN输入的大小。该地图的行和列分别对应不同类型的关节和帧索引。

In these methods , long-term dependencies and semantic information are expected to be captured by the large receptive fields of deep networks. This appears to be brutal and typically results in high model complexity.

在这些方法中，long-term dependencies （长程依赖）和语义信息期望被深层次网络的large receptive fields（感受野）捕捉。这通常使得模型的复杂度很高。

长程依赖：当你想使用语言模型，并有效利用较早时间步的信息，最终产生预测的时候，那么你就要和较长路程前的信息建立一种依赖关系，这就是长程依赖。

感受野：卷积神经网络每一层输出的特征图（feature map）上的像素点在输入图片上映射的区域大小。再通俗点的解释是，特征图上的一个点对应输入图上的区域。表示网络内部的不同位置的神经元对原图像的感受范围的大小。

We propose a semantics-guided neural network (SGN) which explicitly exploits the semantics and dynamics for high efficient skeleton-based action recognition.

我门提出了语义引导神经网络（SGN），该网络明确利用了语义（关节类型和帧索引）和动态特性（3D坐标）来实现更高效率的基于骨架的动作识别。

Framework of the proposed end-to-end Semantics-Guided Neural Network (SGN). It consists of a joint-level module and a frame-level module. In DR, we learn the dynamics representation of a joint by fusing the position and velocity information of a joint. Two types of semantics, i.e., joint type and frame index, are incorporated into the joint-level module and the frame-level module, respectively. To model the dependencies of joints in the joint-level module, we use three GCN layers. To model the dependencies of frames, we use two CNN layers.

提出的端到端语义引导神经网络的框架。它包括一个关节级别模型和帧级别模型。在DR中，我们通过融合关节的位置和速度信息来学习关节的动力学表示。有两种类型的语义：关节类型和帧索引。他们分别合并到了关节级别模块和帧级别模块。为了在关节模块中对关节间的依赖性建模，我们使用三个GCN层。为了对帧之间的依赖关系进行建模，我们使用两个CNN层。

For better joint-level correlation modeling, besides the dynamics, we incorporate the semantics of joint type (e.g., ‘head’, and ‘hip’) to the GCN layers which enables the content adaptive graph construction and effective message passing among joints within each frame.

为了对关节级模块更好建模，除了有动态特性（坐标信息），我们还在GCN层引入了关节类型的语义（例如“头”和”髋“），实现了每个帧中内容自适应图的构造和不同关节的信息传递。

For better frame-level cor-relation modeling, we incorporate the semantics of temporal frame index to the network.

为了对帧级模块更好建模，我们将时间帧索引的语义结合到网络中。

We perform a Spatial MaxPooling (SMP) operation over all the features of the joints within the same frame to obtain framelevel feature representation.

我们对同一帧的关节的所有特征进行SMP来获得帧级别特征表示。

Combined with the embedded frame index information, two temporal convolutional neural network（CNN） layers are used to learn feature representations for classification.

结合嵌入的帧索引信息，两个时间CNN层被用来学习分类的特征表示。

We summarize our three main contributions as follows:

We propose to explicitly explore the joint semantics (frame index and joint type) for efficient skeleton-based action recognition. Previous works overlook the importance of semantics and rely on deep networks with high complexity for action recognition.
We present a semantics-guided neural network (SGN) to exploit the spatial and temporal correlations at joint-level and frame-level hierarchically.
We develop a lightweight strong baseline, which is more powerful than most previous methods. We hope the strong baseline will be helpful for the study of skeletonbased action recognition.

我们总结了三个主要贡献：

我们明确建议探索关节语义（帧索引和关节类型）为了更高效率进行基于骨架的动作识别。先前的工作忽视了语义的重要性并依赖于高度复杂的深层网络来动作识别。
我们提出了语义引导神经网络来利用在关节级和帧级的空间和时间的结合。
我们开发了一个轻量级的强基线，它比以前的大多数方法更强大。我们希望强基线将有助于研究基于骨骼的动作识别。

Related Work

Recurrent Neural Network based.

Recurrent neural networks, such as LSTM and GRU, are often used to model the temporal dynamics of skeleton sequence.

递归神经网络，例如LSTM和GRU经常用来模拟骨架序列的时间动态特性。

The 3D coordinates of all joints in a frame are concatenated in some order to be the input vector of a time slot. They do not explicitly tell the networks which dimensions belong to which joint. Some other RNN-based works tend to design special structures in RNN to make it aware of the spatial structural information.

同一帧中所有关节的3D坐标以某种顺序串接在一起，成为时隙的输入序列。他们没有明确区分网络中哪个维度属于哪个关节。一些其他基于RNN的工作在RNN中设计了特殊的结构来让网络能够感知空间结构信息。

Convolutional Neural Network based.

Graph Convolutional Network based.

Semantics-Guided Neural Networks（SGN）

For a skeleton sequence, we identify a joint by its semantics (joint type and frame index) and represent it together with its dynamics (position/3D coordinates and velocity).

对应一个骨架序列，我们通过关节的语义（关节类型和帧索引）来识别关节，并结合关节的动态特性（位置/三维坐标和速度）来表示关节

Dynamics Representation(DR)

两个全连接层，ReLU激活函数

Joint-level Module

We design a joint-level module to exploit the correlations of joints in the same frame.

我们设计了关节级别模块来利用同一帧中不同关节间的相关性。

We adopt graph convolutional networks (GCN) to explore the correlations for the structural skeleton data.

我们使用GCN来探索结构骨架数据的相关性。

Some previous GCN-based approaches take the joints as nodes and they pre-define the graph connections (edges) based on prior knowledge or learn a content adaptive graph. We also learn a content adaptive graph, but differently we incorporate the semantics of joint type to the GCN layers for more effective learning.

先前的一些基于GCN的方法将关节作为一个节点并且它们基于先验知识或者学习内容自适应图预先定义图的连接。我们同样也学习了内容自适应图，但是不同的是，我们将关节类型的语义合并到GCN层来获得更有效地学习。

We enhance the power of GCN layers by making full use of the semantics from two aspects.

我们通过充分利用语义的两个方面来增强GCN层的功能。

First, we use the semantics of joint type and the dynamics to learn the graph connections among the nodes (different joints) within a frame. The joint type information is helpful for learning suitable adjacent matrix (i.e., relations between joints in terms of connecting weights). Take two source joints, foot and hand, and a target joint head as an example, intuitively, the connection weight value from foot to head should be different from the value from hand to head even when the dynamics of foot and hand are the same.

首先，我们使用关节类型的语义和动态特性来学习同一帧中不同关节的图联系。关节类型信息有助于学习合适的邻接矩阵（指关节间的连接权重关系）。直观上，以手和脚为源关节以及头为目标关节为例，从脚到头的连接权值应该不同于手到头的值，即使手和脚的动态特性相同（因为关节类型不同）。

Second, as part of the information of a joint, the semantics of joint types takes part in the message passing process in GCN layers.

第二，关节类型的语义作为关节信息的一部分，应该参与了GCN层的消息传递过程。

Frame-level Module

We design a frame-level module to exploit the correlations across frames. To make the network know the order of frames, we incorporate the semantics of frame index to enhance the representation capability of a frame.

我们设计了帧级模块来利用帧之间的相关性。为了让网络知道帧的顺序，我们合并了帧索引的语义来提高帧的表达能力。

To merge the information of all joints in a frame, we apply one spatial MaxPooling layer to aggregate them across the joints.

为了合并同一帧中所有关节的信息，我们使用SPM层将他们聚集到关节上。

Two CNN layers are applied. The first CNN layer is a temporal convolution layer to model the dependencies of frames.

应用了两个CNN层，第一个CNN层是时间卷积层，用来模拟帧间的依赖。

The second CNN layer is used to enhance the representation capability of learned features by mapping it to a high dimension space with kernel size of 1.

第二个CNN层用来增强学习特征的表示能力，通过将其映射到核大小为1的高维空间。

After the two CNN layers, we apply a temporal MaxPooling layer to aggregate the information of all frames and obtain the sequence level feature representation of C4 dimensions.

经过两个CNN层，我们使用时间MaxPooling层来将聚合所有帧的信息，然后得到C4维度的序列级特征表示。

This is then followed by a fully connected layer with Softmax to perform the classification.

然后是全连接层和Softmax函数来执行分类。

网络结构解析

DR为将节点使用3D坐标和速度表示，结果如下，p为左边，v为速度，输出为p~：

Concatenation指帧的关节类型（首先使用one-hot表示，再经过两个FC层，输出为T*J*C1）与关节的动态特性（T*J*C1）相并列。论文中的Z便为C的输出。

θ和φ均做两个transform，然后相乘，输出为T*J*J

此时J*J便可以看为邻接矩阵，使用Softmax对每一行做归一化，归一化后的邻接矩阵为G。

图中三个GCN的连接方式是为了做残差效果（类似于Resnet）。

下图为GCN公式在本文中的表现，W（J*C3)：

GCN公式：

其中A~为邻接矩阵+单位矩阵I（表示自己，表示消息传递中不能丢掉自己的信息），前三个字母代表对称归一化；H为本层输入，W为权值矩阵。

论文中公式变形如下：

变形如下：Z ` = GZW+ZW = (G+I)ZW，便对应GCN公式。

最后输出Z`为T*J*C3，此时的Z`的每一帧信息都已经融合了各关节的关节类型和动态特性以及各关节间特征传递后的结构。

帧索引也使用one-hot编码，并经过两个FC层，维度为C3*1。

Sum为使用前面学习到的关节特征和帧索引来表示t帧、第k个关节：

对于整体来说便是T*J*C3中，共T*J个关节，每个关节都加上了f~（帧索引信息），同样Z表示全部关节。

SMP（MaxPooling下采样）为合并同一帧中的所有关节信息，输出为T*1*C3。

第一个CNN为时间卷积，用来模拟帧之间的依赖。

第二个CNN使用1*1卷积核（用来升\降维度：通道数量，不改变图片大小）通过升维来增强已学习到的特征的表示能力，维度升为C4。

TMP为使用TMP（Maxpooling）将所有帧聚合为一个，输出为1*1*C4。

FCL为Softmax再分类。

SGN：CVPR20-Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition相关推荐

论文阅读笔记：Intriguing properties of neural networks
论文阅读笔记:Intriguing properties of neural networks 深度学习对抗样本的开山之作要点以往的观点认为深度神经网络的高层特征中每一个分量描述了一种特质,但是这 ...
《DeepLearning.ai》第十课：卷积神经网络(Convolutional Neural Networks)
第十课:卷积神经网络(Convolutional Neural Networks) 1.1 计算机视觉(Computer vision) 通常如果处理大图用传统的神经网络需要特别大的输入,因此需要大量 ...
论文笔记 Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition - CVPR
Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition 2020 CVPR | c ...
One-Pass Multi-task Convolutional Neural Networks for Efficient Brain Tumor Segmentation
method: 首先:用三个网络来训练,分别针对comlete区域,core区域,和enhancing区域(使用网络OM-net) 1.使用随机采样从MRI大脑图像中采块,训练,分类器分为5类,最后测 ...
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition(时间段网络：使用深度行为识别的良好实现)
本文的原作者为Limin Wang等人原文地址 #摘要深度卷积网络在静止图像中的视觉识别方面取得了巨大成功.然而,对于视频中的动作识别,优于传统方法的优势并不明显.本文旨在探索为视频中的动作识别设计 ...
视频动作识别--Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
Temporal Segment Networks: Towards Good Practices for Deep Action Recognition ECCV2016 https://githu ...
行为识别论文笔记|TSN|Temporal Segment Networks: Towards Good Practices for Deep Action Recognition
行为识别论文笔记|TSN|Temporal Segment Networks: Towards Good Practices for Deep Action Recognition Temporal ...
动作识别阅读笔记(三)《Temporal Segment Networks: Towards Good Practices for Deep Action Recognition》
(注:为避免中文翻译不准确带来误解,故附上论文原句.) 论文:Wang L , Xiong Y , Wang Z , et al. Temporal Segment Networks: Towards ...
3DCNN参数解析：2013-PAMI-3DCNN for Human Action Recognition
3DCNN参数解析:2013-PAMI-3DCNN for Human Action Recognition 参数分析 Input:7 @ 60 ×\times× 40, 7帧,图片大小60 ×\ti ...

SGN：CVPR20-Semantics-Guided Neural Networks for Efficient Skeleton-Based Human Action Recognition