
In this paper, we propose a simple yet effective semantics-guided neural network (SGN) for skeleton-based action recognition.


We explicitly introduce the high level semantics of joints (joint type and frame index) into the network to enhance the feature representation capability.


In addition, we exploit the relationship of joints hierarchically through two modules, i.e., a joint-level module for modeling the correlations of joints in the same frame and a frame-level module for modeling the dependencies of frames by taking the joints in the same frame as a whole.



Skeleton is a type of well structured data with each joint of the human body identified by a joint type, a frame index, and a 3D position.


Most skeleton-based approaches organize the coordinates of joints to a 2D map and resize the map to a size (e.g. 224×224) suitable for the input of a CNN (e.g. ResNet50 ). Its rows/columns correspond to the different types of joints/frames indexes.


In these methods , long-term dependencies and semantic information are expected to be captured by the large receptive fields of deep networks. This appears to be brutal and typically results in high model complexity.

在这些方法中,long-term dependencies (长程依赖)和语义信息期望被深层次网络的large receptive fields(感受野)捕捉。这通常使得模型的复杂度很高。


感受野:卷积神经网络每一层输出的特征图(feature map)上的像素点在输入图片上映射的区域大小。再通俗点的解释是,特征图上的一个点对应输入图上的区域。表示网络内部的不同位置的神经元对原图像的感受范围的大小。

We propose a semantics-guided neural network (SGN) which explicitly exploits the semantics and dynamics for high efficient skeleton-based action recognition.


Framework of the proposed end-to-end Semantics-Guided Neural Network (SGN). It consists of a joint-level module and a frame-level module. In DR, we learn the dynamics representation of a joint by fusing the position and velocity information of a joint. Two types of semantics, i.e., joint type and frame index, are incorporated into the joint-level module and the frame-level module, respectively. To model the dependencies of joints in the joint-level module, we use three GCN layers. To model the dependencies of frames, we use two CNN layers.


For better joint-level correlation modeling, besides the dynamics, we incorporate the semantics of joint type (e.g., ‘head’, and ‘hip’) to the GCN layers which enables the content adaptive graph construction and effective message passing among joints within each frame.


For better frame-level cor-relation modeling, we incorporate the semantics of temporal frame index to the network.


We perform a Spatial MaxPooling (SMP) operation over all the features of the joints within the same frame to obtain framelevel feature representation.


Combined with the embedded frame index information, two temporal convolutional neural network(CNN) layers are used to learn feature representations for classification.


We summarize our three main contributions as follows:

  1. We propose to explicitly explore the joint semantics (frame index and joint type) for efficient skeleton-based action recognition. Previous works overlook the importance of semantics and rely on deep networks with high complexity for action recognition.

  2. We present a semantics-guided neural network (SGN) to exploit the spatial and temporal correlations at joint-level and frame-level hierarchically.

  3. We develop a lightweight strong baseline, which is more powerful than most previous methods. We hope the strong baseline will be helpful for the study of skeletonbased action recognition.


  1. 我们明确建议探索关节语义(帧索引和关节类型)为了更高效率进行基于骨架的动作识别。先前的工作忽视了语义的重要性并依赖于高度复杂的深层网络来动作识别。

  2. 我们提出了语义引导神经网络来利用在关节级和帧级的空间和时间的结合。

  3. 我们开发了一个轻量级的强基线,它比以前的大多数方法更强大。我们希望强基线将有助于研究基于骨骼的动作识别。

Related Work

Recurrent Neural Network based.

Recurrent neural networks, such as LSTM and GRU, are often used to model the temporal dynamics of skeleton sequence.


The 3D coordinates of all joints in a frame are concatenated in some order to be the input vector of a time slot. They do not explicitly tell the networks which dimensions belong to which joint. Some other RNN-based works tend to design special structures in RNN to make it aware of the spatial structural information.


Convolutional Neural Network based.

Graph Convolutional Network based.

Semantics-Guided Neural Networks(SGN)

For a skeleton sequence, we identify a joint by its semantics (joint type and frame index) and represent it together with its dynamics (position/3D coordinates and velocity).


Dynamics Representation(DR)


Joint-level Module

We design a joint-level module to exploit the correlations of joints in the same frame.


We adopt graph convolutional networks (GCN) to explore the correlations for the structural skeleton data.


Some previous GCN-based approaches take the joints as nodes and they pre-define the graph connections (edges) based on prior knowledge or learn a content adaptive graph. We also learn a content adaptive graph, but differently we incorporate the semantics of joint type to the GCN layers for more effective learning.


We enhance the power of GCN layers by making full use of the semantics from two aspects.


First, we use the semantics of joint type and the dynamics to learn the graph connections among the nodes (different joints) within a frame. The joint type information is helpful for learning suitable adjacent matrix (i.e., relations between joints in terms of connecting weights). Take two source joints, foot and hand, and a target joint head as an example, intuitively, the connection weight value from foot to head should be different from the value from hand to head even when the dynamics of foot and hand are the same.


Second, as part of the information of a joint, the semantics of joint types takes part in the message passing process in GCN layers.


Frame-level Module

We design a frame-level module to exploit the correlations across frames. To make the network know the order of frames, we incorporate the semantics of frame index to enhance the representation capability of a frame.


To merge the information of all joints in a frame, we apply one spatial MaxPooling layer to aggregate them across the joints.


Two CNN layers are applied. The first CNN layer is a temporal convolution layer to model the dependencies of frames.


The second CNN layer is used to enhance the representation capability of learned features by mapping it to a high dimension space with kernel size of 1.


After the two CNN layers, we apply a temporal MaxPooling layer to aggregate the information of all frames and obtain the sequence level feature representation of C4 dimensions.


This is then followed by a fully connected layer with Softmax to perform the classification.












变形如下:Z ` = GZW+ZW = (G+I)ZW,便对应GCN公式。










