论文阅读<2> Dynamic Routing Between Capsules

  • Abstract
  • 1 Introduction
  • 2 How the vector inputs and outputs of a capsule are computed
  • 3 Margin loss for digit existence
  • 4 CapsNet architecture
    • 4.1 Reconstruction as a regularization method
  • 5 Capsules on MNIST
    • 5.1 What the individual dimensions of a capsule represent
    • 5.2 Robustness to Affine Transformations
  • 6 Segmenting highly overlapping digits
  • 7 Other datasets
  • 8 Discussion and previous work
  • A How many routing iterations to use?
  • Reference
  • Code


A capsule is a group of neurons whose activity vector represents the instantiation parameters of a specific type of entity such as an object or an object part. We use the length of the activity vector to represent the probability that the entity exists and its orientation to represent the instantiation parameters. Active capsules at one level make predictions, via transformation matrices, for the instantiation parameters of higher-level capsules. When multiple predictions agree, a higher level capsule becomes active. We show that a discrimininatively trained, multi-layer capsule system achieves state-of-the-art performance on MNIST and is considerably better than a convolutional net at recognizing highly overlapping digits. To achieve these results we use an iterative routing-by-agreement mechanism: A lower-level capsule prefers to send its output to higher level capsules whose activity vectors have a big scalar product with the prediction coming from the lower-level capsule.


1 Introduction


An obvious way to represent existence is by using a separate logistic unit whose output is the probability that the entity exists. In this paper we explore an interesting alternative which is to use the overall length of the vector of instantiation parameters to represent the existence of the entity and to force the orientationof the vector to represent the properties of the entity1.


For each possible parent, the capsule computes a “prediction vector” by multiplying its own output by a weight matrix. This type of “routing-by-agreement” should be far more effective than the very primitive form of routing implemented by max-pooling, which allows neurons in one layer to ignore all but the most active feature detector in a local pool in the layer below. We demonstrate that our dynamic routing mechanism is an effective way to implement the “explaining away” that is needed for segmenting highly overlapping objects.


Convolutional neural networks (CNNs) use translated replicas of learned feature detectors. This allows them to translate knowledge about good weight values acquired at one position in an image to other positions. This has proven extremely helpful in image interpretation. Even though we are replacing the scalar-output feature detectors of CNNs with vector-output capsules and max-pooling with routing-by-agreement, we would still like to replicate learned knowledge across space. To achieve this, we make all but the last layer of capsules be convolutional. As with CNNs, we make higher-level capsules cover larger regions of the image. Unlike max-pooling however, we do not throw away information about the precise position of the entity within the region. For low level capsules, location information is “place-coded” by which capsule is active. As we ascend the hierarchy, more and more of the positional information is “rate-coded” in the real-valued components of the output vector of a capsule. This shift from place-coding to rate-coding combined with the fact that higher-level capsules represent more complex entities with more degrees of freedom suggests that the dimensionality of capsules should increase as we ascend the hierarchy.


2 How the vector inputs and outputs of a capsule are computed

The aim of this paper is not to explore this whole space but simply to show that one fairly straightforward implementation works well and that dynamic routing helps.

We want the length of the output vector of a capsule to represent the probability that the entity represented by the capsule is present in the current input. We therefore use a non-linear “squashing” function to ensure that short vectors get shrunk to almost zero length and long vectors get shrunk to a length slightly below 1. We leave it to discriminative learning to make good use of this non-linearity.


where vj is the vector output of capsule j and sj is its total input.
For all but the first layer of capsules, the total input to a capsule sj is a weighted sum over all “prediction vectors” uˆj|i from the capsules in the layer below and is produced by multiplying the output ui of a capsule in the layer below by a weight matrix Wij

其中v_ j是胶囊 j 的向量输出,s_ j是其总输入。除了胶囊体第一层外的其他层,胶囊的总输入s_ j是来自于下一层的胶囊所有“预测向量”u_ j|i 的一个加权和,其通过权重矩阵W_ij 乘以下一层胶囊的输出u_i 产生。

where the cij are coupling coefficients that are determined by the iterative dynamic routing process.The coupling coefficients between capsule i and all the capsules in the layer above sum to 1 and are determined by a “routing softmax” whose initial logits bij are the log prior probabilities that capsule i should be coupled to capsule j.
其中c_ij是由迭代动态路由过程决定的耦合系数。胶囊 i 和高一层的所有胶囊间的耦合系数总和为1,并由“路由softmax”决定,该“路由softmax”的初始逻辑b_ij 是对数先验概率,即胶囊 i 应该与胶囊 j 耦合。

The log priors can be learned discriminatively at the same time as all the other weights. They depend on the location and type of the two capsules but not on the current input image2. The initial coupling coefficients are then iteratively refined by measuring the agreement between the current output vj of each capsule, j, in the layer above and the prediction uˆj|i made by capsule i.

同一时间的对数先验可以作为所有其他权重来进行判别性的学习。它们取决于两个胶囊的位置和类型,而不是取决于当前的输入图像。然后,初始耦合系数通过测量更高一层中每个胶囊 j 的当前输出v_ j 和胶囊 i 的预测u_ j|i 间的一致性迭代细化。

The agreement is simply the scalar product aij = vj.uˆj|i. This agreement is treated as if it was a log likelihood and is added to the initial logit, bij before computing the new values for all the coupling coefficients linking capsule i to higher level capsules.

一致性仅仅是标量积a_ij=v_ j.u_ j|i 。这种一致性被认为是一个对数似然比,并在对连接胶囊 i 和更高层胶囊的所有耦合系数计算新值之前,这种一致性被添加到初始逻辑b_ij 。

In convolutional capsule layers, each capsule outputs a local grid of vectors to each type of capsule in the layer above using different transformation matrices for each member of the grid as well as for each type of capsule.


3 Margin loss for digit existence

我们使用实例化向量的长度来表示胶囊实体存在的概率。当且仅当图像中有数字时,我们想要得到数字类 k 的顶层胶囊来获取一个长的实例化向量。为了允许多个数字,我们对每个数字胶囊 k 使用一个单独的边缘损失L_k。

其中,T_k=1,如果存在一个类 k 的数字,并且m+= 0.9并且m-= 0.1。缺失数字类别的损失的^下降权重会阻止最初学习缩小所有数字胶囊的激活向量的长度。我们用 ^= 0.5。 总损失仅是所有数字胶囊损失的总和。

4 CapsNet architecture


图1:一个简单的三层CapsNet。这个模型给了深度卷积网络可比较的结果(如Chang and Chen [2015]),DigitCaps层中的每个胶囊的激活向量的长度表示每个类的实例呈现,并且用于计算分类损失。W_ ij 是PrimaryCapsule中每个u_i ,i属于(1,3266)和v_j,j属于(1,10)间的一个权重矩阵。



我们仅仅有两个连续的胶囊层之间的路由(即PrimaryCapsules和DigitCaps)。由于Conv1的输出是1维的,在它的空间里没有任何方向可以达成一致。因此,不存在路由是用于Conv1和PrimaryCapsules之间。所有的路由模型(b_ij)初始化为到0。因此,最初的一个胶囊输出(u_i)以相等的概率(c_ij)被发送到所有有父胶囊体(v_0…v_9)。我们是在TensorFlow(Abadiet al. [2016])进行实验,并且我们使用有着TensorFlow默认参数(包括指数衰减的学习率,最小化方程4中的边缘损失的总和)的Adam优化器(Kingma and Ba [2014])。

4.1 Reconstruction as a regularization method


5 Capsules on MNIST



5.1 What the individual dimensions of a capsule represent

由于我们只传递一个数字的编码而将其他数字归零,一个数字胶囊的维数应该学会用实例化的类的数字的方式跨越变化的空间。这些变化包括行程厚度(stroke thickness)、倾斜(skew)和宽度(width)。

5.2 Robustness to Affine Transformations


6 Segmenting highly overlapping digits




7 Other datasets

One drawback of Capsules which it shares with generative models is that it likes to account for everything in the image so it does better when it can model the clutter than when it just uses an additional “orphan” category in the dynamic routing. In CIFAR-10, the backgrounds are much too varied to model in a reasonable sized net which helps to account for the poorer performance.

8 Discussion and previous work

into vectors of instantiation parameters of recognized fragments and then applying transformation matrices to the fragments to predict the instantiation parameters of larger fragments. Transformation matrices that learn to encode the intrinsic spatial relationship between a part and a whole constitute viewpoint invariant knowledge that automatically generalizes to novel viewpoints. Hinton et al. [2011] proposed transforming autoencoders to generate the instantiation parameters of the PrimaryCapsule layer and their system required transformation matrices to be supplied externally. We propose a complete system that also answers “how larger and more complex visual entities can be recognized by using agreements of the poses predicted by active, lower-level capsules”.


Capsules make a very strong representational assumption: At each location in the image, there is at most one instance of the type of entity that a capsule represents. This assumption, which was motivated by the perceptual phenomenon called “crowding” (Pelli et al. [2004]), eliminates the binding problem (Hinton [1981a]) and allows a capsule to use a distributed representation (its activity vector) to encode the instantiation parameters of the entity of that type at a given location. This distributed representation is exponentially more efficient than encoding the instantiation parameters by activating a point on a high-dimensional grid and with the right distributed representation, capsules can then take full advantage of the fact that spatial relationships can be modelled by matrix multiplies.


Capsules use neural activities that vary as viewpoint varies rather than trying to eliminate viewpoint variation from the activities. This gives them an advantage over “normalization” methods like spatial transformer networks (Jaderberg et al. [2015]): They can deal with multiple different affine transformations of different objects or object parts at the same time.


A How many routing iterations to use?



[1]: Sabour S, Frosst N, Hinton G E. Dynamic routing between capsules. Advances in Neural Information Processing Sys- tems, Long Beach, USA:MIT Press, 2017. 3856−3866
[2]: Dynamic Routing Between Capsules(译)
[3]: 如何看待Hinton的论文《Dynamic Routing Between Capsules》?


[1]: naturomics/CapsNet-Tensorflow

