Visualizing and Understanding Convolutional Networks

可视化和理解卷积网络 - 看懂卷积网络

Matthew D Zeiler, Rob Fergus
Dept. of Computer Science, Courant Institute, New York University

department [dɪ'pɑːtm(ə)nt]，dept.：n. 部，部门，系，科，局
Courant Institute of Mathematical Sciences，CIMS：库朗数学学院
New York University，NYU：纽约大学

Abstract

Large Convolutional Network models have recently demonstrated impressive classification performance on the ImageNet benchmark (Krizhevsky et al., 2012). However there is no clear understanding of why they perform so well, or how they might be improved. In this paper we explore both issues. We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al on the ImageNet classification benchmark. We also perform an ablation study to discover the performance contribution from different model layers. We show our ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.
最近 Krizhevsky 等人的论文中证明了大型卷积网络模型在 ImageNet 基准测试集上令人印象深刻的分类性能。然而对于他们为什么表现如此出色以及如何改进他们，还没有清楚的认识。在本文中，我们探讨这两个问题。我们引入了一种新颖的可视化技术，可以深入了解中间特征层的功能和分类器的操作。这些可视化用于诊断的角色，使我们能够在 ImageNet 分类基准上找到超越 Krizhevsky 等人的模型架构。我们还进行消融研究以发现不同模型层的性能贡献。我们展示了我们的 ImageNet 模型很好地适用于其他数据集：当 softmax 分类器被重新训练时，它令人信服地击败了 Caltech-101 和 Caltech-256 数据集上当前最先进的结果。

ablation [ə'bleɪʃ(ə)n]：n. 消融，切除
convincingly [kən'vɪnsɪŋli]：adv. 令人信服地，有说服力地
beat [biːt]：vt. 打，打败 vi. 打，打败，拍打，有节奏地舒张与收缩 n. 拍子，敲击，有规律的一连串敲打 adj. 筋疲力尽的，疲惫不堪的
diagnostic [daɪəg'nɒstɪk]：adj. 诊断的，特征的 n. 诊断法，诊断结论
inductive [ɪn'dʌktɪv]：adj. 归纳的，感应的，诱导的

Matthew D Zeiler 通过反卷积可视化进行网络分析和调优。
AlexNet - Alex Krizhevsky - ImageNet Classification with Deep Convolutional Neural Networks

通过卷积神经网络隐藏层的特征可视化，分析如何构建更好的网络结构。描述网络中的每层对整体分类性能的贡献，重新训练模型末端的 softmax 分类器，评估该模型在其他数据集上的分类效果。

ablation study
An ablation study typically refers to removing some “feature” of the model or algorithm, and seeing how that affects performance.

Examples:

An LSTM has 4 gates: feature, input, output, forget. We might ask: are all 4 necessary? What if I remove one? Indeed, lots of experimentation has gone into LSTM variants, the GRU being a notable example (which is simpler).
If certain tricks are used to get an algorithm to work, it’s useful to know whether the algorithm is robust to removing these tricks. For example, DeepMind’s original DQN paper reports using (1) only periodically updating the reference network and (2) using a replay buffer rather than updating online. It’s very useful for the research community to know that both these tricks are necessary, in order to build on top of these results.
If an algorithm is a modification of a previous work, and has multiple differences, researchers want to know what the key difference is.
Simpler is better (inductive prior towards simpler model classes). If you can get the same performance with two models, prefer the simpler one.

根据奥卡姆剃刀法则，简单和复杂的方法能达到一样的效果，那么简单的方法更好更可靠。
ablation study (消融研究) 指通过移除模型或者算法的某些特征或模块，来观察这些特征对模型效果的影响，模型简化测试。
ablation study 就是为了研究模型中所提出的一些结构是否有效而设计的实验，要想确定这个结构是否有利于最终的效果，那就要将去掉该结构的网络与加上该结构的网络所得到的结果进行对比。

1. Introduction

Since their introduction by (LeCun et al., 1989) in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection. In the last year, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks. (Ciresan et al., 2012) demonstrate state-of-the-art performance on NORB and CIFAR-10 datasets. Most notably, (Krizhevsky et al., 2012) show record beating performance on the ImageNet 2012 classification benchmark, with their convnet model achieving an error rate of 16.4%, compared to the 2nd place result of 26.1%. Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).
自从 (LeCun et al., 1989) 20 世纪 90 年代初介绍他们之后，卷积网络 (convnets) 在诸如手写数字分类和人脸检测等任务中表现出优异的性能。在过去的一年里，一些论文表明他们还可以在更具挑战性的视觉分类任务上提供出色的表现。(Ciresan et al., 2012) 在 NORB 和 CIFAR-10 数据集上展示了最先进的性能。最值得注意的是 (Krizhevsky et al., 2012) 在 ImageNet 2012 分类基准测试中创纪录的表现，其卷积网络模型的错误率为 16.4％，而第二名的结果为 26.1％。卷积网络模型复兴原因有以下几个：(i) 可用更大的训练集，数以百万计的标记样本；(ii) 强大的 GPU 实现，使得超大型模型的训练更加实用；(iii) 更好的模型正则化策略，如 Dropout (Hinton et al., 2012)。

regularization [,rɛɡjʊlərɪ'zeʃən]：n. 规则化，调整，合法化
notably ['nəʊtəblɪ]：adv. 显著地，尤其

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.
尽管取得了令人鼓舞的进展，但对于这些复杂模型的内部运作和行为，以及它们如何取得如此好的表现，仍然没有深刻理解。从科学的角度来看，这是非常令人不满的。如果没有清楚地了解他们如何以及他们为什么工作，那么更好的模型的开发就沦为了反复试验。在本文中，我们介绍一个可视化技术，可以在模型中的任何层显示激发单个特征映射的输入激励。它还使我们能够观察训练过程中的特征演变并诊断模型中的潜在问题。我们提出的可视化技术使用 (Zeiler et al., 2011) 提出的多层反卷积网络 (deconvnet)，将特征激活投影回输入像素空间。我们还通过遮挡输入图像的一部分来对分类器输出进行灵敏度分析，揭示场景的哪些部分对分类很重要。

trial-and-error：n. 试错法，反复试验法
encourage [ɪn'kʌrɪdʒ; en-]：vt. 鼓励，怂恿，激励，支持
reveal [rɪ'viːl]：vt. 显示，透露，揭露，泄露 n. 揭露，暴露，门侧，窗侧
stimuli ['stɪmjʊlaɪ]：n. 刺激，刺激物，促进因素
stimulus ['stɪmjʊləs]：n. 刺激，激励，刺激物 (复数 stimuluses 或 stimuli)

Using these tools, we start with the architecture of (Krizhevsky et al., 2012) and explore different architectures, discovering ones that outperform their results on ImageNet. We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008). The generalization ability of convnet features is also explored in concurrent work by (Donahue et al., 2013).
使用这些工具，我们从 (Krizhevsky et al., 2012) 的体系结构开始，并探索不同的体系结构，在 ImageNet 上得到性能优于的原始模型结果。我们也探索模型对其他数据集的泛化能力，只需对 softmax 分类器进行再训练即可。就其本身而论，这是一种有监督的预训练形式，与其他作者 (Hinton et al., 2006; Bengio et al., 2007; Vincent et al., 2008) 推广的无监督预训练方法形成鲜明对比。在 (Donahue et al., 2013) 论文中也同时探索卷积网络特征的泛化能力。

concurrent [kən'kʌr(ə)nt]：adj. 并发的，一致的，同时发生的，并存的 n. 共点，同时发生的事件

1.1. Related Work

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers this is not the case, and there are limited methods for interpreting activity. (Erhan et al., 2009) find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation. This requires a careful initialization and does not give any information about the unit’s invariances. Motivated by the latter’s short-coming, (Le et al., 2010) (extending an idea by (Berkes & Wiskott, 2006)) show how the Hessian of a given unit may be computed numerically around the optimal response, giving some insight into invariances. The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map. (Donahue et al., 2013) show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.
可视化特征以获得关于网络的直观理解是常见的做法，但大多局限于第一层，其投影到像素空间是可能的。在更高层中情况就不是这样了，仅仅存在有限的方法解释高层特征。(Erhan et al., 2009) 通过在图像空间中执行梯度下降以最大化单元的激活来找出每个单元的最佳刺激。这需要仔细初始化，并且不会提供任何关于单位不变的信息。受到后者短缺的启发，(Le et al., 2010) (通过 (Berkes & Wiskott, 2006) 扩展了一个想法) 显示了一个给定单位的 Hessian 变换可以在最佳响应周围进行数值计算，从而对不变性提供一些深刻理解。问题在于对于更高层次而言，不变性非常复杂，因此通过简单的二次近似很难捕获。相比之下，我们的方法提供了非参数化的不变性视图，显示了训练集中的哪些图案激活了 feature map。(Donahue et al., 2013) 显示可视化通过识别数据集中的斑块，这些斑块负责模型中较高层的强烈激活。我们的可视化效果不同，因为它们不仅仅分析输入图像，而是自上而下的投影，它揭示了每个斑块内的结构，这些结构会刺激特定的 feature map。

In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field.
Hessian matrix or Hessian：黑塞矩阵，海森矩阵，海瑟矩阵，海塞矩阵
parametric [pærə'metrɪk]：adj. 参数的，参量的

2. Approach

We use standard fully supervised convnet models throughout the paper, as defined by (LeCun et al., 1989) and (Krizhevsky et al., 2012). These models map a color 2D input image xix_{i}xi, via a series of layers, to a probability vector yi^\hat{y_{i}}yi^ over the CCC different classes. Each layer consists of (i) convolution of the previous layer output (or, in the case of the 1st layer, the input image) with a set of learned filters; (ii) passing the responses through a rectified linear function (relu(x)=max(x,0))(relu(x) = max(x, 0))(relu(x)=max(x,0)); (iii) [optionally] max pooling over local neighborhoods and (iv) [optionally] a local contrast operation that normalizes the responses across feature maps. For more details of these operations, see (Krizhevsky et al., 2012) and (Jarrett et al., 2009). The top few layers of the network are conventional fully-connected networks and the final layer is a softmax classifier. Fig. 3 shows the model used in many of our experiments.
我们在整篇论文中使用了标准的完全监督的 convnet 模型，正如 (LeCun et al., 1989) and (Krizhevsky et al., 2012) 所定义的那样。这些模型映射彩色 2D 输入图像 xix_{i}xi 经由一系列层到 CCC 个不同类别上的概率向量 yi^\hat{y_{i}}yi^。每一层由 (i) 前一层输出 (或者在第一层的情况下，是输入图像) 与一组学习滤波器的卷积组成；(ii) 通过整数线性函数传递响应 (relu(x)=max(x,0))(relu(x) = max(x, 0))(relu(x)=max(x,0))；(iii) [可选地] 局部邻域上的 max pooling 和 (iv) [可选地] 局部对比操作，其对特征映射上的响应进行归一化。有关这些操作的更多详细信息，请参阅 (Krizhevsky et al., 2012) and (Jarrett et al., 2009)。网络的前几层是传统的全连接网络，最后一层是 softmax 分类器。图 3 显示了我们许多实验中使用的模型。

We train these models using a large set of NNN labeled images {x,y}\{x, y\}{x,y}, where label yiy_{i}yi is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare yi^\hat{y_{i}}yi^ and yiy_{i}yi. The parameters of the network (filters in the convolutional layers, weight matrices in the fully-connected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent. Full details of training are given in Section 3.
我们使用大量 NNN 标注的图像来训练这些模型，其中标签 yiy_{i}yi 是指示真实类别的离散变量。交叉熵损失函数适用于图像分类，用于比较 yi^\hat{y_{i}}yi^ and yiy_{i}yi。网络参数 (卷积层中的滤波器，全连接层中的权重矩阵和偏差) 通过相对于整个网络中的参数向后传播损失函数的导数并通过随机梯度下降来更新参数来训练。第 3 部分详细介绍了训练内容。

propagation [,prɒpə'ɡeɪʃən]：n. 传播，繁殖，增殖
matrice：n. 矩阵，真值表，母式
derivative [dɪ'rɪvətɪv]：n. 衍生物，派生物，导数 adj. 派生的，引出的
stochastic gradient descent，SGD：随机梯度下降

2.1. Visualization with a Deconvnet

Understanding the operation of a convnet requires interpreting the feature activity in intermediate layers. We present a novel way to map these activities back to the input pixel space, showing what input pattern originally caused a given activation in the feature maps. We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.
理解 convnet 的操作需要解释中间层的特征活动。我们提出了一种将这些活动映射回输入像素空间的新方法，显示在特征映射中最初引起给定激活的输入模式。我们用反卷积网络 (deconvnet) (Zeiler et al., 2011) 进行这种映射。一个 deconvnet 可以被认为是一个使用相同组件 (滤波，池化) 但与 convnet 相反的模型，所以不是将像素映射到特征，而是执行相反操作。在 (Zeiler et al., 2011) 的论文中 deconvnets 被提议作为执行无监督学习的一种方式。在这里，他们没有被用于任何学习能力，只是作为一个已经训练的 convnet 的探测。

intermediate [,ɪntə'miːdɪət]：vi. 起媒介作用 adj. 中间的，中级的 n. 中间物，媒介
reverse [rɪ'vɜːs]：n. 背面，相反，倒退，失败 vt. 颠倒，倒转 adj. 反面的，颠倒的，反身的 vi. 倒退，逆叫

Adaptive deconvolutional networks for mid and high level feature learning 中提出 deconvolutional network 用于无监督学习的。本文的反卷积过程并不具备学习的能力，仅仅是用于可视化一个已经训练好的卷积网络模型，没有学习训练的过程。
反卷积网络可以看成卷积网络的逆过程，其卷积核是卷积网络卷积核的转置，反卷积网络就是将输出特征逆映射成输入信号。
输入图像通过卷积网络模型，每一层都会产生特定的特征。然后将反卷积网络中观测层之外的其他链接权值全部置零，将卷积网观测层产生的特征当作输入，输入到对应的反卷积层。

To examine a convnet, a deconvnet is attached to each of its layers, as illustrated in Fig. 1 (top), providing a continuous path back to image pixels. To start, an input image is presented to the convnet and features computed throughout the layers. To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation. This is then repeated until input pixel space is reached.
为了检查一个 convnet，将一个反卷积 (deconvnet) 附加到它的每个层上，如 Fig. 1 (top) 所示，为返回图像像素提供一条连续的路径。首先，将输入图像输入 convnet，并在整个层中计算特征。为了检查给定 convnet 的激活，我们将层中的所有其他激活设置为零，并将该 feature map 作为输入传递给附加的反卷积层。然后我们相继 (i) unpool, (ii) rectify and (iii) filter 以重建下面的层中的行为，导致给定的激活。然后重复此操作直至达到输入像素空间。

beneath [bɪ'niːθ]：prep. 在...之下 adv. 在下方

反卷积可视化以各层得到的特征图作为输入，进行反卷积，用以验证显示各层提取到的特征图。用 AlexNet conv5 的特征图后面接一个反卷积网络，然后反池化、反激活、反卷积…，这样的一个过程，把本来一张 13*13 大小的特征图 (conv5 大小为 13*13)，返回回去，最后得到一张与原始输入图片一样大小的图片 (227*227)。

Unpooling: In the convnet, the max pooling operation is non-invertible, however we can obtain an approximate inverse by recording the locations of the maxima within each pooling region in a set of switch variables. In the deconvnet, the unpooling operation uses these switches to place the reconstructions from the layer above into appropriate locations, preserving the structure of the stimulus. See Fig. 1(bottom) for an illustration of the procedure.
Unpooling: 在 convnet 中最大池操作是不可逆的，但是我们可以通过在一组开关变量中记录每个池区中最大值的位置来获得近似反转。在 deconvnet 中 unpooling 操作使用这些开关将来自上一层的重建放置到合适的位置，从而保留刺激的结构。请参见 Fig. 1 (bottom) 以了解该过程。

invertible [in'və:təbl]：adj. 可逆的，倒转的
inverse ['ɪnvɜːs; ɪn'vɜːs]：n. 相反，倒转 adj. 相反的，倒转的

记录池化过程中最大值的位置，unpooling 时将最大值放置记录位置，其余设置为零。这个过程只是一种近似，因为我们在池化的过程中，除了最大值所在的位置，其它的值也是不为 0 的。

Stacked What-Where Auto-encoders - pooling-unpooling

上图左边表示 pooling 过程，右边表示 unpooling 过程。pooling 块的大小是 3*3，采用 max pooling 后，可以得到一个输出神经元，其激活值为 9。pooling 是一个下采样的过程，本来是 3*3 大小，经过 pooling 后，就变成了 1*1 大小了。而 upooling 刚好与 pooling 过程相反，它是一个上采样的过程，是 pooling 的一个反向运算。当我们由一个神经元要扩展到 3*3 个神经元的时候，我们需要借助于 pooling 过程中，记录下最大值所在的位置坐标 (0, 1)，然后在 unpooling 过程的时候，就把 (0, 1) 这个像素点的位置填上去，其它的神经元激活值全部为 0。
在 max pooling 的时候，我们不仅要得到最大值，同时还要记录下最大值得坐标 (-1, -1)，然后再 unpooling 的时候，就直接把 (-1, -1) 这个点的值填上去，其它的激活值全部为 0。

Adaptive deconvolutional networks for mid and high level feature learning

Rectification: The convnet uses relu non-linearities, which rectify the feature maps thus ensuring the feature maps are always positive. To obtain valid feature reconstructions at each layer (which also should be positive), we pass the reconstructed signal through a relu non-linearity.
Rectification: 该 convnet 使用 relu 函数非线性，整流特征映射从而确保特征映射始终为正。为了在每一层获得有效的特征重建 (这也应该是正的)，我们将重建的信号通过 relu 函数非线性。

relu 函数是用于保证每层输出的激活值都是正数，对于反向过程，我们同样需要保证每层的特征图为正值，也就是说这个反激活过程直接采用 relu 函数。

Filtering: The convnet uses learned filters to convolve the feature maps from the previous layer. To invert this, the deconvnet uses transposed versions of the same filters, but applied to the rectified maps, not the output of the layer beneath. In practice this means flipping each filter vertically and horizontally.
Filtering: convnet 利用学习过的滤波器来卷积上一层的 feature map。为了近似地反转这一点，deconvnet 使用相同滤波器的转置版本，应用于 rectified map，而不是下面层的输出。在实践中，这意味着垂直和水平地翻转每个过滤器。

convolve [kən'vɒlv]：vt. 使卷，使盘旋，使缠绕 vi. 盘旋，卷，缠绕

在 deconvnet 过程中，将 rectified map 与正向卷积核矩阵的转置进行卷积，获得原图。
整个网络的逆过程会有误差，所以重构图片会不连续。
如果网络的输出是一整张图片的话，那么就需要使用到反卷积网络，例如语义分割、去模糊、可视化、图像修复等。

Projecting down from higher layers uses the switch settings generated by the max pooling in the convnet on the way up. As these switch settings are peculiar to a given input image, the reconstruction obtained from a single activation thus resembles a small piece of the original input image, with structures weighted according to their contribution toward to the feature activation. Since the model is trained discriminatively, they implicitly show which parts of the input image are discriminative. Note that these projections are not samples from the model, since there is no generative process involved.
从较高层向下映射使用在 convnet 中的最大池化所产生的交换设置。这些开关设置对于给定的输入图像是独特的，从单次激活获得的重建类似于原始输入图像的一小部分，并且结构根据它们对特征激活的贡献而加权。由于模型是有区别地训练的，因此它们隐含地显示输入图像的哪些部分是有区别的。请注意，这些映射不是来自模型的采样，因为没有涉及生成过程。

peculiar [pɪ'kjuːlɪə]：adj. 特殊的，独特的，奇怪的，罕见的 n. 特权，特有财产
resemble [rɪ'zemb(ə)l]：vt. 类似，像
discriminatively [dɪs'krɪmə,netɪvli]：adv. 有区别地
generative ['dʒen(ə)rətɪv]：adj. 生殖的，生产的，有生殖力的，有生产力的

Figure 1. Top: A deconvnet layer (left) attached to a convnet layer (right). The deconvnet will reconstruct an approximate version of the convnet features from the layer beneath. Bottom: An illustration of the unpooling operation in the deconvnet, using switches which record the location of the local max in each pooling region (colored zones) during pooling in the convnet.

3. Training Details

We now describe the large convnet model that will be visualized in Section 4. The architecture, shown in Fig. 3, is similar to that used by (Krizhevsky et al., 2012) for ImageNet classification. One difference is that the sparse connections used in Krizhevsky’s layers 3, 4, 5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model. Other important differences relating to layers 1 and 2 were made following inspection of the visualizations in Fig. 6, as described in Section 4.1.
我们现在描述将在第 4 部分中可视化的卷积网络模型。图 3 所示的架构与 (Krizhevsky et al., 2012) 使用的 ImageNet 分类架构相似。一个不同之处在于，在我们的模型中 Krizhevsky 使用模型层 3, 4, 5 中使用的稀疏连接 (由于模型分割为 2 个 GPU) 被替换为密集连接。如第 4.1 节所述，在观察图 6 中的可视化之后，再解释与第 1 层和第 2 层相关的其他重要差异。

inspection [ɪn'spekʃn]：n. 视察，检查
split [splɪt]：vt. 分离，使分离，劈开，离开，分解 vi. 离开，被劈开，断绝关系 n. 劈开，裂缝 adj. 劈开的
dense [dens]：adj. 稠密的，浓厚的，愚钝的
sparse [spɑːs]：adj. 稀疏的，稀少的

The model was trained on the ImageNet 2012 training set (1.3 million images, spread over 1000 different classes). Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256x256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224x224 (corners + center with (out) horizontal flips). Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 10−210^{-2}10−2, in conjunction with a momentum term of 0.9. We anneal the learning rate throughout training manually when the validation error plateaus. Dropout (Hinton et al., 2012) is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 10−210^{-2}10−2 and biases are set to 0.
该模型在 ImageNet 2012 训练集上进行了训练 (130 万张图像，涵盖 1000 多个不同类别)。通过将每张 RGB 图像的最小维度缩放为 256，裁剪中心 256 x 256 区域，减去每像素平均值 (横跨所有图像)，然后使用 10 个 224 x 224 不同的子截图 (角落 + 中心与水平 (外) 翻转)。使用随机梯度下降，最小批量为 128 用于更新参数，从学习率 10−210^{-2}10−2 开始，结合动量项 0.9。我们依据整个训练过程中 validation error 的收敛曲线手动减少学习率。Dropout (Hinton et al., 2012) 策略应用于全连接层 (6 和 7) 中，系数设置为 0.5。所有权值初始化为 10−210^{-2}10−2，偏置值设置为 0。

momentum [mə'mentəm]：n. 势头，动量，动力，冲力
plateaus：n. 高原，台地，停滞时期
anneal [ə'niːl]：vt. 使退火，韧炼，锻炼 n. 退火，锻炼，磨练
dropout ['drɒpaʊt]：n. 中途退学，辍学学生

Visualization of the first layer filters during training reveals that a few of them dominate, as shown in Fig. 6(a). To combat this, we renormalize each filter in the convolutional layers whose RMS value exceeds a fixed radius of 10−110^{-1}10−1 to this fixed radius. This is crucial, especially in the first layer of the model, where the input images are roughly in the [-128,128] range. As in (Krizhevsky et al., 2012), we produce multiple different crops and flips of each training example to boost training set size. We stopped training after 70 epochs, which took around 12 days on a single GTX580 GPU, using an implementation based on (Krizhevsky et al., 2012).
如图 Fig. 6 (a) 所示，训练期间第一层滤波器的可视化表明其中一部分滤波器数值过大。为了解决这个问题，我们将 RMS 值超过固定半径 10−110^{-1}10−1 的卷积层中的每个滤波器重新归一化到这个固定的半径。这是至关重要的，特别是在模型的第一层，输入图像大致在 [-128, 128] 范围内。如在 (Krizhevsky et al., 2012) 的论文中，我们在每个训练样例中生成多个不同剪切和翻转来提升训练集的数量。我们在 70 个 epochs 之后停止训练，在单个 GTX580 GPU 上花了大约 12 天时间，使用基于 (Krizhevsky et al., 2012) 的实现。

reveal [rɪ'viːl]：vt. 显示，透露，揭露，泄露 n. 揭露，暴露，门侧，窗侧
root mean square，RMS or rms：均方根误差，标准误差
epoch ['iːpɒk; 'epɒk]：n. 世，新纪元，新时代，时间上的一点
combat ['kɒmbæt; 'kʌm-]：vt. 反对，与...战斗 vi. 战斗，搏斗 n. 战斗，争论 adj. 战斗的，为...斗争的

4. Convnet Visualization

Using the model described in Section 3, we now use the deconvnet to visualize the feature activations on the ImageNet validation set.
使用第 3 节中描述的模型，我们现在使用 deconvnet 来可视化 ImageNet 验证集上的特征激活。

Feature Visualization: Fig. 2 shows feature visualizations from our model once training is complete. However, instead of showing the single strongest activation for a given feature map, we show the top 9 activations. Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.
Feature Visualization: 图 2 显示了训练完成后我们模型的特征可视化。对于给定的 feature map，我们展示了前 9 个激活，而不是显示单一最强的激活。每个激活分别投影到像素空间，显示出不同的结构激发一个特定的 feature map，并显示其对输入变形的不变性的结构。除了这些可视化之外，我们还会显示相应的图像补丁。这些变化比单独关注每个补丁内具有辨别能力纹理结构的可视化更大。例如在第 5 层，第 1 行，第 2 列中，这些补丁似乎没有什么共同之处，但可视化显示此特定地 feature map 侧重于背景中的草地，而不是前景对象。

latter ['lætə]：adj. 后者的，近来的，后面的，较后的
solely ['səʊllɪ]：adv. 单独地，唯一地
patch [pætʃ]：n. 眼罩，斑点，碎片，小块土地 vt. 修补，解决，掩饰 vi. 打补丁
excite [ɪk'saɪt; ek-]：vt. 激起，刺激..，使...兴奋，vi. 激动

Figure 2. Visualization of features in a fully trained model. For layers 2-5 we show the top 9 activations in a random subset of feature maps across the validation data, projected down to pixel space using our deconvolutional network approach. Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1). Best viewed in electronic form.

exaggeration [ɪg,zædʒə'reɪʃ(ə)n]：n. 夸张，夸大之词，夸张的手法

层次越高不变性越强，也就是更能提取出来有辨别性的特征。
Figure 2. 是训练结束后，模型各个隐藏层提取的特征。一组特定的输入特征 (通过重构获得)，将刺激卷积网络产生一个固定的输出特征。Figure 2. 的右边是对应的输入图片，和重构特征相比，输入图片与其之间的差异性很大，而重构特征只包含那些具有判别能力的纹理结构。

Figure 3. Architecture of our 8 layer convnet model. A 224 by 224 crop of an image (with 3 color planes) is presented as the input. This is convolved with 96 different 1st layer filters (red), each of size 7 by 7, using a stride of 2 in both x and y. The resulting feature maps are then: (i) passed through a rectified linear function (not shown), (ii) pooled (max within 3x3 regions, using stride 2) and (iii) contrast normalized across feature maps to give 96 different 55 by 55 element feature maps. Similar operations are repeated in layers 2, 3, 4, 5. The last two layers are fully connected, taking features from the top convolutional layer as input in vector form (6 * 6 * 256 = 9216 dimensions). The final layer is a C-way softmax function, C being the number of classes. All filters and feature maps are square in shape.

The projections from each layer show the hierarchical nature of the features in the network. Layer 2 responds to corners and other edge/color conjunctions. Layer 3 has more complex invariances, capturing similar textures (e.g. mesh patterns (Row 1, Col 1); text (R2, C4)). Layer 4 shows significant variation, but is more class-specific: dog faces (R1, C1); bird’s legs (R4, C2). Layer 5 shows entire objects with significant pose variation, e.g. keyboards (R1, C1) and dogs (R4).
来自每个层的投影显示网络中的特征的分层性质。第 2 层响应角落和其他边缘/颜色的组合。第 3 层具有更复杂的不变性，捕获相似的纹理 (例如网格图案 (第 1 行，第 1 列)；文本 (R2，C4))。第 4 层显示了显著的变化，并且更具有类特定性：狗脸 (R1，C1)；鸟的腿 (R4，C2)。层 5 显示具有显著姿势变化的整个对象，例如键盘 (R1，C1) 和狗 (R4)。

plane [pleɪn]：n. 飞机，平面，程度，水平 vi. 刨，乘飞机旅行，翱翔 vt. 刨平，用刨子刨，掠过水面 adj. 平的，平面的
hierarchical [haɪə'rɑːkɪk(ə)l]：adj. 分层的，等级体系的
conjunction [kən'dʒʌŋ(k)ʃ(ə)n]：n. 结合，连接词，同时发生
variation [veərɪ'eɪʃ(ə)n]：n. 变化，变异，变种
mesh [meʃ]：n. 网眼，网丝，圈套，网格 vi. 相啮合 vt. 啮合，以网捕捉
pattern ['pæt(ə)n]：n. 模式，图案，样品 vt. 模仿，以图案装饰 vi. 形成图案

Feature Evolution during Training: Fig. 4 visualizes the progression during training of the strongest activation (across all training examples) within a given feature map projected back to pixel space. Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.
Feature Evolution during Training: Fig. 4 将投影回像素空间的给定特征图内的最强激活 (在所有训练示例中) 在训练期间的进展可视化。外观突然跳跃是由于最强激活源自的图像发生变化而引起的。可以看出模型的较低层在几个 epoch 内收敛。然而，在相当数量的 epoch (40-50) 之后，较高层才开始发展，表明需要让模型训练直到完全收敛。

evolution [,iːvə'luːʃ(ə)n; 'ev-]：n. 演变，进化论，进展
originate [ə'rɪdʒɪneɪt; ɒ-]：vt. 引起，创作 vi. 发源，发生，起航
converge [kən'vɜːdʒ]：vt. 使汇聚 vi. 聚集，靠拢，收敛

经过一定数目的迭代之后，浅层特征趋于稳定，但更高层特征要更多次迭代才能收敛。

Figure 4. Evolution of a randomly chosen subset of model features through training. Each layer’s features are displayed in a different block. Within each block, we show a randomly chosen subset of features at epochs [1,2,5,10,20,30,40,64]. The visualization shows the strongest activation (across all training examples) for a given feature map, projected down to pixel space using our deconvnet approach. Color contrast is artificially enhanced and the figure is best viewed in electronic form.

artificially [,a:ti'fiʃəli]；adv. 人工地，人为地，不自然地

上面每一整张图片是网络的某一层特征图，然后每一行有 8 个小图片，分别表示网络 epoch 次数为：1、2、5、10、20、30、40、64 的特征图。低层网络特征在训练的过程中变化小，比较容易收敛，高层的特征学习则变化很大。从高层网络 conv5 的变化过程可以看到，刚开始的迭代，变化不是很大，但是到了 40~50 的迭代的时候，变化很大。我们以后在训练网络的时候，看结果需要保证网络收敛。

Feature Invariance: Fig. 5 shows 5 sample images being translated, rotated and scaled by varying degrees while looking at the changes in the feature vectors from the top and bottom layers of the model, relative to the untransformed feature. Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).
Feature Invariance: Fig. 5 展示了 5 个不同的示例图像，它们分别被平移、旋转和缩放不同的角度，观察模型从上到下层特征向量相对于未转换图像的变化。在模型的第一层微小的变化都会导致输出特征的明显变化，但是越往高层分析，平移和尺度变化对模型影响越小。网络输出是稳定相对于平移和缩放。一般情况下，卷积网络无法对旋转操作产生不变性，除非物体具有很强的对称性。

quasi [ˈkweɪzaɪ; ˈkweɪsaɪ; ˈkwɑːzi]：adj. 准的，类似的，外表的 adv. 似是，有如
linear ['lɪnɪə]：adj. 线的，线型的，直线的，线状的，长度的
symmetry ['sɪmɪtrɪ]：n. 对称，整齐，匀称
entertainment [ˌentə'teɪnmənt]：n. 娱乐，消遣，款待
rotation [rə(ʊ)'teɪʃ(ə)n]：n. 旋转，循环，轮流
dramatic [drə'mætɪk]：adj. 戏剧的，急剧的，引人注目的，激动人心的

图片的平移、旋转、缩放等变化对于浅层输出影响较大，对深层影响较小，对旋转操作不具有不变性，除非物体具有很强的对称性。
网络层较浅时，训练图片微小的变化都能导致输出特征的变化。网络层越高，训练图片的变化对输出特征的影响越小。略微的位移和缩放都不会改变正确率，但卷积网络无法对旋转产生良好的鲁棒性 (如果有良好的对称性，则正确率会产生频率一定的波动)。

Figure 5. Analysis of vertical translation, scale, and rotation invariance within the model (rows a-c respectively). Col 1: 5 example images undergoing the transformations. Col 2 & 3: Euclidean distance between feature vectors from the original and transformed images in layers 1 and 7 respectively. Col 4: the probability of the true label for each image, as the image is transformed.

undergo [ʌndə'gəʊ]：vt. 经历，经受，忍受
Euclidean [ju:'klidiən]：adj. 欧几里德的，欧几里德几何学的
lawn mower：割草机
Shih Tzu：西施犬，狮子犬
African [ˈæfrɪkən]：adj.  非洲的，非洲人的，黑人的，来自非洲的 n.  非洲人，非洲居民，黑人民
crocodile [ˈkrɒkədaɪl]：n.    鳄鱼，鳄鱼皮革，鳄类动物
entertainment center [ˌentəˈteinmənt ˈsentə]：娱乐中心
Congo grey parrot or African grey parrot，grey parrot：非洲灰鹦鹉
canonical [kəˈnɒnɪkl]：adj.   权威的，正经篇目的，大教堂教士的，按照教规的 n. 牧师礼拜时穿的法衣

4.1. Architecture Selection

While visualization of a trained model gives insight into its operation, it can also assist with selecting good architectures in the first place. By visualizing the first and second layers of Krizhevsky et al. ’s architecture (Fig. 6(b) & (d)), various problems are apparent. The first layer filters are a mix of extremely high and low frequency information, with little coverage of the mid frequencies. Additionally, the 2nd layer visualization shows aliasing artifacts caused by the large stride 4 used in the 1st layer convolutions. To remedy these problems, we (i) reduced the 1st layer filter size from 11x11 to 7x7 and (ii) made the stride of the convolution 2, rather than 4. This new architecture retains much more information in the 1st and 2nd layer features, as shown in Fig. 6 (c) & (e). More importantly, it also improves the classification performance as shown in Section 5.1.
可视化训练模型可以深入了解其操作，它也可以帮助您选择最佳的体系结构。通过观察 Krizhevsky 等人的第一层和第二层 (Fig. 6 (b) & (d))，各种问题都很明显。第一层滤波器是极高和低频信息的混合，几乎没有中频的覆盖。另外第二层可视化显示了第一层卷积中使用的大步长 4 引起的混叠现象。为了解决这些问题，我们 (i) 将第一层过滤器的尺寸从 11 x 11 减小到 7 x 7，并且 (ii) 使卷积步长为 2，而不是 4。如图 Fig. 6 (c) & (e) 所示，这种新架构在第一层和第二层功能中保留了更多信息。更重要的是，它也提高了分类性能，如 5.1 节所示。

aliasing ['eɪlɪəsɪŋ]：n. 别名使用，混淆现象
artifact ['ɑ:təˌfækt]：n. 人工制品，手工艺品
remedy ['remɪdɪ]：vt. 补救，治疗，纠正 n. 补救，治疗，赔偿

Figure 6. (a): 1st layer features without feature scale clipping. Note that one feature dominates. (b): 1st layer features from (Krizhevsky et al., 2012). (c): Our 1st layer features. The smaller stride (2 vs 4) and filter size (7x7 vs 11x11) results in more distinctive features and fewer “dead” features. (d): Visualizations of 2nd layer features from (Krizhevsky et al., 2012). (e): Visualizations of our 2nd layer features. These are cleaner, with no aliasing artifacts that are visible in (d).

4.2. Occlusion Sensitivity

With image classification approaches, a natural question is if the model is truly identifying the location of the object in the image, or just using the surrounding context. Fig. 7 attempts to answer this question by systematically occluding different portions of the input image with a grey square, and monitoring the output of the classifier. The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded. Fig. 7 also shows visualizations from the strongest feature map of the top convolution layer, in addition to activity in this map (summed over spatial locations) as a function of occluder position. When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.
对于图像分类方法，一个自然的问题是是否模型真正地识别图像中目标的位置，或者只是使用周围的上下文信息。Fig. 7 试图通过系统地用灰色正方形遮挡输入图像的不同部分来回答这个问题，并监视分类器的输出。这些例子清楚地表明模型正在对场景中的物体进行定位，因为当物体被遮挡时正确类别的概率显著下降。Fig. 7 还显示了来自顶部卷积层的最强特征图的可视化，以及该 feature map 的活动 (在空间位置上相加) 作为遮挡物位置的函数。当遮挡物覆盖可视化中出现的图像区域时，我们会看到 feature map 中活动的剧烈下降。这表明可视化真实地对应于刺激该特征图的图像结构，因此验证了 Fig. 4 and Fig. 2 中所示的其他可视化。

occlusion [ə'kluːʒ(ə)n]：n. 闭塞，吸收
sensitivity [sensɪ'tɪvɪtɪ]：n. 敏感，敏感性，过敏
systematically [,sɪstə'mætɪklɪ]：adv. 有系统地，有组织地
occlude [ə'kluːd]：vt. 使闭塞，封闭，挡住 vi. 咬合
genuinely ['dʒenjuinli]：adv. 真诚地，诚实地
stimulate ['stɪmjʊleɪt]：vt. 刺激，鼓舞，激励 vi. 起刺激作用，起促进作用

我们想要探究到底哪一部分在分类时起到了作用，对不同部位进行遮挡，查看最终分类结果。灰色方块是遮挡物，当遮挡在关键区域时，分类性能急剧下降。可以用这个方法查找分类的关键区域。

Figure 7. Three test examples where we systematically cover up different portions of the scene with a gray square (1st column) and see how the top (layer 5) feature maps ((b) & (c)) and classifier output ((d) & (e)) changes. (b): for each position of the gray scale, we record the total activation in one layer 5 feature map (the one with the strongest response in the unoccluded image). (c): a visualization of this feature map projected down into the input image (black square), along with visualizations of this map from other images. The first row example shows the strongest feature to be the dog’s face. When this is covered-up the activity in the feature map decreases (blue area in (b)). (d): a map of correct class probability, as a function of the position of the gray square. E.g. when the dog’s face is obscured, the probability for “Pomeranian” drops significantly. (e): the most probable label as a function of occluder position. E.g. in the 1st row, for most locations it is “Pomeranian”, but if the dog’s face is obscured but not the ball, then it predicts “tennis ball”. In the 2nd example, text on the car is the strongest feature in layer 5, but the classifier is most sensitive to the wheel. The 3rd example contains multiple objects. The strongest feature in layer 5 picks out the faces, but the classifier is sensitive to the dog (blue region in (d)), since it uses multiple feature maps.

wheel [wiːl]：n. 车轮，方向盘，转动 vt. 转动，使变换方向，给...装轮子 vi. 旋转，突然转变方向，盘旋飞行
hound [haʊnd]：vt. 追猎，烦扰，激励 n. 猎犬，卑劣的人
Afghan ['æfgən]：n. 阿富汗语，阿富汗人 adj. 阿富汗人的，阿富汗的
Pomeranian [,pɔmə'reiniən]：n. 波美拉尼亚种小狗，波美拉尼亚人 adj. 波美拉尼亚的
tennis ball：网球
keeshond ['keɪshɒnd]：n.  荷兰卷尾狮毛狗

4.3. Correspondence Analysis

一致性分析

Deep models differ from many existing recognition approaches in that there is no explicit mechanism for establishing correspondence between specific object parts in different images (e.g. faces have a particular spatial configuration of the eyes and nose). However, an intriguing possibility is that deep models might be implicitly computing them. To explore this, we take 5 randomly drawn dog images with frontal pose and systematically mask out the same part of the face in each image (e.g. all left eyes, see Fig. 8). For each image iii, we then compute: ϵil=xil−x~il\epsilon ^{l}_{i} = x^{l}_{i} - \tilde {x}^{l}_{i}ϵil=xil−x~il, where xilx^{l}_{i}xil and x~il\tilde {x}^{l}_{i}x~il are the feature vectors at layer l for the original and occluded images respectively. We then measure the consistency of this difference vector ϵ\epsilonϵ between all related image pairs (i,ji, ji,j): Δl\Delta_{l}Δl = ∑i,j=1,i≠j5\sum^{5}_{i,j=1,i \neq j}∑i,j=1,i̸=j5 H(sign(ϵil),sign(ϵil))H(sign(\epsilon ^{l}_{i}), sign(\epsilon ^{l}_{i}))H(sign(ϵil),sign(ϵil)), where HHH is Hamming distance. A lower value indicates greater consistency in the change resulting from the masking operation, hence tighter correspondence between the same object parts in different images (i.e. blocking the left eye changes the feature representation in a consistent way). In Table 1 we compare the ∆∆∆ score for three parts of the face (left eye, right eye and nose) to random parts of the object, using features from layer l=5l = 5l=5 and l=7l = 7l=7. The lower score for these parts, relative to random object regions, for the layer 5 features show the model does establish some degree of correspondence.
与许多现有的识别方法不同，深度神经网络模型没有一套明确的机制分析不同图像上特定部件之间的关系 (例如人脸拥有特殊眼睛和鼻子的空间位置)，但深度网络很可能隐式地计算了这些特征。为了验证这些假设，我们随机选择 5 个狗的正面图片，并且系统性地挡住狗所有图片的一部分 (例如 Fig. 8 左眼)。对于每张图 iii，计算 ϵil=xil−x~il\epsilon ^{l}_{i} = x^{l}_{i} - \tilde {x}^{l}_{i}ϵil=xil−x~il，其中 xilx^{l}_{i}xil and x~il\tilde {x}^{l}_{i}x~il 分别表示 layer l 原始图片和被遮挡图片所产生的特征。我们测量所有图片对 (i,ji, ji,j) 的误差向量 ϵ\epsilonϵ 的一致性：Δl\Delta_{l}Δl = ∑i,j=1,i≠j5\sum^{5}_{i,j=1,i \neq j}∑i,j=1,i̸=j5 H(sign(ϵil),sign(ϵil))H(sign(\epsilon ^{l}_{i}), sign(\epsilon ^{l}_{i}))H(sign(ϵil),sign(ϵil)), 其中 HHH is Hamming distance。∆l∆_{l}∆l 值越低，遮挡操作引起的变化指示一致性越高，不同的图像中相同的目标部分越存在紧密联系 (遮挡左眼以一致的方式改变特征表示)。在 Table 1 中，随机遮挡目标的局部情况下，使用 layer l=5l = 5l=5 and l=7l = 7l=7 的特征，对比人脸的三个部分 (left eye, right eye and nose) 的 ∆∆∆ 值。对于第 5 层特征，这些部分相对于随机对象区域的较低分数表明该模型确实建立了一定程度的对应关系。

同样是遮挡眼镜、鼻子带来的特征变化要比随机遮挡带来的一致性变化要小，说明相似部位有着内在的一致性。

Figure 8. Images used for correspondence experiments. Col 1: Original image. Col 2, 3, 4: Occlusion of the right eye, left eye, and nose respectively. Other columns show examples of random occlusions.

correspondence [kɒrɪ'spɒnd(ə)ns]：n. 通信，一致，相当
nose [nəʊz]；n. 鼻子，嗅觉，突出的部分，探问 vt. 嗅，用鼻子触 vi. 小心探索着前进，探问

Table 1. Measure of correspondence for different object parts in 5 different dog images. The lower scores for the eyes and nose (compared to random object parts) show the model implicitly establishing some form of correspondence of parts at layer 5 in the model. At layer 7, the scores are more similar, perhaps due to upper layers trying to discriminate between the different breeds of dog.

5. Experiments

5.1. ImageNet 2012

This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories. Table 2 shows our results on this dataset.
该数据集包含 1.3M/50k/100k training/validation/test 图像，遍布 1000 个类别。Table 2 展示了我们在这个数据集的实验结果。

category ['kætɪg(ə)rɪ]：n. 种类，分类，范畴
replication [replɪ'keɪʃ(ə)n]：n. 复制，回答，反响
breed [briːd]：vi. 繁殖，饲养，产生 vt. 繁殖，饲养，养育，教育，引起 n. 品种，种类，类型

Table 2. ImageNet 2012 classification error rates. The ∗ indicates models that were trained on both ImageNet 2011 and 2012 training sets.

Using the exact architecture specified in (Krizhevsky et al., 2012), we attempt to replicate their result on the validation set. We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.
在验证集上，我们尝试使用 (Krizhevsky et al., 2012) 使用的网络架构重现实验结果。在 ImageNet 2012 validation set 上，我们实验的错误率与 (Krizhevsky et al., 2012) 给出的错误率相差 0.1% 以内。

exact [ɪg'zækt; eg-]：adj. 准确的，精密的，精确的 vt. 要求，强求，急需 vi. 勒索钱财
replicate ['replɪkeɪt]：vt. 复制，折叠 vi. 重复，折转 adj. 复制的，折叠的 n. 复制品，八音阶间隔的反覆音

Next we analyze the performance of our model with the architectural changes outlined in Section 4.1 (7×7 filters in layer 1 and stride 2 convolutions in layers 1 & 2). This model, shown in Fig. 3, significantly outperforms the architecture of (Krizhevsky et al., 2012), beating their single model result by 1.7% (test top-5). When we combine multiple models, we obtain a test error of 14.8%, the best published performance on this dataset¹ (despite only using the 2012 training set). We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26.2% error (Gunji et al., 2012).
接下来我们分析我们模型的性能，其中第 4.1 节中概述了体系结构的变化 (layer 1 中的 7×7 滤波器和 layers 1 & 2 中步长 2 的卷积)。如 Fig. 3 所示，该模型显著优于 (Krizhevsky et al., 2012) 的体系结构，超过了他们的单一模型结果 1.7％ (test top-5)。当联合多个模型时，我们获得了 14.8％的测试误差，在数据集上已公开的最好精度 (尽管仅仅使用 2012 training set)。在 ImageNet 2012 classification challenge 上，这个错误率几乎是顶级的非卷积网络方法错误率的一半，(Gunji et al., 2012) 获得 26.2% 的错误率。

¹This performance has been surpassed in the recent Imagenet 2013 competition (http://www.image-net.org/).

surpass [sə'pɑːs]：vt. 超越，胜过，优于，非...所能办到或理解

Varying ImageNet Model Sizes: In Table 3, we first explore the architecture of (Krizhevsky et al., 2012) by adjusting the size of layers, or removing them entirely. In each case, the model is trained from scratch with the revised architecture. Removing the fully connected layers (6, 7) only gives a slight increase in error. This is surprising, given that they contain the majority of model parameters. Removing two of the middle convolutional layers also makes a relatively small different to the error rate. However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. In Table 3, we modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of (Krizhevsky et al., 2012)). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.
Varying ImageNet Model Sizes: 在 Table 3 中，我们首先通过调整层的尺寸，或完全删除它们来探索 (Krizhevsky et al., 2012) 的体系结构。在每一种情况下，模型都是从头开始用修改后的架构进行训练的。移除全连接层 (6, 7) 只会导致的轻微错误率上升。这是令人惊讶的，因为它们包含大部分模型参数。去除两个中间卷积层也只对错误率产生相对较小的差异。但是，除去中间卷积层和全连接层产生只有 4 层的模型，其性能显著变差。这表明模型的整体深度对于获得良好性能很重要。我们修改我们的模型 (如 Fig. 3 所示)，在 Table 3 中改变全连接层的尺寸对性能影响很小 (对于 (Krizhevsky et al., 2012) 模型也是如此)。但是增加中间卷积层的大小可以提高性能。但增加这些，同时也扩大全连接层会导致过拟合。

revise [rɪ'vaɪz]：vt. 修正，复习，校订 vi. 修订，校订，复习功课 n. 修订，校订
enlarge [ɪn'lɑːdʒ; en-]：vi. 扩大，放大，详述 vt. 扩大，使增大，扩展
overfitting：n. 过适，过度拟合

网络的总体深度对模型精度很重要。
可适当精简全连接层。

5.2. Feature Generalization

The experiments above show the importance of the convolutional part of our ImageNet model in obtaining state-of-the-art performance. This is supported by the visualizations of Fig. 2 which show the complex invariances learned in the convolutional layers. We now explore the ability of these feature extraction layers to generalize to other datasets, namely Caltech-101 (Feifei et al., 2006), Caltech-256 (Griffin et al., 2006) and PASCAL VOC 2012. To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.
上面的实验显示了我们的 ImageNet 模型的卷积部分在获得最高性能方面的重要性。这由 Fig. 2 的可视化支持，它显示了卷积图层中的复杂不变量。我们现在探索这些特征提取层的能力，以推广到其他数据集，即 Caltech-101 (Feifei et al., 2006), Caltech-256 (Griffin et al., 2006) and PASCAL VOC 2012。为此，我们将我们的 ImageNet 训练模型的第 1-7 层保持固定，并使用新数据集的训练图像在顶部 (针对适当数目的类别) 训练一个新的 softmax 分类器。由于 softmax 包含的参数相对较少，因此可以从相对较少的示例中快速进行训练，针对某些数据集通常就是这样。

invariance [ɪn'vɛrɪəns]：n. 不变性，不变式

The classifiers used by our model (a softmax) and other approaches (typically a linear SVM) are of similar complexity, thus the experiments compare our feature representation, learned from ImageNet, with the hand-crafted features used by other methods. It is important to note that both our feature representation and the hand-crafted features are designed using images beyond the Caltech and PASCAL training sets. For example, the hyper-parameters in HOG descriptors were determined through systematic experiments on a pedestrian dataset (Dalal & Triggs, 2005). We also try a second strategy of training a model from scratch, i.e. resetting layers 1-7 to random values and train them, as well as the softmax, on the training images of the dataset.
我们的模型使用的分类器 (a softmax) 与其他方法使用的分类器 (typically a linear SVM) 具有相似的复杂度，实验将我们从 ImageNet 获得的特征表示与其他方法使用的手工特征进行比较。需要注意的是，我们的特征表示和手工制作的特征表示都是使用超出 Caltech 和 PASCAL 训练集的图像构建的。例如 HOG 描述符中的超参数是通过对行人数据集 (Dalal & Triggs, 2005) 进行系统实验确定的。我们还尝试了第二种从零开始训练模型的策略，即将图 1-7 重新设置为随机值，并在数据集的训练图像上训练它们以及 softmax。

hand-crafted [,hænd 'kra:ftid]：adj. 手工制作的
systematic [sɪstə'mætɪk]：adj. 系统的，体系的，有系统的，分类的，一贯的，惯常的

One complication is that some of the Caltech datasets have some images that are also in the ImageNet training data. Using normalized correlation, we identified these few “overlap” images² and removed them from our Imagenet training set and then retrained our Imagenet models, so avoiding the possibility of train/test contamination.
其中一个复杂因素是加州理工学院的一些数据集有一些图像也在 ImageNet 训练数据中。使用归一化相关性，我们将这些重叠图像²识别出来，并将它们从我们的 ImageNet 训练集中移除，然后重新训练我们的 ImageNet 模型，从而避免了训练/测试污染的可能性。

²For Caltech-101, we found 44 images in common (out of 9,144 total images), with a maximum overlap of 10 for any given class. For Caltech-256, we found 243 images in common (out of 30,607 total images), with a maximum overlap of 18 for any given class.

California Institute of Technology，Caltech：加州理工学院，加州理工
contamination [kən,tæmɪ'neɪʃən]：n. 污染，玷污，污染物

Table 3. ImageNet 2012 classification error rates with various architectural changes to the model of (Krizhevsky et al., 2012) and our model (see Fig. 3).

Caltech-101: We follow the procedure of (Fei-fei et al., 2006) and randomly select 15 or 30 images per class for training and test on up to 50 images per class reporting the average of the per-class accuracies in Table 4, using 5 train/test folds. Training took 17 minutes for 30 images/class. The pre-trained model beats the best reported result for 30 images/class from (Bo et al., 2013) by 2.2%. The convnet model trained from scratch however does terribly, only achieving 46.5%.
Caltech-101: 我们按照 (Fei-fei et al., 2006) 的步骤，每个类别随机选择 15 或 30 张图像进行训练，每个类别最多测试 50 张图像，Table 4 中列出了每个类别准确率的平均值，using 5 train/test folds。每个类别 30 张图像，训练花费了 17 分钟。预先训练的模型比 (Bo et al., 2013) 报告的 30 images/class 最佳结果高出 2.2％。然而从零开始训练的 convnet 模型确实非常糟糕，仅达到了 46.5％。

terribly ['terɪblɪ]：adv. 非常，可怕地，极度地
fold [fəʊld]：vt. 折叠，合拢，抱住，笼罩 n. 折痕，信徒，羊栏 vi. 折叠起来，彻底失败

Table 4. Caltech-101 classification accuracy for our convnet models, against two leading alternate approaches.

Caltech-256: We follow the procedure of (Griffin et al., 2006), selecting 15, 30, 45, or 60 training images per class, reporting the average of the per-class accuracies in Table 5. Our ImageNet-pretrained model beats the current state-of-the-art results obtained by Bo et al. (Bo et al., 2013) by a significant margin: 74.2% vs 55.2% for 60 training images/class. However, as with Caltech-101, the model trained from scratch does poorly. In Fig. 9, we explore the “one-shot learning” (Fei-fei et al., 2006) regime. With our pre-trained model, just 6 Caltech-256 training images are needed to beat the leading method using 10 times as many images. This shows the power of the ImageNet feature extractor.
Caltech-256: 我们按照 (Griffin et al., 2006) 的步骤，为每个类别选择 15, 30, 45 或 60 张训练图像，Table 5 中列出了每类准确率的平均值。我们的 ImageNet-pretrained 模型超过了 (Bo et al., 2013) 获得的当前最好的结果，差距明显：60 training images/class 的精度是 74.2％ vs 55.2％。但是，与 Caltech-101 数据集一样，从零开始训练的模型效果并不理想。在 Fig. 9 中，我们探讨了 one-shot learning 机制 (Fei-fei et al., 2006)。使用我们的预训练模型，只需要 6 张 Caltech-256 训练图像即可领先使用 10 倍的训练图像的方法。这显示了 ImageNet 特征提取器的强大功能。

regime [reɪ'ʒiːm]：n. 政权，政体，社会制度，管理体制

没有预训练的卷积模型效果很差。

Figure 9. Caltech-256 classification performance as the number of training images per class is varied. Using only 6 training examples per class with our pre-trained feature extractor, we surpass best reported result by (Bo et al., 2013).

Table 5. Caltech 256 classification accuracies.

PASCAL 2012: We used the standard training and validation images to train a 20-way softmax on top of the ImageNet-pretrained convnet. This is not ideal, as PASCAL images can contain multiple objects and our model just provides a single exclusive prediction for each image. Table 6 shows the results on the test set. The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter. This may explain our mean performance being 3.2% lower than the leading (Yan et al., 2012) result, however we do beat them on 5 classes, sometimes by large margins.
PASCAL 2012: 我们使用标准的训练和验证图像在 ImageNet 预训练的卷积网络顶部训练 20 路的 softmax。这并不理想，因为 PASCAL 图像可以包含多个对象，而我们的模型仅为每个图像提供一个单独的预测。Table 6 列出了测试集的结果。PASCAL 和 ImageNet 的图像本质上是完全不同的，前者是全景图像与后者完全不同。这可能解释我们的平均准确率比领先的结果 (Yan et al., 2012) 低了 3.2％，但是我们确实在 5 个类别上超过了他们，有些类别优势明显。

margin ['mɑːdʒɪn]：n. 边缘，利润，余裕，页边的空白 vt. 加边于，加旁注于
accuracy ['ækjʊrəsɪ]：n. 精确度，准确性

Table 6. PASCAL 2012 classification results, comparing our Imagenet-pretrained convnet against the leading two methods ([A]= (Sande et al., 2012) and [B] = (Yan et al., 2012))

5.3. Feature Analysis

We explore how discriminative the features in each layer of our Imagenet-pretrained model are. We do this by varying the number of layers retained from the ImageNet model and place either a linear SVM or softmax classifier on top. Table 7 shows results on Caltech-101 and Caltech-256. For both datasets, a steady improvement can be seen as we ascend the model, with best results being obtained by using all layers. This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.
我们探讨 Imagenet-pretrained 模型每层的特征是区分明显的。我们通过改变保留的 ImageNet 模型中的层数来完成这一工作，并将 a linear SVM or softmax 分类器置于顶层。Table 7 显示了 Caltech-101 和 Caltech-256 的结果。对于这两个数据集，随着我们提升模型，可以看到一个稳定的改进，通过使用所有层可以获得最佳结果。这支持这样一个前提，即随着功能层次越来越深，他们将学习越来越强大的功能。

ascend [ə'send]：vi. 上升，登高，追溯 vt. 攀登，上升
premise ['premɪs]：vt. 引出，预先提出，作为...的前提 n. 前提，上述各项，房屋连地基 vi. 作出前提
hierarchies ['haɪərɑːkɪz]：n. 阶层，层级，分层，分类

深度结构确实能够获得比浅层结构更好的效果。

Table 7. Analysis of the discriminative information contained in each layer of feature maps within our ImageNet-pretrained convnet. We train either a linear SVM or softmax on features from different layers (as indicated in brackets) from the convnet. Higher layers generally produce more discriminative features.

保留层数越多，分类结果越好。

6. Discussion

We explored large convolutional neural network models, trained for image classification, in a number ways. First, we presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers. We also showed how these visualization can be used to debug problems with the model to obtain better results, for example improving on Krizhevsky et al. ’s (Krizhevsky et al., 2012) impressive ImageNet 2012 result. We then demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context. An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.
我们以多种方式探索了大型卷积神经网络模型，并对图像分类进行了训练。首先，我们提出了一种新颖的方式来可视化模型中的活动。这揭示了这些特征远不是随机的，不可解释的模式。相反，它们随着层数的增加显示出许多直观上令人满意的属性，例如组合性，增加不变性和类别区分性等。我们还展示了如何使用这些可视化来识别模型的问题，从而获得更好的结果，例如改进 (Krizhevsky et al., 2012) 令人印象深刻的 ImageNet 2012 结果。然后，我们通过一系列遮挡实验证明，该模型虽然训练了分类，但对图像局部结构高度敏感，并不仅仅使用广泛的场景环境。对该模型的消融研究表明，对网络而言，具有最小的深度而不是任何单独的部分对模型的性能至关重要。

compositionality [cʌmpəziʃə'næliti]：n. 语意合成性

Finally, we showed how the ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that we can beat the best reported results, in the latter case by a significant margin. This result brings into question to utility of benchmarks with small (i.e. < 10⁴) training sets. Our convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias (Torralba & Efros, 2011), although it was still within 3.2% of the best reported result, despite no tuning for the task. For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.
最后，我们展示了 ImageNet 训练模型如何很好地推广到其他数据集。对于 Caltech-101 和 Caltech-256 数据集足够相似，以至于我们可以领先最好的报告结果，在后一种情况下是大幅度领先的。少量训练集的情况下，实验结果对于基准的作用引发质疑。我们的 convnet 模型对 PASCAL 数据的泛化能力较差，可能来自数据集偏差 (Torralba & Efros, 2011)，虽然它仍处于最佳报告结果的 3.2％之内，尽管没有针对任务专门调整。例如，如果使用不同的损失函数，我们的性能可能会提高，允许每个图像上有多个对象的。这自然也使网络能够处理目标检测。

permit [pə'mɪt]：vi. 许可，允许 vt. 许可，允许 n. 许可证，执照
tackle ['tæk(ə)l]：n. 滑车，装备，用具，扭倒 vt. 处理，抓住，固定，与...交涉 vi. 扭倒，拦截抢球

Acknowledgments

The authors are very grateful for support by NSF grant IIS-1116923, Microsoft Research and a Sloan Fellowship.

Sloan Research Fellowship：斯隆研究奖
Sloan Fellowship：斯隆奖
National Science Foundation，NSF：美国国家科学基金会
fellowship ['felə(ʊ)ʃɪp]：n. 团体，友谊，奖学金，研究员职位

References

Visualizing Features from a Convolutional Neural Network
http://kvfrans.com/visualizing-features-from-a-convolutional-neural-network/

feature-visualization
https://github.com/kvfrans/feature-visualization

Matthew Zeiler
https://www.matthewzeiler.com/

CNN_visualization
https://github.com/guruucsd/CNN_visualization

Adaptive deconvolutional networks for mid and high level feature learning
Stacked What-Where Auto-encoders
ImageNet Classification with Deep Convolutional Neural Networks

WORDBOOK

Convolutional Networks，convnets
Deconvolutional Network，deconvnet

KEY POINTS

We introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier.

Several factors are responsible for this renewed interest in convnet models: (i) the availability of much larger training sets, with millions of labeled examples; (ii) powerful GPU implementations, making the training of very large models practical and (iii) better model regularization strategies, such as Dropout (Hinton et al., 2012).

Despite this encouraging progress, there is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance.

Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error. In this paper we introduce a visualization technique that reveals the input stimuli that excite individual feature maps at any layer in the model. It also allows us to observe the evolution of features during training and to diagnose potential problems with the model. The visualization technique we propose uses a multi-layered Deconvolutional Network (deconvnet), as proposed by (Zeiler et al., 2011), to project the feature activations back to the input pixel space. We also perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification.

We then explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a form of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by (Hinton et al., 2006) and others (Bengio et al., 2007; Vincent et al., 2008).

The problem is that for higher layers, the invariances are extremely complex so are poorly captured by a simple quadratic approximation. Our approach, by contrast, provides a non-parametric view of invariance, showing which patterns from the training set activate the feature map.

Our visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

We perform this mapping with a Deconvolutional Network (deconvnet) (Zeiler et al., 2011). A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In (Zeiler et al., 2011), deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet. This is then repeated until input pixel space is reached.

To examine a given convnet activation, we set all other activations in the layer to zero and pass the feature maps as input to the attached deconvnet layer. Then we successively (i) unpool, (ii) rectify and (iii) filter to reconstruct the activity in the layer beneath that gave rise to the chosen activation.

One difference is that the sparse connections used in Krizhevsky’s layers 3, 4, 5 (due to the model being split across 2 GPUs) are replaced with dense connections in our model.

Projecting each separately down to pixel space reveals the different structures that excite a given feature map, hence showing its invariance to input deformations. Alongside these visualizations we show the corresponding image patches. These have greater variation than visualizations as the latter solely focus on the discriminant structure within each patch. For example, in layer 5, row 1, col 2, the patches appear to have little in common, but the visualizations reveal that this particular feature map focuses on the grass in the background, not the foreground objects.

Our reconstructions are not samples from the model: they are reconstructed patterns from the validation set that cause high activations in a given feature map. For each feature map we also show the corresponding image patches. Note: (i) the strong grouping within each feature map, (ii) greater invariance at higher layers and (iii) exaggeration of discriminative parts of the image, e.g. eyes and noses of dogs (layer 4, row 1, cols 1).

Sudden jumps in appearance result from a change in the image from which the strongest activation originates. The lower layers of the model can be seen to converge within a few epochs. However, the upper layers only develop develop after a considerable number of epochs (40-50), demonstrating the need to let the models train until fully converged.

Small transformations have a dramatic effect in the first layer of the model, but a lesser impact at the top feature layer, being quasi-linear for translation & scaling. The network output is stable to translations and scalings. In general, the output is not invariant to rotation, except for object with rotational symmetry (e.g. entertainment center).

The examples clearly show the model is localizing the objects within the scene, as the probability of the correct class drops significantly when the object is occluded.

When the occluder covers the image region that appears in the visualization, we see a strong drop in activity in the feature map. This shows that the visualization genuinely corresponds to the image structure that stimulates that feature map, hence validating the other visualizations shown in Fig. 4 and Fig. 2.

This dataset consists of 1.3M/50k/100k training/validation/test examples, spread over 1000 categories.

We achieve an error rate within 0.1% of their reported value on the ImageNet 2012 validation set.

We note that this error is almost half that of the top non-convnet entry in the ImageNet 2012 classification challenge, which obtained 26.2% error (Gunji et al., 2012).

However, removing both the middle convolution layers and the fully connected layers yields a model with only 4 layers whose performance is dramatically worse. This would suggest that the overall depth of the model is important for obtaining good performance. In Table 3, we modify our model, shown in Fig. 3. Changing the size of the fully connected layers makes little difference to performance (same for model of (Krizhevsky et al., 2012)). However, increasing the size of the middle convolution layers goes give a useful gain in performance. But increasing these, while also enlarging the fully connected layers results in over-fitting.

To do this, we keep layers 1-7 of our ImageNet-trained model fixed and train a new softmax classifier on top (for the appropriate number of classes) using the training images of the new dataset. Since the softmax contains relatively few parameters, it can be trained quickly from a relatively small number of examples, as is the case for certain datasets.

The PASCAL and ImageNet images are quite different in nature, the former being full scenes unlike the latter.

This supports the premise that as the feature hierarchies become deeper, they learn increasingly powerful features.

We train either a linear SVM or softmax on features from different layers (as indicated in brackets) from the convnet. Higher layers generally produce more discriminative features.

This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers.

For example, our performance might improve if a different loss function was used that permitted multiple objects per image. This would naturally enable the networks to tackle the object detection as well.