Abstract

摘要

We present YOLO, a unified pipeline for object detection.
我们提出了一个统一的对象检测管道YOLO。

Prior work on object detection repurposes classifiers to perform detection.
先前关于目标检测的工作重新调整分类器来执行检测。

Instead, we frame object detection as a regression problem to spatially separated bounding boxes and associated class probabilities.
取而代之，我们将对象检测作为一个回归问题来处理空间分离的边界框和相关的类概率。

A single neural network predicts bounding boxes and class probabilities directly from full images in one evaluation.
在一次评估中，单个神经网络直接从完整图像中预测边界框和类概率。

Since the whole detection pipeline is a single network, it can be optimized end-to-end directly on detection performance.
由于是单端检测，所以可以直接优化单端网络检测性能。

Our unified architecture is also extremely fast; YOLO processes images in real-time at 45 frames per second, hundreds to thousands of times faster than existing detection systems.
我们的统一体系结构也非常快速；YOLO以每秒45帧的速度实时处理图像，比现有的检测系统快数百到数千倍。

Our system uses global image context to detect and localize objects, making it less prone to background errors than top detection systems like R-CNN.
我们的系统使用全局图像上下文来检测和定位对象，这使得它比R-CNN这样的顶级检测系统更不容易出现背景错误。

By itself, YOLO detects objects at unprecedented speeds with moderate accuracy.
就其本身而言，YOLO以前所未有的速度以中等精度检测物体。

When combined with state-of-the-art detectors, YOLO boosts performance by 2-3% points mAP.
当与最先进的探测器相结合时，YOLO将性能提升2-3%。

1 Introduction

1简介

Humans glance at an image and instantly know what objects are in the image, where they are, and how they interact.
人类看了一眼图像，就会立刻知道图像中的对象是什么，它们在哪里，以及它们是如何相互作用的。

The human visual system is fast and accurate, allowing us to perform complex tasks like driving or grocery shopping with little conscious thought.
人类的视觉系统是快速和准确的，允许我们执行复杂的任务，如驾驶或杂货店购物很少有意识的想法。

Fast, accurate, algorithms for object detection would allow computers to drive cars in any weather without specialized sensors, enable assistive devices to convey real-time scene information to human users, and unlock the potential for general purpose, responsive robotic systems.
快速、准确的目标检测算法将允许计算机在任何天气下驾驶汽车，而无需专门的传感器，使辅助设备能够向人类用户传送实时场景信息，并为通用、响应迅速的机器人系统释放潜力。

Convolutional neural networks (CNNs) achieve impressive performance on classification tasks at real-time speeds [13].
卷积神经网络（CNN）在分类任务中以实时速度取得了令人印象深刻的性能[13]。

Yet top object detection systems like R-CNN take seconds to process individual images and hallucinate objects in background noise.
然而，像R-CNN这样的顶级目标检测系统需要几秒钟的时间来处理单个图像并在背景噪声中产生幻觉物体。

We believe these shortcomings result from how these systems approach object detection.
我们相信这些缺点是由于这些系统是如何接近目标检测的。

Current detection systems repurpose classifiers to perform detection.
当前的检测系统重新利用分类器来执行检测。

To detect an object these systems take a classifier for that object and evaluate it at various locations and scales in a test image.
为了检测到一个物体，这些系统对该物体使用一个分类器，并在测试图像中的不同位置和比例对其进行评估。

Systems like deformable parts models (DPM) use a sliding window approach where the classifier is run at evenly spaced locations over the entire image [7].
像可变形零件模型（DPM）这样的系统使用滑动窗口方法，分类器在整个图像上均匀分布的位置运行[7]。

More recent approaches like R-CNN use region proposal methods to first generate potential bounding boxes in an image and then run a classifier on these proposed boxes.
最近的一些方法（如R-CNN）使用区域建议方法首先在图像中生成潜在的边界框，然后在这些建议的框上运行分类器。

After classification, post-processing is used to refine the bounding box, eliminate duplicate detections, and rescore the box based on other objects in the scene [9].
分类后，使用后处理细化边界框，消除重复检测，并基于场景中的其他对象对框进行重新扫描[9]。

These region proposal techniques typically generate a few thousand potential boxes per image.
这些区域建议技术通常为每张图像生成几千个潜在框。

Selective Search, the most common region proposal method, takes 1-2 seconds per image to generate these boxes [26].
选择性搜索是最常见的区域建议方法，每幅图像需要1-2秒才能生成这些方框[26]。

The classifier then takes additional time to evaluate the proposals.
然后分类器需要额外的时间来评估建议。

The best performing systems require 2-40 seconds per image and even those optimized for speed do not achieve real-time performance.
性能最好的系统每幅图像需要2-40秒，即使是针对速度进行优化的系统也无法实现实时性能。

Additionally, even a highly accurate classifier will produce false positives when faced with so many proposals.
此外，即使是一个高精度的分类器在面对如此多的建议时也会产生误报。

When viewed out of context, small sections of background can resemble actual objects, causing detection errors.
当脱离上下文查看时，背景的小部分可能与实际对象相似，从而导致检测错误。

Finally, these detection pipelines rely on independent techniques at every stage that cannot be optimized jointly.
最后，这些检测管道在每个阶段都依赖于独立的技术，而这些技术无法联合优化。

A typical pipeline uses Selective Search for region proposals, a convolutional network for feature extraction, a collection of one-versus-all SVMs for classification, non-maximal suppression to reduce duplicates, and a linear model to adjust the final bounding box coordinates.
典型的流水线使用选择性搜索区域建议，卷积网络用于特征提取，一组SVM用于分类，非最大化抑制以减少重复，以及线性模型来调整最终边界框坐标。

Selective Search tries to maximize recall while the SVMs optimize for single class accuracy and the linear model learns from localization error.
选择搜索试图最大化召回率，而支持向量机优化单类准确度，线性模型从定位误差中学习。

Our system is refreshingly simple, see Figure 1.
我们的系统非常简单，见图1。

Figure 1: The YOLO Detection System.
图1:YOLO检测系统。
Processing images with YOLO is simple and straightforward.
用YOLO处理图像是简单而直接的。
Our system (1) resizes the input image to 448 × 448, (2) runs a single convolutional network on the image, and (3) thresholds the resulting detections by the model’s confidence.
我们的系统（1）将输入图像的大小调整为448×448，（2）在图像上运行一个卷积网络，（3）根据模型的置信度阈值检测结果。

A single convolutional network simultaneously predicts multiple bounding boxes and class probabilities for those boxes.
单个卷积网络同时预测多个边界盒和这些盒的类概率。

We train our network on full images and directly optimize detection performance.
我们在全图像上训练我们的网络，并直接优化检测性能。

Context matters in object detection.
上下文在对象检测中很重要。

Our network uses global image features to predict detections which drastically reduces its errors from background detections.
我们的网络使用全局图像特征来预测检测，这大大减少了背景检测的误差。

At test time, a single network evaluation of the full image produces detections of multiple objects in multiple categories without any pre or post-processing.
在测试时，对完整图像的单个网络评估会产生多个类别的多个对象的检测，而无需任何预处理或后处理。

Our training and testing code are open source and available online at [redacted]. A variety of pre-trained models are also available to download.
我们的训练和测试代码是开源的，可以在[修订版]在线获得。此外，还可下载各种经过预训练的型号。

2 Unified Detection

2统一检测

We unify the separate components of object detection into a single neural network.
我们将目标检测的各个部分统一到一个单一的神经网络中。

Using our system, you only look once (YOLO) at an image to predict what objects are present and where they are.
使用我们的系统，你只需看一次图像（YOLO）就可以预测出哪些物体存在以及它们在哪里。

Our network uses features from the entire image to predict each bounding box.
我们的网络使用来自整个图像的特征来预测每个边界框。

It also predicts all bounding boxes for an image simultaneously.
它还可以同时预测图像的所有边界框。

This means our network reasons globally about the full image and all the objects in the image.
这意味着我们的网络对完整图像和图像中的所有对象进行了全球分析。

The YOLO design enables end-to-end training and real-time speeds while maintaining high average precision.
YOLO设计支持端到端的训练和实时速度，同时保持较高的平均精度。

2.1 Design

2.1设计

Our system divides the input image into a 7 × 7 grid.
我们的系统将输入图像分成7×7的网格。

If the center of an object falls into a grid cell, that grid cell is responsible for detecting that object.
如果一个物体的中心落在一个网格单元中，该网格单元负责检测该对象。

Each grid cell predicts a bounding box and class probabilities associated with that bounding box, see Figure 2.
每个网格单元预测一个边界框和与该边界框相关的类概率，见图2。

Figure 2: The Model.
图2：模型。
Our system models detection as a regression problem to a 7 × 7 × 24 tensor.
我们的系统将检测建模为一个7×7×24张量的回归问题。
This tensor encodes bounding boxes and class probabilities for all objects in the image.
这个张量对图像中所有对象的边界框和类概率进行编码。

We implement this model as a convolutional neural network and evaluate it on the PASCAL VOC detection dataset [6].
我们将此模型实现为卷积神经网络，并在PASCAL VOC检测数据集上对其进行评估[6]。

The initial convolutional layers of the network extract features from the image while the fully connected layers predict the output probabilities and coordinates.
网络的初始卷积层从图像中提取特征，而完全连接的层则预测输出概率和坐标。

Our network architecture is inspired by the GoogLeNet model for image classification [25].
我们的网络架构是受GoogLeNet图像分类模型的启发[25]。

Our network has 24 convolutional layers followed by 2 fully connected layers.
我们的网络有24个卷积层，紧接着是2个全连接层。

However, instead of the inception modules used by GoogLeNet we simply use 1 × 1 reduction layers followed by 3 × 3 convolutional layers, similar to Lin et al [17].
然而，与Lin等人[17]相似，我们只是使用1×1的还原层，然后是3×3的卷积层，而不是GoogLeNet使用的初始模块。

We also replace maxpooling layers with strided convolutions.
我们还将maxpooling层替换为跨步卷积。

The full network is shown in Figure 3.
完整的网络如图3所示。

Figure 3: The Architecture.
图3：架构。
Our detection network has 24 convolutional layers followed by 2 fully connected layers.
我们的检测网络有24个卷积层和2个完全连接的层。
The network uses strided convolutional layers to downsample the feature space instead of maxpooling layers.
该网络使用跨步卷积层来减少特征空间采样，而不是最大池层。
Alternating 1 × 1 convolutional layers reduce the features space from preceding layers.
交替的1×1卷积层减少了前一层的特征空间。
We pretrain the convolutional layers on the ImageNet classification task at half the resolution (224×224 input image) and then double the resolution for detection.
我们在ImageNet分类任务中以一半的分辨率（224×224输入图像）对卷积层进行预处理，然后将分辨率提高一倍进行检测。

The final output of our network is a 7 × 7 grid of predictions.
我们网络的最终输出是一个7×7的预测网格。

Each grid cell predicts 20 conditional class probabilities, and 4 bounding box coordinates.
每个网格单元预测20个条件类概率和4个边界框坐标。

2.2 Training

2.2训练

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset [22].
我们在ImageNet 1000类竞争数据集上预训练卷积层[22]。

For pretraining we use the first 20 convolutional layers from Figure 3 followed by a maxpooling layer and two fully connected layers.
对于预训练，我们使用图3中的前20个卷积层，然后是一个maxpooling层和两个完全连接的层。

We train this network for approximately a week and achieve a top-5 accuracy of 86% on the ImageNet 2012 validation set.
我们对该网络进行了大约一周的培训，在ImageNet 2012验证集上达到了86%的 top-5 准确率。

We then adapt the model to perform detection.
然后我们调整模型来执行检测。

Ren et al. show that adding both convolutional and connected layers to pretrained networks can benefit performance [21].
Ren等人指出，在预训练网络中同时添加卷积层和连接层可以提高性能[21]。

Following their example, we add four convolutional layers and two fully connected layers with randomly initialized weights.
在他们的例子之后，我们添加了四个卷积层和两个具有随机初始化权重的全连接层。

Detection often requires fine-grained visual information so we increase the input resolution of the network from 224 × 224 to 448 × 448.
检测通常需要细粒度的视觉信息，因此我们将网络的输入分辨率从224×224提高到448×448。

Our final layer predicts both class probabilities and bounding box coordinates.
最后一层预测类概率和边界框坐标。

We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1.
我们通过图像宽度和高度规范化边界框的宽度和高度，使它们介于0和1之间。

We parameterize the bounding box x and y coordinates to be offsets of a particular grid cell location so they are also bounded between 0 and 1.
我们将边界框x和y坐标参数化为特定网格单元位置的偏移，因此它们也被限定在0和1之间。

We use a logistic activation function to reflect these constraints on the final layer.
我们使用逻辑激活函数来反映这些约束在最后一层上。

All other layers use the following leaky rectified linear activation:
所有其他层使用以下泄漏整流线性激活：

φ(x)={1.1x,if(x>0).1x,otherwiseφ(x)=\left\{\begin{aligned}1.1x, &&if (x > 0)\\.1x,&&otherwise\end{aligned}\right. φ(x)={1.1x,.1x,if(x>0)otherwise

We optimize for sum-squared error in the output of our model.
我们对模型输出中的平方和误差进行了优化。

We use sum-squared error because it is easy to optimize, however it does not perfectly align with our goal of maximizing average precision.
我们使用平方和误差，因为它很容易优化，但它并不完全符合我们最大化平均精度的目标。

It weights localization error equally with classification error which may not be ideal.
它将定位误差与分类误差同等权重，而分类误差可能并不理想。

To remedy this, we use a scaling factor λ to adjust the weight given to error from coordinate predictions versus error from class probabilities.
为了弥补这一点，我们使用比例因子λ来调整坐标预测误差与类概率误差的权重。

In our final model we use the scaling factor λ = 4.
在我们的最终模型中，我们使用比例因子λ=4。

Sum-squared error also equally weights errors in large boxes and small boxes.
平方和误差在大盒子和小盒子中同样加权误差。

Our error metric should reflect that small deviations in large boxes matter less than in small boxes.
我们的误差度量应该反映出大盒子里的小偏差比小盒子里的小偏差更重要。

To partially address this we predict the square root of the bounding box width and height instead of the width and height directly.
为了部分解决这一问题，我们预测边界框宽度和高度的平方根，而不是直接预测宽度和高度。

If cell i predicts class probabilities pi^\hat{p_{i}}pi^ (飞机), pi^\hat{p_{i}}pi^ (自行车)… and the bounding box xi^,yi^,wi^,hi^,\hat{x_{i}},\hat{y_{i}},\hat{w_{i}},\hat{h_{i}},xi^,yi^,wi^,hi^, then our full loss function for an example is:
如果单元格i预测了类概率pi^\hat{p_{i}}pi^ (飞机), pi^\hat{p_{i}}pi^ (自行车)……边界框xi^,yi^,wi^,hi^,\hat{x_{i}},\hat{y_{i}},\hat{w_{i}},\hat{h_{i}},xi^,yi^,wi^,hi^,，然后我们的全损失函数为：

∑i=048(λIiobj((xi−x^i)2+(yi−y^i)2+(wi−w^i)2+(hi−h^i)2)+∑c∈classes(pi(c)−pi(c)^)2)\sum^{48}_{i=0}\left(λ\Bbb I^{obj}_{i}\left((x_i-\hat x_i)^2+(y_i-\hat y_i)^2+(\sqrt w_i-\sqrt{\hat w_i})^2+(\sqrt h_i-\sqrt{\hat h_i})^2\right)+\sum_{c∈classes}(p_{i}(c)-\hat{p_{i}(c)})^2\right) i=0∑48(λIiobj((xi−x^i)2+(yi−y^i)2+(wi−w^i)2+(hi−h^i)2)+c∈classes∑(pi(c)−pi(c)^)2)

Where Iiobj\Bbb I^{obj}_{i}Iiobj encodes whether any object appears in cell i.
其中Iiobj\Bbb I^{obj}_{i}Iiobj编码是否有任何对象出现在单元格i中。

Note that if there is no object in a cell we do not consider any loss from the bounding box coordinates predicted by that cell.
请注意，如果单元格中没有对象，则不会考虑该单元格预测的边界框坐标的任何损失。

In this case, there is no ground truth bounding box so we only penalize the associated probabilities with that region.
在这种情况下，没有基本真实边界框，所以我们只惩罚与该区域相关的概率。

We train the network for about 120 epochs on the training and validation data sets from PASCAL VOC 2007 and 2012 as well as the test set from 2007, a total of 21k images.
我们使用PASCAL VOC 2007和2012的训练和验证数据集以及2007年的测试集（共21k幅图像）对网络进行了大约120个时期的训练。

Throughout training we use a batch size of 64, a momentum of 0.9 and a decay of 0.0005.
在整个训练过程中，我们使用的批量大小为64，动量为0.9，衰减为0.0005。

We use two learning rates during training: 10^-2 and 10^-3.
我们在训练中使用两种学习率：10^-2和10^-3。

Training diverges if we use the higher learning rate, 10^-2 , from the start.
如果我们从一开始就使用更高的学习率（10^-2），那么训练就会出现分歧。

We use the lower rate, 10^-3, for one epoch so that the randomly initialized weights in the final layers can settle to reasonable values.
我们对一个历元使用较低的速率10^-3，以便最终层中随机初始化的权重可以设置为合理的值。

Then we train with the following learning rate schedule: 10^-2 for 80 epochs, and 10^-3 for 40 epochs.
然后，我们使用以下学习率计划进行训练：80个时期10^-2，40个时期10^-3。

To avoid overfitting we use dropout and extensive data augmentation.
为了避免过度拟合，我们使用了dropout和广泛的扩充数据。

A dropout layer with rate = .5 after the first connected layer prevents co-adaptation between layers [14].
在第一个连接层之后，速率为0.5的dropout层阻止了层之间的协同适应[14]。

For data augmentation we introduce random scaling and translations of up to 10% of the original image size.
对于数据增强，我们引入了原始图像大小的10%的随机缩放和平移。

We also randomly adjust the exposure and saturation of the image by up to a factor of 2.
我们还随机调整曝光和饱和度的图像多达2倍。

2.2.1 Parameterizing Class Probabilities

2.2.1类概率参数化

Each grid cell predicts class probabilities for that area of the image.
每个网格单元预测图像中该区域的类概率。

There are 49 cells with a possible 20 classes each yielding 980 predicted probabilities per image.
共有49个单元，每个单元可能有20个类别，每幅图像产生980个预测概率。

Most of these probabilities will be zero since only a few objects appear in any given image.
由于在任何给定的图像中只有少数物体出现，所以大多数概率都为零。

Left unchecked, this imbalance pushes all of the probabilities to zero, leading to divergence during training.
如果不加以控制，这种不平衡会将所有的概率推到零，导致训练过程中出现分歧。

To overcome this, we add an extra variable to each grid location, the probability that any object exists in that location regardless of class.
为了克服这个问题，我们在每个网格位置添加一个额外的变量，即任何对象存在于该位置的概率，而不考虑类。

Thus instead of 20 class probabilities we have 1 “objectness” probability, Pr(Object), and 20 conditional probabilities: Pr(Airplane|Object), Pr(Bicycle|Object), etc.
因此，我们有1个“目标”概率、Pr（Object）和20个条件概率：Pr（Airplane|Object）、Pr（Bicycle|Object）等，而不是20个类概率。

To get the unconditional probability for an object class at a given location we simply multiply the “objectness” probability by the conditional class probability:
要获得给定位置上对象类的无条件概率，我们只需将“objectness”概率乘以条件类概率：

Pr(Dog)=Pr(Object)∗PR(Dog∣Object)Pr(Dog)=Pr(Object)*PR(Dog|Object) Pr(Dog)=Pr(Object)∗PR(Dog∣Object)

We can optimize these probabilities independently or jointly using a novel “detection layer” in our convolutional network.
我们可以单独或联合使用卷积网络中的新“检测层”来优化这些概率。

During the initial stages of training we optimize them independently to improve model stability.
在训练的初始阶段，我们独立地对它们进行优化，以提高模型的稳定性。

We update the “objectness” probabilities at every location however we only update the conditional probabilities at locations that actually contain an object.
我们更新每个位置的“目标”概率，但是我们只更新实际包含对象的位置的条件概率。

This means there are far fewer probabilities getting pushed towards zero.
这意味着被推到零的概率要少得多。

During later stages of training we optimize the unconditioned probabilities by performing the required multiplications in the network and calculating error based on the result.
在训练的后期阶段，我们通过在网络中执行所需的乘法并根据结果计算误差来优化无条件概率。

2.2.2 Predicting IOU

2.2.2预测IOU

Like most detection systems, our network has trouble precisely localizing small objects.
像大多数探测系统一样，我们的网络很难精确定位小目标。

While it may correctly predict that an object is present in an area of the image, if it does not predict a precise enough bounding box the detection is counted as a false positive.
虽然它可以正确地预测对象存在于图像的某个区域中，但是如果它没有预测到足够精确的边界框，则检测将被视为假阳性。

We want YOLO to have some notion of uncertainty in its probability predictions.
我们希望YOLO在它的概率预测中有一些不确定性的概念。

Instead of predicting 1-0 probabilities we can scale the target class probabilities by the IOU of the predicted bounding box with the ground truth box for a region.
与预测1-0概率不同，我们可以通过预测的边界框与区域的底真值框的IOU来缩放目标类概率。

When YOLO predicts good bounding boxes it is also encouraged to predict high class probabilities.
当YOLO预测良好的边界框时，也鼓励预测高阶概率。

For poor bounding box predictions it learns to predict lower confidence probabilities.
对于糟糕的边界盒预测，它学会预测较低的置信概率。

We do not train to predict IOU from the beginning, only during the second stage of training.
我们不训练从一开始就预测IOU，只在训练的第二阶段。

It is not necessary for good performance but it does boost our mean average precision by 3-4%.
它不需要好的性能，但它确实提高了我们的平均精度3-4%。

2.3 Inference

2.3推断

Just like in training, predicting detections for a test image only requires one network evaluation.
就像在训练中一样，预测一个测试图像的检测只需要一次网络评估。

The network predicts 49 bounding boxes per image and class probabilities for each box. YOLO is extremely fast at test time since it only requires a single network evaluation unlike classifier-based methods.
网络预测每个图像有49个边界框，每个框有类概率。与基于分类器的方法不同，YOLO在测试时非常快，因为它只需要单个网络评估。

The grid design enforces spatial diversity in the bounding box predictions.
网格设计在边界框预测中加强了空间多样性。

Often it is clear which grid cell an object falls in to and the network only predicts one box for each object.
通常很清楚一个对象属于哪个网格单元，而网络只为每个对象预测一个框。

However, some large objects or objects near the border of multiple cells can be well localized by multiple cells.
但是，一些大对象或多个单元边界附近的对象可以被多个单元很好地定位。

Nonmaximal suppression can be used to fix these multiple detections.
非最大抑制可用于修复这些多个检测。

While not critical to performance as it is for R-CNN or DPM, non-maximal suppression adds 2-3% in mAP.
虽然对性能并不像R-CNN或DPM那样重要，但非最大抑制在mAP中增加了2-3%。

2.4 Limitations of YOLO

2.4 YOLO的限制

YOLO imposes strong spatial constraints on bounding box predictions since each grid cell only predicts one box.
由于每个网格单元只预测一个框，所以YOLO对边界框预测施加了强大的空间约束。

This spatial constraint limits the number of nearby objects that our model can predict.
这种空间约束限制了模型可以预测的附近对象的数量。

If two objects fall into the same cell our model can only predict one of them.
如果两个物体落入同一个细胞，我们的模型只能预测其中一个。

Our model struggles with small objects that appear in groups, such as flocks of birds.
我们的模型与成群出现的小物体（如成群的鸟）作斗争。

Since our model learns to predict bounding boxes from data, it struggles to generalize to objects in new or unusual aspect ratios or configurations.
由于我们的模型学会了从数据中预测边界框，所以它很难将其推广到新的或不寻常的纵横比或配置中的对象。

Our model also uses relatively coarse features for predicting bounding boxes since our architecture has multiple downsampling layers from the input image.
我们的模型还使用相对粗糙的特征来预测边界框，因为我们的架构从输入图像有多个下采样层。

Finally, while we train on a loss function that approximates detection performance, our loss function treats errors the same in small bounding boxes versus large bounding boxes.
最后，当我们训练一个近似检测性能的损失函数时，我们的损失函数在小边界框和大边界框中处理错误是一样的。

A small error in a large box is generally benign but a small error in a small box has a much greater effect on IOU.
大盒子里的小错误通常是良性的，但小盒子里的小错误对借据的影响要大得多。

Our main source of error is incorrect localizations.
我们错误的主要来源是错误的定位。

3 Comparison to Other Detection Systems

3与其他检测系统的比较

Object detection is a core problem in computer vision.
目标检测是计算机视觉的核心问题。

Detection pipelines generally start by extracting a set of robust features from input images (Haar [19], SIFT [18], HOG [2], convolutional features [3]).
检测管道通常从从输入图像中提取一组鲁棒特征开始（Haar[19]，SIFT[18]，HOG[2]，卷积特征[3]）。

Then, classifiers [27, 16, 9, 7] or localizers [1, 23] are used to identify objects in the feature space.
然后，使用分类器[27,16,9,7]或定位器[1,23]来识别特征空间中的对象。

These classifiers or localizers are run either in sliding window fashion over the whole image or on some subset of regions in the image [26, 11, 28].
这些分类器或定位器要么在整个图像上以滑动窗口方式运行，要么在图像中的某些区域子集上运行[26,11,28]。

We compare the YOLO detection system to several top detection frameworks, highlighting key similarities and differences.
我们将YOLO检测系统与几种顶级检测框架进行了比较，突出了关键的异同点。

Deformable parts models.
可变形零件模型。

Deformable parts models (DPM) use a sliding window approach to object detection [7].
可变形零件模型（DPM）使用滑动窗口方法进行目标检测[7]。

DPM uses a disjoint pipeline to extract static features, classify regions, predict bounding boxes for high scoring regions, etc.
DPM使用不相交的管道来提取静态特征、区域分类、预测高分区域的边界框等。

Our system replaces all of these disparate parts with a single convolutional neural network.
我们的系统用一个卷积神经网络代替了所有这些不同的部分。

The network performs feature extraction, bounding box prediction, non-maximal suppression, and contextual reasoning all concurrently.
该网络同时执行特征提取、边界框预测、非最大化抑制和上下文推理。

Instead of static features, the network trains the features in-line and optimizes them for the detection task.
与静态特征不同，网络在线训练特征并对其进行优化以执行检测任务。

Our unified architecture leads to a faster, more accurate model than DPM.
我们的统一体系结构带来了比DPM更快、更精确的模型。

Table 1: PASCAL VOC 2012 Leaderboard.
表1:PASCAL VOC 2012排行榜。
YOLO compared with the full comp4 (outside data allowed) public leaderboard as of June 6th, 2015.
YOLO与截至2015年6月6日的comp4（允许外部数据）公开排行榜进行了比较。
Mean average precision and per-class average precision are shown for a variety of detection methods.
平均平均精度和每类平均精度显示了各种检测方法。
YOLO is the top detection method that is not based on the R-CNN detection framework.
YOLO是不基于R-CNN检测框架的顶级检测方法。
Fast R-CNN + YOLO is the second highest scoring method, with a 2% boost over Fast R-CNN.
Fast R-CNN+YOLO是第二高得分方法，比Fast R-CNN提高了2%。

R-CNN.

R-CNN and its variants use region proposals instead of sliding windows to find objects in images.
R-CNN及其变种使用区域建议而不是滑动窗口来查找图像中的对象。

These systems use region proposal methods like Selective Search [26] to generate potential bounding boxes in an image.
这些系统使用区域建议方法，如选择性搜索[26]在图像中生成潜在的边界框。

Instead of scanning through every region in the window, now the classifier only has to score a small subset of potential regions in an image.
现在分类器不再扫描窗口中的每个区域，而只需对图像中的一小部分潜在区域进行评分。

Good region proposal methods maintain high recall despite greatly limiting the search space.
好的区域建议方法虽然极大地限制了搜索空间，但仍能保持较高的查全率。

This performance comes at a cost.
这种表演是有代价的。

Selective Search, even in “fast mode” takes about 2 seconds to propose regions for an image.
选择性搜索，即使是在“快速模式”下，也需要2秒钟来为图像建议区域。

R-CNN shares many design aspects with DPM.
R-CNN与DPM共享许多设计方面。

After region proposal, R-CNN uses the same multistage pipeline of feature extraction (using CNNs instead of HOG), SVM scoring, non-maximal suppression, and bounding box prediction using a linear model [9].
区域建议后，R-CNN使用相同的多阶段管道进行特征提取（使用CNNs代替HOG）、SVM评分、非最大抑制和使用线性模型进行边界盒预测[9]。

YOLO shares some similarities with R-CNN.
YOLO和R-CNN有一些相似之处。

Each grid cell proposes a potential bounding box and then scores that bounding box using convolutional features.
每个网格单元提出一个潜在的边界框，然后使用卷积特征对该边界框进行评分。

However, our system puts spatial constraints on the grid cell proposals which helps mitigate multiple detections of the same object.
然而，我们的系统对网格单元方案施加了空间约束，这有助于减少对同一对象的多次检测。

Our system also proposes far fewer bounding boxes, only 49 per image compared to about 2000 from Selective Search.
我们的系统也提出了少得多的边界框，每幅图像只有49个，相比之下，选择性搜索大约有2000个。

Finally, our system combines these individual components into a single, jointly optimized model.
最后，我们的系统将这些单独的组件组合成一个单独的、联合优化的模型。

Deep MultiBox.

Unlike R-CNN, Szegedy et al. train a convolutional neural network to predict regions of interest [5] instead of using Selective Search.
与R-CNN不同，Szegedy等人训练了一个卷积神经网络来预测感兴趣的区域[5]，而不是使用选择性搜索。

MultiBox can also perform single object detection by replacing the confidence prediction with a single class prediction.
MultiBox还可以通过用单类预测代替置信预测来执行单目标检测。

However, MultiBox cannot perform general object detection and is still just a piece in a larger detection pipeline, requiring further image patch classification.
然而，MultiBox不能执行一般的目标检测，它仍然只是一个更大的检测管道中的一部分，需要进一步的图像补丁分类。

Both YOLO and MultiBox use a convolutional network to predict bounding boxes in an image but YOLO is a complete detection pipeline.
YOLO和MultiBox都使用卷积网络来预测图像中的边界框，但YOLO是一个完整的检测管道。

OverFeat.

Sermanet et al. train a convolutional neural network to perform localization and adapt that localizer to perform detection [23].
Sermanet et al.训练一个卷积神经网络来执行定位，并调整该定位器来执行检测[23]。

OverFeat efficiently performs sliding window detection but it is still a disjoint system.
OverFeat有效地执行滑动窗口检测，但它仍然是一个不相交的系统。

OverFeat optimizes for localization, not detection performance.
OverFeat优化定位，而不是检测性能。

Like DPM, the localizer only sees local information when making a prediction.
与DPM一样，定位器在进行预测时只看到本地信息。

OverFeat cannot reason about global context and thus requires significant post-processing to produce coherent detections.
OverFeat不能解释全局上下文，因此需要大量的后处理来产生相干检测。

MultiGrasp.

Our work is similar in design to work on grasp detection by Redmon et al [20].
我们的工作在设计上类似于Redmon等人[20]的抓取检测工作。

Our grid approach to bounding box prediction is based on the MultiGrasp system for regression to grasps.
我们的包围盒预测网格方法是基于MultiGrasp回归系统。

4 Experiments

4实验

We present detection results for the PASCAL VOC 2012 dataset and compare our mean average precision (mAP) and runtime to other top detection methods.
我们展示了PASCAL VOC 2012数据集的检测结果，并将我们的平均精度（mAP）和运行时间与其他顶级检测方法进行了比较。

We also perform error analysis on the VOC 2007 dataset.
我们还对voc2007数据集进行了错误分析。

We compare our results to Fast R-CNN, one of the highest performing versions of R-CNN [10].
我们将我们的结果与Fast R-CNN进行比较，后者是R-CNN的最高性能版本之一[10]。

We use publicly available runs of Fast R-CNN available on GitHub.
我们使用GitHub上的Fast R-CNN公开运行。

Finally we show that a combination of our method with Fast R-CNN gives a significant performance boost.
最后我们展示了我们的方法与Fast R-CNN相结合可以显著提高性能。

4.1 VOC 2012 Results

4.1 VOC 2012的结果

On the VOC 2012 test set we achieve 54.5 mAP.
在VOC 2012测试集上，我们实现了54.5 mAP。

This is lower than the current state of the art, closer to R-CNN based methods that use AlexNet, see Table 1.
这低于目前的技术水平，更接近使用AlexNet的基于R-CNN的方法，见表1。

Our system struggles with small objects compared to its closest competitors.
与最接近的竞争对手相比，我们的系统在处理小对象时遇到了困难。

On categories like bottle, sheep, and tv/monitor YOLO scores 8-10 percentage points lower than R-CNN or Feature Edit.
在瓶子、绵羊和电视/监视器等类别中，YOLO的得分比R-CNN或Feature Edit低8-10个百分点。

However, on other categories like cat and horse YOLO achieves significantly higher performance.
然而，在其他类别，如猫和马，YOLO取得了显著更高的表现。

We further investigate the source of these performance disparities in Section 4.3.
我们将在第4.3节中进一步调查这些性能差异的来源。

4.2 Speed

4.2速度

At test time YOLO processes images at 45 frames per second on an Nvidia Titan X GPU.
测试时，YOLO在Nvidia Titan X GPU上以每秒45帧的速度处理图像。

It is considerably faster than classifier-based methods with similar mAP.
它比具有相似mAP的基于分类器的方法快得多。

Normal R-CNN using AlexNet or the small VGG network take 400-500x longer to process images.
正常的R-CNN使用AlexNet或小型VGG网络需要400-500倍的时间来处理图像。

The recently proposed Fast R-CNN shares convolutional features between the bounding boxes but still relies on selective search for bounding box proposals which accounts for the bulk of their processing time.
最近提出的Fast R-CNN共享边界框之间的卷积特性，但仍然依赖于对边界框方案的选择性搜索，这占了它们处理时间的大部分。

YOLO is still around 100x faster than Fast R-CNN.
YOLO仍然比Fast R-CNN快100倍左右。

Table 2 shows a full comparison between multiple R-CNN and Fast R-CNN variants and YOLO.
表2显示了多个R-CNN和Fast R-CNN变体与YOLO之间的完整比较。

Table 2: Prediction Timing.
表2：预测时间。
mAP and timing information for R-CNN, Fast R-CNN, and YOLO on the VOC 2007 test set.
voc2007测试集上R-CNN、Fast-R-CNN和YOLO的地图和时间信息。
Timing information is given both as frames per second and the time each method takes to process the full 4952 image set.
定时信息以每秒帧数和每种方法处理完整的4952图像集所用的时间来表示。
The final column shows the relative speed of YOLO compared to that method.
最后一列显示了与该方法相比，YOLO的相对速度。

4.3 VOC 2007 Error Analysis

4.3 VOC 2007误差分析

An object detector must have high recall for objects in the test set to obtain high performance.
对象检测器必须对测试集中的对象具有较高的召回率才能获得高性能。

Our model imposes spatial constraints on bounding box predictions which limits recall on small objects that are close together.
我们的模型对边界框预测施加了空间约束，从而限制了对靠近的小对象的回忆。

We examine how detrimental this is in practice by calculating our highest possible recall assuming perfect coordinate prediction.
我们通过假设完美的坐标预测来计算我们的最高可能召回率来检验这在实践中是多么的有害。

Under this assumption, our model can achieve a 93.1% recall for objects in the VOC 2007 test set.
在这个假设下，我们的模型可以在voc2007测试集中实现93.1%的召回率。

This is lower than Selective Search (98.0% [26]) but still relatively high.
这低于选择性搜索（98.0%[26]），但仍然相对较高。

Using the methodology and tools of Hoiem et al. [15] we analyze our performance on the VOC 2007 test set.
使用Hoiem等人[15]的方法和工具，我们分析了我们在voc2007测试集上的性能。

We compare YOLO to Fast R-CNN using VGG-16, one of the highest performing object detectors.
我们使用VGG-16比较YOLO和Fast R-CNN，VGG-16是性能最好的目标探测器之一。

Figure 4 compares frequency of localization and background errors between Fast R-CNN and YOLO.
图4比较了Fast R-CNN和YOLO之间定位和背景错误的频率。

Figure 4: Error Analysis: Fast R-CNN vs. YOLO
图4：错误分析：Fast R-CNN与YOLO
These charts show the percentage of localization and background errors in the top N detections for various categories (N = # objects in that category).
这些图表显示了不同类别（N=#该类别中的对象）前N个检测中定位和背景错误的百分比。

A detection is considered a localization error if it overlaps a true positive in the image but by less than the required 50% IOU.
如果检测与图像中的真阳性重叠，但小于所需的50%IOU，则视为定位错误。

A detection is a background error if the box does not overlap any objects of any class in the image.
如果方框没有覆盖图像中任何类的任何对象，则检测就是背景错误。

Fast R-CNN makes around the same number of localization errors and background errors.
Fast R-CNN产生的定位错误和背景错误数量大致相同。

Over all 20 classes in the top N detections 13.6% are localization errors and 8.6% are background errors.
在所有20个类别中，前N个检测中，13.6%是定位错误，8.6%是背景错误。

YOLO makes far more localization errors but relatively few background errors.
YOLO的定位误差要大得多，但背景误差相对较少。

Averaged across all classes, of its top N detections 24.7% are localization errors and a mere 4.3% are background errors.
平均所有类别，其前N个检测中24.7%是定位错误，只有4.3%是背景错误。

This is about twice the number of localization errors but half the number of background detections.
这大约是定位错误数量的两倍，但却是背景检测数量的一半。

YOLO uses global context to evaluate detections while R-CNN only sees small portions of the image.
YOLO使用全局上下文来评估检测结果，而R-CNN只看到图像的一小部分。

Many of the background detections made by R-CNN are obviously not objects when shown in context.
R-CNN的许多背景检测在上下文中显示时显然不是对象。

YOLO and R-CNN are good at different parts of object detection.
YOLO和R-CNN擅长对象检测的不同部分。

Since their main sources of error are orthogonal, combining them should produce a model that is better across the board.
由于它们的主要误差来源是正交的，因此将它们结合起来应该会产生一个更全面的模型。

4.4 Combining Fast R-CNN and YOLO

4.4结合Fast R-CNN和YOLO

YOLO makes far fewer background mistakes than Fast R-CNN.
YOLO的背景错误比Fast R-CNN要少得多。

By using YOLO to eliminate background detections from Fast R-CNN we get a significant boost in performance.
通过使用YOLO来消除Fast R-CNN中的背景检测，我们可以显著提高性能。

For every bounding box that R-CNN predicts we check to see if YOLO predicts a similar box.
对于R-CNN预测的每个边界框，我们都会检查YOLO是否预测了类似的框。

If it does, we give that prediction a boost based on the probability predicted by YOLO and the overlap between the two boxes.
如果是这样的话，我们会根据YOLO预测的概率和两个盒子之间的重叠来提高预测。

This reorders the detections to favor those predicted by both systems.
这将重新排列探测顺序，以利于两个系统预测的探测。

Since we still use Fast R-CNN’s bounding box predictions we do not introduce any localization error.
由于我们仍然使用快速R-CNN的边界框预测，我们没有引入任何定位错误。

Thus we take advantage of the best aspects of both systems.
因此，我们利用了这两个系统的最佳方面。

The best Fast R-CNN model achieves a mAP of 71.8% on the VOC 2007 test set.
最佳Fast R-CNN模型在voc2007测试集上达到了71.8%的mAP。

When combined with YOLO, its mAP increases by 2.9% to 74.7%.
与YOLO结合后，其mAP增加了2.9%至74.7%。

We also tried combining the top Fast R-CNN model with several other versions of Fast R-CNN.
我们还尝试将顶级Fast R-CNN模型与其他几个版本的Fast R-CNN相结合。

Those ensembles produced small increases in mAP between .3 and .6%, see Table 3 for details.
这些组合使mAP的增幅在0.3%到0.6%之间，详见表3。

Table 3: Model combination expermients on VOC 2007.
表3:2007年VOC模型组合实验。
We examine the effect of combining various models with the best version of Fast R-CNN.
我们检验了不同模型与最佳版本Fast R-CNN相结合的效果。
The model’s base mAP is listed as well as its mAP when combined with the top model on VOC 2007.
当与voc2007上的顶级模型相结合时，列出了该模型的基本图及其mAP图。
Other versions of Fast R-CNN provides only a small marginal benefit while combining with YOLO results in a significant performance boost.
其他版本的Fast R-CNN只提供了一个很小的边际效益，而与YOLO相结合的结果是显著的性能提升。

Thus, the benefit from combining Fast R-CNN with YOLO is unique, not a general property of combining models in this way.
因此，将Fast R-CNN与YOLO相结合的好处是独特的，而不是以这种方式组合模型的一般特性。

Using this combination strategy we achieve a significant boost on the VOC 2012 and 2010 test sets as well, around 2%.
使用这种组合策略，我们在2012年和2010年的VOC测试集上也实现了显著的提升，大约2%。

The combined Fast R-CNN + YOLO model is currently the second highest performing model on the VOC 2012 leaderboard.
Fast R-CNN+YOLO组合模型目前是VOC 2012排行榜上表现第二的模型。

5 Conclusion

5结论

We introduce YOLO, a unified pipeline for object detection.
我们介绍了YOLO，一个用于对象检测的统一管道。

Our model is simple to construct and can be trained directly on full images.
我们的模型构造简单，可以直接在全图像上训练。

Unlike classifier-based approaches, YOLO is trained on a loss function that directly corresponds to detection performance and every piece of the pipeline can be trained jointly.
与基于分类器的方法不同，YOLO被训练在一个与检测性能直接对应的损失函数上，并且每条管道都可以联合训练。

The network reasons about the entire image during inference which makes it less likely to predict background false positives than sliding window or proposal region detectors.
在推理过程中，由于整个图像的网络原因，与滑动窗口或建议区域检测器相比，预测背景误报的可能性更小。

Moreover, it predicts detections with only a single network evaluation, making it extremely fast.
此外，它只需对网络进行一次评估就可以预测检测结果，因此速度非常快。

YOLO achieves impressive performance on standard benchmarks considering it is 2-3 orders of magnitude faster than existing detection methods.
考虑到它比现有的检测方法快2-3个数量级，YOLO在标准基准测试上取得了令人印象深刻的性能。

It can also be used to rescore the bounding boxes produced by state-of-the-art methods like R-CNN for a significant boost in performance.
它还可以用于重新扫描由最先进的方法（如R-CNN）生成的边界框，以显著提高性能。

《每日论文》You Only Look Once: Unified, Real-Time Object Detection相关推荐

论文笔记 Feature Selective Anchor-Free Module for Single-Shot Object Detection - CVPR 2019
2019 FSAF Feature Selective Anchor-Free Module for Single-Shot Object Detection Chenchen Zhu, Yihui ...
论文笔记 Object-Aware Instance Labeling for Weakly Supervised Object Detection - ICCV 2019
Object-Aware Instance Labeling for Weakly Supervised Object Detection Kosugi ICCV, 2019 (PDF) (Citat ...
【论文学习】MKIoU Loss: Towards Accurate Oriented Object Detection in Aerial Images
[论文学习]MKIoU Loss: Towards Accurate Oriented Object Detection in Aerial Images 在本文中,提出了一种近似 SkewIoU 的 ...
【论文精读】Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation（R-CNN）
论文Title:Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation.发表于2014年. 本 ...
目标检测经典论文——R-FCN论文翻译（中英文对照版）：Object Detection via Region-based Fully Convolutional Networks
目标检测经典论文翻译汇总:[翻译汇总] 翻译pdf文件下载:[下载地址] 此版为纯中文版,中英文对照版请稳步:[R-FCN纯中文版] R-FCN: Object Detection via Regio ...
目标检测论文阅读：Multi-scale Location-aware Kernel Representation for Object Detection（CVPR2018）
Multi-scale Location-aware Kernel Representation for Object Detection 论文链接:https://arxiv.org/abs/180 ...
论文解读——An Analysis of Scale Invariance in Object Detection – SNIP
摘要一篇分析了极端尺寸变化下的目标识别和检测技术.通过在不同的输入数据配置下对检测器进行训练,比较了检测器的特定尺寸设计和不变尺寸设计.通过评估不同网络架构在ImageNet上对小对象进行分类的性能 ...
论文阅读：RepPoints: Point Set Representation for Object Detection
文章目录 1.论文总述 2.Bounding boxes for the object detection流行的2个原因和它的缺点 3.为什么 the deformable convolution a ...
目标检测经典论文——R-CNN论文翻译：Rich feature hierarchies for accurate object detection and semantic segmentation
Rich feature hierarchies for accurate object detection and semantic segmentation--Tech report (v5) 用 ...
【论文笔记】Multi-Content Complementation Network for Salient Object Detection in Optical RSI
论文论文:Multi-Content Complementation Network for Salient Object Detection in Optical Remote Sensing I ...

《每日论文》You Only Look Once: Unified, Real-Time Object Detection