Towards Open World Object Detection 开放世界的目标检测

论文地址：https://arxiv.org/abs/2103.02603

代码地址：GitHub - JosephKJ/OWOD: (CVPR 2021 Oral) Open World Object Detection

新思路：在隐空间中去聚类，把不属于当前已标记类别的特征分类为unknown

模型架构：

①对比聚类

取二阶段目标检测网络某个中间层的特征向量作为隐空间变量，对于每个类别都有一个聚类中心，通过一个loss函数来希望每个图片的这个特征离自己类别的聚类中心近一点，离其它类别的远一点
通过最小化loss函数，来获得一种效果：在这个隐空间中，相同类别的ROI，其隐变量分布在较近的距离，而不同类别的较远。因此当一个ROI的隐变量离所有已知类别的聚类都很远时，说明这个ROI是unknown类，也即未标注类别

这个pi是该类别所有隐变量的均值，但是网络是训练的，所以对每个样本通过网络提取的隐变量会随着训练而变化，均值也会变化，所以pi也要变化。所以需要一个明确的更新算法：每隔I次迭代，用各个类别各自最近Q次迭代的样本的特征向量（也就是前面一直说的隐变量）来算各个类别的均值，然后和旧的聚类中心做加权平均（平滑），作为各个类别新的聚类中心。当然一开始没有Q个样本来算聚类中心，所以一开始的loss先按0来算，迭代I次之后才开始算loss。

②RPN自动标记机制

上述训练需要unknown样本，从RPN中获取。当某个ROI与所有标记的GT都不重叠时，将RPN提取的置信度最高的前N个背景建议框作为未知目标的建议框向后传递。

③基于能量的分类模型

当训练结束，推测阶段，基于一个输入特征向量F，该模型利用一个基于能量的模型去计算获取其预测类别。

g是网络的全连接层的输出，普通的网络就是把g接到softmax去做分类的，这里利用了g构建了E，E的意义在于：对于不同的输入f，有不同的能量标量值与之对应，而由于在隐空间将已知类别和未知类别分得很开，所以这里已知类别的特征向量输入产生的能量和未知类别的特征向量输入产生的能量值有较大区分。

因此在测试阶段分为了两步：先通过能量值判断是不是known；若是known，再看哪个类别的输出值更高。

摘要

Humans have a natural instinct to identify unknown object instances in their environments. The intrinsic curiosity about these unknown instances aids in learning about them, when the corresponding knowledge is eventually available. This motivates us to propose a novel computer vision problem called: ‘Open World Object Detection’, where a model is tasked to: 1) identify objects that have not been introduced to it as ‘unknown’, without explicit supervision to do so, and 2) incrementally learn these identified unknown categories without forgetting previously learned classes, when the corresponding labels are progressively received. We formulate the problem, introduce a strong evaluation protocol and provide a novel solution, which we call ORE: Open World Object Detector, based on contrastive clustering and energy based unknown identification. Our experimental evaluation and ablation studies analyse the efficacy of ORE in achieving Open World objectives. As an interesting by-product, we find that identifying and characterizing unknown instances helps to reduce confusion in an incremental object detection setting, where we achieve state-of the-art performance, with no extra methodological effort. We hope that our work will attract further research into this newly identified, yet crucial research direction.

人类有识别环境中未知物体实例的本能。当相应的知识最终可用时，对这些未知实例的好奇心有助于了解它们。这促使我们提出了一个新的计算机视觉问题，称为“开放世界目标检测”，模型的任务是：1）在没有明确监督的情况下，将没有被引入的物体识别为“未知”物体，2）当逐渐接收到相应的标签时，增量地学习这些识别出的未知类别而不忘记先前学习的类。本文提出了一种基于对比聚类和基于能量的未知识别的开放世界目标检测算法。我们的实验评估和消融研究分析了ORE网络在实现开放世界目标方面的功效。作为一个有趣的研究结果，我们发现识别和描述未知实例有助于减少增量目标检测设置中的混乱，在这种情况下，我们实现了最先进的性能，而无需额外的方法学习。我们希望，我们的工作将吸引进一步研究这个新颖但至关重要的研究方向。

I 介绍

Deep learning has accelerated progress in Object Detection research [13, 53, 18, 30, 51], where a model is tasked to identify and localise objects in an image. All existing approaches work under a strong assumption that all the classes that are to be detected would be available at training phase. Two challenging scenarios arises when we relax this assumption: 1) A test image might contain objects from unknown classes, which should be classified as unknown. 2) As and when information (labels) about such identified unknowns become available, the model should be able to incrementally learn the new class. Research in developmental psychology [40, 35] finds out that the ability to identify what one doesn’t know, is key in captivating curiosity. Such a curiosity fuels the desire to learn new things [8, 15]. This motivates us to propose a new problem where a model should be able to identify instances of unknown objects as unknown and subsequently learns to recognise them when training data progressively arrives, in a unified way. We call this problem setting as Open World Object Detection.

深度学习加速了目标检测研究的进展[13，53，18，30，51]，模型的任务是识别和定位图像中的目标。所有现有的方法都是在一个重要假设下工作的，即所有要检测的类在训练阶段都是可用的。当我们放宽这一假设时，出现了两个具有挑战性的场景：1）测试图像可能包含来自未知类的目标，这些目标应该被分类为未知。2） 当有关这些已识别未知项的信息（标签）可用时，模型应该能够增量地学习新类。发展心理学的研究[40,35]发现，辨别自己不知道的东西的能力是吸引好奇心的关键。这种好奇心激发了人们学习新事物的欲望[8,15]。这促使我们提出一个新的问题，即模型应该能够将未知的实例识别为未知目标，然后在训练数据以统一的方式逐渐到达时学习识别它们。我们把这个问题称为开放世界目标检测。

The number of classes that are annotated in standard vision datasets like Pascal VOC [9] and MS-COCO [31] are very low (20 and 80 respectively) when compared to the infinite number of classes that are present in the open world. Recognising an unknown as an unknown requires strong generalization. Scheirer et al. [56] formalises this as Open Set classification problem. Henceforth, various methodologies (using 1-vs-rest SVMs and deep learning models) has been formulated to address this challenging setting. Bendale et al. [2] extends Open Set to an Open World classification setting by additionally updating the image classifier to recognise the identified new unknown classes. Interestingly, as seen in Fig. 1, Open World object detection is unexplored, owing to the difficulty of the problem setting.

与开放世界中存在的无限数量的类相比，标准视觉数据集（如Pascal VOC[9]和MS-COCO[31]）中注释的类的数量非常少（分别为20和80）。将未知目标识别为未知需要很强的泛化能力。Scheirer等人[56]将其形式化为开集分类问题。从今以后，各种方法（使用1-vs-rest支持向量机和深度学习模型）都被用来解决这一具有挑战性的问题。Bendale等人[2]通过额外更新图像分类器来识别新未知类别，将开放集扩展到开放世界分类设置。有趣的是，如图1所示，由于问题的困难，开放世界目标检测还未被探索。

Figure 1: Open World Object Detectionis a novel problem that has not been formally defined and addressed so far. Though related to the Open Set and Open World classification, Open World Object Detection offers its own unique challenges, which when addressed, improves the practicality of object detectors. 图1：开放世界目标检测（※）是一个到目前为止还没有正式定义和解决的新问题。虽然与开放集和开放世界分类相关，但开放世界目标检测有其独特的挑战性，当解决这些问题时，提高了目标检测器的实用性。

The advances in Open Set and Open World image classification cannot be trivially adapted to Open Set and Open World object detection, because of a fundamental difference in the problem setting: The object detector is trained to detect unknown objects as background. Instances of many unknown classes would have been already introduced to the object detector along with known objects. As they are not labelled, these unknown instances would be explicitly learned as background, while training the detection model. Dhamija et al. [7] finds that even with this extra training signal, the state-of-the-art object detectors results in false positive detections, where the unknown objects end up being classified as one of the known classes, often with very high probability. Miller et al. [42] proposes to use dropout sampling to get an estimate of the uncertainty of the object detection prediction. This is the only peer-reviewed research work in the open set object detection literature. Our proposed Open World Object Detection goes a step further to incrementally learn the new classes, once they are detected as unknown and an oracle provides labels for the objects of interest among all the unknowns. To the best of our knowledge this has not been tried in the literature. Fig. 1 shows a taxonomy of existing research work in this space

开放集和开放世界图像分类的进展不能简单地适用于开放集和开放世界的目标检测，因为问题设置有一个根本的区别：目标检测器被训练来检测未知目标作为背景。许多未知类的实例已经与已知目标一起引入到目标检测器中。由于没有标记，这些未知实例将被显式地学习为背景，同时训练检测模型。Dhamija等人[7]发现，即使有了这个额外的训练信号，最先进的物体检测器也会产生假阳性检测，其中未知物体最终被归类为已知类别之一，通常概率非常高。Miller等人[42]建议使用脱落采样来估计目标检测预测的不确定性。这是开放集目标检测文献中唯一一项同行评议的研究工作。我们提出的开放世界目标检测方法更进一步，一旦新类被检测为未知，并且预测为所有未知目标中感兴趣的目标提供标签，就可以增量地学习它们。据我们所知，这在文献中还没有尝试过。图1显示了该领域现有研究工作的分类。

The Open World Object Detection setting is much more natural than the existing closed-world, static-learning setting. The world is diverse and dynamic in the number, type and configurations of novel classes. It would be naive to assume that all the classes to expect at inference are seen during training. Practical deployments of detection systems in robotics, self-driving cars, plant phenotyping, healthcare and surveillance cannot afford to have complete knowledge on what classes to expect at inference time, while being trained in-house. The most natural and realistic behavior that one can expect from an object detection algorithm deployed in such settings would be to confidently predict an unknown object as unknown, and known objects into the corresponding classes. As and when more information about the identified unknown classes becomes available, the system should be able to incorporate them into its existing knowledge base. This would define a smart object detection system, and ours is an effort towards achieving this goal.

与现有的封闭世界静态学习设置相比，开放世界目标检测设置更加自然。世界在新阶级的数量、类型和结构上是多样化和动态的。假设所有在推理时预期的类都是在训练期间看到的，这是不现实的。在机器人、自动驾驶汽车、植物表型鉴定、医疗保健和监控等领域，检测系统的实际部署不能完全掌握推理时需要学习的类别，同时还要接受内部训练。在这样的环境中部署的目标检测算法最自然、最现实的行为是将未知实例预测为未知目标，并将已知实例划分为相应的类。当更多关于已识别未知类的信息可用时，系统应该能够将它们合并到现有的知识库中。这将定义一个智能目标检测系统，我们正努力实现这一目标。

The key contributions of our work are:
• We introduce a novel problem setting, Open World Object Detection, which models the real-world more closely.
• We develop a novel methodology, called ORE, based on contrastive clustering, an unknown-aware proposal network and energy based unknown identification to address the challenges of open world detection.
• We introduce a comprehensive experimental setting, which helps to measure the open world characteristics of an object detector, and benchmark ORE on it against competitive baseline methods.
• As an interesting by-product, the proposed methodology achieves state-of-the-art performance on Incremental Object Detection, even though not primarily designed for it

我们引入了一种新的问题设置，即开放世界目标检测，它可以更紧密地模拟现实世界。
我们开发了一种新的方法，称为ORE，基于对比聚类、未知感知建议网络和基于能量的未知识别来应对开放世界检测的挑战。
我们引入了一个全面的实验环境，有助于测量目标探测器的开放世界特性，并将ORE与竞争性基线方法进行比较。
作为一个有趣的副产品，所提出的方法在增量目标检测方面实现了最先进的性能，尽管主要不是为其设计的

II 相关工作

Open Set Classification: The open set setting considers knowledge acquired through training set to be incomplete, thus new unknown classes can be encountered during testing. Scheirer et al. [57] developed open set classifiers in a one-vs-rest setting to balance the performance and the risk of labeling a sample far from the known training examples (termed as open space risk). Follow up works [22, 58] extended the open set framework to multi-class classifier setting with probabilistic models to account for the fading away classifier confidences in case of unknown classes.

开集分类：开集设置认为通过训练集获得的知识是不完整的，因此在测试过程中会遇到新的未知类。Scheirer等人[57]在一对一的环境中开发了开放集分类器，以平衡标记远离已知训练示例的样本的性能和风险（称为开放空间风险）。后续工作[22，58]将开放集框架扩展到多类分类器设置，并采用概率模型来解释未知类情况下分类器置信度的衰减。

Bendale and Boult [3] identified unknowns in the feature space of deep networks and used a Weibull distribution to estimate the set risk (called OpenMax classifier). A generative version of OpenMax was proposed in [12] by synthesizing novel class images. Liu et al. [34] considered a long-tailed recognition setting where majority, minority and unknown classes coexist. They developed a metric learning framework identify unseen classes as unknown. In similar spirit, several dedicated approaches target on detecting the out of distribution samples [29] or novelties [47]. Recently, self-supervised learning [45] and unsupervised learning with reconstruction [64] have been explored for open set recognition. However, while these works can recognize unknown instances, they cannot dynamically update themselves in an incremental fashion over multiple training episodes. Further, our energy based unknown detection approach has not been explored before.

Bendale和Boult[3]在深度网络的特征空间中识别未知目标，并使用Weibull分布来估计集合风险（称为OpenMax分类器）。[12]通过合成新的类图像，提出了OpenMax的生成版本。Liu等人[34]考虑了一种长尾识别环境，其中多数类、少数类和未知类共存。他们开发了一个度量学习框架，将看不见的类识别为未知类。本着类似的精神，有几种专门的方法旨在检测分布外的样本[29]或新类别[47]。最近，自监督学习[45]和带重构的无监督学习[64]被探索用于开集识别。然而，虽然这些工作可以识别未知的实例，但它们不能在多个训练集上以增量方式动态更新自己。此外，我们基于能量的未知检测方法还没有被探索过。

Open World Classification: [2] first proposed the open world setting for image recognition. Instead of a static classifier trained on a fixed set of classes, they proposed a more flexible setting where knowns and unknowns both coexist. The model can recognize both types of objects and adaptively improve itself when new labels for unknown are provided. Their approach extends Nearest Class Mean classifier to operate in an open world setting by re-calibrating the class probabilities to balance open space risk. [46] studies open world face identity learning while [63] proposed to use an exemplar set of seen classes to match them against a new sample, and rejects it in case of a low match with all previously known classes. However, they don’t test on image classification benchmarks and study product classification in e-commerce applications.

开放世界分类：[2]首先提出了图像识别的开放世界设置。他们提出了一种更灵活的设置，即已知和未知同时存在，而不是在一组固定的类上训练静态分类器。该模型能同时识别这两种类型的目标，并在为未知目标提供新的标签时自适应地进行改进。他们的方法通过重新校准类概率来平衡开放空间风险，从而扩展了最近类均值分类器，使其在开放世界环境中运行。[46]研究了开放世界的人脸识别学习，而[63]则建议使用一组已知类的样本来匹配新样本，如果与所有已知类的匹配度较低，则拒绝使用。然而，他们并没有对图像分类基准进行测试，也没有研究电子商务应用中的产品分类。

Open Set Detection: Dhamija et al. [7] formally studied the impact of open set setting on popular object detectors. They noticed that the state of the art object detectors often classify unknown classes with high confidence to seen classes. This is despite the fact that the detectors are explicitly trained with a background class [54, 13, 32] and/or apply one-vs-rest classifiers to model each class [14, 30]. A dedicated body of work [42, 41, 16] focuses on developing measures of (spatial and semantic) uncertainty in object detectors to reject unknown classes. E.g., [42, 41] uses Monte Carlo Dropout [11] sampling in a SSD detector to obtain uncertainty estimates. These methods, however, cannot incrementally adapt their knowledge in a dynamic world.

开集检测：Dhamija等人[7]正式研究了开集设置对流行目标检测器的影响。他们注意到，最先进的目标检测器通常对未知类进行分类，并且对可见类的可信度很高。尽管检测器是用一个背景类[54，13，32]显式训练的，和/或应用一个vs-rest分类器对每个类进行建模[14，30]。一个专门的工作机构[42，41，16]专注于开发目标检测器中（空间和语义）不确定性的度量，以拒绝未知类。例如，[42，41]在SSD检测器中使用蒙特卡罗差[11]采样来获得不确定度估计。然而，这些方法不能在一个动态的世界中逐渐调整它们的知识。

III 开放世界目标检测

Let us formalise the definition of Open World Object Detection in this section. At any time t, we consider the set of known object classes as Kt = {1; 2; ...; C} ⊂ N+ where N+ denotes the set of positive integers. In order to realistically model the dynamics of real world, we also assume that their exists a set of unknown classes U = {C + 1; ...}, which may be encountered during inference. The known object classes Kt are assumed to be labeled in the dataset Dt = {Xt; Yt} where X and Y denote the input images and labels respectively. The input image set comprises of M training images, Xt = {I1; ... ; IM} and associated object labels for each image forms the label set Yt ={Y1; ... ; YM}. Each Yi = {y1; y2; ...; yK} encodes a set of K object instances with their class labels and locations i.e., yk = [lk; xk; yk; wk; hk], where lk ∈Kt and xk; yk; wk; hk denote the bounding box center coordinates, width and height respectively.

在本节中正式定义开放世界目标检测。在任意时刻t，我们将已知的目标类集记为Kt={1；2；..；C}⊂N+，其中N+表示正整数集。为了真实地模拟现实世界的动态，我们还假设它们存在一组未知的类U={C+1；…}，这在推理过程中可能会遇到。假设已知目标类Kt在数据集Dt={Xt；Yt}中被标记，其中X和Y分别表示输入图像和标签。输入图像集由M个训练图像组成，Xt={I1；...；IM}和每个图像的相关目标标签组成标签集Yt={Y1；...；YM}。每个Yi={y1；y2；...；yK}编码一组K目标实例及其类标签和位置，即yK=[lk；xk；yK；wk；hk]，其中lk ∈ Kt和xk；yK；wk；hk分别表示边界框的中心坐标、宽度和高度。

The Open World Object Detection setting considers an object detection model MC that is trained to detect all the previously encountered C object classes. Importantly, the model MC is able to identify a test instance belonging to any of the known C classes, and can also recognize a new or unseen class instance by classifying it as an unknown, denoted by a label zero (0). The unknown set of instances Ut can then be forwarded to a human user who can identify n new classes of interest (among a potentially large number of unknowns) and provide their training examples. The learner incrementally adds n new classes and updates itself to produce an updated model MC+n without retraining from scratch on the whole dataset. The known class set is also updated Kt+1 = Kt + {C + 1; ...; C + n}. This cycle continues over the life of the object detector, where it adaptively updates itself with new knowledge. The problem setting is illustrated in the top row of Fig. 2.

Openworld目标检测设置考虑一个目标检测模型MC，它被训练来检测所有以前遇到的C目标类。重要的是，MC模型能够识别属于任何已知C类的测试实例，并且还可以通过将新的或看不见的类实例分类为未知的（用标签0表示）来识别它。然后，可以将未知实例集Ut转发给人类用户，该用户可以识别n个新的感兴趣的类（可能有大量未知目标），并提供它们的训练示例。学习器增量地添加n个新类并更新自己，以生成更新的模型MC+n，而无需从头开始对整个数据集进行再训练。已知的类集也会更新为Kt+1=Kt+{C+1；...；C+n}。这个循环会在目标探测器的整个生命周期中持续，在这个生命周期中，它会用新的知识自适应地更新自身。问题设置在图2的顶部表示。

Figure 2: Approach Overview: Top row: At each incremental learning step, the model identifies unknown objects (denoted by ‘?’), which are progressively labelled (as blue circles) and added to the existing knowledge base (green circles). Bottom row: Our open world object detection model identifies potential unknown objects using an energy-based classification head and the unknown-aware RPN. Further, we perform contrastive learning in the feature space to learn discriminative clusters and can flexibly add new classes in a continual manner without forgetting the previous classes.图2：方法概述：顶行：在每个增量学习步骤中，模型识别未知目标（用'？'），逐步标记（蓝色圆圈）并添加到现有知识库（绿色圆圈）。底层：我们的开放世界目标检测模型使用基于能量的分类头和未知感知的RPN来识别潜在的未知目标。此外，我们在特征空间中进行对比学习来判别类，并且可以灵活地连续添加新的类而不会忘记以前的类。

IV ORE: Open World Object Detector 开放世界目标检测器

A successful approach for Open World Object Detection should be able to identify unknown instances without explicit supervision and defy forgetting of earlier instances when labels of these identified novel instances are presented to the model for knowledge upgradation (without retraining from scratch). We propose a solution, ORE which addresses both these challenges in a unified manner.

一个成功的开放世界目标检测方法应该能够在没有明确监督的情况下识别未知实例，并且在将这些识别出的新实例的标签提交给模型进行知识升级（无需从头开始再培训）时，能够克服对早期实例的遗忘。我们提出了一个解决方案，以统一的方式解决这两个挑战。
Neural networks are universal function approximators [21], which learn a mapping between an input and the output through a series of hidden layers. The latent representation learned in these hidden layers directly controls how each function is realised. We hypothesise that learning clear discrimination between classes in the latent space of object detectors could have two fold effect. First, it helps the model to identify how the feature representation of an unknown instance is different from the other known instances, which helps identify an unknown instance as a novelty. Second, it facilitates learning feature representations for the new class instances without overlapping with the previous classes in the latent space, which helps towards incrementally learning without forgetting. The key component that helps us realise this is our proposed contrastive clustering in the latent space, which we elaborate in Sec. 4.1.

神经网络是通用函数逼近器[21]，它通过一系列隐藏层学习输入和输出之间的映射。在这些隐藏层中学习的潜在表示直接控制每个功能的实现方式。我们假设，在目标检测器的潜在空间中学习类间的清晰区分可能产生双重效果。首先，它帮助模型识别未知实例的特征表示与其他已知实例的区别，从而有助于将未知实例识别为新实例。第二，它有助于学习新类实例的特征表示，而不与潜在空间中的前一类重叠，从而有助于不遗忘的增量学习。帮助我们认识到这一点的关键部分是我们在潜在空间中提出的对比聚类，我们将在第二节中详细阐述。

To optimally cluster the unknowns using contrastive clustering, we need to have supervision on what an unknown instance is. It is infeasible to manually annotate even a small subset of the potentially infinite set of unknown classes. To counter this, we propose an auto-labelling mechanism based on the Region Proposal Network [53] to pseudo-label unknown instances, as explained in Sec. 4.2. The inherent separation of auto-labelled unknown instances in the latent space helps our energy based classification head to differentiate between the known and unknown instances. As elucidated in Sec. 4.3, we find that Helmholtz free energy is higher for unknown instances.

为了使用对比聚类法对未知数据进行最佳聚类，我们需要对未知实例进行监督。手动注释无限集合中未知类的一小部分是不可行的。为了解决这个问题，我们提出了一种基于区域建议网络[53]的自动标记机制来伪标记未知实例，如第4.2节所述。 潜在空间中自动标记的未知实例的固有分离有助于我们基于能量的分类头区分已知和未知实例。如第4.3节，我们发现在未知情况下，亥姆霍兹自由能较高。

Fig. 2 shows the high-level architectural overview of ORE. We choose Faster R-CNN [53] as the base detector as Dhamija et al. [7] has found that it has better open set performance when compared against one-stage RetinaNet detector [30] and objectness based YOLO detector [51]. Faster R-CNN [53] is a two stage object detector. In the first stage, a class-agnostic Region Proposal Network (RPN) proposes potential regions which might have an object from the feature maps coming from a shared backbone network. The second stage classifies and adjusts the bounding box coordinates of each of the proposed region. The features that are generated by the residual block in the Region of Interest (RoI) head are contrastively clustered. The RPN and the classification head is adapted to auto-label and identify unknowns respectively. We explain each of these coherent constituent components, in the following subsections:

图2显示了ORE的高级架构概述。我们选择Faster R-CNN[53]作为基本检测器，因为Dhamija等人[7]发现，与单级RetinaNet检测器[30]和基于对象的YOLO检测器[51]相比，它具有更好的开集性能。Faster R-CNN[53]是一个两级目标探测器。在第一阶段中，类无关区域建议网络（RPN）提出可能具有来自共享主干网络的特征映射的目标的潜在区域。第二阶段对每个区域的边界框坐标进行分类和调整。对感兴趣区域（RoI）模块其他部分生成的特征进行对比聚类。RPN和分类头分别用于自动标注和识别未知量。我们将在以下小节中解释这些连贯的组成部分：

4.1. Contrastive Clustering对比聚类

Class separation in the latent space would be an ideal characteristic for an Open World methodology to identify unknowns. A natural way to enforce this would be to model it as a contrastive clustering problem, where instances of same class would be forced to remain close-by, while instances of dissimilar class would be pushed far apart.

潜在空间中的类分离是开放世界方法识别未知的理想特征。一种自然的方法是将其建模为一个对比聚类问题，在这个问题中，同一类的实例将被迫保持在附近，而不同类的实例将被推得很远。

For each known class i∈Kt, we maintain a prototype vector pi. Let fc∈Rd be a feature vector that is generated by an intermediate layer of the object detector, for an object of class c. We define the contrastive loss as follows:

对于每个已知类i∈Kt，我们有一个原型向量pi。令fc∈Rd为特征检测器的中间层为c类物体生成的特征向量。 我们将对比损失定义如下：

where D is any distance function and ∆ defines how close a similar and dissimilar item can be. Minimizing this loss would ensure the desired class separation in the latent space.

其中D是任何距离函数，∆定义了相似和不相似项之间的距离。最小化这种损失将确保在潜在空间中实现所需的类分离。
Mean of feature vectors corresponding to each class is used to create the set of class prototypes: P = fp0 · · · pCg. Maintaining each prototype vector is a crucial component of ORE. As the whole network is trained end-to-end, the class prototypes should also gradually evolve, as the constituent features change gradually (as stochastic gradient descent updates weights by a small step in each iteration). We maintain a fixed-length queue qi, per class for storing the corresponding features. A feature store Fstore = fq0 · · · qCg, stores the class specific features in the corresponding queues. This is a scalable approach for keeping track of how the feature vectors evolve with training, as the number of feature vectors that are stored is bounded by C × Q, where Q is the maximum size of the queue.

每个类对应的特征向量的平均值被用来创建类原型集：P={p0···pC}。生成的每个原型向量是ORE的一个关键组成部分。随着整个网络的端到端训练，类原型也应该随着组成特征的逐渐变化而逐渐演化（因为随机梯度下降在每次迭代中更新一小步权重）。我们有一个固定长度的队列qi，每个类用于存储相应的特征。特征集Fstore={q0···qC}将类特定的特性存储在相应的队列中。这是一种可伸缩的方法，用于跟踪特征向量如何随训练而演化，因为存储的特征向量的数量以C×Q为界，其中Q是队列的最大尺寸。
Algorithm 1 provides an overview on how class prototypes are managed while computing the clustering loss. We start computing the loss only after a certain number of burnin iterations (Ib) are completed. This allows the initial feature embeddings to mature themselves to encode class information. Since then, we compute the clustering loss using Eqn. 1. After every Ip iterations, a set of new class prototypes Pnew is computed (line 8). Then the existing prototypes P are updated by weighing P and Pnew with a momentum parameter η. This allows the class prototypes to evolve gradually keeping track of previous context. The computed clustering loss is added to the standard detection loss and back-propagated to learn the network end-to-end.

算法1概述了在计算集群损失时如何管理类原型。只有在完成一定数量的burnin迭代（Ib）之后，我们才开始计算损耗。这使得初始的特征映射能够逐渐准确，从而对类信息进行编码。从那时起，我们使用公式1计算聚类损失。在每个Ip迭代之后，计算一组新的类原型Pnew（第8行）。然后用动量参数η对P和Pnew进行加权，更新现有的原型P。这允许类原型逐渐演化，并跟踪以前的上下文信息。将计算出的聚类损失加入到标准检测损失中，并进行反向传播，实现对网络的端到端学习。

4.2. Auto-labelling Unknowns with RPN使用RPN自动标记未知

While computing the clustering loss with Eqn. 1, we contrast the input feature vector fc against prototype vectors, which include a prototype for unknown objects too (c 2 f0; 1; ::; Cg where 0 refers to the unknown class). This would require unknown object instances to be labelled with unknown ground truth class, which is not practically feasible owing to the arduous task of re-annotating all instances of each image in already annotated large-scale datasets.

在计算聚类损失时，用公式1，我们将输入特征向量fc与原型向量进行对比，原型向量也包括未知目标的原型（c∈{0；1；…；C}，其中0表示未知类）。这将要求未知目标实例被标记为未知标准类，这实际上是不可行的，因为在已经注释的大规模数据集中重新注释每个图像的所有实例非常困难。

As a surrogate, we propose to automatically label some of the objects in the image as a potential unknown object. For this, we rely on the fact that Region Proposal Network (RPN) is class agnostic. Given an input image, the RPN generates a set of bounding box predictions for foreground and background instances, along with the corresponding objectness scores. We label those proposals that have high objectness score, but do not overlap with a ground-truth object as a potential unknown object. Simply put, we select the top-k background region proposals, sorted by its objectness scores, as unknown objects. This seemingly simple heuristic achieves good performance as demonstrated in Sec. 5.

作为代理，我们建议自动将图像中的一些目标标记为潜在的未知目标。为此，我们使用区域建议网络（RPN）因为它与类无关。给定一个输入图像，RPN为前景和背景实例生成一组边界框预测，以及相应的目标得分。我们将那些具有高目标性得分，但不与ground truth重叠的区域标记为潜在未知目标。简单地说，我们选择top-k背景区域方案，按其目标得分排序，作为未知对象。就如第五部分所讲，这个看似简单的启发式方法可以获得很好的性能

4.3. Energy Based Unknown Identifier 基于能量的未知目标识别

Given the features (f ∈F) in the latent space F and their corresponding labels l ∈ L, we seek to learn an energy function E(F; L). Our formulation is based on the Energy based models (EBMs) [26] that learn a function E(·) to estimates the compatibility between observed variables F and possible set of output variables L using a single output scalar i.e., E(f) : Rd → R. The intrinsic capability of EBMs to assign low energy values to in-distribution data and vice-versa motivates us to use an energy measure to characterize whether a sample is from an unknown class.

给定隐空间f中的特征（f∈F）及其相应的标号l∈L，我们寻求学习一个能量函数E（F,L）。我们的公式基于基于能量的模型（EBMs）[26]，该模型学习一个函数E（·），以使用单个输出标量估计观测变量F和可能的输出变量集L之间的兼容性，即E（f）：Rd→R。EBMs向分布内数据分配低能量值的内在能力，反之亦然，促使我们使用能量度量来表征样本是否来自未知类别。
Specifically, we use the Helmholtz free energy formulation where energies for all values in L are combined，

具体来说，我们使用亥姆霍兹自由能公式，其中L中所有值的能量都是组合的

where T is the temperature parameter. There exists a simple relation between the network outputs after the softmax layer and the Gibbs distribution of class specific energy values [33]. This can be formulated as

其中T是温度参数。softmax层之后的网络输出与类比能量值的Gibbs分布之间存在简单的关系[33]。这可以表述为

where p(l|f) is the probability density for a label l, gl(f) is the lth classification logit of the classification head g(:). Using this correspondence, we define free energy of our classification models in terms of their logits as follows:

其中p（l|f）是标签l的概率密度，gl（f）是分类头g（：）的第l次分类回归。利用这种对应关系，我们用logit定义分类模型的自由能，如下所示：

The above equation provides us a natural way to transform the classification head of the standard Faster R-CNN [53] to an energy function. Due to the clear separation that we enforce in the latent space with the contrastive clustering, we see a clear separation in the energy level of the known class data-points and unknown data-points as illustrated in Fig. 3. In light of this trend, we model the energy distribution of the known and unknown energy values ξkn(f) and ξunk(f), with a set of shifted Weibull distributions. These distributions were found to fit the energy data of a held out validation set very well, when compared to Gamma, Exponential and Normal distributions. The learned distributions can be used to label a prediction as unknown if ξkn(f) < ξunk(f).

上面的公式为我们提供了一种自然的方法，将Faster R-CNN[53]的分类头转换为能量函数。由于我们使用对比聚类在潜在空间中实施的清晰分离，我们看到如图3所示的已知类数据点和未知数据点的能级中的清晰分离。根据这一趋势，我们用一组移位的Weibull分布来模拟已知和未知能量值ξkn（f）和ξ∗（f）的能量分布。与伽马分布、指数分布和正态分布相比，这些分布与验证集的能量数据非常吻合。如果ξkn（f）<ξunk（f），则学习的分布可用于将预测标记为未知。

Figure 3: The energy values of the known and unknown datapoints exhibit clear separation as seen above. We fit a Weibull distribution on each of them and use these for identifying unseen known and unknown samples, as explained in Sec. 4.3. 图3：如上图所示，已知和未知数据点的能量值显示出明显的分离。我们在每个样本上拟合一个威布尔分布，并用这些来识别未知样本和未知样本，如第4.3节所述

4.4. Alleviating Forgetting 减缓遗忘

After the identification of unknowns, an important requisite for an open world detector is to be able to learn new classes, when the labeled examples of some of the unknown classes of interest are provided. Importantly, the training data for the previous tasks will not be present at this stage since retraining from scratch is not a feasible solution. Training with only the new class instances will lead to catastrophic forgetting [39, 10] of the previous classes. We note that a number of involved approaches have been developed to alleviate such forgetting, including methods based on parameter regularization [1, 23, 28, 65], exemplar replay [5, 50, 36, 4], dynamically expanding networks [38, 59, 55] and meta-learning [49, 24].

在识别出未知目标之后，开放世界探测器的一个重要的必要条件是，当提供一些感兴趣的未知类的标记示例时能够学习新的类。重要的是，由于从头开始的再训练不是一个可行的解决方案，因此在此阶段将不提供以前任务的训练数据。仅使用新的类实例进行训练将导致灾难性地忘记以前的类。我们注意到，已经开发了许多相关的方法来缓解这种遗忘，包括基于参数正则化的方法[1，23，28，65]，范例重放[5，50，36，4]，动态扩展网络[38，59，55]和元学习[49，24]。

We build on the recent insights from [48, 25, 61] which compare the importance of example replay against other more complex solutions. Specifically, Prabhu et al. [48] retrospects the progress made by the complex continual learning methodologies and show that a greedy exemplar selection strategy for replay in incremental learning consistently outperforms the state-of-the-art methods by a large margin. Knoblauch et al. [25] develops a theoretical justification for the unwarranted power of replay methods. They prove that an optimal continual learner solves an NP-hard problem and requires infinite memory. The effectiveness of storing few examples and replaying has been found effective in the related few-shot object detection setting by Wang et al. [61]. These motivates us to use a relatively simple methodology for ORE to mitigate forgetting i.e., we store a balanced set of exemplars and finetune the model after each incremental step on these. At each point, we ensure that a minimum of Nex instances for each class are present in the exemplar set.

我们基于[48，25，61]的最新见解，这些见解将样本回放的重要性与其他更复杂的解决方案进行了比较。具体而言，Prabhu等人[48]回顾了复杂的持续学习方法所取得的进展，并表明增量学习中用于回放的贪婪样本选择策略在很大程度上优于最新方法。Knoblauch等人[25]为再训练方法的无用提出了理论上的证明。他们证明了一个最佳的连续学习者解决了一个NP难问题，并且需要无限的记忆力。Wang等人[61]在相关的少样本目标检测设置中发现了存储少量示例和重播的有效性。这促使我们使用相对简单的ORE方法来减轻遗忘，也就是说，我们存储了一组平衡的范例，并在每个增量步骤之后对模型进行微调。在每一点上，我们确保每个类的最少Nex实例出现在范例集中。

V 实验与结果

We propose a comprehensive evaluation protocol to study the performance of an open world detector to identify unknowns, detect known classes and progressively learn new classes when labels are provided for some unknowns.

我们提出了一个综合评估协议来研究开放世界检测器在为某些未知项提供标签时识别未知项、检测已知类和逐步学习新类的性能。

5.1. Open World Evaluation Protocol 开放世界评估协议

Data split: We group classes into a set of tasks T = {T1; · · · Tt; · · ·}. All the classes of a specific task will be introduced to the system at a point of time t. While learning Tt, all the classes of {Tτ : τ<t} will be treated as known and {Tτ : τ>t} would be treated as unknown. For a concrete instantiation of this protocol, we consider classes from Pascal VOC [9] and MS-COCO [31]. We group all VOC classes and data as the first task T1. The remaining 60 classes of MS-COCO [31] are grouped into three successive tasks with semantic drifts (see Tab. 1). All images which correspond to the above split from Pascal VOC and MS-COCO train-sets form the training data. For evaluation, we use the Pascal VOC test split and MS-COCO val split. 1k images from training data of each task is kept aside for validation. Data splits and codes can be found at GitHub - JosephKJ/OWOD: (CVPR 2021 Oral) Open World Object Detection.

数据分割：我们把类分成一组任务T={T1；···Tt；······}。一个特定任务的所有类将在时间点t被引入系统。学习Tt时，{tτ：τ<t}的所有类将被视为已知，{tτ：τ>t}将被视为未知。对于这个协议的一个具体实例，我们考虑来自Pascal VOC[9]和MS-COCO[31]的类。我们将所有VOC类和数据分组为第一个任务T1。剩下的60类MS-COCO[31]被分为三个连续的任务，其中有语义漂移（见表1）。从Pascal VOC和MS-COCO训练集分割的与上述图像对应的所有图像构成训练数据。对于评估，我们使用Pascal VOC测试分割和MS-COCO val分割。从每个任务的训练数据中提取1k图像，留作验证。数据分割和代码在GitHub - JosephKJ/OWOD: (CVPR 2021 Oral) Open World Object Detection。

Table 1: The table shows task composition in the proposed Open World evaluation protocol. The semantics of each task and the number of images and instances (objects) across splits are shown. 表1：该表说明了提出的开放世界评估协议中的任务组成。显示了每个任务的语义以及跨拆分的图像和实例（对象）的数量。

Evaluation metrics: Since an unknown object easily gets confused as a known object, we use the Wilderness Impact (WI) metric [7] to explicitly characterises this behavior

评估指标：由于未知目标很容易与已知目标混淆，因此我们使用Wilderness Impact（WI）指标[7]来明确描述这种行为

where PK refers to the precision of the model when evaluated on known classes and PK∪U is the precision when evaluated on known and unknown classes, measured at a recall level R (0.8 in all experiments). Ideally, WI should be less as the precision must not drop when unknown objects are added to the test set. Besides WI, we also use Absolute Open-Set Error (A-OSE) [42] to report the number count of unknown objects that get wrongly classified as any of the known class. Both WI and A-OSE implicitly measure how effective the model is in handling unknown objects.

其中PK是指在已知类上评估时模型的精度，PK∪U是在已知和未知类上评估时的精度，在召回水平R（所有实验中为0.8）下测量。理想情况下，WI应该更小，因为当未知目标被添加到测试集时，精度不能下降。除了WI之外，我们还使用绝对开集误差（A-OSE）[42]来反映错误分类为已知类的未知目标的数量。WI和A-OSE都隐式地度量模型在处理未知目标方面的有效性。

In order to quantify incremental learning capability of the model in the presence of new labeled classes, we measure the mean Average Precision (mAP) at IoU threshold of 0.5 (consistent with the existing literature [60, 44]).

为了量化模型在存在新标记类的情况下的增量学习能力，我们测量IoU阈值为0.5时的平均精度（mAP）（与现有文献[60，44]一致）。

5.2. Implementation Details 实施细节

ORE re-purposes the standard Faster R-CNN [53] object detector with a ResNet-50 [19] backbone. To handle variable number of classes in the classification head, following incremental classification methods [49, 24, 5, 36], we assume a bound on the maximum number of classes to expect, and modify the loss to take into account only the classes of interest. This is done by setting the classification logits of the unseen classes to a large negative value (v), thus making their contribution to softmax negligible (e-v → 0).

ORE使用Faster R-CNN[53]目标检测器和ResNet-50[19]主干。为了处理分类头中可变数量的类，遵循增量分类方法[49，24，5，36]，我们假设预期的最大类数有界，并修改损失函数以仅考虑感兴趣的类。这是通过将不可见类的分类logit设置为较大的负值（v）来实现的，从而使它们对softmax的贡献可以忽略不计（e-v→0）。

The 2048-dim feature vector which comes from the last residual block in the RoI Head is used for contrastive clustering. The contrastive loss (defined in Eqn. 1) is added to the standard Faster R-CNN classification and localization losses and jointly optimised for. While learning a task Ti, only the classes that are part of Ti will be labelled. While testing Ti, all the classes that were previously introduced are labelled along with classes in Ti, and all classes of future tasks will be labelled ‘unknown’. For the exemplar replay, we empirically choose Nex = 50. We do a sensitivity analysis on the size of the exemplar memory in Sec. 6. Further implementation details are provided in supplementary.

利用RoI头部最后一个残差块的2048维特征向量进行对比聚类。对比损失（定义见等式1） 被添加到Faster R-CNN分类和本地化损失中，并针对这些损失进行了联合优化。在学习任务Ti时，只有属于Ti的类才会被标记。在测试Ti时，前面引入的所有类都将与Ti中的类一起标记，并且未来任务的所有类都将标记为“未知”。对于示例回放，我们根据经验选择Nex=50。我们对以秒为单位的样本内存大小进行了敏感性分析。第6部分补充文件提供了进一步的实施细节。

5.3. Open World Object Detection Results开放世界目标检测结果

Table 2 shows how ORE compares against Faster RCNN on the proposed open world evaluation protocol. An ‘Oracle’ detector has access to all known and unknown labels at any point, and serves as a reference. After learning each task, WI and A-OSE metrics are used to quantify how unknown instances are confused with any of the known classes. We see that ORE has significantly lower WI and AOSE scores, owing to an explicit modeling of the unknown. When unknown classes are progressively labelled in Task 2, we see that the performance of the baseline detector on the known set of classes (quantified via mAP) significantly deteriorates from 56.16% to 4.076%. The proposed balanced finetuning is able to restore the previous class performance to a respectable level (51.09%) at the cost of increased WI and A-OSE, whereas ORE is able to achieve both goals: detect known classes and reduce the effect of unknown comprehensively. Similar trend is seen when Task 3 classes are added. WI and A-OSE scores cannot be measured for Task 4 because of the absence of any unknown ground-truths. We report qualitative results in Fig. 4 and supplementary section, along with failure case analysis. We conduct extensive sensitivity analysis in Sec. 6 and supplementary section.

表2显示了在开放世界评估协议上，ORE与Faster RCNN的比较。“Oracle”检测器可以随时访问所有已知和未知的标签，并作为参考。在学习每个任务之后，WI和A-OSE度量用于量化未知实例与任何已知类的混淆程度。我们发现ORE的WI和A-OSE分数明显较低，这是由于对未知目标的显式建模。当在任务2中逐步标记未知类时，我们发现基线检测器在已知类集合（通过mAP量化）上的性能从56.16%显著下降到4.076%。提出的平衡微调方法能够以增加WI和a-OSE为代价，将前一类的性能恢复到一个可观的水平（51.09%），而ORE能够同时实现两个目标：检测已知类和降低未知类的影响。类似的趋势也出现在任务3类中。由于缺乏任何未知的基本数据，因此无法测量任务4的WI和A-OSE分数。我们在图4和补充部分中报告了定性结果，以及失效案例分析。我们在第6节和补充部分进行了广泛的敏感性分析。

5.4. Incremental Object Detection Results 增量目标检测结果

We find an interesting consequence of the ability of ORE to distinctly model unknown objects: it performs favorably well on the incremental object detection (iOD) task against the state-of-the-art (Tab. 3). This is because, ORE reduces the confusion of an unknown object being classified as a known object, which lets the detector incrementally learn the true foreground objects. We use the standard protocol [60, 44] used in the iOD domain to evaluate ORE, where group of classes (10, 5 and the last class) from Pascal VOC 2007 [9] are incrementally learned by a detector trained on the remaining set of classes. Remarkably, ORE is used as it is, without any change to the methodology introduced in Sec. 4. We ablate contrastive clustering (CC) and energy based unknown identification (EBUI) to find that it results in reduced performance than standard ORE.

我们发现ORE对未知物体进行清晰建模的能力产生了一个有趣的结果：它在增量目标检测（iOD）任务中表现良好，但它不是最先进的（Tab.3)。这是因为，ORE减少了未知目标被分类为已知目标的混淆，这使得检测器可以增量地学习真实的前景目标。我们使用iOD域中使用的标准协议[60，44]来评估ORE，其中Pascal VOC 2007[9]中的一组类（10，5和最后一个类）由在剩余的一组类上训练的检测器递增地学习。值得注意的是，ORE是按原样使用的，对第4节中介绍的方法没有任何改变。我们将对比聚类（CC）和基于能量的未知识别（EBUI）相结合，发现其性能比标准ORE有所下降。

VI 讨论与分析

6.1 Ablating ORE Components: To study the contribution of each of the components in ORE, we design careful ablation experiments (Tab. 4). We consider the setting where Task 1 is introduced to the model. The auto-labelling methodology (referred to as ALU), combined with energy based unknown identification (EBUI) performs better together (row 5) than using either of them separately (row 3 and 4). Adding contrastive clustering (CC) to this configuration, gives the best performance in handling unknown (row 7), measured in terms of WI and A-OSE. There is no severe performance drop in known classes detection (mAP metric) as a side effect of unknown identification. In row 6, we see that EBUI is a critical component whose absence increases WI and A-OSE scores. Thus, each component in ORE has a critical role to play for unknown identification.

6.1 消融研究：为了研究ORE中各组分的贡献，我们设计了消融研究（表4)，我们考虑将任务1引入模型的设置。自动标记方法（称为ALU）与基于能量的未知识别（EBUI）相结合（第5行）比单独使用其中任何一种方法（第3行和第4行）效果更好。将对比聚类（CC）添加到这个配置中，可以在处理未知数据（第7行）时提供最佳的性能，以WI和A-OSE来衡量。在已知类检测（mAP-metric）中没有严重的性能下降，这是未知识别的副作用。在第6行中，我们看到EBUI是一个关键组件，它的缺失会增加WI和a-OSE得分。因此，ORE的每一部分都对未知的鉴定起着至关重要的作用。

6.2 Sensitivity Analysis on Exemplar Memory Size: Our balanced finetuning strategy requires storing exemplar images with at least Nex instances per class. We vary Nex while learning Task 2 and report the results in Table 5. We find that balanced finetuning is very effective in improving the accuracy of previously known class, even with just having minimum 10 instances per class. However, we find that increasing Nex to large values does-not help and at the same time adversely affect how unknowns are handled (evident from WI and A-OSE scores). Hence, by validation, we set Nex to 50 in all our experiments, which is a sweet spot that balances performance on known and unknown classes.

6.2 对示例内存大小的敏感性分析：我们的平衡微调策略要求存储每个类至少有Nex实例的示例图像。我们在学习任务2时改变Nex，并在表5中报告结果。我们发现平衡微调在提高以前已知类的准确性方面非常有效，即使每个类只有至少10个实例。然而，我们发现，将Nex增加到大值并不会有帮助，同时也会对处理未知事件的方式产生不利影响（从WI和A-OSE得分中可以看出）。因此，通过验证，我们在所有的实验中将Nex设置为50，这是平衡已知和未知类性能的一个好方法。

6.3 Comparison with an Open Set Detector: The mAP values of the detector when it is evaluated on closed set data (trained and tested on Pascal VOC 2007) and open set data (test set contains equal number of unknown images from MS-COCO) helps to measure how the detector handles unknown instances. Ideally, there should not be a performance drop. We compare ORE against the recent open set detector proposed by Miller et al. [42]. We find from Tab. 6 that drop in performance of ORE is much lower than [42] owing to the effective modelling of the unknown instances.

6.3 与开放集检测器的比较：在封闭集数据（在Pascal VOC 2007上训练和测试）和开放集数据（测试集包含来自MS-COCO的相同数量的未知图像）上评估检测器时，检测器的mAP值有助于测量检测器如何处理未知实例。理想情况下，不应出现性能下降。我们将ORE与Miller等人[42]最近提出的开集检测器进行了比较。由表6可知，由于对未知实例的有效建模，ORE性能的下降远低于[42]。

6.4 Time and Storage Expense: The training and inference of ORE takes an additional 0:1349 sec/iter and 0:009 sec/iter than standard Faster R-CNN. The storage expense for maintaining FStore is negligible, and the exemplar memory (for Nex = 50) takes approximately 34 MB.

6.4 时间和存储费用：ORE的训练和推断比Faster R-CNN多花费0:1349秒/iter和0:009秒/iter。维护FStore的存储开销可以忽略不计，示例内存（对于Nex=50）大约需要34mb。

6.5 Clustering loss and t-SNE [37] visualization: We visualise the quality of clusters that are formed while training with the contrastive clustering loss (Eqn. 1) for Task 1. We see nicely formed clusters in Fig. 5 (a). Each number in the legend correspond to the 20 classes introduced in Task 1. Label 20 denotes unknown class. Importantly, we see that the unknown instances also gets clustered, which reinforces the quality of the auto-labelled unknowns used in contrastive clustering. In Fig. 5 (b), we plot the contrastive clustering loss against training iterations, where we see a gradual decrease, indicative of good convergence.

6.5 聚类损失和t-SNE可视化[37]：我们在任务1的对比聚类损失（等式1）训练时，观察了形成的聚类的质量。我们在图5（a）中看到了聚类良好的聚类。图例中的每个数字对应于任务1中引入的20个类别。标签20表示未知类别。重要的是，我们看到未知实例也被聚类了，这增强了对比聚类中使用的自动标记的未知数的质量。在图5（b）中，我们绘制了针对训练迭代的对比聚类损失，其中我们看到了逐渐减小的趋势，这表明收敛性良好。

VII 结论

The vibrant object detection community has pushed the performance benchmarks on standard datasets by a large margin. The closed-set nature of these datasets and evaluation protocols, hampers further progress. We introduce Open World Object Detection, where the object detector is able to label an unknown object as unknown and gradually learn the unknown as the model gets exposed to new labels. Our key novelties include an energy-based classifier for unknown detection and a contrastive clustering approach for open world learning. We hope that our work will kindle further research along this important and open direction.

该目标检测框架将标准数据集上的性能基准大大提高。这些数据集和评估协议的封闭性阻碍了进一步的进展。我们引入了开放世界的目标检测，其中目标检测器能够将未知实例标记为未知目标，并随着模型添加新的标签而逐渐学习未知目标。我们的主要创新点包括用于未知检测的基于能量的分类器和用于开放世界学习的对比聚类方法。我们希望，我们的工作将沿着这一重要和开放的方向推动进一步的研究。

补充材料

A.Varying the Queue Size of FStore A.改变FStore的队列大小

In Sec. 4.1, we explain how class specific queues qi are used to store the feature vectors, which are used to compute the class prototypes. A hyper-parameter Q controls the size of each qi. Here we vary Q, while learning Task 1, and report the results in Tab. 7. We observe relatively similar performance, across experiments with different Q values. This can be attributed to the fact that after a prototype is defined, it gets periodically updated with newly observed features, thus effectively evolving itself. Hence, the actual number of features used to compute those prototypes (P and Pnew) is not very significant. We use Q = 20 for all the experiments

在4.1我们解释了如何使用类特定的队列qi来存储用于计算类原型的特征向量。超参数Q控制每个qi的大小。在这里，我们在学习任务1的同时改变Q，并在表7中报告结果。在不同Q值的实验中，我们观察到了相对相似的性能。这可以归因于这样一个事实：在定义了一个原型之后，它会周期性地用新观察到的特征进行更新，从而有效地自我进化。因此，用于计算这些原型（P和Pnew）的特征的实际数量不是很重要。我们用Q=20做所有的实验。

B. Sensitivity Analysis on η η敏感性分析

The momentum parameter η controls how rapidly the class prototypes are updated, as elaborated in Algorithm 1. Larger values of η imply smaller effect of the newly computed prototypes on the current class prototypes. We find from Tab. 8 that performance improves when prototypes are updated slowly (larger values of η). This result is intuitive, as slowly changing the cluster centers helps stabilize contrastive learning

动量参数η控制算法1中详细说明的类原型的更新速度。较大的η表示新计算的原型对当前类原型的影响较小。从图8可以看出，当原型更新缓慢（η的值较大）时，性能会提高。这个结果很直观，因为缓慢改变聚类中心有助于稳定对比学习。

C.Varying the Margin (∆) in Lcont在Lcont中改变边距（∆）

The margin parameter ∆ in the contrastive clustering mloss Lcont (Eqn. 1) defines the minimum distance that an input feature vector should keep from dissimilar class prototypes in the latent space. As we see in Tab. 9, increasing the margin while learning the first task, increases the performance on the known classes and how unknown classes are handled. This would imply that larger separation in the latent space is beneficial for ORE.

对比聚类损失Lcont（等式1）中的边距参数∆定义了输入特征向量与潜在空间中不同类原型之间应保持的最小距离。如图9所示，在学习第一个任务的同时增加余量，可以提高已知类的性能以及未知类的处理方式。这意味着潜在空间中的较大分隔对于ORE是有利的。

D. Varying the Temperature (T) in Eqn. 4 公式4改变温度

We fixed the temperature parameter (T) in Eqn. 4 to 1 in all the experiments. Softening the energies a bit more to T = 2, gives slight improvement in unknown detection, however increasing it further hurts as evident from Tab. 10.

我们在等式4中固定了温度参数（T）为1。将能量稍微增加到T = 2，可以使未知检测略有改善，但是从表10可以明显看出，进一步增加了伤害。

E. More Details on Contrastive Clustering 对比聚类的更多详细信息

The motivation for using contrastive clustering to ensure separation in the latent space is two-fold: 1) it enables the model to cluster unknowns separately from known instances, thus boosting unknown identification; 2) it ensures instances of each class are well-separated from other classes, alleviating the forgetting issue.

使用对比聚类以确保在潜在空间中分离的动机有两个：1）它使模型能够将未知类与已知实例分开进行聚类，从而增强了未知性的识别； 2）确保每个类的实例与其他类完全分开，从而减轻了遗忘问题。

The 2048-dim feature vector that comes out from residual blocks of RoI head (Fig 6) is contrastively clustered. The contrastive loss is added to the Faster R-CNN loss and the entire network is trained end-to-end. Thus all parts of the network before and including the residual block in the RoI head in the Faster R-CNN pipeline will get updated with the gradients from the contrastive clustering loss.

从RoI 头的残差块（图6）中得出的2048像素特征向量被对比地聚类。对比损失被添加到Faster R-CNN损失中，整个网络被端到端训练。因此，Faster R-CNN管道中RoI头中的残差块之前（包括残差块在内）的网络所有部分都将使用来自对比聚类损失的梯度进行更新。

F. Further Implementation Details 更多实施细节

We complete the discussion related to the implementation details, that we had in Sec. 5.2 here. We ran our experiments on a server with 8 Nvidia V100 GPUs with an effective batch size of 8. We use SGD with a learning rate of 0.01. Each task is learned for 8 epochs (∼ 50k iterations). The queue size of the feature store is set to 20. We initiate clustering after 1k iterations and update the cluster prototypes after each 3k iterations with a momentum parameter of 0.99. Euclidean distance is used as the distance function D in Eqn. 1. The margin (∆) is set as 10. For auto-labelling the unknowns in the RPN, we pick the top-1 background proposal, sorted by its objectness score. The temperature parameter in the energy based classification head is set to 1. The code is implemented in PyTorch [43] using Detectron 2 [62]. Reliability library [52] was used for modelling the energy distributions. We release all our codes publicly for foster reproducible research: GitHub - JosephKJ/OWOD: (CVPR 2021 Oral) Open World Object Detection.

5.2节完成了与第二节中有关实现细节的讨论。我们在装有8个Nvidia V100 GPU的服务器上进行了实验，有效批处理大小为8。我们使用SGD，学习率为0.01。每个任务学习8个时期（约5万次迭代）。要素存储的队列大小设置为20。我们在1k次迭代后启动聚类，并在每3k次迭代后更新动量参数为0.99的聚类原型。欧几里得距离用作方程式1中的距离函数D。边距（∆）设置为10。对于自动标记RPN中的未知数，我们选择top-1背景建议框，并按其客观性得分进行排序。基于能量的分类头中的温度参数设置为1。该代码在PyTorch [43]中使用Detectron 2 [62]实现。可靠性库[52]用于对能量分布进行建模。我们公开发布了所有代码以促进可重复的研究：https：//github.com/JosephKJ/OWOD。

G. Related Work on Incremental Object Detection 增量物体检测的相关工作

The class-incremental object detection (iOD) setting considers classes to be observed incrementally over time and that the learner must adapt without retraining on old classes from scratch. The prevalent approaches [60, 27, 17, 6] use knowledge distillation [20] as a regularization measure to avoid forgetting old class information while training on new classes. Specifically, Shmelkov et al. [60] repurpose Fast R-CNN for incremental learning by distilling classification and regression outputs from a previous stage model. Beside distilling model outputs, Chen et al. [6] and Li et al. [27] also distilled the intermediate network features. Hao et al. [17] builds on Faster R-CNN and uses a student-teacher framework for RPN adaptation. Recently, Peng et al. [44] introduces an adaptive distillation technique into Faster R-CNN. Their methodology is the current state-of-the-art in iOD. These methods cannot however work in an Open World environment, which is the focus of this work, and are unable to identify unknown objects.

类增量目标检测（iOD）设置认为类将随着时间的推移而逐步递增，并且学习者必须进行调整，而不必从头开始对旧类进行重新训练。普遍的方法[60，27，17，6]使用知识提炼[20]作为一种正则化措施，以避免在训练新类时忘记旧类信息。具体来说，Shmelkov等[60]通过提炼前阶段模型的分类和回归输出，将Fast R-CNN重新用于增量学习。除了提炼模型输出，Chen等人 [6]和李等[27]也提炼了中间网络的特征。郝等[17]建立在Faster R-CNN之上，并使用学生-教师框架来适应RPN。最近，Peng等[44]将自适应蒸馏技术引入Faster R-CNN。他们的方法是当前iOD的最新技术。但是，这些方法无法在开放世界环境（这是本文的重点）中工作，并且无法识别未知目标。

H. Using Softmax based Unknown Identifier使用基于Softmax的未知标识符

We modified the unknown identification criteria to max(softmax(logits)) < t. For t = {0.3, 0.5, 0.7}: A-OSE, WI and mAP (mean and std-dev) are 11815 ± 352.13, 0.0436 ± 0.009 and 55.22 ± 0.02. This is inferior to ORE.

我们将未知识别标准修改为max（softmax（logits））<t。对于t = {0.3，0.5，0.7}：A-OSE，WI和mAP（平均值和标准差）分别为11815±352.13、0.0436±0.009和55.22±0.02。这不如ORE。

I. Qualitative Results 定性结果

We show qualitative results of ORE in Fig. 8 through Fig. 13. We see that ORE is able to identify a variety of unknown instances and incrementally learn them, using the proposed contrastive clustering and energy-based unknown identification methodology. Sub-figure (a) in all these images shows the identified unknown instances along with the the other instances known to the detector. The corresponding sub-figure (b), shows the detections from the same detector after the new classes are incrementally added.

我们在图8到图13中显示了ORE的定性结果。我们看到ORE使用所提出的对比聚类和基于能量的未知识别方法，能够识别各种未知实例并逐步学习它们。所有这些图像中的子图（a）显示了已识别的未知实例以及检测器已知的其他实例。相应的子图（b）显示了在逐渐增加新类别后来自同一检测器的检测结果。

J. Discussion Regarding Failure Cases关于失败案例的讨论

Occlusions and crowding of objects are cases where our method tends to get confused (external-storage, Walkman and bag not detected as unknown in Figs. 11, 13). Difficult viewpoints (such as backside) also lead to some misclassifications (giraffe→orse in Figs. 4, 12). We have also noticed that detecting small unknown objects co-occurring with larger known objects is hard. As ORE is the first effort in this direction, we hope these identified shortcomings would be basis of further research.

目标的遮挡和拥挤是我们的方法容易混淆的情况（在图11、13中未检测到外部存储，随身听和书包未知）。困难的视角（例如背面）也会导致一些错误的分类（图4、12中的长颈鹿）。我们还注意到，很难检测到与较大的已知目标同时发生的小型未知目标。由于ORE是朝这个方向迈出的第一步，因此我们希望这些已发现的缺点将成为进一步研究的基础。

【论文总结】Towards Open World Object Detection（附翻译）相关推荐

深度学习论文: Task-Specific Context Decoupling for Object Detection及其PyTorch实现
深度学习论文: Task-Specific Context Decoupling for Object Detection及其PyTorch实现 Task-Specific Context Decou ...
【论文总结】FSCE: Few-Shot Object Detection via Contrastive Proposal Encoding（附翻译）
论文地址:https://arxiv.org/pdf/2103.05950.pdf 代码地址:https: //github.com/MegviiDetection/FSCE 改进:主要是针对小样本检 ...
论文学习笔记《SWA Object Detection》
这是一篇2020年年底挂在arxiv上的论文,主要思想很简单,就做了一件事情:采用周期性学习率迭代策略(余弦退火算法)额外再训练模型12个epoch,然后平均每个epoch训练得到的weights作为 ...
论文阅读：DETR:End-to-End Object Detection with Transformers
题目:End-to-End Object Detection with Transformers 来源:Facebook AI ECCV2020 论文链接:https://arxiv.org/abs/ ...
【论文】Oriented R-CNN for Object Detection
[前言]首先推荐几个相关网址: 整理了一系列的倾斜目标检测算法:GitHub - yangxue0827/RotationDetection: This is a tensorflow-based r ...
论文解读--K-Radar:4D Radar Object Detection for Autonomous Driving in Various Weather Conditions
摘要与使用可见光波段(384 ~ 769 THz)的RGB相机和使用红外波段(361 ~ 331 THz)的激光雷达不同,雷达使用波长相对较长的无线电波段(77 ~ 81 GHz),因此在恶劣天气下 ...
[论文阅读] Scene Context-Aware Salient Object Detection
论文地址:https://openaccess.thecvf.com/content/ICCV2021/html/Siris_Scene_Context-Aware_Salient_Object_De ...
显著目标检测论文(三)——Minimum Barrier Salient Object Detection at 80 FPS (2015)
这篇文章最大的亮点就是其实时性, 80 fps. 个人感觉论文的效果还是很惊艳的. 可以先看看论文的效果. 如 Figure 1 所示. 作者使用的机器配置如下: 3.2GHz x 2 CPU 12G ...
【论文】Concurrent Segmentation and Object Detection CNNs for Aircraft Detection and Identification
这篇论文来自preligens,同时采用了分割和检测算法来实现遥感影像飞机的检测和识别,创造性的将分割和检测两类算法进行了融合,提高了检测识别的精度和效率. 一.引言介绍背景.CNN的发展和作用.分 ...
[论文阅读] Transformer Transforms Salient Object Detection and Camouflaged Object Detection
论文地址:https://arxiv.org/abs/2104.10127 代码:https://github.com/fupiao1998/TrasformerSOD 发表于:Arxiv 2021. ...

【论文总结】Towards Open World Object Detection（附翻译）