文章目录

0.论文速览
- 0.1 文章信息
0.2 概述
- 0.2.1 研究什么东西
- 0.2.2 评价
1.Abstract
- 1.1 逐句翻译
- 1.2 总结
2.INTRODUCTION
- 2.1 逐句翻译
- - 第一段（介绍其余室内定位技术的局限性）
  - 第二段（图像检索提供的位置不是很准，使用几何匹配更好）
  - 第三段（本文的贡献）
  - 第四段（本文的结构）
3. Related Work
- 3.1 逐句翻译
- - 第一段（总）
  - 第二段（分）
  - 第三段（基于结构的定位方法）
  - 第四段（基于图像的定位方法）
  - 第五段（基于学习的定位方法）
  - 第六段（基于学习的定位方法，本文使用CNN）
  - 第七段（介绍本文的图像检索策略）
3. System Overview and Methods 系统概述及方法
- 3.1 逐句翻译
- - 第一段（本节概述）
  - 3.1. System Architecture 系统架构
  - - 第一段（本文图像定位的系统架构）
  - 3.2. Data Preparation 数据准备
  - - 第一段（数据库结构）
    - 第二段（选取图像进行特征匹配）
    - 第三段（图像集设置）
  - 3.3. CNN-Based Image Retrieval 基于cnn的图像检索
  - - 3.3.1. Deep Convolutional Neural Networks (DCNNs) 深度卷积神经网络(DCNNs)
    - - 第一段（介绍 VGG16的结构）
      - 第二段（引出ImageNet的预训练网络）
    - 3.3.2. Deep Features Extracted by CNNs cnn提取的深度特征
    - - 第一段（从CNN中提取深度特征）
      - 第二段（从图像提取特征）
    - 3.3.3. Image Retrieval Using Deep Features 基于深度特征的图像检索
  - 3.4. Pose Estimation 姿态估计
  - - 3.4.1. Feature Detection and Matching 特征检测与匹配
    - 3.4.2. Motion from Image Feature Correspondences 从图像特征对应的运动

后续
（由于文章写长了会特别卡，所以将本论文分为两部分）

0.论文速览

0.1 文章信息

题目：基于cnn的图像检索辅助室内视觉定位:无需训练，无需3D建模
来源paperwithcode中的一篇文章
代码地址

0.2 概述

0.2.1 研究什么东西

0.2.2 评价

文章给我带来的收获：
1.方法好值得借鉴：使用图像检索来进行室内定位
2.实验设计比较完整
文章打分：⭐⭐⭐⭐⭐
值得再读：√ （有开源代码，和我想研究的方向符合）

1.Abstract

1.1 逐句翻译

Indoor localization is one of the fundamentals of location-based services (LBS) such as seamless indoor and outdoor navigation, location-based precision marketing, spatial cognition of robotics, etc.
室内定位是基于位置的服务(LBS)的基础之一，如无缝室内外导航、基于位置的精准营销、机器人空间认知等。

Visual features take up a dominant part of the information that helps human and robotics understand the environment, and many visual localization systems have been proposed.
视觉特征占据了人类和机器人理解环境信息的主要部分，许多视觉定位系统已经被提出。

However, the problem of indoor visual localization has not been well settled due to the tough trade-off of accuracy and cost.
然而，由于精度和成本的权衡，室内视觉定位问题一直没有得到很好的解决。

To better address this problem, a localization method based on image retrieval is proposed in this paper, which mainly consists of two parts.
为了更好地解决这一问题，本文提出了一种基于图像检索的定位方法，该方法主要由两部分组成。

The first one is CNN-based image retrieval phase, CNN features extracted by pre-trained deep convolutional neural networks (DCNNs) from images are utilized to compare the similarity, and the output of this part are the matched images of the target image.
首先是基于CNN的图像检索阶段，利用预训练好的深度卷积神经网络(deep convolutional neural networks, DCNNs)从图像中提取的CNN特征进行相似性比较，该部分的输出为目标图像的匹配图像。

The second one is pose estimation phase that computes accurate localization result.
第二阶段是姿态估计阶段，计算精确的定位结果。

Owing to the robust CNN feature extractor, our scheme is applicable to complex indoor environments and easily transplanted to outdoor environments.
由于具有鲁棒的CNN特征提取器，我们的方案适用于复杂的室内环境，并且易于移植到室外环境。

The pose estimation scheme was inspired by monocular visual odometer, therefore, only RGB images and poses of reference images are needed for accurate image geo-localization.
姿态估计方案受到单目视觉里程计的启发，因此仅需要RGB图像和参考图像的姿态即可实现精确的图像地理定位。

Furthermore, our method attempts to use lightweight datum to present the scene.
此外，我们的方法尝试使用轻量级的数据来呈现场景。

To evaluate the performance, experiments are conducted, and the result demonstrates that our scheme can efficiently result in high location accuracy as well as orientation estimation.
为了评价该方法的性能，进行了实验，结果表明该方法可以有效地获得较高的定位精度和方向估计。

Currently the positioning accuracy and usability enhanced compared with similar solutions.
目前与同类解决方案相比，定位精度和可用性均有提高。

Furthermore, our idea has a good application foreground, because the algorithms of data acquisition and pose estimation are compatible with the current state of data expansion.
此外，我们的想法具有良好的应用前景，因为数据采集和姿态估计算法与当前数据扩展的状态相适应

1.2 总结

本文使用图像检索进行室内定位，第一阶段使用神经网络进行图像检索，第二阶段姿态估计，进行定位（仅需要图像即可定位，此处是创新点）

2.INTRODUCTION

2.1 逐句翻译

第一段（介绍其余室内定位技术的局限性）

The increasing demand of location-based services (LBS) in recent years inspires the desire for accurate position information.
近年来，基于位置服务(LBS)的需求不断增长，激发了人们对准确位置信息的渴望。

The most common way for positioning with cell phone and other mobile platforms is GNSS (Global Navigation Satellite System).
手机和其他移动平台最常用的定位方式是全球卫星导航系统(GNSS)。

However, in most of the time, GNSS is only available for the outdoor environment.
然而，在大多数情况下，GNSS仅适用于室外环境。

When it comes to indoor environment, GNSS signals are mostly blocked by obstacles.
在室内环境中，GNSS信号大多被障碍物阻挡。

In recent years, a number of alternative technologies have been proposed for indoor positioning.
近年来，人们提出了许多室内定位的替代技术。

Most of indoor positioning methods are focused on fingerprinting-based localization algorithms which are infrastructure-free [1–3].
大多数室内定位方法都集中在基于指纹的定位算法上，这些算法不需要基础设施[1-3]。

In these methods, Wi-Fi received signal strengths (RSS) or magnetic field strengths (MFS) are collected and will be compared with data in a fingerprinting database during positioning period.
在这些方法中，收集Wi-Fi接收信号强度(RSS)或磁场强度(MFS)，并在定位期间与指纹数据库中的数据进行比较。

This fingerprinting-based system is easy to establish and can achieve fine localization performance in the short term.
该系统易于建立，可在短期内实现良好的定位性能。

Nevertheless, signal patterns will change over time due to the environment changes, which makes it hard to maintain the positioning performance.
然而，信号模式会因环境变化而随时间改变，这就很难保持定位性能。

Additionally, construction of fingerprint database is time-consuming and labor-intensive.
此外，指纹数据库的构建耗时耗力。

To overcome the defects of this scheme, many alternatives have been proposed, including Optical [4,5], RFID (Radio Frequency Identification) [6], Bluetooth Beacons [7], ZigBee [8,9], Pseudo Satellite [10,11],etc.
为了克服该方案的缺陷，人们提出了许多替代方案，包括光学[4,5]，RFID(射频识别)[6]，蓝牙信标[7]，ZigBee[8,9]，伪卫星[10,11]等。

Whereas the accuracy is not enough in intricate indoor environment, and these solutions may need artificial setting and additional infrastructures which may bring unbearable costs.
然而，在复杂的室内环境中，精度是不够的，这些解决方案可能需要人工设置和额外的基础设施，这可能会带来难以承受的成本。

第二段（图像检索提供的位置不是很准，使用几何匹配更好）

There are also some previous attempts at indoor visual positioning.
之前也有一些室内视觉定位的尝试。

Recognition-based image geo-localization methods are quite similar with the problem of image classification in computer vision, in which global or region features are used for image matching [12–15].
基于识别的图像地理定位方法与计算机视觉中的图像分类问题非常相似，都是利用全局或区域特征进行图像匹配[12-15]。

In image classification issue, similar images are labeled as the same category
在图像分类问题中，相似的图像被标记为同一类别

Regarding the visual localization problem, relative images are identified as sharing similar geo-location information.
对于视觉定位问题，相对图像被识别为共享相似的地理位置信息。

As for recognition-based method, location of the target image is estimated by retrieving related images or scene classification [16–19].
基于识别的方法是通过检索相关图像或场景分类来估计目标图像的位置[16-19]。

Recognition-based methods apply an image retrieval strategy or a scene classification strategy at first,subsequently the location of the query image is estimated based on the localization information of the associate retrieved images or the classification labels.
基于识别的方法首先采用图像检索策略或场景分类策略，然后根据关联检索图像的定位信息或分类标签估计查询图像的位置。

However, the mentioned methods above generally provide a rather coarse estimation of location, which hardly satisfies the need of accurate LBS.
然而，上述方法通常提供的位置估计相当粗糙，难以满足精确LBS的需要。

Geometric matching-based methods represent the scenes by geo-referenced 3D models, and then, estimate the pose of query image by directly matching 2D image features to 3D models or by matching 3D image features to 3D models when depth information is available.
基于几何匹配的方法通过地理参考的三维模型来表示场景，然后通过直接将2D图像特征与3D模型进行匹配，或者在有深度信息的情况下将3D图像特征与3D模型进行匹配来估计查询图像的姿态。

These approaches typically come with estimation of 6 degrees of freedom (DoF) camera parameters.
这些方法通常需要估计6个自由度(DoF)相机参数。

However, geometric matching-based methods still have many challenges, which can be concluded as follows: (1) Superior
difficulty in constructing high fidelity RGB-D scene models as well as employing 2D-to-3D matching for textured 3D models scheme; and (2) as for non-RGB point-only models scheme, the problem of geometric alignment between the query images and 3D point models can be hard to settle.
然而，基于几何匹配的方法仍存在诸多挑战，主要表现在:(1)在构建高保真度RGB-D场景模型和纹理三维模型采用2d -3D匹配方案方面存在较大难度;(2)对于非rgb纯点模型方案，查询图像与三维点模型的几何对齐问题难以解决。

第三段（本文的贡献）

To overcome the limitations of recognition-based methods and geometric matching-based methods, a combination of these two strategies has been proposed in the devised scheme.
为了克服基于识别方法和基于几何匹配方法的局限性，在设计方案中提出了两种策略的结合。

In this paper, we demonstrate an image-based indoor localization scheme which is capable of not only achieving sub-meter level positioning accuracy but also determining orientations.
在本文中，我们展示了一种基于图像的室内定位方案，该方案不仅能够达到亚米级的定位精度，而且能够确定方向。

At the same time, the proposed scheme merely uses RGB images in the course of the online localization period and a server is used to host the image database for the computing operation.
同时，该方案在在线定位过程中仅使用RGB图像，并使用服务器托管图像数据库进行计算操作。

The main contributions of this paper can be concluded as follows:
本文的主要贡献如下:

(1) Inspired by the visual spatial cognition ability of human, an image-based visual positioning scheme is proposed. The target image is matched with database images to get the most similar image for localization computing.
借鉴人类视觉空间认知能力，提出了一种基于图像的视觉定位方案。将目标图像与数据库图像进行匹配，得到最相似的图像进行定位计算。

(2) Our visual localization algorithm is 3D-modeling-free. Compared with visual localization methods that combine with image retrieval and image pose estimation from regional 3D reconstruction, regional 3D modeling is unrequired in our scheme since we recover camera pose from two sets of 2D-to-2D matches.
我们的视觉定位算法不需要3d建模。与基于区域三维重建的图像检索和图像姿态估计相结合的视觉定位方法相比，我们的方案不需要区域三维建模，因为我们从两组2d到2d匹配中恢复相机姿态。

(3) Our spatial model is training-free for different scenarios. Owing to pre-trained deep learning models are stable and can be used as powerful feature extractors, we apply deep convolutional neural network (DCNN) pre-trained on ImageNet to extract features to represent images, thus we need not train a unique model for a specific scene.
我们的空间模型对于不同的场景是不需要训练的。由于预训练的深度学习模型是稳定的，可以作为强大的特征提取器，我们使用在ImageNet上预训练的深度卷积神经网络(deep convolutional neural network, DCNN)来提取特征来表示图像，因此我们不需要针对特定场景训练一个唯一的模型。

(4) For localization purpose, we use a lighter model to represent the scene. CNN features extracted from images of database can represent the scene in image retrieval phase. Compared with CNN learning-based visual localization methods that require a large number of images during model training, much fewer images are required to represent the same scene in our scheme.
为了定位，我们使用一个更轻的模型来表示场景。从数据库图像中提取的CNN特征可以代表图像检索阶段的场景。与基于CNN学习的视觉定位方法在模型训练过程中需要大量的图像相比，我们的方案需要更少的图像来表示相同的场景。

第四段（本文的结构）

The paper proceeds as follows: Section 2 provides a brief overview of related work.
本文的主要内容如下:第二节简要概述了相关工作。

The system architecture and methods are described in detail in Section 3.
第3节详细描述了系统架构和方法。

Experiments and performance evaluations are presented in Section 4. Sections 5 and 6 are discussion and suggestions for future work.
实验和性能评估在第4节中提出。第5节和第6节是对今后工作的讨论和建议。

3. Related Work

3.1 逐句翻译

第一段（总）

The work presented in this paper relates to many fields, such as visual localization, image retrieval,
and visual pose estimation
本文所做的工作涉及到视觉定位、图像检索和视觉姿态估计等多个领域

第二段（分）

At present, visual localization systems can be roughly divided into three categories.
目前，视觉定位系统大致可以分为三类。

第三段（基于结构的定位方法）

Structure-based localization methods are the most common visual localization methods that utilize local features to estimate 2D-to-3D matches between features in a query image and points in 3D models, or employ 3D-to-3D matches between RGB-D images and 3D models.
基于结构的定位方法是最常见的视觉定位方法，它利用局部特征来估计查询图像中的特征与3D模型中的点之间的2d到3D匹配，或者使用RGB-D图像与3D模型之间的3D到3D匹配。

Then camera pose will be estimated from the correspondence.
然后根据对应关系估计相机姿态。

Similarly, Torsten et al. [20] compared 2D image-based localization with 3D structure-based localization, and they drew a conclusion that purely 2D-based methods achieve the lowest localization and 3D-based methods offer more precise pose estimation with more complex model construction and maintenance.
同样，Torsten等[20]将基于2D图像的定位与基于3D结构的定位进行了比较，得出结论:单纯基于2D的方法定位效果最低，而基于3D的方法姿态估计更精确，但模型构建和维护更复杂。

They proposed a combination of 2D-based methods with local structure-from-motion (SfM) reconstruction which has both a simple database construction procedure and accurate pose estimation.
他们提出了将基于2d的方法与局部运动结构(SfM)重建相结合的方法，该方法既具有简单的数据库构建过程，又具有准确的姿态估计。

However, the drawback of their method is significantly longer run-time during the location process.
然而，该方法的缺点是在定位过程中运行时间明显较长。

第四段（基于图像的定位方法）

Image-based localization methods were pushed by massive repositories of public geo-labeled images.
基于图像的定位方法是由大量公共地理标记图像库推动的。

These methods employ an image retrieval-based strategy [16–19], which match the query image with images from the database.
这些方法采用基于图像检索的策略[16-19]，将查询图像与数据库中的图像进行匹配。

Afterward the location of the query image is computed based on the pose information of the retrieved reference images [21–23].
然后根据检索到的参考图像的位姿信息计算查询图像的位置[21-23]。

Owing to the prosperity of social network and street view photos, quantity of images with geo-tags has emerged which can be used for reference to these data-driven image-based localization methods.
随着社交网络和街景照片的蓬勃发展，大量带有地理标签的图像应运而生，这些数据驱动的基于图像的定位方法可供参考。

Image retrieval is a visual search task that searches and retrieves images from a large database of digital images, which is commonly used in many image-based localization methods.
图像检索是一种从庞大的数字图像数据库中搜索和检索图像的视觉搜索任务，在许多基于图像的定位方法中都很常用。

Conventional methods retrieve images based on local descriptor matching and reorder with elaborate spatial verification [24–26].
传统方法基于局部描述符匹配和精细空间验证的重新排序来检索图像[24-26]。

Content based image retrieval search for images relies on visual content such as edges, colors, textures, and shape [27].
基于内容的图像检索搜索依赖于视觉内容，如边缘、颜色、纹理和形状[27]。

Recent works leverage deep convolution neural networks for image retrieval, the majority of them use a pre-trained network as local feature extractor.
最近的研究利用深度卷积神经网络进行图像检索，其中大多数使用预训练的网络作为局部特征提取器。

Moreover, some work even can address the problem of geometric invariance of CNN features [28,29], and to accurately represent images of different sizes and aspects ratios [30,31].
此外，一些工作甚至可以解决CNN特征的几何不变性问题[28,29]，并准确地表示不同尺寸和宽高比的图像[30,31]。

第五段（基于学习的定位方法）

Learning-based localization methods emerged in the past few years, which benefited from the dramatic progress made in a variety of computer vision tasks.
基于学习的定位方法是在过去几年出现的，它得益于各种计算机视觉任务的巨大进步。

By training models from given images with pose information, scenes can be represented by these learned models.
通过从给定的图像中训练具有姿态信息的模型，可以用这些学习到的模型来表示场景。

These learning-based localization methods either predict matches for pose estimation [32–35] or directly regress the camera pose such as PoseNet [36], PoseNet2 [37], and VlocNet [38].
这些基于学习的定位方法要么预测姿势估计的匹配[32-35]，要么直接回归相机姿势，如PoseNet[36]、PoseNet2[37]和VlocNet[38]。

PoseNet was the first approach to use DCNNs to solve the metric localization problem, and then Bayesian CNN implementation was utilized to address the pose uncertainty [39].
PoseNet是第一个使用DCNNs解决度量定位问题的方法，然后使用贝叶斯CNN实现来解决位姿不确定性[39]。

After that, architectures such as long-short term memory (LSTM) [40–42] and symmetric encoder-decoder [43] were utilized to facilitate the performance of DCNNs.
之后，长短期记忆(LSTM)[40-42]和对称编码器-解码器[43]等架构被用于提高DCNNs的性能。

第六段（基于学习的定位方法，本文使用CNN）

Moreover, many localization methods [44–47] adopt a from-rough-to-precise idea.
此外，许多定位方法[44-47]采用了从粗糙到精确的思路。

For example [44],to utilized scene recognition to locate in scene-level area, and then employed a multi-sensor fusion approach to give a specific location.
例如[44]，利用场景识别在场景级区域进行定位，然后采用多传感器融合方法给出具体位置。

Similarly, the purely visual-based methods have also been proposed by researchers.
同样，研究人员也提出了纯粹基于视觉的方法。

Reference [45] casts the localization as an alignment problem of the edges of the query image to a 3D model consisting of line segments.
文献[45]将定位定位为查询图像边缘对线段组成的3D模型的对齐问题。

In Reference [46], recognition-based periods are utilized to give coarse localization and then matching can be employed in rather small region.
文献[46]利用基于识别的周期进行粗定位，然后在较小的区域内进行匹配。

Whereas, in their work, the accuracy and robustness are not sufficient for pervasive use,for the reason that their SIFT-based images retrieval is not stable for the complexity and diversity of indoor environments.
然而，在他们的工作中，由于基于sift的图像检索对室内环境的复杂性和多样性不稳定，其准确性和鲁棒性不足以普遍使用。

To solve this problem, the proposed method adopts a robust CNN-based images retrieval scheme which can fully satisfy the requirement of image retrieval, which is efficient for indoor scenes.
为了解决这一问题，本文提出的方法采用了一种基于cnn的鲁棒图像检索方案，该方案能够完全满足图像检索的要求，对于室内场景来说是高效的。

Moreover, 3D model is unnecessary in our strategy.
此外，3D模型在我们的策略中是不必要的

第七段（介绍本文的图像检索策略）

Compared with previous schemes, this paper combines image retrieval-based strategy with feature-based pose estimation period.
与以往的方案相比，本文将基于图像检索的策略与基于特征的姿态估计周期相结合。

During the image retrieval period, we utilize a network pre-trained on ImageNet as feature extractor.
在ImageNet上作为特征提取器进行预训练

CNNs learn suitable feature representations for localization in indoor environments, and experiment shows that the performance of this strategy is sufficient to retrieve spatial adjacent images.
cnn在室内环境中学习适合定位的特征表示，实验表明，该策略的性能足以检索空间相邻图像。

Pose of the target image is estimated based on a selected geo-tagged image, this algorithm was inspired by similar procedure in monocular visual odometer which uses the images of nearby frames as well as the estimated pose of the first frame.
该算法的灵感来自于单目视觉里程计的类似过程，即利用目标图像附近的图像和第一帧的估计姿态来估计目标图像的姿态。

Due to the procession of 3D modeling is complicated, we utilize a strategy that represents local scenario by two contiguous images and succeeding computed the query image’s pose from one of the reference images.
由于三维建模过程复杂，我们采用了一种策略，即用两幅连续图像表示局部场景，然后从其中一幅参考图像计算查询图像的姿态。

However, the performance of pose estimation is highly related to the similarity between the query images and the reference images. In other words, well-behaved image retrieval paves the way of valid precise pose estimation
然而，姿态估计的性能与查询图像和参考图像之间的相似度高度相关。换句话说，行为良好的图像检索为有效精确的姿态估计铺平了道路

3. System Overview and Methods 系统概述及方法

3.1 逐句翻译

第一段（本节概述）

In this section, firstly, we describe the proposed method at a high level.
在本节中，首先，我们在高层次上描述了所提出的方法。

Then, key modules and important algorithms are described in detail, including data preparation, CNN-based image retrieval
and pose estimation.
然后详细介绍了数据准备、基于cnn的图像检索和姿态估计等关键模块和重要算法。

3.1. System Architecture 系统架构

第一段（本文图像定位的系统架构）

We demonstrate a single RGB image based localization system which is not only capable of reaching sub-meter localization accuracy but also estimating orientation.
我们演示了一种基于单一RGB图像的定位系统，该系统不仅能够达到亚米级的定位精度，而且能够估计方向。

The proposed system consists of three components, as shown in Figure 1:
建议的系统由三个部分组成，如图1所示:

(1) Data preparation, shown in Figure 1a: We collected RGB images from target scenarios, then extracted CNN features from all RGB images through pre-trained CNN models. All of the work was done in offline period.
(1)数据准备，如图1a所示:我们从目标场景中收集RGB图像，然后通过预训练的CNN模型从所有RGB图像中提取CNN特征。所有工作都是在离线期间完成的。

(2) Image retrieval, shown in Figure 1b: We loaded all of the CNN features of images in database, and ranked them according to their similarity from the CNN features extracted from captured image, and then output a set of images with top similarity. Pose estimation, shown in Figure 1c: We carried out image retrieval to the query image and got two of the most similar images as well as their poses. Then, feature points were extracted from the query image and retrieved images.
(2)图像检索，如图1b所示:我们将图像的所有CNN特征加载到数据库中，并从捕获图像中提取的CNN特征中按相似度排序，输出一组相似度最高的图像。姿态估计，如图1c所示:我们对查询图像进行图像检索，得到两幅最相似的图像及其姿态。然后，从查询图像和检索图像中提取特征点;

We employed 2D-to-2D correspondence to feature points extracted from two retrieved images to compute the scale in monocular vision setting, and then applied the same procedure to feature points from the query image, and the matching image to compute the pose of the query image.
在单目视觉条件下，我们对两幅检索图像中提取的特征点进行2d - 2d对应计算尺度，然后对查询图像中的特征点和匹配图像中的特征点进行相同的处理，计算查询图像的位姿。

3.2. Data Preparation 数据准备

第一段（数据库结构）

In this part, structure of the database is described.
在这一部分中，描述了数据库的结构。

.The input of the proposed system is an RGB image which is captured either by a cellphone camera or other mobile platforms.
该系统的输入是RGB图像，可由手机相机或其他移动平台捕获。

In the database, the absolute 3D spatial coordinates (x, y, z) and quaternion (qx, qy, qz, qw) of all images are known with respect to a given local coordinate system. In addition, CNN features of each image are also included in image database.
在数据库中，所有图像的绝对三维空间坐标(x, y, z)和四元数(qx, qy, qz, qw)相对于给定的局部坐标系是已知的。此外，每张图像的CNN特征也被包含在图像数据库中。

第二段（选取图像进行特征匹配）

Each image can locally represent the scene it belongs to, and image set contains the information of the scene.
每张图像都可以局部地表示它所属的场景，图像集包含了该场景的信息。

In the proposed method, two of the most similar images are applied to compute the scale of monocular vision during pose estimation period, therefore, adjacent images should have enough common area for feature matching.
在姿态估计过程中，采用两幅最相似的图像来计算单目视觉的尺度，因此相邻图像应具有足够的公共面积进行特征匹配。

The more well-selected images to represent the scene, the better performance of the retrieval and pose estimation result would be.
选取的代表场景的图像越多，检索和姿态估计结果的性能越好。

Besides, too many images result in increasing of the cost of data acquisition and computing time.
此外，过多的图像会导致数据采集成本和计算时间的增加。

We design the image set as follows.
我们设计的图像集如下。

第三段（图像集设置）

As shown in Table 1, the database S of this experiment contains n different scenes as S = fS1, S2, . . . , Sng.
如表1所示，本实验的数据库S包含n个不同的场景，S = fS1, S2，…,合成天然气。

数据库的组成

For each scene Si , we need to get a set of images I = Iij with associated pose information P = Pij , and their respective CNN features C = Cij to create a global representation of this scene, where Pij = xij, yij, zij, qxij, qyij, qzij, qwij is the position and pose data of image Iij.
对于每一个场景Si，我们需要得到一组图像I = Iij，其姿态信息P = Pij，以及它们各自的CNN特征C = cij来创建这个场景的全局表示，其中Pij = xij, yij, zij, qxij, qzij, qwij是图像Iij的位置和姿态数据。

3.3. CNN-Based Image Retrieval 基于cnn的图像检索

In this section, fundamentals of a deep convolutional neural network are described, as well as
a pre-trained CNN model for deep feature extraction in following experiment.
在本节中，描述了深度卷积神经网络的基础知识，以及在接下来的实验中用于深度特征提取的预训练CNN模型。

3.3.1. Deep Convolutional Neural Networks (DCNNs) 深度卷积神经网络(DCNNs)

第一段（介绍 VGG16的结构）

As illustrated in Figure 2, the configuration of CNNs used in our proposed scheme is similar to VGG16 which achieved great performance in the large-scale image recognition tasks such as ILSVRC classification and localization.
如图2所示，我们提出的方案中使用的cnn配置与VGG16相似，在ILSVRC分类和定位等大规模像识别任务中取得了很好的性能。

VGG16的结构。

VGG-Nets apply the same principles as normal CNNs, and the key characteristic of this kind of method is increasing depth using an architecture with very small (3 × 3) convolution filters.
VGG-Nets应用与普通cnn相同的原理，这种方法的关键特征是使用非常小(3 × 3)卷积滤波器的架构来增加深度。

proposed six kinds of VGG-Nets, number of their layers varied from 16 to 24.
提出了6种VGG-Nets，其层数从16层到24层不等。

In our proposed scheme, we use a 16-layer VGG-Net named VGG16, this network consists of thirteen convolutional layers (block1_conv1, block1_conv2, block2_conv1, block2_conv2, block3_conv1, block3_conv2, block3_conv3, block4_conv1, block4_conv2, block4_conv3, block5_conv1,block5_conv2, block5_conv3), five max pooling layers (block1_pool, block2_pool, block3_pool, block4_pool, block5_pool), three fully connected layers (fc1, fc2, fc3) and a soft-max layer.
在我们提出的方案中，我们使用了一个名为 VGG16 的 16 层 VGG 网络，该网络由 13 个卷积层（block1_conv1、block1_conv2、block2_conv1、block2_conv2、block3_conv1、block3_conv2、block3_conv3、block4_conv1、block4_conv2、block4_conv3、block5_conv1、block5_conv2、block5_conv3）、五个最大池化层（block1_pool、block2_pool、block3_pool、block4_pool、block5_pool）、三个全连接层（fc1、fc2、fc3）和一个软最大层

第二段（引出ImageNet的预训练网络）

It is hard to train a valid DCNN model only by data we collected since deep learning needs a mass of training data.
由于深度学习需要大量的训练数据，仅凭我们收集的数据很难训练出有效的DCNN模型。

In the proposed scheme, we use CNN for image feature extraction and apply the extracted features in a retrieval task, and then get the most similar images related to the query image.
在该方案中，我们使用CNN进行图像特征提取，并将提取的特征应用于检索任务中，然后得到与查询图像相关的最相似的图像。

In view of the representation power of CNNs, pre-trained networks based on ImageNet can be used in our feature extraction period.
鉴于cnn的表示能力，我们可以在特征提取阶段使用基于ImageNet的预训练网络。

3.3.2. Deep Features Extracted by CNNs cnn提取的深度特征

第一段（从CNN中提取深度特征）

As shown in Figure 1a,b, both the query images and images in database are processed by CNN model
如图1a、b所示，查询图像和数据库图像都经过CNN模型处理

From previous work [14], we know that deeper layers represent higher level of sematic information from the visualization of feature maps.
从之前的工作[14]中，我们知道更深的层表示来自特征图可视化的更高层次的语义信息。

In our experiment, deep features extracted from CNN can better represent the image, therefore competitive accuracy of image retrieval can be achieved.
在我们的实验中，从CNN中提取的深度特征可以更好地代表图像，因此可以达到具有竞争力的图像检索精度。

第二段（从图像提取特征）

Convolution layers (including responding ReLU and max pooling) are used to extract features from input images, and these features are robust to scale and translation.
卷积层(包括响应ReLU和最大池化)用于从输入图像中提取特征，这些特征对缩放和平移具有鲁棒性。

Subsequently image features are aggregated into a compact feature vector of fixed length.
随后将图像特征聚合成一个紧凑的固定长度的特征向量。

As is shown in Figure 3, we visualize the first 16 matrices of each layer.
如图3所示，我们可视化了每层的前16个矩阵。

We scale layer maps to the same size when visualizing them, but their sizes as well as depths are different among layers as labeled at the left of the figure.
在可视化时，我们将图层映射缩放到相同的大小，但是它们的大小和深度在图层之间是不同的，如图左侧所示。

卷积层可视化。将每层的前16个矩阵可视化，空矩阵对应CNN中掉出的部分。为了更好地可视化层中的特征，采用了一种翠绿的彩色地图，因此层图看起来是绿色的

3.3.3. Image Retrieval Using Deep Features 基于深度特征的图像检索

Image features are aggregated into a vector of fixed length after feature extraction period.
经过特征提取期后，将图像特征聚合成一个固定长度的向量。

If we apply the same CNN model to extract features to the same size images, we will get the same fixed length of feature vectors, as shown in Figure 4.
如果我们对相同大小的图像使用相同的CNN模型提取特征，我们将得到相同的固定长度的特征向量，如图4所示。

图像特征向量可视化。(a) a为查询图像的向量(512维);(b)为获得最高分数的检索图像向量;©为同一场景中不相关图像的矢量;(d)表示来自不同场景的图像向量

When comparing two images, we calculate the distance between image feature vector of retrieved image (

【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(一)相关推荐

定位系列论文阅读-RoNIN（二）-Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations
这里写目录标题 0.Abstract 0.1逐句翻译 0.2总结 1. Introduction 1.1逐句翻译第一段(就是说惯性传感器十分重要有研究的必要) 第二段(惯性导航是非常理想的一个导航方 ...
论文阅读——《Exposure Control using Bayesian Optimization based on Entropy Weighted Image Gradient》
论文阅读--<Exposure Control using Bayesian Optimization based on Entropy Weighted Image Gradient> ...
[论文阅读] (07) RAID2020 Cyber Threat Intelligence Modeling Based on Heterogeneous GCN
<娜璋带你读论文>系列主要是督促自己阅读优秀论文及听取学术讲座,并分享给大家,希望您喜欢.由于作者的英文水平和学术能力不高,需要不断提升,所以还请大家批评指正,非常欢迎大家给我留言评论,学 ...
论文阅读笔记（5）：Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering
论文阅读笔记(5):Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering,基于Oracle的可伸 ...
定位系列论文：基于行为识别的楼层定位（二）：Research on HAR-Based Floor Positioning
0.Abstract: 0.逐句翻译 Floor positioning is an important aspect of indoor positioning technology, which ...
论文阅读：Visual Semantic Localization based on HD Map for AutonomousVehicles in Urban Scenarios
题目:Visual Semantic Localization based on HD Map for Autonomous Vehicles in Urban Scenarios 中文:基于高清地图 ...
Transformer系列论文阅读
这是博主在五一期间对Transformer几篇相关论文阅读的小笔记和总结也借鉴参考了很多大佬的优秀文章,链接贴在文章下方,推荐大家前去阅读该文章只是简单叙述几个Transformer模型的基本框架 ...
论文阅读：Visual Relationship Detection with Language Priors
Visual Relationship Detection with Language Priors(ECCV2016) 文章尽管大多数的relationship并不常见,但是它们的object ...
论文阅读《Visual Measurement Integrity Monitoring for UAV Localization》
目录 1 摘要 2 介绍 3 相关的工作 3.1 接收机自主完好性监测 3.2 基于视觉的定位 3.3 视觉测量的外点剔除 4 完好性监测框架 4.1 问题表述 4.2 故障检测与排除 4.3 计算保 ...

【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(一)

文章目录

0.论文速览

0.1 文章信息

0.2 概述

0.2.1 研究什么东西

0.2.2 评价

1.Abstract

1.1 逐句翻译

1.2 总结

2.INTRODUCTION

2.1 逐句翻译

第一段（介绍其余室内定位技术的局限性）

第二段（图像检索提供的位置不是很准，使用几何匹配更好）

第三段（本文的贡献）

第四段（本文的结构）

3. Related Work

3.1 逐句翻译

第一段（总）

第二段（分）

第三段（基于结构的定位方法）

第四段（基于图像的定位方法）

第五段（基于学习的定位方法）

第六段（基于学习的定位方法，本文使用CNN）

第七段（介绍本文的图像检索策略）

3. System Overview and Methods 系统概述及方法

3.1 逐句翻译

第一段（本节概述）

3.1. System Architecture 系统架构

第一段（本文图像定位的系统架构）

3.2. Data Preparation 数据准备

第一段（数据库结构）

第二段（选取图像进行特征匹配）

第三段（图像集设置）

3.3. CNN-Based Image Retrieval 基于cnn的图像检索

3.3.1. Deep Convolutional Neural Networks (DCNNs) 深度卷积神经网络(DCNNs)

第一段（介绍 VGG16的结构）

第二段（引出ImageNet的预训练网络）

3.3.2. Deep Features Extracted by CNNs cnn提取的深度特征

第一段（从CNN中提取深度特征）

第二段（从图像提取特征）

3.3.3. Image Retrieval Using Deep Features 基于深度特征的图像检索

【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(一)相关推荐

最新文章

热门文章