
  • 0.论文速览
    • 0.1 文章信息
  • 0.2 概述
    • 0.2.1 研究什么东西
    • 0.2.2 评价
  • 1.Abstract
    • 1.1 逐句翻译
    • 1.2 总结
    • 2.1 逐句翻译
      • 第一段(介绍其余室内定位技术的局限性)
      • 第二段(图像检索提供的位置不是很准,使用几何匹配更好)
      • 第三段(本文的贡献)
      • 第四段(本文的结构)
  • 3. Related Work
    • 3.1 逐句翻译
      • 第一段(总)
      • 第二段(分)
      • 第三段(基于结构的定位方法)
      • 第四段(基于图像的定位方法)
      • 第五段(基于学习的定位方法)
      • 第六段(基于学习的定位方法,本文使用CNN)
      • 第七段(介绍本文的图像检索策略)
  • 3. System Overview and Methods 系统概述及方法
    • 3.1 逐句翻译
      • 第一段(本节概述)
      • 3.1. System Architecture 系统架构
        • 第一段(本文图像定位的系统架构)
      • 3.2. Data Preparation 数据准备
        • 第一段(数据库结构)
        • 第二段(选取图像进行特征匹配)
        • 第三段(图像集设置)
      • 3.3. CNN-Based Image Retrieval 基于cnn的图像检索
        • 3.3.1. Deep Convolutional Neural Networks (DCNNs) 深度卷积神经网络(DCNNs)
          • 第一段(介绍 VGG16的结构)
          • 第二段(引出ImageNet的预训练网络)
        • 3.3.2. Deep Features Extracted by CNNs cnn提取的深度特征
          • 第一段(从CNN中提取深度特征)
          • 第二段(从图像提取特征)
        • 3.3.3. Image Retrieval Using Deep Features 基于深度特征的图像检索
      • 3.4. Pose Estimation 姿态估计
        • 3.4.1. Feature Detection and Matching 特征检测与匹配
        • 3.4.2. Motion from Image Feature Correspondences 从图像特征对应的运动



0.1 文章信息


0.2 概述

0.2.1 研究什么东西

0.2.2 评价

值得再读:√ (有开源代码,和我想研究的方向符合)


1.1 逐句翻译

Indoor localization is one of the fundamentals of location-based services (LBS) such as seamless indoor and outdoor navigation, location-based precision marketing, spatial cognition of robotics, etc.

Visual features take up a dominant part of the information that helps human and robotics understand the environment, and many visual localization systems have been proposed.

However, the problem of indoor visual localization has not been well settled due to the tough trade-off of accuracy and cost.

To better address this problem, a localization method based on image retrieval is proposed in this paper, which mainly consists of two parts.

The first one is CNN-based image retrieval phase, CNN features extracted by pre-trained deep convolutional neural networks (DCNNs) from images are utilized to compare the similarity, and the output of this part are the matched images of the target image.
首先是基于CNN的图像检索阶段,利用预训练好的深度卷积神经网络(deep convolutional neural networks, DCNNs)从图像中提取的CNN特征进行相似性比较,该部分的输出为目标图像的匹配图像。

The second one is pose estimation phase that computes accurate localization result.

Owing to the robust CNN feature extractor, our scheme is applicable to complex indoor environments and easily transplanted to outdoor environments.

The pose estimation scheme was inspired by monocular visual odometer, therefore, only RGB images and poses of reference images are needed for accurate image geo-localization.

Furthermore, our method attempts to use lightweight datum to present the scene.

To evaluate the performance, experiments are conducted, and the result demonstrates that our scheme can efficiently result in high location accuracy as well as orientation estimation.

Currently the positioning accuracy and usability enhanced compared with similar solutions.

Furthermore, our idea has a good application foreground, because the algorithms of data acquisition and pose estimation are compatible with the current state of data expansion.

1.2 总结

  • 本文使用图像检索进行室内定位,第一阶段使用神经网络进行图像检索,第二阶段姿态估计,进行定位(仅需要图像即可定位,此处是创新点)


2.1 逐句翻译


The increasing demand of location-based services (LBS) in recent years inspires the desire for accurate position information.

The most common way for positioning with cell phone and other mobile platforms is GNSS (Global Navigation Satellite System).

However, in most of the time, GNSS is only available for the outdoor environment.

When it comes to indoor environment, GNSS signals are mostly blocked by obstacles.

In recent years, a number of alternative technologies have been proposed for indoor positioning.

Most of indoor positioning methods are focused on fingerprinting-based localization algorithms which are infrastructure-free [1–3].

In these methods, Wi-Fi received signal strengths (RSS) or magnetic field strengths (MFS) are collected and will be compared with data in a fingerprinting database during positioning period.

This fingerprinting-based system is easy to establish and can achieve fine localization performance in the short term.

Nevertheless, signal patterns will change over time due to the environment changes, which makes it hard to maintain the positioning performance.

Additionally, construction of fingerprint database is time-consuming and labor-intensive.

To overcome the defects of this scheme, many alternatives have been proposed, including Optical [4,5], RFID (Radio Frequency Identification) [6], Bluetooth Beacons [7], ZigBee [8,9], Pseudo Satellite [10,11],etc.

Whereas the accuracy is not enough in intricate indoor environment, and these solutions may need artificial setting and additional infrastructures which may bring unbearable costs.


There are also some previous attempts at indoor visual positioning.

Recognition-based image geo-localization methods are quite similar with the problem of image classification in computer vision, in which global or region features are used for image matching [12–15].

In image classification issue, similar images are labeled as the same category

Regarding the visual localization problem, relative images are identified as sharing similar geo-location information.

As for recognition-based method, location of the target image is estimated by retrieving related images or scene classification [16–19].

Recognition-based methods apply an image retrieval strategy or a scene classification strategy at first,subsequently the location of the query image is estimated based on the localization information of the associate retrieved images or the classification labels.

However, the mentioned methods above generally provide a rather coarse estimation of location, which hardly satisfies the need of accurate LBS.

Geometric matching-based methods represent the scenes by geo-referenced 3D models, and then, estimate the pose of query image by directly matching 2D image features to 3D models or by matching 3D image features to 3D models when depth information is available.

These approaches typically come with estimation of 6 degrees of freedom (DoF) camera parameters.

However, geometric matching-based methods still have many challenges, which can be concluded as follows: (1) Superior
difficulty in constructing high fidelity RGB-D scene models as well as employing 2D-to-3D matching for textured 3D models scheme; and (2) as for non-RGB point-only models scheme, the problem of geometric alignment between the query images and 3D point models can be hard to settle.
然而,基于几何匹配的方法仍存在诸多挑战,主要表现在:(1)在构建高保真度RGB-D场景模型和纹理三维模型采用2d -3D匹配方案方面存在较大难度;(2)对于非rgb纯点模型方案,查询图像与三维点模型的几何对齐问题难以解决。


To overcome the limitations of recognition-based methods and geometric matching-based methods, a combination of these two strategies has been proposed in the devised scheme.

In this paper, we demonstrate an image-based indoor localization scheme which is capable of not only achieving sub-meter level positioning accuracy but also determining orientations.

At the same time, the proposed scheme merely uses RGB images in the course of the online localization period and a server is used to host the image database for the computing operation.

The main contributions of this paper can be concluded as follows:

(1) Inspired by the visual spatial cognition ability of human, an image-based visual positioning scheme is proposed. The target image is matched with database images to get the most similar image for localization computing.

(2) Our visual localization algorithm is 3D-modeling-free. Compared with visual localization methods that combine with image retrieval and image pose estimation from regional 3D reconstruction, regional 3D modeling is unrequired in our scheme since we recover camera pose from two sets of 2D-to-2D matches.

(3) Our spatial model is training-free for different scenarios. Owing to pre-trained deep learning models are stable and can be used as powerful feature extractors, we apply deep convolutional neural network (DCNN) pre-trained on ImageNet to extract features to represent images, thus we need not train a unique model for a specific scene.
我们的空间模型对于不同的场景是不需要训练的。由于预训练的深度学习模型是稳定的,可以作为强大的特征提取器,我们使用在ImageNet上预训练的深度卷积神经网络(deep convolutional neural network, DCNN)来提取特征来表示图像,因此我们不需要针对特定场景训练一个唯一的模型。

(4) For localization purpose, we use a lighter model to represent the scene. CNN features extracted from images of database can represent the scene in image retrieval phase. Compared with CNN learning-based visual localization methods that require a large number of images during model training, much fewer images are required to represent the same scene in our scheme.


The paper proceeds as follows: Section 2 provides a brief overview of related work.

The system architecture and methods are described in detail in Section 3.

Experiments and performance evaluations are presented in Section 4. Sections 5 and 6 are discussion and suggestions for future work.

3. Related Work

3.1 逐句翻译


The work presented in this paper relates to many fields, such as visual localization, image retrieval,
and visual pose estimation


At present, visual localization systems can be roughly divided into three categories.


Structure-based localization methods are the most common visual localization methods that utilize local features to estimate 2D-to-3D matches between features in a query image and points in 3D models, or employ 3D-to-3D matches between RGB-D images and 3D models.

Then camera pose will be estimated from the correspondence.

Similarly, Torsten et al. [20] compared 2D image-based localization with 3D structure-based localization, and they drew a conclusion that purely 2D-based methods achieve the lowest localization and 3D-based methods offer more precise pose estimation with more complex model construction and maintenance.

They proposed a combination of 2D-based methods with local structure-from-motion (SfM) reconstruction which has both a simple database construction procedure and accurate pose estimation.

However, the drawback of their method is significantly longer run-time during the location process.


Image-based localization methods were pushed by massive repositories of public geo-labeled images.

These methods employ an image retrieval-based strategy [16–19], which match the query image with images from the database.

Afterward the location of the query image is computed based on the pose information of the retrieved reference images [21–23].

Owing to the prosperity of social network and street view photos, quantity of images with geo-tags has emerged which can be used for reference to these data-driven image-based localization methods.

Image retrieval is a visual search task that searches and retrieves images from a large database of digital images, which is commonly used in many image-based localization methods.

Conventional methods retrieve images based on local descriptor matching and reorder with elaborate spatial verification [24–26].

Content based image retrieval search for images relies on visual content such as edges, colors, textures, and shape [27].

Recent works leverage deep convolution neural networks for image retrieval, the majority of them use a pre-trained network as local feature extractor.

Moreover, some work even can address the problem of geometric invariance of CNN features [28,29], and to accurately represent images of different sizes and aspects ratios [30,31].


Learning-based localization methods emerged in the past few years, which benefited from the dramatic progress made in a variety of computer vision tasks.

By training models from given images with pose information, scenes can be represented by these learned models.

These learning-based localization methods either predict matches for pose estimation [32–35] or directly regress the camera pose such as PoseNet [36], PoseNet2 [37], and VlocNet [38].

PoseNet was the first approach to use DCNNs to solve the metric localization problem, and then Bayesian CNN implementation was utilized to address the pose uncertainty [39].

After that, architectures such as long-short term memory (LSTM) [40–42] and symmetric encoder-decoder [43] were utilized to facilitate the performance of DCNNs.


Moreover, many localization methods [44–47] adopt a from-rough-to-precise idea.

For example [44],to utilized scene recognition to locate in scene-level area, and then employed a multi-sensor fusion approach to give a specific location.

Similarly, the purely visual-based methods have also been proposed by researchers.

Reference [45] casts the localization as an alignment problem of the edges of the query image to a 3D model consisting of line segments.

In Reference [46], recognition-based periods are utilized to give coarse localization and then matching can be employed in rather small region.

Whereas, in their work, the accuracy and robustness are not sufficient for pervasive use,for the reason that their SIFT-based images retrieval is not stable for the complexity and diversity of indoor environments.

To solve this problem, the proposed method adopts a robust CNN-based images retrieval scheme which can fully satisfy the requirement of image retrieval, which is efficient for indoor scenes.

Moreover, 3D model is unnecessary in our strategy.


Compared with previous schemes, this paper combines image retrieval-based strategy with feature-based pose estimation period.

During the image retrieval period, we utilize a network pre-trained on ImageNet as feature extractor.

CNNs learn suitable feature representations for localization in indoor environments, and experiment shows that the performance of this strategy is sufficient to retrieve spatial adjacent images.

Pose of the target image is estimated based on a selected geo-tagged image, this algorithm was inspired by similar procedure in monocular visual odometer which uses the images of nearby frames as well as the estimated pose of the first frame.

Due to the procession of 3D modeling is complicated, we utilize a strategy that represents local scenario by two contiguous images and succeeding computed the query image’s pose from one of the reference images.

However, the performance of pose estimation is highly related to the similarity between the query images and the reference images. In other words, well-behaved image retrieval paves the way of valid precise pose estimation

3. System Overview and Methods 系统概述及方法

3.1 逐句翻译


In this section, firstly, we describe the proposed method at a high level.

Then, key modules and important algorithms are described in detail, including data preparation, CNN-based image retrieval
and pose estimation.

3.1. System Architecture 系统架构


We demonstrate a single RGB image based localization system which is not only capable of reaching sub-meter localization accuracy but also estimating orientation.

The proposed system consists of three components, as shown in Figure 1:


(1) Data preparation, shown in Figure 1a: We collected RGB images from target scenarios, then extracted CNN features from all RGB images through pre-trained CNN models. All of the work was done in offline period.

(2) Image retrieval, shown in Figure 1b: We loaded all of the CNN features of images in database, and ranked them according to their similarity from the CNN features extracted from captured image, and then output a set of images with top similarity. Pose estimation, shown in Figure 1c: We carried out image retrieval to the query image and got two of the most similar images as well as their poses. Then, feature points were extracted from the query image and retrieved images.

We employed 2D-to-2D correspondence to feature points extracted from two retrieved images to compute the scale in monocular vision setting, and then applied the same procedure to feature points from the query image, and the matching image to compute the pose of the query image.
在单目视觉条件下,我们对两幅检索图像中提取的特征点进行2d - 2d对应计算尺度,然后对查询图像中的特征点和匹配图像中的特征点进行相同的处理,计算查询图像的位姿。

3.2. Data Preparation 数据准备


In this part, structure of the database is described.

.The input of the proposed system is an RGB image which is captured either by a cellphone camera or other mobile platforms.

In the database, the absolute 3D spatial coordinates (x, y, z) and quaternion (qx, qy, qz, qw) of all images are known with respect to a given local coordinate system. In addition, CNN features of each image are also included in image database.
在数据库中,所有图像的绝对三维空间坐标(x, y, z)和四元数(qx, qy, qz, qw)相对于给定的局部坐标系是已知的。此外,每张图像的CNN特征也被包含在图像数据库中。


Each image can locally represent the scene it belongs to, and image set contains the information of the scene.

In the proposed method, two of the most similar images are applied to compute the scale of monocular vision during pose estimation period, therefore, adjacent images should have enough common area for feature matching.

The more well-selected images to represent the scene, the better performance of the retrieval and pose estimation result would be.

Besides, too many images result in increasing of the cost of data acquisition and computing time.

We design the image set as follows.


As shown in Table 1, the database S of this experiment contains n different scenes as S = fS1, S2, . . . , Sng.
如表1所示,本实验的数据库S包含n个不同的场景,S = fS1, S2,…,合成天然气。


For each scene Si , we need to get a set of images I = Iij with associated pose information P = Pij , and their respective CNN features C = Cij to create a global representation of this scene, where Pij = xij, yij, zij, qxij, qyij, qzij, qwij is the position and pose data of image Iij.
对于每一个场景Si,我们需要得到一组图像I = Iij,其姿态信息P = Pij,以及它们各自的CNN特征C = cij来创建这个场景的全局表示,其中Pij = xij, yij, zij, qxij, qzij, qwij是图像Iij的位置和姿态数据。

3.3. CNN-Based Image Retrieval 基于cnn的图像检索

In this section, fundamentals of a deep convolutional neural network are described, as well as
a pre-trained CNN model for deep feature extraction in following experiment.

3.3.1. Deep Convolutional Neural Networks (DCNNs) 深度卷积神经网络(DCNNs)

第一段(介绍 VGG16的结构)

As illustrated in Figure 2, the configuration of CNNs used in our proposed scheme is similar to VGG16 which achieved great performance in the large-scale image recognition tasks such as ILSVRC classification and localization.


VGG-Nets apply the same principles as normal CNNs, and the key characteristic of this kind of method is increasing depth using an architecture with very small (3 × 3) convolution filters.
VGG-Nets应用与普通cnn相同的原理,这种方法的关键特征是使用非常小(3 × 3)卷积滤波器的架构来增加深度。

proposed six kinds of VGG-Nets, number of their layers varied from 16 to 24.

In our proposed scheme, we use a 16-layer VGG-Net named VGG16, this network consists of thirteen convolutional layers (block1_conv1, block1_conv2, block2_conv1, block2_conv2, block3_conv1, block3_conv2, block3_conv3, block4_conv1, block4_conv2, block4_conv3, block5_conv1,block5_conv2, block5_conv3), five max pooling layers (block1_pool, block2_pool, block3_pool, block4_pool, block5_pool), three fully connected layers (fc1, fc2, fc3) and a soft-max layer.
在我们提出的方案中,我们使用了一个名为 VGG16 的 16 层 VGG 网络,该网络由 13 个卷积层(block1_conv1、block1_conv2、block2_conv1、block2_conv2、block3_conv1、block3_conv2、block3_conv3、block4_conv1、block4_conv2、block4_conv3、block5_conv1、block5_conv2、block5_conv3)、五个最大池化层(block1_pool、block2_pool、block3_pool、block4_pool、block5_pool)、三个全连接层(fc1、fc2、fc3)和一个软最大层


It is hard to train a valid DCNN model only by data we collected since deep learning needs a mass of training data.

In the proposed scheme, we use CNN for image feature extraction and apply the extracted features in a retrieval task, and then get the most similar images related to the query image.

In view of the representation power of CNNs, pre-trained networks based on ImageNet can be used in our feature extraction period.

3.3.2. Deep Features Extracted by CNNs cnn提取的深度特征


As shown in Figure 1a,b, both the query images and images in database are processed by CNN model

From previous work [14], we know that deeper layers represent higher level of sematic information from the visualization of feature maps.

In our experiment, deep features extracted from CNN can better represent the image, therefore competitive accuracy of image retrieval can be achieved.


Convolution layers (including responding ReLU and max pooling) are used to extract features from input images, and these features are robust to scale and translation.

Subsequently image features are aggregated into a compact feature vector of fixed length.

As is shown in Figure 3, we visualize the first 16 matrices of each layer.

We scale layer maps to the same size when visualizing them, but their sizes as well as depths are different among layers as labeled at the left of the figure.


3.3.3. Image Retrieval Using Deep Features 基于深度特征的图像检索

Image features are aggregated into a vector of fixed length after feature extraction period.

If we apply the same CNN model to extract features to the same size images, we will get the same fixed length of feature vectors, as shown in Figure 4.

图像特征向量可视化。(a) a为查询图像的向量(512维);(b)为获得最高分数的检索图像向量;©为同一场景中不相关图像的矢量;(d)表示来自不同场景的图像向量

When comparing two images, we calculate the distance between image feature vector of retrieved image (

【定位系列论文阅读】-Indoor Visual Positioning Aided by CNN-Based Image Retrieval: Training-Free(一)相关推荐

  1. 定位系列论文阅读-RoNIN(二)-Robust Neural Inertial Navigation in the Wild: Benchmark, Evaluations

    这里写目录标题 0.Abstract 0.1逐句翻译 0.2总结 1. Introduction 1.1逐句翻译 第一段(就是说惯性传感器十分重要有研究的必要) 第二段(惯性导航是非常理想的一个导航方 ...

  2. 论文阅读——《Exposure Control using Bayesian Optimization based on Entropy Weighted Image Gradient》

    论文阅读--<Exposure Control using Bayesian Optimization based on Entropy Weighted Image Gradient> ...

  3. [论文阅读] (07) RAID2020 Cyber Threat Intelligence Modeling Based on Heterogeneous GCN

    <娜璋带你读论文>系列主要是督促自己阅读优秀论文及听取学术讲座,并分享给大家,希望您喜欢.由于作者的英文水平和学术能力不高,需要不断提升,所以还请大家批评指正,非常欢迎大家给我留言评论,学 ...

  4. 论文阅读笔记(5):Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering

    论文阅读笔记(5):Oracle Based Active Set Algorithm for Scalable Elastic Net Subspace Clustering,基于Oracle的可伸 ...

  5. 定位系列论文:基于行为识别的楼层定位(二):Research on HAR-Based Floor Positioning

    0.Abstract: 0.逐句翻译 Floor positioning is an important aspect of indoor positioning technology, which ...

  6. 论文阅读:Visual Semantic Localization based on HD Map for AutonomousVehicles in Urban Scenarios

    题目:Visual Semantic Localization based on HD Map for Autonomous Vehicles in Urban Scenarios 中文:基于高清地图 ...

  7. Transformer系列论文阅读

    这是博主在五一期间对Transformer几篇相关论文阅读的小笔记和总结 也借鉴参考了很多大佬的优秀文章,链接贴在文章下方,推荐大家前去阅读 该文章只是简单叙述几个Transformer模型的基本框架 ...

  8. 论文阅读:Visual Relationship Detection with Language Priors

    Visual Relationship Detection with Language Priors(ECCV2016) 文章   尽管大多数的relationship并不常见,但是它们的object ...

  9. 论文阅读《Visual Measurement Integrity Monitoring for UAV Localization》

    目录 1 摘要 2 介绍 3 相关的工作 3.1 接收机自主完好性监测 3.2 基于视觉的定位 3.3 视觉测量的外点剔除 4 完好性监测框架 4.1 问题表述 4.2 故障检测与排除 4.3 计算保 ...


  1. “编码 5 分钟,命名 2 小时”,这道题究竟怎么解? | 问题征集
  2. 设计模式-结构性模式
  3. C++ UTF8和UTF16互转代码
  4. oracle迁移到mysql工具_oracle数据库想迁移到mysql上 有什么方法或者工具吗
  5. Linux内核3.0移植并基于Initramfs根文件系统启动
  6. java decimal_java DecimalFormat常用方法详解
  7. 强化学习能挑战众多世界冠军,人类亦能利用强化学习成为冠军
  8. MySQL_JDBC_数据库连接池
  9. HDU 1087 [Super Jumping! Jumping! Jumping!]动态规划
  10. XidianOJ 1175: count
  11. webots离线网页无法跳转
  12. 2022年全球市场激光直接成像系统(LDI)总体规模、主要生产商、主要地区、产品和应用细分研究报告
  13. 微软 MSCRM 教育成功案例 界面展示
  14. mac php pear pecl,mac 安装 pecl pear
  15. ういんどみる公开了它用的游戏引擎,CatSystem2
  16. ubuntu18.04换源(阿里无脑版)
  17. Shell攻关之shell基础
  18. UE4 安卓手机launch报错
  19. C语言简单编程案例——(五)
  20. swing小区安全管理系统


  1. 使用SQL语句修改表结构(SQL Server)
  2. 计算机一级多少分合格 获证条件是什么
  3. 鸿蒙应用开发:视频播放器,真简单!!!
  4. 风应力旋度 matlab,[张志伟]中尺度涡所诱发的Ekman Pumping 中尺度涡的垂向结构...
  5. Oracle Optimizer
  6. Go 全套学习路线图
  7. php识别脸型代码,PHP人脸识别为你的颜值打分
  8. Invalid bound statement (not found): mapper.UserMapper.selectUser异常解决
  9. 记录:jeecg boot 路由带多种参数的配置
  10. 几百行代码,实现了微信群聊,神奇!