融合NetVLAD和全连接层的三元神经网络交叉视角场景图像定位

薛朝辉; 周逸飏; 强永刚; 刘弋锋; 林晖

下载中心

优秀审稿专家

优秀论文

首页 > 2021, Vol. 25, Issue (5) : 1095-1107

摘要

全文摘要次数： 2057 全文下载次数： 1729

引用本文:

薛朝辉,周逸飏,强永刚,刘弋锋,林晖.2021.融合NetVLAD和全连接层的三元神经网络交叉视角场景图像定位.遥感学报,25(5):1095-1107

DOI:

10.11834/jrs.20210188

收稿日期:

2020-06-11

修改日期:

PDF Free HTML EndNote BibTeX

融合NetVLAD和全连接层的三元神经网络交叉视角场景图像定位

薛朝辉¹，周逸飏¹，强永刚²，刘弋锋³，林晖³

1.河海大学地球科学与工程学院, 南京 211100;2.中国科学技术大学计算机科学与技术学院, 合肥 230026;3.中国电子科学研究院社会安全风险感知与防控大数据应用国家工程实验室, 北京 100041

摘要:

研究场景图像的地理定位问题在室外定位、目标搜寻、军事侦察等领域具有重要意义。针对街景影像与鸟瞰影像之间的交叉视角场景图像匹配与定位问题，本文提出了一种融合可训练局部聚集描述子向量NetVLAD（Net Vector of locally aggregated descriptors）和全连接层的三元神经网络（Triplet Network）定位方法（Tri-NetVLAD）。三元神经网络由三组卷积神经网络CNN（Convolutional Neural Networks）构成，能同时处理3张影像，通过增大不匹配像对间的距离，减小匹配像对间的距离，实现图像检索与匹配；NetVLAD和全连接层的融合可以加强特征间的关联性。本文将CNN提取的局部卷积特征分别通过NetVLAD层和全连接层得到全局描述符与特征向量，并将二者融合，有效地提升了局部特征间的关联性，并保留了不同局部特征之间的差异性，提升了模型的定位精度；改进了DBL loss（Distance-based layer loss），通过加入参数λ增强函数判别困难样本的能力，在提升模型的收敛速度和稳定性的同时也提升了模型的定位精度。在美国Vo and Hays公开数据集上的实验结果表明，Tri-NetVLAD取得了优于MCVPlaces、Triplet eDBL-Net和CVM-Net等现有方法的定位精度，在测试集上的精度高于63%。

关键词:

交叉视角场景图像匹配与定位三元神经网络 NetVLAD CNN（Convolutional Neural Networks）

Cross-view scene image localization with Triplet Network integrating NetVLAD and Fully Connected Layers

Abstract:

Cross-view scene image matching and positioning have a wide range of applications in target search, combating crime, and positioning. With the development of deep learning, neural networks have played an important role in this issue. Given the problem of cross-view scene image matching and positioning between street view and bird’s eye images, the neural network model’s convergence is slow, and the feature correlation is weak. This paper proposes a triplet network model (Tri-NetVLAD) that combines NetVLAD and a fully connected layer and improves DBL Loss (ADBL loss). The proposed method can not only improve the convergence speed and stability of the network but also the overall positioning accuracy of the model.The proposed Tri-NetVLAD model extracts the local features of the three input images through a triplet network and inputs the local features to the fully connected and NetVLAD layers to obtain the feature vector and the global feature descriptor. The global feature descriptor can obtain the relative distribution between features, and on this basis, incorporate feature vectors, which can preserve the differences between features to improve the positioning accuracy of the model. ADBL loss improves the model’s ability to discriminate difficult samples by introducing parameters and the positioning accuracy of the model.The proposed Tri-NetVLAD is compared with several existing methods, namely, MCVPlaces, Triplet eDBL-Net, and CVM-Net, and loss functions, namely, contrastive loss, triplet loss, and DBL loss. In the US vo and hays dataset, the highest positioning accuracy of 63.5% is achieved, proving that the triplet network that combines the NetVLAD and fully connected layers can effectively improve the positioning accuracy with the ADBL Loss.Compared with existing methods, the proposed Tri-NetVLAD has the following advantages. (1) The Triplet network can increase the Euclidean distance between unmatched images while reducing the Euclidean distance between matched images. (2) The introduction of NetVLAD can aggregate the local features extracted by CNN to obtain global feature descriptors and the distribution relationship between features. (3) The fusing of the Fully Connected Layer adds the feature vector obtained through the fully connected layer to the global feature descriptor, so that the final feature vector not only represents the distribution relationship between features, but also retains the differences between features. (4) The improved loss function ADBL Loss can accelerate the gradient convergence speed and improve the overall positioning accuracy.

Key Words:

cross-view scene image matching and geolocation Triplet Network NetVLAD CNN

本文暂时没有被引用！