可见光和红外特征自适应融合的多模态目标检测方法

喻智睿; 尹展鹏; 王俊宇; 周亮; 叶沅鑫

下载中心

优秀审稿专家

优秀论文

首页 > , Vol. , Issue () : -

摘要

全文摘要次数： 179 全文下载次数： 30

引用本文:

DOI:

10.11834/jrs.20254358

收稿日期:

2024-08-19

修改日期:

2025-04-23

PDF Free EndNote BibTeX

可见光和红外特征自适应融合的多模态目标检测方法

喻智睿, 尹展鹏, 王俊宇, 周亮, 叶沅鑫

西南交通大学

摘要:

针对可见光目标检测在遮挡、弱光等复杂环境下特征丰富度易受影响，导致目标检测准确性降低的问题，本文引入红外模态影像弥补可见光影像的不足，提出红外与可见光特征自适应融合的的多模态目标检测方法。该方法采用YOLOv8目标检测框架作为基础网络提取多尺度特征信息；在此基础上，基于可见光影像拥有更丰富的纹理特征而红外影像自身的边缘轮廓较纹理更明显的特点，构建多模态混合注意力模块，进行跨模态的信息权重交流和重组以实现不同光照条件下的优势特征；然后，利用可见光模态特征丰富度与环境光照强度的关系，设计了以环境光照强度为指标的可见光-红外权重动态分配模块，并将权重作为参考，纳入多模态特征融合模块进行自适应融合，实现基于多模态特征融合的目标检测。通过在公开的街景数据集M3FD及航拍车辆数据集DroneVehicle上进行实验，结果表明，相较于现有的单模态与多模态目标检测算法，本文方法能够获得更高的检测精度。

关键词:

目标检测，多模态，卷积神经网络，特征融合，注意力机制，可见光影像，红外影像，深度学习

Multimodal Object Detection Method Using Adaptive Fusionof Infrared and Visible features

Abstract:

Objective: The target detection method is mainly based on visible light images. Although it can fully display the details and texture features of the target, when the target is blurred, blocked, or the light is too strong or too dark, it is difficult to fully obtain the target information relying solely on visible light sensors, thus affecting the target detection effect. In contrast, infrared sensors have the advantages of strong resistance to environmental interference, insensitivity to light, and the ability to reflect target temperature, which can make up for the shortcomings of visible light imaging. Therefore, infrared features are integrated into target detection based on visible light. Methods：This paper proposes a multimodal target detection method that adaptively fuses infrared and visible light features. This method uses the YOLOv8 target detection framework as the basic network to extract multi-scale feature information. On this basis, a new multimodal hybrid attention module (Cross-modal Hybrid Attention Module, CHAM) was constructed. This module extracts the complementary features of visible light and infrared images respectively, and jointly performs channel and spatial attention analysis to achieve the communication and reorganization of cross-modal information weights, thereby improving the perception of complementary features between different modalities. In addition, a visible light-infrared feature adaptive fusion module with ambient light intensity as an indicator was constructed. First, an illumination awareness module (IAM) was designed to evaluate the feature richness of the target contained in the visible light image according to the light intensity of the visible light image, and incorporate it into the cross-modal adaptive fusion module (CAFM) to guide the adaptive fusion process, solving the problem that conventional fusion methods are difficult to achieve dynamic processing based on multimodal data features. Thus, target detection based on multimodal feature fusion is realized. Result：In order to fully demonstrate the effectiveness of the MAF-YOLO model proposed in this paper, several classic target detection models and current advanced detection methods are used to conduct comparative experiments with MAF-YOLO. Among them, the classic target detection models include Faster R-CNN and YOLOv8. Since they are both target detection models for single-modality images, this paper conducts experiments on visible light and infrared modalities respectively; current advanced models include CFT, which is the first application of Transformer to the field of multispectral target detection; TarDAL, which uses generative adversarial networks to generate fused images; and SuperYOLO, which achieves super-resolution reconstruction based on YOLOv5 and improves the accuracy of multimodal target detection. Comparative experiments are conducted on the M3FD street view dataset and the DroneVehicle aerial vehicle dataset to test the robustness of this method in different lighting environments and scenes. Experiments conducted on the publicly available M3FD street scene dataset and the DroneVehicle aerial vehicle dataset show that the proposed method achieves higher detection accuracy compared to current state-of-the-art single-modal and multimodal object detection methods. Conclusion：Based on the YOLOv8 model, this paper proposes a target detection network based on visible light-infrared multimodal feature fusion (MAF-YOLO). A new cross-modal hybrid attention mechanism is designed to make full use of the complementary features of visible light and infrared information, and an illumination perception module and an adaptive fusion module are constructed to perform mid-term fusion of dual-stream information, extracting the complementary features of the two modalities for target detection. The comparative experiments of this method with a variety of existing target detection models on the DroneVehicle and M3FD datasets show that MAF-YOLO can have good detection performance and robustness in complex environments, proving that the proposed method can effectively solve the problem of insufficient visible light modal target features in complex environments, and realize accurate target detection with the fusion of infrared and visible light multimodal features.

Key Words:

target detection multimodal convolutional neural network feature fusion attention mechanism visible light imaging infrared imaging deep learning

本文暂时没有被引用！