空域和频域特征解耦的红外与可见光图像融合

Spatial and frequency domain feature decoupling for infrared and visible image fusion

  • 摘要: 现有的红外与可见光图像融合算法大都在空间域上进行卷积操作,实现特征提取、特征融合、图像重建。受限于卷积神经网络的局部建模特性,此类方法未能考虑图像的全局上下文信息,从而限制了融合算法的鲁棒性。针对上述问题,受到图像在频域上服从谱卷积定理的全局建模属性的启发,提出了一种基于空域和频域特征解耦的红外与可见光图像融合算法,分别解耦出源图像在空间域和频域中的高低频信息表征,并进行互补交互融合,以提高融合算法的鲁棒性。提出的方法主要由频域解耦分支、空间域解耦分支以及多频谱卷积注意力融合模块三部分构成。首先,频域解耦分支利用频率掩膜解耦出源图像在傅里叶域中的高低频信息表征,得到源图像的全局上下文信息。然后,空间域解耦分支包括两个并行的可逆神经网络模块和轻量级Transformer模块,分别用于解耦出源图像在空间域中的高低频信息表征,得到源图像的局部上下文信息。最后,提出一种多频谱卷积注意力融合模块,实现高低频表征信息的互补交互融合,促使融合图像保留更多的红外显著信息和可见光纹理细节信息。在MSRS、TNO、RoadScene三个数据集上的定性和定量实验表明提出的方法取得了优异的性能。相比2023年提出的DATFusion融合方法,在信息熵、平均梯度、VIF等多个客观评价指标上分别提升13.3%、46.6%、10.3%。

     

    Abstract:
    Objective Infrared and visible image fusion is one of the primary branches of image fusion tasks. It can merge images captured from different scenes and sensors into a fused image, which simultaneously possesses complementary information and image features from both infrared and visible images. However, existing fusion methods exhibit the following shortcomings. Firstly, the core idea of feature fusion strategies lies in weighting or convolution, failing to explicitly consider the contextual relationship between multiple modalities. Secondly, due to the local modeling characteristics of convolutional neural networks, these methods cannot effectively model the global contextual information between infrared and visible images. Lastly, most fusion methods still only perform feature extraction, fusion, and reconstruction operations on source images in the spatial domain, lacking exploration of modeling global contextual information in the frequency domain. Therefore, this paper proposes an infrared and visible image fusion algorithm based on spatial and frequency domain feature decoupling to address the limitations.
    Methods This paper proposes an end-to-end encoder-decoder network architecture for image fusion, which is composed of three key modules to effectively decouple and fuse the high and low-frequency information of source images(Fig.1). Firstly, a circular frequency mask with a radius of r is designed in the frequency domain decoupling branch, which can effectively disentangle the high and low-frequency information representations of source images in the Fourier domain, thus comprehensively capturing the global contextual information of source images. Then, the spatial domain decoupling branch contains two parallel reversible neural network modules and a lightweight Transformer module. These two modules work independently, responsible for decoupling the high and low-frequency information representations of source images in the spatial domain, and capturing the local contextual information of source images. Finally, after successfully decoupling the high and low-frequency information representations of source images, a multi-spectral convolutional attention fusion network is designed. This network fully utilizes multi-spectral channel attention and spatial attention mechanisms to effectively capture the detailed information of different modal images, and simulate the contextual relationship between high and low-frequency information representations, achieving complementary interactive fusion of high and low-frequency information representations.
    Results and Discussions In the comparison between the method mentioned in the text and RFN-Nest, SwinFusion, SDNet, U2Fusion, DenseFuse, FusionGAN, and DATFusion, the method achieved 5 firsts, 2 seconds, and 1 third on the TNO dataset (Tab.1). In the RoadScene dataset, it achieved 5 optimal and 2 suboptimal performances (Tab.2). In the MSRS dataset, except for achieving the suboptimal performance in mutual information (MI), it reached the optimal level in the other seven indicators (Tab.3). The subjective evaluation indicators are shown (Fig.2-4), corresponding to the TNO, RoadScene, and MSRS datasets, respectively.
    Conclusions This paper proposes a network framework based on decoupling of spatial and frequency domain features, which is trained in an end-to-end manner to achieve efficient fusion of infrared and visible light images. Firstly, in the frequency domain branch, frequency masks are utilized to decouple the high and low frequency information representations of the source images in the Fourier domain. Meanwhile, in the spatial domain branch, two parallel reversible neural network modules and lightweight Transformer blocks are employed to decouple the high and low frequency information representations of the source images in the spatial domain. Then, a multi-spectrum convolutional attention fusion module is designed to achieve complementary interactive fusion of the high and low-frequency information representations. Finally, through comparative experiments and ablation experiments, it is shown that the proposed algorithm exhibits excellent competitiveness in qualitative and quantitative evaluations compared with recent deep learning-based fusion methods on the TNO, RoadScene, and MSRS datasets. It can effectively promote the fused image to retain more significant infrared thermal radiation information and visible light texture detail information.

     

/

返回文章
返回