Abstract:
Objective Infrared and visible image fusion is one of the primary branches of image fusion tasks. It can merge images captured from different scenes and sensors into a fused image, which simultaneously possesses complementary information and image features from both infrared and visible images. However, existing fusion methods exhibit the following shortcomings. Firstly, the core idea of feature fusion strategies lies in weighting or convolution, failing to explicitly consider the contextual relationship between multiple modalities. Secondly, due to the local modeling characteristics of convolutional neural networks, these methods cannot effectively model the global contextual information between infrared and visible images. Lastly, most fusion methods still only perform feature extraction, fusion, and reconstruction operations on source images in the spatial domain, lacking exploration of modeling global contextual information in the frequency domain. Therefore, this paper proposes an infrared and visible image fusion algorithm based on spatial and frequency domain feature decoupling to address the limitations.
Methods This paper proposes an end-to-end encoder-decoder network architecture for image fusion, which is composed of three key modules to effectively decouple and fuse the high and low-frequency information of source images(Fig.1). Firstly, a circular frequency mask with a radius of r is designed in the frequency domain decoupling branch, which can effectively disentangle the high and low-frequency information representations of source images in the Fourier domain, thus comprehensively capturing the global contextual information of source images. Then, the spatial domain decoupling branch contains two parallel reversible neural network modules and a lightweight Transformer module. These two modules work independently, responsible for decoupling the high and low-frequency information representations of source images in the spatial domain, and capturing the local contextual information of source images. Finally, after successfully decoupling the high and low-frequency information representations of source images, a multi-spectral convolutional attention fusion network is designed. This network fully utilizes multi-spectral channel attention and spatial attention mechanisms to effectively capture the detailed information of different modal images, and simulate the contextual relationship between high and low-frequency information representations, achieving complementary interactive fusion of high and low-frequency information representations.
Results and Discussions In the comparison between the method mentioned in the text and RFN-Nest, SwinFusion, SDNet, U2Fusion, DenseFuse, FusionGAN, and DATFusion, the method achieved 5 firsts, 2 seconds, and 1 third on the TNO dataset (Tab.1). In the RoadScene dataset, it achieved 5 optimal and 2 suboptimal performances (Tab.2). In the MSRS dataset, except for achieving the suboptimal performance in mutual information (MI), it reached the optimal level in the other seven indicators (Tab.3). The subjective evaluation indicators are shown (Fig.2-4), corresponding to the TNO, RoadScene, and MSRS datasets, respectively.
Conclusions This paper proposes a network framework based on decoupling of spatial and frequency domain features, which is trained in an end-to-end manner to achieve efficient fusion of infrared and visible light images. Firstly, in the frequency domain branch, frequency masks are utilized to decouple the high and low frequency information representations of the source images in the Fourier domain. Meanwhile, in the spatial domain branch, two parallel reversible neural network modules and lightweight Transformer blocks are employed to decouple the high and low frequency information representations of the source images in the spatial domain. Then, a multi-spectrum convolutional attention fusion module is designed to achieve complementary interactive fusion of the high and low-frequency information representations. Finally, through comparative experiments and ablation experiments, it is shown that the proposed algorithm exhibits excellent competitiveness in qualitative and quantitative evaluations compared with recent deep learning-based fusion methods on the TNO, RoadScene, and MSRS datasets. It can effectively promote the fused image to retain more significant infrared thermal radiation information and visible light texture detail information.