Abstract:
Objective By leveraging the complementarity between infrared and visible light images, infrared and visible light image fusion technology integrates images obtained from different sensors in the same scene into a fused image that is rich in information, highly reliable, and specifically targeted, providing a comprehensive description and integration of the image information in the scene. The fused image retains both the thermal radiation targets of the infrared image and the detailed texture information of the visible light image. However, existing deep learning-based fusion methods all use convolutional neural networks as the basic framework, such as the encoder structure in the autoencoder method and the generator and discriminator in the generative adversarial network, which all use a large number of stacked convolutional layers to process the input image features. Traditional convolution operations, due to the limitations of the size of the convolution kernel and the scope of its effect, have very limited capabilities in extracting image features, focusing only on the local features of the image, such as the local edges of the thermal radiation target areas in infrared images. They cannot well preserve the global features of the image, including the rich texture background information in visible light images and the contour information of objects or environments in the scene. This one-sidedness of feature extraction leads to blurred background details in the fused image and insufficiently prominent thermal radiation targets. Therefore, there is an urgent need to propose a multimodal fusion method that can extract both global and local features to remedy the aforementioned deficiencies.
Methods A triple multimodal image fusion algorithm based on mixed difference convolution and efficient visual Transformer networks is proposed. The core innovations of this algorithm include: Firstly, at the input end, differential images are introduced to highlight the differences between images through pixel value subtraction, constructing a triple-input network architecture to enhance the discriminability of image features. Secondly, a mixed difference convolution (MDconv) is designed, a variant of traditional convolution that combines edge detection operators and uses the principle of pixel differentiation to enhance the feature learning capability of convolution operations. Furthermore, a dual-branch encoder structure is adopted, which combines a convolutional neural network branch with dense mixed difference convolution and an efficient Transformer branch, to extract both the local details and global background of the image, achieving a comprehensive capture of local and global features. Finally, a multi-dimensional coordinate collaborative attention fusion strategy is employed in the fusion layer to effectively integrate the deep features of the encoded multi-modal images, realizing deep feature fusion.
Results and Discussions To verify the fusion performance of the method proposed in this paper, seven representative infrared and visible light image fusion algorithms were selected for comparative experiments with the algorithm proposed in this paper. Subjective and objective evaluations were conducted on the results of the TNO and RoadScence test sets. In terms of subjective evaluation (Fig.16), the CBF method uses cross bilateral filtering operations that require calculating weight eigenvalues, resulting in a significant loss of information. This method lacks the ability to extract multi-source information, leading to rough background texture information and insignificant thermal radiation targets, with an overall poor fusion effect. The ADF method reconstructs images through K-L transformation, and the selection of the maximum pixel value causes the omission of edge pixels, resulting in low overall background texture contrast, with the fused image information biased towards the distribution of the infrared image. The MFST and Swinfusion fusion results based on the Transformer architecture retain the thermal radiation target information to some extent, but there is a certain amount of noise and artifacts, resulting in low target clarity. The Densfuse and Nestfuse methods are both based on simple autoencoders, and the overall background texture and thermal radiation targets of the fused image are preserved, but the overall contrast is low, and the background information is somewhat flattened. The MTDfusion method has an overall unnatural image with a certain degree of noise points. The method proposed in this paper achieves the best fusion effect, highlighting both the thermal radiation targets of the infrared image and the background details of the visible light image. At the same time, the fusion result does not produce noise points and artifact defects. In terms of objective evaluation (Tab.1), compared to other methods, the method proposed in this paper achieves the best values in the four indicators MI, VIF, SD, QAB/F, and the second-best value in the SF indicator. On the TNO dataset, the four indicators increased by 63.25%, 18.86%, 5.21%, and 5.41%, respectively. On the RoadScence dataset, the four indicators increased by 46.44%, 12.65%, 10.03%, and 3.26%, respectively. In summary, the results of subjective and objective evaluations are consistent, overall proving the effectiveness of the method proposed in this paper. At the same time, quantitative analysis of ablation experiments was conducted on the complete network of this paper and four structures (Tab.2): a network structure with basic convolutional layers changed to 3×3 ordinary convolutions; a network structure without a Transformer branch; a CNN branch without dense connections; a two input structure. Finally, the complete network structure of this paper achieved the best objective evaluation results, proving the effectiveness of each structure in this paper for improving fusion quality.
Conclusions A triple multimodal image fusion algorithm based on mixed difference convolution and efficient visual Transformer networks is proposed. Firstly, infrared, visible light, and difference images are input into a dual-branch encoder consisting of joint dense mixed difference convolution and efficient Transformer. Secondly, a fusion strategy based on multi-dimensional coordinate collaborative attention is designed, which assigns weights according to the importance in the feature maps of the three types of images, effectively retaining and integrating deep features. Finally, the deep information is input into the decoder to complete the effective fusion of the images. The mixed difference convolution integrates rich prior gradient information into ordinary convolution operations, enhancing the feature extraction capability of the effective convolution layers; the efficient Transformer utilizes its long-range correlation to integrate the global features of infrared and visible light images; the difference image in the triple network aims to input the prominent information of the image into the network, improving the texture detail information of the fused image from the source, which is more in line with human visual perception. Comparative experiments with advanced infrared and visible light image fusion algorithms show that, in terms of subjective evaluation, the proposed method enhances the richness of background texture details in the fused image, makes the thermal radiation targets more prominent, and aligns with human visual perception. At the same time, the objective evaluation indicators also show significant improvements compared to advanced image fusion methods. The ablation experiments of the network structure demonstrate the effectiveness of the proposed modules. Additionally, this method can be well applied in fields such as security monitoring and multispectral cameras.