基于混合差分卷积和高效视觉Transformer网络的三重多模态图像融合算法

Triple multi-modal image fusion algorithm based on mixed difference convolution and efficient vision Transformer network

  • 摘要: 提出了一种创新的三重多模态红外和可见图像融合算法,以解决传统卷积运算在全局特征捕捉和长程相关性分析方面的不足。该算法的核心创新包括:首先,在输入端引入差分图像,通过像素值相减突出图像间差异,构建三重输入网络架构,增强图像特征的区分度。其次,设计了混合差分卷积(Mixed difference convolution, MDconv),一种传统卷积的变体,结合边缘检测算子,利用像素差分原理,提升卷积运算的特征学习能力;进一步地,采用双分支编码器结构,结合密集混合差分卷积的卷积神经网络分支和高效视觉Transformer(Efficient Vision Trasnsformer, EfficientViT)分支,分别提取图像的局部细节和全局背景,实现对局部与全局特征的全面捕捉;最后,采用多维坐标协同注意力融合策略,在融合层有效整合编码器输出的多模态图像特征。在公开数据集上的定性和定量实验表明,采用文中算法进行红外和可见融合后图像具有背景纹理细节清晰、热辐射目标更显著等明显优势,并在四项客观评价指标MI、 VIF、SD、QAB/F分别达到最优值,在SF指标上取得次优值。消融实验也证明了文中所提各个模块的有效性。

     

    Abstract:
    Objective By leveraging the complementarity between infrared and visible light images, infrared and visible light image fusion technology integrates images obtained from different sensors in the same scene into a fused image that is rich in information, highly reliable, and specifically targeted, providing a comprehensive description and integration of the image information in the scene. The fused image retains both the thermal radiation targets of the infrared image and the detailed texture information of the visible light image. However, existing deep learning-based fusion methods all use convolutional neural networks as the basic framework, such as the encoder structure in the autoencoder method and the generator and discriminator in the generative adversarial network, which all use a large number of stacked convolutional layers to process the input image features. Traditional convolution operations, due to the limitations of the size of the convolution kernel and the scope of its effect, have very limited capabilities in extracting image features, focusing only on the local features of the image, such as the local edges of the thermal radiation target areas in infrared images. They cannot well preserve the global features of the image, including the rich texture background information in visible light images and the contour information of objects or environments in the scene. This one-sidedness of feature extraction leads to blurred background details in the fused image and insufficiently prominent thermal radiation targets. Therefore, there is an urgent need to propose a multimodal fusion method that can extract both global and local features to remedy the aforementioned deficiencies.
    Methods A triple multimodal image fusion algorithm based on mixed difference convolution and efficient visual Transformer networks is proposed. The core innovations of this algorithm include: Firstly, at the input end, differential images are introduced to highlight the differences between images through pixel value subtraction, constructing a triple-input network architecture to enhance the discriminability of image features. Secondly, a mixed difference convolution (MDconv) is designed, a variant of traditional convolution that combines edge detection operators and uses the principle of pixel differentiation to enhance the feature learning capability of convolution operations. Furthermore, a dual-branch encoder structure is adopted, which combines a convolutional neural network branch with dense mixed difference convolution and an efficient Transformer branch, to extract both the local details and global background of the image, achieving a comprehensive capture of local and global features. Finally, a multi-dimensional coordinate collaborative attention fusion strategy is employed in the fusion layer to effectively integrate the deep features of the encoded multi-modal images, realizing deep feature fusion.
    Results and Discussions To verify the fusion performance of the method proposed in this paper, seven representative infrared and visible light image fusion algorithms were selected for comparative experiments with the algorithm proposed in this paper. Subjective and objective evaluations were conducted on the results of the TNO and RoadScence test sets. In terms of subjective evaluation (Fig.16), the CBF method uses cross bilateral filtering operations that require calculating weight eigenvalues, resulting in a significant loss of information. This method lacks the ability to extract multi-source information, leading to rough background texture information and insignificant thermal radiation targets, with an overall poor fusion effect. The ADF method reconstructs images through K-L transformation, and the selection of the maximum pixel value causes the omission of edge pixels, resulting in low overall background texture contrast, with the fused image information biased towards the distribution of the infrared image. The MFST and Swinfusion fusion results based on the Transformer architecture retain the thermal radiation target information to some extent, but there is a certain amount of noise and artifacts, resulting in low target clarity. The Densfuse and Nestfuse methods are both based on simple autoencoders, and the overall background texture and thermal radiation targets of the fused image are preserved, but the overall contrast is low, and the background information is somewhat flattened. The MTDfusion method has an overall unnatural image with a certain degree of noise points. The method proposed in this paper achieves the best fusion effect, highlighting both the thermal radiation targets of the infrared image and the background details of the visible light image. At the same time, the fusion result does not produce noise points and artifact defects. In terms of objective evaluation (Tab.1), compared to other methods, the method proposed in this paper achieves the best values in the four indicators MI, VIF, SD, QAB/F, and the second-best value in the SF indicator. On the TNO dataset, the four indicators increased by 63.25%, 18.86%, 5.21%, and 5.41%, respectively. On the RoadScence dataset, the four indicators increased by 46.44%, 12.65%, 10.03%, and 3.26%, respectively. In summary, the results of subjective and objective evaluations are consistent, overall proving the effectiveness of the method proposed in this paper. At the same time, quantitative analysis of ablation experiments was conducted on the complete network of this paper and four structures (Tab.2): a network structure with basic convolutional layers changed to 3×3 ordinary convolutions; a network structure without a Transformer branch; a CNN branch without dense connections; a two input structure. Finally, the complete network structure of this paper achieved the best objective evaluation results, proving the effectiveness of each structure in this paper for improving fusion quality.
    Conclusions A triple multimodal image fusion algorithm based on mixed difference convolution and efficient visual Transformer networks is proposed. Firstly, infrared, visible light, and difference images are input into a dual-branch encoder consisting of joint dense mixed difference convolution and efficient Transformer. Secondly, a fusion strategy based on multi-dimensional coordinate collaborative attention is designed, which assigns weights according to the importance in the feature maps of the three types of images, effectively retaining and integrating deep features. Finally, the deep information is input into the decoder to complete the effective fusion of the images. The mixed difference convolution integrates rich prior gradient information into ordinary convolution operations, enhancing the feature extraction capability of the effective convolution layers; the efficient Transformer utilizes its long-range correlation to integrate the global features of infrared and visible light images; the difference image in the triple network aims to input the prominent information of the image into the network, improving the texture detail information of the fused image from the source, which is more in line with human visual perception. Comparative experiments with advanced infrared and visible light image fusion algorithms show that, in terms of subjective evaluation, the proposed method enhances the richness of background texture details in the fused image, makes the thermal radiation targets more prominent, and aligns with human visual perception. At the same time, the objective evaluation indicators also show significant improvements compared to advanced image fusion methods. The ablation experiments of the network structure demonstrate the effectiveness of the proposed modules. Additionally, this method can be well applied in fields such as security monitoring and multispectral cameras.

     

/

返回文章
返回