一种时域模板更新的RGBT渐进融合视觉跟踪

郭勇; 谌海云; 陈建宇; 袁杰敏

doi:10.3788/IRLA20240260

摘要: 由于大多数RGBT跟踪算法的特征提取、融合、匹配方式简单，导致在面临形变、遮挡和低分辨率等问题时跟踪目标丢失。针对以上问题，提出一种时域模板更新的RGBT渐进融合目标跟踪算法SiamDPF。首先，在特征提取阶段，使用空洞卷积与Transformer对两模态的AlexNet网络后两层进行改进，以增强其低分辨率目标的特征表示能力。其次，结合交叉注意力和门控机制提出一种渐进融合模块，对两模态的浅层与深层特征进行渐进交互融合，使模态信息融合更充分。最后，为了让跟踪器能够利用时域上下文信息来改善形变目标的跟踪问题，使用交叉注意力让上一帧目标特征与在线模板特征进行交互更新。在GTOT和RGBT234基准数据集上的实验结果表明，SiamDPF算法在面对目标形变、遮挡、低分辨率等问题时，与其他算法相比其跟踪性能更具鲁棒性。

Abstract:

Objective In recent years, object tracking has been widely used in smart transportation, drone aerial photography, robot vision and other fields. However, most of these object tracking algorithms are based on RGB modal information, and due to the limited information provided by RGB modality, it is difficult to maintain its tracking robustness in low light, haze occlusion, light change and other environments. With the continuous development of thermal infrared sensor technology, relevant researchers consider that thermal infrared has strong penetration ability and is less affected by light changes, and the thermal infrared mode and RGB modal information are fused for tracking, so as to improve the tracking performance in these environments. At present, deep learning is developing rapidly, and many RGBT tracking algorithms based on deep learning have emerged, but the feature extraction, fusion, and matching methods of these RGBT tracking algorithms are simple, and when facing problems such as deformation, occlusion, and low resolution, the tracking target is lost. Therefore, it is necessary to design a robust RGBT tracking algorithm to deal with the tracking problems such as deformation, occlusion and low resolution.

Methods The paper proposes an RGBT progressive fusion visual tracking algorithm with time-domain updated template, SiamDPF, which uses the SiamFC++ algorithm as the baseline network (Fig.1). Firstly, the algorithm uses Dilated Convolution and Transformer to improve the convolution of the last two layers of AlexNet, and proposes a multi-scale dilated attention module (Fig.2). Secondly, a cross-modal progressive fusion module (Fig.3) was proposed by combining cross-attention and gated mechanism. Then, a correlation operation with updated template module (Fig.4) was proposed, which used the target template features in the previous frame to update the online template features interactively. Finally, experiments on GTOT and RGBT234 benchmark datasets show that the tracking performance of the SiamDPF algorithm is more robust than other algorithms.

Results and Discussions Based on the above design methods, the success rate (SR) and precision rate (PR) as well as the number of frames per second (FPS) were used as the evaluation indicators for tracking performance. In the process of evaluation experiments, the success rate and accuracy of the proposed algorithm and other algorithms were evaluated on GTOT and RGBT234 datasets, especially compared with the current Siamese algorithms of the same series (Tab.2) and the challenge attributes (Fig.7), which fully verified the superiority of the tracking performance of the proposed algorithm in the face of target deformation, occlusion, low resolution and low light scenes. In the ablation experiment, the experimental evaluation of the modules designed by the proposed algorithm is carried out from the indicators such as parameter quantity, real-time performance (Tab.4), and PR/SR (Tab.3), and the experimental results show that the network parameters of the proposed algorithm are only 18.73 \times 10^6 , and its running speed reaches 68FPS, which verifies the effectiveness of each module designed in this paper, and the running speed also meets the real-time performance. In the qualitative experiments, it can be intuitively seen that the proposed algorithm can maintain its tracking robustness in the face of three video sequence tracking scenarios: target occlusion, target deformation and low resolution, which further verifies the effectiveness of the proposed algorithm.

Conclusions An RGBT progressive fusion object tracking algorithm with time-domain updated template is proposed to solve the tracking problems of deformation, occlusion, and low resolution. Based on the SiamFC++ baseline, the convolution of the last two layers of the backbone network is improved by combining Dilated Convolution and Transformer to enhance the representation ability of the target features. Secondly, the progressive fusion module is used to gradually interact with the shallow and deep features of the two modalities, which promotes the efficiency of modal fusion. Finally, the template is used to update the cross-correlation module to obtain reliable target response map. In this paper, quantitative analysis, ablation experiment, and qualitative analysis were carried out on GTOT and RGBT234 datasets, respectively. Among them, the PR/SR of the algorithm on the GTOT dataset reaches 0.916/0.735, and the PR/SR on the RGBT234 dataset is 0.819/0.575, respectively, which verifies the superiority of its tracking performance. In the ablation experiments, the effectiveness of the modules designed by the proposed algorithm and the rapidity of inference are verified. The experiments show that, compared with related algorithms, the proposed algorithm has better robustness in dealing with tracking problems such as deformation, occlusion, and low resolution.

一种时域模板更新的RGBT渐进融合视觉跟踪

An RGBT progressive fusion visual tracking with time-domain updated templates