Objective In recent years, object tracking has been widely used in smart transportation, drone aerial photography, robot vision and other fields. However, most of these object tracking algorithms are based on RGB modal information, and due to the limited information provided by RGB modality, it is difficult to maintain its tracking robustness in low light, haze occlusion, light change and other environments. With the continuous development of thermal infrared sensor technology, relevant researchers consider that thermal infrared has strong penetration ability and is less affected by light changes, and the thermal infrared mode and RGB modal information are fused for tracking, so as to improve the tracking performance in these environments. At present, deep learning is developing rapidly, and many RGBT tracking algorithms based on deep learning have emerged, but the feature extraction, fusion, and matching methods of these RGBT tracking algorithms are simple, and when facing problems such as deformation, occlusion, and low resolution, the tracking target is lost. Therefore, it is necessary to design a robust RGBT tracking algorithm to deal with the tracking problems such as deformation, occlusion and low resolution.
Methods The paper proposes an RGBT progressive fusion visual tracking algorithm with time-domain updated template, SiamDPF, which uses the SiamFC++ algorithm as the baseline network (Fig.1). Firstly, the algorithm uses Dilated Convolution and Transformer to improve the convolution of the last two layers of AlexNet, and proposes a multi-scale dilated attention module (Fig.2). Secondly, a cross-modal progressive fusion module (Fig.3) was proposed by combining cross-attention and gated mechanism. Then, a correlation operation with updated template module (Fig.4) was proposed, which used the target template features in the previous frame to update the online template features interactively. Finally, experiments on GTOT and RGBT234 benchmark datasets show that the tracking performance of the SiamDPF algorithm is more robust than other algorithms.
Results and Discussions Based on the above design methods, the success rate (SR) and precision rate (PR) as well as the number of frames per second (FPS) were used as the evaluation indicators for tracking performance. In the process of evaluation experiments, the success rate and accuracy of the proposed algorithm and other algorithms were evaluated on GTOT and RGBT234 datasets, especially compared with the current Siamese algorithms of the same series (Tab.2) and the challenge attributes (Fig.7), which fully verified the superiority of the tracking performance of the proposed algorithm in the face of target deformation, occlusion, low resolution and low light scenes. In the ablation experiment, the experimental evaluation of the modules designed by the proposed algorithm is carried out from the indicators such as parameter quantity, real-time performance (Tab.4), and PR/SR (Tab.3), and the experimental results show that the network parameters of the proposed algorithm are only 18.73 \times 10^6 , and its running speed reaches 68FPS, which verifies the effectiveness of each module designed in this paper, and the running speed also meets the real-time performance. In the qualitative experiments, it can be intuitively seen that the proposed algorithm can maintain its tracking robustness in the face of three video sequence tracking scenarios: target occlusion, target deformation and low resolution, which further verifies the effectiveness of the proposed algorithm.
Conclusions An RGBT progressive fusion object tracking algorithm with time-domain updated template is proposed to solve the tracking problems of deformation, occlusion, and low resolution. Based on the SiamFC++ baseline, the convolution of the last two layers of the backbone network is improved by combining Dilated Convolution and Transformer to enhance the representation ability of the target features. Secondly, the progressive fusion module is used to gradually interact with the shallow and deep features of the two modalities, which promotes the efficiency of modal fusion. Finally, the template is used to update the cross-correlation module to obtain reliable target response map. In this paper, quantitative analysis, ablation experiment, and qualitative analysis were carried out on GTOT and RGBT234 datasets, respectively. Among them, the PR/SR of the algorithm on the GTOT dataset reaches 0.916/0.735, and the PR/SR on the RGBT234 dataset is 0.819/0.575, respectively, which verifies the superiority of its tracking performance. In the ablation experiments, the effectiveness of the modules designed by the proposed algorithm and the rapidity of inference are verified. The experiments show that, compared with related algorithms, the proposed algorithm has better robustness in dealing with tracking problems such as deformation, occlusion, and low resolution.