Abstract:
Objective Since infrared images have limitations such as low resolution and limited target texture details, it is crucial to learn strong discriminative feature representation. In the current field of infrared target tracking, there is a shortage of large-scale infrared tracking training datasets. The largest infrared tracking training dataset in the tracking benchmark is currently LSOTB-TIR, which consists of 650 000 trainable video frames. This dataset partially addresses the issue of insufficient labeled infrared data. However, its size is still significantly smaller compared to visible light mode tracking datasets such as LaSOT, GOT-10k, and TrackingNet, which contain 2.8 million, 1.4 million, and 14 million trainable video frames, respectively. As a result, most existing deep learning-based infrared target tracking methods follow a common approach of pre-training on large-scale visible light data and fine-tuning on small-scale infrared data. However, this complete fine-tuning method becomes prohibitively expensive when training a Transformer tracker with a large number of parameters, which poses limitations for researchers and users with limited resources to explore and apply large-scale models.
Methods To address this issue, this paper proposes an adaptive infrared target tracking algorithm that is efficient in terms of parameters, memory, and time. Firstly, it performs joint feature extraction and relationship modeling on the template and search area images using the self-attention mechanism of the Transformer. This process yields feature representations that are more closely associated with the target. Secondly, a low-rank adaptive matrix is employed in a side network to decouple trainable parameters from the backbone network. This reduces the parameter size that needs training and updating. Finally, a lightweight spatial feature enhancement module is designed to improve the feature's ability to discriminate between targets and backgrounds.
Results and Discussions The proposed method achieves superior performance while requiring significantly less training parameters, memory, and time compared to the full fine-tuning method (Tab.2). Specifically, the training parameters, memory, and time of the proposed method account for only 0.04%, 39.6%, and 66.2% respectively. Experimental comparisons and ablation experiments conducted on three standard infrared tracking datasets, namely LSOTB-TIR120, LSOTB-TIR100, and PTB-TIR, confirm the effectiveness of the proposed method. On the LSOTB-TIR120 dataset, the proposed method achieves a success rate of 73.7%, an accuracy of 86.0%, and a normalized accuracy of 78.5% (Fig.5). Similarly, on the LSOTB-TIR100 dataset, the success rate is 71.6%, the accuracy is 83.9%, and the normalized accuracy is 76.1% (Tab.1). Furthermore, on the PTB-TIR dataset, the success rate is 69.0% and the accuracy is 84.9% (Fig.6), demonstrating state-of-the-art tracking performance.
Conclusions In response to the high cost of fully fine-tuning Transformer trackers using infrared tracking datasets, this paper presents an efficient parameter, memory, and time migration algorithm for infrared target tracking. Built upon a Transformer-based tracker, the proposed algorithm leverages the self-attention mechanism of the Transformer to simultaneously perform feature extraction and relationship modeling on template and search images, aiding the model in extracting target-relevant information. Additionally, a low-rank side network (LSN) is designed to decouple trainable parameters from the large Transformer backbone network, significantly improving model training efficiency. Furthermore, a spatial feature enhancement (SFE) module is introduced to enhance the model's discriminative ability towards targets and complex backgrounds by spatially augmenting the information collected by the LSN. Experimental results on three datasets, LSOTB-TIR-120, LSOTB-TIR-100, and PTB-TIR, demonstrate the superiority of the proposed algorithm over other methods. To further enhance model training efficiency, the LSN is designed as a composition of multiple low-rank linear matrices, but it doesn't address the issue of the Transformer tracking model's limited ability to capture local spatial information. Future work will incorporate a spatial prior information module to improve the model's discriminative ability towards local spatial information.