低秩自适应微调的一阶段红外目标跟踪

代宇航; 刘乔; 袁笛; 范娜娜; 刘云鹏

doi:10.3788/IRLA20240199

低秩自适应微调的一阶段红外目标跟踪

Lowrank adaptative fine-tuning for infrared target tracking

摘要

摘要: 由于缺乏大规模的红外跟踪训练数据集，现有的红外跟踪方法大都利用在大规模可见光数据上预训练的模型，然后在小规模的红外数据上进行完全微调。然而，当预训练模型的参数规模迅速增大时，完全微调需要的内存和时间成本也急剧增长，这限制了低资源用户在大型模型上进行研究和应用。为解决该问题，提出一种参数、内存和时间高效自适应的红外目标跟踪算法。首先，通过Transformer的自注意力机制对模板和搜索区域图像进行联合特征提取和关系建模，获取与目标关联度更强的特征表示；其次，利用低秩自适应矩阵的侧网络将可训练参数从主干网络中进行解耦，以减少需要训练更新的参数规模；最后，设计一种轻量级空间特征增强模块，增强特征对目标和背景的判别能力。提出方法的训练参数，内存和时间分别仅占完全微调方法的0.04%、39.6%和66.2%，性能却超越了完全微调。在3个标准红外跟踪数据集LSOTB-TIR120，LSOTB-TIR100和PTB-TIR上的实验对比结果和消融实验证明了提出的方法是有效的。提出的方法在LSOTB-TIR120数据集上成功率为73.7%，精度为86.0%，归一化精度为78.5%；LSOTB-TIR100数据集上成功率为71.6%，精度为83.9%，归一化精度为76.1%；在PTB-TIR数据集上成功率为69.0%，精度为84.9%，均取得了当前最先进的跟踪性能。

Abstract:
Objective Since infrared images have limitations such as low resolution and limited target texture details, it is crucial to learn strong discriminative feature representation. In the current field of infrared target tracking, there is a shortage of large-scale infrared tracking training datasets. The largest infrared tracking training dataset in the tracking benchmark is currently LSOTB-TIR, which consists of 650 000 trainable video frames. This dataset partially addresses the issue of insufficient labeled infrared data. However, its size is still significantly smaller compared to visible light mode tracking datasets such as LaSOT, GOT-10k, and TrackingNet, which contain 2.8 million, 1.4 million, and 14 million trainable video frames, respectively. As a result, most existing deep learning-based infrared target tracking methods follow a common approach of pre-training on large-scale visible light data and fine-tuning on small-scale infrared data. However, this complete fine-tuning method becomes prohibitively expensive when training a Transformer tracker with a large number of parameters, which poses limitations for researchers and users with limited resources to explore and apply large-scale models.
Methods To address this issue, this paper proposes an adaptive infrared target tracking algorithm that is efficient in terms of parameters, memory, and time. Firstly, it performs joint feature extraction and relationship modeling on the template and search area images using the self-attention mechanism of the Transformer. This process yields feature representations that are more closely associated with the target. Secondly, a low-rank adaptive matrix is employed in a side network to decouple trainable parameters from the backbone network. This reduces the parameter size that needs training and updating. Finally, a lightweight spatial feature enhancement module is designed to improve the feature's ability to discriminate between targets and backgrounds.
Results and Discussions The proposed method achieves superior performance while requiring significantly less training parameters, memory, and time compared to the full fine-tuning method (Tab.2). Specifically, the training parameters, memory, and time of the proposed method account for only 0.04%, 39.6%, and 66.2% respectively. Experimental comparisons and ablation experiments conducted on three standard infrared tracking datasets, namely LSOTB-TIR120, LSOTB-TIR100, and PTB-TIR, confirm the effectiveness of the proposed method. On the LSOTB-TIR120 dataset, the proposed method achieves a success rate of 73.7%, an accuracy of 86.0%, and a normalized accuracy of 78.5% (Fig.5). Similarly, on the LSOTB-TIR100 dataset, the success rate is 71.6%, the accuracy is 83.9%, and the normalized accuracy is 76.1% (Tab.1). Furthermore, on the PTB-TIR dataset, the success rate is 69.0% and the accuracy is 84.9% (Fig.6), demonstrating state-of-the-art tracking performance.
Conclusions In response to the high cost of fully fine-tuning Transformer trackers using infrared tracking datasets, this paper presents an efficient parameter, memory, and time migration algorithm for infrared target tracking. Built upon a Transformer-based tracker, the proposed algorithm leverages the self-attention mechanism of the Transformer to simultaneously perform feature extraction and relationship modeling on template and search images, aiding the model in extracting target-relevant information. Additionally, a low-rank side network (LSN) is designed to decouple trainable parameters from the large Transformer backbone network, significantly improving model training efficiency. Furthermore, a spatial feature enhancement (SFE) module is introduced to enhance the model's discriminative ability towards targets and complex backgrounds by spatially augmenting the information collected by the LSN. Experimental results on three datasets, LSOTB-TIR-120, LSOTB-TIR-100, and PTB-TIR, demonstrate the superiority of the proposed algorithm over other methods. To further enhance model training efficiency, the LSN is designed as a composition of multiple low-rank linear matrices, but it doesn't address the issue of the Transformer tracking model's limited ability to capture local spatial information. Future work will incorporate a spatial prior information module to improve the model's discriminative ability towards local spatial information.

HTML全文

参考文献(27)

施引文献

资源附件(0)