-
该节将在OTB100[3]、VOT2018[77]、GOT-10 k[78]和LaSOT[79]四个公开数据集上对上述40多个跟踪算法进行全面评估。首先对数据集和相应的性能评估方法进行介绍,然后对实验结果进行对比和分析。所有测试结果均来自原论文或官方源码。
-
(1) OTB100
Wu[3]等人2015年提出的OTB100是目前最为常用的跟踪数据集之一。该数据集包含100个完全标注的视频序列,涉及目标跟踪的11种属性,包括光照变化、尺度变化、遮挡、形变、运动模糊、快速运动、平面内旋转、平面外旋转、出视野、背景干扰和低分辨率。OTB的评价指标为距离精度(Distance Precision)和重叠成功率(Overlap Success),测试时采用一次通过评估(One-Pass Evaluation, OPE)。
(2) VOT2018
VOT2018[77]数据集包含60个旋转框标注序列,涵盖遮挡、光照变化、运动变化、尺度变化、相机运动和空闲6种属性。VOT具有重启机制,当重叠率为0时,跟踪器会被重新初始化。VOT2018的评价指标为精确性(Accuracy)、鲁棒性(Robustness)和EAO(Expected average overlap)。
(3) GOT-10 k
GOT-10 k[78]是一个通用大规模目标跟踪数据集,包含超过10 K个视频序列,563个类别和超过150万个标注框,尽可能多地涵盖具有挑战性的现实场景。GOT-10 k训练集和测试集不存在交集,保证模型的泛化能力。评价指标为平均重叠率(Average Overlap, AO)和成功率(Success Rate, SR)。
(4) LaSOT
LaSOT[79]包含1400个视频和超过3.5 M手工标注图片,是目前最大的密集标注单目标跟踪数据集。该数据集包含70个类别,每个类别包含20个序列,每个序列平均2512帧,偏重长时跟踪任务且难度相对较大。LaSOT划分280个序列用于测试,评价方式类似OTB,并增加一个归一化精度(Normalized Precision)指标。
-
表1展示了所有跟踪算法的定量比较结果。对于OTB100和LaSOT,按成功率(AUC)取top5,OTB100上的排名是RPT, DROL, CGACD, SiamRCNN, SiamCAR;LaSOT上的排名是SiamRCNN, PrDiMP, TACT, FCOT, DiMP。按精度(PR)排名,OTB100的前五名是RPT, DROL, SiamDW, CGACD, Ocean;而LaSOT的前五名是SiamRCNN, PrDiMP, FCOT, TACT, Ocean。从结果可以发现,对于LaSOT这类较长的视频序列,排名靠前的算法大多依赖两阶段结构和模型更新。两阶段结构对于鲁棒性和判别性的平衡能有效应对长时跟踪中出现的干扰物以及模型漂移,而判别式的更新方法也能及时处理目标和场景的各类变化。
表 1 跟踪算法在OTB100, LaSOT, GOT-10 k和VOT2018上的性能对比
Table 1. Performance comparison of siamese tracking methods on OTB100, LaSOT, GOT-10 k and VOT2018
TYPE OTB100 LaSOT GOT10 k VOT2018 A S AUC PR AUC. NPR AO SR0.50 SR0.75 A R EAO SiamRPN [29] T 1 0.637 0.851 0.457 - - - - - - - DaSiamRPN [35] T 1 0.658 0.88 0.415 0.496 - - - 0.59 0.276 0.383 SiamRPN++ [38] T 1 0.696 0.915 0.496 0.569 0.518 0.618 0.325 0.6 0.234 0.414 SiamDW [37] T 1 0.674 0.923 0.384 0.476 0.416 - - - - 0.27 SiamMask [31] T 1 - - - - 0.514 0.587 0.366 0.61 0.276 0.38 SiamMan [33] T 1 0.705 0.919 - - - - - 0.605 0.183 0.462 THOR [54] T 1 0.648 0.791 - - 0.447 0.538 0.204 0.582 0.234 0.416 DROL [58] T 1 0.715 0.934 0.537 0.624 - - - 0.616 - 0.481 SiamAttn [55] T 1 0.712 0.926 0.56 0.648 - - - 0.636 0.16 0.47 AFAT [56] T 1 0.663 0.874 0.492 0.574 - - - 0.605 0.239 0.419 UpdateNet [57] T 1 - - 0.475 0.56 - - - - - 0.393 SiamFC++ [41] F 1 0.683 0.896 0.544 0.623 0.595 0.695 0.479 0.587 0.183 0.426 AFSN [49] F 1 0.675 0.868 - - - - - 0.589 0.204 0.398 SATIN [48] F 1 0.641 0.844 - - - - - - - - SiamBAN [43] F 1 0.696 0.91 0.514 0.598 - - - 0.597 0.178 0.452 SiamCAR [40] F 1 0.697 0.91 - - 0.569 0.67 0.415 - - - CGACD [50] F 1 0.713 0.922 0.518 0.626 - - - 0.615 0.173 0.449 FCAF [80] F 1 0.649 0.86 - - - - - - - - FCOT [45] F 1 0.693 0.913 0.569 0.678 0.64 0.763 0.517 0.6 0.108 0.508 PGNet [34] F 1 0.691 0.892 0.531 0.605 - - - 0.618 0.192 0.447 Ocean [19] F 1 0.684 0.92 0.56 - 0.611 0.721 0.473 0.592 0.117 0.489 Ocean+ [44] F 1 - - - - - - - - - - RPT [52] F 0.715 0.936 - - 0.624 0.73 0.504 0.629 0.103 0.51 AlphaRef [51] 1 - - 0.589 0.649 - - - 0.633 0.136 0.476 SiamKPN [64] F 2 0.712 0.927 0.498 - 0.529 0.606 0.362 0.606 0.192 0.44 SPLT [61] T 2 - - 0.426 0.494 - - - - - - CRPN [63] T 2 0.663 - 0.455 0.542 - - - - - - SPM [59] T 2 0.687 0.889 0.485 - 0.513 0.593 0.359 0.58 0.3 0.338 TACT [67] T 2 - - 0.575 0.66 0.578 0.665 0.477 - - - SiamRCNN [69] T 2 0.701 0.891 0.648 0.722 0.649 0.728 0.597 0.609 0.22 0.408 GlobalT [68] T 2 - - 0.521 0.599 - - - - - - LTAO [70] T 2 - - - - - - - - - - ATOM [72] others 0.667 0.879 0.514 0.576 0.556 0.635 0.402 0.59 0.204 0.401 DiMP [65] others 0.686 0.899 0.569 0.648 0.611 0.717 0.492 0.597 0.153 0.44 PrDiMP [66] others 0.696 0.897 0.598 - 0.634 0.738 0.543 0.618 0.165 0.442 SSD-MAML [71] others 0.62 - - - - - - - - - FRCNN-MAML [71] others 0.647 - - - - - - - - - FCOS-MAML [21] others 0.704 0.905 0.523 - - - - 0.635 0.22 0.392 Retina-MAML [21] others 0.712 0.926 0.48 - - - - 0.604 0.159 0.452 Note: Bold fonts are ranked top-3. '-' means the corresponding results are not given in the original literature. 'TYPE' is the classification basis delineated in this paper, where 'A' indicates the Anchor (Anchor-based 'T '/Anchor-free 'F'), 'S' indicates the Stage number (One-stage '1'/Two-stages '2 '), and 'others' indicates other classes. 对于VOT2018,精度(A)领先的是SiamAttn, Alpha-Refine, RPT, PGNet, DROL;鲁棒性(R)领先的是RPT, FCOT, Ocean, Alpha-Refine, DiMP;EAO领先的则是RPT, FCOT, Ocean, DROL, Alpha-Refine。VOT2018的重启机制使得鲁棒性指标的波动范围很大(第一名和最后一名的精度差距0.056,鲁棒性差距0.197)。领先的方法大多为灵活的无锚框结构,它们对IOU较小的预测框有更强的矫正能力,从而避免跟踪失败重启。
对于GOT-10 k,平均重叠率(AO)领先的是SiamRCNN, FCOT, PrDiMP, RPT, DiMP;IOU阈值为0.5的成功率(SR0.50)排名为FCOT, PrDiMP, RPT, SiamRCNN, DiMP;IOU阈值为0.75的成功率(SR0.75)排名为SiamRCNN, PrDiMP, FCOT, RPT, DiMP。不难看出,对边框预测做了特殊处理(如两阶段预测、不确定性预测、在线优化、关键点表示等)的方法在SR0.75上效果普遍较好。
-
综合上述方法描述以及实验分析,按照文中的分类方式总结了不同检测技术对于孪生目标跟踪算法的优缺点,如表2所示,并依此归纳出融合检测技术的孪生目标跟踪算法的六条设计经验:(1)检测网络的预测头部结构可以提升目标状态估计的精度;(2)无锚框结构相比有锚框结构对于目标形变具有更强的适应性;(3)两阶段结构面对复杂干扰场景具有更强的判别能力,而单阶段结构的速度更快;(4)将时序信息融入检测框架能更好地处理目标和场景的变化;(5)对状态估计质量单独进行评估可以进一步提升预测目标框的精度;(6)检测器具有直接转变成跟踪器的潜力。这些经验可以为后续研究者设计跟踪算法提供一定的指导。
表 2 不同检测技术用于孪生目标跟踪算法的优缺点对比
Table 2. Comparison of advantages and disadvantages of siamese trackers with different detection techniques
Taxonomy Advantage limitation State
estimationAnchor-based First Introducing RPN detection technology;
Discarding multi-scale search, and can predict bbox with arbitrary aspect ratioRelying on prior knowledge;
Incapable of rectifying weak predictionAnchor-free Fewer parameters and faster speed;
Correcting weak predictions caused by deformation and fast movementRequiring additional constraints (such as location quality) due to the lack of prior knowledge Stage
numberOne-stage Fast speed;;
Easy to add additional modules (e.g. model updates)Weak discriminability for semantic interference Two-stage Better balance of robustness and discriminability Complex structure and slow speed Others IOUNet-based prediction More accurate evaluation of location quality - Detector transform tracker Narrowing the differences between detection and tracking with a common pattern to solve both problems -
A survey of siamese networks tracking algorithm integrating detection technology
-
摘要: 近年来,基于孪生网络的方法在视觉目标跟踪中取得了巨大的进步,但是这类方法在处理跟踪中的目标状态估计以及复杂场景干扰中仍存在较大的提升空间。随着深度学习在目标检测领域取得的成功,越来越多的研究将其成果用于指导目标跟踪技术的发展。对融合检测技术的孪生目标跟踪算法进行了综述。首先介绍检测和跟踪的联系与区别,同时分析检测技术对改进基于孪生网络的跟踪算法的可行性;然后阐述在不同检测框架指导下的孪生目标跟踪算法,以及使用OTB100、VOT2018、GOT-10k和LaSOT公开数据集对各类算法进行对比和分析;最后对全文进行总结,并对目标跟踪的未来发展方向进行展望。Abstract: In recent years, siamese tracking networks have achieved promising performance in visual tracking. However, there is still large room for improvement in the challenge of target state estimation and complex aberrances for siamese trackers. With the success of deep learning in object detection, more and more object detection technologies are used to guide object tracking. This survey reviews the siamese tracking algorithms integrating detection technologies. Firstly, we introduce the relation and difference between detection and tracking, and analyze the feasibility of improving siamese tracking algorithms by detection technologies. Then, we elaborate the existing siamese trackers based on different detection frameworks. Furthermore, we conduct extensive experiments to compare and analyze the representative methods on the popular OTB100, VOT2018, GOT-10k, and LaSOT benchmarks. Finally, we summarize our manuscript and prospect the further trends of visual tracking.
-
Key words:
- object tracking /
- deep learning /
- siamese network /
- object detection
-
表 1 跟踪算法在OTB100, LaSOT, GOT-10 k和VOT2018上的性能对比
Table 1. Performance comparison of siamese tracking methods on OTB100, LaSOT, GOT-10 k and VOT2018
TYPE OTB100 LaSOT GOT10 k VOT2018 A S AUC PR AUC. NPR AO SR0.50 SR0.75 A R EAO SiamRPN [29] T 1 0.637 0.851 0.457 - - - - - - - DaSiamRPN [35] T 1 0.658 0.88 0.415 0.496 - - - 0.59 0.276 0.383 SiamRPN++ [38] T 1 0.696 0.915 0.496 0.569 0.518 0.618 0.325 0.6 0.234 0.414 SiamDW [37] T 1 0.674 0.923 0.384 0.476 0.416 - - - - 0.27 SiamMask [31] T 1 - - - - 0.514 0.587 0.366 0.61 0.276 0.38 SiamMan [33] T 1 0.705 0.919 - - - - - 0.605 0.183 0.462 THOR [54] T 1 0.648 0.791 - - 0.447 0.538 0.204 0.582 0.234 0.416 DROL [58] T 1 0.715 0.934 0.537 0.624 - - - 0.616 - 0.481 SiamAttn [55] T 1 0.712 0.926 0.56 0.648 - - - 0.636 0.16 0.47 AFAT [56] T 1 0.663 0.874 0.492 0.574 - - - 0.605 0.239 0.419 UpdateNet [57] T 1 - - 0.475 0.56 - - - - - 0.393 SiamFC++ [41] F 1 0.683 0.896 0.544 0.623 0.595 0.695 0.479 0.587 0.183 0.426 AFSN [49] F 1 0.675 0.868 - - - - - 0.589 0.204 0.398 SATIN [48] F 1 0.641 0.844 - - - - - - - - SiamBAN [43] F 1 0.696 0.91 0.514 0.598 - - - 0.597 0.178 0.452 SiamCAR [40] F 1 0.697 0.91 - - 0.569 0.67 0.415 - - - CGACD [50] F 1 0.713 0.922 0.518 0.626 - - - 0.615 0.173 0.449 FCAF [80] F 1 0.649 0.86 - - - - - - - - FCOT [45] F 1 0.693 0.913 0.569 0.678 0.64 0.763 0.517 0.6 0.108 0.508 PGNet [34] F 1 0.691 0.892 0.531 0.605 - - - 0.618 0.192 0.447 Ocean [19] F 1 0.684 0.92 0.56 - 0.611 0.721 0.473 0.592 0.117 0.489 Ocean+ [44] F 1 - - - - - - - - - - RPT [52] F 0.715 0.936 - - 0.624 0.73 0.504 0.629 0.103 0.51 AlphaRef [51] 1 - - 0.589 0.649 - - - 0.633 0.136 0.476 SiamKPN [64] F 2 0.712 0.927 0.498 - 0.529 0.606 0.362 0.606 0.192 0.44 SPLT [61] T 2 - - 0.426 0.494 - - - - - - CRPN [63] T 2 0.663 - 0.455 0.542 - - - - - - SPM [59] T 2 0.687 0.889 0.485 - 0.513 0.593 0.359 0.58 0.3 0.338 TACT [67] T 2 - - 0.575 0.66 0.578 0.665 0.477 - - - SiamRCNN [69] T 2 0.701 0.891 0.648 0.722 0.649 0.728 0.597 0.609 0.22 0.408 GlobalT [68] T 2 - - 0.521 0.599 - - - - - - LTAO [70] T 2 - - - - - - - - - - ATOM [72] others 0.667 0.879 0.514 0.576 0.556 0.635 0.402 0.59 0.204 0.401 DiMP [65] others 0.686 0.899 0.569 0.648 0.611 0.717 0.492 0.597 0.153 0.44 PrDiMP [66] others 0.696 0.897 0.598 - 0.634 0.738 0.543 0.618 0.165 0.442 SSD-MAML [71] others 0.62 - - - - - - - - - FRCNN-MAML [71] others 0.647 - - - - - - - - - FCOS-MAML [21] others 0.704 0.905 0.523 - - - - 0.635 0.22 0.392 Retina-MAML [21] others 0.712 0.926 0.48 - - - - 0.604 0.159 0.452 Note: Bold fonts are ranked top-3. '-' means the corresponding results are not given in the original literature. 'TYPE' is the classification basis delineated in this paper, where 'A' indicates the Anchor (Anchor-based 'T '/Anchor-free 'F'), 'S' indicates the Stage number (One-stage '1'/Two-stages '2 '), and 'others' indicates other classes. 表 2 不同检测技术用于孪生目标跟踪算法的优缺点对比
Table 2. Comparison of advantages and disadvantages of siamese trackers with different detection techniques
Taxonomy Advantage limitation State
estimationAnchor-based First Introducing RPN detection technology;
Discarding multi-scale search, and can predict bbox with arbitrary aspect ratioRelying on prior knowledge;
Incapable of rectifying weak predictionAnchor-free Fewer parameters and faster speed;
Correcting weak predictions caused by deformation and fast movementRequiring additional constraints (such as location quality) due to the lack of prior knowledge Stage
numberOne-stage Fast speed;;
Easy to add additional modules (e.g. model updates)Weak discriminability for semantic interference Two-stage Better balance of robustness and discriminability Complex structure and slow speed Others IOUNet-based prediction More accurate evaluation of location quality - Detector transform tracker Narrowing the differences between detection and tracking with a common pattern to solve both problems - -
[1] Laurense V A, Goh J Y, Gerdes J C. Path-tracking for autonomous vehicles at the limit of friction[C]//Proceedings of the American Control Conference, 2017: 5586-5591. [2] Wang Y H, Chai H W, Yang D Y. Improved KCF real-time target tracking algorithm [J]. Journal of Huazhong University of Science and Technology, 2020, 48(1): 5. (in Chinese) [3] Wu Y, Lim J, Yang M H. Object tracking benchmark [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(9): 1834-1848. doi: 10.1109/TPAMI.2014.2388226 [4] Li P, Wang D, Wang L, et al. Deep visual tracking: Review and experimental comparison [J]. Pattern Recognition, 2018, 76: 323-338. doi: 10.1016/j.patcog.2017.11.007 [5] Bolme D S, Beveridge J R, Draper B A, et al. Visual object tracking using adaptive correlation filters[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010: 2544-2550. [6] Henriques J F, Caseiro R, Martins P, et al. High-speed tracking with kernelized correlation filters [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2015, 37(3): 583-596. doi: 10.1109/TPAMI.2014.2345390 [7] Danelljan M, Hager G, Khan F S, et al. Discriminative scale space tracking [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(8): 1561-1575. doi: 10.1109/TPAMI.2016.2609928 [8] Dalal N, Triggs B. Histograms of oriented gradients for human detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 886-893. [9] Van De Weijer J, Schmid C, Verbeek J. Learning color names from real-world images[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2007. [10] Ma C, Huang J B, Yang X, et al. Hierarchical convolutional features for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2015: 3074-3082. [11] Danelljan M, Robinson A, Shahbaz Khan F, et al. Beyond correlation filters: Learning continuous convolution operators for visual tracking[C]//European Conference on Computer Vision, 2016: 472-488. [12] Luo H B, Xu L Y, Hui B, et al. Status and prospect of target tracking based on deep learning [J]. Infrared and Laser Engineering, 2017, 46(5): 0502002. (in Chinese) doi: 10.3788/IRLA201746.0502002 [13] Russakovsky O, Deng J, Su H, et al. ImageNet large scale visual recognition challenge [J]. International Journal of Computer Vision, 2015, 115(3): 211-252. doi: 10.1007/s11263-015-0816-y [14] Bertinetto L, Valmadre J, Henriques J F, et al. Fully-convolutional siamese networks for object tracking[C]//Proceedings of the European Conference on Computer Vision, 2016: 850-865. [15] Valmadre J, Bertinetto L, Henriques J, et al. End-to-end representation learning for correlation filter based tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2805-2813. [16] Dai K, Wang Y, Yan X. Long-term object tracking based on siamese network[C]//IEEE International Conference on Image Processing (ICIP), 2017: 3640-3644. [17] Chopra S, Hadsell R, LeCun Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2005: 539-546. [18] Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291. [19] Zhang Z, Peng H, Fu J, et al. Ocean: Object-aware anchor-free tracking[C]//Proceedings of the European Conference on Computer Vision, 2020, 12366: 771-787. [20] Yan B, Peng H, Wu K, et al. LightTrack: Finding lightweight neural networks for object tracking via one-shot architecture search[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 15180-15189. [21] Wang G, Luo C, Sun X, et al. Tracking by instance detection: A meta-learning approach[C]//Conference on Computer Vision and Pattern Recognition, 2020: 6287-6296. [22] Zou Z, Shi Z, Guo Y, et al. Object detection in 20 years: A survey[DB/OL]. (2019-05-16)[2022-01-13]. https://doi.org/10.48550/arXiv.1905.05055. [23] Chen Y F, Wu Y, Zhang W. Survey of target tracking algorithm based on siamese network structure [J]. Computer Engineering and Applications, 2020, 56(6): 10-18. (in Chinese) doi: 10.3778/j.issn.1002-8331.1911-0127 [24] Kristan M, Lukeˇ A, Drbohlav O, et al. The Eighth Visual Object Tracking VOT2020 Challenge Results[M]. Switzerland: Springer, 2020. [25] He A, Luo C, Tian X, et al. A twofold siamese network for real-time object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 4834-4843. [26] Wang Q, Teng Z, Xing J, et al. Learning attentions: residual attentional siamese network for high performance online visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2018: 4854-4863. [27] Dong X, Shen J. Triplet Loss in Siamese Network for Object Tracking[M]. Switzerland: Springer, 2018: 472-488. [28] Cui Z J, An J S, Cui T S. Siamese networks tracking algorithm integrating channel-interconnection-spatial attention [J]. Infrared and Laser Engineering, 2021, 50(3): 20200148. (in Chinese) doi: 10.3788/IRLA20200148 [29] Li B, Yan J, Wu W, et al. High performance visual tracking with siamese region proposal network[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8971-8980. [30] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. doi: 10.1109/TPAMI.2016.2577031 [31] Wang Q, Zhang L, Bertinetto L, et al. Fast online object tracking and segmentation: A unifying approach[C]//Conference on Computer Vision and Pattern Recognition, 2019: 1328-1338. [32] Chen B X, Tsotsos J K. Fast visual object tracking with rotated bounding boxes[DB/OL]. (2019-09-12)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.03892. [33] Zhou W, Wen L, Zhang L, et al. SiamMan: Siamese motion-aware network for visual tracking[DB/OL]. (2020-01-18)[2022-01-13]. https://doi.org/10.48550/arXiv.1912.05515. [34] Liao B, Wang C, Wang Y, et al. Pg-net: Pixel to global matching network for visual tracking[C]//European Conference on Computer Vision, 2020: 429-444. [35] Zhu Z, Wang Q, Li B, et al. Distractor-aware siamese networks for visual object tracking[C]//Proceedings of the European Conference on Computer Vision, 2018: 101-117. [36] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Conference on Computer Vision and Pattern Recognition, 2016: 770-778. [37] Zhang Z, Peng H. Deeper and wider siamese networks for real-time visual tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4586-4595. [38] Li B, Wu W, Wang Q, et al. Siamrpn++: Evolution of siamese visual tracking with very deep networks[C]//Conference on Computer Vision and Pattern Recognition, 2019: 4282-4291. [39] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[C]//Conference on Computer Vision and Pattern Recognition, 2017: 936-944. [40] Guo D, Wang J, Cui Y, et al. SiamCAR: siamese fully convolutional classification and regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6268-6276. [41] Xu Y, Wang Z, Li Z, et al. SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020: 12549-12556. [42] Tian Z, Shen C, Chen H, et al. FCOS: Fully convolutional one-stage object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9627-9636. [43] Chen Z, Zhong B, Li G, et al. Siamese box adaptive network for visual tracking[C]//Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2020: 6667-6676. [44] Zhang Z, Liu Y, Li B, et al. Toward accurate pixelwise object tracking via attention retrieval [J]. IEEE Transactions on Image Processing, 2021, 30: 8553-8566. doi: 10.1109/TIP.2021.3117077 [45] Cui Y, Jiang C, Wang L, et al. Fully convolutional online tracking[DB/OL]. (2021-09-26)[2022-01-13]. https://doi.org/10.48550/arXiv.2004.07109. [46] Zhou X, Wang D, Krähenbühl P. Objects as points[DB/OL]. (2019-04-29)[2022-01-13]. https://doi.org/10.48550/arXiv.1904.07850. [47] Law H, Deng J. Cornernet: Detecting objects as paired keypoints[C]//Proceedings of the European Conference on Computer Vision, 2018: 765-781. [48] Gao P, Yuan R, Wang F, et al. Siamese attentional keypoint network for high performance visual tracking [J]. Knowledge-based Systems, 2020, 193: 105448. doi: 10.1016/j.knosys.2019.105448 [49] Peng S, Wang K, Yu Y, et al. Accurate anchor free tracking[DB/OL]. (2020-06-13)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.07560. [50] Du F, Liu P, Zhao W, et al. Correlation-guided attention for corner detection based visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6835-6844. [51] Yan B, Zhang X, Wang D, et al. Alpha-refine: Boosting tracking performance by precise bounding box estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 5289-5298. [52] Ma Z, Wang L, Zhang H, et al. Rpt: Learning point set representation for siamese visual tracking[C]//European Conference on Computer Vision, 2020: 653-665. [53] Yang Z, Liu S, Hu H, et al. Reppoints: Point set representation for object detection[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 9657-9666. [54] Sauer A, Aljalbout E, Haddadin S. Tracking holistic object representations[DB/OL]. (2019-08-06)[2022-01-13]. https://doi.org/10.48550/arXiv.1907.12920. [55] Yu Y, Xiong Y, Huang W, et al. Deformable siamese attention networks for visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6727-6736. [56] Xu T, Feng Z H, Wu X J, et al. AFAT: Adaptive failure-aware tracker for robust visual object tracking[DB/OL]. (2020-05-27)[2022-01-13]. https://doi.org/10.48550/arXiv.2005.13708. [57] Zhang L, Gonzalez-Garcia A, van de Weijer J, et al. Learning the model update for siamese trackers[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 4009-4018. [58] Zhou J, Wang P, Sun H. Discriminative and robust online learning for siamese visual tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 13017-13024. [59] Wang G, Luo C, Xiong Z, et al. Spm-tracker: Series-parallel matching for real-time visual object tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 3643-3652. [60] Sung F, Yang Y, Zhang L, et al. Learning to compare: Relation network for few-shot learning[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, 2018: 1199-1208. [61] Yan B, Zhao H, Wang D, et al. “Skimming-perusal” tracking: A framework for real-time and robust long-term tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 2385-2393. [62] Zhang H W, Li X X, Zhu B, et al. Two-stage object tracking method based on siamese neural network [J]. Infrared and Laser Engineering, 2021, 50(9): 20200491. (in Chinese) doi: 10.3788/IRLA20200491 [63] Fan H, Ling H. Siamese cascaded region proposal networks for real-time visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 7952-7961. [64] Li Q, Qin Z, Zhang W, et al. Siamese keypoint prediction network for visual object tracking[DB/OL]. (2020-06-07)[2022-01-13]. https://doi.org/10.48550/arXiv.2006.04078. [65] Bhat G, Danelljan M, Van Gool L, et al. Learning discriminative model prediction for tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 6181-6190. [66] Danelljan M, van Gool L, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190. [67] Choi J, Kwon J, Lee K M. Visual Tracking by Tridentalign and Context Embedding[M]. Switzerland: Springer, 2020: 504-520. [68] Huang L, Zhao X, Huang K. Globaltrack: A simple and strong baseline for long-term tracking[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34(7): 11037-11044. [69] Voigtlaender P, Luiten J, Torr P H S, et al. Siam R-CNN: Visual tracking by re-detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 6577-6587. [70] Dave A, Tokmakov P, Schmid C, et al. Learning to track any object[DB/OL]. (2019-10-25)[2022-01-13]. https://doi.org/10.48550/arXiv.1910.11844. [71] Huang L, Zhao X, Huang K. Bridging the gap between detection and tracking: A unified approach[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 3998-4008. [72] Danelljan M, Bhat G, Khan F S, et al. ATOM: Accurate tracking by overlap maximization[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 4655-4664. [73] Jiang B, Luo R, Mao J, et al. Acquisition of localization confidence for accurate object detection[C]//Proceedings of the European Conference on Computer Vision, 2018. [74] Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//International Conference on Machine Learning, 2017: 1126-1135. [75] Antoniou A, Edwards H, Storkey A. How to train your maml[C]//International Conference on Learning Representations, 2019. [76] Li Z, Zhou F, Chen F, et al. Meta-SGD: Learning to learn quickly for few-shot learning[DB/OL].(2017-09-28)[2022-01-13]. https://doi.org/10.48550/arXiv.1707.09835. [77] Kristan M, Leonardis A, Matas J, et al. The Sixth Visual Object Tracking VOT2018 Challenge Results[M]. Switzerland: Springer, 2018: 3-53. [78] Huang L, Zhao X, Huang K. Got-10 k: A large high-diversity benchmark for generic object tracking in the wild [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 43(5): 1562-1577. [79] Fan H, Lin L, Yang F, et al. Lasot: A high-quality benchmark for large-scale single object tracking[C]//Conference on Computer Vision and Pattern Recognition, 2019: 5374-5383. [80] Han G, Du H, Liu J, et al. Fully conventional anchor-free siamese networks for object tracking [J]. IEEE Access, 2019, 7: 123934-123943. doi: 10.1109/ACCESS.2019.2937998 [81] Danelljan M, Gool L Van, Timofte R. Probabilistic regression for visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020: 7181-7190. [82] Choi J, Chun D, Kim H, et al. Gaussian yolov3: An accurate and fast object detector using localization uncertainty for autonomous driving[C]//Proceedings of the IEEE International Conference on Computer Vision, 2019: 502-511. [83] He Y, Zhu C, Wang J, et al. Bounding box regression with uncertainty for accurate object detection[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019: 2888-2897. [84] Zhu B, Wang J, Jiang Z, et al. Autoassign: Differentiable label assignment for dense object detection[DB/OL]. (2020-11-25)[2022-01-13]. https://doi.org/10.48550/arXiv.2007.03496. [85] Li X, Wang W, Wu L, et al. Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection[C]//Advances in Neural Information Processing Systems, 2020. [86] Oksuz K, Cam B C, Kalkan S, et al. Imbalance problems in object detection: A review [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 43(10): 3388-3415. [87] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 5998-6008. [88] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16 x16 words: Transformers for image recognition at scale[C]//International Conference on Learning Representations, 2021. [89] Chen X, Yan B, Zhu J, et al. Transformer tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 8126-8135. [90] Yan B, Peng H, Fu J, et al. Learning spatio-temporal transformer for visual tracking[C]//Proceedings of the IEEE International Conference on Computer Vision, 2021: 10448-10457. [91] Wang N, Zhou W, Wang J, et al. Transformer meets tracker: Exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2021: 1571-1580. [92] Lin L, Fan H, Xu Y, et al. SwinTrack: A simple and strong baseline for transformer tracking[DB/OL]. (2021-12-08)[2022-01-13]. https://doi.org/10.48550/arXiv.2112.00995.