Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (<i>Invited</i>)

Sun Peng; Yu Yue; Chen Jiaxin; Qin Hanlin

doi:10.3788/IRLA20220167

Volume 51 Issue 4

May 2022

Turn off MathJax

Article Contents

Article Navigation > Infrared and Laser Engineering > 2022 > 51(4): 20220167

Sun Peng, Yu Yue, Chen Jiaxin, Qin Hanlin. Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)[J]. Infrared and Laser Engineering, 2022, 51(4): 20220167. doi: 10.3788/IRLA20220167

Citation:

Sun Peng, Yu Yue, Chen Jiaxin, Qin Hanlin. Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)[J]. Infrared and Laser Engineering, 2022, 51(4): 20220167. doi: 10.3788/IRLA20220167

Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)

doi: 10.3788/IRLA20220167

School of Optoelectronic Engineering, Xidian University, Xi'an 710071, China

Funds: National Natural Science Foundation of China (62174128)

Received Date: 2022-03-10
Rev Recd Date: 2022-04-07
Publish Date: 2022-05-06

Abstract

Aiming at the problem of reliable detection and accurate recognition of high dynamic aerial targets by infrared detectors carried by hypersonic vehicles in complex background, an aerial polymorphic target detection method based on deep spatial-temporal feature fusion was proposed. A weighted bidirectional cyclic feature pyramid structure was designed to extract the static features of polymorphic target, and switchable atrous convolution was introduced to increase the receptive field and reduce spatial information loss. For the extraction of temporal motion features, in order to suppress the complex background noise and concentrate the corner information into the moving region, the feature point matching method was used to generate the mask image, then the optical flow was calculated, and the sparse optical flow feature map was designed according to calculation results. Finally, the temporal features contained in multiple continuous frame images were extracted by 3D convolution to generate a 3D temporal motion feature map. By concatting the image static features and temporal motion features in channel dimension, the deep spatial-temporal fusion could be realized. A large number of comparative experiments showed that this method can significantly reduce the false recognition probability in complex background, and the target detection accuracy reached 89.87% with high real-time performance, which can meet the needs of infrared targets intelligent detection and recognition under high dynamic conditions.
- object detection,
- feature fusion,
- multi-scale pyramid,
- sparse optical flow,
- 3D convolution

References

[1]	Jiang Taixiang, Huang Tingzhu, Zhao Xile, et al. Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[2]	Zhang Landan, Peng Lingbing, Zhang Tianfang, et al. Infrared small target detection via non-convex rank approximation minimization joint l2, 1 norm [J]. Remote Sensing, 2018, 10(11): 1821. doi: 10.3390/rs10111821
[3]	Hadhoud M M, Thomas D W. The two-dimensional adaptive LMS (TDLMS) algorithm [J]. IEEE Transactions on Circuits and Systems, 1988, 35(5): 485-494. doi: 10.1109/31.1775
[4]	Bai Xiangzhi, Zhou Fugen. Analysis of new top-hat transformation and the application for infrared dim small target detection [J]. Pattern Recognition, 2010, 43(6): 2145-2156. doi: 10.1016/j.patcog.2009.12.023
[5]	Zhao Lu, Xiong Sen. Target recognition based on multi-view infrared images [J]. Infrared and Laser Engineering, 2021, 50(11): 20210206. (in Chinese)
[6]	Tang Peng, Liu Yi, Wei Hongguang, et al. Automatic recognition algorithm of digital instrument reading in offshore booster station based on Mask-RCNN [J]. Infrared and Laser Engineering, 2021, 50(S2): 20211057. (in Chinese)
[7]	Beery S, Wu G, Rathod V, et al. Context R-CNN: Long term temporal context for per-camera object detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 13075-13085.
[8]	Li Jingyu, Yang Jing, Kong Bin, et al. Multi-scale vehicle and pedestrian detection algorithm based on attention mechanism [J]. Optics and Precision Engineering, 2021, 29(6): 1448-1458. (in Chinese) doi: 10.37188/OPE.20212906.1448
[9]	Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos [J]. Advances in Neural Information Processing Systems, 2014, 27: 568-576.
[10]	Zhang Hongying, An Zheng. Human action recognition based on improved two-stream spatiotemporal network [J]. Optics and Precision Engineering, 2021, 29(2): 420-429. (in Chinese) doi: 10.37188/OPE.20212902.0420
[11]	Donahue J, Hendricks L A, Guadarrama S, et a1. Long-term recurrent convolutional networks for visual recognition and description [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 2625-2634.
[12]	Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
[13]	Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 6299-6308.
[14]	Wu Haibin, Wei Xiying, Liu Meihong, et al. Improved YOLOv4 for dangerous goods detection in X-ray inspection combined with atrous convolution and transfer learning [J]. Chinese Optics, 2021, 14(6): 1417-1425. (in Chinese) doi: 10.37188/CO.2021-0078
[15]	Zhang Ruiyan, Jiang Xiujie, An Junshe, et al. Design of global-contextual detection model for optical remote sensing targets [J]. Chinese Optics, 2020, 13(6): 1302-1313. (in Chinese) doi: 10.37188/CO.2020-0057
[16]	Shi J, Tomasi C. Good features to track [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1994: 593-600.
[17]	Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014: 1725-1732.
[18]	Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition [C]//European conference on computer vision. Springer (ECCV), 2016: 20-36.
[19]	Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding [C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 695-712.
[20]	Huang Zhen, Xue Dixiu, Shen Xu, et al. 3D local convolutional neural networks for gait recognition [C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021: 14920-14929.
[21]	Huang Ziyuan, Zhang Shiwei, Pan Liang, et al. TAda! Temporally-adaptive convolutions for video understanding [C]//International Conference on Learning Representations (ICLR), 2022.

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(7) / Tables(1)

Get Citation

PDF

XML

Article Metrics

Article views(289) PDF downloads(66) Cited by()

Proportional views

HTML

0. 引　言

传统空中红外目标检测方法^[1-4]主要利用滤波、噪声抑制、目标增强等方式提高背景与目标的对比度，进而通过阈值分割实现目标检测；深度学习方法^[5-8]利用深层卷积网络提取输入红外图像的空间特征，预测空中目标的类别和位置。然而在高超声速飞行器制导控制领域，由于复杂的战场背景以及高动态条件，利用空间特征信息的目标检测方法无法对高速运动的空中目标进行有效识别和跟踪。对时空多维深度特征信息进行融合，利用空间网络处理静态信息，利用时间网络处理动态信息，可以大大提高空中高动态目标的识别准确率，是提升高超声速飞行器制导作战性能的一种有效方式。

为了将时域信息与空域信息融合，基于双流法、3D卷积结构（3D convolution，conv3D）和长短时记忆（Long Short Term Memory，LSTM）网络的视频特征提取架构陆续被提出。Simonyan K等^[9]首次提出了双流法，对RGB图像的空间流（Spatial Flow）和采用稠密光流图的时域运动流（Temporal Motion Flow）两路输入流分别处理，再进行结果融合，实现行为识别任务。Zhang等^[10]针对双流网络对时序依赖较大的问题，提出一种基于改进双流时空网络的人体行为识别算法，增强网络的特征表达能力，以提高对时序主导行为的识别能力。Donahue J等^[11]结合2D卷积和LSTM结构，首先使用2D卷积对图像进行特征提取，获得一个视觉特征的序列，然后将特征序列直接输入到LSTM中，进一步挖掘上下文信息。Ji S等^[12]将3D卷积用于视频分析领域，但仅在浅层使用3D卷积，且使用了手工方式提取图像的灰度、梯度等信息，无法对视频进行实时处理。I3D^[13]（Inflated 3D ConvNet）将3D卷积引入双流框架，利用稠密光流对图像中所有点逐一匹配，形成光流场。该方法具有良好的视频目标检测准确率，但计算量过大。

文中通过采用关键点匹配的方式，以稀疏光流替代稠密光流，利用3D卷积提取视频的动态特征；同时，设计了引入可切换空洞卷积^[14]（Switchable Atrous Convolution，SAC）的双向循环特征金字塔结构^[15]提取图像静态空间特征，与时域动态特征融合，实现对高动态空中多形态目标的实时高准确率检测。

5. 结　论

文中针对在复杂背景下对高动态空中红外目标智能检测识别的需求，设计并实现了基于深度空时域特征融合的高动态空中多形态目标检测方法，通过分析典型空时域融合算法与基于噪声抑制、目标增强的传统方法对红外空中目标的识别结果，证明文中方法可有效对复杂背景噪声进行抑制并提取高动态目标特征，对空中红外目标检测准确率达到89.87%，同时具有较快的检测速度。

Reference (21)

[1]	Jiang Taixiang, Huang Tingzhu, Zhao Xile, et al. Multi-dimensional imaging data recovery via minimizing the partial sum of tubal nuclear norm [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[2]	Zhang Landan, Peng Lingbing, Zhang Tianfang, et al. Infrared small target detection via non-convex rank approximation minimization joint l2, 1 norm [J]. Remote Sensing, 2018, 10(11): 1821.
[3]	Hadhoud M M, Thomas D W. The two-dimensional adaptive LMS (TDLMS) algorithm [J]. IEEE Transactions on Circuits and Systems, 1988, 35(5): 485-494.
[4]	Bai Xiangzhi, Zhou Fugen. Analysis of new top-hat transformation and the application for infrared dim small target detection [J]. Pattern Recognition, 2010, 43(6): 2145-2156.
[5]	Zhao Lu, Xiong Sen. Target recognition based on multi-view infrared images [J]. Infrared and Laser Engineering, 2021, 50(11): 20210206. (in Chinese)
[6]	Tang Peng, Liu Yi, Wei Hongguang, et al. Automatic recognition algorithm of digital instrument reading in offshore booster station based on Mask-RCNN [J]. Infrared and Laser Engineering, 2021, 50(S2): 20211057. (in Chinese)
[7]	Beery S, Wu G, Rathod V, et al. Context R-CNN: Long term temporal context for per-camera object detection [C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020: 13075-13085.
[8]	Li Jingyu, Yang Jing, Kong Bin, et al. Multi-scale vehicle and pedestrian detection algorithm based on attention mechanism [J]. Optics and Precision Engineering, 2021, 29(6): 1448-1458. (in Chinese)
[9]	Simonyan K, Zisserman A. Two-stream convolutional networks for action recognition in videos [J]. Advances in Neural Information Processing Systems, 2014, 27: 568-576.
[10]	Zhang Hongying, An Zheng. Human action recognition based on improved two-stream spatiotemporal network [J]. Optics and Precision Engineering, 2021, 29(2): 420-429. (in Chinese)
[11]	Donahue J, Hendricks L A, Guadarrama S, et a1. Long-term recurrent convolutional networks for visual recognition and description [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 2625-2634.
[12]	Ji S, Xu W, Yang M, et al. 3D convolutional neural networks for human action recognition [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012, 35(1): 221-231.
[13]	Carreira J, Zisserman A. Quo vadis, action recognition? A new model and the kinetics dataset [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017: 6299-6308.
[14]	Wu Haibin, Wei Xiying, Liu Meihong, et al. Improved YOLOv4 for dangerous goods detection in X-ray inspection combined with atrous convolution and transfer learning [J]. Chinese Optics, 2021, 14(6): 1417-1425. (in Chinese)
[15]	Zhang Ruiyan, Jiang Xiujie, An Junshe, et al. Design of global-contextual detection model for optical remote sensing targets [J]. Chinese Optics, 2020, 13(6): 1302-1313. (in Chinese)
[16]	Shi J, Tomasi C. Good features to track [C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1994: 593-600.
[17]	Karpathy A, Toderici G, Shetty S, et al. Large-scale video classification with convolutional neural networks [C]//Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (CVPR), 2014: 1725-1732.
[18]	Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition [C]//European conference on computer vision. Springer (ECCV), 2016: 20-36.
[19]	Zolfaghari M, Singh K, Brox T. Eco: Efficient convolutional network for online video understanding [C]//Proceedings of the European Conference on Computer Vision (ECCV), 2018: 695-712.
[20]	Huang Zhen, Xue Dixiu, Shen Xu, et al. 3D local convolutional neural networks for gait recognition [C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2021: 14920-14929.
[21]	Huang Ziyuan, Zhang Shiwei, Pan Liang, et al. TAda! Temporally-adaptive convolutions for video understanding [C]//International Conference on Learning Representations (ICLR), 2022.

Method	Accuracy	Speed/FPS	Run memory/GB
C3 D^[17]	82.31%	25.9	2.32
TSN^[18]	85.73%	23.3	3.58
ECO^[19]	86.57%	27.6	3.14
3DLocalCNN^[20]	85.78%	21.6	2.79
TADa^[21]	87.41%	29.1	4.01
Proposed method	89.87%	27.0	2.19

Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)

doi: 10.3788/IRLA20220167

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views

Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)

doi: 10.3788/IRLA20220167

School of Optoelectronic Engineering, Xidian University, Xi'an 710071, China

HTML

3.1. 目标特征角点提取

3.2. 稀疏光流计算与特征图设计

3.3. 时序特征提取

Catalog

Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)

doi: 10.3788/IRLA20220167

Abstract

References

Proportional views

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Related

Proportional views

Highly dynamic aerial polymorphic target detection method based on deep spatial-temporal feature fusion (Invited)

doi: 10.3788/IRLA20220167

School of Optoelectronic Engineering, Xidian University, Xi'an 710071, China

HTML

3.1. 目标特征角点提取

3.2. 稀疏光流计算与特征图设计

3.3. 时序特征提取

Catalog

Export File

Citation

Format

Content