-
Our network mainly consists of two parts: the feature extraction network and the metric network, as shown in Fig.1. The feature extraction network is responsible for extracting features in infrared and visible images. The metric network mainly matches the feature’s similarities.
Figure 1. Infrared-visible image deep matching network. The black line with the arrow indicates the data-flow. The blue lines represent shortcut connections through the reshape layers. This figure describes the process of the infrared-visible image patches matching
The feature extraction network extracts the distinguishing features of visible and infrared images. In the feature extraction network, infrared and visible images are input into two VGG16[16] branches, which constitute a Siamese network. To be compatible with the visible image’s three channels, we copy an infrared image into three channels as another branch input. The weights are shared in the two branches. There are five blocks and two FC layers in a single VGG16 branch. A block consists of two or three convolution layers, an activation layer, and a pooling layer. In a single VGG branch, we retain the original FC6 and FC7 layers in VGG16. There are two reasons for retaining the two FC layers. Firstly, the FC7 layer can produce a feature of 1×4096 dimension, rather than 7×7×512 dimensions from the Conv5 block. It can significantly reduce the parameters and calculations in the metric network. Secondly, we find that the branch with FC layers has better performance than that without them in training.
For infrared and visible images, although their imaging principles are different, the same target is very similar in semantic features. Therefore, branches share network weights in network design. We believe that deep convolutional networks have strong feature representation capacity. It can extract common feature in infrared and visible images. Multiple network branches that traditionally use contrastive loss or triplet loss generally share weights. The shared weights can map high-level features to the same feature space for distance comparison.
The metric network is composed of two FC layers with softmax loss as the objective function. It estimates the probability of whether the visible image and the infrared image are similar or not. Ideally, if they match, the prediction is 1. If they don’t match, the prediction is 0.
-
Compared with visible images, infrared images have no color and less texture information. The edges are usually blurred. However, the objects still have rough outlines and region information in infrared images. These outlines and shapes are common features in visible and infrared images. Therefore, we believe that their spatial information is essential in infrared images for image matching. It is necessary to integrate the spatial features with the semantic features to enhance feature representation.
On the other hand, it is feasible to propose features with multiple scales in the deep neural network's hierarchical framework. The features proposed from the low-level layers are similar to those extracted with the hand-craft descriptors, such as SIFT, SURF. As the CNN layers deepen, the feature maps less focus on the imaging difference. The semantic features gradually reveal in the high-level layers. In our network, the multi-scale features are input into the metric network. So, the metric network can use more comprehensive information to make similarity decisions. Each block in our network directly connects to the input of the metric network. It can preserve more multi-scale spatial information for similarity comparison in the metric network, as shown in Fig.2.
Figure 2. Multi-scale spatial feature integration in a single branch. The output feature map in each block shorts to the concatenation layer. The output of the concatenation layer is one input of the metric network
In multi-scale spatial feature extraction, two problems need to be solved. Firstly, the shortcut feature should maintain the original feature maps' size in each block to preserve spatial information. Secondly, the shortcut feature dimensions should not be too high after it reshapes into a vector. The great dimension eventually results in vast parameters and high computation in the metric network.
The 1×1 convolution is adopted in our network to solve the problems. The 1×1 convolution is widely used in GoogLeNet[17]. The multi-channel feature maps are compressed into a single-channel feature map, which preserves the spatial information and avoids the too high dimensions. To connect the features of different dimensions, they are converted to the vectors of length N×N with the reshape layer. N represents the size of the corresponding feature map. All multi-scale feature maps, including the semantic feature from the FC7, are concatenated as the input of the metric network. In the metric networks, its inputs include the infrared image branch and the visible image branch.
-
As shown in Fig.3, the network of feature extraction consists of two branches. Two branches are identical in structure and share weights. A visible image and an infrared image make up an image pair. The contrastive loss was first used for dimensionality reduction[18]. Here, the contrastive loss is used as the objective function to train the two branches.
Figure 3. (a) Feature extraction network architecture with the contrastive loss; (b) Input data for feature extraction network with the contrastive loss. The visual patches are in the first row. The infrared patches are in the second row. The positive samples are in odd columns. The negative ones are in even columns
The contrastive loss is shown in Eq.(1).
$$l({x_1},{x_2}) = \left\{ \begin{array}{l} {\mathop{d}\nolimits} (f({x_1}),f({x_2})),\;\;\;\;\;\;\;\;\;{p_1} = {p_2}\\ \max (0,{\mathop{\rm margin}\nolimits} - {\mathop{d}\nolimits} (f({x_1}),f({x_2}))),\;\begin{array}{*{20}{c}} {{p_1} \ne {p_2}} \end{array} \end{array} \right.$$ (1) where
${{d}}\mathrm(f({x}_{1}), f({x}_{2}))$ represents the Euclidean distance of two sample features;${p_1}$ is the label of input visual image;${p_2}$ is the label of input infrared image.${p_1} = {p_2}$ means a similar patches pair.${p_1} \ne {p_2}$ means an unrelated patches pair. The margin is a threshold in Eq.(1). It represents the distance that should be separated from the unrelated features, at least. In our experiment, the margin is set to 1. -
As shown in Fig.4 (a), the network consists of triple branches. Three branches are identical in structure and share weights. A visible patch (anchoring sample), an infrared patch (positive sample), and another infrared patch (negative sample) form an image pair. We input a triple pair at a time to train the feature extraction network. The triplet loss was used for face recognition[19] firstly. Here, it is used as the objective function to train the triple branches.
Figure 4. (a) Feature extraction network architecture with the triplet loss; (b) Input data for feature extraction network with the triplet loss. The anchor patches are in the first row. The positive patches are in the second row. The negative patches are in the third row. Each column is triple patches input
The triplet loss shows in Eq.(2).
$${\rm{max}}({d}(f({x_a}),f({x_p})) - {d} (f({x_a}),f({x_n})) + {\rm{margin}},0)$$ (2) The input data include anchoring sample (
${x_a}$ ), positive sample (${x_p}$ ) and negative sample (${x_n}$ ).${d}(f({x_a}),f({x_p}))$ represents the Euclidean distance of the anchoring sample and the positive sample.${d} (f({x_a}),f({x_n}))$ represents the Euclidean distance of the anchoring sample and the negative sample. By optimizing the function, the distance between the anchoring example and the positive example is less than the distance between the anchoring example and the negative example. The anchoring example is randomly selected from the sample set. The positive example and the anchoring example go to the same class, while the negative example and the anchoring example belong to different classes.$${d(}f({x_a}),f({x_p})) + {\rm{margin}} \leqslant {d}(f({x_a}),f({x_n}))$$ (3) Eq.(3) illustrates that there is a margin between
${d(}f({x_a}),f({x_p}))$ and${d(}f({x_a}),f({x_n}))$ to distinguish positive and negative samples. Unlike contrastive loss, the triplet loss function compares the distance between positive and negative samples in a forward and backpropagation process. Compared to visible image matching, the same object’s imaging difference is also relatively large in multi-source image patches matching. So, it is found that a larger margin can achieve better performance in our experiment. The margin is set 3 to achieve the best performance.
-
摘要: 红外和可见光图像块匹配在视觉导航和目标识别等任务中有着广泛的应用。由于红外和可见光传感器有不同的成像原理,红外和可见光图像块匹配更加具有挑战。深度学习在可见光领域图像的块匹配上取得了很好的性能,但是它们很少涉及到红外和可见光的图像块。文中提出了一种基于卷积神经网络的红外和可见光的图像块匹配网络。此网络由特征提取和特征匹配两部分组成。在特征提取过程中,使用对比和三重损失函数能够最大化不同类的图像块的特征距离,缩小同一类图像块的特征距离,使得网络能够更加关注于图像块的公共特征,而忽略红外和可见光成像之间差异。在红外和可见光图像中,不同尺度的空间特征能够提供更加丰富的区域和轮廓信息。红外和可见光图像块的高层特征和底层特征融合可以有效地提升特征的表现能力。改进后的网络相比于先前卷积神经匹配网络,准确率提升了9.8%。Abstract: Infrared-visible image patches matching is widely used in many applications, such as vision-based navigation and target recognition. As infrared and visible sensors have different imaging principles, it is a challenge for the infrared-visible image patches matching. The deep learning has achieved state-of-the-art performance in patch-based image matching. However, it mainly focuses on visible image patches matching, which is rarely involved in the infrared-visible image patches. An infrared-visible image patch matching network (InViNet) based on convolutional neural networks (CNNs) was proposed. It consisted of two parts: feature extraction and feature matching. It focused more on images content themselves contrast, rather than imaging differences in infrared-visible images. In feature extraction, the contrastive loss and the triplet loss function could maximize the inter-class feature distance and reduce the intra-class distance. In this way, infrared-visible image features for matching were more distinguishable. Besides, the multi-scale spatial feature could provide region and shape information of infrared-visible images. The integration of low-level features and high-level features in InViNet could enhance the feature representation and facilitate subsequent image patches matching. With the improvements above, the accuracy of InViNet increased by 9.8%, compared with the state-of-the-art image matching networks.
-
Figure 3. (a) Feature extraction network architecture with the contrastive loss; (b) Input data for feature extraction network with the contrastive loss. The visual patches are in the first row. The infrared patches are in the second row. The positive samples are in odd columns. The negative ones are in even columns
Figure 4. (a) Feature extraction network architecture with the triplet loss; (b) Input data for feature extraction network with the triplet loss. The anchor patches are in the first row. The positive patches are in the second row. The negative patches are in the third row. Each column is triple patches input
Figure 6. ROC curves for various methods. The numbers in the legends are FPR95 values. In the legend, the symbol “F” means the network uses fine-tuning with VGG16. The symbol “C” means that the contrastive loss is used in the extraction feature network. The symbol “T” means that the triplet loss is used in the extraction feature network. The symbol “S” means that shortcut connection is used
Figure 9. Performance matching in the test data set. In the legend, the symbols “F”, “C”, “T” and “S” have the same meaning in Fig. 6
-
[1] Yang Weiping, Shen Zhenkang. Matching technique and its application in aided inertial navigation [J]. Infrared and Laser Engineering, 2007, 36(S2): 15-17. (in Chinese) doi: 10.3969/j.issn.1007-2276.2007.z2.003 [2] Li Hongguang, Ding Wenrui, Cao Xianbin, et al. Image registration and fusion of visible and infrared integrated camera for medium-altitude unmanned aerial vehicle remote sensing [J]. Remote Sensing, 2017, 9(5): 441. doi: 10.3390/rs9050441 [3] Wang Ning, Zhou Ming, Du Qinglei. A method for infrared visible image fusion and target recognition [J]. Journal of Air Force Early Warning Academy, 2019, 33(5): 328-332. [4] Mao Yuanhong, He Zhanzhuang, Ma Zhong. Infrared target classification with reconstruction transfer learning [J]. Journal of University of Electronic Science and Technology of China, 2020, 49(4): 609-614. (in Chinese) [5] Lowe D G. Distinctive image features from scale-invariant keypoints [J]. International Journal of Computer Vision, 2004, 60(2): 91-110. doi: 10.1023/B:VISI.0000029664.99615.94 [6] Bay H, Tuytelaars T, Gool L V. SURF: Speeded up robust features[C]//European Conference on Computer Vision, 2006, 3951: 404–417. [7] Rublee E, Rabaud V, Konolige K, et al. ORB: An efficient alternative to SIFT or SURF[C]//International Conference on Computer Vision, 2011: 2564-2571. [8] Sima A A, Buckley S J. Optimizing SIFT for matching of short wave infrared and visible wavelength images [J]. Remote Sensing, 2013, 5(5): 2037-2056. doi: 10.3390/rs5052037 [9] Li D M, Zhang J L. A improved infrared and visible images matching based on SURF [J]. Applied Mechanics and Materials, 2013, 2418(651): 1637-1640. doi: 10.4028/www.scientific.net/AMM.325-326.1637 [10] Chao Zhiguo, Wu Bo. Approach on scene matching based on histograms of oriented gradients [J]. Infrared and Laser Engineering, 2012, 41(2): 513-516. (in Chinese) doi: 10.3969/j.issn.1007-2276.2012.02.044 [11] Cao Zhiguo, Yan Ruicheng, Song Jie. Approach on fuzzy shape context matching between infrared images and visible images [J]. Infrared and Laser Engineering, 2008, 37(12): 1095-1100. (in Chinese) [12] Jiao Anbo, Shao Liyun, Li Chenxi, et al. Automatic target recognition algorithm based on affine invariant feature of line grouping [J]. Infrared and Laser Engineering, 2019, 48(S2): S226003. (in Chinese) doi: 10.3788/IRLA201948.S226003 [13] Han X, Leung T, Jia Y, et al. MatchNet: Unifying feature and metric learning for patch-based matching[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 3279-3286. [14] Zagoruyko S, Komodakis N. Learning to compare image patches via convolutional neural networks[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 4353-4361. [15] Hanif M S. Patch match networks: Improved two-channel and Siamese networks for image patch matching [J]. Pattern Recognition Letters, 2019, 120: 54-61. doi: 10.1016/j.patrec.2019.01.005 [16] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[C]//ICLR 2015: International Conference on Learning Representations, 2015. [17] Szegedy C, Liu W, Jia Y, et al. Going deeper with convolutions[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 1-9. [18] Hadsell R, Chopra S, LeCun Y. Dimensionality reduction by learning an invariant mapping[C]//IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), 2006, 2: 1735-1742. [19] Schroff F, Kalenichenko D, Philbin J. FaceNet: A unified embedding for face recognition and clustering[C]//IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015: 815-823. [20] Glorot X, Bengio Y. Understanding the difficulty of training deep feedforward neural networks[C]//Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010: 249-256. [21] Van der Maaten L , Hinton G. Visualizing data using t-SNE [J]. Journal of Machine Learning Research, 2008, 9(86): 2579-2605.