Abstract:
Objective In the field of computer vision, cameras and LiDAR have their own advantages. Cameras have dense perception and RGB information, which can capture rich semantic information. LiDAR has more accurate ranging and can provide more accurate spatial information. How to utilize the advantages of cameras and LiDAR to achieve information complementarity is the key to improving 3D target recognition. The single-mode laser point cloud recognition network framework, whether based on point or voxel processing methods, cannot effectively solve the information loss caused by long time consumption or point cloud voxelization. Existing multi-modal networks that fuse images overly rely on point cloud input but fail to reduce the information loss caused by point cloud voxelization, weakening the high-dimensional semantic information provided by images and failing to fully utilize the complementary information between point clouds and images. To address the above issues, this paper improves the feature generation network and multi-modal fusion strategy, while proposing a point level multimodal data augmentation strategy to further enhance model performance.
Methods The multi-modal network framework uses independent image and point cloud branches to extract multi-scale features and fuse them at the feature layer (Fig.1). The image branch uses a depth estimation fusion network to fuse dense perceptual image semantic information and truth supervised deep features (Fig.2), compensating for the disorder and sparsity of point clouds. In the point cloud branch, the feature extraction method for voxelization of point clouds has been improved (Fig.3), no longer solely using voxel center point features, but using vector features, standard deviation features, and extremum features for fusion. By using the dynamic feature fusion module (Fig.4) for feature fusion, the network's ability to extract key features is improved, and global features are obtained more effectively. A point level multimodal fusion data augmentation strategy is proposed, which not only enhances sample diversity but also alleviates the problem of sample imbalance to a certain extent, effectively improving the performance of the model.
Results and Discussions Experiments are conducted using the open-source publicly available dataset Pandaset for autonomous driving at the L5 level, and IoU is used as an evaluation metric for semantic segmentation performance. We first visualized the point level multimodal fusion data augmentation strategy proposed in this paper on Pandaset, and found that this data augmentation strategy outperforms previous methods in terms of visual effects and sample authenticity in task expansion (Fig.5-6). At the same time, comparative experiments were conducted on this dataset with some mainstream 3D semantic segmentation algorithms based on point cloud single modal processing and image point cloud fusion multimodal processing. The algorithm proposed in this paper achieved performance improvement on most labels and mIoU (Tab.1), and the improvement was more significant on distant or small targets. This fully demonstrates the effectiveness of the algorithm proposed in this article, and verifies the effectiveness of each module proposed in this paper on model performance through ablation experiments (Tab.2). And additional comparative experiments were conducted on the improvement of model performance by data augmentation strategies, which proved that the click data augmentation strategy proposed in this paper is also superior to previous data augmentation methods in object detection tasks (Tab.3).
Conclusions This paper improves the image and point cloud feature extraction network and designs a multimodal network framework for image and point cloud fusion, combining the advantages of dense perception images and real 3D perception point clouds to achieve information complementarity. A multimodal fusion network framework has been implemented to improve the performance of 3D object recognition, with the performance improvement being more significant on small samples and small targets. This paper demonstrates the effectiveness of the proposed algorithm through comparative experiments and ablation experiments on the open-source dataset Pandaset.