{"title":"MosViT:基于激光雷达点云的移动物体分割视觉变换器","authors":"Chunyun Ma, Xiaojun Shi, Yingxin Wang, Shuai Song, Zhen Pan, Jiaxiang Hu","doi":"10.1088/1361-6501/ad6626","DOIUrl":null,"url":null,"abstract":"\n Moving object segmentation is fundamental for various downstream tasks in robotics and autonomous driving, providing crucial information for them. Effectively extracting spatial-temporal information from consecutive frames and addressing the scarcity of dataset is paramount for learning-based 3D LiDAR Moving Object Segmentation (LIDAR-MOS). In this work, we propose a novel deep neural network based on Vision Transformers (ViTs) to tackle this problem. We first validate the feasibility of Transformer networks for this task, offering an alternative to CNNs. Specifically, we utilize a dual-branch structure based on range-image data to extract spatial-temporal information from consecutive frames and fuse it using a motion-guided attention mechanism. Furthermore, we employ the ViT as the backbone, keeping its architecture unchanged from what is used for RGB images. This enables us to leverage pre-trained models from RGB images to improve results, addressing the issue of limited LIDAR point cloud data, which is cheaper compared to acquiring and annotating point cloud data. We validate the effectiveness of our approach on the LIDAR-MOS benchmark of SemanticKitti and achieve comparable results to methods that use CNNs on range image data. The source code and trained models are available at https://github.com/mafangniu/MOSViT.git.","PeriodicalId":510602,"journal":{"name":"Measurement Science and Technology","volume":"40 9","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MosViT: Towards Vision Transformers for moving object segmentation based on Lidar point cloud\",\"authors\":\"Chunyun Ma, Xiaojun Shi, Yingxin Wang, Shuai Song, Zhen Pan, Jiaxiang Hu\",\"doi\":\"10.1088/1361-6501/ad6626\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n Moving object segmentation is fundamental for various downstream tasks in robotics and autonomous driving, providing crucial information for them. Effectively extracting spatial-temporal information from consecutive frames and addressing the scarcity of dataset is paramount for learning-based 3D LiDAR Moving Object Segmentation (LIDAR-MOS). In this work, we propose a novel deep neural network based on Vision Transformers (ViTs) to tackle this problem. We first validate the feasibility of Transformer networks for this task, offering an alternative to CNNs. Specifically, we utilize a dual-branch structure based on range-image data to extract spatial-temporal information from consecutive frames and fuse it using a motion-guided attention mechanism. Furthermore, we employ the ViT as the backbone, keeping its architecture unchanged from what is used for RGB images. This enables us to leverage pre-trained models from RGB images to improve results, addressing the issue of limited LIDAR point cloud data, which is cheaper compared to acquiring and annotating point cloud data. We validate the effectiveness of our approach on the LIDAR-MOS benchmark of SemanticKitti and achieve comparable results to methods that use CNNs on range image data. The source code and trained models are available at https://github.com/mafangniu/MOSViT.git.\",\"PeriodicalId\":510602,\"journal\":{\"name\":\"Measurement Science and Technology\",\"volume\":\"40 9\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Measurement Science and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1088/1361-6501/ad6626\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Measurement Science and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1088/1361-6501/ad6626","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MosViT: Towards Vision Transformers for moving object segmentation based on Lidar point cloud
Moving object segmentation is fundamental for various downstream tasks in robotics and autonomous driving, providing crucial information for them. Effectively extracting spatial-temporal information from consecutive frames and addressing the scarcity of dataset is paramount for learning-based 3D LiDAR Moving Object Segmentation (LIDAR-MOS). In this work, we propose a novel deep neural network based on Vision Transformers (ViTs) to tackle this problem. We first validate the feasibility of Transformer networks for this task, offering an alternative to CNNs. Specifically, we utilize a dual-branch structure based on range-image data to extract spatial-temporal information from consecutive frames and fuse it using a motion-guided attention mechanism. Furthermore, we employ the ViT as the backbone, keeping its architecture unchanged from what is used for RGB images. This enables us to leverage pre-trained models from RGB images to improve results, addressing the issue of limited LIDAR point cloud data, which is cheaper compared to acquiring and annotating point cloud data. We validate the effectiveness of our approach on the LIDAR-MOS benchmark of SemanticKitti and achieve comparable results to methods that use CNNs on range image data. The source code and trained models are available at https://github.com/mafangniu/MOSViT.git.