{"title":"Spatial-Temporal Multimodal End-to-End Autonomous Driving","authors":"Lei Yang, Weimin Lei","doi":"10.1109/ICCCS57501.2023.10151280","DOIUrl":null,"url":null,"abstract":"Autonomous driving requires precise perception of the surrounding environment, and considering the complementarity of sensor data, we propose an end-to-end model of spatial-temporal multimodal fusion using an attention mechanism. Our model uses a fusion of camera and light detection and ranging (LiDAR), which works as follows: (i) The spatial network performs spatial feature learning using images from the range view (RV) representation of LiDAR and red, blue, and green (RGB) images as input, followed by a parallel ResNet18 network for feature extraction and fusion through an attention mechanism; (ii) the temporal network performs the learning of the temporal dimension of spatial features, and uses the current spatial features from the spatial network and the historical spatial features to do attention learning, which enhances the features relevant to the autonomous driving task; (iii) finally, the model uses spatial-temporal features to select a different prediction branch by navigation instructions to perform regression of waypoints. Our model was trained and tested in the CARLA simulator, and experiments showed that it enabled to complete autonomous driving tasks in complex environments, achieving a success rate of 85%, especially with many dynamic objects.","PeriodicalId":266168,"journal":{"name":"2023 8th International Conference on Computer and Communication Systems (ICCCS)","volume":"32 S2","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 8th International Conference on Computer and Communication Systems (ICCCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCCS57501.2023.10151280","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Autonomous driving requires precise perception of the surrounding environment, and considering the complementarity of sensor data, we propose an end-to-end model of spatial-temporal multimodal fusion using an attention mechanism. Our model uses a fusion of camera and light detection and ranging (LiDAR), which works as follows: (i) The spatial network performs spatial feature learning using images from the range view (RV) representation of LiDAR and red, blue, and green (RGB) images as input, followed by a parallel ResNet18 network for feature extraction and fusion through an attention mechanism; (ii) the temporal network performs the learning of the temporal dimension of spatial features, and uses the current spatial features from the spatial network and the historical spatial features to do attention learning, which enhances the features relevant to the autonomous driving task; (iii) finally, the model uses spatial-temporal features to select a different prediction branch by navigation instructions to perform regression of waypoints. Our model was trained and tested in the CARLA simulator, and experiments showed that it enabled to complete autonomous driving tasks in complex environments, achieving a success rate of 85%, especially with many dynamic objects.