Bowen Zhao , Hongdou He , Hang Xu , Peng Shi , Xiaobing Hao , Guoyan Huang
{"title":"RTIA-Mono:利用全局-本地信息聚合进行实时轻量级自监督单目深度估计","authors":"Bowen Zhao , Hongdou He , Hang Xu , Peng Shi , Xiaobing Hao , Guoyan Huang","doi":"10.1016/j.dsp.2024.104769","DOIUrl":null,"url":null,"abstract":"<div><p>Self-supervised monocular depth estimation has attracted significant attention in computer vision, especially for applications like autonomous driving and robotics. Recently, CNNs and Transformers have achieved tremendous success in this task. However, existing research primarily focuses on improving estimation accuracy, increasing model complexity poses challenges for deployment on edge computing devices. Shallow CNNs aid lightweight network construction but suffer limited receptive fields, hindering fusion of local geometric features and global semantic information. To address these issues, we propose an efficient real-time lightweight self-supervised architecture, RTIA-Mono, for monocular depth estimation. Firstly, we design a cross-stage feature fusion structure promoting feature aggregation and fusion across stages. Secondly, in each stage, we propose a Global Local Information Aggregation (GLIA) module integrating advantages of CNNs and Transformers to aggregate local and global features. Additionally, we introduce a Directional Feature Enhancement (DFE) module supplementing spatial structure information to mitigate spatial information loss from downsampling. Through sophisticated design, the proposed approach outperforms state-of-the-art methods on KITTI benchmark with the least parameters, and achieves a good balance between accuracy, complexity and inference speed. Furthermore, RTIA-Mono demonstrates excellent generalization on other datasets.</p></div>","PeriodicalId":51011,"journal":{"name":"Digital Signal Processing","volume":"156 ","pages":"Article 104769"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RTIA-Mono: Real-time lightweight self-supervised monocular depth estimation with global-local information aggregation\",\"authors\":\"Bowen Zhao , Hongdou He , Hang Xu , Peng Shi , Xiaobing Hao , Guoyan Huang\",\"doi\":\"10.1016/j.dsp.2024.104769\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Self-supervised monocular depth estimation has attracted significant attention in computer vision, especially for applications like autonomous driving and robotics. Recently, CNNs and Transformers have achieved tremendous success in this task. However, existing research primarily focuses on improving estimation accuracy, increasing model complexity poses challenges for deployment on edge computing devices. Shallow CNNs aid lightweight network construction but suffer limited receptive fields, hindering fusion of local geometric features and global semantic information. To address these issues, we propose an efficient real-time lightweight self-supervised architecture, RTIA-Mono, for monocular depth estimation. Firstly, we design a cross-stage feature fusion structure promoting feature aggregation and fusion across stages. Secondly, in each stage, we propose a Global Local Information Aggregation (GLIA) module integrating advantages of CNNs and Transformers to aggregate local and global features. Additionally, we introduce a Directional Feature Enhancement (DFE) module supplementing spatial structure information to mitigate spatial information loss from downsampling. Through sophisticated design, the proposed approach outperforms state-of-the-art methods on KITTI benchmark with the least parameters, and achieves a good balance between accuracy, complexity and inference speed. Furthermore, RTIA-Mono demonstrates excellent generalization on other datasets.</p></div>\",\"PeriodicalId\":51011,\"journal\":{\"name\":\"Digital Signal Processing\",\"volume\":\"156 \",\"pages\":\"Article 104769\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Digital Signal Processing\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1051200424003944\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Signal Processing","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1051200424003944","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
RTIA-Mono: Real-time lightweight self-supervised monocular depth estimation with global-local information aggregation
Self-supervised monocular depth estimation has attracted significant attention in computer vision, especially for applications like autonomous driving and robotics. Recently, CNNs and Transformers have achieved tremendous success in this task. However, existing research primarily focuses on improving estimation accuracy, increasing model complexity poses challenges for deployment on edge computing devices. Shallow CNNs aid lightweight network construction but suffer limited receptive fields, hindering fusion of local geometric features and global semantic information. To address these issues, we propose an efficient real-time lightweight self-supervised architecture, RTIA-Mono, for monocular depth estimation. Firstly, we design a cross-stage feature fusion structure promoting feature aggregation and fusion across stages. Secondly, in each stage, we propose a Global Local Information Aggregation (GLIA) module integrating advantages of CNNs and Transformers to aggregate local and global features. Additionally, we introduce a Directional Feature Enhancement (DFE) module supplementing spatial structure information to mitigate spatial information loss from downsampling. Through sophisticated design, the proposed approach outperforms state-of-the-art methods on KITTI benchmark with the least parameters, and achieves a good balance between accuracy, complexity and inference speed. Furthermore, RTIA-Mono demonstrates excellent generalization on other datasets.
期刊介绍:
Digital Signal Processing: A Review Journal is one of the oldest and most established journals in the field of signal processing yet it aims to be the most innovative. The Journal invites top quality research articles at the frontiers of research in all aspects of signal processing. Our objective is to provide a platform for the publication of ground-breaking research in signal processing with both academic and industrial appeal.
The journal has a special emphasis on statistical signal processing methodology such as Bayesian signal processing, and encourages articles on emerging applications of signal processing such as:
• big data• machine learning• internet of things• information security• systems biology and computational biology,• financial time series analysis,• autonomous vehicles,• quantum computing,• neuromorphic engineering,• human-computer interaction and intelligent user interfaces,• environmental signal processing,• geophysical signal processing including seismic signal processing,• chemioinformatics and bioinformatics,• audio, visual and performance arts,• disaster management and prevention,• renewable energy,