{"title":"基于事件立体深度估计的深度线索融合","authors":"Dipon Kumar Ghosh, Yong Ju Jung","doi":"10.1016/j.inffus.2024.102891","DOIUrl":null,"url":null,"abstract":"Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"44 1","pages":""},"PeriodicalIF":14.7000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Depth cue fusion for event-based stereo depth estimation\",\"authors\":\"Dipon Kumar Ghosh, Yong Ju Jung\",\"doi\":\"10.1016/j.inffus.2024.102891\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"44 1\",\"pages\":\"\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1016/j.inffus.2024.102891\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1016/j.inffus.2024.102891","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Depth cue fusion for event-based stereo depth estimation
Inspired by the biological retina, event cameras utilize dynamic vision sensors to capture pixel intensity changes asynchronously. Event cameras offer numerous advantages, such as high dynamic range, high temporal resolution, less motion blur, and low power consumption. These features make event cameras particularly well-suited for depth estimation, especially in challenging scenarios involving rapid motion and high dynamic range imaging conditions. The human visual system perceives the scene depth by combining multiple depth cues such as monocular pictorial depth, stereo depth, and motion parallax. However, most existing algorithms of the event-based depth estimation utilize only single depth cue such as either stereo depth or monocular depth. While it is feasible to estimate depth from a single cue, estimating dense disparity in challenging scenarios and lightning conditions remains a challenging problem. Following this, we conduct extensive experiments to explore various methods for the depth cue fusion. Inspired by the experiment results, in this study, we propose a fusion architecture that systematically incorporates multiple depth cues for the event-based stereo depth estimation. To this end, we propose a depth cue fusion (DCF) network to fuse multiple depth cues by utilizing a novel fusion method called SpadeFormer. The proposed SpadeFormer is a full y context-aware fusion mechanism, which incorporates two modulation techniques (i.e., spatially adaptive denormalization (Spade) and cross-attention) for the depth cue fusion in a transformer block. The adaptive denormalization modulates both input features by adjusting the global statistics of features in a cross manner, and the modulated features are further fused by the cross-attention technique. Experiments conducted on a real-world dataset show that our method reduces the one-pixel error rate by at least 47.63% (3.708 for the best existing method vs. 1.942 for ours) and the mean absolute error by 40.07% (0.302 for the best existing method vs. 0.181 for ours). The results reveal that the depth cue fusion method outperforms the state-of-the-art methods by significant margins and produces better disparity maps.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.