Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He
{"title":"CAFNet: Context aligned fusion for depth completion","authors":"Zhichao Fu, Anran Wu, Shuwen Yang, Tianlong Ma, Liang He","doi":"10.1016/j.cviu.2024.104158","DOIUrl":null,"url":null,"abstract":"<div><p>Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.</p></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"249 ","pages":"Article 104158"},"PeriodicalIF":4.3000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Vision and Image Understanding","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S107731422400239X","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.
期刊介绍:
The central focus of this journal is the computer analysis of pictorial information. Computer Vision and Image Understanding publishes papers covering all aspects of image analysis from the low-level, iconic processes of early vision to the high-level, symbolic processes of recognition and interpretation. A wide range of topics in the image understanding area is covered, including papers offering insights that differ from predominant views.
Research Areas Include:
• Theory
• Early vision
• Data structures and representations
• Shape
• Range
• Motion
• Matching and recognition
• Architecture and languages
• Vision systems