用于 RGB-D 语义分割的三重融合和特征金字塔解码器

IF 3.5 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Multimedia Systems Pub Date : 2024-09-16 DOI:10.1007/s00530-024-01459-w

Bin Ge, Xu Zhu, Zihan Tang, Chenxing Xia, Yiming Lu, Zhuang Chen

{"title":"用于 RGB-D 语义分割的三重融合和特征金字塔解码器","authors":"Bin Ge, Xu Zhu, Zihan Tang, Chenxing Xia, Yiming Lu, Zhuang Chen","doi":"10.1007/s00530-024-01459-w","DOIUrl":null,"url":null,"abstract":"<p>Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.</p>","PeriodicalId":51138,"journal":{"name":"Multimedia Systems","volume":"38 1","pages":""},"PeriodicalIF":3.5000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Triple fusion and feature pyramid decoder for RGB-D semantic segmentation\",\"authors\":\"Bin Ge, Xu Zhu, Zihan Tang, Chenxing Xia, Yiming Lu, Zhuang Chen\",\"doi\":\"10.1007/s00530-024-01459-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.</p>\",\"PeriodicalId\":51138,\"journal\":{\"name\":\"Multimedia Systems\",\"volume\":\"38 1\",\"pages\":\"\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-09-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Multimedia Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00530-024-01459-w\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multimedia Systems","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01459-w","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

目前的 RGB-D 语义分割网络将深度信息作为一种额外的模式，并使用等权重串联或简单融合策略等方法合并 RGB 和深度特征。然而，这些方法阻碍了跨模态信息的有效利用。针对现有的 RGB-D 语义分割网络无法充分利用 RGB 和深度特征的问题，我们提出了一种基于三重融合和特征金字塔解码的 RGB-D 语义分割网络，通过所提出的三级跨模态融合模块（TCFM）实现 RGB 和深度特征的双向交互和融合。TCFM 建议利用跨模态交叉关注将两种模态的数据混合到另一种模态中。它利用信道自适应加权融合模块，将 RGB 属性和深度特征进行了很好的融合。此外，本文还引入了轻量级特征金字塔解码器网络，以有效融合编码器提取的多尺度部分。纽约大学深度 V2 数据集和 SUN RGB-D 数据集的实验表明，本研究提出的跨模态特征融合网络能有效地分割复杂场景。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Triple fusion and feature pyramid decoder for RGB-D semantic segmentation

Current RGB-D semantic segmentation networks incorporate depth information as an extra modality and merge RGB and depth features using methods such as equal-weighted concatenation or simple fusion strategies. However, these methods hinder the effective utilization of cross-modal information. Aiming at the problem that existing RGB-D semantic segmentation networks fail to fully utilize RGB and depth features, we propose an RGB-D semantic segmentation network, based on triple fusion and feature pyramid decoding, which achieves bidirectional interaction and fusion of RGB and depth features via the proposed three-stage cross-modal fusion module (TCFM). The TCFM proposes utilizing cross-modal cross-attention to intermix the data from two modalities into another modality. It fuses the RGB attributes and depth features proficiently, utilizing the channel-adaptive weighted fusion module. Furthermore, this paper introduces a lightweight feature pyramidal decoder network to fuse the multi-scale parts taken out by the encoder effectively. Experiments on NYU Depth V2 and SUN RGB-D datasets demonstrate that the cross-modal feature fusion network proposed in this study efficiently segments intricate scenes.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Multimedia Systems 工程技术-计算机：理论方法

CiteScore

5.40

自引率

7.70%

发文量

148

审稿时长

4.5 months

期刊介绍： This journal details innovative research ideas, emerging technologies, state-of-the-art methods and tools in all aspects of multimedia computing, communication, storage, and applications. It features theoretical, experimental, and survey articles.