Salient Object Detection in RGB-D Videos

Ao Mou;Yukang Lu;Jiahao He;Dingyao Min;Keren Fu;Qijun Zhao
{"title":"Salient Object Detection in RGB-D Videos","authors":"Ao Mou;Yukang Lu;Jiahao He;Dingyao Min;Keren Fu;Qijun Zhao","doi":"10.1109/TIP.2024.3498326","DOIUrl":null,"url":null,"abstract":"Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our proposed RDVS, highlight the superiority of DCTNet+ over 19 VSOD models and 14 RGB-D SOD models. Additionally, insightful ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth into VSOD. Our code together with RDVS dataset will be available at \n<uri>https://github.com/kerenfu/RDVS/</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6660-6675"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10759589/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our proposed RDVS, highlight the superiority of DCTNet+ over 19 VSOD models and 14 RGB-D SOD models. Additionally, insightful ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth into VSOD. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/ .
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
RGB-D 视频中的显著物体检测
随着深度传感采集设备的广泛采用,RGB-D视频和相关数据/媒体在日常生活的各个方面都得到了相当大的关注。因此,在RGB-D视频中进行显著目标检测(SOD)呈现出一种非常有前途和不断发展的途径。尽管这一领域具有潜力,但RGB-D视频中的SOD仍未得到充分研究,传统上RGB-D SOD和视频SOD (VSOD)是分开研究的。为了探索这一新兴领域,本文做出了两个主要贡献:数据集和模型。一方面,我们构建了RDVS数据集,这是一个新的RGB-D VSOD数据集,具有逼真的深度,其特点是场景的多样性和严格的逐帧注释。我们通过综合属性和面向对象的分析来验证数据集,并提供训练和测试分割。此外,我们还介绍了DCTNet+,这是一种为RGB- d VSOD量身定制的三流网络,强调RGB模式,并将深度和光流作为辅助模式。为了追求有效的特征增强、细化和融合以实现精确的最终预测,我们提出了两个模块:多模态注意模块(MAM)和细化融合模块(RFM)。为了增强RFM内部的交互和融合,我们设计了一个通用交互模块(UIM),然后集成整体多模态关注路径(HMAPs),在到达RFM之前提炼多模态底层特征。在伪RGB-D视频数据集和我们提出的RDVS上进行的综合实验表明,DCTNet+优于19个VSOD模型和14个RGB-D SOD模型。此外,在伪RGB-D和真实RGB-D视频数据集上进行了深刻的消融实验,以展示各个模块的优势以及在VSOD中引入真实深度的必要性。我们的代码和RDVS数据集可以在https://github.com/kerenfu/RDVS/上获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
PolarGS: Polarimetric Cues for Ambiguity-Free Gaussian Splatting with Accurate Geometry Recovery. Pseudo Sentences Evaluation and Quality-aware Robust Learning for Unsupervised Text-based Person Search. 3D-UIR: 3D Gaussian for Underwater 3D Scene Reconstruction via Physics-Based Appearance-Medium Decoupling. Point-RMAE: Reinforcement Masked Autoencoder for 3D Representation Learning. COMBINER: Composed Image Retrieval Guided by Attribute-based Neighbor Relations.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1