Salient Object Detection in RGB-D Videos

IF 13.7 IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2024-11-20 DOI:10.1109/TIP.2024.3498326

Ao Mou;Yukang Lu;Jiahao He;Dingyao Min;Keren Fu;Qijun Zhao

{"title":"Salient Object Detection in RGB-D Videos","authors":"Ao Mou;Yukang Lu;Jiahao He;Dingyao Min;Keren Fu;Qijun Zhao","doi":"10.1109/TIP.2024.3498326","DOIUrl":null,"url":null,"abstract":"Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our proposed RDVS, highlight the superiority of DCTNet+ over 19 VSOD models and 14 RGB-D SOD models. Additionally, insightful ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth into VSOD. Our code together with RDVS dataset will be available at \n<uri>https://github.com/kerenfu/RDVS/</uri>\n.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"33 ","pages":"6660-6675"},"PeriodicalIF":13.7000,"publicationDate":"2024-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10759589/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Given the widespread adoption of depth-sensing acquisition devices, RGB-D videos and related data/media have gained considerable traction in various aspects of daily life. Consequently, conducting salient object detection (SOD) in RGB-D videos presents a highly promising and evolving avenue. Despite the potential of this area, SOD in RGB-D videos remains somewhat under-explored, with RGB-D SOD and video SOD (VSOD) traditionally studied in isolation. To explore this emerging field, this paper makes two primary contributions: the dataset and the model. On one front, we construct the RDVS dataset, a new RGB-D VSOD dataset with realistic depth and characterized by its diversity of scenes and rigorous frame-by-frame annotations. We validate the dataset through comprehensive attribute and object-oriented analyses, and provide training and testing splits. Moreover, we introduce DCTNet+, a three-stream network tailored for RGB-D VSOD, with an emphasis on RGB modality and treats depth and optical flow as auxiliary modalities. In pursuit of effective feature enhancement, refinement, and fusion for precise final prediction, we propose two modules: the multi-modal attention module (MAM) and the refinement fusion module (RFM). To enhance interaction and fusion within RFM, we design a universal interaction module (UIM) and then integrate holistic multi-modal attentive paths (HMAPs) for refining multi-modal low-level features before reaching RFMs. Comprehensive experiments, conducted on pseudo RGB-D video datasets alongside our proposed RDVS, highlight the superiority of DCTNet+ over 19 VSOD models and 14 RGB-D SOD models. Additionally, insightful ablation experiments were performed on both pseudo and realistic RGB-D video datasets to demonstrate the advantages of individual modules as well as the necessity of introducing realistic depth into VSOD. Our code together with RDVS dataset will be available at https://github.com/kerenfu/RDVS/ .

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

RGB-D 视频中的显著物体检测

随着深度传感采集设备的广泛采用，RGB-D视频和相关数据/媒体在日常生活的各个方面都得到了相当大的关注。因此，在RGB-D视频中进行显著目标检测（SOD）呈现出一种非常有前途和不断发展的途径。尽管这一领域具有潜力，但RGB-D视频中的SOD仍未得到充分研究，传统上RGB-D SOD和视频SOD （VSOD）是分开研究的。为了探索这一新兴领域，本文做出了两个主要贡献：数据集和模型。一方面，我们构建了RDVS数据集，这是一个新的RGB-D VSOD数据集，具有逼真的深度，其特点是场景的多样性和严格的逐帧注释。我们通过综合属性和面向对象的分析来验证数据集，并提供训练和测试分割。此外，我们还介绍了DCTNet+，这是一种为RGB- d VSOD量身定制的三流网络，强调RGB模式，并将深度和光流作为辅助模式。为了追求有效的特征增强、细化和融合以实现精确的最终预测，我们提出了两个模块：多模态注意模块（MAM）和细化融合模块（RFM）。为了增强RFM内部的交互和融合，我们设计了一个通用交互模块（UIM），然后集成整体多模态关注路径（HMAPs），在到达RFM之前提炼多模态底层特征。在伪RGB-D视频数据集和我们提出的RDVS上进行的综合实验表明，DCTNet+优于19个VSOD模型和14个RGB-D SOD模型。此外，在伪RGB-D和真实RGB-D视频数据集上进行了深刻的消融实验，以展示各个模块的优势以及在VSOD中引入真实深度的必要性。我们的代码和RDVS数据集可以在https://github.com/kerenfu/RDVS/上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量