{"title":"Efficient RGB-D Co-Salient Object Detection via Modality-Aware Prompting","authors":"Zhangping Tu;Xiaohong Qian;Wujie Zhou","doi":"10.1109/TASE.2025.3543586","DOIUrl":null,"url":null,"abstract":"RGB-D co-salient object detection (Co-SOD) aims to identify and segment co-occurring salient objects in a set of correlated images and depth maps. Most existing RGB-D Co-SOD methods fully fine-tune the dual-stream encoder-decoder architecture and fuse the RGB and depth features using a complex feature fusion strategy, which is expensive to train owing to the large number of parameters that need to be updated during the feature extraction and fusion process. In addition, current methods do not pay sufficient attention to differentiate co- salient information from non-co- salient information effectively. This interfering information affects the localization of co-salient targets. Therefore, this study proposes a simple and effective modality-aware prompting network (MAPNet) for efficient RGB-D Co-SOD. MAPNet mainly performs RGB-D Co-SOD through two approaches, namely modal fusion and consensus feature extraction, using a multimodal prompt generator (MPG) and consensus feature extraction module (CFEM), respectively. Specifically, the MPG module guides the depth features in the fine-tuned backbone network from the RGB features obtained in the frozen backbone network for fusion in hyperbolic spaces to generate multilevel modal cues that are subsequently injected into the fine-tuned backbone network for efficient modal fusion. The CFEM uses RGB features to generate an image salient prior, combines the salient prior with the highest level of fusion features to obtain the central point, and uses the salient features closer to the central point as the consensus features of the image group. In addition, contrast loss is introduced to separate the synergistic and non-synergistic salient features to obtain pure co-salient features. The trained MAPNet delivered state-of-the-art performance on three benchmark datasets (RGB-D CoSal1k, RGB-D CoSal150, and RGB-D CoSeg183), with the structure-measure improved by 2.1% on the RGB-D CoSeg183 dataset. The codes are available at <uri>https://github.com/trumpetor/MAPNet</uri>. Note to Practitioners—This study presents a straightforward and effective modality-aware prompting network (MAPNet) designed for efficient RGB-D Co-SOD. Initially, the MPG module of MAPNet enables the effective integration of RGB and depth modalities through prompt learning. Subsequently, the CFEM employs the pixel group centroid proxy and top-k selection mechanism to extract high-level integrated features and salient prior consensus features, which serve as coordinated saliency features for image groups. Finally, the coordinated obtained salient and integrated features are input into the decoder to generate predictions.","PeriodicalId":51060,"journal":{"name":"IEEE Transactions on Automation Science and Engineering","volume":"22 ","pages":"12911-12921"},"PeriodicalIF":6.4000,"publicationDate":"2025-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Automation Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10892262/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"AUTOMATION & CONTROL SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
RGB-D co-salient object detection (Co-SOD) aims to identify and segment co-occurring salient objects in a set of correlated images and depth maps. Most existing RGB-D Co-SOD methods fully fine-tune the dual-stream encoder-decoder architecture and fuse the RGB and depth features using a complex feature fusion strategy, which is expensive to train owing to the large number of parameters that need to be updated during the feature extraction and fusion process. In addition, current methods do not pay sufficient attention to differentiate co- salient information from non-co- salient information effectively. This interfering information affects the localization of co-salient targets. Therefore, this study proposes a simple and effective modality-aware prompting network (MAPNet) for efficient RGB-D Co-SOD. MAPNet mainly performs RGB-D Co-SOD through two approaches, namely modal fusion and consensus feature extraction, using a multimodal prompt generator (MPG) and consensus feature extraction module (CFEM), respectively. Specifically, the MPG module guides the depth features in the fine-tuned backbone network from the RGB features obtained in the frozen backbone network for fusion in hyperbolic spaces to generate multilevel modal cues that are subsequently injected into the fine-tuned backbone network for efficient modal fusion. The CFEM uses RGB features to generate an image salient prior, combines the salient prior with the highest level of fusion features to obtain the central point, and uses the salient features closer to the central point as the consensus features of the image group. In addition, contrast loss is introduced to separate the synergistic and non-synergistic salient features to obtain pure co-salient features. The trained MAPNet delivered state-of-the-art performance on three benchmark datasets (RGB-D CoSal1k, RGB-D CoSal150, and RGB-D CoSeg183), with the structure-measure improved by 2.1% on the RGB-D CoSeg183 dataset. The codes are available at https://github.com/trumpetor/MAPNet. Note to Practitioners—This study presents a straightforward and effective modality-aware prompting network (MAPNet) designed for efficient RGB-D Co-SOD. Initially, the MPG module of MAPNet enables the effective integration of RGB and depth modalities through prompt learning. Subsequently, the CFEM employs the pixel group centroid proxy and top-k selection mechanism to extract high-level integrated features and salient prior consensus features, which serve as coordinated saliency features for image groups. Finally, the coordinated obtained salient and integrated features are input into the decoder to generate predictions.
期刊介绍:
The IEEE Transactions on Automation Science and Engineering (T-ASE) publishes fundamental papers on Automation, emphasizing scientific results that advance efficiency, quality, productivity, and reliability. T-ASE encourages interdisciplinary approaches from computer science, control systems, electrical engineering, mathematics, mechanical engineering, operations research, and other fields. T-ASE welcomes results relevant to industries such as agriculture, biotechnology, healthcare, home automation, maintenance, manufacturing, pharmaceuticals, retail, security, service, supply chains, and transportation. T-ASE addresses a research community willing to integrate knowledge across disciplines and industries. For this purpose, each paper includes a Note to Practitioners that summarizes how its results can be applied or how they might be extended to apply in practice.