Xingzheng Wang, Jianbin Wu, Shaoyong Wu, Jiahui Li
{"title":"SAMNet: Adapting segment anything model for accurate light field salient object detection","authors":"Xingzheng Wang, Jianbin Wu, Shaoyong Wu, Jiahui Li","doi":"10.1016/j.imavis.2024.105403","DOIUrl":null,"url":null,"abstract":"<div><div>Light field salient object detection (LF SOD) is an important task that aims to segment visually salient objects from the surroundings. However, existing methods still struggle to achieve accurate detection, especially in complex scenes. Recently, segment anything model (SAM) excels in various vision tasks with its strong object segmentation ability and generalization capability, which is suitable for solving the LF SOD challenge. In this paper, we aim to adapt the SAM for accurate LF SOD. Specifically, we propose a network named SAMNet with two adaptation designs. Firstly, to enhance the perception of salient objects, we design a task-oriented multi-scale convolution adapter (MSCA) integrated into SAM’s image encoder. Parameters in the image encoder except MSCA are frozen to balance detection accuracy and computational requirements. Furthermore, to effectively utilize the rich scene information of LF data, we design a data-oriented cross-modal fusion module (CMFM) to fuse SAM features of different modalities. Comprehensive experiments on four benchmark datasets demonstrate the effectiveness of SAMNet over current state-of-the-art methods. In particular, SAMNet achieves the highest F-measures of 0.945, 0.819, 0.868, and 0.898, respectively. To the best of our knowledge, this is the first work that adapts a vision foundation model to LF SOD.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105403"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624005080","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Light field salient object detection (LF SOD) is an important task that aims to segment visually salient objects from the surroundings. However, existing methods still struggle to achieve accurate detection, especially in complex scenes. Recently, segment anything model (SAM) excels in various vision tasks with its strong object segmentation ability and generalization capability, which is suitable for solving the LF SOD challenge. In this paper, we aim to adapt the SAM for accurate LF SOD. Specifically, we propose a network named SAMNet with two adaptation designs. Firstly, to enhance the perception of salient objects, we design a task-oriented multi-scale convolution adapter (MSCA) integrated into SAM’s image encoder. Parameters in the image encoder except MSCA are frozen to balance detection accuracy and computational requirements. Furthermore, to effectively utilize the rich scene information of LF data, we design a data-oriented cross-modal fusion module (CMFM) to fuse SAM features of different modalities. Comprehensive experiments on four benchmark datasets demonstrate the effectiveness of SAMNet over current state-of-the-art methods. In particular, SAMNet achieves the highest F-measures of 0.945, 0.819, 0.868, and 0.898, respectively. To the best of our knowledge, this is the first work that adapts a vision foundation model to LF SOD.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.