弥合时空特征差距，实现视频突出物体检测

IF 7.6 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Knowledge-Based Systems Pub Date : 2024-11-25 Epub Date: 2024-09-27 DOI:10.1016/j.knosys.2024.112505

Zhenshan Tan , Cheng Chen , Keyu Wen , Qingrong Cheng , Zhangjie Fu

{"title":"弥合时空特征差距，实现视频突出物体检测","authors":"Zhenshan Tan , Cheng Chen , Keyu Wen , Qingrong Cheng , Zhangjie Fu","doi":"10.1016/j.knosys.2024.112505","DOIUrl":null,"url":null,"abstract":"<div><div>The mutual transfer of spatiotemporal features is the main challenge for the two-stream video salient object detection. Current methods adopt the spatiotemporal feature interaction to achieve it. However, these methods still have two issues: modal feature gap and layer feature gap. To address these, we propose a Bridging Spatiotemporal feature Gap Network (BSGNet) with a global correspondence interaction and gate filtering (GCGF) module, a global-local distribution consistency (GLDC) module, and a modality-layer feature fusion framework (MLFF). Compared with previous works, BSGNet not only explores more effective interaction by GCGF, but also bridges modality and layer feature gaps by GLDC and MLFF. Firstly, GCGF achieves the spatiotemporal feature interaction by modeling intra-modal and inter-modal global correspondences. Besides, GCGF employs a gate mechanism to control the proportion of message transfer between appearance and motion information, which characterizes the contribution provided by spatiotemporal features. Secondly, at both global and local levels, GLDC pushes the spatiotemporal feature distribution between same scenes, and pulls the spatiotemporal feature distribution between different scenes. This can enhance the distribution consistency to align spatiotemporal features and bridge modal feature gap. Finally, MLFF designs an inter-modal and inter-layer feature fusion framework to bridge the layer feature gap brought by the different modalities and different receptive fields. Extensive experiments on five benchmarks reveal that our BSGNet outperforms state-of-the-arts.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"304 ","pages":"Article 112505"},"PeriodicalIF":7.6000,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bridging spatiotemporal feature gap for video salient object detection\",\"authors\":\"Zhenshan Tan , Cheng Chen , Keyu Wen , Qingrong Cheng , Zhangjie Fu\",\"doi\":\"10.1016/j.knosys.2024.112505\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The mutual transfer of spatiotemporal features is the main challenge for the two-stream video salient object detection. Current methods adopt the spatiotemporal feature interaction to achieve it. However, these methods still have two issues: modal feature gap and layer feature gap. To address these, we propose a Bridging Spatiotemporal feature Gap Network (BSGNet) with a global correspondence interaction and gate filtering (GCGF) module, a global-local distribution consistency (GLDC) module, and a modality-layer feature fusion framework (MLFF). Compared with previous works, BSGNet not only explores more effective interaction by GCGF, but also bridges modality and layer feature gaps by GLDC and MLFF. Firstly, GCGF achieves the spatiotemporal feature interaction by modeling intra-modal and inter-modal global correspondences. Besides, GCGF employs a gate mechanism to control the proportion of message transfer between appearance and motion information, which characterizes the contribution provided by spatiotemporal features. Secondly, at both global and local levels, GLDC pushes the spatiotemporal feature distribution between same scenes, and pulls the spatiotemporal feature distribution between different scenes. This can enhance the distribution consistency to align spatiotemporal features and bridge modal feature gap. Finally, MLFF designs an inter-modal and inter-layer feature fusion framework to bridge the layer feature gap brought by the different modalities and different receptive fields. Extensive experiments on five benchmarks reveal that our BSGNet outperforms state-of-the-arts.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"304 \",\"pages\":\"Article 112505\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2024-11-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0950705124011390\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/9/27 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0950705124011390","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/9/27 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

时空特征的相互传递是双流视频突出物体检测的主要挑战。目前的方法采用时空特征交互来实现。然而，这些方法仍然存在两个问题：模态特征差距和层特征差距。为了解决这些问题，我们提出了一个时空特征差距网络（BSGNet），其中包含一个全局对应交互和门过滤（GCGF）模块、一个全局-局部分布一致性（GLDC）模块和一个模态-层特征融合框架（MLFF）。与之前的研究相比，BSGNet 不仅通过 GCGF 实现了更有效的交互，还通过 GLDC 和 MLFF 缩小了模态和层特征之间的差距。首先，GCGF 通过建模模态内和模态间的全局对应关系来实现时空特征交互。此外，GCGF 还采用了门控机制来控制外观信息和运动信息之间的信息传递比例，从而体现时空特征的贡献。其次，在全局和局部两个层面上，GLDC 会推动相同场景之间的时空特征分布，拉动不同场景之间的时空特征分布。这可以增强分布的一致性，从而调整时空特征，弥合模态特征差距。最后，MLFF 设计了一个跨模态和跨层特征融合框架，以弥合不同模态和不同感受野带来的层特征差距。在五个基准上进行的广泛实验表明，我们的 BSGNet 优于同行。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Bridging spatiotemporal feature gap for video salient object detection

The mutual transfer of spatiotemporal features is the main challenge for the two-stream video salient object detection. Current methods adopt the spatiotemporal feature interaction to achieve it. However, these methods still have two issues: modal feature gap and layer feature gap. To address these, we propose a Bridging Spatiotemporal feature Gap Network (BSGNet) with a global correspondence interaction and gate filtering (GCGF) module, a global-local distribution consistency (GLDC) module, and a modality-layer feature fusion framework (MLFF). Compared with previous works, BSGNet not only explores more effective interaction by GCGF, but also bridges modality and layer feature gaps by GLDC and MLFF. Firstly, GCGF achieves the spatiotemporal feature interaction by modeling intra-modal and inter-modal global correspondences. Besides, GCGF employs a gate mechanism to control the proportion of message transfer between appearance and motion information, which characterizes the contribution provided by spatiotemporal features. Secondly, at both global and local levels, GLDC pushes the spatiotemporal feature distribution between same scenes, and pulls the spatiotemporal feature distribution between different scenes. This can enhance the distribution consistency to align spatiotemporal features and bridge modal feature gap. Finally, MLFF designs an inter-modal and inter-layer feature fusion framework to bridge the layer feature gap brought by the different modalities and different receptive fields. Extensive experiments on five benchmarks reveal that our BSGNet outperforms state-of-the-arts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Knowledge-Based Systems 工程技术-计算机：人工智能

CiteScore

14.80

自引率

12.50%

发文量

1245

审稿时长

7.8 months

期刊介绍： Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.