Less Can Be More: Sound Source Localization With a Classification Model

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) Pub Date : 2022-01-01 DOI:10.1109/WACV51458.2022.00065

Arda Senocak, H. Ryu, Junsik Kim, In-So Kweon

{"title":"Less Can Be More: Sound Source Localization With a Classification Model","authors":"Arda Senocak, H. Ryu, Junsik Kim, In-So Kweon","doi":"10.1109/WACV51458.2022.00065","DOIUrl":null,"url":null,"abstract":"In this paper, we tackle sound localization as a natural outcome of the audio-visual video classification problem. Differently from the existing sound localization approaches, we do not use any explicit sub-modules or training mechanisms but use simple cross-modal attention on top of the representations learned by a classification loss. Our key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed \"less is more\". Furthermore, we propose potential applications that can be built based on our model. First, we introduce informative moment selection to enhance the localization task learning in the existing approaches compare to mid-frame usage. Then, we introduce a pseudo bounding box generation procedure that can significantly boost the performance of the existing methods in semi-supervised settings or be used for large-scale automatic annotation with minimal effort from any video dataset.","PeriodicalId":297092,"journal":{"name":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WACV51458.2022.00065","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 14

Abstract

In this paper, we tackle sound localization as a natural outcome of the audio-visual video classification problem. Differently from the existing sound localization approaches, we do not use any explicit sub-modules or training mechanisms but use simple cross-modal attention on top of the representations learned by a classification loss. Our key contribution is to show that a simple audio-visual classification model has the ability to localize sound sources accurately and to give on par performance with state-of-the-art methods by proving that indeed "less is more". Furthermore, we propose potential applications that can be built based on our model. First, we introduce informative moment selection to enhance the localization task learning in the existing approaches compare to mid-frame usage. Then, we introduce a pseudo bounding box generation procedure that can significantly boost the performance of the existing methods in semi-supervised settings or be used for large-scale automatic annotation with minimal effort from any video dataset.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

少即是多:声源定位与分类模型

在本文中，我们将声音定位作为视听视频分类问题的自然结果来解决。与现有的声音定位方法不同，我们不使用任何显式的子模块或训练机制，而是在通过分类损失学习到的表征之上使用简单的跨模态注意。我们的主要贡献是表明一个简单的视听分类模型能够准确地定位声源，并通过证明确实“少即是多”来提供与最先进的方法相当的性能。此外，我们提出了可以基于我们的模型构建的潜在应用程序。首先，与中帧方法相比，我们引入了信息矩选择来增强现有方法中的定位任务学习。然后，我们引入了一个伪边界框生成过程，该过程可以显着提高现有方法在半监督设置下的性能，或者用于对任何视频数据集进行大规模自动标注。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

自引率

0.00%

发文量