Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-12 DOI:arxiv-2409.07967

Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang

{"title":"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":null,"url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\nboundaries and corresponding categories for events that can be heard and seen\nconcurrently in an untrimmed video. Existing methods typically encode audio and\nvisual representation separately without any explicit cross-modal alignment\nconstraint. Then they adopt dense cross-modal attention to integrate multimodal\ninformation for DAVE. Thus these methods inevitably aggregate irrelevant noise\nand events, especially in complex and long videos, leading to imprecise\ndetection. In this paper, we present LOCO, a Locality-aware cross-modal\nCorrespondence learning framework for DAVE. The core idea is to explore local\ntemporal continuity nature of audio-visual events, which serves as informative\nyet free supervision signals to guide the filtering of irrelevant information\nand inspire the extraction of complementary multimodal information during both\nunimodal and cross-modal learning stages. i) Specifically, LOCO applies\nLocality-aware Correspondence Correction (LCC) to uni-modal features via\nleveraging cross-modal local-correlated properties without any extra\nannotations. This enforces uni-modal encoders to highlight similar semantics\nshared by audio and visual features. ii) To better aggregate such audio and\nvisual features, we further customize Cross-modal Dynamic Perception layer\n(CDP) in cross-modal feature pyramid to understand local temporal patterns of\naudio-visual events by imposing local consistency within multimodal features in\na data-driven manner. By incorporating LCC and CDP, LOCO provides solid\nperformance gains and outperforms existing methods for DAVE. The source code\nwill be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Dense-localization Audio-Visual Events (DAVE) aims to identify time boundaries and corresponding categories for events that can be heard and seen concurrently in an untrimmed video. Existing methods typically encode audio and visual representation separately without any explicit cross-modal alignment constraint. Then they adopt dense cross-modal attention to integrate multimodal information for DAVE. Thus these methods inevitably aggregate irrelevant noise and events, especially in complex and long videos, leading to imprecise detection. In this paper, we present LOCO, a Locality-aware cross-modal Correspondence learning framework for DAVE. The core idea is to explore local temporal continuity nature of audio-visual events, which serves as informative yet free supervision signals to guide the filtering of irrelevant information and inspire the extraction of complementary multimodal information during both unimodal and cross-modal learning stages. i) Specifically, LOCO applies Locality-aware Correspondence Correction (LCC) to uni-modal features via leveraging cross-modal local-correlated properties without any extra annotations. This enforces uni-modal encoders to highlight similar semantics shared by audio and visual features. ii) To better aggregate such audio and visual features, we further customize Cross-modal Dynamic Perception layer (CDP) in cross-modal feature pyramid to understand local temporal patterns of audio-visual events by imposing local consistency within multimodal features in a data-driven manner. By incorporating LCC and CDP, LOCO provides solid performance gains and outperforms existing methods for DAVE. The source code will be released.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于密集视听事件定位的位置感知跨模态对应学习

密集定位视听事件（DAVE）旨在识别未剪辑视频中可同时听到和看到的事件的时间界限和相应类别。现有方法通常是将音频和视频分别编码，而没有明确的跨模态对齐约束。然后，它们采用密集的跨模态注意力来整合多模态信息，用于 DAVE。因此，这些方法不可避免地会将不相关的噪声和事件聚合在一起，尤其是在复杂的长视频中，从而导致不精确的检测。在本文中，我们提出了用于 DAVE 的局部感知跨模态对应学习框架 LOCO。其核心理念是探索视听事件的局部时空连续性，并将其作为信息丰富但不受约束的监督信号，在单模态和跨模态学习阶段指导过滤无关信息，并启发提取互补的多模态信息。 i) 具体来说，LOCO 将局部感知对应校正（Locality-aware Correspondence Correction，LCC）应用于单模态特征，在不进行任何额外注释的情况下，评估跨模态局部相关属性。ii) 为了更好地聚合这些音频和视频特征，我们在跨模态特征金字塔中进一步定制了跨模态动态感知层（CDP），以数据驱动的方式在多模态特征中施加局部一致性，从而理解音频和视频事件的局部时间模式。通过结合 LCC 和 CDP，LOCO 为 DAVE 提供了坚实的性能增益，并优于现有方法。源代码即将发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量

期刊最新文献

Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey