{"title":"用于密集视听事件定位的位置感知跨模态对应学习","authors":"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang","doi":"arxiv-2409.07967","DOIUrl":null,"url":null,"abstract":"Dense-localization Audio-Visual Events (DAVE) aims to identify time\nboundaries and corresponding categories for events that can be heard and seen\nconcurrently in an untrimmed video. Existing methods typically encode audio and\nvisual representation separately without any explicit cross-modal alignment\nconstraint. Then they adopt dense cross-modal attention to integrate multimodal\ninformation for DAVE. Thus these methods inevitably aggregate irrelevant noise\nand events, especially in complex and long videos, leading to imprecise\ndetection. In this paper, we present LOCO, a Locality-aware cross-modal\nCorrespondence learning framework for DAVE. The core idea is to explore local\ntemporal continuity nature of audio-visual events, which serves as informative\nyet free supervision signals to guide the filtering of irrelevant information\nand inspire the extraction of complementary multimodal information during both\nunimodal and cross-modal learning stages. i) Specifically, LOCO applies\nLocality-aware Correspondence Correction (LCC) to uni-modal features via\nleveraging cross-modal local-correlated properties without any extra\nannotations. This enforces uni-modal encoders to highlight similar semantics\nshared by audio and visual features. ii) To better aggregate such audio and\nvisual features, we further customize Cross-modal Dynamic Perception layer\n(CDP) in cross-modal feature pyramid to understand local temporal patterns of\naudio-visual events by imposing local consistency within multimodal features in\na data-driven manner. By incorporating LCC and CDP, LOCO provides solid\nperformance gains and outperforms existing methods for DAVE. The source code\nwill be released.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"40 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization\",\"authors\":\"Ling Xing, Hongyu Qu, Rui Yan, Xiangbo Shu, Jinhui Tang\",\"doi\":\"arxiv-2409.07967\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Dense-localization Audio-Visual Events (DAVE) aims to identify time\\nboundaries and corresponding categories for events that can be heard and seen\\nconcurrently in an untrimmed video. Existing methods typically encode audio and\\nvisual representation separately without any explicit cross-modal alignment\\nconstraint. Then they adopt dense cross-modal attention to integrate multimodal\\ninformation for DAVE. Thus these methods inevitably aggregate irrelevant noise\\nand events, especially in complex and long videos, leading to imprecise\\ndetection. In this paper, we present LOCO, a Locality-aware cross-modal\\nCorrespondence learning framework for DAVE. The core idea is to explore local\\ntemporal continuity nature of audio-visual events, which serves as informative\\nyet free supervision signals to guide the filtering of irrelevant information\\nand inspire the extraction of complementary multimodal information during both\\nunimodal and cross-modal learning stages. i) Specifically, LOCO applies\\nLocality-aware Correspondence Correction (LCC) to uni-modal features via\\nleveraging cross-modal local-correlated properties without any extra\\nannotations. This enforces uni-modal encoders to highlight similar semantics\\nshared by audio and visual features. ii) To better aggregate such audio and\\nvisual features, we further customize Cross-modal Dynamic Perception layer\\n(CDP) in cross-modal feature pyramid to understand local temporal patterns of\\naudio-visual events by imposing local consistency within multimodal features in\\na data-driven manner. By incorporating LCC and CDP, LOCO provides solid\\nperformance gains and outperforms existing methods for DAVE. The source code\\nwill be released.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":\"40 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07967\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07967","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
密集定位视听事件(DAVE)旨在识别未剪辑视频中可同时听到和看到的事件的时间界限和相应类别。现有方法通常是将音频和视频分别编码,而没有明确的跨模态对齐约束。然后,它们采用密集的跨模态注意力来整合多模态信息,用于 DAVE。因此,这些方法不可避免地会将不相关的噪声和事件聚合在一起,尤其是在复杂的长视频中,从而导致不精确的检测。在本文中,我们提出了用于 DAVE 的局部感知跨模态对应学习框架 LOCO。其核心理念是探索视听事件的局部时空连续性,并将其作为信息丰富但不受约束的监督信号,在单模态和跨模态学习阶段指导过滤无关信息,并启发提取互补的多模态信息。 i) 具体来说,LOCO 将局部感知对应校正(Locality-aware Correspondence Correction,LCC)应用于单模态特征,在不进行任何额外注释的情况下,评估跨模态局部相关属性。ii) 为了更好地聚合这些音频和视频特征,我们在跨模态特征金字塔中进一步定制了跨模态动态感知层(CDP),以数据驱动的方式在多模态特征中施加局部一致性,从而理解音频和视频事件的局部时间模式。通过结合 LCC 和 CDP,LOCO 为 DAVE 提供了坚实的性能增益,并优于现有方法。源代码即将发布。
Locality-aware Cross-modal Correspondence Learning for Dense Audio-Visual Events Localization
Dense-localization Audio-Visual Events (DAVE) aims to identify time
boundaries and corresponding categories for events that can be heard and seen
concurrently in an untrimmed video. Existing methods typically encode audio and
visual representation separately without any explicit cross-modal alignment
constraint. Then they adopt dense cross-modal attention to integrate multimodal
information for DAVE. Thus these methods inevitably aggregate irrelevant noise
and events, especially in complex and long videos, leading to imprecise
detection. In this paper, we present LOCO, a Locality-aware cross-modal
Correspondence learning framework for DAVE. The core idea is to explore local
temporal continuity nature of audio-visual events, which serves as informative
yet free supervision signals to guide the filtering of irrelevant information
and inspire the extraction of complementary multimodal information during both
unimodal and cross-modal learning stages. i) Specifically, LOCO applies
Locality-aware Correspondence Correction (LCC) to uni-modal features via
leveraging cross-modal local-correlated properties without any extra
annotations. This enforces uni-modal encoders to highlight similar semantics
shared by audio and visual features. ii) To better aggregate such audio and
visual features, we further customize Cross-modal Dynamic Perception layer
(CDP) in cross-modal feature pyramid to understand local temporal patterns of
audio-visual events by imposing local consistency within multimodal features in
a data-driven manner. By incorporating LCC and CDP, LOCO provides solid
performance gains and outperforms existing methods for DAVE. The source code
will be released.