Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data.

IF 3 Q2 ROBOTICS Frontiers in Robotics and AI Pub Date : 2025-01-13 eCollection Date: 2024-01-01 DOI:10.3389/frobt.2024.1490718

Wei-Cheng Wang, Sander De Coninck, Sam Leroux, Pieter Simoens

{"title":"Embedding-based pair generation for contrastive representation learning in audio-visual surveillance data.","authors":"Wei-Cheng Wang, Sander De Coninck, Sam Leroux, Pieter Simoens","doi":"10.3389/frobt.2024.1490718","DOIUrl":null,"url":null,"abstract":"<p><p>Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.</p>","PeriodicalId":47597,"journal":{"name":"Frontiers in Robotics and AI","volume":"11 ","pages":"1490718"},"PeriodicalIF":3.0000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11769797/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Robotics and AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frobt.2024.1490718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Smart cities deploy various sensors such as microphones and RGB cameras to collect data to improve the safety and comfort of the citizens. As data annotation is expensive, self-supervised methods such as contrastive learning are used to learn audio-visual representations for downstream tasks. Focusing on surveillance data, we investigate two common limitations of audio-visual contrastive learning: false negatives and the minimal sufficient information bottleneck. Irregular, yet frequently recurring events can lead to a considerable number of false-negative pairs and disrupt the model's training. To tackle this challenge, we propose a novel method for generating contrastive pairs based on the distance between embeddings of different modalities, rather than relying solely on temporal cues. The semantically synchronized pairs can then be used to ease the minimal sufficient information bottleneck along with the new loss function for multiple positives. We experimentally validate our approach on real-world data and show how the learnt representations can be used for different downstream tasks, including audio-visual event localization, anomaly detection, and event search. Our approach reaches similar performance as state-of-the-art modality- and task-specific approaches.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于嵌入的音视频监控数据对比表示学习对生成。

智能城市利用麦克风、RGB摄像头等各种传感器收集数据，提高市民的安全和舒适度。由于数据标注是昂贵的，自监督方法如对比学习被用于学习下游任务的视听表示。关注监测数据，我们研究了视听对比学习的两个常见限制：假阴性和最小充分信息瓶颈。不规则但频繁重复的事件可能导致相当数量的假阴性对，并破坏模型的训练。为了解决这一挑战，我们提出了一种基于不同模态嵌入之间的距离生成对比对的新方法，而不是仅仅依赖于时间线索。然后可以使用语义同步对来缓解最小充分信息瓶颈以及针对多个正数的新损失函数。我们通过实验验证了我们在真实世界数据上的方法，并展示了如何将学习到的表示用于不同的下游任务，包括视听事件定位、异常检测和事件搜索。我们的方法达到了与最先进的模式和特定任务方法相似的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Frontiers in Robotics and AI ROBOTICS-

CiteScore

6.50

自引率

5.90%

发文量

355

审稿时长

14 weeks

期刊介绍： Frontiers in Robotics and AI publishes rigorously peer-reviewed research covering all theory and applications of robotics, technology, and artificial intelligence, from biomedical to space robotics.