基于频率动态卷积和局部注意的环境声事件检测

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS Information (Switzerland) Pub Date : 2023-09-30 DOI:10.3390/info14100534

Grigorios-Aris Cheimariotis, Nikolaos Mitianoudis

{"title":"基于频率动态卷积和局部注意的环境声事件检测","authors":"Grigorios-Aris Cheimariotis, Nikolaos Mitianoudis","doi":"10.3390/info14100534","DOIUrl":null,"url":null,"abstract":"This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance.","PeriodicalId":38479,"journal":{"name":"Information (Switzerland)","volume":null,"pages":null},"PeriodicalIF":2.4000,"publicationDate":"2023-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention\",\"authors\":\"Grigorios-Aris Cheimariotis, Nikolaos Mitianoudis\",\"doi\":\"10.3390/info14100534\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance.\",\"PeriodicalId\":38479,\"journal\":{\"name\":\"Information (Switzerland)\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-09-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information (Switzerland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3390/info14100534\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information (Switzerland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/info14100534","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

这项工作描述了一种在家庭环境中检测声音事件的方法。这项任务的有效解决方案可以支持老年人的自主生活。该方法处理2023年的“声学场景和事件的检测和分类挑战(DCASE)”，更具体地说，是任务4a“家庭活动的声音事件检测”。这项任务包括在10秒的声音片段中检测10个家庭环境中常见的事件。事件可以在10秒剪辑中具有任意的持续时间。该方法的主要组成部分是对代表声音片段的mel-谱图进行数据增强，通过频率动态卷积网络传递谱图(每个卷积都有一个额外的注意模块)来提取特征，将这些特征与BEATs嵌入连接起来，并使用BiGRU进行序列建模。此外，平均教师模型被用于利用未标记的数据。本研究的重点是数据增强技术、特征提取模型和自监督学习的效果。本文的主要贡献是提出的特征提取模型，该模型在每个卷积中对频率进行加权关注，并依次与计算机视觉采用的局部关注模块相结合。该系统具有良好的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Sound Event Detection in Domestic Environment Using Frequency-Dynamic Convolution and Local Attention

This work describes a methodology for sound event detection in domestic environments. Efficient solutions in this task can support the autonomous living of the elderly. The methodology deals with the “Challenge on Detection and Classification of Acoustic Scenes and Events (DCASE)” 2023, and more specifically with Task 4a “Sound event detection of domestic activities”. This task involves the detection of 10 common events in domestic environments in 10 s sound clips. The events may have arbitrary duration in the 10 s clip. The main components of the methodology are data augmentation on mel-spectrograms that represent the sound clips, feature extraction by passing spectrograms through a frequency-dynamic convolution network with an extra attention module in sequence with each convolution, concatenation of these features with BEATs embeddings, and use of BiGRU for sequence modeling. Also, a mean teacher model is employed for leveraging unlabeled data. This research focuses on the effect of data augmentation techniques, of the feature extraction models, and on self-supervised learning. The main contribution is the proposed feature extraction model, which uses weighted attention on frequency in each convolution, combined in sequence with a local attention module adopted by computer vision. The proposed system features promising and robust performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊