AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier

Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu
{"title":"AudioMask: Robust Sound Event Detection Using Mask R-CNN and Frame-Level Classifier","authors":"Alireza Nasiri, Yuxin Cui, Zhonghao Liu, Jing Jin, Yong Zhao, Jianjun Hu","doi":"10.1109/ICTAI.2019.00074","DOIUrl":null,"url":null,"abstract":"Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.","PeriodicalId":346657,"journal":{"name":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","volume":"38 8 Pt 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 31st International Conference on Tools with Artificial Intelligence (ICTAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICTAI.2019.00074","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Deep learning methods have recently made significant contributions to sound event detection. These methods either use a block-level approach to distinguish parts of audio containing the event, or analyze the small frames of the audio separately. In this paper, we introduce a new method, AudioMask, for rare sound event detection by combining these two approaches. AudioMask first applies Mask R-CNN, a state-of-the-art algorithm for detecting objects in images, to the log mel-spectrogram of the audio files. Mask R-CNN detects audio segments that might contain the target event by generating bounding boxes around them in time-frequency domain. Then we use a frame-based audio event classifier trained independently from Mask R-CNN, to analyze each individual frame in the candidate segments proposed by Mask R-CNN. A post-processing step combines the outputs of the Mask R-CNN and the frame-level classifier to identify the true events. By evaluating AudioMask over the data sets from 2017 Detection and Classification of Acoustic Scenes and Events (DCASE) Challenge Task 2, We show that our algorithm performs better than the baseline models by 13.3% in the average F-score and achieves better results compared to the other non-ensemble methods in the challenge.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
AudioMask:使用掩码R-CNN和帧级分类器的鲁棒声音事件检测
深度学习方法最近在声音事件检测方面做出了重大贡献。这些方法要么使用块级方法来区分包含事件的音频部分,要么单独分析音频的小帧。本文将这两种方法相结合,提出了一种新的罕见声事件检测方法——AudioMask。AudioMask首先将Mask R-CNN(一种用于检测图像中物体的最先进算法)应用于音频文件的对数梅尔谱图。掩码R-CNN检测音频片段,可能包含目标事件产生包围框周围的时间-频率域。然后,我们使用独立于Mask R-CNN训练的基于帧的音频事件分类器来分析Mask R-CNN提出的候选片段中的每个单独帧。后处理步骤结合掩码R-CNN和帧级分类器的输出来识别真实事件。通过在2017年声学场景和事件检测和分类(DCASE)挑战任务2的数据集上评估AudioMask,我们发现我们的算法在平均f分数上比基线模型好13.3%,并且与挑战中的其他非集成方法相比取得了更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Monaural Music Source Separation using a ResNet Latent Separator Network Graph-Based Attention Networks for Aspect Level Sentiment Analysis A Multi-channel Neural Network for Imbalanced Emotion Recognition Scaling up Prediction of Psychosis by Natural Language Processing Improving Bandit-Based Recommendations with Spatial Context Reasoning: An Online Evaluation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1