Pub Date : 2024-08-15DOI: 10.1109/TASLP.2024.3444490
Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos
Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.
{"title":"Enhancing Conformer-Based Sound Event Detection Using Frequency Dynamic Convolutions and BEATs Audio Embeddings","authors":"Sara Barahona;Diego de Benito-Gorrón;Doroteo T. Toledano;Daniel Ramos","doi":"10.1109/TASLP.2024.3444490","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444490","url":null,"abstract":"Over the last few years, most of the tasks employing Deep Learning techniques for audio processing have achieved state-of-the-art results employing Conformer-based systems. However, when it comes to sound event detection (SED), it was scarcely used after it won the DCASE Challenge 2020 Task 4. In previous research, we found that Conformer-based systems achieved a higher performance in terms of sound events classification compared to other architectures frequently employed, such as Convolutional Recurrent Neural Networks (CRNNs). Given that the second scenario proposed for the Polyphonic Sound Detection Score (PSDS2) is focused on avoiding confusion between classes, in this paper we propose to optimize a Conformer-based system to maximize the performance on this scenario. For this purpose, we performed a hyperparameter tuning and incorporated recently proposed Frequency Dynamic Convolutions (FDY) to enhance its classification properties. Additionally, we employed our previously proposed multi-resolution approach not only to enhance the performance but also to gain a deeper understanding of the Conformer architecture for SED, analyzing its advantages and disadvantages, and finding possible solutions to them. Additionally, we explored the integration of embeddings from the pre-trained model BEATs, an iterative framework to learn Bidirectional Encoder representation from Audio Transformers. By concatenating these embeddings into the input of the Conformer blocks, results were further improved, achieving a PSDS2 value of 0.813 and considerably outperforming SED systems based on CRNNs.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3896-3907"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637738","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143639","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-15DOI: 10.1109/TASLP.2024.3444486
Vincent Lostanlen;Aurora Cramer;Justin Salamon;Andrew Farnsworth;Benjamin M. Van Doren;Steve Kelling;Juan Pablo Bello
Sound event classification has the potential to advance our understanding of bird migration. Although it is long known that migratory species have a vocal signature of their own, previous work on automatic flight call classification has been limited in robustness and scope: e.g., covering few recording sites, short acquisition segments, and simplified biological taxonomies. In this paper, we present BirdVoxDetect (BVD), the first full-fledged solution to bird migration monitoring from acoustic sensor network data. As an open-source software, BVD integrates an original pipeline of three machine learning modules. The first module is a random forest classifier of sensor faults, trained with human-in-the-loop active learning. The second module is a deep convolutional neural network for sound event detection with per-channel energy normalization (PCEN). The third module is a multitask convolutional neural network which predicts the family, genus, and species of flight calls from passerines (Passeriformes)