{"title":"Hypernetworks for Sound event Detection: a Proof-of-Concept","authors":"Shubhr Singh, Huy Phan, Emmanouil Benetos","doi":"10.23919/eusipco55093.2022.9909716","DOIUrl":null,"url":null,"abstract":"Polyphonic sound event detection (SED) involves the pre-diction of sound events present in an audio recording along with their onset and offset times. Recently, Deep Neural Net-works, specifically convolutional recurrent neural networks (CRNN) have achieved impressive results for this task. The convolution part of the architecture is used to extract trans-lational invariant features from the input and the recurrent part learns the underlying temporal relationship between au-dio frames. Recent studies showed that the weight sharing paradigm of recurrent networks might be a hindering factor in certain kinds of time series data, specifically where there is a temporal conditional shift, i.e. the conditional distribution of a label changes across the temporal scale. This warrants a relevant question - is there a similar phenomenon in poly-phonic sound events due to dynamic polyphony level across the temporal axis? In this work, we explore this question and inquire if relaxed weight sharing improves performance of a CRNN for polyphonic SED. We propose to use hyper-networks to relax weight sharing in the recurrent part and show that the CRNN's performance is improved by ≈ 3% across two datasets, thus paving the way for further explo-ration of the existence of temporal conditional shift for poly-phonic SED.","PeriodicalId":231263,"journal":{"name":"2022 30th European Signal Processing Conference (EUSIPCO)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 30th European Signal Processing Conference (EUSIPCO)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.23919/eusipco55093.2022.9909716","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Polyphonic sound event detection (SED) involves the pre-diction of sound events present in an audio recording along with their onset and offset times. Recently, Deep Neural Net-works, specifically convolutional recurrent neural networks (CRNN) have achieved impressive results for this task. The convolution part of the architecture is used to extract trans-lational invariant features from the input and the recurrent part learns the underlying temporal relationship between au-dio frames. Recent studies showed that the weight sharing paradigm of recurrent networks might be a hindering factor in certain kinds of time series data, specifically where there is a temporal conditional shift, i.e. the conditional distribution of a label changes across the temporal scale. This warrants a relevant question - is there a similar phenomenon in poly-phonic sound events due to dynamic polyphony level across the temporal axis? In this work, we explore this question and inquire if relaxed weight sharing improves performance of a CRNN for polyphonic SED. We propose to use hyper-networks to relax weight sharing in the recurrent part and show that the CRNN's performance is improved by ≈ 3% across two datasets, thus paving the way for further explo-ration of the existence of temporal conditional shift for poly-phonic SED.