{"title":"SoundDoA:从声音原始波形中学习声源到达方向和语义","authors":"Yuhang He, A. Markham","doi":"10.21437/interspeech.2022-378","DOIUrl":null,"url":null,"abstract":"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2408-2412"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms\",\"authors\":\"Yuhang He, A. Markham\",\"doi\":\"10.21437/interspeech.2022-378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"2408-2412\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
SoundDoA: Learn Sound Source Direction of Arrival and Semantics from Sound Raw Waveforms
A fundamental task for an agent to understand an environment acoustically is to detect sound source location (like direction of arrival (DoA)) and semantic label. It is a challenging task: firstly, sound sources overlap in time, frequency and space; secondly, while semantics are largely conveyed through time-frequency energy (amplitude) contours, DoA is encoded in inter-channel phase difference; lastly, although the number of microphone sensors are sparse, recorded sound waveform is temporally dense due to the high sampling rates. Existing methods for predicting DoA mostly depend on pre-extracted 2D acoustic feature such as GCC-PHAT and Mel-spectrograms so as to benefit from the success of mature 2D image based deep neural networks. We instead propose a novel end-to-end trainable framework, named SoundDoA , that is capable of learning sound source DoA and semantics directly from sound raw waveforms. We first use a learnable front-end filter bank to dynamically encode sound source semantics and DoA relevant features into a compact representation. A backbone network consisting of two identical sub-networks with layerwise communication strategy is then proposed to further learn semantic label and DoA both separately and jointly. Finally, a permutation invariant multi-track head is added to regress DoA and classify semantic label. Extensive experimental results on DCASE 2020 sound event detection and localization dataset (SELD) demonstrate the superiority of SoundDoA , when comparing with other existing methods.