{"title":"声学传感器网络中的语音和音频信号处理与机器学习","authors":"Walter Kellermann, Rainer Martin, Nobutaka Ono","doi":"10.1186/s13636-023-00322-6","DOIUrl":null,"url":null,"abstract":"<p>Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.</p><p>Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.</p><p>For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.</p><p>The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed sensors for coherent processing.</p><p>In the paper ‘Robustness of ad hoc microphone clustering using speaker embeddings: Evaluation under realistic and challenging scenarios’ by Kindt et al., the robustness of speaker embeddings learned from multiple microphone signals as a feature for identifying useful clusters for extracting a speech signal is studied with respect to several key aspects: The dependency on the distance metrics for clustering, the observation interval required for establishing robust clusters determining the stationarity requirements for the acoustic scenario, and the performance for increasingly challenging acoustic scenarios and multiple speakers. For evaluation, a source separation task in realistic noisy and reverberant environments is investigated using several separation techniques applied for the resulting clusters. The proposed speaker embeddings are also compared to established MFCC-based features with respect to multiple state-of-the-art criteria for signal enhancement.</p><p>The paper ‘Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology’ by Chinaev et al. is dedicated to network-wide sample-level synchronization relying on a previously published acoustic waveform-based sampling rate-offset estimation and compensation for pairs of nodes. Assuming that the WASN is organized as a directed minimum spanning tree (MST), the proposed network-wide synchronization scheme propagates from a root node over the entire network. Additionally, a network protocol is proposed that ensures the synchronization even if the network topology changes, e.g., because of node failure, broken transmission links, or newly appearing nodes. The efficacy of the method is demonstrated for dynamic scenes in a simulated dynamic acoustic scenario in an apartment with several rooms.</p><p>In their paper ‘Dual input neural networks for positional sound source localization’, Grinstein et al. combine multiple microphone signals from a distributed microphone array with information describing the acoustic properties of the scene for improved sound source localization. This information includes the positions of microphones, the room size, and the reverberation time. They present a Dual Input Neural Network (DI-NN) as a straightforward and efficient technique to construct a neural network capable of processing two distinct data types. It is tested in different scenarios, comparing it to alternative models such as a traditional least-squares method and a convolutional recurrent neural network. Although the proposed DI-NN is not retraining for each new scenario, the authors’ results demonstrate the superiority of the proposed DI-NN, achieving a substantial reduction in localization errors on synthetic data and a data set with real recordings.</p><p>The paper ‘Learning-based robust speaker counting and separation with the aid of spatial coherence’ by Hsu and Bai tackles speaker counting and speaker separation in noisy and reverberant environments. The authors combine traditional and learning-based methods to enhance these tasks and to achieve robustness to unseen room impulse responses (RIRs) and array configurations. They formulate a three-stage approach that entails the computation of a spatial coherence matrix (SCM) based on whitened relative transfer functions (wRTFs) as a spatial signature of directional sources. They evaluate the SCM and local coherence functions to detect the activity of the target speaker. Then, the eigenvalues of the SCM and the maximum similarity of inter-frame global activity distributions between two speakers are fed into a network for speaker counting (SCnet). To extract each independent speaker signal, a global and local activity-driven network (GLADnet) is employed. The authors demonstrate the benefits of the proposed approach on a data set of real meeting recordings.</p><p>The last paper, entitled ‘Acoustic object canceller: removing a known signal from monaural recording using blind synchronization’ by Kawamura et al., addresses the problem of removing undesired interference from individual microphone signals if a reference signal for the interference is available. The authors propose a method that treats the interference as an acoustic object whose signal is linearly filtered before arriving at the receiving microphone. Assuming that the signals of the acoustic object and the microphone exhibit different sampling rates, the signals are first synchronized, and then the frequency response of the propagation path from the object to the microphone is determined via maximum-likelihood estimation using the majorization-minimization algorithm, investigating and evaluating various statistical models for the desired signal that should be preserved.</p><p>We like to thank all authors for their excellent contributions to this special issue and hope that this collection will be a useful resource for research in WASNs in the years to come.</p><h3>Authors and Affiliations</h3><ol><li><p>Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany</p><p>Walter Kellermann</p></li><li><p>Ruhr-Universität Bochum, Bochum, Germany</p><p>Rainer Martin</p></li><li><p>Tokyo Metropolitan University, Hino-shi, Japan</p><p>Nobutaka Ono</p></li></ol><span>Authors</span><ol><li><span>Walter Kellermann</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Rainer Martin</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Nobutaka Ono</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li></ol><h3>Corresponding author</h3><p>Correspondence to Walter Kellermann.</p><h3>Publisher's Note</h3><p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p><p><b>Open Access</b> This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.</p>\n<p>Reprints and Permissions</p><img alt=\"Check for updates. Verify currency and authenticity via CrossMark\" height=\"81\" src=\"\" width=\"57\"/><h3>Cite this article</h3><p>Kellermann, W., Martin, R. & Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks. <i>J AUDIO SPEECH MUSIC PROC.</i> <b>2023</b>, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6</p><p>Download citation<svg aria-hidden=\"true\" focusable=\"false\" height=\"16\" role=\"img\" width=\"16\"><use xlink:href=\"#icon-eds-i-download-medium\" xmlns:xlink=\"http://www.w3.org/1999/xlink\"></use></svg></p><ul data-test=\"publication-history\"><li><p>Published<span>: </span><span><time datetime=\"2023-12-17\">17 December 2023</time></span></p></li><li><p>DOI</abbr><span>: </span><span>https://doi.org/10.1186/s13636-023-00322-6</span></p></li></ul><h3>Share this article</h3><p>Anyone you share the following link with will be able to read this content:</p><button data-track=\"click\" data-track-action=\"get shareable link\" data-track-external=\"\" data-track-label=\"button\" type=\"button\">Get shareable link</button><p>Sorry, a shareable link is not currently available for this article.</p><p data-track=\"click\" data-track-action=\"select share url\" data-track-label=\"button\"></p><button data-track=\"click\" data-track-action=\"copy share url\" data-track-external=\"\" data-track-label=\"button\" type=\"button\">Copy to clipboard</button><p> Provided by the Springer Nature SharedIt content-sharing initiative </p>","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"55 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2023-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Signal processing and machine learning for speech and audio in acoustic sensor networks\",\"authors\":\"Walter Kellermann, Rainer Martin, Nobutaka Ono\",\"doi\":\"10.1186/s13636-023-00322-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.</p><p>Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.</p><p>For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.</p><p>The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed sensors for coherent processing.</p><p>In the paper ‘Robustness of ad hoc microphone clustering using speaker embeddings: Evaluation under realistic and challenging scenarios’ by Kindt et al., the robustness of speaker embeddings learned from multiple microphone signals as a feature for identifying useful clusters for extracting a speech signal is studied with respect to several key aspects: The dependency on the distance metrics for clustering, the observation interval required for establishing robust clusters determining the stationarity requirements for the acoustic scenario, and the performance for increasingly challenging acoustic scenarios and multiple speakers. For evaluation, a source separation task in realistic noisy and reverberant environments is investigated using several separation techniques applied for the resulting clusters. The proposed speaker embeddings are also compared to established MFCC-based features with respect to multiple state-of-the-art criteria for signal enhancement.</p><p>The paper ‘Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology’ by Chinaev et al. is dedicated to network-wide sample-level synchronization relying on a previously published acoustic waveform-based sampling rate-offset estimation and compensation for pairs of nodes. Assuming that the WASN is organized as a directed minimum spanning tree (MST), the proposed network-wide synchronization scheme propagates from a root node over the entire network. Additionally, a network protocol is proposed that ensures the synchronization even if the network topology changes, e.g., because of node failure, broken transmission links, or newly appearing nodes. The efficacy of the method is demonstrated for dynamic scenes in a simulated dynamic acoustic scenario in an apartment with several rooms.</p><p>In their paper ‘Dual input neural networks for positional sound source localization’, Grinstein et al. combine multiple microphone signals from a distributed microphone array with information describing the acoustic properties of the scene for improved sound source localization. This information includes the positions of microphones, the room size, and the reverberation time. They present a Dual Input Neural Network (DI-NN) as a straightforward and efficient technique to construct a neural network capable of processing two distinct data types. It is tested in different scenarios, comparing it to alternative models such as a traditional least-squares method and a convolutional recurrent neural network. Although the proposed DI-NN is not retraining for each new scenario, the authors’ results demonstrate the superiority of the proposed DI-NN, achieving a substantial reduction in localization errors on synthetic data and a data set with real recordings.</p><p>The paper ‘Learning-based robust speaker counting and separation with the aid of spatial coherence’ by Hsu and Bai tackles speaker counting and speaker separation in noisy and reverberant environments. The authors combine traditional and learning-based methods to enhance these tasks and to achieve robustness to unseen room impulse responses (RIRs) and array configurations. They formulate a three-stage approach that entails the computation of a spatial coherence matrix (SCM) based on whitened relative transfer functions (wRTFs) as a spatial signature of directional sources. They evaluate the SCM and local coherence functions to detect the activity of the target speaker. Then, the eigenvalues of the SCM and the maximum similarity of inter-frame global activity distributions between two speakers are fed into a network for speaker counting (SCnet). To extract each independent speaker signal, a global and local activity-driven network (GLADnet) is employed. The authors demonstrate the benefits of the proposed approach on a data set of real meeting recordings.</p><p>The last paper, entitled ‘Acoustic object canceller: removing a known signal from monaural recording using blind synchronization’ by Kawamura et al., addresses the problem of removing undesired interference from individual microphone signals if a reference signal for the interference is available. The authors propose a method that treats the interference as an acoustic object whose signal is linearly filtered before arriving at the receiving microphone. Assuming that the signals of the acoustic object and the microphone exhibit different sampling rates, the signals are first synchronized, and then the frequency response of the propagation path from the object to the microphone is determined via maximum-likelihood estimation using the majorization-minimization algorithm, investigating and evaluating various statistical models for the desired signal that should be preserved.</p><p>We like to thank all authors for their excellent contributions to this special issue and hope that this collection will be a useful resource for research in WASNs in the years to come.</p><h3>Authors and Affiliations</h3><ol><li><p>Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany</p><p>Walter Kellermann</p></li><li><p>Ruhr-Universität Bochum, Bochum, Germany</p><p>Rainer Martin</p></li><li><p>Tokyo Metropolitan University, Hino-shi, Japan</p><p>Nobutaka Ono</p></li></ol><span>Authors</span><ol><li><span>Walter Kellermann</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Rainer Martin</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li><li><span>Nobutaka Ono</span>View author publications<p>You can also search for this author in <span>PubMed<span> </span>Google Scholar</span></p></li></ol><h3>Corresponding author</h3><p>Correspondence to Walter Kellermann.</p><h3>Publisher's Note</h3><p>Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.</p><p><b>Open Access</b> This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.</p>\\n<p>Reprints and Permissions</p><img alt=\\\"Check for updates. Verify currency and authenticity via CrossMark\\\" height=\\\"81\\\" src=\\\"\\\" width=\\\"57\\\"/><h3>Cite this article</h3><p>Kellermann, W., Martin, R. & Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks. <i>J AUDIO SPEECH MUSIC PROC.</i> <b>2023</b>, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6</p><p>Download citation<svg aria-hidden=\\\"true\\\" focusable=\\\"false\\\" height=\\\"16\\\" role=\\\"img\\\" width=\\\"16\\\"><use xlink:href=\\\"#icon-eds-i-download-medium\\\" xmlns:xlink=\\\"http://www.w3.org/1999/xlink\\\"></use></svg></p><ul data-test=\\\"publication-history\\\"><li><p>Published<span>: </span><span><time datetime=\\\"2023-12-17\\\">17 December 2023</time></span></p></li><li><p>DOI</abbr><span>: </span><span>https://doi.org/10.1186/s13636-023-00322-6</span></p></li></ul><h3>Share this article</h3><p>Anyone you share the following link with will be able to read this content:</p><button data-track=\\\"click\\\" data-track-action=\\\"get shareable link\\\" data-track-external=\\\"\\\" data-track-label=\\\"button\\\" type=\\\"button\\\">Get shareable link</button><p>Sorry, a shareable link is not currently available for this article.</p><p data-track=\\\"click\\\" data-track-action=\\\"select share url\\\" data-track-label=\\\"button\\\"></p><button data-track=\\\"click\\\" data-track-action=\\\"copy share url\\\" data-track-external=\\\"\\\" data-track-label=\\\"button\\\" type=\\\"button\\\">Copy to clipboard</button><p> Provided by the Springer Nature SharedIt content-sharing initiative </p>\",\"PeriodicalId\":49202,\"journal\":{\"name\":\"Eurasip Journal on Audio Speech and Music Processing\",\"volume\":\"55 1\",\"pages\":\"\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2023-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eurasip Journal on Audio Speech and Music Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1186/s13636-023-00322-6\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-023-00322-6","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
摘要
将来自分布式麦克风阵列的多个麦克风信号与描述场景声学特性的信息相结合,以改进声源定位。这些信息包括麦克风的位置、房间大小和混响时间。他们提出的双输入神经网络(DI-NN)是一种简单高效的技术,用于构建能够处理两种不同数据类型的神经网络。他们在不同的场景中对其进行了测试,并将其与传统的最小二乘法和卷积递归神经网络等其他模型进行了比较。虽然拟议的 DI-NN 并不针对每个新场景进行再训练,但作者的研究结果证明了拟议的 DI-NN 的优越性,在合成数据和真实录音数据集上实现了定位误差的大幅降低。作者将传统方法和基于学习的方法相结合,以加强这些任务,并实现对未知房间脉冲响应 (RIR) 和阵列配置的鲁棒性。他们提出了一种三阶段方法,该方法需要计算空间相干矩阵(SCM),其基础是作为定向声源空间特征的白化相对传递函数(wRTF)。他们通过评估空间相干矩阵和局部相干函数来检测目标扬声器的活动。然后,将 SCM 的特征值和两个扬声器之间帧间全局活动分布的最大相似度输入扬声器计数网络(SCnet)。为了提取每个独立的说话者信号,采用了全局和局部活动驱动网络(GLADnet)。Kawamura 等人撰写的最后一篇论文题为 "Acoustic object canceller: removing a known signal from monaural recording using blind synchronization"(声学对象消除器:利用盲同步从单声道录音中消除已知信号),解决了在有干扰参考信号的情况下从单个麦克风信号中消除不期望干扰的问题。作者提出的方法是将干扰视为一个声学对象,其信号在到达接收麦克风之前经过线性滤波。假定声学物体和麦克风的信号表现出不同的采样率,首先对信号进行同步,然后使用大化最小化算法通过最大似然估计确定从物体到麦克风传播路径的频率响应,研究和评估应保留的理想信号的各种统计模型。作者和单位德国埃尔兰根-纽伦堡弗里德里希-亚历山大大学沃尔特-凯勒曼德国波鸿鲁尔大学雷纳-马丁日本日野市东京都立大学大野信孝日本Nobutaka Ono作者Walter Kellermann查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Rainer Martin查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者Nobutaka Ono查看作者发表的文章您也可以在PubMed Google Scholar中搜索该作者通讯作者给Walter Kellermann的回信。开放获取本文采用知识共享署名 4.0 国际许可协议进行许可,该协议允许以任何媒介或格式使用、共享、改编、分发和复制,只要您适当注明原作者和来源,提供知识共享许可协议的链接,并说明是否进行了修改。本文中的图片或其他第三方材料均包含在文章的知识共享许可协议中,除非在材料的署名栏中另有说明。如果材料未包含在文章的知识共享许可协议中,且您打算使用的材料不符合法律规定或超出许可使用范围,您需要直接从版权所有者处获得许可。要查看该许可的副本,请访问 http://creativecommons.org/licenses/by/4.0/.Reprints and PermissionsCite this articleKellermann, W., Martin, R. & Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks.J audio speech music proc.2023, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6Download citationPublished: 17 December 2023DOI: https://doi.org/10.
Signal processing and machine learning for speech and audio in acoustic sensor networks
Nowadays, we are surrounded by a plethora of recording devices, including mobile phones, laptops, tablets, smartwatches, and camcorders, among others. However, conventional multichannel signal processing methods can usually not be applied to jointly process the signals recorded by multiple distributed devices because synchronous recording is essential. Thus, commercially available microphone array processing is currently limited to a single device where all microphones are mounted. The full exploitation of the spatial diversity offered by multiple audio devices without requiring wired networking is a major challenge, whose potential practical and commercial benefits prompted significant research efforts over the past decade.
Wireless acoustic sensor networks (WASNs) have become a new paradigm of acoustic sensing to overcome the limitations of individual devices. Along with wireless communications between microphone nodes and addressing new challenges in handling asynchronous channels, unknown microphone positions, and distributed computing, the WASN enables us to spatially distribute many recording devices. These may cover a wider area and utilize the nodes to form an extended microphone array. It promises to significantly improve the performance of various audio tasks such as speech enhancement, speech recognition, diarization, scene analysis, and anomalous acoustic event detection.
For this special issue, six papers were accepted which all address the above-mentioned fundamental challenges when using WASNs: First, the question of which sensors should be used for a specific signal processing task or extraction of a target source is addressed by the papers of Guenther et al. and Kindt et al. Given a set of sensors, a method for its synchronization on waveform level in dynamic scenarios is presented by Chinaev et al., and a localization method using both sensor signals and higher-level environmental information is discussed by Grinstein et al. Finally, robust speaker counting and source separation are addressed by Hsu and Bai and the task of removing specific interference from a single sensor signal is tackled by Kawamura et al.
The paper ‘Microphone utility estimation in acoustic sensor networks using single-channel signal features’ by Guenther et al. proposes a method to assess the utility of individual sensors of a WASN for coherence-based signal processing, e.g., beamforming or blind source separation, by using appropriate single-channel signal features as proxies for waveforms. Thereby, the need for transmitting waveforms for identifying suitable sensors for a synchronized cluster of sensors is avoided and the required amount of transmitted data can be reduced by several orders of magnitude. It is shown that both estimation-theoretic processing of single-channel features and deep learning-based identification of such features lead to measures of coherence in the feature space that reflect the suitability of distributed sensors for coherent processing.
In the paper ‘Robustness of ad hoc microphone clustering using speaker embeddings: Evaluation under realistic and challenging scenarios’ by Kindt et al., the robustness of speaker embeddings learned from multiple microphone signals as a feature for identifying useful clusters for extracting a speech signal is studied with respect to several key aspects: The dependency on the distance metrics for clustering, the observation interval required for establishing robust clusters determining the stationarity requirements for the acoustic scenario, and the performance for increasingly challenging acoustic scenarios and multiple speakers. For evaluation, a source separation task in realistic noisy and reverberant environments is investigated using several separation techniques applied for the resulting clusters. The proposed speaker embeddings are also compared to established MFCC-based features with respect to multiple state-of-the-art criteria for signal enhancement.
The paper ‘Online distributed waveform-synchronization for acoustic sensor networks with dynamic topology’ by Chinaev et al. is dedicated to network-wide sample-level synchronization relying on a previously published acoustic waveform-based sampling rate-offset estimation and compensation for pairs of nodes. Assuming that the WASN is organized as a directed minimum spanning tree (MST), the proposed network-wide synchronization scheme propagates from a root node over the entire network. Additionally, a network protocol is proposed that ensures the synchronization even if the network topology changes, e.g., because of node failure, broken transmission links, or newly appearing nodes. The efficacy of the method is demonstrated for dynamic scenes in a simulated dynamic acoustic scenario in an apartment with several rooms.
In their paper ‘Dual input neural networks for positional sound source localization’, Grinstein et al. combine multiple microphone signals from a distributed microphone array with information describing the acoustic properties of the scene for improved sound source localization. This information includes the positions of microphones, the room size, and the reverberation time. They present a Dual Input Neural Network (DI-NN) as a straightforward and efficient technique to construct a neural network capable of processing two distinct data types. It is tested in different scenarios, comparing it to alternative models such as a traditional least-squares method and a convolutional recurrent neural network. Although the proposed DI-NN is not retraining for each new scenario, the authors’ results demonstrate the superiority of the proposed DI-NN, achieving a substantial reduction in localization errors on synthetic data and a data set with real recordings.
The paper ‘Learning-based robust speaker counting and separation with the aid of spatial coherence’ by Hsu and Bai tackles speaker counting and speaker separation in noisy and reverberant environments. The authors combine traditional and learning-based methods to enhance these tasks and to achieve robustness to unseen room impulse responses (RIRs) and array configurations. They formulate a three-stage approach that entails the computation of a spatial coherence matrix (SCM) based on whitened relative transfer functions (wRTFs) as a spatial signature of directional sources. They evaluate the SCM and local coherence functions to detect the activity of the target speaker. Then, the eigenvalues of the SCM and the maximum similarity of inter-frame global activity distributions between two speakers are fed into a network for speaker counting (SCnet). To extract each independent speaker signal, a global and local activity-driven network (GLADnet) is employed. The authors demonstrate the benefits of the proposed approach on a data set of real meeting recordings.
The last paper, entitled ‘Acoustic object canceller: removing a known signal from monaural recording using blind synchronization’ by Kawamura et al., addresses the problem of removing undesired interference from individual microphone signals if a reference signal for the interference is available. The authors propose a method that treats the interference as an acoustic object whose signal is linearly filtered before arriving at the receiving microphone. Assuming that the signals of the acoustic object and the microphone exhibit different sampling rates, the signals are first synchronized, and then the frequency response of the propagation path from the object to the microphone is determined via maximum-likelihood estimation using the majorization-minimization algorithm, investigating and evaluating various statistical models for the desired signal that should be preserved.
We like to thank all authors for their excellent contributions to this special issue and hope that this collection will be a useful resource for research in WASNs in the years to come.
You can also search for this author in PubMedGoogle Scholar
Rainer MartinView author publications
You can also search for this author in PubMedGoogle Scholar
Nobutaka OnoView author publications
You can also search for this author in PubMedGoogle Scholar
Corresponding author
Correspondence to Walter Kellermann.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Reprints and Permissions
Cite this article
Kellermann, W., Martin, R. & Ono, N. Signal processing and machine learning for speech and audio in acoustic sensor networks. J AUDIO SPEECH MUSIC PROC.2023, 54 (2023). https://doi.org/10.1186/s13636-023-00322-6
Download citation
Published:
DOI: https://doi.org/10.1186/s13636-023-00322-6
Share this article
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
期刊介绍:
The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.