Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation

IF 5.2 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-11-11 DOI:10.1109/TASLP.2024.3496317

Thomas Dietzen;Enzo De Sena;Toon van Waterschoot

{"title":"Scalable-Complexity Steered Response Power Based on Low-Rank and Sparse Interpolation","authors":"Thomas Dietzen;Enzo De Sena;Toon van Waterschoot","doi":"10.1109/TASLP.2024.3496317","DOIUrl":null,"url":null,"abstract":"The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"5024-5039"},"PeriodicalIF":5.2000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10750284/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

The steered response power (SRP) is a popular approach to compute a map of the acoustic scene, typically used for acoustic source localization. The SRP map is obtained as the frequency-weighted output power of a beamformer steered towards a grid of candidate locations. Due to the exhaustive search over a fine grid at all frequency bins, conventional frequency domain-based SRP (conv. FD-SRP) results in a high computational complexity. Time domain-based SRP (conv. TD-SRP) implementations reduce computational complexity at the cost of accuracy using the inverse fast Fourier transform (iFFT). In this paper, to enable a more favourable complexity-performance trade-off as compared to conv. FD-SRP and conv. TD-SRP, we consider the problem of constructing a fine SRP map over the entire search space at scalable computational cost. We propose two approaches to this problem. Expressing the conv. FD-SRP map as a matrix transform of frequency-domain GCCs, we decompose the SRP matrix into a sampling matrix and an interpolation matrix. While sampling can be implemented by the iFFT, we propose to use optimal low-rank or sparse approximations of the interpolation matrix for complexity reduction. The proposed approaches, refered to as sampling + low-rank interpolation-based SRP (SLRI-SRP) and sampling + sparse interpolation-based SRP (SSPI-SRP), are evaluated in various localization scenarios with speech as source signals and compared to the state-of-the-art. The results indicate that SSPI-SRP performs better if large array apertures are used, while SLRI-SRP performs better at small array apertures or a large number of microphones. In comparison to conv. FD-SRP, two to three orders of magnitude of complexity reduction can achieved, often times enabling a more favourable complexity-performance trade-off as compared to conv. TD-SRP. A MATLAB implementation is available online.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于低库和稀疏插值的可扩展复杂度转向响应功率

转向响应功率（SRP）是计算声学场景地图的一种流行方法，通常用于声源定位。SRP 地图是波束成形器转向候选位置网格的频率加权输出功率。由于要在所有频带的细网格上进行穷举搜索，传统的基于频域的 SRP（conv. FD-SRP）计算复杂度很高。基于时域的 SRP（conv. TD-SRP）实现方法利用反快速傅立叶变换（iFFT）降低了计算复杂度，但却牺牲了精度。与 conv.FD-SRP 和 conv.TD-SRP 相比，我们考虑的问题是如何以可扩展的计算成本在整个搜索空间构建精细的 SRP 地图。我们针对这一问题提出了两种方法。将 conv.FD-SRP 地图表示为频域 GCC 的矩阵变换，我们将 SRP 矩阵分解为采样矩阵和插值矩阵。采样可以通过 iFFT 实现，而我们建议使用插值矩阵的最优低阶或稀疏近似来降低复杂性。我们在以语音为源信号的各种定位场景中评估了所提出的方法，分别称为基于采样 + 低秩插值的 SRP（SLRI-SRP）和基于采样 + 稀疏插值的 SRP（SSPI-SRP），并与最先进的方法进行了比较。结果表明，如果使用大的阵列孔径，SSPI-SRP 的性能更好，而 SLRI-SRP 在使用小的阵列孔径或大量麦克风时性能更好。与 conv.FD-SRP相比，复杂度可降低两到三个数量级，因此，在复杂度-性能权衡方面，TD-SRP往往更胜一筹。MATLAB 实现可在线获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach