Exploring the power of pure attention mechanisms in blind room parameter estimation

IF 1.7 3区计算机科学 Q2 ACOUSTICS Eurasip Journal on Audio Speech and Music Processing Pub Date : 2024-04-24 DOI:10.1186/s13636-024-00344-8

Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin

{"title":"Exploring the power of pure attention mechanisms in blind room parameter estimation","authors":"Chunxi Wang, Maoshen Jia, Meiran Li, Changchun Bao, Wenyu Jin","doi":"10.1186/s13636-024-00344-8","DOIUrl":null,"url":null,"abstract":"Dynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT $$_{60}$$ ) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.","PeriodicalId":49202,"journal":{"name":"Eurasip Journal on Audio Speech and Music Processing","volume":"10 1","pages":""},"PeriodicalIF":1.7000,"publicationDate":"2024-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eurasip Journal on Audio Speech and Music Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1186/s13636-024-00344-8","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Dynamic parameterization of acoustic environments has drawn widespread attention in the field of audio processing. Precise representation of local room acoustic characteristics is crucial when designing audio filters for various audio rendering applications. Key parameters in this context include reverberation time (RT $$_{60}$$ ) and geometric room volume. In recent years, neural networks have been extensively applied in the task of blind room parameter estimation. However, there remains a question of whether pure attention mechanisms can achieve superior performance in this task. To address this issue, this study employs blind room parameter estimation based on monaural noisy speech signals. Various model architectures are investigated, including a proposed attention-based model. This model is a convolution-free Audio Spectrogram Transformer, utilizing patch splitting, attention mechanisms, and cross-modality transfer learning from a pretrained Vision Transformer. Experimental results suggest that the proposed attention mechanism-based model, relying purely on attention mechanisms without using convolution, exhibits significantly improved performance across various room parameter estimation tasks, especially with the help of dedicated pretraining and data augmentation schemes. Additionally, the model demonstrates more advantageous adaptability and robustness when handling variable-length audio inputs compared to existing methods.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

探索纯注意力机制在盲室参数估计中的威力

声学环境的动态参数化已引起音频处理领域的广泛关注。在为各种音频渲染应用设计音频滤波器时，精确呈现房间的局部声学特性至关重要。其中的关键参数包括混响时间（RT $$_{60}$ ）和房间几何容积。近年来，神经网络已被广泛应用于盲室参数估计任务中。然而，在这项任务中，纯粹的注意力机制是否能取得优异的性能仍是一个问题。为了解决这个问题，本研究采用了基于单耳噪声语音信号的盲室参数估计。研究了各种模型架构，包括一个基于注意力的模型。该模型是一个无卷积的音频频谱图变换器，利用了补丁分割、注意力机制和来自预训练视觉变换器的跨模态迁移学习。实验结果表明，所提出的基于注意力机制的模型纯粹依靠注意力机制而不使用卷积，在各种房间参数估计任务中表现出显著的性能提升，尤其是在专用预训练和数据增强方案的帮助下。此外，与现有方法相比，该模型在处理变长音频输入时表现出更强的适应性和鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Eurasip Journal on Audio Speech and Music Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

4.10

自引率

4.20%

发文量

审稿时长

12 months

期刊介绍： The aim of “EURASIP Journal on Audio, Speech, and Music Processing” is to bring together researchers, scientists and engineers working on the theory and applications of the processing of various audio signals, with a specific focus on speech and music. EURASIP Journal on Audio, Speech, and Music Processing will be an interdisciplinary journal for the dissemination of all basic and applied aspects of speech communication and audio processes.