Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer

IF 4.1 2区计算机科学 Q1 ACOUSTICS IEEE/ACM Transactions on Audio, Speech, and Language Processing Pub Date : 2024-09-10 DOI:10.1109/TASLP.2024.3458811

Bing Yang;Xiaofei Li

{"title":"Self-Supervised Learning of Spatial Acoustic Representation With Cross-Channel Signal Reconstruction and Multi-Channel Conformer","authors":"Bing Yang;Xiaofei Li","doi":"10.1109/TASLP.2024.3458811","DOIUrl":null,"url":null,"abstract":"Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4211-4225"},"PeriodicalIF":4.1000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10675425/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}

引用次数: 0

Abstract

Supervised learning methods have shown effectiveness in estimating spatial acoustic parameters such as time difference of arrival, direct-to-reverberant ratio and reverberation time. However, they still suffer from the simulation-to-reality generalization problem due to the mismatch between simulated and real-world acoustic characteristics and the deficiency of annotated real-world data. To this end, this work proposes a self-supervised method that takes full advantage of unlabeled data for spatial acoustic parameter estimation. First, a new pretext task, i.e. cross-channel signal reconstruction (CCSR), is designed to learn a universal spatial acoustic representation from unlabeled multi-channel microphone signals. We mask partial signals of one channel and ask the model to reconstruct them, which makes it possible to learn spatial acoustic information from unmasked signals and extract source information from the other microphone channel. An encoder-decoder structure is used to disentangle the two kinds of information. By fine-tuning the pre-trained spatial encoder with a small annotated dataset, this encoder can be used to estimate spatial acoustic parameters. Second, a novel multi-channel audio Conformer (MC-Conformer) is adopted as the encoder model architecture, which is suitable for both the pretext and downstream tasks. It is carefully designed to be able to capture the local and global characteristics of spatial acoustics exhibited in the time-frequency domain. Experimental results of five acoustic parameter estimation tasks on both simulated and real-world data show the effectiveness of the proposed method. To the best of our knowledge, this is the first self-supervised learning method in the field of spatial acoustic representation learning and multi-channel audio signal processing.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用跨信道信号重构和多信道适形器进行空间声学表征的自我监督学习

有监督的学习方法在估计空间声学参数（如到达时间差、直接混响比和混响时间）方面显示出了有效性。然而，由于模拟声学特征与真实世界声学特征之间的不匹配以及注释真实世界数据的不足，这些方法仍然存在模拟到现实的泛化问题。为此，本研究提出了一种自监督方法，充分利用无标注数据进行空间声学参数估计。首先，我们设计了一个新的前置任务，即跨信道信号重建（CCSR），以从未标明的多信道麦克风信号中学习通用的空间声学表示。我们屏蔽一个信道的部分信号，并要求模型对其进行重建，这样就可以从未获屏蔽的信号中学习空间声学信息，并从另一个麦克风信道中提取声源信息。编码器-解码器结构用于分离这两种信息。通过使用小型注释数据集对预先训练的空间编码器进行微调，该编码器可用于估算空间声学参数。其次，编码器模型结构采用了新颖的多通道音频变换器（MC-Conformer），它既适用于前置任务，也适用于下游任务。它经过精心设计，能够捕捉时频域空间声学的局部和全局特征。在模拟和真实世界数据上进行的五项声学参数估计任务的实验结果表明了所提方法的有效性。据我们所知，这是空间声学表示学习和多通道音频信号处理领域的第一个自监督学习方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE/ACM Transactions on Audio, Speech, and Language Processing ACOUSTICS-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

11.30

自引率

11.10%

发文量

217

期刊介绍： The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.

期刊最新文献

List of Reviewers IPDnet: A Universal Direct-Path IPD Estimation Network for Sound Source Localization MO-Transformer: Extract High-Level Relationship Between Words for Neural Machine Translation Online Neural Speaker Diarization With Target Speaker Tracking Blind Audio Bandwidth Extension: A Diffusion-Based Zero-Shot Approach