IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献_第5页

Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability 利用最佳传输和语音存在概率进行无监督语音增强

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473318

Wenbin Jiang;Kai Yu;Fei Wen

Speech enhancement models based on deep learning are typically trained in a supervised manner, requiring a substantial amount of paired noisy-to-clean speech data for training. However, synthetically generated training data can only capture a limited range of realistic environments, and it is often challenging or even impractical to gather real-world pairs of noisy and ground-truth clean speech. To overcome this limitation, we propose an unsupervised learning approach for speech enhancement that eliminates the need for paired noisy-to-clean training data. Specifically, our method utilizes the optimal transport criterion to train the speech enhancement model in an unsupervised manner. It employs a fidelity loss based on noisy speech and a distribution divergence loss to minimize the difference between the distribution of the model's output and that of unpaired clean speech. Further, we use the speech presence probability as an additional optimization objective and incorporate the short-time Fourier transform (STFT) domain loss as an extra term for the unsupervised learning loss. We also apply the multi-resolution STFT loss as the validation loss to enhance the stability of the training process and improve the algorithm's performance. Experimental results on the VCTK + DEMAND benchmark demonstrate that the proposed method achieves competitive performance compared to the supervised methods. Furthermore, the speech recognition results on the CHiME4 benchmark show the superiority of the proposed method over its supervised counterpart.

基于深度学习的语音增强模型通常采用有监督的方式进行训练，需要大量成对的噪声-清洁语音数据来进行训练。然而，合成生成的训练数据只能捕捉有限范围内的真实环境，而收集真实世界中成对的噪声和地面真实的干净语音往往具有挑战性，甚至是不切实际的。为了克服这一限制，我们提出了一种用于语音增强的无监督学习方法，这种方法不需要成对的噪声-清洁训练数据。具体来说，我们的方法利用最优传输准则，以无监督的方式训练语音增强模型。它采用了基于噪声语音的保真度损失和分布发散损失，以最小化模型输出分布与未配对的干净语音分布之间的差异。此外，我们还将语音存在概率作为额外的优化目标，并将短时傅立叶变换 (STFT) 域损失作为无监督学习损失的附加项。我们还将多分辨率 STFT 损失作为验证损失，以增强训练过程的稳定性，提高算法的性能。在 VCTK + DEMAND 基准上的实验结果表明，与有监督方法相比，所提出的方法取得了具有竞争力的性能。此外，在 CHiME4 基准上的语音识别结果表明，所提出的方法优于其监督方法。

{"title":"Unsupervised Speech Enhancement Using Optimal Transport and Speech Presence Probability","authors":"Wenbin Jiang;Kai Yu;Fei Wen","doi":"10.1109/TASLP.2024.3473318","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473318","url":null,"abstract":"Speech enhancement models based on deep learning are typically trained in a supervised manner, requiring a substantial amount of paired noisy-to-clean speech data for training. However, synthetically generated training data can only capture a limited range of realistic environments, and it is often challenging or even impractical to gather real-world pairs of noisy and ground-truth clean speech. To overcome this limitation, we propose an unsupervised learning approach for speech enhancement that eliminates the need for paired noisy-to-clean training data. Specifically, our method utilizes the optimal transport criterion to train the speech enhancement model in an unsupervised manner. It employs a fidelity loss based on noisy speech and a distribution divergence loss to minimize the difference between the distribution of the model's output and that of unpaired clean speech. Further, we use the speech presence probability as an additional optimization objective and incorporate the short-time Fourier transform (STFT) domain loss as an extra term for the unsupervised learning loss. We also apply the multi-resolution STFT loss as the validation loss to enhance the stability of the training process and improve the algorithm's performance. Experimental results on the VCTK + DEMAND benchmark demonstrate that the proposed method achieves competitive performance compared to the supervised methods. Furthermore, the speech recognition results on the CHiME4 benchmark show the superiority of the proposed method over its supervised counterpart.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4445-4455"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142434621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Graph-Based Cross-Granularity Message Passing on Knowledge-Intensive Text 基于图的知识密集型文本跨粒度信息传递

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-10-02 DOI: 10.1109/TASLP.2024.3473308

Chenwei Yan;Xiangling Fu;Xinxin You;Ji Wu;Xien Liu

In knowledge-intensive fields such as medicine, the text often contains numerous professional terms, specific text fragments, and multidimensional information. However, most existing text representation methods ignore this specialized knowledge and instead adopt methods similar to those used in the general domain. In this paper, we focus on developing a learning module to enhance the representation ability of knowledge-intensive text by leveraging a graph-based cross-granularity message passing mechanism. To this end, we propose a novel learning framework, the Multi-Granularity Graph Neural Network (MG-GNN), to integrate fine-grained and coarse-grained knowledge at the character, word, and phase levels. The MG-GNN performs learning in two stages: 1) inter-granularity learning and 2) intra-granularity learning. During inter-granularity learning, semantic knowledge is extracted from character, word, and phrase granularity graphs, whereas intra-granularity learning focuses on fusing knowledge across different granularity graphs to achieve comprehensive message integration. To enhance the fusion performance, we propose a context-based gating mechanism to guide cross-graph propagation learning. Furthermore, we apply MG-GNN to address two important medical applications. Experimental results demonstrate that our proposed MG-GNN model significantly enhances the performance in both diagnosis prediction and medical named entity recognition tasks.

在医学等知识密集型领域，文本往往包含大量专业术语、特定文本片段和多维信息。然而，现有的文本表示方法大多忽略了这些专业知识，而是采用与一般领域类似的方法。在本文中，我们将重点开发一种学习模块，利用基于图的跨粒度信息传递机制来增强知识密集型文本的表示能力。为此，我们提出了一个新颖的学习框架--多粒度图神经网络（MG-GNN），以整合字符、单词和相位层面的细粒度和粗粒度知识。MG-GNN 分两个阶段进行学习：1) 粒度间学习和 2) 粒度内学习。在粒度间学习过程中，语义知识是从字符、单词和短语粒度图中提取的，而粒度内学习则侧重于融合不同粒度图中的知识，以实现全面的信息整合。为了提高融合性能，我们提出了一种基于上下文的门控机制来指导跨图传播学习。此外，我们还将 MG-GNN 应用于两个重要的医疗应用。实验结果表明，我们提出的 MG-GNN 模型显著提高了诊断预测和医疗命名实体识别任务的性能。

{"title":"Graph-Based Cross-Granularity Message Passing on Knowledge-Intensive Text","authors":"Chenwei Yan;Xiangling Fu;Xinxin You;Ji Wu;Xien Liu","doi":"10.1109/TASLP.2024.3473308","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3473308","url":null,"abstract":"In knowledge-intensive fields such as medicine, the text often contains numerous professional terms, specific text fragments, and multidimensional information. However, most existing text representation methods ignore this specialized knowledge and instead adopt methods similar to those used in the general domain. In this paper, we focus on developing a learning module to enhance the representation ability of knowledge-intensive text by leveraging a graph-based cross-granularity message passing mechanism. To this end, we propose a novel learning framework, the \u0000<bold>M\u0000ulti-\u0000<bold>G\u0000ranularity \u0000<bold>G\u0000raph \u0000<bold>N\u0000eural \u0000<bold>N\u0000etwork (MG-GNN), to integrate fine-grained and coarse-grained knowledge at the character, word, and phase levels. The MG-GNN performs learning in two stages: 1) inter-granularity learning and 2) intra-granularity learning. During inter-granularity learning, semantic knowledge is extracted from character, word, and phrase granularity graphs, whereas intra-granularity learning focuses on fusing knowledge across different granularity graphs to achieve comprehensive message integration. To enhance the fusion performance, we propose a context-based gating mechanism to guide cross-graph propagation learning. Furthermore, we apply MG-GNN to address two important medical applications. Experimental results demonstrate that our proposed MG-GNN model significantly enhances the performance in both diagnosis prediction and medical named entity recognition tasks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4409-4419"},"PeriodicalIF":4.1,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross-Utterance Conditioned VAE for Speech Generation 用于语音生成的交叉共振条件 VAE

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-30 DOI: 10.1109/TASLP.2024.3453598

Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun

Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.

由神经网络驱动的语音合成系统为多媒体制作带来了希望，但在生成有表现力的语音和无缝编辑方面经常面临问题。为此，我们提出了交叉均衡条件变异自动编码器语音合成（CUC-VAE S2）框架，以增强前音并确保自然语音的生成。该框架利用了预训练语言模型的强大表示能力和变异自动编码器（VAE）的重表达能力。CUC-VAE S2 框架的核心部分是跨口音 CVAE，它从周围的句子中提取声学、说话人和文本特征，生成上下文敏感的前音特征，从而更准确地模拟人类前音生成。我们还针对不同的语音合成应用提出了两种实用算法：用于文本到语音的 CUC-VAE TTS 和用于语音编辑的 CUC-VAE SE。CUC-VAE TTS 是该框架的直接应用，旨在生成带有从周围文本中提取的上下文前音的音频。另一方面，CUC-VAE SE 算法利用以上下文信息为条件的真实熔谱采样，生成与真实声音非常接近的音频，从而方便了基于文本的灵活语音编辑，如删除、插入和替换。在 LibriTTS 数据集上的实验结果表明，我们提出的模型显著增强了语音合成和编辑功能，生成的语音更自然、更具表现力。

{"title":"Cross-Utterance Conditioned VAE for Speech Generation","authors":"Yang Li;Cheng Yu;Guangzhi Sun;Weiqin Zu;Zheng Tian;Ying Wen;Wei Pan;Chao Zhang;Jun Wang;Yang Yang;Fanglei Sun","doi":"10.1109/TASLP.2024.3453598","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3453598","url":null,"abstract":"Speech synthesis systems powered by neural networks hold promise for multimedia production, but frequently face issues with producing expressive speech and seamless editing. In response, we present the Cross-Utterance Conditioned Variational Autoencoder speech synthesis (CUC-VAE S2) framework to enhance prosody and ensure natural speech generation. This framework leverages the powerful representational capabilities of pre-trained language models and the re-expression abilities of variational autoencoders (VAEs). The core component of the CUC-VAE S2 framework is the cross-utterance CVAE, which extracts acoustic, speaker, and textual features from surrounding sentences to generate context-sensitive prosodic features, more accurately emulating human prosody generation. We further propose two practical algorithms tailored for distinct speech synthesis applications: CUC-VAE TTS for text-to-speech and CUC-VAE SE for speech editing. The CUC-VAE TTS is a direct application of the framework, designed to generate audio with contextual prosody derived from surrounding texts. On the other hand, the CUC-VAE SE algorithm leverages real mel spectrogram sampling conditioned on contextual information, producing audio that closely mirrors real sound and thereby facilitating flexible speech editing based on text such as deletion, insertion, and replacement. Experimental results on the LibriTTS datasets demonstrate that our proposed models significantly enhance speech synthesis and editing, producing more natural and expressive speech.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4263-4276"},"PeriodicalIF":4.1,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142359710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cross Domain Optimization for Speech Enhancement: Parallel or Cascade? 语音增强的跨域优化：并行还是级联？

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-26 DOI: 10.1109/TASLP.2024.3468026

Liang Wan;Hongqing Liu;Liming Shi;Yi Zhou;Lu Gan

This paper introduces five novel deep-learning architectures for speech enhancement. Existing methods typically use time-domain, time-frequency representations, or a hybrid approach. Recognizing the unique contributions of each domain to feature extraction and model design, this study investigates the integration of waveform and complex spectrogram models through cross-domain fusion to enhance speech feature learning and noise reduction, thereby improving speech quality. We examine both cascading and parallel configurations of waveform and complex spectrogram models to assess their effectiveness in speech enhancement. Additionally, we employ an orthogonal projection-based error decomposition technique and manage the inputs of individual sub-models to analyze factors affecting speech quality. The network is trained by optimizing three specific loss functions applied across all sub-models. Our experiments, using the DNS Challenge (ICASSP 2021) dataset, reveal that the proposed models surpass existing benchmarks in speech enhancement, offering superior speech quality and intelligibility. These results highlight the efficacy of our cross-domain fusion strategy.

本文介绍了用于语音增强的五种新型深度学习架构。现有方法通常使用时域、时频表示或混合方法。认识到每个域对特征提取和模型设计的独特贡献，本研究探讨了通过跨域融合来整合波形和复杂频谱模型，以增强语音特征学习和降噪，从而提高语音质量。我们研究了波形和复杂频谱图模型的级联和并行配置，以评估它们在语音增强中的有效性。此外，我们还采用了基于正交投影的误差分解技术，并对各个子模型的输入进行管理，以分析影响语音质量的因素。我们通过优化应用于所有子模型的三个特定损失函数来训练网络。我们使用 DNS Challenge（ICASSP 2021）数据集进行的实验表明，所提出的模型超越了语音增强方面的现有基准，提供了卓越的语音质量和可懂度。这些结果凸显了我们的跨域融合策略的功效。

{"title":"Cross Domain Optimization for Speech Enhancement: Parallel or Cascade?","authors":"Liang Wan;Hongqing Liu;Liming Shi;Yi Zhou;Lu Gan","doi":"10.1109/TASLP.2024.3468026","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468026","url":null,"abstract":"This paper introduces five novel deep-learning architectures for speech enhancement. Existing methods typically use time-domain, time-frequency representations, or a hybrid approach. Recognizing the unique contributions of each domain to feature extraction and model design, this study investigates the integration of waveform and complex spectrogram models through cross-domain fusion to enhance speech feature learning and noise reduction, thereby improving speech quality. We examine both cascading and parallel configurations of waveform and complex spectrogram models to assess their effectiveness in speech enhancement. Additionally, we employ an orthogonal projection-based error decomposition technique and manage the inputs of individual sub-models to analyze factors affecting speech quality. The network is trained by optimizing three specific loss functions applied across all sub-models. Our experiments, using the DNS Challenge (ICASSP 2021) dataset, reveal that the proposed models surpass existing benchmarks in speech enhancement, offering superior speech quality and intelligibility. These results highlight the efficacy of our cross-domain fusion strategy.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4328-4341"},"PeriodicalIF":4.1,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142376622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment 基于适应环境的物理约束核插值的声场估计

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3467951

Juliano G. C. Ribeiro;Shoichi Koyama;Ryosuke Horiuchi;Hiroshi Saruwatari

A sound field estimation method based on kernel interpolation with an adaptive kernel function is proposed. The kernel-interpolation-based sound field estimation methods enable physics-constrained interpolation from pressure measurements of distributed microphones with a linear estimator, which constrains interpolation functions to satisfy the Helmholtz equation. However, a fixed kernel function would not be capable of adapting to the acoustic environment in which the measurement is performed, limiting their applicability. To make the kernel function adaptive, we represent it with a sum of directed and residual trainable kernel functions. The directed kernel is defined by a weight function composed of a superposition of exponential functions to capture highly directional components. The weight function for the residual kernel is represented by neural networks to capture unpredictable spatial patterns of the residual components. Experimental results using simulated and real data indicate that the proposed method outperforms the current kernel-interpolation-based methods and a method based on physics-informed neural networks.

本文提出了一种基于具有自适应核函数的核插值的声场估计方法。基于核内插法的声场估算方法能够利用线性估算器对分布式传声器的压力测量结果进行物理约束内插法，该估算器约束内插法函数满足亥姆霍兹方程。然而，固定的核函数无法适应进行测量的声学环境，从而限制了其适用性。为了使核函数具有自适应能力，我们用定向核函数和残差可训练核函数的总和来表示核函数。定向内核由一个权重函数定义，该权重函数由指数函数叠加而成，用于捕捉高方向性成分。残差核的权重函数由神经网络表示，以捕捉残差成分的不可预测空间模式。使用模拟和真实数据的实验结果表明，所提出的方法优于目前基于内核插值的方法和基于物理信息神经网络的方法。

{"title":"Sound Field Estimation Based on Physics-Constrained Kernel Interpolation Adapted to Environment","authors":"Juliano G. C. Ribeiro;Shoichi Koyama;Ryosuke Horiuchi;Hiroshi Saruwatari","doi":"10.1109/TASLP.2024.3467951","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3467951","url":null,"abstract":"A sound field estimation method based on kernel interpolation with an adaptive kernel function is proposed. The kernel-interpolation-based sound field estimation methods enable physics-constrained interpolation from pressure measurements of distributed microphones with a linear estimator, which constrains interpolation functions to satisfy the Helmholtz equation. However, a fixed kernel function would not be capable of adapting to the acoustic environment in which the measurement is performed, limiting their applicability. To make the kernel function adaptive, we represent it with a sum of directed and residual trainable kernel functions. The directed kernel is defined by a weight function composed of a superposition of exponential functions to capture highly directional components. The weight function for the residual kernel is represented by neural networks to capture unpredictable spatial patterns of the residual components. Experimental results using simulated and real data indicate that the proposed method outperforms the current kernel-interpolation-based methods and a method based on physics-informed neural networks.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4369-4383"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10693558","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoders 高保真声码器的时频表示判别器研究

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3468005

Yicheng Gu;Xueyao Zhang;Liumeng Xue;Haizhou Li;Zhizheng Wu

Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.

基于生成对抗网络（GAN）的声码器从声学表征重建可听波形时，在推理速度和合成质量方面都更胜一筹。本研究的重点是改进基于 GAN 的声码器的判别器。现有的基于时频表示法（TFR）的判别器大多植根于短时傅里叶变换（STFT），它具有恒定的时频（TF）分辨率、线性缩放的中心频率和固定的分解基础，因此不适合像歌声这样需要动态关注不同频段和不同时间间隔的信号。有鉴于此，我们提出了多尺度子带常数 Q 变换 CQT（MS-SB-CQT）判别器和多尺度时域压缩连续小波变换 CWT（MS-TC-CWT）判别器。CQT 和 CWT 对不同频段都具有动态 TF 分辨率。相比之下，CQT 对音高信息的建模能力更强，而 CWT 对短时瞬态的建模能力更强。在语音和歌声中进行的实验证实了我们提出的判别器的有效性。此外，基于 STFT、CQT 和 CWT 的判别器可以联合使用，以获得更好的性能。所提出的判别器可以提高各种基于 GAN 的最先进声码器的合成质量，包括 HiFi-GAN、BigVGAN 和 APNet。

{"title":"An Investigation of Time-Frequency Representation Discriminators for High-Fidelity Vocoders","authors":"Yicheng Gu;Xueyao Zhang;Liumeng Xue;Haizhou Li;Zhizheng Wu","doi":"10.1109/TASLP.2024.3468005","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468005","url":null,"abstract":"Generative Adversarial Network (GAN) based vocoders are superior in both inference speed and synthesis quality when reconstructing an audible waveform from an acoustic representation. This study focuses on improving the discriminator for GAN-based vocoders. Most existing Time-Frequency Representation (TFR)-based discriminators are rooted in Short-Time Fourier Transform (STFT), which owns a constant Time-Frequency (TF) resolution, linearly scaled center frequencies, and a fixed decomposition basis, making it incompatible with signals like singing voices that require dynamic attention for different frequency bands and different time intervals. Motivated by that, we propose a Multi-Scale Sub-Band Constant-Q Transform CQT (MS-SB-CQT) discriminator and a Multi-Scale Temporal-Compressed Continuous Wavelet Transform CWT (MS-TC-CWT) discriminator. Both CQT and CWT have a dynamic TF resolution for different frequency bands. In contrast, CQT has a better modeling ability in pitch information, and CWT has a better modeling ability in short-time transients. Experiments conducted on both speech and singing voices confirm the effectiveness of our proposed discriminators. Moreover, the STFT, CQT, and CWT-based discriminators can be used jointly for better performance. The proposed discriminators can boost the synthesis quality of various state-of-the-art GAN-based vocoders, including HiFi-GAN, BigVGAN, and APNet.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4569-4579"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142518150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Three-Dimensional Room Transfer Function Parameterization Based on Multiple Concentric Planar Circular Arrays 基于多同心平面圆阵列的三维室内传递函数参数化

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-25 DOI: 10.1109/TASLP.2024.3468025

Lu Li;Maoshen Jia;Changchun Bao

This study proposes a three-dimensional room transfer function (RTF) parameterization method based on multiple concentric planar circular arrays, which exhibits robustness to variations in the positions of both the receiver and source. According to the harmonic solution to the wave equation, the RTFs between two spherical regions (sound source and receiver) in a room can be expressed as a weighted sum of spherical harmonics, whose weight coefficients serve as the RTF parameters, which can be estimated by placing multiple concentric planar circular arrays composed of monopole-source pairs (MSPs) and multiple concentric planar circular arrays composed of omnidirectional-microphone pairs (OMPs) in respective source and receiver regions. We use MSP arrays to generate required outgoing soundfields originating from a source region. We derive a method to use OMP arrays to estimate RTF parameters that are concealed within the captured soundfield, which can be employed to reconstruct the RTF from any point in the source region to any point in the receiver region. The accuracy of the RTF parameterization method is validated through simulation testing.

本研究提出了一种基于多个同心平面圆阵列的三维房间传递函数（RTF）参数化方法，该方法对接收器和声源位置的变化具有鲁棒性。根据波方程的谐波解，房间内两个球形区域（声源和接收器）之间的 RTF 可表示为球形谐波的加权和，其权重系数可作为 RTF 参数，通过在声源和接收器区域分别放置由单极声源对（MSP）和全向麦克风对（OMP）组成的多个同心平面圆阵列，可估算出 RTF 参数。我们使用 MSP 阵列来生成源自声源区域的所需外向声场。我们推导出一种使用 OMP 阵列估算隐藏在捕获声场中的 RTF 参数的方法，该方法可用于重建从声源区域任意点到接收区域任意点的 RTF。通过模拟测试验证了 RTF 参数化方法的准确性。

{"title":"Three-Dimensional Room Transfer Function Parameterization Based on Multiple Concentric Planar Circular Arrays","authors":"Lu Li;Maoshen Jia;Changchun Bao","doi":"10.1109/TASLP.2024.3468025","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3468025","url":null,"abstract":"This study proposes a three-dimensional room transfer function (RTF) parameterization method based on multiple concentric planar circular arrays, which exhibits robustness to variations in the positions of both the receiver and source. According to the harmonic solution to the wave equation, the RTFs between two spherical regions (sound source and receiver) in a room can be expressed as a weighted sum of spherical harmonics, whose weight coefficients serve as the RTF parameters, which can be estimated by placing multiple concentric planar circular arrays composed of monopole-source pairs (MSPs) and multiple concentric planar circular arrays composed of omnidirectional-microphone pairs (OMPs) in respective source and receiver regions. We use MSP arrays to generate required outgoing soundfields originating from a source region. We derive a method to use OMP arrays to estimate RTF parameters that are concealed within the captured soundfield, which can be employed to reconstruct the RTF from any point in the source region to any point in the receiver region. The accuracy of the RTF parameterization method is validated through simulation testing.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4384-4398"},"PeriodicalIF":4.1,"publicationDate":"2024-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142430805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On the Quantization of Neural Models for Speaker Verification 论用于验证说话人的神经模型量化

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-20 DOI: 10.1109/TASLP.2024.3463430

Vishal Kumar;Vinayak Abrol;Mathew Magamai Doss

This paper addresses the sub-optimality of current post-training quantization (PTQ) and quantization-aware training (QAT) methods for state-of-the-art speaker verification (SV) models featuring intricate architectural elements such as channel aggregation and squeeze excitation modules. To address these limitations, we propose 1) a data-independent PTQ technique employing iterative low-precision calibration on pre-trained models; and 2) a data-dependent QAT method designed to reduce the performance gap between full-precision and integer models. Our QAT involves two progressive stages where FP-32 weights are initially transformed into FP-8, adapting precision based on the gradient norm, followed by the learning of quantizer parameters (scale and zero-point) for INT8 conversion. Experimental validation underscores the ingenuity of our method in model quantization, demonstrating reduced floating-point operations and INT8 inference time, all while maintaining performance on par with full-precision models.

本文探讨了当前训练后量化（PTQ）和量化感知训练（QAT）方法对于具有复杂架构元素（如信道聚合和挤压激励模块）的最先进扬声器验证（SV）模型的次优化问题。为了解决这些局限性，我们提出了：1）一种与数据无关的 PTQ 技术，在预训练模型上采用迭代低精度校准；2）一种与数据无关的 QAT 方法，旨在缩小全精度模型和整数模型之间的性能差距。我们的 QAT 包括两个渐进阶段，首先将 FP-32 权重转换为 FP-8，根据梯度规范调整精度，然后学习量化器参数（标度和零点）以进行 INT8 转换。实验验证凸显了我们在模型量化方面的独创性，证明我们减少了浮点运算和 INT8 推理时间，同时保持了与全精度模型相同的性能。

引用次数: 0

Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting 克服灾难性遗忘的贝叶斯参数高效微调技术

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463395

Haolin Chen;Philip N. Garner

We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.

我们的主要动机是对文本到语音合成模型进行调整；但我们认为，更通用的参数高效微调（PEFT）是进行这种调整的合适框架。然而，灾难性遗忘仍然是 PEFT 的一个问题，它损害了预训练模型的固有能力。我们证明，只要微调层的参数偏移可以微分计算，现有的贝叶斯学习技术就可以应用于 PEFT，以防止灾难性遗忘。在语言建模和语音合成任务的一系列原则性实验中，我们利用已有的拉普拉斯近似方法（包括对角线方法和克朗克因子方法）对 PEFT 与低秩适应（LoRA）进行了正则化，并比较了它们在预训练知识保存方面的性能。结果表明，我们的方法可以在不降低微调性能的情况下克服灾难性遗忘，而使用 Kronecker-factored近似方法比对角线近似方法能更好地保存训练前知识。

{"title":"Bayesian Parameter-Efficient Fine-Tuning for Overcoming Catastrophic Forgetting","authors":"Haolin Chen;Philip N. Garner","doi":"10.1109/TASLP.2024.3463395","DOIUrl":"10.1109/TASLP.2024.3463395","url":null,"abstract":"We are motivated primarily by the adaptation of text-to-speech synthesis models; however we argue that more generic parameter-efficient fine-tuning (PEFT) is an appropriate framework to do such adaptation. Nevertheless, catastrophic forgetting remains an issue with PEFT, damaging the pre-trained model's inherent capabilities. We demonstrate that existing Bayesian learning techniques can be applied to PEFT to prevent catastrophic forgetting as long as the parameter shift of the fine-tuned layers can be calculated differentiably. In a principled series of experiments on language modeling and speech synthesis tasks, we utilize established Laplace approximations, including diagonal and Kronecker-factored approaches, to regularize PEFT with the low-rank adaptation (LoRA) and compare their performance in pre-training knowledge preservation. Our results demonstrate that catastrophic forgetting can be overcome by our methods without degrading the fine-tuning performance, and using the Kronecker-factored approximation produces a better preservation of the pre-training knowledge than the diagonal ones.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4253-4262"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683983","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals NeuroHeed：使用脑电信号的神经分层扬声器提取技术

IF 4.1 2区计算机科学 Q1 ACOUSTICS

IEEE/ACM Transactions on Audio, Speech, and Language Processing

Pub Date : 2024-09-18 DOI: 10.1109/TASLP.2024.3463498

Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li

Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as selective auditory attention. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.

人类拥有一种非凡的能力，即在相互竞争的声音和背景噪声中选择性地注意单个说话者，这种能力被称为选择性听觉注意。听觉神经科学的最新研究表明，被注意的语音信号与相应的大脑神经元活动之间存在很强的相关性。在这项研究中，我们使用经济实惠的非侵入式脑电图（EEG）设备对这种大脑活动进行了研究。我们提出的 NeuroHeed 是一种说话者提取模型，它利用听者的同步脑电信号提取鸡尾酒会场景中的语音信号，提取过程以脑电信号编码的神经元吸引子为条件。我们提出了离线和在线 NeuroHeed，后者专为实时推理而设计。在在线 NeuroHeed 中，我们还提出了一个自回归扬声器编码器，该编码器会将过去提取的语音信号累积起来，以便将所关注的扬声器信息自加入听觉吸引子，从而保持注意力的长期动力。在线 NeuroHeed 在这两个吸引子的引导下提取语音信号的当前窗口。KUL 数据集双扬声器场景的实验结果表明，NeuroHeed 能有效提取大脑关注的语音信号，离线设置下的平均标度不变信噪比改进（SI-SDRi）为 14.3 dB，提取准确率为 90.8%；在线设置下的平均标度不变信噪比改进（SI-SDRi）为 11.2 dB，提取准确率为 85.1%。

{"title":"NeuroHeed: Neuro-Steered Speaker Extraction Using EEG Signals","authors":"Zexu Pan;Marvin Borsdorf;Siqi Cai;Tanja Schultz;Haizhou Li","doi":"10.1109/TASLP.2024.3463498","DOIUrl":"10.1109/TASLP.2024.3463498","url":null,"abstract":"Humans possess the remarkable ability to selectively attend to a single speaker amidst competing voices and background noise, known as \u0000<italic>selective auditory attention\u0000. Recent studies in auditory neuroscience indicate a strong correlation between the attended speech signal and the corresponding brain's elicited neuronal activities. In this work, we study such brain activities measured using affordable and non-intrusive electroencephalography (EEG) devices. We present NeuroHeed, a speaker extraction model that leverages the listener's synchronized EEG signals to extract the attended speech signal in a cocktail party scenario, in which the extraction process is conditioned on a neuronal attractor encoded from the EEG signal. We propose both an offline and an online NeuroHeed, with the latter designed for real-time inference. In the online NeuroHeed, we additionally propose an autoregressive speaker encoder, which accumulates past extracted speech signals for self-enrollment of the attended speaker information into an auditory attractor, that retains the attentional momentum over time. Online NeuroHeed extracts the current window of the speech signals with guidance from both attractors. Experimental results on KUL dataset two-speaker scenario demonstrate that NeuroHeed effectively extracts brain-attended speech signals with an average scale-invariant signal-to-noise ratio improvement (SI-SDRi) of 14.3 dB and extraction accuracy of 90.8% in offline settings, and SI-SDRi of 11.2 dB and extraction accuracy of 85.1% in online settings.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4456-4470"},"PeriodicalIF":4.1,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10683957","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142263309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0