首页 > 最新文献

IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

英文 中文
An Effective Hierarchical Graph Attention Network Modeling Approach for Pronunciation Assessment 发音评估的有效层次图注意网络建模方法
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-26 DOI: 10.1109/TASLP.2024.3449111
Bi-Cheng Yan;Berlin Chen
Automatic pronunciation assessment (APA) manages to quantify second language (L2) learners’ pronunciation proficiency in a target language by providing fine-grained feedback with multiple aspect scores (e.g., accuracy, fluency, and completeness) at various linguistic levels (i.e., phone, word, and utterance). Most of the existing efforts commonly follow a parallel modeling framework, which takes a sequence of phone-level pronunciation feature embeddings of a learner's utterance as input and then predicts multiple aspect scores across various linguistic levels. However, these approaches neither take the hierarchy of linguistic units into account nor consider the relatedness among the pronunciation aspects in an explicit manner. In light of this, we put forward an effective modeling approach for APA, termed HierGAT, which is grounded on a hierarchical graph attention network. Our approach facilitates hierarchical modeling of the input utterance as a heterogeneous graph that contains linguistic nodes at various levels of granularity. On top of the tactfully designed hierarchical graph message passing mechanism, intricate interdependencies within and across different linguistic levels are encapsulated and the language hierarchy of an utterance is factored in as well. Furthermore, we also design a novel aspect attention module to encode relatedness among aspects. To our knowledge, we are the first to introduce multiple types of linguistic nodes into graph-based neural networks for APA and perform a comprehensive qualitative analysis to investigate their merits. A series of experiments conducted on the speechocean762 benchmark dataset suggests the feasibility and effectiveness of our approach in relation to several competitive baselines.
自动发音评估(APA)通过在不同语言层面(即电话、单词和语篇)提供多方面评分(如准确度、流利度和完整性)的细粒度反馈来量化第二语言(L2)学习者的目标语言发音水平。现有的大多数方法通常采用并行建模框架,将学习者语篇的电话级发音特征嵌入序列作为输入,然后预测不同语言级别的多个方面得分。然而,这些方法既没有考虑语言单位的层次结构,也没有明确考虑发音方面之间的关联性。有鉴于此,我们提出了一种有效的 APA 建模方法,称为 HierGAT,它以分层图注意网络为基础。我们的方法有利于将输入语篇作为一个异构图进行分层建模,该图包含不同粒度的语言节点。在巧妙设计的分层图信息传递机制之上,不同语言层次内部和之间错综复杂的相互依赖关系被封装起来,语篇的语言层次结构也被考虑在内。此外,我们还设计了一个新颖的方面关注模块来编码各方面之间的相关性。据我们所知,我们是第一个在基于图的 APA 神经网络中引入多种类型的语言节点,并对其优点进行全面定性分析的人。在 speechocean762 基准数据集上进行的一系列实验表明,我们的方法与几种具有竞争力的基线方法相比是可行和有效的。
{"title":"An Effective Hierarchical Graph Attention Network Modeling Approach for Pronunciation Assessment","authors":"Bi-Cheng Yan;Berlin Chen","doi":"10.1109/TASLP.2024.3449111","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3449111","url":null,"abstract":"Automatic pronunciation assessment (APA) manages to quantify second language (L2) learners’ pronunciation proficiency in a target language by providing fine-grained feedback with multiple aspect scores (e.g., accuracy, fluency, and completeness) at various linguistic levels (i.e., phone, word, and utterance). Most of the existing efforts commonly follow a parallel modeling framework, which takes a sequence of phone-level pronunciation feature embeddings of a learner's utterance as input and then predicts multiple aspect scores across various linguistic levels. However, these approaches neither take the hierarchy of linguistic units into account nor consider the relatedness among the pronunciation aspects in an explicit manner. In light of this, we put forward an effective modeling approach for APA, termed HierGAT, which is grounded on a hierarchical graph attention network. Our approach facilitates hierarchical modeling of the input utterance as a heterogeneous graph that contains linguistic nodes at various levels of granularity. On top of the tactfully designed hierarchical graph message passing mechanism, intricate interdependencies within and across different linguistic levels are encapsulated and the language hierarchy of an utterance is factored in as well. Furthermore, we also design a novel aspect attention module to encode relatedness among aspects. To our knowledge, we are the first to introduce multiple types of linguistic nodes into graph-based neural networks for APA and perform a comprehensive qualitative analysis to investigate their merits. A series of experiments conducted on the speechocean762 benchmark dataset suggests the feasibility and effectiveness of our approach in relation to several competitive baselines.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3974-3985"},"PeriodicalIF":4.1,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
IEEE Signal Processing Society Information 电气和电子工程师学会信号处理学会信息
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-26 DOI: 10.1109/TASLP.2023.3328752
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/TASLP.2023.3328752","DOIUrl":"https://doi.org/10.1109/TASLP.2023.3328752","url":null,"abstract":"","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"C2-C2"},"PeriodicalIF":4.1,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10646371","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142077615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS PE-wav2vec:用于 TTS 中自监督前奏学习的前奏增强型语音模型
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-23 DOI: 10.1109/TASLP.2024.3449148
Zhao-Ci Liu;Liping Chen;Ya-Jun Hu;Zhen-Hua Ling;Jia Pan
This paper investigates leveraging large-scale untranscribed speech data to enhance the prosody modelling capability of text-to-speech (TTS) models. On the basis of the self-supervised speech model wav2vec 2.0, Prosody-Enhanced wav2vec (PE-wav2vec) is proposed by introducing prosody learning. Specifically, prosody learning is achieved by applying supervision from the linear predictive coding (LPC) residual signals on the initial Transformer blocks in the wav2vec 2.0 architecture. The embedding vectors extracted with the initial Transformer blocks of the PE-wav2vec model are utilised as prosodic representations for the corresponding frames in a speech utterance. To apply the PE-wav2vec representations in TTS, an acoustic model named Speech Synthesis model conditioned on Self-Supervisedly Learned Prosodic Representations (S4LPR) is designed on the basis of FastSpeech 2. The experimental results demonstrate that the proposed PE-wav2vec model can provide richer prosody descriptions of speech than the vanilla wav2vec 2.0 model can. Furthermore, the S4LPR model using PE-wav2vec representations can effectively improve the subjective naturalness and reduce the objective distortions of synthetic speech compared with baseline models.
本文研究了如何利用大规模非转录语音数据来增强文本到语音(TTS)模型的前音建模能力。在自监督语音模型 wav2vec 2.0 的基础上,通过引入前奏学习,提出了前奏增强型 wav2vec(PE-wav2vec)。具体来说,前音学习是通过对 wav2vec 2.0 架构中的初始变换器块应用线性预测编码(LPC)残差信号的监督来实现的。用 PE-wav2vec 模型的初始变换器块提取的嵌入向量可用作语音语篇中相应帧的前音表示。为了将 PE-wav2vec 表示法应用于 TTS,在 FastSpeech 2 的基础上设计了一个名为 "基于自监督学习前音表示法(S4LPR)的语音合成模型 "的声学模型。实验结果表明,与 vanilla wav2vec 2.0 模型相比,所提出的 PE-wav2vec 模型能提供更丰富的语音前体描述。此外,与基线模型相比,使用 PE-wav2vec 表示法的 S4LPR 模型能有效提高合成语音的主观自然度,减少客观失真。
{"title":"PE-Wav2vec: A Prosody-Enhanced Speech Model for Self-Supervised Prosody Learning in TTS","authors":"Zhao-Ci Liu;Liping Chen;Ya-Jun Hu;Zhen-Hua Ling;Jia Pan","doi":"10.1109/TASLP.2024.3449148","DOIUrl":"10.1109/TASLP.2024.3449148","url":null,"abstract":"This paper investigates leveraging large-scale untranscribed speech data to enhance the prosody modelling capability of \u0000<italic>text-to-speech</i>\u0000 (TTS) models. On the basis of the self-supervised speech model wav2vec 2.0, \u0000<italic>Prosody-Enhanced wav2vec</i>\u0000 (PE-wav2vec) is proposed by introducing prosody learning. Specifically, prosody learning is achieved by applying supervision from the \u0000<italic>linear predictive coding</i>\u0000 (LPC) residual signals on the initial Transformer blocks in the wav2vec 2.0 architecture. The embedding vectors extracted with the initial Transformer blocks of the PE-wav2vec model are utilised as prosodic representations for the corresponding frames in a speech utterance. To apply the PE-wav2vec representations in TTS, an acoustic model named \u0000<italic>Speech Synthesis model conditioned on Self-Supervisedly Learned Prosodic Representations</i>\u0000 (S4LPR) is designed on the basis of FastSpeech 2. The experimental results demonstrate that the proposed PE-wav2vec model can provide richer prosody descriptions of speech than the vanilla wav2vec 2.0 model can. Furthermore, the S4LPR model using PE-wav2vec representations can effectively improve the subjective naturalness and reduce the objective distortions of synthetic speech compared with baseline models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4199-4210"},"PeriodicalIF":4.1,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial Analysis and Synthesis Methods: Subjective and Objective Evaluations Using Various Microphone Arrays in the Auralization of a Critical Listening Room 空间分析与合成方法:在临界聆听室的听觉化过程中使用各种麦克风阵列进行主观和客观评估
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-23 DOI: 10.1109/TASLP.2024.3449037
Alan Pawlak;Hyunkook Lee;Aki Mäkivirta;Thomas Lund
Parametric sound field reproduction methods, such as the Spatial Decomposition Method (SDM) and Higher-Order Spatial Impulse Response Rendering (HO-SIRR), are widely used for the analysis and auralization of sound fields. This paper studies the performance of various sound field reproduction methods in the context of the auralization of a critical listening room, focusing on fixed head orientations. The influence on the perceived spatial and timbral fidelity of the following factors is considered: the rendering framework, direction of arrival (DOA) estimation method, microphone array structure, and use of a dedicated center reference microphone with SDM. Listening tests compare the synthesized sound fields to a reference binaural rendering condition, all for static head positions. Several acoustic parameters are measured to gain insights into objective differences between methods. All systems were distinguishable from the reference in perceptual tests. A high-quality pressure microphone improves the SDM framework's timbral fidelity, and spatial fidelity in certain scenarios. Additionally, SDM and HO-SIRR show similarities in spatial fidelity. Performance variation between SDM configurations is influenced by the DOA estimation method and microphone array construction. The binaural SDM (BSDM) presentations display temporal artifacts impacting sound quality.
参数声场再现方法,如空间分解法(SDM)和高阶空间脉冲响应渲染法(HO-SIRR),被广泛用于声场分析和听觉化。本文研究了各种声场再现方法在临界聆听室听觉化背景下的性能,重点是固定的头部方向。本文考虑了以下因素对感知空间和音色保真度的影响:渲染框架、到达方向(DOA)估计方法、麦克风阵列结构,以及使用带有 SDM 的专用中心参考麦克风。听力测试将合成声场与参考双耳渲染条件进行比较,所有测试均针对静态头部位置。为了深入了解不同方法之间的客观差异,对几个声学参数进行了测量。在感知测试中,所有系统都能与参考系统区分开来。高质量的压力麦克风提高了 SDM 框架的音色保真度,并在某些情况下提高了空间保真度。此外,SDM 和 HO-SIRR 在空间保真度方面也有相似之处。SDM 配置之间的性能差异受到 DOA 估算方法和麦克风阵列结构的影响。双耳 SDM(BSDM)演示显示出影响音质的时间伪影。
{"title":"Spatial Analysis and Synthesis Methods: Subjective and Objective Evaluations Using Various Microphone Arrays in the Auralization of a Critical Listening Room","authors":"Alan Pawlak;Hyunkook Lee;Aki Mäkivirta;Thomas Lund","doi":"10.1109/TASLP.2024.3449037","DOIUrl":"10.1109/TASLP.2024.3449037","url":null,"abstract":"Parametric sound field reproduction methods, such as the Spatial Decomposition Method (SDM) and Higher-Order Spatial Impulse Response Rendering (HO-SIRR), are widely used for the analysis and auralization of sound fields. This paper studies the performance of various sound field reproduction methods in the context of the auralization of a critical listening room, focusing on fixed head orientations. The influence on the perceived spatial and timbral fidelity of the following factors is considered: the rendering framework, direction of arrival (DOA) estimation method, microphone array structure, and use of a dedicated center reference microphone with SDM. Listening tests compare the synthesized sound fields to a reference binaural rendering condition, all for static head positions. Several acoustic parameters are measured to gain insights into objective differences between methods. All systems were distinguishable from the reference in perceptual tests. A high-quality pressure microphone improves the SDM framework's timbral fidelity, and spatial fidelity in certain scenarios. Additionally, SDM and HO-SIRR show similarities in spatial fidelity. Performance variation between SDM configurations is influenced by the DOA estimation method and microphone array construction. The binaural SDM (BSDM) presentations display temporal artifacts impacting sound quality.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3986-4001"},"PeriodicalIF":4.1,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10645201","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142226689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Zero-Shot Cross-Lingual Named Entity Recognition via Progressive Multi-Teacher Distillation 通过多教师渐进式提炼实现零镜头跨语言命名实体识别
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-23 DOI: 10.1109/TASLP.2024.3449029
Zhuoran Li;Chunming Hu;Richong Zhang;Junfan Chen;Xiaohui Guo
Cross-lingual learning aims to transfer knowledge from one natural language to another. Zero-shot cross-lingual named entity recognition (NER) tasks are to train an NER model on source languages and to identify named entities in other languages. Existing knowledge distillation-based models in a teacher-student manner leverage the unlabeled samples from the target languages and show their superiority in this setting. However, the valuable similarity information between tokens in the target language is ignored. And the teacher model trained solely on the source language generates low-quality pseudo-labels. These two facts impact the performance of cross-lingual NER. To improve the reliability of the teacher model, in this study, we first introduce one extra simple binary classification teacher model by similarity learning to measure if the inputs are from the same class. We note that this binary classification auxiliary task is easier, and the two teachers simultaneously supervise the student model for better performance. Furthermore, given such a stronger student model, we propose a progressive knowledge distillation framework that extensively fine-tunes the teacher model on the target-language pseudo-labels generated by the student model. Empirical studies on three datasets across seven different languages show that our presented model outperforms state-of-the-art methods.
跨语言学习旨在将知识从一种自然语言转移到另一种自然语言。零点跨语言命名实体识别(NER)任务是在源语言上训练 NER 模型,并识别其他语言中的命名实体。现有的基于知识提炼的模型以教师-学生的方式利用来自目标语言的未标记样本,并在这种情况下显示出其优越性。然而,目标语言中标记之间有价值的相似性信息却被忽略了。而且,仅根据源语言训练的教师模型会生成低质量的伪标签。这两个事实影响了跨语言 NER 的性能。为了提高教师模型的可靠性,在本研究中,我们首先通过相似性学习引入了一个额外的简单二元分类教师模型,以衡量输入是否来自同一类别。我们注意到,这种二元分类辅助任务比较简单,而且两个教师同时监督学生模型,可以获得更好的性能。此外,在学生模型更强的情况下,我们提出了一个渐进式知识提炼框架,在学生模型生成的目标语言伪标签上对教师模型进行广泛的微调。在七个不同语言的三个数据集上进行的实证研究表明,我们提出的模型优于最先进的方法。
{"title":"Zero-Shot Cross-Lingual Named Entity Recognition via Progressive Multi-Teacher Distillation","authors":"Zhuoran Li;Chunming Hu;Richong Zhang;Junfan Chen;Xiaohui Guo","doi":"10.1109/TASLP.2024.3449029","DOIUrl":"10.1109/TASLP.2024.3449029","url":null,"abstract":"Cross-lingual learning aims to transfer knowledge from one natural language to another. Zero-shot cross-lingual named entity recognition (NER) tasks are to train an NER model on source languages and to identify named entities in other languages. Existing knowledge distillation-based models in a teacher-student manner leverage the unlabeled samples from the target languages and show their superiority in this setting. However, the valuable similarity information between tokens in the target language is ignored. And the teacher model trained solely on the source language generates low-quality pseudo-labels. These two facts impact the performance of cross-lingual NER. To improve the reliability of the teacher model, in this study, we first introduce one extra simple binary classification teacher model by similarity learning to measure if the inputs are from the same class. We note that this binary classification auxiliary task is easier, and the two teachers simultaneously supervise the student model for better performance. Furthermore, given such a stronger student model, we propose a progressive knowledge distillation framework that extensively fine-tunes the teacher model on the target-language pseudo-labels generated by the student model. Empirical studies on three datasets across seven different languages show that our presented model outperforms state-of-the-art methods.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4617-4630"},"PeriodicalIF":4.1,"publicationDate":"2024-08-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142210050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The VoxCeleb Speaker Recognition Challenge: A Retrospective VoxCeleb 演讲者识别挑战赛:回顾
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-20 DOI: 10.1109/TASLP.2024.3444456
Jaesung Huh;Joon Son Chung;Arsha Nagrani;Andrew Brown;Jee-weon Jung;Daniel Garcia-Romero;Andrew Zisserman
The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges.
VoxCeleb 演讲者识别挑战赛(VoxSRC)是一系列挑战赛和研讨会,从 2019 年到 2023 年每年举办一次。挑战赛主要评估各种环境下的说话人识别和日记化任务,包括:封闭和开放的训练数据;以及用于领域适应的监督、自我监督和半监督训练。这些挑战赛还为每个任务和环境提供了公开可用的训练和评估数据集,并每年发布新的测试集。在本文中,我们将对这些挑战赛进行回顾,内容包括:挑战赛的探索内容;挑战赛参与者开发的方法以及这些方法是如何演变的;以及扬声器验证和日记化领域的现状。我们在一个共同的评估数据集上描绘了五期挑战赛的成绩进展,并详细分析了每年的特别关注点对参赛者成绩的影响。本文的读者既包括希望了解演讲者识别和日记化领域概况的研究人员,也包括希望从 VoxSRC 挑战赛的成功中获益并避免犯错的挑战赛组织者。最后,我们将讨论该领域目前的优势和开放挑战。
{"title":"The VoxCeleb Speaker Recognition Challenge: A Retrospective","authors":"Jaesung Huh;Joon Son Chung;Arsha Nagrani;Andrew Brown;Jee-weon Jung;Daniel Garcia-Romero;Andrew Zisserman","doi":"10.1109/TASLP.2024.3444456","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444456","url":null,"abstract":"The VoxCeleb Speaker Recognition Challenges (VoxSRC) were a series of challenges and workshops that ran annually from 2019 to 2023. The challenges primarily evaluated the tasks of speaker recognition and diarisation under various settings including: closed and open training data; as well as supervised, self-supervised, and semi-supervised training for domain adaptation. The challenges also provided publicly available training and evaluation datasets for each task and setting, with new test sets released each year. In this paper, we provide a review of these challenges that covers: what they explored; the methods developed by the challenge participants and how these evolved; and also the current state of the field for speaker verification and diarisation. We chart the progress in performance over the five installments of the challenge on a common evaluation dataset and provide a detailed analysis of how each year's special focus affected participants' performance. This paper is aimed both at researchers who want an overview of the speaker recognition and diarisation field, and also at challenge organisers who want to benefit from the successes and avoid the mistakes of the VoxSRC challenges. We end with a discussion of the current strengths of the field and open challenges.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3850-3866"},"PeriodicalIF":4.1,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142084491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Speech Separation With Pretrained Frontend to Minimize Domain Mismatch 利用预训练前端进行语音分离,最大限度地减少领域不匹配现象
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-20 DOI: 10.1109/TASLP.2024.3446242
Wupeng Wang;Zexu Pan;Xinke Li;Shuai Wang;Haizhou Li
Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.
语音分离的目的是从语音混合物中分离出单独的语音信号。通常情况下,由于在现实世界的鸡尾酒会场景中无法获得目标参考数据,大多数分离模型都是在合成数据上进行训练的。因此,在实际应用中部署语音分离模型时,真实数据和合成数据之间存在领域差距。在本文中,我们提出了一种自监督领域不变性预训练(DIP)前端,该前端无需目标参考语音即可使用混合数据。DIP 前端利用连体网络和两个创新的前置任务--混合预测编码 (MPC) 和混合不变编码 (MIC),来捕捉真实和合成无标记混合物之间的共享语境线索。随后,我们在合成数据上训练下游语音分离模型时,将 DIP 前端冻结为特征提取器。通过使用上下文线索对 DIP 前端进行预训练,我们希望从合成数据中学到的语音分离技能能有效地转移到真实数据中。为了从 DIP 前端获益,我们引入了一个新颖的分离管道,以调整分离模型的特征分辨率。我们在标准基准和真实世界数据集上对语音分离质量进行了评估。结果证实,我们的 DIP 前端优于现有的语音分离模型。这项研究强调了大规模预训练在实际应用中提高语音分离质量和可懂度的潜力。
{"title":"Speech Separation With Pretrained Frontend to Minimize Domain Mismatch","authors":"Wupeng Wang;Zexu Pan;Xinke Li;Shuai Wang;Haizhou Li","doi":"10.1109/TASLP.2024.3446242","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3446242","url":null,"abstract":"Speech separation seeks to separate individual speech signals from a speech mixture. Typically, most separation models are trained on synthetic data due to the unavailability of target reference in real-world cocktail party scenarios. As a result, there exists a domain gap between real and synthetic data when deploying speech separation models in real-world applications. In this paper, we propose a self-supervised domain-invariant pretrained (DIP) frontend that is exposed to mixture data without the need for target reference speech. The DIP frontend utilizes a Siamese network with two innovative pretext tasks, mixture predictive coding (MPC) and mixture invariant coding (MIC), to capture shared contextual cues between real and synthetic unlabeled mixtures. Subsequently, we freeze the DIP frontend as a feature extractor when training the downstream speech separation models on synthetic data. By pretraining the DIP frontend with the contextual cues, we expect that the speech separation skills learned from synthetic data can be effectively transferred to real data. To benefit from the DIP frontend, we introduce a novel separation pipeline to align the feature resolution of the separation models. We evaluate the speech separation quality on standard benchmarks and real-world datasets. The results confirm the superiority of our DIP frontend over existing speech separation models. This study underscores the potential of large-scale pretraining to enhance the quality and intelligibility of speech separation in real-world applications.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4184-4198"},"PeriodicalIF":4.1,"publicationDate":"2024-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142316488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Data Priors to Weighted Prediction Error for Speech Dereverberation 将数据先验因素与语音消除混响的加权预测误差相结合
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-19 DOI: 10.1109/TASLP.2024.3440003
Ziye Yang;Wenxing Yang;Kai Xie;Jie Chen
Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven methods, enhancing the performance of various signal processing tasks while maintaining interpretability. Motivated by these advancements, this paper presents a novel dereverberation framework for the single-source case, which incorporates data-driven methods for capturing speech priors within the WPE framework. The plug-and-play (PnP) framework, specifically the regularization by denoising (RED) strategy, is utilized to incorporate speech prior information learnt from data during the optimization problem solving iterations. Experimental results validate the effectiveness of the proposed approach.
语音消除混响的目的是减轻后期混响成分的有害影响。虽然加权预测误差(WPE)方法在消除混响方面表现出了卓越的性能,但在复杂和嘈杂环境中的性能和鲁棒性方面仍有进一步改进的空间。最近的研究突显了基于物理和数据驱动的方法整合的有效性,在保持可解释性的同时提高了各种信号处理任务的性能。在这些研究进展的推动下,本文针对单源情况提出了一种新的消除混响框架,该框架在 WPE 框架内采用了数据驱动方法来捕捉语音先验。即插即用(PnP)框架,特别是去噪正则化(RED)策略,被用来在优化问题迭代求解过程中纳入从数据中学到的语音先验信息。实验结果验证了所提方法的有效性。
{"title":"Integrating Data Priors to Weighted Prediction Error for Speech Dereverberation","authors":"Ziye Yang;Wenxing Yang;Kai Xie;Jie Chen","doi":"10.1109/TASLP.2024.3440003","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3440003","url":null,"abstract":"Speech dereverberation aims to alleviate the detrimental effects of late-reverberant components. While the weighted prediction error (WPE) method has shown superior performance in dereverberation, there is still room for further improvement in terms of performance and robustness in complex and noisy environments. Recent research has highlighted the effectiveness of integrating physics-based and data-driven methods, enhancing the performance of various signal processing tasks while maintaining interpretability. Motivated by these advancements, this paper presents a novel dereverberation framework for the single-source case, which incorporates data-driven methods for capturing speech priors within the WPE framework. The plug-and-play (PnP) framework, specifically the regularization by denoising (RED) strategy, is utilized to incorporate speech prior information learnt from data during the optimization problem solving iterations. Experimental results validate the effectiveness of the proposed approach.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3908-3923"},"PeriodicalIF":4.1,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering USDnet:通过神经前向滤波实现无监督语音消除混响
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-16 DOI: 10.1109/TASLP.2024.3445120
Zhong-Qiu Wang
In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.
在单个扬声器的混响条件下,每个远场麦克风都会在不同位置记录同一扬声器信号的混响版本。在过度确定的条件下,即有多个麦克风但只有一个扬声器时,每个记录的混合信号都可以作为约束条件加以利用,以缩小针对消声语音的解决方案的范围,从而减少混响。有鉴于此,我们提出了一种用于无监督语音消除混响(USD)的新型深度神经网络(DNN)方法--USDnet。在每个训练步骤中,我们首先向 USDnet 输入一个输入混合物,以生成目标语音的估计值,然后对 DNN 估计值进行线性过滤,以近似多麦克风混合物,从而使每个麦克风都能满足约束条件,从而使 DNN 估计值正则化,以近似目标消混响语音。线性滤波器可通过神经前向滤波算法(如前向卷积预测)根据混合物和 DNN 估计值进行估计。我们的研究表明,这种新方法可以促进单源混响语音的无监督消除混响。
{"title":"USDnet: Unsupervised Speech Dereverberation via Neural Forward Filtering","authors":"Zhong-Qiu Wang","doi":"10.1109/TASLP.2024.3445120","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3445120","url":null,"abstract":"In reverberant conditions with a single speaker, each far-field microphone records a reverberant version of the same speaker signal at a different location. In over-determined conditions, where there are multiple microphones but only one speaker, each recorded mixture signal can be leveraged as a constraint to narrow down the solutions to target anechoic speech and thereby reduce reverberation. Equipped with this insight, we propose USDnet, a novel deep neural network (DNN) approach for unsupervised speech dereverberation (USD). At each training step, we first feed an input mixture to USDnet to produce an estimate for target speech, and then linearly filter the DNN estimate to approximate the multi-microphone mixture so that the constraint can be satisfied at each microphone, thereby regularizing the DNN estimate to approximate target anechoic speech. The linear filter can be estimated based on the mixture and DNN estimate via neural forward filtering algorithms such as forward convolutive prediction. We show that this novel methodology can promote unsupervised dereverberation of single-source reverberant speech.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3882-3895"},"PeriodicalIF":4.1,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142123022","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement 论用于单声道语音增强的复值变分 U 网络的泛化能力
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444492
Eike J. Nustede;Jörn Anemüller
The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.
对于真实世界中的音频去噪系统来说,能够很好地适应不同环境是非常重要的。尤其是单通道信号,需要在不对语音清晰度产生负面影响的情况下进行高效的噪声过滤。我们之前的工作表明,概率潜空间模型与 U-Network 架构相结合,在一定程度上提高了性能和泛化能力。在此,我们将在两个不同的数据集上,并在训练-测试不匹配的情况下,进一步评估纯幅度模型和复值 U-Network 模型。通过引入与 "曲线下面积 "指标类似的基于曲线的评分,对模型的适应性进行了评估。所提出的概率潜空间模型在大多数情况下都优于其消融变体,也优于著名的比较方法,而网络规模的增加可以忽略不计。在匹配条件下,SI-SDR 可提高 0.97 dB,在不匹配条件下,SI-SDR 可提高 2.72 dB,SI-SDR 总分最高分别为 20.21 dB 和 18.71 dB。建议的稳定性分数与观察到的性能表现非常吻合,进一步验证了概率潜空间模型。
{"title":"On the Generalization Ability of Complex-Valued Variational U-Networks for Single-Channel Speech Enhancement","authors":"Eike J. Nustede;Jörn Anemüller","doi":"10.1109/TASLP.2024.3444492","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444492","url":null,"abstract":"The ability to generalize well to different environments is of importance for audio de-noising systems in real-world scenarios. Especially single-channel signals require efficient noise filtering without impacting speech intelligibility negatively. Our previous work has shown that a probabilistic latent space model combined with a U-Network architecture increases performance and generalization ability to some extent. Here, we further evaluate magnitude-only, as well as complex-valued U-Network models, on two different datasets, and in a train-test mismatch scenario. Adaptability of models is evaluated by introducing a curve-based score similar to area-under-the-curve metrics. The proposed probabilistic latent space models outperform their ablated variants in most conditions, as well as well-known comparison methods, while increases in network size are negligible. Improvements of up to 0.97 dB SI-SDR in matched, and 2.72 dB SI-SDR in mismatched conditions are observed, with highest total SI-SDR scores of 20.21 dB and 18.71 dB, respectively. The proposed stability-score aligns well with observed performance behaviour, further validating the probabilistic latent space model.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3838-3849"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10637717","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142084492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1