首页 > 最新文献

IEEE/ACM Transactions on Audio, Speech, and Language Processing最新文献

英文 中文
Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation 用于多对多多语言语音到语音翻译的无文本单元到单元训练
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-15 DOI: 10.1109/TASLP.2024.3444470
Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro
This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.
本文提出了一种用于多对多多语种语音到语音翻译的无文本训练方法,这种方法也有利于将预先训练好的知识转移到基于文本的系统、文本到语音合成和文本到语音翻译中。为此,我们用语音单元来表示多语言语音,这些语音单元是由自监督语音模型得出的语音特征的离散表示。通过将语音单元视为伪文本,我们可以将注意力集中在语音的语言内容上,这样就可以很容易地将语音和文本模式的语音信息联系起来。通过将学习问题的输入和输出都设置为语音单元,我们提出了在多对多口语翻译环境中训练编码器-解码器模型的方法,即单元对单元翻译(UTUT)。具体来说,编码器以源语言标记为条件,正确理解输入的口语,而解码器则以目标语言标记为条件,生成目标语言的翻译语音。因此,在训练过程中,模型可以建立有关语言理解方式以及如何将它们与不同语言联系起来的知识。由于语音单元可以很容易地通过量化和音素化分别从音频和文本中关联起来,因此即使是以无文本方式训练的模型,也可以很容易地转移到与文本相关的任务中。我们证明,所提出的 UTUT 模型不仅可以有效地用于语音到语音翻译(S2ST),还可以用于多语言文本到语音合成(T2S)和文本到语音翻译(T2ST),只需对文本输入进行最小限度的微调。通过开展涵盖各种语言的综合实验,我们验证了所提方法在各种多语言任务中的有效性。此外,得益于多对多的语言训练,我们证明了UTUT 还能对训练过程中不存在的新语言对进行语言翻译,而这在之前的文献中还没有得到很好的探讨。
{"title":"Textless Unit-to-Unit Training for Many-to-Many Multilingual Speech-to-Speech Translation","authors":"Minsu Kim;Jeongsoo Choi;Dahun Kim;Yong Man Ro","doi":"10.1109/TASLP.2024.3444470","DOIUrl":"https://doi.org/10.1109/TASLP.2024.3444470","url":null,"abstract":"This paper proposes a textless training method for many-to-many multilingual speech-to-speech translation that can also benefit the transfer of pre-trained knowledge to text-based systems, text-to-speech synthesis and text-to-speech translation. To this end, we represent multilingual speech with speech units that are the discretized representations of speech features derived from a self-supervised speech model. By treating the speech units as pseudo-text, we can focus on the linguistic content of the speech, which can be easily associated with both speech and text modalities at the phonetic level information. By setting both the inputs and outputs of our learning problem as speech units, we propose to train an encoder-decoder model in a many-to-many spoken language translation setting, namely Unit-to-Unit Translation (UTUT). Specifically, the encoder is conditioned on the source language token to correctly understand the input spoken language, while the decoder is conditioned on the target language token to generate the translated speech in the target language. Therefore, during the training, the model can build the knowledge of how languages are comprehended and how to relate them to different languages. Since speech units can be easily associated from both audio and text by quantization and phonemization respectively, the trained model can easily transferred to text-related tasks, even if it is trained in a textless manner. We demonstrate that the proposed UTUT model can be effectively utilized not only for Speech-to-Speech Translation (S2ST) but also for multilingual Text-to-Speech Synthesis (T2S) and Text-to-Speech Translation (T2ST), requiring only minimal fine-tuning steps on text inputs. By conducting comprehensive experiments encompassing various languages, we validate the efficacy of the proposed method across diverse multilingual tasks. Moreover, thanks to the many-to-many language training, we show that the UTUT can also perform language translations for novel language pairs that are not present during training as pairs, which has not well been explored in the previous literature.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3934-3946"},"PeriodicalIF":4.1,"publicationDate":"2024-08-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142159099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation 基于上下文信息开发的粗到细目标扬声器提取
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-08 DOI: 10.1109/TASLP.2024.3440638
Xue Yang;Changchun Bao;Xianhong Chen
To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.
为解决鸡尾酒会问题,目标发言人提取(TSE)近来受到越来越多的关注。通常,TSE 在两种情况下进行探索。第一种情况是特定情况,即目标说话者在场,麦克风接收到的信号至少包含两个说话者。第二种情况是普遍情况,目标扬声器可能存在,也可能不存在,接收到的信号可能包含一个或多个扬声器。许多 TSE 研究利用目标扬声器的嵌入来指导提取。然而,仅仅利用这种嵌入可能无法充分利用注册中的上下文信息。为了解决这一局限性,有人提出了一种直接利用时频(T-F)域上下文信息的新方法。本文通过整合我们之前提出的 "从粗到细 "框架,对这一方法进行了改进。针对特定场景,采用了一个交互块,以促进报名和接收信号的 T-F 表示之间的直接交互。这种直接互动会产生一致的报名表示,为粗提取提供指导。之后,粗提取信号的 T-F 表示将用于指导精提取。精提取过程中获得的残差表示提高了提取精度。此外,本文还探讨了不考虑噪声和混响的无干扰通用场景。本文设计了一种两级决策方案,将我们提出的方法推广到这种无干扰通用场景中。所提出的方法实现了高性能,并被证明对这两种场景都有效。
{"title":"Coarse-to-Fine Target Speaker Extraction Based on Contextual Information Exploitation","authors":"Xue Yang;Changchun Bao;Xianhong Chen","doi":"10.1109/TASLP.2024.3440638","DOIUrl":"10.1109/TASLP.2024.3440638","url":null,"abstract":"To address the cocktail party problem, the target speaker extraction (TSE) has received increasing attention recently. Typically, the TSE is explored in two scenarios. The first scenario is a specific one, where the target speaker is present and the signal received by the microphone contains at least two speakers. The second scenario is a universal one, where the target speaker may be present or absent and the received signal may contain one or multiple speakers. Numerous TSE studies utilize the target speaker's embedding to guide the extraction. However, solely utilizing this embedding may not fully leverage the contextual information within the enrollment. To address this limitation, a novel approach that directly exploits the contextual information in the time-frequency (T-F) domain was proposed. This paper improves this approach by integrating our previously proposed coarse-to-fine framework. For the specific scenario, an interaction block is employed to facilitate direct interaction between the T-F representations of the enrollment and received signal. This direct interaction leads to the consistent representation of the enrollment that serves as guidance for the coarse extraction. Afterwards, the T-F representation of the coarsely extracted signal is utilized to guide the refining extraction. The residual representation obtained during the refining extraction increases the extraction precision. Besides, this paper explores an undisturbed universal scenario where the noise and reverberation are not considered. A two-level decision-making scheme is devised to generalize our proposed method for this undisturbed universal scenario. The proposed method achieves high performance and is proven effective for both scenarios.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3795-3810"},"PeriodicalIF":4.1,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions 基于麦克劳林扩展的线性差分麦克风阵列的理论分析和改进方案
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439994
Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang
Linear differential microphone arrays (LDMAs) are becoming popular due to their potentially high directional gain and frequency-invariant beampattern. By increasing the number of microphones, the Maclaurin expansion-based LDMAs address the inherently poor robustness problem of the conventional LDMA at low frequencies. However, this method encounters severe beampattern distortion and the deep nulls problem in the white noise gain (WNG) and the directivity factor (DF) at high frequencies as the number of microphones increases. In this paper, we reveal that the severe beampattern distortion is attributed to the deviation term of the synthesized beampattern while the deep nulls problem in the WNG and the DF is attributed to the violation of the distortionless constraint in the desired direction. We then propose two new design methods to avoid the degraded performance of LDMAs. Compared to the Maclaurin series expansion-based method, the first method additionally imposes the distortionless constraint in the desired direction, and the deep nulls problem in the WNG and the DF can be avoided. The second method explicitly requires the response of the higher order spatial directivity pattern in the deviation term to be zero, and thus the beampattern distortion can be avoided. By choosing the frequency-wise parameter that determines the number of the considered higher order spatial directivity patterns, the second method enables a good trade-off between the WNG and the beampattern distortion. Simulations exemplify the superiority of the proposed method against existing methods in terms of the robustness and the beampattern distortion.
线性差分传声器阵列(LDMA)因其潜在的高指向性增益和频率不变的振型而越来越受欢迎。通过增加传声器数量,基于麦克劳林扩展的 LDMA 解决了传统 LDMA 在低频下固有的鲁棒性差的问题。然而,随着麦克风数量的增加,这种方法会遇到严重的贝型失真以及高频白噪声增益(WNG)和指向性因子(DF)的深空问题。本文揭示了严重的贝型失真归因于合成贝型的偏差项,而白噪增益(WNG)和指向性因子(DF)中的深空问题则归因于在所需方向上违反了无失真约束。我们随后提出了两种新的设计方法,以避免 LDMA 性能下降。与基于 Maclaurin 数列展开的方法相比,第一种方法在所需方向上额外施加了无失真约束,从而避免了 WNG 和 DF 中的深空问题。第二种方法明确要求偏差项中高阶空间指向性模式的响应为零,因此可以避免振型失真。通过选择决定所考虑的高阶空间指向性模式数量的频率参数,第二种方法可以在 WNG 和贝叶斯失真之间实现良好的权衡。仿真结果表明,与现有方法相比,所提出的方法在鲁棒性和振铃失真方面更具优势。
{"title":"Theoretical Analysis of Maclaurin Expansion Based Linear Differential Microphone Arrays and Improved Solutions","authors":"Jinfu Wang;Feiran Yang;Xiaoqing Hu;Jun Yang","doi":"10.1109/TASLP.2024.3439994","DOIUrl":"10.1109/TASLP.2024.3439994","url":null,"abstract":"Linear differential microphone arrays (LDMAs) are becoming popular due to their potentially high directional gain and frequency-invariant beampattern. By increasing the number of microphones, the Maclaurin expansion-based LDMAs address the inherently poor robustness problem of the conventional LDMA at low frequencies. However, this method encounters severe beampattern distortion and the deep nulls problem in the white noise gain (WNG) and the directivity factor (DF) at high frequencies as the number of microphones increases. In this paper, we reveal that the severe beampattern distortion is attributed to the deviation term of the synthesized beampattern while the deep nulls problem in the WNG and the DF is attributed to the violation of the distortionless constraint in the desired direction. We then propose two new design methods to avoid the degraded performance of LDMAs. Compared to the Maclaurin series expansion-based method, the first method additionally imposes the distortionless constraint in the desired direction, and the deep nulls problem in the WNG and the DF can be avoided. The second method explicitly requires the response of the higher order spatial directivity pattern in the deviation term to be zero, and thus the beampattern distortion can be avoided. By choosing the frequency-wise parameter that determines the number of the considered higher order spatial directivity patterns, the second method enables a good trade-off between the WNG and the beampattern distortion. Simulations exemplify the superiority of the proposed method against existing methods in terms of the robustness and the beampattern distortion.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3811-3825"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors 使用非自回归吸引子的端到端神经扬声器标示法
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-07 DOI: 10.1109/TASLP.2024.3439993
Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk
Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.
尽管最近在说话人日记化方面取得了许多进展,但如何使日记化在现实生活中既稳健又有效,仍然是一个挑战,也是一个活跃的研究领域。基于聚类的成熟方法显示出良好的性能和质量。然而,这类系统由多个独立的、单独优化的模块组成,可能会导致性能不理想。端到端神经扬声器日记化(EEND)系统被认为是追求高性能日记化的下一块基石。然而,这种方法也有其局限性,比如在处理长录音和录音中扬声器数量较多(超过四个)或未知扬声器数量的情况时。基于编码器-解码器吸引子的 EEND(EEND-EDA)的出现,使我们能够利用基于 LSTM 的 EDA 模块,灵活处理包含大量发言人的录音。与 EEND-EDA 基线相比,本文作者最近提出的具有非自回归吸引子的 EEND(EEND-NAA)估计是一种有竞争力的替代方案。NAA 后端将 k-means 聚类作为吸引子估计的一部分,并采用基于变换解码器的吸引子细化模块。不过,在我们之前的 EEND-NAA 工作中,我们假设了已知的扬声器数量,而且实验评估仅限于 2 个扬声器的录音。在本文中,我们将详细介绍我们最近的 EEND-NAA 方法,并提出对 EEND-NAA 架构的进一步改进,引入 NAA 后端的三种新变体,它们可以处理包含不同和未知发言人数的语音录音。实验包括使用 Switchboard 和 NIST SRE 数据集生成的模拟混合物,以及来自 CALLHOME 和 DIHARD II 数据集的真实录音。在实验评估中,与基线 EEND-EDA 相比,建议的系统在模拟场景中实现了高达 51% 的相对改进,在真实录音中实现了高达 15% 的相对改进。
{"title":"End-to-End Neural Speaker Diarization With Non-Autoregressive Attractors","authors":"Magdalena Rybicka;Jesús Villalba;Thomas Thebaud;Najim Dehak;Konrad Kowalczyk","doi":"10.1109/TASLP.2024.3439993","DOIUrl":"10.1109/TASLP.2024.3439993","url":null,"abstract":"Despite many recent developments in speaker diarization, it remains a challenge and an active area of research to make diarization robust and effective in real-life scenarios. Well-established clustering-based methods are showing good performance and qualities. However, such systems are built of several independent, separately optimized modules, which may cause non-optimum performance. End-to-end neural speaker diarization (EEND) systems are considered the next stepping stone in pursuing high-performance diarization. Nevertheless, this approach also suffers limitations, such as dealing with long recordings and scenarios with a large (more than four) or unknown number of speakers in the recording. The appearance of EEND with encoder-decoder-based attractors (EEND-EDA) enabled us to deal with recordings that contain a flexible number of speakers thanks to an LSTM-based EDA module. A competitive alternative over the referenced EEND-EDA baseline is the EEND with non-autoregressive attractor (EEND-NAA) estimation, proposed recently by the authors of this article. NAA back-end incorporates k-means clustering as part of the attractor estimation and an attractor refinement module based on a Transformer decoder. However, in our previous work on EEND-NAA, we assumed a known number of speakers, and the experimental evaluation was limited to 2-speaker recordings only. In this article, we describe in detail our recent EEND-NAA approach and propose further improvements to the EEND-NAA architecture, introducing three novel variants of the NAA back-end, which can handle recordings containing speech of a variable and unknown number of speakers. Conducted experiments include simulated mixtures generated using the Switchboard and NIST SRE datasets and real-life recordings from the CALLHOME and DIHARD II datasets. In experimental evaluation, the proposed systems achieve up to 51% relative improvement for the simulated scenario and up to 15% for real recordings over the baseline EEND-EDA.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3960-3973"},"PeriodicalIF":4.1,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935757","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization 通过自适应神经网络量化实现轻量级扬声器验证
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-05 DOI: 10.1109/TASLP.2024.3437237
Bei Liu;Haoyu Wang;Yanmin Qian
Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. This approach assigns varying bit widths to different network layers. When bit combinations are determined, MSFT progressively quantizes and fine-tunes the network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of $sim$8. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.
现代说话人验证(SV)系统通常需要昂贵的存储和计算资源,因此阻碍了它们在移动设备上的部署。在本文中,我们探讨了用于轻量级说话人验证的自适应神经网络量化方法。首先,我们提出了一种新颖的自适应统一精度量化方法,该方法能够在 K 均值聚类的基础上动态生成为每个网络层定制的量化中心点。通过将其应用于预训练 SV 系统,我们获得了一系列不同位宽的量化变体。为了增强低位量化模型,我们进一步引入了混合精度量化算法和多级微调(MSFT)策略。这种方法为不同的网络层分配不同的位宽。当位组合确定后,MSFT 按特定顺序逐步量化和微调网络。最后,我们设计了两种不同的二进制量化方案,以减轻 1 位量化模型的性能下降:静态量化器和自适应量化器。在 VoxCeleb 上进行的实验表明,在 ResNets 和 DF-ResNets 上实现了无损的 4 位统一精度量化,压缩率达到了 8 美元。此外,与统一精度方法相比,混合精度量化不仅能在模型大小相似的情况下获得额外的性能改进,还能灵活地生成任何理想模型大小的位组合。此外,我们建议的 1 位量化方案显著提高了二值化模型的性能。最后,与现有的轻量级 SV 系统进行全面比较后发现,在各种模型大小范围内,我们提出的模型都远远优于之前的所有方法。
{"title":"Towards Lightweight Speaker Verification via Adaptive Neural Network Quantization","authors":"Bei Liu;Haoyu Wang;Yanmin Qian","doi":"10.1109/TASLP.2024.3437237","DOIUrl":"10.1109/TASLP.2024.3437237","url":null,"abstract":"Modern speaker verification (SV) systems typically demand expensive storage and computing resources, thereby hindering their deployment on mobile devices. In this paper, we explore adaptive neural network quantization for lightweight speaker verification. Firstly, we propose a novel adaptive uniform precision quantization method which enables the dynamic generation of quantization centroids customized for each network layer based on k-means clustering. By applying it to the pre-trained SV systems, we obtain a series of quantized variants with different bit widths. To enhance low-bit quantized models, a mixed precision quantization algorithm along with a multi-stage fine-tuning (MSFT) strategy is further introduced. This approach assigns varying bit widths to different network layers. When bit combinations are determined, MSFT progressively quantizes and fine-tunes the network in a specific order. Finally, we design two distinct binary quantization schemes to mitigate performance degradation of 1-bit quantized models: the static and adaptive quantizers. Experiments on VoxCeleb demonstrate that lossless 4-bit uniform precision quantization is achieved on both ResNets and DF-ResNets, yielding a promising compression ratio of \u0000<inline-formula><tex-math>$sim$</tex-math></inline-formula>\u00008. Moreover, compared to uniform precision approach, mixed precision quantization not only obtains additional performance improvements with a similar model size but also offers the flexibility to generate bit combination for any desirable model size. In addition, our suggested 1-bit quantization schemes remarkably boost the performance of binarized models. Finally, a thorough comparison with existing lightweight SV systems reveals that our proposed models outperform all previous methods by a large margin across various model size ranges.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3771-3784"},"PeriodicalIF":4.1,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141935758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks SpeechPrompt:提示语音语言模型完成语音处理任务
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3436618
Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee
Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.
提示已成为利用预训练语言模型(LM)的一种实用方法。这种方法有几个优点。它能让 LM 以最少的训练和参数更新适应新任务,从而实现存储和计算的高效率。此外,提示只修改 LM 的输入,并利用语言模型的生成能力,以统一的方式处理各种下游任务。这大大减少了设计特定任务模型的人力需求。随着语言模型所服务任务数量的增加,这些优势会变得更加明显。在提示功能优势的推动下,我们首次探索了提示语音 LM 在语音处理领域的潜力。最近,人们对将语音转换为离散单元进行语言建模的兴趣日益浓厚。我们的开创性研究表明,在我们的统一提示框架内,这些量化语音单元具有很强的通用性。它们不仅可以作为类标签,还包含丰富的语音信息,可以重新合成为语音信号,用于语音生成任务。具体来说,我们将语音处理任务重新表述为语音单元生成任务。因此,我们可以在一个统一的提示框架内无缝整合语音分类、序列生成和语音生成等任务。实验结果表明,在可训练参数数量相近的情况下,与基于自监督学习模型的强微调方法相比,提示方法可以获得具有竞争力的性能。提示法还在少镜头设置中显示出良好的效果。此外,随着先进的语音 LM 进入舞台,所提出的提示框架将大有可为。
{"title":"SpeechPrompt: Prompting Speech Language Models for Speech Processing Tasks","authors":"Kai-Wei Chang;Haibin Wu;Yu-Kai Wang;Yuan-Kuei Wu;Hua Shen;Wei-Cheng Tseng;Iu-Thing Kang;Shang-Wen Li;Hung-Yi Lee","doi":"10.1109/TASLP.2024.3436618","DOIUrl":"10.1109/TASLP.2024.3436618","url":null,"abstract":"Prompting has become a practical method for utilizing pre-trained language models (LMs). This approach offers several advantages. It allows an LM to adapt to new tasks with minimal training and parameter updates, thus achieving efficiency in both storage and computation. Additionally, prompting modifies only the LM's inputs and harnesses the generative capabilities of language models to address various downstream tasks in a unified manner. This significantly reduces the need for human labor in designing task-specific models. These advantages become even more evident as the number of tasks served by the LM scales up. Motivated by the strengths of prompting, we are the first to explore the potential of prompting speech LMs in the domain of speech processing. Recently, there has been a growing interest in converting speech into discrete units for language modeling. Our pioneer research demonstrates that these quantized speech units are highly versatile within our unified prompting framework. Not only can they serve as class labels, but they also contain rich phonetic information that can be re-synthesized back into speech signals for speech generation tasks. Specifically, we reformulate speech processing tasks into speech-to-unit generation tasks. As a result, we can seamlessly integrate tasks such as speech classification, sequence generation, and speech generation within a single, unified prompting framework. The experiment results show that the prompting method can achieve competitive performance compared to the strong fine-tuning method based on self-supervised learning models with a similar number of trainable parameters. The prompting method also shows promising results in the few-shot setting. Moreover, with the advanced speech LMs coming into the stage, the proposed prompting framework attains great potential.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3730-3744"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Artist Similarity Based on Heterogeneous Graph Neural Networks 基于异构图神经网络的艺术家相似性
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3437170
Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini
Music streaming platforms rely on recommending similar artists to maintain user engagement, with artists benefiting from these suggestions to boost their popularity. Another important feature is music information retrieval, allowing users to explore new content. In both scenarios, performance depends on how to compute the similarity between musical content. This is a challenging process since musical data is inherently multimodal, containing textual and audio data. We propose a novel graph-based artist representation that integrates audio, lyrics features, and artist relations. Thus, a multimodal representation on a heterogeneous graph is proposed, along with a network regularization process followed by a GNN model to aggregate multimodal information into a more robust unified representation. The proposed method explores this final multimodal representation for the task of artist similarity as a link prediction problem. Our method introduces a new importance matrix to emphasize related artists in this multimodal space. We compare our approach with other strong baselines based on combining input features, importance matrix construction, and GNN models. Experimental results highlight the superiority of multimodal representation through the transfer learning process and the value of the importance matrix in enhancing GNN models for artist similarity.
音乐流媒体平台依靠推荐相似的艺人来维持用户参与度,艺人则从这些推荐中获益,从而提高自己的人气。另一个重要功能是音乐信息检索,允许用户探索新内容。在这两种情况下,性能都取决于如何计算音乐内容之间的相似性。这是一个具有挑战性的过程,因为音乐数据本身就是多模态的,包含文本和音频数据。我们提出了一种新颖的基于图的艺术家表示法,它整合了音频、歌词特征和艺术家关系。因此,我们提出了一种异构图上的多模态表示法,以及一种网络正则化过程,然后使用 GNN 模型将多模态信息聚合到一个更强大的统一表示法中。所提出的方法将这种最终的多模态表示法用于艺术家相似性任务的链接预测问题。我们的方法引入了新的重要性矩阵,以强调多模态空间中的相关艺术家。我们将我们的方法与其他基于输入特征组合、重要性矩阵构建和 GNN 模型的强大基线进行了比较。实验结果凸显了通过迁移学习过程进行多模态表示的优越性,以及重要性矩阵在增强艺术家相似性 GNN 模型方面的价值。
{"title":"Artist Similarity Based on Heterogeneous Graph Neural Networks","authors":"Angelo Cesar Mendes da Silva;Diego Furtado Silva;Ricardo Marcondes Marcacini","doi":"10.1109/TASLP.2024.3437170","DOIUrl":"10.1109/TASLP.2024.3437170","url":null,"abstract":"Music streaming platforms rely on recommending similar artists to maintain user engagement, with artists benefiting from these suggestions to boost their popularity. Another important feature is music information retrieval, allowing users to explore new content. In both scenarios, performance depends on how to compute the similarity between musical content. This is a challenging process since musical data is inherently multimodal, containing textual and audio data. We propose a novel graph-based artist representation that integrates audio, lyrics features, and artist relations. Thus, a multimodal representation on a heterogeneous graph is proposed, along with a network regularization process followed by a GNN model to aggregate multimodal information into a more robust unified representation. The proposed method explores this final multimodal representation for the task of artist similarity as a link prediction problem. Our method introduces a new importance matrix to emphasize related artists in this multimodal space. We compare our approach with other strong baselines based on combining input features, importance matrix construction, and GNN models. Experimental results highlight the superiority of multimodal representation through the transfer learning process and the value of the importance matrix in enhancing GNN models for artist similarity.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3717-3729"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141880423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Room Acoustic Rendering Networks With Control of Scattering and Early Reflections 控制散射和早期反射的室内声学渲染网络
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-08-02 DOI: 10.1109/TASLP.2024.3436702
Matteo Scerbo;Lauri Savioja;Enzo De Sena
Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.
室内声学合成可用于虚拟现实(VR)、增强现实(AR)和游戏应用,以增强听众的沉浸感、真实感和外在化。一种常见的方法是使用几何声学(GA)模型以交互速度计算脉冲响应,并使用快速卷积方法实时应用上述响应。另外,基于延迟网络的模型也能对房间声学的某些方面进行建模,但计算成本要低得多。为了缩小这两类模型之间的差距,最近的工作引入了近似声辐射传递(ART)的延迟网络设计,这是一种几何声学(GA)模型,用于模拟环境中离散表面斑块之间的声能传递。本文介绍了此类设计的两个关键扩展。第一个扩展是对反馈矩阵进行新的基于物理和保持稳定的设计,从而能够更精确地控制散射,并更广泛地控制后期混响特性。第二个扩展允许对任意数量的早期反射进行高精度建模,这意味着可以在计算成本和早期混响精度之间随意调整网络规模。我们将所提出的扩展功能与基准 ART 近似延迟网络以及两个参考 GA 模型进行了比较。评估基于感知相关特征的客观测量,包括频率相关混响时间、回声密度积累和早期衰减时间。结果表明,与基线模型相比,所提出的扩展方案有了显著的改进,尤其是在非凸几何形状或墙壁吸收分布不均的情况下,这两种情况都具有广泛的实际意义。
{"title":"Room Acoustic Rendering Networks With Control of Scattering and Early Reflections","authors":"Matteo Scerbo;Lauri Savioja;Enzo De Sena","doi":"10.1109/TASLP.2024.3436702","DOIUrl":"10.1109/TASLP.2024.3436702","url":null,"abstract":"Room acoustic synthesis can be used in virtual reality (VR), augmented reality (AR) and gaming applications to enhance listeners' sense of immersion, realism and externalisation. A common approach is to use geometrical acoustics (GA) models to compute impulse responses at interactive speed, and fast convolution methods to apply said responses in real time. Alternatively, delay-network-based models are capable of modeling certain aspects of room acoustics, but with a significantly lower computational cost. In order to bridge the gap between these classes of models, recent work introduced delay network designs that approximate Acoustic Radiance Transfer (ART), a geometrical acoustics (GA) model that simulates the transfer of acoustic energy between discrete surface patches in an environment. This paper presents two key extensions of such designs. The first extension involves a new physically-based and stability-preserving design of the feedback matrices, enabling more accurate control of scattering and, more in general, of late reverberation properties. The second extension allows an arbitrary number of early reflections to be modeled with high accuracy, meaning the network can be scaled at will between computational cost and early reverberation precision. The proposed extensions are compared to the baseline ART-approximating delay network as well as two reference GA models. The evaluation is based on objective measures of perceptually-relevant features, including frequency-dependent reverberation times, echo density build-up, and early decay time. Results show how the proposed extensions result in a significant improvement over the baseline model, especially for the case of non-convex geometries or the case of unevenly distributed wall absorption, both scenarios of broad practical interest.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3745-3758"},"PeriodicalIF":4.1,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141886632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism 基于注意力机制的两阶段视听融合钢琴转写模型
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-07-30 DOI: 10.1109/TASLP.2024.3426303
Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng
Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.
钢琴转写是音乐信息检索领域的一个重要问题,其目的是从捕获的音频或视觉信号中获取音乐的符号表示。以往的研究主要集中于使用音频或视觉信息的单模态转录方法,但基于视听融合的研究为数不多。为了充分利用两种模式的互补优势,实现更高的转录精度,我们提出了一种基于注意力机制的两阶段视听融合钢琴转录模型,同时利用钢琴演奏的音频和视觉信息。在第一阶段,我们提出了一个音频模型和一个视觉模型。音频模型利用频域稀疏注意力捕捉频域中的谐波关系,而视觉模型则包括 CNN 和 Transformer 两个分支,以合并不同分辨率下的局部和全局特征。在第二阶段,我们利用交叉注意来学习不同模态之间的相关性和序列的时间关系。在 OMAPS2 数据集上的实验结果表明,我们的模型达到了 98.60% 的 F1 分数,与单模态转录模型相比有显著提高。
{"title":"A Two-Stage Audio-Visual Fusion Piano Transcription Model Based on the Attention Mechanism","authors":"Yuqing Li;Xianke Wang;Ruimin Wu;Wei Xu;Wenqing Cheng","doi":"10.1109/TASLP.2024.3426303","DOIUrl":"10.1109/TASLP.2024.3426303","url":null,"abstract":"Piano transcription is a significant problem in the field of music information retrieval, aiming to obtain symbolic representations of music from captured audio or visual signals. Previous research has mainly focused on single-modal transcription methods using either audio or visual information, yet there is a small number of studies based on audio-visual fusion. To leverage the complementary advantages of both modalities and achieve higher transcription accuracy, we propose a two-stage audio-visual fusion piano transcription model based on the attention mechanism, utilizing both audio and visual information from the piano performance. In the first stage, we propose an audio model and a visual model. The audio model utilizes frequency domain sparse attention to capture harmonic relationships in the frequency domain, while the visual model includes both CNN and Transformer branches to merge local and global features at different resolutions. In the second stage, we employ cross-attention to learn the correlations between different modalities and the temporal relationships of the sequences. Experimental results on the OMAPS2 dataset show that our model achieves an F1-score of 98.60%, demonstrating significant improvement compared with the single-modal transcription models.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3618-3630"},"PeriodicalIF":4.1,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10614622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863352","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Mispronunciation Detection Using Speech Reconstruction 利用语音重建改进错误发音检测
IF 4.1 2区 计算机科学 Q1 ACOUSTICS Pub Date : 2024-07-29 DOI: 10.1109/TASLP.2024.3434497
Anurag Das;Ricardo Gutierrez-Osuna
Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.
同时训练相关的机器学习任务可以提高这两项任务的性能。文本到语音(TTS)和错误发音检测与诊断(MDD)都是利用语音信息进行操作的,我们希望研究 MDD 性能的提升是否可以通过两个任务来实现。我们提出了一种从 MDD 系统产生的语音中重建语音并计算语音重建损失的网络。我们假设,如果重建的语音听起来更接近原始语音,那么 MDD 系统产生的电话将更接近地面实况。为了验证这一点,我们首先从预先训练好的模型中提取 wav2vec 特征,并将其与文本输入一起输入 MDD 系统。然后,MDD 系统预测目标注释电话,再根据预测电话合成语音。因此,该系统是通过计算语音重建损失和 MDD 损失来进行训练的。我们将提出的系统与不带语音重构的相同系统和另一个最先进的基线系统进行了比较,发现提出的系统获得了更高的错误发音检测和诊断(MDD)分数。在一组训练过程中未见的句子中,同时进行语音重建和说话人验证可使拟议系统获得更高的 MDD 分数,这表明从预测电话重建语音信号有助于系统泛化到新的测试句子。我们还测试了当输入的电话发音错误时,系统能否生成重音语音。我们的感知实验结果表明,与没有任何发音错误的电话相比,由含有错误发音的电话生成的语音听起来重音更重,可懂度更低,这表明系统能够识别电话中的差异,并生成所需的语音信号。
{"title":"Improving Mispronunciation Detection Using Speech Reconstruction","authors":"Anurag Das;Ricardo Gutierrez-Osuna","doi":"10.1109/TASLP.2024.3434497","DOIUrl":"10.1109/TASLP.2024.3434497","url":null,"abstract":"Training related machine learning tasks simultaneously can lead to improved performance on both tasks. Text- to-speech (TTS) and mispronunciation detection and diagnosis (MDD) both operate using phonetic information and we wanted to examine whether a boost in MDD performance can be by two tasks. We propose a network that reconstructs speech from the phones produced by the MDD system and computes a speech reconstruction loss. We hypothesize that the phones produced by the MDD system will be closer to the ground truth if the reconstructed speech sounds closer to the original speech. To test this, we first extract wav2vec features from a pre-trained model and feed it to the MDD system along with the text input. The MDD system then predicts the target annotated phones and then synthesizes speech from the predicted phones. The system is therefore trained by computing both a speech reconstruction loss as well as an MDD loss. Comparing the proposed systems against an identical system but without speech reconstruction and another state-of-the-art baseline, we found that the proposed system achieves higher mispronunciation detection and diagnosis (MDD) scores. On a set of sentences unseen during training, the and speaker verification simultaneously can lead to improve proposed system achieves higher MDD scores, which suggests that reconstructing the speech signal from the predicted phones helps the system generalize to new test sentences. We also tested whether the system can generate accented speech when the input phones have mispronunciations. Results from our perceptual experiments show that speech generated from phones containing mispronunciations sounds more accented and less intelligible than phones without any mispronunciations, which suggests that the system can identify differences in phones and generate the desired speech signal.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"4420-4433"},"PeriodicalIF":4.1,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141863423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE/ACM Transactions on Audio, Speech, and Language Processing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1