首页 > 最新文献

Interspeech最新文献

英文 中文
Zero-Shot Foreign Accent Conversion without a Native Reference 没有本机引用的零样本外来重音转换
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10664
Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna
Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.
先前的外国口音转换(FAC)方法在合成过程中还需要来自母语(L1)的参考话语,或者是必须为每个非母语(L2)说话者单独训练的专用一对一系统。为了解决这两个问题,我们提出了一种新的FAC系统,它可以直接转换以前看不见的说话者的L2语音。该系统由两个独立的模块组成:翻译器和合成器,它们对语音后验图中的瓶颈特征进行操作。翻译者被训练为将L2话语中的瓶颈特征映射为来自平行L1话语的瓶颈特征。合成器是一个多对多系统,它将输入瓶颈特征映射到相应的Mel声谱图中,条件是来自L2扬声器的嵌入。在推理过程中,两个模块依次操作,以获取一个看不见的L2话语,并生成一个带有本地口音的梅尔声谱图。感知实验表明,与为每个L2说话者建立专用模型的最先进的无参考系统(28.9%)相比,我们的系统在非母语重音方面实现了大幅降低(67%)。此外,80%的听众认为合成的话语与L2说话者具有相同的语音身份。
{"title":"Zero-Shot Foreign Accent Conversion without a Native Reference","authors":"Waris Quamer, Anurag Das, John M. Levis, E. Chukharev-Hudilainen, R. Gutierrez-Osuna","doi":"10.21437/interspeech.2022-10664","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10664","url":null,"abstract":"Previous approaches for foreign accent conversion (FAC) ei-ther need a reference utterance from a native speaker (L1) during synthesis, or are dedicated one-to-one systems that must be trained separately for each non-native (L2) speaker. To address both issues, we propose a new FAC system that can transform L2 speech directly from previously unseen speakers. The system consists of two independent modules: a translator and a synthesizer, which operate on bottleneck features derived from phonetic posteriorgrams. The translator is trained to map bottleneck features in L2 utterances into those from a parallel L1 utterance. The synthesizer is a many-to-many system that maps input bottleneck features into the corresponding Mel-spectrograms, conditioned on an embedding from the L2 speaker. During inference, both modules operate in sequence to take an unseen L2 utterance and generate a native-accented Mel-spectrogram. Perceptual experiments show that our system achieves a large reduction (67%) in non-native accentedness compared to a state-of-the-art reference-free system (28.9%) that builds a dedicated model for each L2 speaker. Moreover, 80% of the listeners rated the synthesized utterances to have the same voice identity as the L2 speaker.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4920-4924"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43009987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Incremental learning for RNN-Transducer based speech recognition models 基于RNN传感器的语音识别模型的增量学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10795
Deepak Baby, Pasquale D’Alterio, Valentin Mendelev
This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.
本文研究了一种基于RNN-Transducer自动语音识别(ASR)模型的语音助手增量学习框架。这样的模型需要定期更新,以跟上客户请求分布的变化。我们证明了一种简单的微调方法,结合旧的和新的训练数据,可以用来增量地更新模型,只需要几个小时的训练时间,并且对旧数据没有任何退化。本文探讨了基于月度训练数据的ASR模型的多轮增量更新。结果表明,与在月度评估数据集上从零开始训练的模型相比,该方法的相对WER提高了5-6%。此外,我们还探讨了是否有可能提高对特定生词的识别。我们用每个单词的少量训练话语(真实的和合成的)模拟了多轮增量更新,并表明对新单词的识别显着提高,但对一般数据有轻微的下降。最后,我们证明了在一般数据上观察到的退化可以通过将每月更新与针对特定单词的更新交叉使用来缓解。
{"title":"Incremental learning for RNN-Transducer based speech recognition models","authors":"Deepak Baby, Pasquale D’Alterio, Valentin Mendelev","doi":"10.21437/interspeech.2022-10795","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10795","url":null,"abstract":"This paper investigates an incremental learning framework for a real-world voice assistant employing RNN-Transducer based automatic speech recognition (ASR) model. Such a model needs to be regularly updated to keep up with changing distribution of customer requests. We demonstrate that a simple fine-tuning approach with a combination of old and new training data can be used to incrementally update the model spending only several hours of training time and without any degradation on old data. This paper explores multiple rounds of incremental updates on the ASR model with monthly training data. Results show that the proposed approach achieves 5-6% relative WER improvement over the models trained from scratch on the monthly evaluation datasets. In addition, we explore if it is pos-sible to improve recognition of specific new words. We simulate multiple rounds of incremental updates with handful of training utterances per word (both real and synthetic) and show that the recognition of the new words improves dramatically but with a minor degradation on general data. Finally, we demonstrate that the observed degradation on general data can be mitigated by interleaving monthly updates with updates targeting specific words.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"71-75"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47633462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification 用于构音障碍语音自动分类的对抗性自由说话人身份不变表示学习
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-402
Parvaneh Janbakhshi, I. Kodrasi
Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.
对病理学无关线索(如说话者身份信息)具有鲁棒性的语音表示已被证明有利于自动进行构音障碍语音分类。最近提出的一种用于学习构音障碍语音分类的说话人身份不变表示的技术是基于对抗性训练的。然而,对抗性训练可能具有挑战性、不稳定且对训练参数敏感。为了避免对抗性训练,在本文中,我们提出利用依赖于互信息最小化的特征分离框架来学习说话人身份不变表示。在神经典型和构音障碍语音数据库上的实验结果表明,所提出的无对抗性框架成功地学习了说话人身份不变表示。此外,研究表明,这种表征与使用对抗性训练获得的表征具有相似的构音障碍语音分类性能,而训练过程更稳定,对训练参数不太敏感。
{"title":"Adversarial-Free Speaker Identity-Invariant Representation Learning for Automatic Dysarthric Speech Classification","authors":"Parvaneh Janbakhshi, I. Kodrasi","doi":"10.21437/interspeech.2022-402","DOIUrl":"https://doi.org/10.21437/interspeech.2022-402","url":null,"abstract":"Speech representations which are robust to pathology-unrelated cues such as speaker identity information have been shown to be advantageous for automatic dysarthric speech classification. A recently proposed technique to learn speaker identity-invariant representations for dysarthric speech classification is based on adversarial training. However, adversarial training can be challenging, unstable, and sensitive to training parameters. To avoid adversarial training, in this paper we propose to learn speaker-identity invariant representations exploiting a feature separation framework relying on mutual information minimization. Experimental results on a database of neurotypical and dysarthric speech show that the proposed adversarial-free framework successfully learns speaker identity-invariant representations. Further, it is shown that such representations result in a similar dysarthric speech classification performance as the representations obtained using adversarial training, while the training procedure is more stable and less sensitive to training parameters.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"2138-2142"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48272141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge distillation for In-memory keyword spotting model 内存关键字识别模型的知识精馏
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-633
Zeyang Song, Qi Liu, Qu Yang, Haizhou Li
We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.
我们研究了一种用于语音命令和控制的关键词识别(KWS)的轻量级实现,该实现可以在内存计算(IMC)单元上以与最先进的方法相比更低的计算成本实现,具有相同的精度。对于资源有限的移动设备,KWS预计将一直处于开启状态。IMC代表了其中一种解决方案。但是,它只支持乘法累加和布尔运算。我们注意到,常见的特征提取方法,如MFCC和SincConv,不受IMC的支持,因为它们依赖于昂贵的对数计算。另一方面,KWS的一些神经网络解决方案涉及大量参数,这些参数对于移动设备来说是不可行的。在这项工作中,我们提出了一种知识提取技术,用一种没有性能损失的轻量级编码器来取代像MFCC或SincConv这样的复杂语音前端。实验表明,该模型在精度和计算成本方面优于具有MFCC和SincConv前端的KWS模型。
{"title":"Knowledge distillation for In-memory keyword spotting model","authors":"Zeyang Song, Qi Liu, Qu Yang, Haizhou Li","doi":"10.21437/interspeech.2022-633","DOIUrl":"https://doi.org/10.21437/interspeech.2022-633","url":null,"abstract":"We study a light-weight implementation of keyword spotting (KWS) for voice command and control, that can be implemented on an in-memory computing (IMC) unit with same accuracy at a lower computational cost than the state-of-the-art methods. KWS is expected to be always-on for mobile devices with limited resources. IMC represents one of the solutions. However, it only supports multiplication-accumulation and Boolean operations. We note that common feature extraction methods, such as MFCC and SincConv, are not supported by IMC as they depend on expensive logarithm computing. On the other hand, some neural network solutions to KWS involve a large number of parameters that are not feasible for mobile devices. In this work, we propose a knowledge distillation technique to replace the complex speech frontend like MFCC or SincConv with a light-weight encoder without performance loss. Experiments show that the proposed model outperforms the KWS model with MFCC and SincConv front-end in terms of accuracy and computational cost.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4128-4132"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48292021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS 跨语言韵律迁移TTS的对抗性和顺序性训练
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-865
Min-Kyung Kim, Joon‐Hyuk Chang
This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.
本研究提出了一种通过使用语言、说话者和韵律三种全局语音风格表示来提高文本到语音(TTS)模型性能的方法。在说话者的声音中综合不同的语言和韵律,而不管他们自己的语言和韵律是可能的。为了构建TTS模型中每个表征的嵌入,使其独立于其他表征,我们提出了一种针对TTS模型一般架构的对抗性训练方法。此外,我们引入了一种顺序训练方法,包括基于预演的持续学习,以训练复杂和少量的数据,而不会忘记先前学习的信息。实验结果表明,该方法可以生成高质量的语音,即使对于数据集中不包含的说话人表示,也可以产生高质量的说话人和韵律的相似度。
{"title":"Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS","authors":"Min-Kyung Kim, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-865","DOIUrl":"https://doi.org/10.21437/interspeech.2022-865","url":null,"abstract":"This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4556-4560"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46991331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Phonetic Analysis of Self-supervised Representations of English Speech 英语语音自监督表征的语音分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10884
Dan Wells, Hao Tang, Korin Richmond
We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.
我们对英语语音的自监督表示学习中发现的离散单元进行了分析。我们专注于由预训练的HuBERT模型产生的单元,因为它在ASR、语音合成和许多其他任务中被广泛采用。鉴于之前的工作已经评估了针对给定语言的所有电话的此类量化模型的总体质量,我们将分析分解为广泛的语音类,在考虑其与离散单元的对齐时考虑到其发音的特定方面。我们发现这些单位对应于次语音事件,并且精细的动态,如爆破音的不同闭合和释放部分往往由离散单位序列表示。我们的工作为HuBERT发现的离散单元的语音特性提供了参考,促进了基于该模型的许多语音应用的分析。
{"title":"Phonetic Analysis of Self-supervised Representations of English Speech","authors":"Dan Wells, Hao Tang, Korin Richmond","doi":"10.21437/interspeech.2022-10884","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10884","url":null,"abstract":"We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3583-3587"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47141667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition W2V2 Light:用于自动语音识别的Wav2vec 2.0的轻量级版本
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10339
Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang
Wav2vec 2.0 (W2V2) has shown remarkable speech recognition performance by pre-training only with unlabeled data and fine-tuning with a small amount of labeled data. However, the practical application of W2V2 is hindered by hardware memory limitations, as it contains 317 million parameters. To ad-dress this issue, we propose W2V2-Light, a lightweight version of W2V2. We introduce two simple sharing methods to reduce the memory consumption as well as the computational costs of W2V2. Compared to W2V2, our model has 91% lesser parameters and a speedup of 1.31 times with minor degradation in downstream task performance. Moreover, by quantifying the stability of representations, we provide an empirical insight into why our model is capable of maintaining competitive performance despite the significant reduction in memory
Wav2vec 2.0(W2V2)通过仅使用未标记数据进行预训练和使用少量标记数据进行微调,显示出显著的语音识别性能。然而,W2V2的实际应用受到硬件内存限制的阻碍,因为它包含3.17亿个参数。为了宣传这个问题,我们提出了W2V2 Light,W2V2的轻量级版本。我们介绍了两种简单的共享方法来减少W2V2的内存消耗和计算成本。与W2V2相比,我们的模型的参数减少了91%,速度提高了1.31倍,下游任务性能略有下降。此外,通过量化表征的稳定性,我们提供了一个经验见解,说明为什么我们的模型能够在记忆显著减少的情况下保持竞争性能
{"title":"W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition","authors":"Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-10339","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10339","url":null,"abstract":"Wav2vec 2.0 (W2V2) has shown remarkable speech recognition performance by pre-training only with unlabeled data and fine-tuning with a small amount of labeled data. However, the practical application of W2V2 is hindered by hardware memory limitations, as it contains 317 million parameters. To ad-dress this issue, we propose W2V2-Light, a lightweight version of W2V2. We introduce two simple sharing methods to reduce the memory consumption as well as the computational costs of W2V2. Compared to W2V2, our model has 91% lesser parameters and a speedup of 1.31 times with minor degradation in downstream task performance. Moreover, by quantifying the stability of representations, we provide an empirical insight into why our model is capable of maintaining competitive performance despite the significant reduction in memory","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3038-3042"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47360779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Autoencoder-Based Tongue Shape Estimation During Continuous Speech 基于自编码器的连续语音舌形估计
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10272
Vinicius Ribeiro, Y. Laprie
Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.
声道形状估计是发音语音合成的必要步骤。然而,关于这一主题的文献很少,目前大多数方法都缺乏与语音产生相关的许多物理限制。本研究提出了一种替代方法来解决先前工作中面临的具体问题,特别是与关键发音器相关的问题。提出了一种基于自编码器的连续语音舌形估计方法。自动编码器被训练来学习数据的编码,并作为主网络的辅助网络,将音素映射到形状。神经网络不是预测目标曲线上的精确点,而是学习如何预测曲线的主要组成部分,即自动编码器的表示。我们展示了这种方法如何允许施加关键发音器的约束,通过潜在空间控制舌形,并在不依赖任何后处理方法的情况下生成平滑输出。
{"title":"Autoencoder-Based Tongue Shape Estimation During Continuous Speech","authors":"Vinicius Ribeiro, Y. Laprie","doi":"10.21437/interspeech.2022-10272","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10272","url":null,"abstract":"Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"86-90"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44213806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy 无监督声学-发音倒置与可变声道解剖
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-477
Yifan Sun, Qinlong Huang, Xihong Wu
Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a fixed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the configuration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory benefits. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.
说话者之间的声学和发音可变性在一定程度上限制了声学-发音倒置(AAI)方法的泛化性能。说话人无关AAI(SI-AAI)方法通常侧重于声学特征的转换,但很少考虑发音空间中的直接匹配。无监督的AAI方法具有更好的泛化能力,但通常使用物理发音合成器的固定或光学设置,即使对于不同的说话者也是如此,这可能会导致不合格的发音补偿。在本文中,我们建议在语音倒置过程中联合估计发音运动和声道解剖。采用了无监督的AAI框架,其中估计的声道解剖结构用于设置物理发音合成器的配置,而物理发音合成器又由估计的发音运动驱动,以模仿给定的语音。实验表明,对声道解剖结构的估计可以带来声学和发音方面的好处。在声学上,重建质量更高;在咬合方面,估计的咬合运动轨迹与测量的轨迹更好地匹配。此外,估计的解剖学参数显示出说话者的清晰聚类,表明说话者特征和语言内容的成功解耦。
{"title":"Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy","authors":"Yifan Sun, Qinlong Huang, Xihong Wu","doi":"10.21437/interspeech.2022-477","DOIUrl":"https://doi.org/10.21437/interspeech.2022-477","url":null,"abstract":"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a fixed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the configuration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory benefits. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4656-4660"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44404742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR 结合简单但新颖的数据增强方法改进保形ASR
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10835
Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux
{"title":"Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR","authors":"Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux","doi":"10.21437/interspeech.2022-10835","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10835","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4890-4894"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44483829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1