首页 > 最新文献

Interspeech最新文献

英文 中文
Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS 跨语言韵律迁移TTS的对抗性和顺序性训练
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-865
Min-Kyung Kim, Joon‐Hyuk Chang
This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.
本研究提出了一种通过使用语言、说话者和韵律三种全局语音风格表示来提高文本到语音(TTS)模型性能的方法。在说话者的声音中综合不同的语言和韵律,而不管他们自己的语言和韵律是可能的。为了构建TTS模型中每个表征的嵌入,使其独立于其他表征,我们提出了一种针对TTS模型一般架构的对抗性训练方法。此外,我们引入了一种顺序训练方法,包括基于预演的持续学习,以训练复杂和少量的数据,而不会忘记先前学习的信息。实验结果表明,该方法可以生成高质量的语音,即使对于数据集中不包含的说话人表示,也可以产生高质量的说话人和韵律的相似度。
{"title":"Adversarial and Sequential Training for Cross-lingual Prosody Transfer TTS","authors":"Min-Kyung Kim, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-865","DOIUrl":"https://doi.org/10.21437/interspeech.2022-865","url":null,"abstract":"This study presents a method for improving the performance of the text-to-speech (TTS) model by using three global speech-style representations: language, speaker, and prosody. Synthesizing different languages and prosody in the speaker’s voice regardless of their own language and prosody is possi-ble. To construct the embedding of each representation conditioned in the TTS model such that it is independent of the other representations, we propose an adversarial training method for the general architecture of TTS models. Furthermore, we introduce a sequential training method that includes rehearsal-based continual learning to train complex and small amounts of data without forgetting previously learned information. The experimental results show that the proposed method can generate good-quality speech and yield high similarity for speakers and prosody, even for representations that the speaker in the dataset does not contain.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4556-4560"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46991331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Phonetic Analysis of Self-supervised Representations of English Speech 英语语音自监督表征的语音分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10884
Dan Wells, Hao Tang, Korin Richmond
We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.
我们对英语语音的自监督表示学习中发现的离散单元进行了分析。我们专注于由预训练的HuBERT模型产生的单元,因为它在ASR、语音合成和许多其他任务中被广泛采用。鉴于之前的工作已经评估了针对给定语言的所有电话的此类量化模型的总体质量,我们将分析分解为广泛的语音类,在考虑其与离散单元的对齐时考虑到其发音的特定方面。我们发现这些单位对应于次语音事件,并且精细的动态,如爆破音的不同闭合和释放部分往往由离散单位序列表示。我们的工作为HuBERT发现的离散单元的语音特性提供了参考,促进了基于该模型的许多语音应用的分析。
{"title":"Phonetic Analysis of Self-supervised Representations of English Speech","authors":"Dan Wells, Hao Tang, Korin Richmond","doi":"10.21437/interspeech.2022-10884","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10884","url":null,"abstract":"We present an analysis of discrete units discovered via self-supervised representation learning on English speech. We focus on units produced by a pre-trained HuBERT model due to its wide adoption in ASR, speech synthesis, and many other tasks. Whereas previous work has evaluated the quality of such quantization models in aggregate over all phones for a given language, we break our analysis down into broad phonetic classes, taking into account specific aspects of their articulation when consid-ering their alignment to discrete units. We find that these units correspond to sub-phonetic events, and that fine dynamics such as the distinct closure and release portions of plosives tend to be represented by sequences of discrete units. Our work provides a reference for the phonetic properties of discrete units discovered by HuBERT, facilitating analyses of many speech applications based on this model.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3583-3587"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47141667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition W2V2 Light:用于自动语音识别的Wav2vec 2.0的轻量级版本
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10339
Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang
Wav2vec 2.0 (W2V2) has shown remarkable speech recognition performance by pre-training only with unlabeled data and fine-tuning with a small amount of labeled data. However, the practical application of W2V2 is hindered by hardware memory limitations, as it contains 317 million parameters. To ad-dress this issue, we propose W2V2-Light, a lightweight version of W2V2. We introduce two simple sharing methods to reduce the memory consumption as well as the computational costs of W2V2. Compared to W2V2, our model has 91% lesser parameters and a speedup of 1.31 times with minor degradation in downstream task performance. Moreover, by quantifying the stability of representations, we provide an empirical insight into why our model is capable of maintaining competitive performance despite the significant reduction in memory
Wav2vec 2.0(W2V2)通过仅使用未标记数据进行预训练和使用少量标记数据进行微调,显示出显著的语音识别性能。然而,W2V2的实际应用受到硬件内存限制的阻碍,因为它包含3.17亿个参数。为了宣传这个问题,我们提出了W2V2 Light,W2V2的轻量级版本。我们介绍了两种简单的共享方法来减少W2V2的内存消耗和计算成本。与W2V2相比,我们的模型的参数减少了91%,速度提高了1.31倍,下游任务性能略有下降。此外,通过量化表征的稳定性,我们提供了一个经验见解,说明为什么我们的模型能够在记忆显著减少的情况下保持竞争性能
{"title":"W2V2-Light: A Lightweight Version of Wav2vec 2.0 for Automatic Speech Recognition","authors":"Dong-Hyun Kim, Jaehwan Lee, J. Mo, Joon‐Hyuk Chang","doi":"10.21437/interspeech.2022-10339","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10339","url":null,"abstract":"Wav2vec 2.0 (W2V2) has shown remarkable speech recognition performance by pre-training only with unlabeled data and fine-tuning with a small amount of labeled data. However, the practical application of W2V2 is hindered by hardware memory limitations, as it contains 317 million parameters. To ad-dress this issue, we propose W2V2-Light, a lightweight version of W2V2. We introduce two simple sharing methods to reduce the memory consumption as well as the computational costs of W2V2. Compared to W2V2, our model has 91% lesser parameters and a speedup of 1.31 times with minor degradation in downstream task performance. Moreover, by quantifying the stability of representations, we provide an empirical insight into why our model is capable of maintaining competitive performance despite the significant reduction in memory","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3038-3042"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47360779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Autoencoder-Based Tongue Shape Estimation During Continuous Speech 基于自编码器的连续语音舌形估计
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10272
Vinicius Ribeiro, Y. Laprie
Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.
声道形状估计是发音语音合成的必要步骤。然而,关于这一主题的文献很少,目前大多数方法都缺乏与语音产生相关的许多物理限制。本研究提出了一种替代方法来解决先前工作中面临的具体问题,特别是与关键发音器相关的问题。提出了一种基于自编码器的连续语音舌形估计方法。自动编码器被训练来学习数据的编码,并作为主网络的辅助网络,将音素映射到形状。神经网络不是预测目标曲线上的精确点,而是学习如何预测曲线的主要组成部分,即自动编码器的表示。我们展示了这种方法如何允许施加关键发音器的约束,通过潜在空间控制舌形,并在不依赖任何后处理方法的情况下生成平滑输出。
{"title":"Autoencoder-Based Tongue Shape Estimation During Continuous Speech","authors":"Vinicius Ribeiro, Y. Laprie","doi":"10.21437/interspeech.2022-10272","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10272","url":null,"abstract":"Vocal tract shape estimation is a necessary step for articulatory speech synthesis. However, the literature on the topic is scarce, and most current methods lack adequacy to many physical constraints related to speech production. This study proposes an alternative approach to the task to solve specific issues faced in the previous work, especially those related to critical ar-ticulators. We present an autoencoder-based method for tongue shape estimation during continuous speech. An autoencoder is trained to learn the data’s encoding and serves as an auxiliary network for the principal one, which maps phonemes to the shapes. Instead of predicting the exact points in the target curve, the neural network learns how to predict the curve’s main components, i.e., the autoencoder’s representation. We show how this approach allows imposing critical articulators’ constraints, controlling the tongue shape through the latent space, and generating a smooth output without relying on any postprocessing method.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"86-90"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44213806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy 无监督声学-发音倒置与可变声道解剖
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-477
Yifan Sun, Qinlong Huang, Xihong Wu
Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a fixed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the configuration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory benefits. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.
说话者之间的声学和发音可变性在一定程度上限制了声学-发音倒置(AAI)方法的泛化性能。说话人无关AAI(SI-AAI)方法通常侧重于声学特征的转换,但很少考虑发音空间中的直接匹配。无监督的AAI方法具有更好的泛化能力,但通常使用物理发音合成器的固定或光学设置,即使对于不同的说话者也是如此,这可能会导致不合格的发音补偿。在本文中,我们建议在语音倒置过程中联合估计发音运动和声道解剖。采用了无监督的AAI框架,其中估计的声道解剖结构用于设置物理发音合成器的配置,而物理发音合成器又由估计的发音运动驱动,以模仿给定的语音。实验表明,对声道解剖结构的估计可以带来声学和发音方面的好处。在声学上,重建质量更高;在咬合方面,估计的咬合运动轨迹与测量的轨迹更好地匹配。此外,估计的解剖学参数显示出说话者的清晰聚类,表明说话者特征和语言内容的成功解耦。
{"title":"Unsupervised Acoustic-to-Articulatory Inversion with Variable Vocal Tract Anatomy","authors":"Yifan Sun, Qinlong Huang, Xihong Wu","doi":"10.21437/interspeech.2022-477","DOIUrl":"https://doi.org/10.21437/interspeech.2022-477","url":null,"abstract":"Acoustic and articulatory variability across speakers has al-ways limited the generalization performance of acoustic-to-articulatory inversion (AAI) methods. Speaker-independent AAI (SI-AAI) methods generally focus on the transformation of acoustic features, but rarely consider the direct matching in the articulatory space. Unsupervised AAI methods have the potential of better generalization ability but typically use a fixed mor-phological setting of a physical articulatory synthesizer even for different speakers, which may cause nonnegligible articulatory compensation. In this paper, we propose to jointly estimate articulatory movements and vocal tract anatomy during the inversion of speech. An unsupervised AAI framework is employed, where estimated vocal tract anatomy is used to set the configuration of a physical articulatory synthesizer, which in turn is driven by estimated articulation movements to imitate a given speech. Experiments show that the estimation of vocal tract anatomy can bring both acoustic and articulatory benefits. Acoustically, the reconstruction quality is higher; articulatorily, the estimated articulatory movement trajectories better match the measured ones. Moreover, the estimated anatomy parameters show clear clusterings by speakers, indicating successful decoupling of speaker characteristics and linguistic content.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4656-4660"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44404742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR 结合简单但新颖的数据增强方法改进保形ASR
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10835
Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux
{"title":"Combining Simple but Novel Data Augmentation Methods for Improving Conformer ASR","authors":"Ronit Damania, Christopher Homan, Emily Tucker Prud'hommeaux","doi":"10.21437/interspeech.2022-10835","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10835","url":null,"abstract":"","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"4890-4894"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44483829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Streaming model for Acoustic to Articulatory Inversion with transformer networks 基于变压器网络的声-铰接反演流模型
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10159
Sathvik Udupa, Aravind Illa, P. Ghosh
Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.
从语音声学中估计语音发音运动被称为声学到发音反转(AAI)。最近,基于变压器的AAI模型已被证明能够实现最先进的性能。然而,在变压器网络中,注意力集中在整个话语上,因此需要在推理之前获得完整的话语,这导致了高延迟,并且不适合流式AAI。为了在推理过程中实现流,可以在不重叠的卡盘上执行评估,而不是在完整的话语上执行评估。然而,由于在训练和评估过程中注意接受野的不匹配,AAI的表现可能会下降。为了克服这种情况,在这项工作中,我们使用不同的注意力面具进行实验,并在训练期间使用先前预测的上下文。实验结果表明,将随机开始掩码注意与变压器解码器先前预测的上下文结合使用,效果优于基线结果。
{"title":"Streaming model for Acoustic to Articulatory Inversion with transformer networks","authors":"Sathvik Udupa, Aravind Illa, P. Ghosh","doi":"10.21437/interspeech.2022-10159","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10159","url":null,"abstract":"Estimating speech articulatory movements from speech acoustics is known as Acoustic to Articulatory Inversion (AAI). Recently, transformer-based AAI models have been shown to achieve state-of-art performance. However, in transformer networks, the attention is applied over the whole utterance, thereby needing to obtain the full utterance before the inference, which leads to high latency and is impractical for streaming AAI. To enable streaming during inference, evaluation could be performed on non-overlapping chucks instead of a full utterance. However, due to a mismatch of the attention receptive field during training and evaluation, there could be a drop in AAI performance. To overcome this scenario, in this work we perform experiments with different attention masks and use context from previous predictions during training. Experiments results revealed that using the random start mask attention with the context from previous predictions of transformer decoder performs better than the baseline results.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"625-629"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44495671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi Gram Vaani ASR挑战印地语地区变体的自发电话语音记录
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-11371
Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad
This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.
本文描述了在印地语区域变体中Gram-Vaani自动语音识别(ASR)挑战的语料库和基线系统。这一挑战的语料库包括社交科技企业Gram Vaani收集的自发电话语音记录。印地语的区域变异,加上语音的自发性、自然背景和由于众包而具有可变准确性的转录,使其成为ASR关于自发电话语音的独特语料库。作为挑战的一部分,已经发布了大约1108小时的真实世界自发语音记录,包括1000小时的未标记训练数据、100小时的标记训练数据,5小时的发展数据和3小时的评估数据。在传统的时延神经网络隐马尔可夫模型(TDNN-HMM)框架和完全神经端到端(E2E)设置中,在不同的ASR系统上验证了训练集和测试集的有效性。在100小时的标记数据上训练的TDNN模型的eval集上的字错误率(WER)和字符错误率(CER)为29。7%和15。分别为1%。而在E2E设置中,在100小时的数据上训练的一致性模型的评估集上的WER和CER为32。9%和19。分别为0%。
{"title":"Gram Vaani ASR Challenge on spontaneous telephone speech recordings in regional variations of Hindi","authors":"Anish Bhanushali, Grant Bridgman, Deekshitha G, P. Ghosh, Pratik Kumar, Saurabh Kumar, Adithya Raj Kolladath, Nithya Ravi, Aaditeshwar Seth, Ashish Seth, Abhayjeet Singh, Vrunda N. Sukhadia, Umesh S, Sathvik Udupa, L. D. Prasad","doi":"10.21437/interspeech.2022-11371","DOIUrl":"https://doi.org/10.21437/interspeech.2022-11371","url":null,"abstract":"This paper describes the corpus and baseline systems for the Gram Vaani Automatic Speech Recognition (ASR) challenge in regional variations of Hindi. The corpus for this challenge comprises the spontaneous telephone speech recordings collected by a social technology enterprise, Gram Vaani . The regional variations of Hindi together with spontaneity of speech, natural background and transcriptions with variable accuracy due to crowdsourcing make it a unique corpus for ASR on spontaneous telephonic speech. Around, 1108 hours of real-world spontaneous speech recordings, including 1000 hours of unlabelled training data, 100 hours of labelled training data, 5 hours of development data and 3 hours of evaluation data, have been released as a part of the challenge. The efficacy of both training and test sets are validated on different ASR systems in both traditional time-delay neural network-hidden Markov model (TDNN-HMM) frameworks and fully-neural end-to-end (E2E) setup. The word error rate (WER) and character error rate (CER) on eval set for a TDNN model trained on 100 hours of labelled data are 29 . 7% and 15 . 1% , respectively. While, in E2E setup, WER and CER on eval set for a conformer model trained on 100 hours of data are 32 . 9% and 19 . 0% , respectively.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3548-3552"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43519978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis MISP2021挑战中的视听语音识别:数据集发布和深度分析
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10483
Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan
In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audio-visual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus 1 and the code 2 are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.
在本文中,我们提出了MISP2021挑战中更新的视听语音识别(AVSR)语料库,这是一个大型视听汉语会话语料库,由远/中/近麦克风和远/中摄像机在34个真实家庭电视房间中收集的141小时音频和视频数据组成。据我们所知,我们的语料库是第一个远程多麦克风对话中文视听语料库,也是第一个在不利的家庭电视场景下的大词汇量连续中文唇读数据集。此外,我们对语料库进行了深入的分析,并对纯音频/纯视频/视听系统中的所有音频和视频数据进行了全面的消融研究。误差分析表明,视频模态补充被噪声退化的声信息以减少删除错误,并在重叠语音中提供判别信息以减少替换错误。最后,我们还设计了一组实验,如前端、数据增强和端到端模型,为潜在的未来工作提供方向。语料库1和代码2的发布不仅是为了促进语音领域的研究,也是为了促进计算机视觉领域和跨学科的研究。
{"title":"Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis","authors":"Hang Chen, Jun Du, Yusheng Dai, Chin-Hui Lee, S. Siniscalchi, Shinji Watanabe, O. Scharenborg, Jingdong Chen, Baocai Yin, Jia Pan","doi":"10.21437/interspeech.2022-10483","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10483","url":null,"abstract":"In this paper, we present the updated Audio-Visual Speech Recognition (AVSR) corpus of MISP2021 challenge, a large-scale audio-visual Chinese conversational corpus consisting of 141h audio and video data collected by far/middle/near microphones and far/middle cameras in 34 real-home TV rooms. To our best knowledge, our corpus is the first distant multi-microphone conversational Chinese audio-visual corpus and the first large vocabulary continuous Chinese lip-reading dataset in the adverse home-tv scenario. Moreover, we make a deep analysis of the corpus and conduct a comprehensive ablation study of all audio and video data in the audio-only/video-only/audio-visual systems. Error analysis shows video modality supplement acoustic information degraded by noise to reduce deletion errors and provide discriminative information in overlapping speech to reduce substitution errors. Finally, we also design a set of experiments such as frontend, data augmentation and end-to-end models for providing the direction of potential future work. The corpus 1 and the code 2 are released to promote the research not only in speech area but also for the computer vision area and cross-disciplinary research.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"1766-1770"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43761010","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees 基于数据增强的Laryntomes端到端无声语音识别
Pub Date : 2022-09-18 DOI: 10.21437/interspeech.2022-10868
Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang
Silent speech recognition (SSR) predicts textual information from silent articulation, which is an algorithm design in silent speech interfaces (SSIs). SSIs have the potential of recov-ering the speech ability of individuals who lost their voice but can still articulate (e.g., laryngectomees). Due to the lo-gistic difficulties in articulatory data collection, current SSR studies suffer limited amount of dataset. Data augmentation aims to increase the training data amount by introducing variations into the existing dataset, but has rarely been investigated in SSR for laryngectomees. In this study, we investigated the effectiveness of multiple data augmentation approaches for SSR including consecutive and intermittent time masking, articulatory dimension masking, sinusoidal noise injection and randomly scaling. Different experimental setups including speaker-dependent, speaker-independent, and speaker-adaptive were used. The SSR models were end-to-end speech recognition models trained with connectionist temporal classification (CTC). Electromagnetic articulography (EMA) datasets collected from multiple healthy speakers and laryngectomees were used. The experimental results have demonstrated that the data augmentation approaches explored performed differently, but generally improved SSR performance. Especially, the consecutive time masking has brought significant improvement on SSR for both healthy speakers and laryngectomees.
无声语音识别(SSR)是无声语音接口(ssi)中的一种算法设计,通过无声发音来预测文本信息。ssi有可能恢复失去声音但仍能表达的个体的语言能力(例如,喉切除术患者)。由于发音数据收集的逻辑困难,目前的SSR研究数据量有限。数据增强旨在通过在现有数据集中引入变体来增加训练数据量,但很少在喉切除术患者的SSR中进行研究。在本研究中,我们研究了连续和间歇时间掩蔽、关节维掩蔽、正弦噪声注入和随机缩放等多种SSR数据增强方法的有效性。不同的实验设置包括说话人依赖、说话人独立和说话人自适应。SSR模型是用连接时间分类(CTC)训练的端到端语音识别模型。使用从多个健康说话者和喉切除术者收集的电磁关节造影(EMA)数据集。实验结果表明,所探索的数据增强方法表现不同,但总体上提高了SSR性能。特别是,连续的时间掩蔽对健康说话者和喉切除者的SSR都有显著的改善。
{"title":"Data Augmentation for End-to-end Silent Speech Recognition for Laryngectomees","authors":"Beiming Cao, Kristin J. Teplansky, Nordine Sebkhi, Arpan Bhavsar, O. Inan, Robin A. Samlan, T. Mau, Jun Wang","doi":"10.21437/interspeech.2022-10868","DOIUrl":"https://doi.org/10.21437/interspeech.2022-10868","url":null,"abstract":"Silent speech recognition (SSR) predicts textual information from silent articulation, which is an algorithm design in silent speech interfaces (SSIs). SSIs have the potential of recov-ering the speech ability of individuals who lost their voice but can still articulate (e.g., laryngectomees). Due to the lo-gistic difficulties in articulatory data collection, current SSR studies suffer limited amount of dataset. Data augmentation aims to increase the training data amount by introducing variations into the existing dataset, but has rarely been investigated in SSR for laryngectomees. In this study, we investigated the effectiveness of multiple data augmentation approaches for SSR including consecutive and intermittent time masking, articulatory dimension masking, sinusoidal noise injection and randomly scaling. Different experimental setups including speaker-dependent, speaker-independent, and speaker-adaptive were used. The SSR models were end-to-end speech recognition models trained with connectionist temporal classification (CTC). Electromagnetic articulography (EMA) datasets collected from multiple healthy speakers and laryngectomees were used. The experimental results have demonstrated that the data augmentation approaches explored performed differently, but generally improved SSR performance. Especially, the consecutive time masking has brought significant improvement on SSR for both healthy speakers and laryngectomees.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3653-3657"},"PeriodicalIF":0.0,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43442432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Interspeech
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1