Applying Multi- and Cross-Lingual Stochastic Phone Space Transformations to Non-Native Speech Recognition

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-08-01 DOI:10.1109/TASL.2013.2260150

David Imseng, H. Bourlard, J. Dines, Philip N. Garner, M. Magimai.-Doss

{"title":"Applying Multi- and Cross-Lingual Stochastic Phone Space Transformations to Non-Native Speech Recognition","authors":"David Imseng, H. Bourlard, J. Dines, Philip N. Garner, M. Magimai.-Doss","doi":"10.1109/TASL.2013.2260150","DOIUrl":null,"url":null,"abstract":"In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior probabilities. The resulting stochastic matrix transformation can be used within the same language to automatically adapt to different phone formats (e.g., IPA) or across languages. Additionally, as shown here, it can also be applied successfully to non-native speech recognition. In the same spirit as MLLR adaptation, or MLP adaptation, the approach proposed here is directly mapping posterior distributions, and is trained by optimizing on a small amount of adaptation data a Kullback-Leibler based cost function, along a modified version of an iterative EM algorithm. On a non-native English database (HIWIRE), and comparing with multiple setups (monophone and triphone mapping, MLLR adaptation) we show that the resulting posterior mapping yields state-of-the-art results using very limited amounts of adaptation data in mono-, cross- and multi-lingual setups. We also show that “universal” phone posteriors, trained on a large amount of multilingual data, can be transformed to English phone posteriors, resulting in an ASR system that significantly outperforms a system trained on English data only. Finally, we demonstrate that the proposed approach outperforms alternative data-driven, as well as a knowledge-based, mapping techniques.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"1713-1726"},"PeriodicalIF":0.0000,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2013.2260150","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2013.2260150","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

In the context of hybrid HMM/MLP Automatic Speech Recognition (ASR), this paper describes an investigation into a new type of stochastic phone space transformation, which maps “source” phone (or phone HMM state) posterior probabilities (as obtained at the output of a Multilayer Perceptron/MLP) into “destination” phone (HMM phone state) posterior probabilities. The resulting stochastic matrix transformation can be used within the same language to automatically adapt to different phone formats (e.g., IPA) or across languages. Additionally, as shown here, it can also be applied successfully to non-native speech recognition. In the same spirit as MLLR adaptation, or MLP adaptation, the approach proposed here is directly mapping posterior distributions, and is trained by optimizing on a small amount of adaptation data a Kullback-Leibler based cost function, along a modified version of an iterative EM algorithm. On a non-native English database (HIWIRE), and comparing with multiple setups (monophone and triphone mapping, MLLR adaptation) we show that the resulting posterior mapping yields state-of-the-art results using very limited amounts of adaptation data in mono-, cross- and multi-lingual setups. We also show that “universal” phone posteriors, trained on a large amount of multilingual data, can be transformed to English phone posteriors, resulting in an ASR system that significantly outperforms a system trained on English data only. Finally, we demonstrate that the proposed approach outperforms alternative data-driven, as well as a knowledge-based, mapping techniques.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多语言和跨语言随机电话空间变换在非母语语音识别中的应用

在混合HMM/MLP自动语音识别(ASR)的背景下，研究了一种新的随机电话空间变换，该变换将“源”电话(或电话HMM状态)后验概率(在多层感知机/MLP的输出处获得)映射到“目标”电话(HMM电话状态)后验概率。由此产生的随机矩阵变换可以在同一语言中使用，以自动适应不同的电话格式(例如，IPA)或跨语言。此外，如图所示，它也可以成功地应用于非母语语音识别。与MLLR自适应或MLP自适应的精神相同，本文提出的方法是直接映射后见分布，并通过基于Kullback-Leibler的成本函数对少量自适应数据进行优化，并沿着迭代EM算法的改进版本进行训练。在非母语英语数据库(HIWIRE)上，并与多种设置(单声道和三声道映射，MLLR适应)进行比较，我们表明，在单语言、跨语言和多语言设置中，使用非常有限的适应数据，所得后验映射产生了最先进的结果。我们还表明，在大量多语言数据上训练的“通用”电话后验可以转换为英语电话后验，从而使ASR系统的性能明显优于仅在英语数据上训练的系统。最后，我们证明了所提出的方法优于其他数据驱动的以及基于知识的映射技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.