Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR

Zhehuai Chen, Ankur Bapna, A. Rosenberg, Yu Zhang, B. Ramabhadran, P. Moreno, Nanxin Chen
{"title":"Maestro-U: Leveraging Joint Speech-Text Representation Learning for Zero Supervised Speech ASR","authors":"Zhehuai Chen, Ankur Bapna, A. Rosenberg, Yu Zhang, B. Ramabhadran, P. Moreno, Nanxin Chen","doi":"10.1109/SLT54892.2023.10022791","DOIUrl":null,"url":null,"abstract":"Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in [1] can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover 102 languages, where transcribed speech is available in 52 of these languages and can be used to improve end-to-end ASR quality on the remaining 50. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8% to 30.8%, a relative reduction of 53%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5% relative and reduces the CER of 19 languages below 15%.","PeriodicalId":352002,"journal":{"name":"2022 IEEE Spoken Language Technology Workshop (SLT)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT54892.2023.10022791","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

Training state-of-the-art Automated Speech Recognition (ASR) models typically requires a substantial amount of transcribed speech. In this work, we demonstrate that a modality-matched joint speech and text model introduced in [1] can be leveraged to train a massively multilingual ASR model without any supervised (manually transcribed) speech for some languages. This paper explores the use of jointly learnt speech and text representations in a massively multilingual, zero supervised speech, real-world setting to expand the set of languages covered by ASR with only unlabeled speech and text in the target languages. Using the FLEURS dataset, we define the task to cover 102 languages, where transcribed speech is available in 52 of these languages and can be used to improve end-to-end ASR quality on the remaining 50. First, we show that by combining speech representations with byte-level text representations and use of language embeddings, we can dramatically reduce the Character Error Rate (CER) on languages with no supervised speech from 64.8% to 30.8%, a relative reduction of 53%. Second, using a subset of South Asian languages we show that Maestro-U can promote knowledge transfer from languages with supervised speech even when there is limited to no graphemic overlap. Overall, Maestro-U closes the gap to oracle performance by 68.5% relative and reduces the CER of 19 languages below 15%.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用联合语音-文本表示学习实现零监督语音ASR
训练最先进的自动语音识别(ASR)模型通常需要大量的转录语音。在这项工作中,我们证明了[1]中引入的模态匹配的联合语音和文本模型可以用于训练大规模多语言ASR模型,而无需任何监督(手动转录)语音。本文探讨了在大规模多语言、零监督语音、现实世界环境中使用联合学习的语音和文本表示,以仅使用目标语言中的未标记语音和文本来扩展ASR所涵盖的语言集。使用FLEURS数据集,我们将任务定义为覆盖102种语言,其中52种语言的转录语音可用,可用于提高其余50种语言的端到端ASR质量。首先,我们表明,通过将语音表示与字节级文本表示相结合并使用语言嵌入,我们可以显着降低无监督语音语言的字符错误率(CER),从64.8%降至30.8%,相对降低53%。其次,使用南亚语言的一个子集,我们表明Maestro-U可以促进具有监督语音的语言的知识转移,即使在没有字母重叠的情况下也是如此。总体而言,Maestro-U将与oracle的性能差距缩小了68.5%,并将19种语言的CER降低到15%以下。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Phone-Level Pronunciation Scoring for L1 Using Weighted-Dynamic Time Warping The Clever Hans Effect in Voice Spoofing Detection A Multi-Modal Array of Interpretable Features to Evaluate Language and Speech Patterns in Different Neurological Disorders Unsupervised Domain Adaptation of Neural PLDA Using Segment Pairs for Speaker Verification Learning Accent Representation with Multi-Level VAE Towards Controllable Speech Synthesis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1