Morphology aware data augmentation with neural language models for online hybrid ASR

IF 0.5 3区 文学 0 LANGUAGE & LINGUISTICS Acta Linguistica Academica Pub Date : 2022-11-21 DOI:10.1556/2062.2022.00582
Balázs Tarján, T. Fegyó, P. Mihajlik
{"title":"Morphology aware data augmentation with neural language models for online hybrid ASR","authors":"Balázs Tarján, T. Fegyó, P. Mihajlik","doi":"10.1556/2062.2022.00582","DOIUrl":null,"url":null,"abstract":"Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.","PeriodicalId":37594,"journal":{"name":"Acta Linguistica Academica","volume":null,"pages":null},"PeriodicalIF":0.5000,"publicationDate":"2022-11-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Acta Linguistica Academica","FirstCategoryId":"98","ListUrlMain":"https://doi.org/10.1556/2062.2022.00582","RegionNum":3,"RegionCategory":"文学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"LANGUAGE & LINGUISTICS","Score":null,"Total":0}
引用次数: 0

Abstract

Recognition of Hungarian conversational telephone speech is challenging due to the informal style and morphological richness of the language. Neural Network Language Models (NNLMs) can provide remedy for the high perplexity of the task; however, their high complexity makes them very difficult to apply in the first (single) pass of an online system. Recent studies showed that a considerable part of the knowledge of NNLMs can be transferred to traditional n-grams by using neural text generation based data augmentation. Data augmentation with NNLMs works well for isolating languages; however, we show that it causes a vocabulary explosion in a morphologically rich language. Therefore, we propose a new, morphology aware neural text augmentation method, where we retokenize the generated text into statistically derived subwords. We compare the performance of word-based and subword-based data augmentation techniques with recurrent and Transformer language models and show that subword-based methods can significantly improve the Word Error Rate (WER) while greatly reducing vocabulary size and memory requirements. Combining subword-based modeling and neural language model-based data augmentation, we were able to achieve 11% relative WER reduction and preserve real-time operation of our conversational telephone speech recognition system. Finally, we also demonstrate that subword-based neural text augmentation outperforms the word-based approach not only in terms of overall WER but also in recognition of Out-of-Vocabulary (OOV) words.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于神经语言模型的在线混合ASR形态学感知数据增强
由于匈牙利语的非正式风格和丰富的形态,识别匈牙利语会话电话语音具有挑战性。神经网络语言模型(NNLMs)可以弥补任务的高度困惑;然而,它们的高复杂性使得它们很难应用于在线系统的第一道(单道)。最近的研究表明,通过使用基于神经文本生成的数据扩充,可以将相当一部分NNLMs的知识转移到传统的n-gram中。NNLMs的数据扩充对于隔离语言非常有效;然而,我们发现,在形态丰富的语言中,它会导致词汇爆炸。因此,我们提出了一种新的形态学感知神经文本增强方法,将生成的文本重新命名为统计衍生的子词。我们将基于单词和基于子单词的数据扩充技术与递归和Transformer语言模型的性能进行了比较,结果表明,基于子词的方法可以显著提高单词错误率(WER),同时大大降低词汇大小和内存需求。将基于子词的建模和基于神经语言的数据增强相结合,我们能够实现11%的相对WER降低,并保持会话电话语音识别系统的实时运行。最后,我们还证明了基于子词的神经文本增强不仅在整体WER方面,而且在词汇外(OOV)词的识别方面都优于基于词的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Acta Linguistica Academica
Acta Linguistica Academica Arts and Humanities-Literature and Literary Theory
CiteScore
1.00
自引率
20.00%
发文量
20
期刊介绍: Acta Linguistica Academica publishes papers on general linguistics. Papers presenting empirical material must have strong theoretical implications. The scope of the journal is not restricted to the core areas of linguistics; it also covers areas such as socio- and psycholinguistics, neurolinguistics, discourse analysis, the philosophy of language, language typology, and formal semantics. The journal also publishes book and dissertation reviews and advertisements.
期刊最新文献
American linguistics in transition: From post-Bloomfieldian structuralism to generative grammar Production of Mandarin and Fuzhou lexical tones in six- to seven-year-old Mandarin-Fuzhou bilingual children No lowering, only paradigms: A paradigm-based account of linking vowels in Hungarian Strange a construction: The “A egy N” in Hungarian Another fortis-lenis language: A reanalysis of Old English obstruents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1