The estimation of powerful language models from small and large corpora

P. Placeway, R. Schwartz, Pascale Fung, L. Nguyen
{"title":"The estimation of powerful language models from small and large corpora","authors":"P. Placeway, R. Schwartz, Pascale Fung, L. Nguyen","doi":"10.1109/ICASSP.1993.319222","DOIUrl":null,"url":null,"abstract":"The authors consider the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. They begin with improved modeling of the grammar statistics, based on a combination of the backing-off technique and zero-frequency techniques. These are extended to be more amenable to the particular system considered here. The resulting technique is greatly simplified, more robust, and gives improved recognition performance over either of the previous techniques. The authors also consider the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. A technique that allows the estimation of a high-order model on modest computation resources is also presented. This makes it possible to run a 4-gram statistical model of a 50-million word corpus on a workstation of only modest capability and cost. Finally, the authors discuss results from applying a 2-gram statistical language model integrated in the HMM (hidden Markov model) search, obtaining a list of the N-best recognition results, and rescoring this list with a higher-order statistical model.<<ETX>>","PeriodicalId":428449,"journal":{"name":"1993 IEEE International Conference on Acoustics, Speech, and Signal Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1993-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"105","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"1993 IEEE International Conference on Acoustics, Speech, and Signal Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.1993.319222","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 105

Abstract

The authors consider the estimation of powerful statistical language models using a technique that scales from very small to very large amounts of domain-dependent data. They begin with improved modeling of the grammar statistics, based on a combination of the backing-off technique and zero-frequency techniques. These are extended to be more amenable to the particular system considered here. The resulting technique is greatly simplified, more robust, and gives improved recognition performance over either of the previous techniques. The authors also consider the problem of robustness of a model based on a small training corpus by grouping words into obvious semantic classes. This significantly improves the robustness of the resulting statistical grammar. A technique that allows the estimation of a high-order model on modest computation resources is also presented. This makes it possible to run a 4-gram statistical model of a 50-million word corpus on a workstation of only modest capability and cost. Finally, the authors discuss results from applying a 2-gram statistical language model integrated in the HMM (hidden Markov model) search, obtaining a list of the N-best recognition results, and rescoring this list with a higher-order statistical model.<>
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从大小语料库中估计强大的语言模型
作者考虑使用一种技术来估计强大的统计语言模型,这种技术可以从非常小的领域相关数据扩展到非常大的领域相关数据。他们首先改进了语法统计的建模,基于后退技术和零频率技术的组合。它们被扩展为更适合这里所考虑的特定系统。由此产生的技术大大简化,更健壮,并且比之前的任何一种技术都具有更好的识别性能。作者还考虑了一个基于小训练语料库的模型的鲁棒性问题,通过将单词分组到明显的语义类中。这显著提高了生成的统计语法的健壮性。提出了一种在有限的计算资源下对高阶模型进行估计的方法。这使得在一个能力和成本适中的工作站上运行一个5千万字语料库的4克统计模型成为可能。最后,作者讨论了在HMM(隐马尔可夫模型)搜索中应用2克统计语言模型的结果,得到了n个最佳识别结果的列表,并用高阶统计模型对该列表进行了评分
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Modeling, analysis, and compensation of quantization effects in M-band subband codecs Matched field source detection and localization in high noise environments: A novel reduced-rank signal processing approach Enhanced video compression with standardized bit stream syntax Formally correct translation of DSP algorithms specified in an asynchronous applicative language All-thru DSP provision, essential for the modern EE
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1