Unified Likelihood Ratio Estimation for High- to Zero-frequency N-grams

M. Kikuchi, Kento Kawakami, Kazuho Watanabe, Mitsuo Yoshida, Kyoji Umemura
{"title":"Unified Likelihood Ratio Estimation for High- to Zero-frequency N-grams","authors":"M. Kikuchi, Kento Kawakami, Kazuho Watanabe, Mitsuo Yoshida, Kyoji Umemura","doi":"10.1587/transfun.2020EAP1088","DOIUrl":null,"url":null,"abstract":"Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of N items, called an N -gram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on N -gram frequency information. A naive estimation approach that uses only N -gram frequencies is sensitive to low-frequency (rare) N -grams and not applicable to zero-frequency (unobserved) N -grams; these are known as the lowand zero-frequency problems, respectively. To address these problems, we propose a method for decomposing N -grams into item units and then applying their frequencies along with the original N -gram frequencies. Our method can obtain the estimates of unobserved N -grams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often co-occur in practice and therefore maintains their dependencies by using the relevant N -gram frequencies. We also introduce a regularization to achieve robust estimation for rare N -grams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies. key words: Likelihood ratio, the low-frequency problem, the zero-frequency problem, uLSIF.","PeriodicalId":348826,"journal":{"name":"IEICE Trans. Fundam. Electron. Commun. Comput. Sci.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEICE Trans. Fundam. Electron. Commun. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1587/transfun.2020EAP1088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of N items, called an N -gram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on N -gram frequency information. A naive estimation approach that uses only N -gram frequencies is sensitive to low-frequency (rare) N -grams and not applicable to zero-frequency (unobserved) N -grams; these are known as the lowand zero-frequency problems, respectively. To address these problems, we propose a method for decomposing N -grams into item units and then applying their frequencies along with the original N -gram frequencies. Our method can obtain the estimates of unobserved N -grams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often co-occur in practice and therefore maintains their dependencies by using the relevant N -gram frequencies. We also introduce a regularization to achieve robust estimation for rare N -grams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies. key words: Likelihood ratio, the low-frequency problem, the zero-frequency problem, uLSIF.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
高频到零频N-grams的统一似然比估计
似然比(LRs)通常用于概率数据处理,通常是根据从样本中获得的单个元素的频率计数来估计的。在自然语言处理中,一个元素可以是N个项目的连续序列,称为N -gram,其中每个项目是一个单词、字母等。在本文中,我们尝试基于N -gram频率信息来估计LRs。仅使用N克频率的朴素估计方法对低频(罕见)N克敏感,但不适用于零频率(未观察到的)N克;这些分别被称为低频率和零频率问题。为了解决这些问题,我们提出了一种方法,将N -gram分解为项目单位,然后将它们的频率与原始N -gram频率一起应用。我们的方法可以利用单位频率得到未观测到的N -g的估计。虽然只使用单位频率忽略了项目之间的依赖关系,但我们的方法利用了某些项目在实践中经常共同出现的事实,因此通过使用相关的N -gram频率来保持它们的依赖关系。我们还引入了正则化来实现稀有N -grams的鲁棒估计。实验结果表明,我们的方法可以有效地解决这两个问题,并且可以有效地控制依赖关系。关键词:似然比,低频问题,零频问题,uLSIF。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Erratum: Concatenated Permutation Codes under Chebyshev Distance [IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Vol. E106.A (2023), No. 3 pp.616-632] Automorphism Shuffles for Graphs and Hypergraphs and Its Applications Erratum: A Compact Digital Signature Scheme Based on the Module-LWR Problem [IEICE Transactions on Fundamentals of Electronics, Communications and Computer Sciences Vol. E104.A (2021), No. 9 pp.1219-1234] Learning Sparse Graph with Minimax Concave Penalty under Gaussian Markov Random Fields Ramsey Numbers of Trails
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1