Unified Likelihood Ratio Estimation for High- to Zero-frequency N-grams

IEICE Trans. Fundam. Electron. Commun. Comput. Sci. Pub Date : 2021-08-01 DOI:10.1587/transfun.2020EAP1088

M. Kikuchi, Kento Kawakami, Kazuho Watanabe, Mitsuo Yoshida, Kyoji Umemura

{"title":"Unified Likelihood Ratio Estimation for High- to Zero-frequency N-grams","authors":"M. Kikuchi, Kento Kawakami, Kazuho Watanabe, Mitsuo Yoshida, Kyoji Umemura","doi":"10.1587/transfun.2020EAP1088","DOIUrl":null,"url":null,"abstract":"Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of N items, called an N -gram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on N -gram frequency information. A naive estimation approach that uses only N -gram frequencies is sensitive to low-frequency (rare) N -grams and not applicable to zero-frequency (unobserved) N -grams; these are known as the lowand zero-frequency problems, respectively. To address these problems, we propose a method for decomposing N -grams into item units and then applying their frequencies along with the original N -gram frequencies. Our method can obtain the estimates of unobserved N -grams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often co-occur in practice and therefore maintains their dependencies by using the relevant N -gram frequencies. We also introduce a regularization to achieve robust estimation for rare N -grams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies. key words: Likelihood ratio, the low-frequency problem, the zero-frequency problem, uLSIF.","PeriodicalId":348826,"journal":{"name":"IEICE Trans. Fundam. Electron. Commun. Comput. Sci.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEICE Trans. Fundam. Electron. Commun. Comput. Sci.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1587/transfun.2020EAP1088","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Likelihood ratios (LRs), which are commonly used for probabilistic data processing, are often estimated based on the frequency counts of individual elements obtained from samples. In natural language processing, an element can be a continuous sequence of N items, called an N -gram, in which each item is a word, letter, etc. In this paper, we attempt to estimate LRs based on N -gram frequency information. A naive estimation approach that uses only N -gram frequencies is sensitive to low-frequency (rare) N -grams and not applicable to zero-frequency (unobserved) N -grams; these are known as the lowand zero-frequency problems, respectively. To address these problems, we propose a method for decomposing N -grams into item units and then applying their frequencies along with the original N -gram frequencies. Our method can obtain the estimates of unobserved N -grams by using the unit frequencies. Although using only unit frequencies ignores dependencies between items, our method takes advantage of the fact that certain items often co-occur in practice and therefore maintains their dependencies by using the relevant N -gram frequencies. We also introduce a regularization to achieve robust estimation for rare N -grams. Our experimental results demonstrate that our method is effective at solving both problems and can effectively control dependencies. key words: Likelihood ratio, the low-frequency problem, the zero-frequency problem, uLSIF.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

高频到零频N-grams的统一似然比估计

似然比(LRs)通常用于概率数据处理，通常是根据从样本中获得的单个元素的频率计数来估计的。在自然语言处理中，一个元素可以是N个项目的连续序列，称为N -gram，其中每个项目是一个单词、字母等。在本文中，我们尝试基于N -gram频率信息来估计LRs。仅使用N克频率的朴素估计方法对低频(罕见)N克敏感，但不适用于零频率(未观察到的)N克;这些分别被称为低频率和零频率问题。为了解决这些问题，我们提出了一种方法，将N -gram分解为项目单位，然后将它们的频率与原始N -gram频率一起应用。我们的方法可以利用单位频率得到未观测到的N -g的估计。虽然只使用单位频率忽略了项目之间的依赖关系，但我们的方法利用了某些项目在实践中经常共同出现的事实，因此通过使用相关的N -gram频率来保持它们的依赖关系。我们还引入了正则化来实现稀有N -grams的鲁棒估计。实验结果表明，我们的方法可以有效地解决这两个问题，并且可以有效地控制依赖关系。关键词:似然比，低频问题，零频问题，uLSIF。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEICE Trans. Fundam. Electron. Commun. Comput. Sci.

自引率

0.00%

发文量