Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management Pub Date : 2017-11-06 DOI:10.1145/3132847.3133028

Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel

{"title":"Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge","authors":"Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel","doi":"10.1145/3132847.3133028","DOIUrl":null,"url":null,"abstract":"Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

具有噪声特定领域知识的场景下基于噪声信道模型的快速单词识别

单词识别是许多应用程序面临的一项具有挑战性的任务，特别是在非常嘈杂的场景中。这个问题通常被视为通过噪声信道传输单词，因此有必要确定词典中的哪个已知单词是接收到的字符串。为了可行，只选择一个简化的候选词集。如果可以通过应用最多k个字符编辑操作将它们转换为输入字符串，则通常选择它们。为了对候选项进行排序，最有效的估计使用了从实际使用数据中提取的有关噪声源和误差分布的领域知识。然而，在有很多噪声的场景中，这种估计和通常所需的索引策略不能很好地扩展，因为它们随着k和词典大小呈指数增长。在这项工作中，我们提出了在非常嘈杂的场景中非常有效的单词识别方法，这些方法支持有效的基于编辑的距离算法，可以使用最小完美散列进行搜索。该方法允许对最有希望的候选词进行早期处理，这样快速修剪的搜索在单词排名质量上的损失可以忽略不计。我们还提出了一种线性启发式方法来估计基于编辑的距离，该方法利用了索引已经提供的信息。我们的方法达到了与最先进的方法相似的精度，大约快了十倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 2017 ACM on Conference on Information and Knowledge Management

自引率

0.00%

发文量

期刊最新文献

Query and Animate Multi-attribute Trajectory Data HyPerInsight: Data Exploration Deep Inside HyPer Algorithmic Bias: Do Good Systems Make Relevant Documents More Retrievable? NeuPL: Attention-based Semantic Matching and Pair-Linking for Entity Disambiguation Health Forum Thread Recommendation Using an Interest Aware Topic Model