Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel
{"title":"具有噪声特定领域知识的场景下基于噪声信道模型的快速单词识别","authors":"Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel","doi":"10.1145/3132847.3133028","DOIUrl":null,"url":null,"abstract":"Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.","PeriodicalId":20449,"journal":{"name":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","volume":"1 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2017-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge\",\"authors\":\"Marco Cristo, Raíza Hanada, A. Carvalho, Fernando Anglada Lores, M. G. Pimentel\",\"doi\":\"10.1145/3132847.3133028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.\",\"PeriodicalId\":20449,\"journal\":{\"name\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"volume\":\"1 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3132847.3133028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2017 ACM on Conference on Information and Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3132847.3133028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Fast Word Recognition for Noise channel-based Models in Scenarios with Noise Specific Domain Knowledge
Word recognition is a challenging task faced by many applications, specially in very noisy scenarios. This problem is usually seen as the transmission of a word through a noisy-channel, such that it is necessary to determine which known word of a lexicon is the received string. To be feasible, just a reduced set of candidate words are selected. They are usually chosen if they can be transformed into the input string by applying up to k character edit operations. To rank the candidates, the most effective estimates use domain knowledge about noise sources and error distributions, extracted from real use data. In scenarios with much noise, however, such estimates, and the index strategies normally required, do not scale well as they grow exponentially with k and the lexicon size. In this work, we propose very efficient methods for word recognition in very noisy scenarios which support effective edit-based distance algorithms in a Mor-Fraenkel index, searchable using a minimum perfect hashing. The method allows the early processing of most promising candidates, such that fast pruned searches present negligible loss in word ranking quality. We also propose a linear heuristic for estimating edit-based distances which take advantage of information already provided by the index. Our methods achieve precision similar to a state-of-the-art approach, being about ten times faster.