{"title":"Word discrimination based on bigram co-occurrences","authors":"A. El-Nasan, S. Veeramachaneni, G. Nagy","doi":"10.1109/ICDAR.2001.953773","DOIUrl":null,"url":null,"abstract":"Very few pairs of English words share exactly the same letter bigrams. This linguistic property can be exploited to bring lexical context into the classification stage of a word recognition system. The lexical n-gram matches between every word in a lexicon and a subset of reference words can be precomputed. If a match function can detect matching segments of at least n-gram length from the feature representation of words, then an unknown word can be recognized by determining the subset of reference words having an n-gram match at the feature level with the unknown word. We show that with a reasonable number of reference words, bigrams represent the best compromise between the recall ability of single letters and the precision of trigrams. Our simulations indicate that using a longer reference list can compensate errors in feature extraction. The algorithm is fast enough, even with a slow processor, for human-computer interaction.","PeriodicalId":277816,"journal":{"name":"Proceedings of Sixth International Conference on Document Analysis and Recognition","volume":"29 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"14","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of Sixth International Conference on Document Analysis and Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDAR.2001.953773","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 14
Abstract
Very few pairs of English words share exactly the same letter bigrams. This linguistic property can be exploited to bring lexical context into the classification stage of a word recognition system. The lexical n-gram matches between every word in a lexicon and a subset of reference words can be precomputed. If a match function can detect matching segments of at least n-gram length from the feature representation of words, then an unknown word can be recognized by determining the subset of reference words having an n-gram match at the feature level with the unknown word. We show that with a reasonable number of reference words, bigrams represent the best compromise between the recall ability of single letters and the precision of trigrams. Our simulations indicate that using a longer reference list can compensate errors in feature extraction. The algorithm is fast enough, even with a slow processor, for human-computer interaction.