允许错误的文本检索实用q -Gram索引

CLEI Electron. J. Pub Date : 2018-09-25 DOI:10.19153/cleiej.1.2.3

G. Navarro, Ricardo Baeza-Yates

{"title":"允许错误的文本检索实用q -Gram索引","authors":"G. Navarro, Ricardo Baeza-Yates","doi":"10.19153/cleiej.1.2.3","DOIUrl":null,"url":null,"abstract":"We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.","PeriodicalId":418941,"journal":{"name":"CLEI Electron. J.","volume":"99 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"85","resultStr":"{\"title\":\"A Practical q -Gram Index for Text Retrieval Allowing Errors\",\"authors\":\"G. Navarro, Ricardo Baeza-Yates\",\"doi\":\"10.19153/cleiej.1.2.3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.\",\"PeriodicalId\":418941,\"journal\":{\"name\":\"CLEI Electron. J.\",\"volume\":\"99 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"85\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CLEI Electron. J.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.19153/cleiej.1.2.3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CLEI Electron. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19153/cleiej.1.2.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 85

摘要

本文提出了一种实用而强大的近似文本检索索引技术，并对自然语言文本进行了优化。与其他此类索引不同，它能够检索与搜索模式近似匹配的任何字符串，而不仅仅是单词。每个固定长度为q的文本子字符串都存储在索引中，以及指向它出现的所有文本位置的指针。搜索模式被划分为在索引中搜索的部分，并验证它们在文本中的所有出现是否完全匹配。为了减少空间需求，可以使用指向块的指针而不是精确位置，这增加了查询成本。我们设计了一种算法来优化模式划分，从而使验证的总次数最小化。这特别适合于自然语言文本，并允许提前知道搜索的预期成本和查询与用户的预期相关性。实验证明了索引的建立时间、空间要求和查询时间，是一种实用的文本检索方法。检索时间从最佳在线算法的10%减少到60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Practical q -Gram Index for Text Retrieval Allowing Errors

We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

CLEI Electron. J.

自引率

0.00%

发文量

期刊最新文献

Cluster-based LSTM models to improve Dengue cases forecast Medium Access Control Techniques for Massive Machine-Type Communications in Cellular IoT Networks 2D Simplified Wildfire Spreading Model in Python: From NumPy to CuPy Preface to the CLTM-CLTD 2022 Special Issue On the specification and verification of the PCR parallel programming pattern in TLA+