允许错误的文本检索实用q -Gram索引

G. Navarro, Ricardo Baeza-Yates
{"title":"允许错误的文本检索实用q -Gram索引","authors":"G. Navarro, Ricardo Baeza-Yates","doi":"10.19153/cleiej.1.2.3","DOIUrl":null,"url":null,"abstract":"We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.","PeriodicalId":418941,"journal":{"name":"CLEI Electron. J.","volume":"99 1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"85","resultStr":"{\"title\":\"A Practical q -Gram Index for Text Retrieval Allowing Errors\",\"authors\":\"G. Navarro, Ricardo Baeza-Yates\",\"doi\":\"10.19153/cleiej.1.2.3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.\",\"PeriodicalId\":418941,\"journal\":{\"name\":\"CLEI Electron. J.\",\"volume\":\"99 1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-09-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"85\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"CLEI Electron. J.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.19153/cleiej.1.2.3\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"CLEI Electron. J.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.19153/cleiej.1.2.3","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 85

摘要

本文提出了一种实用而强大的近似文本检索索引技术,并对自然语言文本进行了优化。与其他此类索引不同,它能够检索与搜索模式近似匹配的任何字符串,而不仅仅是单词。每个固定长度为q的文本子字符串都存储在索引中,以及指向它出现的所有文本位置的指针。搜索模式被划分为在索引中搜索的部分,并验证它们在文本中的所有出现是否完全匹配。为了减少空间需求,可以使用指向块的指针而不是精确位置,这增加了查询成本。我们设计了一种算法来优化模式划分,从而使验证的总次数最小化。这特别适合于自然语言文本,并允许提前知道搜索的预期成本和查询与用户的预期相关性。实验证明了索引的建立时间、空间要求和查询时间,是一种实用的文本检索方法。检索时间从最佳在线算法的10%减少到60%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Practical q -Gram Index for Text Retrieval Allowing Errors
We propose an indexing technique for approximate text searching, which is practical and powerful, and especially optimized for natural language text. Unlike other indices of this kind, it is able to retrieve any string that approximately matches the search pattern, not only words. Every text substring of a fixed length q is stored in the index, together with pointers to all the text positions where it appears. The search pattern is partitioned into pieces which are searched in the index, and all their occurrences in the text are verified for a complete match. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This is especially well suited for natural language texts, and allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the building time, space requirements and querying time of our index, finding that it is a practical alternative for text retrieval. The retrieval times are reduced from 10% to 60% of the best on-line algorithm.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cluster-based LSTM models to improve Dengue cases forecast Medium Access Control Techniques for Massive Machine-Type Communications in Cellular IoT Networks 2D Simplified Wildfire Spreading Model in Python: From NumPy to CuPy Preface to the CLTM-CLTD 2022 Special Issue On the specification and verification of the PCR parallel programming pattern in TLA+
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1