基于Damerau-Levenshtein自动机的大数据模糊搜索算法比较

Kyrylo Kleshch, Volodymyr Shablii
{"title":"基于Damerau-Levenshtein自动机的大数据模糊搜索算法比较","authors":"Kyrylo Kleshch, Volodymyr Shablii","doi":"10.15587/2706-5448.2023.286382","DOIUrl":null,"url":null,"abstract":"The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.","PeriodicalId":22480,"journal":{"name":"Technology audit and production reserves","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data\",\"authors\":\"Kyrylo Kleshch, Volodymyr Shablii\",\"doi\":\"10.15587/2706-5448.2023.286382\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.\",\"PeriodicalId\":22480,\"journal\":{\"name\":\"Technology audit and production reserves\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Technology audit and production reserves\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15587/2706-5448.2023.286382\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology audit and production reserves","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15587/2706-5448.2023.286382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

研究对象是基于Damerau-Levenshtein自动机和Levenshtein自动机的模糊搜索算法。本文研究并比较了基于有限状态机的解决方案,利用模糊搜索的概念在给定编辑距离的大型文本数据中高效快速地查找单词和行。模糊搜索算法允许找到比标准显式搜索算法更相关的结果。然而,这种算法通常具有较高的渐近复杂度,因此工作时间更长。使用Damerau-Levenshtein距离的模糊文本搜索允许考虑用户在搜索词中可能出现的常见错误,即:字符替换,额外字符,缺失字符和字符重新排序。要使用有限自动机,必须首先为特定的输入单词和编辑距离构造它,然后对该自动机执行搜索,丢弃自动机不接受的单词。因此,在选择算法时,这两个阶段都要考虑。这是因为制造一台机器需要很长时间。为了加快其中一台机器的速度,使用了SIMD指令,根据搜索词的数量、搜索词的长度和编辑距离,它提供了1-10%的加速。所获得的结果可用于需要在大量数据中快速有效地执行模糊搜索的各种行业,例如,在搜索引擎中或在错误的自动纠正中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Comparison of fuzzy search algorithms based on Damerau-Levenshtein automata on large data
The object of research is fuzzy search algorithms based on Damerau-Levenshtein automata and Levenshtein automata. The paper examines and compares solutions based on finite state machines for efficient and fast finding of words and lines with a given editing distance in large text data using the concept of fuzzy search. Fuzzy search algorithms allow finding significantly more relevant results than standard explicit search algorithms. However, such algorithms usually have a higher asymptotic complexity and, accordingly, work much longer. Fuzzy text search using Damerau-Levenshtein distance allows taking into account common errors that the user may have made in the search term, namely: character substitution, extra character, missing character, and reordering of characters. To use a finite automaton, it is necessary to first construct it for a specific input word and edit distance, and then perform a search on that automaton, discarding words that the automaton will not accept. Therefore, when choosing an algorithm, both phases should be taken into account. This is because building a machine can take a long time. To speed up one of the machines, SIMD instructions were used, which gave a speedup of 1-10% depending on the number of search words, the length of the search word and the editing distance. The obtained results can be useful for use in various industries where it is necessary to quickly and efficiently perform fuzzy search in large volumes of data, for example, in search engines or in autocorrection of errors.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
89
审稿时长
8 weeks
期刊最新文献
Technology audit of the Nigerian agricultural sector: towards food security Estimation of global nanomedicine market: status, segment analysis, dynamics, competition and prospects Exploring the possibility of undesirable manufacturing heritage reduction in parts made of composites and their joints Comprehensive physicochemical characterization of Algerian coal powders for the engineering of advanced sustainable materials Research into arsenic (III) effective catalytic oxidation in an aqueous solution on a new active manganese dioxide in a flow column
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1