开发模糊搜索方法,创建高效的文本数据信息搜索系统

Kyrylo Kleshch
{"title":"开发模糊搜索方法,创建高效的文本数据信息搜索系统","authors":"Kyrylo Kleshch","doi":"10.15587/2706-5448.2024.298425","DOIUrl":null,"url":null,"abstract":"The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents.\nThe proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words.\nThe Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words.\nThe developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.","PeriodicalId":22480,"journal":{"name":"Technology audit and production reserves","volume":"27 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development of fuzzy search method for creating an efficient information search system in text data\",\"authors\":\"Kyrylo Kleshch\",\"doi\":\"10.15587/2706-5448.2024.298425\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents.\\nThe proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words.\\nThe Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words.\\nThe developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.\",\"PeriodicalId\":22480,\"journal\":{\"name\":\"Technology audit and production reserves\",\"volume\":\"27 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Technology audit and production reserves\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15587/2706-5448.2024.298425\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology audit and production reserves","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15587/2706-5448.2024.298425","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

研究对象是在一组文本数据中有效搜索信息的过程。研究课题是模糊搜索方法,该方法可以有效解决在一组文本数据中搜索信息的问题。论文考虑了模糊搜索方法的开发过程,该方法由 9 个连续步骤组成,是在大量文本数据集中快速搜索匹配信息所必需的。在此方法的基础上,建议创建一个模糊搜索系统,以解决从一组此类文档中找到最相关文档的问题。建议的模糊搜索方法结合了基于确定性有限自动机的算法和基于动态编程计算达默劳-列文斯丹距离的算法的优点。通过这种结合,可以以最佳方式实现符号相似性表。作为这项工作的一部分,我们提出了一种创建符号相似性表的方法,并以英语字母表中的符号为例创建了这样一个表,它允许以恒定渐近的方式找到两个符号之间的相似程度,并将当前符号转换为其基本对应符号。在文档过滤方面,开发了一种度量方法来评估文本数据与搜索短语的对应关系,该方法同时考虑了找到和未找到字符的数量,以及找到和未找到单词的数量。Damerau-Levenstein 算法可以找到两个单词之间的编辑距离,同时考虑到以下类型的错误:字符的替换、添加、删除和移位。这项工作提出了对这一算法的修改,即使用一个相似性表来更准确地估计两个词之间的编辑距离。所开发的方法使创建一个模糊搜索系统成为可能,这将有助于更快地找到所需的结果,并根据所提出的测试数据相似性度量值对所获得的结果进行排序,从而提高这些结果的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Development of fuzzy search method for creating an efficient information search system in text data
The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents. The proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words. The Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words. The developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
89
审稿时长
8 weeks
期刊最新文献
Technology audit of the Nigerian agricultural sector: towards food security Estimation of global nanomedicine market: status, segment analysis, dynamics, competition and prospects Exploring the possibility of undesirable manufacturing heritage reduction in parts made of composites and their joints Comprehensive physicochemical characterization of Algerian coal powders for the engineering of advanced sustainable materials Research into arsenic (III) effective catalytic oxidation in an aqueous solution on a new active manganese dioxide in a flow column
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1