开发模糊搜索方法，创建高效的文本数据信息搜索系统

Technology audit and production reserves Pub Date : 2024-02-13 DOI:10.15587/2706-5448.2024.298425

Kyrylo Kleshch

{"title":"开发模糊搜索方法，创建高效的文本数据信息搜索系统","authors":"Kyrylo Kleshch","doi":"10.15587/2706-5448.2024.298425","DOIUrl":null,"url":null,"abstract":"The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents.\nThe proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words.\nThe Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words.\nThe developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.","PeriodicalId":22480,"journal":{"name":"Technology audit and production reserves","volume":"27 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Development of fuzzy search method for creating an efficient information search system in text data\",\"authors\":\"Kyrylo Kleshch\",\"doi\":\"10.15587/2706-5448.2024.298425\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents.\\nThe proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words.\\nThe Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words.\\nThe developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.\",\"PeriodicalId\":22480,\"journal\":{\"name\":\"Technology audit and production reserves\",\"volume\":\"27 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Technology audit and production reserves\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.15587/2706-5448.2024.298425\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Technology audit and production reserves","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.15587/2706-5448.2024.298425","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

研究对象是在一组文本数据中有效搜索信息的过程。研究课题是模糊搜索方法，该方法可以有效解决在一组文本数据中搜索信息的问题。论文考虑了模糊搜索方法的开发过程，该方法由 9 个连续步骤组成，是在大量文本数据集中快速搜索匹配信息所必需的。在此方法的基础上，建议创建一个模糊搜索系统，以解决从一组此类文档中找到最相关文档的问题。建议的模糊搜索方法结合了基于确定性有限自动机的算法和基于动态编程计算达默劳-列文斯丹距离的算法的优点。通过这种结合，可以以最佳方式实现符号相似性表。作为这项工作的一部分，我们提出了一种创建符号相似性表的方法，并以英语字母表中的符号为例创建了这样一个表，它允许以恒定渐近的方式找到两个符号之间的相似程度，并将当前符号转换为其基本对应符号。在文档过滤方面，开发了一种度量方法来评估文本数据与搜索短语的对应关系，该方法同时考虑了找到和未找到字符的数量，以及找到和未找到单词的数量。Damerau-Levenstein 算法可以找到两个单词之间的编辑距离，同时考虑到以下类型的错误：字符的替换、添加、删除和移位。这项工作提出了对这一算法的修改，即使用一个相似性表来更准确地估计两个词之间的编辑距离。所开发的方法使创建一个模糊搜索系统成为可能，这将有助于更快地找到所需的结果，并根据所提出的测试数据相似性度量值对所获得的结果进行排序，从而提高这些结果的相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Development of fuzzy search method for creating an efficient information search system in text data

The object of research is the processes of effective search for information in a set of textual data. The subject of the research is the fuzzy search method, which will allow to effectively solve the problem of searching for information in a set of textual data. The paper considers the process of developing a fuzzy search method, which consists of 9 consecutive steps and is required for a quick search for matches in a large set of text data. Based on this method, it is proposed to create a fuzzy search system that will solve the problem of finding the most relevant documents from a set of such documents. The proposed fuzzy search method combines the advantages of algorithms based on deterministic finite automata and algorithms based on dynamic programming for calculating the Damerau-Levenshtein distance. Such a combination allows to implement the symbol similarity table in an optimal way. As part of the work, an approach for creating a symbol similarity table was proposed and an example of such a table was created for symbols from the English alphabet, which allows to find the degree of similarity between two symbols with constant asymptotics and to convert the current symbol into its basic counterpart. For document filtering, a metric was developed to evaluate the correspondence of text data to a search phrase, which simultaneously takes into account the number of found and not found characters and the number of found and not found words. The Damerau-Levenstein algorithm allows to find the edit distance between two words, taking into account the following types of errors: substitution, addition, deletion, and transposition of characters. The work proposed a modification of this algorithm by using a similarity table to more accurately estimate the editing distance between two words. The developed method makes it possible to create a fuzzy search system that will help find the desired results faster and increase the relevance of the obtained results by sorting them according to the values of the proposed test data similarity metric.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Technology audit and production reserves

自引率

0.00%

发文量

审稿时长

8 weeks