Improving regular-expression matching on strings using negative factors

Proceedings. ACM-SIGMOD International Conference on Management of Data Pub Date : 2013-06-22 DOI:10.1145/2463676.2465289

Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li

{"title":"Improving regular-expression matching on strings using negative factors","authors":"Xiaochun Yang, Bin Wang, Tao Qiu, Yaoshu Wang, Chen Li","doi":"10.1145/2463676.2465289","DOIUrl":null,"url":null,"abstract":"The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.","PeriodicalId":87344,"journal":{"name":"Proceedings. ACM-SIGMOD International Conference on Management of Data","volume":"33 1","pages":"361-372"},"PeriodicalIF":0.0000,"publicationDate":"2013-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. ACM-SIGMOD International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2463676.2465289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

The problem of finding matches of a regular expression (RE) on a string exists in many applications such as text editing, biosequence search, and shell commands. Existing techniques first identify candidates using substrings in the RE, then verify each of them using an automaton. These techniques become inefficient when there are many candidate occurrences that need to be verified. In this paper we propose a novel technique that prunes false negatives by utilizing negative factors, which are substrings that cannot appear in an answer. A main advantage of the technique is that it can be integrated with many existing algorithms to improve their efficiency significantly. We give a full specification of this technique. We develop an efficient algorithm that utilizes negative factors to prune candidates, then improve it by using bit operations to process negative factors in parallel. We show that negative factors, when used together with necessary factors (substrings that must appear in each answer), can achieve much better pruning power. We analyze the large number of negative factors, and develop an algorithm for finding a small number of high-quality negative factors. We conducted a thorough experimental study of this technique on real data sets, including DNA sequences, proteins, and text documents, and show the significant performance improvement when applying the technique in existing algorithms. For instance, it improved the search speed of the popular Gnu Grep tool by 11 to 74 times for text documents.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用负因子改进字符串的正则表达式匹配

在字符串上查找正则表达式(RE)匹配的问题存在于许多应用程序中，例如文本编辑、生物序列搜索和shell命令。现有的技术首先使用正则中的子字符串识别候选对象，然后使用自动机验证每个候选对象。当存在许多需要验证的候选事件时，这些技术变得低效。在本文中，我们提出了一种新的技术，利用负因子来修剪假阴性，负因子是不能出现在答案中的子串。该技术的一个主要优点是它可以与许多现有算法集成，以显着提高它们的效率。我们对这种技术作了详细说明。我们开发了一种有效的算法，利用负因子来修剪候选项，然后通过使用位运算并行处理负因子来改进它。我们证明，当负因子与必要因子(必须出现在每个答案中的子字符串)一起使用时，可以获得更好的修剪能力。我们分析了大量的负面因素，并开发了一种寻找少量高质量负面因素的算法。我们对该技术在真实数据集(包括DNA序列、蛋白质和文本文档)上进行了深入的实验研究，并在现有算法中应用该技术时显示出显着的性能改进。例如，它将流行的Gnu Grep工具对文本文档的搜索速度提高了11到74倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings. ACM-SIGMOD International Conference on Management of Data

自引率

0.00%

发文量

期刊最新文献

Protecting Data Markets from Strategic Buyers XLJoins Convergence of Array DBMS and Cellular Automata: A Road Traffic Simulation Case Near-Optimal Distributed Band-Joins through Recursive Partitioning. Optimal Join Algorithms Meet Top-k.