Which Regular Expression Patterns Are Hard to Match?

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS) Pub Date : 2015-11-22 DOI:10.1109/FOCS.2016.56

A. Backurs, P. Indyk

{"title":"Which Regular Expression Patterns Are Hard to Match?","authors":"A. Backurs, P. Indyk","doi":"10.1109/FOCS.2016.56","DOIUrl":null,"url":null,"abstract":"Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. In particular, regular expression matching and membership testing are widely used computational primitives, employed in many programming languages and text processing utilities. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(m n) running time (where m is the length of the pattern and n is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its depth (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for \"concatenations of stars\", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form \"a star of an OR of concatenations\", e.g., [a|ab|bc]*. This corresponds to the so-called word break problem, for which a dynamic programming algorithm with a runtime of (roughly) O(n √m) is known. We show that the latter bound is not tight and improve the runtime to O(n m0.44...).","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2016.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

Abstract

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. In particular, regular expression matching and membership testing are widely used computational primitives, employed in many programming languages and text processing utilities. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(m n) running time (where m is the length of the pattern and n is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its depth (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., [a|ab|bc]*. This corresponds to the so-called word break problem, for which a dynamic programming algorithm with a runtime of (roughly) O(n √m) is known. We show that the latter bound is not tight and improve the runtime to O(n m0.44...).

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

哪些正则表达式模式难以匹配?

正则表达式是形式语言理论中的一个基本概念，在计算机科学中经常用于定义搜索模式。特别是，正则表达式匹配和成员关系测试是广泛使用的计算原语，在许多编程语言和文本处理实用程序中使用。这些问题的经典算法构建并模拟了与表达式对应的非确定性有限自动机，导致运行时间为O(m n)(其中m是模式的长度，n是文本的长度)。这个运行时间可以稍微改进(通过多对数因子)，但是没有明显更快的解决方案。同时，对于正则表达式的各种特殊情况，包括字典匹配、通配符匹配、子集匹配、断行问题等，存在更快的算法。在本文中，我们证明了正则表达式匹配的复杂性可以基于其深度来表征(当被解释为公式时)。我们的结果适用于涉及串联、OR、Kleene星和Kleene加的表达式。对于深度2的正则表达式(涉及上述算子的任何组合)，我们证明了以下二分法:匹配和隶属性检验可以在近线性时间内解决，除了“星星的连接”，它不能在强次二次时间内解决，假设强指数时间假设(SETH)。对于深度为3的正则表达式，情况更为复杂。然而，我们证明了所有问题要么可以在强次二次时间内解决，要么不能在假设SETH的强次二次时间内解决。成员性测试的一个有趣的特殊情况涉及到形式为“连接的OR的a *”的正则表达式，例如，[a|ab|bc]*。这对应于所谓的断字问题，对于这个问题，已知的动态规划算法的运行时间(大致)为O(n√m)。我们证明了后一个边界并不紧，并将运行时间提高到O(n m0.44…)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)

自引率

0.00%

发文量