Which Regular Expression Patterns Are Hard to Match?

A. Backurs, P. Indyk
{"title":"Which Regular Expression Patterns Are Hard to Match?","authors":"A. Backurs, P. Indyk","doi":"10.1109/FOCS.2016.56","DOIUrl":null,"url":null,"abstract":"Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. In particular, regular expression matching and membership testing are widely used computational primitives, employed in many programming languages and text processing utilities. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(m n) running time (where m is the length of the pattern and n is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its depth (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for \"concatenations of stars\", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form \"a star of an OR of concatenations\", e.g., [a|ab|bc]*. This corresponds to the so-called word break problem, for which a dynamic programming algorithm with a runtime of (roughly) O(n √m) is known. We show that the latter bound is not tight and improve the runtime to O(n m0.44...).","PeriodicalId":414001,"journal":{"name":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FOCS.2016.56","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 89

Abstract

Regular expressions constitute a fundamental notion in formal language theory and are frequently used in computer science to define search patterns. In particular, regular expression matching and membership testing are widely used computational primitives, employed in many programming languages and text processing utilities. A classic algorithm for these problems constructs and simulates a non-deterministic finite automaton corresponding to the expression, resulting in an O(m n) running time (where m is the length of the pattern and n is the length of the text). This running time can be improved slightly (by a polylogarithmic factor), but no significantly faster solutions are known. At the same time, much faster algorithms exist for various special cases of regular expressions, including dictionary matching, wildcard matching, subset matching, word break problem etc. In this paper, we show that the complexity of regular expression matching can be characterized based on its depth (when interpreted as a formula). Our results hold for expressions involving concatenation, OR, Kleene star and Kleene plus. For regular expressions of depth two (involving any combination of the above operators), we show the following dichotomy: matching and membership testing can be solved in near-linear time, except for "concatenations of stars", which cannot be solved in strongly sub-quadratic time assuming the Strong Exponential Time Hypothesis (SETH). For regular expressions of depth three the picture is more complex. Nevertheless, we show that all problems can either be solved in strongly sub-quadratic time, or cannot be solved in strongly sub-quadratic time assuming SETH. An intriguing special case of membership testing involves regular expressions of the form "a star of an OR of concatenations", e.g., [a|ab|bc]*. This corresponds to the so-called word break problem, for which a dynamic programming algorithm with a runtime of (roughly) O(n √m) is known. We show that the latter bound is not tight and improve the runtime to O(n m0.44...).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
哪些正则表达式模式难以匹配?
正则表达式是形式语言理论中的一个基本概念,在计算机科学中经常用于定义搜索模式。特别是,正则表达式匹配和成员关系测试是广泛使用的计算原语,在许多编程语言和文本处理实用程序中使用。这些问题的经典算法构建并模拟了与表达式对应的非确定性有限自动机,导致运行时间为O(m n)(其中m是模式的长度,n是文本的长度)。这个运行时间可以稍微改进(通过多对数因子),但是没有明显更快的解决方案。同时,对于正则表达式的各种特殊情况,包括字典匹配、通配符匹配、子集匹配、断行问题等,存在更快的算法。在本文中,我们证明了正则表达式匹配的复杂性可以基于其深度来表征(当被解释为公式时)。我们的结果适用于涉及串联、OR、Kleene星和Kleene加的表达式。对于深度2的正则表达式(涉及上述算子的任何组合),我们证明了以下二分法:匹配和隶属性检验可以在近线性时间内解决,除了“星星的连接”,它不能在强次二次时间内解决,假设强指数时间假设(SETH)。对于深度为3的正则表达式,情况更为复杂。然而,我们证明了所有问题要么可以在强次二次时间内解决,要么不能在假设SETH的强次二次时间内解决。成员性测试的一个有趣的特殊情况涉及到形式为“连接的OR的a *”的正则表达式,例如,[a|ab|bc]*。这对应于所谓的断字问题,对于这个问题,已知的动态规划算法的运行时间(大致)为O(n√m)。我们证明了后一个边界并不紧,并将运行时间提高到O(n m0.44…)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Exponential Lower Bounds for Monotone Span Programs Truly Sub-cubic Algorithms for Language Edit Distance and RNA-Folding via Fast Bounded-Difference Min-Plus Product Polynomial-Time Tensor Decompositions with Sum-of-Squares Decremental Single-Source Reachability and Strongly Connected Components in Õ(m√n) Total Update Time NP-Hardness of Reed-Solomon Decoding and the Prouhet-Tarry-Escott Problem
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1