Udi Boker, Thomas A. Henzinger, Karoliina Lehtinen, Aditya Prakash
An automaton is history-deterministic if its nondeterminism can be resolved on the fly, only using the prefix of the word read so far. This mild form of nondeterminism has attracted particular attention for its applications in synthesis problems. An automaton $A$ is guidable with respect to a class $C$ of automata if it can fairly simulate every automaton in $C$ whose language is contained in that of $A$. In other words, guidable automata are those for which inclusion and simulation coincide, making them particularly interesting for model-checking. We study the connection between these two notions, and specifically the question of when they coincide. For classes of automata on which they do, deciding guidability, an otherwise challenging decision problem, reduces to deciding history-determinism, a problem that is starting to be well-understood for many classes. We provide a selection of sufficient criteria for a class of automata to guarantee the coincidence of the notions, and use them to show that the notions coincide for the most common automata classes, among which are $omega$-regular automata and many infinite-state automata with safety and reachability acceptance conditions, including vector addition systems with states, one-counter nets, pushdown-, Parikh-, and timed-automata. We also demonstrate that history-determinism and guidability do not always coincide, for example, for the classes of timed automata with a fixed number of clocks.
{"title":"History-Determinism vs Fair Simulation","authors":"Udi Boker, Thomas A. Henzinger, Karoliina Lehtinen, Aditya Prakash","doi":"arxiv-2407.08620","DOIUrl":"https://doi.org/arxiv-2407.08620","url":null,"abstract":"An automaton is history-deterministic if its nondeterminism can be resolved\u0000on the fly, only using the prefix of the word read so far. This mild form of\u0000nondeterminism has attracted particular attention for its applications in\u0000synthesis problems. An automaton $A$ is guidable with respect to a class $C$ of\u0000automata if it can fairly simulate every automaton in $C$ whose language is\u0000contained in that of $A$. In other words, guidable automata are those for which\u0000inclusion and simulation coincide, making them particularly interesting for\u0000model-checking. We study the connection between these two notions, and specifically the\u0000question of when they coincide. For classes of automata on which they do,\u0000deciding guidability, an otherwise challenging decision problem, reduces to\u0000deciding history-determinism, a problem that is starting to be well-understood\u0000for many classes. We provide a selection of sufficient criteria for a class of automata to\u0000guarantee the coincidence of the notions, and use them to show that the notions\u0000coincide for the most common automata classes, among which are $omega$-regular\u0000automata and many infinite-state automata with safety and reachability\u0000acceptance conditions, including vector addition systems with states,\u0000one-counter nets, pushdown-, Parikh-, and timed-automata. We also demonstrate that history-determinism and guidability do not always\u0000coincide, for example, for the classes of timed automata with a fixed number of\u0000clocks.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LMs are often expected to generate strings in some formal language; for example, structured data, API calls, or code snippets. Although LMs can be tuned to improve their adherence to formal syntax, this does not guarantee conformance, especially with smaller LMs suitable for large-scale deployment. In addition, tuning requires significant resources, making it impractical for uncommon or task-specific formats. To prevent downstream parsing errors we would ideally constrain the LM to only produce valid output, but this is severely complicated by tokenization, which is typically both ambiguous and misaligned with the formal grammar. We solve these issues through the application of automata theory, deriving an efficient closed-form solution for the regular languages, a broad class of formal languages with many practical applications, including API calls or schema-guided JSON and YAML. We also discuss pragmatic extensions for coping with the issue of high branching factor. Finally, we extend our techniques to deterministic context-free languages, which similarly admit an efficient closed-form solution. In spite of its flexibility and representative power, our approach only requires access to per-token decoding logits and lowers into simple calculations that are independent of LM size, making it both efficient and easy to apply to almost any LM architecture.
{"title":"Automata-based constraints for language model decoding","authors":"Terry Koo, Frederick Liu, Luheng He","doi":"arxiv-2407.08103","DOIUrl":"https://doi.org/arxiv-2407.08103","url":null,"abstract":"LMs are often expected to generate strings in some formal language; for\u0000example, structured data, API calls, or code snippets. Although LMs can be\u0000tuned to improve their adherence to formal syntax, this does not guarantee\u0000conformance, especially with smaller LMs suitable for large-scale deployment.\u0000In addition, tuning requires significant resources, making it impractical for\u0000uncommon or task-specific formats. To prevent downstream parsing errors we\u0000would ideally constrain the LM to only produce valid output, but this is\u0000severely complicated by tokenization, which is typically both ambiguous and\u0000misaligned with the formal grammar. We solve these issues through the\u0000application of automata theory, deriving an efficient closed-form solution for\u0000the regular languages, a broad class of formal languages with many practical\u0000applications, including API calls or schema-guided JSON and YAML. We also\u0000discuss pragmatic extensions for coping with the issue of high branching\u0000factor. Finally, we extend our techniques to deterministic context-free\u0000languages, which similarly admit an efficient closed-form solution. In spite of\u0000its flexibility and representative power, our approach only requires access to\u0000per-token decoding logits and lowers into simple calculations that are\u0000independent of LM size, making it both efficient and easy to apply to almost\u0000any LM architecture.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141614319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yu Wang, Zhaohui Zhu, Rob van Glabbeek, Jinjin Zhang, Lixing Tan
Takai proposed a method for constructing a maximally permissive supervisor for the similarity control problem (IEEE Transactions on Automatic Control, 66(7):3197-3204, 2021). This paper points out flaws in his results by providing a counterexample. Inspired by Takai's construction, the notion of a (saturated) (G, R)-automaton is introduced and metatheorems concerning (maximally permissive) supervisors for the similarity control problem are provided in terms of this notion. As an application of these metatheorems, the flaws in Takai's work are corrected.
Takai 提出了一种为相似性控制问题构建最大许可监督器的方法(IEEE Transactions on Automatic Control,66(7):3197-3204, 2021)。本文提供了一个反例,指出了其结果的缺陷。受高井构建的启发,本文引入了(饱和)(G, R)-自变量的概念,并根据这一概念提供了关于相似性控制问题(最大容许)监督器的元定理。作为这些元定理的应用,高井工作中的缺陷得到了纠正。
{"title":"More on Maximally Permissive Similarity Control of Discrete Event Systems","authors":"Yu Wang, Zhaohui Zhu, Rob van Glabbeek, Jinjin Zhang, Lixing Tan","doi":"arxiv-2407.08068","DOIUrl":"https://doi.org/arxiv-2407.08068","url":null,"abstract":"Takai proposed a method for constructing a maximally permissive supervisor\u0000for the similarity control problem (IEEE Transactions on Automatic Control,\u000066(7):3197-3204, 2021). This paper points out flaws in his results by providing\u0000a counterexample. Inspired by Takai's construction, the notion of a (saturated)\u0000(G, R)-automaton is introduced and metatheorems concerning (maximally\u0000permissive) supervisors for the similarity control problem are provided in\u0000terms of this notion. As an application of these metatheorems, the flaws in\u0000Takai's work are corrected.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce and study a generalized Parikh matrix mapping based on tracking the occurrence counts of special types of subsequences. These matrices retain more information about a word than the original Parikh matrix mapping while preserving the homomorphic property. We build the generalization by first introducing the Parikh factor matrix mapping and extend it to the Parikh sequence matrix mapping. We establish an interesting connection between the generalized Parikh matrices and the original ones and use it to prove that certain important minors of a Parikh sequence matrix have nonnegative determinant. Finally, we generalize the concept of subword histories and show that each generalized subword history is equivalent to a linear one.
{"title":"Generalized Parikh Matrices For Tracking Subsequence Occurrences","authors":"Szilárd Zsolt Fazekas, Xinhao Huang","doi":"arxiv-2407.04462","DOIUrl":"https://doi.org/arxiv-2407.04462","url":null,"abstract":"We introduce and study a generalized Parikh matrix mapping based on tracking\u0000the occurrence counts of special types of subsequences. These matrices retain\u0000more information about a word than the original Parikh matrix mapping while\u0000preserving the homomorphic property. We build the generalization by first\u0000introducing the Parikh factor matrix mapping and extend it to the Parikh\u0000sequence matrix mapping. We establish an interesting connection between the\u0000generalized Parikh matrices and the original ones and use it to prove that\u0000certain important minors of a Parikh sequence matrix have nonnegative\u0000determinant. Finally, we generalize the concept of subword histories and show\u0000that each generalized subword history is equivalent to a linear one.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141568488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Elias Alevizos, Alexander Artikis, Georgios Paliouras
We present a system for Complex Event Recognition (CER) based on automata. While multiple such systems have been described in the literature, they typically suffer from a lack of clear and denotational semantics, a limitation which often leads to confusion with respect to their expressive power. In order to address this issue, our system is based on an automaton model which is a combination of symbolic and register automata. We extend previous work on these types of automata, in order to construct a formalism with clear semantics and a corresponding automaton model whose properties can be formally investigated. We call such automata Symbolic Register Transducers (SRT). We show that SRT are closed under various operators, but are not in general closed under complement and they are not determinizable. However, they are closed under these operations when a window operator, quintessential in Complex Event Recognition, is used. We show how SRT can be used in CER in order to detect patterns upon streams of events, using our framework that provides declarative and compositional semantics, and that allows for a systematic treatment of such automata. For SRT to work in pattern detection, we allow them to mark events from the input stream as belonging to a complex event or not, hence the name "transducers". We also present an implementation of SRT which can perform CER. We compare our SRT-based CER engine against other state-of-the-art CER systems and show that it is both more expressive and more efficient.
{"title":"Complex Event Recognition with Symbolic Register Transducers: Extended Technical Report","authors":"Elias Alevizos, Alexander Artikis, Georgios Paliouras","doi":"arxiv-2407.02884","DOIUrl":"https://doi.org/arxiv-2407.02884","url":null,"abstract":"We present a system for Complex Event Recognition (CER) based on automata.\u0000While multiple such systems have been described in the literature, they\u0000typically suffer from a lack of clear and denotational semantics, a limitation\u0000which often leads to confusion with respect to their expressive power. In order\u0000to address this issue, our system is based on an automaton model which is a\u0000combination of symbolic and register automata. We extend previous work on these\u0000types of automata, in order to construct a formalism with clear semantics and a\u0000corresponding automaton model whose properties can be formally investigated. We\u0000call such automata Symbolic Register Transducers (SRT). We show that SRT are\u0000closed under various operators, but are not in general closed under complement\u0000and they are not determinizable. However, they are closed under these\u0000operations when a window operator, quintessential in Complex Event Recognition,\u0000is used. We show how SRT can be used in CER in order to detect patterns upon\u0000streams of events, using our framework that provides declarative and\u0000compositional semantics, and that allows for a systematic treatment of such\u0000automata. For SRT to work in pattern detection, we allow them to mark events\u0000from the input stream as belonging to a complex event or not, hence the name\u0000\"transducers\". We also present an implementation of SRT which can perform CER.\u0000We compare our SRT-based CER engine against other state-of-the-art CER systems\u0000and show that it is both more expressive and more efficient.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Proving mathematical theorems using computer-verifiable formal languages like Lean significantly impacts mathematical reasoning. One approach to formal theorem proving involves generating complete proofs using Large Language Models (LLMs) based on Natural Language (NL) proofs. Similar methods have shown promising results in code generation. However, most modern LLMs exhibit suboptimal performance due to the scarcity of aligned NL and Formal Language (FL) theorem-proving data. This scarcity results in a paucity of methodologies for training LLMs and techniques to fully utilize their capabilities in composing formal proofs. To address the challenges, this paper proposes **TheoremLlama**, an end-to-end framework to train a general-purpose LLM to become a Lean4 expert. This framework encompasses NL-FL aligned dataset generation methods, training approaches for the LLM formal theorem prover, and techniques for LLM Lean4 proof writing. Using the dataset generation method, we provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped dataset. A key innovation in this framework is the NL-FL bootstrapping method, where NL proofs are integrated into Lean4 code for training datasets, leveraging the NL reasoning ability of LLMs for formal reasoning. The **TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61% on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline of 22.95% and 25.41%. We have also open-sourced our model checkpoints and generated dataset, and will soon make all the code publicly available.
{"title":"TheoremLlama: Transforming General-Purpose LLMs into Lean4 Experts","authors":"Ruida Wang, Jipeng Zhang, Yizhen Jia, Rui Pan, Shizhe Diao, Renjie Pi, Tong Zhang","doi":"arxiv-2407.03203","DOIUrl":"https://doi.org/arxiv-2407.03203","url":null,"abstract":"Proving mathematical theorems using computer-verifiable formal languages like\u0000Lean significantly impacts mathematical reasoning. One approach to formal\u0000theorem proving involves generating complete proofs using Large Language Models\u0000(LLMs) based on Natural Language (NL) proofs. Similar methods have shown\u0000promising results in code generation. However, most modern LLMs exhibit\u0000suboptimal performance due to the scarcity of aligned NL and Formal Language\u0000(FL) theorem-proving data. This scarcity results in a paucity of methodologies\u0000for training LLMs and techniques to fully utilize their capabilities in\u0000composing formal proofs. To address the challenges, this paper proposes\u0000**TheoremLlama**, an end-to-end framework to train a general-purpose LLM to\u0000become a Lean4 expert. This framework encompasses NL-FL aligned dataset\u0000generation methods, training approaches for the LLM formal theorem prover, and\u0000techniques for LLM Lean4 proof writing. Using the dataset generation method, we\u0000provide *Open Bootstrapped Theorems* (OBT), an NL-FL aligned and bootstrapped\u0000dataset. A key innovation in this framework is the NL-FL bootstrapping method,\u0000where NL proofs are integrated into Lean4 code for training datasets,\u0000leveraging the NL reasoning ability of LLMs for formal reasoning. The\u0000**TheoremLlama** framework achieves cumulative accuracies of 36.48% and 33.61%\u0000on MiniF2F-Valid and Test datasets respectively, surpassing the GPT-4 baseline\u0000of 22.95% and 25.41%. We have also open-sourced our model checkpoints and\u0000generated dataset, and will soon make all the code publicly available.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a definition of recognizable transducers over monads and comonads, which bridges two important ongoing efforts in the current research on regularity. The first effort is the study of regular transductions, which extends the notion of regularity from languages into word-to-word functions. The other important effort is generalizing the notion of regular languages from words to arbitrary monads, introduced in arXiv:1502.04898. In this paper, we present a number of examples of transducer classes that fit the proposed framework. In particular we show that our class generalizes the classes of Mealy machines and rational transductions. We also present examples of recognizable transducers for infinite words and a specific type of trees called terms. The main result of this paper is a theorem, which states the class of recognizable transductions is closed under composition, subject to some coherence axioms between the structure of a monad and the structure of a comonad. Due to its complexity, we formalize the proof of the theorem in Coq Proof Assistant. In the proof, we introduce the concepts of a context and a generalized wreath product for Eilenberg-Moore algebras, which could be valuable tools for studying these algebras.
{"title":"Monads, Comonads, and Transducers","authors":"Rafał Stefański","doi":"arxiv-2407.02704","DOIUrl":"https://doi.org/arxiv-2407.02704","url":null,"abstract":"This paper proposes a definition of recognizable transducers over monads and\u0000comonads, which bridges two important ongoing efforts in the current research\u0000on regularity. The first effort is the study of regular transductions, which\u0000extends the notion of regularity from languages into word-to-word functions.\u0000The other important effort is generalizing the notion of regular languages from\u0000words to arbitrary monads, introduced in arXiv:1502.04898. In this paper, we\u0000present a number of examples of transducer classes that fit the proposed\u0000framework. In particular we show that our class generalizes the classes of\u0000Mealy machines and rational transductions. We also present examples of\u0000recognizable transducers for infinite words and a specific type of trees called\u0000terms. The main result of this paper is a theorem, which states the class of\u0000recognizable transductions is closed under composition, subject to some\u0000coherence axioms between the structure of a monad and the structure of a\u0000comonad. Due to its complexity, we formalize the proof of the theorem in Coq\u0000Proof Assistant. In the proof, we introduce the concepts of a context and a\u0000generalized wreath product for Eilenberg-Moore algebras, which could be\u0000valuable tools for studying these algebras.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider a class of finite state three-tape transducers which models the operation of shuffling and splitting words. We present them as automata over the so-called Shuffling Monoid. These automata can be seen as either shufflers or splitters interchangeably. We prove that functionality is decidable for splitters, and we also show that the equivalence between functional splitters is decidable. Moreover, in the deterministic case, the algorithm for equivalence is polynomial on the number of states of the splitter.
{"title":"On Shuffling and Splitting Automata","authors":"Ignacio Mollo Cunningham","doi":"arxiv-2407.02660","DOIUrl":"https://doi.org/arxiv-2407.02660","url":null,"abstract":"We consider a class of finite state three-tape transducers which models the\u0000operation of shuffling and splitting words. We present them as automata over\u0000the so-called Shuffling Monoid. These automata can be seen as either shufflers\u0000or splitters interchangeably. We prove that functionality is decidable for\u0000splitters, and we also show that the equivalence between functional splitters\u0000is decidable. Moreover, in the deterministic case, the algorithm for\u0000equivalence is polynomial on the number of states of the splitter.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141546775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We study the question of whether a given regular language of finite trees can be defined in first-order logic. We develop an algebraic approach to address this question and we use it to derive several necessary and sufficient conditions for definability (but unfortunately no condition that is both). The main difference of our results to those from the literature is that our conditions are decidable.
{"title":"Some Remarks on First-Order Definable Tree Languages","authors":"Achim Blumensath","doi":"arxiv-2407.01169","DOIUrl":"https://doi.org/arxiv-2407.01169","url":null,"abstract":"We study the question of whether a given regular language of finite trees can\u0000be defined in first-order logic. We develop an algebraic approach to address\u0000this question and we use it to derive several necessary and sufficient\u0000conditions for definability (but unfortunately no condition that is both). The\u0000main difference of our results to those from the literature is that our\u0000conditions are decidable.","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141511252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Backreference is a well-known practical extension of regular expressions and most modern programming languages, such as Java, Python, JavaScript and more, support regular expressions with backreferences (rewb) in their standard libraries for string processing. A difficulty of backreference is non-regularity: unlike some other extensions, backreference strictly enhances the expressive power of regular expressions and thus rewbs can describe non-regular (in fact, even non-context-free) languages. In this paper, we investigate the expressive power of rewbs by comparing rewbs to multiple context-free languages (MCFL) and parallel multiple context-free languages (PMCFL). First, we prove that the language class of rewbs is a proper subclass of unary-PMCFLs. The class of unary-PMCFLs coincides with that of EDT0L languages, and our result strictly improves the known upper bound of rewbs. Additionally, we show that, however, the language class of rewbs is not contained in that of MCFLs even when restricted to rewbs with only one capturing group and no captured references. Therefore, in general, the parallelism seems essential for rewbs. Backed by these results, we define a novel syntactic condition on rewbs that we call closed-star and observe that it provides an upper bound on the number of times a rewb references the same captured string. The closed-star condition allows dispensing with the parallelism: that is, we prove that the language class of closed-star rewbs falls inside the class of unary-MCFLs, which is equivalent to that of EDT0L systems of finite index. Furthermore, as additional evidence for the robustness of the condition, we show that the language class of closed-star rewbs also falls inside the class of nonerasing stack languages (NESL).
{"title":"Regular Expressions with Backreferences on Multiple Context-Free Languages, and the Closed-Star Condition","authors":"Taisei Nogami, Tachio Terauchi","doi":"arxiv-2406.18918","DOIUrl":"https://doi.org/arxiv-2406.18918","url":null,"abstract":"Backreference is a well-known practical extension of regular expressions and\u0000most modern programming languages, such as Java, Python, JavaScript and more,\u0000support regular expressions with backreferences (rewb) in their standard\u0000libraries for string processing. A difficulty of backreference is\u0000non-regularity: unlike some other extensions, backreference strictly enhances\u0000the expressive power of regular expressions and thus rewbs can describe\u0000non-regular (in fact, even non-context-free) languages. In this paper, we\u0000investigate the expressive power of rewbs by comparing rewbs to multiple\u0000context-free languages (MCFL) and parallel multiple context-free languages\u0000(PMCFL). First, we prove that the language class of rewbs is a proper subclass\u0000of unary-PMCFLs. The class of unary-PMCFLs coincides with that of EDT0L\u0000languages, and our result strictly improves the known upper bound of rewbs.\u0000Additionally, we show that, however, the language class of rewbs is not\u0000contained in that of MCFLs even when restricted to rewbs with only one\u0000capturing group and no captured references. Therefore, in general, the\u0000parallelism seems essential for rewbs. Backed by these results, we define a\u0000novel syntactic condition on rewbs that we call closed-star and observe that it\u0000provides an upper bound on the number of times a rewb references the same\u0000captured string. The closed-star condition allows dispensing with the\u0000parallelism: that is, we prove that the language class of closed-star rewbs\u0000falls inside the class of unary-MCFLs, which is equivalent to that of EDT0L\u0000systems of finite index. Furthermore, as additional evidence for the robustness\u0000of the condition, we show that the language class of closed-star rewbs also\u0000falls inside the class of nonerasing stack languages (NESL).","PeriodicalId":501124,"journal":{"name":"arXiv - CS - Formal Languages and Automata Theory","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141511255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}