Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, Nicola Prezza
{"title":"自动机和规则语言的共词典排序。第1部分","authors":"Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, Nicola Prezza","doi":"https://dl.acm.org/doi/10.1145/3607471","DOIUrl":null,"url":null,"abstract":"<p>The states of a finite-state automaton \\(\\mathcal {N} \\) can be identified with collections of words in the prefix closure of the regular language accepted by \\(\\mathcal {N} \\). But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of papers automata admitting a <i>total</i> co-lexicographic (<i>co-lex</i> for brevity) ordering of states have been proposed and studied. Such class of ordered automata — <i>Wheeler automata</i> — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character. </p><p>Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be <i>partially</i> ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width <i>p</i> of one of their admissible <i>co-lex partial orders</i>—dubbed here the automaton’s <i>co-lex width</i>. We first show that this new measure captures <i>at once</i> the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width <i>p</i>: (i) has an equivalent powerset DFA whose size is exponential in <i>p</i> rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just <i>Θ</i>(log <i>p</i>) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to <i>p</i><sup>2</sup> per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in <i>p</i>, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small <i>p</i>. </p><p>Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language \\(\\mathcal {L} \\)—dubbed the Hasse automaton \\(\\mathcal {H} \\) of \\(\\mathcal {L} \\)—can be exhibited. \\(\\mathcal {H} \\) provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting \\(\\mathcal {L} \\), as long as we want to maintain an operational link with the (co-lexicographic) order of \\(\\mathcal {L} \\)’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.</p>","PeriodicalId":50022,"journal":{"name":"Journal of the ACM","volume":"6 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2023-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Co-lexicographically Ordering Automata and Regular Languages - Part I\",\"authors\":\"Nicola Cotumaccio, Giovanna D’Agostino, Alberto Policriti, Nicola Prezza\",\"doi\":\"https://dl.acm.org/doi/10.1145/3607471\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The states of a finite-state automaton \\\\(\\\\mathcal {N} \\\\) can be identified with collections of words in the prefix closure of the regular language accepted by \\\\(\\\\mathcal {N} \\\\). But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of papers automata admitting a <i>total</i> co-lexicographic (<i>co-lex</i> for brevity) ordering of states have been proposed and studied. Such class of ordered automata — <i>Wheeler automata</i> — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character. </p><p>Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be <i>partially</i> ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width <i>p</i> of one of their admissible <i>co-lex partial orders</i>—dubbed here the automaton’s <i>co-lex width</i>. We first show that this new measure captures <i>at once</i> the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width <i>p</i>: (i) has an equivalent powerset DFA whose size is exponential in <i>p</i> rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just <i>Θ</i>(log <i>p</i>) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to <i>p</i><sup>2</sup> per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in <i>p</i>, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small <i>p</i>. </p><p>Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language \\\\(\\\\mathcal {L} \\\\)—dubbed the Hasse automaton \\\\(\\\\mathcal {H} \\\\) of \\\\(\\\\mathcal {L} \\\\)—can be exhibited. \\\\(\\\\mathcal {H} \\\\) provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting \\\\(\\\\mathcal {L} \\\\), as long as we want to maintain an operational link with the (co-lexicographic) order of \\\\(\\\\mathcal {L} \\\\)’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.</p>\",\"PeriodicalId\":50022,\"journal\":{\"name\":\"Journal of the ACM\",\"volume\":\"6 4\",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2023-07-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of the ACM\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/https://dl.acm.org/doi/10.1145/3607471\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the ACM","FirstCategoryId":"94","ListUrlMain":"https://doi.org/https://dl.acm.org/doi/10.1145/3607471","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
Co-lexicographically Ordering Automata and Regular Languages - Part I
The states of a finite-state automaton \(\mathcal {N} \) can be identified with collections of words in the prefix closure of the regular language accepted by \(\mathcal {N} \). But words can be ordered, and among the many possible orders a very natural one is the co-lexicographic order. Such naturalness stems from the fact that it suggests a transfer of the order from words to the automaton’s states. This suggestion is, in fact, concrete and in a number of papers automata admitting a total co-lexicographic (co-lex for brevity) ordering of states have been proposed and studied. Such class of ordered automata — Wheeler automata — turned out to require just a constant number of bits per transition to be represented and enable regular expression matching queries in constant time per matched character.
Unfortunately, not all automata can be totally ordered as previously outlined. In the present work, we lay out a new theory showing that all automata can always be partially ordered, and an intrinsic measure of their complexity can be defined and effectively determined, namely, the minimum width p of one of their admissible co-lex partial orders—dubbed here the automaton’s co-lex width. We first show that this new measure captures at once the complexity of several seemingly-unrelated hard problems on automata. Any NFA of co-lex width p: (i) has an equivalent powerset DFA whose size is exponential in p rather than (as a classic analysis shows) in the NFA’s size; (ii) can be encoded using just Θ(log p) bits per transition; (iii) admits a linear-space data structure solving regular expression matching queries in time proportional to p2 per matched character. Some consequences of this new parameterization of automata are that PSPACE-hard problems such as NFA equivalence are FPT in p, and quadratic lower bounds for the regular expression matching problem do not hold for sufficiently small p.
Having established that the co-lex width of an automaton is a fundamental complexity measure, we proceed by (i) determining its computational complexity and (ii) extending this notion from automata to regular languages by studying their smallest-width accepting NFAs and DFAs. In this work we focus on the deterministic case and prove that a canonical minimum-width DFA accepting a language \(\mathcal {L} \)—dubbed the Hasse automaton \(\mathcal {H} \) of \(\mathcal {L} \)—can be exhibited. \(\mathcal {H} \) provides, in a precise sense, the best possible way to (partially) order the states of any DFA accepting \(\mathcal {L} \), as long as we want to maintain an operational link with the (co-lexicographic) order of \(\mathcal {L} \)’s prefixes. Finally, we explore the relationship between two conflicting objectives: minimizing the width and minimizing the number of states of a DFA. In this context, we provide an analogue of the Myhill-Nerode Theorem for co-lexicographically ordered regular languages.
期刊介绍:
The best indicator of the scope of the journal is provided by the areas covered by its Editorial Board. These areas change from time to time, as the field evolves. The following areas are currently covered by a member of the Editorial Board: Algorithms and Combinatorial Optimization; Algorithms and Data Structures; Algorithms, Combinatorial Optimization, and Games; Artificial Intelligence; Complexity Theory; Computational Biology; Computational Geometry; Computer Graphics and Computer Vision; Computer-Aided Verification; Cryptography and Security; Cyber-Physical, Embedded, and Real-Time Systems; Database Systems and Theory; Distributed Computing; Economics and Computation; Information Theory; Logic and Computation; Logic, Algorithms, and Complexity; Machine Learning and Computational Learning Theory; Networking; Parallel Computing and Architecture; Programming Languages; Quantum Computing; Randomized Algorithms and Probabilistic Analysis of Algorithms; Scientific Computing and High Performance Computing; Software Engineering; Web Algorithms and Data Mining