Pub Date : 2022-02-10DOI: 10.4230/LIPIcs.CPM.2023.26
T. Gagie
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m log n)$.
{"title":"MONI can find k-MEMs","authors":"T. Gagie","doi":"10.4230/LIPIcs.CPM.2023.26","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2023.26","url":null,"abstract":"Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m log n)$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-16DOI: 10.4230/LIPIcs.CPM.2022.17
D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss
We prove that the normalized edit distance proposed in [Marzal and Vidal 1993] is a metric when the cost of all the edit operations are the same. This closes a long standing gap in the literature where several authors noted that this distance does not satisfy the triangle inequality in the general case, and that it was not known whether it is satisfied in the uniform case – where all the edit costs are equal. We compare this metric to two normalized metrics proposed as alternatives in the literature, when people thought that Marzal’s and Vidal’s distance is not a metric, and identify key properties that explain why the original distance, now known to also be a metric, is better for some applications. Our examination is from a point of view of formal verification, but the properties and their significance are stated in an application agnostic way.
我们证明了[Marzal and Vidal 1993]中提出的归一化编辑距离是所有编辑操作成本相同时的度量。这填补了文献中长期存在的空白,一些作者指出,在一般情况下,这个距离不满足三角形不等式,并且不知道它是否满足统一情况-所有编辑成本相等。当人们认为Marzal 's和Vidal 's距离不是度量时,我们将这个度量与文献中提出的两个标准化度量进行比较,并确定解释为什么原始距离(现在已知也是度量)在某些应用中更好的关键属性。我们的研究是从形式验证的角度出发的,但是性质和它们的意义是以一种与应用无关的方式来陈述的。
{"title":"The Normalized Edit Distance with Uniform Operation Costs is a Metric","authors":"D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss","doi":"10.4230/LIPIcs.CPM.2022.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.17","url":null,"abstract":"We prove that the normalized edit distance proposed in [Marzal and Vidal 1993] is a metric when the cost of all the edit operations are the same. This closes a long standing gap in the literature where several authors noted that this distance does not satisfy the triangle inequality in the general case, and that it was not known whether it is satisfied in the uniform case – where all the edit costs are equal. We compare this metric to two normalized metrics proposed as alternatives in the literature, when people thought that Marzal’s and Vidal’s distance is not a metric, and identify key properties that explain why the original distance, now known to also be a metric, is better for some applications. Our examination is from a point of view of formal verification, but the properties and their significance are stated in an application agnostic way.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123238484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-08-17DOI: 10.4230/LIPIcs.CPM.2022.9
Abhinav Nellore, Rachel A. Ward
Let $widetilde{alpha}$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $mathcal{A}$ such that the number of occurrences of any length-$m$ string on $mathcal{A}$ as a substring of $widetilde{alpha}$ is $lfloor L / K^m rfloor$ or $lceil L / K^m rceil$. When $L = K^N$ for any positive integer $N$, $widetilde{alpha}$ is a de Bruijn sequence of order $N$, and when $L neq K^N$, $widetilde{alpha}$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $widetilde{alpha}$ for any combination of $K geq 2$ and $L geq 1$ in $O(L)$ time using $O(L log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.
{"title":"Arbitrary-length analogs to de Bruijn sequences","authors":"Abhinav Nellore, Rachel A. Ward","doi":"10.4230/LIPIcs.CPM.2022.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.9","url":null,"abstract":"Let $widetilde{alpha}$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $mathcal{A}$ such that the number of occurrences of any length-$m$ string on $mathcal{A}$ as a substring of $widetilde{alpha}$ is $lfloor L / K^m rfloor$ or $lceil L / K^m rceil$. When $L = K^N$ for any positive integer $N$, $widetilde{alpha}$ is a de Bruijn sequence of order $N$, and when $L neq K^N$, $widetilde{alpha}$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $widetilde{alpha}$ for any combination of $K geq 2$ and $L geq 1$ in $O(L)$ time using $O(L log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122336978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-04-09DOI: 10.4230/LIPIcs.CPM.2021.4
Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov
The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k^2 n^4), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.
{"title":"Ranking Bracelets in Polynomial Time","authors":"Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov","doi":"10.4230/LIPIcs.CPM.2021.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.4","url":null,"abstract":"The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k^2 n^4), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-02-25DOI: 10.4230/LIPIcs.CPM.2021.22
Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals
The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).
{"title":"A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs","authors":"Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals","doi":"10.4230/LIPIcs.CPM.2021.22","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.22","url":null,"abstract":"The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127920748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-12-01DOI: 10.4230/LIPIcs.CPM.2017.11
Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi
Abstract In the Maximum-Duo Preservation String Mapping ( Max-Duo PSM ) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in 4 k ⋅ n O ( 1 ) time, and a deterministic algorithm that solves this problem in 6.855 k ⋅ n O ( 1 ) time. The previous best known (deterministic) algorithm for this problem has ( 8 e ) 2 k + o ( k ) ⋅ n O ( 1 ) running time [Beretta et al. (2016) [1] , [2] ]. We also show that Max-Duo PSM admits a problem kernel of size O ( k 3 ) , improving upon the previous best known problem kernel of size O ( k 6 ) .
{"title":"Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping","authors":"Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi","doi":"10.4230/LIPIcs.CPM.2017.11","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.11","url":null,"abstract":"Abstract In the Maximum-Duo Preservation String Mapping ( Max-Duo PSM ) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in 4 k ⋅ n O ( 1 ) time, and a deterministic algorithm that solves this problem in 6.855 k ⋅ n O ( 1 ) time. The previous best known (deterministic) algorithm for this problem has ( 8 e ) 2 k + o ( k ) ⋅ n O ( 1 ) running time [Beretta et al. (2016) [1] , [2] ]. We also show that Max-Duo PSM admits a problem kernel of size O ( k 3 ) , improving upon the previous best known problem kernel of size O ( k 6 ) .","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115421742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-29DOI: 10.4230/LIPIcs.CPM.2021.24
Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea
Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity $O(n)$ for each window length and thus a total complexity of $O(n^2)$ and the space complexity $O(|I|)$ for a sequence of size n and an itemset of size $|I|$. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the expected time complexity of $O(n)$ and space complexity of $O( sqrt{ n|I| })$. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with expected time complexity $O(n|I|)$ and space complexity $O( sqrt{n|I|} + e_{max}|I|)$, where $e_{max}$ is the length of the largest pattern.
{"title":"AWLCO: All-Window Length Co-Occurrence","authors":"Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea","doi":"10.4230/LIPIcs.CPM.2021.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.24","url":null,"abstract":"Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity $O(n)$ for each window length and thus a total complexity of $O(n^2)$ and the space complexity $O(|I|)$ for a sequence of size n and an itemset of size $|I|$. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the expected time complexity of $O(n)$ and space complexity of $O( sqrt{ n|I| })$. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with expected time complexity $O(n|I|)$ and space complexity $O( sqrt{n|I|} + e_{max}|I|)$, where $e_{max}$ is the length of the largest pattern.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124608711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-11-16DOI: 10.4230/LIPIcs.CPM.2021.14
R. Dondi, F. Sikora
Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al.,WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.
{"title":"The Longest Run Subsequence Problem: Further Complexity Results","authors":"R. Dondi, F. Sikora","doi":"10.4230/LIPIcs.CPM.2021.14","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.14","url":null,"abstract":"Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al.,WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132486939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-16DOI: 10.4230/LIPIcs.CPM.2021.19
Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering
Let $W$ be a string of length $n$ over an alphabet $Sigma$, $k$ be a positive integer, and $mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{mathrm{ED}}$ such that: (i) no string of $mathcal{S}$ occurs in $X_{mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Sigma$ is the same in $W$ and in $X_{mathrm{ED}}$; and (iii) $X_{mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $mathcal{O}(n^{2-delta})$ time, for any $delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $mathcal{O}(n^2log^2k)$-time algorithm to solve ETFS; and (ii) an $mathcal{O}(n^2log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
设$W$是一个长度为$n$的字符串,包含一个字母$Sigma$, $k$是一个正整数,$mathcal{S}$是$W$的一组长度为$k$的子字符串。ETFS问题要求我们构造一个字符串$X_{mathrm{ED}}$,这样:(i) $X_{mathrm{ED}}$中没有$mathcal{S}$字符串;(ii) $Sigma$上所有其他长度为$k$的子字符串的顺序在$W$和$X_{mathrm{ED}}$中是相同的;(三)$X_{mathrm{ED}}$到$W$的编辑距离最小。当$W$代表个人数据,$mathcal{S}$代表一组机密模式时,ETFS问题要求转换$W$以保护其隐私和效用[Bernardini等人,ECML PKDD 2019]。ETFS可以在$mathcal{O}(n^2k)$时间内求解[Bernardini et al., CPM 2020]。同一篇论文表明,对于任何$delta>0$,除非强指数时间假设(SETH)为假,否则ETFS不能在$mathcal{O}(n^{2-delta})$时间内求解。我们的主要成果可以总结如下:(i)求解ETFS的$mathcal{O}(n^2log^2k)$时间算法;(ii)求解AETFS的$mathcal{O}(n^2log^2n)$时间算法,这是ETFS的一种推广,其中$mathcal{S}$的元素可以具有任意长度。我们的算法因此是最优的多对数因素,除非SETH失败。我们还需要强调的是,我们的算法可以在任意权重的编辑距离下工作,而不需要额外的成本。作为奖励,我们展示了如何修改一些已知的技术,这些技术可以加快标准编辑距离计算,以应用于我们的问题。除了字符串清理之外,我们的技术还可以启发解决与正则表达式或上下文无关语法相关的其他问题。
{"title":"String Sanitization Under Edit Distance: Improved and Generalized","authors":"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2021.19","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.19","url":null,"abstract":"Let $W$ be a string of length $n$ over an alphabet $Sigma$, $k$ be a positive integer, and $mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{mathrm{ED}}$ such that: (i) no string of $mathcal{S}$ occurs in $X_{mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Sigma$ is the same in $W$ and in $X_{mathrm{ED}}$; and (iii) $X_{mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \u0000ETFS can be solved in $mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $mathcal{O}(n^{2-delta})$ time, for any $delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $mathcal{O}(n^2log^2k)$-time algorithm to solve ETFS; and (ii) an $mathcal{O}(n^2log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124765700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-06-09DOI: 10.4230/LIPIcs.CPM.2020.7
G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering
textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.
{"title":"String Sanitization Under Edit Distance","authors":"G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2020.7","url":null,"abstract":"textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}