首页 > 最新文献

Annual Symposium on Combinatorial Pattern Matching最新文献

英文 中文
MONI can find k-MEMs MONI可以找到k-MEMs
Pub Date : 2022-02-10 DOI: 10.4230/LIPIcs.CPM.2023.26
T. Gagie
Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m log n)$.
假设我们被要求索引一个文本$T [0..]n - 1]$这样,给定一个模式$P[0…m - 1]$,我们可以快速地报告$P$的最大子串,每个子串在$T$中出现至少$k$次。我们首先展示了如何将$O (r log n)$位添加到Rossi等人最近的MONI索引中,其中$r$是$T$的Burrows-Wheeler变换的运行次数,这样它就可以在$O (k m log n)$时间内支持这样的查询。然后我们将展示,如果在构造时给定$k$,我们如何将查询时间减少到$O (m log n)$。
{"title":"MONI can find k-MEMs","authors":"T. Gagie","doi":"10.4230/LIPIcs.CPM.2023.26","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2023.26","url":null,"abstract":"Suppose we are asked to index a text $T [0..n - 1]$ such that, given a pattern $P [0..m - 1]$, we can quickly report the maximal substrings of $P$ that each occur in $T$ at least $k$ times. We first show how we can add $O (r log n)$ bits to Rossi et al.'s recent MONI index, where $r$ is the number of runs in the Burrows-Wheeler Transform of $T$, such that it supports such queries in $O (k m log n)$ time. We then show how, if we are given $k$ at construction time, we can reduce the query time to $O (m log n)$.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115797106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Normalized Edit Distance with Uniform Operation Costs is a Metric 具有统一操作成本的归一化编辑距离是一个度量
Pub Date : 2022-01-16 DOI: 10.4230/LIPIcs.CPM.2022.17
D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss
We prove that the normalized edit distance proposed in [Marzal and Vidal 1993] is a metric when the cost of all the edit operations are the same. This closes a long standing gap in the literature where several authors noted that this distance does not satisfy the triangle inequality in the general case, and that it was not known whether it is satisfied in the uniform case – where all the edit costs are equal. We compare this metric to two normalized metrics proposed as alternatives in the literature, when people thought that Marzal’s and Vidal’s distance is not a metric, and identify key properties that explain why the original distance, now known to also be a metric, is better for some applications. Our examination is from a point of view of formal verification, but the properties and their significance are stated in an application agnostic way.
我们证明了[Marzal and Vidal 1993]中提出的归一化编辑距离是所有编辑操作成本相同时的度量。这填补了文献中长期存在的空白,一些作者指出,在一般情况下,这个距离不满足三角形不等式,并且不知道它是否满足统一情况-所有编辑成本相等。当人们认为Marzal 's和Vidal 's距离不是度量时,我们将这个度量与文献中提出的两个标准化度量进行比较,并确定解释为什么原始距离(现在已知也是度量)在某些应用中更好的关键属性。我们的研究是从形式验证的角度出发的,但是性质和它们的意义是以一种与应用无关的方式来陈述的。
{"title":"The Normalized Edit Distance with Uniform Operation Costs is a Metric","authors":"D. Fisman, Joshua Grogin, Oded Margalit, Gera Weiss","doi":"10.4230/LIPIcs.CPM.2022.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.17","url":null,"abstract":"We prove that the normalized edit distance proposed in [Marzal and Vidal 1993] is a metric when the cost of all the edit operations are the same. This closes a long standing gap in the literature where several authors noted that this distance does not satisfy the triangle inequality in the general case, and that it was not known whether it is satisfied in the uniform case – where all the edit costs are equal. We compare this metric to two normalized metrics proposed as alternatives in the literature, when people thought that Marzal’s and Vidal’s distance is not a metric, and identify key properties that explain why the original distance, now known to also be a metric, is better for some applications. Our examination is from a point of view of formal verification, but the properties and their significance are stated in an application agnostic way.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"107 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123238484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Arbitrary-length analogs to de Bruijn sequences 与德布鲁因序列类似的任意长度序列
Pub Date : 2021-08-17 DOI: 10.4230/LIPIcs.CPM.2022.9
Abhinav Nellore, Rachel A. Ward
Let $widetilde{alpha}$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $mathcal{A}$ such that the number of occurrences of any length-$m$ string on $mathcal{A}$ as a substring of $widetilde{alpha}$ is $lfloor L / K^m rfloor$ or $lceil L / K^m rceil$. When $L = K^N$ for any positive integer $N$, $widetilde{alpha}$ is a de Bruijn sequence of order $N$, and when $L neq K^N$, $widetilde{alpha}$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $widetilde{alpha}$ for any combination of $K geq 2$ and $L geq 1$ in $O(L)$ time using $O(L log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.
让$widetilde{alpha}$是一个长度- $L$的循环字符序列,从一个大小- $K$的字母表$mathcal{A}$,这样任何长度- $m$字符串在$mathcal{A}$上作为$widetilde{alpha}$的子字符串出现的次数是$lfloor L / K^m rfloor$或$lceil L / K^m rceil$。对于任意正整数$N$,当$L = K^N$时,$widetilde{alpha}$是一个顺序为$N$的de Bruijn序列;当$L neq K^N$时,$widetilde{alpha}$与de Bruijn序列共享许多属性。我们描述了一个算法,该算法使用$O(L log K)$空间在$O(L)$时间内为$K geq 2$和$L geq 1$的任意组合输出一些$widetilde{alpha}$。该算法扩展了Lempel二元德布鲁因序列的递归构造。用Python编写的实现可在https://github.com/nelloreward/pkl获得。
{"title":"Arbitrary-length analogs to de Bruijn sequences","authors":"Abhinav Nellore, Rachel A. Ward","doi":"10.4230/LIPIcs.CPM.2022.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2022.9","url":null,"abstract":"Let $widetilde{alpha}$ be a length-$L$ cyclic sequence of characters from a size-$K$ alphabet $mathcal{A}$ such that the number of occurrences of any length-$m$ string on $mathcal{A}$ as a substring of $widetilde{alpha}$ is $lfloor L / K^m rfloor$ or $lceil L / K^m rceil$. When $L = K^N$ for any positive integer $N$, $widetilde{alpha}$ is a de Bruijn sequence of order $N$, and when $L neq K^N$, $widetilde{alpha}$ shares many properties with de Bruijn sequences. We describe an algorithm that outputs some $widetilde{alpha}$ for any combination of $K geq 2$ and $L geq 1$ in $O(L)$ time using $O(L log K)$ space. This algorithm extends Lempel's recursive construction of a binary de Bruijn sequence. An implementation written in Python is available at https://github.com/nelloreward/pkl.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122336978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Ranking Bracelets in Polynomial Time 多项式时间排序手镯
Pub Date : 2021-04-09 DOI: 10.4230/LIPIcs.CPM.2021.4
Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov
The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k^2 n^4), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.
本文的主要成果是第一个对手镯排序的多项式时间算法。该算法的时间复杂度为O(k^2 n^4),其中k是字母表的大小,n是所考虑的手镯的长度。该算法的关键部分是通过查找其他三个秩来计算任何单词相对于手镯集的秩:所有项链的秩,回文项链的秩,以及包含回文项链的秩。本文介绍了后两个概念。为了将问题分解成几个部分,这些秩是我们算法的关键组成部分。此外,该排序过程还用于构建多项式时间排序算法。
{"title":"Ranking Bracelets in Polynomial Time","authors":"Duncan Adamson, Argyrios Deligkas, V. Gusev, I. Potapov","doi":"10.4230/LIPIcs.CPM.2021.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.4","url":null,"abstract":"The main result of the paper is the first polynomial-time algorithm for ranking bracelets. The time-complexity of the algorithm is O(k^2 n^4), where k is the size of the alphabet and n is the length of the considered bracelets. The key part of the algorithm is to compute the rank of any word with respect to the set of bracelets by finding three other ranks: the rank over all necklaces, the rank over palindromic necklaces, and the rank over enclosing apalindromic necklaces. The last two concepts are introduced in this paper. These ranks are key components to our algorithm in order to decompose the problem into parts. Additionally, this ranking procedure is used to build a polynomial-time unranking algorithm.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs 一种构造分层重叠图的线性时间算法
Pub Date : 2021-02-25 DOI: 10.4230/LIPIcs.CPM.2021.22
Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals
The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).
分层重叠图(HOG)是一种对给定集合P (n个字符串)的重叠进行编码的图,就像重叠图一样。一个最著名的算法在O(||P|| log n)时间和O(||P|)空间上构造HOG,其中|P||是P中字符串长度的和。本文给出了一个在O(| P||)时间和空间上构造HOG的新算法。因此,HOG的构造时间和空间都优于重叠图的构造时间和空间为O(||P|| + n²)。
{"title":"A Linear Time Algorithm for Constructing Hierarchical Overlap Graphs","authors":"Sangsoo Park, Sung Gwan Park, Bastien Cazaux, Kunsoo Park, Eric Rivals","doi":"10.4230/LIPIcs.CPM.2021.22","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.22","url":null,"abstract":"The hierarchical overlap graph (HOG) is a graph that encodes overlaps from a given set P of n strings, as the overlap graph does. A best known algorithm constructs HOG in O(||P|| log n) time and O(||P||) space, where ||P|| is the sum of lengths of the strings in P. In this paper we present a new algorithm to construct HOG in O(||P||) time and space. Hence, the construction time and space of HOG are better than those of the overlap graph, which are O(||P|| + n²).","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127920748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping 重新审视最大二保存字符串映射的参数化复杂度
Pub Date : 2020-12-01 DOI: 10.4230/LIPIcs.CPM.2017.11
Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi
Abstract In the Maximum-Duo Preservation String Mapping ( Max-Duo PSM ) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in 4 k ⋅ n O ( 1 ) time, and a deterministic algorithm that solves this problem in 6.855 k ⋅ n O ( 1 ) time. The previous best known (deterministic) algorithm for this problem has ( 8 e ) 2 k + o ( k ) ⋅ n O ( 1 ) running time [Beretta et al. (2016) [1] , [2] ]. We also show that Max-Duo PSM admits a problem kernel of size O ( k 3 ) , improving upon the previous best known problem kernel of size O ( k 6 ) .
在Maximum-Duo Preservation String Mapping (Max-Duo PSM)问题中,输入由两个长度为n的相关字符串A和B和一个非负整数k组成,目的是确定是否存在从A的位置集合到B的位置集合的映射m,该映射m只映射到具有相同字符的位置,并且保留至少k对相邻位置对。我们开发了一个在4 k·n·O(1)时间内解决Max-Duo PSM问题的随机算法和一个在6.855 k·n·O(1)时间内解决该问题的确定性算法。之前最著名的(确定性)算法对于这个问题的运行时间为(8 e) 2k + o (k)⋅no (1) [Beretta et al.(2016)[1],[2]]。我们还证明了Max-Duo PSM允许一个大小为O (k3)的问题核,改进了之前最著名的大小为O (k6)的问题核。
{"title":"Revisiting the Parameterized Complexity of Maximum-Duo Preservation String Mapping","authors":"Christian Komusiewicz, Mateus de Oliveira Oliveira, M. Zehavi","doi":"10.4230/LIPIcs.CPM.2017.11","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.11","url":null,"abstract":"Abstract In the Maximum-Duo Preservation String Mapping ( Max-Duo PSM ) problem, the input consists of two related strings A and B of length n and a nonnegative integer k. The objective is to determine whether there exists a mapping m from the set of positions of A to the set of positions of B that maps only to positions with the same character and preserves at least k duos, which are pairs of adjacent positions. We develop a randomized algorithm that solves Max-Duo PSM in 4 k ⋅ n O ( 1 ) time, and a deterministic algorithm that solves this problem in 6.855 k ⋅ n O ( 1 ) time. The previous best known (deterministic) algorithm for this problem has ( 8 e ) 2 k + o ( k ) ⋅ n O ( 1 ) running time [Beretta et al. (2016) [1] , [2] ]. We also show that Max-Duo PSM admits a problem kernel of size O ( k 3 ) , improving upon the previous best known problem kernel of size O ( k 6 ) .","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115421742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
AWLCO: All-Window Length Co-Occurrence AWLCO:全窗口长度共现
Pub Date : 2020-11-29 DOI: 10.4230/LIPIcs.CPM.2021.24
Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea
Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity $O(n)$ for each window length and thus a total complexity of $O(n^2)$ and the space complexity $O(|I|)$ for a sequence of size n and an itemset of size $|I|$. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the expected time complexity of $O(n)$ and space complexity of $O( sqrt{ n|I| })$. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with expected time complexity $O(n|I|)$ and space complexity $O( sqrt{n|I|} + e_{max}|I|)$, where $e_{max}$ is the length of the largest pattern.
分析事件序列中的模式在文本分析、计算机编程和基因组学研究中都有应用。在本文中,我们考虑了全窗长的分析模型,该模型分析了一系列事件相对于所有长度的窗口。我们研究了全窗长分析模型的精确共现计数问题。我们的第一个算法是离线算法,它通过在序列上执行多次传递并计算单窗口长度的共现来计算全窗口长度的共现。对于每个窗口长度,该算法的时间复杂度为$O(n)$,因此总复杂度为$O(n^2)$,对于大小为n的序列和大小为$|I|$的项集,其空间复杂度为$O(|I|)$。我们提出了一种在线算法AWLCO,它在单遍中计算所有窗口长度的共现,期望时间复杂度为$O(n)$,空间复杂度为$O(sqrt{n|I|})$。在此之后,我们将我们的用例推广到模式,其中我们提出了一种算法,该算法以期望的时间复杂度$O(n|I|)$和空间复杂度$O(sqrt{n|I|} + e_{max}|I|)$计算全窗口长度共现,其中$e_{max}$是最大模式的长度。
{"title":"AWLCO: All-Window Length Co-Occurrence","authors":"Joshua Sobel, Noah Bertram, C. Ding, F. Nargesian, D. Gildea","doi":"10.4230/LIPIcs.CPM.2021.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.24","url":null,"abstract":"Analyzing patterns in a sequence of events has applications in text analysis, computer programming, and genomics research. In this paper, we consider the all-window-length analysis model which analyzes a sequence of events with respect to windows of all lengths. We study the exact co-occurrence counting problem for the all-window-length analysis model. Our first algorithm is an offline algorithm that counts all-window-length co-occurrences by performing multiple passes over a sequence and computing single-window-length co-occurrences. This algorithm has the time complexity $O(n)$ for each window length and thus a total complexity of $O(n^2)$ and the space complexity $O(|I|)$ for a sequence of size n and an itemset of size $|I|$. We propose AWLCO, an online algorithm that computes all-window-length co-occurrences in a single pass with the expected time complexity of $O(n)$ and space complexity of $O( sqrt{ n|I| })$. Following this, we generalize our use case to patterns in which we propose an algorithm that computes all-window-length co-occurrence with expected time complexity $O(n|I|)$ and space complexity $O( sqrt{n|I|} + e_{max}|I|)$, where $e_{max}$ is the length of the largest pattern.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124608711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Longest Run Subsequence Problem: Further Complexity Results 最长运行子序列问题:进一步的复杂性结果
Pub Date : 2020-11-16 DOI: 10.4230/LIPIcs.CPM.2021.14
R. Dondi, F. Sikora
Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al.,WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.
最长运行子序列是最近在基因组组装的脚手架阶段提出的一个问题(Schrinner等人,WABI 2020)。该问题要求给定字符串的最大长度子序列,该子序列对每个符号最多只包含一次运行(一次运行是连续相同符号的最大子字符串)。这个问题已经被证明是np困难的,并且当参数是定义输入字符串的字母表的大小时,它是固定参数可处理的。在本文中,我们进一步研究了问题的复杂性,并证明了当它被一个较小参数的解中的运行次数参数化时,它是固定参数可处理的。此外,我们还研究了最长运行子序列的核复杂度,并证明了当以字母表的大小或运行次数作为参数时,最长运行子序列不承认多项式核。最后,当每个符号在输入字符串中最多出现两次时,我们考虑了最长运行子序列的限制,并证明了它是apx困难的。
{"title":"The Longest Run Subsequence Problem: Further Complexity Results","authors":"R. Dondi, F. Sikora","doi":"10.4230/LIPIcs.CPM.2021.14","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.14","url":null,"abstract":"Longest Run Subsequence is a problem introduced recently in the context of the scaffolding phase of genome assembly (Schrinner et al.,WABI 2020). The problem asks for a maximum length subsequence of a given string that contains at most one run for each symbol (a run is a maximum substring of consecutive identical symbols). The problem has been shown to be NP-hard and to be fixed-parameter tractable when the parameter is the size of the alphabet on which the input string is defined. In this paper we further investigate the complexity of the problem and we show that it is fixed-parameter tractable when it is parameterized by the number of runs in a solution, a smaller parameter. Moreover, we investigate the kernelization complexity of Longest Run Subsequence and we prove that it does not admit a polynomial kernel when parameterized by the size of the alphabet or by the number of runs. Finally, we consider the restriction of Longest Run Subsequence when each symbol has at most two occurrences in the input string and we show that it is APX-hard.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132486939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
String Sanitization Under Edit Distance: Improved and Generalized 编辑距离下的字符串消毒:改进与推广
Pub Date : 2020-07-16 DOI: 10.4230/LIPIcs.CPM.2021.19
Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering
Let $W$ be a string of length $n$ over an alphabet $Sigma$, $k$ be a positive integer, and $mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{mathrm{ED}}$ such that: (i) no string of $mathcal{S}$ occurs in $X_{mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Sigma$ is the same in $W$ and in $X_{mathrm{ED}}$; and (iii) $X_{mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. ETFS can be solved in $mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $mathcal{O}(n^{2-delta})$ time, for any $delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $mathcal{O}(n^2log^2k)$-time algorithm to solve ETFS; and (ii) an $mathcal{O}(n^2log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.
设$W$是一个长度为$n$的字符串,包含一个字母$Sigma$, $k$是一个正整数,$mathcal{S}$是$W$的一组长度为$k$的子字符串。ETFS问题要求我们构造一个字符串$X_{mathrm{ED}}$,这样:(i) $X_{mathrm{ED}}$中没有$mathcal{S}$字符串;(ii) $Sigma$上所有其他长度为$k$的子字符串的顺序在$W$和$X_{mathrm{ED}}$中是相同的;(三)$X_{mathrm{ED}}$到$W$的编辑距离最小。当$W$代表个人数据,$mathcal{S}$代表一组机密模式时,ETFS问题要求转换$W$以保护其隐私和效用[Bernardini等人,ECML PKDD 2019]。ETFS可以在$mathcal{O}(n^2k)$时间内求解[Bernardini et al., CPM 2020]。同一篇论文表明,对于任何$delta>0$,除非强指数时间假设(SETH)为假,否则ETFS不能在$mathcal{O}(n^{2-delta})$时间内求解。我们的主要成果可以总结如下:(i)求解ETFS的$mathcal{O}(n^2log^2k)$时间算法;(ii)求解AETFS的$mathcal{O}(n^2log^2n)$时间算法,这是ETFS的一种推广,其中$mathcal{S}$的元素可以具有任意长度。我们的算法因此是最优的多对数因素,除非SETH失败。我们还需要强调的是,我们的算法可以在任意权重的编辑距离下工作,而不需要额外的成本。作为奖励,我们展示了如何修改一些已知的技术,这些技术可以加快标准编辑距离计算,以应用于我们的问题。除了字符串清理之外,我们的技术还可以启发解决与正则表达式或上下文无关语法相关的其他问题。
{"title":"String Sanitization Under Edit Distance: Improved and Generalized","authors":"Takuya Mieno, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2021.19","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2021.19","url":null,"abstract":"Let $W$ be a string of length $n$ over an alphabet $Sigma$, $k$ be a positive integer, and $mathcal{S}$ be a set of length-$k$ substrings of $W$. The ETFS problem asks us to construct a string $X_{mathrm{ED}}$ such that: (i) no string of $mathcal{S}$ occurs in $X_{mathrm{ED}}$; (ii) the order of all other length-$k$ substrings over $Sigma$ is the same in $W$ and in $X_{mathrm{ED}}$; and (iii) $X_{mathrm{ED}}$ has minimal edit distance to $W$. When $W$ represents an individual's data and $mathcal{S}$ represents a set of confidential patterns, the ETFS problem asks for transforming $W$ to preserve its privacy and its utility [Bernardini et al., ECML PKDD 2019]. \u0000ETFS can be solved in $mathcal{O}(n^2k)$ time [Bernardini et al., CPM 2020]. The same paper shows that ETFS cannot be solved in $mathcal{O}(n^{2-delta})$ time, for any $delta>0$, unless the Strong Exponential Time Hypothesis (SETH) is false. Our main results can be summarized as follows: (i) an $mathcal{O}(n^2log^2k)$-time algorithm to solve ETFS; and (ii) an $mathcal{O}(n^2log^2n)$-time algorithm to solve AETFS, a generalization of ETFS in which the elements of $mathcal{S}$ can have arbitrary lengths. Our algorithms are thus optimal up to polylogarithmic factors, unless SETH fails. Let us also stress that our algorithms work under edit distance with arbitrary weights at no extra cost. As a bonus, we show how to modify some known techniques, which speed up the standard edit distance computation, to be applied to our problems. Beyond string sanitization, our techniques may inspire solutions to other problems related to regular expressions or context-free grammars.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124765700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
String Sanitization Under Edit Distance 编辑距离下的字符串处理
Pub Date : 2020-06-09 DOI: 10.4230/LIPIcs.CPM.2020.7
G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering
textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.
设W是字母表Σ上长度为n的字符串,k是一个正整数,并且是W的长度为k的子字符串的集合。ETFS问题要求我们构造一个字符串X_{ED},满足:(i)在X_{ED}中不存在字符串;(ii) Σ上所有其他长度为k的子串的顺序在W和X_{ED}中是相同的;(iii) X_{ED}与W的编辑距离最小。当W代表个人数据并代表一组机密子字符串时,求解ETFS的算法可以应用于保持效用的字符串处理[Bernardini等人,ECML PKDD 2019]。我们在这里的第一个结果是在(kn²)时间内求解ETFS的算法,该算法比目前的技术水平(Bernardini et al., arXiv 2019)提高了一个因子|Σ|。我们的算法是基于经典动态规划算法的一个重要修改,用于计算两个字符串之间的编辑距离。值得注意的是,我们还表明,对于任何δ>0,除非强指数时间假设为假,否则ETFS不能在(n^{2-δ})时间内求解。为了实现这一点,我们减少了编辑距离问题,该问题承认相同的条件下界[Bringmann和Kunnemann, FOCS 2015], ETFS。
{"title":"String Sanitization Under Edit Distance","authors":"G. Bernardini, Huiping Chen, G. Loukides, N. Pisanti, S. Pissis, L. Stougie, Michelle Sweering","doi":"10.4230/LIPIcs.CPM.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2020.7","url":null,"abstract":"textabstractLet W be a string of length n over an alphabet Σ, k be a positive integer, and be a set of length-k substrings of W. The ETFS problem asks us to construct a string X_{ED} such that: (i) no string of occurs in X_{ED}; (ii) the order of all other length-k substrings over Σ is the same in W and in X_{ED}; and (iii) X_{ED} has minimal edit distance to W. When W represents an individual’s data and represents a set of confidential substrings, algorithms solving ETFS can be applied for utility-preserving string sanitization [Bernardini et al., ECML PKDD 2019]. Our first result here is an algorithm to solve ETFS in (kn²) time, which improves on the state of the art [Bernardini et al., arXiv 2019] by a factor of |Σ|. Our algorithm is based on a non-trivial modification of the classic dynamic programming algorithm for computing the edit distance between two strings. Notably, we also show that ETFS cannot be solved in (n^{2-δ}) time, for any δ>0, unless the strong exponential time hypothesis is false. To achieve this, we reduce the edit distance problem, which is known to admit the same conditional lower bound [Bringmann and Kunnemann, FOCS 2015], to ETFS.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116303767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
Annual Symposium on Combinatorial Pattern Matching
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1