Annual Symposium on Combinatorial Pattern Matching最新文献

英文中文

The Longest Filled Common Subsequence Problem 最长填充公共子序列问题

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-07-01 DOI: 10.4230/LIPIcs.CPM.2017.14

M. Castelli, R. Dondi, G. Mauri, I. Zoppis

Inspired by a recent approach for genome reconstruction from incomplete data, we consider a variant of the longest common subsequence problem for the comparison of two sequences, one of which is incomplete, i.e. it has some missing elements. The new combinatorial problem, called Longest Filled Common Subsequence, given two sequences A and B, and a multiset M of symbols missing in B, asks for a sequence B* obtained by inserting the symbols of M into B so that B* induces a common subsequence with A of maximum length. First, we investigate the computational and approximation complexity of the problem and we show that it is NP-hard and APX-hard when A contains at most two occurrences of each symbol. Then, we give a 3/5-approximation algorithm for the problem. Finally, we present a fixed-parameter algorithm, when the problem is parameterized by the number of symbols inserted in B that "match" symbols of A.

受最近从不完整数据中重建基因组的方法的启发，我们考虑了两个序列比较的最长公共子序列问题的一个变体，其中一个序列是不完整的，即它有一些缺失的元素。新的组合问题称为最长填充公共子序列，给定两个序列A和B，以及B中缺失的符号的多集M，要求通过将M的符号插入到B中得到序列B*，使B*诱导出A长度最大的公共子序列。首先，我们研究了问题的计算和近似复杂性，并表明当A中每个符号最多包含两次出现时，它是np困难和apx困难的。然后给出了该问题的3/5近似算法。最后，我们提出了一种固定参数算法，当问题被参数化时，B中插入的符号与a的符号“匹配”。

引用次数: 7

28th Annual Symposium on Combinatorial Pattern Matching, CPM 2017, July 4-6, 2017, Warsaw, Poland 第28届组合模式匹配年度研讨会，CPM 2017, 2017年7月4-6日，波兰华沙

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-07-01 DOI: 10.4230/LIPICS.CPM.2017.0

Juha Kärkkäinen, J. Radoszewski, W. Rytter

引用次数: 1

Document Listing on Repetitive Collections with Guaranteed Performance 保证性能的重复集合的文档列表

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-06-28 DOI: 10.4230/LIPIcs.CPM.2017.4

G. Navarro

We consider document listing on string collections, that is, finding in which strings a given pattern appears. In particular, we focus on repetitive collections: a collection of size N over alphabet [1,a] is composed of D copies of a string of size n, and s single-character edits are applied on the copies. We introduce the first document listing index with size O~(n + s), precisely O((n lg a + s lg^2 N) lg D) bits, and with useful worst-case time guarantees: Given a pattern of length m, the index reports the ndoc strings where it appears in time O(m^2 + m lg N (lg D + lg^e N) ndoc), for any constant e > 0.

我们考虑字符串集合上的文档列表，即查找给定模式出现在哪个字符串中。我们特别关注重复集合:一个大小为N /字母表[1,a]的集合由大小为N的字符串的D个副本组成，并且在副本上应用了s个单字符编辑。我们引入第一个文档列表索引，其大小为O~(n + s)，精确地为O((n lg a + s lg^2 n) lg D)位，并具有有用的最坏情况时间保证:给定长度为m的模式，索引报告在时间为O(m^2 + m lg n (lg D + lg^e n) ndoc)时出现的ndoc字符串，对于任何常数e > 0。

引用次数: 21

Representing the suffix tree with the CDAWG 用CDAWG表示后缀树

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-05-24 DOI: 10.4230/LIPIcs.CPM.2017.7

D. Belazzougui, F. Cunial

Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.

给定一个字符串$T$，我们知道它的后缀树可以使用具有$e_T$弧的紧凑有向无环词图(CDAWG)来表示，占用了整个$O(e_T+e_{{overline{T}}})$个词的空间，其中${overline{T}}$是$T$的反转，并且在最坏的情况下支持$O(1)$和$O(log{log{n}})$之间的一些关键操作。这种表示对高度重复的字符串特别有吸引力，比如相似基因组的集合或版本控制文档，在这些字符串中，$e_T$实际上在$T$的长度中呈次线性增长。在本文中，我们增强了这种表示，在RAM模型中支持在$O(1)$和$O(log{n})$之间的最坏情况下的一些额外查询，而不会逐渐增加空间复杂性。我们的技术基于后缀树的重路径分解，还支持后缀数组、反向后缀数组和$T$本身的表示，它们占用$O(e_T)$个空间单词，并支持在$O(log{n})$时间内的随机访问。此外，我们在$T$的反向CDAWG和生成$T$且仅产生$T$的上下文无关语法之间建立了连接，这两个语法可能具有独立的兴趣。

{"title":"Representing the suffix tree with the CDAWG","authors":"D. Belazzougui, F. Cunial","doi":"10.4230/LIPIcs.CPM.2017.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.7","url":null,"abstract":"Given a string $T$, it is known that its suffix tree can be represented using the compact directed acyclic word graph (CDAWG) with $e_T$ arcs, taking overall $O(e_T+e_{{overline{T}}})$ words of space, where ${overline{T}}$ is the reverse of $T$, and supporting some key operations in time between $O(1)$ and $O(log{log{n}})$ in the worst case. This representation is especially appealing for highly repetitive strings, like collections of similar genomes or of version-controlled documents, in which $e_T$ grows sublinearly in the length of $T$ in practice. In this paper we augment such representation, supporting a number of additional queries in worst-case time between $O(1)$ and $O(log{n})$ in the RAM model, without increasing space complexity asymptotically. Our technique, based on a heavy path decomposition of the suffix tree, enables also a representation of the suffix array, of the inverse suffix array, and of $T$ itself, that takes $O(e_T)$ words of space, and that supports random access in $O(log{n})$ time. Furthermore, we establish a connection between the reversed CDAWG of $T$ and a context-free grammar that produces $T$ and only $T$, which might have independent interest.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130914613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Faster STR-IC-LCS computation via RLE 更快的STR-IC-LCS计算通过RLE

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-03-15 DOI: 10.4230/LIPIcs.CPM.2017.20

Keita Kuboi, Yuta Fujishige, Shunsuke Inenaga, H. Bannai, M. Takeda

The constrained LCS problem asks one to find a longest common subsequence of two input strings $A$ and $B$ with some constraints. The STR-IC-LCS problem is a variant of the constrained LCS problem, where the solution must include a given constraint string $C$ as a substring. Given two strings $A$ and $B$ of respective lengths $M$ and $N$, and a constraint string $C$ of length at most $min{M, N}$, the best known algorithm for the STR-IC-LCS problem, proposed by Deorowicz~({em Inf. Process. Lett.}, 11:423--426, 2012), runs in $O(MN)$ time. In this work, we present an $O(mN + nM)$-time solution to the STR-IC-LCS problem, where $m$ and $n$ denote the sizes of the run-length encodings of $A$ and $B$, respectively. Since $m leq M$ and $n leq N$ always hold, our algorithm is always as fast as Deorowicz's algorithm, and is faster when input strings are compressible via RLE.

约束LCS问题要求在一些约束条件下找到两个输入字符串$A$和$B$的最长公共子序列。STR-IC-LCS问题是约束LCS问题的一个变体，其中的解决方案必须包含一个给定的约束字符串$C$作为子字符串。给定长度分别为$M$和$N$的两个字符串$A$和$B$，以及长度不超过$min{M, N}$的约束字符串$C$，由Deorowicz (Inf. {emProcess)提出的STR-IC-LCS问题的最著名算法。左。}， 11:423—426,2012)，运行时间为$O(MN)$。在这项工作中，我们提出了STR-IC-LCS问题的$O(mN + nM)$时间解决方案，其中$m$和$n$分别表示$A$和$B$的运行长度编码的大小。由于$m leq M$和$n leq N$总是成立，我们的算法总是和Deorowicz的算法一样快，并且当输入字符串通过RLE可压缩时更快。

引用次数: 6

Fast and Simple Jumbled Indexing for Binary Run-Length Encoded Strings 快速和简单的二进制运行长度编码字符串的混乱索引

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-02-04 DOI: 10.4230/LIPIcs.CPM.2017.19

L. Cunha, S. Dantas, T. Gagie, Roland Wittler, L. Kowada, J. Stoye

Important papers have appeared recently on the problem of indexing binary strings for jumbled pattern matching, and further lowering the time bounds in terms of the input size would now be a breakthrough with broad implications. We can still make progress on the problem, however, by considering other natural parameters. Badkobeh et al. (IPL, 2013) and Amir et al. (TCS, 2016) gave algorithms that index a binary string in O(n + r^2 log r) time, where n is the length and r is the number of runs, and Giaquinta and Grabowski (IPL, 2013) gave one that runs in O(n + r^2) time. In this paper we propose a new and very simple algorithm that also runs in O(n + r^2) time and can be extended either so that the index returns the position of a match (if there is one), or so that the algorithm uses only O(n) bits of space instead of O(n) words.

最近出现了一些重要的论文，讨论了为混乱模式匹配索引二进制字符串的问题，进一步降低输入大小的时间限制现在将是一个具有广泛意义的突破。然而，通过考虑其他自然参数，我们仍然可以在这个问题上取得进展。Badkobeh等人(IPL, 2013)和Amir等人(TCS, 2016)给出了在O(n + r^2 log r)时间内索引二进制字符串的算法，其中n是长度，r是运行次数，Giaquinta和Grabowski (IPL, 2013)给出了在O(n + r^2)时间内运行的算法。在本文中，我们提出了一个新的非常简单的算法，它也在O(n + r^2)时间内运行，并且可以扩展，以便索引返回匹配的位置(如果有的话)，或者使算法只使用O(n)位空间而不是O(n)个单词。

引用次数: 2

From LZ77 to the Run-Length Encoded Burrows-Wheeler Transform, and Back 从LZ77到行长编码Burrows-Wheeler变换，再回来

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-02-04 DOI: 10.4230/LIPIcs.CPM.2017.17

A. Policriti, N. Prezza

The Lempel-Ziv factorization (LZ77) and the Run-Length encoded Burrows-Wheeler Transform (RLBWT) are two important tools in text compression and indexing, being their sizes $z$ and $r$ closely related to the amount of text self-repetitiveness. In this paper we consider the problem of converting the two representations into each other within a working space proportional to the input and the output. Let $n$ be the text length. We show that $RLBWT$ can be converted to $LZ77$ in $mathcal{O}(nlog r)$ time and $mathcal{O}(r)$ words of working space. Conversely, we provide an algorithm to convert $LZ77$ to $RLBWT$ in $mathcal{O}big(n(log r + log z)big)$ time and $mathcal{O}(r+z)$ words of working space. Note that $r$ and $z$ can be emph{constant} if the text is highly repetitive, and our algorithms can operate with (up to) emph{exponentially} less space than naive solutions based on full decompression.

Lempel-Ziv分解(LZ77)和运行长度编码Burrows-Wheeler变换(RLBWT)是文本压缩和索引中的两个重要工具，它们的大小$z$和$r$与文本自重复的数量密切相关。在本文中，我们考虑在与输入和输出成比例的工作空间内将两种表示转换成彼此的问题。让$n$为文本长度。我们表明$RLBWT$可以在$mathcal{O}(nlog r)$时间和$mathcal{O}(r)$工作空间的单词中转换为$LZ77$。反过来，我们提供了在$mathcal{O}big(n(log r + log z)big)$时间和$mathcal{O}(r+z)$字的工作空间中将$LZ77$转换为$RLBWT$的算法。请注意，如果文本高度重复，则$r$和$z$可以是emph{常数}，并且我们的算法可以使用(最多)emph{指数级}的空间，而不是基于完全解压缩的简单解决方案。

引用次数: 11

A family of approximation algorithms for the maximum duo-preservation string mapping problem 最大双保存字符串映射问题的一组近似算法

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2017-02-01 DOI: 10.4230/LIPIcs.CPM.2017.10

Bartłomiej Dudek, Paweł Gawrychowski, Piotr Ostropolski-Nalewaja

In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter so as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string. Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time $3.25$-approximation algorithm. Our main contribution is that for any $epsilon>0$ there exists a polynomial-time $(2+epsilon)$-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves $2.67$-approximation in quadratic time.

在最大对偶保存字符串映射问题中，我们给定两个字符串，希望将前者的字母映射到后者的字母，以最大化对偶的数量。一对字母是一对连续的字母，以相同的顺序映射到一对连续的字母。这是对经过充分研究的最小公共字符串分区问题的补充，后者的目标是将前一个字符串划分为可以排列和连接以获得后一个字符串的块。最大双保存字符串映射是APX-hard。经过一系列改进，Brubach [WABI 2016]提出了一种多项式时间$3.25$的近似算法。我们的主要贡献是，对于任何$epsilon>0$，存在一个多项式时间$(2+epsilon)$逼近算法。与Boria等人之前的解决方案类似[CPM 2016]，我们的算法使用局部搜索技术。然而，这只在某个初步贪婪过程之后使用，这给了我们更多的结构，并使更一般的局部搜索成为可能。我们用一个专门版本的算法进行补充，该算法在二次时间内达到2.67美元的近似值。

{"title":"A family of approximation algorithms for the maximum duo-preservation string mapping problem","authors":"Bartłomiej Dudek, Paweł Gawrychowski, Piotr Ostropolski-Nalewaja","doi":"10.4230/LIPIcs.CPM.2017.10","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.10","url":null,"abstract":"In the Maximum Duo-Preservation String Mapping problem we are given two strings and wish to map the letters of the former to the letters of the latter so as to maximise the number of duos. A duo is a pair of consecutive letters that is mapped to a pair of consecutive letters in the same order. This is complementary to the well-studied Minimum Common String Partition problem, where the goal is to partition the former string into blocks that can be permuted and concatenated to obtain the latter string. \u0000Maximum Duo-Preservation String Mapping is APX-hard. After a series of improvements, Brubach [WABI 2016] showed a polynomial-time $3.25$-approximation algorithm. Our main contribution is that for any $epsilon>0$ there exists a polynomial-time $(2+epsilon)$-approximation algorithm. Similarly to a previous solution by Boria et al. [CPM 2016], our algorithm uses the local search technique. However, this is used only after a certain preliminary greedy procedure, which gives us more structure and makes a more general local search possible. We complement this with a specialised version of the algorithm that achieves $2.67$-approximation in quadratic time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127735309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Longest Common Extensions with Recompression 带重压缩的最长公共扩展

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2016-11-16 DOI: 10.4230/LIPIcs.CPM.2017.18

T. I.

Given two positions i and j in a string T of length N, a longest common extension (LCE) query asks for the length of the longest common prefix between suffixes beginning at i and j. A compressed LCE data structure stores T in a compressed form while supporting fast LCE queries. In this article we show that the recompression technique is a powerful tool for compressed LCE data structures. We present a new compressed LCE data structure of size O(z lg (N/z)) that supports LCE queries in O(lg N) time, where z is the size of Lempel-Ziv 77 factorization without self-reference of T. Given T as an uncompressed form, we show how to build our data structure in O(N) time and space. Given T as a grammar compressed form, i.e., a straight-line program of size n generating T, we show how to build our data structure in O(n lg (N/n)) time and O(n + z lg (N/z)) space. Our algorithms are deterministic and always return correct answers.

在长度为N的字符串T中给定两个位置i和j，最长公共扩展(LCE)查询请求以i和j开头的后缀之间最长公共前缀的长度。压缩LCE数据结构以压缩形式存储T，同时支持快速LCE查询。在本文中，我们将展示再压缩技术是压缩LCE数据结构的强大工具。我们提出了一个新的压缩LCE数据结构，其大小为O(z lg (N/z))，支持在O(lg N)时间内进行LCE查询，其中z是没有自引用T的Lempel-Ziv 77分解的大小。给定T是未压缩的形式，我们展示了如何在O(N)时间和空间内构建我们的数据结构。给定T作为语法压缩形式，即大小为n的直线程序生成T，我们展示如何在O(n lg (n /n))时间和O(n + z lg (n /z))空间中构建我们的数据结构。我们的算法是确定性的，总是返回正确的答案。

引用次数: 30

Computing All Distinct Squares in Linear Time for Integer Alphabets 在线性时间内计算整数字母的所有不同的平方

Annual Symposium on Combinatorial Pattern Matching

Pub Date : 2016-10-11 DOI: 10.4230/LIPIcs.CPM.2017.22

H. Bannai, Shunsuke Inenaga, D. Köppl

Given a string on an integer alphabet, we present an algorithm that computes the set of all distinct squares belonging to this string in time linear to the string length. As an application, we show how to compute the tree topology of the minimal augmented suffix tree in linear time. Asides from that, we elaborate an algorithm computing the longest previous table in a succinct representation using compressed working space.

给定一个整数字母表上的字符串，我们给出了一个算法，该算法在与字符串长度线性的时间内计算属于该字符串的所有不同平方的集合。作为一个应用，我们展示了如何在线性时间内计算最小增广后缀树的树拓扑。除此之外，我们还详细阐述了一种算法，该算法使用压缩的工作空间以简洁的表示计算最长的前表。

引用次数: 21

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Annual Symposium on Combinatorial Pattern Matching

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀