首页 > 最新文献

Annual Symposium on Combinatorial Pattern Matching最新文献

英文 中文
Longest substring palindrome after edit 编辑后最长的子串回文
Pub Date : 2018-05-01 DOI: 10.4230/LIPIcs.CPM.2018.12
Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, H. Bannai, M. Takeda
It is known that the length of the longest substring palindromes (LSPals) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LSPal after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(log (min {sigma, log n })) time after single character substitution, insertion, or deletion, where sigma denotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(l + log n) time, after an existing substring in T is replaced by a string of arbitrary length l.
已知给定长度为n的字符串T的最长子串回文(LSPals)的长度可以用Manacher算法在O(n)时间内计算出来[J]。ACM的75]。在本文中,我们考虑了字符串被编辑后查找LSPal的问题。我们提出一种算法,它使用O (n)时间和空间预处理,和答案的长度LSPals在O (log (min{σ,O (log n)}))时间单个字符替换后,插入,删除,其中σ表示不同的字符数出现在T .我们也提出了一个算法,使用O (n)时间和空间进行预处理,和答案的长度LSPals O (l + O (log n)),在现有的子串T被任意长度的字符串。
{"title":"Longest substring palindrome after edit","authors":"Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, H. Bannai, M. Takeda","doi":"10.4230/LIPIcs.CPM.2018.12","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.12","url":null,"abstract":"It is known that the length of the longest substring palindromes (LSPals) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LSPal after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(log (min {sigma, log n })) time after single character substitution, insertion, or deletion, where sigma denotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(l + log n) time, after an existing substring in T is replaced by a string of arbitrary length l.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Faster Online Elastic Degenerate String Matching 更快的在线弹性简并字符串匹配
Pub Date : 2018-05-01 DOI: 10.4230/LIPIcs.CPM.2018.9
Kotaro Aoyama, Yuto Nakashima, I. Tomohiro, Shunsuke Inenaga, H. Bannai, M. Takeda
An Elastic-Degenerate String [Iliopoulus et al., LATA 2017] is a sequence of sets of strings, which was recently proposed as a way to model a set of similar sequences. We give an online algorithm for the Elastic-Degenerate String Matching (EDSM) problem that runs in O(nm sqrt{m log m} + N) time and O(m) working space, where n is the number of elastic degenerate segments of the text, N is the total length of all strings in the text, and m is the length of the pattern. This improves the previous algorithm by Grossi et al. [CPM 2017] that runs in O(nm^2 + N) time.
弹性简并字符串[Iliopoulus et al., LATA 2017]是字符串集合的序列,最近被提出作为一种建模一组相似序列的方法。给出了一种求解弹性-退化字符串匹配(EDSM)问题的在线算法,该算法运行时间为O(nm sqrt{m log m} + N),工作空间为O(m),其中N为文本中弹性退化段的个数,N为文本中所有字符串的总长度,m为模式的长度。这改进了Grossi等人[CPM 2017]之前的算法,该算法运行时间为O(nm^2 + N)。
{"title":"Faster Online Elastic Degenerate String Matching","authors":"Kotaro Aoyama, Yuto Nakashima, I. Tomohiro, Shunsuke Inenaga, H. Bannai, M. Takeda","doi":"10.4230/LIPIcs.CPM.2018.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.9","url":null,"abstract":"An Elastic-Degenerate String [Iliopoulus et al., LATA 2017] is a sequence of sets of strings, which was recently proposed as a way to model a set of similar sequences. We give an online algorithm for the Elastic-Degenerate String Matching (EDSM) problem that runs in O(nm sqrt{m log m} + N) time and O(m) working space, where n is the number of elastic degenerate segments of the text, N is the total length of all strings in the text, and m is the length of the pattern. This improves the previous algorithm by Grossi et al. [CPM 2017] that runs in O(nm^2 + N) time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133613328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Dualities in Tree Representations 树表示中的二象性
Pub Date : 2018-04-12 DOI: 10.4230/LIPIcs.CPM.2018.18
R. Chikhi, A. Schönhuth
A characterization of the tree $T^*$ such that $mathrm{BP}(T^*)=overleftrightarrow{mathrm{DFUDS}(T)}$, the reversal of $mathrm{DFUDS}(T)$ is given. An immediate consequence is a rigorous characterization of the tree $hat{T}$ such that $mathrm{BP}(hat{T})=mathrm{DFUDS}(T)$. In summary, $mathrm{BP}$ and $mathrm{DFUDS}$ are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in $mathrm{BP}$ and/or $mathrm{DFUDS}$. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.
给出了树$T^*$的一个表征,使得$mathrm{BP}(T^*)= overlefightrow {mathrm{DFUDS}(T)}$, $mathrm{DFUDS}(T)$的反转。一个直接的结果是树$hat{T}$的严格表征使得$mathrm{BP}(hat{T})=mathrm{DFUDS}(T)$。总而言之,$mathrm{BP}$和$mathrm{DFUDS}$在一个包含的框架内是统一的,这可能意味着将来对$mathrm{BP}$和/或$mathrm{DFUDS}$中的查询进行简化。这里显示的直接好处是,在最近关于范围最小查询问题的工作中,识别到目前为止未被注意到的共性,并为最小长度间隔查询问题提供改进。
{"title":"Dualities in Tree Representations","authors":"R. Chikhi, A. Schönhuth","doi":"10.4230/LIPIcs.CPM.2018.18","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.18","url":null,"abstract":"A characterization of the tree $T^*$ such that $mathrm{BP}(T^*)=overleftrightarrow{mathrm{DFUDS}(T)}$, the reversal of $mathrm{DFUDS}(T)$ is given. An immediate consequence is a rigorous characterization of the tree $hat{T}$ such that $mathrm{BP}(hat{T})=mathrm{DFUDS}(T)$. In summary, $mathrm{BP}$ and $mathrm{DFUDS}$ are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in $mathrm{BP}$ and/or $mathrm{DFUDS}$. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123945087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On Undetected Redundancy in the Burrows-Wheeler Transform Burrows-Wheeler变换中未检测到的冗余
Pub Date : 2018-04-01 DOI: 10.4230/LIPIcs.CPM.2018.3
Uwe Baier
The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.
burrows - wheeler变换(BWT)是一种已知具有高度可压缩性的文本的可逆排列,但对于序列分析也很有用,这使得BWT对无损数据压缩非常有吸引力。在本文中,我们提出了一种利用BWT的组合特性来减小其大小,同时保持其可逆的新技术。该技术可以应用于任何基于bwt的压缩器,并且,实验表明,能够将编码大小平均减少8- 16%,在最好的情况下最多减少33- 57%(取决于所使用的bwt压缩器),使基于bwt的压缩器与当今最好的无损压缩器相竞争甚至优于。
{"title":"On Undetected Redundancy in the Burrows-Wheeler Transform","authors":"Uwe Baier","doi":"10.4230/LIPIcs.CPM.2018.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.3","url":null,"abstract":"The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125963198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars 解析树相邻语法的基于团的下界
Pub Date : 2018-03-02 DOI: 10.4230/LIPIcs.CPM.2017.12
K. Bringmann, Philip Wellnitz
Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar $Gamma$ and a string $s$ of length $n$, the task is to decide whether $s$ can be obtained from $Gamma$. Rajasekaran and Yooseph's parser (JCSS'98) solves this problem in time $O(n^{2omega})$, where $omega < 2.373$ is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time $O(n^6)$. The first evidence for hardness was given by Satta (J. Comp. Linguist.'94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than $O(|Gamma| n^6)$ in the case of $|Gamma| = Theta(n^{12})$ would imply a breakthrough for Boolean matrix multiplication. Following an approach by Abboud et al. (FOCS'15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph's parser would imply a breakthrough for the $k$-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of $n^{2omega}$, up to lower order factors.
树相邻语法是上下文无关语法的泛化,非常适合建模人类语言,因此在计算语言学中很流行。在树相邻语法识别问题中,给定一个语法$Gamma$和一个长度为$n$的字符串$s$,任务是确定$s$是否可以从$Gamma$中获得。Rajasekaran和yusseph的解析器(JCSS'98)及时解决了这个问题$O(n^{2omega})$,其中$omega < 2.373$是矩阵乘法指数。避免快速矩阵乘法的最佳算法需要时间$O(n^6)$。硬度的第一个证据是由Satta (J. Comp. Linguist. 1994)给出的:对于更一般的解析问题,任何避免快速矩阵乘法并且在$|Gamma| = Theta(n^{12})$的情况下比$O(|Gamma| n^6)$快得多的算法都意味着布尔矩阵乘法的突破。遵循Abboud等人(FOCS'15)的上下文无关语法识别方法,在本文中,我们解决了以前下界的许多缺点。我们表明,即使在常量大小的语法上,对Rajasekaran和yusseph的解析器的任何改进都意味着$k$ -Clique问题的突破。这使得树相邻语法解析成为与$n^{2omega}$异常运行时间相关的实际问题,直至低阶因子。
{"title":"Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars","authors":"K. Bringmann, Philip Wellnitz","doi":"10.4230/LIPIcs.CPM.2017.12","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.12","url":null,"abstract":"Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar $Gamma$ and a string $s$ of length $n$, the task is to decide whether $s$ can be obtained from $Gamma$. Rajasekaran and Yooseph's parser (JCSS'98) solves this problem in time $O(n^{2omega})$, where $omega < 2.373$ is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time $O(n^6)$. \u0000The first evidence for hardness was given by Satta (J. Comp. Linguist.'94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than $O(|Gamma| n^6)$ in the case of $|Gamma| = Theta(n^{12})$ would imply a breakthrough for Boolean matrix multiplication. \u0000Following an approach by Abboud et al. (FOCS'15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph's parser would imply a breakthrough for the $k$-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of $n^{2omega}$, up to lower order factors.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133911803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Linear-Time Algorithm for Long LCF with k Mismatches 有k个不匹配的长LCF的线性时间算法
Pub Date : 2018-02-18 DOI: 10.4230/LIPIcs.CPM.2018.23
P. Charalampopoulos, M. Crochemore, C. Iliopoulos, T. Kociumaka, S. Pissis, J. Radoszewski, W. Rytter, Tomasz Waleń
In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $mathcal{O}(n log^k n)$ time and $mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($ell$) problem in which we assume that the sought factors have length at least $ell$, and the LCF$_k$($ell$) problem for $ell=Omega(log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=mathcal{O}(n/log^{k+1}n)$ synchronized factors. The latter can be solved in $mathcal{O}(m log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($ell$) for arbitrary $ell$ takes $mathcal{O}(n + n log^{k+1} n/sqrt{ell})$ time.
在具有$k$错配的最长公共因子(LCF $_k$)问题中,我们给定两个字符串$X$和$Y$,总长度为$n$,我们被要求找到一对最大长度因子,一个为$X$,另一个为$Y$,使得它们的汉明距离不超过$k$。Thankachan等人表明,对于常数$k$,这个问题可以在$mathcal{O}(n log^k n)$时间和$mathcal{O}(n)$空间中解决。我们考虑LCF $_k$ ($ell$)问题,其中我们假设所寻找的因子长度至少为$ell$,以及$ell=Omega(log^{2k+2} n)$的LCF $_k$ ($ell$)问题,我们称之为长LCF $_k$问题。我们使用差异覆盖将Long LCF $_k$问题简化为涉及$m=mathcal{O}(n/log^{k+1}n)$同步因素的任务。后者可以在$mathcal{O}(m log^{k+1}m)$时间内求解,从而得到长LCF的线性时间算法$_k$。一般来说,我们对任意$ell$的LCF $_k$ ($ell$)的解决方案需要$mathcal{O}(n + n log^{k+1} n/sqrt{ell})$时间。
{"title":"Linear-Time Algorithm for Long LCF with k Mismatches","authors":"P. Charalampopoulos, M. Crochemore, C. Iliopoulos, T. Kociumaka, S. Pissis, J. Radoszewski, W. Rytter, Tomasz Waleń","doi":"10.4230/LIPIcs.CPM.2018.23","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.23","url":null,"abstract":"In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $mathcal{O}(n log^k n)$ time and $mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($ell$) problem in which we assume that the sought factors have length at least $ell$, and the LCF$_k$($ell$) problem for $ell=Omega(log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=mathcal{O}(n/log^{k+1}n)$ synchronized factors. The latter can be solved in $mathcal{O}(m log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($ell$) for arbitrary $ell$ takes $mathcal{O}(n + n log^{k+1} n/sqrt{ell})$ time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129672186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Online LZ77 Parsing and Matching Statistics with RLBWTs 基于RLBWTs的在线LZ77解析和匹配统计
Pub Date : 2018-02-16 DOI: 10.4230/LIPIcs.CPM.2018.7
H. Bannai, T. Gagie, I. Tomohiro
Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse $T^R$ of a text $T$, to compute offline the LZ77 parse of $T$ in $O (n log r)$ time and $O (r)$ space, where $n$ is the length of $T$ and $r$ is the number of runs in the BWT of $T^R$. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of $T$ while still using $O (n log r)$ time and $O (r)$ space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further --- albeit making it static again and increasing its space by a factor proportional to the size of the alphabet --- such that later, given another string $S$ and $O (log log n)$-time random access to $T$, we can compute the matching statistics of $S$ with respect to $T$ in $O (|S| log log n)$ time.
解析、匹配统计和Burrows-Wheeler变换(BWT)都是弦学的基本元素。在最近的一系列论文中,politici和Prezza (DCC 2016和Algorithmica, CPM 2017)展示了我们如何使用文本$T$的反向$T^R$的增强运行长度压缩BWT (RLBWT)来离线计算$T$在$O (n log R)$时间和$O (R)$空间中的LZ77解析,其中$n$是$T$的长度,$ R$是$T^R$的BWT中的运行次数。在本文中,我们首先扩展了一种众所周知的技术,用于在字符被添加到文本中时更新未增强的RLBWT,以与Policriti和Prezza的增强RLBWT一起工作。这立即意味着我们可以在线构建$T$的LZ77解析,同时仍然使用$O (n log r)$时间和$O (r)$空间;它似乎也可能具有独立的利益。我们的实验,使用Ohno, Takabatake, I和Sakamoto (IWOCA 2017)的更新实现的扩展,表明我们的方法对于重复字符串既节省时间又节省空间。然后,我们将展示如何进一步扩大RLBWT——尽管使其再次保持静态,并按与字母表大小成比例的因子增加其空间——这样,稍后,给定另一个字符串$S$和$O (log log n)$时间随机访问$T$,我们可以在$O (|S| log log n)$时间内计算$S$相对于$T$的匹配统计信息。
{"title":"Online LZ77 Parsing and Matching Statistics with RLBWTs","authors":"H. Bannai, T. Gagie, I. Tomohiro","doi":"10.4230/LIPIcs.CPM.2018.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.7","url":null,"abstract":"Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse $T^R$ of a text $T$, to compute offline the LZ77 parse of $T$ in $O (n log r)$ time and $O (r)$ space, where $n$ is the length of $T$ and $r$ is the number of runs in the BWT of $T^R$. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of $T$ while still using $O (n log r)$ time and $O (r)$ space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further --- albeit making it static again and increasing its space by a factor proportional to the size of the alphabet --- such that later, given another string $S$ and $O (log log n)$-time random access to $T$, we can compute the matching statistics of $S$ with respect to $T$ in $O (|S| log log n)$ time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123879618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Text Indexing and Searching in Sublinear Time 亚线性时间下的文本索引与检索
Pub Date : 2017-12-01 DOI: 10.4230/LIPIcs.CPM.2020.24
J. Munro, G. Navarro, Yakov Nekrich
We introduce the first index that can be built in $o(n)$ time for a text of length $n$, and also queried in $o(m)$ time for a pattern of length $m$. On a constant-size alphabet, for example, our index uses $O(nlog^{1/2+varepsilon}n)$ bits, is built in $O(n/log^{1/2-varepsilon} n)$ deterministic time, and finds the $mathrm{occ}$ pattern occurrences in time $O(m/log n + sqrt{log n}loglog n + mathrm{occ})$, where $varepsilon>0$ is an arbitrarily small constant. As a comparison, the most recent classical text index uses $O(nlog n)$ bits, is built in $O(n)$ time, and searches in time $O(m/log n + loglog n + mathrm{occ})$. We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in $o(Sort(n))$ time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of $O(sqrt{log_{M/B} n}loglog n)$, where $M$ is the size of main memory available and $B$ is the disk block size.
我们介绍了第一个索引,它可以在$o(n)$ time中为长度为$n$的文本构建索引,也可以在$o(m)$ time中查询长度为$m$的模式。例如,在固定大小的字母表上,我们的索引使用$O(nlog^{1/2+varepsilon}n)$位,在$O(n/log^{1/2-varepsilon} n)$确定时间内构建,并在$O(m/log n + sqrt{log n}loglog n + mathrm{occ})$时间内找到$mathrm{occ}$模式,其中$varepsilon>0$是一个任意小的常数。作为比较,最新的经典文本索引使用$O(nlog n)$位,在$O(n)$时间内构建,并在$O(m/log n + loglog n + mathrm{occ})$时间内搜索。我们建立了一个基于差异覆盖的新颖文本采样,它具有允许我们在恒定时间内有效计算最长公共前缀的特性。我们也将结果扩展到辅助内存模型,其中我们在$o(Sort(n))$时间内给出了带有后缀数组功能的数据结构的第一个构造,它可以在几乎最优的时间内搜索模式,并附带$O(sqrt{log_{M/B} n}loglog n)$的附加惩罚,其中$M$是可用的主内存大小,$B$是磁盘块大小。
{"title":"Text Indexing and Searching in Sublinear Time","authors":"J. Munro, G. Navarro, Yakov Nekrich","doi":"10.4230/LIPIcs.CPM.2020.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2020.24","url":null,"abstract":"We introduce the first index that can be built in $o(n)$ time for a text of length $n$, and also queried in $o(m)$ time for a pattern of length $m$. On a constant-size alphabet, for example, our index uses $O(nlog^{1/2+varepsilon}n)$ bits, is built in $O(n/log^{1/2-varepsilon} n)$ deterministic time, and finds the $mathrm{occ}$ pattern occurrences in time $O(m/log n + sqrt{log n}loglog n + mathrm{occ})$, where $varepsilon>0$ is an arbitrarily small constant. As a comparison, the most recent classical text index uses $O(nlog n)$ bits, is built in $O(n)$ time, and searches in time $O(m/log n + loglog n + mathrm{occ})$. We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in $o(Sort(n))$ time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of $O(sqrt{log_{M/B} n}loglog n)$, where $M$ is the size of main memory available and $B$ is the disk block size.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124411109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Gapped Pattern Statistics 缺口模式统计
Pub Date : 2017-07-04 DOI: 10.4230/LIPIcs.CPM.2017.21
Philippe Duchon, C. Nicaud, Carine Pivoteau
We give a probabilistic analysis of parameters related to $alpha$-gapped repeats and palindromes in random words, under both uniform and memoryless distributions (where letters have different probabilities, but are drawn independently). More precisely, we study the expected number of maximal $alpha$-gapped patterns, as well as the expected length of the longest $alpha$-gapped pattern in a random word.
我们在均匀分布和无记忆分布(其中字母具有不同的概率,但独立绘制)下,对随机单词中与$alpha$间隙重复和回文相关的参数进行了概率分析。更准确地说,我们研究了一个随机单词中最大$alpha$-gap模式的期望数目,以及最长$alpha$-gap模式的期望长度。
{"title":"Gapped Pattern Statistics","authors":"Philippe Duchon, C. Nicaud, Carine Pivoteau","doi":"10.4230/LIPIcs.CPM.2017.21","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.21","url":null,"abstract":"We give a probabilistic analysis of parameters related to $alpha$-gapped repeats and palindromes in random words, under both uniform and memoryless distributions (where letters have different probabilities, but are drawn independently). \u0000More precisely, we study the expected number of maximal $alpha$-gapped patterns, as well as the expected length of the longest $alpha$-gapped pattern in a random word.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114174904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
On-Line Pattern Matching on Similar Texts 相似文本的在线模式匹配
Pub Date : 2017-07-01 DOI: 10.4230/LIPIcs.CPM.2017.9
R. Grossi, C. Iliopoulos, Chang Liu, N. Pisanti, S. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, Luca Versari
Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.
一组相似文本的模式匹配受到了广泛的关注,特别是最近,主要是由于它在人类遗传变异编目中的应用。特别是,对于这个问题的离线版本,已经提出了许多不同的算法;也就是说,为一组相似的文本构造一个压缩索引,以便有效地回答模式匹配查询。然而,这个问题的更基本的在线版本是一个相当不发达的话题。解决方案的在线版本可能是有益的,原因有很多;例如,有效的在线解决方案可以与部分索引结合使用,作为实际的权衡。在这里,我们试图通过提出两种有效的算法来解决这个问题。值得注意的是,对于短模式,其中一种算法要求文本表示的大小呈时间线性。此外,实验结果在实践中证实了我们的理论发现。
{"title":"On-Line Pattern Matching on Similar Texts","authors":"R. Grossi, C. Iliopoulos, Chang Liu, N. Pisanti, S. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, Luca Versari","doi":"10.4230/LIPIcs.CPM.2017.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.9","url":null,"abstract":"Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133771398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
Annual Symposium on Combinatorial Pattern Matching
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1