Pub Date : 2018-05-01DOI: 10.4230/LIPIcs.CPM.2018.12
Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, H. Bannai, M. Takeda
It is known that the length of the longest substring palindromes (LSPals) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LSPal after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(log (min {sigma, log n })) time after single character substitution, insertion, or deletion, where sigma denotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(l + log n) time, after an existing substring in T is replaced by a string of arbitrary length l.
已知给定长度为n的字符串T的最长子串回文(LSPals)的长度可以用Manacher算法在O(n)时间内计算出来[J]。ACM的75]。在本文中,我们考虑了字符串被编辑后查找LSPal的问题。我们提出一种算法,它使用O (n)时间和空间预处理,和答案的长度LSPals在O (log (min{σ,O (log n)}))时间单个字符替换后,插入,删除,其中σ表示不同的字符数出现在T .我们也提出了一个算法,使用O (n)时间和空间进行预处理,和答案的长度LSPals O (l + O (log n)),在现有的子串T被任意长度的字符串。
{"title":"Longest substring palindrome after edit","authors":"Mitsuru Funakoshi, Yuto Nakashima, Shunsuke Inenaga, H. Bannai, M. Takeda","doi":"10.4230/LIPIcs.CPM.2018.12","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.12","url":null,"abstract":"It is known that the length of the longest substring palindromes (LSPals) of a given string T of length n can be computed in O(n) time by Manacher's algorithm [J. ACM '75]. In this paper, we consider the problem of finding the LSPal after the string is edited. We present an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(log (min {sigma, log n })) time after single character substitution, insertion, or deletion, where sigma denotes the number of distinct characters appearing in T. We also propose an algorithm that uses O(n) time and space for preprocessing, and answers the length of the LSPals in O(l + log n) time, after an existing substring in T is replaced by a string of arbitrary length l.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-01DOI: 10.4230/LIPIcs.CPM.2018.9
Kotaro Aoyama, Yuto Nakashima, I. Tomohiro, Shunsuke Inenaga, H. Bannai, M. Takeda
An Elastic-Degenerate String [Iliopoulus et al., LATA 2017] is a sequence of sets of strings, which was recently proposed as a way to model a set of similar sequences. We give an online algorithm for the Elastic-Degenerate String Matching (EDSM) problem that runs in O(nm sqrt{m log m} + N) time and O(m) working space, where n is the number of elastic degenerate segments of the text, N is the total length of all strings in the text, and m is the length of the pattern. This improves the previous algorithm by Grossi et al. [CPM 2017] that runs in O(nm^2 + N) time.
弹性简并字符串[Iliopoulus et al., LATA 2017]是字符串集合的序列,最近被提出作为一种建模一组相似序列的方法。给出了一种求解弹性-退化字符串匹配(EDSM)问题的在线算法,该算法运行时间为O(nm sqrt{m log m} + N),工作空间为O(m),其中N为文本中弹性退化段的个数,N为文本中所有字符串的总长度,m为模式的长度。这改进了Grossi等人[CPM 2017]之前的算法,该算法运行时间为O(nm^2 + N)。
{"title":"Faster Online Elastic Degenerate String Matching","authors":"Kotaro Aoyama, Yuto Nakashima, I. Tomohiro, Shunsuke Inenaga, H. Bannai, M. Takeda","doi":"10.4230/LIPIcs.CPM.2018.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.9","url":null,"abstract":"An Elastic-Degenerate String [Iliopoulus et al., LATA 2017] is a sequence of sets of strings, which was recently proposed as a way to model a set of similar sequences. We give an online algorithm for the Elastic-Degenerate String Matching (EDSM) problem that runs in O(nm sqrt{m log m} + N) time and O(m) working space, where n is the number of elastic degenerate segments of the text, N is the total length of all strings in the text, and m is the length of the pattern. This improves the previous algorithm by Grossi et al. [CPM 2017] that runs in O(nm^2 + N) time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"172 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133613328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-12DOI: 10.4230/LIPIcs.CPM.2018.18
R. Chikhi, A. Schönhuth
A characterization of the tree $T^*$ such that $mathrm{BP}(T^*)=overleftrightarrow{mathrm{DFUDS}(T)}$, the reversal of $mathrm{DFUDS}(T)$ is given. An immediate consequence is a rigorous characterization of the tree $hat{T}$ such that $mathrm{BP}(hat{T})=mathrm{DFUDS}(T)$. In summary, $mathrm{BP}$ and $mathrm{DFUDS}$ are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in $mathrm{BP}$ and/or $mathrm{DFUDS}$. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.
{"title":"Dualities in Tree Representations","authors":"R. Chikhi, A. Schönhuth","doi":"10.4230/LIPIcs.CPM.2018.18","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.18","url":null,"abstract":"A characterization of the tree $T^*$ such that $mathrm{BP}(T^*)=overleftrightarrow{mathrm{DFUDS}(T)}$, the reversal of $mathrm{DFUDS}(T)$ is given. An immediate consequence is a rigorous characterization of the tree $hat{T}$ such that $mathrm{BP}(hat{T})=mathrm{DFUDS}(T)$. In summary, $mathrm{BP}$ and $mathrm{DFUDS}$ are unified within an encompassing framework, which might have the potential to imply future simplifications with regard to queries in $mathrm{BP}$ and/or $mathrm{DFUDS}$. Immediate benefits displayed here are to identify so far unnoted commonalities in most recent work on the Range Minimum Query problem, and to provide improvements for the Minimum Length Interval Query problem.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123945087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-04-01DOI: 10.4230/LIPIcs.CPM.2018.3
Uwe Baier
The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.
{"title":"On Undetected Redundancy in the Burrows-Wheeler Transform","authors":"Uwe Baier","doi":"10.4230/LIPIcs.CPM.2018.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.3","url":null,"abstract":"The Burrows-Wheeler-Transform (BWT) is an invertible permutation of a text known to be highly compressible but also useful for sequence analysis, what makes the BWT highly attractive for lossless data compression. In this paper, we present a new technique to reduce the size of a BWT using its combinatorial properties, while keeping it invertible. The technique can be applied to any BWT-based compressor, and, as experiments show, is able to reduce the encoding size by 8-16 % on average and up to 33-57 % in the best cases (depending on the BWT-compressor used), making BWT-based compressors competitive or even superior to today's best lossless compressors.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125963198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-03-02DOI: 10.4230/LIPIcs.CPM.2017.12
K. Bringmann, Philip Wellnitz
Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar $Gamma$ and a string $s$ of length $n$, the task is to decide whether $s$ can be obtained from $Gamma$. Rajasekaran and Yooseph's parser (JCSS'98) solves this problem in time $O(n^{2omega})$, where $omega < 2.373$ is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time $O(n^6)$. The first evidence for hardness was given by Satta (J. Comp. Linguist.'94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than $O(|Gamma| n^6)$ in the case of $|Gamma| = Theta(n^{12})$ would imply a breakthrough for Boolean matrix multiplication. Following an approach by Abboud et al. (FOCS'15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph's parser would imply a breakthrough for the $k$-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of $n^{2omega}$, up to lower order factors.
{"title":"Clique-Based Lower Bounds for Parsing Tree-Adjoining Grammars","authors":"K. Bringmann, Philip Wellnitz","doi":"10.4230/LIPIcs.CPM.2017.12","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.12","url":null,"abstract":"Tree-adjoining grammars are a generalization of context-free grammars that are well suited to model human languages and are thus popular in computational linguistics. In the tree-adjoining grammar recognition problem, given a grammar $Gamma$ and a string $s$ of length $n$, the task is to decide whether $s$ can be obtained from $Gamma$. Rajasekaran and Yooseph's parser (JCSS'98) solves this problem in time $O(n^{2omega})$, where $omega < 2.373$ is the matrix multiplication exponent. The best algorithms avoiding fast matrix multiplication take time $O(n^6)$. \u0000The first evidence for hardness was given by Satta (J. Comp. Linguist.'94): For a more general parsing problem, any algorithm that avoids fast matrix multiplication and is significantly faster than $O(|Gamma| n^6)$ in the case of $|Gamma| = Theta(n^{12})$ would imply a breakthrough for Boolean matrix multiplication. \u0000Following an approach by Abboud et al. (FOCS'15) for context-free grammar recognition, in this paper we resolve many of the disadvantages of the previous lower bound. We show that, even on constant-size grammars, any improvement on Rajasekaran and Yooseph's parser would imply a breakthrough for the $k$-Clique problem. This establishes tree-adjoining grammar parsing as a practically relevant problem with the unusual running time of $n^{2omega}$, up to lower order factors.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133911803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-02-18DOI: 10.4230/LIPIcs.CPM.2018.23
P. Charalampopoulos, M. Crochemore, C. Iliopoulos, T. Kociumaka, S. Pissis, J. Radoszewski, W. Rytter, Tomasz Waleń
In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $mathcal{O}(n log^k n)$ time and $mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($ell$) problem in which we assume that the sought factors have length at least $ell$, and the LCF$_k$($ell$) problem for $ell=Omega(log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=mathcal{O}(n/log^{k+1}n)$ synchronized factors. The latter can be solved in $mathcal{O}(m log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($ell$) for arbitrary $ell$ takes $mathcal{O}(n + n log^{k+1} n/sqrt{ell})$ time.
{"title":"Linear-Time Algorithm for Long LCF with k Mismatches","authors":"P. Charalampopoulos, M. Crochemore, C. Iliopoulos, T. Kociumaka, S. Pissis, J. Radoszewski, W. Rytter, Tomasz Waleń","doi":"10.4230/LIPIcs.CPM.2018.23","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.23","url":null,"abstract":"In the Longest Common Factor with $k$ Mismatches (LCF$_k$) problem, we are given two strings $X$ and $Y$ of total length $n$, and we are asked to find a pair of maximal-length factors, one of $X$ and the other of $Y$, such that their Hamming distance is at most $k$. Thankachan et al. show that this problem can be solved in $mathcal{O}(n log^k n)$ time and $mathcal{O}(n)$ space for constant $k$. We consider the LCF$_k$($ell$) problem in which we assume that the sought factors have length at least $ell$, and the LCF$_k$($ell$) problem for $ell=Omega(log^{2k+2} n)$, which we call the Long LCF$_k$ problem. We use difference covers to reduce the Long LCF$_k$ problem to a task involving $m=mathcal{O}(n/log^{k+1}n)$ synchronized factors. The latter can be solved in $mathcal{O}(m log^{k+1}m)$ time, which results in a linear-time algorithm for Long LCF$_k$. In general, our solution to LCF$_k$($ell$) for arbitrary $ell$ takes $mathcal{O}(n + n log^{k+1} n/sqrt{ell})$ time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129672186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-02-16DOI: 10.4230/LIPIcs.CPM.2018.7
H. Bannai, T. Gagie, I. Tomohiro
Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse $T^R$ of a text $T$, to compute offline the LZ77 parse of $T$ in $O (n log r)$ time and $O (r)$ space, where $n$ is the length of $T$ and $r$ is the number of runs in the BWT of $T^R$. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of $T$ while still using $O (n log r)$ time and $O (r)$ space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further --- albeit making it static again and increasing its space by a factor proportional to the size of the alphabet --- such that later, given another string $S$ and $O (log log n)$-time random access to $T$, we can compute the matching statistics of $S$ with respect to $T$ in $O (|S| log log n)$ time.
{"title":"Online LZ77 Parsing and Matching Statistics with RLBWTs","authors":"H. Bannai, T. Gagie, I. Tomohiro","doi":"10.4230/LIPIcs.CPM.2018.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2018.7","url":null,"abstract":"Lempel-Ziv 1977 (LZ77) parsing, matching statistics and the Burrows-Wheeler Transform (BWT) are all fundamental elements of stringology. In a series of recent papers, Policriti and Prezza (DCC 2016 and Algorithmica, CPM 2017) showed how we can use an augmented run-length compressed BWT (RLBWT) of the reverse $T^R$ of a text $T$, to compute offline the LZ77 parse of $T$ in $O (n log r)$ time and $O (r)$ space, where $n$ is the length of $T$ and $r$ is the number of runs in the BWT of $T^R$. In this paper we first extend a well-known technique for updating an unaugmented RLBWT when a character is prepended to a text, to work with Policriti and Prezza's augmented RLBWT. This immediately implies that we can build online the LZ77 parse of $T$ while still using $O (n log r)$ time and $O (r)$ space; it also seems likely to be of independent interest. Our experiments, using an extension of Ohno, Takabatake, I and Sakamoto's (IWOCA 2017) implementation of updating, show our approach is both time- and space-efficient for repetitive strings. We then show how to augment the RLBWT further --- albeit making it static again and increasing its space by a factor proportional to the size of the alphabet --- such that later, given another string $S$ and $O (log log n)$-time random access to $T$, we can compute the matching statistics of $S$ with respect to $T$ in $O (|S| log log n)$ time.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123879618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01DOI: 10.4230/LIPIcs.CPM.2020.24
J. Munro, G. Navarro, Yakov Nekrich
We introduce the first index that can be built in $o(n)$ time for a text of length $n$, and also queried in $o(m)$ time for a pattern of length $m$. On a constant-size alphabet, for example, our index uses $O(nlog^{1/2+varepsilon}n)$ bits, is built in $O(n/log^{1/2-varepsilon} n)$ deterministic time, and finds the $mathrm{occ}$ pattern occurrences in time $O(m/log n + sqrt{log n}loglog n + mathrm{occ})$, where $varepsilon>0$ is an arbitrarily small constant. As a comparison, the most recent classical text index uses $O(nlog n)$ bits, is built in $O(n)$ time, and searches in time $O(m/log n + loglog n + mathrm{occ})$. We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in $o(Sort(n))$ time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of $O(sqrt{log_{M/B} n}loglog n)$, where $M$ is the size of main memory available and $B$ is the disk block size.
我们介绍了第一个索引,它可以在$o(n)$ time中为长度为$n$的文本构建索引,也可以在$o(m)$ time中查询长度为$m$的模式。例如,在固定大小的字母表上,我们的索引使用$O(nlog^{1/2+varepsilon}n)$位,在$O(n/log^{1/2-varepsilon} n)$确定时间内构建,并在$O(m/log n + sqrt{log n}loglog n + mathrm{occ})$时间内找到$mathrm{occ}$模式,其中$varepsilon>0$是一个任意小的常数。作为比较,最新的经典文本索引使用$O(nlog n)$位,在$O(n)$时间内构建,并在$O(m/log n + loglog n + mathrm{occ})$时间内搜索。我们建立了一个基于差异覆盖的新颖文本采样,它具有允许我们在恒定时间内有效计算最长公共前缀的特性。我们也将结果扩展到辅助内存模型,其中我们在$o(Sort(n))$时间内给出了带有后缀数组功能的数据结构的第一个构造,它可以在几乎最优的时间内搜索模式,并附带$O(sqrt{log_{M/B} n}loglog n)$的附加惩罚,其中$M$是可用的主内存大小,$B$是磁盘块大小。
{"title":"Text Indexing and Searching in Sublinear Time","authors":"J. Munro, G. Navarro, Yakov Nekrich","doi":"10.4230/LIPIcs.CPM.2020.24","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2020.24","url":null,"abstract":"We introduce the first index that can be built in $o(n)$ time for a text of length $n$, and also queried in $o(m)$ time for a pattern of length $m$. On a constant-size alphabet, for example, our index uses $O(nlog^{1/2+varepsilon}n)$ bits, is built in $O(n/log^{1/2-varepsilon} n)$ deterministic time, and finds the $mathrm{occ}$ pattern occurrences in time $O(m/log n + sqrt{log n}loglog n + mathrm{occ})$, where $varepsilon>0$ is an arbitrarily small constant. As a comparison, the most recent classical text index uses $O(nlog n)$ bits, is built in $O(n)$ time, and searches in time $O(m/log n + loglog n + mathrm{occ})$. We build on a novel text sampling based on difference covers, which enjoys properties that allow us efficiently computing longest common prefixes in constant time. We extend our results to the secondary memory model as well, where we give the first construction in $o(Sort(n))$ time of a data structure with suffix array functionality, which can search for patterns in the almost optimal time, with an additive penalty of $O(sqrt{log_{M/B} n}loglog n)$, where $M$ is the size of main memory available and $B$ is the disk block size.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124411109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-04DOI: 10.4230/LIPIcs.CPM.2017.21
Philippe Duchon, C. Nicaud, Carine Pivoteau
We give a probabilistic analysis of parameters related to $alpha$-gapped repeats and palindromes in random words, under both uniform and memoryless distributions (where letters have different probabilities, but are drawn independently). More precisely, we study the expected number of maximal $alpha$-gapped patterns, as well as the expected length of the longest $alpha$-gapped pattern in a random word.
{"title":"Gapped Pattern Statistics","authors":"Philippe Duchon, C. Nicaud, Carine Pivoteau","doi":"10.4230/LIPIcs.CPM.2017.21","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.21","url":null,"abstract":"We give a probabilistic analysis of parameters related to $alpha$-gapped repeats and palindromes in random words, under both uniform and memoryless distributions (where letters have different probabilities, but are drawn independently). \u0000More precisely, we study the expected number of maximal $alpha$-gapped patterns, as well as the expected length of the longest $alpha$-gapped pattern in a random word.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114174904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-07-01DOI: 10.4230/LIPIcs.CPM.2017.9
R. Grossi, C. Iliopoulos, Chang Liu, N. Pisanti, S. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, Luca Versari
Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.
{"title":"On-Line Pattern Matching on Similar Texts","authors":"R. Grossi, C. Iliopoulos, Chang Liu, N. Pisanti, S. Pissis, Ahmad Retha, Giovanna Rosone, Fatima Vayani, Luca Versari","doi":"10.4230/LIPIcs.CPM.2017.9","DOIUrl":"https://doi.org/10.4230/LIPIcs.CPM.2017.9","url":null,"abstract":"Pattern matching on a set of similar texts has received much attention, especially recently, mainly due to its application in cataloguing human genetic variation. In particular, many different algorithms have been proposed for the off-line version of this problem; that is, constructing a compressed index for a set of similar texts in order to answer pattern matching queries efficiently. However, the on-line, more fundamental, version of this problem is a rather undeveloped topic. Solutions to the on-line version can be beneficial for a number of reasons; for instance, efficient on-line solutions can be used in combination with partial indexes as practical trade-offs. We make here an attempt to close this gap via proposing two efficient algorithms for this problem. Notably, one of the algorithms requires time linear in the size of the texts' representation, for short patterns. Furthermore, experimental results confirm our theoretical findings in practical terms.","PeriodicalId":236737,"journal":{"name":"Annual Symposium on Combinatorial Pattern Matching","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133771398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}