{"title":"Dictionary Matching in Elastic-Degenerate Texts with Applications in Searching VCF Files On-line","authors":"S. Pissis, Ahmad Retha","doi":"10.4230/LIPIcs.SEA.2018.16","DOIUrl":null,"url":null,"abstract":"An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching","PeriodicalId":9448,"journal":{"name":"Bulletin of the Society of Sea Water Science, Japan","volume":"25 1","pages":"16:1-16:14"},"PeriodicalIF":0.0000,"publicationDate":"2018-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bulletin of the Society of Sea Water Science, Japan","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.4230/LIPIcs.SEA.2018.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13
Abstract
An elastic-degenerate string is a sequence of n sets of strings of total length N . It has been introduced to represent multiple sequence alignments of closely-related sequences in a compact form. For a standard pattern of length m, pattern matching in an elastic-degenerate text can be solved on-line in time O(nm2 +N) with pre-processing time and space O(m) (Grossi et al., CPM 2017). A fast bit-vector algorithm requiring time O(N · dmw e) with pre-processing time and space O(m·dmw e), where w is the size of the computer word, was also presented. In this paper we consider the same problem for a set of patterns of total length M . A straightforward generalization of the existing bit-vector algorithm would require time O(N · dMw e) with pre-processing time and space O(M · dMw e), which is prohibitive in practice. We present a new on-line O(N · d M w e)-time algorithm with pre-processing time and space O(M). We present experimental results using both synthetic and real data demonstrating the performance of the algorithm. We further demonstrate a real application of our algorithm in a pipeline for discovery and verification of minimal absent words (MAWs) in the human genome showing that a significant number of previously discovered MAWs are in fact false-positives when a population’s variants are considered. 2012 ACM Subject Classification Theory of computation → Pattern matching