Pub Date : 2022-07-03DOI: 10.48550/arXiv.2207.00972
Zsuzsanna Lipt'ak, Francesco Masillo, S. Puglisi
We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.
{"title":"Suffix sorting via matching statistics","authors":"Zsuzsanna Lipt'ak, Francesco Masillo, S. Puglisi","doi":"10.48550/arXiv.2207.00972","DOIUrl":"https://doi.org/10.48550/arXiv.2207.00972","url":null,"abstract":"We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"114 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133481019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-30DOI: 10.4230/LIPIcs.WABI.2022.18
Adrián Goga, Andrej Baláz
We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.
我们提出了一种新技术,用于为大型重复文本集合创建空间高效索引,例如包含来自同一物种的许多个体序列的泛基因组数据库。我们结合了该领域的两种最新技术:Wheeler图(gieet al., 2017)和无前缀解析(PFP, Boucher et al., 2019)。惠勒图(WGs)是一个基于Burrows-Wheeler变换(BWT)的包含多个索引的通用框架,如fm指数。惠勒图承认一个简洁的表示,它可以通过使用隧道的思想进一步压缩,隧道利用冗余的形式,以平行的,相等标记的路径称为块,可以合并成一条路径。寻找隧道掘进的最优块集的问题,即最小化所产生的WG大小的问题,已知是np完全的,并且仍然是隧道掘进过程中最具计算挑战性的部分。为了在更短的时间内找到合适的块集,我们提出了一种基于无前缀解析(PFP)的新方法。PFP的思想是将输入文本分成大小大致相等的短语,这些短语由固定数量的字符重叠。原始文本由短语序列(解析)和所有使用过的短语列表(字典)表示。在重复的文本中,文本的PFP通常比原文短得多。为了加快隧道化的块选择,我们使用PFP获得文本的解析和字典,使用现有的启发式方法隧道化解析的WG,然后使用该隧道化解析构造原始文本的紧凑WG。与不使用PFP从原始文本构建WG相比,我们的方法在全基因组序列集合上速度更快,占用的内存更少。因此,我们的方法可以使用WGs作为真实世界数据集的泛基因组参考。
{"title":"Prefix-free parsing for building large tunnelled Wheeler graphs","authors":"Adrián Goga, Andrej Baláz","doi":"10.4230/LIPIcs.WABI.2022.18","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2022.18","url":null,"abstract":"We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132725143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-08-01DOI: 10.4230/LIPIcs.WABI.2020.5
Leah L. Weber, M. El-Kebir
Cancer arises from an evolutionary process where somatic mutations occur and eventually give rise to clonal expansions. Modeling this evolutionary process as a phylogeny is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. However, cancer phylogeny inference from single-cell DNA sequencing data of tumors is challenging due to limitations with sequencing technology and the complexity of the resulting problem. Therefore, as a first step some value might be obtained from correctly classifying the evolutionary process as either linear or branched. The biological implications of these two high-level patterns are different and understanding what cancer types and which patients have each of these trajectories could provide useful insight for both clinicians and researchers. Here, we introduce the Linear Perfect Phylogeny Flipping Problem as a means of testing a null model that the tree topology is linear and show that it is NP-hard. We develop Phyolin and, through both in silico experiments and real data application, show that it is an accurate, easy to use and a reasonably fast method for classifying an evolutionary trajectory as linear or branched. 2012 ACM Subject Classification Applied computing → Molecular evolution
{"title":"Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors","authors":"Leah L. Weber, M. El-Kebir","doi":"10.4230/LIPIcs.WABI.2020.5","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.5","url":null,"abstract":"Cancer arises from an evolutionary process where somatic mutations occur and eventually give rise to clonal expansions. Modeling this evolutionary process as a phylogeny is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. However, cancer phylogeny inference from single-cell DNA sequencing data of tumors is challenging due to limitations with sequencing technology and the complexity of the resulting problem. Therefore, as a first step some value might be obtained from correctly classifying the evolutionary process as either linear or branched. The biological implications of these two high-level patterns are different and understanding what cancer types and which patients have each of these trajectories could provide useful insight for both clinicians and researchers. Here, we introduce the Linear Perfect Phylogeny Flipping Problem as a means of testing a null model that the tree topology is linear and show that it is NP-hard. We develop Phyolin and, through both in silico experiments and real data application, show that it is an accurate, easy to use and a reasonably fast method for classifying an evolutionary trajectory as linear or branched. 2012 ACM Subject Classification Applied computing → Molecular evolution","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129315548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-06DOI: 10.4230/LIPIcs.WABI.2020.17
Arun Ganesh, Aaron Sy
We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.
{"title":"Near-Linear Time Edit Distance for Indel Channels","authors":"Arun Ganesh, Aaron Sy","doi":"10.4230/LIPIcs.WABI.2020.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.17","url":null,"abstract":"We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132923870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-05-19DOI: 10.4230/LIPIcs.WABI.2020.7
V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu
We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3%$ of the size of the MSA.
{"title":"Linear Time Construction of Indexable Founder Block Graphs","authors":"V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu","doi":"10.4230/LIPIcs.WABI.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.7","url":null,"abstract":"We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. \u0000We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. \u0000Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3%$ of the size of the MSA.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127200125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-08DOI: 10.4230/LIPIcs.WABI.2019.13
Pijus Simonaitis, A. Chateau, K. M. Swenson
We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n 4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.
{"title":"Weighted Minimum-Length Rearrangement Scenarios","authors":"Pijus Simonaitis, A. Chateau, K. M. Swenson","doi":"10.4230/LIPIcs.WABI.2019.13","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.13","url":null,"abstract":"We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n 4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132412136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-06DOI: 10.4230/LIPIcs.WABI.2019.7
N. Vaddadi, Rajgopal Srinivasan, N. Sivadasan
{"title":"Read Mapping on Genome Variation Graphs","authors":"N. Vaddadi, Rajgopal Srinivasan, N. Sivadasan","doi":"10.4230/LIPIcs.WABI.2019.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.7","url":null,"abstract":"","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125609195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-06DOI: 10.4230/LIPIcs.WABI.2019.16
A. Tiskin
Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith–Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings’ lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name “local alignment with length threshold”. They proposed a dynamic programming algorithm solving the problem in time O(mn2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith–Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others. 2012 ACM Subject Classification Theory of computation → Pattern matching; Theory of computation → Divide and conquer; Theory of computation → Dynamic programming; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics
{"title":"Bounded-Length Smith-Waterman Alignment","authors":"A. Tiskin","doi":"10.4230/LIPIcs.WABI.2019.16","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.16","url":null,"abstract":"Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith–Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings’ lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name “local alignment with length threshold”. They proposed a dynamic programming algorithm solving the problem in time O(mn2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith–Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others. 2012 ACM Subject Classification Theory of computation → Pattern matching; Theory of computation → Divide and conquer; Theory of computation → Dynamic programming; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-09-01DOI: 10.4230/LIPIcs.WABI.2019.4
Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow
Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.
{"title":"TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees","authors":"Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow","doi":"10.4230/LIPIcs.WABI.2019.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.4","url":null,"abstract":"Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132868609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-06-24DOI: 10.4230/LIPIcs.WABI.2019.3
Nathan L. Clement
The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, first showing the variability that exists in existing software, and then providing a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.
{"title":"Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms","authors":"Nathan L. Clement","doi":"10.4230/LIPIcs.WABI.2019.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.3","url":null,"abstract":"The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. \u0000In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, first showing the variability that exists in existing software, and then providing a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128588539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}