Workshop on Algorithms in Bioinformatics最新文献

Suffix sorting via matching statistics 通过匹配统计进行后缀排序

Workshop on Algorithms in Bioinformatics

Pub Date : 2022-07-03 DOI: 10.48550/arXiv.2207.00972

Zsuzsanna Lipt'ak, Francesco Masillo, S. Puglisi

We introduce a new algorithm for constructing the generalized suffix array of a collection of highly similar strings. As a first step, we construct a compressed representation of the matching statistics of the collection with respect to a reference string. We then use this data structure to distribute suffixes into a partial order, and subsequently to speed up suffix comparisons to complete the generalized suffix array. Our experimental evidence with a prototype implementation (a tool we call sacamats) shows that on string collections with highly similar strings we can construct the suffix array in time competitive with or faster than the fastest available methods. Along the way, we describe a heuristic for fast computation of the matching statistics of two strings, which may be of independent interest.

介绍了一种构造高度相似字符串集合广义后缀数组的新算法。作为第一步，我们构造集合相对于引用字符串的匹配统计信息的压缩表示。然后，我们使用该数据结构将后缀按部分顺序分布，并随后加快后缀比较以完成广义后缀数组。我们使用原型实现(我们称之为sacamats的工具)的实验证据表明，在具有高度相似字符串的字符串集合上，我们可以在与最快可用方法竞争或更快的时间内构建后缀数组。在此过程中，我们描述了一种启发式方法，用于快速计算两个字符串的匹配统计量，这可能是独立的兴趣。

引用次数: 1

Prefix-free parsing for building large tunnelled Wheeler graphs 用于构建大型隧道惠勒图的无前缀解析

Workshop on Algorithms in Bioinformatics

Pub Date : 2022-06-30 DOI: 10.4230/LIPIcs.WABI.2022.18

Adrián Goga, Andrej Baláz

We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.

我们提出了一种新技术，用于为大型重复文本集合创建空间高效索引，例如包含来自同一物种的许多个体序列的泛基因组数据库。我们结合了该领域的两种最新技术:Wheeler图(gieet al.， 2017)和无前缀解析(PFP, Boucher et al.， 2019)。惠勒图(WGs)是一个基于Burrows-Wheeler变换(BWT)的包含多个索引的通用框架，如fm指数。惠勒图承认一个简洁的表示，它可以通过使用隧道的思想进一步压缩，隧道利用冗余的形式，以平行的，相等标记的路径称为块，可以合并成一条路径。寻找隧道掘进的最优块集的问题，即最小化所产生的WG大小的问题，已知是np完全的，并且仍然是隧道掘进过程中最具计算挑战性的部分。为了在更短的时间内找到合适的块集，我们提出了一种基于无前缀解析(PFP)的新方法。PFP的思想是将输入文本分成大小大致相等的短语，这些短语由固定数量的字符重叠。原始文本由短语序列(解析)和所有使用过的短语列表(字典)表示。在重复的文本中，文本的PFP通常比原文短得多。为了加快隧道化的块选择，我们使用PFP获得文本的解析和字典，使用现有的启发式方法隧道化解析的WG，然后使用该隧道化解析构造原始文本的紧凑WG。与不使用PFP从原始文本构建WG相比，我们的方法在全基因组序列集合上速度更快，占用的内存更少。因此，我们的方法可以使用WGs作为真实世界数据集的泛基因组参考。

{"title":"Prefix-free parsing for building large tunnelled Wheeler graphs","authors":"Adrián Goga, Andrej Baláz","doi":"10.4230/LIPIcs.WABI.2022.18","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2022.18","url":null,"abstract":"We propose a new technique for creating a space-efficient index for large repetitive text collections, such as pangenomic databases containing sequences of many individuals from the same species. We combine two recent techniques from this area: Wheeler graphs (Gagie et al., 2017) and prefix-free parsing (PFP, Boucher et al., 2019). Wheeler graphs (WGs) are a general framework encompassing several indexes based on the Burrows-Wheeler transform (BWT), such as the FM-index. Wheeler graphs admit a succinct representation which can be further compacted by employing the idea of tunnelling, which exploits redundancies in the form of parallel, equally-labelled paths called blocks that can be merged into a single path. The problem of finding the optimal set of blocks for tunnelling, i.e. the one that minimizes the size of the resulting WG, is known to be NP-complete and remains the most computationally challenging part of the tunnelling process. To find an adequate set of blocks in less time, we propose a new method based on the prefix-free parsing (PFP). The idea of PFP is to divide the input text into phrases of roughly equal sizes that overlap by a fixed number of characters. The original text is represented by a sequence of phrase ranks (the parse) and a list of all used phrases (the dictionary). In repetitive texts, the PFP of the text is generally much shorter than the original. To speed up the block selection for tunnelling, we apply the PFP to obtain the parse and the dictionary of the text, tunnel the WG of the parse using existing heuristics and subsequently use this tunnelled parse to construct a compact WG of the original text. Compared with constructing a WG from the original text without PFP, our method is much faster and uses less memory on collections of pangenomic sequences. Therefore, our method enables the use of WGs as a pangenomic reference for real-world datasets.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132725143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors 植藻碱:在肿瘤单细胞DNA测序数据中发现一个线性的完美系统发育

Workshop on Algorithms in Bioinformatics

Pub Date : 2020-08-01 DOI: 10.4230/LIPIcs.WABI.2020.5

Leah L. Weber, M. El-Kebir

Cancer arises from an evolutionary process where somatic mutations occur and eventually give rise to clonal expansions. Modeling this evolutionary process as a phylogeny is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. However, cancer phylogeny inference from single-cell DNA sequencing data of tumors is challenging due to limitations with sequencing technology and the complexity of the resulting problem. Therefore, as a first step some value might be obtained from correctly classifying the evolutionary process as either linear or branched. The biological implications of these two high-level patterns are different and understanding what cancer types and which patients have each of these trajectories could provide useful insight for both clinicians and researchers. Here, we introduce the Linear Perfect Phylogeny Flipping Problem as a means of testing a null model that the tree topology is linear and show that it is NP-hard. We develop Phyolin and, through both in silico experiments and real data application, show that it is an accurate, easy to use and a reasonably fast method for classifying an evolutionary trajectory as linear or branched. 2012 ACM Subject Classification Applied computing → Molecular evolution

癌症起源于体细胞突变发生的进化过程，最终导致克隆扩增。将这一进化过程建模为一种系统发育，对于治疗决策以及理解患者和癌症类型的进化模式都很有用。然而，由于测序技术的限制和由此产生的问题的复杂性，从肿瘤单细胞DNA测序数据推断癌症系统发育具有挑战性。因此，作为第一步，正确地将进化过程分类为线性或分支可能会获得一些价值。这两种高水平模式的生物学意义是不同的，了解哪些癌症类型以及哪些患者具有这些轨迹可以为临床医生和研究人员提供有用的见解。在这里，我们引入线性完美系统发育翻转问题作为一种测试零模型的方法，证明树拓扑是线性的，并证明它是np困难的。我们开发了Phyolin，并通过计算机实验和实际数据应用表明，它是一种准确、易于使用和相当快速的方法，用于将进化轨迹分类为线性或分支。2012 ACM学科分类:应用计算→分子进化

{"title":"Phyolin: Identifying a Linear Perfect Phylogeny in Single-Cell DNA Sequencing Data of Tumors","authors":"Leah L. Weber, M. El-Kebir","doi":"10.4230/LIPIcs.WABI.2020.5","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.5","url":null,"abstract":"Cancer arises from an evolutionary process where somatic mutations occur and eventually give rise to clonal expansions. Modeling this evolutionary process as a phylogeny is useful for treatment decision-making as well as understanding evolutionary patterns across patients and cancer types. However, cancer phylogeny inference from single-cell DNA sequencing data of tumors is challenging due to limitations with sequencing technology and the complexity of the resulting problem. Therefore, as a first step some value might be obtained from correctly classifying the evolutionary process as either linear or branched. The biological implications of these two high-level patterns are different and understanding what cancer types and which patients have each of these trajectories could provide useful insight for both clinicians and researchers. Here, we introduce the Linear Perfect Phylogeny Flipping Problem as a means of testing a null model that the tree topology is linear and show that it is NP-hard. We develop Phyolin and, through both in silico experiments and real data application, show that it is an accurate, easy to use and a reasonably fast method for classifying an evolutionary trajectory as linear or branched. 2012 ACM Subject Classification Applied computing → Molecular evolution","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129315548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Near-Linear Time Edit Distance for Indel Channels 近线性时间编辑距离Indel通道

Workshop on Algorithms in Bioinformatics

Pub Date : 2020-07-06 DOI: 10.4230/LIPIcs.WABI.2020.17

Arun Ganesh, Aaron Sy

We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.

我们考虑以下字符串采样对的模型:$s_1$是长度为$n$的一致随机位串，$s_2$是通过对$s_1$的每个位以一定的概率进行替换、插入和删除而得到的位串。我们证明了$s_1$和$s_2$之间的编辑距离可以高概率地在$O(n ln n)$时间内计算出来，只要$s_1$的每个比特都有一个至多为一个小常数的概率的突变。该算法简单，仅使用教科书动态规划算法作为原语，首先计算两个字符串之间的近似对齐，然后运行动态规划算法，限制条目接近近似对齐。本文算法的分析为BLAST、FASTA和MAFFT等在实际应用中的定位启发式算法提供了理论依据，这些算法也是从快速计算近似定位开始，然后在近似定位附近找到最佳定位。我们的主要技术贡献是对齐的分区，这样分区中的子集数量就不会太大，并且一个子集中的每个对齐都比我们的算法以高概率考虑的对齐更差。类似的技术可能对通常通过动态规划解决的其他问题的平均情况分析感兴趣。

{"title":"Near-Linear Time Edit Distance for Indel Channels","authors":"Arun Ganesh, Aaron Sy","doi":"10.4230/LIPIcs.WABI.2020.17","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.17","url":null,"abstract":"We consider the following model for sampling pairs of strings: $s_1$ is a uniformly random bitstring of length $n$, and $s_2$ is the bitstring arrived at by applying substitutions, insertions, and deletions to each bit of $s_1$ with some probability. We show that the edit distance between $s_1$ and $s_2$ can be computed in $O(n ln n)$ time with high probability, as long as each bit of $s_1$ has a mutation applied to it with probability at most a small constant. The algorithm is simple and only uses the textbook dynamic programming algorithm as a primitive, first computing an approximate alignment between the two strings, and then running the dynamic programming algorithm restricted to entries close to the approximate alignment. The analysis of our algorithm provides theoretical justification for alignment heuristics used in practice such as BLAST, FASTA, and MAFFT, which also start by computing approximate alignments quickly and then find the best alignment near the approximate alignment. Our main technical contribution is a partitioning of alignments such that the number of the subsets in the partition is not too large and every alignment in one subset is worse than an alignment considered by our algorithm with high probability. Similar techniques may be of interest in the average-case analysis of other problems commonly solved via dynamic programming.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"181 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132923870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Linear Time Construction of Indexable Founder Block Graphs 可转位方正方图的线性时间构造

Workshop on Algorithms in Bioinformatics

Pub Date : 2020-05-19 DOI: 10.4230/LIPIcs.WABI.2020.7

V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu

We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3%$ of the size of the MSA.

我们介绍了一种基于最佳分割概念的紧凑泛基因组表示，旨在从多序列比对(MSA)中重建方正序列。这样的建立者序列的特点是，每一行的MSA是一个重新组合的建立者。以前已经设计了几种线性时间动态规划算法来优化分割，这些分割可以诱导方正块，然后可以连接到一组方正序列中。所有可能的连接顺序都可以表示为一个方正块图。我们观察到这样的图的一个关键属性:如果节点标签(创始人段)在图的路径中不重复，这样的图可以被索引以进行有效的字符串匹配。我们称这种图为分段无重复方正块图。给出了一种线性时间算法来构造给定MSA的段无重复方正块图。该算法结合了创始人分割算法(Cazaux等)的技术。SPIRE 2019)和全功能双向Burrows-Wheeler指数(Belazzougui和Cunial, CPM 2019)。我们推导了一个简洁的索引结构来支持图路径中任意长度的查询。报道了sars - cov -2株MSA的实验。大小为$410乘以29811$的MSA在一分钟内被压缩成一个包含3900个节点和4440条边的段无重复的创建方块图。节点标签的最大长度为12，总长度为34968。图上的索引只占用MSA大小的3%。

{"title":"Linear Time Construction of Indexable Founder Block Graphs","authors":"V. Mäkinen, Bastien Cazaux, Massimo Equi, T. Norri, Alexandru I. Tomescu","doi":"10.4230/LIPIcs.WABI.2020.7","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2020.7","url":null,"abstract":"We introduce a compact pangenome representation based on an optimal segmentation concept that aims to reconstruct founder sequences from a multiple sequence alignment (MSA). Such founder sequences have the feature that each row of the MSA is a recombination of the founders. Several linear time dynamic programming algorithms have been previously devised to optimize segmentations that induce founder blocks that then can be concatenated into a set of founder sequences. All possible concatenation orders can be expressed as a founder block graph. We observe a key property of such graphs: if the node labels (founder segments) do not repeat in the paths of the graph, such graphs can be indexed for efficient string matching. We call such graphs segment repeat-free founder block graphs. \u0000We give a linear time algorithm to construct a segment repeat-free founder block graph given an MSA. The algorithm combines techniques from the founder segmentation algorithms (Cazaux et al. SPIRE 2019) and fully-functional bidirectional Burrows-Wheeler index (Belazzougui and Cunial, CPM 2019). We derive a succinct index structure to support queries of arbitrary length in the paths of the graph. \u0000Experiments on an MSA of SAR-CoV-2 strains are reported. An MSA of size $410times 29811$ is compacted in one minute into a segment repeat-free founder block graph of 3900 nodes and 4440 edges. The maximum length and total length of node labels is 12 and 34968, respectively. The index on the graph takes only $3%$ of the size of the MSA.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127200125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Weighted Minimum-Length Rearrangement Scenarios 加权最小长度重排场景

Workshop on Algorithms in Bioinformatics

Pub Date : 2019-09-08 DOI: 10.4230/LIPIcs.WABI.2019.13

Pijus Simonaitis, A. Chateau, K. M. Swenson

We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n 4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.

我们提出了已知的第一个基因组重排模型，在重排上具有任意实值权函数。它是基于基因组重排的数学和算法研究的主导模型，双切割和连接(DCJ)。我们的目标函数是进化场景中dcj权重的总和或乘积，该函数可以最小化或最大化。例如，如果观察到独立DCJ的可能性是基于生物条件估计的，那么这个目标函数可以是在一个场景中观察到独立DCJ的可能性。本文提出了一种O(n 4)时间动态规划算法，用于解决具有n个基因(或合成块)的共尾基因组的最小成本节约情景(MCPS)问题。将此与我们之前在MCPS上的工作相结合，可以得到一般基因组的多项式时间算法。关键的理论贡献是在简约的DCJ(或2-break)场景和正多边形的四边形之间建立了新的联系。为了证明我们的算法处理生物数据的速度足够快，我们在为人类与黑猩猩、长臂猿、老鼠和鸡配对构建的合成块上运行它。我们认为，人类和长臂猿对是研究加权基因组重排的一个特别有趣的模型。

{"title":"Weighted Minimum-Length Rearrangement Scenarios","authors":"Pijus Simonaitis, A. Chateau, K. M. Swenson","doi":"10.4230/LIPIcs.WABI.2019.13","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.13","url":null,"abstract":"We present the first known model of genome rearrangement with an arbitrary real-valued weight function on the rearrangements. It is based on the dominant model for the mathematical and algorithmic study of genome rearrangement, Double Cut and Join (DCJ). Our objective function is the sum or product of the weights of the DCJs in an evolutionary scenario, and the function can be minimized or maximized. If the likelihood of observing an independent DCJ was estimated based on biological conditions, for example, then this objective function could be the likelihood of observing the independent DCJs together in a scenario. We present an O(n 4)-time dynamic programming algorithm solving the Minimum Cost Parsimonious Scenario (MCPS) problem for co-tailed genomes with n genes (or syntenic blocks). Combining this with our previous work on MCPS yields a polynomial-time algorithm for general genomes. The key theoretical contribution is a novel link between the parsimonious DCJ (or 2-break) scenarios and quadrangulations of a regular polygon. To demonstrate that our algorithm is fast enough to treat biological data, we run it on syntenic blocks constructed for Human paired with Chimpanzee, Gibbon, Mouse, and Chicken. We argue that the Human and Gibbon pair is a particularly interesting model for the study of weighted genome rearrangements.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"253 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132412136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Read Mapping on Genome Variation Graphs 阅读基因组变异图制图

Workshop on Algorithms in Bioinformatics

Pub Date : 2019-09-06 DOI: 10.4230/LIPIcs.WABI.2019.7

N. Vaddadi, Rajgopal Srinivasan, N. Sivadasan

引用次数: 4

Bounded-Length Smith-Waterman Alignment 限长史密斯-沃特曼对齐

Workshop on Algorithms in Bioinformatics

Pub Date : 2019-09-06 DOI: 10.4230/LIPIcs.WABI.2019.16

A. Tiskin

Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith–Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings’ lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name “local alignment with length threshold”. They proposed a dynamic programming algorithm solving the problem in time O(mn2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith–Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others. 2012 ACM Subject Classification Theory of computation → Pattern matching; Theory of computation → Divide and conquer; Theory of computation → Dynamic programming; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics

给定一个固定的对齐评分方案，在长度为m, n的字符串对上的有界长度(即有界总长度)Smith-Waterman对齐问题要求在所有子字符串对上的最大对齐评分，使得第一个子字符串的长度(即两个子字符串长度之和)大于给定的阈值w。后一个问题由Arslan和Eğecioğlu以“具有长度阈值的局部对齐”的名称引入。他们提出了在O(mn2)时间内求解问题的动态规划算法，以及在O(rmn)时间内运行的逼近算法，其中r是控制逼近精度的参数。我们证明了这两个问题都可以在O(mn)时间内精确地解决，假设一个合理的评分方案;此外，该解决方案可用于获得归一化有界总长度Smith-Waterman对齐问题的精确算法，运行时间为O(mn log n)。我们的算法依赖于作者和其他人先前开发的快速窗口-子串对齐和隐式单元- monge矩阵搜索技术。2012 ACM学科分类计算理论→模式匹配;计算理论→分治法;计算理论→动态规划;应用计算→分子序列分析;应用计算→生物信息学

{"title":"Bounded-Length Smith-Waterman Alignment","authors":"A. Tiskin","doi":"10.4230/LIPIcs.WABI.2019.16","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.16","url":null,"abstract":"Given a fixed alignment scoring scheme, the bounded length (respectively, bounded total length) Smith–Waterman alignment problem on a pair of strings of lengths m, n, asks for the maximum alignment score across all substring pairs, such that the first substring’s length (respectively, the sum of the two substrings’ lengths) is above the given threshold w. The latter problem was introduced by Arslan and Eğecioğlu under the name “local alignment with length threshold”. They proposed a dynamic programming algorithm solving the problem in time O(mn2), and also an approximation algorithm running in time O(rmn), where r is a parameter controlling the accuracy of approximation. We show that both these problems can be solved exactly in time O(mn), assuming a rational scoring scheme; furthermore, this solution can be used to obtain an exact algorithm for the normalised bounded total length Smith–Waterman alignment problem, running in time O(mn log n). Our algorithms rely on the techniques of fast window-substring alignment and implicit unit-Monge matrix searching, developed previously by the author and others. 2012 ACM Subject Classification Theory of computation → Pattern matching; Theory of computation → Divide and conquer; Theory of computation → Dynamic programming; Applied computing → Molecular sequence analysis; Applied computing → Bioinformatics","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125067791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees 牵引:快速非参数改进估计的基因树

Workshop on Algorithms in Bioinformatics

Pub Date : 2019-09-01 DOI: 10.4230/LIPIcs.WABI.2019.4

Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow

Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.

基因树校正旨在通过使用计算技术以及参考树(在某些情况下可用的序列数据)来提高基因树的准确性。由于基因树的重复和丢失(GDL)导致的异质性是一个活跃的研究领域。在这里，我们研究了基因树校正问题，其中基因树异质性是由于不完全谱系分类(ILS，真核生物系统发育中常见的问题)和水平基因转移(HGT，细菌系统发育中常见的问题)。我们引入了TRACTION，一种简单的多项式时间方法，可证明地找到RF最优树改进和补全问题的最优解，该问题寻求输入树t相对于给定二叉树t的改进和补全，以最小化Robinson-Foulds (RF)距离。我们提出了一项广泛的模拟研究的结果，评估了68,000个估计的基因树的基因树校正管道中的TRACTION，使用估计的物种树作为参考树。我们探索了由于ILS和HGT而导致的基因树异质性水平不同的条件下的准确性。我们发现，在HGT和ILS条件下，TRACTION匹配或提高了GDL文献中成熟方法的准确性，并且在仅ILS条件下达到最佳。此外，在这些数据集上，TRACTION是最快的。TRACTION可在https://github.com/pranjalv123/TRACTION-RF上获得，研究数据集可在https://doi.org/10.13012/B2IDB-1747658_V1上获得。

{"title":"TRACTION: Fast Non-Parametric Improvement of Estimated Gene Trees","authors":"Sarah A. Christensen, Erin K. Molloy, P. Vachaspati, T. Warnow","doi":"10.4230/LIPIcs.WABI.2019.4","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.4","url":null,"abstract":"Gene tree correction aims to improve the accuracy of a gene tree by using computational techniques along with a reference tree (and in some cases available sequence data). It is an active area of research when dealing with gene tree heterogeneity due to duplication and loss (GDL). Here, we study the problem of gene tree correction where gene tree heterogeneity is instead due to incomplete lineage sorting (ILS, a common problem in eukaryotic phylogenetics) and horizontal gene transfer (HGT, a common problem in bacterial phylogenetics). We introduce TRACTION, a simple polynomial time method that provably finds an optimal solution to the RF-Optimal Tree Refinement and Completion Problem, which seeks a refinement and completion of an input tree t with respect to a given binary tree T so as to minimize the Robinson-Foulds (RF) distance. We present the results of an extensive simulation study evaluating TRACTION within gene tree correction pipelines on 68,000 estimated gene trees, using estimated species trees as reference trees. We explore accuracy under conditions with varying levels of gene tree heterogeneity due to ILS and HGT. We show that TRACTION matches or improves the accuracy of well-established methods from the GDL literature under conditions with HGT and ILS, and ties for best under the ILS-only conditions. Furthermore, TRACTION ties for fastest on these datasets. TRACTION is available at https://github.com/pranjalv123/TRACTION-RF and the study datasets are available at https://doi.org/10.13012/B2IDB-1747658_V1.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132868609","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms 柔性蛋白-蛋白对接算法的量化不确定性

Workshop on Algorithms in Bioinformatics

Pub Date : 2019-06-24 DOI: 10.4230/LIPIcs.WABI.2019.3

Nathan L. Clement

The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, first showing the variability that exists in existing software, and then providing a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.

算法的强弱最终取决于其结果的置信度。当问题的域较大时(如遍历高维空间)，不可能得到完美解，因此必须进行近似。这些近似值通常会导致在运行之间变化的报告兴趣量(QOI)，从而降低任何单个运行的置信度。当算法基于不确定或有噪声的数据进一步计算最终QOI时，最终QOI的可变性(或缺乏置信度)会增加。如果不受限制，这两个不确定性来源(算法近似和输入数据中的不确定性)可能导致报告的统计数据与基本事实的相关性较低。在生物应用中，这尤其适用，因为搜索空间通常至少在某种程度上是近似的(例如，高比例的蛋白质结构是无效的或能量上不利的)，并且蛋白质表示从连续空间到离散空间的显式转换意味着输入数据中的一些不确定性。本研究将不确定性量化技术应用于困难的蛋白质-蛋白质对接问题，首先展示了现有软件中存在的可变性，然后提供了一种以类切诺夫界形式计算概率证书的方法。最后，本文利用这些概率证明准确地绑定了两种对接算法的对接不确定性，提供了一个鲁棒且具有统计意义的QOI。

{"title":"Quantified Uncertainty of Flexible Protein-Protein Docking Algorithms","authors":"Nathan L. Clement","doi":"10.4230/LIPIcs.WABI.2019.3","DOIUrl":"https://doi.org/10.4230/LIPIcs.WABI.2019.3","url":null,"abstract":"The strength or weakness of an algorithm is ultimately governed by the confidence of its result. When the domain of the problem is large (e.g. traversal of a high-dimensional space), a perfect solution cannot be obtained, so approximations must be made. These approximations often lead to a reported quantity of interest (QOI) which varies between runs, decreasing the confidence of any single run. When the algorithm further computes this final QOI based on uncertain or noisy data, the variability (or lack of confidence) of the final QOI increases. Unbounded, these two sources of uncertainty (algorithmic approximations and uncertainty in input data) can result in a reported statistic that has low correlation with ground truth. \u0000In biological applications, this is especially applicable, as the search space is generally approximated at least to some degree (e.g. a high percentage of protein structures are invalid or energetically unfavorable) and the explicit conversion from continuous to discrete space for protein representation implies some uncertainty in the input data. This research applies uncertainty quantification techniques to the difficult protein-protein docking problem, first showing the variability that exists in existing software, and then providing a method for computing probabilistic certificates in the form of Chernoff-like bounds. Finally, this paper leverages these probabilistic certificates to accurately bound the uncertainty in docking from two docking algorithms, providing a QOI that is both robust and statistically meaningful.","PeriodicalId":329847,"journal":{"name":"Workshop on Algorithms in Bioinformatics","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128588539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0