Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献

英文中文

Method for effective virtual screening and scaffold-hopping in chemical compounds. 化合物的有效虚拟筛选和跳架方法。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01

Nikil Wale, George Karypis, Ian A Watson

Methods that can screen large databases to retrieve a structurally diverse set of compounds with desirable bioactivity properties are critical in the drug discovery and development process. This paper presents a set of such methods, which are designed to find compounds that are structurally different to a certain query compound while retaining its bioactivity properties (scaffold hops). These methods utilize various indirect ways of measuring the similarity between the query and a compound that take into account additional information beyond their structure-based similarities. Two sets of techniques are presented that capture these indirect similarities using approaches based on automatic relevance feedback and on analyzing the similarity network formed by the query and the database compounds. Experimental evaluation shows that many of these methods substantially outperform previously developed approaches both in terms of their ability to identify structurally diverse active compounds as well as active compounds in general.

筛选大型数据库以检索具有理想生物活性特性的结构多样化化合物的方法在药物发现和开发过程中至关重要。本文提出了一套这样的方法，旨在寻找结构上与特定查询化合物不同的化合物，同时保留其生物活性特性(支架啤酒花)。这些方法利用各种间接方法来测量查询和化合物之间的相似性，这些方法考虑了基于结构的相似性之外的其他信息。本文提出了两套技术，利用基于自动关联反馈的方法和基于分析由查询和数据库化合物形成的相似性网络的方法来捕获这些间接相似性。实验评估表明，这些方法中的许多方法在识别结构多样的活性化合物以及一般活性化合物的能力方面都大大优于先前开发的方法。

引用次数: 0

Modeling species-genes data for efficient phylogenetic inference. 为有效的系统发育推断建立物种-基因数据模型。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01

Wenyuan Li, Ying Liu

In recent years, biclique methods have been proposed to construct phylogenetic trees. One of the key steps of these methods is to find complete sub-matrices (without missing entries) from a species-genes data matrix. To enumerate all complete sub-matrices, (17) described an exact algorithm, whose running time is exponential. Furthermore, it generates a large number of complete sub-matrices, many of which may not be used for tree reconstruction. Further investigating and understanding the characteristics of species-genes data may be helpful for discovering complete sub-matrices. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of species-genes data, which can be used to guide new algorithm design for efficient phylogenetic inference. In this paper, a mathematical model is constructed to simulate the real species-genes data. The results indicate that sequence-availability probability distributions follow power law, which leads to the skewness and sparseness of the real species-genes data. Moreover, a special structure, called "ladder structure", is discovered in the real species-genes data. This ladder structure is used to identify complete sub-matrices, and more importantly, to reveal overlapping relationships among complete sub-matrices. To discover the distinct ladder structure in real species-genes data, we propose an efficient evolutionary dynamical system, called "generalized replicator dynamics". Two species-genes data sets from green plants are used to illustrate the effectiveness of our model. Empirical study has shown that our model is effective and efficient in understanding species-genes data for phylogenetic inference.

近年来，人们提出了biclique方法来构建系统发育树。这些方法的关键步骤之一是从物种-基因数据矩阵中找到完整的子矩阵(不缺少条目)。为了枚举所有完整子矩阵，(17)描述了一个精确算法，其运行时间是指数的。此外，它生成了大量完整的子矩阵，其中许多可能无法用于树重建。进一步研究和了解物种基因数据的特征，有助于发现完整的亚矩阵。因此，在本文中，我们着重于定量研究和理解物种-基因数据的特征，这可以用来指导新的算法设计，以实现高效的系统发育推断。本文建立了一个数学模型来模拟真实的物种-基因数据。结果表明，序列可用性概率分布服从幂律，这导致了实际物种基因数据的偏性和稀疏性。此外，在真实的物种基因数据中发现了一种特殊的结构，称为“阶梯结构”。该阶梯结构用于识别完全子矩阵，更重要的是揭示完全子矩阵之间的重叠关系。为了发现真实物种-基因数据中独特的阶梯结构，我们提出了一种有效的进化动力系统，称为“广义复制因子动力学”。两个来自绿色植物的物种基因数据集被用来说明我们模型的有效性。实证研究表明，该模型在理解物种-基因数据进行系统发育推断方面是有效的。

{"title":"Modeling species-genes data for efficient phylogenetic inference.","authors":"Wenyuan Li, Ying Liu","doi":"","DOIUrl":"","url":null,"abstract":"In recent years, biclique methods have been proposed to construct phylogenetic trees. One of the key steps of these methods is to find complete sub-matrices (without missing entries) from a species-genes data matrix. To enumerate all complete sub-matrices, (17) described an exact algorithm, whose running time is exponential. Furthermore, it generates a large number of complete sub-matrices, many of which may not be used for tree reconstruction. Further investigating and understanding the characteristics of species-genes data may be helpful for discovering complete sub-matrices. Therefore, in this paper, we focus on quantitatively studying and understanding the characteristics of species-genes data, which can be used to guide new algorithm design for efficient phylogenetic inference. In this paper, a mathematical model is constructed to simulate the real species-genes data. The results indicate that sequence-availability probability distributions follow power law, which leads to the skewness and sparseness of the real species-genes data. Moreover, a special structure, called \"ladder structure\", is discovered in the real species-genes data. This ladder structure is used to identify complete sub-matrices, and more importantly, to reveal overlapping relationships among complete sub-matrices. To discover the distinct ladder structure in real species-genes data, we propose an efficient evolutionary dynamical system, called \"generalized replicator dynamics\". Two species-genes data sets from green plants are used to illustrate the effectiveness of our model. Empirical study has shown that our model is effective and efficient in understanding species-genes data for phylogenetic inference.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"429-40"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27060938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rule-based human gene normalization in biomedical text with confidence estimation. 基于规则的生物医学文本人类基因归一化与置信度估计。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01

William W Lau, Calvin A Johnson, Kevin G Becker

The ability to identify gene mentions in text and normalize them to the proper unique identifiers is crucial for "down-stream" text mining applications in bioinformatics. We have developed a rule-based algorithm that divides the normalization task into two steps. The first step includes pattern matching for gene symbols and an approximate term searching technique for gene names. Next, the algorithm measures several features based on morphological, statistical, and contextual information to estimate the level of confidence that the correct identifier is selected for a potential mention. Uniqueness, inverse distance, and coverage are three novel features we quantified. The algorithm was evaluated against the BioCreAtIvE datasets. The feature weights were tuned by the Nealder-Mead simplex method. An F-score of .7622 and an AUC (area under the recall-precision curve) of .7461 were achieved on the test data using the set of weights optimized to the training data.

识别文本中提到的基因并将其规范化为适当的唯一标识符的能力对于生物信息学中的“下游”文本挖掘应用至关重要。我们开发了一种基于规则的算法，将规范化任务分为两个步骤。第一步包括基因符号的模式匹配和基因名称的近似术语搜索技术。接下来，该算法基于形态学、统计学和上下文信息测量几个特征，以估计为潜在提及选择正确标识符的置信度。唯一性、逆距离和覆盖是我们量化的三个新特征。根据BioCreAtIvE数据集对该算法进行了评估。采用Nealder-Mead单纯形法对特征权值进行了调整。使用针对训练数据优化的权值集，测试数据的f得分为0.7622,AUC(召回精度曲线下面积)为0.7461。

引用次数: 0

Prediction of transcription start sites based on feature selection using AMOSA. 基于AMOSA特征选择的转录起始位点预测。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01

Xi Wang, Sanghamitra Bandyopadhyay, Zhenyu Xuan, Xiaoyue Zhao, Michael Q Zhang, Xuegong Zhang

To understand the regulation of the gene expression, the identification of transcription start sites (TSSs) is a primary and important step. With the aim to improve the computational prediction accuracy, we focus on the most challenging task, i.e., to identify the TSSs within 50 bp in non-CpG related promoter regions. Due to the diversity of non-CpG related promoters, a large number of features are extracted. Effective feature selection can minimize the noise, improve the prediction accuracy, and also to discover biologically meaningful intrinsic properties. In this paper, a newly proposed multi-objective simulated annealing based optimization method, Archive Multi-Objective Simulated Annealing (AMOSA), is integrated with Linear Discriminant Analysis (LDA) to yield a combined feature selection and classification system. This system is found to be comparable to, often better than, several existing methods in terms of different quantitative performance measures.

转录起始位点(transcription start sites, tss)的鉴定是了解基因表达调控的首要和重要步骤。为了提高计算预测的准确性，我们将重点放在最具挑战性的任务上，即识别非cpg相关启动子区域50 bp内的tss。由于非cpg相关启动子的多样性，提取了大量的特征。有效的特征选择可以最大限度地减少噪声，提高预测精度，也可以发现有生物学意义的内在特性。本文提出了一种新的基于多目标模拟退火的优化方法——存档多目标模拟退火(AMOSA)，并将其与线性判别分析(LDA)相结合，得到了一个结合特征选择和分类的系统。人们发现，就不同的定量绩效衡量标准而言，这一系统可与几种现有方法相媲美，往往优于它们。

引用次数: 0

Mining molecular contexts of cancer via in-silico conditioning. 通过计算机调节挖掘癌症的分子背景。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0020

Seungchan Kim, Ina Sen, M. Bittner

Cell maintains its specific status by tightly regulating a set of genes through various regulatory mechanisms. If there are aberrations that force cell to adjust its regulatory machinery away from the normal state to reliably provide proliferative signals and abrogate normal safeguards, it must achieve a new regulatory state different from the normal. Due to this tightly coordinated regulation, the expression of genes should show consistent patterns within a cellular context, for example, a subtype of tumor, but the behaviour of those genes outside the context would rather become less consistent. Based on this hypothesis, we propose a method to identify genes whose expression pattern is significantly more consistent within a specific biological context, and also provide an algorithm to identify novel cellular contexts. The method was applied to previously published data sets to find possible novel biological contexts in conjunction with available clinical or drug sensitivity data. The software is currently written in Java and is available upon request from the corresponding author.

细胞通过各种调控机制对一组基因进行严密调控，从而维持其特定的状态。如果存在畸变，迫使细胞调整其调节机制，使其偏离正常状态，以可靠地提供增殖信号，并废除正常的保障措施，则必须实现不同于正常的新调节状态。由于这种紧密协调的调节，基因的表达应该在细胞环境中表现出一致的模式，例如，肿瘤的一种亚型，但这些基因在环境之外的行为宁愿变得不那么一致。基于这一假设，我们提出了一种方法来识别在特定生物环境中表达模式更为一致的基因，并提供了一种算法来识别新的细胞环境。该方法应用于先前发表的数据集，以结合可用的临床或药物敏感性数据找到可能的新生物学背景。该软件目前是用Java编写的，可根据相应作者的要求提供。

引用次数: 14

Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion. 基于连锁不平衡准则的全基因组标记snp选择算法。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0011

Lan Liu, Yonghui Wu, S. Lonardi, Tao Jiang

In this paper, we study the tagSNP selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.

本文利用配对r(2)连锁不平衡准则研究了多种群的标签snp选择问题。针对标签snp选择问题，我们提出了一种新的组合优化模型，称为最小共同标签snp选择(MCTS)问题，并给出了MCTS的有效解决方案。我们的方法包括三个主要步骤，包括(i)将SNP标记划分为小的不连接的组件，(ii)应用一些数据约简规则来简化问题，以及(iii)应用快速贪婪算法或拉格朗日松弛算法来解决剩余的(一般)MCTS。这些算法还提供了标签的下限(即所需的标签snp的最小数量)。下界允许我们计算解离最优解的距离。据我们所知，这是文献中第一次讨论标注下界。我们评估了我们的算法在全基因组标记的真实HapMap数据上的性能。实验表明，我们的算法运行速度比现有的单种群标记程序(如FESTA, LD-Select和多种群标记方法MultiPop-TagSelect)快3到4个数量级。与单种群的LD-Select和多种群的MultiPop-TagSelect相比，我们的方法也大大减少了所需的tagsnp。此外，我们的算法选择的标签snp数量几乎是最优的，因为它们非常接近我们的方法得到的相应下界。

{"title":"Efficient algorithms for genome-wide tagSNP selection across populations via the linkage disequilibrium criterion.","authors":"Lan Liu, Yonghui Wu, S. Lonardi, Tao Jiang","doi":"10.1142/9781860948732_0011","DOIUrl":"https://doi.org/10.1142/9781860948732_0011","url":null,"abstract":"In this paper, we study the tagSNP selection problem on multiple populations using the pairwise r(2) linkage disequilibrium criterion. We propose a novel combinatorial optimization model for the tagSNP selection problem, called the minimum common tagSNP selection (MCTS) problem, and present efficient solutions for MCTS. Our approach consists of three main steps including (i) partitioning the SNP markers into small disjoint components, (ii) applying some data reduction rules to simplify the problem, and (iii) applying either a fast greedy algorithm or a Lagrangian relaxation algorithm to solve the remaining (general) MCTS. These algorithms also provide lower bounds on tagging (i.e. the minimum number of tagSNPs needed). The lower bounds allow us to evaluate how far our solution is from the optimum. To the best of our knowledge, it is the first time tagging lower bounds are discussed in the literature. We assess the performance of our algorithms on real HapMap data for genome-wide tagging. The experiments demonstrate that our algorithms run 3 to 4 orders of magnitude faster than the existing single-population tagging programs like FESTA, LD-Select and the multiple-population tagging method MultiPop-TagSelect. Our method also greatly reduces the required tagSNPs compared to LD-Select on a single population and MultiPop-TagSelect on multiple populations. Moreover, the numbers of tagSNPs selected by our algorithms are almost optimal since they are very close to the corresponding lower bounds obtained by our method.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"67-78"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Algorithm for peptide sequencing by tandem mass spectrometry based on better preprocessing and anti-symmetric computational model. 基于改进预处理和反对称计算模型的串联质谱多肽测序算法。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0007

K. Ning, H. Leong

Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric of model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for anti-symmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better performance on datasets examined.

肽段串联质谱测序是蛋白质组学中一个非常重要、有趣但又具有挑战性的问题。近年来，这一问题得到了研究人员的广泛研究，肽测序结果也越来越准确。然而，这些算法中的许多都使用基于一些未经验证的假设的计算模型。我们相信，对这些假设和相关问题的有效性的调查将导致当前算法的改进。在本文中，我们首次研究了不经谱预处理的多肽测序，并证明了在谱上引入预处理可以使多肽测序更快、更容易、更准确。然后，我们研究了一个非常重要的问题，即肽序列问题中的反对称问题，并通过实验证明，对于肽序列问题，简单地忽略模型的反对称，去掉所有反对称实例的模型过于简单。我们提出了一种更现实的反对称问题的新模型。我们还提出了一种新的算法，该算法结合了预处理和新的模型来处理反对称问题，实验表明该算法在检验的数据集上具有更好的性能。

{"title":"Algorithm for peptide sequencing by tandem mass spectrometry based on better preprocessing and anti-symmetric computational model.","authors":"K. Ning, H. Leong","doi":"10.1142/9781860948732_0007","DOIUrl":"https://doi.org/10.1142/9781860948732_0007","url":null,"abstract":"Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric of model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for anti-symmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better performance on datasets examined.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"19-30"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Effective labeling of molecular surface points for cavity detection and location of putative binding sites. 有效标记分子表面点，用于空腔检测和推定结合位点的定位。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0028

M. Bock, C. Garutti, C. Guerra

We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.

我们提出了一种检测和比较蛋白质表面空腔的方法，这对蛋白质结合位点识别很有用。该方法基于一组自旋图像及其相关的自旋图像轮廓来表示蛋白质结构。对大量非冗余蛋白进行了空腔检测，并与SURFNET-ConSurf进行了比较。我们的比较方法是用来找到一个表面区域在一个空腔的蛋白质，是几何上相似的表面区域在另一个蛋白质的空腔。这一发现表明，这两个区域可能与同一配体结合。我们对空腔检测和比较的总体方法以几对已知配合物为基准，获得了结合位点原子的良好覆盖。

引用次数: 18

Enhanced partial order curve comparison over multiple protein folding trajectories. 在多个蛋白质折叠轨迹上增强的偏序曲线比较。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0031

Hong Sun, H. Ferhatosmanoğlu, M. Ota, Yusu Wang

Understanding how proteins fold is essential to our quest in discovering how life works at the molecular level. Current computation power enables researchers to produce a huge amount of folding simulation data. Hence there is a pressing need to be able to interpret and identify novel folding features from them. In this paper, we model each folding trajectory as a multi-dimensional curve. We then develop an effective multiple curve comparison (MCC) algorithm, called the enhanced partial order (EPO) algorithm, to extract features from a set of diverse folding trajectories, including both successful and unsuccessful simulation runs. Our EPO algorithm addresses several new challenges presented by comparing high dimensional curves coming from folding trajectories. A detailed case study of applying our algorithm to a miniprotein Trp-cage(24) demonstrates that our algorithm can detect similarities at rather low level, and extract biologically meaningful folding events.

了解蛋白质如何折叠对于我们探索生命在分子水平上如何运作至关重要。目前的计算能力使研究人员能够产生大量的折叠模拟数据。因此，迫切需要能够解释和识别新的折叠特征。在本文中，我们将每个折叠轨迹建模为一个多维曲线。然后，我们开发了一种有效的多曲线比较(MCC)算法，称为增强偏序(EPO)算法，从一组不同的折叠轨迹中提取特征，包括成功和不成功的模拟运行。我们的EPO算法通过比较来自折叠轨迹的高维曲线解决了几个新的挑战。将我们的算法应用于微型蛋白色氨酸笼的详细案例研究(24)表明，我们的算法可以在相当低的水平上检测相似性，并提取具有生物学意义的折叠事件。

引用次数: 3

Reconciliation with non-binary species trees. 与非二种树的和解。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2007-01-01 DOI: 10.1142/9781860948732_0044

B Vernot, M Stolzer, A Goldman, D Durand

Reconciliation is the process of resolving disagreement between gene and species trees, by invoking gene duplications and losses to explain topological incongruence. The resulting inferred duplication histories are a valuable source of information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. Reconciliation for binary trees is a tractable and well studied problem. However, a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and deep coalescence. We present the first formal algorithm for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Using a space efficient mapping from gene to species tree, our algorithm infers the minimum number of duplications and losses in O(|V(G)| . (k(S) + h(S))) time, where V(G) is the number of nodes in the gene tree, h(S) is the height of the species tree and k(S) is the width of its largest multifurcation. We also present a dynamic programming algorithm for a combined loss model, in which losses in sibling species may be represented as a single loss in the common ancestor. Our algorithms have been implemented in NOTUNG, a robust, production quality tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.

和解是解决基因和物种树之间分歧的过程，通过调用基因复制和损失来解释拓扑不一致。由此推断的复制历史是广泛生物学应用的有价值的信息来源，包括同源鉴定，估计基因复制时间，以及生根和纠正基因树。二叉树的调和是一个容易处理且研究得很好的问题。然而，惊人比例的物种树是非二元的。例如，NCBI分类法中64%的分支点有三个或更多的子节点。当应用于非二叉物种树时，目前的算法高估了重复的数量，因为它们不能区分重复和深度合并。在重复损失简约模型下，提出了二叉基因树与非二叉物种树调和的第一个形式化算法。该算法利用从基因到物种树的空间高效映射，推断出O(|V(G)|的最小重复数和损失数。(k(S) + h(S)))时间，其中V(G)为基因树的节点数，h(S)为物种树的高度，k(S)为其最大多分叉的宽度。我们还提出了一个组合损失模型的动态规划算法，其中兄弟物种的损失可以表示为共同祖先的单个损失。我们的算法已经在NOTUNG中实现，NOTUNG是一个强大的，生产质量的树拟合程序，它为探索性分析提供了图形用户界面，还支持自动化，高通量的大数据集分析。

{"title":"Reconciliation with non-binary species trees.","authors":"B Vernot, M Stolzer, A Goldman, D Durand","doi":"10.1142/9781860948732_0044","DOIUrl":"https://doi.org/10.1142/9781860948732_0044","url":null,"abstract":"Reconciliation is the process of resolving disagreement between gene and species trees, by invoking gene duplications and losses to explain topological incongruence. The resulting inferred duplication histories are a valuable source of information for a broad range of biological applications, including ortholog identification, estimating gene duplication times, and rooting and correcting gene trees. Reconciliation for binary trees is a tractable and well studied problem. However, a striking proportion of species trees are non-binary. For example, 64% of branch points in the NCBI taxonomy have three or more children. When applied to non-binary species trees, current algorithms overestimate the number of duplications because they cannot distinguish between duplication and deep coalescence. We present the first formal algorithm for reconciling binary gene trees with non-binary species trees under a duplication-loss parsimony model. Using a space efficient mapping from gene to species tree, our algorithm infers the minimum number of duplications and losses in O(|V(G)| . (k(S) + h(S))) time, where V(G) is the number of nodes in the gene tree, h(S) is the height of the species tree and k(S) is the width of its largest multifurcation. We also present a dynamic programming algorithm for a combined loss model, in which losses in sibling species may be represented as a single loss in the common ancestor. Our algorithms have been implemented in NOTUNG, a robust, production quality tree-fitting program, which provides a graphical user interface for exploratory analysis and also supports automated, high-throughput analysis of large data sets.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"441-52"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27060939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀