Proceedings. IEEE Computational Systems Bioinformatics Conference最新文献

英文中文

Consensus genetic maps: a graph theoretic approach. 共识遗传图谱:一种图论方法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.26

Benjamin N Jackson, Srinivas Aluru, Patrick S Schnable

A genetic map is an ordering of genetic markers constructed from genetic linkage data for use in linkage studies and experimental design. While traditional methods have focused on constructing maps from a single population study, increasingly maps are generated for multiple lines and populations of the same organism. For example, in crop plants, where the genetic variability is high, researchers have created maps for many populations. In the face of these new data, we address the increasingly important problem of generating a consensus map - an ordering of all markers in the various population studies. In our method, each input map is treated as a partial order on a set of markers. To find the most consistent order shared between maps, we model the partial orders as directed graphs. We create an aggregate by merginging the transitive closure of the input graphs and taking the transitive reduction of the result. In this process, cycles may need to be broken to resolve inconsistencies between the inputs. The cycle breaking problem is NP-hard, but the problem size depends upon the scope of the inconsistency between the input graphs, which will be local if the input graphs are from closely related organisms. We present results of running the resulting software on maps generated from seven populations of the crop plant Zea Mays.

遗传图谱是从遗传连锁数据中构建的遗传标记的排序，用于连锁研究和实验设计。虽然传统方法侧重于从单一种群研究中构建地图，但越来越多的地图是为同一生物的多系和种群生成的。例如，在遗传变异性很高的农作物中，研究人员已经为许多种群绘制了图谱。面对这些新的数据，我们要处理一个日益重要的问题，即产生一个共识图- -对各种人口研究中的所有标记进行排序。在我们的方法中，每个输入映射都被视为一组标记上的偏序。为了找到映射之间共享的最一致的顺序，我们将偏序建模为有向图。我们通过合并输入图的传递闭包并对结果进行传递约简来创建聚合。在此过程中，可能需要打破循环以解决输入之间的不一致。循环打破问题是np困难的，但问题的大小取决于输入图之间不一致的范围，如果输入图来自密切相关的生物，则不一致将是局部的。我们展示了在7个玉米作物种群的地图上运行软件的结果。

{"title":"Consensus genetic maps: a graph theoretic approach.","authors":"Benjamin N Jackson, Srinivas Aluru, Patrick S Schnable","doi":"10.1109/csb.2005.26","DOIUrl":"https://doi.org/10.1109/csb.2005.26","url":null,"abstract":"A genetic map is an ordering of genetic markers constructed from genetic linkage data for use in linkage studies and experimental design. While traditional methods have focused on constructing maps from a single population study, increasingly maps are generated for multiple lines and populations of the same organism. For example, in crop plants, where the genetic variability is high, researchers have created maps for many populations. In the face of these new data, we address the increasingly important problem of generating a consensus map - an ordering of all markers in the various population studies. In our method, each input map is treated as a partial order on a set of markers. To find the most consistent order shared between maps, we model the partial orders as directed graphs. We create an aggregate by merginging the transitive closure of the input graphs and taking the transitive reduction of the result. In this process, cycles may need to be broken to resolve inconsistencies between the inputs. The cycle breaking problem is NP-hard, but the problem size depends upon the scope of the inconsistency between the input graphs, which will be local if the input graphs are from closely related organisms. We present results of running the resulting software on maps generated from seven populations of the crop plant Zea Mays.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"35-43"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.26","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

Islands of tractability for parsimony haplotyping. 简约单倍型易于处理的岛屿。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.37

Roded Sharan, Bjarni V Halldósson, Sorin Istrail

We study the parsimony approach to haplotype inference, which calls for finding a set of haplotypes of minimum cardinality that explains an input set of genotypes. We prove that the problem is APX-hard even in very restricted cases. On the positive side, we identify islands of tractability for the problem, by focusing on instances with specific structure of haplotype sharing among the input genotypes. We exploit the structure of those instance to give polynomial and constant-approximation algorithms to the problem. We also show that the general parsimony haplotyping problem is fixed parameter tractable.

我们研究了单倍型推断的简约方法，该方法要求找到一组最小基数的单倍型来解释一组输入的基因型。即使在非常有限的情况下，我们也证明了这个问题是apx困难的。从积极的方面来看，我们通过关注输入基因型中具有特定单倍型共享结构的实例，确定了问题的可追溯性岛屿。我们利用这些实例的结构给出了问题的多项式和常逼近算法。我们还证明了一般简约单倍型问题是固定参数可处理的。

引用次数: 0

Choosing SNPs using feature selection. 使用特征选择选择snp。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.22

Tu Minh Phuong, Zhen Lin, Russ B Altman

A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.

全基因组疾病关联研究的一个主要挑战是对大量单核苷酸多态性(SNP)进行基因分型的高成本。然而，snp之间的相关性使得选择一组信息丰富的snp成为可能，这些snp被称为“标记”snp，能够捕获种群中的大多数变异。相当大的研究兴趣最近集中在寻找这种snp的方法的发展上。在本文中，我们提出了一种有效的方法来寻找标记snp。该方法不涉及对SNP子集的计算密集型搜索，而是使用特征选择算法丢弃冗余SNP。与大多数现有方法相比，本文提出的方法并不局限于仅使用本地群体中snp之间的相关性。通过使用发生在不同染色体区域的相关性，该方法可以减少全局冗余snp的数量。实验结果表明，与基于块的方法相比，该方法选择的标记snp数量更少。

引用次数: 89

Robust and accurate cancer classification with gene expression profiling. 稳健和准确的癌症分类与基因表达谱。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.49

Haifeng Li, Keshu Zhang, Tao Jiang

Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.

准确可靠的癌症分类对癌症治疗至关重要。基因表达谱分析有望使我们能够准确、系统地诊断肿瘤。然而，由于维数的诅咒和小样本量的问题，在这种情况下的分类任务非常具有挑战性。在本文中，我们提出了一种新的方法来解决这两个问题。我们的方法能够将基因表达数据映射到一个非常低维的空间，从而满足推荐的样本与每类特征的比率。因此，它可以用于对新样本进行鲁棒分类，并且错误率低且可靠。该方法基于线性判别分析(LDA)。然而，传统的LDA要求类内散射矩阵S(w)是非奇异的。不幸的是，由于样本量小的问题，在癌症分类中，Sw总是单一的。为了克服这个问题，我们开发了一个广义线性判别分析(GLDA)，它是优化Fisher准则的一般、直接和完整的解。当S(w)为非奇异时，GLDA在数学上有良好的基础，与传统的LDA一致。与传统的LDA不同，GLDA不假设S(w)的非奇异性，因此自然解决了小样本量问题。为了适应散点矩阵的高维性，本文还提出了一种快速的GLDA算法。我们在七个公共癌症数据集上的大量实验表明，该方法性能良好。特别是在一些具有非常小的样本与每类基因比率的困难实例中，我们的方法比广泛使用的分类方法(如支持向量机，随机森林等)实现了更高的准确性。

{"title":"Robust and accurate cancer classification with gene expression profiling.","authors":"Haifeng Li, Keshu Zhang, Tao Jiang","doi":"10.1109/csb.2005.49","DOIUrl":"https://doi.org/10.1109/csb.2005.49","url":null,"abstract":"Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"310-21"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.49","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

A pivoting algorithm for metabolic networks in the presence of thermodynamic constraints. 存在热力学约束的代谢网络的旋转算法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.6

R Nigam, S Liang

A linear programming algorithm is presented to constructively compute thermodynamically feasible fluxes and change in chemical potentials of reactions for a metabolic network. It is based on physical laws of mass conservation and the second law of thermodynamics that all chemical reactions should satisfy. As a demonstration, the algorithm has been applied to the core metabolic pathway of E. coli.

提出了一种线性规划算法来构造地计算代谢网络的热力学可行通量和化学势的变化。它基于质量守恒的物理定律和热力学第二定律，所有的化学反应都应该满足这些定律。作为示范，该算法已应用于大肠杆菌的核心代谢途径。

引用次数: 4

On optimizing distance-based similarity search for biological databases. 基于距离的生物数据库相似度搜索优化研究。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.42

Rui Mao, Weijia Xu, Smriti Ramakrishnan, Glen Nuckolls, Daniel P Miranker

Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.

利用基于距离的索引结构的相似性搜索越来越多地用于多媒体和生物数据库应用程序。我们考虑了三种重要的生物数据类型的基于距离的索引，蛋白质k-mers与度量PAM模型，DNA k-mers与汉明距离和肽片段谱与余弦距离衍生的伪度量。迄今为止，该研究的主要驱动力是多媒体应用，其中相似函数通常是高维特征向量上的欧几里得范数。我们开发的结果表明，这些生物工作负载的特点不同于多媒体工作负载。特别是，它们本质上不是高维的，需要不同的优化启发式。在mvp树的基础上，我们开发了一种寻找中心的枢轴选择启发式算法，并证明它优于最广泛使用的角点搜索启发式算法。同样，我们开发了一种对实际数据分布敏感的数据分区方法，以代替中位数分割。

{"title":"On optimizing distance-based similarity search for biological databases.","authors":"Rui Mao, Weijia Xu, Smriti Ramakrishnan, Glen Nuckolls, Daniel P Miranker","doi":"10.1109/csb.2005.42","DOIUrl":"https://doi.org/10.1109/csb.2005.42","url":null,"abstract":"Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"351-61"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.42","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

TreeRefiner: a tool for refining a multiple alignment on a phylogenetic tree. TreeRefiner:一种在系统发育树上精炼多重比对的工具。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.53

Aswath Manohar, Serafim Batzoglou

We present TreeRefiner, a tool for refining multiple alignments of biological sequences. Given a multiple alignment, a phylogenetic tree, and scoring parameters as input, TreeRefiner optimizes the sum-of-pairs function in a restricted three-dimensional space around the alignment. At each internal node of the unrooted tree, the multiple alignment is projected to the sub-alignments corresponding to the three neighboring nodes, and three-dimensional dynamic programming is performed within a user-specified radius r around the original alignment. We test TreeRefiner on simulated sequences aligned by several popular tools, and demonstrate substantial improvements in the percentage of correctly aligned positions.

我们提出了TreeRefiner，一个用于精炼生物序列的多重比对的工具。给定多个序列、系统发育树和评分参数作为输入，TreeRefiner在序列周围受限的三维空间中优化配对和函数。在无根树的每个内部节点上，将多重对齐投影到三个相邻节点对应的子对齐上，并在原始对齐周围用户指定的半径r内进行三维动态规划。我们在几种常用工具的模拟序列上测试了TreeRefiner，并证明了正确对齐位置的百分比有了实质性的提高。

引用次数: 3

Bioinformatic insights from metagenomics through visualization. 通过可视化从宏基因组学获得生物信息学见解。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.19

Susan L Havre, Bobbie-Jo Webb-Robertson, Anuj Shah, Christian Posse, Banu Gopalan, Fred J Brockman

Cutting-edge biological and bioinformatics research seeks a systems perspective through the analysis of multiple types of high-throughput and other experimental data for the same sample. Systems-level analysis requires the integration and fusion of such data, typically through advanced statistics and mathematics. Visualization is a complementary computational approach that supports integration and analysis of complex data or its derivatives. We present a bioinformatics visualization prototype, Juxter, which depicts categorical information derived from or assigned to these diverse data for the purpose of comparing patterns across categorizations. The visualization allows users to easily discern correlated and anomalous patterns in the data. These patterns, which might not be detected automatically by algorithms, may reveal valuable information leading to insight and discovery. We describe the visualization and interaction capabilities and demonstrate its utility in a new field, metagenomics, which combines molecular biology and genetics to identify and characterize genetic material from multi-species microbial samples.

尖端的生物学和生物信息学研究通过对同一样本的多种类型的高通量和其他实验数据的分析，寻求一个系统的视角。系统级分析通常需要通过高级统计和数学对这些数据进行整合和融合。可视化是一种辅助的计算方法，支持对复杂数据或其衍生物进行集成和分析。我们提出了一个生物信息学可视化原型，并特，它描述了从这些不同的数据衍生或分配的分类信息，以比较不同分类的模式。可视化使用户可以轻松地识别数据中的相关模式和异常模式。这些模式可能不会被算法自动检测到，但可能会揭示有价值的信息，从而导致洞察力和发现。我们描述了可视化和交互能力，并展示了它在一个新领域的应用，宏基因组学，它结合了分子生物学和遗传学来鉴定和表征多物种微生物样本的遗传物质。

{"title":"Bioinformatic insights from metagenomics through visualization.","authors":"Susan L Havre, Bobbie-Jo Webb-Robertson, Anuj Shah, Christian Posse, Banu Gopalan, Fred J Brockman","doi":"10.1109/csb.2005.19","DOIUrl":"https://doi.org/10.1109/csb.2005.19","url":null,"abstract":"Cutting-edge biological and bioinformatics research seeks a systems perspective through the analysis of multiple types of high-throughput and other experimental data for the same sample. Systems-level analysis requires the integration and fusion of such data, typically through advanced statistics and mathematics. Visualization is a complementary computational approach that supports integration and analysis of complex data or its derivatives. We present a bioinformatics visualization prototype, Juxter, which depicts categorical information derived from or assigned to these diverse data for the purpose of comparing patterns across categorizations. The visualization allows users to easily discern correlated and anomalous patterns in the data. These patterns, which might not be detected automatically by algorithms, may reveal valuable information leading to insight and discovery. We describe the visualization and interaction capabilities and demonstrate its utility in a new field, metagenomics, which combines molecular biology and genetics to identify and characterize genetic material from multi-species microbial samples.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"341-50"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.19","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

An algebraic geometry approach to protein structure determination from NMR data. 从核磁共振数据测定蛋白质结构的代数几何方法。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.11

Lincong Wang, Ramgopal R Mettu, Bruce Randall Donald

Our paper describes the first provably-efficient algorithm for determining protein structures de novo, solely from experimental data. We show how the global nature of a certain kind of NMR data provides quantifiable complexity-theoretic benefits, allowing us to classify our algorithm as running in polynomial time. While our algorithm uses NMR data as input, it is the first polynomial-time algorithm to compute high-resolution structures de novo using any experimentally-recorded data, from either NMR spectroscopy or X-Ray crystallography. Improved algorithms for protein structure determination are needed, because currently, the process is expensive and time-consuming. For example, an area of intense research in NMR methodology is automated assignment of nuclear Overhauser effect (NOE) restraints, in which structure determination sits in a tight inner-loop (cycle) of assignment/refinement. These algorithms are very time-consuming, and typically require a large cluster. Thus, algorithms for protein structure determination that are known to run in polynomial time and provide guarantees on solution accuracy are likely to have great impact in the long-term. Methods stemming from a technique called "distance geometry embedding" do come with provable guarantees, but the NP-hardness of these problem formulations implies that in the worst case these techniques cannot run in polynomial time. We are able to avoid the NP-hardness by (a) some mild assumptions about the protein being studied, (b) the use of residual dipolar couplings (RDCs) instead of a dense network of NOEs, and (c) novel algorithms and proofs that exploit the biophysical geometry of (a) and (b), drawing on a variety of computer science, computational geometry, and computational algebra techniques. In our algorithm, RDC data, which gives global restraints on the orientation of internuclear bond vectors, is used in conjunction with very sparse NOE data to obtain a polynomial-time algorithm for protein structure determination. An implementation of our algorithm has been applied to 6 different real biological NMR data sets recorded for 3 proteins. Our algorithm is combinatorially precise, polynomial-time, and uses much less NMR data to produce results that are as good or better than previous approaches in terms of accuracy of the computed structure as well as running time. In practice approaches such as restrained molecular dynamics and simulated annealing, which lack both combinatorial precision and guarantees on running time and solution quality, are commonly used. Our results show that by using a different "slice" of the data, an algorithm that is polynomial time and that has guarantees about solution quality can be obtained. We believe that our techniques can be extended and generalized for other structure-determination problems such as computing side-chain conformations and the structure of nucleic acids from experimental data.

我们的论文描述了第一个可以证明有效的算法来确定蛋白质结构从头开始，仅仅从实验数据。我们展示了某种核磁共振数据的全局性质如何提供可量化的复杂性理论好处，允许我们将算法分类为在多项式时间内运行。虽然我们的算法使用核磁共振数据作为输入，但它是第一个使用任何实验记录数据(从核磁共振波谱或x射线晶体学)从头计算高分辨率结构的多项式时间算法。由于目前的检测过程昂贵且耗时，因此需要改进蛋白质结构的检测算法。例如，核磁共振方法论的一个热门研究领域是核Overhauser效应(NOE)约束的自动分配，其中结构确定位于分配/细化的紧密内循环(循环)中。这些算法非常耗时，通常需要一个大型集群。因此，已知在多项式时间内运行并提供解决精度保证的蛋白质结构确定算法可能在长期内产生很大影响。源于“距离几何嵌入”技术的方法确实具有可证明的保证，但这些问题公式的np硬度意味着，在最坏的情况下，这些技术不能在多项式时间内运行。我们能够通过(a)对所研究的蛋白质进行一些温和的假设，(b)使用残余偶极偶联(rdc)而不是密集的noe网络，以及(c)利用(a)和(b)的生物物理几何，利用各种计算机科学，计算几何和计算代数技术的新算法和证明来避免np硬度。在我们的算法中，RDC数据给出了核间键向量方向的全局约束，与非常稀疏的NOE数据结合使用，获得了一个用于蛋白质结构确定的多项式时间算法。我们的算法的实现已经应用于6个不同的真实生物NMR数据集，记录了3种蛋白质。我们的算法是组合精确的，多项式时间的，并且使用更少的NMR数据来产生在计算结构的准确性和运行时间方面与以前的方法一样好或更好的结果。在实际应用中，常用的是约束分子动力学和模拟退火等方法，它们既缺乏组合精度，又缺乏运行时间和求解质量的保证。我们的结果表明，通过使用数据的不同“切片”，可以获得多项式时间且有保证解质量的算法。我们相信，我们的技术可以扩展和推广到其他结构确定问题，如计算侧链构象和核酸结构的实验数据。

{"title":"An algebraic geometry approach to protein structure determination from NMR data.","authors":"Lincong Wang, Ramgopal R Mettu, Bruce Randall Donald","doi":"10.1109/csb.2005.11","DOIUrl":"https://doi.org/10.1109/csb.2005.11","url":null,"abstract":"Our paper describes the first provably-efficient algorithm for determining protein structures de novo, solely from experimental data. We show how the global nature of a certain kind of NMR data provides quantifiable complexity-theoretic benefits, allowing us to classify our algorithm as running in polynomial time. While our algorithm uses NMR data as input, it is the first polynomial-time algorithm to compute high-resolution structures de novo using any experimentally-recorded data, from either NMR spectroscopy or X-Ray crystallography. Improved algorithms for protein structure determination are needed, because currently, the process is expensive and time-consuming. For example, an area of intense research in NMR methodology is automated assignment of nuclear Overhauser effect (NOE) restraints, in which structure determination sits in a tight inner-loop (cycle) of assignment/refinement. These algorithms are very time-consuming, and typically require a large cluster. Thus, algorithms for protein structure determination that are known to run in polynomial time and provide guarantees on solution accuracy are likely to have great impact in the long-term. Methods stemming from a technique called \"distance geometry embedding\" do come with provable guarantees, but the NP-hardness of these problem formulations implies that in the worst case these techniques cannot run in polynomial time. We are able to avoid the NP-hardness by (a) some mild assumptions about the protein being studied, (b) the use of residual dipolar couplings (RDCs) instead of a dense network of NOEs, and (c) novel algorithms and proofs that exploit the biophysical geometry of (a) and (b), drawing on a variety of computer science, computational geometry, and computational algebra techniques. In our algorithm, RDC data, which gives global restraints on the orientation of internuclear bond vectors, is used in conjunction with very sparse NOE data to obtain a polynomial-time algorithm for protein structure determination. An implementation of our algorithm has been applied to 6 different real biological NMR data sets recorded for 3 proteins. Our algorithm is combinatorially precise, polynomial-time, and uses much less NMR data to produce results that are as good or better than previous approaches in terms of accuracy of the computed structure as well as running time. In practice approaches such as restrained molecular dynamics and simulated annealing, which lack both combinatorial precision and guarantees on running time and solution quality, are commonly used. Our results show that by using a different \"slice\" of the data, an algorithm that is polynomial time and that has guarantees about solution quality can be obtained. We believe that our techniques can be extended and generalized for other structure-determination problems such as computing side-chain conformations and the structure of nucleic acids from experimental data.","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"235-46"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.11","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Multi-scale hierarchical structure prediction of helical transmembrane proteins. 螺旋跨膜蛋白的多尺度层次结构预测。

Proceedings. IEEE Computational Systems Bioinformatics Conference

Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.41

Zhong Chen, Ying Xu

As the first step toward a multi-scale, hierarchical computational approach for membrane protein structure prediction, the packing of transmembrane helices was modeled at the residual and atomistic levels, respectively. For predictions at the residual level, the helix-helix and helix-lipid interactions were described by a set of knowledge-based energy functions. For predictions at the atomistic level, CHARMM19 force field was employed. To facilitate the system to overcome energy barriers, Wang-Landau sampling was carried out by performing a random walk in the energy and conformational spaces. Native-like structures were predicted at both levels for 2- and 7-helix systems. Interestingly, consistent results were obtained from simulations at residual and atomistic levels for the same system, strongly suggesting the feasibility of a hierarchical approach for membrane structure prediction.

作为膜蛋白结构预测的多尺度、分层计算方法的第一步，跨膜螺旋的堆积分别在残差和原子水平上建模。对于残差水平的预测，螺旋-螺旋和螺旋-脂质相互作用由一组基于知识的能量函数来描述。对于原子水平的预测，采用CHARMM19力场。为了促进系统克服能量势垒，Wang-Landau采样是通过在能量和构象空间中进行随机漫步来进行的。2-螺旋和7-螺旋系统在两个水平上都预测了原生结构。有趣的是，在残余和原子水平上对同一系统的模拟得到了一致的结果，这有力地表明了分层方法用于膜结构预测的可行性。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings. IEEE Computational Systems Bioinformatics Conference

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀