首页 > 最新文献

Proceedings. IEEE Computational Systems Bioinformatics Conference最新文献

英文 中文
Consensus genetic maps: a graph theoretic approach. 共识遗传图谱:一种图论方法。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.26
Benjamin N Jackson, Srinivas Aluru, Patrick S Schnable

A genetic map is an ordering of genetic markers constructed from genetic linkage data for use in linkage studies and experimental design. While traditional methods have focused on constructing maps from a single population study, increasingly maps are generated for multiple lines and populations of the same organism. For example, in crop plants, where the genetic variability is high, researchers have created maps for many populations. In the face of these new data, we address the increasingly important problem of generating a consensus map - an ordering of all markers in the various population studies. In our method, each input map is treated as a partial order on a set of markers. To find the most consistent order shared between maps, we model the partial orders as directed graphs. We create an aggregate by merginging the transitive closure of the input graphs and taking the transitive reduction of the result. In this process, cycles may need to be broken to resolve inconsistencies between the inputs. The cycle breaking problem is NP-hard, but the problem size depends upon the scope of the inconsistency between the input graphs, which will be local if the input graphs are from closely related organisms. We present results of running the resulting software on maps generated from seven populations of the crop plant Zea Mays.

遗传图谱是从遗传连锁数据中构建的遗传标记的排序,用于连锁研究和实验设计。虽然传统方法侧重于从单一种群研究中构建地图,但越来越多的地图是为同一生物的多系和种群生成的。例如,在遗传变异性很高的农作物中,研究人员已经为许多种群绘制了图谱。面对这些新的数据,我们要处理一个日益重要的问题,即产生一个共识图- -对各种人口研究中的所有标记进行排序。在我们的方法中,每个输入映射都被视为一组标记上的偏序。为了找到映射之间共享的最一致的顺序,我们将偏序建模为有向图。我们通过合并输入图的传递闭包并对结果进行传递约简来创建聚合。在此过程中,可能需要打破循环以解决输入之间的不一致。循环打破问题是np困难的,但问题的大小取决于输入图之间不一致的范围,如果输入图来自密切相关的生物,则不一致将是局部的。我们展示了在7个玉米作物种群的地图上运行软件的结果。
{"title":"Consensus genetic maps: a graph theoretic approach.","authors":"Benjamin N Jackson,&nbsp;Srinivas Aluru,&nbsp;Patrick S Schnable","doi":"10.1109/csb.2005.26","DOIUrl":"https://doi.org/10.1109/csb.2005.26","url":null,"abstract":"<p><p>A genetic map is an ordering of genetic markers constructed from genetic linkage data for use in linkage studies and experimental design. While traditional methods have focused on constructing maps from a single population study, increasingly maps are generated for multiple lines and populations of the same organism. For example, in crop plants, where the genetic variability is high, researchers have created maps for many populations. In the face of these new data, we address the increasingly important problem of generating a consensus map - an ordering of all markers in the various population studies. In our method, each input map is treated as a partial order on a set of markers. To find the most consistent order shared between maps, we model the partial orders as directed graphs. We create an aggregate by merginging the transitive closure of the input graphs and taking the transitive reduction of the result. In this process, cycles may need to be broken to resolve inconsistencies between the inputs. The cycle breaking problem is NP-hard, but the problem size depends upon the scope of the inconsistency between the input graphs, which will be local if the input graphs are from closely related organisms. We present results of running the resulting software on maps generated from seven populations of the crop plant Zea Mays.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"35-43"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.26","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830569","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Islands of tractability for parsimony haplotyping. 简约单倍型易于处理的岛屿。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.37
Roded Sharan, Bjarni V Halldósson, Sorin Istrail

We study the parsimony approach to haplotype inference, which calls for finding a set of haplotypes of minimum cardinality that explains an input set of genotypes. We prove that the problem is APX-hard even in very restricted cases. On the positive side, we identify islands of tractability for the problem, by focusing on instances with specific structure of haplotype sharing among the input genotypes. We exploit the structure of those instance to give polynomial and constant-approximation algorithms to the problem. We also show that the general parsimony haplotyping problem is fixed parameter tractable.

我们研究了单倍型推断的简约方法,该方法要求找到一组最小基数的单倍型来解释一组输入的基因型。即使在非常有限的情况下,我们也证明了这个问题是apx困难的。从积极的方面来看,我们通过关注输入基因型中具有特定单倍型共享结构的实例,确定了问题的可追溯性岛屿。我们利用这些实例的结构给出了问题的多项式和常逼近算法。我们还证明了一般简约单倍型问题是固定参数可处理的。
{"title":"Islands of tractability for parsimony haplotyping.","authors":"Roded Sharan,&nbsp;Bjarni V Halldósson,&nbsp;Sorin Istrail","doi":"10.1109/csb.2005.37","DOIUrl":"https://doi.org/10.1109/csb.2005.37","url":null,"abstract":"<p><p>We study the parsimony approach to haplotype inference, which calls for finding a set of haplotypes of minimum cardinality that explains an input set of genotypes. We prove that the problem is APX-hard even in very restricted cases. On the positive side, we identify islands of tractability for the problem, by focusing on instances with specific structure of haplotype sharing among the input genotypes. We exploit the structure of those instance to give polynomial and constant-approximation algorithms to the problem. We also show that the general parsimony haplotyping problem is fixed parameter tractable.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"65-72"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.37","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Choosing SNPs using feature selection. 使用特征选择选择snp。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.22
Tu Minh Phuong, Zhen Lin, Russ B Altman

A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as "tagging" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.

全基因组疾病关联研究的一个主要挑战是对大量单核苷酸多态性(SNP)进行基因分型的高成本。然而,snp之间的相关性使得选择一组信息丰富的snp成为可能,这些snp被称为“标记”snp,能够捕获种群中的大多数变异。相当大的研究兴趣最近集中在寻找这种snp的方法的发展上。在本文中,我们提出了一种有效的方法来寻找标记snp。该方法不涉及对SNP子集的计算密集型搜索,而是使用特征选择算法丢弃冗余SNP。与大多数现有方法相比,本文提出的方法并不局限于仅使用本地群体中snp之间的相关性。通过使用发生在不同染色体区域的相关性,该方法可以减少全局冗余snp的数量。实验结果表明,与基于块的方法相比,该方法选择的标记snp数量更少。
{"title":"Choosing SNPs using feature selection.","authors":"Tu Minh Phuong,&nbsp;Zhen Lin,&nbsp;Russ B Altman","doi":"10.1109/csb.2005.22","DOIUrl":"https://doi.org/10.1109/csb.2005.22","url":null,"abstract":"<p><p>A major challenge for genomewide disease association studies is the high cost of genotyping large number of single nucleotide polymorphisms (SNP). The correlations between SNPs, however, make it possible to select a parsimonious set of informative SNPs, known as \"tagging\" SNPs, able to capture most variation in a population. Considerable research interest has recently focused on the development of methods for finding such SNPs. In this paper, we present an efficient method for finding tagging SNPs. The method does not involve computation-intensive search for SNP subsets but discards redundant SNPs using a feature selection algorithm. In contrast to most existing methods, the method presented here does not limit itself to using only correlations between SNPs in local groups. By using correlations that occur across different chromosomal regions, the method can reduce the number of globally redundant SNPs. Experimental results show that the number of tagging SNPs selected by our method is smaller than by using block-based methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"301-9"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.22","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
Robust and accurate cancer classification with gene expression profiling. 稳健和准确的癌症分类与基因表达谱。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.49
Haifeng Li, Keshu Zhang, Tao Jiang

Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.

准确可靠的癌症分类对癌症治疗至关重要。基因表达谱分析有望使我们能够准确、系统地诊断肿瘤。然而,由于维数的诅咒和小样本量的问题,在这种情况下的分类任务非常具有挑战性。在本文中,我们提出了一种新的方法来解决这两个问题。我们的方法能够将基因表达数据映射到一个非常低维的空间,从而满足推荐的样本与每类特征的比率。因此,它可以用于对新样本进行鲁棒分类,并且错误率低且可靠。该方法基于线性判别分析(LDA)。然而,传统的LDA要求类内散射矩阵S(w)是非奇异的。不幸的是,由于样本量小的问题,在癌症分类中,Sw总是单一的。为了克服这个问题,我们开发了一个广义线性判别分析(GLDA),它是优化Fisher准则的一般、直接和完整的解。当S(w)为非奇异时,GLDA在数学上有良好的基础,与传统的LDA一致。与传统的LDA不同,GLDA不假设S(w)的非奇异性,因此自然解决了小样本量问题。为了适应散点矩阵的高维性,本文还提出了一种快速的GLDA算法。我们在七个公共癌症数据集上的大量实验表明,该方法性能良好。特别是在一些具有非常小的样本与每类基因比率的困难实例中,我们的方法比广泛使用的分类方法(如支持向量机,随机森林等)实现了更高的准确性。
{"title":"Robust and accurate cancer classification with gene expression profiling.","authors":"Haifeng Li,&nbsp;Keshu Zhang,&nbsp;Tao Jiang","doi":"10.1109/csb.2005.49","DOIUrl":"https://doi.org/10.1109/csb.2005.49","url":null,"abstract":"<p><p>Robust and accurate cancer classification is critical in cancer treatment. Gene expression profiling is expected to enable us to diagnose tumors precisely and systematically. However, the classification task in this context is very challenging because of the curse of dimensionality and the small sample size problem. In this paper, we propose a novel method to solve these two problems. Our method is able to map gene expression data into a very low dimensional space and thus meets the recommended samples to features per class ratio. As a result, it can be used to classify new samples robustly with low and trustable (estimated) error rates. The method is based on linear discriminant analysis (LDA). However, the conventional LDA requires that the within-class scatter matrix S(w) be nonsingular. Unfortunately, Sw is always singular in the case of cancer classification due to the small sample size problem. To overcome this problem, we develop a generalized linear discriminant analysis (GLDA) that is a general, direct, and complete solution to optimize Fisher's criterion. GLDA is mathematically well-founded and coincides with the conventional LDA when S(w) is nonsingular. Different from the conventional LDA, GLDA does not assume the nonsingularity of S(w), and thus naturally solves the small sample size problem. To accommodate the high dimensionality of scatter matrices, a fast algorithm of GLDA is also developed. Our extensive experiments on seven public cancer datasets show that the method performs well. Especially on some difficult instances that have very small samples to genes per class ratios, our method achieves much higher accuracies than widely used classification methods such as support vector machines, random forests, etc.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"310-21"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.49","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
A pivoting algorithm for metabolic networks in the presence of thermodynamic constraints. 存在热力学约束的代谢网络的旋转算法。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.6
R Nigam, S Liang

A linear programming algorithm is presented to constructively compute thermodynamically feasible fluxes and change in chemical potentials of reactions for a metabolic network. It is based on physical laws of mass conservation and the second law of thermodynamics that all chemical reactions should satisfy. As a demonstration, the algorithm has been applied to the core metabolic pathway of E. coli.

提出了一种线性规划算法来构造地计算代谢网络的热力学可行通量和化学势的变化。它基于质量守恒的物理定律和热力学第二定律,所有的化学反应都应该满足这些定律。作为示范,该算法已应用于大肠杆菌的核心代谢途径。
{"title":"A pivoting algorithm for metabolic networks in the presence of thermodynamic constraints.","authors":"R Nigam,&nbsp;S Liang","doi":"10.1109/csb.2005.6","DOIUrl":"https://doi.org/10.1109/csb.2005.6","url":null,"abstract":"<p><p>A linear programming algorithm is presented to constructively compute thermodynamically feasible fluxes and change in chemical potentials of reactions for a metabolic network. It is based on physical laws of mass conservation and the second law of thermodynamics that all chemical reactions should satisfy. As a demonstration, the algorithm has been applied to the core metabolic pathway of E. coli.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"259-67"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.6","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
On optimizing distance-based similarity search for biological databases. 基于距离的生物数据库相似度搜索优化研究。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.42
Rui Mao, Weijia Xu, Smriti Ramakrishnan, Glen Nuckolls, Daniel P Miranker

Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.

利用基于距离的索引结构的相似性搜索越来越多地用于多媒体和生物数据库应用程序。我们考虑了三种重要的生物数据类型的基于距离的索引,蛋白质k-mers与度量PAM模型,DNA k-mers与汉明距离和肽片段谱与余弦距离衍生的伪度量。迄今为止,该研究的主要驱动力是多媒体应用,其中相似函数通常是高维特征向量上的欧几里得范数。我们开发的结果表明,这些生物工作负载的特点不同于多媒体工作负载。特别是,它们本质上不是高维的,需要不同的优化启发式。在mvp树的基础上,我们开发了一种寻找中心的枢轴选择启发式算法,并证明它优于最广泛使用的角点搜索启发式算法。同样,我们开发了一种对实际数据分布敏感的数据分区方法,以代替中位数分割。
{"title":"On optimizing distance-based similarity search for biological databases.","authors":"Rui Mao,&nbsp;Weijia Xu,&nbsp;Smriti Ramakrishnan,&nbsp;Glen Nuckolls,&nbsp;Daniel P Miranker","doi":"10.1109/csb.2005.42","DOIUrl":"https://doi.org/10.1109/csb.2005.42","url":null,"abstract":"<p><p>Similarity search leveraging distance-based index structures is increasingly being used for both multimedia and biological database applications. We consider distance-based indexing for three important biological data types, protein k-mers with the metric PAM model, DNA k-mers with Hamming distance and peptide fragmentation spectra with a pseudo-metric derived from cosine distance. To date, the primary driver of this research has been multimedia applications, where similarity functions are often Euclidean norms on high dimensional feature vectors. We develop results showing that the character of these biological workloads is different from multimedia workloads. In particular, they are not intrinsically very high dimensional, and deserving different optimization heuristics. Based on MVP-trees, we develop a pivot selection heuristic seeking centers and show it outperforms the most widely used corner seeking heuristic. Similarly, we develop a data partitioning approach sensitive to the actual data distribution in lieu of median splits.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"351-61"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.42","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
An algebraic geometry approach to protein structure determination from NMR data. 从核磁共振数据测定蛋白质结构的代数几何方法。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.11
Lincong Wang, Ramgopal R Mettu, Bruce Randall Donald

Our paper describes the first provably-efficient algorithm for determining protein structures de novo, solely from experimental data. We show how the global nature of a certain kind of NMR data provides quantifiable complexity-theoretic benefits, allowing us to classify our algorithm as running in polynomial time. While our algorithm uses NMR data as input, it is the first polynomial-time algorithm to compute high-resolution structures de novo using any experimentally-recorded data, from either NMR spectroscopy or X-Ray crystallography. Improved algorithms for protein structure determination are needed, because currently, the process is expensive and time-consuming. For example, an area of intense research in NMR methodology is automated assignment of nuclear Overhauser effect (NOE) restraints, in which structure determination sits in a tight inner-loop (cycle) of assignment/refinement. These algorithms are very time-consuming, and typically require a large cluster. Thus, algorithms for protein structure determination that are known to run in polynomial time and provide guarantees on solution accuracy are likely to have great impact in the long-term. Methods stemming from a technique called "distance geometry embedding" do come with provable guarantees, but the NP-hardness of these problem formulations implies that in the worst case these techniques cannot run in polynomial time. We are able to avoid the NP-hardness by (a) some mild assumptions about the protein being studied, (b) the use of residual dipolar couplings (RDCs) instead of a dense network of NOEs, and (c) novel algorithms and proofs that exploit the biophysical geometry of (a) and (b), drawing on a variety of computer science, computational geometry, and computational algebra techniques. In our algorithm, RDC data, which gives global restraints on the orientation of internuclear bond vectors, is used in conjunction with very sparse NOE data to obtain a polynomial-time algorithm for protein structure determination. An implementation of our algorithm has been applied to 6 different real biological NMR data sets recorded for 3 proteins. Our algorithm is combinatorially precise, polynomial-time, and uses much less NMR data to produce results that are as good or better than previous approaches in terms of accuracy of the computed structure as well as running time. In practice approaches such as restrained molecular dynamics and simulated annealing, which lack both combinatorial precision and guarantees on running time and solution quality, are commonly used. Our results show that by using a different "slice" of the data, an algorithm that is polynomial time and that has guarantees about solution quality can be obtained. We believe that our techniques can be extended and generalized for other structure-determination problems such as computing side-chain conformations and the structure of nucleic acids from experimental data.

我们的论文描述了第一个可以证明有效的算法来确定蛋白质结构从头开始,仅仅从实验数据。我们展示了某种核磁共振数据的全局性质如何提供可量化的复杂性理论好处,允许我们将算法分类为在多项式时间内运行。虽然我们的算法使用核磁共振数据作为输入,但它是第一个使用任何实验记录数据(从核磁共振波谱或x射线晶体学)从头计算高分辨率结构的多项式时间算法。由于目前的检测过程昂贵且耗时,因此需要改进蛋白质结构的检测算法。例如,核磁共振方法论的一个热门研究领域是核Overhauser效应(NOE)约束的自动分配,其中结构确定位于分配/细化的紧密内循环(循环)中。这些算法非常耗时,通常需要一个大型集群。因此,已知在多项式时间内运行并提供解决精度保证的蛋白质结构确定算法可能在长期内产生很大影响。源于“距离几何嵌入”技术的方法确实具有可证明的保证,但这些问题公式的np硬度意味着,在最坏的情况下,这些技术不能在多项式时间内运行。我们能够通过(a)对所研究的蛋白质进行一些温和的假设,(b)使用残余偶极偶联(rdc)而不是密集的noe网络,以及(c)利用(a)和(b)的生物物理几何,利用各种计算机科学,计算几何和计算代数技术的新算法和证明来避免np硬度。在我们的算法中,RDC数据给出了核间键向量方向的全局约束,与非常稀疏的NOE数据结合使用,获得了一个用于蛋白质结构确定的多项式时间算法。我们的算法的实现已经应用于6个不同的真实生物NMR数据集,记录了3种蛋白质。我们的算法是组合精确的,多项式时间的,并且使用更少的NMR数据来产生在计算结构的准确性和运行时间方面与以前的方法一样好或更好的结果。在实际应用中,常用的是约束分子动力学和模拟退火等方法,它们既缺乏组合精度,又缺乏运行时间和求解质量的保证。我们的结果表明,通过使用数据的不同“切片”,可以获得多项式时间且有保证解质量的算法。我们相信,我们的技术可以扩展和推广到其他结构确定问题,如计算侧链构象和核酸结构的实验数据。
{"title":"An algebraic geometry approach to protein structure determination from NMR data.","authors":"Lincong Wang,&nbsp;Ramgopal R Mettu,&nbsp;Bruce Randall Donald","doi":"10.1109/csb.2005.11","DOIUrl":"https://doi.org/10.1109/csb.2005.11","url":null,"abstract":"<p><p>Our paper describes the first provably-efficient algorithm for determining protein structures de novo, solely from experimental data. We show how the global nature of a certain kind of NMR data provides quantifiable complexity-theoretic benefits, allowing us to classify our algorithm as running in polynomial time. While our algorithm uses NMR data as input, it is the first polynomial-time algorithm to compute high-resolution structures de novo using any experimentally-recorded data, from either NMR spectroscopy or X-Ray crystallography. Improved algorithms for protein structure determination are needed, because currently, the process is expensive and time-consuming. For example, an area of intense research in NMR methodology is automated assignment of nuclear Overhauser effect (NOE) restraints, in which structure determination sits in a tight inner-loop (cycle) of assignment/refinement. These algorithms are very time-consuming, and typically require a large cluster. Thus, algorithms for protein structure determination that are known to run in polynomial time and provide guarantees on solution accuracy are likely to have great impact in the long-term. Methods stemming from a technique called \"distance geometry embedding\" do come with provable guarantees, but the NP-hardness of these problem formulations implies that in the worst case these techniques cannot run in polynomial time. We are able to avoid the NP-hardness by (a) some mild assumptions about the protein being studied, (b) the use of residual dipolar couplings (RDCs) instead of a dense network of NOEs, and (c) novel algorithms and proofs that exploit the biophysical geometry of (a) and (b), drawing on a variety of computer science, computational geometry, and computational algebra techniques. In our algorithm, RDC data, which gives global restraints on the orientation of internuclear bond vectors, is used in conjunction with very sparse NOE data to obtain a polynomial-time algorithm for protein structure determination. An implementation of our algorithm has been applied to 6 different real biological NMR data sets recorded for 3 proteins. Our algorithm is combinatorially precise, polynomial-time, and uses much less NMR data to produce results that are as good or better than previous approaches in terms of accuracy of the computed structure as well as running time. In practice approaches such as restrained molecular dynamics and simulated annealing, which lack both combinatorial precision and guarantees on running time and solution quality, are commonly used. Our results show that by using a different \"slice\" of the data, an algorithm that is polynomial time and that has guarantees about solution quality can be obtained. We believe that our techniques can be extended and generalized for other structure-determination problems such as computing side-chain conformations and the structure of nucleic acids from experimental data.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"235-46"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.11","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Multi-scale hierarchical structure prediction of helical transmembrane proteins. 螺旋跨膜蛋白的多尺度层次结构预测。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.41
Zhong Chen, Ying Xu

As the first step toward a multi-scale, hierarchical computational approach for membrane protein structure prediction, the packing of transmembrane helices was modeled at the residual and atomistic levels, respectively. For predictions at the residual level, the helix-helix and helix-lipid interactions were described by a set of knowledge-based energy functions. For predictions at the atomistic level, CHARMM19 force field was employed. To facilitate the system to overcome energy barriers, Wang-Landau sampling was carried out by performing a random walk in the energy and conformational spaces. Native-like structures were predicted at both levels for 2- and 7-helix systems. Interestingly, consistent results were obtained from simulations at residual and atomistic levels for the same system, strongly suggesting the feasibility of a hierarchical approach for membrane structure prediction.

作为膜蛋白结构预测的多尺度、分层计算方法的第一步,跨膜螺旋的堆积分别在残差和原子水平上建模。对于残差水平的预测,螺旋-螺旋和螺旋-脂质相互作用由一组基于知识的能量函数来描述。对于原子水平的预测,采用CHARMM19力场。为了促进系统克服能量势垒,Wang-Landau采样是通过在能量和构象空间中进行随机漫步来进行的。2-螺旋和7-螺旋系统在两个水平上都预测了原生结构。有趣的是,在残余和原子水平上对同一系统的模拟得到了一致的结果,这有力地表明了分层方法用于膜结构预测的可行性。
{"title":"Multi-scale hierarchical structure prediction of helical transmembrane proteins.","authors":"Zhong Chen,&nbsp;Ying Xu","doi":"10.1109/csb.2005.41","DOIUrl":"https://doi.org/10.1109/csb.2005.41","url":null,"abstract":"<p><p>As the first step toward a multi-scale, hierarchical computational approach for membrane protein structure prediction, the packing of transmembrane helices was modeled at the residual and atomistic levels, respectively. For predictions at the residual level, the helix-helix and helix-lipid interactions were described by a set of knowledge-based energy functions. For predictions at the atomistic level, CHARMM19 force field was employed. To facilitate the system to overcome energy barriers, Wang-Landau sampling was carried out by performing a random walk in the energy and conformational spaces. Native-like structures were predicted at both levels for 2- and 7-helix systems. Interestingly, consistent results were obtained from simulations at residual and atomistic levels for the same system, strongly suggesting the feasibility of a hierarchical approach for membrane structure prediction.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"203-7"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.41","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829946","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PSIST: indexing protein structures using suffix trees. 使用后缀树对蛋白质结构进行索引。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.46
Feng Gao, Mohammed J Zaki

Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results shows classification accuracy up to 97.8% and 99.4% at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods.

对蛋白质进行索引以及对类似于查询结构的结构进行快速和可扩展的搜索的方法在蛋白质结构和功能预测、蛋白质分类和药物发现等方面有着重要的应用。本文提出了一种提取蛋白质结构局部特征向量的新方法。每个残基都用一个三角形来表示,一组残基之间的关系用Calpha原子之间的距离和三角形所在平面法线之间的夹角来描述。使用后缀树对规范化的局部特征向量进行索引。对于所有查询段,后缀树可以有效地用于检索最大匹配,然后将其链接以获得与数据库蛋白质的对齐。相似的蛋白质通过它们对查询的比对得分来选择。我们的结果表明,按照SCOP分类,在超家族和类水平上的分类准确率分别达到97.8%和99.4%,并且在前10个匹配中平均获得了来自同一超家族的10个蛋白质中的7.49个。这些结果与以前最好的方法相比具有竞争力。
{"title":"PSIST: indexing protein structures using suffix trees.","authors":"Feng Gao,&nbsp;Mohammed J Zaki","doi":"10.1109/csb.2005.46","DOIUrl":"https://doi.org/10.1109/csb.2005.46","url":null,"abstract":"<p><p>Approaches for indexing proteins, and for fast and scalable searching for structures similar to a query structure have important applications such as protein structure and function prediction, protein classification and drug discovery. In this paper, we developed a new method for extracting the local feature vectors of protein structures. Each residue is represented by a triangle, and the correlation between a set of residues is described by the distances between Calpha atoms and the angles between the normals of planes in which the triangles lie. The normalized local feature vectors are indexed using a suffix tree. For all query segments, suffix trees can be used effectively to retrieve the maximal matches, which are then chained to obtain alignments with database proteins. Similar proteins are selected by their alignment score against the query. Our results shows classification accuracy up to 97.8% and 99.4% at the superfamily and class level according to the SCOP classification, and shows that on average 7.49 out of 10 proteins from the same superfamily are obtained among the top 10 matches. These results are competitive with the best previous methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"212-22"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.46","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
TreeRefiner: a tool for refining a multiple alignment on a phylogenetic tree. TreeRefiner:一种在系统发育树上精炼多重比对的工具。
Pub Date : 2005-01-01 DOI: 10.1109/csb.2005.53
Aswath Manohar, Serafim Batzoglou

We present TreeRefiner, a tool for refining multiple alignments of biological sequences. Given a multiple alignment, a phylogenetic tree, and scoring parameters as input, TreeRefiner optimizes the sum-of-pairs function in a restricted three-dimensional space around the alignment. At each internal node of the unrooted tree, the multiple alignment is projected to the sub-alignments corresponding to the three neighboring nodes, and three-dimensional dynamic programming is performed within a user-specified radius r around the original alignment. We test TreeRefiner on simulated sequences aligned by several popular tools, and demonstrate substantial improvements in the percentage of correctly aligned positions.

我们提出了TreeRefiner,一个用于精炼生物序列的多重比对的工具。给定多个序列、系统发育树和评分参数作为输入,TreeRefiner在序列周围受限的三维空间中优化配对和函数。在无根树的每个内部节点上,将多重对齐投影到三个相邻节点对应的子对齐上,并在原始对齐周围用户指定的半径r内进行三维动态规划。我们在几种常用工具的模拟序列上测试了TreeRefiner,并证明了正确对齐位置的百分比有了实质性的提高。
{"title":"TreeRefiner: a tool for refining a multiple alignment on a phylogenetic tree.","authors":"Aswath Manohar,&nbsp;Serafim Batzoglou","doi":"10.1109/csb.2005.53","DOIUrl":"https://doi.org/10.1109/csb.2005.53","url":null,"abstract":"<p><p>We present TreeRefiner, a tool for refining multiple alignments of biological sequences. Given a multiple alignment, a phylogenetic tree, and scoring parameters as input, TreeRefiner optimizes the sum-of-pairs function in a restricted three-dimensional space around the alignment. At each internal node of the unrooted tree, the multiple alignment is projected to the sub-alignments corresponding to the three neighboring nodes, and three-dimensional dynamic programming is performed within a user-specified radius r around the original alignment. We test TreeRefiner on simulated sequences aligned by several popular tools, and demonstrate substantial improvements in the percentage of correctly aligned positions.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"111-9"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.53","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings. IEEE Computational Systems Bioinformatics Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1