Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献_第2页

A HAUSDORFF-BASED NOE ASSIGNMENT ALGORITHM USING PROTEIN BACKBONE DETERMINED FROM RESIDUAL DIPOLAR COUPLINGS AND ROTAMER PATTERNS. 一种基于豪斯多夫（Hausdorff）的诺伊分配算法，利用从残余偶极耦合和转子模式中确定的蛋白质骨架。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Jianyang Michael Zeng, Chittaranjan Tripathy, Pei Zhou, Bruce R Donald

High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue37, 39, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn(3) +tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 Å and all-heavy-atom RMSD < 2.5 Å from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.

基于溶液核磁共振（NMR）光谱的高通量结构测定在结构基因组学中发挥着重要作用。核磁共振结构确定的主要瓶颈之一是对核磁共振数据进行解释，以便通过为质子对分配核奥弗霍瑟效应（NOE）谱峰来获得足够数量的精确距离约束。NOE 自动赋值的难点主要在于化学位移的共振变性和 NOE 峰位置的实验误差造成的不确定性。本文提出了一种新颖的 NOE 赋值算法，称为基于 HAusdorff 的 NOE 赋值（HANA），该算法以高分辨率蛋白质骨架为起点，每个残基仅使用两个残余偶极耦合（RDC）进行计算37, 39 ，采用基于 Hausdorff 的模式匹配技术，从统计多样性库中推断出每个旋转体的实验 NOE 光谱与反向计算 NOE 光谱之间的相似性，并驱动选择特定位置的最佳旋转体，以过滤模糊的 NOE 赋值。我们的算法运行时间为 O(tn(3)+tn log t)，其中 t 是每个残基的最大转子数量，n 是蛋白质的大小。我们的算法应用于三种蛋白质的生物核磁共振数据，即人类泛素、人类 DNA Y 聚合酶 Eta（pol η）的锌指结构域和人类 Set2-Rpb1 相互作用结构域（hSRI），结果表明我们的算法克服了光谱噪声，达到了 90% 以上的赋值准确率。此外，使用我们的自动 NOE 赋值计算出的最终结构与通过 X 射线晶体学或传统 NMR 方法确定的参考结构相比，骨干 RMSD < 1.7 Å，所有重原子 RMSD < 2.5 Å。这些结果表明，我们的 NOE 赋值算法可成功应用于蛋白质 NMR 光谱，从而获得高质量的结构。

{"title":"A HAUSDORFF-BASED NOE ASSIGNMENT ALGORITHM USING PROTEIN BACKBONE DETERMINED FROM RESIDUAL DIPOLAR COUPLINGS AND ROTAMER PATTERNS.","authors":"Jianyang Michael Zeng, Chittaranjan Tripathy, Pei Zhou, Bruce R Donald","doi":"","DOIUrl":"","url":null,"abstract":"High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue37, 39, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn(3) +tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 Å and all-heavy-atom RMSD < 2.5 Å from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"2008 ","pages":"169-181"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2613371/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140102956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Detecting pathways transcriptionally correlated with clinical parameters. 检测途径转录与临床参数相关。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Igor Ulitsky, Ron Shamir

The recent explosion in the number of clinical studies involving microarray data calls for novel computational methods for their dissection. Human protein interaction networks are rapidly growing and can assist in the extraction of functional modules from microarray data. We describe a novel methodology for extraction of connected network modules with coherent gene expression patterns that are correlated with a specific clinical parameter. Our approach suits both numerical (e.g., age or tumor size) and logical parameters (e.g., gender or mutation status). We demonstrate the method on a large breast cancer dataset, where we identify biologically-relevant modules related to nine clinical parameters including patient age, tumor size, and metastasis-free survival. Our method is capable of detecting disease-relevant pathways that could not be found using other methods. Our results support some previous hypotheses regarding the molecular pathways underlying diversity of breast tumors and suggest novel ones.

最近，涉及微阵列数据的临床研究数量激增，需要新的计算方法来解剖它们。人类蛋白质相互作用网络正在迅速发展，可以帮助从微阵列数据中提取功能模块。我们描述了一种新的方法，用于提取与特定临床参数相关的具有相干基因表达模式的连接网络模块。我们的方法既适用于数值(例如，年龄或肿瘤大小)，也适用于逻辑参数(例如，性别或突变状态)。我们在一个大型乳腺癌数据集上演示了该方法，在那里我们确定了与九个临床参数相关的生物学相关模块，包括患者年龄、肿瘤大小和无转移生存期。我们的方法能够检测到其他方法无法发现的疾病相关途径。我们的研究结果支持了先前关于乳腺肿瘤多样性的分子途径的一些假设，并提出了新的假设。

引用次数: 0

The effect of massive gene loss following whole genome duplication on the algorithmic reconstruction of the ancestral populus diploid. 全基因组复制后大量基因丢失对杨树祖先二倍体算法重建的影响。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Chunfang Zheng, P Kerr Wall, Jim Leebens-Mack, Victor A Albert, Claude dePamphilis, David Sankoff

We improve on guided genome halving algorithms so that several thousand gene sets, each containing two paralogs in the descendant T of the doubling event and their single ortholog from an undoubled reference genome R, can be analyzed to reconstruct the ancestor A of T at the time of doubling. At the same time, large numbers of defective gene sets, either missing one paralog from T or missing their ortholog in R, may be incorporated into the analysis in a consistent way. We apply this genomic rearrangement distance-based approach to the recently sequenced poplar (Populus trichocarpa) and grapevine (Vitis vinifera) genomes, as T and R respectively.

我们改进了引导基因组减半算法，从而可以分析数千个基因集，每个基因集在加倍事件的后代T中包含两个相似物，并且它们的单一同源物来自未加倍的参考基因组R，从而重建T在加倍时的祖先A。同时，大量有缺陷的基因集，无论是T中缺少一个平行序列，还是R中缺少其同源序列，都可以以一致的方式纳入分析。我们将这种基于距离的基因组重排方法应用于最近测序的杨树(Populus trichocarpa)和葡萄藤(Vitis vinifera)基因组，分别作为T和R。

引用次数: 0

Extensive exploration of conformational space improves Rosetta results for short protein domains. 对构象空间的广泛探索改善了罗塞塔对短蛋白质结构域的结果。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0018

Yaohang Li, A. Bordner, Yuan Tian, Xiuping Tao, A. Gorin

With some simplifications, computational protein folding can be understood as an optimization problem of a potential energy function on a variable space consisting of all conformation for a given protein molecule. It is well known that realistic energy potentials are very "rough" functions, when expressed in the standard variables, and the folding trajectories can be easily trapped in multiple local minima. We have integrated our variation of Parallel Tempering optimization into the protein folding program Rosetta in order to improve its capability to overcome energy barriers and estimate how such improvement will influence the quality of the folded protein domains. Here we report that (1) Parallel Tempering Rosetta (PTR) is significantly better in the exploration of protein structures than previous implementations of the program; (2) systematic improvements are observed across a large benchmark set in the parameters that are normally followed to estimate robustness of the folding; (3) these improvements are most dramatic in the subset of the shortest domains, where high-quality structures have been obtained for >75% of all tested sequences. Further analysis of the results will improve our understanding of protein conformational space and lead to new improvements in the protein folding methodology, while the current PTR implementation should be very efficient for short (up to approximately 80 a.a.) protein domains and therefore may find practical application in system biology studies.

经过一些简化，计算蛋白质折叠可以被理解为由给定蛋白质分子的所有构象组成的可变空间上势能函数的优化问题。众所周知，当用标准变量表示时，真实的能量势是非常“粗糙”的函数，并且折叠轨迹很容易陷入多个局部极小值。我们将平行回火优化的变化整合到蛋白质折叠程序Rosetta中，以提高其克服能量障碍的能力，并估计这种改进将如何影响折叠蛋白质结构域的质量。在此，我们报告了(1)平行回火罗塞塔(PTR)在蛋白质结构的探索方面明显优于之前的程序实现;(2)在通常用于估计折叠鲁棒性的参数中，通过大型基准集观察到系统的改进;(3)这些改进在最短结构域的子集中最为显著，在最短结构域中，所有测试序列中有75%的序列获得了高质量的结构。对结果的进一步分析将提高我们对蛋白质构象空间的理解，并导致蛋白质折叠方法的新改进，而目前的PTR实现对于短(高达约80 a.a)蛋白质结构域应该非常有效，因此可能在系统生物学研究中找到实际应用。

{"title":"Extensive exploration of conformational space improves Rosetta results for short protein domains.","authors":"Yaohang Li, A. Bordner, Yuan Tian, Xiuping Tao, A. Gorin","doi":"10.1142/9781848162648_0018","DOIUrl":"https://doi.org/10.1142/9781848162648_0018","url":null,"abstract":"With some simplifications, computational protein folding can be understood as an optimization problem of a potential energy function on a variable space consisting of all conformation for a given protein molecule. It is well known that realistic energy potentials are very \"rough\" functions, when expressed in the standard variables, and the folding trajectories can be easily trapped in multiple local minima. We have integrated our variation of Parallel Tempering optimization into the protein folding program Rosetta in order to improve its capability to overcome energy barriers and estimate how such improvement will influence the quality of the folded protein domains. Here we report that (1) Parallel Tempering Rosetta (PTR) is significantly better in the exploration of protein structures than previous implementations of the program; (2) systematic improvements are observed across a large benchmark set in the parameters that are normally followed to estimate robustness of the folding; (3) these improvements are most dramatic in the subset of the shortest domains, where high-quality structures have been obtained for >75% of all tested sequences. Further analysis of the results will improve our understanding of protein conformational space and lead to new improvements in the protein folding methodology, while the current PTR implementation should be very efficient for short (up to approximately 80 a.a.) protein domains and therefore may find practical application in system biology studies.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"203-9"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures. 利用不相交集数据结构的线性系统从缺失数据的谱系中进行有效的单倍型推断。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0026

Xin Li, Jing Li

We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.

我们研究了零重组假设下系谱数据的单倍型推断问题，这一假设得到了相对较大染色体片段上紧密链接标记(即单核苷酸多态性(SNPs))的实际数据的很好支持。我们通过将基因型约束表述为遗传变量的线性系统，以严格的数学方式解决了这个问题。然后，我们利用disjoint-set结构来编码个体之间的连接信息，检测来自基因型的约束，并检查约束的一致性。在没有丢失数据的树谱系上，我们的算法可以在近线性时间O (mn x alpha(n))内输出通解和总特解的个数，其中m为基因座数，n为个体数，alpha为逆Ackermann函数，这是对现有算法的进一步改进。通过考虑继承变量上的现有(部分)约束，我们还将该思想扩展到循环谱系和缺少数据的谱系。该算法已在c++中实现，并将被纳入我们的PedPhase包中。实验结果表明，该方法能正确识别所有0-重组溶液，效率高。与其他两种流行算法的比较表明，该算法在各种参数设置下实现了10到10(5)倍的改进。实验研究也为理论分析提出的复杂性界限提供了经验证据。

{"title":"Efficient haplotype inference from pedigrees with missing data using linear systems with disjoint-set data structures.","authors":"Xin Li, Jing Li","doi":"10.1142/9781848162648_0026","DOIUrl":"https://doi.org/10.1142/9781848162648_0026","url":null,"abstract":"We study the haplotype inference problem from pedigree data under the zero recombination assumption, which is well supported by real data for tightly linked markers (i.e., single nucleotide polymorphisms (SNPs)) over a relatively large chromosome segment. We solve the problem in a rigorous mathematical manner by formulating genotype constraints as a linear system of inheritance variables. We then utilize disjoint-set structures to encode connectivity information among individuals, to detect constraints from genotypes, and to check consistency of constraints. On a tree pedigree without missing data, our algorithm can output a general solution as well as the number of total specific solutions in a nearly linear time O (mn x alpha(n)), where m is the number of loci, n is the number of individuals and alpha is the inverse Ackermann function, which is a further improvement over existing ones. We also extend the idea to looped pedigrees and pedigrees with missing data by considering existing (partial) constraints on inheritance variables. The algorithm has been implemented in C++ and will be incorporated into our PedPhase package. Experimental results show that it can correctly identify all 0-recombinant solutions with great efficiency. Comparisons with other two popular algorithms show that the proposed algorithm achieves 10 to 10(5)-fold improvements over a variety of parameter settings. The experimental study also provides empirical evidences on the complexity bounds suggested by theoretical analysis.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"297-308"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Graph wavelet alignment kernels for drug virtual screening. 用于药物虚拟筛选的图小波对齐核。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Aaron Smalter, Jun Huan, Gerald Lushington

In this paper we introduce a novel graph classification algorithm and demonstrate its efficacy in drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to create features capturing graph local topology. We design a novel graph kernel function to utilize the created feature to build predictive models for chemicals. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure-activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than 10 fold speed up.

本文介绍了一种新的图分类算法，并论证了其在药物设计中的有效性。在我们的方法中，我们使用图来模拟化学结构，并应用图的小波分析来创建捕获图局部拓扑的特征。我们设计了一个新的图核函数，利用所创建的特征来构建化学品的预测模型。我们称这种新的图核为图小波对齐核。我们使用一组化学结构-活性预测基准评估了小波对准核的有效性。我们的结果表明，使用核函数产生的性能概况与现有的最先进的化学分类方法相当，有时甚至超过。此外，我们的结果还表明，小波函数的使用显著降低了图核计算的计算成本，速度提高了10倍以上。

引用次数: 0

Predicting flexible length linear B-cell epitopes. 预测弹性长度线性b细胞表位。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Yasser El-Manzalawy, Drena Dobbs, Vasant Honavar

Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.

鉴定b细胞表位在疫苗设计、免疫诊断试验和抗体生产中起着重要作用。因此，可靠地预测b细胞表位的计算工具是非常需要的。我们探索了两种预测灵活长度线性b细胞表位的机器学习方法。第一种方法利用四个序列核来确定任意一对变长序列之间的相似性得分。第二种方法利用四种不同的方法将可变长度序列映射到固定长度的特征向量。基于我们的经验比较，我们提出了FBCPred，一种利用子序列核预测弹性长度线性b细胞表位的新方法。我们的结果表明，FBCPred显著优于本研究中评估的所有其他分类器。FBCPred的实现和本研究中使用的数据集可通过我们的线性b细胞表位预测服务器BCPREDS公开获取，网址:http://ailab.cs.iastate.edu/bcpreds/。

引用次数: 0

Designing secondary structure profiles for fast ncRNA identification. 设计用于ncRNA快速鉴定的二级结构谱。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Yanni Sun, Jeremy Buhler

Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.

检测基因组DNA中的非编码rna (ncRNAs)是基因组注释的重要组成部分。然而，最广泛使用的ncRNA家族建模工具，协方差模型(CM)，在用于搜索时会产生很高的计算成本。这种成本可以通过使用过滤器来排除不太可能包含感兴趣的ncRNA的序列，仅在可能强烈匹配的地方应用CM来降低。尽管最近取得了一些进展，但设计一种能够检测几乎所有ncRNA实例并排除大多数不相关序列的有效过滤器仍然具有挑战性。这项工作提出了一个系统的程序，将ncRNA家族的CM转换为二级结构剖面(SSP)，这增加了二级结构信息的保守剖面，但仍然可以有效地扫描长序列。我们使用动态规划来估计SSP的灵敏度和FP率，从而产生一个高效的、全自动的滤波器设计算法。我们的实验表明，设计的SSP滤波器可以在对各种ncRNA家族(包括具有和不具有强序列保守性的ncRNA家族)保持高灵敏度的同时，比未过滤的CM搜索获得显著的加速。对于高度结构化的ncRNA家族，包括二级结构保守比单独使用一级序列保守产生更好的性能。

{"title":"Designing secondary structure profiles for fast ncRNA identification.","authors":"Yanni Sun, Jeremy Buhler","doi":"","DOIUrl":"","url":null,"abstract":"Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"145-56"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving homology models for protein-ligand binding sites. 改进蛋白质-配体结合位点的同源性模型。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0019

Chris Kauffman, H. Rangwala, G. Karypis

In order to improve the prediction of protein-ligand binding sites through homology modeling, we incorporate knowledge of the binding residues into the modeling framework. Residues are identified as binding or nonbinding based on their true labels as well as labels predicted from structure and sequence. The sequence predictions were made using a support vector machine framework which employs a sophisticated window-based kernel. Binding labels are used with a very sensitive sequence alignment method to align the target and template. Relevant parameters governing the alignment process are searched for optimal values. Based on our results, homology models of the binding site can be improved if a priori knowledge of the binding residues is available. For target-template pairs with low sequence identity and high structural diversity our sequence-based prediction method provided sufficient information to realize this improvement.

为了通过同源性建模提高对蛋白质-配体结合位点的预测，我们将结合残基的知识纳入建模框架。根据残基的真实标签以及从结构和序列预测的标签来识别它们是结合的还是非结合的。使用支持向量机框架进行序列预测，该框架采用了复杂的基于窗口的内核。结合标签使用非常敏感的序列比对方法来对准目标和模板。寻找控制对准过程的相关参数的最优值。基于我们的研究结果，如果结合残基的先验知识可用，可以改进结合位点的同源性模型。对于低序列同一性和高结构多样性的目标模板对，基于序列的预测方法提供了足够的信息来实现这一改进。

引用次数: 12

Graph wavelet alignment kernels for drug virtual screening. 用于药物虚拟筛选的图小波对齐核。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0029

Aaron M. Smalter, Jun Huan, G. Lushington

In this paper we introduce a novel graph classification algorithm and demonstrate its efficacy in drug design. In our method, we use graphs to model chemical structures and apply a wavelet analysis of graphs to create features capturing graph local topology. We design a novel graph kernel function to utilize the created feature to build predictive models for chemicals. We call the new graph kernel a graph wavelet-alignment kernel. We have evaluated the efficacy of the wavelet-alignment kernel using a set of chemical structure-activity prediction benchmarks. Our results indicate that the use of the kernel function yields performance profiles comparable to, and sometimes exceeding that of the existing state-of-the-art chemical classification approaches. In addition, our results also show that the use of wavelet functions significantly decreases the computational costs for graph kernel computation with more than 10 fold speed up.

本文介绍了一种新的图分类算法，并论证了其在药物设计中的有效性。在我们的方法中，我们使用图来模拟化学结构，并应用图的小波分析来创建捕获图局部拓扑的特征。我们设计了一个新的图核函数，利用所创建的特征来构建化学品的预测模型。我们称这种新的图核为图小波对齐核。我们使用一组化学结构-活性预测基准评估了小波对准核的有效性。我们的结果表明，使用核函数产生的性能概况与现有的最先进的化学分类方法相当，有时甚至超过。此外，我们的结果还表明，小波函数的使用显著降低了图核计算的计算成本，速度提高了10倍以上。

引用次数: 5