首页 > 最新文献

Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献

英文 中文
Designing secondary structure profiles for fast ncRNA identification. 设计用于ncRNA快速鉴定的二级结构谱。
Yanni Sun, Jeremy Buhler

Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.

检测基因组DNA中的非编码rna (ncRNAs)是基因组注释的重要组成部分。然而,最广泛使用的ncRNA家族建模工具,协方差模型(CM),在用于搜索时会产生很高的计算成本。这种成本可以通过使用过滤器来排除不太可能包含感兴趣的ncRNA的序列,仅在可能强烈匹配的地方应用CM来降低。尽管最近取得了一些进展,但设计一种能够检测几乎所有ncRNA实例并排除大多数不相关序列的有效过滤器仍然具有挑战性。这项工作提出了一个系统的程序,将ncRNA家族的CM转换为二级结构剖面(SSP),这增加了二级结构信息的保守剖面,但仍然可以有效地扫描长序列。我们使用动态规划来估计SSP的灵敏度和FP率,从而产生一个高效的、全自动的滤波器设计算法。我们的实验表明,设计的SSP滤波器可以在对各种ncRNA家族(包括具有和不具有强序列保守性的ncRNA家族)保持高灵敏度的同时,比未过滤的CM搜索获得显著的加速。对于高度结构化的ncRNA家族,包括二级结构保守比单独使用一级序列保守产生更好的性能。
{"title":"Designing secondary structure profiles for fast ncRNA identification.","authors":"Yanni Sun,&nbsp;Jeremy Buhler","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Detecting non-coding RNAs (ncRNAs) in genomic DNA is an important part of annotation. However, the most widely used tool for modeling ncRNA families, the covariance model (CM), incurs a high computational cost when used for search. This cost can be reduced by using a filter to exclude sequence that is unlikely to contain the ncRNA of interest, applying the CM only where it is likely to match strongly. Despite recent advances, designing an efficient filter that can detect nearly all ncRNA instances while excluding most irrelevant sequences remains challenging. This work proposes a systematic procedure to convert a CM for an ncRNA family to a secondary structure profile (SSP), which augments a conservation profile with secondary structure information but can still be efficiently scanned against long sequences. We use dynamic programming to estimate an SSP's sensitivity and FP rate, yielding an efficient, fully automated filter design algorithm. Our experiments demonstrate that designed SSP filters can achieve significant speedup over unfiltered CM search while maintaining high sensitivity for various ncRNA families, including those with and without strong sequence conservation. For highly structured ncRNA families, including secondary structure conservation yields better performance than using primary sequence conservation alone.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"145-56"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Knowledge representation and data mining for biological imaging. 生物成像的知识表示与数据挖掘。
Wamiq M Ahmed

Biological and pharmaceutical research relies heavily on microscopically imaging cell populations for understanding their structure and function. Much work has been done on automated analysis of biological images, but image analysis tools are generally focused only on extracting quantitative information for validating a particular hypothesis. Images contain much more information than is normally required for testing individual hypotheses. The lack of symbolic knowledge representation schemes for representing semantic image information and the absence of knowledge mining tools are the biggest obstacles in utilizing the full information content of these images. In this paper we first present a graph-based scheme for integrated representation of semantic biological knowledge contained in cellular images acquired in spatial, spectral, and temporal dimensions. We then present a spatio-temporal knowledge mining framework for extracting non-trivial and previously unknown association rules from image data sets. This mechanism can change the role of biological imaging from a tool used to validate hypotheses to one used for automatically generating new hypotheses. Results for an apoptosis screen are also presented.

生物和制药研究在很大程度上依赖于显微镜成像细胞群来了解它们的结构和功能。在生物图像的自动分析方面已经做了很多工作,但图像分析工具通常只关注于提取定量信息以验证特定假设。图像包含的信息比通常测试单个假设所需的信息多得多。缺乏用于表示语义图像信息的符号知识表示方案和缺乏知识挖掘工具是利用这些图像的全部信息内容的最大障碍。在本文中,我们首先提出了一种基于图的方案,用于在空间、光谱和时间维度上获取的细胞图像中包含的语义生物学知识的集成表示。然后,我们提出了一个时空知识挖掘框架,用于从图像数据集中提取非平凡和先前未知的关联规则。这种机制可以将生物成像的作用从验证假设的工具转变为自动生成新假设的工具。细胞凋亡筛选的结果也被提出。
{"title":"Knowledge representation and data mining for biological imaging.","authors":"Wamiq M Ahmed","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Biological and pharmaceutical research relies heavily on microscopically imaging cell populations for understanding their structure and function. Much work has been done on automated analysis of biological images, but image analysis tools are generally focused only on extracting quantitative information for validating a particular hypothesis. Images contain much more information than is normally required for testing individual hypotheses. The lack of symbolic knowledge representation schemes for representing semantic image information and the absence of knowledge mining tools are the biggest obstacles in utilizing the full information content of these images. In this paper we first present a graph-based scheme for integrated representation of semantic biological knowledge contained in cellular images acquired in spatial, spectral, and temporal dimensions. We then present a spatio-temporal knowledge mining framework for extracting non-trivial and previously unknown association rules from image data sets. This mechanism can change the role of biological imaging from a tool used to validate hypotheses to one used for automatically generating new hypotheses. Results for an apoptosis screen are also presented.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"311-4"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable computation of kinship and identity coefficients on large pedigrees. 大型家系亲属关系和身份系数的可扩展计算。
En Cheng, Brendan Elliott, Z Meral Ozsoyoglu

With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. An important computation on pedigree data is the calculation of identity coefficients, which provide a complete description of the degree of relatedness of a pair of individuals. The areas of application of identity coefficients are numerous and diverse, from genetic counseling to disease tracking, and thus, the computation of identity coefficients merits special attention. However, the computation of identity coefficients is not done directly, but rather as the final step after computing a set of generalized kinship coefficients. In this paper, we first propose a novel Path-Counting Formula for calculating generalized kinship coefficients, which is motivated by Wright's path-counting method for computing the inbreeding coefficient for an individual. We then present an efficient and scalable scheme for calculating generalized kinship coefficients on large pedigrees using NodeCodes, a special encoding scheme for expediting the evaluation of queries on pedigree graph structures. We also perform experiments for evaluating the efficiency of our method, and compare it with the performance of the traditional recursive algorithm for three individuals. Experimental results demonstrate that the resulting scheme is more scalable and efficient than the traditional recursive methods for computing generalized kinship coefficients.

随着医学遗传学和遗传咨询领域的迅速发展,家谱信息日益丰富。对系谱数据的一个重要计算是同一性系数的计算,它能完整地描述一对个体的亲缘程度。从遗传咨询到疾病跟踪,身份系数的应用领域非常广泛,因此,身份系数的计算值得特别关注。然而,身份系数的计算不是直接完成的,而是在计算一组广义亲属系数后的最后一步。本文首先提出了一种新的计算广义亲缘关系系数的路径计数公式,该公式受Wright计算个体近交系数的路径计数方法的启发。然后,我们提出了一种高效且可扩展的方案,用于使用NodeCodes计算大型谱系上的广义亲属系数,NodeCodes是一种特殊的编码方案,用于加快对谱系图结构查询的评估。我们还进行了实验来评估我们的方法的效率,并将其与传统的三个体递归算法的性能进行了比较。实验结果表明,该方法比传统的递归方法计算广义亲属系数具有更高的可扩展性和效率。
{"title":"Scalable computation of kinship and identity coefficients on large pedigrees.","authors":"En Cheng,&nbsp;Brendan Elliott,&nbsp;Z Meral Ozsoyoglu","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. An important computation on pedigree data is the calculation of identity coefficients, which provide a complete description of the degree of relatedness of a pair of individuals. The areas of application of identity coefficients are numerous and diverse, from genetic counseling to disease tracking, and thus, the computation of identity coefficients merits special attention. However, the computation of identity coefficients is not done directly, but rather as the final step after computing a set of generalized kinship coefficients. In this paper, we first propose a novel Path-Counting Formula for calculating generalized kinship coefficients, which is motivated by Wright's path-counting method for computing the inbreeding coefficient for an individual. We then present an efficient and scalable scheme for calculating generalized kinship coefficients on large pedigrees using NodeCodes, a special encoding scheme for expediting the evaluation of queries on pedigree graph structures. We also perform experiments for evaluating the efficiency of our method, and compare it with the performance of the traditional recursive algorithm for three individuals. Experimental results demonstrate that the resulting scheme is more scalable and efficient than the traditional recursive methods for computing generalized kinship coefficients.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"27-36"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Estimating support for protein-protein interaction data with applications to function prediction. 估计蛋白质-蛋白质相互作用数据在功能预测中的应用支持度。
Erliang Zeng, Chris Ding, Giri Narasimhan, Stephen R Holbrook

Almost every cellular process requires the interactions of pairs or larger complexes of proteins. High throughput protein-protein interaction (PPI) data have been generated using techniques such as the yeast two-hybrid systems, mass spectrometry method, and many more. Such data provide us with a new perspective to predict protein functions and to generate protein-protein interaction networks, and many recent algorithms have been developed for this purpose. However, PPI data generated using high throughput techniques contain a large number of false positives. In this paper, we have proposed a novel method to evaluate the support for PPI data based on gene ontology information. If the semantic similarity between genes is computed using gene ontology information and using Resnik's formula, then our results show that we can model the PPI data as a mixture model predicated on the assumption that true protein-protein interactions will have higher support than the false positives in the data. Thus semantic similarity between genes serves as a metric of support for PPI data. Taking it one step further, new function prediction approaches are also being proposed with the help of the proposed metric of the support for the PPI data. These new function prediction approaches outperform their conventional counterparts. New evaluation methods are also proposed.

几乎每一个细胞过程都需要蛋白质对或更大的复合物的相互作用。高通量蛋白质-蛋白质相互作用(PPI)数据已经使用酵母双杂交系统、质谱法等技术生成。这些数据为我们预测蛋白质功能和生成蛋白质-蛋白质相互作用网络提供了一个新的视角,许多最近的算法已经为此目的而开发。然而,使用高通量技术生成的PPI数据包含大量假阳性。本文提出了一种基于基因本体信息的PPI数据支持度评价方法。如果使用基因本体信息和Resnik公式计算基因之间的语义相似度,那么我们的结果表明,我们可以将PPI数据建模为混合模型,该模型基于真实蛋白质-蛋白质相互作用比数据中的假阳性具有更高的支持度的假设。因此,基因之间的语义相似性作为支持PPI数据的度量。更进一步,新的功能预测方法也在PPI数据支持度指标的帮助下被提出。这些新的函数预测方法优于传统的方法。提出了新的评价方法。
{"title":"Estimating support for protein-protein interaction data with applications to function prediction.","authors":"Erliang Zeng,&nbsp;Chris Ding,&nbsp;Giri Narasimhan,&nbsp;Stephen R Holbrook","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Almost every cellular process requires the interactions of pairs or larger complexes of proteins. High throughput protein-protein interaction (PPI) data have been generated using techniques such as the yeast two-hybrid systems, mass spectrometry method, and many more. Such data provide us with a new perspective to predict protein functions and to generate protein-protein interaction networks, and many recent algorithms have been developed for this purpose. However, PPI data generated using high throughput techniques contain a large number of false positives. In this paper, we have proposed a novel method to evaluate the support for PPI data based on gene ontology information. If the semantic similarity between genes is computed using gene ontology information and using Resnik's formula, then our results show that we can model the PPI data as a mixture model predicated on the assumption that true protein-protein interactions will have higher support than the false positives in the data. Thus semantic similarity between genes serves as a metric of support for PPI data. Taking it one step further, new function prediction approaches are also being proposed with the help of the proposed metric of the support for the PPI data. These new function prediction approaches outperform their conventional counterparts. New evaluation methods are also proposed.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"73-84"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Using relative importance methods to model high-throughput gene perturbation screens. 使用相对重要性方法模拟高通量基因扰动筛选。
Ying Jin, Naren Ramakrishnan, Lenwood S Heath, Richard F Helm

With the advent of high-throughput gene perturbation screens (e.g., RNAi assays, genome-wide deletion mutants), modeling the complex relationship between genes and phenotypes has become a paramount problem. One broad class of methods uses 'guilt by association' methods to impute phenotypes to genes based on the interactions between the given gene and other genes with known phenotypes. But these methods are inadequate for genes that have no cataloged interactions but which nevertheless are known to result in important phenotypes. In this paper, we present an approach to first model relationships between phenotypes using the notion of 'relative importance' and subsequently use these derived relationships to make phenotype predictions. Besides improved accuracy on S. cerevisiae deletion mutants and C. elegans knock-down datasets, we show how our approach sheds insight into relations between phenotypes.

随着高通量基因扰动筛选(例如,RNAi测定,全基因组缺失突变)的出现,基因和表型之间复杂关系的建模已成为一个首要问题。一大类方法使用“关联罪恶感”方法,根据给定基因与其他已知表型基因之间的相互作用,将表型归咎于基因。但是这些方法对于那些没有被编目的相互作用但却已知会导致重要表型的基因来说是不够的。在本文中,我们提出了一种方法,首先使用“相对重要性”的概念对表型之间的关系进行建模,然后使用这些衍生关系进行表型预测。除了提高酿酒葡萄球菌缺失突变体和秀丽隐杆线虫敲除数据集的准确性外,我们还展示了我们的方法如何揭示表型之间的关系。
{"title":"Using relative importance methods to model high-throughput gene perturbation screens.","authors":"Ying Jin,&nbsp;Naren Ramakrishnan,&nbsp;Lenwood S Heath,&nbsp;Richard F Helm","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>With the advent of high-throughput gene perturbation screens (e.g., RNAi assays, genome-wide deletion mutants), modeling the complex relationship between genes and phenotypes has become a paramount problem. One broad class of methods uses 'guilt by association' methods to impute phenotypes to genes based on the interactions between the given gene and other genes with known phenotypes. But these methods are inadequate for genes that have no cataloged interactions but which nevertheless are known to result in important phenotypes. In this paper, we present an approach to first model relationships between phenotypes using the notion of 'relative importance' and subsequently use these derived relationships to make phenotype predictions. Besides improved accuracy on S. cerevisiae deletion mutants and C. elegans knock-down datasets, we show how our approach sheds insight into relations between phenotypes.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"225-35"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting flexible length linear B-cell epitopes. 预测弹性长度线性b细胞表位。
Y. El-Manzalawy, D. Dobbs, V. Honavar
Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.
鉴定b细胞表位在疫苗设计、免疫诊断试验和抗体生产中起着重要作用。因此,可靠地预测b细胞表位的计算工具是非常需要的。我们探索了两种预测灵活长度线性b细胞表位的机器学习方法。第一种方法利用四个序列核来确定任意一对变长序列之间的相似性得分。第二种方法利用四种不同的方法将可变长度序列映射到固定长度的特征向量。基于我们的经验比较,我们提出了FBCPred,一种利用子序列核预测弹性长度线性b细胞表位的新方法。我们的结果表明,FBCPred显著优于本研究中评估的所有其他分类器。FBCPred的实现和本研究中使用的数据集可通过我们的线性b细胞表位预测服务器BCPREDS公开获取,网址:http://ailab.cs.iastate.edu/bcpreds/。
{"title":"Predicting flexible length linear B-cell epitopes.","authors":"Y. El-Manzalawy, D. Dobbs, V. Honavar","doi":"10.1142/9781848162648_0011","DOIUrl":"https://doi.org/10.1142/9781848162648_0011","url":null,"abstract":"Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"121-32"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1142/9781848162648_0011","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction. 基于残基标签哈希和几何滤波的蛋白质功能预测结构基序匹配。
Mark Moll, L. Kavraki
There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.
结构已知但功能未知的蛋白质越来越多。确定它们的功能将对理解疾病和设计新的治疗方法产生重大影响。然而,实验蛋白功能测定既昂贵又耗时。计算方法可以通过识别具有高结构和化学相似性的蛋白质来促进功能确定。我们的重点是确定结合位点相似性的方法。尽管存在几种这样的方法,但如何在大数据集中快速、高特异性地找到结构基序的所有功能相关匹配仍然是一个具有挑战性的问题。在这种情况下,结构基序是一组用表征分子功能的物理化学信息注释的3D点。我们提出了一种名为LabelHash的新方法,它为一组目标创建n元残基哈希表。使用这些哈希表,我们可以快速查找与主题的部分匹配,并将这些匹配扩展为完全匹配。我们表明,通过仅应用非常轻微的几何约束,我们可以在非常大的数据集和非常普遍的结构图案中找到具有极高特异性的统计显著匹配。我们证明,当采用简单的几何滤波器时,我们的方法需要合理的存储量,并进一步提高了我们以前工作的特异性,同时保持了非常高的灵敏度。我们的算法在20个同源类和蛋白质数据库的非冗余版本作为我们的背景数据集上进行了评估。我们使用聚类分析来分析为什么某些类别的同系物比其他类别的同系物更难分类。LabelHash算法在web服务器http://kavrakilab.org/labelhash/上实现。
{"title":"Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.","authors":"Mark Moll, L. Kavraki","doi":"10.1142/9781848162648_0014","DOIUrl":"https://doi.org/10.1142/9781848162648_0014","url":null,"abstract":"There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"157-68"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction. 基于残基标签哈希和几何滤波的蛋白质功能预测结构基序匹配。
Mark Moll, Lydia E Kavraki

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.

结构已知但功能未知的蛋白质越来越多。确定它们的功能将对理解疾病和设计新的治疗方法产生重大影响。然而,实验蛋白功能测定既昂贵又耗时。计算方法可以通过识别具有高结构和化学相似性的蛋白质来促进功能确定。我们的重点是确定结合位点相似性的方法。尽管存在几种这样的方法,但如何在大数据集中快速、高特异性地找到结构基序的所有功能相关匹配仍然是一个具有挑战性的问题。在这种情况下,结构基序是一组用表征分子功能的物理化学信息注释的3D点。我们提出了一种名为LabelHash的新方法,它为一组目标创建n元残基哈希表。使用这些哈希表,我们可以快速查找与主题的部分匹配,并将这些匹配扩展为完全匹配。我们表明,通过仅应用非常轻微的几何约束,我们可以在非常大的数据集和非常普遍的结构图案中找到具有极高特异性的统计显著匹配。我们证明,当采用简单的几何滤波器时,我们的方法需要合理的存储量,并进一步提高了我们以前工作的特异性,同时保持了非常高的灵敏度。我们的算法在20个同源类和蛋白质数据库的非冗余版本作为我们的背景数据集上进行了评估。我们使用聚类分析来分析为什么某些类别的同系物比其他类别的同系物更难分类。LabelHash算法在web服务器http://kavrakilab.org/labelhash/上实现。
{"title":"Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.","authors":"Mark Moll,&nbsp;Lydia E Kavraki","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"157-68"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GaborLocal: peak detection in mass spectrum by Gabor filters and Gaussian local maxima. Gabor滤波器和高斯局部最大值在质谱中的峰检测。
Nha Nguyen, Heng Huang, S. Oraintara, An P. N. Vo
Mass Spectrometry (MS) is increasingly being used to discover disease related proteomic patterns. The peak detection step is one of most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false position rate in peak detection. Most of them follow two approaches: one is denoising approach and the other one is decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose a new method named GaborLocal which can detect more true peaks with a very low false position rate. The Gaussian local maxima is employed for peak detection, because it is robust to noise in signals. Moreover, the maximum rank of peaks is defined at the first time to identify peaks instead of using the signal-to-noise ratio and the Gabor filter is used to decompose the raw MS signal. We perform the proposed method on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate our method outperforms other common used methods in the receiver operating characteristic (ROC) curve.
质谱(MS)越来越多地被用于发现疾病相关的蛋白质组学模式。峰检测步骤是质谱典型分析中最重要的步骤之一。近年来,人们提出了许多新的算法来提高峰值检测的真位置率和低假位置率。它们大多采用两种方法:一种是去噪方法,另一种是分解方法。在以往的研究中,MS数据分解方法比第一种方法更有潜力。本文提出了一种名为GaborLocal的新方法,该方法可以在非常低的假位置率下检测到更多的真峰。由于高斯局部极大值对信号中的噪声具有较强的鲁棒性,因此采用高斯局部极大值进行峰值检测。此外,第一次定义峰值的最大秩来识别峰值,而不是使用信噪比,并使用Gabor滤波器对原始MS信号进行分解。我们对已知多肽位置的真实SELDI-TOF谱进行了验证。实验结果表明,该方法在受试者工作特征(ROC)曲线上优于其他常用方法。
{"title":"GaborLocal: peak detection in mass spectrum by Gabor filters and Gaussian local maxima.","authors":"Nha Nguyen, Heng Huang, S. Oraintara, An P. N. Vo","doi":"10.1142/9781848162648_0008","DOIUrl":"https://doi.org/10.1142/9781848162648_0008","url":null,"abstract":"Mass Spectrometry (MS) is increasingly being used to discover disease related proteomic patterns. The peak detection step is one of most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false position rate in peak detection. Most of them follow two approaches: one is denoising approach and the other one is decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose a new method named GaborLocal which can detect more true peaks with a very low false position rate. The Gaussian local maxima is employed for peak detection, because it is robust to noise in signals. Moreover, the maximum rank of peaks is defined at the first time to identify peaks instead of using the signal-to-noise ratio and the Gabor filter is used to decompose the raw MS signal. We perform the proposed method on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate our method outperforms other common used methods in the receiver operating characteristic (ROC) curve.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"85-96"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1142/9781848162648_0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A HAUSDORFF-BASED NOE ASSIGNMENT ALGORITHM USING PROTEIN BACKBONE DETERMINED FROM RESIDUAL DIPOLAR COUPLINGS AND ROTAMER PATTERNS. 一种基于hausdorff的基于偶极偶联和旋转体模式的蛋白质骨架分配算法。
Jianyang Zeng, C. Tripathy, Pei Zhou, B. Donald
High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue37, 39, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn(3) +tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 Å and all-heavy-atom RMSD < 2.5 Å from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.
基于溶液核磁共振(NMR)光谱的高通量结构测定在结构基因组学中发挥着重要作用。核磁共振结构测定的主要瓶颈之一是对核磁共振数据的解释,通过将核Overhauser效应(NOE)光谱峰分配给质子对来获得足够数量的精确距离约束。NOE自动赋值的困难主要在于化学位移的共振简并和NOE峰位实验误差的不确定性所产生的模糊性。在本文中,我们提出了一种新的NOE分配算法,称为基于hausdorff的NOE分配(HANA),该算法首先使用每个残差仅使用两个残差偶极耦合(rdc)计算高分辨率蛋白质骨架37,39,采用基于hausdorff的模式匹配技术,从统计多样化的库中推断每个转子的实验和反向计算的NOE光谱之间的相似性。并驱动最佳位置特定转子的选择,以过滤模糊NOE分配。我们的算法运行时间为O(tn(3) +tn log t),其中t是每个残基的最大旋转体数量,n是蛋白质的大小。将该算法应用于人类泛素、人类DNA y -聚合酶Eta (pol η)锌指结构域和人类Set2-Rpb1相互作用结构域(hSRI)三种蛋白质的生物核磁共振数据,结果表明该算法克服了光谱噪声,分配精度达到90%以上。此外,使用我们的自动化NOE分配计算的最终结构的主链RMSD < 1.7 Å,全重原子RMSD < 2.5 Å,来自x射线晶体学或传统核磁共振方法确定的参考结构。结果表明,NOE分配算法可以成功地应用于蛋白质核磁共振光谱,获得高质量的结构。
{"title":"A HAUSDORFF-BASED NOE ASSIGNMENT ALGORITHM USING PROTEIN BACKBONE DETERMINED FROM RESIDUAL DIPOLAR COUPLINGS AND ROTAMER PATTERNS.","authors":"Jianyang Zeng, C. Tripathy, Pei Zhou, B. Donald","doi":"10.1142/9781848162648_0015","DOIUrl":"https://doi.org/10.1142/9781848162648_0015","url":null,"abstract":"High-throughput structure determination based on solution Nuclear Magnetic Resonance (NMR) spectroscopy plays an important role in structural genomics. One of the main bottlenecks in NMR structure determination is the interpretation of NMR data to obtain a sufficient number of accurate distance restraints by assigning nuclear Overhauser effect (NOE) spectral peaks to pairs of protons. The difficulty in automated NOE assignment mainly lies in the ambiguities arising both from the resonance degeneracy of chemical shifts and from the uncertainty due to experimental errors in NOE peak positions. In this paper we present a novel NOE assignment algorithm, called HAusdorff-based NOE Assignment (HANA), that starts with a high-resolution protein backbone computed using only two residual dipolar couplings (RDCs) per residue37, 39, employs a Hausdorff-based pattern matching technique to deduce similarity between experimental and back-computed NOE spectra for each rotamer from a statistically diverse library, and drives the selection of optimal position-specific rotamers for filtering ambiguous NOE assignments. Our algorithm runs in time O(tn(3) +tn log t), where t is the maximum number of rotamers per residue and n is the size of the protein. Application of our algorithm on biological NMR data for three proteins, namely, human ubiquitin, the zinc finger domain of the human DNA Y-polymerase Eta (pol η) and the human Set2-Rpb1 interacting domain (hSRI) demonstrates that our algorithm overcomes spectral noise to achieve more than 90% assignment accuracy. Additionally, the final structures calculated using our automated NOE assignments have backbone RMSD < 1.7 Å and all-heavy-atom RMSD < 2.5 Å from reference structures that were determined either by X-ray crystallography or traditional NMR approaches. These results show that our NOE assignment algorithm can be successfully applied to protein NMR spectra to obtain high-quality structures.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"2008 1","pages":"169-181"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Computational systems bioinformatics. Computational Systems Bioinformatics Conference
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1