Computational systems bioinformatics. Computational Systems Bioinformatics Conference最新文献_第3页

A max-flow based approach to the identification of protein complexes using protein interaction and microarray data. 利用蛋白质相互作用和微阵列数据，基于最大流量的方法来鉴定蛋白质复合物。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0005

Jianxing Feng, Rui Jiang, Tao Jiang

The emergence of high-throughput technologies leads to abundant protein-protein interaction (PPI) data and microarray gene expression profiles, and provides a great opportunity for the identification of novel protein complexes using computational methods. Although it has been demonstrated in the literature that methods using protein-protein interaction data alone can successfully predict a large number of protein complexes, the incorporation of gene expression profiles could help refine the putative complexes and hence improve the accuracy of the computational methods. By combining protein-protein interaction data and microarray gene expression profiles, we propose a novel Graph Fragmentation Algorithm (GFA) for protein complex identification. Adapted from a classical max-flow algorithm for finding the (weighted) densest subgraphs, GFA first finds large (weighted) dense subgraphs in a protein-protein interaction network and then breaks each such subgraph into fragments iteratively by weighting its nodes appropriately in terms of their corresponding log fold changes in the microarray data, until the fragment subgraphs are sufficiently small. Our extensive tests on three widely used protein-protein interaction datasets and comparisons with the latest methods for protein complex identification demonstrate the superior performance of our method in terms of accuracy, efficiency, and capability in predicting novel protein complexes. Given the high specificity (or precision) that our method has achieved, we conjecture that our prediction results imply more than 200 novel protein complexes.

高通量技术的出现带来了丰富的蛋白质-蛋白质相互作用(PPI)数据和微阵列基因表达谱，并为使用计算方法鉴定新的蛋白质复合物提供了很大的机会。虽然文献已经证明，仅使用蛋白质-蛋白质相互作用数据的方法可以成功地预测大量蛋白质复合物，但基因表达谱的结合可以帮助改进假定的复合物，从而提高计算方法的准确性。通过结合蛋白质-蛋白质相互作用数据和微阵列基因表达谱，我们提出了一种新的用于蛋白质复合物识别的图碎片算法(GFA)。GFA改编自寻找(加权)最密集子图的经典最大流算法，首先在蛋白质-蛋白质相互作用网络中找到大的(加权)密集子图，然后根据微阵列数据中相应的对数折叠变化对其节点进行适当加权，迭代地将每个这样的子图分解为片段，直到片段子图足够小。我们对三种广泛使用的蛋白质-蛋白质相互作用数据集进行了广泛的测试，并与最新的蛋白质复合物鉴定方法进行了比较，证明了我们的方法在预测新型蛋白质复合物的准确性、效率和能力方面具有优越的性能。鉴于我们的方法已经达到的高特异性(或精度)，我们推测我们的预测结果意味着超过200种新的蛋白质复合物。

{"title":"A max-flow based approach to the identification of protein complexes using protein interaction and microarray data.","authors":"Jianxing Feng, Rui Jiang, Tao Jiang","doi":"10.1142/9781848162648_0005","DOIUrl":"https://doi.org/10.1142/9781848162648_0005","url":null,"abstract":"The emergence of high-throughput technologies leads to abundant protein-protein interaction (PPI) data and microarray gene expression profiles, and provides a great opportunity for the identification of novel protein complexes using computational methods. Although it has been demonstrated in the literature that methods using protein-protein interaction data alone can successfully predict a large number of protein complexes, the incorporation of gene expression profiles could help refine the putative complexes and hence improve the accuracy of the computational methods. By combining protein-protein interaction data and microarray gene expression profiles, we propose a novel Graph Fragmentation Algorithm (GFA) for protein complex identification. Adapted from a classical max-flow algorithm for finding the (weighted) densest subgraphs, GFA first finds large (weighted) dense subgraphs in a protein-protein interaction network and then breaks each such subgraph into fragments iteratively by weighting its nodes appropriately in terms of their corresponding log fold changes in the microarray data, until the fragment subgraphs are sufficiently small. Our extensive tests on three widely used protein-protein interaction datasets and comparisons with the latest methods for protein complex identification demonstrate the superior performance of our method in terms of accuracy, efficiency, and capability in predicting novel protein complexes. Given the high specificity (or precision) that our method has achieved, we conjecture that our prediction results imply more than 200 novel protein complexes.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"51-62"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64002534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

Knowledge representation and data mining for biological imaging. 生物成像的知识表示与数据挖掘。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Wamiq M Ahmed

Biological and pharmaceutical research relies heavily on microscopically imaging cell populations for understanding their structure and function. Much work has been done on automated analysis of biological images, but image analysis tools are generally focused only on extracting quantitative information for validating a particular hypothesis. Images contain much more information than is normally required for testing individual hypotheses. The lack of symbolic knowledge representation schemes for representing semantic image information and the absence of knowledge mining tools are the biggest obstacles in utilizing the full information content of these images. In this paper we first present a graph-based scheme for integrated representation of semantic biological knowledge contained in cellular images acquired in spatial, spectral, and temporal dimensions. We then present a spatio-temporal knowledge mining framework for extracting non-trivial and previously unknown association rules from image data sets. This mechanism can change the role of biological imaging from a tool used to validate hypotheses to one used for automatically generating new hypotheses. Results for an apoptosis screen are also presented.

生物和制药研究在很大程度上依赖于显微镜成像细胞群来了解它们的结构和功能。在生物图像的自动分析方面已经做了很多工作，但图像分析工具通常只关注于提取定量信息以验证特定假设。图像包含的信息比通常测试单个假设所需的信息多得多。缺乏用于表示语义图像信息的符号知识表示方案和缺乏知识挖掘工具是利用这些图像的全部信息内容的最大障碍。在本文中，我们首先提出了一种基于图的方案，用于在空间、光谱和时间维度上获取的细胞图像中包含的语义生物学知识的集成表示。然后，我们提出了一个时空知识挖掘框架，用于从图像数据集中提取非平凡和先前未知的关联规则。这种机制可以将生物成像的作用从验证假设的工具转变为自动生成新假设的工具。细胞凋亡筛选的结果也被提出。

{"title":"Knowledge representation and data mining for biological imaging.","authors":"Wamiq M Ahmed","doi":"","DOIUrl":"","url":null,"abstract":"Biological and pharmaceutical research relies heavily on microscopically imaging cell populations for understanding their structure and function. Much work has been done on automated analysis of biological images, but image analysis tools are generally focused only on extracting quantitative information for validating a particular hypothesis. Images contain much more information than is normally required for testing individual hypotheses. The lack of symbolic knowledge representation schemes for representing semantic image information and the absence of knowledge mining tools are the biggest obstacles in utilizing the full information content of these images. In this paper we first present a graph-based scheme for integrated representation of semantic biological knowledge contained in cellular images acquired in spatial, spectral, and temporal dimensions. We then present a spatio-temporal knowledge mining framework for extracting non-trivial and previously unknown association rules from image data sets. This mechanism can change the role of biological imaging from a tool used to validate hypotheses to one used for automatically generating new hypotheses. Results for an apoptosis screen are also presented.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"311-4"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scalable computation of kinship and identity coefficients on large pedigrees. 大型家系亲属关系和身份系数的可扩展计算。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

En Cheng, Brendan Elliott, Z Meral Ozsoyoglu

With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. An important computation on pedigree data is the calculation of identity coefficients, which provide a complete description of the degree of relatedness of a pair of individuals. The areas of application of identity coefficients are numerous and diverse, from genetic counseling to disease tracking, and thus, the computation of identity coefficients merits special attention. However, the computation of identity coefficients is not done directly, but rather as the final step after computing a set of generalized kinship coefficients. In this paper, we first propose a novel Path-Counting Formula for calculating generalized kinship coefficients, which is motivated by Wright's path-counting method for computing the inbreeding coefficient for an individual. We then present an efficient and scalable scheme for calculating generalized kinship coefficients on large pedigrees using NodeCodes, a special encoding scheme for expediting the evaluation of queries on pedigree graph structures. We also perform experiments for evaluating the efficiency of our method, and compare it with the performance of the traditional recursive algorithm for three individuals. Experimental results demonstrate that the resulting scheme is more scalable and efficient than the traditional recursive methods for computing generalized kinship coefficients.

随着医学遗传学和遗传咨询领域的迅速发展，家谱信息日益丰富。对系谱数据的一个重要计算是同一性系数的计算，它能完整地描述一对个体的亲缘程度。从遗传咨询到疾病跟踪，身份系数的应用领域非常广泛，因此，身份系数的计算值得特别关注。然而，身份系数的计算不是直接完成的，而是在计算一组广义亲属系数后的最后一步。本文首先提出了一种新的计算广义亲缘关系系数的路径计数公式，该公式受Wright计算个体近交系数的路径计数方法的启发。然后，我们提出了一种高效且可扩展的方案，用于使用NodeCodes计算大型谱系上的广义亲属系数，NodeCodes是一种特殊的编码方案，用于加快对谱系图结构查询的评估。我们还进行了实验来评估我们的方法的效率，并将其与传统的三个体递归算法的性能进行了比较。实验结果表明，该方法比传统的递归方法计算广义亲属系数具有更高的可扩展性和效率。

{"title":"Scalable computation of kinship and identity coefficients on large pedigrees.","authors":"En Cheng, Brendan Elliott, Z Meral Ozsoyoglu","doi":"","DOIUrl":"","url":null,"abstract":"With the rapidly expanding field of medical genetics and genetic counseling, genealogy information is becoming increasingly abundant. An important computation on pedigree data is the calculation of identity coefficients, which provide a complete description of the degree of relatedness of a pair of individuals. The areas of application of identity coefficients are numerous and diverse, from genetic counseling to disease tracking, and thus, the computation of identity coefficients merits special attention. However, the computation of identity coefficients is not done directly, but rather as the final step after computing a set of generalized kinship coefficients. In this paper, we first propose a novel Path-Counting Formula for calculating generalized kinship coefficients, which is motivated by Wright's path-counting method for computing the inbreeding coefficient for an individual. We then present an efficient and scalable scheme for calculating generalized kinship coefficients on large pedigrees using NodeCodes, a special encoding scheme for expediting the evaluation of queries on pedigree graph structures. We also perform experiments for evaluating the efficiency of our method, and compare it with the performance of the traditional recursive algorithm for three individuals. Experimental results demonstrate that the resulting scheme is more scalable and efficient than the traditional recursive methods for computing generalized kinship coefficients.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"27-36"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Estimating support for protein-protein interaction data with applications to function prediction. 估计蛋白质-蛋白质相互作用数据在功能预测中的应用支持度。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Erliang Zeng, Chris Ding, Giri Narasimhan, Stephen R Holbrook

Almost every cellular process requires the interactions of pairs or larger complexes of proteins. High throughput protein-protein interaction (PPI) data have been generated using techniques such as the yeast two-hybrid systems, mass spectrometry method, and many more. Such data provide us with a new perspective to predict protein functions and to generate protein-protein interaction networks, and many recent algorithms have been developed for this purpose. However, PPI data generated using high throughput techniques contain a large number of false positives. In this paper, we have proposed a novel method to evaluate the support for PPI data based on gene ontology information. If the semantic similarity between genes is computed using gene ontology information and using Resnik's formula, then our results show that we can model the PPI data as a mixture model predicated on the assumption that true protein-protein interactions will have higher support than the false positives in the data. Thus semantic similarity between genes serves as a metric of support for PPI data. Taking it one step further, new function prediction approaches are also being proposed with the help of the proposed metric of the support for the PPI data. These new function prediction approaches outperform their conventional counterparts. New evaluation methods are also proposed.

几乎每一个细胞过程都需要蛋白质对或更大的复合物的相互作用。高通量蛋白质-蛋白质相互作用(PPI)数据已经使用酵母双杂交系统、质谱法等技术生成。这些数据为我们预测蛋白质功能和生成蛋白质-蛋白质相互作用网络提供了一个新的视角，许多最近的算法已经为此目的而开发。然而，使用高通量技术生成的PPI数据包含大量假阳性。本文提出了一种基于基因本体信息的PPI数据支持度评价方法。如果使用基因本体信息和Resnik公式计算基因之间的语义相似度，那么我们的结果表明，我们可以将PPI数据建模为混合模型，该模型基于真实蛋白质-蛋白质相互作用比数据中的假阳性具有更高的支持度的假设。因此，基因之间的语义相似性作为支持PPI数据的度量。更进一步，新的功能预测方法也在PPI数据支持度指标的帮助下被提出。这些新的函数预测方法优于传统的方法。提出了新的评价方法。

{"title":"Estimating support for protein-protein interaction data with applications to function prediction.","authors":"Erliang Zeng, Chris Ding, Giri Narasimhan, Stephen R Holbrook","doi":"","DOIUrl":"","url":null,"abstract":"Almost every cellular process requires the interactions of pairs or larger complexes of proteins. High throughput protein-protein interaction (PPI) data have been generated using techniques such as the yeast two-hybrid systems, mass spectrometry method, and many more. Such data provide us with a new perspective to predict protein functions and to generate protein-protein interaction networks, and many recent algorithms have been developed for this purpose. However, PPI data generated using high throughput techniques contain a large number of false positives. In this paper, we have proposed a novel method to evaluate the support for PPI data based on gene ontology information. If the semantic similarity between genes is computed using gene ontology information and using Resnik's formula, then our results show that we can model the PPI data as a mixture model predicated on the assumption that true protein-protein interactions will have higher support than the false positives in the data. Thus semantic similarity between genes serves as a metric of support for PPI data. Taking it one step further, new function prediction approaches are also being proposed with the help of the proposed metric of the support for the PPI data. These new function prediction approaches outperform their conventional counterparts. New evaluation methods are also proposed.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"73-84"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28336174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Using relative importance methods to model high-throughput gene perturbation screens. 使用相对重要性方法模拟高通量基因扰动筛选。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Ying Jin, Naren Ramakrishnan, Lenwood S Heath, Richard F Helm

With the advent of high-throughput gene perturbation screens (e.g., RNAi assays, genome-wide deletion mutants), modeling the complex relationship between genes and phenotypes has become a paramount problem. One broad class of methods uses 'guilt by association' methods to impute phenotypes to genes based on the interactions between the given gene and other genes with known phenotypes. But these methods are inadequate for genes that have no cataloged interactions but which nevertheless are known to result in important phenotypes. In this paper, we present an approach to first model relationships between phenotypes using the notion of 'relative importance' and subsequently use these derived relationships to make phenotype predictions. Besides improved accuracy on S. cerevisiae deletion mutants and C. elegans knock-down datasets, we show how our approach sheds insight into relations between phenotypes.

随着高通量基因扰动筛选(例如，RNAi测定，全基因组缺失突变)的出现，基因和表型之间复杂关系的建模已成为一个首要问题。一大类方法使用“关联罪恶感”方法，根据给定基因与其他已知表型基因之间的相互作用，将表型归咎于基因。但是这些方法对于那些没有被编目的相互作用但却已知会导致重要表型的基因来说是不够的。在本文中，我们提出了一种方法，首先使用“相对重要性”的概念对表型之间的关系进行建模，然后使用这些衍生关系进行表型预测。除了提高酿酒葡萄球菌缺失突变体和秀丽隐杆线虫敲除数据集的准确性外，我们还展示了我们的方法如何揭示表型之间的关系。

引用次数: 0

Predicting flexible length linear B-cell epitopes. 预测弹性长度线性b细胞表位。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0011

Y. El-Manzalawy, D. Dobbs, V. Honavar

Identifying B-cell epitopes play an important role in vaccine design, immunodiagnostic tests, and antibody production. Therefore, computational tools for reliably predicting B-cell epitopes are highly desirable. We explore two machine learning approaches for predicting flexible length linear B-cell epitopes. The first approach utilizes four sequence kernels for determining a similarity score between any arbitrary pair of variable length sequences. The second approach utilizes four different methods of mapping a variable length sequence into a fixed length feature vector. Based on our empirical comparisons, we propose FBCPred, a novel method for predicting flexible length linear B-cell epitopes using the subsequence kernel. Our results demonstrate that FBCPred significantly outperforms all other classifiers evaluated in this study. An implementation of FBCPred and the datasets used in this study are publicly available through our linear B-cell epitope prediction server, BCPREDS, at: http://ailab.cs.iastate.edu/bcpreds/.

鉴定b细胞表位在疫苗设计、免疫诊断试验和抗体生产中起着重要作用。因此，可靠地预测b细胞表位的计算工具是非常需要的。我们探索了两种预测灵活长度线性b细胞表位的机器学习方法。第一种方法利用四个序列核来确定任意一对变长序列之间的相似性得分。第二种方法利用四种不同的方法将可变长度序列映射到固定长度的特征向量。基于我们的经验比较，我们提出了FBCPred，一种利用子序列核预测弹性长度线性b细胞表位的新方法。我们的结果表明，FBCPred显著优于本研究中评估的所有其他分类器。FBCPred的实现和本研究中使用的数据集可通过我们的线性b细胞表位预测服务器BCPREDS公开获取，网址:http://ailab.cs.iastate.edu/bcpreds/。

引用次数: 16

Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction. 基于残基标签哈希和几何滤波的蛋白质功能预测结构基序匹配。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0014

Mark Moll, L. Kavraki

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.

结构已知但功能未知的蛋白质越来越多。确定它们的功能将对理解疾病和设计新的治疗方法产生重大影响。然而，实验蛋白功能测定既昂贵又耗时。计算方法可以通过识别具有高结构和化学相似性的蛋白质来促进功能确定。我们的重点是确定结合位点相似性的方法。尽管存在几种这样的方法，但如何在大数据集中快速、高特异性地找到结构基序的所有功能相关匹配仍然是一个具有挑战性的问题。在这种情况下，结构基序是一组用表征分子功能的物理化学信息注释的3D点。我们提出了一种名为LabelHash的新方法，它为一组目标创建n元残基哈希表。使用这些哈希表，我们可以快速查找与主题的部分匹配，并将这些匹配扩展为完全匹配。我们表明，通过仅应用非常轻微的几何约束，我们可以在非常大的数据集和非常普遍的结构图案中找到具有极高特异性的统计显著匹配。我们证明，当采用简单的几何滤波器时，我们的方法需要合理的存储量，并进一步提高了我们以前工作的特异性，同时保持了非常高的灵敏度。我们的算法在20个同源类和蛋白质数据库的非冗余版本作为我们的背景数据集上进行了评估。我们使用聚类分析来分析为什么某些类别的同系物比其他类别的同系物更难分类。LabelHash算法在web服务器http://kavrakilab.org/labelhash/上实现。

{"title":"Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.","authors":"Mark Moll, L. Kavraki","doi":"10.1142/9781848162648_0014","DOIUrl":"https://doi.org/10.1142/9781848162648_0014","url":null,"abstract":"There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"157-68"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Fast multisegment alignments for temporal expression profiles. 快速多段比对时间表达谱。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Adam A Smith, Mark Craven

We present two heuristics for speeding up a time series alignment algorithm that is related to dynamic time warping (DTW). In previous work, we developed our multisegment alignment algorithm to answer similarity queries for toxicogenomic time-series data. Our multisegment algorithm returns more accurate alignments than DTW at the cost of time complexity; the multisegment algorithm is O(n(5)) whereas DTW is O(n(2)). The first heuristic we present speeds up our algorithm by a constant factor by restricting alignments to a cone shape in alignment space. The second heuristic restricts the alignments considered to those near one returned by a DTW-like method. This heuristic adjusts the time complexity to O(n(3)). Importantly, neither heuristic results in a loss in accuracy.

我们提出了两种启发式算法来加速与动态时间规整(DTW)相关的时间序列对齐算法。在之前的工作中，我们开发了我们的多段比对算法来回答毒物基因组学时间序列数据的相似性查询。我们的多段算法以时间复杂度为代价，返回比DTW更精确的对齐;多段算法是O(n(5))，而DTW是O(n(2))。我们提出的第一个启发式算法通过将对齐限制为对齐空间中的锥形来提高算法的速度。第二种启发式方法将考虑的对齐限制为那些接近dtw方法返回的对齐。这种启发式算法将时间复杂度调整为O(n(3))。重要的是，两种启发式都不会导致准确性的损失。

引用次数: 0

Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction. 基于残基标签哈希和几何滤波的蛋白质功能预测结构基序匹配。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01

Mark Moll, Lydia E Kavraki

There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.

结构已知但功能未知的蛋白质越来越多。确定它们的功能将对理解疾病和设计新的治疗方法产生重大影响。然而，实验蛋白功能测定既昂贵又耗时。计算方法可以通过识别具有高结构和化学相似性的蛋白质来促进功能确定。我们的重点是确定结合位点相似性的方法。尽管存在几种这样的方法，但如何在大数据集中快速、高特异性地找到结构基序的所有功能相关匹配仍然是一个具有挑战性的问题。在这种情况下，结构基序是一组用表征分子功能的物理化学信息注释的3D点。我们提出了一种名为LabelHash的新方法，它为一组目标创建n元残基哈希表。使用这些哈希表，我们可以快速查找与主题的部分匹配，并将这些匹配扩展为完全匹配。我们表明，通过仅应用非常轻微的几何约束，我们可以在非常大的数据集和非常普遍的结构图案中找到具有极高特异性的统计显著匹配。我们证明，当采用简单的几何滤波器时，我们的方法需要合理的存储量，并进一步提高了我们以前工作的特异性，同时保持了非常高的灵敏度。我们的算法在20个同源类和蛋白质数据库的非冗余版本作为我们的背景数据集上进行了评估。我们使用聚类分析来分析为什么某些类别的同系物比其他类别的同系物更难分类。LabelHash算法在web服务器http://kavrakilab.org/labelhash/上实现。

{"title":"Matching of structural motifs using hashing on residue labels and geometric filtering for protein function prediction.","authors":"Mark Moll, Lydia E Kavraki","doi":"","DOIUrl":"","url":null,"abstract":"There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Our focus is on methods that determine binding site similarity. Although several such methods exist, it still remains a challenging problem to quickly find all functionally-related matches for structural motifs in large data sets with high specificity. In this context, a structural motif is a set of 3D points annotated with physicochemical information that characterize a molecular function. We propose a new method called LabelHash that creates hash tables of n-tuples of residues for a set of targets. Using these hash tables, we can quickly look up partial matches to a motif and expand those matches to complete matches. We show that by applying only very mild geometric constraints we can find statistically significant matches with extremely high specificity in very large data sets and for very general structural motifs. We demonstrate that our method requires a reasonable amount of storage when employing a simple geometric filter and further improves on the specificity of our previous work while maintaining very high sensitivity. Our algorithm is evaluated on 20 homolog classes and a non-redundant version of the Protein Data Bank as our background data set. We use cluster analysis to analyze why certain classes of homologs are more difficult to classify than others. The LabelHash algorithm is implemented on a web server at http://kavrakilab.org/labelhash/.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 ","pages":"157-68"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"28337241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GaborLocal: peak detection in mass spectrum by Gabor filters and Gaussian local maxima. Gabor滤波器和高斯局部最大值在质谱中的峰检测。

Computational systems bioinformatics. Computational Systems Bioinformatics Conference

Pub Date : 2008-01-01 DOI: 10.1142/9781848162648_0008

Nha Nguyen, Heng Huang, S. Oraintara, An P. N. Vo

Mass Spectrometry (MS) is increasingly being used to discover disease related proteomic patterns. The peak detection step is one of most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false position rate in peak detection. Most of them follow two approaches: one is denoising approach and the other one is decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose a new method named GaborLocal which can detect more true peaks with a very low false position rate. The Gaussian local maxima is employed for peak detection, because it is robust to noise in signals. Moreover, the maximum rank of peaks is defined at the first time to identify peaks instead of using the signal-to-noise ratio and the Gabor filter is used to decompose the raw MS signal. We perform the proposed method on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate our method outperforms other common used methods in the receiver operating characteristic (ROC) curve.

质谱(MS)越来越多地被用于发现疾病相关的蛋白质组学模式。峰检测步骤是质谱典型分析中最重要的步骤之一。近年来，人们提出了许多新的算法来提高峰值检测的真位置率和低假位置率。它们大多采用两种方法:一种是去噪方法，另一种是分解方法。在以往的研究中，MS数据分解方法比第一种方法更有潜力。本文提出了一种名为GaborLocal的新方法，该方法可以在非常低的假位置率下检测到更多的真峰。由于高斯局部极大值对信号中的噪声具有较强的鲁棒性，因此采用高斯局部极大值进行峰值检测。此外，第一次定义峰值的最大秩来识别峰值，而不是使用信噪比，并使用Gabor滤波器对原始MS信号进行分解。我们对已知多肽位置的真实SELDI-TOF谱进行了验证。实验结果表明，该方法在受试者工作特征(ROC)曲线上优于其他常用方法。

{"title":"GaborLocal: peak detection in mass spectrum by Gabor filters and Gaussian local maxima.","authors":"Nha Nguyen, Heng Huang, S. Oraintara, An P. N. Vo","doi":"10.1142/9781848162648_0008","DOIUrl":"https://doi.org/10.1142/9781848162648_0008","url":null,"abstract":"Mass Spectrometry (MS) is increasingly being used to discover disease related proteomic patterns. The peak detection step is one of most important steps in the typical analysis of MS data. Recently, many new algorithms have been proposed to increase true position rate with low false position rate in peak detection. Most of them follow two approaches: one is denoising approach and the other one is decomposing approach. In the previous studies, the decomposition of MS data method shows more potential than the first one. In this paper, we propose a new method named GaborLocal which can detect more true peaks with a very low false position rate. The Gaussian local maxima is employed for peak detection, because it is robust to noise in signals. Moreover, the maximum rank of peaks is defined at the first time to identify peaks instead of using the signal-to-noise ratio and the Gabor filter is used to decompose the raw MS signal. We perform the proposed method on the real SELDI-TOF spectrum with known polypeptide positions. The experimental results demonstrate our method outperforms other common used methods in the receiver operating characteristic (ROC) curve.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"7 1","pages":"85-96"},"PeriodicalIF":0.0,"publicationDate":"2008-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1142/9781848162648_0008","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64003306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4