Biological signals, such as the start of protein translation in eukaryotic mRNA, are stretches of nucleotides recognized by cellular machinery. There are a variety of techniques for modeling and identifying them. Most of these techniques either assume that the base pairs at each position of the signal are independently distributed, or they allow for limited dependencies among different positions. In previous work, we provided a statistical model that generalizes earlier methods and captures all significant high-order dependencies among different base positions. In this paper, we use a set of experimentally verified translation initiation (TI) sites (provided by Amos Bairoch) from eukaryotic sequences to train a range of methods, and then compare these methods. None of the methods is effective in predicting TI sites. We take advantage of the ribosome scanning model (Cigan et al., 1988) to significantly improve the prediction accuracy for full-length mRNAs. The ribosome scanning model suggests scanning from the 5' end of the capped mRNA and initiating translation at the first AUG in good context. This reduces the search space dramatically and accounts for its effectiveness. The success of this approach illustrates how biological ideas can illuminate and help solve challenging problems in computational biology.
生物信号,如真核mRNA中蛋白质翻译的开始,是由细胞机制识别的核苷酸的延伸。有各种各样的技术可以对它们进行建模和识别。大多数这些技术要么假设信号每个位置的碱基对是独立分布的,要么它们允许不同位置之间的有限依赖。在之前的工作中,我们提供了一个统计模型,该模型概括了早期的方法,并捕获了不同碱基位置之间所有重要的高阶依赖关系。在本文中,我们从真核生物序列中使用一组实验验证的翻译起始(TI)位点(由Amos Bairoch提供)来训练一系列方法,然后比较这些方法。没有一种方法能有效预测TI位点。我们利用核糖体扫描模型(Cigan et al., 1988)显著提高了全长mrna的预测精度。核糖体扫描模型表明,在良好的环境下,从封帽mRNA的5'端开始扫描,并在第一个AUG开始翻译。这大大减少了搜索空间,并说明了它的有效性。这种方法的成功说明了生物学思想如何能够阐明和帮助解决计算生物学中具有挑战性的问题。
{"title":"The ribosome scanning model for translation initiation: implications for gene prediction and full-length cDNA detection.","authors":"P Agarwal, V Bafna","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Biological signals, such as the start of protein translation in eukaryotic mRNA, are stretches of nucleotides recognized by cellular machinery. There are a variety of techniques for modeling and identifying them. Most of these techniques either assume that the base pairs at each position of the signal are independently distributed, or they allow for limited dependencies among different positions. In previous work, we provided a statistical model that generalizes earlier methods and captures all significant high-order dependencies among different base positions. In this paper, we use a set of experimentally verified translation initiation (TI) sites (provided by Amos Bairoch) from eukaryotic sequences to train a range of methods, and then compare these methods. None of the methods is effective in predicting TI sites. We take advantage of the ribosome scanning model (Cigan et al., 1988) to significantly improve the prediction accuracy for full-length mRNAs. The ribosome scanning model suggests scanning from the 5' end of the capped mRNA and initiating translation at the first AUG in good context. This reduces the search space dramatically and accounts for its effectiveness. The success of this approach illustrates how biological ideas can illuminate and help solve challenging problems in computational biology.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20695510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
LabFlow is a workflow management system designed for large scale biology research laboratories. It provides a workflow model in which objects flow from task to task under programmatic control. The model supports parallelism, meaning that an object can flow down several paths simultaneously, and sub-workflows which can be invoked subroutine-style from a task. The system allocates tasks to Unix processes to achieve requisite levels of multiprocessing. The system uses the LabBase data management system to store workflow-state and laboratory results. LabFlow provides a Per15 object-oriented framework for defining workflows, and an engine for executing these. The software is freely available.
{"title":"The LabFlow system for workflow management in large scale biology research laboratories.","authors":"N Goodman, S Rozen, L D Stein","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>LabFlow is a workflow management system designed for large scale biology research laboratories. It provides a workflow model in which objects flow from task to task under programmatic control. The model supports parallelism, meaning that an object can flow down several paths simultaneously, and sub-workflows which can be invoked subroutine-style from a task. The system allocates tasks to Unix processes to achieve requisite levels of multiprocessing. The system uses the LabBase data management system to store workflow-state and laboratory results. LabFlow provides a Per15 object-oriented framework for defining workflows, and an engine for executing these. The software is freely available.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P G Baker, A Brass, S Bechhofer, C Goble, N Paton, R Stevens
The TAMBIS project aims to provide transparent access to disparate biological databases and analysis tools, enabling users to utilize a wide range of resources with the minimum of effort. A prototype system has been developed that includes a knowledge base of biological terminology (the biological Concept Model), a model of the underlying data sources (the Source Model) and a 'knowledge-driven' user interface. Biological concepts are captured in the knowledge base using a description logic called GRAIL. The Concept Model provides the user with the concepts necessary to construct a wide range of multiple-source queries, and the user interface provides a flexible means of constructing and manipulating those queries. The Source Model provides a description of the underlying sources and mappings between terms used in the sources and terms in the biological Concept Model. The Concept Model and Source Model provide a level of indirection that shields the user from source details, providing a high level of source transparency. Source independent, declarative queries formed from terms in the Concept Model are transformed into a set of source dependent, executable procedures. Query formulation, translation and execution is demonstrated using a working example.
{"title":"TAMBIS--Transparent Access to Multiple Bioinformatics Information Sources.","authors":"P G Baker, A Brass, S Bechhofer, C Goble, N Paton, R Stevens","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>The TAMBIS project aims to provide transparent access to disparate biological databases and analysis tools, enabling users to utilize a wide range of resources with the minimum of effort. A prototype system has been developed that includes a knowledge base of biological terminology (the biological Concept Model), a model of the underlying data sources (the Source Model) and a 'knowledge-driven' user interface. Biological concepts are captured in the knowledge base using a description logic called GRAIL. The Concept Model provides the user with the concepts necessary to construct a wide range of multiple-source queries, and the user interface provides a flexible means of constructing and manipulating those queries. The Source Model provides a description of the underlying sources and mappings between terms used in the sources and terms in the biological Concept Model. The Concept Model and Source Model provide a level of indirection that shields the user from source details, providing a high level of source transparency. Source independent, declarative queries formed from terms in the Concept Model are transformed into a set of source dependent, executable procedures. Query formulation, translation and execution is demonstrated using a working example.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20695513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work focuses on the inference of evolutionary relationships in protein superfamilies, and the uses of these relationships to identify key positions in the structure, to infer attributes on the basis of evolutionary distance, and to identify potential errors in sequence annotations. Relative entropy, a distance metric from information theory, is used in combination with Dirichlet mixture priors to estimate a phylogenetic tree for a set of proteins. This method infers key structural or functional positions in the molecule, and guides the tree topology to preserve these important positions within subtrees. Minimum-description-length principles are used to determine a cut of the tree into subtrees, to identify the subfamilies in the data. This method is demonstrated on SH2-domain containing proteins, resulting in a new subfamily assignment for Src2-drome and a suggested evolutionary relationship between Nck_human and Drk_drome, Sem5_caeel, Grb2_human and Grb2_chick.
{"title":"Phylogenetic inference in protein superfamilies: analysis of SH2 domains.","authors":"K Sjölander","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>This work focuses on the inference of evolutionary relationships in protein superfamilies, and the uses of these relationships to identify key positions in the structure, to infer attributes on the basis of evolutionary distance, and to identify potential errors in sequence annotations. Relative entropy, a distance metric from information theory, is used in combination with Dirichlet mixture priors to estimate a phylogenetic tree for a set of proteins. This method infers key structural or functional positions in the molecule, and guides the tree topology to preserve these important positions within subtrees. Minimum-description-length principles are used to determine a cut of the tree into subtrees, to identify the subfamilies in the data. This method is demonstrated on SH2-domain containing proteins, resulting in a new subfamily assignment for Src2-drome and a suggested evolutionary relationship between Nck_human and Drk_drome, Sem5_caeel, Grb2_human and Grb2_chick.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P Baldi, Y Chauvin, S Brunak, J Gorodkin, A G Pedersen
We study from a computational standpoint several different physical scales associated with structural features of DNA sequences, including dinucleotide scales such as base stacking energy and propeller twist, and trinucleotide scales such as bendability and nucleosome positioning. We show that these scales provide an alternative or complementary compact representation of DNA sequences. As an example we construct a strand invariant representation of DNA sequences. The scales can also be used to analyze and discover new DNA structural patterns, especially in combinations with hidden Markov models (HMMs). The scales are applied to HMMs of human promoter sequences revealing a number of significant differences between regions upstream and downstream of the transcriptional start point. Finally we show, with some qualifications, that such scales are by and large independent, and therefore complement each other.
{"title":"Computational applications of DNA structural scales.","authors":"P Baldi, Y Chauvin, S Brunak, J Gorodkin, A G Pedersen","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We study from a computational standpoint several different physical scales associated with structural features of DNA sequences, including dinucleotide scales such as base stacking energy and propeller twist, and trinucleotide scales such as bendability and nucleosome positioning. We show that these scales provide an alternative or complementary compact representation of DNA sequences. As an example we construct a strand invariant representation of DNA sequences. The scales can also be used to analyze and discover new DNA structural patterns, especially in combinations with hidden Markov models (HMMs). The scales are applied to HMMs of human promoter sequences revealing a number of significant differences between regions upstream and downstream of the transcriptional start point. Finally we show, with some qualifications, that such scales are by and large independent, and therefore complement each other.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20695514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A statistical theory of local alignment algorithms with gaps is presented. Both the linear and logarithmic phases, as well as the phase transition separating the two phases, are described in a quantitative way. Markov sequences without mutual correlations are shown to have scale-invariant alignment statistics. Deviations from scale invariance indicate the presence of mutual correlations detectable by alignment algorithms. Conditions are obtained for the optimal detection of a class of mutual sequence correlations.
{"title":"A statistical theory of sequence alignment with gaps.","authors":"D Drasdo, T Hwa, M Lässig","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>A statistical theory of local alignment algorithms with gaps is presented. Both the linear and logarithmic phases, as well as the phase transition separating the two phases, are described in a quantitative way. Markov sequences without mutual correlations are shown to have scale-invariant alignment statistics. Deviations from scale invariance indicate the presence of mutual correlations detectable by alignment algorithms. Conditions are obtained for the optimal detection of a class of mutual sequence correlations.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N A Kolchanov, M P Ponomarenko, A E Kel, Kondrakhin YuV, A S Frolov, F A Kolpakov, T N Goryachkovsky, O V Kel, E A Ananko, E V Ignatieva, O A Podkolodnaya, V N Babenko, I L Stepanenko, A G Romashchenko, T I Merkulova, D G Vorobiev, S V Lavryushev, Ponomarenko YuV, A V Kochetov, G B Kolesov, V V Solovyev, L Milanesi, N L Podkolodny, E Wingender, T Heinemeyer
GeneExpress system has been designed to integrate description, analysis, and recognition of eukaryotic regulatory sequences. The system includes 5 basic units: (1) GeneNet contains an object-oriented database for accumulation of data on gene networks and signal transduction pathways and a Java-based viewer that allows an exploration and visualization of the GeneNet information; (2) Transcription Regulation combines the database on transcription regulatory regions of eukaryotic genes (TRRD) and TRRD Viewer; (3) Transcription Factor Binding Site Recognition contains a compilation of transcription factor binding sites (TFBSC) and programs for their analysis and recognition; (4) mRNA Translation is designed for analysis of structural and contextual features of mRNA 5'UTRs and prediction of their translation efficiency; and (5) ACTIVITY is the module for analysis and site activity prediction of a given nucleotide sequence. Integration of the databases in the GeneExpress is based on the Sequence Retrieval System (SRS) created in the European Bioinformatics Institute.
{"title":"GeneExpress: a computer system for description, analysis, and recognition of regulatory sequences in eukaryotic genome.","authors":"N A Kolchanov, M P Ponomarenko, A E Kel, Kondrakhin YuV, A S Frolov, F A Kolpakov, T N Goryachkovsky, O V Kel, E A Ananko, E V Ignatieva, O A Podkolodnaya, V N Babenko, I L Stepanenko, A G Romashchenko, T I Merkulova, D G Vorobiev, S V Lavryushev, Ponomarenko YuV, A V Kochetov, G B Kolesov, V V Solovyev, L Milanesi, N L Podkolodny, E Wingender, T Heinemeyer","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>GeneExpress system has been designed to integrate description, analysis, and recognition of eukaryotic regulatory sequences. The system includes 5 basic units: (1) GeneNet contains an object-oriented database for accumulation of data on gene networks and signal transduction pathways and a Java-based viewer that allows an exploration and visualization of the GeneNet information; (2) Transcription Regulation combines the database on transcription regulatory regions of eukaryotic genes (TRRD) and TRRD Viewer; (3) Transcription Factor Binding Site Recognition contains a compilation of transcription factor binding sites (TFBSC) and programs for their analysis and recognition; (4) mRNA Translation is designed for analysis of structural and contextual features of mRNA 5'UTRs and prediction of their translation efficiency; and (5) ACTIVITY is the module for analysis and site activity prediction of a given nucleotide sequence. Integration of the databases in the GeneExpress is based on the Sequence Retrieval System (SRS) created in the European Bioinformatics Institute.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignment is defined as the sum of so-called weights of aligned segment pairs. A simple modification of the weight function used by the original version of the DIALIGN alignment program turns out to have a crucial advantage: it can be applied to both, global and local alignment problems without the need to specify a threshold parameter.
{"title":"Segment-based scores for pairwise and multiple sequence alignments.","authors":"B Morgenstern, W R Atchley, K Hahn, A Dress","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>In this paper, we discuss a novel scoring scheme for sequence alignments. The score of an alignment is defined as the sum of so-called weights of aligned segment pairs. A simple modification of the weight function used by the original version of the DIALIGN alignment program turns out to have a crucial advantage: it can be applied to both, global and local alignment problems without the need to specify a threshold parameter.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Computing three-dimensional structures from sparse experimental constraints requires method for combining heterogeneous sources of information, such as distances, angles, and measures of total volume, shape, and surface. For some types of information, such as distances between atoms, numerous methods are available for computing structures that satisfy the provided constraints. It is more difficult, however, to use information about the degree to which an atom is on the surface or buried as a useful constraint during structure computations. Surface measures have been used as accept/reject criteria for previously computed structures, but this is not an efficient strategy. In this paper, we investigate the efficacy of applying a surface measure in the computation of molecular structure, using a method of probabilistic least square computations which facilitates the introduction of multiple, noisy, heterogeneous data sources. For this purpose, we introduce a simple purely geometrical measure of surface proximity called maximal conic view (MCV). MCV is efficiently computable and differentiable, and is hence well suited to driving a structural optimization method based, in part, on surface data. As an initial validation, we show that MCV correlates well with known measures for total exposed surface area. We use this measure in our experiments to show that information about surface proximity (derived from theory or experiment, for example) can be added to a set of distance measurements to increase significantly the quality of the computed structure. In particular, when 30 to 50 percent of all possible short-range distances are provided, the addition of surface information improves the quality of the computed structure (as measured by RMS fit) by as much as 80 percent. Our results demonstrate that knowledge of which atoms are on the surface and which are buried can be used as a powerful constraint in estimating molecular structure.
{"title":"A surface measure for probabilistic structural computations.","authors":"J P Schmidt, C C Chen, J L Cooper, R B Altman","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Computing three-dimensional structures from sparse experimental constraints requires method for combining heterogeneous sources of information, such as distances, angles, and measures of total volume, shape, and surface. For some types of information, such as distances between atoms, numerous methods are available for computing structures that satisfy the provided constraints. It is more difficult, however, to use information about the degree to which an atom is on the surface or buried as a useful constraint during structure computations. Surface measures have been used as accept/reject criteria for previously computed structures, but this is not an efficient strategy. In this paper, we investigate the efficacy of applying a surface measure in the computation of molecular structure, using a method of probabilistic least square computations which facilitates the introduction of multiple, noisy, heterogeneous data sources. For this purpose, we introduce a simple purely geometrical measure of surface proximity called maximal conic view (MCV). MCV is efficiently computable and differentiable, and is hence well suited to driving a structural optimization method based, in part, on surface data. As an initial validation, we show that MCV correlates well with known measures for total exposed surface area. We use this measure in our experiments to show that information about surface proximity (derived from theory or experiment, for example) can be added to a set of distance measurements to increase significantly the quality of the computed structure. In particular, when 30 to 50 percent of all possible short-range distances are provided, the addition of surface information improves the quality of the computed structure (as measured by RMS fit) by as much as 80 percent. Our results demonstrate that knowledge of which atoms are on the surface and which are buried can be used as a powerful constraint in estimating molecular structure.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Homologous proteins do not necessarily exhibit identical biochemical function. Despite this fact, local or global sequence similarity is widely used as an indication of functional identity. Of the 1327 Enzyme Commission defined functional classes with more than one annotated example in the sequence databases, similarity scores alone are inadequate in 251 (19%) of the cases. We test the hypothesis that conserved domains, as defined in the ProDom database, can be used to discriminate between alternative functions for homologous proteins in these cases. Using machine learning methods, we were able to induce correct discriminators for more than half of these 251 challenging functional classes. These results show that the combination of modular representations of proteins with sequence similarity improves the ability to infer function from sequence over similarity scores alone.
{"title":"Identification of divergent functions in homologous proteins by induction over conserved modules.","authors":"I Shah, L Hunter","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Homologous proteins do not necessarily exhibit identical biochemical function. Despite this fact, local or global sequence similarity is widely used as an indication of functional identity. Of the 1327 Enzyme Commission defined functional classes with more than one annotated example in the sequence databases, similarity scores alone are inadequate in 251 (19%) of the cases. We test the hypothesis that conserved domains, as defined in the ProDom database, can be used to discriminate between alternative functions for homologous proteins in these cases. Using machine learning methods, we were able to induce correct discriminators for more than half of these 251 challenging functional classes. These results show that the combination of modular representations of proteins with sequence similarity improves the ability to infer function from sequence over similarity scores alone.</p>","PeriodicalId":79420,"journal":{"name":"Proceedings. International Conference on Intelligent Systems for Molecular Biology","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"1998-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"20696545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}