Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706594
Sung-Gon Yi, T. Park
As the magnitude of the experiment increases, it is common to combine various types of microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important to analyze microarray data together to derive a combined conclusion after accounting for heterogeneity among data sets. One of the main objectives of the microarray experiment is to identify differentially expressed genes among the different experimental groups. We propose the linear-mixed effect model for the integrated analysis of the heterogeneous microarray data sets. The proposed LMe model was illustrated using the data from 133 microarrays collected at three different hospitals. Though simulation studies, we compared the proposed LMe model approach with the meta-analysis and the ANOVA model approaches. The LMe model approach was shown to provide higher powers than the other approaches.
{"title":"Integrated analysis of the various types of microarray data using linear-mixed effects models","authors":"Sung-Gon Yi, T. Park","doi":"10.1109/BIBM.2010.5706594","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706594","url":null,"abstract":"As the magnitude of the experiment increases, it is common to combine various types of microarrays such as paired and non-paired microarrays from different laboratories or hospitals. Thus, it is important to analyze microarray data together to derive a combined conclusion after accounting for heterogeneity among data sets. One of the main objectives of the microarray experiment is to identify differentially expressed genes among the different experimental groups. We propose the linear-mixed effect model for the integrated analysis of the heterogeneous microarray data sets. The proposed LMe model was illustrated using the data from 133 microarrays collected at three different hospitals. Though simulation studies, we compared the proposed LMe model approach with the meta-analysis and the ANOVA model approaches. The LMe model approach was shown to provide higher powers than the other approaches.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"567 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122932551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706534
Ping Zhang, Z. Obradovic
Studies of intrinsically disordered proteins that lack a stable tertiary structure but still have important biological functions critically rely on computational methods that predict this property based on sequence information. Although a number of fairly successful models for prediction of protein disorder were developed over the last decade, the quality of their predictions is limited by available cases of confirmed disorders. To more reliably estimate protein disorder from protein sequences, an iterative algorithm is proposed that integrates predictions of multiple disorder models without relying on any protein sequences with confirmed disorder annotation. The iterative method alternately provides the maximum a posterior (MAP) estimation of disorder prediction and the maximum-likelihood (ML) estimation of quality of multiple disorder predictors. Experiments on data used at the Critical Assessment of Techniques for Protein Structure Prediction (CASP7 and CASP8) have shown the effectiveness of the proposed algorithm.
{"title":"Unsupervised integration of multiple protein disorder predictors","authors":"Ping Zhang, Z. Obradovic","doi":"10.1109/BIBM.2010.5706534","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706534","url":null,"abstract":"Studies of intrinsically disordered proteins that lack a stable tertiary structure but still have important biological functions critically rely on computational methods that predict this property based on sequence information. Although a number of fairly successful models for prediction of protein disorder were developed over the last decade, the quality of their predictions is limited by available cases of confirmed disorders. To more reliably estimate protein disorder from protein sequences, an iterative algorithm is proposed that integrates predictions of multiple disorder models without relying on any protein sequences with confirmed disorder annotation. The iterative method alternately provides the maximum a posterior (MAP) estimation of disorder prediction and the maximum-likelihood (ML) estimation of quality of multiple disorder predictors. Experiments on data used at the Critical Assessment of Techniques for Protein Structure Prediction (CASP7 and CASP8) have shown the effectiveness of the proposed algorithm.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114797443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706538
S. Le, B. Shapiro
Small regulatory RNAs are highly abundant noncoding RNAs (ncRNA) found in bacterial genomes. These small regulatory ncRNAs (sRNAs) can regulate the synthesis of proteins by mediating mRNA transcription, translation and stability. Furthermore, they also control the activity of specific proteins by binding to them. In this study, we present a general computational approach for identifying the distinct structure of sRNAs in the Escherichia coli (E. coli) genome by a quantitative measure, Ediff that is the energy difference between the optimal structure folded from a sequence segment and its corresponding optimal restrained structure where all base pairings formed in the original optimal structure are excluded. Our results indicate that most of the known small ncRNAs in E. coli K12 have very high normalized Ediff scores with high statistical significance. These sRNAs have distinct well-ordered structures that are both thermodynamically stable and uniquely folded.
{"title":"Characterization of structural features for small regulatory RNAs in Escherichia coli genomes","authors":"S. Le, B. Shapiro","doi":"10.1109/BIBM.2010.5706538","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706538","url":null,"abstract":"Small regulatory RNAs are highly abundant noncoding RNAs (ncRNA) found in bacterial genomes. These small regulatory ncRNAs (sRNAs) can regulate the synthesis of proteins by mediating mRNA transcription, translation and stability. Furthermore, they also control the activity of specific proteins by binding to them. In this study, we present a general computational approach for identifying the distinct structure of sRNAs in the Escherichia coli (E. coli) genome by a quantitative measure, Ediff that is the energy difference between the optimal structure folded from a sequence segment and its corresponding optimal restrained structure where all base pairings formed in the original optimal structure are excluded. Our results indicate that most of the known small ncRNAs in E. coli K12 have very high normalized Ediff scores with high statistical significance. These sRNAs have distinct well-ordered structures that are both thermodynamically stable and uniquely folded.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124974838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706582
Zhenyu Wang, V. Palade
We believe the great interpretability of fuzzy models allow fuzzy-based methods to play a very important role in Microarray gene expression data analysis, but the advantages offered by fuzzy-based techniques in this application have not yet been fully explored in the literature. In this paper, we construct Multi-Objective Evolutionary Algorithms based Interpretable Fuzzy (MOEAIF) models for microarray gene expression data analysis. Our novel fuzzy models can significantly decrease the model complexity, and automatically balance the accuracy and interpretability of the models. The experimental studies have shown that relatively simple and small fuzzy rule bases, with satisfactory classification performance, have been successful found for challenging microarray gene expression datasets.
{"title":"Multi-objective evolutionary algorithms based Interpretable Fuzzy models for microarray gene expression data analysis","authors":"Zhenyu Wang, V. Palade","doi":"10.1109/BIBM.2010.5706582","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706582","url":null,"abstract":"We believe the great interpretability of fuzzy models allow fuzzy-based methods to play a very important role in Microarray gene expression data analysis, but the advantages offered by fuzzy-based techniques in this application have not yet been fully explored in the literature. In this paper, we construct Multi-Objective Evolutionary Algorithms based Interpretable Fuzzy (MOEAIF) models for microarray gene expression data analysis. Our novel fuzzy models can significantly decrease the model complexity, and automatically balance the accuracy and interpretability of the models. The experimental studies have shown that relatively simple and small fuzzy rule bases, with satisfactory classification performance, have been successful found for challenging microarray gene expression datasets.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127950730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706539
Wei Chen, Shaowu Zhang, Yong-mei Cheng, Q. Pan
Protein-RNA interactions are vitally important to a number of fundamental cellular processes, including regulation of gene expression such as RNA splicing, transport and translation, protein synthesis and assembly of ribosome. More detailed information on the Protein-RNA interaction is helpful for comprehending the function notation and molecular regulatory mechanism, meanwhile, knowing the knowledge of Protein-RNA recognition can also help the biological scientist and researcher understand the site-directed mutagenesis and drug design. In the present work, we proposed a computational approach, based on SVM-KNN algorithm, with evolutionary information of spatial neighbour residues for prediction of protein-RNA interaction sites. The overall success rate obtained by 5-fold cross-validation is 78.00%, which is comparable or better than other existing methods, indicating our method is very promising for identifying and predicting protein-RNA interaction sites.
{"title":"Prediction of Protein-RNA interaction site using SVM-KNN algorithm with spatial information","authors":"Wei Chen, Shaowu Zhang, Yong-mei Cheng, Q. Pan","doi":"10.1109/BIBM.2010.5706539","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706539","url":null,"abstract":"Protein-RNA interactions are vitally important to a number of fundamental cellular processes, including regulation of gene expression such as RNA splicing, transport and translation, protein synthesis and assembly of ribosome. More detailed information on the Protein-RNA interaction is helpful for comprehending the function notation and molecular regulatory mechanism, meanwhile, knowing the knowledge of Protein-RNA recognition can also help the biological scientist and researcher understand the site-directed mutagenesis and drug design. In the present work, we proposed a computational approach, based on SVM-KNN algorithm, with evolutionary information of spatial neighbour residues for prediction of protein-RNA interaction sites. The overall success rate obtained by 5-fold cross-validation is 78.00%, which is comparable or better than other existing methods, indicating our method is very promising for identifying and predicting protein-RNA interaction sites.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121227212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706573
Javad Safaei, Ján Manuch, Arvind Gupta, L. Stacho, S. Pelech
In this paper we propose a new algorithm to predict the phosphorylation site specificities of 478 human protein kinases based on the primary structures of the catalytic domains of these enzymes. Existing methods deduce the specificity of a protein kinase through the alignment of the amino acid sequences of phospho-sites targeted by the kinase to generate a consensus sequence or they use machine learning models for recognition. However, for most protein kinases few if any substrates have been experimentally identified by protein sequencing and mass spectrometry. In this work, we used mutual information from a training set of over 200 protein kinases consensus phospho-site sequences and predicted amino acid interactions between kinases and their substrate phospho-sites to generate position-specific scoring matrices (PSSM). The results demonstrate that using our algorithm, knowledge of the primary amino acid sequence of the catalytic domain of these kinases is sufficient to predict their phosphorylation sites specificities and their PSSM matrices.
{"title":"Prediction of human protein kinase substrate specificities","authors":"Javad Safaei, Ján Manuch, Arvind Gupta, L. Stacho, S. Pelech","doi":"10.1109/BIBM.2010.5706573","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706573","url":null,"abstract":"In this paper we propose a new algorithm to predict the phosphorylation site specificities of 478 human protein kinases based on the primary structures of the catalytic domains of these enzymes. Existing methods deduce the specificity of a protein kinase through the alignment of the amino acid sequences of phospho-sites targeted by the kinase to generate a consensus sequence or they use machine learning models for recognition. However, for most protein kinases few if any substrates have been experimentally identified by protein sequencing and mass spectrometry. In this work, we used mutual information from a training set of over 200 protein kinases consensus phospho-site sequences and predicted amino acid interactions between kinases and their substrate phospho-sites to generate position-specific scoring matrices (PSSM). The results demonstrate that using our algorithm, knowledge of the primary amino acid sequence of the catalytic domain of these kinases is sufficient to predict their phosphorylation sites specificities and their PSSM matrices.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"690 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122485292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706575
Gene P. K. Wu, Keith C. C. Chan, A. Wong, Bin Wu
Discovering patterns from gene expression levels is regarded as a classification problem when tissue classes of the samples are given and solved as a discrete-data problem by discretizing the expression levels of each gene into intervals maximizing the interdependence between that gene and the class labels. However, when class information is unavailable, discovering gene expression patterns becomes difficult. This paper attempts to tackle this important problem. For a gene pool with large number of genes, we first cluster the genes into smaller groups. In each group, we use the representative gene, one with highest interdependence with others in the group, to drive the discretization of the gene expression levels of other genes. Treating intervals as discrete events, association patterns can be discovered. If the gene groups obtained are crisp clusters, significant patterns overlapping different clusters cannot be found. This paper presents a new method of “fuzzifying” the crisp attribute clusters for that purpose. To evaluate the effectiveness of our approach, we first apply the above described procedure on a synthetic dataset and then a gene expression dataset with known class labels. The class labels are not being used in both analyses but used later as the ground truth in a classificatory problem for assessing the algorithm's effectiveness in fuzzy gene clustering and discretization. The results show the efficacy of the proposed method.
{"title":"Unsupervised discovery of fuzzy patterns in gene expression data","authors":"Gene P. K. Wu, Keith C. C. Chan, A. Wong, Bin Wu","doi":"10.1109/BIBM.2010.5706575","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706575","url":null,"abstract":"Discovering patterns from gene expression levels is regarded as a classification problem when tissue classes of the samples are given and solved as a discrete-data problem by discretizing the expression levels of each gene into intervals maximizing the interdependence between that gene and the class labels. However, when class information is unavailable, discovering gene expression patterns becomes difficult. This paper attempts to tackle this important problem. For a gene pool with large number of genes, we first cluster the genes into smaller groups. In each group, we use the representative gene, one with highest interdependence with others in the group, to drive the discretization of the gene expression levels of other genes. Treating intervals as discrete events, association patterns can be discovered. If the gene groups obtained are crisp clusters, significant patterns overlapping different clusters cannot be found. This paper presents a new method of “fuzzifying” the crisp attribute clusters for that purpose. To evaluate the effectiveness of our approach, we first apply the above described procedure on a synthetic dataset and then a gene expression dataset with known class labels. The class labels are not being used in both analyses but used later as the ground truth in a classificatory problem for assessing the algorithm's effectiveness in fuzzy gene clustering and discretization. The results show the efficacy of the proposed method.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124171579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706590
Yuji Zhang, J. Xuan, R. Clarke, H. Ressom
The availability of genome-wide biological network data opens up new possibilities to discover novel biomarkers and elucidate cancer-related complex mechanisms at network level. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers, not as individual genes but as functional modules. Also, a large-scale analysis of ensemble feature selection concept is presented. The method allows combining features selected from multiple runs with various data subsampling to increase the reliability and classification accuracy of the final set of selected features. The results from four breast cancer studies demonstrate that the identified module biomarkers achieve: i) higher classification accuracy in independent validation datasets; ii) better reproducibility than individual gene biomarkers; iii) improved biological interpretability; and iv) enhanced enrichment in cancer-related “disease drivers”.
{"title":"Module-based biomarker discovery in breast cancer","authors":"Yuji Zhang, J. Xuan, R. Clarke, H. Ressom","doi":"10.1109/BIBM.2010.5706590","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706590","url":null,"abstract":"The availability of genome-wide biological network data opens up new possibilities to discover novel biomarkers and elucidate cancer-related complex mechanisms at network level. In this paper, we propose a novel module-based feature selection framework, which integrates biological network information and gene expression data to identify biomarkers, not as individual genes but as functional modules. Also, a large-scale analysis of ensemble feature selection concept is presented. The method allows combining features selected from multiple runs with various data subsampling to increase the reliability and classification accuracy of the final set of selected features. The results from four breast cancer studies demonstrate that the identified module biomarkers achieve: i) higher classification accuracy in independent validation datasets; ii) better reproducibility than individual gene biomarkers; iii) improved biological interpretability; and iv) enhanced enrichment in cancer-related “disease drivers”.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122357808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706620
M. Ayhan, Ryan G. Benton, Vijay V. Raghavan, Suresh K. Choubey
Alzheimer's disease (AD) is one major cause of dementia. Previous studies have indicated that the use of features derived from Positron Emission Tomography (PET) scans lead to more accurate and earlier diagnosis of AD, compared to the traditional approach used for determining dementia ratings, which uses a combination of clinical assessments such as memory tests. In this study, we compare Naïve Bayes (NB), a probabilistic learner, with variations of Support Vector Machines (SVMs), a geometric learner, for the automatic diagnosis of Alzheimer's disease. 3D Stereotactic Surface Projection (3D-SSP) is utilized to extract features from PET scans. At the most detailed level, the dimensionality of the feature space is very high, resulting in 15964 features. Since classifier performance can degrade in the presence of a high number of features, we evaluate the benefits of a correlation-based feature selection method to find a small number of highly relevant features.
{"title":"Exploitation of 3D Stereotactic Surface Projection for automated classification of Alzheimer's disease according to dementia levels","authors":"M. Ayhan, Ryan G. Benton, Vijay V. Raghavan, Suresh K. Choubey","doi":"10.1109/BIBM.2010.5706620","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706620","url":null,"abstract":"Alzheimer's disease (AD) is one major cause of dementia. Previous studies have indicated that the use of features derived from Positron Emission Tomography (PET) scans lead to more accurate and earlier diagnosis of AD, compared to the traditional approach used for determining dementia ratings, which uses a combination of clinical assessments such as memory tests. In this study, we compare Naïve Bayes (NB), a probabilistic learner, with variations of Support Vector Machines (SVMs), a geometric learner, for the automatic diagnosis of Alzheimer's disease. 3D Stereotactic Surface Projection (3D-SSP) is utilized to extract features from PET scans. At the most detailed level, the dimensionality of the feature space is very high, resulting in 15964 features. Since classifier performance can degrade in the presence of a high number of features, we evaluate the benefits of a correlation-based feature selection method to find a small number of highly relevant features.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128086440","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2010-12-01DOI: 10.1109/BIBM.2010.5706623
Ying Shen, Shaohong Zhang, H. Wong
Semantic similarity defined on Gene Ontology (GO) aims to provide the functional relationship between different biological processes, molecular functions, or cellular components. In this paper, a novel method, namely the Shortest Path (SP) algorithm, for measuring the semantic similarity on GO is proposed based on both the GO structure information and the term's property. The proposed algorithm searches for the shortest path that connects two terms and uses the sum of weights on the shortest path to compute the semantic similarity for GO terms. A method for evaluating the nonlinear correlation between two variables is also introduced for validation. Extensive experiments conducted on two public gene expression datasets demonstrate the overall superiority of SP method over the other state-of-the-art methods evaluated.
{"title":"A new method for measuring the semantic similarity on gene ontology","authors":"Ying Shen, Shaohong Zhang, H. Wong","doi":"10.1109/BIBM.2010.5706623","DOIUrl":"https://doi.org/10.1109/BIBM.2010.5706623","url":null,"abstract":"Semantic similarity defined on Gene Ontology (GO) aims to provide the functional relationship between different biological processes, molecular functions, or cellular components. In this paper, a novel method, namely the Shortest Path (SP) algorithm, for measuring the semantic similarity on GO is proposed based on both the GO structure information and the term's property. The proposed algorithm searches for the shortest path that connects two terms and uses the sum of weights on the shortest path to compute the semantic similarity for GO terms. A method for evaluating the nonlinear correlation between two variables is also introduced for validation. Extensive experiments conducted on two public gene expression datasets demonstrate the overall superiority of SP method over the other state-of-the-art methods evaluated.","PeriodicalId":275098,"journal":{"name":"2010 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130365237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}