In gene network estimation from time series microarray data, dynamic models such as differential equations and dynamic Bayesian networks assume that the network structure is stable through all time points, while the real network might changes its structure depending on time, affection of some shocks and so on. If the true network structure underlying the data changes at certain points, the fitting of the usual dynamic linear models fails to estimate the structure of gene network and we cannot obtain efficient information from data. To solve this problem, we propose a dynamic linear model with Markov switching for estimating time-dependent gene network structure from time series gene expression data. Using our proposed method, the network structure between genes and its change points are automatically estimated. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae cell cycle time series data.
{"title":"Estimating time-dependent gene networks from time series microarray data by dynamic linear models with Markov switching.","authors":"Ryo Yoshida, Seiya Imoto, Tomoyuki Higuchi","doi":"10.1109/csb.2005.32","DOIUrl":"https://doi.org/10.1109/csb.2005.32","url":null,"abstract":"<p><p>In gene network estimation from time series microarray data, dynamic models such as differential equations and dynamic Bayesian networks assume that the network structure is stable through all time points, while the real network might changes its structure depending on time, affection of some shocks and so on. If the true network structure underlying the data changes at certain points, the fitting of the usual dynamic linear models fails to estimate the structure of gene network and we cannot obtain efficient information from data. To solve this problem, we propose a dynamic linear model with Markov switching for estimating time-dependent gene network structure from time series gene expression data. Using our proposed method, the network structure between genes and its change points are automatically estimated. We demonstrate the effectiveness of the proposed method through the analysis of Saccharomyces cerevisiae cell cycle time series data.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"289-98"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.32","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829325","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We developed a machine learning system for determining gene functions from heterogeneous sources of data sets using a Weighted Naive Bayesian Network (WNB). The knowledge of gene functions is crucial for understanding many fundamental biological mechanisms such as regulatory pathways, cell cycles and diseases. Our major goal is to accurately infer functions of putative genes or ORFs (Open Reading Frames) from existing databases using computational methods. However, this task is intrinsically difficult since the underlying biological processes represent complex interactions of multiple entities. Therefore many functional links would be missing when only one or two source of data is used in the prediction. Our hypothesis is that integrating evidence from multiple and complementary sources could significantly improve the prediction accuracy. In this paper, our experimental results not only suggest that the above hypothesis is valid, but also provide guidelines for using the WNB system for data collection, training and predictions. The combined training data sets contain information from gene annotations, gene expressions, clustering outputs, keyword annotations and sequence homology from public databases. The current system is trained and tested on the genes of budding yeast Saccharomyces cerevisiae. Our WNB model can also be used to analyze the contribution of each source of information toward the prediction performance through the weight training process. The contribution analysis could potentially lead to significant scientific discovery by facilitating the interpretation and understanding of the complex relationships between biological entities.
{"title":"Learning yeast gene functions from heterogeneous sources of data using hybrid weighted Bayesian networks.","authors":"Xutao Deng, Huimin Geng, Hesham Ali","doi":"10.1109/csb.2005.38","DOIUrl":"https://doi.org/10.1109/csb.2005.38","url":null,"abstract":"<p><p>We developed a machine learning system for determining gene functions from heterogeneous sources of data sets using a Weighted Naive Bayesian Network (WNB). The knowledge of gene functions is crucial for understanding many fundamental biological mechanisms such as regulatory pathways, cell cycles and diseases. Our major goal is to accurately infer functions of putative genes or ORFs (Open Reading Frames) from existing databases using computational methods. However, this task is intrinsically difficult since the underlying biological processes represent complex interactions of multiple entities. Therefore many functional links would be missing when only one or two source of data is used in the prediction. Our hypothesis is that integrating evidence from multiple and complementary sources could significantly improve the prediction accuracy. In this paper, our experimental results not only suggest that the above hypothesis is valid, but also provide guidelines for using the WNB system for data collection, training and predictions. The combined training data sets contain information from gene annotations, gene expressions, clustering outputs, keyword annotations and sequence homology from public databases. The current system is trained and tested on the genes of budding yeast Saccharomyces cerevisiae. Our WNB model can also be used to analyze the contribution of each source of information toward the prediction performance through the weight training process. The contribution analysis could potentially lead to significant scientific discovery by facilitating the interpretation and understanding of the complex relationships between biological entities.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"25-34"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.38","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High throughput expression profiling and genotyping technologies provide the means to study the genetic determinants of population variation in gene expression variation. In this paper we present a general statistical framework for the simultaneous analysis of gene expression data and SNP genotype data measured for the same cohort. The framework consists of methods to associate transcripts with SNPs affecting their expression, algorithms to detect subsets of transcripts that share significantly many associations with a subset of SNPs, and methods to visualize the identified relations. We apply our framework to SNP-expression data collected from 49 breast cancer patients. Our results demonstrate an overabundance of transcript-SNP associations in this data, and pinpoint SNPs that are potential master regulators of transcription. We also identify several statistically significant transcript-subsets with common putative regulators that fall into well-defined functional categories.
{"title":"Analysis of SNP-expression association matrices.","authors":"Anya Tsalenko, Roded Sharan, Hege Edvardsen, Vessela Kristensen, Anne-Lise Børresen-Dale, Amir Ben-Dor, Zohar Yakhini","doi":"10.1109/csb.2005.14","DOIUrl":"https://doi.org/10.1109/csb.2005.14","url":null,"abstract":"<p><p>High throughput expression profiling and genotyping technologies provide the means to study the genetic determinants of population variation in gene expression variation. In this paper we present a general statistical framework for the simultaneous analysis of gene expression data and SNP genotype data measured for the same cohort. The framework consists of methods to associate transcripts with SNPs affecting their expression, algorithms to detect subsets of transcripts that share significantly many associations with a subset of SNPs, and methods to visualize the identified relations. We apply our framework to SNP-expression data collected from 49 breast cancer patients. Our results demonstrate an overabundance of transcript-SNP associations in this data, and pinpoint SNPs that are potential master regulators of transcription. We also identify several statistically significant transcript-subsets with common putative regulators that fall into well-defined functional categories.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"135-43"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.14","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dekel Tsur, Stephen Tanner, Ebrahim Zandi, Vineet Bafna, Pavel A Pevzner
Post-translational modifications (PTMs) are of great biological importance. Most existing approaches perform a restrictive search that can only take into account a few types of PTMs and ignore all others. We describe an unrestrictive PTM search algorithm that searches for all types of PTMs at once in a blind mode, i.e., without knowing which PTMs exist in a sample. The blind PTM identification opens a possibility to study the extent and frequencies of different types of PTMs, still an open problem in proteomics. Using our new algorithm, we were able to construct a two-dimensional PTM frequency matrix that reflects the number of MS/MS spectra in a sample for each putative PTM type and each amino acid. Application of this approach to a large IKKb dataset resulted in the largest set of PTMs reported for a single MS/MS sample so far. We demonstrate an excellent correlation between high values in the PTM frequency matrix and known PTMs thus validating our approach. We further argue that the PTM frequency matrix may reveal some still unknown modifications that warrant further experimental validation.
{"title":"Identification of post-translational modifications via blind search of mass-spectra.","authors":"Dekel Tsur, Stephen Tanner, Ebrahim Zandi, Vineet Bafna, Pavel A Pevzner","doi":"10.1109/csb.2005.34","DOIUrl":"https://doi.org/10.1109/csb.2005.34","url":null,"abstract":"<p><p>Post-translational modifications (PTMs) are of great biological importance. Most existing approaches perform a restrictive search that can only take into account a few types of PTMs and ignore all others. We describe an unrestrictive PTM search algorithm that searches for all types of PTMs at once in a blind mode, i.e., without knowing which PTMs exist in a sample. The blind PTM identification opens a possibility to study the extent and frequencies of different types of PTMs, still an open problem in proteomics. Using our new algorithm, we were able to construct a two-dimensional PTM frequency matrix that reflects the number of MS/MS spectra in a sample for each putative PTM type and each amino acid. Application of this approach to a large IKKb dataset resulted in the largest set of PTMs reported for a single MS/MS sample so far. We demonstrate an excellent correlation between high values in the PTM frequency matrix and known PTMs thus validating our approach. We further argue that the PTM frequency matrix may reveal some still unknown modifications that warrant further experimental validation.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"157-66"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.34","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-throughput methods for detecting protein-protein interactions (PPI) have given researchers an initial global picture of protein interactions on a genomic scale. The usefulness of this understanding is, however, typically compromised by noisy data. The effective way of integrating and using these non-congruent data sets has received little attention to date. This paper proposes a model to integrate different data sets. We construct this model using our prior knowledge of data set reliability. Based on this model, we propose a topological measurement to select reliable interactions and to quantify the similarity between two proteins' interaction profiles. Our measurement exploits the small-world network topological properties of protein interaction network. Meanwhile, we discovered some additional properties of the network. We show that our measurement can be used to find reliable interactions with improved performance and to find protein pairs with higher function homogeneity.
{"title":"A topological measurement for weighted protein interaction network.","authors":"Pengjun Pei, Aidong Zhang","doi":"10.1109/csb.2005.8","DOIUrl":"https://doi.org/10.1109/csb.2005.8","url":null,"abstract":"<p><p>High-throughput methods for detecting protein-protein interactions (PPI) have given researchers an initial global picture of protein interactions on a genomic scale. The usefulness of this understanding is, however, typically compromised by noisy data. The effective way of integrating and using these non-congruent data sets has received little attention to date. This paper proposes a model to integrate different data sets. We construct this model using our prior knowledge of data set reliability. Based on this model, we propose a topological measurement to select reliable interactions and to quantify the similarity between two proteins' interaction profiles. Our measurement exploits the small-world network topological properties of protein interaction network. Meanwhile, we discovered some additional properties of the network. We show that our measurement can be used to find reliable interactions with improved performance and to find protein pairs with higher function homogeneity.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"268-78"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.8","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tarek S Najdi, Chin-Rang Yang, Bruce E Shapiro, G Wesley Hatfield, Eric D Mjolsness
In our effort to elucidate the systems biology of the model organism, Escherichia coli, we have developed a mathematical model that simulates the allosteric regulation for threonine biosynthesis pathway starting from aspartate. To achieve this goal, we used kMech, a Cellerator language extension that describes enzyme mechanisms for the mathematical modeling of metabolic pathways. These mechanisms are converted by Cellerator into ordinary differential equations (ODEs) solvable by Mathematica. In this paper, we describe a more flexible model in Cellerator, which generalizes the Monod, Wyman, Changeux (MWC) model for enzyme allosteric regulation to allow for multiple substrate, activator and inhibitor binding sites. Furthermore, we have developed a model that describes the behavior of the bifunctional allosteric enzyme aspartate Kinase I-Homoserine Dehydrogenase I (AKI-HDHI). This model predicts the partition of enzyme activities in the steady state which paves a way for a more generalized prediction of the behavior of bifunctional enzymes.
{"title":"Application of a generalized MWC model for the mathematical simulation of metabolic pathways regulated by allosteric enzymes.","authors":"Tarek S Najdi, Chin-Rang Yang, Bruce E Shapiro, G Wesley Hatfield, Eric D Mjolsness","doi":"10.1109/csb.2005.15","DOIUrl":"https://doi.org/10.1109/csb.2005.15","url":null,"abstract":"<p><p>In our effort to elucidate the systems biology of the model organism, Escherichia coli, we have developed a mathematical model that simulates the allosteric regulation for threonine biosynthesis pathway starting from aspartate. To achieve this goal, we used kMech, a Cellerator language extension that describes enzyme mechanisms for the mathematical modeling of metabolic pathways. These mechanisms are converted by Cellerator into ordinary differential equations (ODEs) solvable by Mathematica. In this paper, we describe a more flexible model in Cellerator, which generalizes the Monod, Wyman, Changeux (MWC) model for enzyme allosteric regulation to allow for multiple substrate, activator and inhibitor binding sites. Furthermore, we have developed a model that describes the behavior of the bifunctional allosteric enzyme aspartate Kinase I-Homoserine Dehydrogenase I (AKI-HDHI). This model predicts the partition of enzyme activities in the steady state which paves a way for a more generalized prediction of the behavior of bifunctional enzymes.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"279-88"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.15","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the feature selection of cancer classification problems, many existing methods consider genes individually by choosing the top genes which have the most significant signal-to-noise statistic or correlation coefficient. However the information of the class distinction provided by such genes may overlap intensively, since their gene expression patterns are similar. The redundancy of including many genes with similar gene expression patterns results in highly complex classifiers. According to the principle of Occam's razor, simple models are preferable to complex ones, if they can produce comparable prediction performances to the complex ones. In this paper, we introduce a new method to learn accurate and low-complexity classifiers from gene expression profiles. In our method, we use mutual information to measure the relation between a set of genes, called gene vectors, and the class attribute of the samples. The gene vectors are in higher-dimensional spaces than individual genes, therefore, they are more diverse, or contain more information than individual genes. Hence, gene vectors are more preferable to individual genes in describing the class distinctions between samples since they contain more information about the class attribute. We validate our method on 3 gene expression profiles. By comparing our results with those from literature and other well-known classification methods, our method demonstrated better or comparable prediction performances to the existing methods, however, with lower-complexity models than existing methods.
{"title":"Identifying simple discriminatory gene vectors with an information theory approach.","authors":"Zheng Yun, Kwoh Chee Keong","doi":"10.1109/csb.2005.35","DOIUrl":"https://doi.org/10.1109/csb.2005.35","url":null,"abstract":"<p><p>In the feature selection of cancer classification problems, many existing methods consider genes individually by choosing the top genes which have the most significant signal-to-noise statistic or correlation coefficient. However the information of the class distinction provided by such genes may overlap intensively, since their gene expression patterns are similar. The redundancy of including many genes with similar gene expression patterns results in highly complex classifiers. According to the principle of Occam's razor, simple models are preferable to complex ones, if they can produce comparable prediction performances to the complex ones. In this paper, we introduce a new method to learn accurate and low-complexity classifiers from gene expression profiles. In our method, we use mutual information to measure the relation between a set of genes, called gene vectors, and the class attribute of the samples. The gene vectors are in higher-dimensional spaces than individual genes, therefore, they are more diverse, or contain more information than individual genes. Hence, gene vectors are more preferable to individual genes in describing the class distinctions between samples since they contain more information about the class attribute. We validate our method on 3 gene expression profiles. By comparing our results with those from literature and other well-known classification methods, our method demonstrated better or comparable prediction performances to the existing methods, however, with lower-complexity models than existing methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"13-24"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.35","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829583","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Will Sheffler, Eli Upfal, John Sedivy, William Stafford Noble
Perhaps the most common question that a microarray study can ask is, "Between two given biological conditions, which genes exhibit changed expression levels?" Existing methods for answering this question either generate a comparative measure based upon a static model, or take an indirect approach, first estimating absolute expression levels and then comparing the estimated levels to one another. We present a method for detecting changes in gene expression between two samples based on data from Affymetrix GeneChips. Using a library of over 200,000 known cases of differential expression, we create a learned comparative expression measure (LCEM) based on classification of probe-level data patterns as changed or unchanged. LCEM uses perfect match probe data only; mismatch probe values did not prove to be useful in this context. LCEM is particularly powerful in the case of small microarry studies, in which a regression-based method such as RMA cannot generalize, and in detecting small expression changes. At the levels of selectivity that are typical in microarray analysis, the LCEM shows a lower false discovery rate than either MAS5 or RMA trained from a single chip. When many chips are available to RMA, LCEM performs better on two out of the three data sets, and nearly as well on the third. Performance of the MAS5 log ratio statistic was notably bad on all datasets.
{"title":"A learned comparative expression measure for affymetrix genechip DNA microarrays.","authors":"Will Sheffler, Eli Upfal, John Sedivy, William Stafford Noble","doi":"10.1109/csb.2005.5","DOIUrl":"https://doi.org/10.1109/csb.2005.5","url":null,"abstract":"<p><p>Perhaps the most common question that a microarray study can ask is, \"Between two given biological conditions, which genes exhibit changed expression levels?\" Existing methods for answering this question either generate a comparative measure based upon a static model, or take an indirect approach, first estimating absolute expression levels and then comparing the estimated levels to one another. We present a method for detecting changes in gene expression between two samples based on data from Affymetrix GeneChips. Using a library of over 200,000 known cases of differential expression, we create a learned comparative expression measure (LCEM) based on classification of probe-level data patterns as changed or unchanged. LCEM uses perfect match probe data only; mismatch probe values did not prove to be useful in this context. LCEM is particularly powerful in the case of small microarry studies, in which a regression-based method such as RMA cannot generalize, and in detecting small expression changes. At the levels of selectivity that are typical in microarray analysis, the LCEM shows a lower false discovery rate than either MAS5 or RMA trained from a single chip. When many chips are available to RMA, LCEM performs better on two out of the three data sets, and nearly as well on the third. Performance of the MAS5 log ratio statistic was notably bad on all datasets.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"144-54"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a new computational method for the prediction of orthologous gene groups for microbial genomes based on the prediction of co-occurrences of homologous genes. The method is inspired by the observation that homologous genes are highly likely to be orthologous if their neighboring genes are also homologous. Based on co-occurrences of homologous genes, we have grouped the (predicted) operons of 77 selected sequenced microbial genomes so that operons of the same group are highly likely to be functionally similar or related. We then cluster the homologous genes in the same operon group so that genes of the same cluster are highly likely to be similar in terms of their sequences and functions, i.e., they are predicted to be orthologous genes. By comparing our predicted orthologous gene groups with the COG assignments and NCBI annotations, we conclude that our method is promising to provide more accurate and specific predictions than the existing methods.
{"title":"Accurate prediction of orthologous gene groups in microbes.","authors":"Hongwei Wu, Fenglou Mao, Victor Olman, Ying Xu","doi":"10.1109/csb.2005.10","DOIUrl":"https://doi.org/10.1109/csb.2005.10","url":null,"abstract":"<p><p>We present a new computational method for the prediction of orthologous gene groups for microbial genomes based on the prediction of co-occurrences of homologous genes. The method is inspired by the observation that homologous genes are highly likely to be orthologous if their neighboring genes are also homologous. Based on co-occurrences of homologous genes, we have grouped the (predicted) operons of 77 selected sequenced microbial genomes so that operons of the same group are highly likely to be functionally similar or related. We then cluster the homologous genes in the same operon group so that genes of the same cluster are highly likely to be similar in terms of their sequences and functions, i.e., they are predicted to be orthologous genes. By comparing our predicted orthologous gene groups with the COG assignments and NCBI annotations, we conclude that our method is promising to provide more accurate and specific predictions than the existing methods.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"73-9"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.10","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A good number of biclustering algorithms have been proposed for grouping gene expression data. Many of them have adopted matrix norms to define the similarity score of a bicluster. We shall show that almost all matrix metrics can be converted into vector norms while preserving the rank equivalence. Vector norms provide a much more efficient vehicle for biclustering analysis and computation. The advantages are two folds: ease of analysis and saving of computation. Most existing biclustering algorithms have also implicitly assumed the use of univariate (i.e., single metric) evaluation for identifying biclusters. Such an approach however overlooks the fundamental principle that genes (even though they may belong to the same gene group) (1) may be subdivided into different substructures; and (2) they may be co-expressed via a diversity of coherence models (a gene may participate in multiple pathways that may or may not be co-active under all conditions). The former leads to the adoption of a multi-substurcture analysis, while the latter to the multivariate analysis. This paper will show that the proposed multivariate and multi-subscluster analysis is very effective in identifying and classifying biologically relevant groups in genes and conditions. For example, it has successfully yielded highly discriminant and accurate classification based on known ribosomal gene groups.
{"title":"Multi-metric and multi-substructure biclustering analysis for gene expression data.","authors":"S Y Kung, Man-Wai Mak, Ilias Tagkopoulos","doi":"10.1109/csb.2005.40","DOIUrl":"https://doi.org/10.1109/csb.2005.40","url":null,"abstract":"<p><p>A good number of biclustering algorithms have been proposed for grouping gene expression data. Many of them have adopted matrix norms to define the similarity score of a bicluster. We shall show that almost all matrix metrics can be converted into vector norms while preserving the rank equivalence. Vector norms provide a much more efficient vehicle for biclustering analysis and computation. The advantages are two folds: ease of analysis and saving of computation. Most existing biclustering algorithms have also implicitly assumed the use of univariate (i.e., single metric) evaluation for identifying biclusters. Such an approach however overlooks the fundamental principle that genes (even though they may belong to the same gene group) (1) may be subdivided into different substructures; and (2) they may be co-expressed via a diversity of coherence models (a gene may participate in multiple pathways that may or may not be co-active under all conditions). The former leads to the adoption of a multi-substurcture analysis, while the latter to the multivariate analysis. This paper will show that the proposed multivariate and multi-subscluster analysis is very effective in identifying and classifying biologically relevant groups in genes and conditions. For example, it has successfully yielded highly discriminant and accurate classification based on known ribosomal gene groups.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"123-34"},"PeriodicalIF":0.0,"publicationDate":"2005-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2005.40","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}