One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds & how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn + pn(5/2) log (cn)+p log p) time, where p is the number of proteins in the database, n is the number of residues in the target protein and c is the maximum edge weight in an integer-weighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data.
{"title":"High-throughput 3D structural homology detection via NMR resonance assignment.","authors":"Christopher James Langmead, Bruce Randall Donald","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>One goal of the structural genomics initiative is the identification of new protein folds. Sequence-based structural homology prediction methods are an important means for prioritizing unknown proteins for structure determination. However, an important challenge remains: two highly dissimilar sequences can have similar folds & how can we detect this rapidly, in the context of structural genomics? High-throughput NMR experiments, coupled with novel algorithms for data analysis, can address this challenge. We report an automated procedure, called HD, for detecting 3D structural homologies from sparse, unassigned protein NMR data. Our method identifies 3D models in a protein structural database whose geometries best fit the unassigned experimental NMR data. HD does not use, and is thus not limited by sequence homology. The method can also be used to confirm or refute structural predictions made by other techniques such as protein threading or homology modelling. The algorithm runs in O(pn + pn(5/2) log (cn)+p log p) time, where p is the number of proteins in the database, n is the number of residues in the target protein and c is the maximum edge weight in an integer-weighted bipartite graph. Our experiments on real NMR data from 3 different proteins against a database of 4,500 representative folds demonstrate that the method identifies closely related protein folds, including sub-domains of larger proteins, with as little as 10-30% sequence homology between the target protein (or sub-domain) and the computed model. In particular, we report no false-negatives or false-positives despite significant percentages of missing experimental data.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"278-89"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25831030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332433
Karuturi R Murthy, Liu Jian Hua
Motivation: Cell-cycle regulated gene prediction using microarray time-course measurements of the mRNA expression levels of genes has been used by several researchers. The popularly employed approach is Fourier transform (FT) method in conjunction with the set of known cell-cycle regulated genes. In the absence of training data, fourier transform method is sensitive to noise, additive monotonic component arising from cell population growth and deviation from strict sinusoidal form of expression. Known cell cycle regulated genes may not be available for certain organisms or using them for training may bias the prediction.
Results: In this paper we propose an Improved Fourier Transform (IFT) method which takes care of several factors such as monotonic additive component of the cell-cycle expression, irregular or partial-cycle sampling of gene expression. The proposed algorithm does not need any known cell-cycle regulated genes for prediction. Apart from alleviating need for training set, it also removes bias towards genes similar to the training set. We have evaluated the developed method on two publicly available datasets: yeast cell-cycle data and HeLa cell-cycle data. The proposed algorithm has performed competitively on both datasets with that of the supervised fourier transform method used. It outperformed other unsupervised methods such as Partial Least Squares (PLS) and Single Pulse Modeling (SPM). This method is easy to comprehend and implement, and runs faster.
{"title":"Improved fourier transform method for unsupervised cell-cycle regulated gene prediction.","authors":"Karuturi R Murthy, Liu Jian Hua","doi":"10.1109/csb.2004.1332433","DOIUrl":"https://doi.org/10.1109/csb.2004.1332433","url":null,"abstract":"<p><strong>Motivation: </strong>Cell-cycle regulated gene prediction using microarray time-course measurements of the mRNA expression levels of genes has been used by several researchers. The popularly employed approach is Fourier transform (FT) method in conjunction with the set of known cell-cycle regulated genes. In the absence of training data, fourier transform method is sensitive to noise, additive monotonic component arising from cell population growth and deviation from strict sinusoidal form of expression. Known cell cycle regulated genes may not be available for certain organisms or using them for training may bias the prediction.</p><p><strong>Results: </strong>In this paper we propose an Improved Fourier Transform (IFT) method which takes care of several factors such as monotonic additive component of the cell-cycle expression, irregular or partial-cycle sampling of gene expression. The proposed algorithm does not need any known cell-cycle regulated genes for prediction. Apart from alleviating need for training set, it also removes bias towards genes similar to the training set. We have evaluated the developed method on two publicly available datasets: yeast cell-cycle data and HeLa cell-cycle data. The proposed algorithm has performed competitively on both datasets with that of the supervised fourier transform method used. It outperformed other unsupervised methods such as Partial Least Squares (PLS) and Single Pulse Modeling (SPM). This method is easy to comprehend and implement, and runs faster.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"194-203"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332433","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332436
Tolga Can, Orhan Camoğlu, Ambuj K Singh, Yuan-Fang Wang
We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. High accuracy is achieved by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure, using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.
{"title":"Automated protein classification using consensus decision.","authors":"Tolga Can, Orhan Camoğlu, Ambuj K Singh, Yuan-Fang Wang","doi":"10.1109/csb.2004.1332436","DOIUrl":"https://doi.org/10.1109/csb.2004.1332436","url":null,"abstract":"<p><p>We propose a novel technique for automatically generating the SCOP classification of a protein structure with high accuracy. High accuracy is achieved by combining the decisions of multiple methods using the consensus of a committee (or an ensemble) classifier. Our technique is rooted in machine learning which shows that by judicially employing component classifiers, an ensemble classifier can be constructed to outperform its components. We use two sequence- and three structure-comparison tools as component classifiers. Given a protein structure, using the joint hypothesis, we first determine if the protein belongs to an existing category (family, superfamily, fold) in the SCOP hierarchy. For the proteins that are predicted as members of the existing categories, we compute their family-, superfamily-, and fold-level classifications using the consensus classifier. We show that we can significantly improve the classification accuracy compared to the individual component classifiers. In particular, we achieve error rates that are 3-12 times less than the individual classifiers' error rates at the family level, 1.5-4.5 times less at the superfamily level, and 1.1-2.4 times less at the fold level.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"224-35"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332436","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332449
Victor Olman, Hanchuan Peng, Zhengchang Su, Ying Xu
We present a novel computer algorithm for mapping biological pathways from one prokaryotic genome to another. The algorithm maps genes in a known pathway to their homologous genes (if any) in a target genome that is most consistent with (a) predicted orthologous gene relationship, (b) predicted operon structures, and (c) predicted co-regulation relationship of operons. Mathematically, we have formulated this problem as a constrained minimum spanning tree problem (called a Steiner network problem), and demonstrated that this formulation has the desired property through applications. We have solved this mapping problem using a combinatorial optimization algorithm, with guaranteed global optimality. We have implemented this algorithm as a computer program, called PMAP. Our test results on pathway mapping are highly encouraging -- we have mapped a number of pathways of H. influenzae, B. subtilis, H. pylori, and M. tuberculosis to E. coli using P-MAP, whose homologous pathways in E coli. are known and hence the mapping accuracy could be checked. We have then mapped known E. coli pathways in the EcoCyc database to the newly sequenced organism Synechococcus sp WH8102, and predicted 158 Synechococcus pathways. Detailed analyses on the predicted pathways indicate that P-MAP's mapping results are consistent with our general knowledge about (local) pathways. We believe that P-MAP will be a useful tool for microbial genome annotation projects and inference of individual microbial pathways.
{"title":"Mapping of microbial pathways through constrained mapping of orthologous genes.","authors":"Victor Olman, Hanchuan Peng, Zhengchang Su, Ying Xu","doi":"10.1109/csb.2004.1332449","DOIUrl":"https://doi.org/10.1109/csb.2004.1332449","url":null,"abstract":"<p><p>We present a novel computer algorithm for mapping biological pathways from one prokaryotic genome to another. The algorithm maps genes in a known pathway to their homologous genes (if any) in a target genome that is most consistent with (a) predicted orthologous gene relationship, (b) predicted operon structures, and (c) predicted co-regulation relationship of operons. Mathematically, we have formulated this problem as a constrained minimum spanning tree problem (called a Steiner network problem), and demonstrated that this formulation has the desired property through applications. We have solved this mapping problem using a combinatorial optimization algorithm, with guaranteed global optimality. We have implemented this algorithm as a computer program, called PMAP. Our test results on pathway mapping are highly encouraging -- we have mapped a number of pathways of H. influenzae, B. subtilis, H. pylori, and M. tuberculosis to E. coli using P-MAP, whose homologous pathways in E coli. are known and hence the mapping accuracy could be checked. We have then mapped known E. coli pathways in the EcoCyc database to the newly sequenced organism Synechococcus sp WH8102, and predicted 158 Synechococcus pathways. Detailed analyses on the predicted pathways indicate that P-MAP's mapping results are consistent with our general knowledge about (local) pathways. We believe that P-MAP will be a useful tool for microbial genome annotation projects and inference of individual microbial pathways.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"363-70"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332449","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332429
Ryo Yoshida, Tomoyuki Higuchi, Seiya Imoto
When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset.
{"title":"A mixed factors model for dimension reduction and extraction of a group structure in gene expression data.","authors":"Ryo Yoshida, Tomoyuki Higuchi, Seiya Imoto","doi":"10.1109/csb.2004.1332429","DOIUrl":"https://doi.org/10.1109/csb.2004.1332429","url":null,"abstract":"<p><p>When we cluster tissue samples on the basis of genes, the number of observations to be grouped is much smaller than the dimension of feature vector. In such a case, the applicability of conventional model-based clustering is limited since the high dimensionality of feature vector leads to overfitting during the density estimation process. To overcome such difficulty, we attempt a methodological extension of the factor analysis. Our approach enables us not only to prevent from the occurrence of overfitting, but also to handle the issues of clustering, data compression and extracting a set of genes to be relevant to explain the group structure. The potential usefulness are demonstrated with the application to the leukemia dataset.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"161-72"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332429","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reverse engineering of genetics networks generally requires establishing correlative behavior within and between a very large number of genes. This becomes a difficult analytical problem for even a few hundred genes and the difficulty tends to grow exponentially as more genes are examined. Using a hybrid data analysis method known as Fractal Genomics Modeling (FGM), this problem is reduced to examining correlative behavior within small gene groups that can then be compared and integrated to produce a picture of larger networks using a type pf shotgun approach. We have applied FGM toward examining genetic networks involved in HIV infection in the brain. These networks have relevance both to processes related to HIV infection and neurodegenerative disorders. Our preliminary findings have produced conjectures of related pathways and networks as well new candidates for genetic markers in HIV brain infection. Evidence has also been produced which appears to show the presence of a hierarchical network structure within the genes studied. We will discuss the background and methodology of FGM as well as our recent findings.
{"title":"Fractal genomics modeling: a new approach to genomic analysis and biomarker discovery.","authors":"Sandy Shaw, Paul Shapshak","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Reverse engineering of genetics networks generally requires establishing correlative behavior within and between a very large number of genes. This becomes a difficult analytical problem for even a few hundred genes and the difficulty tends to grow exponentially as more genes are examined. Using a hybrid data analysis method known as Fractal Genomics Modeling (FGM), this problem is reduced to examining correlative behavior within small gene groups that can then be compared and integrated to produce a picture of larger networks using a type pf shotgun approach. We have applied FGM toward examining genetic networks involved in HIV infection in the brain. These networks have relevance both to processes related to HIV infection and neurodegenerative disorders. Our preliminary findings have produced conjectures of related pathways and networks as well new candidates for genetic markers in HIV brain infection. Evidence has also been produced which appears to show the presence of a hierarchical network structure within the genes studied. We will discuss the background and methodology of FGM as well as our recent findings.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"9-18"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25837755","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332414
Hong Liu, Li Li, Asher Zilberstein, Chang S Hahn
Both inter- and intra-chromosomal segmental duplications are known occurred in human genome during evolution. Few cases of such segments involving functional genes have been reported. While searching for the human orthologs of murine hematopoietic deubiquitinating enzymes (DUBs), we identified four clusters of DUB-like genes on chromosome 4p15 and chromosome 8p22-23 that are over 90% identical to each other at the DNA level. These genes are expressed in a cell type- and activation-specific manner, with different clusters possessing potentially distinct expression profiles. Examining the surrounding sequences of these gene duplication events, we have identified previously unreported conserved sequence elements that are as large as 35 to 74 kb encircling the gene clusters. Traces of these elements are also found on chromosome 12p13 and chromosome 11q13. The coding and immediate upstream sequences for DUB-like genes as well as the surrounding conserved elements, are present in the chimpanzee trace database, but not in rodent genome. We hypothesize that the segments containing these DUB clusters and surrounding elements arose relatively recently in evolution through inter- and intra-chromosomal duplicative transpositions, following the divergence of primates and rodents. Genome wide systematical search of the segmental duplication containing duplicated gene cluster has been performed.
{"title":"Segmental duplications containing tandem repeated genes encoding putative deubiquitinating enzymes.","authors":"Hong Liu, Li Li, Asher Zilberstein, Chang S Hahn","doi":"10.1109/csb.2004.1332414","DOIUrl":"https://doi.org/10.1109/csb.2004.1332414","url":null,"abstract":"<p><p>Both inter- and intra-chromosomal segmental duplications are known occurred in human genome during evolution. Few cases of such segments involving functional genes have been reported. While searching for the human orthologs of murine hematopoietic deubiquitinating enzymes (DUBs), we identified four clusters of DUB-like genes on chromosome 4p15 and chromosome 8p22-23 that are over 90% identical to each other at the DNA level. These genes are expressed in a cell type- and activation-specific manner, with different clusters possessing potentially distinct expression profiles. Examining the surrounding sequences of these gene duplication events, we have identified previously unreported conserved sequence elements that are as large as 35 to 74 kb encircling the gene clusters. Traces of these elements are also found on chromosome 12p13 and chromosome 11q13. The coding and immediate upstream sequences for DUB-like genes as well as the surrounding conserved elements, are present in the chimpanzee trace database, but not in rodent genome. We hypothesize that the segments containing these DUB clusters and surrounding elements arose relatively recently in evolution through inter- and intra-chromosomal duplicative transpositions, following the divergence of primates and rodents. Genome wide systematical search of the segmental duplication containing duplicated gene cluster has been performed.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"31-9"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332414","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332419
Michael Molla, Jude Shavlik, Todd Richmond, Steven Smith
Current methods for interpreting oligonucleotide-based SNP-detection microarrays, SNP chips, are based on statistics and require extensive parameter tuning as well as extremely high-resolution images of the chip being processed. We present a method, based on a simple data-classification technique called nearest-neighbors that, on haploid organisms, produces results comparable to the published results of the leading statistical methods and requires very little in the way of parameter tuning. Furthermore, it can interpret SNP chips using lower-resolution scanners of the type more typically used in current microarray experiments. Along with our algorithm, we present the results of a SNP-detection experiment where, when independently applying this algorithm to six identical SARS SNP chips, we correctly identify all 24 SNPs in a particular strain of the SARS virus, with between 6 and 13 false positives across the six experiments.
{"title":"A self-tuning method for one-chip SNP identification.","authors":"Michael Molla, Jude Shavlik, Todd Richmond, Steven Smith","doi":"10.1109/csb.2004.1332419","DOIUrl":"https://doi.org/10.1109/csb.2004.1332419","url":null,"abstract":"<p><p>Current methods for interpreting oligonucleotide-based SNP-detection microarrays, SNP chips, are based on statistics and require extensive parameter tuning as well as extremely high-resolution images of the chip being processed. We present a method, based on a simple data-classification technique called nearest-neighbors that, on haploid organisms, produces results comparable to the published results of the leading statistical methods and requires very little in the way of parameter tuning. Furthermore, it can interpret SNP chips using lower-resolution scanners of the type more typically used in current microarray experiments. Along with our algorithm, we present the results of a SNP-detection experiment where, when independently applying this algorithm to six identical SARS SNP chips, we correctly identify all 24 SNPs in a particular strain of the SARS virus, with between 6 and 13 false positives across the six experiments.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"69-79"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332419","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25829774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332455
Sheng Zhong, Lu Tian, Cheng Li, Kai-Florian Storch, Wing H Wong
The Gene Ontology (GO) resource can be used as a powerful tool to uncover the properties shared among, and specific to, a list of genes produced by high-throughput functional genomics studies, such as microarray studies. In the comparative analysis of several gene lists, researchers maybe interested in knowing which GO terms are enriched in one list of genes but relatively depleted in another. Statistical tests such as Fisher's exact test or Chi-square test can be performed to search for such GO terms. However, because multiple GO terms are tested simultaneously, individual p-values from individual tests do not serve as good indicators for picking GO terms. Furthermore, these multiple tests are highly correlated, usual multiple testing procedures that work under an independence assumption are not applicable. In this paper we introduce a procedure, based on False Discovery Rate (FDR), to treat this correlated multiple testing problem. This procedure calculates a moderately conserved estimator of q-value for every GO term. We identify the GO terms with q-values that satisfy a desired level as the significant GO terms. This procedure has been implemented into the GoSurfer software. GoSurfer is a windows based graphical data mining tool. It is freely available at http://www.gosurfer.org.
{"title":"Comparative analysis of gene sets in the Gene Ontology space under the multiple hypothesis testing framework.","authors":"Sheng Zhong, Lu Tian, Cheng Li, Kai-Florian Storch, Wing H Wong","doi":"10.1109/csb.2004.1332455","DOIUrl":"https://doi.org/10.1109/csb.2004.1332455","url":null,"abstract":"<p><p>The Gene Ontology (GO) resource can be used as a powerful tool to uncover the properties shared among, and specific to, a list of genes produced by high-throughput functional genomics studies, such as microarray studies. In the comparative analysis of several gene lists, researchers maybe interested in knowing which GO terms are enriched in one list of genes but relatively depleted in another. Statistical tests such as Fisher's exact test or Chi-square test can be performed to search for such GO terms. However, because multiple GO terms are tested simultaneously, individual p-values from individual tests do not serve as good indicators for picking GO terms. Furthermore, these multiple tests are highly correlated, usual multiple testing procedures that work under an independence assumption are not applicable. In this paper we introduce a procedure, based on False Discovery Rate (FDR), to treat this correlated multiple testing problem. This procedure calculates a moderately conserved estimator of q-value for every GO term. We identify the GO terms with q-values that satisfy a desired level as the significant GO terms. This procedure has been implemented into the GoSurfer software. GoSurfer is a windows based graphical data mining tool. It is freely available at http://www.gosurfer.org.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"425-35"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332455","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2004-01-01DOI: 10.1109/csb.2004.1332430
Zhaohui Sun, Jingyi Yang, Jitender S Deogun
The recognition of regulatory motifs of co-regulated genes is essential for understanding the regulatory mechanisms. However, the automatic extraction of regulatory motifs from a given data set of the upstream non-coding DNA sequences of a family of co-regulated genes is difficult because regulatory motifs are often subtle and inexact. This problem is further complicated by the corruption of the data sets. In this paper, a new approach called Mismatch-allowed Probabilistic Suffix Tree Motif Extraction (MISAE) is proposed. It combines the mismatch-allowed probabilistic suffix tree that is a probabilistic model and local prediction for the extraction of regulatory motifs. The proposed approach is tested on 15 co-regulated gene families and compares favorably with other state-of-the-art approaches. Moreover, MISAE performs well on "corrupted" data sets. It is able to extract the motif from a "corrupted" data set with less than one fourth of the sequences containing the real motif.
{"title":"MISAE: a new approach for regulatory motif extraction.","authors":"Zhaohui Sun, Jingyi Yang, Jitender S Deogun","doi":"10.1109/csb.2004.1332430","DOIUrl":"https://doi.org/10.1109/csb.2004.1332430","url":null,"abstract":"<p><p>The recognition of regulatory motifs of co-regulated genes is essential for understanding the regulatory mechanisms. However, the automatic extraction of regulatory motifs from a given data set of the upstream non-coding DNA sequences of a family of co-regulated genes is difficult because regulatory motifs are often subtle and inexact. This problem is further complicated by the corruption of the data sets. In this paper, a new approach called Mismatch-allowed Probabilistic Suffix Tree Motif Extraction (MISAE) is proposed. It combines the mismatch-allowed probabilistic suffix tree that is a probabilistic model and local prediction for the extraction of regulatory motifs. The proposed approach is tested on 15 co-regulated gene families and compares favorably with other state-of-the-art approaches. Moreover, MISAE performs well on \"corrupted\" data sets. It is able to extract the motif from a \"corrupted\" data set with less than one fourth of the sequences containing the real motif.</p>","PeriodicalId":87417,"journal":{"name":"Proceedings. IEEE Computational Systems Bioinformatics Conference","volume":" ","pages":"173-81"},"PeriodicalIF":0.0,"publicationDate":"2004-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/csb.2004.1332430","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"25830154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}