Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0038
C Baral, G Gonzalez, A Gitter, C Teegarden, A Zeigler, G Joshi-Topé
In molecular biology research, looking for information on a particular entity such as a gene or a protein may lead to thousands of articles, making it impossible for a researcher to individually read these articles and even just their abstracts. Thus, there is a need to curate the literature to get various nuggets of knowledge, such as an interaction between two proteins, and store them in a database. However the body of existing biomedical articles is growing at a very fast rate, making it impossible to curate them manually. An alternative approach of using computers for automatic extraction has problem with accuracy. We propose to leverage the advantages of both techniques, extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships. Thus, the community of researchers that writes and reads the biomedical texts can use the server for searching our database of extracted facts, and as an easy-to-use web platform to annotate facts relevant to them. We presented a preliminary prototype as a proof of concept earlier(1). This paper presents the working implementation available for download at http://www.cbioc.org as a browser-plug in for both Internet Explorer and FireFox. This current version has been available since June of 2006, and has over 160 registered users from around the world. Aside from its use as an annotation tool, data from CBioC has also been used in computational methods with encouraging results.
{"title":"CBioC: beyond a prototype for collaborative annotation of molecular interactions from the literature.","authors":"C Baral, G Gonzalez, A Gitter, C Teegarden, A Zeigler, G Joshi-Topé","doi":"10.1142/9781860948732_0038","DOIUrl":"https://doi.org/10.1142/9781860948732_0038","url":null,"abstract":"<p><p>In molecular biology research, looking for information on a particular entity such as a gene or a protein may lead to thousands of articles, making it impossible for a researcher to individually read these articles and even just their abstracts. Thus, there is a need to curate the literature to get various nuggets of knowledge, such as an interaction between two proteins, and store them in a database. However the body of existing biomedical articles is growing at a very fast rate, making it impossible to curate them manually. An alternative approach of using computers for automatic extraction has problem with accuracy. We propose to leverage the advantages of both techniques, extracting binary relationships between biological entities automatically from the biomedical literature and providing a platform that allows community collaboration in the annotation of the extracted relationships. Thus, the community of researchers that writes and reads the biomedical texts can use the server for searching our database of extracted facts, and as an easy-to-use web platform to annotate facts relevant to them. We presented a preliminary prototype as a proof of concept earlier(1). This paper presents the working implementation available for download at http://www.cbioc.org as a browser-plug in for both Internet Explorer and FireFox. This current version has been available since June of 2006, and has over 160 registered users from around the world. Aside from its use as an annotation tool, data from CBioC has also been used in computational methods with encouraging results.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"381-4"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric of model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for anti-symmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better performance on datasets examined.
{"title":"Algorithm for peptide sequencing by tandem mass spectrometry based on better preprocessing and anti-symmetric computational model.","authors":"Kang Ning, Hon Wai Leong","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Peptide sequencing by tandem mass spectrometry is a very important, interesting, yet challenging problem in proteomics. This problem is extensively investigated by researchers recently, and the peptide sequencing results are becoming more and more accurate. However, many of these algorithms are using computational models based on some unverified assumptions. We believe that the investigation of the validity of these assumptions and related problems will lead to improvements in current algorithms. In this paper, we have first investigated peptide sequencing without preprocessing the spectrum, and we have shown that by introducing preprocessing on spectrum, peptide sequencing can be faster, easier and more accurate. We have then investigated one very important problem, the anti-symmetric problem in the peptide sequencing problem, and we have proved by experiments that model that simply ignore anti-symmetric of model that remove all anti-symmetric instances are too simple for peptide sequencing problem. We have proposed a new model for anti-symmetric problem in more realistic way. We have also proposed a novel algorithm which incorporate preprocessing and new model for anti-symmetric issue, and experiments show that this algorithm has better performance on datasets examined.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"19-30"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0022
Zheng Fu, Tao Jiang
The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and a high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication have been proposed in (11). MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario minimizing the number of genome rearrangement and (post-speciation) gene duplication events. However, the parsimony approach used by MSOAR limits it to pairwsie genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse and human genomes, and validate our results using gene annotations and gene function classifications in the public databases. We further compare our results to the ortholog clusters predicted by MultiParanoid, which is an extension of the well-known program Inparanoid for pairwise genome comparisons. The comparison reveals that MultiMSOAR gives more detailed and accurate orthology information since it can effectively distinguish main orthologs from inparalogs.
{"title":"Clustering of main orthologs for multiple genomes.","authors":"Zheng Fu, Tao Jiang","doi":"10.1142/9781860948732_0022","DOIUrl":"https://doi.org/10.1142/9781860948732_0022","url":null,"abstract":"The identification of orthologous genes shared by multiple genomes is critical for both functional and evolutionary studies in comparative genomics. While it is usually done by sequence similarity search and reconciled tree construction in practice, recently a new combinatorial approach and a high-throughput system MSOAR for ortholog identification between closely related genomes based on genome rearrangement and gene duplication have been proposed in (11). MSOAR assumes that orthologous genes correspond to each other in the most parsimonious evolutionary scenario minimizing the number of genome rearrangement and (post-speciation) gene duplication events. However, the parsimony approach used by MSOAR limits it to pairwsie genome comparisons. In this paper, we extend MSOAR to multiple (closely related) genomes and propose an ortholog clustering method, called MultiMSOAR, to infer main orthologs in multiple genomes. As a preliminary experiment, we apply MultiMSOAR to rat, mouse and human genomes, and validate our results using gene annotations and gene function classifications in the public databases. We further compare our results to the ortholog clusters predicted by MultiParanoid, which is an extension of the well-known program Inparanoid for pairwise genome comparisons. The comparison reveals that MultiMSOAR gives more detailed and accurate orthology information since it can effectively distinguish main orthologs from inparalogs.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"16 1","pages":"195-201"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0025
Erliang Zeng, K. Mathee, G. Narasimhan
Understanding gene regulation is a key step to investigating gene functions and their relationships. Many algorithms have been developed to discover transcription factor binding sites (TFBS); they are predominantly located in upstream regions of genes and contribute to transcription regulation if they are bound by a specific transcription factor. However, traditional methods focusing on finding motifs have shortcomings, which can be overcome by using comparative genomics data that is now increasingly available. Traditional methods to score motifs also have their limitations. In this paper, we propose a new algorithm called IEM to refine motifs using comparative genomics data. We show the effectiveness of our techniques with several data sets. Two sets of experiments were performed with comparative genomics data on five strains of P. aeruginosa. One set of experiments were performed with similar data on four species of yeast. The weighted conservation score proposed in this paper is an improvement over existing motif scores.
{"title":"IEM: an algorithm for iterative enhancement of motifs using comparative genomics data.","authors":"Erliang Zeng, K. Mathee, G. Narasimhan","doi":"10.1142/9781860948732_0025","DOIUrl":"https://doi.org/10.1142/9781860948732_0025","url":null,"abstract":"Understanding gene regulation is a key step to investigating gene functions and their relationships. Many algorithms have been developed to discover transcription factor binding sites (TFBS); they are predominantly located in upstream regions of genes and contribute to transcription regulation if they are bound by a specific transcription factor. However, traditional methods focusing on finding motifs have shortcomings, which can be overcome by using comparative genomics data that is now increasingly available. Traditional methods to score motifs also have their limitations. In this paper, we propose a new algorithm called IEM to refine motifs using comparative genomics data. We show the effectiveness of our techniques with several data sets. Two sets of experiments were performed with comparative genomics data on five strains of P. aeruginosa. One set of experiments were performed with similar data on four species of yeast. The weighted conservation score proposed in this paper is an improvement over existing motif scores.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"227-35"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0016
P. Ghosh, Samik Ghosh, K. Basu, Sajial K Das
The molecular networks regulating basic physiological processes in a cell are generally converted into rate equations assuming the number of biochemical molecules as deterministic variables. At steady state these rate equations gives a set of differential equations that are solved using numerical methods. However, the stochastic cellular environment motivates us to propose a mathematical framework for analyzing such biochemical molecular networks. The stochastic simulators that solve a system of differential equations includes this stochasticity in the model, but suffer from simulation stiffness and require huge computational overheads. This paper describes a new markov chain based model to simulate such complex biological systems with reduced computation and memory overheads. The central idea is to transform the continuous domain chemical master equation (CME) based method into a discrete domain of molecular states with corresponding state transition probabilities and times. Our methodology allows the basic optimization schemes devised for the CME and can also be extended to reduce the computational and memory overheads appreciably at the cost of accuracy. The simulation results for the standard Enzyme-Kinetics and Transcriptional Regulatory systems show promising correspondence with the CME based methods and point to the efficacy of our scheme.
{"title":"A markov model based analysis of stochastic biochemical systems.","authors":"P. Ghosh, Samik Ghosh, K. Basu, Sajial K Das","doi":"10.1142/9781860948732_0016","DOIUrl":"https://doi.org/10.1142/9781860948732_0016","url":null,"abstract":"The molecular networks regulating basic physiological processes in a cell are generally converted into rate equations assuming the number of biochemical molecules as deterministic variables. At steady state these rate equations gives a set of differential equations that are solved using numerical methods. However, the stochastic cellular environment motivates us to propose a mathematical framework for analyzing such biochemical molecular networks. The stochastic simulators that solve a system of differential equations includes this stochasticity in the model, but suffer from simulation stiffness and require huge computational overheads. This paper describes a new markov chain based model to simulate such complex biological systems with reduced computation and memory overheads. The central idea is to transform the continuous domain chemical master equation (CME) based method into a discrete domain of molecular states with corresponding state transition probabilities and times. Our methodology allows the basic optimization schemes devised for the CME and can also be extended to reduce the computational and memory overheads appreciably at the cost of accuracy. The simulation results for the standard Enzyme-Kinetics and Transcriptional Regulatory systems show promising correspondence with the CME based methods and point to the efficacy of our scheme.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"121-32"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0033
Xin Gao, D. Bu, S. Li, Ming Li, Jinbo Xu
Protein inter-residue contacts are of great use for protein structure determination or prediction. Recent CASP events have shown that a few accurately predicted contacts can help improve both computational efficiency and prediction accuracy of the ab inito folding methods. This paper develops an integer linear programming (ILP) method for consensus-based contact prediction. In contrast to the simple "majority voting" method assuming that all the individual servers are equal and independent, our method evaluates their correlations using the maximum likelihood method and constructs some latent independent servers using the principal component analysis technique. Then, we use an integer linear programming model to assign weights to these latent servers in order to maximize the deviation between the correct contacts and incorrect ones; our consensus prediction server is the weighted combination of these latent servers. In addition to the consensus information, our method also uses server-independent correlated mutation (CM) as one of the prediction features. Experimental results demonstrate that our contact prediction server performs better than the "majority voting" method. The accuracy of our method for the top L/5 contacts on CASP7 targets is 73.41%, which is much higher than previously reported studies. On the 16 free modeling (FM) targets, our method achieves an accuracy of 37.21%.
{"title":"Consensus contact prediction by linear programming.","authors":"Xin Gao, D. Bu, S. Li, Ming Li, Jinbo Xu","doi":"10.1142/9781860948732_0033","DOIUrl":"https://doi.org/10.1142/9781860948732_0033","url":null,"abstract":"Protein inter-residue contacts are of great use for protein structure determination or prediction. Recent CASP events have shown that a few accurately predicted contacts can help improve both computational efficiency and prediction accuracy of the ab inito folding methods. This paper develops an integer linear programming (ILP) method for consensus-based contact prediction. In contrast to the simple \"majority voting\" method assuming that all the individual servers are equal and independent, our method evaluates their correlations using the maximum likelihood method and constructs some latent independent servers using the principal component analysis technique. Then, we use an integer linear programming model to assign weights to these latent servers in order to maximize the deviation between the correct contacts and incorrect ones; our consensus prediction server is the weighted combination of these latent servers. In addition to the consensus information, our method also uses server-independent correlated mutation (CM) as one of the prediction features. Experimental results demonstrate that our contact prediction server performs better than the \"majority voting\" method. The accuracy of our method for the top L/5 contacts on CASP7 targets is 73.41%, which is much higher than previously reported studies. On the 16 free modeling (FM) targets, our method achieves an accuracy of 37.21%.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"323-34"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0032
H. Rangwala, G. Karypis
The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes.
{"title":"fRMSDPred: predicting local RMSD between structural fragments using sequence information.","authors":"H. Rangwala, G. Karypis","doi":"10.1142/9781860948732_0032","DOIUrl":"https://doi.org/10.1142/9781860948732_0032","url":null,"abstract":"The effectiveness of comparative modeling approaches for protein structure prediction can be substantially improved by incorporating predicted structural information in the initial sequence-structure alignment. Motivated by the approaches used to align protein structures, this paper focuses on developing machine learning approaches for estimating the RMSD value of a pair of protein fragments. These estimated fragment-level RMSD values can be used to construct the alignment, assess the quality of an alignment, and identify high-quality alignment segments. We present algorithms to solve this fragment-level RMSD prediction problem using a supervised learning framework based on support vector regression and classification that incorporates protein profiles, predicted secondary structure, effective information encoding schemes, and novel second-order pairwise exponential kernel functions. Our comprehensive empirical study shows superior results compared to the profile-to-profile scoring schemes.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"141 1","pages":"311-22"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mary Ellen Bock, Claudio Garutti, Conettina Guerra
We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.
{"title":"Effective labeling of molecular surface points for cavity detection and location of putative binding sites.","authors":"Mary Ellen Bock, Claudio Garutti, Conettina Guerra","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>We present a method for detecting and comparing cavities on protein surfaces that is useful for protein binding site recognition. The method is based on a representation of the protein structures by a collection of spin-images and their associated spin-image profiles. Results of the cavity detection procedure are presented for a large set of non-redundant proteins and compared with SURFNET-ConSurf. Our comparison method is used to find a surface region in one cavity of a protein that is geometrically similar to a surface region in the cavity of another protein. Such a finding would be an indication that the two regions likely bind to the same ligand. Our overall approach for cavity detection and comparison is benchmarked on several pairs of known complexes, obtaining a good coverage of the atoms of the binding sites.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"263-74"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061073","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Statistical relations between genome-wide mRNA transcript levels have been successfully used to infer regulatory relations among the genes, however the most successful methods have relied on additional data and focused on small sub-networks of genes. Along these lines, we recently demonstrated a model for simultaneously incorporating micro-array expression data with whole genome genotype marker data to identify causal pairwise relationships among genes. In this paper we extend this methodology to the principled construction of networks describing local regulatory modules. Our method is a two-step process: starting with a seed gene of interest, a Markov Blanket over genotype and gene expression observations is inferred according to differential entropy estimation; a Bayes Net is then constructed from the resulting variables with important biological constraints yielding causally correct relationships. We tested our method by simulating a regulatory network within the background of of a real data set. We found that 45% of the genes in a regulatory module can be identified and the relations among the genes can be recovered with moderately high accuracy (> 70%). Since sample size is a practical and economic limitation, we considered the impact of increasing the number of samples and found that recovery of true gene-gene relationships only doubled with ten times the number of samples, suggesting that useful networks can be achieved with current experimental designs, but that significant improvements are not expected without major increases in the number of samples. When we applied this method to an actual data set of 111 back-crossed mice we were able to recover local gene regulatory networks supported by the biological literature.
{"title":"An information theoretic method for reconstructing local regulatory network modules from polymorphic samples.","authors":"Manjunatha Jagalur, David Kulp","doi":"","DOIUrl":"","url":null,"abstract":"<p><p>Statistical relations between genome-wide mRNA transcript levels have been successfully used to infer regulatory relations among the genes, however the most successful methods have relied on additional data and focused on small sub-networks of genes. Along these lines, we recently demonstrated a model for simultaneously incorporating micro-array expression data with whole genome genotype marker data to identify causal pairwise relationships among genes. In this paper we extend this methodology to the principled construction of networks describing local regulatory modules. Our method is a two-step process: starting with a seed gene of interest, a Markov Blanket over genotype and gene expression observations is inferred according to differential entropy estimation; a Bayes Net is then constructed from the resulting variables with important biological constraints yielding causally correct relationships. We tested our method by simulating a regulatory network within the background of of a real data set. We found that 45% of the genes in a regulatory module can be identified and the relations among the genes can be recovered with moderately high accuracy (> 70%). Since sample size is a practical and economic limitation, we considered the impact of increasing the number of samples and found that recovery of true gene-gene relationships only doubled with ten times the number of samples, suggesting that useful networks can be achieved with current experimental designs, but that significant improvements are not expected without major increases in the number of samples. When we applied this method to an actual data set of 111 back-crossed mice we were able to recover local gene regulatory networks supported by the biological literature.</p>","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":" ","pages":"133-43"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"27061633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2007-01-01DOI: 10.1142/9781860948732_0027
Xin Chen, Lingqiong Guo, Zhaocheng Fan, Tao Jiang
Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and discovering the binding targets of TFs. Given a set of binding sites bound by a TF, the learning problem can be formulated as a straightforward maximum likelihood problem, namely, finding a PWM such that the likelihood of the observed binding sites is maximized, and is usually solved by counting the base frequencies at each position of the aligned binding sequences. In this paper, we study the question of accurately learning a PWM from both binding site sequences and gene expression (or ChIP-chip) data. We revise the above maximum likelihood framework by taking into account the given gene expression or ChIP-chip data. More specifically, we attempt to find a PWM such that the likelihood of simultaneously observing both the binding sequences and the associated gene expression (or ChIP-chip) values is maximized, by using the sequence weighting scheme introduced in our recent work. We have incorporated this new approach for estimating PWMs into the popular motif finding program AlignACE. The modified program, called W-AlignACE, is compared with three other programs (AlignACE, MDscan, and MotifRegressor) on a variety of datasets, including simulated data, publicly available mRNA expression data, and ChIP-chip data. These large-scale tests demonstrate that W-AlignACE is an effective tool for discovering TF binding sites from gene expression or ChIP-chip data and, in particular, has the ability to find very weak motifs.
{"title":"Learning position weight matrices from sequence and expression data.","authors":"Xin Chen, Lingqiong Guo, Zhaocheng Fan, Tao Jiang","doi":"10.1142/9781860948732_0027","DOIUrl":"https://doi.org/10.1142/9781860948732_0027","url":null,"abstract":"Position weight matrices (PWMs) are widely used to depict the DNA binding preferences of transcription factors (TFs) in computational molecular biology and regulatory genomics. Thus, learning an accurate PWM to characterize the binding sites of a specific TF is a fundamental problem that plays an important role in modeling regulatory motifs and discovering the binding targets of TFs. Given a set of binding sites bound by a TF, the learning problem can be formulated as a straightforward maximum likelihood problem, namely, finding a PWM such that the likelihood of the observed binding sites is maximized, and is usually solved by counting the base frequencies at each position of the aligned binding sequences. In this paper, we study the question of accurately learning a PWM from both binding site sequences and gene expression (or ChIP-chip) data. We revise the above maximum likelihood framework by taking into account the given gene expression or ChIP-chip data. More specifically, we attempt to find a PWM such that the likelihood of simultaneously observing both the binding sequences and the associated gene expression (or ChIP-chip) values is maximized, by using the sequence weighting scheme introduced in our recent work. We have incorporated this new approach for estimating PWMs into the popular motif finding program AlignACE. The modified program, called W-AlignACE, is compared with three other programs (AlignACE, MDscan, and MotifRegressor) on a variety of datasets, including simulated data, publicly available mRNA expression data, and ChIP-chip data. These large-scale tests demonstrate that W-AlignACE is an effective tool for discovering TF binding sites from gene expression or ChIP-chip data and, in particular, has the ability to find very weak motifs.","PeriodicalId":72665,"journal":{"name":"Computational systems bioinformatics. Computational Systems Bioinformatics Conference","volume":"6 1","pages":"249-60"},"PeriodicalIF":0.0,"publicationDate":"2007-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"64007104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}