Since there is no standard naming convention for genes and gene products, gene symbol disambiguation (GSD) has become a big challenge when mining biomedical literature. Several GSD methods have been proposed based on MEDLINE references to genes. However, nowadays gene databases, e.g. Entrez Gene, provide plenty of information about genes, and many biomedical ontologies, e.g. UMLS Metathesaurus and Semantic Network, have been developed. These knowledge sources could be used for disambiguation, in this paper we propose a method which relies on information about gene candidates from gene databases, contexts of gene symbols and biomedical ontologies. We implement our method, and evaluate the performance of the implementation using BioCreAtIvE II data sets.
由于基因和基因产物没有统一的命名规范,基因符号消歧(GSD)成为生物医学文献挖掘的一大难题。已经提出了几种基于MEDLINE基因参考的GSD方法。目前,基因数据库如Entrez gene提供了大量的基因信息,生物医学本体如UMLS meta - thesaurus和Semantic Network也得到了发展。本文提出了一种基于基因数据库、基因符号上下文和生物医学本体的候选基因信息消歧方法。我们实现了我们的方法,并使用BioCreAtIvE II数据集评估了实现的性能。
{"title":"Knowledge-based gene symbol disambiguation","authors":"He Tan","doi":"10.1145/1458449.1458466","DOIUrl":"https://doi.org/10.1145/1458449.1458466","url":null,"abstract":"Since there is no standard naming convention for genes and gene products, gene symbol disambiguation (GSD) has become a big challenge when mining biomedical literature. Several GSD methods have been proposed based on MEDLINE references to genes. However, nowadays gene databases, e.g. Entrez Gene, provide plenty of information about genes, and many biomedical ontologies, e.g. UMLS Metathesaurus and Semantic Network, have been developed. These knowledge sources could be used for disambiguation, in this paper we propose a method which relies on information about gene candidates from gene databases, contexts of gene symbols and biomedical ontologies. We implement our method, and evaluate the performance of the implementation using BioCreAtIvE II data sets.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122035338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.
{"title":"Text mining in genomics and systems biology","authors":"A. Valencia","doi":"10.1145/1458449.1458453","DOIUrl":"https://doi.org/10.1145/1458449.1458453","url":null,"abstract":"There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model, and compared them from the viewpoints of log-likelihoods, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the other, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.
{"title":"Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation","authors":"Tatsuya Asou, K. Eguchi","doi":"10.1145/1458449.1458467","DOIUrl":"https://doi.org/10.1145/1458449.1458467","url":null,"abstract":"This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model, and compared them from the viewpoints of log-likelihoods, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the other, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115193288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Classification using microarray gene expression data is an important task in bioinformatics. Due to the high dimensionality and small sample size that characterizes microarray data, there has recently been a drive to incorporate any available information in addition to the expression data in the classification process. As a result, much work has begun on selecting biological pathways that are closely related to a clinical outcome of interest using the gene expression data, and incorporating this pathway information opens up new avenues for classification. As opposed to previous approaches that consider individual genes as features, we propose a new approach that treats biological pathways as features. Each pathway found to be significantly related to an outcome of interest is treated as a feature, and is mapped to a feature value. We define several methods for mapping pathways to features, and compare the performance of several classifiers using our feature transformations to that of the classifiers using individual genes as features for different feature selection methods.
{"title":"Biological pathways as features for microarray data classification","authors":"Brian Quanz, Meeyoung Park, Jun Huan","doi":"10.1145/1458449.1458455","DOIUrl":"https://doi.org/10.1145/1458449.1458455","url":null,"abstract":"Classification using microarray gene expression data is an important task in bioinformatics. Due to the high dimensionality and small sample size that characterizes microarray data, there has recently been a drive to incorporate any available information in addition to the expression data in the classification process. As a result, much work has begun on selecting biological pathways that are closely related to a clinical outcome of interest using the gene expression data, and incorporating this pathway information opens up new avenues for classification. As opposed to previous approaches that consider individual genes as features, we propose a new approach that treats biological pathways as features. Each pathway found to be significantly related to an outcome of interest is treated as a feature, and is mapped to a feature value. We define several methods for mapping pathways to features, and compare the performance of several classifiers using our feature transformations to that of the classifiers using individual genes as features for different feature selection methods.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125315768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage.
{"title":"A data integration method for exploring gene regulatory mechanisms","authors":"Jane Synnergren, B. Olsson, Jonas Gamalielsson","doi":"10.1145/1458449.1458468","DOIUrl":"https://doi.org/10.1145/1458449.1458468","url":null,"abstract":"Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127610038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. O. Falcão, Daniel Faria, António E. N. Ferreira
Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources
{"title":"Peptide programs: applying fragment programs to protein classification","authors":"A. O. Falcão, Daniel Faria, António E. N. Ferreira","doi":"10.1145/1458449.1458459","DOIUrl":"https://doi.org/10.1145/1458449.1458459","url":null,"abstract":"Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127221253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.
{"title":"Microarray data analysis with PCA in a DBMS","authors":"W. Rinsurongkawong, C. Ordonez","doi":"10.1145/1458449.1458456","DOIUrl":"https://doi.org/10.1145/1458449.1458456","url":null,"abstract":"Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. A conventional ARM method, however, has a limit on extracting temporal dependencies between genes, though the temporal information is indispensable to discover underlying regulation mechanisms in biological pathways. In this paper, therefore, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A ↑, gene B↓] → (7 min)[gene C], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. In the parameter fitting phase of TARM, the best parameter set [threshold = ±0.8, support cutoff = 3 transactions, confidence cutoff = 90%], which extracted the most number of correct associations in KEGG cell cycle pathway, has been chosen for rule mining phase. Furthermore, comparing the precision scores of TARM (0.38) and Bayesian network (0.16), TARM method showed better accuracy. With the best parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes which are pre-identified cell cycle relevant genes, while comparably small number of rules are extracted from random shuffled gene expression data of 799 genes. From the extracted temporal association rules, associated genes which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified.
{"title":"Identification of temporal association rules from time-series microarray data set: temporal association rules","authors":"Hojung Nam, K. Lee, Doheon Lee","doi":"10.1145/1458449.1458457","DOIUrl":"https://doi.org/10.1145/1458449.1458457","url":null,"abstract":"One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. A conventional ARM method, however, has a limit on extracting temporal dependencies between genes, though the temporal information is indispensable to discover underlying regulation mechanisms in biological pathways. In this paper, therefore, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A ↑, gene B↓] → (7 min)[gene C], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. In the parameter fitting phase of TARM, the best parameter set [threshold = ±0.8, support cutoff = 3 transactions, confidence cutoff = 90%], which extracted the most number of correct associations in KEGG cell cycle pathway, has been chosen for rule mining phase. Furthermore, comparing the precision scores of TARM (0.38) and Bayesian network (0.16), TARM method showed better accuracy. With the best parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes which are pre-identified cell cycle relevant genes, while comparably small number of rules are extracted from random shuffled gene expression data of 799 genes. From the extracted temporal association rules, associated genes which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126889796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}