首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
Knowledge-based gene symbol disambiguation 基于知识的基因符号消歧
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458466
He Tan
Since there is no standard naming convention for genes and gene products, gene symbol disambiguation (GSD) has become a big challenge when mining biomedical literature. Several GSD methods have been proposed based on MEDLINE references to genes. However, nowadays gene databases, e.g. Entrez Gene, provide plenty of information about genes, and many biomedical ontologies, e.g. UMLS Metathesaurus and Semantic Network, have been developed. These knowledge sources could be used for disambiguation, in this paper we propose a method which relies on information about gene candidates from gene databases, contexts of gene symbols and biomedical ontologies. We implement our method, and evaluate the performance of the implementation using BioCreAtIvE II data sets.
由于基因和基因产物没有统一的命名规范,基因符号消歧(GSD)成为生物医学文献挖掘的一大难题。已经提出了几种基于MEDLINE基因参考的GSD方法。目前,基因数据库如Entrez gene提供了大量的基因信息,生物医学本体如UMLS meta - thesaurus和Semantic Network也得到了发展。本文提出了一种基于基因数据库、基因符号上下文和生物医学本体的候选基因信息消歧方法。我们实现了我们的方法,并使用BioCreAtIvE II数据集评估了实现的性能。
{"title":"Knowledge-based gene symbol disambiguation","authors":"He Tan","doi":"10.1145/1458449.1458466","DOIUrl":"https://doi.org/10.1145/1458449.1458466","url":null,"abstract":"Since there is no standard naming convention for genes and gene products, gene symbol disambiguation (GSD) has become a big challenge when mining biomedical literature. Several GSD methods have been proposed based on MEDLINE references to genes. However, nowadays gene databases, e.g. Entrez Gene, provide plenty of information about genes, and many biomedical ontologies, e.g. UMLS Metathesaurus and Semantic Network, have been developed. These knowledge sources could be used for disambiguation, in this paper we propose a method which relies on information about gene candidates from gene databases, contexts of gene symbols and biomedical ontologies. We implement our method, and evaluate the performance of the implementation using BioCreAtIvE II data sets.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122035338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Text mining in genomics and systems biology 基因组学和系统生物学中的文本挖掘
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458453
A. Valencia
There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.
在系统生物学和基因组学项目中,对生物系统分析可用信息的补充需求日益增加。使用信息提取和文本挖掘方法直接从文本源中提取信息的集成需求非常有趣。我的团队一直致力于开发文本挖掘方法,并将其与其他实验和生物信息学方法集成到大型项目中。在这个场合,我将介绍与人类有丝分裂纺锤体表征有关的发展,在ENFIN NoE的背景下发展。对于这些和其他应用来说,准确估计当前文本挖掘系统的容量是至关重要的。由CNIO、MITRE和NCBI与MINT和完好无损数据库(http://biocreative.sourceforge.net,基因组生物学,2008年8月特刊)合作组织的BioCreative II挑战赛提供了这样一个概述。BioCreative II有两个任务:1)基因名称识别和规范化,其中许多系统能够达到一致的80%的平衡精度/召回率。2)蛋白质相互作用检测,分为四个子任务:a)根据它们与蛋白质相互作用实验测定的相关性对出版物进行排序,b)检测文本中的蛋白质相互作用伙伴,c)检测描述蛋白质相互作用的关键句子,d)检测用于确定相互作用的实验技术。在出版物排名、实验方法检测和相关句子的突出显示方面,结果相当不错,但它们指出了基因/蛋白质名称正确规范化方面存在的持续问题。此外,BioCreative还引导了几个团队的合作,创建了第一个文本挖掘元服务器(the BioCreative元服务器,Leitner等人,Genome Biology 2008 BioCreative特刊)。我们目前正在筹备BioCreative III,重点是促进文本挖掘系统的创建,这些系统可以集成到基因组分析管道中,并有效地促进对复杂生物系统的理解。
{"title":"Text mining in genomics and systems biology","authors":"A. Valencia","doi":"10.1145/1458449.1458453","DOIUrl":"https://doi.org/10.1145/1458449.1458453","url":null,"abstract":"There is an increasing need of complementing the information available for the analysis of biological systems in Systems Biology and Genomics projects. A need that makes interesting the integration of information directly extracted from textual sources using Information Extraction and Text Mining approaches. My group has been working in developing Text Mining approaches and in their integration in large-scale projects together with other experimental and bioinformatics methods. In this occasion I will present the developments related with the characterization of the human mitotic spindle apparatus, developed in the context of the ENFIN NoE. For these, and other, applications it is crucial to have an accurate estimation of the capacity of the current Text Mining systems. The BioCreative II challenge organized by CNIO, MITRE and NCBI in collaboration with the MINT and INTACT databases (http://biocreative.sourceforge.net, Genome Biology, August 2008 Special Issue) provides such an overview. BioCreative II was in two task: 1) gene name identification and normalization, where many systems were able to achieve a consistent 80% balance precision / recall. And 2) protein interaction detection that was divided in four sub-tasks: a) ranking of publications by their relevance on experimental determination of protein interactions, b) detection of protein interaction partners in text, c) detection of key sentences describing protein interactions, and d) detection of the experimental technique used to determine the interactions. The results were quite good in the categories of publication raking, detection of experimental methods, and highlighting of relevant sentences, while they pointed to persistent problems in the correct normalization of gene/protein names. Furthermore BioCreative has channel the collaboration of several teams for the creation of the first Text Mining meta-server (The BioCreative Meta-server, Leitner et al., Genome Biology 2008 BioCreative special issue). We are working now in the preparation of BioCreative III, with particular focus in fostering the creation of Text Mining systems that can be integrated in Genome analysis pipelines, and contribute effectively to the understanding of complex Biological Systems.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation 利用塌陷变分潜狄利克雷分配从文献中预测蛋白质与蛋白质的关系
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458467
Tatsuya Asou, K. Eguchi
This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model, and compared them from the viewpoints of log-likelihoods, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the other, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.
本文研究了应用统计主题模型来提取和预测生物实体之间的关系,特别是蛋白质提及。潜在狄利克雷分配(Latent Dirichlet Allocation, LDA)是一种很有前途的统计主题模型;然而,它还没有被研究用于这样的任务。本文将最先进的崩溃变分贝叶斯推理和吉布斯抽样推理应用于LDA模型的估计,并从对数似然、分类准确率和检索效率三个方面对它们进行了比较。通过实验证明,在蛋白质-蛋白质关系预测任务中,崩塌变分LDA在分类精度和检索效率方面优于其他方法。
{"title":"Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation","authors":"Tatsuya Asou, K. Eguchi","doi":"10.1145/1458449.1458467","DOIUrl":"https://doi.org/10.1145/1458449.1458467","url":null,"abstract":"This paper investigates applying statistical topic models to extract and predict relationships between biological entities, especially protein mentions. A statistical topic model, Latent Dirichlet Allocation (LDA) is promising; however, it has not been investigated for such a task. In this paper, we apply the state-of-the-art Collapsed Variational Bayesian Inference and Gibbs Sampling inference to estimating the LDA model, and compared them from the viewpoints of log-likelihoods, classification accuracy and retrieval effectiveness. We demonstrate through experiments that the Collapsed Variational LDA gives better results than the other, especially in terms of classification accuracy and retrieval effectiveness in the task of the protein-protein relationship prediction.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115193288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Biological pathways as features for microarray data classification 生物通路作为微阵列数据分类的特征
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458455
Brian Quanz, Meeyoung Park, Jun Huan
Classification using microarray gene expression data is an important task in bioinformatics. Due to the high dimensionality and small sample size that characterizes microarray data, there has recently been a drive to incorporate any available information in addition to the expression data in the classification process. As a result, much work has begun on selecting biological pathways that are closely related to a clinical outcome of interest using the gene expression data, and incorporating this pathway information opens up new avenues for classification. As opposed to previous approaches that consider individual genes as features, we propose a new approach that treats biological pathways as features. Each pathway found to be significantly related to an outcome of interest is treated as a feature, and is mapped to a feature value. We define several methods for mapping pathways to features, and compare the performance of several classifiers using our feature transformations to that of the classifiers using individual genes as features for different feature selection methods.
利用微阵列基因表达数据进行分类是生物信息学的一项重要任务。由于微阵列数据具有高维数和小样本量的特点,最近出现了一种将除表达数据外的任何可用信息纳入分类过程的趋势。因此,利用基因表达数据选择与感兴趣的临床结果密切相关的生物学途径已经开始了许多工作,并且结合这些途径信息为分类开辟了新的途径。与以往将个体基因视为特征的方法相反,我们提出了一种将生物途径视为特征的新方法。每个与感兴趣的结果显著相关的路径被视为一个特征,并被映射到一个特征值。我们定义了几种映射路径到特征的方法,并比较了使用我们的特征转换的几种分类器与使用单个基因作为不同特征选择方法的分类器的性能。
{"title":"Biological pathways as features for microarray data classification","authors":"Brian Quanz, Meeyoung Park, Jun Huan","doi":"10.1145/1458449.1458455","DOIUrl":"https://doi.org/10.1145/1458449.1458455","url":null,"abstract":"Classification using microarray gene expression data is an important task in bioinformatics. Due to the high dimensionality and small sample size that characterizes microarray data, there has recently been a drive to incorporate any available information in addition to the expression data in the classification process. As a result, much work has begun on selecting biological pathways that are closely related to a clinical outcome of interest using the gene expression data, and incorporating this pathway information opens up new avenues for classification. As opposed to previous approaches that consider individual genes as features, we propose a new approach that treats biological pathways as features. Each pathway found to be significantly related to an outcome of interest is treated as a feature, and is mapped to a feature value. We define several methods for mapping pathways to features, and compare the performance of several classifiers using our feature transformations to that of the classifiers using individual genes as features for different feature selection methods.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125315768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
A data integration method for exploring gene regulatory mechanisms 研究基因调控机制的数据集成方法
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458468
Jane Synnergren, B. Olsson, Jonas Gamalielsson
Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage.
系统生物学旨在了解活细胞的各种组成部分,如基因、蛋白质和代谢物之间的行为和相互作用。在这些复杂的系统中涉及大量的组件,并且组件之间的关系的多样性可能是压倒性的,因此需要包含数据集成的分析方法。我们在此提出了一种探索基因调控机制的方法,该方法整合了各种类型的数据,以帮助识别基因调控机制中的重要组成部分。首先分析基因表达数据,选择一组差异表达基因。然后,通过结合各种类型的生物学信息,如聚类结果、启动子序列、结合位点、转录因子和其他先前公布的有关所选基因的信息,对这些基因进行进一步研究。受信息融合研究的启发,我们还将该方法的功能映射到知名的ooda模型,以促进该数据集成方法在其他研究领域的应用。我们已经成功地将这种方法应用于在人类胚胎干细胞向心脏细胞分化的不同阶段被鉴定为差异表达的基因。我们确定了15个新的基序,它们可能代表心脏细胞谱系中重要的结合位点。
{"title":"A data integration method for exploring gene regulatory mechanisms","authors":"Jane Synnergren, B. Olsson, Jonas Gamalielsson","doi":"10.1145/1458449.1458468","DOIUrl":"https://doi.org/10.1145/1458449.1458468","url":null,"abstract":"Systems biology aims to understand the behavior of and interaction between various components of the living cell, such as genes, proteins, and metabolites. A large number of components are involved in these complex systems and the diversity of relationships between the components can be overwhelming, and there is therefore a need for analysis methods incorporating data integration. We here present a method for exploring gene regulatory mechanisms which integrates various types of data to assist the identification of important components in gene regulation mechanisms. By first analyzing gene expression data, a set of differentially expressed genes is selected. These genes are then further investigated by combining various types of biological information, such as clustering results, promoter sequences, binding sites, transcription factors and other previously published information regarding the selected genes. Inspired by Information Fusion research, we also mapped functions of the proposed method to the well-known OODA-model to facilitate application of this data integration method in other research communities. We have successfully applied the method to genes identified as differentially expressed in human embryonic stem cells at different stages of differentiation towards cardiac cells. We identified 15 novel motifs that may represent important binding sites in the cardiac cell linage.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127610038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Peptide programs: applying fragment programs to protein classification 肽程序:将片段程序应用于蛋白质分类
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458459
A. O. Falcão, Daniel Faria, António E. N. Ferreira
Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources
蛋白质的功能预测/分类是生物信息学的核心问题。对齐方法是一种有用的方法,但有局限性,这促使了机器学习方法的发展和使用。然而,传统的机器学习方法不能直接利用序列数据,而是使用衍生的序列特征或核函数来获得特征空间。因为从理论上讲,预测蛋白质结构和功能所需的所有信息都包含在其序列中,因此可以直接利用序列数据的方法可能是有利的。在片段程序概念的启发下,提出了一种新的蛋白质分类机器学习方法。这种方法包括为20种氨基酸中的每一种分配一个最小的计算机程序,然后将组成其序列的氨基酸的程序依次应用,从而将蛋白质表示为程序。讨论了该方法的基本概念(肽程序),并提出了实现框架,包括指令集、虚拟机、评估程序和收敛方法。该方法在33,500种酶的二元分类中进行了测试,这些酶分为182种不同的酶委员会(EC)类。二分类器的平均马修斯相关系数在训练时为0.75,在验证时为0.68。总体而言,获得的结果证明了所提出方法的潜力,以及它使用很少的计算资源从序列数据中提取知识的能力
{"title":"Peptide programs: applying fragment programs to protein classification","authors":"A. O. Falcão, Daniel Faria, António E. N. Ferreira","doi":"10.1145/1458449.1458459","DOIUrl":"https://doi.org/10.1145/1458449.1458459","url":null,"abstract":"Functional prediction/classification of proteins is a central problem in bioinformatics. Alignment methods are a useful approach, but have limitations, which have prompted the development and use of machine learning approaches. However, traditional machine learning approaches are unable to exploit sequence data directly, and instead use derived sequence features or Kernel functions to obtain a feature space. Because theoretically all information necessary to predict a protein's structure and function is contained in its sequence, a methodology that could exploit sequence data directly could be advantageous. A novel machine learning methodology for protein classification, inspired in the concept of fragment programs, is presented. This methodology consists in assigning a minimal computer program to each of the 20 amino acids, and then representing a protein as the program resulting from applying sequentially the programs of the amino acids which compose its sequence. The basic concepts of the methodology presented (peptide programs) are discussed and a framework is proposed for their implementation, including instruction set, virtual machine, evaluation procedures and convergence methods. The methodology is tested in the binary classification of 33,500 enzymes into 182 distinct Enzyme Commission (EC) classes. The average Matthews correlation coefficient of the binary classifiers is 0.75 in training and 0.68 in validation. Overall, the results obtained demonstrate the potential of the proposed methodology, and its ability to extract knowledge from sequence data, using very few computational resources","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127221253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Microarray data analysis with PCA in a DBMS 微阵列数据分析与PCA在一个DBMS
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458456
W. Rinsurongkawong, C. Ordonez
Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.
微阵列数据集包含数千个基因的表达水平。这些数据集的统计分析通常在DBMS之外使用统计包或数学库执行。在这项工作中,我们将重点放在在DBMS中分析它们。这是一个困难的问题,因为微阵列数据集具有高维,但尺寸小。首先,由于DBMS对每个表的最大列数的限制,在分析之前必须对数据集进行pivot和转换。更重要的是,数万个基因的相关矩阵有数百万个值。虽然大多数高维数据集可以用经典的主成分分析方法进行分析,但小而高维的数据集只能用奇异值分解(SVD)进行分析。采用Householder三对角化和QR分解数值方法求解数据库内部的奇异值分解问题。由于这些数学方法需要大量的矩阵运算,而这些运算很难用SQL来表达,因此需要开发查询优化和高效的udf来获得良好的性能。我们提出的技术实现了与R包(一个著名的统计工具)相当的处理时间。实验表明,我们的方法在高维情况下具有良好的可扩展性。
{"title":"Microarray data analysis with PCA in a DBMS","authors":"W. Rinsurongkawong, C. Ordonez","doi":"10.1145/1458449.1458456","DOIUrl":"https://doi.org/10.1145/1458449.1458456","url":null,"abstract":"Microarray data sets contain expression levels of thousands of genes. The statistical analysis of such data sets is typically performed outside a DBMS with statistical packages or mathematical libraries. In this work, we focus on analyzing them inside the DBMS. This is a difficult problem because microarray data sets have high dimensionality, but small size. First, due to DBMS limitations on a maximum number of columns per table, the data set has to be pivoted and transformed before analysis. More importantly, the correlation matrix on tens of thousands of genes has millions of values. While most high dimensional data sets can be analyzed with the classical PCA method, small, but high dimensional, data sets can only be analyzed with Singular Value Decomposition (SVD). We adapt the Householder tridiagonalization and QR factorization numerical methods to solve SVD inside the DBMS. Since these mathematical methods require many matrix operations, which are hard to express in SQL, query optimizations and efficient UDFs are developed to get good performance. Our proposed techniques achieve processing times comparable with those from the R package, a well-known statistical tool. We experimentally show our methods scale well with high dimensionality.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Identification of temporal association rules from time-series microarray data set: temporal association rules 时间序列微阵列数据集时间关联规则的识别:时间关联规则
Pub Date : 2008-10-26 DOI: 10.1145/1458449.1458457
Hojung Nam, K. Lee, Doheon Lee
One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. A conventional ARM method, however, has a limit on extracting temporal dependencies between genes, though the temporal information is indispensable to discover underlying regulation mechanisms in biological pathways. In this paper, therefore, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A ↑, gene B↓] → (7 min)[gene C], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. In the parameter fitting phase of TARM, the best parameter set [threshold = ±0.8, support cutoff = 3 transactions, confidence cutoff = 90%], which extracted the most number of correct associations in KEGG cell cycle pathway, has been chosen for rule mining phase. Furthermore, comparing the precision scores of TARM (0.38) and Bayesian network (0.16), TARM method showed better accuracy. With the best parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes which are pre-identified cell cycle relevant genes, while comparably small number of rules are extracted from random shuffled gene expression data of 799 genes. From the extracted temporal association rules, associated genes which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified.
挖掘基因表达数据最具挑战性的问题之一是确定任何特定基因的表达如何影响其他基因的表达。为了阐明基因之间的关系,将关联规则挖掘(ARM)方法应用于微阵列基因表达数据。然而,传统的ARM方法在提取基因之间的时间依赖性方面存在限制,尽管时间信息对于发现生物学途径中的潜在调节机制是必不可少的。因此,我们提出了一种新的方法,称为时间关联规则挖掘(TARM),它可以提取相关基因之间的时间依赖性。时间关联规则的形式为[基因A↑,基因B↓]→(7分钟)[基因C],表示基因A高表达,基因B显著抑制,7分钟后基因C显著表达。用酿酒酵母细胞周期时间序列微阵列基因表达数据集对该方法进行了验证。在TARM的参数拟合阶段,选择在KEGG细胞周期通路中提取正确关联数最多的最佳参数集[threshold =±0.8,support cutoff = 3 transactions, confidence cutoff = 90%]进行规则挖掘阶段。此外,将TARM方法的精度分数(0.38)与贝叶斯网络的精度分数(0.16)进行比较,TARM方法的准确率更高。利用最佳参数集,从799个预先鉴定的细胞周期相关基因的基因表达数据中提取出5个转录时间延迟(0、7、14、21、28分钟)的时间关联规则数量,而从随机洗牌的799个基因的基因表达数据中提取出较少的规则数量。从提取的时间关联规则中,识别出在短转录时滞内对生物过程起相同作用的相关基因,以及特定生物过程中基因之间的时间依赖性。
{"title":"Identification of temporal association rules from time-series microarray data set: temporal association rules","authors":"Hojung Nam, K. Lee, Doheon Lee","doi":"10.1145/1458449.1458457","DOIUrl":"https://doi.org/10.1145/1458449.1458457","url":null,"abstract":"One of the most challenging problems in mining gene expression data is to identify how the expression of any particular gene affects the expression of other genes. To elucidate the relationships between genes, an association rule mining (ARM) method has been applied to microarray gene expression data. A conventional ARM method, however, has a limit on extracting temporal dependencies between genes, though the temporal information is indispensable to discover underlying regulation mechanisms in biological pathways. In this paper, therefore, we propose a novel method, referred to as temporal association rule mining (TARM), which can extract temporal dependencies among related genes. A temporal association rule has the form [gene A ↑, gene B↓] → (7 min)[gene C], which represents that high expression level of gene A and significant repression of gene B followed by significant expression of gene C after 7 minutes. The proposed TARM method is tested with Saccharomyces cerevisiae cell cycle time-series microarray gene expression data set. In the parameter fitting phase of TARM, the best parameter set [threshold = ±0.8, support cutoff = 3 transactions, confidence cutoff = 90%], which extracted the most number of correct associations in KEGG cell cycle pathway, has been chosen for rule mining phase. Furthermore, comparing the precision scores of TARM (0.38) and Bayesian network (0.16), TARM method showed better accuracy. With the best parameter set, numbers of temporal association rules with five transcriptional time delays (0, 7, 14, 21, 28 minutes) are extracted from gene expression data of 799 genes which are pre-identified cell cycle relevant genes, while comparably small number of rules are extracted from random shuffled gene expression data of 799 genes. From the extracted temporal association rules, associated genes which play same role of biological processes within short transcriptional time delay and some temporal dependencies between genes with specific biological processes are identified.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126889796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1