首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
A large-scale gene network inference system for systems biology on supercomputing resources 基于超级计算资源的系统生物学大规模基因网络推理系统
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651340
Younghoon Kim, Doheon Lee, Yongseong Cho, Sang Joo Lee
Motivation: Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations. Results: MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference. Availability: A cytoscape plugin is available at http://cytoscape.org and a web service is at http://delsol.kaist.ac.kr/~monet/home
动机:尽管基因表达数据不断积累,荟萃分析方法已经发展到将独立的表达谱整合到更大的数据集中,但信息量仍然不足以推断大规模的遗传网络。此外,遗传网络推理中最具代表性的技术之一贝叶斯网络推理等全局优化算法需要大量的计算量,远远超出一般工作站的能力。结果:MONET是一个Cytoscape插件,可以从基因表达谱中推断基因组规模的网络。它通过合并预先存在的注释来缓解信息的不足。当前版本的MONET利用韩国KISTI超级计算中心的数千个并行计算核心来应对大规模遗传网络推理的计算需求。可用性:cytoscape插件可在http://cytoscape.org上获得,web服务可在http://delsol.kaist.ac.kr/~monet/home上获得
{"title":"A large-scale gene network inference system for systems biology on supercomputing resources","authors":"Younghoon Kim, Doheon Lee, Yongseong Cho, Sang Joo Lee","doi":"10.1145/1651318.1651340","DOIUrl":"https://doi.org/10.1145/1651318.1651340","url":null,"abstract":"Motivation: Although gene expression data has been continuously accumulated and meta-analysis approaches have been developed to integrate independent expression profiles into larger datasets, the amount of information is still insufficient to infer large scale genetic networks. In addition, global optimization such as Bayesian network inference, one of the most representative techniques for genetic network inference, requires tremendous computational load far beyond the capacity of moderate workstations.\u0000 Results: MONET is a Cytoscape plugin to infer genome-scale networks from gene expression profiles. It alleviates the shortage of information by incorporating pre-existing annotations. The current version of MONET utilizes thousands of parallel computational cores in the supercomputing center in KISTI, Korea, to cope with the computational requirement for large scale genetic network inference.\u0000 Availability: A cytoscape plugin is available at http://cytoscape.org and a web service is at http://delsol.kaist.ac.kr/~monet/home","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124687140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A graph-based approach for biomedical thesaurus expansion 生物医学同义词典扩展的基于图的方法
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651336
Ikumi Suzuki, Kazuo Hara, M. Shimbo, Yuji Matsumoto
The addition of new terms to biomedical thesauri is important for keeping pace with new research. In the context of a thesaurus expansion task, we investigate the property of Laplacian diffusion kernel matrices that depreciate pivotal vertices having many links to surrounding vertices. We confirm that this property can be seen on the Laplacian matrix of a graph that we construct from the GENIA corpus (a subset of MEDLINE abstracts) and simulate thesaurus expansion by employing either the Laplacian diffusion kernel matrix, or the adjacency matrix (i.e., cosine similarity), to determine the correct position for new biomedical terms being added to the MeSH thesaurus. Whilst results do not show the desired precision, our approach is shown to be complementary to calculation of cosine similarity between thesaurus terms and we recognize directions for future work.
在生物医学词典中增加新的术语对于跟上新的研究是很重要的。在同义词库扩展任务的背景下,我们研究了拉普拉斯扩散核矩阵的性质,该矩阵贬低了与周围顶点有许多链接的关键顶点。我们从GENIA语料库(MEDLINE摘要的一个子集)构建的图的拉普拉斯矩阵上证实了这一特性,并通过使用拉普拉斯扩散核矩阵或邻接矩阵(即余弦相似度)模拟同义词库扩展,以确定添加到MeSH同义词库中的新生物医学术语的正确位置。虽然结果不显示所需的精度,我们的方法被证明是互补的余弦相似度的计算在同义词典术语和我们认识到未来的工作方向。
{"title":"A graph-based approach for biomedical thesaurus expansion","authors":"Ikumi Suzuki, Kazuo Hara, M. Shimbo, Yuji Matsumoto","doi":"10.1145/1651318.1651336","DOIUrl":"https://doi.org/10.1145/1651318.1651336","url":null,"abstract":"The addition of new terms to biomedical thesauri is important for keeping pace with new research. In the context of a thesaurus expansion task, we investigate the property of Laplacian diffusion kernel matrices that depreciate pivotal vertices having many links to surrounding vertices. We confirm that this property can be seen on the Laplacian matrix of a graph that we construct from the GENIA corpus (a subset of MEDLINE abstracts) and simulate thesaurus expansion by employing either the Laplacian diffusion kernel matrix, or the adjacency matrix (i.e., cosine similarity), to determine the correct position for new biomedical terms being added to the MeSH thesaurus. Whilst results do not show the desired precision, our approach is shown to be complementary to calculation of cosine similarity between thesaurus terms and we recognize directions for future work.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127182024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A web-based comparative visualization system for human endogenous RetroVirus(HERV) on whole genomes 基于web的人内源性逆转录病毒(HERV)全基因组比较可视化系统
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651333
Woo-Keun Chung, Hyong-Jun Kim, Hwan-Gue Cho
Human Endogenous RetroViruses(HERVs) are suggested that they have a function of regulating the activity of human genes and could produce protein in some conditions. So it is crucial to examine the physical layout relationship between HERVs and genes in the whole genome scale. In this paper we present RetroScope, a new Web-based comparative visualization system for HERV over 4 whole primate genomes including Human, Chimpanzee, Orangutan and Rhesus monkey. So RetroScope enables us to find the retro element which is very close to a specified gene in the form of exonoverlapping or promotor, primer overlapping. Thus our system enables biologist to provide global understanding by comparing the linear configuration of several HERVs in the whole chromosome scales by using a fast HERV alignment algorithm. Also by alignment of HERVs, we can find the most similar pair of chromosomes with respect to the configuration of HERV elements, which would be another clues to construct phylogenetics based on HERV. RetroScope is available on http://neobio.cs.pusan.ac.kr/sretroscope/.
人类内源性逆转录病毒(herv)具有调节人类基因活性的功能,在某些条件下可产生蛋白质。因此,在全基因组尺度上研究herv与基因的物理布局关系至关重要。在本文中,我们提出了RetroScope,一个新的基于web的4种灵长类动物HERV全基因组比较可视化系统,包括人类、黑猩猩、猩猩和恒河猴。因此RetroScope使我们能够找到与特定基因非常接近的逆转录因子以外显子重叠或启动子,引物重叠的形式出现。因此,我们的系统使生物学家能够通过使用快速HERV比对算法,在整个染色体尺度上比较几种HERV的线性结构,从而提供全局理解。此外,通过对HERV的比对,我们可以找到与HERV元件结构最相似的一对染色体,这将是构建基于HERV的系统发育的另一个线索。RetroScope可在http://neobio.cs.pusan.ac.kr/sretroscope/上获得。
{"title":"A web-based comparative visualization system for human endogenous RetroVirus(HERV) on whole genomes","authors":"Woo-Keun Chung, Hyong-Jun Kim, Hwan-Gue Cho","doi":"10.1145/1651318.1651333","DOIUrl":"https://doi.org/10.1145/1651318.1651333","url":null,"abstract":"Human Endogenous RetroViruses(HERVs) are suggested that they have a function of regulating the activity of human genes and could produce protein in some conditions. So it is crucial to examine the physical layout relationship between HERVs and genes in the whole genome scale. In this paper we present RetroScope, a new Web-based comparative visualization system for HERV over 4 whole primate genomes including Human, Chimpanzee, Orangutan and Rhesus monkey. So RetroScope enables us to find the retro element which is very close to a specified gene in the form of exonoverlapping or promotor, primer overlapping. Thus our system enables biologist to provide global understanding by comparing the linear configuration of several HERVs in the whole chromosome scales by using a fast HERV alignment algorithm. Also by alignment of HERVs, we can find the most similar pair of chromosomes with respect to the configuration of HERV elements, which would be another clues to construct phylogenetics based on HERV. RetroScope is available on http://neobio.cs.pusan.ac.kr/sretroscope/.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129235225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An outcome discovery system to determine mortality factors in primary care facilities 确定初级保健机构死亡因素的结果发现系统
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651341
Jeremias Murillo, Min Song
This project assembles a virtual team consisting of personnel from the New Jersey Institute of Technology with expertise in the data mining domain and the Saint Barnabas Health Care System with expertise in the medical domain. We apply proven techniques in data and text mining to the problem of hospital mortality. Methodology in outcomes research using data/text mining has typically included Bayesian Networks to include decision trees and rules, regression analysis or Neural Networks/Support Vector Machines to analyze a single disease or condition. We propose to instead analyze the entire spectrum of reasons patients are admitted to a hospital in an effort to discern what chronologies result in good outcomes and which in the worst outcome so as to identify the characteristics to be avoided throughout the spectrum of reasons for admission.
该项目组建了一个虚拟团队,由来自新泽西理工学院的数据挖掘专业人员和圣巴纳巴斯医疗保健系统的医疗专业人员组成。我们将数据和文本挖掘中的成熟技术应用于医院死亡率问题。使用数据/文本挖掘的结果研究方法通常包括贝叶斯网络,包括决策树和规则,回归分析或神经网络/支持向量机,以分析单一疾病或状况。相反,我们建议分析患者入院的全部原因,以努力辨别哪些时间顺序会产生良好的结果,哪些会产生最坏的结果,从而确定在入院的所有原因中应避免的特征。
{"title":"An outcome discovery system to determine mortality factors in primary care facilities","authors":"Jeremias Murillo, Min Song","doi":"10.1145/1651318.1651341","DOIUrl":"https://doi.org/10.1145/1651318.1651341","url":null,"abstract":"This project assembles a virtual team consisting of personnel from the New Jersey Institute of Technology with expertise in the data mining domain and the Saint Barnabas Health Care System with expertise in the medical domain. We apply proven techniques in data and text mining to the problem of hospital mortality. Methodology in outcomes research using data/text mining has typically included Bayesian Networks to include decision trees and rules, regression analysis or Neural Networks/Support Vector Machines to analyze a single disease or condition. We propose to instead analyze the entire spectrum of reasons patients are admitted to a hospital in an effort to discern what chronologies result in good outcomes and which in the worst outcome so as to identify the characteristics to be avoided throughout the spectrum of reasons for admission.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128447189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Data mining in bioinformatics: challenges and opportunities 生物信息学中的数据挖掘:挑战与机遇
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651320
Xiaohua Hu
In this talk I will discuss some data mining techniques and methods in the bioinformatics domain, what are the main challenges and what are the opportunities. I will cover some of the issues related to biomedical literature mining, bioinformatics data integration and biological network analysis and simulation. In biomedical literature mining, I will discuss the effective information retrieval and large-scale information extraction from biomedical literatures. I will also share my view of the semantic-based approach for data integration for bioinformatics domain. In the end, I will talk about the various approaches for biological network analysis and simulation.
在这次演讲中,我将讨论生物信息学领域的一些数据挖掘技术和方法,主要的挑战和机遇是什么。我将涵盖一些与生物医学文献挖掘、生物信息学数据集成和生物网络分析与模拟相关的问题。在生物医学文献挖掘中,我将讨论生物医学文献的有效信息检索和大规模信息提取。我还将分享我对基于语义的生物信息学领域数据集成方法的看法。最后,我将讨论生物网络分析和模拟的各种方法。
{"title":"Data mining in bioinformatics: challenges and opportunities","authors":"Xiaohua Hu","doi":"10.1145/1651318.1651320","DOIUrl":"https://doi.org/10.1145/1651318.1651320","url":null,"abstract":"In this talk I will discuss some data mining techniques and methods in the bioinformatics domain, what are the main challenges and what are the opportunities. I will cover some of the issues related to biomedical literature mining, bioinformatics data integration and biological network analysis and simulation. In biomedical literature mining, I will discuss the effective information retrieval and large-scale information extraction from biomedical literatures. I will also share my view of the semantic-based approach for data integration for bioinformatics domain. In the end, I will talk about the various approaches for biological network analysis and simulation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125373886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Incremental non-gaussian analysis of microarray gene expression data 微阵列基因表达数据的增量非高斯分析
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651334
Kam Swee Ng, Hyung-Jeong Yang, Sun-Hee Kim
The microarray is gaining popularity in biomedical research due to its ability to analyze hundreds to thousands of genes simultaneously in one experiment. However, the unique nature of microarray data, with a large number of features but relative small number of samples, poses challenges to process the microarray data effectively. The curse of dimensionality introduces the importance of feature extraction in analyzing microarray data. Therefore, we propose a novel incremental method to discover the non-Gaussian weight from the microarray gene expression data with high efficiency. Our proposed method can discover a small number of compact features from a huge number of genes and can still achieve good predictive performance. It integrates non-gaussianity and an adaptive incremental model in an unsupervised way to extract informative features. It is also plausible to analyze microarray data with the number of features much larger than number of observations with promising results.
微阵列在生物医学研究中越来越受欢迎,因为它能够在一次实验中同时分析数百到数千个基因。然而,由于微阵列数据具有大量特征但样本数量相对较少的特点,这给有效处理微阵列数据带来了挑战。维数的诅咒介绍了特征提取在分析微阵列数据中的重要性。因此,我们提出了一种新的增量方法,从芯片基因表达数据中高效地发现非高斯权值。我们提出的方法可以从大量的基因中发现少量的紧凑特征,并且仍然可以获得良好的预测性能。它将非高斯性和自适应增量模型以无监督的方式相结合,提取信息特征。这也是合理的分析微阵列数据的特征数量远远大于有希望的结果的观察数量。
{"title":"Incremental non-gaussian analysis of microarray gene expression data","authors":"Kam Swee Ng, Hyung-Jeong Yang, Sun-Hee Kim","doi":"10.1145/1651318.1651334","DOIUrl":"https://doi.org/10.1145/1651318.1651334","url":null,"abstract":"The microarray is gaining popularity in biomedical research due to its ability to analyze hundreds to thousands of genes simultaneously in one experiment. However, the unique nature of microarray data, with a large number of features but relative small number of samples, poses challenges to process the microarray data effectively. The curse of dimensionality introduces the importance of feature extraction in analyzing microarray data. Therefore, we propose a novel incremental method to discover the non-Gaussian weight from the microarray gene expression data with high efficiency. Our proposed method can discover a small number of compact features from a huge number of genes and can still achieve good predictive performance. It integrates non-gaussianity and an adaptive incremental model in an unsupervised way to extract informative features. It is also plausible to analyze microarray data with the number of features much larger than number of observations with promising results.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130290777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient computation of impact degrees for multiple reactions in metabolic networks with cycles 循环代谢网络中多反应冲击度的高效计算
Pub Date : 2009-10-01 DOI: 10.1145/1651318.1651332
Yang Cong, Takeyuki Tamura, T. Akutsu, W. Ching
Analysis of the robustness of a metabolic network against of single or multiple reaction(s) is useful for mining important enzymes/genes. For that purpose, the impact degree was proposed by Jiang et al. In this short paper, we extend the impact degree for metabolic networks containing cycles and develop a simple algorithm for its computation. Furthermore, we propose an improved algorithm for computing impact degrees for deletions of multiple reactions. The results of preliminary computational experiments suggest that the improved algorithm is several tens of times faster than a simple algorithm.
分析代谢网络对单个或多个反应的鲁棒性对于挖掘重要的酶/基因是有用的。为此,Jiang等人提出了影响程度。在这篇短文中,我们扩展了含循环代谢网络的影响程度,并开发了一个简单的计算算法。此外,我们提出了一种改进的算法来计算多反应缺失的影响程度。初步的计算实验结果表明,改进后的算法比简单的算法快几十倍。
{"title":"Efficient computation of impact degrees for multiple reactions in metabolic networks with cycles","authors":"Yang Cong, Takeyuki Tamura, T. Akutsu, W. Ching","doi":"10.1145/1651318.1651332","DOIUrl":"https://doi.org/10.1145/1651318.1651332","url":null,"abstract":"Analysis of the robustness of a metabolic network against of single or multiple reaction(s) is useful for mining important enzymes/genes. For that purpose, the impact degree was proposed by Jiang et al. In this short paper, we extend the impact degree for metabolic networks containing cycles and develop a simple algorithm for its computation. Furthermore, we propose an improved algorithm for computing impact degrees for deletions of multiple reactions. The results of preliminary computational experiments suggest that the improved algorithm is several tens of times faster than a simple algorithm.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2009-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125984868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Text mining for pharmacogenomics 药物基因组学的文本挖掘
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458451
R. Altman
We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation: 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/ 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these. 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.
我们正在建立药物遗传学和药物基因组学知识库(PharmGKB, http://www.pharmgkb.org/),目标是对所有关于遗传变异如何影响药物反应表型的知识进行编目。PharmGKB存储了原始数据(基因型和表型数据)以及更多以途径图、非常重要的药物基因(VIP基因)的注释摘要和注释文献的形式提炼出来的知识。文献注释的工作包括由训练有素的管理员手动整理和自动信息提取。在这次演讲中,我将讨论与我们在文献策展方面的努力相关的三个项目:pharmespresso项目是一个简单的基于规则的系统,用于从文本中提取提到的基因、药物、疾病和多态性相互作用。它以加州理工学院开发的Textpresso系统为基础,但增加了关于人类药物、基因和表型的具体规则。最初版本的pharmespresso具有良好的性能,但存在假阳性提取的问题,因此我们一直在努力提高性能,同时尽可能保持通用性。pharmespresso可在http://www.pharmpresso.stanford.edu/2上找到。PGxPipeline项目建立在人工和自动挖掘基因-药物-疾病关联的基础上,以进行科学发现。药物遗传学的一个关键瓶颈是确定可能对改变药物反应很重要的基因。除非了解药物作用和代谢的全部细节,否则大约25,000个人类基因中的任何一个都可能对了解作用和代谢很重要。PgxPipeline可以接受药物和使用适应症(例如疼痛或高胆固醇)作为输入。然后,它使用文献信息和化学结构信息对人类基因组中所有基因与感兴趣的药物相互作用的可能性进行排序。通过这种方式,我们可以优先考虑最可能与药物相关的基因。我们发现,我们的排名顺序列表是其他独立信息来源的有用辅助,并且与这些信息结合使用效果最好。3.最后,我们一直在研究蛋白质中结合小分子(如药物)或作为蛋白质功能发生的重要活性位点的位置。我们根据结构相似性对这些位点进行聚类,以发现与蛋白质功能相关的新结构基序。通常,我们对这些新发现的结构基序的功能一无所知,但文献中通常有关于这些基序所属蛋白质功能的大量信息。因此,我们的最终项目集中于收集与具有共同基序的蛋白质相关的文献,并确定哪些单词/概念可能描述这些蛋白质的共同功能,从而确定这些共享结构基序的可能意义。
{"title":"Text mining for pharmacogenomics","authors":"R. Altman","doi":"10.1145/1458449.1458451","DOIUrl":"https://doi.org/10.1145/1458449.1458451","url":null,"abstract":"We are building the Pharmacogenetics & Pharmacogenomics Knowledgebase (PharmGKB, http://www.pharmgkb.org/) with the goal of cataloguing all knowledge about how genetic variation impacts drug response phenotypes. PharmGKB stores primary data (genotype and phenotype data) as well as more distilled knowledge in the form of pathway diagrams, annotated summaries of very important pharmacogenes (VIP genes), and annotated literature. The literature annotation efforts include both manual curation by trained curators and automatic information extraction. In this talk, I will discuss three projects relevant to our efforts in literature curation:\u0000 1. The Pharmspresso project is a simple rule-based system for extracting mentions of gene, drug, disease and polymorphism interactions from text. It is based on the Textpresso system developed at Caltech, but adds specific rules about human drugs, genes and phenotypes. The initial version of Pharmspresso had good performance, but suffered from false positive extractions, and so we have been working to improve the performance, while maintaining as much generality as possible. Pharmspresso is available athttp://pharmspresso.stanford.edu/\u0000 2. The PGxPipeline project builds on the gene-drug-disease associations mined both manually and automatically to do scientific discovery. A critical bottleneck in pharmacogenetics is identifying genes that are likely to be important for modifying drug response. Unless the full details of drug action and metabolism are understood, any of the ~25,000 human genes could be important for understanding action and metabolism. PgxPipeline is built to accept as input a drug and an indication for use (e.g. pain or high cholesterol). It then uses both information from the literature as well as information about chemical structure to rank order all genes in the human genome with respect to the likelihood that they interact with the drug of interest. In this way, we can prioritize the genes that are most likely to be relevant to the drug. We have found that our rank order lists are useful adjuncts to other independent sources of information, and work best in combination with these.\u0000 3. Finally, we have been studying the sites in proteins that bind small molecules (such as drugs) or are important as active sites where the proteins' functions occur. We have clustered these sites based on structural similarity to discover new structural motifs associated with protein function. Very often, we have no knowledge of the function of these newly discovered structural motifs, but the literature often has substantial information about the function of the proteins to which these motifs belong. Our final project, then, is focused on gathering the literature associated with proteins that have a common motif, and determining what words/concepts are likely to describe the common functions of these proteins, and therefore be the likely significance of these shared structural motifs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126318411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining metastasis related genes by primary-secondary tumor comparisons from large-scale database 从大规模数据库中通过肿瘤原发-继发比较挖掘转移相关基因
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458458
Sangwoo Kim, Doheon Lee
Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately. To address these problems, we conducted comparisons between primary tumors and secondary metastatic tumors. Noises from the differences of tissue specific characteristics between two types of tumors have been controlled by additional analyses. In this paper, we suggest a new method for identifying genes and pathways which secure metastasis dependency and are free of metastasis independent features.
转移是癌症进展中最危险的一步,导致90%以上的癌症死亡。尽管许多研究人员对肿瘤转移的生物学特征和特征进行了研究,但其大部分遗传水平的过程仍不确定。一些研究成功地阐明了转移相关基因和通路,进而预测了癌症患者的预后,但结果基因或通路是否包含足够的信息,噪声特征是否得到了适当的控制,仍然是一个问题。为了解决这些问题,我们对原发性肿瘤和继发性转移瘤进行了比较。两种类型肿瘤之间组织特异性特征的差异所产生的噪声已通过附加分析加以控制。在本文中,我们提出了一种新的方法来鉴定确保转移依赖和不具有转移独立特征的基因和途径。
{"title":"Mining metastasis related genes by primary-secondary tumor comparisons from large-scale database","authors":"Sangwoo Kim, Doheon Lee","doi":"10.1145/1458449.1458458","DOIUrl":"https://doi.org/10.1145/1458449.1458458","url":null,"abstract":"Metastasis is the most dangerous step in cancer progression and causes more than 90% of cancer death. Although many researchers have been working on biological features and characteristics of metastasis, most of its genetic level processes remain uncertain. Some studies succeeded in elucidating metastasis related genes and pathways, followed by predicting prognosis of cancer patients, but there still is a question whether the result genes or pathways contain enough information and noise features have been controlled appropriately. To address these problems, we conducted comparisons between primary tumors and secondary metastatic tumors. Noises from the differences of tissue specific characteristics between two types of tumors have been controlled by additional analyses. In this paper, we suggest a new method for identifying genes and pathways which secure metastasis dependency and are free of metastasis independent features.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126177324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The role of syntactic features in protein interaction extraction 句法特征在蛋白质相互作用提取中的作用
Pub Date : 2008-10-30 DOI: 10.1145/1458449.1458463
Timur Fayruzov, M. D. Cock, C. Cornelis, Veronique Hoste
Most approaches for protein interaction mining from biomedical texts use both lexical and syntactic features. However, the individual impact of these two kinds of features on the effectiveness of the mining process has not yet been thoroughly studied. In this paper, we perform such a study on a recently published state of the art support vector machine approach that uses both lexical and syntactic features. To this end, we strip this approach down to an algorithm that uses only a subset of the initial syntactic features. Next, we compare the original and the stripped-down method by evaluating them on 5 benchmark datasets as well as by performing 5 additional cross-dataset experiments. Although the original method exploits a very rich feature set including words, parts-of-speech and grammatical relations, it is not significantly better than the stripped-down version; in fact, the former does not even consistently outperform the latter.
大多数从生物医学文本中挖掘蛋白质相互作用的方法都使用词汇和句法特征。然而,这两种特征对采矿过程有效性的个别影响尚未得到深入研究。在本文中,我们对最近发表的最先进的支持向量机方法进行了这样的研究,该方法同时使用了词汇和句法特征。为此,我们将该方法简化为只使用初始语法特征子集的算法。接下来,我们通过在5个基准数据集上评估原始方法和简化方法以及执行5个额外的跨数据集实验来比较原始方法和简化方法。虽然原始方法利用了非常丰富的特征集,包括单词、词性和语法关系,但它并没有明显优于精简版本;事实上,前者甚至没有一直优于后者。
{"title":"The role of syntactic features in protein interaction extraction","authors":"Timur Fayruzov, M. D. Cock, C. Cornelis, Veronique Hoste","doi":"10.1145/1458449.1458463","DOIUrl":"https://doi.org/10.1145/1458449.1458463","url":null,"abstract":"Most approaches for protein interaction mining from biomedical texts use both lexical and syntactic features. However, the individual impact of these two kinds of features on the effectiveness of the mining process has not yet been thoroughly studied. In this paper, we perform such a study on a recently published state of the art support vector machine approach that uses both lexical and syntactic features. To this end, we strip this approach down to an algorithm that uses only a subset of the initial syntactic features. Next, we compare the original and the stripped-down method by evaluating them on 5 benchmark datasets as well as by performing 5 additional cross-dataset experiments. Although the original method exploits a very rich feature set including words, parts-of-speech and grammatical relations, it is not significantly better than the stripped-down version; in fact, the former does not even consistently outperform the latter.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2008-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116372941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1