首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
Pathway-based classification of brain activities for alzheimer's disease analysis 基于通路的脑活动分类用于阿尔茨海默病分析
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512093
Jongan Lee, Younghoon Kim, Y. Jeong, D. Na, Kwang-H. Lee, Doheon Lee
The advent of resting-state (RS) functional magnetic resonance imaging (fMRI) technology has made it possible to classify Alzheimer's disease (AD) states based on the quantitative activity indices of brain regions. Current connectivity-based classification techniques suffer from limited reproducibility due to the need for prior knowledge on discriminative brain regions and intrinsic heterogeneity in the course of AD progression. Actually, similar challenges have been already addressed in molecular bioinformatics communities. They have achieved higher and reproducible classification accuracy and have identified interpretable markers by incorporating molecular pathway information in their classification. We have adopted a similar strategy to the RS-fMRI-based AD classification problem. After collecting various functional brain pathways from literature, we have quantified which pathways show significantly different activity levels between AD patients and healthy subjects. Moreover, discriminatory pathways between AD patients and healthy subjects may facilitate the interpretation of functional alterations in the course of AD progression.
静息状态(RS)功能磁共振成像(fMRI)技术的出现,使得基于脑区定量活动指标对阿尔茨海默病(AD)状态进行分类成为可能。目前基于连接的分类技术的可重复性有限,因为需要预先了解区分脑区域和阿尔茨海默病进展过程中的内在异质性。实际上,类似的挑战已经在分子生物信息学领域得到了解决。他们已经实现了更高的和可重复的分类精度,并已确定了可解释的标记通过纳入分子途径信息在他们的分类。我们采用了与基于rs - fmri的AD分类问题类似的策略。在从文献中收集各种脑功能通路后,我们量化了哪些通路在AD患者和健康受试者之间表现出显著不同的活动水平。此外,阿尔茨海默病患者和健康受试者之间的差异通路可能有助于解释阿尔茨海默病进展过程中的功能改变。
{"title":"Pathway-based classification of brain activities for alzheimer's disease analysis","authors":"Jongan Lee, Younghoon Kim, Y. Jeong, D. Na, Kwang-H. Lee, Doheon Lee","doi":"10.1145/2512089.2512093","DOIUrl":"https://doi.org/10.1145/2512089.2512093","url":null,"abstract":"The advent of resting-state (RS) functional magnetic resonance imaging (fMRI) technology has made it possible to classify Alzheimer's disease (AD) states based on the quantitative activity indices of brain regions. Current connectivity-based classification techniques suffer from limited reproducibility due to the need for prior knowledge on discriminative brain regions and intrinsic heterogeneity in the course of AD progression. Actually, similar challenges have been already addressed in molecular bioinformatics communities. They have achieved higher and reproducible classification accuracy and have identified interpretable markers by incorporating molecular pathway information in their classification. We have adopted a similar strategy to the RS-fMRI-based AD classification problem. After collecting various functional brain pathways from literature, we have quantified which pathways show significantly different activity levels between AD patients and healthy subjects. Moreover, discriminatory pathways between AD patients and healthy subjects may facilitate the interpretation of functional alterations in the course of AD progression.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125305255","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Exploring the effectiveness of medical entity recognition for clinical information retrieval 探索医学实体识别在临床信息检索中的有效性
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512091
J. Cogley, N. Stokes, J. Carthy
The growth of medical and clinical textual datasets has fostered research interests in methods for storing, retrieving and extracting of pertinent data. In more recent years, shared tasks and more comprehensive data sharing agreements have seen a further growth in the research area spanning Natural Language Processing (NLP) and Information Retrieval (IR) to aid the world of healthcare. Frequently NLP applications such as Medical Entity Recognition (MER), are motivated within the context of improving IR system performance. In this paper, we investigate the application of MER to a clinical retrieval system in the context of shared tasks in the respective areas. Namely, we aim to add structure to previously unstructured clinical reports and query sets. We evaluate the performance of MER on the query set, highlighting issues in constructing queries in a clinical setting. Further to this, we evaluate the performance of structuring queries on a retrieval dataset. We find that while structuring queries improves performance on complex queries that contain many term dependencies, there is a larger issue of linguistic variation found in clinical texts that must also be addressed.
医学和临床文本数据集的增长促进了对存储、检索和提取相关数据方法的研究兴趣。近年来,共享任务和更全面的数据共享协议在跨越自然语言处理(NLP)和信息检索(IR)的研究领域取得了进一步的发展,以帮助医疗保健领域。通常,NLP应用,如医疗实体识别(MER),是在提高红外系统性能的背景下被激发的。在本文中,我们研究了在各自领域共享任务的背景下,MER在临床检索系统中的应用。也就是说,我们的目标是为以前非结构化的临床报告和查询集添加结构。我们评估了MER在查询集上的性能,突出了在临床设置中构建查询的问题。此外,我们评估了检索数据集上结构化查询的性能。我们发现,虽然结构化查询提高了包含许多术语依赖关系的复杂查询的性能,但在临床文本中发现的更大的语言差异问题也必须得到解决。
{"title":"Exploring the effectiveness of medical entity recognition for clinical information retrieval","authors":"J. Cogley, N. Stokes, J. Carthy","doi":"10.1145/2512089.2512091","DOIUrl":"https://doi.org/10.1145/2512089.2512091","url":null,"abstract":"The growth of medical and clinical textual datasets has fostered research interests in methods for storing, retrieving and extracting of pertinent data. In more recent years, shared tasks and more comprehensive data sharing agreements have seen a further growth in the research area spanning Natural Language Processing (NLP) and Information Retrieval (IR) to aid the world of healthcare. Frequently NLP applications such as Medical Entity Recognition (MER), are motivated within the context of improving IR system performance. In this paper, we investigate the application of MER to a clinical retrieval system in the context of shared tasks in the respective areas. Namely, we aim to add structure to previously unstructured clinical reports and query sets. We evaluate the performance of MER on the query set, highlighting issues in constructing queries in a clinical setting. Further to this, we evaluate the performance of structuring queries on a retrieval dataset. We find that while structuring queries improves performance on complex queries that contain many term dependencies, there is a larger issue of linguistic variation found in clinical texts that must also be addressed.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116317584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
BoDBES: a boosted dictionary-based biomedical entity spotter BoDBES:一个基于字典的生物医学实体识别器
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512098
Min Song, Wook-Shin Han, Hwanjo Yu
To measure the impact of the difference sources on the performance of entity extraction, we used three different data sources: 1) GENIA, 2) Mesh Tree, and 3) UMLS. The performance is also measured by F1. In the performance comparision among three approaches on the dictionary with GENIA+MeSH, BoDBES is slightly better than SPED in all three datasets whereas the context only option shows the worst performance.
为了衡量不同数据源对实体提取性能的影响,我们使用了三种不同的数据源:1)GENIA, 2) Mesh Tree和3)UMLS。性能也由F1来衡量。在使用GENIA+MeSH的三种方法在字典上的性能比较中,BoDBES在所有三个数据集上都略好于SPED,而仅使用上下文的方法表现出最差的性能。
{"title":"BoDBES: a boosted dictionary-based biomedical entity spotter","authors":"Min Song, Wook-Shin Han, Hwanjo Yu","doi":"10.1145/2512089.2512098","DOIUrl":"https://doi.org/10.1145/2512089.2512098","url":null,"abstract":"To measure the impact of the difference sources on the performance of entity extraction, we used three different data sources: 1) GENIA, 2) Mesh Tree, and 3) UMLS. The performance is also measured by F1. In the performance comparision among three approaches on the dictionary with GENIA+MeSH, BoDBES is slightly better than SPED in all three datasets whereas the context only option shows the worst performance.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116331173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
BSML: bio-synergy modeling language for multi-component and multi-target analysis BSML:用于多组分和多目标分析的生物协同建模语言
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512097
W. Hwang, Jaejoon Choi, J. Jung, Doheon Lee
Multi-compound drugs are considered as the most promising solution to overcome the limited efficacy and off-target effect of drugs. However, identifying promising multiple compounds by experimental tests requires overwhelming costs and a number of tests. Systems biology-based approaches are regarded as one of the most promising strategy. To predict responses of drugs in biological systems is one of aims of Systems biology. We made Bio-Synergy Modeling Language (BSML) for modeling biological systems, which are multi-scale systems. BSML contains context information that covers spatial scales, temporal scales, and condition information, such as disease. We have applied BSML to generate type 2 diabetes (T2D) model, which involves malfunctions of numerous organs such as pancreas, liver, and muscle. We have extracted 12,522 T2D-related rules from public databases automatically. We simulated responses of single drugs and combination drugs on the T2D model by Petri nets. The results of our simulation show candidate T2D drugs and how combination drugs could act on whole-body scales. We expect that our work would provide an insight for identifying promising combination drugs and mechanisms of combination drugs on whole body scales.
多种复合药物被认为是克服药物疗效有限和脱靶效应的最有希望的解决方案。然而,通过实验测试确定有希望的多种化合物需要巨大的成本和大量的测试。基于系统生物学的方法被认为是最有前途的策略之一。预测药物在生物系统中的反应是系统生物学的目标之一。生物协同建模语言(BSML)是一种多尺度的生物系统建模语言。BSML包含涵盖空间尺度、时间尺度和状况信息(如疾病)的上下文信息。我们将BSML应用于2型糖尿病(T2D)模型,该模型涉及胰腺、肝脏、肌肉等多个器官的功能障碍。我们从公共数据库中自动提取了12522条t2d相关规则。采用Petri网模拟单药和联合用药在T2D模型上的反应。我们的模拟结果显示了候选T2D药物以及联合药物如何在全身范围内起作用。我们期望我们的工作将为在全身范围内识别有前途的联合药物和联合药物的机制提供见解。
{"title":"BSML: bio-synergy modeling language for multi-component and multi-target analysis","authors":"W. Hwang, Jaejoon Choi, J. Jung, Doheon Lee","doi":"10.1145/2512089.2512097","DOIUrl":"https://doi.org/10.1145/2512089.2512097","url":null,"abstract":"Multi-compound drugs are considered as the most promising solution to overcome the limited efficacy and off-target effect of drugs. However, identifying promising multiple compounds by experimental tests requires overwhelming costs and a number of tests. Systems biology-based approaches are regarded as one of the most promising strategy. To predict responses of drugs in biological systems is one of aims of Systems biology.\u0000 We made Bio-Synergy Modeling Language (BSML) for modeling biological systems, which are multi-scale systems. BSML contains context information that covers spatial scales, temporal scales, and condition information, such as disease. We have applied BSML to generate type 2 diabetes (T2D) model, which involves malfunctions of numerous organs such as pancreas, liver, and muscle. We have extracted 12,522 T2D-related rules from public databases automatically. We simulated responses of single drugs and combination drugs on the T2D model by Petri nets. The results of our simulation show candidate T2D drugs and how combination drugs could act on whole-body scales. We expect that our work would provide an insight for identifying promising combination drugs and mechanisms of combination drugs on whole body scales.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127505436","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Breast and prostate cancer expression similarity analysis by iterative SVM based ensemble gene selection 基于迭代支持向量机的集合基因选择的乳腺癌和前列腺癌表达相似性分析
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512099
Darius Coelho, Lee Sael
Epidemiologic and phenotypic evidences indicate that breast and prostate cancers have high pathological similarities. Analysis of pathological similarities between cancers can be beneficial in several aspects such as enabling the knowledge transfer between the cancer studies. To gain knowledge of the similarity between the breast and prostate cancer pathology, common genes that are affected by the two carcinomas are investigated. Gene expression data extracted from RNA-seq experiments, provided through TCGA consortium, is used for gene selection. Gene selection was performed using an iterative SVM based ensemble feature selection approach. Iterative SVM-based gene selection methods enable correlated gene expressions to be considered simultaneously and ensemble approach stabilizes the selection. As results of the analysis, two genes, Transglutaminase 4 (TGM4) and complement component 4A (C4A), were selected as commonly altered genes. Direct relationships of the two genes to the two cancers are not confirmed. However, TGM4 is known to be associated with adenocarcinomas and C4A with ovarian cancer. Thus provides evidence that they maybe pathologically important genes for the two cancers.
流行病学和表型证据表明,乳腺癌和前列腺癌具有高度的病理相似性。分析癌症之间的病理相似性可以在几个方面有益,例如使癌症研究之间的知识转移。为了了解乳腺癌和前列腺癌病理之间的相似性,研究了受这两种癌影响的共同基因。通过TCGA联盟提供的RNA-seq实验提取的基因表达数据用于基因选择。采用基于迭代支持向量机的集成特征选择方法进行基因选择。基于迭代支持向量机的基因选择方法可以同时考虑相关基因表达,集合方法可以稳定选择。根据分析结果,选择转谷氨酰胺酶4 (TGM4)和补体组分4A (C4A)两个基因作为常见的改变基因。这两种基因与两种癌症的直接关系尚未得到证实。然而,已知TGM4与腺癌有关,C4A与卵巢癌有关。因此提供了证据,证明它们可能是这两种癌症的重要病理基因。
{"title":"Breast and prostate cancer expression similarity analysis by iterative SVM based ensemble gene selection","authors":"Darius Coelho, Lee Sael","doi":"10.1145/2512089.2512099","DOIUrl":"https://doi.org/10.1145/2512089.2512099","url":null,"abstract":"Epidemiologic and phenotypic evidences indicate that breast and prostate cancers have high pathological similarities. Analysis of pathological similarities between cancers can be beneficial in several aspects such as enabling the knowledge transfer between the cancer studies. To gain knowledge of the similarity between the breast and prostate cancer pathology, common genes that are affected by the two carcinomas are investigated. Gene expression data extracted from RNA-seq experiments, provided through TCGA consortium, is used for gene selection. Gene selection was performed using an iterative SVM based ensemble feature selection approach. Iterative SVM-based gene selection methods enable correlated gene expressions to be considered simultaneously and ensemble approach stabilizes the selection. As results of the analysis, two genes, Transglutaminase 4 (TGM4) and complement component 4A (C4A), were selected as commonly altered genes. Direct relationships of the two genes to the two cancers are not confirmed. However, TGM4 is known to be associated with adenocarcinomas and C4A with ovarian cancer. Thus provides evidence that they maybe pathologically important genes for the two cancers.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121337792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Efficient local ligand-binding site search using landmark mds 基于地标mds的高效局部配体结合位点搜索
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512092
Sungchul Kim, Lee Sael, Hwanjo Yu
In this work, we propose a new local binding site search system, called Fast Patch-Surfer, for extending previous work, Patch-Surfer. Patch-Surfer efficiently retrieves top-k similar proteins based on new representation of proteins capturing features of their local ligand-binding site and newly defined distance function. However, further speed up is needed since in practical setting of computing dissimilarity between proteins, there are possibilities for simultaneous multiple user access on the database. We address this need for further speed up in local ligand-binding site search by exploiting landmark MultiDimensional Scaling (MDS), which is an efficient version of MDS being popularly used for representing high-dimensional dataset. According to the result, using our method, the searching time is reduced up to 99%, and it retrieves almost 80% of exact top-k results.
在这项工作中,我们提出了一个新的局部结合位点搜索系统,称为Fast Patch-Surfer,以扩展之前的工作Patch-Surfer。Patch-Surfer基于捕获其局部配体结合位点特征的蛋白质的新表示和新定义的距离函数有效地检索top-k相似蛋白质。然而,由于在计算蛋白质之间的不相似性的实际设置中,存在同时对数据库进行多用户访问的可能性,因此需要进一步的速度。我们通过利用具有里程碑意义的多维尺度(MDS)来解决这一问题,以进一步加快局部配体结合位点的搜索速度,MDS是MDS的有效版本,广泛用于表示高维数据集。根据结果,使用我们的方法,搜索时间减少了99%,并且检索了几乎80%的精确top-k结果。
{"title":"Efficient local ligand-binding site search using landmark mds","authors":"Sungchul Kim, Lee Sael, Hwanjo Yu","doi":"10.1145/2512089.2512092","DOIUrl":"https://doi.org/10.1145/2512089.2512092","url":null,"abstract":"In this work, we propose a new local binding site search system, called Fast Patch-Surfer, for extending previous work, Patch-Surfer. Patch-Surfer efficiently retrieves top-k similar proteins based on new representation of proteins capturing features of their local ligand-binding site and newly defined distance function. However, further speed up is needed since in practical setting of computing dissimilarity between proteins, there are possibilities for simultaneous multiple user access on the database. We address this need for further speed up in local ligand-binding site search by exploiting landmark MultiDimensional Scaling (MDS), which is an efficient version of MDS being popularly used for representing high-dimensional dataset. According to the result, using our method, the searching time is reduced up to 99%, and it retrieves almost 80% of exact top-k results.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130223362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Combining dictionaries and ontologies for drug name recognition in biomedical texts 结合字典和本体论在生物医学文本中的药物名称识别
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512100
Daniel Sánchez-Cisneros, Paloma Martínez, Isabel Segura-Bedmar
Two approaches have been commonly used for recognizing Drug Name Entities in biomedical texts: machine learning-based and domain specific resources-based approaches. In this work we focus on the second one by combining (1) a dictionary-based approach that collects terms from different pharmacological data sources such as DrugBank, MeSH, RxNorm and ATC index; and (2) an ontology-based approach that maps each text unit of a source text into one or more domain-specific concepts, providing rich semantic knowledge of domain name entities using Metamap and Mgrep analyzer. The aim is to take advantage of the best of each resource used. The combined system obtains an F1 measure of 0, 667 over exact matching span evaluation.
两种方法通常用于识别生物医学文本中的药物名称实体:基于机器学习和基于领域特定资源的方法。在这项工作中,我们将重点放在第二种方法上:(1)基于词典的方法,从不同的药理学数据源(如DrugBank、MeSH、RxNorm和ATC index)收集术语;(2)基于本体的方法,将源文本的每个文本单元映射到一个或多个特定于领域的概念,使用Metamap和Mgrep分析器提供丰富的域名实体语义知识。其目的是充分利用所使用的每种资源。组合系统在精确匹配跨度评价上得到了一个F1测度为0,667。
{"title":"Combining dictionaries and ontologies for drug name recognition in biomedical texts","authors":"Daniel Sánchez-Cisneros, Paloma Martínez, Isabel Segura-Bedmar","doi":"10.1145/2512089.2512100","DOIUrl":"https://doi.org/10.1145/2512089.2512100","url":null,"abstract":"Two approaches have been commonly used for recognizing Drug Name Entities in biomedical texts: machine learning-based and domain specific resources-based approaches. In this work we focus on the second one by combining (1) a dictionary-based approach that collects terms from different pharmacological data sources such as DrugBank, MeSH, RxNorm and ATC index; and (2) an ontology-based approach that maps each text unit of a source text into one or more domain-specific concepts, providing rich semantic knowledge of domain name entities using Metamap and Mgrep analyzer. The aim is to take advantage of the best of each resource used. The combined system obtains an F1 measure of 0, 667 over exact matching span evaluation.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115154761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Bayesian variable selection for linear regression in high dimensional microarray data 高维微阵列数据线性回归的贝叶斯变量选择
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512094
Wellington Cabrera, C. Ordonez, D. S. Matusevich, V. Baladandayuthapani
Variable selection is a fundamental problem in Bayesian statistics whose solution requires exploring a combinatorial search space. We study the solution of variable selection with a well-known MCMC method, which requires thousands of iterations. We present several algorithmic optimizations to accelerate the MCMC method to make it work efficiently inside a database system. Our optimizations include sufficient statistics, variable preselection, hash tables and calling a linear algebra library. We present experiments with very high dimensional microarray data sets to predict cancer survival time. We discuss encouraging findings, identifying specific genes likely to predict the survival time for brain cancer patients. We also show our DBMS-based algorithm is orders of magnitude faster than the R statistical package. Our work shows a DBMS is a promising platform to analyze microarray data.
变量选择是贝叶斯统计中的一个基本问题,它的解决需要探索一个组合搜索空间。我们用一种著名的MCMC方法研究了变量选择的解,该方法需要数千次迭代。我们提出了几个算法优化来加速MCMC方法,使其在数据库系统中有效地工作。我们的优化包括充分的统计、变量预选、哈希表和调用线性代数库。我们提出了用高维微阵列数据集预测癌症生存时间的实验。我们讨论了令人鼓舞的发现,确定了可能预测脑癌患者生存时间的特定基因。我们还展示了基于dbms的算法比R统计包快几个数量级。我们的工作表明,DBMS是一个很有前途的平台来分析微阵列数据。
{"title":"Bayesian variable selection for linear regression in high dimensional microarray data","authors":"Wellington Cabrera, C. Ordonez, D. S. Matusevich, V. Baladandayuthapani","doi":"10.1145/2512089.2512094","DOIUrl":"https://doi.org/10.1145/2512089.2512094","url":null,"abstract":"Variable selection is a fundamental problem in Bayesian statistics whose solution requires exploring a combinatorial search space. We study the solution of variable selection with a well-known MCMC method, which requires thousands of iterations. We present several algorithmic optimizations to accelerate the MCMC method to make it work efficiently inside a database system. Our optimizations include sufficient statistics, variable preselection, hash tables and calling a linear algebra library. We present experiments with very high dimensional microarray data sets to predict cancer survival time. We discuss encouraging findings, identifying specific genes likely to predict the survival time for brain cancer patients. We also show our DBMS-based algorithm is orders of magnitude faster than the R statistical package. Our work shows a DBMS is a promising platform to analyze microarray data.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"563 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123517563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Translating a trillion points of data into therapies, diagnostics, and new insights into disease 将一万亿点数据转化为治疗、诊断和对疾病的新见解
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512102
A. Butte
There is an urgent need to translate genome-era discoveries into clinical utility, but the difficulties in making bench-to-bedside translations have been well described. The nascent field of translational bioinformatics may help. Dr. Butte's lab at Stanford builds and applies tools that convert more than a trillion points of molecular, clinical, and epidemiological data - measured by researchers and clinicians over the past decade - into diagnostics, therapeutics, and new insights into disease. Dr. Butte, a bioinformatician and pediatric endocrinologist, will highlight his lab's work on using publicly-available molecular measurements to find new uses for drugs including drug repositioning for inflammatory bowel disease, discovering new treatable inflammatory mechanisms of disease in type 2 diabetes, and the evaluation of patients presenting with whole genomes sequenced.
迫切需要将基因组时代的发现转化为临床应用,但是从实验室到床边的转化的困难已经很好地描述了。新兴的转化生物信息学领域可能会有所帮助。Butte博士在斯坦福大学的实验室建立并应用工具,将研究人员和临床医生在过去十年中测量的超过一万亿点的分子、临床和流行病学数据转化为诊断、治疗和对疾病的新见解。Butte博士是一名生物信息学家和儿科内分泌学家,他将重点介绍他的实验室在利用公开可用的分子测量方法寻找药物新用途方面的工作,包括针对炎症性肠病的药物重新定位,发现2型糖尿病疾病的新的可治疗炎症机制,以及对全基因组测序患者的评估。
{"title":"Translating a trillion points of data into therapies, diagnostics, and new insights into disease","authors":"A. Butte","doi":"10.1145/2512089.2512102","DOIUrl":"https://doi.org/10.1145/2512089.2512102","url":null,"abstract":"There is an urgent need to translate genome-era discoveries into clinical utility, but the difficulties in making bench-to-bedside translations have been well described. The nascent field of translational bioinformatics may help. Dr. Butte's lab at Stanford builds and applies tools that convert more than a trillion points of molecular, clinical, and epidemiological data - measured by researchers and clinicians over the past decade - into diagnostics, therapeutics, and new insights into disease. Dr. Butte, a bioinformatician and pediatric endocrinologist, will highlight his lab's work on using publicly-available molecular measurements to find new uses for drugs including drug repositioning for inflammatory bowel disease, discovering new treatable inflammatory mechanisms of disease in type 2 diabetes, and the evaluation of patients presenting with whole genomes sequenced.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117332127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Refining health outcomes of interest using formal concept analysis and semantic query expansion 使用形式概念分析和语义查询扩展来细化感兴趣的健康结果
Pub Date : 2013-11-01 DOI: 10.1145/2512089.2512095
Olivier Curé, H. Maurer, N. Shah, P. LePendu
Clinicians and researchers using Electronic Health Records (EHRs) often search for, extract, and analyze groups of patients by defining a Health Outcome of Interest (HOI), which may include a set of diseases, conditions, signs, or symptoms. In our work on pharmacovigilance using clinical notes, for example, we use a method that operates over many (potentially hundreds) of ontologies at once, expands the input query, and increases the search space over clinical text as well as structured data. This method requires specifying an initial set of seed concepts, based on concept unique identifiers from the UMLS Metathesaurus. In some cases, such as for progressive multifocal leukoencephalopathy, the seed query is easy to specify, but in other cases this task can be more subtle and requires manual-intensive work, such as for chronic obstructive pulmonary disease. The challenge in defining an HOI arises because medical and health terminologies are numerous and complex. We have developed a method consisting of a cooperation between Semantic Query Expansion, to leverage the hierarchical structure of ontologies, and Formal Concept Analysis, to organize, reason, and prune discovered concepts in an efficient manner over a large number of ontologies. Together, they assist the user, through a RESTful API and a web-based graphical user interface, in defining their seed query and in refining the expanded search space that it encompasses. In this context, end-user interactions mainly consist in accepting or rejecting system propositions and can be ceased on the user's will. We use this approach for text-mining clinical notes from EHRs, but they are equally applicable for cohort building tools in general. A preliminary evaluation of this work, on the i2b2 Obesity NLP reference set, emphasizes positive results for sensitivity and specificity measures which are slightly improving existing results on this gold standard. This experimentation also highlights that our semi-automatic approach provides fast processing times (in the order of milliseconds to few seconds) for the generation of several thousands of potential terms. The most promising aspect of this approach is the discovery of potentially positive results from false negative concepts discovered by our method. In future works, we aim to conduct user driven evaluation of the Web interface, analyze the acceptance/rejection of physicians in several practical scenarios and use active learning over past query refinements to improve future queries.
使用电子健康记录(EHRs)的临床医生和研究人员经常通过定义感兴趣的健康结果(HOI)来搜索、提取和分析患者组,其中可能包括一组疾病、状况、体征或症状。例如,在我们使用临床记录进行药物警戒的工作中,我们使用了一种方法,该方法可以同时操作许多(可能是数百个)本体,扩展输入查询,并增加临床文本和结构化数据的搜索空间。此方法需要根据来自UMLS元词典的概念唯一标识符指定一组初始的种子概念。在某些情况下,如进行性多灶性脑白质病,种子查询很容易指定,但在其他情况下,这项任务可能更微妙,需要人工密集的工作,如慢性阻塞性肺病。由于医学和卫生术语众多且复杂,定义HOI的挑战就出现了。我们已经开发了一种由语义查询扩展(利用本体的层次结构)和形式概念分析(在大量本体上以有效的方式组织、推理和修剪发现的概念)之间的合作组成的方法。它们通过RESTful API和基于web的图形用户界面共同帮助用户定义种子查询并细化其包含的扩展搜索空间。在这种情况下,最终用户交互主要是接受或拒绝系统命题,并可以根据用户的意愿停止。我们将这种方法用于从电子病历中挖掘临床记录的文本,但它们同样适用于一般的队列构建工具。在i2b2肥胖NLP参考集上对这项工作的初步评估强调了敏感性和特异性措施的积极结果,这略微改善了该金标准的现有结果。这个实验还突出表明,我们的半自动方法为生成数千个潜在项提供了快速的处理时间(从几毫秒到几秒钟不等)。这种方法最有希望的方面是从我们的方法发现的假阴性概念中发现潜在的阳性结果。在未来的工作中,我们的目标是对Web界面进行用户驱动的评估,分析医生在几个实际场景中的接受/拒绝,并在过去的查询改进中使用主动学习来改进未来的查询。
{"title":"Refining health outcomes of interest using formal concept analysis and semantic query expansion","authors":"Olivier Curé, H. Maurer, N. Shah, P. LePendu","doi":"10.1145/2512089.2512095","DOIUrl":"https://doi.org/10.1145/2512089.2512095","url":null,"abstract":"Clinicians and researchers using Electronic Health Records (EHRs) often search for, extract, and analyze groups of patients by defining a Health Outcome of Interest (HOI), which may include a set of diseases, conditions, signs, or symptoms. In our work on pharmacovigilance using clinical notes, for example, we use a method that operates over many (potentially hundreds) of ontologies at once, expands the input query, and increases the search space over clinical text as well as structured data. This method requires specifying an initial set of seed concepts, based on concept unique identifiers from the UMLS Metathesaurus. In some cases, such as for progressive multifocal leukoencephalopathy, the seed query is easy to specify, but in other cases this task can be more subtle and requires manual-intensive work, such as for chronic obstructive pulmonary disease. The challenge in defining an HOI arises because medical and health terminologies are numerous and complex. We have developed a method consisting of a cooperation between Semantic Query Expansion, to leverage the hierarchical structure of ontologies, and Formal Concept Analysis, to organize, reason, and prune discovered concepts in an efficient manner over a large number of ontologies. Together, they assist the user, through a RESTful API and a web-based graphical user interface, in defining their seed query and in refining the expanded search space that it encompasses. In this context, end-user interactions mainly consist in accepting or rejecting system propositions and can be ceased on the user's will. We use this approach for text-mining clinical notes from EHRs, but they are equally applicable for cohort building tools in general. A preliminary evaluation of this work, on the i2b2 Obesity NLP reference set, emphasizes positive results for sensitivity and specificity measures which are slightly improving existing results on this gold standard. This experimentation also highlights that our semi-automatic approach provides fast processing times (in the order of milliseconds to few seconds) for the generation of several thousands of potential terms. The most promising aspect of this approach is the discovery of potentially positive results from false negative concepts discovered by our method. In future works, we aim to conduct user driven evaluation of the Web interface, analyze the acceptance/rejection of physicians in several practical scenarios and use active learning over past query refinements to improve future queries.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"912 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126990145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1