Data and Text Mining in Bioinformatics最新文献

英文中文

Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration 从全基因组关联研究数据集中检测2型糖尿病致病单核苷酸多态性组合

Data and Text Mining in Bioinformatics

Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390070

Chiyong Kang, Hyeji Yu, G. Yi

The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.

由于来自全基因组关联研究(GWAS)的单个标记的统计能力较低，鉴定2型糖尿病(T2D)等复杂疾病的因果单核苷酸多态性(snp)是一项挑战。SNP组合可以弥补单个标记的低统计能力，但来自GWAS的SNP组合产生了很高的计算复杂度。因此，我们的目标是通过最佳过滤从GWAS数据集中检测T2D因果SNP组合，并发现检测到的SNP组合的生物学意义。通过比较不同Bonferroni阈值和基于p值范围的阈值结合链接不平衡(LD)修剪的SNP组合的错误率，最优过滤可以增强SNP组合的统计能力。使用随机森林从最佳SNP数据集中选择变量，选择T2D因果SNP组合。通过t2d相关信息和基因集富集分析(GSEA)的多维度水平对选定的SNP进行定位。选择了来自Wellcome Trust病例控制联盟(WTCCC) GWAS数据集的包含101个SNP的T2D因果SNP组合，错误率为10.25%。与已知疾病基因和基因集的匹配揭示了T2D与SNP组合之间的关系。我们提出了一种基于随机森林变量选择的最优SNP数据集的复杂致病SNP组合检测方法。绘制检测到的SNP组合的生物学意义可以帮助揭示复杂的疾病机制。

{"title":"Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration","authors":"Chiyong Kang, Hyeji Yu, G. Yi","doi":"10.1145/2390068.2390070","DOIUrl":"https://doi.org/10.1145/2390068.2390070","url":null,"abstract":"The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"7 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133699051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Clinical entity recognition using structural support vector machines with rich features 特征丰富的结构支持向量机临床实体识别

Data and Text Mining in Bioinformatics

Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390073

Buzhou Tang, Yonghui Wu, Min Jiang, Hua Xu

Named entity recognition (NER) is an important task for natural language processing (NLP) of clinical text. Conditional Random Fields (CRFs), a sequential labeling algorithm, and Support Vector Machines (SVMs), which is based on large margin theory, are two typical machine learning algorithms that have been widely applied to NER tasks, including clinical entity recognition. However, Structural Support Vector Machines (SSVMs), an algorithm that combines the advantages of both CRFs and SVMs, has not been investigated for clinical text processing. In this study, we applied the SSVMs algorithm to the Concept Extraction task of the 2010 i2b2 clinical NLP challenge, which was to recognize entities of medical problems, treatments, and tests from hospital discharge summaries. Using the same training (N = 27,837) and test (N = 45,009) sets in the challenge, our evaluation showed that the SSVMs-based NER system required less training time, while achieved better performance than the CRFs-based system for clinical entity recognition, when same features were used. Our study also demonstrated that rich features such as unsupervised word representations improved the performance of clinical entity recognition. When rich features were integrated with SSVMs, our system achieved a highest F-measure of 85.74% on the test set of 2010 i2b2 NLP challenge, which outperformed the best system reported in the challenge by 0.5%.

命名实体识别(NER)是临床文本自然语言处理(NLP)的一项重要任务。条件随机场(CRFs)是一种顺序标注算法，而支持向量机(svm)是基于大余量理论的两种典型的机器学习算法，已广泛应用于NER任务，包括临床实体识别。然而，结合CRFs和svm优点的结构支持向量机(ssvm)算法尚未被研究用于临床文本处理。在本研究中，我们将ssvm算法应用于2010年i2b2临床NLP挑战的概念提取任务，该任务是从医院出院摘要中识别医疗问题、治疗和测试的实体。在挑战中使用相同的训练集(N = 27,837)和测试集(N = 45,009)，我们的评估表明，当使用相同的特征时，基于ssvm的NER系统所需的训练时间更少，但在临床实体识别方面取得了比基于crfs的系统更好的性能。我们的研究还表明，丰富的特征，如无监督的词表示，提高了临床实体识别的性能。当丰富的特征与ssvm集成时，我们的系统在2010年i2b2 NLP挑战的测试集上达到了85.74%的最高f值，比挑战中报告的最佳系统高出0.5%。

{"title":"Clinical entity recognition using structural support vector machines with rich features","authors":"Buzhou Tang, Yonghui Wu, Min Jiang, Hua Xu","doi":"10.1145/2390068.2390073","DOIUrl":"https://doi.org/10.1145/2390068.2390073","url":null,"abstract":"Named entity recognition (NER) is an important task for natural language processing (NLP) of clinical text. Conditional Random Fields (CRFs), a sequential labeling algorithm, and Support Vector Machines (SVMs), which is based on large margin theory, are two typical machine learning algorithms that have been widely applied to NER tasks, including clinical entity recognition. However, Structural Support Vector Machines (SSVMs), an algorithm that combines the advantages of both CRFs and SVMs, has not been investigated for clinical text processing. In this study, we applied the SSVMs algorithm to the Concept Extraction task of the 2010 i2b2 clinical NLP challenge, which was to recognize entities of medical problems, treatments, and tests from hospital discharge summaries. Using the same training (N = 27,837) and test (N = 45,009) sets in the challenge, our evaluation showed that the SSVMs-based NER system required less training time, while achieved better performance than the CRFs-based system for clinical entity recognition, when same features were used. Our study also demonstrated that rich features such as unsupervised word representations improved the performance of clinical entity recognition. When rich features were integrated with SSVMs, our system achieved a highest F-measure of 85.74% on the test set of 2010 i2b2 NLP challenge, which outperformed the best system reported in the challenge by 0.5%.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129035767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 58

High precision rule based PPI extraction and per-pair basis performance evaluation 基于高精度规则的PPI提取和基于对的性能评价

Data and Text Mining in Bioinformatics

Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390082

Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang

Virtually all current PPI extraction studies focus on improving F-score, aiming to balance the performance on both precision and recall. However, in many realistic scenarios involving large corpora, one can benefit more from an extremely high precision PPI extraction tool than a high-recall counterpart. We also argue that the current "per-instance" basis performance evaluation method should be revisited. In order to address these problems, we introduce a new rule-based PPI extraction method equipped with a set of ultra-high precision extraction rules. We also propose a new "per-pair" basis performance metric, which is more pragmatic in practice. The proposed PPI extraction method achieves 95-96% per-pair and 94-97% per-instance precisions on the AIMed benchmark corpus.

目前几乎所有的PPI提取研究都集中在提高f分上，旨在平衡准确率和召回率的表现。然而，在许多涉及大型语料库的现实场景中，与高召回率的对应工具相比，极高精度的PPI提取工具可以带来更多好处。我们还认为，当前的“每个实例”的基础性能评估方法应该重新审视。为了解决这些问题，我们引入了一种新的基于规则的PPI提取方法，该方法配备了一套超高精度的提取规则。我们还提出了一个新的“每对”基础性能度量，它在实践中更加实用。所提出的PPI提取方法在aims基准语料上的每对提取精度为95-96%，每实例提取精度为94-97%。

引用次数: 7

TNMCA: generation and application of network motif based inference models for drug repositioning TNMCA:基于网络基序的药物重新定位推理模型的生成与应用

Data and Text Mining in Bioinformatics

Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390081

Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee

Since the increase of the public biomedical data, Undiscovered Public Knowledge (UPK, proposed by Swanson) became an important research topic in the biological field. Drug repositioning is one of famous UPK tasks which infer alternative indications for approved drugs. Many researchers tried to find novel candidates of existing drugs, but these previous works are not fully automated which required manual modulations to desired tasks, and was not able to cover various biomedical entities. In addition, they had inference limitations that those works could infer only pre-defined cases using limited patterns. In this paper, we propose the Typed Network Motif Comparison Algorithm (TNMCA) to discover novel drug indications using topological patterns of data. Typed network motifs (TNM) are connected sub-graphs of data, which store types of data, instead of values of data. While previous researches depends on ABC model (or extension of it), TNMCA utilizes more generalized patterns as its inference models. Also, TNMCA can infer not only an existence of interaction, but also the type of the interaction. TNMCA is suited for multi-level biomedical interaction data as TNMs depend on the different types of entities and relations. We apply TNMCA to a public database, Comparative Toxicogenomics Database (CTD), to validate our method. The results show that TNMCA could infer meaningful indications with high performance (AUC=0.7469) compared to the ABC model (AUC=0.7050).

随着生物医学公共数据的增加，由Swanson提出的未发现公共知识(Undiscovered public Knowledge, UPK)成为生物领域的一个重要研究课题。药物重新定位是著名的UPK任务之一，它推断已批准药物的替代适应症。许多研究人员试图找到现有药物的新候选药物，但这些先前的工作不是完全自动化的，需要手动调节所需的任务，并且无法覆盖各种生物医学实体。此外，它们有推理限制，即这些作品只能使用有限的模式推断预先定义的情况。在本文中，我们提出了类型化网络基序比较算法(TNMCA)来发现新的药物适应症，利用数据的拓扑模式。类型化网络母图(TNM)是数据的连接子图，它存储数据的类型，而不是数据的值。与以往的研究依赖于ABC模型(或ABC模型的扩展)相比，TNMCA采用更广义的模式作为其推理模型。此外，TNMCA不仅可以推断出相互作用的存在，还可以推断出相互作用的类型。TNMCA适用于多层次生物医学相互作用数据，因为tnm依赖于不同类型的实体和关系。我们将TNMCA应用于一个公共数据库，比较毒物基因组学数据库(CTD)，以验证我们的方法。结果表明，与ABC模型(AUC=0.7050)相比，TNMCA模型能够推断出有意义的适应症，且AUC=0.7469。

{"title":"TNMCA: generation and application of network motif based inference models for drug repositioning","authors":"Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee","doi":"10.1145/2390068.2390081","DOIUrl":"https://doi.org/10.1145/2390068.2390081","url":null,"abstract":"Since the increase of the public biomedical data, Undiscovered Public Knowledge (UPK, proposed by Swanson) became an important research topic in the biological field. Drug repositioning is one of famous UPK tasks which infer alternative indications for approved drugs. Many researchers tried to find novel candidates of existing drugs, but these previous works are not fully automated which required manual modulations to desired tasks, and was not able to cover various biomedical entities. In addition, they had inference limitations that those works could infer only pre-defined cases using limited patterns. In this paper, we propose the Typed Network Motif Comparison Algorithm (TNMCA) to discover novel drug indications using topological patterns of data. Typed network motifs (TNM) are connected sub-graphs of data, which store types of data, instead of values of data. While previous researches depends on ABC model (or extension of it), TNMCA utilizes more generalized patterns as its inference models. Also, TNMCA can infer not only an existence of interaction, but also the type of the interaction. TNMCA is suited for multi-level biomedical interaction data as TNMs depend on the different types of entities and relations. We apply TNMCA to a public database, Comparative Toxicogenomics Database (CTD), to validate our method. The results show that TNMCA could infer meaningful indications with high performance (AUC=0.7469) compared to the ABC model (AUC=0.7050).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Session details: Keynote address 会议详情:主题演讲

Data and Text Mining in Bioinformatics

Pub Date : 2011-10-24 DOI: 10.1145/3260180

Doheon Lee

引用次数: 0

Dynamic concept ontology construction for pubmed queries 面向pubmed查询的动态概念本体构建

Data and Text Mining in Bioinformatics

Pub Date : 2010-10-26 DOI: 10.1145/1871871.1871885

Jinoh Oh, Taehoon Kim, Sun Park, Wook-Shin Han, Hwanjo Yu

Exploring PubMed to find relevant information is challenging and time-consuming, as PubMed typically returns a large list of articles as a result of query. Existing works in improving the search quality on PubMed have focused on helping PubMed query formulation, clustering the results, or ranking by relevance. This paper proposes a novel system that dynamically constructs a concept ontology based on the search results, which visualizes related concepts to the query in the form of ontology. The concept ontology can make the PubMed search more effective by detecting related concepts and their relation hidden in the documents. The ontology can broaden the user's knowledge by recommending new concepts unexpected by the user, and also serves to narrow down the search results by recommending additional query terms. The ontology construction is processed in real-time as a result of query, integrated within our PubMed search engine called RefMED. Our system is accesible at "http://dm.hwanjoyu.org/refmed".

在PubMed上搜索相关信息既困难又耗时，因为PubMed通常会返回大量文章作为查询的结果。现有的提高PubMed搜索质量的工作主要集中在帮助PubMed查询公式、聚类结果或根据相关性进行排名。本文提出了一种基于搜索结果动态构建概念本体的系统，以本体的形式将查询的相关概念可视化。概念本体通过检测隐藏在文档中的相关概念及其关系，使PubMed搜索更加有效。本体可以通过推荐用户意想不到的新概念来扩展用户的知识，也可以通过推荐额外的查询词来缩小搜索结果的范围。本体构建作为查询的结果实时处理，集成在我们的PubMed搜索引擎RefMED中。我们的系统可在“http://dm.hwanjoyu.org/refmed”访问。

引用次数: 0

DrugNerAR: linguistic rule-based anaphora resolver for drug-drug interaction extraction in pharmacological documents 基于语言规则的药物-药物相互作用提取的回指解析器

Data and Text Mining in Bioinformatics

Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651324

Isabel Segura-Bedmar, Mario Crespo, César de Pablo-Sánchez, Paloma Martínez

DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. The system uses a set of linguistic rules inspired by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. This linguistic rule-based approach shows very promising results for the challenge of accounting for anaphoric expressions in pharmacological texts.

为解决药理学文献中共指表达的问题，提出了一个药物回指消解系统。这一发展是关于药物相互作用自动提取的更大的创新研究的一部分。此外，本文还开发了一个语料库来分析这些现象并对现有的方法进行评价。该系统在生物医学语法解析器提供的分析之上，使用了一套受中心理论启发的语言规则。为了提高对名义药物回指的识别和分辨能力，还集成了统一医学语言系统(UMLS)提供的语义信息。这种基于语言规则的方法显示了非常有希望的结果，用于在药理学文本中解释回指表达的挑战。

引用次数: 9

Mining cancer genes with running-sum statistics 用运行和统计挖掘癌症基因

Data and Text Mining in Bioinformatics

Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651326

Inho Park, Kwang-H. Lee, Doheon Lee

In this paper, we propose a new method to detect candidate cancer genes for developing molecular biomarkers or therapeutic targets from cancer microarray datasets. To resolve problems resulted in the molecular heterogeneity of cancers on gene prioritizing, our proposed method is intended to identify genes that are over- or down- expressed not in the whole cancer samples but also in a subgroup of cancer samples. To this end, we propose the RS score for gene ranking calculated with a weighted running sum statistic on the ordered list of expression values of each gene. We apply the proposed method to publically available prostate cancer microarray datasets, showing that it can identify previously well known prostate cancer associated genes such as ERG, HPN, and AMACR at the top of the list of candidate genes. Embedding samples, represented as vectors of the expression values of the top 20 genes, into a two dimensional space using the commute time embedding shows the distinction between normal samples and cancer samples in the independent test datasets as well as in the training datasets. We further evaluate the proposed method by estimating classification performance on the independent test datasets, and it shows the better classification performance compared to the other cancer outlier profile approaches.

在本文中，我们提出了一种新的方法来检测候选癌症基因，用于从癌症微阵列数据集开发分子生物标志物或治疗靶点。为了解决癌症在基因优先级上的分子异质性问题，我们提出的方法旨在确定不是在整个癌症样本中，而是在癌症样本的一个亚组中过度表达或低表达的基因。为此，我们提出了基因排序的RS评分，通过对每个基因表达值的有序列表进行加权运行和统计计算。我们将提出的方法应用于公开可用的前列腺癌微阵列数据集，结果表明它可以识别出候选基因列表顶部的先前已知的前列腺癌相关基因，如ERG, HPN和AMACR。使用通勤时间嵌入将样本(表示为前20个基因的表达值向量)嵌入到二维空间中，显示了独立测试数据集和训练数据集中正常样本和癌症样本的区别。我们通过在独立测试数据集上估计分类性能来进一步评估所提出的方法，与其他癌症离群值剖面方法相比，它显示出更好的分类性能。

{"title":"Mining cancer genes with running-sum statistics","authors":"Inho Park, Kwang-H. Lee, Doheon Lee","doi":"10.1145/1651318.1651326","DOIUrl":"https://doi.org/10.1145/1651318.1651326","url":null,"abstract":"In this paper, we propose a new method to detect candidate cancer genes for developing molecular biomarkers or therapeutic targets from cancer microarray datasets. To resolve problems resulted in the molecular heterogeneity of cancers on gene prioritizing, our proposed method is intended to identify genes that are over- or down- expressed not in the whole cancer samples but also in a subgroup of cancer samples. To this end, we propose the RS score for gene ranking calculated with a weighted running sum statistic on the ordered list of expression values of each gene. We apply the proposed method to publically available prostate cancer microarray datasets, showing that it can identify previously well known prostate cancer associated genes such as ERG, HPN, and AMACR at the top of the list of candidate genes. Embedding samples, represented as vectors of the expression values of the top 20 genes, into a two dimensional space using the commute time embedding shows the distinction between normal samples and cancer samples in the independent test datasets as well as in the training datasets. We further evaluate the proposed method by estimating classification performance on the independent test datasets, and it shows the better classification performance compared to the other cancer outlier profile approaches.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114860664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

The challenge of high recall in biomedical systematic search 生物医学系统检索中高查全率的挑战

Data and Text Mining in Bioinformatics

Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651338

Sarvnaz Karimi, J. Zobel, Stefan Pohl, Falk Scholer

Clinical systematic reviews are based on expert, laborious search of well-annotated literature. Boolean search on bibliographic databases, such as MEDLINE, continues to be the preferred discovery method, but the size of these databases, now approaching 20 million records, makes it impossible to fully trust these searching methods. We are investigating the trade-offs between Boolean and ranked retrieval. Our findings show that although Boolean search has limitations, it is not obvious that ranking is superior, and illustrate that a single query cannot be used to resolve an information need. Our experiments show that a combination of less complicated Boolean queries and ranked retrieval outperforms either of them individually, leading to possible time savings over the current process.

临床系统评价是建立在专家的基础上的，费力地搜索有充分注释的文献。在书目数据库(如MEDLINE)上进行布尔搜索仍然是首选的发现方法，但是这些数据库的规模(现在接近2000万条记录)使得不可能完全信任这些搜索方法。我们正在研究布尔检索和排名检索之间的权衡。我们的研究结果表明，尽管布尔搜索有局限性，但排名并不明显优越，并说明单个查询不能用于解决信息需求。我们的实验表明，不太复杂的布尔查询和排名检索的组合比它们单独的任何一个都要好，从而可能比当前的过程节省时间。

引用次数: 21

LITSEEK: public health literature search by metadata enhancement with external knowledge bases LITSEEK:利用外部知识库进行元数据增强的公共卫生文献检索

Data and Text Mining in Bioinformatics

Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651337

P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy

Biomedical literature is an important source of information in any researcher's investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature Search by metadata Enhancement with External Knowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way.

生物医学文献是任何研究人员研究基因、危险因素、疾病和药物的重要信息来源。公共卫生研究人员搜索的信息通常分布在多个不同的来源，可能包括PubMed、基因组、蛋白质组学和途径数据库、基因表达和临床资源以及生物医学本体等出版物。这些信息的非结构化性质使得人工查找相关部分变得困难，全面的知识也难以自动合成。在这篇论文中，我们报告了LITSEEK(通过外部知识库元数据增强的文献检索)，这是我们为疾病控制中心(CDC)的研究人员开发的一个系统，使他们能够从药物基因组学的角度搜索PubMed文章的HuGE(人类流行病学基因组)数据库。除了使用TFIDF对重要术语进行排序和索引来分析文本外，拟议的系统还结合了与PharmGKB(一个由人类管理的关于药物、相关疾病和基因的知识库)以及基因本体(一个由人类管理的、被广泛接受的本体)的自动咨询。我们强调了我们方法的主要组成部分，并说明了如何通过纳入来自PharmGKB的基因/药物/疾病方面的其他概念(为便于参考，称为元数据)来增强搜索。报告了关于添加这些元数据项的各种测量结果。基于CDC专家用户反馈的精度方面的初步结果令人鼓舞。实际研究人员正在对搜索程序进行进一步评价。

{"title":"LITSEEK: public health literature search by metadata enhancement with external knowledge bases","authors":"P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy","doi":"10.1145/1651318.1651337","DOIUrl":"https://doi.org/10.1145/1651318.1651337","url":null,"abstract":"Biomedical literature is an important source of information in any researcher's investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature Search by metadata Enhancement with External Knowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Data and Text Mining in Bioinformatics

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀