首页 > 最新文献

Data and Text Mining in Bioinformatics最新文献

英文 中文
Finding associations among SNPS for prostate cancer using collaborative filtering 利用协同过滤发现前列腺癌snp之间的关联
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390080
Rohit Kugaonkar, A. Gangopadhyay, Y. Yesha, A. Joshi, Y. Yesha, M. Grasso, Mary Brady, N. Rishe
Prostate cancer is the second leading cause of cancer related deaths among men. Because of the slow growing nature of prostate cancer, sometimes surgical treatment is not required for less aggressive cancers. Recent debates over prostate-specific antigen (PSA) screening have drawn new attention to prostate cancer. Genome-based screening can potentially help in assessing the risk of developing prostate cancer. Due to the complicated nature of prostate cancer, studying the entire genome is essential to find genomic traits. Due to the high cost of studying all Single Nucleotide Polymorphisms (SNPs), it is essential to find tag SNPs which can represent other SNPs. Earlier methods to find tag SNPs using associations between SNPs either use SNP's location information or are based on data of very few SNP markers in each sample. Our study is based on 2300 samples with 550,000 SNPs each. We have not used SNP location information or any predefined standard cut-offs to find tag SNPs. Our approach is based on using collaborative filtering methods to find pairwise associations among SNPs and thus list top-N tag SNPs. We have found 25 tag SNPs which have highest similarities to other SNPs. In addition we found 16 more SNPs which have high correlation with the known high risk SNPs that are associated with prostate cancer. We used some of these newly found SNPs with 5 different classification algorithms and observed some improvement in prostate cancer prediction accuracy over using the original known high risk SNPs.
前列腺癌是男性癌症相关死亡的第二大原因。由于前列腺癌生长缓慢,对于侵袭性较低的癌症,有时不需要手术治疗。最近关于前列腺特异性抗原(PSA)筛查的争论引起了人们对前列腺癌的新的关注。基于基因组的筛查可能有助于评估患前列腺癌的风险。由于前列腺癌的复杂性,研究整个基因组对于发现基因组特征至关重要。由于研究所有单核苷酸多态性(Single Nucleotide Polymorphisms, SNPs)的成本很高,因此寻找能够代表其他snp的标签snp是至关重要的。早期使用SNP之间的关联来查找标签SNP的方法要么使用SNP的位置信息,要么基于每个样本中很少的SNP标记的数据。我们的研究基于2300个样本,每个样本有55万个snp。我们没有使用SNP位置信息或任何预定义的标准截断来查找标签SNP。我们的方法是基于使用协同过滤方法来查找snp之间的成对关联,从而列出top-N标签snp。我们发现了25个与其他snp相似性最高的标签snp。此外,我们还发现了16个snp与已知的与前列腺癌相关的高风险snp高度相关。我们将其中一些新发现的snp与5种不同的分类算法一起使用,并观察到与使用原始已知的高风险snp相比,前列腺癌预测准确性有所提高。
{"title":"Finding associations among SNPS for prostate cancer using collaborative filtering","authors":"Rohit Kugaonkar, A. Gangopadhyay, Y. Yesha, A. Joshi, Y. Yesha, M. Grasso, Mary Brady, N. Rishe","doi":"10.1145/2390068.2390080","DOIUrl":"https://doi.org/10.1145/2390068.2390080","url":null,"abstract":"Prostate cancer is the second leading cause of cancer related deaths among men. Because of the slow growing nature of prostate cancer, sometimes surgical treatment is not required for less aggressive cancers. Recent debates over prostate-specific antigen (PSA) screening have drawn new attention to prostate cancer. Genome-based screening can potentially help in assessing the risk of developing prostate cancer. Due to the complicated nature of prostate cancer, studying the entire genome is essential to find genomic traits. Due to the high cost of studying all Single Nucleotide Polymorphisms (SNPs), it is essential to find tag SNPs which can represent other SNPs. Earlier methods to find tag SNPs using associations between SNPs either use SNP's location information or are based on data of very few SNP markers in each sample. Our study is based on 2300 samples with 550,000 SNPs each. We have not used SNP location information or any predefined standard cut-offs to find tag SNPs. Our approach is based on using collaborative filtering methods to find pairwise associations among SNPs and thus list top-N tag SNPs. We have found 25 tag SNPs which have highest similarities to other SNPs. In addition we found 16 more SNPs which have high correlation with the known high risk SNPs that are associated with prostate cancer. We used some of these newly found SNPs with 5 different classification algorithms and observed some improvement in prostate cancer prediction accuracy over using the original known high risk SNPs.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114446130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration 从全基因组关联研究数据集中检测2型糖尿病致病单核苷酸多态性组合
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390070
Chiyong Kang, Hyeji Yu, G. Yi
The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.
由于来自全基因组关联研究(GWAS)的单个标记的统计能力较低,鉴定2型糖尿病(T2D)等复杂疾病的因果单核苷酸多态性(snp)是一项挑战。SNP组合可以弥补单个标记的低统计能力,但来自GWAS的SNP组合产生了很高的计算复杂度。因此,我们的目标是通过最佳过滤从GWAS数据集中检测T2D因果SNP组合,并发现检测到的SNP组合的生物学意义。通过比较不同Bonferroni阈值和基于p值范围的阈值结合链接不平衡(LD)修剪的SNP组合的错误率,最优过滤可以增强SNP组合的统计能力。使用随机森林从最佳SNP数据集中选择变量,选择T2D因果SNP组合。通过t2d相关信息和基因集富集分析(GSEA)的多维度水平对选定的SNP进行定位。选择了来自Wellcome Trust病例控制联盟(WTCCC) GWAS数据集的包含101个SNP的T2D因果SNP组合,错误率为10.25%。与已知疾病基因和基因集的匹配揭示了T2D与SNP组合之间的关系。我们提出了一种基于随机森林变量选择的最优SNP数据集的复杂致病SNP组合检测方法。绘制检测到的SNP组合的生物学意义可以帮助揭示复杂的疾病机制。
{"title":"Detecting type 2 diabetes causal single nucleotide polymorphism combinations from a genome-wide association study dataset with optimal filtration","authors":"Chiyong Kang, Hyeji Yu, G. Yi","doi":"10.1145/2390068.2390070","DOIUrl":"https://doi.org/10.1145/2390068.2390070","url":null,"abstract":"The identification of causal single nucleotide polymorphisms (SNPs) for complex diseases like type 2 diabetes (T2D) is a challenge because of the low statistical power of individual markers from a genome-wide association study (GWAS). SNP combinations are suggested to compensate for the low statistical power of individual markers, but SNP combinations from GWAS generate high computational complexity. Hence, we aim to detect T2D causal SNP combinations from a GWAS dataset with optimal filtration and to discover the biological meaning of the detected SNP combinations. Optimal filtration can enhance the statistical power of SNP combinations by comparing the error rates of SNP combinations from various Bonferroni thresholds and p-value range-based thresholds combined with linkage disequilibrium (LD) pruning. T2D causal SNP combinations are selected using random forests with variable selection from an optimal SNP dataset. The selected SNPs with SNP combinations are mapped with multi-dimensional levels of T2D-related information and gene set enrichment analysis (GSEA). A T2D causal SNP combination containing 101 SNPs from the Wellcome Trust Case Control Consortium (WTCCC) GWAS dataset are selected, with an error rate of 10.25%. Matching with known disease genes and gene sets revealed the relationships between T2D and SNP combinations. We propose a detection method for complex disease causal SNP combinations from an optimal SNP dataset by using random forests with variable selection. Mapping the biological meanings of detected SNP combinations can help uncover complex disease mechanisms.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"7 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133699051","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
High precision rule based PPI extraction and per-pair basis performance evaluation 基于高精度规则的PPI提取和基于对的性能评价
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390082
Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang
Virtually all current PPI extraction studies focus on improving F-score, aiming to balance the performance on both precision and recall. However, in many realistic scenarios involving large corpora, one can benefit more from an extremely high precision PPI extraction tool than a high-recall counterpart. We also argue that the current "per-instance" basis performance evaluation method should be revisited. In order to address these problems, we introduce a new rule-based PPI extraction method equipped with a set of ultra-high precision extraction rules. We also propose a new "per-pair" basis performance metric, which is more pragmatic in practice. The proposed PPI extraction method achieves 95-96% per-pair and 94-97% per-instance precisions on the AIMed benchmark corpus.
目前几乎所有的PPI提取研究都集中在提高f分上,旨在平衡准确率和召回率的表现。然而,在许多涉及大型语料库的现实场景中,与高召回率的对应工具相比,极高精度的PPI提取工具可以带来更多好处。我们还认为,当前的“每个实例”的基础性能评估方法应该重新审视。为了解决这些问题,我们引入了一种新的基于规则的PPI提取方法,该方法配备了一套超高精度的提取规则。我们还提出了一个新的“每对”基础性能度量,它在实践中更加实用。所提出的PPI提取方法在aims基准语料上的每对提取精度为95-96%,每实例提取精度为94-97%。
{"title":"High precision rule based PPI extraction and per-pair basis performance evaluation","authors":"Junkyu Lee, Seongsoon Kim, Sunwon Lee, Kyubum Lee, Jaewoo Kang","doi":"10.1145/2390068.2390082","DOIUrl":"https://doi.org/10.1145/2390068.2390082","url":null,"abstract":"Virtually all current PPI extraction studies focus on improving F-score, aiming to balance the performance on both precision and recall. However, in many realistic scenarios involving large corpora, one can benefit more from an extremely high precision PPI extraction tool than a high-recall counterpart. We also argue that the current \"per-instance\" basis performance evaluation method should be revisited. In order to address these problems, we introduce a new rule-based PPI extraction method equipped with a set of ultra-high precision extraction rules. We also propose a new \"per-pair\" basis performance metric, which is more pragmatic in practice. The proposed PPI extraction method achieves 95-96% per-pair and 94-97% per-instance precisions on the AIMed benchmark corpus.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128794604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
TNMCA: generation and application of network motif based inference models for drug repositioning TNMCA:基于网络基序的药物重新定位推理模型的生成与应用
Pub Date : 2012-10-29 DOI: 10.1145/2390068.2390081
Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee
Since the increase of the public biomedical data, Undiscovered Public Knowledge (UPK, proposed by Swanson) became an important research topic in the biological field. Drug repositioning is one of famous UPK tasks which infer alternative indications for approved drugs. Many researchers tried to find novel candidates of existing drugs, but these previous works are not fully automated which required manual modulations to desired tasks, and was not able to cover various biomedical entities. In addition, they had inference limitations that those works could infer only pre-defined cases using limited patterns. In this paper, we propose the Typed Network Motif Comparison Algorithm (TNMCA) to discover novel drug indications using topological patterns of data. Typed network motifs (TNM) are connected sub-graphs of data, which store types of data, instead of values of data. While previous researches depends on ABC model (or extension of it), TNMCA utilizes more generalized patterns as its inference models. Also, TNMCA can infer not only an existence of interaction, but also the type of the interaction. TNMCA is suited for multi-level biomedical interaction data as TNMs depend on the different types of entities and relations. We apply TNMCA to a public database, Comparative Toxicogenomics Database (CTD), to validate our method. The results show that TNMCA could infer meaningful indications with high performance (AUC=0.7469) compared to the ABC model (AUC=0.7050).
随着生物医学公共数据的增加,由Swanson提出的未发现公共知识(Undiscovered public Knowledge, UPK)成为生物领域的一个重要研究课题。药物重新定位是著名的UPK任务之一,它推断已批准药物的替代适应症。许多研究人员试图找到现有药物的新候选药物,但这些先前的工作不是完全自动化的,需要手动调节所需的任务,并且无法覆盖各种生物医学实体。此外,它们有推理限制,即这些作品只能使用有限的模式推断预先定义的情况。在本文中,我们提出了类型化网络基序比较算法(TNMCA)来发现新的药物适应症,利用数据的拓扑模式。类型化网络母图(TNM)是数据的连接子图,它存储数据的类型,而不是数据的值。与以往的研究依赖于ABC模型(或ABC模型的扩展)相比,TNMCA采用更广义的模式作为其推理模型。此外,TNMCA不仅可以推断出相互作用的存在,还可以推断出相互作用的类型。TNMCA适用于多层次生物医学相互作用数据,因为tnm依赖于不同类型的实体和关系。我们将TNMCA应用于一个公共数据库,比较毒物基因组学数据库(CTD),以验证我们的方法。结果表明,与ABC模型(AUC=0.7050)相比,TNMCA模型能够推断出有意义的适应症,且AUC=0.7469。
{"title":"TNMCA: generation and application of network motif based inference models for drug repositioning","authors":"Jaejoon Choi, Kwangmin Kim, Min-Keun Song, Doheon Lee","doi":"10.1145/2390068.2390081","DOIUrl":"https://doi.org/10.1145/2390068.2390081","url":null,"abstract":"Since the increase of the public biomedical data, Undiscovered Public Knowledge (UPK, proposed by Swanson) became an important research topic in the biological field. Drug repositioning is one of famous UPK tasks which infer alternative indications for approved drugs. Many researchers tried to find novel candidates of existing drugs, but these previous works are not fully automated which required manual modulations to desired tasks, and was not able to cover various biomedical entities. In addition, they had inference limitations that those works could infer only pre-defined cases using limited patterns. In this paper, we propose the Typed Network Motif Comparison Algorithm (TNMCA) to discover novel drug indications using topological patterns of data. Typed network motifs (TNM) are connected sub-graphs of data, which store types of data, instead of values of data. While previous researches depends on ABC model (or extension of it), TNMCA utilizes more generalized patterns as its inference models. Also, TNMCA can infer not only an existence of interaction, but also the type of the interaction. TNMCA is suited for multi-level biomedical interaction data as TNMs depend on the different types of entities and relations. We apply TNMCA to a public database, Comparative Toxicogenomics Database (CTD), to validate our method. The results show that TNMCA could infer meaningful indications with high performance (AUC=0.7469) compared to the ABC model (AUC=0.7050).","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134067663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Session details: Keynote address 会议详情:主题演讲
Pub Date : 2011-10-24 DOI: 10.1145/3260180
Doheon Lee
{"title":"Session details: Keynote address","authors":"Doheon Lee","doi":"10.1145/3260180","DOIUrl":"https://doi.org/10.1145/3260180","url":null,"abstract":"","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126386633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic concept ontology construction for pubmed queries 面向pubmed查询的动态概念本体构建
Pub Date : 2010-10-26 DOI: 10.1145/1871871.1871885
Jinoh Oh, Taehoon Kim, Sun Park, Wook-Shin Han, Hwanjo Yu
Exploring PubMed to find relevant information is challenging and time-consuming, as PubMed typically returns a large list of articles as a result of query. Existing works in improving the search quality on PubMed have focused on helping PubMed query formulation, clustering the results, or ranking by relevance. This paper proposes a novel system that dynamically constructs a concept ontology based on the search results, which visualizes related concepts to the query in the form of ontology. The concept ontology can make the PubMed search more effective by detecting related concepts and their relation hidden in the documents. The ontology can broaden the user's knowledge by recommending new concepts unexpected by the user, and also serves to narrow down the search results by recommending additional query terms. The ontology construction is processed in real-time as a result of query, integrated within our PubMed search engine called RefMED. Our system is accesible at "http://dm.hwanjoyu.org/refmed".
在PubMed上搜索相关信息既困难又耗时,因为PubMed通常会返回大量文章作为查询的结果。现有的提高PubMed搜索质量的工作主要集中在帮助PubMed查询公式、聚类结果或根据相关性进行排名。本文提出了一种基于搜索结果动态构建概念本体的系统,以本体的形式将查询的相关概念可视化。概念本体通过检测隐藏在文档中的相关概念及其关系,使PubMed搜索更加有效。本体可以通过推荐用户意想不到的新概念来扩展用户的知识,也可以通过推荐额外的查询词来缩小搜索结果的范围。本体构建作为查询的结果实时处理,集成在我们的PubMed搜索引擎RefMED中。我们的系统可在“http://dm.hwanjoyu.org/refmed”访问。
{"title":"Dynamic concept ontology construction for pubmed queries","authors":"Jinoh Oh, Taehoon Kim, Sun Park, Wook-Shin Han, Hwanjo Yu","doi":"10.1145/1871871.1871885","DOIUrl":"https://doi.org/10.1145/1871871.1871885","url":null,"abstract":"Exploring PubMed to find relevant information is challenging and time-consuming, as PubMed typically returns a large list of articles as a result of query. Existing works in improving the search quality on PubMed have focused on helping PubMed query formulation, clustering the results, or ranking by relevance. This paper proposes a novel system that dynamically constructs a concept ontology based on the search results, which visualizes related concepts to the query in the form of ontology. The concept ontology can make the PubMed search more effective by detecting related concepts and their relation hidden in the documents. The ontology can broaden the user's knowledge by recommending new concepts unexpected by the user, and also serves to narrow down the search results by recommending additional query terms. The ontology construction is processed in real-time as a result of query, integrated within our PubMed search engine called RefMED. Our system is accesible at \"http://dm.hwanjoyu.org/refmed\".","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134632303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DrugNerAR: linguistic rule-based anaphora resolver for drug-drug interaction extraction in pharmacological documents 基于语言规则的药物-药物相互作用提取的回指解析器
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651324
Isabel Segura-Bedmar, Mario Crespo, César de Pablo-Sánchez, Paloma Martínez
DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. The system uses a set of linguistic rules inspired by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. This linguistic rule-based approach shows very promising results for the challenge of accounting for anaphoric expressions in pharmacological texts.
为解决药理学文献中共指表达的问题,提出了一个药物回指消解系统。这一发展是关于药物相互作用自动提取的更大的创新研究的一部分。此外,本文还开发了一个语料库来分析这些现象并对现有的方法进行评价。该系统在生物医学语法解析器提供的分析之上,使用了一套受中心理论启发的语言规则。为了提高对名义药物回指的识别和分辨能力,还集成了统一医学语言系统(UMLS)提供的语义信息。这种基于语言规则的方法显示了非常有希望的结果,用于在药理学文本中解释回指表达的挑战。
{"title":"DrugNerAR: linguistic rule-based anaphora resolver for drug-drug interaction extraction in pharmacological documents","authors":"Isabel Segura-Bedmar, Mario Crespo, César de Pablo-Sánchez, Paloma Martínez","doi":"10.1145/1651318.1651324","DOIUrl":"https://doi.org/10.1145/1651318.1651324","url":null,"abstract":"DrugNerAR, a drug anaphora resolution system is presented to address the problem of co-referring expressions in pharmacological literature. This development is part of a larger and innovative study about automatic drug-drug interaction extraction. Besides, a corpus has been developed in order to analyze the phenomena and evaluate the current approach. The system uses a set of linguistic rules inspired by Centering Theory over the analysis provided by a biomedical syntactic parser. Semantic information provided by Unified Medical Language System (UMLS) is also integrated in order to improve the recognition and the resolution of nominal drug anaphors. This linguistic rule-based approach shows very promising results for the challenge of accounting for anaphoric expressions in pharmacological texts.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"161 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125945295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Mining cancer genes with running-sum statistics 用运行和统计挖掘癌症基因
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651326
Inho Park, Kwang-H. Lee, Doheon Lee
In this paper, we propose a new method to detect candidate cancer genes for developing molecular biomarkers or therapeutic targets from cancer microarray datasets. To resolve problems resulted in the molecular heterogeneity of cancers on gene prioritizing, our proposed method is intended to identify genes that are over- or down- expressed not in the whole cancer samples but also in a subgroup of cancer samples. To this end, we propose the RS score for gene ranking calculated with a weighted running sum statistic on the ordered list of expression values of each gene. We apply the proposed method to publically available prostate cancer microarray datasets, showing that it can identify previously well known prostate cancer associated genes such as ERG, HPN, and AMACR at the top of the list of candidate genes. Embedding samples, represented as vectors of the expression values of the top 20 genes, into a two dimensional space using the commute time embedding shows the distinction between normal samples and cancer samples in the independent test datasets as well as in the training datasets. We further evaluate the proposed method by estimating classification performance on the independent test datasets, and it shows the better classification performance compared to the other cancer outlier profile approaches.
在本文中,我们提出了一种新的方法来检测候选癌症基因,用于从癌症微阵列数据集开发分子生物标志物或治疗靶点。为了解决癌症在基因优先级上的分子异质性问题,我们提出的方法旨在确定不是在整个癌症样本中,而是在癌症样本的一个亚组中过度表达或低表达的基因。为此,我们提出了基因排序的RS评分,通过对每个基因表达值的有序列表进行加权运行和统计计算。我们将提出的方法应用于公开可用的前列腺癌微阵列数据集,结果表明它可以识别出候选基因列表顶部的先前已知的前列腺癌相关基因,如ERG, HPN和AMACR。使用通勤时间嵌入将样本(表示为前20个基因的表达值向量)嵌入到二维空间中,显示了独立测试数据集和训练数据集中正常样本和癌症样本的区别。我们通过在独立测试数据集上估计分类性能来进一步评估所提出的方法,与其他癌症离群值剖面方法相比,它显示出更好的分类性能。
{"title":"Mining cancer genes with running-sum statistics","authors":"Inho Park, Kwang-H. Lee, Doheon Lee","doi":"10.1145/1651318.1651326","DOIUrl":"https://doi.org/10.1145/1651318.1651326","url":null,"abstract":"In this paper, we propose a new method to detect candidate cancer genes for developing molecular biomarkers or therapeutic targets from cancer microarray datasets. To resolve problems resulted in the molecular heterogeneity of cancers on gene prioritizing, our proposed method is intended to identify genes that are over- or down- expressed not in the whole cancer samples but also in a subgroup of cancer samples. To this end, we propose the RS score for gene ranking calculated with a weighted running sum statistic on the ordered list of expression values of each gene. We apply the proposed method to publically available prostate cancer microarray datasets, showing that it can identify previously well known prostate cancer associated genes such as ERG, HPN, and AMACR at the top of the list of candidate genes. Embedding samples, represented as vectors of the expression values of the top 20 genes, into a two dimensional space using the commute time embedding shows the distinction between normal samples and cancer samples in the independent test datasets as well as in the training datasets. We further evaluate the proposed method by estimating classification performance on the independent test datasets, and it shows the better classification performance compared to the other cancer outlier profile approaches.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114860664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
The challenge of high recall in biomedical systematic search 生物医学系统检索中高查全率的挑战
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651338
Sarvnaz Karimi, J. Zobel, Stefan Pohl, Falk Scholer
Clinical systematic reviews are based on expert, laborious search of well-annotated literature. Boolean search on bibliographic databases, such as MEDLINE, continues to be the preferred discovery method, but the size of these databases, now approaching 20 million records, makes it impossible to fully trust these searching methods. We are investigating the trade-offs between Boolean and ranked retrieval. Our findings show that although Boolean search has limitations, it is not obvious that ranking is superior, and illustrate that a single query cannot be used to resolve an information need. Our experiments show that a combination of less complicated Boolean queries and ranked retrieval outperforms either of them individually, leading to possible time savings over the current process.
临床系统评价是建立在专家的基础上的,费力地搜索有充分注释的文献。在书目数据库(如MEDLINE)上进行布尔搜索仍然是首选的发现方法,但是这些数据库的规模(现在接近2000万条记录)使得不可能完全信任这些搜索方法。我们正在研究布尔检索和排名检索之间的权衡。我们的研究结果表明,尽管布尔搜索有局限性,但排名并不明显优越,并说明单个查询不能用于解决信息需求。我们的实验表明,不太复杂的布尔查询和排名检索的组合比它们单独的任何一个都要好,从而可能比当前的过程节省时间。
{"title":"The challenge of high recall in biomedical systematic search","authors":"Sarvnaz Karimi, J. Zobel, Stefan Pohl, Falk Scholer","doi":"10.1145/1651318.1651338","DOIUrl":"https://doi.org/10.1145/1651318.1651338","url":null,"abstract":"Clinical systematic reviews are based on expert, laborious search of well-annotated literature. Boolean search on bibliographic databases, such as MEDLINE, continues to be the preferred discovery method, but the size of these databases, now approaching 20 million records, makes it impossible to fully trust these searching methods. We are investigating the trade-offs between Boolean and ranked retrieval. Our findings show that although Boolean search has limitations, it is not obvious that ranking is superior, and illustrate that a single query cannot be used to resolve an information need. Our experiments show that a combination of less complicated Boolean queries and ranked retrieval outperforms either of them individually, leading to possible time savings over the current process.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"204 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115468500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
LITSEEK: public health literature search by metadata enhancement with external knowledge bases LITSEEK:利用外部知识库进行元数据增强的公共卫生文献检索
Pub Date : 2009-11-06 DOI: 10.1145/1651318.1651337
P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy
Biomedical literature is an important source of information in any researcher's investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature Search by metadata Enhancement with External Knowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way.
生物医学文献是任何研究人员研究基因、危险因素、疾病和药物的重要信息来源。公共卫生研究人员搜索的信息通常分布在多个不同的来源,可能包括PubMed、基因组、蛋白质组学和途径数据库、基因表达和临床资源以及生物医学本体等出版物。这些信息的非结构化性质使得人工查找相关部分变得困难,全面的知识也难以自动合成。在这篇论文中,我们报告了LITSEEK(通过外部知识库元数据增强的文献检索),这是我们为疾病控制中心(CDC)的研究人员开发的一个系统,使他们能够从药物基因组学的角度搜索PubMed文章的HuGE(人类流行病学基因组)数据库。除了使用TFIDF对重要术语进行排序和索引来分析文本外,拟议的系统还结合了与PharmGKB(一个由人类管理的关于药物、相关疾病和基因的知识库)以及基因本体(一个由人类管理的、被广泛接受的本体)的自动咨询。我们强调了我们方法的主要组成部分,并说明了如何通过纳入来自PharmGKB的基因/药物/疾病方面的其他概念(为便于参考,称为元数据)来增强搜索。报告了关于添加这些元数据项的各种测量结果。基于CDC专家用户反馈的精度方面的初步结果令人鼓舞。实际研究人员正在对搜索程序进行进一步评价。
{"title":"LITSEEK: public health literature search by metadata enhancement with external knowledge bases","authors":"P. Prabhu, S. Navathe, Stephen Tyler, V. Dasigi, N. Narkhede, Balaji Palanisamy","doi":"10.1145/1651318.1651337","DOIUrl":"https://doi.org/10.1145/1651318.1651337","url":null,"abstract":"Biomedical literature is an important source of information in any researcher's investigation of genes, risk factors, diseases and drugs. Often the information searched by public health researchers is distributed across multiple disparate sources that may include publications from PubMed, genomic, proteomic and pathway databases, gene expression and clinical resources and biomedical ontologies. The unstructured nature of this information makes it difficult to find relevant parts from it manually and comprehensive knowledge is further difficult to synthesize automatically. In this paper we report on LITSEEK (LITerature Search by metadata Enhancement with External Knowledgebases), a system we have developed for the benefit of researchers at the Centers for Disease Control (CDC) to enable them to search the HuGE (Human Genome for Epidemiology) database of PubMed articles, from a pharmacogenomic perspective. Besides analyzing text using TFIDF ranking and indexing of the important terms, the proposed system incorporates an automatic consultation with PharmGKB - a human-curated knowledge base about drugs, related diseases and genes, as well as with the Gene Ontology, a human-curated, well accepted ontology. We highlight the main components of our approach and illustrate how the search is enhanced by incorporating additional concepts in terms of genes/drugs/diseases (called metadata for ease of reference) from PharmGKB. Various measurements are reported with respect to the addition of these metadata terms. Preliminary results in terms of precision based on expert user feedback from CDC are encouraging. Further evaluation of the search procedure by actual researchers is under way.","PeriodicalId":143937,"journal":{"name":"Data and Text Mining in Bioinformatics","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125181432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Data and Text Mining in Bioinformatics
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1