Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids.

Journal of clinical bioinformatics Pub Date : 2014-10-23 eCollection Date: 2014-01-01 DOI:10.1186/2043-9113-4-13

Rick Jordan, Shyam Visweswaran, Vanathi Gopalakrishnan

{"title":"Semi-automated literature mining to identify putative biomarkers of disease from multiple biofluids.","authors":"Rick Jordan, Shyam Visweswaran, Vanathi Gopalakrishnan","doi":"10.1186/2043-9113-4-13","DOIUrl":null,"url":null,"abstract":"Background: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.Methodology: A positive set of abstracts was defined by the terms 'breast cancer' and 'lung cancer' in conjunction with 14 separate 'biofluids' (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms '(biofluid) NOT breast cancer' or '(biofluid) NOT lung cancer.' More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method's performance.Results: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI's On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI's Genes & Disease, NCI's Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.Conclusions: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.","PeriodicalId":73663,"journal":{"name":"Journal of clinical bioinformatics","volume":"4 ","pages":"13"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/2043-9113-4-13","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of clinical bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1186/2043-9113-4-13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2014/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 6

Abstract

Background: Computational methods for mining of biomedical literature can be useful in augmenting manual searches of the literature using keywords for disease-specific biomarker discovery from biofluids. In this work, we develop and apply a semi-automated literature mining method to mine abstracts obtained from PubMed to discover putative biomarkers of breast and lung cancers in specific biofluids.

Methodology: A positive set of abstracts was defined by the terms 'breast cancer' and 'lung cancer' in conjunction with 14 separate 'biofluids' (bile, blood, breastmilk, cerebrospinal fluid, mucus, plasma, saliva, semen, serum, synovial fluid, stool, sweat, tears, and urine), while a negative set of abstracts was defined by the terms '(biofluid) NOT breast cancer' or '(biofluid) NOT lung cancer.' More than 5.3 million total abstracts were obtained from PubMed and examined for biomarker-disease-biofluid associations (34,296 positive and 2,653,396 negative for breast cancer; 28,355 positive and 2,595,034 negative for lung cancer). Biological entities such as genes and proteins were tagged using ABNER, and processed using Python scripts to produce a list of putative biomarkers. Z-scores were calculated, ranked, and used to determine significance of putative biomarkers found. Manual verification of relevant abstracts was performed to assess our method's performance.

Results: Biofluid-specific markers were identified from the literature, assigned relevance scores based on frequency of occurrence, and validated using known biomarker lists and/or databases for lung and breast cancer [NCBI's On-line Mendelian Inheritance in Man (OMIM), Cancer Gene annotation server for cancer genomics (CAGE), NCBI's Genes & Disease, NCI's Early Detection Research Network (EDRN), and others]. The specificity of each marker for a given biofluid was calculated, and the performance of our semi-automated literature mining method assessed for breast and lung cancer.

Conclusions: We developed a semi-automated process for determining a list of putative biomarkers for breast and lung cancer. New knowledge is presented in the form of biomarker lists; ranked, newly discovered biomarker-disease-biofluid relationships; and biomarker specificity across biofluids.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

半自动文献挖掘，从多种生物体液中识别假定的疾病生物标志物。

背景:生物医学文献挖掘的计算方法可用于增加使用关键字从生物流体中发现疾病特异性生物标志物的文献的手动搜索。在这项工作中，我们开发并应用了一种半自动文献挖掘方法来挖掘从PubMed获得的摘要，以发现特定生物体液中乳腺癌和肺癌的推定生物标志物。方法:阳性摘要用术语“乳腺癌”和“肺癌”以及14种单独的“生物液体”(胆汁、血液、母乳、脑脊液、粘液、血浆、唾液、精液、血清、滑液、粪便、汗液、眼泪和尿液)来定义，而阴性摘要用术语“(生物液体)非乳腺癌”或“(生物液体)非肺癌”来定义。从PubMed获得了530多万份摘要，并检查了生物标志物-疾病-生物流体相关性(乳腺癌阳性34296例，阴性2653396例;28,355例肺癌呈阳性，2,595,034例呈阴性)。使用ABNER对基因和蛋白质等生物实体进行标记，并使用Python脚本进行处理，以产生假定的生物标记物列表。计算z分数，排序，并用于确定发现的假定生物标志物的显著性。对相关摘要进行了人工验证，以评估我们的方法的性能。结果:从文献中识别出生物液体特异性标志物，根据发生频率分配相关性评分，并使用已知的生物标志物列表和/或肺癌和乳腺癌数据库[NCBI的在线孟德尔遗传(OMIM)，癌症基因组学的癌症基因注释服务器(CAGE)， NCBI的基因与疾病，NCI的早期检测研究网络(EDRN)等]进行验证。计算了给定生物流体的每个标记物的特异性，并评估了我们的半自动文献挖掘方法在乳腺癌和肺癌方面的性能。结论:我们开发了一种半自动化的过程来确定乳腺癌和肺癌的假定生物标志物列表。新知识以生物标志物列表的形式呈现;排名，新发现的生物标志物-疾病-生物流体关系;以及生物流体的生物标志物特异性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助