Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature.

IF 2 3区工程技术 Q3 MATHEMATICAL & COMPUTATIONAL BIOLOGY Journal of Biomedical Semantics Pub Date : 2023-05-30 DOI:10.1186/s13326-023-00287-7

Weixin Xie, Kunjie Fan, Shijun Zhang, Lang Li

{"title":"Multiple sampling schemes and deep learning improve active learning performance in drug-drug interaction information retrieval analysis from the literature.","authors":"Weixin Xie, Kunjie Fan, Shijun Zhang, Lang Li","doi":"10.1186/s13326-023-00287-7","DOIUrl":null,"url":null,"abstract":"Background: Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.Results: PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.Conclusions: By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.","PeriodicalId":15055,"journal":{"name":"Journal of Biomedical Semantics","volume":"14 1","pages":"5"},"PeriodicalIF":2.0000,"publicationDate":"2023-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10228061/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Biomedical Semantics","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1186/s13326-023-00287-7","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Drug-drug interaction (DDI) information retrieval (IR) is an important natural language process (NLP) task from the PubMed literature. For the first time, active learning (AL) is studied in DDI IR analysis. DDI IR analysis from PubMed abstracts faces the challenges of relatively small positive DDI samples among overwhelmingly large negative samples. Random negative sampling and positive sampling are purposely designed to improve the efficiency of AL analysis. The consistency of random negative sampling and positive sampling is shown in the paper.

Results: PubMed abstracts are divided into two pools. Screened pool contains all abstracts that pass the DDI keywords query in PubMed, while unscreened pool includes all the other abstracts. At a prespecified recall rate of 0.95, DDI IR analysis precision is evaluated and compared. In screened pool IR analysis using supporting vector machine (SVM), similarity sampling plus uncertainty sampling improves the precision over uncertainty sampling, from 0.89 to 0.92 respectively. In the unscreened pool IR analysis, the integrated random negative sampling, positive sampling, and similarity sampling improve the precision over uncertainty sampling along, from 0.72 to 0.81 respectively. When we change the SVM to a deep learning method, all sampling schemes consistently improve DDI AL analysis in both screened pool and unscreened pool. Deep learning has significant improvement of precision over SVM, 0.96 vs. 0.92 in screened pool, and 0.90 vs. 0.81 in the unscreened pool, respectively.

Conclusions: By integrating various sampling schemes and deep learning algorithms into AL, the DDI IR analysis from literature is significantly improved. The random negative sampling and positive sampling are highly effective methods in improving AL analysis where the positive and negative samples are extremely imbalanced.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

多重采样方案和深度学习提高了文献中药物相互作用信息检索分析的主动学习性能。

背景：药物相互作用（DDI）信息检索（IR）是从 PubMed 文献中提取的一项重要的自然语言处理（NLP）任务。在 DDI IR 分析中首次研究了主动学习（AL）。从 PubMed 摘要中进行 DDI IR 分析面临的挑战是，在大量的阴性样本中，DDI 阳性样本相对较少。为了提高 AL 分析的效率，特意设计了随机阴性采样和阳性采样。文中展示了随机阴性取样和阳性取样的一致性：PubMed 摘要分为两个池。筛选池包含所有通过 PubMed DDI 关键词查询的摘要，而未筛选池包含所有其他摘要。在预设召回率为 0.95 的条件下，对 DDI IR 分析的精确度进行评估和比较。在使用支持向量机（SVM）进行的筛选池 IR 分析中，相似性采样加不确定性采样比不确定性采样提高了精确度，分别从 0.89 提高到 0.92。在非筛选池红外分析中，综合随机负采样、正采样和相似性采样比不确定性采样的精度分别从 0.72 提高到 0.81。当我们将 SVM 改为深度学习方法时，所有采样方案在筛选池和非筛选池中都能持续改进 DDI AL 分析。深度学习比 SVM 的精确度有明显提高，在筛选池中分别为 0.96 对 0.92，在未筛选池中分别为 0.90 对 0.81：通过将各种采样方案和深度学习算法整合到 AL 中，文献中的 DDI IR 分析得到了显著改善。在正负样本极不平衡的情况下，随机负向采样和正向采样是改进 AL 分析的高效方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Biomedical Semantics MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

4.20

自引率

5.30%

发文量

审稿时长

30 weeks

期刊介绍： Journal of Biomedical Semantics addresses issues of semantic enrichment and semantic processing in the biomedical domain. The scope of the journal covers two main areas: Infrastructure for biomedical semantics: focusing on semantic resources and repositories, meta-data management and resource description, knowledge representation and semantic frameworks, the Biomedical Semantic Web, and semantic interoperability. Semantic mining, annotation, and analysis: focusing on approaches and applications of semantic resources; and tools for investigation, reasoning, prediction, and discoveries in biomedicine.