Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.

IF 5.4 3区生物学 Q1 BIOCHEMICAL RESEARCH METHODS Bioinformatics Pub Date : 2023-09-05 DOI:10.1093/bioinformatics/btad542

Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu

{"title":"Towards precise PICO extraction from abstracts of randomized controlled trials using a section-specific learning approach.","authors":"Yan Hu, Vipina K Keloth, Kalpana Raja, Yong Chen, Hua Xu","doi":"10.1093/bioinformatics/btad542","DOIUrl":null,"url":null,"abstract":"Motivation: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.Results: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.Availability: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.Supplementary information: Supplementary data are available at Bioinformatics online.","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2023-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10500081/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btad542","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

Motivation: Automated extraction of participants, intervention, comparison/control, and outcome (PICO) from the randomized controlled trial (RCT) abstracts is important for evidence synthesis. Previous studies have demonstrated the feasibility of applying natural language processing (NLP) for PICO extraction. However, the performance is not optimal due to the complexity of PICO information in RCT abstracts and the challenges involved in their annotation.

Results: We propose a two-step NLP pipeline to extract PICO elements from RCT abstracts: (i) sentence classification using a prompt-based learning model and (ii) PICO extraction using a named entity recognition (NER) model. First, the sentences in abstracts were categorized into four sections namely background, methods, results, and conclusions. Next, the NER model was applied to extract the PICO elements from the sentences within the title and methods sections that include >96% of PICO information. We evaluated our proposed NLP pipeline on three datasets, the EBM-NLPmoddataset, a randomly selected and reannotated dataset of 500 RCT abstracts from the EBM-NLP corpus, a dataset of 150 COVID-19 RCT abstracts, and a dataset of 150 Alzheimer's disease (AD) RCT abstracts. The end-to-end evaluation reveals that our proposed approach achieved an overall micro F1 score of 0.833 on the EBM-NLPmod dataset, 0.928 on the COVID-19 dataset, and 0.899 on the AD dataset when measured at the token-level and an overall micro F1 score of 0.712 on EBM-NLPmod dataset, 0.850 on the COVID-19 dataset, and 0.805 on the AD dataset when measured at the entity-level.

Availability: Our codes and datasets are publicly available at https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.

Supplementary information: Supplementary data are available at Bioinformatics online.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用特定章节学习法从随机对照试验摘要中精确提取 PICO。

动机从随机对照试验（RCT）摘要中自动提取参与者、干预措施、对比/对照和结果（PICO）对于证据综合非常重要。之前的研究已经证明了应用自然语言处理（NLP）提取 PICO 的可行性。然而，由于 RCT 摘要中 PICO 信息的复杂性及其注释所涉及的挑战，其性能并不理想：我们提出了从 RCT 摘要中提取 PICO 要素的两步 NLP 流程：(i) 使用基于提示的学习模型进行句子分类；(ii) 使用命名实体识别（NER）模型提取 PICO。首先，将摘要中的句子分为四个部分，即背景、方法、结果和结论。然后，应用 NER 模型从标题和方法部分的句子中提取 PICO 要素，这两个部分包含的 PICO 信息量大于 96%。我们在三个数据集上评估了我们提出的 NLP 管道：EBM-NLPmoddataset（从 EBM-NLP 语料库中随机挑选并重新标注的 500 篇 RCT 摘要数据集）、150 篇 COVID-19 RCT 摘要数据集和 150 篇阿尔茨海默病（AD）RCT 摘要数据集。端到端评估结果表明，我们提出的方法在EBM-NLPmod数据集上取得了0.833的微观F1得分，在COVID-19数据集上取得了0.928的微观F1得分，在AD数据集上取得了0.899的微观F1得分；在EBM-NLPmod数据集上取得了0.712的微观F1得分，在COVID-19数据集上取得了0.850的微观F1得分，在AD数据集上取得了0.805的微观F1得分；在实体层面上取得了0.712的微观F1得分，在COVID-19数据集上取得了0.850的微观F1得分，在AD数据集上取得了0.805的微观F1得分：我们的代码和数据集可在 https://github.com/BIDS-Xu-Lab/section_specific_annotation_of_PICO.Supplementary 信息网站上公开获取：补充数据可在 Bioinformatics online 上获取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioinformatics 生物-生化研究方法

CiteScore

11.20

自引率

5.20%

发文量

753

审稿时长

2.1 months

期刊介绍： The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.