Semi-supervised learning from small annotated data and large unlabeled data for fine-grained Participants, Intervention, Comparison, and Outcomes entity recognition.

IF 4.7 2区医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of the American Medical Informatics Association Pub Date : 2025-01-17 DOI:10.1093/jamia/ocae326

Fangyi Chen, Gongbo Zhang, Yilu Fang, Yifan Peng, Chunhua Weng

{"title":"Semi-supervised learning from small annotated data and large unlabeled data for fine-grained Participants, Intervention, Comparison, and Outcomes entity recognition.","authors":"Fangyi Chen, Gongbo Zhang, Yilu Fang, Yifan Peng, Chunhua Weng","doi":"10.1093/jamia/ocae326","DOIUrl":null,"url":null,"abstract":"Objective: Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.Materials and methods: Using a corpus of 2511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into 2 subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1.Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (P-value < .001).Discussion: We developed FinePICO to recognize fine-grained PICO entities from text and validated its performance across diverse experimental settings, highlighting the feasibility of using semi-supervised learning (SSL) techniques to enhance PICO entities extraction. Future work can focus on optimizing SSL algorithms to improve efficiency and reduce computational costs.Conclusion: This study contributes a generalizable and effective semi-supervised approach leveraging large unlabeled data together with small, annotated data for fine-grained PICO extraction.","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae326","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.

Materials and methods: Using a corpus of 2511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into 2 subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1.

Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (P-value < .001).

Discussion: We developed FinePICO to recognize fine-grained PICO entities from text and validated its performance across diverse experimental settings, highlighting the feasibility of using semi-supervised learning (SSL) techniques to enhance PICO entities extraction. Future work can focus on optimizing SSL algorithms to improve efficiency and reduce computational costs.

Conclusion: This study contributes a generalizable and effective semi-supervised approach leveraging large unlabeled data together with small, annotated data for fine-grained PICO extraction.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

针对细粒度参与者、干预、比较和结果实体识别，从小型带注释数据和大型未标记数据中进行半监督学习。

目的：从临床试验文献中提取PICO要素（参与者、干预、比较和结果）对临床证据检索、评估和综合至关重要。现有的方法不能区分PICO实体的属性。本研究旨在建立一个命名实体识别（NER）模型，以提取具有细粒度的PICO实体。材料和方法：利用来自4个公共数据集的2511篇PICO提及摘要的语料库，我们开发了一种半监督方法，通过将有限的PICO实体注释数据和丰富的未标记数据结合起来，促进NER模型FinePICO的训练。为了评估，我们将整个数据集分为2个子集：一个带有注释的较小组和一个没有注释的较大组。然后，我们根据监督学习模型的性能分别建立了理论的下界和上界，这些模型分别训练在带有完整注释的小子集和整个集上。最后，我们在较小的带注释的子集和较大的最初未注释的子集上评估FinePICO。我们使用精确度、召回率和F1来衡量FinePICO的性能。结果：我们的方法在使用一小部分带注释的样本时，准确率/召回率/F1分别为0.567/0.636/0.60，优于基线模型（F1: 0.437） 16%以上。该模型展示了对不同PICO框架和另一个语料库的可泛化性，在不同的实验设置中始终优于基准（p值< .001）。讨论：我们开发了FinePICO来从文本中识别细粒度的PICO实体，并在不同的实验设置中验证了其性能，强调了使用半监督学习（SSL）技术来增强PICO实体提取的可行性。未来的工作可以集中在优化SSL算法，以提高效率和降低计算成本。结论：本研究提供了一种可推广和有效的半监督方法，利用大量未标记数据和小的、带注释的数据进行细粒度PICO提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of the American Medical Informatics Association 医学-计算机：跨学科应用

CiteScore

14.50

自引率

7.80%

发文量

230

审稿时长

3-8 weeks

期刊介绍： JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.