Semi-supervised learning from small annotated data and large unlabeled data for fine-grained Participants, Intervention, Comparison, and Outcomes entity recognition.

IF 4.7 2区 医学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS Journal of the American Medical Informatics Association Pub Date : 2025-01-17 DOI:10.1093/jamia/ocae326
Fangyi Chen, Gongbo Zhang, Yilu Fang, Yifan Peng, Chunhua Weng
{"title":"Semi-supervised learning from small annotated data and large unlabeled data for fine-grained Participants, Intervention, Comparison, and Outcomes entity recognition.","authors":"Fangyi Chen, Gongbo Zhang, Yilu Fang, Yifan Peng, Chunhua Weng","doi":"10.1093/jamia/ocae326","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.</p><p><strong>Materials and methods: </strong>Using a corpus of 2511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into 2 subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1.</p><p><strong>Results: </strong>Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (P-value < .001).</p><p><strong>Discussion: </strong>We developed FinePICO to recognize fine-grained PICO entities from text and validated its performance across diverse experimental settings, highlighting the feasibility of using semi-supervised learning (SSL) techniques to enhance PICO entities extraction. Future work can focus on optimizing SSL algorithms to improve efficiency and reduce computational costs.</p><p><strong>Conclusion: </strong>This study contributes a generalizable and effective semi-supervised approach leveraging large unlabeled data together with small, annotated data for fine-grained PICO extraction.</p>","PeriodicalId":50016,"journal":{"name":"Journal of the American Medical Informatics Association","volume":" ","pages":""},"PeriodicalIF":4.7000,"publicationDate":"2025-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Medical Informatics Association","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1093/jamia/ocae326","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Objective: Extracting PICO elements-Participants, Intervention, Comparison, and Outcomes-from clinical trial literature is essential for clinical evidence retrieval, appraisal, and synthesis. Existing approaches do not distinguish the attributes of PICO entities. This study aims to develop a named entity recognition (NER) model to extract PICO entities with fine granularities.

Materials and methods: Using a corpus of 2511 abstracts with PICO mentions from 4 public datasets, we developed a semi-supervised method to facilitate the training of a NER model, FinePICO, by combining limited annotated data of PICO entities and abundant unlabeled data. For evaluation, we divided the entire dataset into 2 subsets: a smaller group with annotations and a larger group without annotations. We then established the theoretical lower and upper performance bounds based on the performance of supervised learning models trained solely on the small, annotated subset and on the entire set with complete annotations, respectively. Finally, we evaluated FinePICO on both the smaller annotated subset and the larger, initially unannotated subset. We measured the performance of FinePICO using precision, recall, and F1.

Results: Our method achieved precision/recall/F1 of 0.567/0.636/0.60, respectively, using a small set of annotated samples, outperforming the baseline model (F1: 0.437) by more than 16%. The model demonstrates generalizability to a different PICO framework and to another corpus, which consistently outperforms the benchmark in diverse experimental settings (P-value < .001).

Discussion: We developed FinePICO to recognize fine-grained PICO entities from text and validated its performance across diverse experimental settings, highlighting the feasibility of using semi-supervised learning (SSL) techniques to enhance PICO entities extraction. Future work can focus on optimizing SSL algorithms to improve efficiency and reduce computational costs.

Conclusion: This study contributes a generalizable and effective semi-supervised approach leveraging large unlabeled data together with small, annotated data for fine-grained PICO extraction.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
针对细粒度参与者、干预、比较和结果实体识别,从小型带注释数据和大型未标记数据中进行半监督学习。
目的:从临床试验文献中提取PICO要素(参与者、干预、比较和结果)对临床证据检索、评估和综合至关重要。现有的方法不能区分PICO实体的属性。本研究旨在建立一个命名实体识别(NER)模型,以提取具有细粒度的PICO实体。材料和方法:利用来自4个公共数据集的2511篇PICO提及摘要的语料库,我们开发了一种半监督方法,通过将有限的PICO实体注释数据和丰富的未标记数据结合起来,促进NER模型FinePICO的训练。为了评估,我们将整个数据集分为2个子集:一个带有注释的较小组和一个没有注释的较大组。然后,我们根据监督学习模型的性能分别建立了理论的下界和上界,这些模型分别训练在带有完整注释的小子集和整个集上。最后,我们在较小的带注释的子集和较大的最初未注释的子集上评估FinePICO。我们使用精确度、召回率和F1来衡量FinePICO的性能。结果:我们的方法在使用一小部分带注释的样本时,准确率/召回率/F1分别为0.567/0.636/0.60,优于基线模型(F1: 0.437) 16%以上。该模型展示了对不同PICO框架和另一个语料库的可泛化性,在不同的实验设置中始终优于基准(p值< .001)。讨论:我们开发了FinePICO来从文本中识别细粒度的PICO实体,并在不同的实验设置中验证了其性能,强调了使用半监督学习(SSL)技术来增强PICO实体提取的可行性。未来的工作可以集中在优化SSL算法,以提高效率和降低计算成本。结论:本研究提供了一种可推广和有效的半监督方法,利用大量未标记数据和小的、带注释的数据进行细粒度PICO提取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Journal of the American Medical Informatics Association
Journal of the American Medical Informatics Association 医学-计算机:跨学科应用
CiteScore
14.50
自引率
7.80%
发文量
230
审稿时长
3-8 weeks
期刊介绍: JAMIA is AMIA''s premier peer-reviewed journal for biomedical and health informatics. Covering the full spectrum of activities in the field, JAMIA includes informatics articles in the areas of clinical care, clinical research, translational science, implementation science, imaging, education, consumer health, public health, and policy. JAMIA''s articles describe innovative informatics research and systems that help to advance biomedical science and to promote health. Case reports, perspectives and reviews also help readers stay connected with the most important informatics developments in implementation, policy and education.
期刊最新文献
Efficacy of the mLab App: a randomized clinical trial for increasing HIV testing uptake using mobile technology. Using human factors methods to mitigate bias in artificial intelligence-based clinical decision support. Distributed, immutable, and transparent biomedical limited data set request management on multi-capacity network. Identifying stigmatizing and positive/preferred language in obstetric clinical notes using natural language processing. National COVID Cohort Collaborative data enhancements: a path for expanding common data models.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1