Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning.

Rohan Goli, Keerthana Komatineni, Shailesh Alluri, Nina Hubig, Hua Min, Yang Gong, Dean F Sittig, Lior Rennert, David Robinson, Paul Biondich, Adam Wright, Christian Nøhr, Timothy Law, Arild Faxvaag, Aneesa Weaver, Ronald Gimbel, Xia Jing
{"title":"Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning.","authors":"Rohan Goli, Keerthana Komatineni, Shailesh Alluri, Nina Hubig, Hua Min, Yang Gong, Dean F Sittig, Lior Rennert, David Robinson, Paul Biondich, Adam Wright, Christian Nøhr, Timothy Law, Arild Faxvaag, Aneesa Weaver, Ronald Gimbel, Xia Jing","doi":"10.1101/2023.01.26.23285060","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Interoperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can be a critical complementary automatic part of building ontology. However, KP identification requires human expertise, consensus, and contextual understanding for data labeling.</p><p><strong>Methods: </strong>This paper presents a semi-supervised KP identification framework (long short-term memory-based encoders and the conditional random fields -based decoder models, BiLSTM-CRF) using minimal human labeled data based on hierarchical attention (i.e., at word, sentence, and abstract levels) over the documents and domain adaptation. We created synthetic labels for initial training and human-labeled data for fine-tuning. We also tested different options during NLP preprocessing and ML training to optimize the ML pipeline.</p><p><strong>Results: </strong>Our method outperforms the prior neural architectures by learning through synthetic labels for initial training, document-level contextual learning, language modeling, and fine-tuning with limited gold standard label data. After comparison, we found that the BIO encoding schema performed slightly better than Blue, and domain adaptation techniques can improve the quality of synthetic labels. In addition, document-level context, pre-trained LM, and pre-trained WE all contributed to better model performance in our tasks. Add 2 to 4 human-labeled documents for every 100 synthetic labeled documents improves the model performance without exhausting human-labeled documents too quickly.</p><p><strong>Conclusions: </strong>To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify KPs, which is trained on limited human labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging, and light-weighted deep learning models play an important role in real-time KP identification as a complementary approach to human experts' effort.</p>","PeriodicalId":18659,"journal":{"name":"medRxiv : the preprint server for health sciences","volume":" ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ftp.ncbi.nlm.nih.gov/pub/pmc/oa_pdf/b9/97/nihpp-2023.01.26.23285060v2.PMC10246160.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv : the preprint server for health sciences","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2023.01.26.23285060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Interoperable clinical decision support system (CDSS) rules provide a pathway to interoperability, a well-recognized challenge in health information technology. Building an ontology facilitates creating interoperable CDSS rules, which can be achieved by identifying the keyphrases (KP) from the existing literature. Ontology construction is traditionally a manual effort by human domain experts, and the newly advanced natural language processing techniques, such as KP identification, can be a critical complementary automatic part of building ontology. However, KP identification requires human expertise, consensus, and contextual understanding for data labeling.

Methods: This paper presents a semi-supervised KP identification framework (long short-term memory-based encoders and the conditional random fields -based decoder models, BiLSTM-CRF) using minimal human labeled data based on hierarchical attention (i.e., at word, sentence, and abstract levels) over the documents and domain adaptation. We created synthetic labels for initial training and human-labeled data for fine-tuning. We also tested different options during NLP preprocessing and ML training to optimize the ML pipeline.

Results: Our method outperforms the prior neural architectures by learning through synthetic labels for initial training, document-level contextual learning, language modeling, and fine-tuning with limited gold standard label data. After comparison, we found that the BIO encoding schema performed slightly better than Blue, and domain adaptation techniques can improve the quality of synthetic labels. In addition, document-level context, pre-trained LM, and pre-trained WE all contributed to better model performance in our tasks. Add 2 to 4 human-labeled documents for every 100 synthetic labeled documents improves the model performance without exhausting human-labeled documents too quickly.

Conclusions: To the best of our knowledge, this is the first functional framework for the CDSS sub-domain to identify KPs, which is trained on limited human labeled data. It contributes to the general natural language processing (NLP) architectures in areas such as clinical NLP, where manual data labeling is challenging, and light-weighted deep learning models play an important role in real-time KP identification as a complementary approach to human experts' effort.

Abstract Image

Abstract Image

Abstract Image

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
使用具有分层上下文和迁移学习的最小标记数据的关键词识别。
可互操作的临床决策支持系统(CDSS)规则提供了一条实现互操作性的途径,这是卫生信息技术中公认的挑战。构建本体有助于创建可互操作的CDSS规则,这可以通过从现有文献中识别关键短语(KP)来实现。然而,数据标签的KP识别需要人类专业知识、共识和上下文理解。本文旨在提出一个半监督KP识别框架,该框架使用基于对文档的层次关注和领域自适应的最小标记数据。我们的方法通过合成标签进行初始训练、文档级上下文学习、语言建模以及使用有限的金标准标签数据进行微调,从而优于先前的神经架构。据我们所知,这是CDSS子域识别KP的第一个功能框架,它是在有限的标记数据上训练的。它有助于临床NLP等领域的通用自然语言处理(NLP)架构,在这些领域,手动数据标记具有挑战性,而轻量级深度学习模型在实时KP识别中发挥作用,作为人类专家工作的补充方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
After the Infection: A Survey of Pathogens and Non-communicable Human Disease. The Extra-Islet Pancreas Supports Autoimmunity in Human Type 1 Diabetes. Keyphrase Identification Using Minimal Labeled Data with Hierarchical Contexts and Transfer Learning. Advancing Efficacy Prediction for EHR-based Emulated Trials in Repurposing Heart Failure Therapies. Novel autoantibody targets identified in patients with autoimmune hepatitis (AIH) by PhIP-Seq reveals pathogenic insights.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1