Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education.

Gregory J Booth, Thomas Hauert, Mike Mynes, John Hodgson, Elizabeth Slama, Ashton Goldman, Jeffrey Moore
{"title":"Fine-Tuning Large Language Models to Enhance Programmatic Assessment in Graduate Medical Education.","authors":"Gregory J Booth, Thomas Hauert, Mike Mynes, John Hodgson, Elizabeth Slama, Ashton Goldman, Jeffrey Moore","doi":"10.46374/VolXXVI_Issue3_Moore","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Natural language processing is a collection of techniques designed to empower computer systems to comprehend and/or produce human language. The purpose of this investigation was to train several large language models (LLMs) to explore the tradeoff between model complexity and performance while classifying narrative feedback on trainees into the Accreditation Council for Graduate Medical Education subcompetencies. We hypothesized that classification accuracy would increase with model complexity.</p><p><strong>Methods: </strong>The authors fine-tuned several transformer-based LLMs (Bidirectional Encoder Representations from Transformers [BERT]-base, BERT-medium, BERT-small, BERT-mini, BERT-tiny, and SciBERT) to predict Accreditation Council for Graduate Medical Education subcompetencies on a curated dataset of 10 218 feedback comments. Performance was compared with the authors' previous work, which trained a FastText model on the same dataset. Performance metrics included F1 score for global model performance and area under the receiver operating characteristic curve for each competency.</p><p><strong>Results: </strong>No models were superior to FastText. Only BERT-tiny performed worse than FastText. The smallest model with comparable performance to FastText, BERT-mini, was 94% smaller. Area under the receiver operating characteristic curve for each competency was similar on BERT-mini and FastText with the exceptions of Patient Care 7 (Situational Awareness and Crisis Management) and Systems-Based Practice.</p><p><strong>Discussion: </strong>Transformer-based LLMs were fine-tuned to understand anesthesiology graduate medical education language. Complex LLMs did not outperform FastText. However, equivalent performance was achieved with a model that was 94% smaller, which may allow model deployment on personal devices to enhance speed and data privacy. This work advances our understanding of best practices when integrating LLMs into graduate medical education.</p>","PeriodicalId":75067,"journal":{"name":"The journal of education in perioperative medicine : JEPM","volume":"26 3","pages":"E729"},"PeriodicalIF":0.0000,"publicationDate":"2024-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11441632/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The journal of education in perioperative medicine : JEPM","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46374/VolXXVI_Issue3_Moore","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Natural language processing is a collection of techniques designed to empower computer systems to comprehend and/or produce human language. The purpose of this investigation was to train several large language models (LLMs) to explore the tradeoff between model complexity and performance while classifying narrative feedback on trainees into the Accreditation Council for Graduate Medical Education subcompetencies. We hypothesized that classification accuracy would increase with model complexity.

Methods: The authors fine-tuned several transformer-based LLMs (Bidirectional Encoder Representations from Transformers [BERT]-base, BERT-medium, BERT-small, BERT-mini, BERT-tiny, and SciBERT) to predict Accreditation Council for Graduate Medical Education subcompetencies on a curated dataset of 10 218 feedback comments. Performance was compared with the authors' previous work, which trained a FastText model on the same dataset. Performance metrics included F1 score for global model performance and area under the receiver operating characteristic curve for each competency.

Results: No models were superior to FastText. Only BERT-tiny performed worse than FastText. The smallest model with comparable performance to FastText, BERT-mini, was 94% smaller. Area under the receiver operating characteristic curve for each competency was similar on BERT-mini and FastText with the exceptions of Patient Care 7 (Situational Awareness and Crisis Management) and Systems-Based Practice.

Discussion: Transformer-based LLMs were fine-tuned to understand anesthesiology graduate medical education language. Complex LLMs did not outperform FastText. However, equivalent performance was achieved with a model that was 94% smaller, which may allow model deployment on personal devices to enhance speed and data privacy. This work advances our understanding of best practices when integrating LLMs into graduate medical education.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
微调大型语言模型,加强医学研究生教育的项目评估。
背景:自然语言处理是一系列旨在增强计算机系统理解和/或生成人类语言能力的技术。这项调查的目的是训练几个大型语言模型(LLM),以探索在将受训者的叙述性反馈归类为毕业后医学教育认证委员会的子能力时,模型复杂性与性能之间的权衡。我们假设分类准确率会随着模型复杂度的增加而提高:作者微调了几种基于变压器的 LLM(变压器双向编码器表征 [BERT]- base、BERT-medium、BERT-small、BERT-mini、BERT-tiny 和 SciBERT),以预测由 10 218 条反馈意见组成的数据集上的毕业医学教育评审委员会的子能力。性能与作者之前的工作进行了比较,后者在同一数据集上训练了一个 FastText 模型。性能指标包括全局模型性能的 F1 分数和每项能力的接收者工作特征曲线下面积:结果:没有任何模型优于 FastText。只有 BERT-tiny 的性能比 FastText 差。与 FastText 性能相当的最小模型 BERT-mini 比 FastText 小 94%。BERT-mini和FastText的各项能力的接收器操作特征曲线下面积相似,但病人护理7(态势感知和危机管理)和基于系统的实践除外:基于转换器的 LLMs 经过了微调,以理解麻醉学研究生医学教育语言。复杂 LLM 的性能没有超过 FastText。不过,在模型体积缩小94%的情况下,其性能与FastText相当,这可能允许在个人设备上部署模型,以提高速度和数据私密性。这项工作加深了我们对将 LLM 整合到毕业医学教育中的最佳实践的理解。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Geographical Distribution of Newly Accredited Anesthesiology Training Programs in Relation to Health Professional Shortage Areas and Medically Underserved Populations. Identification of Candidate Characteristics that Predicted a Successful Anesthesiology Residency Program Match in 2024: An Anonymous, Prospective Survey. The Role of Ombuds in Graduate Medical Education: Fostering Wellness and Psychological Safety. Assessment and Recommendations for the Society of Obstetric Anesthesia and Perinatology Fellowship Websites. Development of a Prioritized Anesthesiology Residency Critical Care Content Outline.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1