信息学教育特刊:与医学知识问题相比,ChatGPT 在 USMLE 形式的伦理问题上表现更差。

IF 2.1 2区 医学 Q4 MEDICAL INFORMATICS Applied Clinical Informatics Pub Date : 2024-08-29 DOI:10.1055/a-2405-0138
Tessa Louise Danehy, Jessica Hecht, Sabrina Kentis, Clyde Schechter, Sunit Jariwala
{"title":"信息学教育特刊:与医学知识问题相比,ChatGPT 在 USMLE 形式的伦理问题上表现更差。","authors":"Tessa Louise Danehy, Jessica Hecht, Sabrina Kentis, Clyde Schechter, Sunit Jariwala","doi":"10.1055/a-2405-0138","DOIUrl":null,"url":null,"abstract":"<p><strong>Objectives: </strong>The main objective of this study is to evaluate the ability of the Large Language Model ChatGPT to accurately answer USMLE board style medical ethics questions compared to medical knowledge based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and to assess the variability of responses given by each version.</p><p><strong>Materials and methods: </strong>Using AMBOSS, a third party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4, and recorded the output. A random-effects linear probability regression model evaluated accuracy, and a Shannon entropy calculation evaluated response variation.</p><p><strong>Results: </strong>Both versions of ChatGPT demonstrated a worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (P < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (P = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (P < 0.001) on medical ethics and 33% points (P < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55) which indicates lower variability in response.</p><p><strong>Conclusion: </strong>Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.</p><p><strong>Key words: </strong>ChatGPT, Large Language Model, Artificial Intelligence, Medical Education, USMLE, Ethics.</p>","PeriodicalId":48956,"journal":{"name":"Applied Clinical Informatics","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.\",\"authors\":\"Tessa Louise Danehy, Jessica Hecht, Sabrina Kentis, Clyde Schechter, Sunit Jariwala\",\"doi\":\"10.1055/a-2405-0138\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objectives: </strong>The main objective of this study is to evaluate the ability of the Large Language Model ChatGPT to accurately answer USMLE board style medical ethics questions compared to medical knowledge based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and to assess the variability of responses given by each version.</p><p><strong>Materials and methods: </strong>Using AMBOSS, a third party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4, and recorded the output. A random-effects linear probability regression model evaluated accuracy, and a Shannon entropy calculation evaluated response variation.</p><p><strong>Results: </strong>Both versions of ChatGPT demonstrated a worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (P < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (P = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (P < 0.001) on medical ethics and 33% points (P < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55) which indicates lower variability in response.</p><p><strong>Conclusion: </strong>Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.</p><p><strong>Key words: </strong>ChatGPT, Large Language Model, Artificial Intelligence, Medical Education, USMLE, Ethics.</p>\",\"PeriodicalId\":48956,\"journal\":{\"name\":\"Applied Clinical Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.1000,\"publicationDate\":\"2024-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Clinical Informatics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1055/a-2405-0138\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"MEDICAL INFORMATICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Clinical Informatics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1055/a-2405-0138","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

摘要

研究目的本研究的主要目的是评估大语言模型 ChatGPT 与基于医学知识的问题相比,在准确回答 USMLE 考试医学伦理问题方面的能力。本研究还有一个额外的目标,即比较 GPT-3.5 和 GPT-4 的总体准确性,并评估每个版本给出的答案的可变性:我们使用第三方 USMLE Step Exam 考试准备服务 AMBOSS,为医科学生选择了一组 27 道医学伦理问题和另一组 27 道医学知识问题,这两组问题在问题难度上是匹配的。我们在 GPT-3.5 和 GPT-4 上对这些问题进行了 30 次试验,并记录了试验结果。随机效应线性概率回归模型评估了准确性,香农熵计算评估了回答的变化:结果:与医学知识问题相比,两个版本的 ChatGPT 在医学伦理问题上的表现都较差。与医学知识问题相比,GPT-4 在医学伦理问题上的表现差了 18% (P < 0.05),而 GPT-3.5 则差了 7% (P = 0.41)。在医学伦理方面,GPT-4 比 GPT-3.5 高出 22% 分(P < 0.001),在医学知识方面,GPT-4 比 GPT-3.5 高出 33% 分(P < 0.001)。GPT-4 在医学伦理和医学知识问题上的香农熵(分别为 0.21 和 0.11)也总体低于 GPT-3.5(分别为 0.59 和 0.55),这表明回答的可变性较低:结论:与医学知识问题相比,两个版本的 ChatGPT 在医学伦理问题上的表现都较差。GPT-4在总体准确性上明显优于GPT-3.5,在答案选择上的可变性也明显较低。这强调了对用于医学教育的 ChatGPT 版本进行持续评估的必要性:ChatGPT、大语言模型、人工智能、医学教育、USMLE、伦理学。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Special Issue on Informatics Education: ChatGPT Performs Worse on USMLE-Style Ethics Questions Compared to Medical Knowledge Questions.

Objectives: The main objective of this study is to evaluate the ability of the Large Language Model ChatGPT to accurately answer USMLE board style medical ethics questions compared to medical knowledge based questions. This study has the additional objectives of comparing the overall accuracy of GPT-3.5 to GPT-4 and to assess the variability of responses given by each version.

Materials and methods: Using AMBOSS, a third party USMLE Step Exam test prep service, we selected one group of 27 medical ethics questions and a second group of 27 medical knowledge questions matched on question difficulty for medical students. We ran 30 trials asking these questions on GPT-3.5 and GPT-4, and recorded the output. A random-effects linear probability regression model evaluated accuracy, and a Shannon entropy calculation evaluated response variation.

Results: Both versions of ChatGPT demonstrated a worse performance on medical ethics questions compared to medical knowledge questions. GPT-4 performed 18% points (P < 0.05) worse on medical ethics questions compared to medical knowledge questions and GPT-3.5 performed 7% points (P = 0.41) worse. GPT-4 outperformed GPT-3.5 by 22% points (P < 0.001) on medical ethics and 33% points (P < 0.001) on medical knowledge. GPT-4 also exhibited an overall lower Shannon entropy for medical ethics and medical knowledge questions (0.21 and 0.11, respectively) than GPT-3.5 (0.59 and 0.55) which indicates lower variability in response.

Conclusion: Both versions of ChatGPT performed more poorly on medical ethics questions compared to medical knowledge questions. GPT-4 significantly outperformed GPT-3.5 on overall accuracy and exhibited a significantly lower response variability in answer choices. This underscores the need for ongoing assessment of ChatGPT versions for medical education.

Key words: ChatGPT, Large Language Model, Artificial Intelligence, Medical Education, USMLE, Ethics.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Applied Clinical Informatics
Applied Clinical Informatics MEDICAL INFORMATICS-
CiteScore
4.60
自引率
24.10%
发文量
132
期刊介绍: ACI is the third Schattauer journal dealing with biomedical and health informatics. It perfectly complements our other journals Öffnet internen Link im aktuellen FensterMethods of Information in Medicine and the Öffnet internen Link im aktuellen FensterYearbook of Medical Informatics. The Yearbook of Medical Informatics being the “Milestone” or state-of-the-art journal and Methods of Information in Medicine being the “Science and Research” journal of IMIA, ACI intends to be the “Practical” journal of IMIA.
期刊最新文献
Extracting International Classification of Diseases Codes from Clinical Documentation using Large Language Models. Special_Issue_Teaching_and_Training_Future_Health_Informaticians: Managing the transition from tradition to innovation of the Heidelberg/Heilbronn Medical Informatics Master's Program. Effects of Aligning Residency Note Templates with CMS Evaluation and Management Documentation Requirements. Multisite implementation of a sexual health survey and clinical decision support to promote adolescent sexually transmitted infection screening. Optimizing Resident Charge Capture with Disappearing Help Text in Note Templates.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1