ChatGPT-4o 真的能通过医学考试吗?利用新颖问题进行实用分析

Phil Newton, Chris J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Rees, Ross Davey, Adrienne A Cox, Jessica A Bassett
{"title":"ChatGPT-4o 真的能通过医学考试吗?利用新颖问题进行实用分析","authors":"Phil Newton, Chris J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Rees, Ross Davey, Adrienne A Cox, Jessica A Bassett","doi":"10.1101/2024.06.29.24309595","DOIUrl":null,"url":null,"abstract":"ChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPTs performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels.\nThese data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.","PeriodicalId":501387,"journal":{"name":"medRxiv - Medical Education","volume":"108 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions.\",\"authors\":\"Phil Newton, Chris J Summers, Uzman Zaheer, Maira Xiromeriti, Jemima R Stokes, Jaskaran Singh Bhangu, Elis G Roome, Alanna Roberts-Phillips, Darius Mazaheri-Asadi, Cameron D Jones, Stuart Hughes, Dominic Gilbert, Ewan Jones, Keioni Essex, Emily C Rees, Ross Davey, Adrienne A Cox, Jessica A Bassett\",\"doi\":\"10.1101/2024.06.29.24309595\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPTs performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels.\\nThese data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.\",\"PeriodicalId\":501387,\"journal\":{\"name\":\"medRxiv - Medical Education\",\"volume\":\"108 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv - Medical Education\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.06.29.24309595\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv - Medical Education","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.06.29.24309595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

ChatGPT 显然在高水平的专业考试(如涉及医学评估和执照的考试)中表现出色。这引发了人们对 ChatGPT 可能被用于学术不端行为的担忧,尤其是在未经监考的在线考试中。然而,ChatGPT 在带有图片的试题上的表现也较弱,而且有人担心 ChatGPT 的表现可能会因为测试样题的公开性而被人为夸大,这意味着这些样题很可能是 ChatGPT 培训资料的一部分。因此有人建议,可以通过在每次考试中使用新问题和广泛使用基于图片的问题来减少作弊。这些方法仍未得到验证。在这里,我们测试了 ChatGPT-4o 在英国和美国现有医学执照考试以及基于这些考试的新问题上的表现。ChatGPT-4o 在英国医学执照考试应用知识测试中的得分率为 94%,在美国医学执照考试步骤 1 中的得分率为 89.9%。当试题被改写成新颖的版本时,或者在没有任何现有试题基础的完全新颖的试题上,成绩都没有下降。这些数据表明,ChatGPT 的性能在不断提高,而在线未经监考的考试是评估高阶学习所需的基础知识的一种无效形式。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Can ChatGPT-4o really pass medical science exams? A pragmatic analysis using novel questions.
ChatGPT apparently shows excellent performance on high level professional exams such as those involved in medical assessment and licensing. This has raised concerns that ChatGPT could be used for academic misconduct, especially in unproctored online exams. However, ChatGPT has also shown weaker performance on questions with pictures, and there have been concerns that ChatGPTs performance may be artificially inflated by the public nature of the sample questions tested, meaning they likely formed part of the training materials for ChatGPT. This led to suggestions that cheating could be mitigated by using novel questions for every sitting of an exam and making extensive use of picture-based questions. These approaches remain untested. Here we tested the performance of ChatGPT-4o on existing medical licensing exams in the UK and USA, and on novel questions based on those exams. ChatGPT-4o scored 94% on the United Kingdom Medical Licensing Exam Applied Knowledge Test, and 89.9% on the United States Medical Licensing Exam Step 1. Performance was not diminished when the questions were rewritten into novel versions, or on completely novel questions which were not based on any existing questions. ChatGPT did show a slightly reduced performance on questions containing images, particularly when the answer options were added to an image as text labels. These data demonstrate that the performance of ChatGPT continues to improve and that online unproctored exams are an invalid form of assessment of the foundational knowledge needed for higher order learning.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Barriers and facilitators for the implementation of wiki- and blog-based Virtual Learning Environments as tools for improving collaborative learning in the Bachelor of Nursing degree. Comparative Analysis of Stress Responses in Medical Students Using Virtual Reality Versus Traditional 3D-Printed Mannequins for Pericardiocentesis Training The Role of Artificial Intelligence in Modern Medical Education and Practice: A Systematic Literature Review Precision Education Tools for Pediatrics Trainees: A Mixed-Methods Multi-Site Usability Assessment Silence in physician clinical practice: a scoping review protocol
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1