在为认证医生为注册临床骨密度测量师而设计的考试中,进行逐代 GPT 测试

IF 1.7 4区 医学 Q4 ENDOCRINOLOGY & METABOLISM Journal of Clinical Densitometry Pub Date : 2024-02-17 DOI:10.1016/j.jocd.2024.101480
Dustin Valdez , Arianna Bunnell , Sian Y. Lim , Peter Sadowski , John A. Shepherd
{"title":"在为认证医生为注册临床骨密度测量师而设计的考试中,进行逐代 GPT 测试","authors":"Dustin Valdez ,&nbsp;Arianna Bunnell ,&nbsp;Sian Y. Lim ,&nbsp;Peter Sadowski ,&nbsp;John A. Shepherd","doi":"10.1016/j.jocd.2024.101480","DOIUrl":null,"url":null,"abstract":"<div><p><em>Background</em>: Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300.</p><p><em>Methods</em>: A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length.</p><p><em>Results</em>: The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the “Overview of Low Bone Mass and Osteoporosis” category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except “Imaging Technology in Bone Health” (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category.</p><p><em>Conclusion</em>: If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.</p></div>","PeriodicalId":50240,"journal":{"name":"Journal of Clinical Densitometry","volume":"27 2","pages":"Article 101480"},"PeriodicalIF":1.7000,"publicationDate":"2024-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists\",\"authors\":\"Dustin Valdez ,&nbsp;Arianna Bunnell ,&nbsp;Sian Y. Lim ,&nbsp;Peter Sadowski ,&nbsp;John A. Shepherd\",\"doi\":\"10.1016/j.jocd.2024.101480\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p><em>Background</em>: Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300.</p><p><em>Methods</em>: A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length.</p><p><em>Results</em>: The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the “Overview of Low Bone Mass and Osteoporosis” category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except “Imaging Technology in Bone Health” (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category.</p><p><em>Conclusion</em>: If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.</p></div>\",\"PeriodicalId\":50240,\"journal\":{\"name\":\"Journal of Clinical Densitometry\",\"volume\":\"27 2\",\"pages\":\"Article 101480\"},\"PeriodicalIF\":1.7000,\"publicationDate\":\"2024-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Clinical Densitometry\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1094695024000155\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"ENDOCRINOLOGY & METABOLISM\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Clinical Densitometry","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1094695024000155","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENDOCRINOLOGY & METABOLISM","Score":null,"Total":0}
引用次数: 0

摘要

:人工智能(AI)大型语言模型(LLM),如 ChatGPT,已证明有能力通过标准化考试。这些模型并不是为特定任务而训练的,而是为预测来自互联网的大型文档库中的文本序列而训练的。事实证明,即使是针对这种一般任务训练的模型,也能通过各种特定领域的考试,包括美国医学执照考试。我们想知道,大型语言模型是否能在为医学专家设计的范围更窄的子领域测试中表现出色。此外,我们还想更好地了解,即使在一代代训练的基础上,GPT(生成式预训练转换器)模型是如何在反应的完整性和复杂性方面不断发展的。在这项研究中,我们评估了两个版本的 GPT(GPT 3 和 GPT 4)在通过骨质疏松症专家认证考试和成为认证临床骨密度测量师方面的表现。CCD 考试的分数范围为 150 分至 400 分。您需要 300 分才能通过。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Performance of Progressive Generations of GPT on an Exam Designed for Certifying Physicians as Certified Clinical Densitometrists

Background: Artificial intelligence (AI) large language models (LLMs) such as ChatGPT have demonstrated the ability to pass standardized exams. These models are not trained for a specific task, but instead trained to predict sequences of text from large corpora of documents sourced from the internet. It has been shown that even models trained on this general task can pass exams in a variety of domain-specific fields, including the United States Medical Licensing Examination. We asked if large language models would perform as well on a much narrower subdomain tests designed for medical specialists. Furthermore, we wanted to better understand how progressive generations of GPT (generative pre-trained transformer) models may be evolving in the completeness and sophistication of their responses even while generational training remains general. In this study, we evaluated the performance of two versions of GPT (GPT 3 and 4) on their ability to pass the certification exam given to physicians to work as osteoporosis specialists and become a certified clinical densitometrists. The CCD exam has a possible score range of 150 to 400. To pass, you need a score of 300.

Methods: A 100-question multiple-choice practice exam was obtained from a 3rd party exam preparation website that mimics the accredited certification tests given by the ISCD (International Society for Clinical Densitometry). The exam was administered to two versions of GPT, the free version (GPT Playground) and ChatGPT+, which are based on GPT-3 and GPT-4, respectively (OpenAI, San Francisco, CA). The systems were prompted with the exam questions verbatim. If the response was purely textual and did not specify which of the multiple-choice answers to select, the authors matched the text to the closest answer. Each exam was graded and an estimated ISCD score was provided from the exam website. In addition, each response was evaluated by a rheumatologist CCD and ranked for accuracy using a 5-level scale. The two GPT versions were compared in terms of response accuracy and length.

Results: The average response length was 11.6 ±19 words for GPT-3 and 50.0±43.6 words for GPT-4. GPT-3 answered 62 questions correctly resulting in a failing ISCD score of 289. However, GPT-4 answered 82 questions correctly with a passing score of 342. GPT-3 scored highest on the “Overview of Low Bone Mass and Osteoporosis” category (72 % correct) while GPT-4 scored well above 80 % accuracy on all categories except “Imaging Technology in Bone Health” (65 % correct). Regarding subjective accuracy, GPT-3 answered 23 questions with nonsensical or totally wrong responses while GPT-4 had no responses in that category.

Conclusion: If this had been an actual certification exam, GPT-4 would now have a CCD suffix to its name even after being trained using general internet knowledge. Clearly, more goes into physician training than can be captured in this exam. However, GPT algorithms may prove to be valuable physician aids in the diagnoses and monitoring of osteoporosis and other diseases.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Clinical Densitometry
Journal of Clinical Densitometry 医学-内分泌学与代谢
CiteScore
4.90
自引率
8.00%
发文量
92
审稿时长
90 days
期刊介绍: The Journal is committed to serving ISCD''s mission - the education of heterogenous physician specialties and technologists who are involved in the clinical assessment of skeletal health. The focus of JCD is bone mass measurement, including epidemiology of bone mass, how drugs and diseases alter bone mass, new techniques and quality assurance in bone mass imaging technologies, and bone mass health/economics. Combining high quality research and review articles with sound, practice-oriented advice, JCD meets the diverse diagnostic and management needs of radiologists, endocrinologists, nephrologists, rheumatologists, gynecologists, family physicians, internists, and technologists whose patients require diagnostic clinical densitometry for therapeutic management.
期刊最新文献
Predicting bone mineral content from smartphone digital anthropometrics: evaluation of an existing application and the development of new prediction models Moving towards an equitable future: Rethinking the use of race in pediatric densitometry Comparison of the Effect of Selective Serotonin and Norepinephrine Reuptake Inhibitors on Bone Mineral Density with Selective Serotonin Reuptake Inhibitors and Healthy Controls Opportunistic Screening for Low Bone Mineral Density in Routine Computed Tomography Scans: A Brazilian Validation Study Canadian adult reference data for body composition, trabecular bone score and advanced hip analysis using DXA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1