Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.

IF 2.3 Q2 ORTHOPEDICS JBJS Open Access Pub Date : 2024-10-24 eCollection Date: 2024-10-01 DOI:10.2106/JBJS.OA.24.00028
Zachary C Lum, Lohitha Guntupalli, Augustine M Saiz, Holly Leshikar, Hai V Le, John P Meehan, Eric G Huish
{"title":"Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.","authors":"Zachary C Lum, Lohitha Guntupalli, Augustine M Saiz, Holly Leshikar, Hai V Le, John P Meehan, Eric G Huish","doi":"10.2106/JBJS.OA.24.00028","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The potential capabilities of generative artificial intelligence (AI) tools have been relatively unexplored, particularly in the realm of creating personalized statements for medical students applying to residencies. This study aimed to investigate the ability of generative AI, specifically ChatGPT and Google BARD, to generate personal statements and assess whether faculty on residency selection committees could (1) evaluate differences between real and AI statements and (2) determine differences based on 13 defined and specific metrics of a personal statement.</p><p><strong>Methods: </strong>Fifteen real personal statements were used to generate 15 unique and distinct personal statements from ChatGPT and BARD each, resulting in a total of 45 statements. Statements were then randomized, blinded, and presented to a group of faculty reviewers on residency selection committees. Reviewers assessed the statements by 14 metrics including if the personal statement was AI-generated or real. Comparison of all metrics was performed.</p><p><strong>Results: </strong>Faculty correctly identified 88% (79/90) real statements, 90% (81/90) BARD, and 44% (40/90) ChatGPT statements. Accuracy of identifying real and BARD statements was 89%, but this dropped to 74% when including ChatGPT. In addition, the accuracy did not increase as faculty members reviewed more personal statements (area under the curve [AUC] 0.498, p = 0.966). BARD performed poorer than both real and ChatGPT across all metrics (p < 0.001). Comparing real with ChatGPT, there was no difference in most metrics, except for Personal Interests, Reasons for Choosing Residency, Career Goals, Compelling Nature and Originality, and all favoring the real personal statements (p = 0.001, p = 0.002, p < 0.001, p < 0.001, and p < 0.001, respectively).</p><p><strong>Conclusion: </strong>Faculty members accurately identified real and BARD statements, but ChatGPT deceived them 56% of the time. Although AI can craft convincing statements that are sometimes indistinguishable from real ones, replicating the humanistic experience, personal nuances, and individualistic elements found in real personal statements is difficult. Residency selection committees might want to prioritize these particular metrics while assessing personal statements, given the growing capabilities of AI in this arena.</p><p><strong>Clinical relevance: </strong>Residency selection committees may want to prioritize certain metrics unique to the human element such as personal interests, reasons for choosing residency, career goals, compelling nature, and originality when evaluating personal statements.</p>","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"9 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11498924/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.24.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: The potential capabilities of generative artificial intelligence (AI) tools have been relatively unexplored, particularly in the realm of creating personalized statements for medical students applying to residencies. This study aimed to investigate the ability of generative AI, specifically ChatGPT and Google BARD, to generate personal statements and assess whether faculty on residency selection committees could (1) evaluate differences between real and AI statements and (2) determine differences based on 13 defined and specific metrics of a personal statement.

Methods: Fifteen real personal statements were used to generate 15 unique and distinct personal statements from ChatGPT and BARD each, resulting in a total of 45 statements. Statements were then randomized, blinded, and presented to a group of faculty reviewers on residency selection committees. Reviewers assessed the statements by 14 metrics including if the personal statement was AI-generated or real. Comparison of all metrics was performed.

Results: Faculty correctly identified 88% (79/90) real statements, 90% (81/90) BARD, and 44% (40/90) ChatGPT statements. Accuracy of identifying real and BARD statements was 89%, but this dropped to 74% when including ChatGPT. In addition, the accuracy did not increase as faculty members reviewed more personal statements (area under the curve [AUC] 0.498, p = 0.966). BARD performed poorer than both real and ChatGPT across all metrics (p < 0.001). Comparing real with ChatGPT, there was no difference in most metrics, except for Personal Interests, Reasons for Choosing Residency, Career Goals, Compelling Nature and Originality, and all favoring the real personal statements (p = 0.001, p = 0.002, p < 0.001, p < 0.001, and p < 0.001, respectively).

Conclusion: Faculty members accurately identified real and BARD statements, but ChatGPT deceived them 56% of the time. Although AI can craft convincing statements that are sometimes indistinguishable from real ones, replicating the humanistic experience, personal nuances, and individualistic elements found in real personal statements is difficult. Residency selection committees might want to prioritize these particular metrics while assessing personal statements, given the growing capabilities of AI in this arena.

Clinical relevance: Residency selection committees may want to prioritize certain metrics unique to the human element such as personal interests, reasons for choosing residency, career goals, compelling nature, and originality when evaluating personal statements.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
人工智能能否愚弄住院医师遴选委员会?一项随机、单盲多中心研究:真实申请者的个人陈述与生成式人工智能分析。
引言生成式人工智能(AI)工具的潜在能力尚未得到充分开发,尤其是在为申请住院医师培训的医学生创建个性化陈述方面。本研究旨在调查生成式人工智能(特别是 ChatGPT 和 Google BARD)生成个人陈述的能力,并评估住院医师遴选委员会的教师是否能够:(1)评估真实陈述与人工智能陈述之间的差异;(2)根据个人陈述的 13 项定义和具体指标确定差异:用 15 份真实的个人陈述分别从 ChatGPT 和 BARD 生成 15 份独特的个人陈述,共生成 45 份陈述。然后,对这些陈述进行随机、盲法处理,并提交给住院医师遴选委员会的一组教师评审员。评审人员通过 14 项指标对陈述进行评估,包括个人陈述是人工智能生成的还是真实的。对所有指标进行比较:结果:教员们正确识别了88%(79/90)的真实陈述、90%(81/90)的BARD陈述和44%(40/90)的ChatGPT陈述。识别真实语句和 BARD 语句的准确率为 89%,但如果包括 ChatGPT,准确率则下降到 74%。此外,准确率并没有随着教员审查更多的个人陈述而提高(曲线下面积 [AUC] 0.498,p = 0.966)。在所有指标上,BARD 的表现都比真实和 ChatGPT 差(p < 0.001)。将真实个人陈述与 ChatGPT 进行比较,除了个人兴趣、选择住院医生的原因、职业目标、令人信服的性质和独创性之外,大多数指标都没有差异,而且都有利于真实个人陈述(分别为 p = 0.001、p = 0.002、p < 0.001、p < 0.001 和 p < 0.001):教职员工准确识别了真实语句和 BARD 语句,但 ChatGPT 在 56% 的情况下欺骗了他们。虽然人工智能可以制作出令人信服的陈述,有时甚至与真实陈述无异,但复制真实个人陈述中的人文体验、个人细微差别和个人主义元素却很困难。鉴于人工智能在这一领域的能力日益增强,住院医师遴选委员会在评估个人陈述时可能要优先考虑这些特定指标:住院医师遴选委员会在评估个人陈述时,可能希望优先考虑某些人类特有的指标,如个人兴趣、选择住院医师的原因、职业目标、引人注目的性质和原创性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
JBJS Open Access
JBJS Open Access Medicine-Surgery
CiteScore
5.00
自引率
0.00%
发文量
77
审稿时长
6 weeks
期刊最新文献
Both-Bone Forearm Shaft Fractures Treated with Compression Plate Fixation in Adults: A Systematic Review on Adverse Events and Outcomes. Exploring the Performance of ChatGPT in an Orthopaedic Setting and Its Potential Use as an Educational Tool. Nonoperative Care Versus Surgery for Degenerative Cervical Myelopathy: An Application of a Health Economic Technique to Simulate Head-to-Head Comparisons. Collagenase Clostridium histolyticum Versus Needle Fasciotomy for Primary Metacarpophalangeal Dupuytren Contracture: Five-Year Results from a Randomized Controlled Trial. Reoperation Rate After Posterior Spinal Fusion Varies Significantly by Lenke Type.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1