Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.
Zachary C Lum, Lohitha Guntupalli, Augustine M Saiz, Holly Leshikar, Hai V Le, John P Meehan, Eric G Huish
{"title":"Can Artificial Intelligence Fool Residency Selection Committees? Analysis of Personal Statements by Real Applicants and Generative AI, a Randomized, Single-Blind Multicenter Study.","authors":"Zachary C Lum, Lohitha Guntupalli, Augustine M Saiz, Holly Leshikar, Hai V Le, John P Meehan, Eric G Huish","doi":"10.2106/JBJS.OA.24.00028","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The potential capabilities of generative artificial intelligence (AI) tools have been relatively unexplored, particularly in the realm of creating personalized statements for medical students applying to residencies. This study aimed to investigate the ability of generative AI, specifically ChatGPT and Google BARD, to generate personal statements and assess whether faculty on residency selection committees could (1) evaluate differences between real and AI statements and (2) determine differences based on 13 defined and specific metrics of a personal statement.</p><p><strong>Methods: </strong>Fifteen real personal statements were used to generate 15 unique and distinct personal statements from ChatGPT and BARD each, resulting in a total of 45 statements. Statements were then randomized, blinded, and presented to a group of faculty reviewers on residency selection committees. Reviewers assessed the statements by 14 metrics including if the personal statement was AI-generated or real. Comparison of all metrics was performed.</p><p><strong>Results: </strong>Faculty correctly identified 88% (79/90) real statements, 90% (81/90) BARD, and 44% (40/90) ChatGPT statements. Accuracy of identifying real and BARD statements was 89%, but this dropped to 74% when including ChatGPT. In addition, the accuracy did not increase as faculty members reviewed more personal statements (area under the curve [AUC] 0.498, p = 0.966). BARD performed poorer than both real and ChatGPT across all metrics (p < 0.001). Comparing real with ChatGPT, there was no difference in most metrics, except for Personal Interests, Reasons for Choosing Residency, Career Goals, Compelling Nature and Originality, and all favoring the real personal statements (p = 0.001, p = 0.002, p < 0.001, p < 0.001, and p < 0.001, respectively).</p><p><strong>Conclusion: </strong>Faculty members accurately identified real and BARD statements, but ChatGPT deceived them 56% of the time. Although AI can craft convincing statements that are sometimes indistinguishable from real ones, replicating the humanistic experience, personal nuances, and individualistic elements found in real personal statements is difficult. Residency selection committees might want to prioritize these particular metrics while assessing personal statements, given the growing capabilities of AI in this arena.</p><p><strong>Clinical relevance: </strong>Residency selection committees may want to prioritize certain metrics unique to the human element such as personal interests, reasons for choosing residency, career goals, compelling nature, and originality when evaluating personal statements.</p>","PeriodicalId":36492,"journal":{"name":"JBJS Open Access","volume":"9 4","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11498924/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JBJS Open Access","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2106/JBJS.OA.24.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/10/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: The potential capabilities of generative artificial intelligence (AI) tools have been relatively unexplored, particularly in the realm of creating personalized statements for medical students applying to residencies. This study aimed to investigate the ability of generative AI, specifically ChatGPT and Google BARD, to generate personal statements and assess whether faculty on residency selection committees could (1) evaluate differences between real and AI statements and (2) determine differences based on 13 defined and specific metrics of a personal statement.
Methods: Fifteen real personal statements were used to generate 15 unique and distinct personal statements from ChatGPT and BARD each, resulting in a total of 45 statements. Statements were then randomized, blinded, and presented to a group of faculty reviewers on residency selection committees. Reviewers assessed the statements by 14 metrics including if the personal statement was AI-generated or real. Comparison of all metrics was performed.
Results: Faculty correctly identified 88% (79/90) real statements, 90% (81/90) BARD, and 44% (40/90) ChatGPT statements. Accuracy of identifying real and BARD statements was 89%, but this dropped to 74% when including ChatGPT. In addition, the accuracy did not increase as faculty members reviewed more personal statements (area under the curve [AUC] 0.498, p = 0.966). BARD performed poorer than both real and ChatGPT across all metrics (p < 0.001). Comparing real with ChatGPT, there was no difference in most metrics, except for Personal Interests, Reasons for Choosing Residency, Career Goals, Compelling Nature and Originality, and all favoring the real personal statements (p = 0.001, p = 0.002, p < 0.001, p < 0.001, and p < 0.001, respectively).
Conclusion: Faculty members accurately identified real and BARD statements, but ChatGPT deceived them 56% of the time. Although AI can craft convincing statements that are sometimes indistinguishable from real ones, replicating the humanistic experience, personal nuances, and individualistic elements found in real personal statements is difficult. Residency selection committees might want to prioritize these particular metrics while assessing personal statements, given the growing capabilities of AI in this arena.
Clinical relevance: Residency selection committees may want to prioritize certain metrics unique to the human element such as personal interests, reasons for choosing residency, career goals, compelling nature, and originality when evaluating personal statements.