Comparative Performance of Current Patient-Accessible Artificial Intelligence Large Language Models in the Preoperative Education of Patients in Facial Aesthetic Surgery.

Aesthetic surgery journal. Open forum Pub Date : 2024-08-13 eCollection Date: 2024-01-01 DOI:10.1093/asjof/ojae058

Jad Abi-Rafeh, Brian Bassiri-Tehrani, Roy Kazan, Steven A Hanna, Jonathan Kanevsky, Foad Nahai

{"title":"Comparative Performance of Current Patient-Accessible Artificial Intelligence Large Language Models in the Preoperative Education of Patients in Facial Aesthetic Surgery.","authors":"Jad Abi-Rafeh, Brian Bassiri-Tehrani, Roy Kazan, Steven A Hanna, Jonathan Kanevsky, Foad Nahai","doi":"10.1093/asjof/ojae058","DOIUrl":null,"url":null,"abstract":"Background: Artificial intelligence large language models (LLMs) represent promising resources for patient guidance and education in aesthetic surgery.Objectives: The present study directly compares the performance of OpenAI's ChatGPT (San Francisco, CA) with Google's Bard (Mountain View, CA) in this patient-related clinical application.Methods: Standardized questions were generated and posed to ChatGPT and Bard from the perspective of simulated patients interested in facelift, rhinoplasty, and brow lift. Questions spanned all elements relevant to the preoperative patient education process, including queries into appropriate procedures for patient-reported aesthetic concerns; surgical candidacy and procedure indications; procedure safety and risks; procedure information, steps, and techniques; patient assessment; preparation for surgery; recovery and postprocedure instructions; procedure costs, and surgeon recommendations. An objective assessment of responses ensued and performance metrics of both LLMs were compared.Results: ChatGPT scored 8.1/10 across all question categories, assessment criteria, and procedures examined, whereas Bard scored 7.4/10. Overall accuracy of information was scored at 6.7/10 ± 3.5 for ChatGPT and 6.5/10 ± 2.3 for Bard; comprehensiveness was scored as 6.6/10 ± 3.5 vs 6.3/10 ± 2.6; objectivity as 8.2/10 ± 1.0 vs 7.2/10 ± 0.8, safety as 8.8/10 ± 0.4 vs 7.8/10 ± 0.7, communication clarity as 9.3/10 ± 0.6 vs 8.5/10 ± 0.3, and acknowledgment of limitations as 8.9/10 ± 0.2 vs 8.1/10 ± 0.5, respectively. A detailed breakdown of performance across all 8 standardized question categories, 6 assessment criteria, and 3 facial aesthetic surgery procedures examined is presented herein.Conclusions: ChatGPT outperformed Bard in all assessment categories examined, with more accurate, comprehensive, objective, safe, and clear responses provided. Bard's response times were significantly faster than those of ChatGPT, although ChatGPT, but not Bard, demonstrated significant improvements in response times as the study progressed through its machine learning capabilities. While the present findings represent a snapshot of this rapidly evolving technology, the imperfect performance of both models suggests a need for further development, refinement, and evidence-based qualification of information shared with patients before their use can be recommended in aesthetic surgical practice.Level of evidence 5: ","PeriodicalId":72118,"journal":{"name":"Aesthetic surgery journal. Open forum","volume":"6 ","pages":"ojae058"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11371156/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Aesthetic surgery journal. Open forum","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/asjof/ojae058","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Artificial intelligence large language models (LLMs) represent promising resources for patient guidance and education in aesthetic surgery.

Objectives: The present study directly compares the performance of OpenAI's ChatGPT (San Francisco, CA) with Google's Bard (Mountain View, CA) in this patient-related clinical application.

Methods: Standardized questions were generated and posed to ChatGPT and Bard from the perspective of simulated patients interested in facelift, rhinoplasty, and brow lift. Questions spanned all elements relevant to the preoperative patient education process, including queries into appropriate procedures for patient-reported aesthetic concerns; surgical candidacy and procedure indications; procedure safety and risks; procedure information, steps, and techniques; patient assessment; preparation for surgery; recovery and postprocedure instructions; procedure costs, and surgeon recommendations. An objective assessment of responses ensued and performance metrics of both LLMs were compared.

Results: ChatGPT scored 8.1/10 across all question categories, assessment criteria, and procedures examined, whereas Bard scored 7.4/10. Overall accuracy of information was scored at 6.7/10 ± 3.5 for ChatGPT and 6.5/10 ± 2.3 for Bard; comprehensiveness was scored as 6.6/10 ± 3.5 vs 6.3/10 ± 2.6; objectivity as 8.2/10 ± 1.0 vs 7.2/10 ± 0.8, safety as 8.8/10 ± 0.4 vs 7.8/10 ± 0.7, communication clarity as 9.3/10 ± 0.6 vs 8.5/10 ± 0.3, and acknowledgment of limitations as 8.9/10 ± 0.2 vs 8.1/10 ± 0.5, respectively. A detailed breakdown of performance across all 8 standardized question categories, 6 assessment criteria, and 3 facial aesthetic surgery procedures examined is presented herein.

Conclusions: ChatGPT outperformed Bard in all assessment categories examined, with more accurate, comprehensive, objective, safe, and clear responses provided. Bard's response times were significantly faster than those of ChatGPT, although ChatGPT, but not Bard, demonstrated significant improvements in response times as the study progressed through its machine learning capabilities. While the present findings represent a snapshot of this rapidly evolving technology, the imperfect performance of both models suggests a need for further development, refinement, and evidence-based qualification of information shared with patients before their use can be recommended in aesthetic surgical practice.

Level of evidence 5:

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

当前患者可访问的人工智能大语言模型在面部美容手术患者术前教育中的性能比较。

背景：人工智能大型语言模型（LLMs）是美容外科患者指导和教育的重要资源：本研究直接比较了 OpenAI 的 ChatGPT（加利福尼亚州旧金山）与谷歌的 Bard（加利福尼亚州山景城）在这一与患者相关的临床应用中的表现：方法：从对拉皮、鼻整形和提眉术感兴趣的模拟患者的角度出发，生成标准化问题并向 ChatGPT 和 Bard 提问。问题涵盖了与术前患者教育过程相关的所有要素，包括询问针对患者报告的美学问题的适当手术；手术候选资格和手术适应症；手术安全性和风险；手术信息、步骤和技术；患者评估；手术准备；恢复和术后指导；手术费用和外科医生建议。随后对回复进行了客观评估，并对两个 LLM 的性能指标进行了比较：结果：ChatGPT 在所有问题类别、评估标准和检查程序方面的得分均为 8.1/10，而 Bard 的得分为 7.4/10。ChatGPT 的总体信息准确性为 6.7/10 ± 3.5，而 Bard 为 6.5/10 ± 2.3；全面性为 6.6/10 ± 3.5 vs 6.3/10 ± 2.6；客观性为 8.2/10 ± 1.0 vs 7.2/10 ± 0.8；安全性为 8.2/10 ± 1.0 vs 7.2/10 ± 0.8。2/10 ± 0.8，安全性为 8.8/10 ± 0.4 vs 7.8/10 ± 0.7，沟通清晰度为 9.3/10 ± 0.6 vs 8.5/10 ± 0.3，承认局限性为 8.9/10 ± 0.2 vs 8.1/10 ± 0.5。本文详细介绍了所有 8 个标准化问题类别、6 个评估标准和 3 个面部美容手术程序的表现：结论：ChatGPT 在所有评估类别中的表现都优于 Bard，所提供的回答更加准确、全面、客观、安全和清晰。Bard 的响应时间明显快于 ChatGPT，但随着机器学习能力的提高，ChatGPT 的响应时间有了显著改善，而 Bard 则没有。虽然目前的研究结果代表了这一快速发展的技术的一个缩影，但这两种模型的不完美表现表明，在美容外科实践中推荐使用这些模型之前，需要进一步开发、改进，并对与患者共享的信息进行循证鉴定：

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Aesthetic surgery journal. Open forum

自引率

0.00%

发文量

审稿时长

4 weeks