甲状腺眼病与人工智能：ChatGPT-3.5、chatgpt - 40和Gemini在患者信息传递中的比较研究

IF 1.2 4区医学 Q3 OPHTHALMOLOGY Ophthalmic Plastic and Reconstructive Surgery Pub Date : 2024-12-24 DOI:10.1097/IOP.0000000000002882

Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam

{"title":"甲状腺眼病与人工智能：ChatGPT-3.5、chatgpt - 40和Gemini在患者信息传递中的比较研究","authors":"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam","doi":"10.1097/IOP.0000000000002882","DOIUrl":null,"url":null,"abstract":"Purpose: This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.Methods: Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.Results: GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.Conclusions: GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.","PeriodicalId":19588,"journal":{"name":"Ophthalmic Plastic and Reconstructive Surgery","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.\",\"authors\":\"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam\",\"doi\":\"10.1097/IOP.0000000000002882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Purpose: This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.Methods: Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.Results: GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.Conclusions: GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.\",\"PeriodicalId\":19588,\"journal\":{\"name\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/IOP.0000000000002882\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmic Plastic and Reconstructive Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IOP.0000000000002882","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

目的：本研究旨在比较3种人工智能语言模型——gpt -3.5、gpt - 40和Gemini在提供以患者为中心的甲状腺眼病（TED）信息方面的有效性。我们评估他们的表现基于他们的准确性和全面的答复常见的病人询问有关TED。该研究没有评估人工智能反应的可重复性，而是侧重于每个模型的单次评估。方法：5名经验丰富的眼科医生对人工智能模型对TED患者常见的12个关键问题的回答进行评估。这些问题涉及TED病理生理学、危险因素、临床表现、诊断测试和治疗方案。每个回答的正确性和可靠性评分为7分李克特量表，其中1表示不正确或不可靠的信息，7表示高度准确和可靠的信息。正确性指的是事实的准确性，而可靠性评估的是患者使用的可信度。评估是匿名的，为了便于模型比较，对所有外科医生的最终得分取平均值。结果：GPT-3.5是表现最好的，平均正确性得分为5.75分，可靠性得分为5.68分，在提供TED治疗和手术干预等复杂主题的详细信息方面表现出色。gpt - 40的正确性得分为5.32分，可靠性得分为5.25分，总体上提供了准确但不太详细的信息。双子座在正确性和可靠性方面的得分分别为5.10和4.70，在较简单的问题上通常能给出足够的回答，但在二线免疫抑制治疗等复杂领域却缺乏细节。使用Friedman检验的统计分析显示，在关键主题上，模型之间存在显著差异（p < 0.05）， GPT-3.5始终领先。结论：GPT-3.5是提供可靠和全面的患者信息的最有效模型，特别是对于复杂的治疗和手术主题。gpt - 40提供了可靠的一般信息，但缺乏专业主题的必要深度，而Gemini适合解决基本的患者询问，但不足以提供详细的医疗信息。这项研究强调了人工智能在患者教育中的作用，表明像GPT-3.5这样的模型可以成为临床医生提高患者对TED理解的有价值的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.

Purpose: This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.

Methods: Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.

Results: GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.

Conclusions: GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Ophthalmic Plastic and Reconstructive Surgery 医学-外科

CiteScore

2.50

自引率

10.00%

发文量

322

审稿时长

3-8 weeks

期刊介绍： Ophthalmic Plastic and Reconstructive Surgery features original articles and reviews on topics such as ptosis, eyelid reconstruction, orbital diagnosis and surgery, lacrimal problems, and eyelid malposition. Update reports on diagnostic techniques, surgical equipment and instrumentation, and medical therapies are included, as well as detailed analyses of recent research findings and their clinical applications.