Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam
{"title":"甲状腺眼病与人工智能:ChatGPT-3.5、chatgpt - 40和Gemini在患者信息传递中的比较研究","authors":"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam","doi":"10.1097/IOP.0000000000002882","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.</p><p><strong>Methods: </strong>Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.</p><p><strong>Results: </strong>GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.</p><p><strong>Conclusions: </strong>GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.</p>","PeriodicalId":19588,"journal":{"name":"Ophthalmic Plastic and Reconstructive Surgery","volume":" ","pages":""},"PeriodicalIF":1.2000,"publicationDate":"2024-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.\",\"authors\":\"Daniel Bahir, Morris Hartstein, Ofira Zloto, Cat Burkat, Jimmy Uddin, Shirin Hamed Azzam\",\"doi\":\"10.1097/IOP.0000000000002882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.</p><p><strong>Methods: </strong>Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.</p><p><strong>Results: </strong>GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.</p><p><strong>Conclusions: </strong>GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.</p>\",\"PeriodicalId\":19588,\"journal\":{\"name\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":1.2000,\"publicationDate\":\"2024-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ophthalmic Plastic and Reconstructive Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/IOP.0000000000002882\",\"RegionNum\":4,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ophthalmic Plastic and Reconstructive Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IOP.0000000000002882","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
Thyroid Eye Disease and Artificial Intelligence: A Comparative Study of ChatGPT-3.5, ChatGPT-4o, and Gemini in Patient Information Delivery.
Purpose: This study aimed to compare the effectiveness of 3 artificial intelligence language models-GPT-3.5, GPT-4o, and Gemini, in delivering patient-centered information about thyroid eye disease (TED). We evaluated their performance based on the accuracy and comprehensiveness of their responses to common patient inquiries regarding TED. The study did not assess the repeatability of artificial intelligence responses, focusing on single-session evaluations per model.
Methods: Five experienced oculoplastic surgeons assessed the responses generated by the artificial intelligence models to 12 key questions frequently asked by TED patients. These questions addressed TED pathophysiology, risk factors, clinical presentation, diagnostic testing, and treatment options. Each response was rated for correctness and reliability on a 7-point Likert scale, where 1 indicated incorrect or unreliable information and 7 indicated highly accurate and reliable information. Correctness referred to factual accuracy, while reliability assessed trustworthiness for patient use. The evaluations were anonymized, and the final scores were averaged across the surgeons to facilitate model comparisons.
Results: GPT-3.5 emerged as the top performer, achieving an average correctness score of 5.75 and a reliability score of 5.68, excelling in delivering detailed information on complex topics such as TED treatment and surgical interventions. GPT-4o followed with scores of 5.32 for correctness and 5.25 for reliability, generally providing accurate but less detailed information. Gemini trailed with scores of 5.10 for correctness and 4.70 for reliability, often providing sufficient responses for simpler questions but lacking detail in complex areas like second-line immunosuppressive treatments. Statistical analysis using the Friedman test showed significant differences between models (p < 0.05) for key topics, with GPT-3.5 consistently leading.
Conclusions: GPT-3.5 was the most effective model for delivering reliable and comprehensive patient information, particularly for complex treatment and surgical topics. GPT-4o provided reliable general information but lacked the necessary depth for specialized topics, while Gemini was suitable for addressing basic patient inquiries but insufficient for detailed medical information. This study highlights the role of artificial intelligence in patient education, suggesting that models like GPT-3.5 can be valuable tools for clinicians in enhancing patient understanding of TED.
期刊介绍:
Ophthalmic Plastic and Reconstructive Surgery features original articles and reviews on topics such as ptosis, eyelid reconstruction, orbital diagnosis and surgery, lacrimal problems, and eyelid malposition. Update reports on diagnostic techniques, surgical equipment and instrumentation, and medical therapies are included, as well as detailed analyses of recent research findings and their clinical applications.