Camellia Edalat, Nila Kirupaharan, Lauren A Dalvin, Kapil Mishra, Rayna Marshall, Hannah Xu, Jasmine H Francis, Meghan Berkenstock
{"title":"评估大型语言模型的准确性和完整性:免疫检查点抑制剂及其眼部毒性。","authors":"Camellia Edalat, Nila Kirupaharan, Lauren A Dalvin, Kapil Mishra, Rayna Marshall, Hannah Xu, Jasmine H Francis, Meghan Berkenstock","doi":"10.1097/IAE.0000000000004271","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>To analyze the accuracy and thoroughness of 3 large language models (LLMs) to produce information for providers about immune checkpoint inhibitor (ICI) ocular toxicities.</p><p><strong>Methods: </strong>Eight questions were created about the general definition of checkpoint inhibitors, their mechanism of action, ocular toxicities, and toxicity management. All were inputted into ChatGPT 4.0, Bard, and LLaMA programs. Utilizing the 6-point Likert scale for accuracy and completeness, four ophthalmologists who routinely treat ocular toxicities of immunotherapy agents rated the LLMs answers. ANOVA testing was used to assess significant differences among the three LLMs and a post-hoc pairwise t-test. Fleiss kappa values were calculated to account for interrater variability.</p><p><strong>Results: </strong>ChatGPT responses were rated with an average of 4.59 for accuracy and 4.09 for completeness; Bard answers were rated 4.59 and 4.19; LLaMA results were rated 4.38 and 4.03. The three LLMs did not significantly differ in accuracy (p=0.47) nor completeness (p=0.86). Fleiss kappa values were found to be poor for both accuracy (-0.03) and completeness (0.01).</p><p><strong>Conclusions: </strong>All three LLMs provided highly accurate and complete responses to questions centered on ICI inhibitor ocular toxicities and management. Further studies are needed to assess specific ICI agents and the accuracy and completeness of updated versions of LLMs.</p>","PeriodicalId":54486,"journal":{"name":"Retina-The Journal of Retinal and Vitreous Diseases","volume":" ","pages":""},"PeriodicalIF":2.3000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Large Language Models on their Accuracy and Completeness: Immune Checkpoint Inhibitors and their Ocular Toxicities.\",\"authors\":\"Camellia Edalat, Nila Kirupaharan, Lauren A Dalvin, Kapil Mishra, Rayna Marshall, Hannah Xu, Jasmine H Francis, Meghan Berkenstock\",\"doi\":\"10.1097/IAE.0000000000004271\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>To analyze the accuracy and thoroughness of 3 large language models (LLMs) to produce information for providers about immune checkpoint inhibitor (ICI) ocular toxicities.</p><p><strong>Methods: </strong>Eight questions were created about the general definition of checkpoint inhibitors, their mechanism of action, ocular toxicities, and toxicity management. All were inputted into ChatGPT 4.0, Bard, and LLaMA programs. Utilizing the 6-point Likert scale for accuracy and completeness, four ophthalmologists who routinely treat ocular toxicities of immunotherapy agents rated the LLMs answers. ANOVA testing was used to assess significant differences among the three LLMs and a post-hoc pairwise t-test. Fleiss kappa values were calculated to account for interrater variability.</p><p><strong>Results: </strong>ChatGPT responses were rated with an average of 4.59 for accuracy and 4.09 for completeness; Bard answers were rated 4.59 and 4.19; LLaMA results were rated 4.38 and 4.03. The three LLMs did not significantly differ in accuracy (p=0.47) nor completeness (p=0.86). Fleiss kappa values were found to be poor for both accuracy (-0.03) and completeness (0.01).</p><p><strong>Conclusions: </strong>All three LLMs provided highly accurate and complete responses to questions centered on ICI inhibitor ocular toxicities and management. Further studies are needed to assess specific ICI agents and the accuracy and completeness of updated versions of LLMs.</p>\",\"PeriodicalId\":54486,\"journal\":{\"name\":\"Retina-The Journal of Retinal and Vitreous Diseases\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Retina-The Journal of Retinal and Vitreous Diseases\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/IAE.0000000000004271\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"OPHTHALMOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Retina-The Journal of Retinal and Vitreous Diseases","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/IAE.0000000000004271","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OPHTHALMOLOGY","Score":null,"Total":0}
Evaluating Large Language Models on their Accuracy and Completeness: Immune Checkpoint Inhibitors and their Ocular Toxicities.
Purpose: To analyze the accuracy and thoroughness of 3 large language models (LLMs) to produce information for providers about immune checkpoint inhibitor (ICI) ocular toxicities.
Methods: Eight questions were created about the general definition of checkpoint inhibitors, their mechanism of action, ocular toxicities, and toxicity management. All were inputted into ChatGPT 4.0, Bard, and LLaMA programs. Utilizing the 6-point Likert scale for accuracy and completeness, four ophthalmologists who routinely treat ocular toxicities of immunotherapy agents rated the LLMs answers. ANOVA testing was used to assess significant differences among the three LLMs and a post-hoc pairwise t-test. Fleiss kappa values were calculated to account for interrater variability.
Results: ChatGPT responses were rated with an average of 4.59 for accuracy and 4.09 for completeness; Bard answers were rated 4.59 and 4.19; LLaMA results were rated 4.38 and 4.03. The three LLMs did not significantly differ in accuracy (p=0.47) nor completeness (p=0.86). Fleiss kappa values were found to be poor for both accuracy (-0.03) and completeness (0.01).
Conclusions: All three LLMs provided highly accurate and complete responses to questions centered on ICI inhibitor ocular toxicities and management. Further studies are needed to assess specific ICI agents and the accuracy and completeness of updated versions of LLMs.
期刊介绍:
RETINA® focuses exclusively on the growing specialty of vitreoretinal disorders. The Journal provides current information on diagnostic and therapeutic techniques. Its highly specialized and informative, peer-reviewed articles are easily applicable to clinical practice.
In addition to regular reports from clinical and basic science investigators, RETINA® publishes special features including periodic review articles on pertinent topics, special articles dealing with surgical and other therapeutic techniques, and abstract cards. Issues are abundantly illustrated in vivid full color.
Published 12 times per year, RETINA® is truly a “must have” publication for anyone connected to this field.