Franziska C S Altorfer, Michael J Kelly, Fedan Avrumova, Varun Rohatgi, Jiaqi Zhu, Christopher M Bono, Darren R Lebl
{"title":"The Double-Edged Sword of Generative AI: Surpassing an Expert or a Deceptive \"False Friend\"?","authors":"Franziska C S Altorfer, Michael J Kelly, Fedan Avrumova, Varun Rohatgi, Jiaqi Zhu, Christopher M Bono, Darren R Lebl","doi":"10.1016/j.spinee.2025.02.010","DOIUrl":null,"url":null,"abstract":"<p><strong>Background context: </strong>Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.</p><p><strong>Purpose: </strong>To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.</p><p><strong>Study design: </strong>Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point \"alignment score.\" Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.</p><p><strong>Results: </strong>Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).</p><p><strong>Conclusions: </strong>Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.</p>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.spinee.2025.02.010","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background context: Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.
Purpose: To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.
Study design: Comparative Study METHODS: Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to two freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a five-point "alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.
Results: Both tools' responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n = 193) were authentic and 24.0% (n = 61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n = 111, 91.0%; Gemini: n = 36, 50.7%; p< 0.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p< 0.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n = 27, 24.3%; Gemini: n = 1, 2.8%; p = 0.04).
Conclusions: Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or non-scientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
期刊介绍:
The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.