Franziska C.S. Altorfer MD , Michael J. Kelly MD , Fedan Avrumova MS , Varun Rohatgi BS , Jiaqi Zhu MS , Christopher M. Bono MD , Darren R. Lebl MD
{"title":"The double-edged sword of generative AI: surpassing an expert or a deceptive “false friend”?","authors":"Franziska C.S. Altorfer MD , Michael J. Kelly MD , Fedan Avrumova MS , Varun Rohatgi BS , Jiaqi Zhu MS , Christopher M. Bono MD , Darren R. Lebl MD","doi":"10.1016/j.spinee.2025.02.010","DOIUrl":null,"url":null,"abstract":"<div><h3>BACKGROUND CONTEXT</h3><div>Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.</div></div><div><h3>PURPOSE</h3><div>To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.</div></div><div><h3>STUDY DESIGN</h3><div>Comparative study.</div></div><div><h3>METHODS</h3><div>Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to 2 freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a 5-point “alignment score.\" Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.</div></div><div><h3>RESULTS</h3><div>Both tools’ responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n=193) were authentic and 24.0% (n=61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n=111, 91.0%; Gemini: n=36, 50.7%; p<.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p<.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n=27, 24.3%; Gemini: n=1, 2.8%; p=.04).</div></div><div><h3>CONCLUSIONS</h3><div>Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or nonscientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.</div></div>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":"25 8","pages":"Pages 1635-1643"},"PeriodicalIF":4.7000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1529943025001226","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND CONTEXT
Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.
PURPOSE
To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.
STUDY DESIGN
Comparative study.
METHODS
Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to 2 freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a 5-point “alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.
RESULTS
Both tools’ responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n=193) were authentic and 24.0% (n=61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n=111, 91.0%; Gemini: n=36, 50.7%; p<.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p<.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n=27, 24.3%; Gemini: n=1, 2.8%; p=.04).
CONCLUSIONS
Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or nonscientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
期刊介绍:
The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.