The double-edged sword of generative AI: surpassing an expert or a deceptive “false friend”?

IF 4.7 1区 医学 Q1 CLINICAL NEUROLOGY Spine Journal Pub Date : 2025-08-01 Epub Date: 2025-03-04 DOI:10.1016/j.spinee.2025.02.010
Franziska C.S. Altorfer MD , Michael J. Kelly MD , Fedan Avrumova MS , Varun Rohatgi BS , Jiaqi Zhu MS , Christopher M. Bono MD , Darren R. Lebl MD
{"title":"The double-edged sword of generative AI: surpassing an expert or a deceptive “false friend”?","authors":"Franziska C.S. Altorfer MD ,&nbsp;Michael J. Kelly MD ,&nbsp;Fedan Avrumova MS ,&nbsp;Varun Rohatgi BS ,&nbsp;Jiaqi Zhu MS ,&nbsp;Christopher M. Bono MD ,&nbsp;Darren R. Lebl MD","doi":"10.1016/j.spinee.2025.02.010","DOIUrl":null,"url":null,"abstract":"<div><h3>BACKGROUND CONTEXT</h3><div>Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.</div></div><div><h3>PURPOSE</h3><div>To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.</div></div><div><h3>STUDY DESIGN</h3><div>Comparative study.</div></div><div><h3>METHODS</h3><div>Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to 2 freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a 5-point “alignment score.\" Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.</div></div><div><h3>RESULTS</h3><div>Both tools’ responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n=193) were authentic and 24.0% (n=61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n=111, 91.0%; Gemini: n=36, 50.7%; p&lt;.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p&lt;.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n=27, 24.3%; Gemini: n=1, 2.8%; p=.04).</div></div><div><h3>CONCLUSIONS</h3><div>Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or nonscientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.</div></div>","PeriodicalId":49484,"journal":{"name":"Spine Journal","volume":"25 8","pages":"Pages 1635-1643"},"PeriodicalIF":4.7000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Spine Journal","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1529943025001226","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/3/4 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

BACKGROUND CONTEXT

Generative artificial intelligence (AI), ChatGPT being the most popular example, has been extensively assessed for its capability to respond to medical questions, such as queries in spine treatment approaches or technological advances. However, it often lacks scientific foundation or fabricates inauthentic references, also known as AI hallucinations.

PURPOSE

To develop an understanding of the scientific basis of generative AI tools by studying the authenticity of references and reliability in comparison to the alignment of responses of evidence-based guidelines.

STUDY DESIGN

Comparative study.

METHODS

Thirty-three previously published North American Spine Society (NASS) guideline questions were posed as prompts to 2 freely available generative AI tools (Tools I and II). The responses were scored for correctness compared with the published NASS guideline responses using a 5-point “alignment score." Furthermore, all cited references were evaluated for authenticity, source type, year of publication, and inclusion in the scientific guidelines.

RESULTS

Both tools’ responses to guideline questions achieved an overall score of 3.5±1.1, which is considered acceptable to be equivalent to the guideline. Both tools generated 254 references to support their responses, of which 76.0% (n=193) were authentic and 24.0% (n=61) were fabricated. From these, authentic references were: peer-reviewed scientific research papers (147, 76.2%), guidelines (16, 8.3%), educational websites (9, 4.7%), books (9, 4.7%), a government website (1, 0.5%), insurance websites (6, 3.1%) and newspaper websites (5, 2.6%). Claude referenced significantly more authentic peer-reviewed scientific papers (Claude: n=111, 91.0%; Gemini: n=36, 50.7%; p<.001). The year of publication amongst all references ranged from 1988-2023, with significantly older references provided by Claude (Claude: 2008±6; Gemini: 2014±6; p<.001). Lastly, significantly more references provided by Claude were also referenced in the published NASS guidelines (Claude: n=27, 24.3%; Gemini: n=1, 2.8%; p=.04).

CONCLUSIONS

Both generative AI tools provided responses that had acceptable alignment with NASS evidence-based guideline recommendations and offered references, though nearly a quarter of the references were inauthentic or nonscientific sources. This deficiency of legitimate scientific references does not meet standards for clinical implementation. Considering this limitation, caution should be exercised when applying the output of generative AI tools to clinical applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
生成式人工智能的双刃剑:超越专家还是欺骗性的 "假朋友"?
背景背景:生成式人工智能(AI), ChatGPT是最受欢迎的例子,因其响应医学问题的能力而受到广泛评估,例如脊柱治疗方法或技术进步方面的问题。然而,它往往缺乏科学依据或编造不真实的参考,也被称为人工智能幻觉。目的:通过研究参考文献的真实性和可靠性,与循证指南的响应一致性进行比较,了解生成式人工智能工具的科学基础。研究设计:比较研究方法:将33个先前发表的北美脊柱学会(NASS)指南问题作为提示提出给两个免费的生成式人工智能工具(工具I和工具II)。将回答与已发表的NASS指南回答进行准确性评分,使用五分制“对齐评分”。此外,对所有被引用的参考文献的真实性、来源类型、出版年份和是否被纳入科学指南进行了评估。结果:两种工具对指南问题的回答总分均达到3.5±1.1分,可接受与指南相当。这两种工具都产生了254篇文献来支持他们的回答,其中76.0% (n = 193)是真实的,24.0% (n = 61)是捏造的。其中,真实参考文献包括:同行评议的科研论文147篇(76.2%)、指南16篇(8.3%)、教育网站9篇(4.7%)、书籍9篇(4.7%)、政府网站1篇(0.5%)、保险网站6篇(3.1%)和报纸网站5篇(2.6%)。Claude引用了更真实的同行评议的科学论文(Claude: n = 111,91.0%;双子座:n = 36,50.7%;p < 0.001)。所有参考文献的出版年份从1988年到2023年不等,Claude提供的参考文献明显更早(Claude: 2008±6;双子座:2014±6;p < 0.001)。最后,在已发表的NASS指南中,Claude提供的参考文献也被引用了明显更多(Claude: n = 27,24.3%;双子座:n = 1.2.8%; = 0.04页)。结论:两种生成式人工智能工具都提供了与NASS循证指南建议一致的响应,并提供了参考文献,尽管近四分之一的参考文献是不真实或非科学的来源。这种缺乏合法科学参考的情况不符合临床实施的标准。考虑到这一局限性,在将生成式人工智能工具的输出应用于临床应用时应谨慎行事。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Spine Journal
Spine Journal 医学-临床神经学
CiteScore
8.20
自引率
6.70%
发文量
680
审稿时长
13.1 weeks
期刊介绍: The Spine Journal, the official journal of the North American Spine Society, is an international and multidisciplinary journal that publishes original, peer-reviewed articles on research and treatment related to the spine and spine care, including basic science and clinical investigations. It is a condition of publication that manuscripts submitted to The Spine Journal have not been published, and will not be simultaneously submitted or published elsewhere. The Spine Journal also publishes major reviews of specific topics by acknowledged authorities, technical notes, teaching editorials, and other special features, Letters to the Editor-in-Chief are encouraged.
期刊最新文献
Hospital Characteristics and Episode Cost Differences between Participant and Non-participant hospitals in Medicare's Transforming Episode Accountability Model (TEAM) Bundle Payment Program. The Association between Teaching Hospital Status and Postoperative Outcomes among Adults Undergoing Long-Segment Posterior Lumbar Instrumentation. A dedicated spine team is more efficient and improves perioperative outcomes in idiopathic scoliosis surgery: A propensity score-matched study. Predictors of Insurance Denial with and without Prior Authorization in Patients Undergoing Spine Surgery: A Year-Long, Single-Center Cohort Analysis. Progression of lumbar disc degeneration: a 26-year follow-up study of healthy individuals from childhood to adulthood.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1