An evaluation of AI generated literature reviews in musculoskeletal radiology

N. Jenko , S. Ariyaratne , L. Jeys , S. Evans , K.P. Iyengar , R. Botchu
{"title":"An evaluation of AI generated literature reviews in musculoskeletal radiology","authors":"N. Jenko ,&nbsp;S. Ariyaratne ,&nbsp;L. Jeys ,&nbsp;S. Evans ,&nbsp;K.P. Iyengar ,&nbsp;R. Botchu","doi":"10.1016/j.surge.2023.12.005","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (<span>the-literature.com</span><svg><path></path></svg>) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.</p></div><div><h3>Methods</h3><p>The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.</p></div><div><h3>Results</h3><p>The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).</p></div><div><h3>Conclusion</h3><p>Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.</p></div>","PeriodicalId":49463,"journal":{"name":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","volume":null,"pages":null},"PeriodicalIF":2.3000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1479666X24000088","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0

Abstract

Purpose

The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.

Methods

The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.

Results

The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).

Conclusion

Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
评估人工智能生成的肌肉骨骼放射学文献综述。
目的:使用人工智能(AI)工具帮助总结医学和研究方面的信息最近引起了人们的极大兴趣。虽然 ChatGPT 等工具能产生令人信服且听起来自然的输出结果,但有时答案并不正确。我们希望通过使用针对更具体范围进行训练的程序来避免其中的一些缺点。在这项研究中,我们比较了一款新的人工智能工具(the-literature.com)与最新版 OpenAI 的 ChatGPT(GPT-4)在总结作者有重大贡献的主题方面的表现:方法:要求人工智能工具对 7 个主题进行文献综述。方法:要求人工智能工具对 7 个主题进行文献综述,这些主题是根据作者通过自己的出版物所熟悉和参与的研究主题选定的。人工智能工具生成的结果由两名受过研究培训的放射科顾问医生根据准确性、全面性和相关性按 1-5 级李克特量表进行评分:结果:The-literature.com 制作了 3 份出色的摘要,3 份非常差的摘要与提示无关,还有一份摘要与提示有关,但没有包括所有相关论文。GPT-4 提出的所有摘要都是相关的,但发现的相关论文较少。文献》的平均李克特评分为 2.88,《GPT-4》的平均李克特评分为 3.86。两位放射科医生的评分结果非常一致(ICC = 0.883):结论:目前人工智能生成的摘要需要经过仔细的人工验证。GPT-4提供的摘要平均质量更高。两种工具都不能可靠地识别所有相关出版物。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.40
自引率
0.00%
发文量
158
审稿时长
6-12 weeks
期刊介绍: Since its establishment in 2003, The Surgeon has established itself as one of the leading multidisciplinary surgical titles, both in print and online. The Surgeon is published for the worldwide surgical and dental communities. The goal of the Journal is to achieve wider national and international recognition, through a commitment to excellence in original research. In addition, both Colleges see the Journal as an important educational service, and consequently there is a particular focus on post-graduate development. Much of our educational role will continue to be achieved through publishing expanded review articles by leaders in their field. Articles in related areas to surgery and dentistry, such as healthcare management and education, are also welcomed. We aim to educate, entertain, give insight into new surgical techniques and technology, and provide a forum for debate and discussion.
期刊最新文献
Comment on, "2-methoxyestradiol sensitizes tamoxifen-resistant MCF-7 breast cancer cells via downregulating HIF-1α". The effect of forced-air warming blanket position during spinal surgery on patients' intra-operative body temperature. List of editors Tight application of a surgical tourniquet prior to inflation increases venous pressure in the upper limb; Potentially resulting in increased blood loss and poorer visibility. Surgical procedures performed by non-medical practitioners, reviewing the era of the barber-surgeon.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1