An evaluation of AI generated literature reviews in musculoskeletal radiology

IF 2.3 4区医学 Q2 SURGERY Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland Pub Date : 2024-01-12 DOI:10.1016/j.surge.2023.12.005

N. Jenko , S. Ariyaratne , L. Jeys , S. Evans , K.P. Iyengar , R. Botchu

{"title":"An evaluation of AI generated literature reviews in musculoskeletal radiology","authors":"N. Jenko , S. Ariyaratne , L. Jeys , S. Evans , K.P. Iyengar , R. Botchu","doi":"10.1016/j.surge.2023.12.005","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (<span>the-literature.com</span><svg><path></path></svg>) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.</p></div><div><h3>Methods</h3><p>The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.</p></div><div><h3>Results</h3><p>The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).</p></div><div><h3>Conclusion</h3><p>Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.</p></div>","PeriodicalId":49463,"journal":{"name":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","volume":"22 3","pages":"Pages 194-197"},"PeriodicalIF":2.3000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1479666X24000088","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose

The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.

Methods

The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.

Results

The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).

Conclusion

Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

评估人工智能生成的肌肉骨骼放射学文献综述。

目的：使用人工智能（AI）工具帮助总结医学和研究方面的信息最近引起了人们的极大兴趣。虽然 ChatGPT 等工具能产生令人信服且听起来自然的输出结果，但有时答案并不正确。我们希望通过使用针对更具体范围进行训练的程序来避免其中的一些缺点。在这项研究中，我们比较了一款新的人工智能工具（the-literature.com）与最新版 OpenAI 的 ChatGPT（GPT-4）在总结作者有重大贡献的主题方面的表现：方法：要求人工智能工具对 7 个主题进行文献综述。方法：要求人工智能工具对 7 个主题进行文献综述，这些主题是根据作者通过自己的出版物所熟悉和参与的研究主题选定的。人工智能工具生成的结果由两名受过研究培训的放射科顾问医生根据准确性、全面性和相关性按 1-5 级李克特量表进行评分：结果：The-literature.com 制作了 3 份出色的摘要，3 份非常差的摘要与提示无关，还有一份摘要与提示有关，但没有包括所有相关论文。GPT-4 提出的所有摘要都是相关的，但发现的相关论文较少。文献》的平均李克特评分为 2.88，《GPT-4》的平均李克特评分为 3.86。两位放射科医生的评分结果非常一致（ICC = 0.883）：结论：目前人工智能生成的摘要需要经过仔细的人工验证。GPT-4提供的摘要平均质量更高。两种工具都不能可靠地识别所有相关出版物。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland 医学-外科

CiteScore

4.40

自引率

0.00%

发文量

158

审稿时长

6-12 weeks

期刊介绍： Since its establishment in 2003, The Surgeon has established itself as one of the leading multidisciplinary surgical titles, both in print and online. The Surgeon is published for the worldwide surgical and dental communities. The goal of the Journal is to achieve wider national and international recognition, through a commitment to excellence in original research. In addition, both Colleges see the Journal as an important educational service, and consequently there is a particular focus on post-graduate development. Much of our educational role will continue to be achieved through publishing expanded review articles by leaders in their field. Articles in related areas to surgery and dentistry, such as healthcare management and education, are also welcomed. We aim to educate, entertain, give insight into new surgical techniques and technology, and provide a forum for debate and discussion.