N. Jenko , S. Ariyaratne , L. Jeys , S. Evans , K.P. Iyengar , R. Botchu
{"title":"An evaluation of AI generated literature reviews in musculoskeletal radiology","authors":"N. Jenko , S. Ariyaratne , L. Jeys , S. Evans , K.P. Iyengar , R. Botchu","doi":"10.1016/j.surge.2023.12.005","DOIUrl":null,"url":null,"abstract":"<div><h3>Purpose</h3><p>The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (<span>the-literature.com</span><svg><path></path></svg>) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.</p></div><div><h3>Methods</h3><p>The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.</p></div><div><h3>Results</h3><p>The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).</p></div><div><h3>Conclusion</h3><p>Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.</p></div>","PeriodicalId":49463,"journal":{"name":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","volume":"22 3","pages":"Pages 194-197"},"PeriodicalIF":2.3000,"publicationDate":"2024-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgeon-Journal of the Royal Colleges of Surgeons of Edinburgh and Ireland","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1479666X24000088","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"SURGERY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose
The use of artificial intelligence (AI) tools to aid in summarizing information in medicine and research has recently garnered a huge amount of interest. While tools such as ChatGPT produce convincing and naturally sounding output, the answers are sometimes incorrect. Some of these drawbacks, it is hoped, can be avoided by using programmes trained for a more specific scope. In this study we compared the performance of a new AI tool (the-literature.com) to the latest version OpenAI's ChatGPT (GPT-4) in summarizing topics that the authors have significantly contributed to.
Methods
The AI tools were asked to produce a literature review on 7 topics. These were selected based on the research topics that the authors were intimately familiar with and have contributed to through their own publications. The output produced by the AI tools were graded on a 1–5 Likert scale for accuracy, comprehensiveness, and relevance by two fellowship trained consultant radiologists.
Results
The-literature.com produced 3 excellent summaries, 3 very poor summaries not relevant to the prompt, and one summary, which was relevant but did not include all relevant papers. All of the summaries produced by GPT-4 were relevant, but fewer relevant papers were identified. The average Likert rating was for the-literature was 2.88 and 3.86 for GPT-4. There was good agreement between the ratings of both radiologists (ICC = 0.883).
Conclusion
Summaries produced by AI in its current state require careful human validation. GPT-4 on average provides higher quality summaries. Neither tool can reliably identify all relevant publications.
期刊介绍:
Since its establishment in 2003, The Surgeon has established itself as one of the leading multidisciplinary surgical titles, both in print and online. The Surgeon is published for the worldwide surgical and dental communities. The goal of the Journal is to achieve wider national and international recognition, through a commitment to excellence in original research. In addition, both Colleges see the Journal as an important educational service, and consequently there is a particular focus on post-graduate development. Much of our educational role will continue to be achieved through publishing expanded review articles by leaders in their field.
Articles in related areas to surgery and dentistry, such as healthcare management and education, are also welcomed. We aim to educate, entertain, give insight into new surgical techniques and technology, and provide a forum for debate and discussion.