定制大语言模型提高准确性:比较检索增强生成和人工智能代理与循证医学非定制模型。

Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar
{"title":"定制大语言模型提高准确性:比较检索增强生成和人工智能代理与循证医学非定制模型。","authors":"Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar","doi":"10.1016/j.arthro.2024.10.042","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.</p><p><strong>Methods: </strong>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.</p><p><strong>Results: </strong>All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).</p><p><strong>Conclusion: </strong>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</p><p><strong>Clinical relevance: </strong>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</p>","PeriodicalId":55459,"journal":{"name":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","volume":" ","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Non-Custom Models for Evidence-Based Medicine.\",\"authors\":\"Joshua J Woo, Andrew J Yang, Reena J Olsen, Sayyida S Hasan, Danyal H Nawabi, Benedict U Nwachukwu, Riley J Williams, Prem N Ramkumar\",\"doi\":\"10.1016/j.arthro.2024.10.042\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.</p><p><strong>Methods: </strong>A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.</p><p><strong>Results: </strong>All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).</p><p><strong>Conclusion: </strong>RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.</p><p><strong>Clinical relevance: </strong>Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.</p>\",\"PeriodicalId\":55459,\"journal\":{\"name\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2024-11-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Arthroscopy-The Journal of Arthroscopic and Related Surgery\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1016/j.arthro.2024.10.042\",\"RegionNum\":1,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Arthroscopy-The Journal of Arthroscopic and Related Surgery","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.arthro.2024.10.042","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0

摘要

目的:本研究的目的是利用一个前交叉韧带(ACL)损伤的案例,证明定制方法(即基于检索增强生成(RAG)的大语言模型(LLMs)和代理增强(Agentic Augmentation))在提供准确信息方面比标准 LLMs 更有价值:方法:根据 2022 年美国医学会前交叉韧带委员会指南,策划了一组 100 个问题和答案。对封闭源模型(Open AI GPT4/GPT 3.5 和 Anthropic's Claude3)和开放源模型(LLama3 8b/70b 和 Mistral8x7b)进行了基本提问,并将 AAOS 指南嵌入 RAG 系统。人工智能(AI)代理进一步增强了表现最佳的模型,并对其进行了重新评估。两名经过研究员培训的外科医生对每组回答的准确性进行了盲评。计算 ROUGE 和 METEOR 分数以评估响应的语义相似性:结果:所有非定制 LLM 模型的准确率都低于 60%。应用 RAG 后,每个模型的准确率平均提高了 39.7%。仅使用 RAG 的性能最高的模型是 Meta 的开源 Llama3 70b(94%)。使用 RAG 和人工智能代理的性能最高的模型是 Open AI 的 GPT4(95%):RAG平均提高了39.7%的准确率,其中Meta Llama3 70b的准确率最高,达到94%。在之前的 RAG 增强 LLM 中加入人工智能代理,可将 ChatGPT4 的准确率提高到 95%。因此,Agentic 和 RAG 增强型 LLM 可以成为准确的信息联络工具,支持了我们的假设:尽管有文献围绕 LLM 在医学中的应用展开讨论,但鉴于其准确率参差不齐,人们对其持相当程度的怀疑态度。本研究为确定使用 RAG 和 Agentic 增强技术对 LLM 进行定制修改是否能更好地在骨科护理中提供准确信息奠定了基础。有了这些知识,就可以对 ChatGPT 等流行的 LLMs 中常见的在线医疗信息进行标准化,并提供相关的在线医疗信息,从而更好地支持外科医生和患者之间的共同决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Custom Large Language Models Improve Accuracy: Comparing Retrieval Augmented Generation and Artificial Intelligence Agents to Non-Custom Models for Evidence-Based Medicine.

Purpose: The purpose of the study is to demonstrate the value of custom methods, namely Retrieval Augmented Generation(RAG)-based Large Language Models(LLMs) and Agentic Augmentation, over standard LLMs in delivering accurate information using an anterior cruciate ligament(ACL) injury case.

Methods: A set of 100 questions and answers based on the 2022 AAOS ACL guidelines were curated. Closed-source(Open AI GPT4/GPT 3.5 and Anthropic's Claude3) and open-source models(LLama3 8b/70b and Mistral8x7b) were asked questions in base form and again with AAOS guidelines embedded into a RAG system. The top-performing models were further augmented with Artificial Intelligence(AI) Agents and re-evaluated. Two fellowship-trained surgeons blindly evaluated the accuracy of the responses of each cohort. ROUGE and METEOR scores were calculated to assess semantic similarity in the response.

Results: All non-custom LLM models started below 60% accuracy. Applying RAG improved the accuracy of every model by an average 39.7%. The highest performing model with just RAG was Meta's Open-Source Llama3 70b(94%). The highest performing model with RAG and AI Agents was Open AI's GPT4(95%).

Conclusion: RAG improved accuracy by an average of 39.7%, with the highest accuracy rate of 94% in the Meta Llama3 70b. Incorporating AI agents into a previously RAG-augmented LLM improved ChatGPT4 accuracy rate to 95%. Thus, Agentic and RAG augmented LLMs can be accurate liaisons of information, supporting our hypothesis.

Clinical relevance: Despite literature surrounding the use of LLM in medicine, there has been considerable and appropriate skepticism given the variably accurate response rates. This study establishes the groundwork to identify whether custom modifications to LLMs using RAG and Agentic augmentation can better deliver accurate information in orthopaedic care. With this knowledge, online medical information commonly sought in popular LLMs, such as ChatGPT, can be standardized and provide relevant online medical information to better support shared decision making between surgeon and patient.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
9.30
自引率
17.00%
发文量
555
审稿时长
58 days
期刊介绍: Nowhere is minimally invasive surgery explained better than in Arthroscopy, the leading peer-reviewed journal in the field. Every issue enables you to put into perspective the usefulness of the various emerging arthroscopic techniques. The advantages and disadvantages of these methods -- along with their applications in various situations -- are discussed in relation to their efficiency, efficacy and cost benefit. As a special incentive, paid subscribers also receive access to the journal expanded website.
期刊最新文献
Editorial Commentary: One to Two-Year Follow-up after Instability Surgery may be Similar, but Longer Follow-up Will Almost Certainly Show Diminished Patient Reported Outcomes as Recurrence Rates Increase. Hip Arthroscopy and Periacetabular Osteotomy in Patients 45 Years and Older Have Similar Outcomes to a Younger Cohort: Articular Cartilage Status Is the Primary Determinant of Outcome. Patients Reliably Return to Work After Shoulder Latarjet Procedure. Statin Use Not Linked to Rotator Cuff Retear After Arthroscopic Rotator Cuff Repair. The incidence of Popeye Deformity after Soft Tissue Biceps Tenodesis is Comparable to Biceps Anchor Tenodesis and Lower than Biceps Tenotomy During Arthroscopic Rotator Cuff Repair.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1