Optimal Large Language Model Characteristics to Balance Accuracy and Energy Use for Sustainable Medical Applications.

IF 12.1 1区 医学 Q1 RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING Radiology Pub Date : 2024-08-01 DOI:10.1148/radiol.240320
Florence X Doo, Dharmam Savani, Adway Kanhere, Ruth C Carlos, Anupam Joshi, Paul H Yi, Vishwa S Parekh
{"title":"Optimal Large Language Model Characteristics to Balance Accuracy and Energy Use for Sustainable Medical Applications.","authors":"Florence X Doo, Dharmam Savani, Adway Kanhere, Ruth C Carlos, Anupam Joshi, Paul H Yi, Vishwa S Parekh","doi":"10.1148/radiol.240320","DOIUrl":null,"url":null,"abstract":"<p><p>Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":null,"pages":null},"PeriodicalIF":12.1000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11366671/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240320","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0

Abstract

Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024 Supplemental material is available for this article.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
为可持续医疗应用平衡准确性和能源使用的最佳大型语言模型特性。
背景 用于医疗应用的大型语言模型(LLMs)会消耗大量能源,这对医疗保健系统的整体碳足迹造成了影响。目的 研究在医疗应用中使用不同类型和大小的 LLM 时,在准确性和能耗之间的权衡。材料和方法 本回顾性研究使用美国国立医学图书馆印第安纳大学胸部 X 光片采集中心的胸部 X 光片报告,评估了两种开源 LLM(通用模型 Meta's Llama 2 和专用微调模型 LMSYS Org's Vicuna 1.5)的五种不同的十亿分参数大小。缺失人口统计学信息和缺失或空白文件的报告被排除在外。模型在配有视觉计算图形处理单元的本地计算集群上运行。单一任务提示解释了临床术语,并指示每个模型确认是否存在 13 种 CheXpert 疾病标签中的每一种。能源使用量(以千瓦时为单位)使用开源工具进行测量。使用 13 个 CheXpert 参考标准标签评估胸片诊断结果的准确性,总体准确性是所有 13 个标签的单个准确性的平均值。计算了每种型号和尺寸的效率比(每千瓦时的准确率)。结果 共评估了 3665 份胸片报告。与 Llama 2 型号相比,Vicuna 1.5 7B 和 13B 型号的效率比(分别为 737.28 和 331.40)更高,总体标记准确率(分别为 93.83% [3665 份报告中的 3438.69 份] 和 93.65% [3665 份报告中的 3432.38 份])也更高(7B:效率比为 13.7B:效率比为 13.39,准确率为 7.91% [3665 份报告中的 289.76 份];13B:效率比为 40.90,准确率为 74.08% [3665 份报告中的 2715.15 份];70B:效率比为 22.30,准确率为 92.70% [3665 份报告中的 3397.38 份])。Vicuna 1.5 7B 的效率比最高(737.28 对 Llama 2 7B 的 13.39)。较大的 Llama 2 70B 模型的能耗是其 7B 模型的七倍多(4.16 千瓦时对 0.59 千瓦时),但总体精度较低,因此效率比仅为 22.30。结论 较小的微调 LLM 比较大的通用 LLM 更具可持续性,在不影响精度的情况下耗能更少,这突出了医疗应用中选择 LLM 的重要性。© RSNA, 2024 本文有补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Radiology
Radiology 医学-核医学
CiteScore
35.20
自引率
3.00%
发文量
596
审稿时长
3.6 months
期刊介绍: Published regularly since 1923 by the Radiological Society of North America (RSNA), Radiology has long been recognized as the authoritative reference for the most current, clinically relevant and highest quality research in the field of radiology. Each month the journal publishes approximately 240 pages of peer-reviewed original research, authoritative reviews, well-balanced commentary on significant articles, and expert opinion on new techniques and technologies. Radiology publishes cutting edge and impactful imaging research articles in radiology and medical imaging in order to help improve human health.
期刊最新文献
Risk Factors for Pneumothorax Following Lung Biopsy: Another Peek at Air Leak. Sex-specific Associations between Left Ventricular Remodeling at MRI and Long-term Cardiovascular Risk. The Clinical Weight of Left Ventricular Mass and Shape. Assessment of Nonmass Lesions Detected with Screening Breast US Based on Mammographic Findings. CT-guided Coaxial Lung Biopsy: Number of Cores and Association with Complications.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1