下载PDF
{"title":"Optimal Large Language Model Characteristics to Balance Accuracy and Energy Use for Sustainable Medical Applications.","authors":"Florence X Doo, Dharmam Savani, Adway Kanhere, Ruth C Carlos, Anupam Joshi, Paul H Yi, Vishwa S Parekh","doi":"10.1148/radiol.240320","DOIUrl":null,"url":null,"abstract":"<p><p>Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024 <i>Supplemental material is available for this article.</i></p>","PeriodicalId":20896,"journal":{"name":"Radiology","volume":null,"pages":null},"PeriodicalIF":12.1000,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11366671/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1148/radiol.240320","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
引用
批量引用
Abstract
Background Large language models (LLMs) for medical applications use unknown amounts of energy, which contribute to the overall carbon footprint of the health care system. Purpose To investigate the tradeoffs between accuracy and energy use when using different LLM types and sizes for medical applications. Materials and Methods This retrospective study evaluated five different billion (B)-parameter sizes of two open-source LLMs (Meta's Llama 2, a general-purpose model, and LMSYS Org's Vicuna 1.5, a specialized fine-tuned model) using chest radiograph reports from the National Library of Medicine's Indiana University Chest X-ray Collection. Reports with missing demographic information and missing or blank files were excluded. Models were run on local compute clusters with visual computing graphic processing units. A single-task prompt explained clinical terminology and instructed each model to confirm the presence or absence of each of the 13 CheXpert disease labels. Energy use (in kilowatt-hours) was measured using an open-source tool. Accuracy was assessed with 13 CheXpert reference standard labels for diagnostic findings on chest radiographs, where overall accuracy was the mean of individual accuracies of all 13 labels. Efficiency ratios (accuracy per kilowatt-hour) were calculated for each model type and size. Results A total of 3665 chest radiograph reports were evaluated. The Vicuna 1.5 7B and 13B models had higher efficiency ratios (737.28 and 331.40, respectively) and higher overall labeling accuracy (93.83% [3438.69 of 3665 reports] and 93.65% [3432.38 of 3665 reports], respectively) than that of the Llama 2 models (7B: efficiency ratio of 13.39, accuracy of 7.91% [289.76 of 3665 reports]; 13B: efficiency ratio of 40.90, accuracy of 74.08% [2715.15 of 3665 reports]; 70B: efficiency ratio of 22.30, accuracy of 92.70% [3397.38 of 3665 reports]). Vicuna 1.5 7B had the highest efficiency ratio (737.28 vs 13.39 for Llama 2 7B). The larger Llama 2 70B model used more than seven times the energy of its 7B counterpart (4.16 kWh vs 0.59 kWh) with low overall accuracy, resulting in an efficiency ratio of only 22.30. Conclusion Smaller fine-tuned LLMs were more sustainable than larger general-purpose LLMs, using less energy without compromising accuracy, highlighting the importance of LLM selection for medical applications. © RSNA, 2024 Supplemental material is available for this article.
为可持续医疗应用平衡准确性和能源使用的最佳大型语言模型特性。
背景 用于医疗应用的大型语言模型(LLMs)会消耗大量能源,这对医疗保健系统的整体碳足迹造成了影响。目的 研究在医疗应用中使用不同类型和大小的 LLM 时,在准确性和能耗之间的权衡。材料和方法 本回顾性研究使用美国国立医学图书馆印第安纳大学胸部 X 光片采集中心的胸部 X 光片报告,评估了两种开源 LLM(通用模型 Meta's Llama 2 和专用微调模型 LMSYS Org's Vicuna 1.5)的五种不同的十亿分参数大小。缺失人口统计学信息和缺失或空白文件的报告被排除在外。模型在配有视觉计算图形处理单元的本地计算集群上运行。单一任务提示解释了临床术语,并指示每个模型确认是否存在 13 种 CheXpert 疾病标签中的每一种。能源使用量(以千瓦时为单位)使用开源工具进行测量。使用 13 个 CheXpert 参考标准标签评估胸片诊断结果的准确性,总体准确性是所有 13 个标签的单个准确性的平均值。计算了每种型号和尺寸的效率比(每千瓦时的准确率)。结果 共评估了 3665 份胸片报告。与 Llama 2 型号相比,Vicuna 1.5 7B 和 13B 型号的效率比(分别为 737.28 和 331.40)更高,总体标记准确率(分别为 93.83% [3665 份报告中的 3438.69 份] 和 93.65% [3665 份报告中的 3432.38 份])也更高(7B:效率比为 13.7B:效率比为 13.39,准确率为 7.91% [3665 份报告中的 289.76 份];13B:效率比为 40.90,准确率为 74.08% [3665 份报告中的 2715.15 份];70B:效率比为 22.30,准确率为 92.70% [3665 份报告中的 3397.38 份])。Vicuna 1.5 7B 的效率比最高(737.28 对 Llama 2 7B 的 13.39)。较大的 Llama 2 70B 模型的能耗是其 7B 模型的七倍多(4.16 千瓦时对 0.59 千瓦时),但总体精度较低,因此效率比仅为 22.30。结论 较小的微调 LLM 比较大的通用 LLM 更具可持续性,在不影响精度的情况下耗能更少,这突出了医疗应用中选择 LLM 的重要性。© RSNA, 2024 本文有补充材料。
本文章由计算机程序翻译,如有差异,请以英文原文为准。