GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.

IF 3.7 3区医学 Q2 ENGINEERING, BIOMEDICAL Bioengineering Pub Date : 2024-10-18 DOI:10.3390/bioengineering11101043

Ștefan-Vlad Voinea, Mădălin Mămuleanu, Rossy Vlăduț Teică, Lucian Mihai Florescu, Dan Selișteanu, Ioana Andreea Gheonea

{"title":"GPT-Driven Radiology Report Generation with Fine-Tuned Llama 3.","authors":"Ștefan-Vlad Voinea, Mădălin Mămuleanu, Rossy Vlăduț Teică, Lucian Mihai Florescu, Dan Selișteanu, Ioana Andreea Gheonea","doi":"10.3390/bioengineering11101043","DOIUrl":null,"url":null,"abstract":"<p><p>The integration of deep learning into radiology has the potential to enhance diagnostic processes, yet its acceptance in clinical practice remains limited due to various challenges. This study aimed to develop and evaluate a fine-tuned large language model (LLM), based on Llama 3-8B, to automate the generation of accurate and concise conclusions in magnetic resonance imaging (MRI) and computed tomography (CT) radiology reports, thereby assisting radiologists and improving reporting efficiency. A dataset comprising 15,000 radiology reports was collected from the University of Medicine and Pharmacy of Craiova's Imaging Center, covering a diverse range of MRI and CT examinations made by four experienced radiologists. The Llama 3-8B model was fine-tuned using transfer-learning techniques, incorporating parameter quantization to 4-bit precision and low-rank adaptation (LoRA) with a rank of 16 to optimize computational efficiency on consumer-grade GPUs. The model was trained over five epochs using an NVIDIA RTX 3090 GPU, with intermediary checkpoints saved for monitoring. Performance was evaluated quantitatively using Bidirectional Encoder Representations from Transformers Score (BERTScore), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics on a held-out test set. Additionally, a qualitative assessment was conducted, involving 13 independent radiologists who participated in a Turing-like test and provided ratings for the AI-generated conclusions. The fine-tuned model demonstrated strong quantitative performance, achieving a BERTScore F1 of 0.8054, a ROUGE-1 F1 of 0.4998, a ROUGE-L F1 of 0.4628, and a METEOR score of 0.4282. In the human evaluation, the artificial intelligence (AI)-generated conclusions were preferred over human-written ones in approximately 21.8% of cases, indicating that the model's outputs were competitive with those of experienced radiologists. The average rating of the AI-generated conclusions was 3.65 out of 5, reflecting a generally favorable assessment. Notably, the model maintained its consistency across various types of reports and demonstrated the ability to generalize to unseen data. The fine-tuned Llama 3-8B model effectively generates accurate and coherent conclusions for MRI and CT radiology reports. By automating the conclusion-writing process, this approach can assist radiologists in reducing their workload and enhancing report consistency, potentially addressing some barriers to the adoption of deep learning in clinical practice. The positive evaluations from independent radiologists underscore the model's potential utility. While the model demonstrated strong performance, limitations such as dataset bias, limited sample diversity, a lack of clinical judgment, and the need for large computational resources require further refinement and real-world validation. Future work should explore the integration of such models into clinical workflows, address ethical and legal considerations, and extend this approach to generate complete radiology reports.</p>","PeriodicalId":8874,"journal":{"name":"Bioengineering","volume":"11 10","pages":""},"PeriodicalIF":3.7000,"publicationDate":"2024-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11504957/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioengineering","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.3390/bioengineering11101043","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, BIOMEDICAL","Score":null,"Total":0}

引用次数: 0

Abstract

The integration of deep learning into radiology has the potential to enhance diagnostic processes, yet its acceptance in clinical practice remains limited due to various challenges. This study aimed to develop and evaluate a fine-tuned large language model (LLM), based on Llama 3-8B, to automate the generation of accurate and concise conclusions in magnetic resonance imaging (MRI) and computed tomography (CT) radiology reports, thereby assisting radiologists and improving reporting efficiency. A dataset comprising 15,000 radiology reports was collected from the University of Medicine and Pharmacy of Craiova's Imaging Center, covering a diverse range of MRI and CT examinations made by four experienced radiologists. The Llama 3-8B model was fine-tuned using transfer-learning techniques, incorporating parameter quantization to 4-bit precision and low-rank adaptation (LoRA) with a rank of 16 to optimize computational efficiency on consumer-grade GPUs. The model was trained over five epochs using an NVIDIA RTX 3090 GPU, with intermediary checkpoints saved for monitoring. Performance was evaluated quantitatively using Bidirectional Encoder Representations from Transformers Score (BERTScore), Recall-Oriented Understudy for Gisting Evaluation (ROUGE), Bilingual Evaluation Understudy (BLEU), and Metric for Evaluation of Translation with Explicit Ordering (METEOR) metrics on a held-out test set. Additionally, a qualitative assessment was conducted, involving 13 independent radiologists who participated in a Turing-like test and provided ratings for the AI-generated conclusions. The fine-tuned model demonstrated strong quantitative performance, achieving a BERTScore F1 of 0.8054, a ROUGE-1 F1 of 0.4998, a ROUGE-L F1 of 0.4628, and a METEOR score of 0.4282. In the human evaluation, the artificial intelligence (AI)-generated conclusions were preferred over human-written ones in approximately 21.8% of cases, indicating that the model's outputs were competitive with those of experienced radiologists. The average rating of the AI-generated conclusions was 3.65 out of 5, reflecting a generally favorable assessment. Notably, the model maintained its consistency across various types of reports and demonstrated the ability to generalize to unseen data. The fine-tuned Llama 3-8B model effectively generates accurate and coherent conclusions for MRI and CT radiology reports. By automating the conclusion-writing process, this approach can assist radiologists in reducing their workload and enhancing report consistency, potentially addressing some barriers to the adoption of deep learning in clinical practice. The positive evaluations from independent radiologists underscore the model's potential utility. While the model demonstrated strong performance, limitations such as dataset bias, limited sample diversity, a lack of clinical judgment, and the need for large computational resources require further refinement and real-world validation. Future work should explore the integration of such models into clinical workflows, address ethical and legal considerations, and extend this approach to generate complete radiology reports.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

使用微调 Llama 3 生成 GPT 驱动的放射学报告。

将深度学习融入放射学具有增强诊断过程的潜力，但由于各种挑战，其在临床实践中的接受程度仍然有限。本研究旨在开发和评估基于 Llama 3-8B 的微调大语言模型（LLM），以自动生成磁共振成像（MRI）和计算机断层扫描（CT）放射学报告中准确而简洁的结论，从而协助放射科医生并提高报告效率。我们从克拉约瓦医学和药学大学的影像中心收集了一个包含 15,000 份放射学报告的数据集，涵盖了四位经验丰富的放射科医生所做的各种核磁共振成像和 CT 检查。Llama 3-8B 模型采用迁移学习技术进行了微调，将参数量化为 4 位精度，并采用秩为 16 的低秩自适应性 (LoRA)，以优化消费级 GPU 的计算效率。使用英伟达 RTX 3090 GPU 对模型进行了五次历时训练，并保存中间检查点进行监控。在保留的测试集上，使用来自变换器的双向编码器表示得分（BERTScore）、以召回为导向的词组评估研究（ROUGE）、双语评估研究（BLEU）和显式排序翻译评估指标（METEOR）对性能进行了定量评估。此外，还进行了定性评估，13 位独立的放射科医生参与了类似图灵的测试，并对人工智能生成的结论进行了评分。经过微调的模型在定量方面表现出色，BERTScore F1 为 0.8054，ROUGE-1 F1 为 0.4998，ROUGE-L F1 为 0.4628，METEOR 为 0.4282。在人工评估中，人工智能（AI）生成的结论比人工撰写的结论更受青睐的案例约占 21.8%，这表明该模型的输出结果与经验丰富的放射科医生的输出结果具有竞争力。人工智能生成结论的平均评分为 3.65 分（满分 5 分），反映了普遍良好的评价。值得注意的是，该模型在各种类型的报告中都保持了一致性，并展示了对未见过的数据进行归纳的能力。经过微调的 Llama 3-8B 模型能有效地为 MRI 和 CT 放射学报告生成准确、一致的结论。通过实现结论撰写过程的自动化，这种方法可以帮助放射科医生减少工作量并提高报告的一致性，从而有可能解决在临床实践中采用深度学习的一些障碍。独立放射科医生的积极评价强调了该模型的潜在用途。虽然该模型表现出很强的性能，但数据集偏差、样本多样性有限、缺乏临床判断以及需要大量计算资源等局限性还需要进一步完善和实际验证。未来的工作应探索将此类模型整合到临床工作流程中，解决伦理和法律方面的问题，并将这种方法扩展到生成完整的放射学报告。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Bioengineering Chemical Engineering-Bioengineering

CiteScore

4.00

自引率

8.70%

发文量

661

期刊介绍： Aims Bioengineering (ISSN 2306-5354) provides an advanced forum for the science and technology of bioengineering. It publishes original research papers, comprehensive reviews, communications and case reports. Our aim is to encourage scientists to publish their experimental and theoretical results in as much detail as possible. All aspects of bioengineering are welcomed from theoretical concepts to education and applications. There is no restriction on the length of the papers. The full experimental details must be provided so that the results can be reproduced. There are, in addition, four key features of this Journal: ● We are introducing a new concept in scientific and technical publications “The Translational Case Report in Bioengineering”. It is a descriptive explanatory analysis of a transformative or translational event. Understanding that the goal of bioengineering scholarship is to advance towards a transformative or clinical solution to an identified transformative/clinical need, the translational case report is used to explore causation in order to find underlying principles that may guide other similar transformative/translational undertakings. ● Manuscripts regarding research proposals and research ideas will be particularly welcomed. ● Electronic files and software regarding the full details of the calculation and experimental procedure, if unable to be published in a normal way, can be deposited as supplementary material. ● We also accept manuscripts communicating to a broader audience with regard to research projects financed with public funds. Scope ● Bionics and biological cybernetics: implantology; bio–abio interfaces ● Bioelectronics: wearable electronics; implantable electronics; “more than Moore” electronics; bioelectronics devices ● Bioprocess and biosystems engineering and applications: bioprocess design; biocatalysis; bioseparation and bioreactors; bioinformatics; bioenergy; etc. ● Biomolecular, cellular and tissue engineering and applications: tissue engineering; chromosome engineering; embryo engineering; cellular, molecular and synthetic biology; metabolic engineering; bio-nanotechnology; micro/nano technologies; genetic engineering; transgenic technology ● Biomedical engineering and applications: biomechatronics; biomedical electronics; biomechanics; biomaterials; biomimetics; biomedical diagnostics; biomedical therapy; biomedical devices; sensors and circuits; biomedical imaging and medical information systems; implants and regenerative medicine; neurotechnology; clinical engineering; rehabilitation engineering ● Biochemical engineering and applications: metabolic pathway engineering; modeling and simulation ● Translational bioengineering