Large Language Model With Region-Guided Referring and Grounding for CT Report Generation

IEEE transactions on medical imaging Pub Date : 2025-04-11 DOI:10.1109/TMI.2025.3559923

Zhixuan Chen;Yequan Bie;Haibo Jin;Hao Chen

{"title":"Large Language Model With Region-Guided Referring and Grounding for CT Report Generation","authors":"Zhixuan Chen;Yequan Bie;Haibo Jin;Hao Chen","doi":"10.1109/TMI.2025.3559923","DOIUrl":null,"url":null,"abstract":"Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model’s referring and grounding capabilities while also improving the report’s interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at <uri>https://github.com/zhi-xuan-chen/Reg2RG</uri>.","PeriodicalId":94033,"journal":{"name":"IEEE transactions on medical imaging","volume":"44 8","pages":"3139-3150"},"PeriodicalIF":0.0000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on medical imaging","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10963672/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Computed tomography (CT) report generation is crucial to assist radiologists in interpreting CT volumes, which can be time-consuming and labor-intensive. Existing methods primarily only consider the global features of the entire volume, making it struggle to focus on specific regions and potentially missing abnormalities. To address this issue, we propose Reg2RG, the first region-guided referring and grounding framework for CT report generation, which enhances diagnostic performance by focusing on anatomical regions within the volume. Specifically, we utilize masks from a universal segmentation module to capture local features for each referring region. A local feature decoupling (LFD) strategy is proposed to preserve the local high-resolution details with little computational overhead. Then the local features are integrated with global features to capture inter-regional relationships within a cohesive context. Moreover, we propose a novel region-report alignment (RRA) training strategy. It leverages the recognition of referring regions to guide the generation of region-specific reports, enhancing the model’s referring and grounding capabilities while also improving the report’s interpretability. A large language model (LLM) is further employed as the language decoder to generate reports from integrated visual features, facilitating region-level comprehension. Extensive experiments on two large-scale chest CT-report datasets demonstrate the superiority of our method, which outperforms several state-of-the-art methods in terms of both natural language generation and clinical efficacy metrics while preserving promising interpretability. The code is available at https://github.com/zhi-xuan-chen/Reg2RG.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

大型语言模型与区域引导参考和基础的CT报告生成

计算机断层扫描（CT）报告生成对于帮助放射科医生解释CT体积至关重要，这可能是耗时和劳动密集型的。现有的方法主要只考虑整个体积的整体特征，使其难以关注特定区域和潜在的缺失异常。为了解决这个问题，我们提出了Reg2RG，这是第一个用于CT报告生成的区域导向参考和基础框架，它通过关注体积内的解剖区域来提高诊断性能。具体来说，我们利用通用分割模块的掩码来捕获每个参考区域的局部特征。提出了一种局部特征解耦（LFD）策略，在保持局部高分辨率细节的同时减少了计算开销。然后，将局部特征与全局特征相结合，以捕获内聚上下文中的区域间关系。此外，我们提出了一种新的区域报告对齐（RRA）培训策略。它利用对参考区域的识别来指导特定区域报告的生成，增强模型的参考和基础能力，同时也提高了报告的可解释性。采用大型语言模型（LLM）作为语言解码器，从集成的视觉特征生成报告，便于区域级理解。在两个大型胸部ct报告数据集上进行的大量实验证明了我们的方法的优越性，该方法在自然语言生成和临床疗效指标方面优于几种最先进的方法，同时保留了有希望的可解释性。代码可在https://github.com/zhi-xuan-chen/Reg2RG上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on medical imaging

自引率

0.00%

发文量