Objectives: Large language models (LLMs) are increasingly used in clinical practice, but their performance can deteriorate when radiology reports are incomplete. We evaluated whether multimodal LLMs (integrating text and images) could enhance accuracy and interpretability in chest radiography reports, thereby improving their utility for clinical decision support. Specifically, we aimed to assess the robustness of LLMs in generating accurate impressions from chest radiography reports when provided with incomplete data, and whether multimodal input could mitigate performance loss.
Methods: We analyzed 300 radiology image-report pairs from the MIMIC-CXR database. Three LLMs-OpenFlamingo, MedFlamingo, IDEFICS-were tested in text-only and multimodal formats. Chest X-ray impressions were generated from complete text reports and then regenerated after systematically removing 20%, 50%, and 80% of the text. The effect of adding images was evaluated using chest X-rays, and model performance was compared using three statistical methods. Hallucination rates were quantified.
Results: In the text-only setting, OpenFlamingo, MedFlamingo, and IDEFICS demonstrated comparable performance (ROUGE-L: 0.23 vs. 0.21 vs. 0.21; F1RadGraph: 0.20 vs. 0.16 vs. 0.16; F1CheXbert: 0.49 vs. 0.41 vs. 0.41), with OpenFlamingo performing best on complete text (p < 0.001). All models exhibited performance decline with incomplete data. However, multimodal input significantly improved the performance of MedFlamingo and IDEFICS (p < 0.001), equaling or surpassing OpenFlamingo even under incomplete text conditions. Regarding hallucination, MedFlamingo showed a lower false-negative rate in multimodal compared with unimodal use, while false-positive rates were similar.
Conclusions: LLMs may produce suboptimal outputs when radiology data are incomplete, but multimodal LLMs enhance reliability and may strengthen clinical decision-making support.
扫码关注我们
求助内容:
应助结果提醒方式:
