Amit Gupta, Ashish Rastogi, Hema Malhotra, Krithika Rangarajan
{"title":"Comparative Evaluation of Large Language Models for Translating Radiology Reports into Hindi.","authors":"Amit Gupta, Ashish Rastogi, Hema Malhotra, Krithika Rangarajan","doi":"10.1055/s-0044-1789618","DOIUrl":null,"url":null,"abstract":"<p><p><b>Objective</b> The aim of this study was to compare the performance of four publicly available large language models (LLMs)-GPT-4o, GPT-4, Gemini, and Claude Opus-in translating radiology reports into simple Hindi. <b>Materials and Methods</b> In this retrospective study, 100 computed tomography (CT) scan report impressions were gathered from a tertiary care cancer center. Reference translations of these impressions into simple Hindi were done by a bilingual radiology staff in consultation with a radiologist. Two distinct prompts were used to assess the LLMs' ability to translate these report impressions into simple Hindi. Translated reports were assessed by a radiologist for instances of misinterpretation, omission, and addition of fictitious information. Translation quality was assessed using Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Translation Edit Rate (TER), and character F-score (CHRF) scores. Statistical analyses were performed to compare the LLM performance across prompts. <b>Results</b> Nine instances of misinterpretation and two instances of omission of information were found on radiologist evaluation of the total 800 LLM-generated translated report impressions. For prompt 1, Gemini outperformed others in BLEU ( <i>p</i> < 0.001) and METEOR scores ( <i>p</i> = 0.001), and was superior to GPT-4o and GPT-4 in TER and CHRF ( <i>p</i> < 0.001), but comparable to Claude ( <i>p</i> = 0.501 for TER and <i>p</i> = 0.90 for CHRF). For prompt 2, GPT-4o outperformed all others ( <i>p</i> < 0.001) in all metrics. Prompt 2 yielded better BLEU, METEOR, and CHRF scores ( <i>p</i> < 0.001), while prompt 1 had a better TER score ( <i>p</i> < 0.001). <b>Conclusion</b> While each LLM's effectiveness varied with prompt wording, all models demonstrated potential in translating and simplifying radiology report impressions.</p>","PeriodicalId":51597,"journal":{"name":"Indian Journal of Radiology and Imaging","volume":"35 1","pages":"88-96"},"PeriodicalIF":0.9000,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11651845/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Indian Journal of Radiology and Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1055/s-0044-1789618","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"RADIOLOGY, NUCLEAR MEDICINE & MEDICAL IMAGING","Score":null,"Total":0}
引用次数: 0
Abstract
Objective The aim of this study was to compare the performance of four publicly available large language models (LLMs)-GPT-4o, GPT-4, Gemini, and Claude Opus-in translating radiology reports into simple Hindi. Materials and Methods In this retrospective study, 100 computed tomography (CT) scan report impressions were gathered from a tertiary care cancer center. Reference translations of these impressions into simple Hindi were done by a bilingual radiology staff in consultation with a radiologist. Two distinct prompts were used to assess the LLMs' ability to translate these report impressions into simple Hindi. Translated reports were assessed by a radiologist for instances of misinterpretation, omission, and addition of fictitious information. Translation quality was assessed using Bilingual Evaluation Understudy (BLEU), Metric for Evaluation of Translation with Explicit ORdering (METEOR), Translation Edit Rate (TER), and character F-score (CHRF) scores. Statistical analyses were performed to compare the LLM performance across prompts. Results Nine instances of misinterpretation and two instances of omission of information were found on radiologist evaluation of the total 800 LLM-generated translated report impressions. For prompt 1, Gemini outperformed others in BLEU ( p < 0.001) and METEOR scores ( p = 0.001), and was superior to GPT-4o and GPT-4 in TER and CHRF ( p < 0.001), but comparable to Claude ( p = 0.501 for TER and p = 0.90 for CHRF). For prompt 2, GPT-4o outperformed all others ( p < 0.001) in all metrics. Prompt 2 yielded better BLEU, METEOR, and CHRF scores ( p < 0.001), while prompt 1 had a better TER score ( p < 0.001). Conclusion While each LLM's effectiveness varied with prompt wording, all models demonstrated potential in translating and simplifying radiology report impressions.