Jun Kanzawa, Koichiro Yasaka, Nana Fujita, Shin Fujiwara, Osamu Abe
{"title":"Automated classification of brain MRI reports using fine-tuned large language models.","authors":"Jun Kanzawa, Koichiro Yasaka, Nana Fujita, Shin Fujiwara, Osamu Abe","doi":"10.1007/s00234-024-03427-7","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>This study aimed to investigate the efficacy of fine-tuned large language models (LLM) in classifying brain MRI reports into pretreatment, posttreatment, and nontumor cases.</p><p><strong>Methods: </strong>This retrospective study included 759, 284, and 164 brain MRI reports for training, validation, and test dataset. Radiologists stratified the reports into three groups: nontumor (group 1), posttreatment tumor (group 2), and pretreatment tumor (group 3) cases. A pretrained Bidirectional Encoder Representations from Transformers Japanese model was fine-tuned using the training dataset and evaluated on the validation dataset. The model which demonstrated the highest accuracy on the validation dataset was selected as the final model. Two additional radiologists were involved in classifying reports in the test datasets for the three groups. The model's performance on test dataset was compared to that of two radiologists.</p><p><strong>Results: </strong>The fine-tuned LLM attained an overall accuracy of 0.970 (95% CI: 0.930-0.990). The model's sensitivity for group 1/2/3 was 1.000/0.864/0.978. The model's specificity for group1/2/3 was 0.991/0.993/0.958. No statistically significant differences were found in terms of accuracy, sensitivity, and specificity between the LLM and human readers (p ≥ 0.371). The LLM completed the classification task approximately 20-26-fold faster than the radiologists. The area under the receiver operating characteristic curve for discriminating groups 2 and 3 from group 1 was 0.994 (95% CI: 0.982-1.000) and for discriminating group 3 from groups 1 and 2 was 0.992 (95% CI: 0.982-1.000).</p><p><strong>Conclusion: </strong>Fine-tuned LLM demonstrated a comparable performance with radiologists in classifying brain MRI reports, while requiring substantially less time.</p>","PeriodicalId":19422,"journal":{"name":"Neuroradiology","volume":" ","pages":"2177-2183"},"PeriodicalIF":2.4000,"publicationDate":"2024-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11611921/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neuroradiology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00234-024-03427-7","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/12 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"CLINICAL NEUROLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Purpose: This study aimed to investigate the efficacy of fine-tuned large language models (LLM) in classifying brain MRI reports into pretreatment, posttreatment, and nontumor cases.
Methods: This retrospective study included 759, 284, and 164 brain MRI reports for training, validation, and test dataset. Radiologists stratified the reports into three groups: nontumor (group 1), posttreatment tumor (group 2), and pretreatment tumor (group 3) cases. A pretrained Bidirectional Encoder Representations from Transformers Japanese model was fine-tuned using the training dataset and evaluated on the validation dataset. The model which demonstrated the highest accuracy on the validation dataset was selected as the final model. Two additional radiologists were involved in classifying reports in the test datasets for the three groups. The model's performance on test dataset was compared to that of two radiologists.
Results: The fine-tuned LLM attained an overall accuracy of 0.970 (95% CI: 0.930-0.990). The model's sensitivity for group 1/2/3 was 1.000/0.864/0.978. The model's specificity for group1/2/3 was 0.991/0.993/0.958. No statistically significant differences were found in terms of accuracy, sensitivity, and specificity between the LLM and human readers (p ≥ 0.371). The LLM completed the classification task approximately 20-26-fold faster than the radiologists. The area under the receiver operating characteristic curve for discriminating groups 2 and 3 from group 1 was 0.994 (95% CI: 0.982-1.000) and for discriminating group 3 from groups 1 and 2 was 0.992 (95% CI: 0.982-1.000).
Conclusion: Fine-tuned LLM demonstrated a comparable performance with radiologists in classifying brain MRI reports, while requiring substantially less time.
期刊介绍:
Neuroradiology aims to provide state-of-the-art medical and scientific information in the fields of Neuroradiology, Neurosciences, Neurology, Psychiatry, Neurosurgery, and related medical specialities. Neuroradiology as the official Journal of the European Society of Neuroradiology receives submissions from all parts of the world and publishes peer-reviewed original research, comprehensive reviews, educational papers, opinion papers, and short reports on exceptional clinical observations and new technical developments in the field of Neuroimaging and Neurointervention. The journal has subsections for Diagnostic and Interventional Neuroradiology, Advanced Neuroimaging, Paediatric Neuroradiology, Head-Neck-ENT Radiology, Spine Neuroradiology, and for submissions from Japan. Neuroradiology aims to provide new knowledge about and insights into the function and pathology of the human nervous system that may help to better diagnose and treat nervous system diseases. Neuroradiology is a member of the Committee on Publication Ethics (COPE) and follows the COPE core practices. Neuroradiology prefers articles that are free of bias, self-critical regarding limitations, transparent and clear in describing study participants, methods, and statistics, and short in presenting results. Before peer-review all submissions are automatically checked by iThenticate to assess for potential overlap in prior publication.