Pub Date : 2026-01-27DOI: 10.1007/s10278-025-01836-5
Man Li, Mei Choo Ang, Musatafa Abbas Abbood Albadr, Jun Kit Chaw, JianBang Liu, Kok Weng Ng, Wei Hong
The rapid development of machine learning (ML) and deep learning (DL) has greatly advanced Alzheimer's disease (AD) diagnosis. However, existing models struggle to capture weak structural features in the marginal regions of brain MRI images, leading to limited diagnostic accuracy. To address this challenge, we introduce a Dual-Branch Convolutional Neural Network (DBCNN) equipped with a Learnable Edge Detection Module designed to jointly learn global semantic representations and fine-grained edge cues within a unified framework. Experimental results on two public datasets demonstrate that DBCNN significantly improves classification accuracy, surpassing 98%. Notably, on the OASIS dataset, it achieved an average accuracy of 99.71%, demonstrating strong generalization and robustness. This high diagnostic performance indicates that the model can assist clinicians in the early detection of Alzheimer's disease, reduce subjectivity in manual image interpretation, and enhance diagnostic consistency. Overall, the proposed approach provides a promising pathway toward intelligent, interpretable, and computationally efficient solutions for MRI-based diagnosis, offering strong potential to support early clinical decision-making.
{"title":"Edge-Aware Dual-Branch CNN Architecture for Alzheimer's Disease Diagnosis.","authors":"Man Li, Mei Choo Ang, Musatafa Abbas Abbood Albadr, Jun Kit Chaw, JianBang Liu, Kok Weng Ng, Wei Hong","doi":"10.1007/s10278-025-01836-5","DOIUrl":"https://doi.org/10.1007/s10278-025-01836-5","url":null,"abstract":"<p><p>The rapid development of machine learning (ML) and deep learning (DL) has greatly advanced Alzheimer's disease (AD) diagnosis. However, existing models struggle to capture weak structural features in the marginal regions of brain MRI images, leading to limited diagnostic accuracy. To address this challenge, we introduce a Dual-Branch Convolutional Neural Network (DBCNN) equipped with a Learnable Edge Detection Module designed to jointly learn global semantic representations and fine-grained edge cues within a unified framework. Experimental results on two public datasets demonstrate that DBCNN significantly improves classification accuracy, surpassing 98%. Notably, on the OASIS dataset, it achieved an average accuracy of 99.71%, demonstrating strong generalization and robustness. This high diagnostic performance indicates that the model can assist clinicians in the early detection of Alzheimer's disease, reduce subjectivity in manual image interpretation, and enhance diagnostic consistency. Overall, the proposed approach provides a promising pathway toward intelligent, interpretable, and computationally efficient solutions for MRI-based diagnosis, offering strong potential to support early clinical decision-making.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-27DOI: 10.1007/s10278-025-01830-x
Evgin Goceri
In this work, a new fusion network was developed and applied to lung cancer classification. It incorporates a transformer-based module, a convolutional module with encoders, and another convolutional module with decoders. Each module is strategically placed and extracts features at different scales, enabling the network to capture enriched feature information at both global and local levels. A novel hybrid loss function was also employed to reduce both pixel- and image-based differences while enhancing region-wise consistency. The model's effectiveness was evaluated by classifying lung cancer subtypes from computed tomography scans, a highly challenging task due to factors such as high interclass similarity and the presence of nontumor features. Moreover, recent methods used for lung cancer classification were applied to identical datasets and evaluated using identical metrics to ensure fair comparative assessments. The results demonstrate the superiority of the proposed approach in lung cancer subtype classification, achieving higher accuracy (96.59%), recall (96.68%), precision (96.90%), and F1-score (96.65%) compared to recent methods.
{"title":"Lung Cancer Classification Using Effective Fusion Network Integrating Transformers and Controllable Convolutional Encoders-Decoders.","authors":"Evgin Goceri","doi":"10.1007/s10278-025-01830-x","DOIUrl":"https://doi.org/10.1007/s10278-025-01830-x","url":null,"abstract":"<p><p>In this work, a new fusion network was developed and applied to lung cancer classification. It incorporates a transformer-based module, a convolutional module with encoders, and another convolutional module with decoders. Each module is strategically placed and extracts features at different scales, enabling the network to capture enriched feature information at both global and local levels. A novel hybrid loss function was also employed to reduce both pixel- and image-based differences while enhancing region-wise consistency. The model's effectiveness was evaluated by classifying lung cancer subtypes from computed tomography scans, a highly challenging task due to factors such as high interclass similarity and the presence of nontumor features. Moreover, recent methods used for lung cancer classification were applied to identical datasets and evaluated using identical metrics to ensure fair comparative assessments. The results demonstrate the superiority of the proposed approach in lung cancer subtype classification, achieving higher accuracy (96.59%), recall (96.68%), precision (96.90%), and F1-score (96.65%) compared to recent methods.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-26DOI: 10.1007/s10278-026-01847-w
Dae Chul Jung, Jungwook Lee, Seungsoo Lee, Sung Il Jung, Myoung Seok Lee, Min Hoan Moon
This study aimed to evaluate the utility of dual-energy CT (DECT)-based two-material decomposition in facilitating the generation of training data for ureter segmentation. This retrospective two-center study included 180 patients who underwent DECT urography between April and July 2020, including 150 from Institution 1 and 30 from Institution 2. Virtual unenhanced (VUE) images were generated from the late excretory phase (LEP) images using a two-material decomposition technique. Ground truth segmentation masks were created by segmenting contrast-filled ureteral regions on LEP images and were then paired with the corresponding VUE images. These VUE images and their corresponding ground truth masks were used to construct training, validation, and test datasets. A deep learning-based segmentation model was developed using the nnU-Net framework. Its performance was evaluated using the Dice coefficient, precision, and recall. In the internal test dataset, the model achieved excellent performance, with a median Dice coefficient of 0.89 (95% CI 0.88-0.90), precision of 0.90 (95% CI 0.88-0.92), and recall of 0.88 (95% CI 0.86-0.91). In contrast, the external validation dataset yielded limited performance, with a median Dice coefficient of 0.43 (95% CI 0.31-0.61) and recall of 0.28 (95% CI 0.18-0.45), while precision remained high at 0.95 (95% CI 0.93-0.96). There were statistically significant differences in all metrics between the internal and external datasets (P < 0.01). DECT-based two-material decomposition is a feasible method for generating training data for ureter segmentation. Although external validation performance was limited, this approach shows promise for ureter segmentation on non-contrast CT scans.
本研究旨在评估基于双能CT (DECT)的双材料分解在促进输尿管分割训练数据生成方面的效用。这项回顾性双中心研究纳入了2020年4月至7月期间接受DECT尿路造影的180例患者,其中150例来自机构1,30例来自机构2。使用双材料分解技术从排泄后期(LEP)图像生成虚拟未增强(VUE)图像。通过在LEP图像上分割造影剂填充的输尿管区域,创建Ground truth分割蒙版,然后与相应的VUE图像配对。这些VUE图像及其相应的地面真值掩模用于构建训练、验证和测试数据集。在nnU-Net框架下,建立了基于深度学习的分割模型。使用Dice系数、精度和召回率来评估其性能。在内部测试数据集中,该模型取得了优异的性能,中位Dice系数为0.89 (95% CI 0.88-0.90),精度为0.90 (95% CI 0.88-0.92),召回率为0.88 (95% CI 0.86-0.91)。相比之下,外部验证数据集产生了有限的性能,中位Dice系数为0.43 (95% CI 0.31-0.61),召回率为0.28 (95% CI 0.18-0.45),而精度仍然很高,为0.95 (95% CI 0.93-0.96)。内部和外部数据集在所有指标上有统计学显著差异(P
{"title":"Generating Training Data for Ureter Segmentation Using Dual-Energy CT Two-Material Decomposition.","authors":"Dae Chul Jung, Jungwook Lee, Seungsoo Lee, Sung Il Jung, Myoung Seok Lee, Min Hoan Moon","doi":"10.1007/s10278-026-01847-w","DOIUrl":"https://doi.org/10.1007/s10278-026-01847-w","url":null,"abstract":"<p><p>This study aimed to evaluate the utility of dual-energy CT (DECT)-based two-material decomposition in facilitating the generation of training data for ureter segmentation. This retrospective two-center study included 180 patients who underwent DECT urography between April and July 2020, including 150 from Institution 1 and 30 from Institution 2. Virtual unenhanced (VUE) images were generated from the late excretory phase (LEP) images using a two-material decomposition technique. Ground truth segmentation masks were created by segmenting contrast-filled ureteral regions on LEP images and were then paired with the corresponding VUE images. These VUE images and their corresponding ground truth masks were used to construct training, validation, and test datasets. A deep learning-based segmentation model was developed using the nnU-Net framework. Its performance was evaluated using the Dice coefficient, precision, and recall. In the internal test dataset, the model achieved excellent performance, with a median Dice coefficient of 0.89 (95% CI 0.88-0.90), precision of 0.90 (95% CI 0.88-0.92), and recall of 0.88 (95% CI 0.86-0.91). In contrast, the external validation dataset yielded limited performance, with a median Dice coefficient of 0.43 (95% CI 0.31-0.61) and recall of 0.28 (95% CI 0.18-0.45), while precision remained high at 0.95 (95% CI 0.93-0.96). There were statistically significant differences in all metrics between the internal and external datasets (P < 0.01). DECT-based two-material decomposition is a feasible method for generating training data for ureter segmentation. Although external validation performance was limited, this approach shows promise for ureter segmentation on non-contrast CT scans.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146055782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1007/s10278-025-01820-z
S Mirhosseini, T Rai, P Diaz-Santana, R La Ragione, N Bacon, K Wells
Recent advancements in artificial intelligence (AI) have revealed important patterns in pathology images imperceptible to human observers that can improve diagnostic accuracy and decision support systems. However, progress has been limited due to the lack of publicly available medical images. To address this scarcity, we explore Instagram as a novel source of pathology images with expert annotations. We curated the IPATH dataset from Instagram, comprising 45,609 pathology image-text pairs rigorously filtered and curated for domain quality using classifiers, large language models, and manual filtering. To demonstrate the value of this dataset, we developed a multimodal AI model called IP-CLIP by fine-tuning a pretrained CLIP model using the IPATH dataset. We evaluated IP-CLIP on seven external histopathology datasets using zero shot classification and linear probing, where it consistently outperformed the original CLIP model. Furthermore, IP-CLIP matched or exceeded several recent state-of-the-art pathology vision-language models, despite being trained on a substantially smaller dataset. We also assessed image-text alignment on a 5k held-out IPATH subset using image-text retrieval, where IP-CLIP surpassed CLIP and other specialized models. These results demonstrate the effectiveness of the IPATH dataset and highlight the potential of leveraging social media data to develop AI models for medical image classification and enhance diagnostic accuracy.
{"title":"IPATH: A Large-Scale Pathology Image-Text Dataset from Instagram for Vision-Language Model Training.","authors":"S Mirhosseini, T Rai, P Diaz-Santana, R La Ragione, N Bacon, K Wells","doi":"10.1007/s10278-025-01820-z","DOIUrl":"https://doi.org/10.1007/s10278-025-01820-z","url":null,"abstract":"<p><p>Recent advancements in artificial intelligence (AI) have revealed important patterns in pathology images imperceptible to human observers that can improve diagnostic accuracy and decision support systems. However, progress has been limited due to the lack of publicly available medical images. To address this scarcity, we explore Instagram as a novel source of pathology images with expert annotations. We curated the IPATH dataset from Instagram, comprising 45,609 pathology image-text pairs rigorously filtered and curated for domain quality using classifiers, large language models, and manual filtering. To demonstrate the value of this dataset, we developed a multimodal AI model called IP-CLIP by fine-tuning a pretrained CLIP model using the IPATH dataset. We evaluated IP-CLIP on seven external histopathology datasets using zero shot classification and linear probing, where it consistently outperformed the original CLIP model. Furthermore, IP-CLIP matched or exceeded several recent state-of-the-art pathology vision-language models, despite being trained on a substantially smaller dataset. We also assessed image-text alignment on a 5k held-out IPATH subset using image-text retrieval, where IP-CLIP surpassed CLIP and other specialized models. These results demonstrate the effectiveness of the IPATH dataset and highlight the potential of leveraging social media data to develop AI models for medical image classification and enhance diagnostic accuracy.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146042272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1007/s10278-025-01824-9
Ali Nowroozi, Masha Bondarenko, Adrian Serapio, Tician Schnitzler, Sukhmanjit S Brar, Jae Ho Sohn
The purpose of this study was to investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis. We defined six tasks: extracting (1) minimum T-score from DEXA report, (2) maximum common bile duct (CBD) diameter from ultrasound report, and (3) maximum lung nodule size from CT report, and judging (1) presence of a highly hypermetabolic region on a PET report, (2) whether a patient is osteoporotic based on a DEXA report, and (3) whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution's databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis. In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies > 95%). In judgment tasks, the lowest accuracies of Llama, DeepSeek distilled Llama, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek distilled Llama but not in o1-mini or GPT-5-mini. To conclude, reinforcement learning (RL) reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-RL reasoning models may also achieve acceptable performance depending on the task.
{"title":"Large Language Models in Radiologic Numerical Tasks: A Thorough Evaluation and Error Analysis.","authors":"Ali Nowroozi, Masha Bondarenko, Adrian Serapio, Tician Schnitzler, Sukhmanjit S Brar, Jae Ho Sohn","doi":"10.1007/s10278-025-01824-9","DOIUrl":"https://doi.org/10.1007/s10278-025-01824-9","url":null,"abstract":"<p><p>The purpose of this study was to investigate the performance of LLMs in radiology numerical tasks and perform a comprehensive error analysis. We defined six tasks: extracting (1) minimum T-score from DEXA report, (2) maximum common bile duct (CBD) diameter from ultrasound report, and (3) maximum lung nodule size from CT report, and judging (1) presence of a highly hypermetabolic region on a PET report, (2) whether a patient is osteoporotic based on a DEXA report, and (3) whether a patient has a dilated CBD based on an ultrasound report. Reports were extracted from the MIMIC III and our institution's databases, and the ground truths were extracted manually. The models used were Llama 3.1 8b, DeepSeek R1 distilled Llama 8b, OpenAI o1-mini, and OpenAI GPT-5-mini. We manually reviewed all incorrect outputs and performed a comprehensive error analysis. In extraction tasks, while Llama showed relatively variable results (ranging 86%-98.7%) across tasks, other models performed consistently well (accuracies > 95%). In judgment tasks, the lowest accuracies of Llama, DeepSeek distilled Llama, o1-mini, and GPT-5-mini were 62.0%, 91.7%, 91.7%, and 99.0%, respectively, while o1-mini and GPT-5-mini did reach 100% performance in detecting osteoporosis. We found no mathematical errors in the outputs of o1-mini and GPT-5-mini. Answer-only output format significantly reduced performance in Llama and DeepSeek distilled Llama but not in o1-mini or GPT-5-mini. To conclude, reinforcement learning (RL) reasoning models perform consistently well in radiology numerical tasks and show no mathematical errors. Simpler non-RL reasoning models may also achieve acceptable performance depending on the task.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1007/s10278-025-01837-4
Yeong Hun Kang, Young Jae Kim, Kwang Gi Kim
Diabetic retinopathy (DR) is one of the most common complications of diabetes, and timely detection of retinal hemorrhages is essential for preventing vision loss. This study evaluates the U-Net3 + model for pixel-level hemorrhage segmentation in fundus images and examines its performance across clinically meaningful retinal regions. Model performance was assessed using accuracy, sensitivity, specificity, and Dice score and further analyzed across perivascular and extravascular areas, perifoveal and extrafoveal regions, fovea-centered quadrants, and images stratified by hemorrhage burden. U-Net3 + achieved strong overall performance, with 99.93% accuracy, 87.03% sensitivity, 99.97% specificity, and an 85.02% Dice score. Higher segmentation accuracy was observed in extravascular regions and within the foveal area, while quadrant-wise performance remained largely consistent. Images with greater hemorrhage burden demonstrated higher segmentation reliability. These findings highlight the importance of region-aware evaluation and suggest that U-Net3 + can provide clinically meaningful support for automated DR screening. Further validation using larger and multi-center datasets is required to enhance the model's generalizability for real-world clinical deployment.
{"title":"Hemorrhage Segmentation in Fundus Images Using the U-Net 3+ Model: Performance Comparison Across Retinal Regions.","authors":"Yeong Hun Kang, Young Jae Kim, Kwang Gi Kim","doi":"10.1007/s10278-025-01837-4","DOIUrl":"https://doi.org/10.1007/s10278-025-01837-4","url":null,"abstract":"<p><p>Diabetic retinopathy (DR) is one of the most common complications of diabetes, and timely detection of retinal hemorrhages is essential for preventing vision loss. This study evaluates the U-Net3 + model for pixel-level hemorrhage segmentation in fundus images and examines its performance across clinically meaningful retinal regions. Model performance was assessed using accuracy, sensitivity, specificity, and Dice score and further analyzed across perivascular and extravascular areas, perifoveal and extrafoveal regions, fovea-centered quadrants, and images stratified by hemorrhage burden. U-Net3 + achieved strong overall performance, with 99.93% accuracy, 87.03% sensitivity, 99.97% specificity, and an 85.02% Dice score. Higher segmentation accuracy was observed in extravascular regions and within the foveal area, while quadrant-wise performance remained largely consistent. Images with greater hemorrhage burden demonstrated higher segmentation reliability. These findings highlight the importance of region-aware evaluation and suggest that U-Net3 + can provide clinically meaningful support for automated DR screening. Further validation using larger and multi-center datasets is required to enhance the model's generalizability for real-world clinical deployment.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146021222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
To assess the performance of two AI chatbot assistants in identifying the presence and classifying the position of third molars on panoramic radiographs. A total of 114 third molars from 100 panoramic radiographs were evaluated consensually by three examiners and independently by two AI chatbot assistants (ChatGPT-4 and Microsoft Copilot). They were asked to provide descriptions regarding the orientation of the third molar's long axis, level of bone inclusion, space between the lower second molar and the mandibular ramus, and proximity of the third molar to the mandibular canal or maxillary sinus. Keywords generated by the AI chatbot assistants were compared to those used by the examiners and scored as 0 (incorrect), 0.5 (partially correct), or 1 (correct). Mean scores and standard deviations were calculated for each parameter and compared using the Wilcoxon test (α = 0.05). Mean scores across the four parameters ranged from 0.08 to 0.30 (SD = 0.42-0.44) for ChatGPT-4 and from 0.25 to 0.31 (SD = 0.42-0.47) for Microsoft Copilot. The only significant difference in performance between the AI chatbots was observed in the space between the lower second molar and ramus, in favor of Microsoft Copilot (p < 0.05). Overall performance scores were 0.22 (SD = 0.42) for ChatGPT-4 and 0.28 (SD = 0.46) for Microsoft Copilot. Furthermore, hallucinations such as classifying absent teeth were also observed. Both ChatGPT-4 and Microsoft Copilot demonstrate generally low performance in accurately identifying and classifying the position of third molars on panoramic radiographs.
{"title":"Evaluation of ChatGPT-4 and Microsoft Copilot for Third-Molar Assessment on Panoramic Radiographs.","authors":"Thaísa Pinheiro Silva, Maria Fernanda Silva Andrade-Bortoletto, Caio Alencar-Palha, Thaís Santos Cerqueira Ocampo, Christiano Oliveira-Santos, Deborah Queiroz Freitas, Matheus L Oliveira","doi":"10.1007/s10278-025-01805-y","DOIUrl":"https://doi.org/10.1007/s10278-025-01805-y","url":null,"abstract":"<p><p>To assess the performance of two AI chatbot assistants in identifying the presence and classifying the position of third molars on panoramic radiographs. A total of 114 third molars from 100 panoramic radiographs were evaluated consensually by three examiners and independently by two AI chatbot assistants (ChatGPT-4 and Microsoft Copilot). They were asked to provide descriptions regarding the orientation of the third molar's long axis, level of bone inclusion, space between the lower second molar and the mandibular ramus, and proximity of the third molar to the mandibular canal or maxillary sinus. Keywords generated by the AI chatbot assistants were compared to those used by the examiners and scored as 0 (incorrect), 0.5 (partially correct), or 1 (correct). Mean scores and standard deviations were calculated for each parameter and compared using the Wilcoxon test (α = 0.05). Mean scores across the four parameters ranged from 0.08 to 0.30 (SD = 0.42-0.44) for ChatGPT-4 and from 0.25 to 0.31 (SD = 0.42-0.47) for Microsoft Copilot. The only significant difference in performance between the AI chatbots was observed in the space between the lower second molar and ramus, in favor of Microsoft Copilot (p < 0.05). Overall performance scores were 0.22 (SD = 0.42) for ChatGPT-4 and 0.28 (SD = 0.46) for Microsoft Copilot. Furthermore, hallucinations such as classifying absent teeth were also observed. Both ChatGPT-4 and Microsoft Copilot demonstrate generally low performance in accurately identifying and classifying the position of third molars on panoramic radiographs.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1007/s10278-025-01819-6
Xue Zhang, Yonggang Li, Ping Mu, Xiaotong Yu, Xiu Zhang, Shulin Liu, Baishi Wang, Ning Li, Fu Ren
Accurate segmentation and measurement of ventricular volume are critical for neuroscience research and neurological disease diagnosis. In resource-limited settings, free and open-source automated tools offer accessible solutions. However, the lack of comparative evaluations limits their application. This study aims to identify reliable free tools for automated ventricular volume measurement to support clinical and research utilization. Magnetic resonance imaging (MRI) data from 150 healthy adults were collected with informed consent. Ventricular volumes were segmented using three open-source tools (3D Slicer, FreeSurfer, ITK-SNAP) and compared with manual segmentation as the reference standard. Pearson's correlation coefficient, intraclass correlation coefficient (ICC), and the Bland-Altman analysis were employed to evaluate consistency and reliability. All three automated tools showed significant correlations with manual measurements (P < 0.01). ITK-SNAP had the highest Pearson correlation and ICC values, followed by 3D Slicer, while FreeSurfer had the lowest. All tools demonstrated strong reliability, with ICCs greater than 0.9. The Bland-Altman analysis showed that ITK-SNAP had the closest consistency with manual results, again followed by 3D Slicer, with FreeSurfer performing least consistently. ITK-SNAP demonstrates higher accuracy and reliability for ventricular volumetry compared to 3D Slicer and FreeSurfer. Its open-source nature supports broader implementation in resource-constrained environments, enhancing neuroimaging accessibility for clinical and research applications.
{"title":"Comparison of Three Automated Measurements of Ventricular Volumes of the Brain.","authors":"Xue Zhang, Yonggang Li, Ping Mu, Xiaotong Yu, Xiu Zhang, Shulin Liu, Baishi Wang, Ning Li, Fu Ren","doi":"10.1007/s10278-025-01819-6","DOIUrl":"https://doi.org/10.1007/s10278-025-01819-6","url":null,"abstract":"<p><p>Accurate segmentation and measurement of ventricular volume are critical for neuroscience research and neurological disease diagnosis. In resource-limited settings, free and open-source automated tools offer accessible solutions. However, the lack of comparative evaluations limits their application. This study aims to identify reliable free tools for automated ventricular volume measurement to support clinical and research utilization. Magnetic resonance imaging (MRI) data from 150 healthy adults were collected with informed consent. Ventricular volumes were segmented using three open-source tools (3D Slicer, FreeSurfer, ITK-SNAP) and compared with manual segmentation as the reference standard. Pearson's correlation coefficient, intraclass correlation coefficient (ICC), and the Bland-Altman analysis were employed to evaluate consistency and reliability. All three automated tools showed significant correlations with manual measurements (P < 0.01). ITK-SNAP had the highest Pearson correlation and ICC values, followed by 3D Slicer, while FreeSurfer had the lowest. All tools demonstrated strong reliability, with ICCs greater than 0.9. The Bland-Altman analysis showed that ITK-SNAP had the closest consistency with manual results, again followed by 3D Slicer, with FreeSurfer performing least consistently. ITK-SNAP demonstrates higher accuracy and reliability for ventricular volumetry compared to 3D Slicer and FreeSurfer. Its open-source nature supports broader implementation in resource-constrained environments, enhancing neuroimaging accessibility for clinical and research applications.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1007/s10278-025-01798-8
İlknur Tuncer Fırat, Murat Fırat, Haci Erbali, Taner Tuncer
Glaucoma is a leading cause of irreversible vision loss. During clinical follow-up, visual field (VF) tests (Humphrey Field Analyzer 30-2) assesses functional loss, while optical coherence tomography (OCT) and fundus imaging provide structural information. However, VF measurement can be subjective, exhibit test-retest variability, and sometimes exhibit structure-function discordance (SFD). Therefore, predicting VF values from structural images may support clinical decision-making. To estimate Humphrey 30-2 measures (mean deviation (MD), pattern standard deviation (PSD), and point-wise threshold sensitivity (TS)) in glaucoma/ocular hypertension (OHT) using a ViT-B/32-based feature-fusion approach on OCT and fundus images, and to examine the effect of SFD via sensitivity analysis. Visual features were extracted from color optic disc photographs, red-free fundus images, retinal nerve fiber layer (RNFL) thickness map, and circular RNFL plots using Vision Transformer (ViT-B/32)-based models. These features were combined with demographic and clinical data to form a multimodal artificial intelligence model. Global VF indices (MD, PSD) were estimated with probabilistic regression that accounts for uncertainty, and point-wise TS values were predicted using a location-aware network. In a separate analysis, eyes exhibiting SFD were identified and excluded to assess model performance under OCT-VF concordance. Mean absolute errors (MAE) were 2.26, 1.42, and 2.96 dB for MD, PSD, and mean TS, respectively, and the proportions within ± 2 dB were 59.65%, 75.44%, and 57.90%. After excluding SFD eyes, MAEs decreased to 1.82, 1.30, and 2.12 dB for MD, PSD, and mean TS, respectively; the proportions within ± 2 dB increased to 66.7%, 76.5% and 62.7%, respectively. These findings indicate that discordance affects performance and that predictions are more reliable in clinically concordant cases. ViT-B/32-based deep feature fusion offers clinically meaningful accuracy for predicting VF metrics from multimodal structural images. SFD was frequently detected among the lowest-performing cases, and this possibility should be considered when interpreting low-performing outputs.
{"title":"Predicting HFA 30-2 Visual Fields with Deep Learning from Multimodal OCT-Fundus Feature Fusion and Structure-Function Discordance Analysis.","authors":"İlknur Tuncer Fırat, Murat Fırat, Haci Erbali, Taner Tuncer","doi":"10.1007/s10278-025-01798-8","DOIUrl":"https://doi.org/10.1007/s10278-025-01798-8","url":null,"abstract":"<p><p>Glaucoma is a leading cause of irreversible vision loss. During clinical follow-up, visual field (VF) tests (Humphrey Field Analyzer 30-2) assesses functional loss, while optical coherence tomography (OCT) and fundus imaging provide structural information. However, VF measurement can be subjective, exhibit test-retest variability, and sometimes exhibit structure-function discordance (SFD). Therefore, predicting VF values from structural images may support clinical decision-making. To estimate Humphrey 30-2 measures (mean deviation (MD), pattern standard deviation (PSD), and point-wise threshold sensitivity (TS)) in glaucoma/ocular hypertension (OHT) using a ViT-B/32-based feature-fusion approach on OCT and fundus images, and to examine the effect of SFD via sensitivity analysis. Visual features were extracted from color optic disc photographs, red-free fundus images, retinal nerve fiber layer (RNFL) thickness map, and circular RNFL plots using Vision Transformer (ViT-B/32)-based models. These features were combined with demographic and clinical data to form a multimodal artificial intelligence model. Global VF indices (MD, PSD) were estimated with probabilistic regression that accounts for uncertainty, and point-wise TS values were predicted using a location-aware network. In a separate analysis, eyes exhibiting SFD were identified and excluded to assess model performance under OCT-VF concordance. Mean absolute errors (MAE) were 2.26, 1.42, and 2.96 dB for MD, PSD, and mean TS, respectively, and the proportions within ± 2 dB were 59.65%, 75.44%, and 57.90%. After excluding SFD eyes, MAEs decreased to 1.82, 1.30, and 2.12 dB for MD, PSD, and mean TS, respectively; the proportions within ± 2 dB increased to 66.7%, 76.5% and 62.7%, respectively. These findings indicate that discordance affects performance and that predictions are more reliable in clinically concordant cases. ViT-B/32-based deep feature fusion offers clinically meaningful accuracy for predicting VF metrics from multimodal structural images. SFD was frequently detected among the lowest-performing cases, and this possibility should be considered when interpreting low-performing outputs.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1007/s10278-025-01831-w
Valentin Durand de Gevigney, Nicolas Nicastro, Valentina Garibotto, Jérôme Schmid
Dopamine transporter (DAT) SPECT is a validated biomarker for Parkinson's disease (PD) and related degenerative parkinsonisms. Interpretation relies on visual assessment supported by striatal image features such as striatal binding ratios (SBRs). Deep learning can aid this process but often underuses complementary data, lacks robustness to heterogeneous inputs, and offers limited interpretability. We built an end-to-end multimodal framework that encodes DAT images and scalar data (patient age and striatal image features) using, respectively, a vision transformer and a multilayer perceptron. A transformer-based fusion module then combined the encoded representations, while tackling possible missing inputs. Interpretability was provided through modality-level attention, spatial attention maps, occlusion analysis, and scalar feature saliency. Performance was evaluated on 664 Parkinson's Progression Markers Initiative (PPMI) cases and two local datasets A (N = 228) and B (N = 530) from different devices, including PD, atypical parkinsonisms, and non-degenerative subjects. Transfer learning involved pretraining on two datasets and finetuning on the third. On PPMI, the model reached 97.4% AUROC, 95.5% accuracy, 97.0% sensitivity, and 91.9% specificity, matching state-of-the-art performance. Results were similar on dataset B (98.6% AUROC) but lower on dataset A (92.6% AUROC), likely due to its smaller size and reduced image quality. Explainability analyses showed the model focused on clinically relevant striatal regions and identified key scalar features such as putamen SBR and asymmetry. The fusion module also supported stable predictions despite missing data. Our method efficiently combined multimodal data with heterogeneous datasets and partial multimodal data. Integrated explainability tools showed clinically meaningful attention that is expected to favor its adoption.
{"title":"Multimodal Fusion and Transfer Learning for the Detection of Degenerative Parkinsonisms with Dopamine Transporter SPECT Imaging.","authors":"Valentin Durand de Gevigney, Nicolas Nicastro, Valentina Garibotto, Jérôme Schmid","doi":"10.1007/s10278-025-01831-w","DOIUrl":"https://doi.org/10.1007/s10278-025-01831-w","url":null,"abstract":"<p><p>Dopamine transporter (DAT) SPECT is a validated biomarker for Parkinson's disease (PD) and related degenerative parkinsonisms. Interpretation relies on visual assessment supported by striatal image features such as striatal binding ratios (SBRs). Deep learning can aid this process but often underuses complementary data, lacks robustness to heterogeneous inputs, and offers limited interpretability. We built an end-to-end multimodal framework that encodes DAT images and scalar data (patient age and striatal image features) using, respectively, a vision transformer and a multilayer perceptron. A transformer-based fusion module then combined the encoded representations, while tackling possible missing inputs. Interpretability was provided through modality-level attention, spatial attention maps, occlusion analysis, and scalar feature saliency. Performance was evaluated on 664 Parkinson's Progression Markers Initiative (PPMI) cases and two local datasets A (N = 228) and B (N = 530) from different devices, including PD, atypical parkinsonisms, and non-degenerative subjects. Transfer learning involved pretraining on two datasets and finetuning on the third. On PPMI, the model reached 97.4% AUROC, 95.5% accuracy, 97.0% sensitivity, and 91.9% specificity, matching state-of-the-art performance. Results were similar on dataset B (98.6% AUROC) but lower on dataset A (92.6% AUROC), likely due to its smaller size and reduced image quality. Explainability analyses showed the model focused on clinically relevant striatal regions and identified key scalar features such as putamen SBR and asymmetry. The fusion module also supported stable predictions despite missing data. Our method efficiently combined multimodal data with heterogeneous datasets and partial multimodal data. Integrated explainability tools showed clinically meaningful attention that is expected to favor its adoption.</p>","PeriodicalId":516858,"journal":{"name":"Journal of imaging informatics in medicine","volume":" ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146013941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}