Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of FER systems. To mitigate the interference caused by large pose variations and improve recognition accuracy, we propose a FER method based on profile-to-frontal transformation and multimodal learning. Specifically, we first leverage the visual understanding and generation capabilities of Qwen-Image-Edit that transform profile images to frontal viewpoints, preserving key expression features while standardizing facial poses. Second, we introduce the CLIP model to enhance the semantic representation capability of expression features through vision-language joint learning. The qualitative and quantitative experiments on the RAF (89.39%), EXPW (67.17%), and AffectNet-7 (62.66%) datasets demonstrate that our method outperforms the existing approaches.
{"title":"LLM-Based Pose Normalization and Multimodal Fusion for Facial Expression Recognition in Extreme Poses.","authors":"Bohan Chen, Bowen Qu, Yu Zhou, Han Huang, Jianing Guo, Yanning Xian, Longxiang Ma, Jinxuan Yu, Jingyu Chen","doi":"10.3390/jimaging12010024","DOIUrl":"10.3390/jimaging12010024","url":null,"abstract":"<p><p>Facial expression recognition (FER) technology has progressively matured over time. However, existing FER methods are primarily optimized for frontal face images, and their recognition accuracy significantly degrades when processing profile or large-angle rotated facial images. Consequently, this limitation hinders the practical deployment of FER systems. To mitigate the interference caused by large pose variations and improve recognition accuracy, we propose a FER method based on profile-to-frontal transformation and multimodal learning. Specifically, we first leverage the visual understanding and generation capabilities of Qwen-Image-Edit that transform profile images to frontal viewpoints, preserving key expression features while standardizing facial poses. Second, we introduce the CLIP model to enhance the semantic representation capability of expression features through vision-language joint learning. The qualitative and quantitative experiments on the RAF (89.39%), EXPW (67.17%), and AffectNet-7 (62.66%) datasets demonstrate that our method outperforms the existing approaches.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842521/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-03DOI: 10.3390/jimaging12010022
Maria Jahan, Al Ibne Siam, Lamim Zakir Pronay, Saif Ahmed, Nabeel Mohammed, James Dudley, Taseef Hasan Farook
To assess the efficiency of vision-language models in detecting and classifying carious and non-carious lesions from intraoral photo imaging. A dataset of 172 annotated images were classified for microcavitation, cavitated lesions, staining, calculus, and non-carious lesions. Florence-2, PaLI-Gemma, and YOLOv8 models were trained on the dataset and model performance. The dataset was divided into 80:10:10 split, and the model performance was evaluated using mean average precision (mAP), mAP50-95, class-specific precision and recall. YOLOv8 outperformed the vision-language models, achieving a mean average precision (mAP) of 37% with a precision of 42.3% (with 100% for cavitation detection) and 31.3% recall. PaLI-Gemma produced a recall of 13% and 21%. Florence-2 yielded a mean average precision of 10% with a precision and recall was 51% and 35%. YOLOv8 achieved the strongest overall performance. Florence-2 and PaLI-Gemma models underperformed relative to YOLOv8 despite the potential for multimodal contextual understanding, highlighting the need for larger, more diverse datasets and hybrid architectures to achieve improved performance.
{"title":"Comparative Evaluation of Vision-Language Models for Detecting and Localizing Dental Lesions from Intraoral Images.","authors":"Maria Jahan, Al Ibne Siam, Lamim Zakir Pronay, Saif Ahmed, Nabeel Mohammed, James Dudley, Taseef Hasan Farook","doi":"10.3390/jimaging12010022","DOIUrl":"10.3390/jimaging12010022","url":null,"abstract":"<p><p>To assess the efficiency of vision-language models in detecting and classifying carious and non-carious lesions from intraoral photo imaging. A dataset of 172 annotated images were classified for microcavitation, cavitated lesions, staining, calculus, and non-carious lesions. Florence-2, PaLI-Gemma, and YOLOv8 models were trained on the dataset and model performance. The dataset was divided into 80:10:10 split, and the model performance was evaluated using mean average precision (mAP), mAP50-95, class-specific precision and recall. YOLOv8 outperformed the vision-language models, achieving a mean average precision (mAP) of 37% with a precision of 42.3% (with 100% for cavitation detection) and 31.3% recall. PaLI-Gemma produced a recall of 13% and 21%. Florence-2 yielded a mean average precision of 10% with a precision and recall was 51% and 35%. YOLOv8 achieved the strongest overall performance. Florence-2 and PaLI-Gemma models underperformed relative to YOLOv8 despite the potential for multimodal contextual understanding, highlighting the need for larger, more diverse datasets and hybrid architectures to achieve improved performance.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842643/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054239","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-01DOI: 10.3390/jimaging12010021
Yan Wang, Adisorn Sirikham, Jessada Konpang, Chunguang Li
Drastic alterations have been observed in the coastline of Bangkok Bay, Thailand, over the past three decades. Understanding how coastlines change plays a key role in developing strategies for coastal protection and sustainable resource utilization. This study investigates the temporal and spatial changes in the Bangkok Bay coastline, Thailand, using remote sensing and GIS techniques from 1989 to 2024. The historical rate of coastline change for a typical segment was analyzed using the EPR method, and the underlying causes of these changes were discussed. Finally, the variation trend of the total shoreline length and the characteristics of erosion and sedimentation for a typical shoreline in Bangkok Bay, Thailand, over the past 35 years were obtained. An overall increase in coastline length was observed in Bangkok Bay, Thailand, over the 35-year period from 1989 to 2024, with a net gain from 507.23 km to 571.38 km. The rate of growth has transitioned from rapid to slow, with the most significant changes occurring during the period 1989-1994. Additionally, the average and maximum erosion rates for the typical shoreline segment were notably high during 1989-1994, with values of -21.61 m/a and -55.49 m/a, respectively. The maximum sedimentation rate along the coastline was relatively high from 2014 to 2024, reaching 10.57 m/a. Overall, the entire coastline of the Samut Sakhon-Bangkok-Samut Prakan Provinces underwent net erosion from 1989 to 2024, driven by a confluence of natural and anthropogenic factors.
{"title":"Multi-Temporal Shoreline Monitoring and Analysis in Bangkok Bay, Thailand, Using Remote Sensing and GIS Techniques.","authors":"Yan Wang, Adisorn Sirikham, Jessada Konpang, Chunguang Li","doi":"10.3390/jimaging12010021","DOIUrl":"10.3390/jimaging12010021","url":null,"abstract":"<p><p>Drastic alterations have been observed in the coastline of Bangkok Bay, Thailand, over the past three decades. Understanding how coastlines change plays a key role in developing strategies for coastal protection and sustainable resource utilization. This study investigates the temporal and spatial changes in the Bangkok Bay coastline, Thailand, using remote sensing and GIS techniques from 1989 to 2024. The historical rate of coastline change for a typical segment was analyzed using the EPR method, and the underlying causes of these changes were discussed. Finally, the variation trend of the total shoreline length and the characteristics of erosion and sedimentation for a typical shoreline in Bangkok Bay, Thailand, over the past 35 years were obtained. An overall increase in coastline length was observed in Bangkok Bay, Thailand, over the 35-year period from 1989 to 2024, with a net gain from 507.23 km to 571.38 km. The rate of growth has transitioned from rapid to slow, with the most significant changes occurring during the period 1989-1994. Additionally, the average and maximum erosion rates for the typical shoreline segment were notably high during 1989-1994, with values of -21.61 m/a and -55.49 m/a, respectively. The maximum sedimentation rate along the coastline was relatively high from 2014 to 2024, reaching 10.57 m/a. Overall, the entire coastline of the Samut Sakhon-Bangkok-Samut Prakan Provinces underwent net erosion from 1989 to 2024, driven by a confluence of natural and anthropogenic factors.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842628/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The increasing use of artificial intelligence (AI) and deep learning (DL) techniques has driven advances in vehicle classification and detection applications for embedded devices with deployment constraints due to computational cost and response time. In the case of urban environments with high traffic congestion, such as the city of Lima, it is important to determine the trade-off between model accuracy, type of embedded system, and the dataset used. This study was developed using a methodology adapted from the CRISP-DM approach, which included the acquisition of traffic videos in the city of Lima, their segmentation, and manual labeling. Subsequently, three SSD-based detection models (MobileNetV1-SSD, MobileNetV2-SSD-Lite, and VGG16-SSD) were trained on the NVIDIA Jetson Orin NX 16 GB platform. The results show that the VGG16-SSD model achieved the highest average precision (mAP ≈90.7%), with a longer training time, while the MobileNetV1-SSD (512×512) model achieved comparable performance (mAP ≈90.4%) with a shorter time. Additionally, data augmentation through contrast adjustment improved the detection of minority classes such as Tuk-tuk and Motorcycle. The results indicate that, among the evaluated models, MobileNetV1-SSD (512×512) achieved the best balance between accuracy and computational load for its implementation in ADAS embedded systems in congested urban environments.
人工智能(AI)和深度学习(DL)技术的使用越来越多,推动了嵌入式设备在车辆分类和检测应用方面的进步,这些设备由于计算成本和响应时间而受到部署限制。在交通拥堵严重的城市环境中,比如利马市,确定模型精度、嵌入式系统类型和使用的数据集之间的权衡是很重要的。本研究采用了一种改编自CRISP-DM方法的方法,其中包括获取利马市的交通视频,对其进行分割和手动标记。随后,在NVIDIA Jetson Orin NX 16gb平台上训练了三个基于ssd的检测模型(MobileNetV1-SSD、MobileNetV2-SSD-Lite和VGG16-SSD)。结果表明,VGG16-SSD模型的平均精度最高(mAP≈90.7%),训练时间较长,而MobileNetV1-SSD (512×512)模型在较短的训练时间内取得了相当的性能(mAP≈90.4%)。此外,通过对比度调整的数据增强提高了Tuk-tuk和Motorcycle等少数类别的检测。结果表明,在所评估的模型中,MobileNetV1-SSD (512×512)在城市拥挤环境下的ADAS嵌入式系统中实现了精度和计算负荷之间的最佳平衡。
{"title":"Object Detection on Road: Vehicle's Detection Based on Re-Training Models on NVIDIA-Jetson Platform.","authors":"Sleiter Ramos-Sanchez, Jinmi Lezama, Ricardo Yauri, Joyce Zevallos","doi":"10.3390/jimaging12010020","DOIUrl":"10.3390/jimaging12010020","url":null,"abstract":"<p><p>The increasing use of artificial intelligence (AI) and deep learning (DL) techniques has driven advances in vehicle classification and detection applications for embedded devices with deployment constraints due to computational cost and response time. In the case of urban environments with high traffic congestion, such as the city of Lima, it is important to determine the trade-off between model accuracy, type of embedded system, and the dataset used. This study was developed using a methodology adapted from the CRISP-DM approach, which included the acquisition of traffic videos in the city of Lima, their segmentation, and manual labeling. Subsequently, three SSD-based detection models (MobileNetV1-SSD, MobileNetV2-SSD-Lite, and VGG16-SSD) were trained on the NVIDIA Jetson Orin NX 16 GB platform. The results show that the VGG16-SSD model achieved the highest average precision (mAP ≈90.7%), with a longer training time, while the MobileNetV1-SSD (512×512) model achieved comparable performance (mAP ≈90.4%) with a shorter time. Additionally, data augmentation through contrast adjustment improved the detection of minority classes such as Tuk-tuk and Motorcycle. The results indicate that, among the evaluated models, MobileNetV1-SSD (512×512) achieved the best balance between accuracy and computational load for its implementation in ADAS embedded systems in congested urban environments.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842717/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-31DOI: 10.3390/jimaging12010019
Zhongmin Jiang, Zhen Wang, Wenju Wang, Jifan Zhu
Existing methods for reconstructing hyperspectral images from single RGB images struggle to obtain a large number of labeled RGB-HSI paired images. These methods face issues such as detail loss, insufficient robustness, low reconstruction accuracy, and the difficulty of balancing the spatial-spectral trade-off. To address these challenges, a Double-Gated Mamba Multi-Scale Adaptive Feature (DMMAF) learning network model is proposed. DMMAF designs a reflection dot-product adaptive dual-noise-aware feature extraction method, which is used to supplement edge detail information in spectral images and improve robustness. DMMAF also constructs a deformable attention-based global feature extraction method and a double-gated Mamba local feature extraction approach, enhancing the interaction between local and global information during the reconstruction process, thereby improving image accuracy. Meanwhile, DMMAF introduces a structure-aware smooth loss function, which, by combining smoothing, curvature, and attention supervision losses, effectively resolves the spatial-spectral resolution balance problem. This network model is applied to three datasets-NTIRE 2020, Harvard, and CAVE-achieving state-of-the-art unsupervised reconstruction performance compared to existing advanced algorithms. Experiments on the NTIRE 2020, Harvard, and CAVE datasets demonstrate that this model achieves state-of-the-art unsupervised reconstruction performance. On the NTIRE 2020 dataset, our method attains MRAE, RMSE, and PSNR values of 0.133, 0.040, and 31.314, respectively. On the Harvard dataset, it achieves RMSE and PSNR values of 0.025 and 34.955, respectively, while on the CAVE dataset, it achieves RMSE and PSNR values of 0.041 and 30.983, respectively.
{"title":"Double-Gated Mamba Multi-Scale Adaptive Feature Learning Network for Unsupervised Single RGB Image Hyperspectral Image Reconstruction.","authors":"Zhongmin Jiang, Zhen Wang, Wenju Wang, Jifan Zhu","doi":"10.3390/jimaging12010019","DOIUrl":"10.3390/jimaging12010019","url":null,"abstract":"<p><p>Existing methods for reconstructing hyperspectral images from single RGB images struggle to obtain a large number of labeled RGB-HSI paired images. These methods face issues such as detail loss, insufficient robustness, low reconstruction accuracy, and the difficulty of balancing the spatial-spectral trade-off. To address these challenges, a Double-Gated Mamba Multi-Scale Adaptive Feature (DMMAF) learning network model is proposed. DMMAF designs a reflection dot-product adaptive dual-noise-aware feature extraction method, which is used to supplement edge detail information in spectral images and improve robustness. DMMAF also constructs a deformable attention-based global feature extraction method and a double-gated Mamba local feature extraction approach, enhancing the interaction between local and global information during the reconstruction process, thereby improving image accuracy. Meanwhile, DMMAF introduces a structure-aware smooth loss function, which, by combining smoothing, curvature, and attention supervision losses, effectively resolves the spatial-spectral resolution balance problem. This network model is applied to three datasets-NTIRE 2020, Harvard, and CAVE-achieving state-of-the-art unsupervised reconstruction performance compared to existing advanced algorithms. Experiments on the NTIRE 2020, Harvard, and CAVE datasets demonstrate that this model achieves state-of-the-art unsupervised reconstruction performance. On the NTIRE 2020 dataset, our method attains MRAE, RMSE, and PSNR values of 0.133, 0.040, and 31.314, respectively. On the Harvard dataset, it achieves RMSE and PSNR values of 0.025 and 34.955, respectively, while on the CAVE dataset, it achieves RMSE and PSNR values of 0.041 and 30.983, respectively.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843217/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.3390/jimaging12010017
Ichrak Khoulqi, Zakariae El Ouazzani
In this paper, we propose a literature review regarding two deep learning architectures, namely Convolutional Neural Networks (CNNs) and Capsule Networks (CapsNets), applied to medical images, in order to analyze them to help in medical decision support. CNNs demonstrate their capacity in the medical diagnostic field; however, their reliability decreases when there is slight spatial variability, which can affect diagnosis, especially since the anatomical structure of the human body can differ from one patient to another. In contrast, CapsNets encode not only feature activation but also spatial relationships, hence improving the reliability and stability of model generalization. This paper proposes a structured comparison by reviewing studies published from 2018 to 2025 across major databases, including IEEE Xplore, ScienceDirect, SpringerLink, and MDPI. The applications in the reviewed papers are based on the benchmark datasets BraTS, INbreast, ISIC, and COVIDx. This paper review compares the core architectural principles, performance, and interpretability of both architectures. To conclude the paper, we underline the complementary roles of these two architectures in medical decision-making and propose future directions toward hybrid, explainable, and computationally efficient deep learning systems for real clinical environments, thereby increasing survival rates by helping prevent diseases at an early stage.
{"title":"Advancing Medical Decision-Making with AI: A Comprehensive Exploration of the Evolution from Convolutional Neural Networks to Capsule Networks.","authors":"Ichrak Khoulqi, Zakariae El Ouazzani","doi":"10.3390/jimaging12010017","DOIUrl":"10.3390/jimaging12010017","url":null,"abstract":"<p><p>In this paper, we propose a literature review regarding two deep learning architectures, namely Convolutional Neural Networks (CNNs) and Capsule Networks (CapsNets), applied to medical images, in order to analyze them to help in medical decision support. CNNs demonstrate their capacity in the medical diagnostic field; however, their reliability decreases when there is slight spatial variability, which can affect diagnosis, especially since the anatomical structure of the human body can differ from one patient to another. In contrast, CapsNets encode not only feature activation but also spatial relationships, hence improving the reliability and stability of model generalization. This paper proposes a structured comparison by reviewing studies published from 2018 to 2025 across major databases, including IEEE Xplore, ScienceDirect, SpringerLink, and MDPI. The applications in the reviewed papers are based on the benchmark datasets BraTS, INbreast, ISIC, and COVIDx. This paper review compares the core architectural principles, performance, and interpretability of both architectures. To conclude the paper, we underline the complementary roles of these two architectures in medical decision-making and propose future directions toward hybrid, explainable, and computationally efficient deep learning systems for real clinical environments, thereby increasing survival rates by helping prevent diseases at an early stage.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842556/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-30DOI: 10.3390/jimaging12010018
Ali Awad, Ashraf Saleem, Sidike Paheding, Evan Lucas, Serein Al-Ratrout, Timothy C Havens
Underwater images often suffer from severe color distortion, low contrast, and reduced visibility, motivating the widespread use of image enhancement as a preprocessing step for downstream computer vision tasks. However, recent studies have questioned whether enhancement actually improves object detection performance. In this work, we conduct a comprehensive and rigorous evaluation of nine state-of-the-art enhancement methods and their interactions with modern object detectors. We propose a unified evaluation framework that integrates (1) a distribution-level quality assessment using a composite quality index (Q-index), (2) a fine-grained per-image detection protocol based on COCO-style mAP, and (3) a mixed-set upper-bound analysis that quantifies the theoretical performance achievable through ideal selective enhancement. Our findings reveal that traditional image quality metrics do not reliably predict detection performance, and that dataset-level conclusions often overlook substantial image-level variability. Through per-image evaluation, we identify numerous cases in which enhancement significantly improves detection accuracy-primarily for low-quality inputs-while also demonstrating conditions under which enhancement degrades performance. The mixed-set analysis shows that selective enhancement can yield substantial gains over both original and fully enhanced datasets, establishing a new direction for designing enhancement models optimized for downstream vision tasks. This study provides the most comprehensive evidence to date that underwater image enhancement can be beneficial for object detection when evaluated at the appropriate granularity and guided by informed selection strategies. The data generated and code developed are publicly available.
{"title":"Revisiting Underwater Image Enhancement for Object Detection: A Unified Quality-Detection Evaluation Framework.","authors":"Ali Awad, Ashraf Saleem, Sidike Paheding, Evan Lucas, Serein Al-Ratrout, Timothy C Havens","doi":"10.3390/jimaging12010018","DOIUrl":"10.3390/jimaging12010018","url":null,"abstract":"<p><p>Underwater images often suffer from severe color distortion, low contrast, and reduced visibility, motivating the widespread use of image enhancement as a preprocessing step for downstream computer vision tasks. However, recent studies have questioned whether enhancement actually improves object detection performance. In this work, we conduct a comprehensive and rigorous evaluation of nine state-of-the-art enhancement methods and their interactions with modern object detectors. We propose a unified evaluation framework that integrates (1) a distribution-level quality assessment using a composite quality index (Q-index), (2) a fine-grained per-image detection protocol based on COCO-style mAP, and (3) a mixed-set upper-bound analysis that quantifies the theoretical performance achievable through ideal selective enhancement. Our findings reveal that traditional image quality metrics do not reliably predict detection performance, and that dataset-level conclusions often overlook substantial image-level variability. Through per-image evaluation, we identify numerous cases in which enhancement significantly improves detection accuracy-primarily for low-quality inputs-while also demonstrating conditions under which enhancement degrades performance. The mixed-set analysis shows that selective enhancement can yield substantial gains over both original and fully enhanced datasets, establishing a new direction for designing enhancement models optimized for downstream vision tasks. This study provides the most comprehensive evidence to date that underwater image enhancement can be beneficial for object detection when evaluated at the appropriate granularity and guided by informed selection strategies. The data generated and code developed are publicly available.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842599/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146053600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-29DOI: 10.3390/jimaging12010016
Lin Shi, Kengo Matsufuji, Michitaka Yoshida, Ryo Kawahara, Takahiro Okabe
Synthesizing photo-realistic images of a scene from arbitrary viewpoints and under arbitrary lighting environments is one of the important research topics in computer vision and graphics. In this paper, we propose a method for synthesizing photo-realistic images of a scene with fluorescent objects from novel viewpoints and under novel lighting colors and spectra. In general, fluorescent materials absorb light with certain wavelengths and then emit light with longer wavelengths than the absorbed ones, in contrast to reflective materials, which preserve wavelengths of light. Therefore, we cannot reproduce the colors of fluorescent objects under arbitrary lighting colors by combining conventional view synthesis techniques with the white balance adjustment of the RGB channels. Accordingly, we extend the novel-view synthesis based on the neural radiance fields by incorporating the superposition principle of light; our proposed method captures a sparse set of images of a scene from varying viewpoints and under varying lighting colors or spectra with active lighting systems such as a color display or a multi-spectral light stage and then synthesizes photo-realistic images of the scene without explicitly modeling its geometric and photometric models. We conducted a number of experiments using real images captured with an LCD and confirmed that our method works better than the existing methods. Moreover, we showed that the extension of our method using more than three primary colors with a light stage enables us to reproduce the colors of fluorescent objects under common light sources.
{"title":"FluoNeRF: Fluorescent Novel-View Synthesis Under Novel Light Source Colors and Spectra.","authors":"Lin Shi, Kengo Matsufuji, Michitaka Yoshida, Ryo Kawahara, Takahiro Okabe","doi":"10.3390/jimaging12010016","DOIUrl":"10.3390/jimaging12010016","url":null,"abstract":"<p><p>Synthesizing photo-realistic images of a scene from arbitrary viewpoints and under arbitrary lighting environments is one of the important research topics in computer vision and graphics. In this paper, we propose a method for synthesizing photo-realistic images of a scene with fluorescent objects from novel viewpoints and under novel lighting colors and spectra. In general, fluorescent materials absorb light with certain wavelengths and then emit light with longer wavelengths than the absorbed ones, in contrast to reflective materials, which preserve wavelengths of light. Therefore, we cannot reproduce the colors of fluorescent objects under arbitrary lighting colors by combining conventional view synthesis techniques with the white balance adjustment of the RGB channels. Accordingly, we extend the novel-view synthesis based on the neural radiance fields by incorporating the superposition principle of light; our proposed method captures a sparse set of images of a scene from varying viewpoints and under varying lighting colors or spectra with active lighting systems such as a color display or a multi-spectral light stage and then synthesizes photo-realistic images of the scene without explicitly modeling its geometric and photometric models. We conducted a number of experiments using real images captured with an LCD and confirmed that our method works better than the existing methods. Moreover, we showed that the extension of our method using more than three primary colors with a light stage enables us to reproduce the colors of fluorescent objects under common light sources.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843175/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Medical image segmentation presents substantial challenges arising from the diverse scales and morphological complexities of target anatomical structures. Although existing Transformer-based models excel at capturing global dependencies, they encounter critical bottlenecks in multi-scale feature representation, spatial relationship modeling, and cross-layer feature fusion. To address these limitations, we propose the M3-TransUNet architecture, which incorporates three key innovations: (1) MSGA (Multi-Scale Gate Attention) and MSSA (Multi-Scale Selective Attention) modules to enhance multi-scale feature representation; (2) ME-MSA (Manhattan Enhanced Multi-Head Self-Attention) to integrate spatial priors into self-attention computations, thereby overcoming spatial modeling deficiencies; and (3) MKGAG (Multi-kernel Gated Attention Gate) to optimize skip connections by precisely filtering noise and preserving boundary details. Extensive experiments on public datasets-including Synapse, CVC-ClinicDB, and ISIC-demonstrate that M3-TransUNet achieves state-of-the-art performance. Specifically, on the Synapse dataset, our model outperforms recent TransUNet variants such as J-CAPA, improving the average DSC to 82.79% (compared to 82.29%) and significantly reducing the average HD95 from 19.74 mm to 10.21 mm.
{"title":"M<sup>3</sup>-TransUNet: Medical Image Segmentation Based on Spatial Prior Attention and Multi-Scale Gating.","authors":"Zhigao Zeng, Jiale Xiao, Shengqiu Yi, Qiang Liu, Yanhui Zhu","doi":"10.3390/jimaging12010015","DOIUrl":"10.3390/jimaging12010015","url":null,"abstract":"<p><p>Medical image segmentation presents substantial challenges arising from the diverse scales and morphological complexities of target anatomical structures. Although existing Transformer-based models excel at capturing global dependencies, they encounter critical bottlenecks in multi-scale feature representation, spatial relationship modeling, and cross-layer feature fusion. To address these limitations, we propose the M<sup>3</sup>-TransUNet architecture, which incorporates three key innovations: (1) MSGA (Multi-Scale Gate Attention) and MSSA (Multi-Scale Selective Attention) modules to enhance multi-scale feature representation; (2) ME-MSA (Manhattan Enhanced Multi-Head Self-Attention) to integrate spatial priors into self-attention computations, thereby overcoming spatial modeling deficiencies; and (3) MKGAG (Multi-kernel Gated Attention Gate) to optimize skip connections by precisely filtering noise and preserving boundary details. Extensive experiments on public datasets-including Synapse, CVC-ClinicDB, and ISIC-demonstrate that M<sup>3</sup>-TransUNet achieves state-of-the-art performance. Specifically, on the Synapse dataset, our model outperforms recent TransUNet variants such as J-CAPA, improving the average DSC to 82.79% (compared to 82.29%) and significantly reducing the average HD95 from 19.74 mm to 10.21 mm.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843401/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-28DOI: 10.3390/jimaging12010014
Jatsada Singthongchai, Tanachapong Wangkhamhan
This study presents a controlled benchmarking analysis of min-max scaling, Z-score normalization, and an adaptive preprocessing pipeline that combines percentile-based ROI cropping with histogram standardization. The evaluation was conducted across four public chest X-ray (CXR) datasets and three convolutional neural network architectures under controlled experimental settings. The adaptive pipeline generally improved accuracy, F1-score, and training stability on datasets with relatively stable contrast characteristics while yielding limited gains on MIMIC-CXR due to strong acquisition heterogeneity. Ablation experiments showed that histogram standardization provided the primary performance contribution, with ROI cropping offering complementary benefits, and the full pipeline achieving the best overall performance. The computational overhead of the adaptive preprocessing was minimal (+6.3% training-time cost; 5.2 ms per batch). Friedman-Nemenyi and Wilcoxon signed-rank tests confirmed that the observed improvements were statistically significant across most dataset-model configurations. Overall, adaptive normalization is positioned not as a novel algorithmic contribution, but as a practical preprocessing design choice that can enhance cross-dataset robustness and reliability in chest X-ray classification workflows.
{"title":"Adaptive Normalization Enhances the Generalization of Deep Learning Model in Chest X-Ray Classification.","authors":"Jatsada Singthongchai, Tanachapong Wangkhamhan","doi":"10.3390/jimaging12010014","DOIUrl":"10.3390/jimaging12010014","url":null,"abstract":"<p><p>This study presents a controlled benchmarking analysis of min-max scaling, Z-score normalization, and an adaptive preprocessing pipeline that combines percentile-based ROI cropping with histogram standardization. The evaluation was conducted across four public chest X-ray (CXR) datasets and three convolutional neural network architectures under controlled experimental settings. The adaptive pipeline generally improved accuracy, F1-score, and training stability on datasets with relatively stable contrast characteristics while yielding limited gains on MIMIC-CXR due to strong acquisition heterogeneity. Ablation experiments showed that histogram standardization provided the primary performance contribution, with ROI cropping offering complementary benefits, and the full pipeline achieving the best overall performance. The computational overhead of the adaptive preprocessing was minimal (+6.3% training-time cost; 5.2 ms per batch). Friedman-Nemenyi and Wilcoxon signed-rank tests confirmed that the observed improvements were statistically significant across most dataset-model configurations. Overall, adaptive normalization is positioned not as a novel algorithmic contribution, but as a practical preprocessing design choice that can enhance cross-dataset robustness and reliability in chest X-ray classification workflows.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2025-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842669/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}