Pub Date : 2026-01-13DOI: 10.3390/jimaging12010043
Binhua Guo, Dinghui Liu, Zhou Shen, Tiebin Wang
Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles, this paper presents FF-Mamba-YOLO, a novel framework based on the principles of Mamba and YOLO (You Only Look Once) that leverages innovative modules and architectures to overcome these limitations. Specifically, we introduce MFEBlock and MFFBlock based on state space models (SSMs) in the backbone and neck parts of the network, respectively, enabling the model to effectively capture global dependencies. Second, we construct CFEBlock, a module that performs feature enhancement before SSM processing, improving local feature processing capabilities. Furthermore, we propose MGBlock, which adopts a dynamic gating mechanism, enhancing the model's adaptive processing capabilities and robustness. Finally, we enhance the structure of Path Aggregation Feature Pyramid Network (PAFPN) to improve feature fusion quality and introduce DySample to enhance image resolution without significantly increasing computational costs. Experimental results on our self-constructed forest fire image dataset demonstrate that the model achieves 67.4% mAP@50, 36.3% mAP@50:95, and 64.8% precision, outperforming previous state-of-the-art methods. These results highlight the potential of FF-Mamba-YOLO in forest fire monitoring.
{"title":"FF-Mamba-YOLO: An SSM-Based Benchmark for Forest Fire Detection in UAV Remote Sensing Images.","authors":"Binhua Guo, Dinghui Liu, Zhou Shen, Tiebin Wang","doi":"10.3390/jimaging12010043","DOIUrl":"10.3390/jimaging12010043","url":null,"abstract":"<p><p>Timely and accurate detection of forest fires through unmanned aerial vehicle (UAV) remote sensing target detection technology is of paramount importance. However, multiscale targets and complex environmental interference in UAV remote sensing images pose significant challenges during detection tasks. To address these obstacles, this paper presents FF-Mamba-YOLO, a novel framework based on the principles of Mamba and YOLO (You Only Look Once) that leverages innovative modules and architectures to overcome these limitations. Specifically, we introduce MFEBlock and MFFBlock based on state space models (SSMs) in the backbone and neck parts of the network, respectively, enabling the model to effectively capture global dependencies. Second, we construct CFEBlock, a module that performs feature enhancement before SSM processing, improving local feature processing capabilities. Furthermore, we propose MGBlock, which adopts a dynamic gating mechanism, enhancing the model's adaptive processing capabilities and robustness. Finally, we enhance the structure of Path Aggregation Feature Pyramid Network (PAFPN) to improve feature fusion quality and introduce DySample to enhance image resolution without significantly increasing computational costs. Experimental results on our self-constructed forest fire image dataset demonstrate that the model achieves 67.4% mAP@50, 36.3% mAP@50:95, and 64.8% precision, outperforming previous state-of-the-art methods. These results highlight the potential of FF-Mamba-YOLO in forest fire monitoring.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842753/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.3390/jimaging12010042
Junjie Cao, Yuhang Yu, Rong Rong, Xing Xie
Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness. Key components include the Locality-Preserved Cross-branch Fusion (LPCF) module, which combines Local-Positional-Channel Gating (LPCG) for local region and positional sensitivity; Cross-branch Context Interpolated Attention (CCIA) for stable cross-branch consistency; and Graph-Enhanced Center Geometry Alignment (GE-CGA), which aligns class-center similarity structures across modalities to preserve category-level relationships. We also introduce Intra-Modal Prototype Discrepancy Mining Loss (IPDM-Loss) to reduce intra-class variance and improve inter-class separation, thereby creating more compact identity structures in both RGB and IR spaces. Extensive experiments on SYSU-MM01, RegDB, and other benchmarks demonstrate the effectiveness of our approach.
{"title":"GLCN: Graph-Aware Locality-Enhanced Cross-Modality Re-ID Network.","authors":"Junjie Cao, Yuhang Yu, Rong Rong, Xing Xie","doi":"10.3390/jimaging12010042","DOIUrl":"10.3390/jimaging12010042","url":null,"abstract":"<p><p>Cross-modality person re-identification faces challenges such as illumination discrepancies, local occlusions, and inconsistent modality structures, leading to misalignment and sensitivity issues. We propose GLCN, a framework that addresses these problems by enhancing representation learning through locality enhancement, cross-modality structural alignment, and intra-modality compactness. Key components include the Locality-Preserved Cross-branch Fusion (LPCF) module, which combines Local-Positional-Channel Gating (LPCG) for local region and positional sensitivity; Cross-branch Context Interpolated Attention (CCIA) for stable cross-branch consistency; and Graph-Enhanced Center Geometry Alignment (GE-CGA), which aligns class-center similarity structures across modalities to preserve category-level relationships. We also introduce Intra-Modal Prototype Discrepancy Mining Loss (IPDM-Loss) to reduce intra-class variance and improve inter-class separation, thereby creating more compact identity structures in both RGB and IR spaces. Extensive experiments on SYSU-MM01, RegDB, and other benchmarks demonstrate the effectiveness of our approach.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843497/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054196","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-13DOI: 10.3390/jimaging12010041
Ahmed A H Alkurdi, Amira Bibo Sallow
Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side classification using low-energy (DM) images from CESM acquisitions (Normal vs. Tumorous; benign and malignant merged). The evaluation was conducted using 5-fold stratified group cross-validation with patient-level grouping to prevent leakage across folds. The final configuration (Model E) integrates dual-backbone feature extraction, transformer fusion, MC-dropout inference for uncertainty estimation, and post hoc logistic calibration. Across the five held-out test folds, Model E achieved a mean accuracy of 96.88% ± 2.39% and a mean F1-score of 97.68% ± 1.66%. The mean ROC-AUC and PR-AUC were 0.9915 ± 0.0098 and 0.9968 ± 0.0029, respectively. Probability quality was supported by a mean Brier score of 0.0236 ± 0.0145 and a mean expected calibration error (ECE) of 0.0334 ± 0.0171. An ablation study (Models A-E) was also reported to quantify the incremental contribution of dual-view input, transformer fusion, and uncertainty calibration. Within the limits of this retrospective single-center setting, these results suggest that dual-view transformer fusion can provide strong discrimination while also producing calibrated probabilities and uncertainty outputs that are relevant for decision support.
{"title":"Calibrated Transformer Fusion for Dual-View Low-Energy CESM Classification.","authors":"Ahmed A H Alkurdi, Amira Bibo Sallow","doi":"10.3390/jimaging12010041","DOIUrl":"10.3390/jimaging12010041","url":null,"abstract":"<p><p>Contrast-enhanced spectral mammography (CESM) provides low-energy images acquired in standard craniocaudal (CC) and mediolateral oblique (MLO) views, and clinical interpretation relies on integrating both views. This study proposes a dual-view classification framework that combines deep CNN feature extraction with transformer-based fusion for breast-side classification using low-energy (DM) images from CESM acquisitions (Normal vs. Tumorous; benign and malignant merged). The evaluation was conducted using 5-fold stratified group cross-validation with patient-level grouping to prevent leakage across folds. The final configuration (Model E) integrates dual-backbone feature extraction, transformer fusion, MC-dropout inference for uncertainty estimation, and post hoc logistic calibration. Across the five held-out test folds, Model E achieved a mean accuracy of 96.88% ± 2.39% and a mean F1-score of 97.68% ± 1.66%. The mean ROC-AUC and PR-AUC were 0.9915 ± 0.0098 and 0.9968 ± 0.0029, respectively. Probability quality was supported by a mean Brier score of 0.0236 ± 0.0145 and a mean expected calibration error (ECE) of 0.0334 ± 0.0171. An ablation study (Models A-E) was also reported to quantify the incremental contribution of dual-view input, transformer fusion, and uncertainty calibration. Within the limits of this retrospective single-center setting, these results suggest that dual-view transformer fusion can provide strong discrimination while also producing calibrated probabilities and uncertainty outputs that are relevant for decision support.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842785/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements and may be multiple in number, while reference images across different views exhibit weak correlations. To address these challenges, we propose a diffusion-based framework for customized multi-view image generation. Our approach introduces a decoupled feature injection mechanism within a dual-UNet architecture to handle weakly correlated reference images, effectively integrating spatial information by concurrently feeding both reference images and noise into the denoising branch. A hybrid attention mechanism enables deep fusion of reference features and multi-view representations. Furthermore, a data augmentation strategy facilitates viewpoint-adaptive pose adjustments, and panoramic coordinates are employed to guide multi-view attention. The experimental results demonstrate our model's effectiveness in generating coherent, high-quality customized multi-view images.
{"title":"A Dual-UNet Diffusion Framework for Personalized Panoramic Generation.","authors":"Jing Shen, Leigang Huo, Chunlei Huo, Shiming Xiang","doi":"10.3390/jimaging12010040","DOIUrl":"10.3390/jimaging12010040","url":null,"abstract":"<p><p>While text-to-image and customized generation methods demonstrate strong capabilities in single-image generation, they fall short in supporting immersive applications that require coherent 360° panoramas. Conversely, existing panorama generation models lack customization capabilities. In panoramic scenes, reference objects often appear as minor background elements and may be multiple in number, while reference images across different views exhibit weak correlations. To address these challenges, we propose a diffusion-based framework for customized multi-view image generation. Our approach introduces a decoupled feature injection mechanism within a dual-UNet architecture to handle weakly correlated reference images, effectively integrating spatial information by concurrently feeding both reference images and noise into the denoising branch. A hybrid attention mechanism enables deep fusion of reference features and multi-view representations. Furthermore, a data augmentation strategy facilitates viewpoint-adaptive pose adjustments, and panoramic coordinates are employed to guide multi-view attention. The experimental results demonstrate our model's effectiveness in generating coherent, high-quality customized multi-view images.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843003/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146053774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09DOI: 10.3390/jimaging12010039
Amani Almalki, Abdulrahman Almalki, Longin Jan Latecki
This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed Masked Deep Embedding (MDE) pre-training method, extending the masked autoencoder (MAE) transformer, significantly enhances dental implant detection performance compared to baselines. Specifically, the proposed method achieves a best detection performance of AP = 96.1, outperforming supervised ViT and MAE baselines by up to +2.9 AP. In addition, we address the absence of a comprehensive dataset for implant design, enhancing an existing dataset under dental expert supervision. This augmentation includes annotations for implant design, such as coronal, middle, and apical parts, resulting in a unique Implant Design Dataset (IDD). The contributions encompass employing self-supervised learning for limited dental radiograph data, replacing MAE's patch reconstruction with patch embeddings, achieving substantial performance improvement in implant detection, and expanding possibilities through the labeling of implant design. This study paves the way for AI-driven solutions in implant dentistry, providing valuable tools for dentists and patients facing implant-related challenges.
{"title":"Self-Supervised Learning of Deep Embeddings for Classification and Identification of Dental Implants.","authors":"Amani Almalki, Abdulrahman Almalki, Longin Jan Latecki","doi":"10.3390/jimaging12010039","DOIUrl":"10.3390/jimaging12010039","url":null,"abstract":"<p><p>This study proposes an automated system using deep learning-based object detection to identify implant systems, leveraging recent progress in self-supervised learning, specifically masked image modeling (MIM). We advocate for self-pre-training, emphasizing that its advantages when acquiring suitable pre-training data is challenging. The proposed Masked Deep Embedding (MDE) pre-training method, extending the masked autoencoder (MAE) transformer, significantly enhances dental implant detection performance compared to baselines. Specifically, the proposed method achieves a best detection performance of AP = 96.1, outperforming supervised ViT and MAE baselines by up to +2.9 AP. In addition, we address the absence of a comprehensive dataset for implant design, enhancing an existing dataset under dental expert supervision. This augmentation includes annotations for implant design, such as coronal, middle, and apical parts, resulting in a unique Implant Design Dataset (IDD). The contributions encompass employing self-supervised learning for limited dental radiograph data, replacing MAE's patch reconstruction with patch embeddings, achieving substantial performance improvement in implant detection, and expanding possibilities through the labeling of implant design. This study paves the way for AI-driven solutions in implant dentistry, providing valuable tools for dentists and patients facing implant-related challenges.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842735/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-09DOI: 10.3390/jimaging12010038
Guohao Nie, Xingmei Wang, Debin Zhang, He Wang
Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework that holistically estimates target trajectories. Specifically, SCT-Diff processes video clips globally via a diffusion model to incorporate bidirectional spatiotemporal awareness, where reverse diffusion steps progressively refine noisy trajectory proposals into optimal predictions. Crucially, SCT-Diff enables iterative correction of historical trajectory hypotheses by observing future contexts within a sliding time window. This closed-loop feedback from future frames preserves temporal consistency and breaks the error propagation chain under complex appearance variations. For joint modeling of appearance and motion dynamics, we formulate trajectories as unified discrete token sequences. The designed Mamba-based expert decoder bridges visual features with language-formulated trajectories, enabling lightweight yet coherent sequence modeling. Extensive experiments demonstrate SCT-Diff's superior efficiency and performance, achieving 75.4% AO on GOT-10k while maintaining real-time computational efficiency.
{"title":"SCT-Diff: Seamless Contextual Tracking via Diffusion Trajectory.","authors":"Guohao Nie, Xingmei Wang, Debin Zhang, He Wang","doi":"10.3390/jimaging12010038","DOIUrl":"10.3390/jimaging12010038","url":null,"abstract":"<p><p>Existing detection-based trackers exploit temporal contexts by updating appearance models or modeling target motion. However, the sequential one-shot integration of temporal priors risks amplifying error accumulation, as frame-level template matching restricts comprehensive spatiotemporal analysis. To address this, we propose SCT-Diff, a video-level framework that holistically estimates target trajectories. Specifically, SCT-Diff processes video clips globally via a diffusion model to incorporate bidirectional spatiotemporal awareness, where reverse diffusion steps progressively refine noisy trajectory proposals into optimal predictions. Crucially, SCT-Diff enables iterative correction of historical trajectory hypotheses by observing future contexts within a sliding time window. This closed-loop feedback from future frames preserves temporal consistency and breaks the error propagation chain under complex appearance variations. For joint modeling of appearance and motion dynamics, we formulate trajectories as unified discrete token sequences. The designed Mamba-based expert decoder bridges visual features with language-formulated trajectories, enabling lightweight yet coherent sequence modeling. Extensive experiments demonstrate SCT-Diff's superior efficiency and performance, achieving 75.4% AO on GOT-10k while maintaining real-time computational efficiency.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843046/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.3390/jimaging12010034
Mikhail Uzdiaev, Marina Astapova, Andrey Ronzhin, Aleksandra Figurek
The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task remains unexplored. This work presents a baseline empirical evaluation of the U-Net architecture for the semantic segmentation of surfaces applicable for seismic sensor installation. We utilize a novel dataset of Sentinel-2 multispectral images, specifically labeled for this purpose. The study investigates the impact of pretrained encoders (EfficientNetB2, Cross-Stage Partial Darknet53-CSPDarknet53, and Multi-Axis Vision Transformer-MAxViT), different combinations of Sentinel-2 spectral bands (Red, Green, Blue (RGB), RGB+Near Infrared (NIR), 10-bands with 10 and 20 m/pix spatial resolution, full 13-band), and a technique for improving small object segmentation by modifying the input convolutional layer stride. Experimental results demonstrate that the CSPDarknet53 encoder generally outperforms the others (IoU = 0.534, Precision = 0.716, Recall = 0.635). The combination of RGB and Near-Infrared bands (10 m/pixel resolution) yielded the most robust performance across most configurations. Reducing the input stride from 2 to 1 proved beneficial for segmenting small linear objects like roads. The findings establish a baseline for this novel task and provide practical insights for optimizing deep learning models in the context of automated seismic nodal network installation planning.
{"title":"Empirical Evaluation of UNet for Segmentation of Applicable Surfaces for Seismic Sensor Installation.","authors":"Mikhail Uzdiaev, Marina Astapova, Andrey Ronzhin, Aleksandra Figurek","doi":"10.3390/jimaging12010034","DOIUrl":"10.3390/jimaging12010034","url":null,"abstract":"<p><p>The deployment of wireless seismic nodal systems necessitates the efficient identification of optimal locations for sensor installation, considering factors such as ground stability and the absence of interference. Semantic segmentation of satellite imagery has advanced significantly, and its application to this specific task remains unexplored. This work presents a baseline empirical evaluation of the U-Net architecture for the semantic segmentation of surfaces applicable for seismic sensor installation. We utilize a novel dataset of Sentinel-2 multispectral images, specifically labeled for this purpose. The study investigates the impact of pretrained encoders (EfficientNetB2, Cross-Stage Partial Darknet53-CSPDarknet53, and Multi-Axis Vision Transformer-MAxViT), different combinations of Sentinel-2 spectral bands (Red, Green, Blue (RGB), RGB+Near Infrared (NIR), 10-bands with 10 and 20 m/pix spatial resolution, full 13-band), and a technique for improving small object segmentation by modifying the input convolutional layer stride. Experimental results demonstrate that the CSPDarknet53 encoder generally outperforms the others (IoU = 0.534, Precision = 0.716, Recall = 0.635). The combination of RGB and Near-Infrared bands (10 m/pixel resolution) yielded the most robust performance across most configurations. Reducing the input stride from 2 to 1 proved beneficial for segmenting small linear objects like roads. The findings establish a baseline for this novel task and provide practical insights for optimizing deep learning models in the context of automated seismic nodal network installation planning.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.3390/jimaging12010037
Lian Xie, Hao Chen, Jin Shu
Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify inputs into three dominant degradation classes (color cast, low light, blur) with 91.85% accuracy on an EUVP subset. Stage II applies three scene-specific lightweight enhancement pipelines and fuses their outputs using two alternative learnable modules: a global Linear Fusion and a LiteUNetFusion (spatially adaptive weighting with optional residual correction). Compared to the three single-scene optimizers (average PSNR = 19.0 dB; mean UCIQE ≈ 0.597; mean UIQM ≈ 2.07), the Linear Fusion improves PSNR by +2.6 dB on average and yields roughly +20.7% in UCIQE and +21.0% in UIQM, while maintaining low latency (~90 ms per 640 × 480 frame on an Intel i5-13400F (Intel Corporation, Santa Clara, CA, USA). The LiteUNetFusion further refines results: it raises PSNR by +1.5 dB over the Linear model (23.1 vs. 21.6 dB), brings modest perceptual gains (UCIQE from 0.72 to 0.74, UIQM 2.5 to 2.8) at a runtime of ≈125 ms per 640 × 480 frame, and better preserves local texture and color consistency in mixed-degradation scenes. We release implementation details for reproducibility and discuss limitations (e.g., occasional blur/noise amplification and domain generalization) together with future directions.
水下图像经常遭受色偏,低照度,模糊由于波长依赖的吸收和散射。我们提出了一个实用的两阶段、模块化和退化感知框架,旨在实时增强,优先考虑边缘设备上的可部署性。第一阶段使用轻量级CNN将输入分类为三个主要的退化类别(色偏、弱光、模糊),在EUVP子集上的准确率为91.85%。第二阶段应用三个场景特定的轻量级增强管道,并使用两个可选的可学习模块融合它们的输出:一个全局线性融合和一个LiteUNetFusion(空间自适应加权,可选残差校正)。与三种单场景优化器(平均PSNR = 19.0 dB;平均UCIQE≈0.597;平均UIQM≈2.07)相比,线性融合将PSNR平均提高了+2.6 dB, UCIQE和UIQM分别提高了+20.7%和+21.0%,同时保持了低延迟(在Intel i5-13400F (Intel Corporation, Santa Clara, CA, USA)上每640 × 480帧约90 ms)。LiteUNetFusion进一步改进了结果:与线性模型相比,它将PSNR提高了+1.5 dB (23.1 vs. 21.6 dB),在每640 × 480帧≈125 ms的运行时带来适度的感知增益(UCIQE从0.72到0.74,UIQM从2.5到2.8),并且在混合退化场景中更好地保留了局部纹理和颜色一致性。我们发布了可重复性的实现细节,并讨论了限制(例如,偶尔的模糊/噪声放大和域泛化)以及未来的方向。
{"title":"Degradation-Aware Multi-Stage Fusion for Underwater Image Enhancement.","authors":"Lian Xie, Hao Chen, Jin Shu","doi":"10.3390/jimaging12010037","DOIUrl":"10.3390/jimaging12010037","url":null,"abstract":"<p><p>Underwater images frequently suffer from color casts, low illumination, and blur due to wavelength-dependent absorption and scattering. We present a practical two-stage, modular, and degradation-aware framework designed for real-time enhancement, prioritizing deployability on edge devices. Stage I employs a lightweight CNN to classify inputs into three dominant degradation classes (color cast, low light, blur) with 91.85% accuracy on an EUVP subset. Stage II applies three scene-specific lightweight enhancement pipelines and fuses their outputs using two alternative learnable modules: a global Linear Fusion and a LiteUNetFusion (spatially adaptive weighting with optional residual correction). Compared to the three single-scene optimizers (average PSNR = 19.0 dB; mean UCIQE ≈ 0.597; mean UIQM ≈ 2.07), the Linear Fusion improves PSNR by +2.6 dB on average and yields roughly +20.7% in UCIQE and +21.0% in UIQM, while maintaining low latency (~90 ms per 640 × 480 frame on an Intel i5-13400F (Intel Corporation, Santa Clara, CA, USA). The LiteUNetFusion further refines results: it raises PSNR by +1.5 dB over the Linear model (23.1 vs. 21.6 dB), brings modest perceptual gains (UCIQE from 0.72 to 0.74, UIQM 2.5 to 2.8) at a runtime of ≈125 ms per 640 × 480 frame, and better preserves local texture and color consistency in mixed-degradation scenes. We release implementation details for reproducibility and discuss limitations (e.g., occasional blur/noise amplification and domain generalization) together with future directions.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12843447/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.3390/jimaging12010036
Ekaterina A Lopukhova, Gulnaz M Idrisova, Timur R Mukhamadeev, Grigory S Voronkov, Ruslan V Kutluyarov, Elizaveta P Topolskaya
The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data. We propose a hierarchical modular deep learning system designed for multi-label OCT screening with conditional routing to specialized staging modules. To enable DR staging when fundus images are unavailable, we use cross-modal alignment between OCT and fundus representations. This approach involves training a latent bridge that projects OCT embeddings into the fundus feature space. We enhance clinical reliability through per-class threshold calibration and implement quality control checks for OCT-only DR staging. Experiments demonstrate robust multi-label performance (macro-F1 =0.989±0.006 after per-class threshold calibration) and reliable calibration (ECE =2.1±0.4%), and OCT-only DR staging is feasible in 96.1% of cases that meet the quality control criterion.
{"title":"A Hierarchical Deep Learning Architecture for Diagnosing Retinal Diseases Using Cross-Modal OCT to Fundus Translation in the Lack of Paired Data.","authors":"Ekaterina A Lopukhova, Gulnaz M Idrisova, Timur R Mukhamadeev, Grigory S Voronkov, Ruslan V Kutluyarov, Elizaveta P Topolskaya","doi":"10.3390/jimaging12010036","DOIUrl":"10.3390/jimaging12010036","url":null,"abstract":"<p><p>The paper focuses on automated diagnosis of retinal diseases, particularly Age-related Macular Degeneration (AMD) and diabetic retinopathy (DR), using optical coherence tomography (OCT), while addressing three key challenges: disease comorbidity, severe class imbalance, and the lack of strictly paired OCT and fundus data. We propose a hierarchical modular deep learning system designed for multi-label OCT screening with conditional routing to specialized staging modules. To enable DR staging when fundus images are unavailable, we use cross-modal alignment between OCT and fundus representations. This approach involves training a latent bridge that projects OCT embeddings into the fundus feature space. We enhance clinical reliability through per-class threshold calibration and implement quality control checks for OCT-only DR staging. Experiments demonstrate robust multi-label performance (macro-F1 =0.989±0.006 after per-class threshold calibration) and reliable calibration (ECE =2.1±0.4%), and OCT-only DR staging is feasible in 96.1% of cases that meet the quality control criterion.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842718/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146053781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-08DOI: 10.3390/jimaging12010035
Onural Ozturk, Sibel Balci, Seda Ozturk
The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high PVS scores to reveal microstructural differences that are not visible macroscopically. Adult patients who underwent cranial MRI over a one-month period were retrospectively screened and divided into two groups according to their global PVS score. Radiomics feature extraction from NAWM was performed at the level of the centrum semiovale on FLAIR and ADC images. Radiomics features were selected using Least Absolute Shrinkage and Selection Operator (LASSO) regression during the initial model development phase, and predefined radiomics scores were evaluated for both sequences. A total of 160 patients were included in the study. Radiomics scores derived from normal-appearing white matter demonstrated good discriminative performance for differentiating high vs. low perivascular space (PVS) burden (AUC = 0.853 for FLAIR and AUC = 0.753 for ADC). In age- and scanner-adjusted multivariable models, radiomics scores remained independently associated with high PVS burden. These findings suggest that radiomics analysis of NAWM can capture subtle white matter alterations associated with PVS burden and may serve as a non-invasive biomarker for early detection of microvascular and inflammatory changes.
{"title":"Comparison of the Radiomics Features of Normal-Appearing White Matter in Persons with High or Low Perivascular Space Scores.","authors":"Onural Ozturk, Sibel Balci, Seda Ozturk","doi":"10.3390/jimaging12010035","DOIUrl":"10.3390/jimaging12010035","url":null,"abstract":"<p><p>The clinical significance of perivascular spaces (PVS) remains controversial. Radiomics refers to the extraction of quantitative features from medical images using pixel-based computational approaches. This study aimed to compare the radiomics features of normal-appearing white matter (NAWM) in patients with low and high PVS scores to reveal microstructural differences that are not visible macroscopically. Adult patients who underwent cranial MRI over a one-month period were retrospectively screened and divided into two groups according to their global PVS score. Radiomics feature extraction from NAWM was performed at the level of the centrum semiovale on FLAIR and ADC images. Radiomics features were selected using Least Absolute Shrinkage and Selection Operator (LASSO) regression during the initial model development phase, and predefined radiomics scores were evaluated for both sequences. A total of 160 patients were included in the study. Radiomics scores derived from normal-appearing white matter demonstrated good discriminative performance for differentiating high vs. low perivascular space (PVS) burden (AUC = 0.853 for FLAIR and AUC = 0.753 for ADC). In age- and scanner-adjusted multivariable models, radiomics scores remained independently associated with high PVS burden. These findings suggest that radiomics analysis of NAWM can capture subtle white matter alterations associated with PVS burden and may serve as a non-invasive biomarker for early detection of microvascular and inflammatory changes.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 1","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12842764/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146054248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}