Pub Date : 2026-02-09DOI: 10.3390/jimaging12020071
Kosei Aketagawa, Midori Tanaka, Takahiko Horiuchi
The demand for accurate representation of gloss perception, which significantly contributes to the impression and evaluation of objects, is increasing owing to recent advancements in display technology enabling high-definition visual reproduction. This study experimentally analyzes the influence of display pixel structure on gloss perception. In a visual evaluation experiment using natural images, gloss perception was assessed across six types of stimuli: three subpixel arrays (RGB, RGBW, and PenTile RGBG) combined with two pixel-aperture ratios (100% and 50%). The experimental results statistically confirmed that regardless of pixel-aperture ratio, the RGB subpixel array was perceived as exhibiting the strongest gloss. Furthermore, cluster analysis of observers revealed individual differences in the effect of pixel structure on gloss perception. Additionally, gloss classification and image feature analysis suggested that the magnitude of pixel structure influence varies depending on the frequency components contained in the images. Moreover, analysis using a generalized linear mixed model supported the superiority of the RGB subpixel array even when accounting for variability across observers and natural images.
{"title":"Relationship Between Display Pixel Structure and Gloss Perception.","authors":"Kosei Aketagawa, Midori Tanaka, Takahiko Horiuchi","doi":"10.3390/jimaging12020071","DOIUrl":"10.3390/jimaging12020071","url":null,"abstract":"<p><p>The demand for accurate representation of gloss perception, which significantly contributes to the impression and evaluation of objects, is increasing owing to recent advancements in display technology enabling high-definition visual reproduction. This study experimentally analyzes the influence of display pixel structure on gloss perception. In a visual evaluation experiment using natural images, gloss perception was assessed across six types of stimuli: three subpixel arrays (RGB, RGBW, and PenTile RGBG) combined with two pixel-aperture ratios (100% and 50%). The experimental results statistically confirmed that regardless of pixel-aperture ratio, the RGB subpixel array was perceived as exhibiting the strongest gloss. Furthermore, cluster analysis of observers revealed individual differences in the effect of pixel structure on gloss perception. Additionally, gloss classification and image feature analysis suggested that the magnitude of pixel structure influence varies depending on the frequency components contained in the images. Moreover, analysis using a generalized linear mixed model supported the superiority of the RGB subpixel array even when accounting for variability across observers and natural images.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12942264/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-09DOI: 10.3390/jimaging12020070
Natthaphong Suthamno, Jessada Tanthanuch
This study proposes a topic-modeling guided framework that enhances image classification by introducing semantic clustering prior to CNN training. Images are processed through two key-point extraction pipelines: Scale-Invariant Feature Transform (SIFT) with Sobel edge detection and Block Gabor Filtering (BGF), to obtain local feature descriptors. These descriptors are clustered using K-means to build a visual vocabulary. Bag of Words histograms then represent each image as a visual document. Latent Dirichlet Allocation is applied to uncover latent semantic topics, generating coherent image clusters. Cluster-specific CNN models, including AlexNet, GoogLeNet, and several ResNet variants, are trained under identical conditions to identify the most suitable architecture for each cluster. Two topic guided integration strategies, the Maximum Proportion Topic (MPT) and the Weight Proportion Topic (WPT), are then used to assign test images to the corresponding specialized model. Experimental results show that both the SIFT-based and BGF-based pipelines outperform non-clustered CNN models and a baseline method using Incremental PCA, K-means, Same-Cluster Prediction, and unweighted Ensemble Voting. The SIFT pipeline achieves the highest accuracy of 95.24% with the MPT strategy, while the BGF pipeline achieves 93.76% with the WPT strategy. These findings confirm that semantic structure introduced through topic modeling substantially improves CNN classification performance.
{"title":"Topic-Modeling Guided Semantic Clustering for Enhancing CNN-Based Image Classification Using Scale-Invariant Feature Transform and Block Gabor Filtering.","authors":"Natthaphong Suthamno, Jessada Tanthanuch","doi":"10.3390/jimaging12020070","DOIUrl":"10.3390/jimaging12020070","url":null,"abstract":"<p><p>This study proposes a topic-modeling guided framework that enhances image classification by introducing semantic clustering prior to CNN training. Images are processed through two key-point extraction pipelines: Scale-Invariant Feature Transform (SIFT) with Sobel edge detection and Block Gabor Filtering (BGF), to obtain local feature descriptors. These descriptors are clustered using K-means to build a visual vocabulary. Bag of Words histograms then represent each image as a visual document. Latent Dirichlet Allocation is applied to uncover latent semantic topics, generating coherent image clusters. Cluster-specific CNN models, including AlexNet, GoogLeNet, and several ResNet variants, are trained under identical conditions to identify the most suitable architecture for each cluster. Two topic guided integration strategies, the Maximum Proportion Topic (MPT) and the Weight Proportion Topic (WPT), are then used to assign test images to the corresponding specialized model. Experimental results show that both the SIFT-based and BGF-based pipelines outperform non-clustered CNN models and a baseline method using Incremental PCA, K-means, Same-Cluster Prediction, and unweighted Ensemble Voting. The SIFT pipeline achieves the highest accuracy of 95.24% with the MPT strategy, while the BGF pipeline achieves 93.76% with the WPT strategy. These findings confirm that semantic structure introduced through topic modeling substantially improves CNN classification performance.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941444/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.3390/jimaging12020069
Qi Mi, Jianshu Chao, Anqi Chen, Kaiyuan Zhang, Jiahua Lai
Unmanned aerial vehicles (UAVs) are now widely used in various applications, including agriculture, urban traffic management, and search and rescue operations. However, several challenges arise, including the small size of objects occupying only a sparse number of pixels in images, complex backgrounds in aerial footage, and limited computational resources onboard. To address these issues, this paper proposes an improved UAV-based small object detection algorithm, YOLO11s-UAV, specifically designed for aerial imagery. Firstly, we introduce a novel FPN, called Content-Aware Reassembly and Interaction Feature Pyramid Network (CARIFPN), which significantly enhances small object feature detection while reducing redundant network structures. Secondly, we apply a new downsampling convolution for small object feature extraction, called Space-to-Depth for Dilation-wise Residual Convolution (S2DResConv), in the model's backbone. This module effectively eliminates information loss caused by strided convolution or pooling operations and facilitates the capture of multi-scale context. Finally, we integrate a simple, parameter-free attention module (SimAM) with C3k2 to form Flexible SimAM (FlexSimAM), which is applied throughout the entire model. This improved module not only reduces the model's complexity but also enables efficient enhancement of small object features in complex scenarios. Experimental results demonstrate that on the VisDrone-DET2019 dataset, our model improves mAP@0.5 by 7.8% on the validation set (reaching 46.0%) and by 5.9% on the test set (increasing to 37.3%) compared to the baseline YOLO11s, while reducing model parameters by 55.3%. Similarly, it achieves a 7.2% improvement on the TinyPerson dataset and a 3.0% increase on UAVDT-DET. Deployment on the NVIDIA Jetson Orin NX SUPER platform shows that our model achieves 33 FPS, which is 21.4% lower than YOLO11s, confirming its feasibility for real-time onboard UAV applications.
无人驾驶飞行器(uav)现在广泛应用于各种应用,包括农业、城市交通管理和搜救行动。然而,出现了一些挑战,包括小尺寸的物体只占用图像中稀疏的像素数量,航空镜头中的复杂背景,以及有限的机载计算资源。为了解决这些问题,本文提出了一种改进的基于无人机的小目标检测算法YOLO11s-UAV,专为航空图像设计。首先,我们引入了一种新的FPN,称为内容感知重组和交互特征金字塔网络(CARIFPN),该网络在减少冗余网络结构的同时显著增强了小目标特征检测。其次,我们在模型的主干中应用一种新的下采样卷积来提取小目标特征,称为空间到深度的膨胀残差卷积(S2DResConv)。该模块有效消除了跨行卷积或池化操作造成的信息丢失,便于多尺度上下文的捕获。最后,我们将一个简单的、无参数的注意力模块(SimAM)与C3k2集成在一起,形成灵活的SimAM (FlexSimAM),并应用于整个模型。这种改进的模块不仅降低了模型的复杂性,而且能够在复杂场景中有效地增强小对象特征。实验结果表明,在VisDrone-DET2019数据集上,与基线yolo11相比,我们的模型在验证集mAP@0.5上提高了7.8%(达到46.0%),在测试集上提高了5.9%(增加到37.3%),同时减少了55.3%的模型参数。同样,它在TinyPerson数据集上实现了7.2%的改进,在UAVDT-DET上实现了3.0%的改进。在NVIDIA Jetson Orin NX SUPER平台上的部署表明,我们的模型实现了33 FPS,比yolo11低21.4%,证实了其实时机载无人机应用的可行性。
{"title":"YOLO11s-UAV: An Advanced Algorithm for Small Object Detection in UAV Aerial Imagery.","authors":"Qi Mi, Jianshu Chao, Anqi Chen, Kaiyuan Zhang, Jiahua Lai","doi":"10.3390/jimaging12020069","DOIUrl":"10.3390/jimaging12020069","url":null,"abstract":"<p><p>Unmanned aerial vehicles (UAVs) are now widely used in various applications, including agriculture, urban traffic management, and search and rescue operations. However, several challenges arise, including the small size of objects occupying only a sparse number of pixels in images, complex backgrounds in aerial footage, and limited computational resources onboard. To address these issues, this paper proposes an improved UAV-based small object detection algorithm, YOLO11s-UAV, specifically designed for aerial imagery. Firstly, we introduce a novel FPN, called Content-Aware Reassembly and Interaction Feature Pyramid Network (CARIFPN), which significantly enhances small object feature detection while reducing redundant network structures. Secondly, we apply a new downsampling convolution for small object feature extraction, called Space-to-Depth for Dilation-wise Residual Convolution (S2DResConv), in the model's backbone. This module effectively eliminates information loss caused by strided convolution or pooling operations and facilitates the capture of multi-scale context. Finally, we integrate a simple, parameter-free attention module (SimAM) with C3k2 to form Flexible SimAM (FlexSimAM), which is applied throughout the entire model. This improved module not only reduces the model's complexity but also enables efficient enhancement of small object features in complex scenarios. Experimental results demonstrate that on the VisDrone-DET2019 dataset, our model improves mAP@0.5 by 7.8% on the validation set (reaching 46.0%) and by 5.9% on the test set (increasing to 37.3%) compared to the baseline YOLO11s, while reducing model parameters by 55.3%. Similarly, it achieves a 7.2% improvement on the TinyPerson dataset and a 3.0% increase on UAVDT-DET. Deployment on the NVIDIA Jetson Orin NX SUPER platform shows that our model achieves 33 FPS, which is 21.4% lower than YOLO11s, confirming its feasibility for real-time onboard UAV applications.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12942582/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.3390/jimaging12020068
Shaheen Khatoon, Azhar Mahmood
Breast ultrasound imaging is widely used for the detection and characterization of breast abnormalities; however, generating detailed and consistent radiological reports remains a labor-intensive and subjective process. Recent advances in deep learning have demonstrated the potential of automated report generation systems to support clinical workflows, yet most existing approaches focus on chest X-ray imaging and rely on convolutional-recurrent architectures with limited capacity to model long-range dependencies and complex clinical semantics. In this work, we propose a multimodal Transformer-based framework for automatic breast ultrasound report generation that integrates visual and textual information through cross-attention mechanisms. The proposed architecture employs a Vision Transformer (ViT) to extract rich spatial and morphological features from ultrasound images. For textual embedding, pretrained language models (BERT, BioBERT, and GPT-2) are implemented in various encoder-decoder configurations to leverage both general linguistic knowledge and domain-specific biomedical semantics. A multimodal Transformer decoder is implemented to autoregressively generate diagnostic reports by jointly attending to visual features and contextualized textual embeddings. We conducted an extensive quantitative evaluation using standard report generation metrics, including BLEU, ROUGE-L, METEOR, and CIDEr, to assess lexical accuracy, semantic alignment, and clinical relevance. Experimental results demonstrate that BioBERT-based models consistently outperform general domain counterparts in clinical specificity, while GPT-2-based decoders improve linguistic fluency.
{"title":"Automated Radiological Report Generation from Breast Ultrasound Images Using Vision and Language Transformers.","authors":"Shaheen Khatoon, Azhar Mahmood","doi":"10.3390/jimaging12020068","DOIUrl":"10.3390/jimaging12020068","url":null,"abstract":"<p><p>Breast ultrasound imaging is widely used for the detection and characterization of breast abnormalities; however, generating detailed and consistent radiological reports remains a labor-intensive and subjective process. Recent advances in deep learning have demonstrated the potential of automated report generation systems to support clinical workflows, yet most existing approaches focus on chest X-ray imaging and rely on convolutional-recurrent architectures with limited capacity to model long-range dependencies and complex clinical semantics. In this work, we propose a multimodal Transformer-based framework for automatic breast ultrasound report generation that integrates visual and textual information through cross-attention mechanisms. The proposed architecture employs a Vision Transformer (ViT) to extract rich spatial and morphological features from ultrasound images. For textual embedding, pretrained language models (BERT, BioBERT, and GPT-2) are implemented in various encoder-decoder configurations to leverage both general linguistic knowledge and domain-specific biomedical semantics. A multimodal Transformer decoder is implemented to autoregressively generate diagnostic reports by jointly attending to visual features and contextualized textual embeddings. We conducted an extensive quantitative evaluation using standard report generation metrics, including BLEU, ROUGE-L, METEOR, and CIDEr, to assess lexical accuracy, semantic alignment, and clinical relevance. Experimental results demonstrate that BioBERT-based models consistently outperform general domain counterparts in clinical specificity, while GPT-2-based decoders improve linguistic fluency.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941839/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291320","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-05DOI: 10.3390/jimaging12020067
Juan Arredondo Valdez, Josué Israel García López, Héctor Flores Breceda, Ajay Kumar, Ricardo David Valdez Cepeda, Alejandro Isabel Luna Maldonado
Opuntia ficus-indica L. is a prominent crop in Mexico, requiring advanced non-destructive technologies for the real-time monitoring and quality control of fresh commercial cladodes. The primary research objective of this study was to develop and validate high-precision mathematical models that correlate hyperspectral signatures (400-1000 nm) with the specific nutritional, morphological, and antioxidant attributes of fresh cladodes (cultivar Villanueva) at their peak commercial maturity. By combining hyperspectral imaging (HSI) with machine learning algorithms, including K-Means clustering for image preprocessing and Partial Least Squares Regression (PLSR) for predictive modeling, this study successfully predicted the concentrations of 10 minerals (N, P, K, Ca, Mg, Fe, B, Mn, Zn, and Cu), chlorophylls (a, b, and Total), and antioxidant capacities (ABTS, FRAP, and DPPH). The innovative nature of this work lies in the simultaneous non-destructive quantification of 17 distinct variables from a single scan, achieving coefficients of determination (R2) as high as 0.988 for Phosphorus and Chlorophyll b. The practical applicability of this research provides a viable replacement for time-consuming and destructive laboratory acid digestion, enabling producers to implement automated, high-throughput sorting lines for quality assurance. Furthermore, this study establishes a framework for interdisciplinary collaborations between agricultural engineers, data scientists for algorithm optimization, and food scientists to enhance the functional value chain of Opuntia products.
{"title":"Predicting Nutritional and Morphological Attributes of Fresh Commercial <i>Opuntia</i> Cladodes Using Machine Learning and Imaging.","authors":"Juan Arredondo Valdez, Josué Israel García López, Héctor Flores Breceda, Ajay Kumar, Ricardo David Valdez Cepeda, Alejandro Isabel Luna Maldonado","doi":"10.3390/jimaging12020067","DOIUrl":"10.3390/jimaging12020067","url":null,"abstract":"<p><p><i>Opuntia ficus-indica</i> L. is a prominent crop in Mexico, requiring advanced non-destructive technologies for the real-time monitoring and quality control of fresh commercial cladodes. The primary research objective of this study was to develop and validate high-precision mathematical models that correlate hyperspectral signatures (400-1000 nm) with the specific nutritional, morphological, and antioxidant attributes of fresh cladodes (cultivar Villanueva) at their peak commercial maturity. By combining hyperspectral imaging (HSI) with machine learning algorithms, including K-Means clustering for image preprocessing and Partial Least Squares Regression (PLSR) for predictive modeling, this study successfully predicted the concentrations of 10 minerals (N, P, K, Ca, Mg, Fe, B, Mn, Zn, and Cu), chlorophylls (a, b, and Total), and antioxidant capacities (ABTS, FRAP, and DPPH). The innovative nature of this work lies in the simultaneous non-destructive quantification of 17 distinct variables from a single scan, achieving coefficients of determination (R<sup>2</sup>) as high as 0.988 for Phosphorus and Chlorophyll b. The practical applicability of this research provides a viable replacement for time-consuming and destructive laboratory acid digestion, enabling producers to implement automated, high-throughput sorting lines for quality assurance. Furthermore, this study establishes a framework for interdisciplinary collaborations between agricultural engineers, data scientists for algorithm optimization, and food scientists to enhance the functional value chain of <i>Opuntia</i> products.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941559/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-05DOI: 10.3390/jimaging12020065
Ruifeng Li, Masaaki Fujiyoshi
Grayscale-based Encryption-then-Compression (EtC) systems transform RGB images into the YCbCr color space, concatenate the components into a single grayscale image, and apply block permutation, block rotation/flipping, and block-wise negative-positive inversion. Because this pipeline separates color components and disrupts inter-channel statistics, existing extended jigsaw puzzle solvers (JPSs) have been regarded as ineffective, and grayscale-based EtC systems have been considered resistant to ciphertext-only visual reconstruction. In this paper, we present a practical ciphertext-only attack against grayscale-based EtC. The proposed attack introduces three key components: (i) Texture-Based Component Classification (TBCC) to distinguish luminance (Y) and chrominance (Cb/Cr) blocks and focus reconstruction on structure-rich regions; (ii) Regularized Single-Channel Edge Compatibility (R-SCEC), which applies Tikhonov regularization to a single-channel variant of the Mahalanobis Gradient Compatibility (MGC) measure to alleviate covariance rank-deficiency while maintaining robustness under inversion and geometric transforms; and (iii) Adaptive Pruning based on the TBCC-reduced search space that skips redundant boundary matching computations to further improve reconstruction efficiency. Experiments show that, in settings where existing extended JPS solvers fail, our method can still recover visually recognizable semantic content, revealing a potential vulnerability in grayscale-based EtC and calling for a re-evaluation of its security.
{"title":"Ciphertext-Only Attack on Grayscale-Based EtC Image Encryption via Component Separation and Regularized Single-Channel Compatibility.","authors":"Ruifeng Li, Masaaki Fujiyoshi","doi":"10.3390/jimaging12020065","DOIUrl":"10.3390/jimaging12020065","url":null,"abstract":"<p><p>Grayscale-based Encryption-then-Compression (EtC) systems transform RGB images into the YCbCr color space, concatenate the components into a single grayscale image, and apply block permutation, block rotation/flipping, and block-wise negative-positive inversion. Because this pipeline separates color components and disrupts inter-channel statistics, existing extended jigsaw puzzle solvers (JPSs) have been regarded as ineffective, and grayscale-based EtC systems have been considered resistant to ciphertext-only visual reconstruction. In this paper, we present a practical ciphertext-only attack against grayscale-based EtC. The proposed attack introduces three key components: (i) Texture-Based Component Classification (TBCC) to distinguish luminance (Y) and chrominance (Cb/Cr) blocks and focus reconstruction on structure-rich regions; (ii) Regularized Single-Channel Edge Compatibility (R-SCEC), which applies Tikhonov regularization to a single-channel variant of the Mahalanobis Gradient Compatibility (MGC) measure to alleviate covariance rank-deficiency while maintaining robustness under inversion and geometric transforms; and (iii) Adaptive Pruning based on the TBCC-reduced search space that skips redundant boundary matching computations to further improve reconstruction efficiency. Experiments show that, in settings where existing extended JPS solvers fail, our method can still recover visually recognizable semantic content, revealing a potential vulnerability in grayscale-based EtC and calling for a re-evaluation of its security.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941909/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Major crops worldwide are affected by various diseases yearly, leading to crop losses in different regions. The primary methods for addressing crop disease losses include manual inspection and chemical control. However, traditional manual inspection methods are time-consuming, labor-intensive, and require specialized knowledge. The preemptive use of chemicals also poses a risk of soil pollution, which may cause irreversible damage. With the advancement of computer hardware, photographic technology, and artificial intelligence, crop disease recognition methods based on spectral and red-green-blue (RGB) images not only recognize diseases without damaging the crops but also offer high accuracy and speed of recognition, essentially solving the problems associated with manual inspection and chemical control. This paper summarizes the research on disease recognition methods based on spectral and RGB images, with the literature spanning from 2020 through early 2025. Unlike previous surveys, this paper reviews recent advances involving emerging paradigms such as State Space Models (e.g., Mamba) and Generative AI in the context of crop disease recognition. In addition, it introduces public datasets and commonly used evaluation metrics for crop disease identification. Finally, the paper discusses potential issues and solutions encountered during research, including the use of diffusion models for data augmentation. Hopefully, this survey will help readers understand the current methods and effectiveness of crop disease detection, inspiring the development of more effective methods to assist farmers in identifying crop diseases.
{"title":"A Survey of Crop Disease Recognition Methods Based on Spectral and RGB Images.","authors":"Haoze Zheng, Heran Wang, Hualong Dong, Yurong Qian","doi":"10.3390/jimaging12020066","DOIUrl":"10.3390/jimaging12020066","url":null,"abstract":"<p><p>Major crops worldwide are affected by various diseases yearly, leading to crop losses in different regions. The primary methods for addressing crop disease losses include manual inspection and chemical control. However, traditional manual inspection methods are time-consuming, labor-intensive, and require specialized knowledge. The preemptive use of chemicals also poses a risk of soil pollution, which may cause irreversible damage. With the advancement of computer hardware, photographic technology, and artificial intelligence, crop disease recognition methods based on spectral and red-green-blue (RGB) images not only recognize diseases without damaging the crops but also offer high accuracy and speed of recognition, essentially solving the problems associated with manual inspection and chemical control. This paper summarizes the research on disease recognition methods based on spectral and RGB images, with the literature spanning from 2020 through early 2025. Unlike previous surveys, this paper reviews recent advances involving emerging paradigms such as State Space Models (e.g., Mamba) and Generative AI in the context of crop disease recognition. In addition, it introduces public datasets and commonly used evaluation metrics for crop disease identification. Finally, the paper discusses potential issues and solutions encountered during research, including the use of diffusion models for data augmentation. Hopefully, this survey will help readers understand the current methods and effectiveness of crop disease detection, inspiring the development of more effective methods to assist farmers in identifying crop diseases.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12942047/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291323","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-31DOI: 10.3390/jimaging12020064
Munish Rathee, Boris Bačić, Maryam Doborjeh
Automated anomaly detection in transportation infrastructure is essential for enhancing safety and reducing the operational costs associated with manual inspection protocols. This study presents an improved neuromorphic vision system, which extends the prior SIFT-SNN (scale-invariant feature transform-spiking neural network) proof-of-concept by incorporating temporal feature aggregation for context-aware and sequence-stable detection. Analysis of classical stitching-based pipelines exposed sensitivity to motion and lighting variations, motivating the proposed temporally smoothed neuromorphic design. SIFT keypoints are encoded into latency-based spike trains and classified using a leaky integrate-and-fire (LIF) spiking neural network implemented in PyTorch. Evaluated across three hardware configurations-an NVIDIA RTX 4060 GPU, an Intel i7 CPU, and a simulated Jetson Nano-the system achieved 92.3% accuracy and a macro F1 score of 91.0% under five-fold cross-validation. Inference latencies were measured at 9.5 ms, 26.1 ms, and ~48.3 ms per frame, respectively. Memory footprints were under 290 MB, and power consumption was estimated to be between 5 and 65 W. The classifier distinguishes between safe, partially dislodged, and fully dislodged barrier pins, which are critical failure modes for the Auckland Harbour Bridge's Movable Concrete Barrier (MCB) system. Temporal smoothing further improves recall for ambiguous cases. By achieving a compact model size (2.9 MB), low-latency inference, and minimal power demands, the proposed framework offers a deployable, interpretable, and energy-efficient alternative to conventional CNN-based inspection tools. Future work will focus on exploring the generalisability and transferability of the work presented, additional input sources, and human-computer interaction paradigms for various deployment infrastructures and advancements.
{"title":"SIFT-SNN for Traffic-Flow Infrastructure Safety: A Real-Time Context-Aware Anomaly Detection Framework.","authors":"Munish Rathee, Boris Bačić, Maryam Doborjeh","doi":"10.3390/jimaging12020064","DOIUrl":"10.3390/jimaging12020064","url":null,"abstract":"<p><p>Automated anomaly detection in transportation infrastructure is essential for enhancing safety and reducing the operational costs associated with manual inspection protocols. This study presents an improved neuromorphic vision system, which extends the prior SIFT-SNN (scale-invariant feature transform-spiking neural network) proof-of-concept by incorporating temporal feature aggregation for context-aware and sequence-stable detection. Analysis of classical stitching-based pipelines exposed sensitivity to motion and lighting variations, motivating the proposed temporally smoothed neuromorphic design. SIFT keypoints are encoded into latency-based spike trains and classified using a leaky integrate-and-fire (LIF) spiking neural network implemented in PyTorch. Evaluated across three hardware configurations-an NVIDIA RTX 4060 GPU, an Intel i7 CPU, and a simulated Jetson Nano-the system achieved 92.3% accuracy and a macro F1 score of 91.0% under five-fold cross-validation. Inference latencies were measured at 9.5 ms, 26.1 ms, and ~48.3 ms per frame, respectively. Memory footprints were under 290 MB, and power consumption was estimated to be between 5 and 65 W. The classifier distinguishes between safe, partially dislodged, and fully dislodged barrier pins, which are critical failure modes for the Auckland Harbour Bridge's Movable Concrete Barrier (MCB) system. Temporal smoothing further improves recall for ambiguous cases. By achieving a compact model size (2.9 MB), low-latency inference, and minimal power demands, the proposed framework offers a deployable, interpretable, and energy-efficient alternative to conventional CNN-based inspection tools. Future work will focus on exploring the generalisability and transferability of the work presented, additional input sources, and human-computer interaction paradigms for various deployment infrastructures and advancements.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12942226/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.3390/jimaging12020063
Asmita Chakraborty, Gizem Karagoz, Nirvana Meratnia
Deep learning models for three-dimensional (3D) data are increasingly used in domains such as medical imaging, object recognition, and robotics. At the same time, the use of AI in these domains is increasing, while, due to their black-box nature, the need for explainability has grown significantly. However, the lack of standardized and quantitative benchmarks for explainable artificial intelligence (XAI) in 3D data limits the reliable comparison of explanation quality. In this paper, we present a unified benchmarking framework to evaluate both intrinsic and post hoc XAI methods across three representative 3D datasets: volumetric CT scans (MosMed), voxelized CAD models (ModelNet40), and real-world point clouds (ScanObjectNN). The evaluated methods include Grad-CAM, Integrated Gradients, Saliency, Occlusion, and the intrinsic ResAttNet-3D model. We quantitatively assess explanations using the Correctness (AOPC), Completeness (AUPC), and Compactness metrics, consistently applied across all datasets. Our results show that explanation quality significantly varies across methods and domains, demonstrating that Grad-CAM and intrinsic attention performed best on medical CT scans, while gradient-based methods excelled on voxelized and point-based data. Statistical tests (Kruskal-Wallis and Mann-Whitney U) confirmed significant performance differences between methods. No single approach achieved superior results across all domains, highlighting the importance of multi-metric evaluation. This work provides a reproducible framework for standardized assessment of 3D explainability and comparative insights to guide future XAI method selection.
{"title":"A Cross-Domain Benchmark of Intrinsic and Post Hoc Explainability for 3D Deep Learning Models.","authors":"Asmita Chakraborty, Gizem Karagoz, Nirvana Meratnia","doi":"10.3390/jimaging12020063","DOIUrl":"10.3390/jimaging12020063","url":null,"abstract":"<p><p>Deep learning models for three-dimensional (3D) data are increasingly used in domains such as medical imaging, object recognition, and robotics. At the same time, the use of AI in these domains is increasing, while, due to their black-box nature, the need for explainability has grown significantly. However, the lack of standardized and quantitative benchmarks for explainable artificial intelligence (XAI) in 3D data limits the reliable comparison of explanation quality. In this paper, we present a unified benchmarking framework to evaluate both intrinsic and post hoc XAI methods across three representative 3D datasets: volumetric CT scans (MosMed), voxelized CAD models (ModelNet40), and real-world point clouds (ScanObjectNN). The evaluated methods include Grad-CAM, Integrated Gradients, Saliency, Occlusion, and the intrinsic ResAttNet-3D model. We quantitatively assess explanations using the Correctness (AOPC), Completeness (AUPC), and Compactness metrics, consistently applied across all datasets. Our results show that explanation quality significantly varies across methods and domains, demonstrating that Grad-CAM and intrinsic attention performed best on medical CT scans, while gradient-based methods excelled on voxelized and point-based data. Statistical tests (Kruskal-Wallis and Mann-Whitney U) confirmed significant performance differences between methods. No single approach achieved superior results across all domains, highlighting the importance of multi-metric evaluation. This work provides a reproducible framework for standardized assessment of 3D explainability and comparative insights to guide future XAI method selection.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941976/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.3390/jimaging12020062
Mohammad Ishtiaque Rahman, Amrina Rahman
Lung cancer remains a leading cause of cancer-related mortality. Although reliable multiclass classification of lung lesions from CT imaging is essential for early diagnosis, it remains challenging due to subtle inter-class differences, limited sample sizes, and class imbalance. We propose an Adaptive Attention-Augmented Convolutional Neural Network with Vision Transformer (AACNN-ViT), a hybrid framework that integrates local convolutional representations with global transformer embeddings through an adaptive attention-based fusion module. The CNN branch captures fine-grained spatial patterns, the ViT branch encodes long-range contextual dependencies, and the adaptive fusion mechanism learns to weight cross-representation interactions to improve discriminability. To reduce the impact of imbalance, a hybrid objective that combines focal loss with categorical cross-entropy is incorporated during training. Experiments on the IQ-OTH/NCCD dataset (benign, malignant, and normal) show consistent performance progression in an ablation-style evaluation: CNN-only, ViT-only, CNN-ViT concatenation, and AACNN-ViT. The proposed AACNN-ViT achieved 96.97% accuracy on the validation set with macro-averaged precision/recall/F1 of 0.9588/0.9352/0.9458 and weighted F1 of 0.9693, substantially improving minority-class recognition (Benign recall 0.8333) compared with CNN-ViT (accuracy 89.09%, macro-F1 0.7680). One-vs.-rest ROC analysis further indicates strong separability across all classes (micro-average AUC 0.992). These results suggest that adaptive attention-based fusion offers a robust and clinically relevant approach for computer-aided lung cancer screening and decision support.
{"title":"AACNN-ViT: Adaptive Attention-Augmented Convolutional and Vision Transformer Fusion for Lung Cancer Detection.","authors":"Mohammad Ishtiaque Rahman, Amrina Rahman","doi":"10.3390/jimaging12020062","DOIUrl":"10.3390/jimaging12020062","url":null,"abstract":"<p><p>Lung cancer remains a leading cause of cancer-related mortality. Although reliable multiclass classification of lung lesions from CT imaging is essential for early diagnosis, it remains challenging due to subtle inter-class differences, limited sample sizes, and class imbalance. We propose an Adaptive Attention-Augmented Convolutional Neural Network with Vision Transformer (AACNN-ViT), a hybrid framework that integrates local convolutional representations with global transformer embeddings through an adaptive attention-based fusion module. The CNN branch captures fine-grained spatial patterns, the ViT branch encodes long-range contextual dependencies, and the adaptive fusion mechanism learns to weight cross-representation interactions to improve discriminability. To reduce the impact of imbalance, a hybrid objective that combines focal loss with categorical cross-entropy is incorporated during training. Experiments on the IQ-OTH/NCCD dataset (benign, malignant, and normal) show consistent performance progression in an ablation-style evaluation: CNN-only, ViT-only, CNN-ViT concatenation, and AACNN-ViT. The proposed AACNN-ViT achieved 96.97% accuracy on the validation set with macro-averaged precision/recall/F1 of 0.9588/0.9352/0.9458 and weighted F1 of 0.9693, substantially improving minority-class recognition (Benign recall 0.8333) compared with CNN-ViT (accuracy 89.09%, macro-F1 0.7680). One-vs.-rest ROC analysis further indicates strong separability across all classes (micro-average AUC 0.992). These results suggest that adaptive attention-based fusion offers a robust and clinically relevant approach for computer-aided lung cancer screening and decision support.</p>","PeriodicalId":37035,"journal":{"name":"Journal of Imaging","volume":"12 2","pages":""},"PeriodicalIF":2.7,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12941408/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147291269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}