Pub Date : 2026-03-01Epub Date: 2026-02-12DOI: 10.1016/j.cviu.2026.104704
Erkang Chen, Wangen Chen, Zhiwei Shen, Zhihui Li, Zhiqi Lin
Underwater image enhancement is highly challenging due to low contrast, color distortion, blurring caused by light attenuation and scattering. This paper proposes a novel parallel architecture, the Underwater Global-Local Feature Fusion Network (UGLF-Net) for robust image restoration. UGLF-Net consists of the AMFE module for high-quality global feature extraction, the HMCM module with SSM for selective local enhancement and the Swin FAM module for capturing global context. By progressively fusing multi-source features (RGB, grayscale gradients and reduced-dimension data) in a parallel manner, UGLF-Net achieves effective global-local collaborative modeling. Residual connections and Enhanced ECA modules further improve feature representation and training stability, enabling state-of-the-art (SOTA) performance. Experiments on LSUI, EUVP and UIEB datasets show that UGLF-Net outperforms existing methods, including the U-shape Transformer, in PSNR and SSIM. Ablation studies validate the effectiveness of each component. Qualitative results demonstrate superior restoration of vivid colors and fine details. The lightweight design with single-layer SSM and window attention achieves efficient inference (0.009s per image), making it well-suited for real-time enhancement on embedded devices and advancing underwater visual applications.
{"title":"UGLF-Net: A parallel architecture for Underwater Global-Local Feature Fusion Network","authors":"Erkang Chen, Wangen Chen, Zhiwei Shen, Zhihui Li, Zhiqi Lin","doi":"10.1016/j.cviu.2026.104704","DOIUrl":"10.1016/j.cviu.2026.104704","url":null,"abstract":"<div><div>Underwater image enhancement is highly challenging due to low contrast, color distortion, blurring caused by light attenuation and scattering. This paper proposes a novel parallel architecture, the Underwater Global-Local Feature Fusion Network (UGLF-Net) for robust image restoration. UGLF-Net consists of the AMFE module for high-quality global feature extraction, the HMCM module with SSM for selective local enhancement and the Swin FAM module for capturing global context. By progressively fusing multi-source features (RGB, grayscale gradients and reduced-dimension data) in a parallel manner, UGLF-Net achieves effective global-local collaborative modeling. Residual connections and Enhanced ECA modules further improve feature representation and training stability, enabling state-of-the-art (SOTA) performance. Experiments on LSUI, EUVP and UIEB datasets show that UGLF-Net outperforms existing methods, including the U-shape Transformer, in PSNR and SSIM. Ablation studies validate the effectiveness of each component. Qualitative results demonstrate superior restoration of vivid colors and fine details. The lightweight design with single-layer SSM and window attention achieves efficient inference (0.009s per image), making it well-suited for real-time enhancement on embedded devices and advancing underwater visual applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104704"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-12DOI: 10.1016/j.cviu.2026.104705
Bin Ge, Xiaolong Peng, Chenxing Xia, Hailong Chen
Camouflaged Object Detection (COD), an emerging research direction in computer vision, faces a core challenge: accurately segmenting objects that are naturally or artificially concealed within visually similar backgrounds. In COD tasks, camouflaged objects often exhibit high similarity to their surroundings in terms of texture and color, rendering traditional saliency cues insufficient for reliable target-background discrimination. In contrast, edges serve as structural cues that offer more stable and explicit boundary information, thereby facilitating accurate localization of camouflaged object contours. Motivated by this insight, we propose a Frequency-Guided Edge Encoder (FGEE), which employs a spatial-frequency dual-branch cascaded architecture to enable multi-scale edge modeling and extract more precise and fine-grained edge features. Furthermore, we introduce a Feature Progressive Reinforcement Module (FPRM) that leverages a combination of reverse attention mechanisms and deformable convolutions to suppress foreground distractions and mine structural representations of camouflaged objects for enhanced feature learning. Additionally, we design an Edge-Driven Hierarchical Feature Aggregator (EDHFA) that dynamically integrates contextual information by detecting discrepancies between dual-branch features, generating initial edge contours, and progressively refining edge representations. Extensive experimental results conducted on four widely used COD benchmark datasets demonstrate that the proposed FDESNet surpasses 15 state-of-the-art methods, achieving significant improvements in segmentation performance. The source code is available at https://github.com/Pengxiaolong293/FDESNet.
{"title":"Frequency domain-based edge sensing for camouflaged object detection","authors":"Bin Ge, Xiaolong Peng, Chenxing Xia, Hailong Chen","doi":"10.1016/j.cviu.2026.104705","DOIUrl":"10.1016/j.cviu.2026.104705","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD), an emerging research direction in computer vision, faces a core challenge: accurately segmenting objects that are naturally or artificially concealed within visually similar backgrounds. In COD tasks, camouflaged objects often exhibit high similarity to their surroundings in terms of texture and color, rendering traditional saliency cues insufficient for reliable target-background discrimination. In contrast, edges serve as structural cues that offer more stable and explicit boundary information, thereby facilitating accurate localization of camouflaged object contours. Motivated by this insight, we propose a Frequency-Guided Edge Encoder (FGEE), which employs a spatial-frequency dual-branch cascaded architecture to enable multi-scale edge modeling and extract more precise and fine-grained edge features. Furthermore, we introduce a Feature Progressive Reinforcement Module (FPRM) that leverages a combination of reverse attention mechanisms and deformable convolutions to suppress foreground distractions and mine structural representations of camouflaged objects for enhanced feature learning. Additionally, we design an Edge-Driven Hierarchical Feature Aggregator (EDHFA) that dynamically integrates contextual information by detecting discrepancies between dual-branch features, generating initial edge contours, and progressively refining edge representations. Extensive experimental results conducted on four widely used COD benchmark datasets demonstrate that the proposed FDESNet surpasses 15 state-of-the-art methods, achieving significant improvements in segmentation performance. The source code is available at <span><span>https://github.com/Pengxiaolong293/FDESNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104705"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cancer is a chaotic disease known as the plague of our age and there are many subtypes of the cancer. Cancer is commonly seen disorder and its mortality rate is very high. Therefore, many researchers have worked/studied on the cancer detection and treatment. To contribute cancer studies according to machine learning, we have presented a new generation convolutional neural network (CNN) termed ShortNeXt in this research. The presented ShortNeXt has inspired by ResNet, ConvNeXt and MobileNet architectures to use the advantages these CNNs together. This model, which aims to extract robust feature map using convolution-based residual blocks, is named ShortNeXt because it incorporates more than one shortcut. The ShortNeXt architecture has four main stages and these stages are: (i) an input/stem, (ii) ShortNeXt, (iii) downsampling, and (iv) output. In this CNN architecture, convolution, batch normalization and the Gaussian Error Linear Unit (GELU) activation functions have been utilized. In this aspect, the implementation of the recommended ShortNeXt is simple. The stem stage uses a 4 × 4 sized convolution with stride 4 like ConvNeXt and Swin Transformer and this operation is named patchify operation. Additionally, a 2 × 2 patchify block has been used in the downsampling block. In the ShortNeXt block, an inverted bottleneck has been used, and both 1 × 1 and 3 × 3 convolution blocks are employed in the expansion phase. The output layer has increased the number of filters from 768 to 1280 by using pixel-wise convolution, drawing inspiration from MobileNetV2 and a final feature map with a length of 1280 has been obtained by deploying global average pooling (GAP). In the classification phase, fully connected and softmax operators have been used.
To get comparative results about to the recommended ShortNeXt, a publicly available histopathological image dataset has been used and this dataset contains nine classes, and the proposed ShortNeXt has achieved 97.82% and 97.86% validation and test accuracy, respectively. The obtained results and findings openly showcases that ShortNeXt is an effective deep learning method for histopathological image classification for cancer detection/classification.
{"title":"ShortNeXt: A novel method for accurate classification of colorectal cancer histopathology images","authors":"Prabal Datta Barua , Burak Tasci , Mehmet Baygin , Sengul Dogan , Turker Tuncer , Filippo Molinari , Salvi Massimo , U. Rajendra Acharya","doi":"10.1016/j.cviu.2026.104672","DOIUrl":"10.1016/j.cviu.2026.104672","url":null,"abstract":"<div><div>Cancer is a chaotic disease known as the plague of our age and there are many subtypes of the cancer. Cancer is commonly seen disorder and its mortality rate is very high. Therefore, many researchers have worked/studied on the cancer detection and treatment. To contribute cancer studies according to machine learning, we have presented a new generation convolutional neural network (CNN) termed ShortNeXt in this research. The presented ShortNeXt has inspired by ResNet, ConvNeXt and MobileNet architectures to use the advantages these CNNs together. This model, which aims to extract robust feature map using convolution-based residual blocks, is named ShortNeXt because it incorporates more than one shortcut. The ShortNeXt architecture has four main stages and these stages are: (i) an input/stem, (ii) ShortNeXt, (iii) downsampling, and (iv) output. In this CNN architecture, convolution, batch normalization and the Gaussian Error Linear Unit (GELU) activation functions have been utilized. In this aspect, the implementation of the recommended ShortNeXt is simple. The stem stage uses a 4 × 4 sized convolution with stride 4 like ConvNeXt and Swin Transformer and this operation is named patchify operation. Additionally, a 2 × 2 patchify block has been used in the downsampling block. In the ShortNeXt block, an inverted bottleneck has been used, and both 1 × 1 and 3 × 3 convolution blocks are employed in the expansion phase. The output layer has increased the number of filters from 768 to 1280 by using pixel-wise convolution, drawing inspiration from MobileNetV2 and a final feature map with a length of 1280 has been obtained by deploying global average pooling (GAP). In the classification phase, fully connected and softmax operators have been used.</div><div>To get comparative results about to the recommended ShortNeXt, a publicly available histopathological image dataset has been used and this dataset contains nine classes, and the proposed ShortNeXt has achieved 97.82% and 97.86% validation and test accuracy, respectively. The obtained results and findings openly showcases that ShortNeXt is an effective deep learning method for histopathological image classification for cancer detection/classification.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104672"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-17DOI: 10.1016/j.cviu.2026.104682
Zifan Han , Xuchong Zhang , Hang Wang , Hongbin Sun
Tracking-by-query based multi-object-tracking (MOT) aims to simplify the complicated and tiresome post-processing of the traditional tracking-by-detection paradigm in an end-to-end manner. However, the former method usually suffers from the conflict between detection and association due to the semantic ambiguity between tracking and detection instances in joint training, resulting in unsatisfactory performance compared to the latter method. Previous tracking-by-query methods usually use an extra detector to decouple the detection and association tasks. However, these methods inevitably introduce complex operations like additional detectors or manual hyperparameters adjustment. In this paper, we propose a simple end-to-end MOT method, Query Boostrapping Multi-Object Tracking with TRansformer (QB-MOTR) to alleviate the conflict. Specifically, a Query Boostarpping module is designed to enhance the semantic features of the tracking query in order to distinguish the detection and tracking instances. This module integrates both positional and specific semantic information into the tracker effectively while maintaining the simple pipeline of the whole network. The tracking performance of various MOT networks is evaluated on multiple datasets. Evaluation results demonstrate that QB-MOTR surpasses baseline method MOTR by about 18.1%. Besides, the detection and association performance is superior to the state-of-the-art end-to-end method MeMOTR with much simpler training and inference pipeline.
{"title":"QB-MOTR: A simple query bootstrapping end-to-end multi-object tracking method with transformer","authors":"Zifan Han , Xuchong Zhang , Hang Wang , Hongbin Sun","doi":"10.1016/j.cviu.2026.104682","DOIUrl":"10.1016/j.cviu.2026.104682","url":null,"abstract":"<div><div>Tracking-by-query based multi-object-tracking (MOT) aims to simplify the complicated and tiresome post-processing of the traditional tracking-by-detection paradigm in an end-to-end manner. However, the former method usually suffers from the conflict between detection and association due to the semantic ambiguity between tracking and detection instances in joint training, resulting in unsatisfactory performance compared to the latter method. Previous tracking-by-query methods usually use an extra detector to decouple the detection and association tasks. However, these methods inevitably introduce complex operations like additional detectors or manual hyperparameters adjustment. In this paper, we propose a simple end-to-end MOT method, <strong>Q</strong>uery <strong>B</strong>oostrapping <strong>M</strong>ulti-<strong>O</strong>bject <strong>T</strong>racking with T<strong>R</strong>ansformer (<strong>QB-MOTR</strong>) to alleviate the conflict. Specifically, a Query Boostarpping module is designed to enhance the semantic features of the tracking query in order to distinguish the detection and tracking instances. This module integrates both positional and specific semantic information into the tracker effectively while maintaining the simple pipeline of the whole network. The tracking performance of various MOT networks is evaluated on multiple datasets. Evaluation results demonstrate that QB-MOTR surpasses baseline method MOTR by about 18.1%. Besides, the detection and association performance is superior to the state-of-the-art end-to-end method MeMOTR with much simpler training and inference pipeline.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104682"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-03DOI: 10.1016/j.cviu.2026.104676
Chen Hu , Shan Luo , Letizia Gionfrida
Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for grasp assistance that integrates RGB-D vision, open vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalisation in open environments, OVGrasp incorporates a vision language foundation model with an open vocabulary mechanism, which enables zero-shot detection of previously unseen objects without retraining. A multimodal decision maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in situations involving multiple objects. We deploy the complete framework on a custom egocentric view wearable exoskeleton and conduct systematic evaluations on fifteen objects across three grasp types. Experimental results with ten participants show that OVGrasp achieves a grasping ability score (GAS) of 87.00%, surpassing existing baselines and providing improved kinematic alignment with natural hand movement.
{"title":"OVGrasp: Open-Vocabulary Intent Detection for Grasping Assistance using ExoGlove","authors":"Chen Hu , Shan Luo , Letizia Gionfrida","doi":"10.1016/j.cviu.2026.104676","DOIUrl":"10.1016/j.cviu.2026.104676","url":null,"abstract":"<div><div>Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present <strong>OVGrasp</strong>, a hierarchical control framework for grasp assistance that integrates RGB-D vision, open vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalisation in open environments, OVGrasp incorporates a vision language foundation model with an open vocabulary mechanism, which enables zero-shot detection of previously unseen objects without retraining. A multimodal decision maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in situations involving multiple objects. We deploy the complete framework on a custom egocentric view wearable exoskeleton and conduct systematic evaluations on fifteen objects across three grasp types. Experimental results with ten participants show that OVGrasp achieves a grasping ability score (GAS) of 87.00%, surpassing existing baselines and providing improved kinematic alignment with natural hand movement.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104676"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-26DOI: 10.1016/j.cviu.2026.104709
Incheol Park , Youngwan Jin , Nalcakan Yagiz , Hyeongjin Ju , Sanghyeop Yeo , Shiho Kim
Autonomous systems commonly rely on RGB cameras, which are susceptible to failure in low-light and adverse conditions. Infrared (IR) imaging provides a viable alternative by capturing thermal signatures independent of visible illumination. However, its high cost and integration complexities limit widespread adoption. To address these challenges, we introduce SpectraDiff, a diffusion-based framework that synthesizes realistic IR images by fusing RGB inputs with refined semantic segmentation. Through our RGB-Seg Object-Aware (RSOA) module, SpectraDiff learns object-specific IR intensities by leveraging object-aware features. The SpectraDiff architecture, featuring a novel Spectral Attention Block, enforces self-attention among semantically similar pixels while leveraging cross-attention with the original RGB to preserve high-frequency details. Extensive evaluations on FLIR, FMB, MFNet, IDD-AW, and RANUS demonstrate SpectraDiff’s superior performance over existing methods, as measured by both perceptual (FID, LPIPS, DISTS) and fidelity (SSIM, SAM) metrics. Code and pretrained models are available at: https://yonsei-stl.github.io/SpectraDiff/.
{"title":"SpectraDiff: Enhancing the fidelity of Infrared Image Translation with object-aware diffusion","authors":"Incheol Park , Youngwan Jin , Nalcakan Yagiz , Hyeongjin Ju , Sanghyeop Yeo , Shiho Kim","doi":"10.1016/j.cviu.2026.104709","DOIUrl":"10.1016/j.cviu.2026.104709","url":null,"abstract":"<div><div>Autonomous systems commonly rely on RGB cameras, which are susceptible to failure in low-light and adverse conditions. Infrared (IR) imaging provides a viable alternative by capturing thermal signatures independent of visible illumination. However, its high cost and integration complexities limit widespread adoption. To address these challenges, we introduce SpectraDiff, a diffusion-based framework that synthesizes realistic IR images by fusing RGB inputs with refined semantic segmentation. Through our RGB-Seg Object-Aware (RSOA) module, SpectraDiff learns object-specific IR intensities by leveraging object-aware features. The SpectraDiff architecture, featuring a novel Spectral Attention Block, enforces self-attention among semantically similar pixels while leveraging cross-attention with the original RGB to preserve high-frequency details. Extensive evaluations on FLIR, FMB, MFNet, IDD-AW, and RANUS demonstrate SpectraDiff’s superior performance over existing methods, as measured by both perceptual (FID, LPIPS, DISTS) and fidelity (SSIM, SAM) metrics. Code and pretrained models are available at: <span><span>https://yonsei-stl.github.io/SpectraDiff/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"266 ","pages":"Article 104709"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147426834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video compression is essential for reducing bandwidth and storage demands, but often introduces artifacts that impair visual quality. Current Video Compression Artifact Removal (VCAR) methods face challenges including high computational complexity and unstable enhancement performance. To address these issues, we propose a novel Kernel-Adaptive Spatiotemporal Synchronization (KASS) network. First, a Dual-branch Alignment Module (DAM) enables multi-receptive-field feature alignment for modeling complex motion patterns. Second, an Adaptive Spatial Attention (ASA) block employs multi-branch deformable convolution with varying kernel sizes to locate artifacts. It then restores high-frequency details efficiently through attention-guided reconstruction. Third, a Spatiotemporal Multi-scale Alignment (SMA) block captures global spatiotemporal information and integrates multi-frame features via spatial and channel attention. This design effectively removes artifacts while improving alignment and enhancement stability. Experiments demonstrate that KASS significantly improves artifact removal performance while overcoming key limitations in alignment accuracy, computational burden, and enhancement stability.
{"title":"KASS: Efficient video artifact removal via Kernel-Adaptive Spatiotemporal Synchronization","authors":"Liqun Lin, Fawei Tang, Mingxing Wang, Yipeng Liao, Tiesong Zhao","doi":"10.1016/j.cviu.2026.104649","DOIUrl":"10.1016/j.cviu.2026.104649","url":null,"abstract":"<div><div>Video compression is essential for reducing bandwidth and storage demands, but often introduces artifacts that impair visual quality. Current Video Compression Artifact Removal (VCAR) methods face challenges including high computational complexity and unstable enhancement performance. To address these issues, we propose a novel Kernel-Adaptive Spatiotemporal Synchronization (KASS) network. First, a Dual-branch Alignment Module (DAM) enables multi-receptive-field feature alignment for modeling complex motion patterns. Second, an Adaptive Spatial Attention (ASA) block employs multi-branch deformable convolution with varying kernel sizes to locate artifacts. It then restores high-frequency details efficiently through attention-guided reconstruction. Third, a Spatiotemporal Multi-scale Alignment (SMA) block captures global spatiotemporal information and integrates multi-frame features via spatial and channel attention. This design effectively removes artifacts while improving alignment and enhancement stability. Experiments demonstrate that KASS significantly improves artifact removal performance while overcoming key limitations in alignment accuracy, computational burden, and enhancement stability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104649"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-04DOI: 10.1016/j.cviu.2026.104668
Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su
Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, in this paper, we for the first time propose to leverage the pretrained generative model, Stable Diffusion (SD), to perform the one-for-all few-shot anomaly detection task, in contrast to existing few-shot anomaly detection works that heavily rely on the use of pre-trained representation-based CLIP model. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning the SD. At the testing stage, to accurately mask anomalous regions for inpainting, we propose a multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings. The source code of our method is available at https://github.com/YanZhenyu1999/AnomalySD.git.
{"title":"AnomalySD: One-for-all few-shot anomaly detection via pre-trained diffusion models","authors":"Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su","doi":"10.1016/j.cviu.2026.104668","DOIUrl":"10.1016/j.cviu.2026.104668","url":null,"abstract":"<div><div>Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, in this paper, we for the first time propose to leverage the pretrained generative model, Stable Diffusion (SD), to perform the one-for-all few-shot anomaly detection task, in contrast to existing few-shot anomaly detection works that heavily rely on the use of pre-trained representation-based CLIP model. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning the SD. At the testing stage, to accurately mask anomalous regions for inpainting, we propose a multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings. The source code of our method is available at <span><span>https://github.com/YanZhenyu1999/AnomalySD.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104668"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-06DOI: 10.1016/j.cviu.2026.104681
Masoud Ayoubi, Mehrdad Arashpour
Assistive computer vision technologies have the potential to significantly enhance workplace safety by enabling early detection of hazards and supporting proactive risk management. However, the development of such systems is constrained by the absence of comprehensive video datasets and clearly defined tasks that capture real-world hazard conditions. This study formulates pre-incident hazard recognition as a distinct assistive-vision problem, focusing on identifying unsafe states that precede incidents rather than the incidents themselves. To address this problem, we propose the Workplace Hazards Dataset (WHD), a balanced and diverse set of real-world videos representing five universal hazard categories in varied workplace settings. Furthermore, we establish a standardized benchmarking framework that evaluates state-of-the-art convolutional and transformer-based video models on both performance and inference-latency metrics to assess real-time feasibility. Experimental results show that the Multiscale Vision Transformer (MViT 16 × 4) achieves the highest accuracy (74.1%) while maintaining efficient inference speed, highlighting the importance of balancing recognition accuracy with processing time. Overall, this work defines a new benchmark task for assistive computer vision and provides the foundation for developing real-time hazard recognition systems that enhance safety and efficiency in high-risk environments.
{"title":"Enhancing workplace safety through assistive computer vision: Real-time hazard recognition using the Workplace Hazards Dataset (WHD)","authors":"Masoud Ayoubi, Mehrdad Arashpour","doi":"10.1016/j.cviu.2026.104681","DOIUrl":"10.1016/j.cviu.2026.104681","url":null,"abstract":"<div><div>Assistive computer vision technologies have the potential to significantly enhance workplace safety by enabling early detection of hazards and supporting proactive risk management. However, the development of such systems is constrained by the absence of comprehensive video datasets and clearly defined tasks that capture real-world hazard conditions. This study formulates pre-incident hazard recognition as a distinct assistive-vision problem, focusing on identifying unsafe states that precede incidents rather than the incidents themselves. To address this problem, we propose the Workplace Hazards Dataset (WHD), a balanced and diverse set of real-world videos representing five universal hazard categories in varied workplace settings. Furthermore, we establish a standardized benchmarking framework that evaluates state-of-the-art convolutional and transformer-based video models on both performance and inference-latency metrics to assess real-time feasibility. Experimental results show that the Multiscale Vision Transformer (MViT 16 × 4) achieves the highest accuracy (74.1%) while maintaining efficient inference speed, highlighting the importance of balancing recognition accuracy with processing time. Overall, this work defines a new benchmark task for assistive computer vision and provides the foundation for developing real-time hazard recognition systems that enhance safety and efficiency in high-risk environments.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104681"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-02-17DOI: 10.1016/j.cviu.2026.104688
Yuehua Li, Songwei Pei, Wenzheng Yang, BingFeng Liu, Shuhuai Wang
Single image dehazing, as a representative low-level vision task, is of paramount importance in substantial applications such as object detection and autonomous driving. Nowadays, most image dehazing methods focus on directly learning the overall difference between hazy and clear image pairs, thus making the learning task excessively challenging and restricting the performance of image dehazing to a certain extent. In this paper, we draw on considerations about image style and content from the style transfer task, and believe that the degradation of hazy images typically involves the transition of style and the hiding of details, and then the image dehazing task can be divided and conquered according to the two aspects to reduce the learning difficulty of the dehazing network for effective image dehazing. Based on this inspiration, in this paper, we propose an image dehazing network with a specific focus on style recovery and detail replenishment, namely SRDR, which firstly recovers the style and extracts the detail of the hazy image, respectively, and then aggregates the information from style recovery and detail replenishment for better image dehazing. The SRDR mainly consists of three modules: Style Recovery Module (SRM), Detail Replenishment Module (DRM), and Cross Fusion Module (CFM). SRM is responsible for style recovery by adapting the pre-trained MAE model, DRM handles the detail replenishment with multiple direction convolutions, and CFM is an information aggregation module. Extensive experiments demonstrate that SRDR achieves state-of-the-art performance on numerous mainstream datasets.
{"title":"SRDR: Style recovery and detail replenishment matter for single image dehazing","authors":"Yuehua Li, Songwei Pei, Wenzheng Yang, BingFeng Liu, Shuhuai Wang","doi":"10.1016/j.cviu.2026.104688","DOIUrl":"10.1016/j.cviu.2026.104688","url":null,"abstract":"<div><div>Single image dehazing, as a representative low-level vision task, is of paramount importance in substantial applications such as object detection and autonomous driving. Nowadays, most image dehazing methods focus on directly learning the overall difference between hazy and clear image pairs, thus making the learning task excessively challenging and restricting the performance of image dehazing to a certain extent. In this paper, we draw on considerations about image style and content from the style transfer task, and believe that the degradation of hazy images typically involves the transition of style and the hiding of details, and then the image dehazing task can be divided and conquered according to the two aspects to reduce the learning difficulty of the dehazing network for effective image dehazing. Based on this inspiration, in this paper, we propose an image dehazing network with a specific focus on style recovery and detail replenishment, namely SRDR, which firstly recovers the style and extracts the detail of the hazy image, respectively, and then aggregates the information from style recovery and detail replenishment for better image dehazing. The SRDR mainly consists of three modules: Style Recovery Module (SRM), Detail Replenishment Module (DRM), and Cross Fusion Module (CFM). SRM is responsible for style recovery by adapting the pre-trained MAE model, DRM handles the detail replenishment with multiple direction convolutions, and CFM is an information aggregation module. Extensive experiments demonstrate that SRDR achieves state-of-the-art performance on numerous mainstream datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104688"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}