首页 > 最新文献

Computer Vision and Image Understanding最新文献

英文 中文
UGLF-Net: A parallel architecture for Underwater Global-Local Feature Fusion Network ugf - net:水下全局-局部特征融合网络的并行架构
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-12 DOI: 10.1016/j.cviu.2026.104704
Erkang Chen, Wangen Chen, Zhiwei Shen, Zhihui Li, Zhiqi Lin
Underwater image enhancement is highly challenging due to low contrast, color distortion, blurring caused by light attenuation and scattering. This paper proposes a novel parallel architecture, the Underwater Global-Local Feature Fusion Network (UGLF-Net) for robust image restoration. UGLF-Net consists of the AMFE module for high-quality global feature extraction, the HMCM module with SSM for selective local enhancement and the Swin FAM module for capturing global context. By progressively fusing multi-source features (RGB, grayscale gradients and reduced-dimension data) in a parallel manner, UGLF-Net achieves effective global-local collaborative modeling. Residual connections and Enhanced ECA modules further improve feature representation and training stability, enabling state-of-the-art (SOTA) performance. Experiments on LSUI, EUVP and UIEB datasets show that UGLF-Net outperforms existing methods, including the U-shape Transformer, in PSNR and SSIM. Ablation studies validate the effectiveness of each component. Qualitative results demonstrate superior restoration of vivid colors and fine details. The lightweight design with single-layer SSM and window attention achieves efficient inference (0.009s per image), making it well-suited for real-time enhancement on embedded devices and advancing underwater visual applications.
由于低对比度、色彩失真、光衰减和散射引起的模糊,水下图像增强具有很高的挑战性。本文提出了一种新的并行结构——水下全局-局部特征融合网络(ugf - net),用于鲁棒图像恢复。ugf - net由用于高质量全局特征提取的AMFE模块、用于选择性局部增强的带有SSM的HMCM模块和用于捕获全局上下文的Swin FAM模块组成。通过并行地逐步融合多源特征(RGB、灰度梯度和降维数据),实现了有效的全局-局部协同建模。剩余连接和增强型ECA模块进一步提高了特征表示和训练稳定性,实现了最先进的SOTA性能。在LSUI、EUVP和UIEB数据集上的实验表明,在PSNR和SSIM方面,ugf - net优于现有的方法,包括u形变压器。消融研究证实了每个组成部分的有效性。定性结果表明,优越的恢复生动的色彩和精细的细节。单层SSM和窗口关注的轻量级设计实现了高效的推理(每张图像0.009秒),使其非常适合嵌入式设备的实时增强和推进水下视觉应用。
{"title":"UGLF-Net: A parallel architecture for Underwater Global-Local Feature Fusion Network","authors":"Erkang Chen,&nbsp;Wangen Chen,&nbsp;Zhiwei Shen,&nbsp;Zhihui Li,&nbsp;Zhiqi Lin","doi":"10.1016/j.cviu.2026.104704","DOIUrl":"10.1016/j.cviu.2026.104704","url":null,"abstract":"<div><div>Underwater image enhancement is highly challenging due to low contrast, color distortion, blurring caused by light attenuation and scattering. This paper proposes a novel parallel architecture, the Underwater Global-Local Feature Fusion Network (UGLF-Net) for robust image restoration. UGLF-Net consists of the AMFE module for high-quality global feature extraction, the HMCM module with SSM for selective local enhancement and the Swin FAM module for capturing global context. By progressively fusing multi-source features (RGB, grayscale gradients and reduced-dimension data) in a parallel manner, UGLF-Net achieves effective global-local collaborative modeling. Residual connections and Enhanced ECA modules further improve feature representation and training stability, enabling state-of-the-art (SOTA) performance. Experiments on LSUI, EUVP and UIEB datasets show that UGLF-Net outperforms existing methods, including the U-shape Transformer, in PSNR and SSIM. Ablation studies validate the effectiveness of each component. Qualitative results demonstrate superior restoration of vivid colors and fine details. The lightweight design with single-layer SSM and window attention achieves efficient inference (0.009s per image), making it well-suited for real-time enhancement on embedded devices and advancing underwater visual applications.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104704"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency domain-based edge sensing for camouflaged object detection 基于频域边缘传感的伪装目标检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-12 DOI: 10.1016/j.cviu.2026.104705
Bin Ge, Xiaolong Peng, Chenxing Xia, Hailong Chen
Camouflaged Object Detection (COD), an emerging research direction in computer vision, faces a core challenge: accurately segmenting objects that are naturally or artificially concealed within visually similar backgrounds. In COD tasks, camouflaged objects often exhibit high similarity to their surroundings in terms of texture and color, rendering traditional saliency cues insufficient for reliable target-background discrimination. In contrast, edges serve as structural cues that offer more stable and explicit boundary information, thereby facilitating accurate localization of camouflaged object contours. Motivated by this insight, we propose a Frequency-Guided Edge Encoder (FGEE), which employs a spatial-frequency dual-branch cascaded architecture to enable multi-scale edge modeling and extract more precise and fine-grained edge features. Furthermore, we introduce a Feature Progressive Reinforcement Module (FPRM) that leverages a combination of reverse attention mechanisms and deformable convolutions to suppress foreground distractions and mine structural representations of camouflaged objects for enhanced feature learning. Additionally, we design an Edge-Driven Hierarchical Feature Aggregator (EDHFA) that dynamically integrates contextual information by detecting discrepancies between dual-branch features, generating initial edge contours, and progressively refining edge representations. Extensive experimental results conducted on four widely used COD benchmark datasets demonstrate that the proposed FDESNet surpasses 15 state-of-the-art methods, achieving significant improvements in segmentation performance. The source code is available at https://github.com/Pengxiaolong293/FDESNet.
伪装目标检测(COD)是计算机视觉领域的一个新兴研究方向,它面临着一个核心挑战:准确分割自然或人为隐藏在视觉相似背景中的目标。在COD任务中,被伪装的物体通常在纹理和颜色方面表现出与其周围环境的高度相似性,使得传统的显著性线索不足以进行可靠的目标-背景区分。相反,边缘作为结构线索,提供更稳定和明确的边界信息,从而促进对伪装物体轮廓的准确定位。基于这一见解,我们提出了一种频率引导边缘编码器(FGEE),它采用空频双分支级联架构来实现多尺度边缘建模,并提取更精确和细粒度的边缘特征。此外,我们引入了一个特征渐进强化模块(FPRM),该模块利用反向注意机制和可变形卷积的组合来抑制前景干扰,并挖掘伪装对象的结构表征,以增强特征学习。此外,我们设计了一个边缘驱动的分层特征聚合器(EDHFA),通过检测双分支特征之间的差异,生成初始边缘轮廓,并逐步改进边缘表示来动态集成上下文信息。在四个广泛使用的COD基准数据集上进行的大量实验结果表明,所提出的FDESNet超过了15种最先进的方法,在分割性能上取得了显着提高。源代码可从https://github.com/Pengxiaolong293/FDESNet获得。
{"title":"Frequency domain-based edge sensing for camouflaged object detection","authors":"Bin Ge,&nbsp;Xiaolong Peng,&nbsp;Chenxing Xia,&nbsp;Hailong Chen","doi":"10.1016/j.cviu.2026.104705","DOIUrl":"10.1016/j.cviu.2026.104705","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD), an emerging research direction in computer vision, faces a core challenge: accurately segmenting objects that are naturally or artificially concealed within visually similar backgrounds. In COD tasks, camouflaged objects often exhibit high similarity to their surroundings in terms of texture and color, rendering traditional saliency cues insufficient for reliable target-background discrimination. In contrast, edges serve as structural cues that offer more stable and explicit boundary information, thereby facilitating accurate localization of camouflaged object contours. Motivated by this insight, we propose a Frequency-Guided Edge Encoder (FGEE), which employs a spatial-frequency dual-branch cascaded architecture to enable multi-scale edge modeling and extract more precise and fine-grained edge features. Furthermore, we introduce a Feature Progressive Reinforcement Module (FPRM) that leverages a combination of reverse attention mechanisms and deformable convolutions to suppress foreground distractions and mine structural representations of camouflaged objects for enhanced feature learning. Additionally, we design an Edge-Driven Hierarchical Feature Aggregator (EDHFA) that dynamically integrates contextual information by detecting discrepancies between dual-branch features, generating initial edge contours, and progressively refining edge representations. Extensive experimental results conducted on four widely used COD benchmark datasets demonstrate that the proposed FDESNet surpasses 15 state-of-the-art methods, achieving significant improvements in segmentation performance. The source code is available at <span><span>https://github.com/Pengxiaolong293/FDESNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104705"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ShortNeXt: A novel method for accurate classification of colorectal cancer histopathology images ShortNeXt:一种准确分类结直肠癌组织病理图像的新方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.cviu.2026.104672
Prabal Datta Barua , Burak Tasci , Mehmet Baygin , Sengul Dogan , Turker Tuncer , Filippo Molinari , Salvi Massimo , U. Rajendra Acharya
Cancer is a chaotic disease known as the plague of our age and there are many subtypes of the cancer. Cancer is commonly seen disorder and its mortality rate is very high. Therefore, many researchers have worked/studied on the cancer detection and treatment. To contribute cancer studies according to machine learning, we have presented a new generation convolutional neural network (CNN) termed ShortNeXt in this research. The presented ShortNeXt has inspired by ResNet, ConvNeXt and MobileNet architectures to use the advantages these CNNs together. This model, which aims to extract robust feature map using convolution-based residual blocks, is named ShortNeXt because it incorporates more than one shortcut. The ShortNeXt architecture has four main stages and these stages are: (i) an input/stem, (ii) ShortNeXt, (iii) downsampling, and (iv) output. In this CNN architecture, convolution, batch normalization and the Gaussian Error Linear Unit (GELU) activation functions have been utilized. In this aspect, the implementation of the recommended ShortNeXt is simple. The stem stage uses a 4 × 4 sized convolution with stride 4 like ConvNeXt and Swin Transformer and this operation is named patchify operation. Additionally, a 2 × 2 patchify block has been used in the downsampling block. In the ShortNeXt block, an inverted bottleneck has been used, and both 1 × 1 and 3 × 3 convolution blocks are employed in the expansion phase. The output layer has increased the number of filters from 768 to 1280 by using pixel-wise convolution, drawing inspiration from MobileNetV2 and a final feature map with a length of 1280 has been obtained by deploying global average pooling (GAP). In the classification phase, fully connected and softmax operators have been used.
To get comparative results about to the recommended ShortNeXt, a publicly available histopathological image dataset has been used and this dataset contains nine classes, and the proposed ShortNeXt has achieved 97.82% and 97.86% validation and test accuracy, respectively. The obtained results and findings openly showcases that ShortNeXt is an effective deep learning method for histopathological image classification for cancer detection/classification.
癌症是一种混乱的疾病,被称为我们这个时代的瘟疫,癌症有许多亚型。癌症是一种常见的疾病,死亡率很高。因此,许多研究人员对癌症的检测和治疗进行了研究。为了根据机器学习为癌症研究做出贡献,我们在本研究中提出了新一代卷积神经网络(CNN),称为ShortNeXt。本文提出的ShortNeXt受ResNet、ConvNeXt和MobileNet架构的启发,将这些cnn的优势结合在一起。该模型旨在使用基于卷积的残差块提取鲁棒特征映射,由于它包含多个快捷方式,因此被命名为ShortNeXt。ShortNeXt架构有四个主要阶段,这些阶段是:(i)输入/干,(ii) ShortNeXt, (iii)下采样,(iv)输出。在该CNN架构中,使用了卷积、批处理归一化和高斯误差线性单元(GELU)激活函数。在这方面,推荐的ShortNeXt的实现很简单。茎级使用4 × 4大小的卷积,步长为4,如ConvNeXt和Swin Transformer,此操作称为patchify操作。此外,在下采样块中使用了一个2 × 2的补丁块。在ShortNeXt块中,使用了一个反向瓶颈,并且在扩展阶段使用了1 × 1和3 × 3卷积块。输出层从MobileNetV2中获得灵感,使用逐像素卷积将过滤器的数量从768个增加到1280个,并通过部署全局平均池化(GAP)获得了长度为1280的最终特征图。在分类阶段,使用了全连接算子和softmax算子。为了得到与推荐的ShortNeXt的比较结果,我们使用了一个公开的组织病理学图像数据集,该数据集包含9个类,所提出的ShortNeXt的验证和测试准确率分别达到了97.82%和97.86%。所获得的结果和发现公开表明,ShortNeXt是一种有效的用于癌症检测/分类的组织病理学图像分类的深度学习方法。
{"title":"ShortNeXt: A novel method for accurate classification of colorectal cancer histopathology images","authors":"Prabal Datta Barua ,&nbsp;Burak Tasci ,&nbsp;Mehmet Baygin ,&nbsp;Sengul Dogan ,&nbsp;Turker Tuncer ,&nbsp;Filippo Molinari ,&nbsp;Salvi Massimo ,&nbsp;U. Rajendra Acharya","doi":"10.1016/j.cviu.2026.104672","DOIUrl":"10.1016/j.cviu.2026.104672","url":null,"abstract":"<div><div>Cancer is a chaotic disease known as the plague of our age and there are many subtypes of the cancer. Cancer is commonly seen disorder and its mortality rate is very high. Therefore, many researchers have worked/studied on the cancer detection and treatment. To contribute cancer studies according to machine learning, we have presented a new generation convolutional neural network (CNN) termed ShortNeXt in this research. The presented ShortNeXt has inspired by ResNet, ConvNeXt and MobileNet architectures to use the advantages these CNNs together. This model, which aims to extract robust feature map using convolution-based residual blocks, is named ShortNeXt because it incorporates more than one shortcut. The ShortNeXt architecture has four main stages and these stages are: (i) an input/stem, (ii) ShortNeXt, (iii) downsampling, and (iv) output. In this CNN architecture, convolution, batch normalization and the Gaussian Error Linear Unit (GELU) activation functions have been utilized. In this aspect, the implementation of the recommended ShortNeXt is simple. The stem stage uses a 4 × 4 sized convolution with stride 4 like ConvNeXt and Swin Transformer and this operation is named patchify operation. Additionally, a 2 × 2 patchify block has been used in the downsampling block. In the ShortNeXt block, an inverted bottleneck has been used, and both 1 × 1 and 3 × 3 convolution blocks are employed in the expansion phase. The output layer has increased the number of filters from 768 to 1280 by using pixel-wise convolution, drawing inspiration from MobileNetV2 and a final feature map with a length of 1280 has been obtained by deploying global average pooling (GAP). In the classification phase, fully connected and softmax operators have been used.</div><div>To get comparative results about to the recommended ShortNeXt, a publicly available histopathological image dataset has been used and this dataset contains nine classes, and the proposed ShortNeXt has achieved 97.82% and 97.86% validation and test accuracy, respectively. The obtained results and findings openly showcases that ShortNeXt is an effective deep learning method for histopathological image classification for cancer detection/classification.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104672"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
QB-MOTR: A simple query bootstrapping end-to-end multi-object tracking method with transformer QB-MOTR:一个简单的带变压器的查询引导端到端多目标跟踪方法
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-17 DOI: 10.1016/j.cviu.2026.104682
Zifan Han , Xuchong Zhang , Hang Wang , Hongbin Sun
Tracking-by-query based multi-object-tracking (MOT) aims to simplify the complicated and tiresome post-processing of the traditional tracking-by-detection paradigm in an end-to-end manner. However, the former method usually suffers from the conflict between detection and association due to the semantic ambiguity between tracking and detection instances in joint training, resulting in unsatisfactory performance compared to the latter method. Previous tracking-by-query methods usually use an extra detector to decouple the detection and association tasks. However, these methods inevitably introduce complex operations like additional detectors or manual hyperparameters adjustment. In this paper, we propose a simple end-to-end MOT method, Query Boostrapping Multi-Object Tracking with TRansformer (QB-MOTR) to alleviate the conflict. Specifically, a Query Boostarpping module is designed to enhance the semantic features of the tracking query in order to distinguish the detection and tracking instances. This module integrates both positional and specific semantic information into the tracker effectively while maintaining the simple pipeline of the whole network. The tracking performance of various MOT networks is evaluated on multiple datasets. Evaluation results demonstrate that QB-MOTR surpasses baseline method MOTR by about 18.1%. Besides, the detection and association performance is superior to the state-of-the-art end-to-end method MeMOTR with much simpler training and inference pipeline.
基于查询跟踪的多目标跟踪(MOT)旨在以端到端的方式简化传统检测跟踪模式中复杂而繁琐的后处理。然而,在联合训练中,由于跟踪和检测实例之间存在语义歧义,前者往往存在检测和关联之间的冲突,导致性能不如后者。以前的查询跟踪方法通常使用一个额外的检测器来解耦检测和关联任务。然而,这些方法不可避免地引入了复杂的操作,如额外的检测器或手动超参数调整。本文提出了一种简单的端到端多目标跟踪方法——基于变压器的查询Boostrapping多目标跟踪(QB-MOTR)。具体来说,设计了一个查询启动模块来增强跟踪查询的语义特征,以区分检测和跟踪实例。该模块在保持整个网络的简单管道的同时,有效地将位置信息和特定语义信息集成到跟踪器中。在多个数据集上评估了各种MOT网络的跟踪性能。评价结果表明,qb - mor优于基线法mor约18.1%。此外,检测和关联性能优于当前最先进的端到端MeMOTR方法,并且训练和推理管道更简单。
{"title":"QB-MOTR: A simple query bootstrapping end-to-end multi-object tracking method with transformer","authors":"Zifan Han ,&nbsp;Xuchong Zhang ,&nbsp;Hang Wang ,&nbsp;Hongbin Sun","doi":"10.1016/j.cviu.2026.104682","DOIUrl":"10.1016/j.cviu.2026.104682","url":null,"abstract":"<div><div>Tracking-by-query based multi-object-tracking (MOT) aims to simplify the complicated and tiresome post-processing of the traditional tracking-by-detection paradigm in an end-to-end manner. However, the former method usually suffers from the conflict between detection and association due to the semantic ambiguity between tracking and detection instances in joint training, resulting in unsatisfactory performance compared to the latter method. Previous tracking-by-query methods usually use an extra detector to decouple the detection and association tasks. However, these methods inevitably introduce complex operations like additional detectors or manual hyperparameters adjustment. In this paper, we propose a simple end-to-end MOT method, <strong>Q</strong>uery <strong>B</strong>oostrapping <strong>M</strong>ulti-<strong>O</strong>bject <strong>T</strong>racking with T<strong>R</strong>ansformer (<strong>QB-MOTR</strong>) to alleviate the conflict. Specifically, a Query Boostarpping module is designed to enhance the semantic features of the tracking query in order to distinguish the detection and tracking instances. This module integrates both positional and specific semantic information into the tracker effectively while maintaining the simple pipeline of the whole network. The tracking performance of various MOT networks is evaluated on multiple datasets. Evaluation results demonstrate that QB-MOTR surpasses baseline method MOTR by about 18.1%. Besides, the detection and association performance is superior to the state-of-the-art end-to-end method MeMOTR with much simpler training and inference pipeline.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104682"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OVGrasp: Open-Vocabulary Intent Detection for Grasping Assistance using ExoGlove OVGrasp:使用ExoGlove的开放词汇意图检测辅助抓取
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-03 DOI: 10.1016/j.cviu.2026.104676
Chen Hu , Shan Luo , Letizia Gionfrida
Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present OVGrasp, a hierarchical control framework for grasp assistance that integrates RGB-D vision, open vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalisation in open environments, OVGrasp incorporates a vision language foundation model with an open vocabulary mechanism, which enables zero-shot detection of previously unseen objects without retraining. A multimodal decision maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in situations involving multiple objects. We deploy the complete framework on a custom egocentric view wearable exoskeleton and conduct systematic evaluations on fifteen objects across three grasp types. Experimental results with ten participants show that OVGrasp achieves a grasping ability score (GAS) of 87.00%, surpassing existing baselines and providing improved kinematic alignment with natural hand movement.
抓取辅助对于恢复运动障碍患者的自主性至关重要,特别是在物体类别和用户意图多样且不可预测的非结构化环境中。我们提出了OVGrasp,这是一个用于抓取辅助的分层控制框架,它集成了RGB-D视觉、开放词汇提示和语音命令,以实现强大的多模态交互。为了增强开放环境中的泛化,OVGrasp结合了一个具有开放词汇机制的视觉语言基础模型,可以在不重新训练的情况下对以前未见过的物体进行零射击检测。多模态决策者进一步融合空间和语言线索来推断用户意图,例如在涉及多个对象的情况下抓取或释放。我们将完整的框架部署在定制的自我中心视图可穿戴外骨骼上,并对三种抓取类型的15个对象进行系统评估。10个参与者的实验结果表明,OVGrasp的抓取能力得分(GAS)达到了87.00%,超越了现有的基线,并在手部自然运动的情况下提供了改进的运动学对齐。
{"title":"OVGrasp: Open-Vocabulary Intent Detection for Grasping Assistance using ExoGlove","authors":"Chen Hu ,&nbsp;Shan Luo ,&nbsp;Letizia Gionfrida","doi":"10.1016/j.cviu.2026.104676","DOIUrl":"10.1016/j.cviu.2026.104676","url":null,"abstract":"<div><div>Grasping assistance is essential for restoring autonomy in individuals with motor impairments, particularly in unstructured environments where object categories and user intentions are diverse and unpredictable. We present <strong>OVGrasp</strong>, a hierarchical control framework for grasp assistance that integrates RGB-D vision, open vocabulary prompts, and voice commands to enable robust multimodal interaction. To enhance generalisation in open environments, OVGrasp incorporates a vision language foundation model with an open vocabulary mechanism, which enables zero-shot detection of previously unseen objects without retraining. A multimodal decision maker further fuses spatial and linguistic cues to infer user intent, such as grasp or release, in situations involving multiple objects. We deploy the complete framework on a custom egocentric view wearable exoskeleton and conduct systematic evaluations on fifteen objects across three grasp types. Experimental results with ten participants show that OVGrasp achieves a grasping ability score (GAS) of 87.00%, surpassing existing baselines and providing improved kinematic alignment with natural hand movement.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104676"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SpectraDiff: Enhancing the fidelity of Infrared Image Translation with object-aware diffusion SpectraDiff:利用目标感知扩散增强红外图像平移的保真度
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-26 DOI: 10.1016/j.cviu.2026.104709
Incheol Park , Youngwan Jin , Nalcakan Yagiz , Hyeongjin Ju , Sanghyeop Yeo , Shiho Kim
Autonomous systems commonly rely on RGB cameras, which are susceptible to failure in low-light and adverse conditions. Infrared (IR) imaging provides a viable alternative by capturing thermal signatures independent of visible illumination. However, its high cost and integration complexities limit widespread adoption. To address these challenges, we introduce SpectraDiff, a diffusion-based framework that synthesizes realistic IR images by fusing RGB inputs with refined semantic segmentation. Through our RGB-Seg Object-Aware (RSOA) module, SpectraDiff learns object-specific IR intensities by leveraging object-aware features. The SpectraDiff architecture, featuring a novel Spectral Attention Block, enforces self-attention among semantically similar pixels while leveraging cross-attention with the original RGB to preserve high-frequency details. Extensive evaluations on FLIR, FMB, MFNet, IDD-AW, and RANUS demonstrate SpectraDiff’s superior performance over existing methods, as measured by both perceptual (FID, LPIPS, DISTS) and fidelity (SSIM, SAM) metrics. Code and pretrained models are available at: https://yonsei-stl.github.io/SpectraDiff/.
自主系统通常依赖于RGB相机,在低光和不利条件下容易发生故障。红外(IR)成像通过捕获独立于可见照明的热特征提供了一种可行的替代方案。然而,它的高成本和集成复杂性限制了广泛采用。为了解决这些挑战,我们引入了SpectraDiff,这是一个基于扩散的框架,通过融合RGB输入和精细的语义分割来合成真实的红外图像。通过我们的RGB-Seg对象感知(RSOA)模块,SpectraDiff通过利用对象感知功能来学习对象特定的红外强度。SpectraDiff架构具有新颖的光谱注意块,在语义相似的像素中强制自注意,同时利用与原始RGB的交叉注意来保留高频细节。对FLIR、FMB、MFNet、IDD-AW和RANUS的广泛评估表明,SpectraDiff在感知(FID、LPIPS、DISTS)和保真度(SSIM、SAM)指标方面优于现有方法。代码和预训练模型可在:https://yonsei-stl.github.io/SpectraDiff/。
{"title":"SpectraDiff: Enhancing the fidelity of Infrared Image Translation with object-aware diffusion","authors":"Incheol Park ,&nbsp;Youngwan Jin ,&nbsp;Nalcakan Yagiz ,&nbsp;Hyeongjin Ju ,&nbsp;Sanghyeop Yeo ,&nbsp;Shiho Kim","doi":"10.1016/j.cviu.2026.104709","DOIUrl":"10.1016/j.cviu.2026.104709","url":null,"abstract":"<div><div>Autonomous systems commonly rely on RGB cameras, which are susceptible to failure in low-light and adverse conditions. Infrared (IR) imaging provides a viable alternative by capturing thermal signatures independent of visible illumination. However, its high cost and integration complexities limit widespread adoption. To address these challenges, we introduce SpectraDiff, a diffusion-based framework that synthesizes realistic IR images by fusing RGB inputs with refined semantic segmentation. Through our RGB-Seg Object-Aware (RSOA) module, SpectraDiff learns object-specific IR intensities by leveraging object-aware features. The SpectraDiff architecture, featuring a novel Spectral Attention Block, enforces self-attention among semantically similar pixels while leveraging cross-attention with the original RGB to preserve high-frequency details. Extensive evaluations on FLIR, FMB, MFNet, IDD-AW, and RANUS demonstrate SpectraDiff’s superior performance over existing methods, as measured by both perceptual (FID, LPIPS, DISTS) and fidelity (SSIM, SAM) metrics. Code and pretrained models are available at: <span><span>https://yonsei-stl.github.io/SpectraDiff/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"266 ","pages":"Article 104709"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147426834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
KASS: Efficient video artifact removal via Kernel-Adaptive Spatiotemporal Synchronization 基于核自适应时空同步的高效视频伪影去除
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-02 DOI: 10.1016/j.cviu.2026.104649
Liqun Lin, Fawei Tang, Mingxing Wang, Yipeng Liao, Tiesong Zhao
Video compression is essential for reducing bandwidth and storage demands, but often introduces artifacts that impair visual quality. Current Video Compression Artifact Removal (VCAR) methods face challenges including high computational complexity and unstable enhancement performance. To address these issues, we propose a novel Kernel-Adaptive Spatiotemporal Synchronization (KASS) network. First, a Dual-branch Alignment Module (DAM) enables multi-receptive-field feature alignment for modeling complex motion patterns. Second, an Adaptive Spatial Attention (ASA) block employs multi-branch deformable convolution with varying kernel sizes to locate artifacts. It then restores high-frequency details efficiently through attention-guided reconstruction. Third, a Spatiotemporal Multi-scale Alignment (SMA) block captures global spatiotemporal information and integrates multi-frame features via spatial and channel attention. This design effectively removes artifacts while improving alignment and enhancement stability. Experiments demonstrate that KASS significantly improves artifact removal performance while overcoming key limitations in alignment accuracy, computational burden, and enhancement stability.
视频压缩对于减少带宽和存储需求至关重要,但通常会引入损害视觉质量的伪影。当前的视频压缩伪影去除(VCAR)方法面临着计算复杂度高、增强性能不稳定等问题。为了解决这些问题,我们提出了一种新的核自适应时空同步(KASS)网络。首先,双分支对齐模块(Dual-branch Alignment Module, DAM)实现了多接受场特征对齐,用于建模复杂的运动模式。其次,自适应空间注意(ASA)块采用不同核大小的多分支可变形卷积来定位工件。然后,它通过注意力引导的重建有效地恢复高频细节。第三,时空多尺度对齐(SMA)块捕获全局时空信息,并通过空间和通道关注整合多帧特征。这种设计有效地消除了工件,同时改善了对齐和增强了稳定性。实验表明,KASS在克服对准精度、计算负担和增强稳定性方面的关键限制的同时,显著提高了伪影去除性能。
{"title":"KASS: Efficient video artifact removal via Kernel-Adaptive Spatiotemporal Synchronization","authors":"Liqun Lin,&nbsp;Fawei Tang,&nbsp;Mingxing Wang,&nbsp;Yipeng Liao,&nbsp;Tiesong Zhao","doi":"10.1016/j.cviu.2026.104649","DOIUrl":"10.1016/j.cviu.2026.104649","url":null,"abstract":"<div><div>Video compression is essential for reducing bandwidth and storage demands, but often introduces artifacts that impair visual quality. Current Video Compression Artifact Removal (VCAR) methods face challenges including high computational complexity and unstable enhancement performance. To address these issues, we propose a novel Kernel-Adaptive Spatiotemporal Synchronization (KASS) network. First, a Dual-branch Alignment Module (DAM) enables multi-receptive-field feature alignment for modeling complex motion patterns. Second, an Adaptive Spatial Attention (ASA) block employs multi-branch deformable convolution with varying kernel sizes to locate artifacts. It then restores high-frequency details efficiently through attention-guided reconstruction. Third, a Spatiotemporal Multi-scale Alignment (SMA) block captures global spatiotemporal information and integrates multi-frame features via spatial and channel attention. This design effectively removes artifacts while improving alignment and enhancement stability. Experiments demonstrate that KASS significantly improves artifact removal performance while overcoming key limitations in alignment accuracy, computational burden, and enhancement stability.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104649"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AnomalySD: One-for-all few-shot anomaly detection via pre-trained diffusion models AnomalySD:通过预先训练的扩散模型进行一次性的少量异常检测
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-04 DOI: 10.1016/j.cviu.2026.104668
Zhenyu Yan, Qingqing Fang, Wenxi Lv, Qinliang Su
Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, in this paper, we for the first time propose to leverage the pretrained generative model, Stable Diffusion (SD), to perform the one-for-all few-shot anomaly detection task, in contrast to existing few-shot anomaly detection works that heavily rely on the use of pre-trained representation-based CLIP model. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning the SD. At the testing stage, to accurately mask anomalous regions for inpainting, we propose a multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings. The source code of our method is available at https://github.com/YanZhenyu1999/AnomalySD.git.
异常检测是工业制造中的一项关键任务,旨在识别产品的缺陷部件。大多数工业异常检测方法都假定有足够的正常数据用于训练。由于标签或数据隐私政策的成本,这种假设可能不成立。此外,主流方法需要针对不同的对象训练定制模型,这不仅成本高,而且在实践中缺乏灵活性。为了解决这些问题,在本文中,我们首次提出利用预训练的生成模型稳定扩散(SD)来执行一次性的少量异常检测任务,而不是现有的少量异常检测工作严重依赖于使用预训练的基于表示的CLIP模型。为了使SD适应异常检测任务,我们设计了不同层次的文本描述和前景掩码机制来微调SD。在测试阶段,为了准确地掩盖异常区域,我们提出了多尺度掩模策略和原型引导掩模策略来处理不同的异常区域。在推理阶段,还使用分层文本提示来指导绘制过程。在MVTec-AD和VisA数据集上的大量实验证明了我们方法的优越性。在multi-class和one-shot设置下,我们在MVTec-AD数据集上获得了93.6%/94.8% AUROC和86.1%/96.5% AUROC的异常分类和分割结果。我们的方法的源代码可从https://github.com/YanZhenyu1999/AnomalySD.git获得。
{"title":"AnomalySD: One-for-all few-shot anomaly detection via pre-trained diffusion models","authors":"Zhenyu Yan,&nbsp;Qingqing Fang,&nbsp;Wenxi Lv,&nbsp;Qinliang Su","doi":"10.1016/j.cviu.2026.104668","DOIUrl":"10.1016/j.cviu.2026.104668","url":null,"abstract":"<div><div>Anomaly detection is a critical task in industrial manufacturing, aiming to identify defective parts of products. Most industrial anomaly detection methods assume the availability of sufficient normal data for training. This assumption may not hold true due to the cost of labeling or data privacy policies. Additionally, mainstream methods require training bespoke models for different objects, which incurs heavy costs and lacks flexibility in practice. To address these issues, in this paper, we for the first time propose to leverage the pretrained generative model, Stable Diffusion (SD), to perform the one-for-all few-shot anomaly detection task, in contrast to existing few-shot anomaly detection works that heavily rely on the use of pre-trained representation-based CLIP model. To adapt SD to anomaly detection task, we design different hierarchical text descriptions and the foreground mask mechanism for fine-tuning the SD. At the testing stage, to accurately mask anomalous regions for inpainting, we propose a multi-scale mask strategy and prototype-guided mask strategy to handle diverse anomalous regions. Hierarchical text prompts are also utilized to guide the process of inpainting in the inference stage. Extensive experiments on the MVTec-AD and VisA datasets demonstrate the superiority of our approach. We achieved anomaly classification and segmentation results of 93.6%/94.8% AUROC on the MVTec-AD dataset and 86.1%/96.5% AUROC on the VisA dataset under multi-class and one-shot settings. The source code of our method is available at <span><span>https://github.com/YanZhenyu1999/AnomalySD.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104668"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing workplace safety through assistive computer vision: Real-time hazard recognition using the Workplace Hazards Dataset (WHD) 利用辅助电脑视觉加强工作场所安全:利用工作场所危害数据集进行实时危害识别
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-06 DOI: 10.1016/j.cviu.2026.104681
Masoud Ayoubi, Mehrdad Arashpour
Assistive computer vision technologies have the potential to significantly enhance workplace safety by enabling early detection of hazards and supporting proactive risk management. However, the development of such systems is constrained by the absence of comprehensive video datasets and clearly defined tasks that capture real-world hazard conditions. This study formulates pre-incident hazard recognition as a distinct assistive-vision problem, focusing on identifying unsafe states that precede incidents rather than the incidents themselves. To address this problem, we propose the Workplace Hazards Dataset (WHD), a balanced and diverse set of real-world videos representing five universal hazard categories in varied workplace settings. Furthermore, we establish a standardized benchmarking framework that evaluates state-of-the-art convolutional and transformer-based video models on both performance and inference-latency metrics to assess real-time feasibility. Experimental results show that the Multiscale Vision Transformer (MViT 16 × 4) achieves the highest accuracy (74.1%) while maintaining efficient inference speed, highlighting the importance of balancing recognition accuracy with processing time. Overall, this work defines a new benchmark task for assistive computer vision and provides the foundation for developing real-time hazard recognition systems that enhance safety and efficiency in high-risk environments.
辅助计算机视觉技术有可能通过早期发现危险和支持主动风险管理来显著提高工作场所的安全性。然而,由于缺乏全面的视频数据集和明确定义的任务来捕捉现实世界的危险条件,这种系统的发展受到了限制。本研究将事故前危险识别作为一个独特的辅助视觉问题,重点是识别事故发生前的不安全状态,而不是事件本身。为了解决这个问题,我们提出了工作场所危害数据集(WHD),这是一套平衡和多样化的现实世界视频,代表了不同工作场所环境中的五种普遍危害类别。此外,我们建立了一个标准化的基准测试框架,评估最先进的卷积和基于变压器的视频模型的性能和推理延迟指标,以评估实时可行性。实验结果表明,在保持高效推理速度的同时,多尺度视觉变压器(MViT 16 × 4)达到了最高的准确率(74.1%),突出了平衡识别精度和处理时间的重要性。总的来说,这项工作为辅助计算机视觉定义了一个新的基准任务,并为开发实时危险识别系统提供了基础,从而提高高风险环境中的安全性和效率。
{"title":"Enhancing workplace safety through assistive computer vision: Real-time hazard recognition using the Workplace Hazards Dataset (WHD)","authors":"Masoud Ayoubi,&nbsp;Mehrdad Arashpour","doi":"10.1016/j.cviu.2026.104681","DOIUrl":"10.1016/j.cviu.2026.104681","url":null,"abstract":"<div><div>Assistive computer vision technologies have the potential to significantly enhance workplace safety by enabling early detection of hazards and supporting proactive risk management. However, the development of such systems is constrained by the absence of comprehensive video datasets and clearly defined tasks that capture real-world hazard conditions. This study formulates pre-incident hazard recognition as a distinct assistive-vision problem, focusing on identifying unsafe states that precede incidents rather than the incidents themselves. To address this problem, we propose the Workplace Hazards Dataset (WHD), a balanced and diverse set of real-world videos representing five universal hazard categories in varied workplace settings. Furthermore, we establish a standardized benchmarking framework that evaluates state-of-the-art convolutional and transformer-based video models on both performance and inference-latency metrics to assess real-time feasibility. Experimental results show that the Multiscale Vision Transformer (MViT 16 × 4) achieves the highest accuracy (74.1%) while maintaining efficient inference speed, highlighting the importance of balancing recognition accuracy with processing time. Overall, this work defines a new benchmark task for assistive computer vision and provides the foundation for developing real-time hazard recognition systems that enhance safety and efficiency in high-risk environments.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104681"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146191774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SRDR: Style recovery and detail replenishment matter for single image dehazing SRDR:风格恢复和细节补充对单幅图像去雾很重要
IF 3.5 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-03-01 Epub Date: 2026-02-17 DOI: 10.1016/j.cviu.2026.104688
Yuehua Li, Songwei Pei, Wenzheng Yang, BingFeng Liu, Shuhuai Wang
Single image dehazing, as a representative low-level vision task, is of paramount importance in substantial applications such as object detection and autonomous driving. Nowadays, most image dehazing methods focus on directly learning the overall difference between hazy and clear image pairs, thus making the learning task excessively challenging and restricting the performance of image dehazing to a certain extent. In this paper, we draw on considerations about image style and content from the style transfer task, and believe that the degradation of hazy images typically involves the transition of style and the hiding of details, and then the image dehazing task can be divided and conquered according to the two aspects to reduce the learning difficulty of the dehazing network for effective image dehazing. Based on this inspiration, in this paper, we propose an image dehazing network with a specific focus on style recovery and detail replenishment, namely SRDR, which firstly recovers the style and extracts the detail of the hazy image, respectively, and then aggregates the information from style recovery and detail replenishment for better image dehazing. The SRDR mainly consists of three modules: Style Recovery Module (SRM), Detail Replenishment Module (DRM), and Cross Fusion Module (CFM). SRM is responsible for style recovery by adapting the pre-trained MAE model, DRM handles the detail replenishment with multiple direction convolutions, and CFM is an information aggregation module. Extensive experiments demonstrate that SRDR achieves state-of-the-art performance on numerous mainstream datasets.
单幅图像去雾作为一项具有代表性的低层次视觉任务,在物体检测和自动驾驶等实质性应用中具有至关重要的意义。目前,大多数图像去雾方法都侧重于直接学习模糊和清晰图像对的整体差异,这使得学习任务过于艰巨,在一定程度上限制了图像去雾的性能。本文借鉴风格转移任务中对图像风格和内容的考虑,认为模糊图像的退化通常涉及风格的转移和细节的隐藏,然后可以根据这两个方面对图像去雾任务进行划分和攻克,从而降低去雾网络的学习难度,实现有效的图像去雾。基于这一灵感,本文提出了一种以风格恢复和细节补充为重点的图像去雾网络SRDR,该网络首先对模糊图像进行风格恢复和细节提取,然后将风格恢复和细节补充的信息进行聚合,以达到更好的图像去雾效果。SRDR主要由三个模块组成:样式恢复模块(SRM)、细节补充模块(DRM)和交叉融合模块(CFM)。SRM采用预训练的MAE模型进行风格恢复,DRM采用多方向卷积处理细节补充,CFM是一个信息聚合模块。大量的实验表明,SRDR在许多主流数据集上实现了最先进的性能。
{"title":"SRDR: Style recovery and detail replenishment matter for single image dehazing","authors":"Yuehua Li,&nbsp;Songwei Pei,&nbsp;Wenzheng Yang,&nbsp;BingFeng Liu,&nbsp;Shuhuai Wang","doi":"10.1016/j.cviu.2026.104688","DOIUrl":"10.1016/j.cviu.2026.104688","url":null,"abstract":"<div><div>Single image dehazing, as a representative low-level vision task, is of paramount importance in substantial applications such as object detection and autonomous driving. Nowadays, most image dehazing methods focus on directly learning the overall difference between hazy and clear image pairs, thus making the learning task excessively challenging and restricting the performance of image dehazing to a certain extent. In this paper, we draw on considerations about image style and content from the style transfer task, and believe that the degradation of hazy images typically involves the transition of style and the hiding of details, and then the image dehazing task can be divided and conquered according to the two aspects to reduce the learning difficulty of the dehazing network for effective image dehazing. Based on this inspiration, in this paper, we propose an image dehazing network with a specific focus on style recovery and detail replenishment, namely SRDR, which firstly recovers the style and extracts the detail of the hazy image, respectively, and then aggregates the information from style recovery and detail replenishment for better image dehazing. The SRDR mainly consists of three modules: Style Recovery Module (SRM), Detail Replenishment Module (DRM), and Cross Fusion Module (CFM). SRM is responsible for style recovery by adapting the pre-trained MAE model, DRM handles the detail replenishment with multiple direction convolutions, and CFM is an information aggregation module. Extensive experiments demonstrate that SRDR achieves state-of-the-art performance on numerous mainstream datasets.</div></div>","PeriodicalId":50633,"journal":{"name":"Computer Vision and Image Understanding","volume":"265 ","pages":"Article 104688"},"PeriodicalIF":3.5,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147422112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Computer Vision and Image Understanding
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1