首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
A multi-scale U-shaped transformer neural network for low-light image enhancement 一种用于弱光图像增强的多尺度u形变压器神经网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-11-07 DOI: 10.1016/j.imavis.2025.105801
Ji Soo Shin, Ho Sub Lee
Low-light image enhancement (LLIE) aims to improve the visibility and visual quality of images captured under insufficient lighting conditions, which are typically characterized by low contrast, suppressed textures, and amplified noise. Recent methods often employ a multi-scale enhancement strategy by stacking sub-networks—such as cascaded convolutional blocks or a single scale transposed self-attention module—to refine contrast from coarse to fine levels. However, these methods struggle to effectively restore natural color appearance and fail to preserve global illumination cues, which limits the generalization capability of the models. In addition, conventional self-attention methods for LLIE operate at a single resolution, making it difficult to effectively fuse multi-scale features and thus constraining their ability to simultaneously capture long-range dependencies and preserve fine structural details. To address these issues, this paper proposes MSTSA-UTNet, a compact U-shaped Transformer architecture that incorporates a newly designed Transformer block based on multi-scale transposed self-attention (MSTSA) with lightweight feed forward modules, and adopts a multi-scale input, single-scale output (MISO) strategy. The key idea of MSTSA is to enable multi-resolution interaction by simultaneously incorporating original high-resolution features and down-sampled low-resolution features. Furthermore, the proposed feature extraction and fusion framework comprises two core components: a prior-guided shallow feature extraction (PG-SFE) module that preserves low-level spatial cues while incorporating illumination priors to modulate shallow features, and a multi-scale feed forward network (MSFFN) that performs gated fusion to selectively integrate global context and local detail. This design facilitates improved feature learning for low-light enhancement. Extensive experimental results demonstrate that the proposed MSTSA-UTNet consistently outperforms recent state-of-the-art multi-scale enhancement method, SMNet [37], by up to 0.59 dB in PSNR on the LOL-v1 dataset.
低光图像增强(LLIE)旨在提高在光照不足条件下拍摄的图像的可见度和视觉质量,这些图像通常具有对比度低、纹理被抑制和噪声放大的特点。最近的方法通常采用多尺度增强策略,通过堆叠子网络(如级联卷积块或单尺度转置的自注意模块)将对比度从粗级细化到细级。然而,这些方法难以有效地恢复自然颜色外观,并且不能保留全局光照线索,这限制了模型的泛化能力。此外,传统的LLIE自关注方法在单一分辨率下运行,难以有效融合多尺度特征,从而限制了它们同时捕获远程依赖关系和保留精细结构细节的能力。为了解决这些问题,本文提出了一种紧凑的u型Transformer架构MSTSA- utnet,该架构结合了新设计的基于多尺度转置自关注(MSTSA)的Transformer模块和轻量级前馈模块,并采用了多尺度输入,单尺度输出(MISO)策略。MSTSA的关键思想是通过同时融合原始的高分辨率特征和下采样的低分辨率特征来实现多分辨率交互。此外,所提出的特征提取和融合框架包括两个核心组件:一个是先验引导的浅层特征提取(PG-SFE)模块,该模块在结合光照先验来调制浅层特征的同时保留低级空间线索;另一个是多尺度前馈网络(MSFFN),该网络执行门控融合,有选择地整合全局背景和局部细节。这种设计有助于改进弱光增强的特征学习。大量的实验结果表明,MSTSA-UTNet在LOL-v1数据集上的PSNR高达0.59 dB,持续优于最近最先进的多尺度增强方法SMNet[37]。
{"title":"A multi-scale U-shaped transformer neural network for low-light image enhancement","authors":"Ji Soo Shin,&nbsp;Ho Sub Lee","doi":"10.1016/j.imavis.2025.105801","DOIUrl":"10.1016/j.imavis.2025.105801","url":null,"abstract":"<div><div>Low-light image enhancement (LLIE) aims to improve the visibility and visual quality of images captured under insufficient lighting conditions, which are typically characterized by low contrast, suppressed textures, and amplified noise. Recent methods often employ a multi-scale enhancement strategy by stacking sub-networks—such as cascaded convolutional blocks or a single scale transposed self-attention module—to refine contrast from coarse to fine levels. However, these methods struggle to effectively restore natural color appearance and fail to preserve global illumination cues, which limits the generalization capability of the models. In addition, conventional self-attention methods for LLIE operate at a single resolution, making it difficult to effectively fuse multi-scale features and thus constraining their ability to simultaneously capture long-range dependencies and preserve fine structural details. To address these issues, this paper proposes MSTSA-UTNet, a compact U-shaped Transformer architecture that incorporates a newly designed Transformer block based on multi-scale transposed self-attention (MSTSA) with lightweight feed forward modules, and adopts a multi-scale input, single-scale output (MISO) strategy. The key idea of MSTSA is to enable multi-resolution interaction by simultaneously incorporating original high-resolution features and down-sampled low-resolution features. Furthermore, the proposed feature extraction and fusion framework comprises two core components: a prior-guided shallow feature extraction (PG-SFE) module that preserves low-level spatial cues while incorporating illumination priors to modulate shallow features, and a multi-scale feed forward network (MSFFN) that performs gated fusion to selectively integrate global context and local detail. This design facilitates improved feature learning for low-light enhancement. Extensive experimental results demonstrate that the proposed MSTSA-UTNet consistently outperforms recent state-of-the-art multi-scale enhancement method, SMNet [<span><span>37</span></span>], by up to 0.59 dB in PSNR on the LOL-v1 dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105801"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery MSBC-Segformer:一种用于保乳术后放疗CT图像中临床靶体积和危险器官的自动分割模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-17 DOI: 10.1016/j.imavis.2025.105878
Yadi Gao , Qian Sun , Lan Ye , Chengliang Li , Peipei Dang , Min Han
In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (p>0.98 for CTV and p>0.59 for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.
在乳腺癌保乳手术(BCS)放疗中,CT图像上的临床靶体积(CTV)和危险器官(OARs)主要由放射肿瘤学家(RO)手工逐层描绘,这是一个耗时的过程,容易因临床经验差异和观察者之间和观察者内部的差异而发生变化。为了解决这个问题,我们开发了一种新的针对医学CT图像的自动描绘模型,特别是用于计算机辅助医学检测和诊断。收集100例接受BCS和放疗的患者的CT扫描。这些数据用于创建、训练和验证一种新的深度学习(DL)模型,即MSBC-Segformer(基于Transformer的多尺度边界约束分割模型)模型,该模型用于自动分割CTV和OARs。采用骰子相似系数(DSC)和第95百分位豪斯多夫距离(95HD)来评价模型的有效性。结果表明,MSBC-Segformer模型能够准确、高效地对bccs后放疗的BC患者进行CTV和OARs的描绘,优于初级医生和几乎所有现有的CNN模型,并且减少了由于观察者差异导致的分割结果的不稳定性,显著提高了临床效率。此外,三位ro的评估显示,模型与高级医生手工描绘的差异无统计学意义(CTV为p>;0.98, OARs为p>;0.59)。该模型显著缩短了分割时间,平均每位患者仅为12.53 s。
{"title":"MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery","authors":"Yadi Gao ,&nbsp;Qian Sun ,&nbsp;Lan Ye ,&nbsp;Chengliang Li ,&nbsp;Peipei Dang ,&nbsp;Min Han","doi":"10.1016/j.imavis.2025.105878","DOIUrl":"10.1016/j.imavis.2025.105878","url":null,"abstract":"<div><div>In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (<span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>98</mn></mrow></math></span> for CTV and <span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>59</mn></mrow></math></span> for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105878"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A computationally efficient framework leveraging auxiliary head features for robust cloth-changing person re-identification 利用辅助头部特征的鲁棒换布人再识别计算效率框架
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-11-27 DOI: 10.1016/j.imavis.2025.105852
Shanna Zhuang, Yining Li, Jiaxin Zhang, Jing Bai, Congcong Li, Zhengyou Wang
Cloth-Changing Person Re-identification (CC-ReID) addresses the critical challenge of matching individuals across surveillance cameras after clothing changes. This task presents unique difficulties due to appearance variations that increase visual ambiguity among persons. While existing approaches predominantly concentrate on body shape analysis, contour sketches, and human parsing techniques, these methods exhibit limitations in robustness while demanding substantial computational resources. To exploit cloth-irrelevant information for the CC-ReID task at low cost, a computationally efficient framework that effectively leverages head features as discriminative auxiliary cues is proposed in this paper. First, a Head Region Cropping and Location (HRCL) module is presented to isolate head regions, enabling rough feature localization. Second, a lightweight Head Attention Network (HANet) is presented that integrates MobileNet architecture with cascaded channel-spatial attention mechanisms, which synergistically capture both global semantic patterns and local discriminative characteristics. This dual-stream design achieves effective feature enhancement without requiring complex auxiliary feature extraction as required by conventional approaches. An optimal Feature-Level Fusion Module (FLFM) that combines learned head representations with conventional body features is presented. Extensive evaluations on the LTCC and PRCC datasets demonstrate significant performance improvements. The proposed method achieves state-of-the-art Rank-1 accuracy of 58.8% and a competitive mAP of 25.4% on the LTCC dataset, while attaining 59.8% Rank-1 accuracy and 57.9% mAP on PRCC. Comprehensive ablation studies further validate the effectiveness of each proposed component.
换衣服的人再识别(CC-ReID)解决了在换衣服后通过监控摄像头匹配个人的关键挑战。这项任务呈现出独特的困难,因为外观变化会增加人们之间的视觉模糊性。虽然现有的方法主要集中在身体形状分析、轮廓草图和人类解析技术上,但这些方法在鲁棒性方面存在局限性,同时需要大量的计算资源。为了在CC-ReID任务中低成本地利用布料无关信息,本文提出了一个计算效率高的框架,该框架有效地利用头部特征作为判别辅助线索。首先,提出了头部区域裁剪和定位(HRCL)模块来隔离头部区域,实现粗略的特征定位;其次,提出了一个轻量级的头部注意网络(HANet),该网络将MobileNet架构与级联的通道-空间注意机制相结合,协同捕获全局语义模式和局部判别特征。这种双流设计实现了有效的特征增强,而不需要像传统方法那样进行复杂的辅助特征提取。提出了一种将学习到的头部表征与常规身体特征相结合的最优特征级融合模块(FLFM)。对LTCC和PRCC数据集的广泛评估显示了显著的性能改进。该方法在LTCC数据集上的Rank-1精度为58.8%,mAP为25.4%,而在PRCC数据集上的Rank-1精度为59.8%,mAP为57.9%。综合消融研究进一步验证了每个提出的组件的有效性。
{"title":"A computationally efficient framework leveraging auxiliary head features for robust cloth-changing person re-identification","authors":"Shanna Zhuang,&nbsp;Yining Li,&nbsp;Jiaxin Zhang,&nbsp;Jing Bai,&nbsp;Congcong Li,&nbsp;Zhengyou Wang","doi":"10.1016/j.imavis.2025.105852","DOIUrl":"10.1016/j.imavis.2025.105852","url":null,"abstract":"<div><div>Cloth-Changing Person Re-identification (CC-ReID) addresses the critical challenge of matching individuals across surveillance cameras after clothing changes. This task presents unique difficulties due to appearance variations that increase visual ambiguity among persons. While existing approaches predominantly concentrate on body shape analysis, contour sketches, and human parsing techniques, these methods exhibit limitations in robustness while demanding substantial computational resources. To exploit cloth-irrelevant information for the CC-ReID task at low cost, a computationally efficient framework that effectively leverages head features as discriminative auxiliary cues is proposed in this paper. First, a Head Region Cropping and Location (HRCL) module is presented to isolate head regions, enabling rough feature localization. Second, a lightweight Head Attention Network (HANet) is presented that integrates MobileNet architecture with cascaded channel-spatial attention mechanisms, which synergistically capture both global semantic patterns and local discriminative characteristics. This dual-stream design achieves effective feature enhancement without requiring complex auxiliary feature extraction as required by conventional approaches. An optimal Feature-Level Fusion Module (FLFM) that combines learned head representations with conventional body features is presented. Extensive evaluations on the LTCC and PRCC datasets demonstrate significant performance improvements. The proposed method achieves state-of-the-art Rank-1 accuracy of 58.8% and a competitive mAP of 25.4% on the LTCC dataset, while attaining 59.8% Rank-1 accuracy and 57.9% mAP on PRCC. Comprehensive ablation studies further validate the effectiveness of each proposed component.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105852"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145618475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WAM-Net: Wavelet-Based Adaptive Multi-scale Fusion Network for fine-grained action recognition WAM-Net:基于小波的自适应多尺度融合网络的细粒度动作识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-03 DOI: 10.1016/j.imavis.2025.105855
Jirui Di, Zhengping Hu, Hehao Zhang, Qiming Zhang, Zhe Sun
Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.
细粒度的动作通常缺乏场景先验信息,这使得强大的时间建模变得尤为重要。由于这些动作主要依赖于细微和局部的运动差异,单尺度特征往往不足以捕捉它们的复杂性。相比之下,多尺度特征不仅捕获细粒度模式,而且包含丰富的节奏信息,这对于建模时间依赖性至关重要。然而,现有的多尺度特征处理方法存在两个主要的局限性:它们通常依赖于简单的降采样操作来进行尺度对齐,导致严重的结构信息丢失;它们平等地对待来自不同层的特征,而没有充分利用不同层次间的互补优势。为了解决这些问题,我们提出了一种新的基于小波的自适应多尺度融合网络(WAM-Net),它由三个关键部分组成:(1)基于小波的融合模块(WFM),通过小波重构实现特征对齐,避免了直接下采样通常带来的结构退化;(2)自适应特征选择模块(AFSM),基于全局信息动态选择和融合两级特征,使网络能够利用它们的互补优势;(3)持续时间上下文编码器(DCE),从整个视频长度中提取时间持续时间表示,以指导全局依赖关系建模。在Diving48、FineGym和Kinetics-400上进行的大量实验表明,我们的方法始终优于现有的最先进的方法。
{"title":"WAM-Net: Wavelet-Based Adaptive Multi-scale Fusion Network for fine-grained action recognition","authors":"Jirui Di,&nbsp;Zhengping Hu,&nbsp;Hehao Zhang,&nbsp;Qiming Zhang,&nbsp;Zhe Sun","doi":"10.1016/j.imavis.2025.105855","DOIUrl":"10.1016/j.imavis.2025.105855","url":null,"abstract":"<div><div>Fine-grained actions often lack scene prior information, making strong temporal modeling particularly important. Since these actions primarily rely on subtle and localized motion differences, single-scale features are often insufficient to capture their complexity. In contrast, multi-scale features not only capture fine-grained patterns but also contain rich rhythmic information, which is crucial for modeling temporal dependencies. However, existing methods for processing multi-scale features suffer from two major limitations: they often rely on naive downsampling operations for scale alignment, causing significant structural information loss, and they treat features from different layers equally, without fully exploiting the complementary strengths across hierarchical levels. To address these issues, we propose a novel Wavelet-Based Adaptive Multi-scale Fusion Network (WAM-Net), which consists of three key components: (1) a Wavelet-based Fusion Module (WFM) that achieves feature alignment through wavelet reconstruction, avoiding the structural degradation typically introduced by direct downsampling, (2) an Adaptive Feature Selection Module (AFSM) that dynamically selects and fuses two levels of features based on global information, enabling the network to leverage their complementary advantages, and (3) a Duration Context Encoder (DCE) that extracts temporal duration representations from the overall video length to guide global dependency modeling. Extensive experiments on Diving48, FineGym, and Kinetics-400 demonstrate that our approach consistently outperforms existing state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105855"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145685265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BSA-Dehaze: Multi-Scale Bitemporal Fusion and Size-Aware Decoder for Unsupervised Image Dehazing BSA-Dehaze:用于无监督图像去雾的多尺度双时间融合和尺寸感知解码器
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-11-19 DOI: 10.1016/j.imavis.2025.105819
Wujin Li , Qian Xing , Wei He , Longyuan Guo , Jianhui Wu , Minzhi Zhao , Siyuan Chen
Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.
单图像去雾在各种自主视觉系统中起着至关重要的作用。早期的方法依赖于手工优化技术,而由于现实世界配对数据集的稀缺性,最近的方法利用了在合成数据上训练的深度神经网络。然而,当应用于户外场景时,这通常会导致域偏差。在本文中,我们提出了BSA-Dehaze,一种集成了多尺度双时间融合模块(MBFM)和尺寸感知解码器(SA-Decoder)的无监督单图像去雾框架。该方法不需要地面真值图像。我们的方法将除雾重新定义为从雾到清晰的图像转换任务。BSA-Dehaze采用了一种新颖的编码器- sa -解码器,采用ResNet块构建,旨在更好地保留图像细节和边缘清晰度。为了提高特征融合和训练效率,我们引入了MBFM。提出了一种多尺度判别器(MSD),结合Hinge Loss和Dynamic Block-wise contrast Loss来提高训练稳定性和强调挑战性样本。消融研究证实了每个组成部分的贡献。在SOTS户外、BeDDE和真实数据集上的实验结果表明,我们的方法在性能和效率上都超过了现有的方法,尽管训练的数据要少得多。
{"title":"BSA-Dehaze: Multi-Scale Bitemporal Fusion and Size-Aware Decoder for Unsupervised Image Dehazing","authors":"Wujin Li ,&nbsp;Qian Xing ,&nbsp;Wei He ,&nbsp;Longyuan Guo ,&nbsp;Jianhui Wu ,&nbsp;Minzhi Zhao ,&nbsp;Siyuan Chen","doi":"10.1016/j.imavis.2025.105819","DOIUrl":"10.1016/j.imavis.2025.105819","url":null,"abstract":"<div><div>Single-image dehazing plays a critical role in various autonomous vision systems. Early methods relied on hand-crafted optimization techniques, whereas recent approaches leverage deep neural networks trained on synthetic data, owing to the scarcity of real-world paired datasets. However, this often results in domain bias when applied to outdoor scenes. In this paper, we present BSA-Dehaze, an unsupervised single-image dehazing framework that integrates a Multi-Scale Bitemporal Fusion Module (MBFM) and a Size-Aware Decoder (SA-Decoder). The method operates without requiring ground-truth images. Our method reformulates dehazing as a haze-to-clear image translation task. BSA-Dehaze incorporates a novel Encoder-SA-Decoder built with ResNet blocks, designed to better preserve image details and edge sharpness. To enhance feature fusion and training efficiency, we introduce the MBFM. A multi-scale discriminator (MSD) is proposed, along with Hinge Loss and Dynamic Block-wise Contrastive Loss, to improve training stability and emphasize challenging samples. Ablation studies verify the contribution of each component. Experimental results on SOTS outdoor, BeDDE, and a real-world dataset demonstrate that our method surpasses existing approaches in both performance and efficiency, despite being trained on significantly less data.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105819"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145600398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness 利用目标上下文和外观感知增强零射击目标-目标视觉导航
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-18 DOI: 10.1016/j.imavis.2025.105873
Yu Fu , Lichun Wang , Tong Bie , Shuang Li , Tong Gao , Baocai Yin
The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.
零射击目标-目标视觉导航(Zero-Shot Object-Goal Visual Navigation, ZSON)任务侧重于利用目标语义信息,将从可见类学习到的端到端导航策略转移到不可见类。对于大多数ZSON方法,目标标签充当语义信息的主要来源。然而,基于单一语义线索的导航策略学习限制了导航策略的可移植性。受人类在搜索目标时联想对象标签现象的启发,我们提出了双目标感知网络(Dual Target Awareness Network, DTAN),将标签语义扩展到目标上下文和目标外观,为导航策略学习提供更多的目标线索。DTAN首先利用大语言模型(LLM)根据目标标签推断目标上下文和目标属性。对目标上下文进行编码,然后与观察结果交互以获得上下文感知特征。目标属性用于生成目标外观,然后与观测交互以获得外观感知特征。通过融合这两种特征,得到目标感知特征,并将其输入策略网络进行行动决策。实验结果表明,在AI2-THOR模拟器上,DTAN方法对未见目标的成功率(SR)和路径长度加权成功率(SPL)分别比ZSON方法高6.9%和3.1%。在RoboTHOR和Habitat (MP3D)模拟器上进行的实验进一步证明了DTAN在更大规模和更逼真场景中的可扩展性。
{"title":"Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness","authors":"Yu Fu ,&nbsp;Lichun Wang ,&nbsp;Tong Bie ,&nbsp;Shuang Li ,&nbsp;Tong Gao ,&nbsp;Baocai Yin","doi":"10.1016/j.imavis.2025.105873","DOIUrl":"10.1016/j.imavis.2025.105873","url":null,"abstract":"<div><div>The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105873"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MVD-NeRF: Multi-View Deblurring Neural Radiance Fields from Defocused Images MVD-NeRF:多视图去模糊神经辐射场从散焦图像
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-17 DOI: 10.1016/j.imavis.2025.105876
Zhenyu Yin , Xiaohui Wang , Feiqing Zhang , Xiaoqiang Shi , Dan Feng
Neural Radiance Fields (NeRF) have demonstrated exceptional three-dimensional (3D) reconstruction quality by synthesizing novel views from multi-view images. However, NeRF algorithms typically require clear and static images to function effectively, and little attention has been given to suboptimal scenarios involving noise such as reflections and blur. Although blurred images are common in real-world situations, few studies have explored NeRF for handling blur, particularly defocus blur. Correctly simulating the formation of defocus blur is the key to deblurring and helps to accurately synthesize new perspectives from blurred images. Therefore, this paper proposes a Multi-View Deblurring Neural Radiance Fields from Defocused Images (MVD-NeRF), a framework for 3D reconstruction from defocus-blurred images. The framework ensures consistency in 3D geometry and appearance by modeling the formation of defocus blur. MVD-NeRF introduces the Defocus Modeling Approach (DMA), a novel method for simulating defocused scenes. When the view is fixed, DMA assumes that the pixel is rendered by multiple rays emitted from the same light source. Additionally, MVD-NeRF proposes a new Multi-view Panning Algorithm (MPA), which simulates light source movement through slight shifts in the camera center across different views, thereby generating blur effects similar to those in real photography. Together, DMA and MPA enhance MVD-NeRF’s ability to capture intricate scene details. Our experimental results validate that MVD-NeRF achieves significant improvements in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). The source code for MVD-NeRF is available at the following URL: https://github.com/luckhui0505/MVD-NeRF.
神经辐射场(NeRF)通过从多视图图像合成新视图,展示了卓越的三维(3D)重建质量。然而,NeRF算法通常需要清晰和静态的图像才能有效地工作,并且很少关注涉及反射和模糊等噪声的次优场景。虽然模糊的图像在现实世界中很常见,但很少有研究探索NeRF处理模糊,特别是散焦模糊。正确模拟离焦模糊的形成是去模糊的关键,有助于从模糊图像中准确合成新的视角。为此,本文提出了一种基于离焦图像的多视图去模糊神经辐射场(MVD-NeRF),这是一种从离焦模糊图像中进行三维重建的框架。该框架通过建模散焦模糊的形成来确保3D几何形状和外观的一致性。MVD-NeRF引入了散焦建模方法(DMA),这是一种模拟散焦场景的新方法。当视图固定时,DMA假定像素是由同一光源发出的多条光线渲染的。此外,MVD-NeRF提出了一种新的多视图平移算法(MPA),该算法通过相机中心在不同视图中的轻微移动来模拟光源的移动,从而产生类似于真实摄影中的模糊效果。DMA和MPA一起增强了MVD-NeRF捕捉复杂场景细节的能力。我们的实验结果验证了MVD-NeRF在峰值信噪比(PSNR)、结构相似指数测量(SSIM)和学习感知图像斑块相似度(LPIPS)方面取得了显着改善。MVD-NeRF的源代码可从以下网址获得:https://github.com/luckhui0505/MVD-NeRF。
{"title":"MVD-NeRF: Multi-View Deblurring Neural Radiance Fields from Defocused Images","authors":"Zhenyu Yin ,&nbsp;Xiaohui Wang ,&nbsp;Feiqing Zhang ,&nbsp;Xiaoqiang Shi ,&nbsp;Dan Feng","doi":"10.1016/j.imavis.2025.105876","DOIUrl":"10.1016/j.imavis.2025.105876","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRF) have demonstrated exceptional three-dimensional (3D) reconstruction quality by synthesizing novel views from multi-view images. However, NeRF algorithms typically require clear and static images to function effectively, and little attention has been given to suboptimal scenarios involving noise such as reflections and blur. Although blurred images are common in real-world situations, few studies have explored NeRF for handling blur, particularly defocus blur. Correctly simulating the formation of defocus blur is the key to deblurring and helps to accurately synthesize new perspectives from blurred images. Therefore, this paper proposes a Multi-View Deblurring Neural Radiance Fields from Defocused Images (MVD-NeRF), a framework for 3D reconstruction from defocus-blurred images. The framework ensures consistency in 3D geometry and appearance by modeling the formation of defocus blur. MVD-NeRF introduces the Defocus Modeling Approach (DMA), a novel method for simulating defocused scenes. When the view is fixed, DMA assumes that the pixel is rendered by multiple rays emitted from the same light source. Additionally, MVD-NeRF proposes a new Multi-view Panning Algorithm (MPA), which simulates light source movement through slight shifts in the camera center across different views, thereby generating blur effects similar to those in real photography. Together, DMA and MPA enhance MVD-NeRF’s ability to capture intricate scene details. Our experimental results validate that MVD-NeRF achieves significant improvements in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). The source code for MVD-NeRF is available at the following URL: <span><span>https://github.com/luckhui0505/MVD-NeRF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105876"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
When Mamba meets CNN: A hybrid architecture for skin lesion segmentation 当曼巴遇到CNN:皮肤病变分割的混合架构
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-16 DOI: 10.1016/j.imavis.2025.105880
Yun Xiao , Caijuan Shi , Jinghao Jia , Ao Cai , Yinan Zhang , Meiwen Zhang
As an important means of computer-aided diagnosis and treatment of skin cancer, skin lesion segmentation has recently received extensive research based on different deep models. Because of the limitations of each single deep model, the hybrid architectures, especially the Mamba–CNN based methods, have become a research hotspot. However, the segmentation accuracy of existing Mamba–CNN methods is still limited, especially for lesions of varying sizes and blurry boundaries, meanwhile, the model’s computational complexity remains high. Therefore, to address these issues and improve the skin lesion segmentation performance, we propose a new Mamba–CNN based model, named Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM). Specifically, the designed Multi-scale Hybrid Attention Interaction (MHAI) module can enhance the multi-scale feature representation with the powerful capability of long-range dependency modeling to obtain rich local and global information. The designed Region Localization and Boundary Enhancement (RLBE) module can effectively explore the local information to alleviate inaccurate skin lesion localization and boundary blurring. The Lightweight Visual State Space (LVSS) module is designed to reduce the model’s computational complexity. Extensive experiments are conducted on four datasets, and our model FFBA-VM effectively boosts the segmentation accuracy in terms of multiple evaluation metrics. For example, FFBA-VM achieves mIoU and DSC of 80.28% and 89.06% on the ISIC17 dataset, and reaches mIoU and DSC of 80.47% and 89.17% on the ISIC18 dataset. The experimental results indicate that our proposed FFBA-VM can outperform the existing state-of-the-art methods, validating its effectiveness and practicality for skin lesion segmentation.
作为计算机辅助皮肤癌诊断和治疗的重要手段,基于不同深度模型的皮肤病灶分割近年来得到了广泛的研究。由于单个深度模型的局限性,混合结构,特别是基于Mamba-CNN的方法成为研究热点。然而,现有的Mamba-CNN方法的分割精度仍然有限,特别是对于大小不一、边界模糊的病变,同时模型的计算复杂度仍然很高。因此,为了解决这些问题并提高皮肤损伤分割性能,我们提出了一种新的基于Mamba - cnn的模型,命名为Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM)。具体而言,设计的多尺度混合注意交互(MHAI)模块可以增强多尺度特征表示,具有强大的远程依赖建模能力,从而获得丰富的局部和全局信息。设计的区域定位和边界增强(RLBE)模块可以有效地挖掘局部信息,以缓解皮肤病变定位不准确和边界模糊的问题。轻量级视觉状态空间(LVSS)模块旨在降低模型的计算复杂度。在4个数据集上进行了大量的实验,我们的模型FFBA-VM在多个评价指标上有效地提高了分割精度。例如,FFBA-VM在ISIC17数据集中mIoU和DSC分别达到80.28%和89.06%,在ISIC18数据集中mIoU和DSC分别达到80.47%和89.17%。实验结果表明,我们提出的FFBA-VM可以优于现有的最先进的方法,验证了其对皮肤病变分割的有效性和实用性。
{"title":"When Mamba meets CNN: A hybrid architecture for skin lesion segmentation","authors":"Yun Xiao ,&nbsp;Caijuan Shi ,&nbsp;Jinghao Jia ,&nbsp;Ao Cai ,&nbsp;Yinan Zhang ,&nbsp;Meiwen Zhang","doi":"10.1016/j.imavis.2025.105880","DOIUrl":"10.1016/j.imavis.2025.105880","url":null,"abstract":"<div><div>As an important means of computer-aided diagnosis and treatment of skin cancer, skin lesion segmentation has recently received extensive research based on different deep models. Because of the limitations of each single deep model, the hybrid architectures, especially the Mamba–CNN based methods, have become a research hotspot. However, the segmentation accuracy of existing Mamba–CNN methods is still limited, especially for lesions of varying sizes and blurry boundaries, meanwhile, the model’s computational complexity remains high. Therefore, to address these issues and improve the skin lesion segmentation performance, we propose a new Mamba–CNN based model, named Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM). Specifically, the designed Multi-scale Hybrid Attention Interaction (MHAI) module can enhance the multi-scale feature representation with the powerful capability of long-range dependency modeling to obtain rich local and global information. The designed Region Localization and Boundary Enhancement (RLBE) module can effectively explore the local information to alleviate inaccurate skin lesion localization and boundary blurring. The Lightweight Visual State Space (LVSS) module is designed to reduce the model’s computational complexity. Extensive experiments are conducted on four datasets, and our model FFBA-VM effectively boosts the segmentation accuracy in terms of multiple evaluation metrics. For example, FFBA-VM achieves mIoU and DSC of 80.28% and 89.06% on the ISIC17 dataset, and reaches mIoU and DSC of 80.47% and 89.17% on the ISIC18 dataset. The experimental results indicate that our proposed FFBA-VM can outperform the existing state-of-the-art methods, validating its effectiveness and practicality for skin lesion segmentation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105880"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LoGA-Attack: Local geometry-aware adversarial attack on 3D point clouds LoGA-Attack:对3D点云的局部几何感知对抗性攻击
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-11 DOI: 10.1016/j.imavis.2025.105871
Jia Yuan , Jun Chen , Chongshou Li , Pedro Alonso , Xinke Li , Tianrui Li
Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: https://github.com/yuanjiachn/LoGA-Attack.
对3D点云的对抗性攻击在自动驾驶等安全敏感领域变得越来越重要。大多数现有的方法忽略了局部几何结构,产生了损害不可感知性和几何一致性的扰动。我们介绍了局部几何感知对抗性攻击(LoGA-Attack),这是一种利用拓扑和几何线索来制作精细扰动的局部几何感知方法。邻域中心性(NC)评分将点划分为轮廓点集和平面点集。轮廓点接收基于梯度的迭代更新以最大化攻击强度,而平坦点使用基于最优邻域的攻击(ONA),将梯度投影到最一致的局部几何方向上。在ModelNet40和ScanObjectNN上的实验表明,攻击成功率高,感知失真小,性能优越,可移植性强。我们的代码可在:https://github.com/yuanjiachn/LoGA-Attack。
{"title":"LoGA-Attack: Local geometry-aware adversarial attack on 3D point clouds","authors":"Jia Yuan ,&nbsp;Jun Chen ,&nbsp;Chongshou Li ,&nbsp;Pedro Alonso ,&nbsp;Xinke Li ,&nbsp;Tianrui Li","doi":"10.1016/j.imavis.2025.105871","DOIUrl":"10.1016/j.imavis.2025.105871","url":null,"abstract":"<div><div>Adversarial attacks on 3D point clouds are increasingly critical for safety-sensitive domains like autonomous driving. Most existing methods ignore local geometric structure, yielding perturbations that harm imperceptibility and geometric consistency. We introduce local geometry-aware adversarial attack (LoGA-Attack), a local geometry-aware approach that exploits topological and geometric cues to craft refined perturbations. A Neighborhood Centrality (NC) score partitions points into contour and flat points sets. Contour points receive gradient-based iterative updates to maximize attack strength, while flat points use an Optimal Neighborhood-based Attack (ONA) that projects gradients onto the most consistent local geometric direction. Experiments on ModelNet40 and ScanObjectNN show higher attack success with lower perceptual distortion, demonstrating superior performance and strong transferability. Our code is available at: <span><span>https://github.com/yuanjiachn/LoGA-Attack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105871"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CONXA: A CONvnext and CROSS-attention combination network for Semantic Edge Detection CONXA:一种用于语义边缘检测的卷积和交叉注意组合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-01 Epub Date: 2025-12-15 DOI: 10.1016/j.imavis.2025.105867
Gwangsoo Kim , Hyuk-jae Lee , Hyunmin Jung
Semantic Edge Detection (SED) is an advanced edge detection technique that simultaneously detects edges in an image and classifies them according to their semantics. It is expected to be applied in various fields, such as medical imaging, satellite imagery, and smart manufacturing. Although previous research on SED has significantly improved performance, further advancements are still needed. In particular, existing studies have typically focused on specific types of dataset, limiting the broader applicability of SED techniques. Motivated by this, our paper makes three key contributions. First, we propose a novel network for SED, called CONXA. CONXA improves SED accuracy by leveraging the powerful feature extraction of ConvNeXt and the effective feature combination of cross-attention. Second, we introduce a novel loss function, Inverted Dice (I-Dice) loss, which calculates loss based on a sufficient number of non-edge pixels rather than edge pixels. This helps balance false positives and false negatives, enabling more stable training. Third, unlike previous studies that typically use only one type of dataset, we validate our method using two distinct types of datasets commonly used in SED. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art (SOTA) methods on datasets that define semantics by edge types, and achieves comparable performance to SOTA methods on datasets where semantics are defined by object boundaries. This indicates that our method can be effectively applied across diverse datasets regardless of the semantic characteristics of edges, contributing to the generalization of SED. Code is available at https://github.com/GSKIM13/CONXA/.
语义边缘检测(Semantic Edge Detection, SED)是一种先进的边缘检测技术,它可以同时检测图像中的边缘,并根据其语义对边缘进行分类。它有望应用于医学成像、卫星图像和智能制造等各个领域。虽然先前的研究已经显著提高了SED的性能,但仍需要进一步的发展。特别是,现有的研究通常集中在特定类型的数据集上,限制了SED技术的广泛适用性。在此激励下,本文做出了三个关键贡献。首先,我们提出了一个新的SED网络,称为CONXA。CONXA通过利用ConvNeXt强大的特征提取和交叉注意的有效特征组合来提高SED的准确性。其次,我们引入了一种新的损失函数,倒骰子(I-Dice)损失,它基于足够数量的非边缘像素而不是边缘像素来计算损失。这有助于平衡假阳性和假阴性,使训练更稳定。第三,与以前的研究通常只使用一种类型的数据集不同,我们使用SED中常用的两种不同类型的数据集来验证我们的方法。实验结果表明,我们的方法在按边缘类型定义语义的数据集上显著优于现有的最先进的(SOTA)方法,并且在按对象边界定义语义的数据集上实现了与SOTA方法相当的性能。这表明我们的方法可以有效地应用于不同的数据集,而不考虑边缘的语义特征,有助于SED的泛化。代码可从https://github.com/GSKIM13/CONXA/获得。
{"title":"CONXA: A CONvnext and CROSS-attention combination network for Semantic Edge Detection","authors":"Gwangsoo Kim ,&nbsp;Hyuk-jae Lee ,&nbsp;Hyunmin Jung","doi":"10.1016/j.imavis.2025.105867","DOIUrl":"10.1016/j.imavis.2025.105867","url":null,"abstract":"<div><div>Semantic Edge Detection (SED) is an advanced edge detection technique that simultaneously detects edges in an image and classifies them according to their semantics. It is expected to be applied in various fields, such as medical imaging, satellite imagery, and smart manufacturing. Although previous research on SED has significantly improved performance, further advancements are still needed. In particular, existing studies have typically focused on specific types of dataset, limiting the broader applicability of SED techniques. Motivated by this, our paper makes three key contributions. First, we propose a novel network for SED, called CONXA. CONXA improves SED accuracy by leveraging the powerful feature extraction of ConvNeXt and the effective feature combination of cross-attention. Second, we introduce a novel loss function, Inverted Dice (I-Dice) loss, which calculates loss based on a sufficient number of non-edge pixels rather than edge pixels. This helps balance false positives and false negatives, enabling more stable training. Third, unlike previous studies that typically use only one type of dataset, we validate our method using two distinct types of datasets commonly used in SED. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art (SOTA) methods on datasets that define semantics by edge types, and achieves comparable performance to SOTA methods on datasets where semantics are defined by object boundaries. This indicates that our method can be effectively applied across diverse datasets regardless of the semantic characteristics of edges, contributing to the generalization of SED. Code is available at <span><span>https://github.com/GSKIM13/CONXA/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105867"},"PeriodicalIF":4.2,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1