首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Cross-level fusion network for two-stage polyp segmentation via integrity learning 基于完整性学习的两阶段息肉分割交叉融合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-19 DOI: 10.1016/j.imavis.2025.105883
Junzhuo Liu , Dorit Merhof , Zhixiang Wang
Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves mDice of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with mIou of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average mDice of 0.788 and an average mIou of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.
结直肠癌是最常见和最致命的癌症之一。从结直肠内镜图像中对早期息肉组织进行自动检测、分割和分类,在提高临床诊断准确性、避免漏诊和降低人群中结直肠癌的发病率方面显示出令人印象深刻的潜力。然而,现有的研究大多没有考虑不同深度神经网络层之间信息融合的潜力和模型复杂性优化,导致临床实用性较差。针对上述局限性,本文引入了完整性学习的概念,将息肉分割分为两个阶段逐步完成,并提出了一个跨层次融合的轻量级网络IC-FusionNet,从内镜图像中准确分割息肉。首先,网络的上下文融合模块(Context Fusion Module, CFM)对编码器相邻分支和当前级别信息进行聚合,实现宏观完整性学习;第二阶段,将较浅层的息肉细节信息与较深层的高维语义信息进行聚合,实现不同层间互补信息的增强。IC-FusionNet在5个息肉分割基准数据集上进行了8个评估指标的评估,以评估其性能。IC-FusionNet在Kvasir和CVC-ClinicDB数据集上的mDice分别为0.908和0.925,mIou分别为0.851和0.973。在三个外部息肉分割测试数据集上,该模型的平均mdevice为0.788,平均mIou为0.712。与现有方法相比,IC-FusionNet在大多数评估指标上都达到了卓越或接近最佳的性能。此外,IC-FusionNet仅包含3.84 M个参数和0.76G个mac,与最近的轻量级分段网络相比,参数数量减少了9.22%,计算复杂度减少了74.15%。
{"title":"Cross-level fusion network for two-stage polyp segmentation via integrity learning","authors":"Junzhuo Liu ,&nbsp;Dorit Merhof ,&nbsp;Zhixiang Wang","doi":"10.1016/j.imavis.2025.105883","DOIUrl":"10.1016/j.imavis.2025.105883","url":null,"abstract":"<div><div>Colorectal cancer is one of the most prevalent and lethal forms of cancer. The automated detection, segmentation and classification of early polyp tissues from endoscopy images of the colorectum has demonstrated impressive potential in improving clinical diagnostic accuracy, avoiding missed detections and reducing the incidence of colorectal cancer in the population. However, most existing studies fail to consider the potential of information fusion between different deep neural network layers and optimization with respect to model complexity, resulting in poor clinical utility. To address the above limitations, the concept of integrity learning is introduced, which divides polyp segmentation into two stages for progressive completion, and a cross-level fusion lightweight network, IC-FusionNet, is proposed to accurately segment polyps from endoscopy images. First, the Context Fusion Module (CFM) of the network aggregates the encoder neighboring branches and current level information to achieve macro-integrity learning. In the second stage, polyp detail information from shallower layers and deeper high-dimensional semantic information are aggregated to achieve enhancement between different layers of complementary information. IC-FusionNet is evaluated on five polyp segmentation benchmark datasets across eight evaluation metrics to assess its performance. IC-FusionNet achieves <span><math><mi>mDice</mi></math></span> of 0.908 and 0.925 on the Kvasir and CVC-ClinicDB datasets, respectively, along with <span><math><mi>mIou</mi></math></span> of 0.851 and 0.973. On three external polyp segmentation test datasets, the model obtains an average <span><math><mi>mDice</mi></math></span> of 0.788 and an average <span><math><mi>mIou</mi></math></span> of 0.712. Compared to existing methods, IC-FusionNet achieves superior or near-optimal performance across most evaluation metrics. Moreover, IC-FusionNet contains only 3.84 M parameters and 0.76G MACs, representing a reduction of 9.22% in parameter count and 74.15% in computational complexity compared to recent lightweight segmentation networks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105883"},"PeriodicalIF":4.2,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness 利用目标上下文和外观感知增强零射击目标-目标视觉导航
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-18 DOI: 10.1016/j.imavis.2025.105873
Yu Fu , Lichun Wang , Tong Bie , Shuang Li , Tong Gao , Baocai Yin
The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.
零射击目标-目标视觉导航(Zero-Shot Object-Goal Visual Navigation, ZSON)任务侧重于利用目标语义信息,将从可见类学习到的端到端导航策略转移到不可见类。对于大多数ZSON方法,目标标签充当语义信息的主要来源。然而,基于单一语义线索的导航策略学习限制了导航策略的可移植性。受人类在搜索目标时联想对象标签现象的启发,我们提出了双目标感知网络(Dual Target Awareness Network, DTAN),将标签语义扩展到目标上下文和目标外观,为导航策略学习提供更多的目标线索。DTAN首先利用大语言模型(LLM)根据目标标签推断目标上下文和目标属性。对目标上下文进行编码,然后与观察结果交互以获得上下文感知特征。目标属性用于生成目标外观,然后与观测交互以获得外观感知特征。通过融合这两种特征,得到目标感知特征,并将其输入策略网络进行行动决策。实验结果表明,在AI2-THOR模拟器上,DTAN方法对未见目标的成功率(SR)和路径长度加权成功率(SPL)分别比ZSON方法高6.9%和3.1%。在RoboTHOR和Habitat (MP3D)模拟器上进行的实验进一步证明了DTAN在更大规模和更逼真场景中的可扩展性。
{"title":"Enhancing Zero-Shot Object-Goal Visual Navigation with target context and appearance awareness","authors":"Yu Fu ,&nbsp;Lichun Wang ,&nbsp;Tong Bie ,&nbsp;Shuang Li ,&nbsp;Tong Gao ,&nbsp;Baocai Yin","doi":"10.1016/j.imavis.2025.105873","DOIUrl":"10.1016/j.imavis.2025.105873","url":null,"abstract":"<div><div>The Zero-Shot Object-Goal Visual Navigation (ZSON) task focuses on leveraging the target semantic information to transfer the end-to-end navigation policy learned from seen classes to unseen classes. For most ZSON methods, the target label serves as the primary source of semantic information. However, the learning of navigation policies based on the single semantic clue limits the transferability of navigation policies. Inspired by the phenomenon that humans associate object labels during searching for targets, we propose the Dual Target Awareness Network (DTAN), which expands the label semantic to target context and target appearance, providing more target clues for the navigation policy learning. By using Large Language Model (LLM), DTAN first infers target context and target attribute based on the target label. The target context is encoded and then interacts with the observation to obtain the context-aware feature. The target attribute is used to generate target appearance and then interacts with the observation to obtain the appearance-aware feature. By fusing the two kinds of features, the target-aware feature is obtained and fed into the policy network to make action decisions. Experimental results demonstrate that DTAN outperforms the state-of-the-art ZSON method by 6.9% in Success Rate (SR) and 3.1% in Success weighted by Path Length (SPL) for unseen targets on AI2-THOR simulator. Experiments conducted on RoboTHOR and Habitat (MP3D) simulators further prove the scalability of DTAN to larger-scale and more realistic scenes.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105873"},"PeriodicalIF":4.2,"publicationDate":"2025-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145840005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery MSBC-Segformer:一种用于保乳术后放疗CT图像中临床靶体积和危险器官的自动分割模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-17 DOI: 10.1016/j.imavis.2025.105878
Yadi Gao , Qian Sun , Lan Ye , Chengliang Li , Peipei Dang , Min Han
In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (p>0.98 for CTV and p>0.59 for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.
在乳腺癌保乳手术(BCS)放疗中,CT图像上的临床靶体积(CTV)和危险器官(OARs)主要由放射肿瘤学家(RO)手工逐层描绘,这是一个耗时的过程,容易因临床经验差异和观察者之间和观察者内部的差异而发生变化。为了解决这个问题,我们开发了一种新的针对医学CT图像的自动描绘模型,特别是用于计算机辅助医学检测和诊断。收集100例接受BCS和放疗的患者的CT扫描。这些数据用于创建、训练和验证一种新的深度学习(DL)模型,即MSBC-Segformer(基于Transformer的多尺度边界约束分割模型)模型,该模型用于自动分割CTV和OARs。采用骰子相似系数(DSC)和第95百分位豪斯多夫距离(95HD)来评价模型的有效性。结果表明,MSBC-Segformer模型能够准确、高效地对bccs后放疗的BC患者进行CTV和OARs的描绘,优于初级医生和几乎所有现有的CNN模型,并且减少了由于观察者差异导致的分割结果的不稳定性,显著提高了临床效率。此外,三位ro的评估显示,模型与高级医生手工描绘的差异无统计学意义(CTV为p>;0.98, OARs为p>;0.59)。该模型显著缩短了分割时间,平均每位患者仅为12.53 s。
{"title":"MSBC-Segformer: An automatic segmentation model of clinical target volume and organs at risk in CT images for radiotherapy after breast-conserving surgery","authors":"Yadi Gao ,&nbsp;Qian Sun ,&nbsp;Lan Ye ,&nbsp;Chengliang Li ,&nbsp;Peipei Dang ,&nbsp;Min Han","doi":"10.1016/j.imavis.2025.105878","DOIUrl":"10.1016/j.imavis.2025.105878","url":null,"abstract":"<div><div>In breast-conserving surgery (BCS) radiotherapy for breast cancer (BC), clinical target volume (CTV) and organs at risk (OARs) on CT images are mainly manually delineated layer by layer by radiation oncologists (RO), a time-consuming process prone to variability due to clinical experience differences and inter- and intra-observer variations. To address this, we developed a new automatic delineation model aimed at medical CT images, specifically for computer-assisted medical detection and diagnosis. The CT scans of 100 patients who underwent BCS and radiotherapy were collected. These data were used to create, train, and validate a new deep-learning (DL) model, the MSBC-Segformer (Multi-Scale Boundary-Constrained Segmentation Model Based on Transformer) model, which was proposed to automatically segment the CTV and OARs. The Dice Similarity Coefficient (DSC) and 95th percentile Hausdorff Distance (95HD) were used to evaluate the effectiveness of the proposed model. In result, the MSBC-Segformer model can provide accurate and efficient delineation of CTV and OARs for BC patients underwent radiotherapy after BCS, outperforming both junior doctors and almost all other existing CNN models, and reducing the instability of segmentation results due to observer differences, thus significantly enhancing clinical efficiency. Moreover, evaluation by three ROs revealed no significant difference between the model and manual delineation by the senior doctors (<span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>98</mn></mrow></math></span> for CTV and <span><math><mrow><mi>p</mi><mo>&gt;</mo><mn>0</mn><mo>.</mo><mn>59</mn></mrow></math></span> for OARs). The model significantly reduced segmentation time, with an average of only 12.53 s per patient.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105878"},"PeriodicalIF":4.2,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MVD-NeRF: Multi-View Deblurring Neural Radiance Fields from Defocused Images MVD-NeRF:多视图去模糊神经辐射场从散焦图像
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-17 DOI: 10.1016/j.imavis.2025.105876
Zhenyu Yin , Xiaohui Wang , Feiqing Zhang , Xiaoqiang Shi , Dan Feng
Neural Radiance Fields (NeRF) have demonstrated exceptional three-dimensional (3D) reconstruction quality by synthesizing novel views from multi-view images. However, NeRF algorithms typically require clear and static images to function effectively, and little attention has been given to suboptimal scenarios involving noise such as reflections and blur. Although blurred images are common in real-world situations, few studies have explored NeRF for handling blur, particularly defocus blur. Correctly simulating the formation of defocus blur is the key to deblurring and helps to accurately synthesize new perspectives from blurred images. Therefore, this paper proposes a Multi-View Deblurring Neural Radiance Fields from Defocused Images (MVD-NeRF), a framework for 3D reconstruction from defocus-blurred images. The framework ensures consistency in 3D geometry and appearance by modeling the formation of defocus blur. MVD-NeRF introduces the Defocus Modeling Approach (DMA), a novel method for simulating defocused scenes. When the view is fixed, DMA assumes that the pixel is rendered by multiple rays emitted from the same light source. Additionally, MVD-NeRF proposes a new Multi-view Panning Algorithm (MPA), which simulates light source movement through slight shifts in the camera center across different views, thereby generating blur effects similar to those in real photography. Together, DMA and MPA enhance MVD-NeRF’s ability to capture intricate scene details. Our experimental results validate that MVD-NeRF achieves significant improvements in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). The source code for MVD-NeRF is available at the following URL: https://github.com/luckhui0505/MVD-NeRF.
神经辐射场(NeRF)通过从多视图图像合成新视图,展示了卓越的三维(3D)重建质量。然而,NeRF算法通常需要清晰和静态的图像才能有效地工作,并且很少关注涉及反射和模糊等噪声的次优场景。虽然模糊的图像在现实世界中很常见,但很少有研究探索NeRF处理模糊,特别是散焦模糊。正确模拟离焦模糊的形成是去模糊的关键,有助于从模糊图像中准确合成新的视角。为此,本文提出了一种基于离焦图像的多视图去模糊神经辐射场(MVD-NeRF),这是一种从离焦模糊图像中进行三维重建的框架。该框架通过建模散焦模糊的形成来确保3D几何形状和外观的一致性。MVD-NeRF引入了散焦建模方法(DMA),这是一种模拟散焦场景的新方法。当视图固定时,DMA假定像素是由同一光源发出的多条光线渲染的。此外,MVD-NeRF提出了一种新的多视图平移算法(MPA),该算法通过相机中心在不同视图中的轻微移动来模拟光源的移动,从而产生类似于真实摄影中的模糊效果。DMA和MPA一起增强了MVD-NeRF捕捉复杂场景细节的能力。我们的实验结果验证了MVD-NeRF在峰值信噪比(PSNR)、结构相似指数测量(SSIM)和学习感知图像斑块相似度(LPIPS)方面取得了显着改善。MVD-NeRF的源代码可从以下网址获得:https://github.com/luckhui0505/MVD-NeRF。
{"title":"MVD-NeRF: Multi-View Deblurring Neural Radiance Fields from Defocused Images","authors":"Zhenyu Yin ,&nbsp;Xiaohui Wang ,&nbsp;Feiqing Zhang ,&nbsp;Xiaoqiang Shi ,&nbsp;Dan Feng","doi":"10.1016/j.imavis.2025.105876","DOIUrl":"10.1016/j.imavis.2025.105876","url":null,"abstract":"<div><div>Neural Radiance Fields (NeRF) have demonstrated exceptional three-dimensional (3D) reconstruction quality by synthesizing novel views from multi-view images. However, NeRF algorithms typically require clear and static images to function effectively, and little attention has been given to suboptimal scenarios involving noise such as reflections and blur. Although blurred images are common in real-world situations, few studies have explored NeRF for handling blur, particularly defocus blur. Correctly simulating the formation of defocus blur is the key to deblurring and helps to accurately synthesize new perspectives from blurred images. Therefore, this paper proposes a Multi-View Deblurring Neural Radiance Fields from Defocused Images (MVD-NeRF), a framework for 3D reconstruction from defocus-blurred images. The framework ensures consistency in 3D geometry and appearance by modeling the formation of defocus blur. MVD-NeRF introduces the Defocus Modeling Approach (DMA), a novel method for simulating defocused scenes. When the view is fixed, DMA assumes that the pixel is rendered by multiple rays emitted from the same light source. Additionally, MVD-NeRF proposes a new Multi-view Panning Algorithm (MPA), which simulates light source movement through slight shifts in the camera center across different views, thereby generating blur effects similar to those in real photography. Together, DMA and MPA enhance MVD-NeRF’s ability to capture intricate scene details. Our experimental results validate that MVD-NeRF achieves significant improvements in Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS). The source code for MVD-NeRF is available at the following URL: <span><span>https://github.com/luckhui0505/MVD-NeRF</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105876"},"PeriodicalIF":4.2,"publicationDate":"2025-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchical texture-aware image inpainting via contextual attention and multi-scale fusion 基于上下文关注和多尺度融合的分层纹理感知图像绘画
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.imavis.2025.105875
Runing Li , Jiangyan Dai , Qibing Qin , Chengduan Wang , Yugen Yi , Jianzhong Wang
Image inpainting aims to restore missing regions in images with visually coherent and semantically plausible content. Although deep learning methods have achieved significant progress, current approaches still face challenges in handling large-area image inpainting tasks, often producing blurred textures or structurally inconsistent results. These limitations primarily stem from the insufficient exploitation of long-range dependencies and inadequate texture priors. To address these issues, we propose a novel two-stage image inpainting framework that integrates multi-directional texture priors with contextual information. In the first stage, we extract rich texture features from corrupted images using Gabor filters, which simulate human visual perception. These features are then fused to guide a texture inpainting network, where a Multi-Scale Dense Skip Connection (MSDSC) module is introduced to bridge semantic gaps across different feature levels. In the second stage, we design a hierarchical texture-aware guided image completion network that utilizes the repaired textures as auxiliary guidance. Specifically, a contextual attention module is incorporated to capture long-range spatial dependencies and enhance structural consistency. Extensive experiments conducted on three challenging benchmarks, such as CelebA-HQ, Places2, and Paris Street View, demonstrate that our method outperforms existing state-of-the-art approaches in both quantitative metrics and visual quality. The proposed framework significantly improves the realism and coherence of inpainting results, particularly for images with large missing regions or complex textures. The code is available at https://github.com/Runing-Lab/HTA2I.git.
图像修复的目的是恢复图像中缺失的区域,使其具有视觉连贯和语义可信的内容。尽管深度学习方法已经取得了重大进展,但目前的方法在处理大面积图像绘制任务时仍然面临挑战,通常会产生模糊的纹理或结构不一致的结果。这些限制主要源于对长期依赖关系的开发不足和不充分的纹理先验。为了解决这些问题,我们提出了一种新的两阶段图像绘制框架,该框架将多向纹理先验与上下文信息相结合。在第一阶段,我们使用Gabor滤波器从损坏的图像中提取丰富的纹理特征,模拟人类的视觉感知。然后将这些特征融合到纹理绘制网络中,其中引入了多尺度密集跳过连接(MSDSC)模块来弥合不同特征级别之间的语义差距。在第二阶段,我们设计了一个分层的纹理感知引导图像补全网络,利用修复后的纹理作为辅助引导。具体地说,我们采用了一个上下文注意模块来捕捉远程空间依赖性和增强结构一致性。在三个具有挑战性的基准测试(如CelebA-HQ、Places2和巴黎街景)上进行的大量实验表明,我们的方法在定量指标和视觉质量方面都优于现有的最先进的方法。提出的框架显著提高了绘制结果的真实感和一致性,特别是对于具有大缺失区域或复杂纹理的图像。代码可在https://github.com/Runing-Lab/HTA2I.git上获得。
{"title":"Hierarchical texture-aware image inpainting via contextual attention and multi-scale fusion","authors":"Runing Li ,&nbsp;Jiangyan Dai ,&nbsp;Qibing Qin ,&nbsp;Chengduan Wang ,&nbsp;Yugen Yi ,&nbsp;Jianzhong Wang","doi":"10.1016/j.imavis.2025.105875","DOIUrl":"10.1016/j.imavis.2025.105875","url":null,"abstract":"<div><div>Image inpainting aims to restore missing regions in images with visually coherent and semantically plausible content. Although deep learning methods have achieved significant progress, current approaches still face challenges in handling large-area image inpainting tasks, often producing blurred textures or structurally inconsistent results. These limitations primarily stem from the insufficient exploitation of long-range dependencies and inadequate texture priors. To address these issues, we propose a novel two-stage image inpainting framework that integrates multi-directional texture priors with contextual information. In the first stage, we extract rich texture features from corrupted images using Gabor filters, which simulate human visual perception. These features are then fused to guide a texture inpainting network, where a Multi-Scale Dense Skip Connection (MSDSC) module is introduced to bridge semantic gaps across different feature levels. In the second stage, we design a hierarchical texture-aware guided image completion network that utilizes the repaired textures as auxiliary guidance. Specifically, a contextual attention module is incorporated to capture long-range spatial dependencies and enhance structural consistency. Extensive experiments conducted on three challenging benchmarks, such as CelebA-HQ, Places2, and Paris Street View, demonstrate that our method outperforms existing state-of-the-art approaches in both quantitative metrics and visual quality. The proposed framework significantly improves the realism and coherence of inpainting results, particularly for images with large missing regions or complex textures. The code is available at <span><span>https://github.com/Runing-Lab/HTA2I.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"167 ","pages":"Article 105875"},"PeriodicalIF":4.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145842592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient 6DoF pose estimation for multi-instance objects from a single image 单幅图像中多实例对象的高效6DoF姿态估计
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.imavis.2025.105882
Wen-Nung Lie , Lee Aing
Estimating 6 degrees of freedom poses for multiple objects from a single image and making it practical in industry is difficult since several metrics, like accuracy, speed and complexity must be traded. This study adopts a fast bottom-up approach to estimate poses for multi-instance objects in an image simultaneously. We design a convolutional neural network with simple end-to-end training to output 4 feature maps: error mask, semantic mask, center vector map and 6D coordinate map (6DCM). Specifically, 6DCM is capable of providing the rear-side 3D object point clouds information that are originally invisible from the camera's viewpoint. This procedure enriches shape information about target objects which can be used to construct each instance's 2D-3D correspondences for pose parameter estimation. Experimental results show that our proposed bottom-up approach is fast and can process a single image containing 7 objects at 25 frames per second with competitive accuracy to other top-down methods.
从一张图像中估计多个物体的6个自由度姿势并使其在工业中实用是很困难的,因为必须权衡精度、速度和复杂性等几个指标。本研究采用快速自底向上的方法同时估计图像中多实例物体的姿态。我们设计了一个简单的端到端训练卷积神经网络,输出4个特征图:错误掩码、语义掩码、中心向量图和6D坐标图(6DCM)。具体来说,6DCM能够提供最初从相机视点看不见的后侧3D物体点云信息。该过程丰富了目标物体的形状信息,可用于构建每个实例的2D-3D对应关系,用于姿态参数估计。实验结果表明,我们提出的自底向上方法速度快,可以以每秒25帧的速度处理包含7个物体的单幅图像,并且精度与其他自顶向下方法相当。
{"title":"Efficient 6DoF pose estimation for multi-instance objects from a single image","authors":"Wen-Nung Lie ,&nbsp;Lee Aing","doi":"10.1016/j.imavis.2025.105882","DOIUrl":"10.1016/j.imavis.2025.105882","url":null,"abstract":"<div><div>Estimating 6 degrees of freedom poses for multiple objects from a single image and making it practical in industry is difficult since several metrics, like accuracy, speed and complexity must be traded. This study adopts a fast bottom-up approach to estimate poses for multi-instance objects in an image simultaneously. We design a convolutional neural network with simple end-to-end training to output 4 feature maps: error mask, semantic mask, center vector map and 6D coordinate map (6DCM). Specifically, 6DCM is capable of providing the rear-side 3D object point clouds information that are originally invisible from the camera's viewpoint. This procedure enriches shape information about target objects which can be used to construct each instance's 2D-3D correspondences for pose parameter estimation. Experimental results show that our proposed bottom-up approach is fast and can process a single image containing 7 objects at 25 frames per second with competitive accuracy to other top-down methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105882"},"PeriodicalIF":4.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HEL-Net: Heterogeneous Ensemble Learning for comprehensive diabetic retinopathy multi-lesion segmentation via Mamba-UNet hell - net:基于Mamba-UNet的糖尿病视网膜病变多病灶分割的异构集成学习
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.imavis.2025.105879
Lingyu Wu , Haiying Xia , Shuxiang Song , Yang Lan
Diabetic Retinopathy (DR) is the leading cause of blindness in adults with diabetes. Early automated detection of DR lesions is crucial for preventing vision loss and assisting ophthalmologists in treatment. However, accurately segmenting multiple types of DR lesions poses significant challenges due to their large diversity in size, shape, and location, as well as the conflict in feature modeling between local details and long-range dependencies. To address these issues, we propose a novel Heterogeneous Ensemble Learning Network (HEL-Net) specifically designed for four-lesion segmentation. HEL-Net comprises two ensemble stages: the first stage utilizes Mamba-UNet to generate coarse multi-lesion prediction results, which serve as contextual priors for the second stage, forming a multi-perspective lesion navigation strategy. The second stage employs a heterogeneous structure, integrating specialized networks (Mamba-UNet and U-Net) tailored to different lesion characteristics. Mamba-UNet excels in capturing large lesions by modeling long-range dependencies, while U-Net focuses on small lesions with significant local features. The heterogeneous ensemble framework leverages their complementary strengths to promote comprehensive lesion feature learning. Extensive quantitative and qualitative evaluations on two public datasets (IDRiD and DDR) demonstrate that our HEL-Net achieves competitive performance compared to state-of-the-art methods, achieving an mAUPR of 69.52%, mDice of 67.40%, and mIoU of 51.99% on the IDRiD dataset.
糖尿病视网膜病变(DR)是导致成人糖尿病患者失明的主要原因。早期自动检测DR病变对于预防视力丧失和协助眼科医生治疗至关重要。然而,由于多种类型的DR病变在大小、形状和位置上的巨大差异,以及局部细节和远程依赖关系之间的特征建模冲突,准确分割多种类型的DR病变带来了巨大的挑战。为了解决这些问题,我们提出了一种新的异构集成学习网络(hell - net),专门用于四病灶分割。HEL-Net包括两个集成阶段:第一阶段利用Mamba-UNet生成粗糙的多病变预测结果,作为第二阶段的上下文先验,形成多视角病变导航策略。第二阶段采用异质结构,整合针对不同病变特征定制的专用网络(Mamba-UNet和U-Net)。Mamba-UNet擅长通过建模远程依赖关系来捕获大型病变,而U-Net则专注于具有重要局部特征的小病变。异构集成框架利用它们的互补优势,促进全面的病变特征学习。在两个公共数据集(IDRiD和DDR)上进行的大量定量和定性评估表明,与最先进的方法相比,我们的hell - net的性能具有竞争力,在IDRiD数据集上实现了69.52%的mAUPR, 67.40%的mdevice和51.99%的mIoU。
{"title":"HEL-Net: Heterogeneous Ensemble Learning for comprehensive diabetic retinopathy multi-lesion segmentation via Mamba-UNet","authors":"Lingyu Wu ,&nbsp;Haiying Xia ,&nbsp;Shuxiang Song ,&nbsp;Yang Lan","doi":"10.1016/j.imavis.2025.105879","DOIUrl":"10.1016/j.imavis.2025.105879","url":null,"abstract":"<div><div>Diabetic Retinopathy (DR) is the leading cause of blindness in adults with diabetes. Early automated detection of DR lesions is crucial for preventing vision loss and assisting ophthalmologists in treatment. However, accurately segmenting multiple types of DR lesions poses significant challenges due to their large diversity in size, shape, and location, as well as the conflict in feature modeling between local details and long-range dependencies. To address these issues, we propose a novel Heterogeneous Ensemble Learning Network (<em>HEL-Net</em>) specifically designed for four-lesion segmentation. HEL-Net comprises two ensemble stages: the first stage utilizes Mamba-UNet to generate coarse multi-lesion prediction results, which serve as contextual priors for the second stage, forming a multi-perspective lesion navigation strategy. The second stage employs a heterogeneous structure, integrating specialized networks (Mamba-UNet and U-Net) tailored to different lesion characteristics. Mamba-UNet excels in capturing large lesions by modeling long-range dependencies, while U-Net focuses on small lesions with significant local features. The heterogeneous ensemble framework leverages their complementary strengths to promote comprehensive lesion feature learning. Extensive quantitative and qualitative evaluations on two public datasets (IDRiD and DDR) demonstrate that our HEL-Net achieves competitive performance compared to state-of-the-art methods, achieving an mAUPR of 69.52%, mDice of 67.40%, and mIoU of 51.99% on the IDRiD dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105879"},"PeriodicalIF":4.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PASS: Peer-agreement based sample selection for training with instance dependent noisy labels PASS:基于同行协议的样本选择,用于与实例相关的噪声标签的训练
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.imavis.2025.105877
Arpit Garg , Cuong Nguyen , Rafael Felix , Thanh-Toan Do , Gustavo Carneiro
Deep learning encounters significant challenges in the form of noisy-label samples, which can cause the overfitting of trained models. A primary challenge in learning with noisy-label (LNL) techniques is their ability to differentiate between hard samples (clean-label samples near the decision boundary) and instance-dependent noisy (IDN) label samples to allow these samples to be treated differently during training. Existing methodologies to identify IDN samples, including the small-loss hypothesis and feature-based selection, have demonstrated limited efficacy, thus impeding their effectiveness in dealing with real-world label noise. We present Peer-Agreement-based Sample Selection (PASS), a novel approach that utilises three classifiers, where a consensus-driven agreement between two models accurately differentiates between clean and noisy-label IDN samples to train the third model. In contrast to current techniques, PASS is specifically designed to address the complexities of IDN, where noise patterns are correlated with instance features. Our approach seamlessly integrates with existing LNL algorithms to enhance the accuracy of detecting both noisy and clean samples. Comprehensive experiments conducted on simulated benchmarks (CIFAR-100 and Red mini-ImageNet) and real-world datasets (Animal-10N, CIFAR-N, Clothing1M, and mini-WebVision) demonstrated that PASS substantially improved the performance of multiple state-of-the-art methods. This technique achieves superior classification accuracy, particularly in scenarios with high noise levels.1
深度学习在噪声标签样本方面遇到了重大挑战,这可能导致训练模型的过拟合。使用噪声标签(LNL)技术学习的主要挑战是区分硬样本(靠近决策边界的干净标签样本)和实例相关噪声(IDN)标签样本的能力,以便在训练期间对这些样本进行不同的处理。现有的识别IDN样本的方法,包括小损失假设和基于特征的选择,已经证明了有限的有效性,从而阻碍了它们在处理现实世界标签噪声方面的有效性。我们提出了基于同行协议的样本选择(PASS),这是一种利用三个分类器的新方法,其中两个模型之间的共识驱动协议准确区分干净和噪声标签的IDN样本,以训练第三个模型。与目前的技术相比,PASS是专门为解决IDN的复杂性而设计的,其中噪声模式与实例特征相关。我们的方法与现有的LNL算法无缝集成,以提高检测噪声和干净样本的准确性。在模拟基准测试(CIFAR-100和Red mini-ImageNet)和真实数据集(Animal-10N、CIFAR-N、Clothing1M和mini-WebVision)上进行的综合实验表明,PASS大大提高了多种最先进方法的性能。该技术实现了更高的分类精度,特别是在高噪声水平的情况下
{"title":"PASS: Peer-agreement based sample selection for training with instance dependent noisy labels","authors":"Arpit Garg ,&nbsp;Cuong Nguyen ,&nbsp;Rafael Felix ,&nbsp;Thanh-Toan Do ,&nbsp;Gustavo Carneiro","doi":"10.1016/j.imavis.2025.105877","DOIUrl":"10.1016/j.imavis.2025.105877","url":null,"abstract":"<div><div>Deep learning encounters significant challenges in the form of noisy-label samples, which can cause the overfitting of trained models. A primary challenge in learning with noisy-label (LNL) techniques is their ability to differentiate between hard samples (clean-label samples near the decision boundary) and instance-dependent noisy (IDN) label samples to allow these samples to be treated differently during training. Existing methodologies to identify IDN samples, including the small-loss hypothesis and feature-based selection, have demonstrated limited efficacy, thus impeding their effectiveness in dealing with real-world label noise. We present Peer-Agreement-based Sample Selection (PASS), a novel approach that utilises three classifiers, where a consensus-driven agreement between two models accurately differentiates between clean and noisy-label IDN samples to train the third model. In contrast to current techniques, PASS is specifically designed to address the complexities of IDN, where noise patterns are correlated with instance features. Our approach seamlessly integrates with existing LNL algorithms to enhance the accuracy of detecting both noisy and clean samples. Comprehensive experiments conducted on simulated benchmarks (CIFAR-100 and Red mini-ImageNet) and real-world datasets (Animal-10N, CIFAR-N, Clothing1M, and mini-WebVision) demonstrated that PASS substantially improved the performance of multiple state-of-the-art methods. This technique achieves superior classification accuracy, particularly in scenarios with high noise levels.<span><span><sup>1</sup></span></span></div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105877"},"PeriodicalIF":4.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790213","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
When Mamba meets CNN: A hybrid architecture for skin lesion segmentation 当曼巴遇到CNN:皮肤病变分割的混合架构
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-16 DOI: 10.1016/j.imavis.2025.105880
Yun Xiao , Caijuan Shi , Jinghao Jia , Ao Cai , Yinan Zhang , Meiwen Zhang
As an important means of computer-aided diagnosis and treatment of skin cancer, skin lesion segmentation has recently received extensive research based on different deep models. Because of the limitations of each single deep model, the hybrid architectures, especially the Mamba–CNN based methods, have become a research hotspot. However, the segmentation accuracy of existing Mamba–CNN methods is still limited, especially for lesions of varying sizes and blurry boundaries, meanwhile, the model’s computational complexity remains high. Therefore, to address these issues and improve the skin lesion segmentation performance, we propose a new Mamba–CNN based model, named Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM). Specifically, the designed Multi-scale Hybrid Attention Interaction (MHAI) module can enhance the multi-scale feature representation with the powerful capability of long-range dependency modeling to obtain rich local and global information. The designed Region Localization and Boundary Enhancement (RLBE) module can effectively explore the local information to alleviate inaccurate skin lesion localization and boundary blurring. The Lightweight Visual State Space (LVSS) module is designed to reduce the model’s computational complexity. Extensive experiments are conducted on four datasets, and our model FFBA-VM effectively boosts the segmentation accuracy in terms of multiple evaluation metrics. For example, FFBA-VM achieves mIoU and DSC of 80.28% and 89.06% on the ISIC17 dataset, and reaches mIoU and DSC of 80.47% and 89.17% on the ISIC18 dataset. The experimental results indicate that our proposed FFBA-VM can outperform the existing state-of-the-art methods, validating its effectiveness and practicality for skin lesion segmentation.
作为计算机辅助皮肤癌诊断和治疗的重要手段,基于不同深度模型的皮肤病灶分割近年来得到了广泛的研究。由于单个深度模型的局限性,混合结构,特别是基于Mamba-CNN的方法成为研究热点。然而,现有的Mamba-CNN方法的分割精度仍然有限,特别是对于大小不一、边界模糊的病变,同时模型的计算复杂度仍然很高。因此,为了解决这些问题并提高皮肤损伤分割性能,我们提出了一种新的基于Mamba - cnn的模型,命名为Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM)。具体而言,设计的多尺度混合注意交互(MHAI)模块可以增强多尺度特征表示,具有强大的远程依赖建模能力,从而获得丰富的局部和全局信息。设计的区域定位和边界增强(RLBE)模块可以有效地挖掘局部信息,以缓解皮肤病变定位不准确和边界模糊的问题。轻量级视觉状态空间(LVSS)模块旨在降低模型的计算复杂度。在4个数据集上进行了大量的实验,我们的模型FFBA-VM在多个评价指标上有效地提高了分割精度。例如,FFBA-VM在ISIC17数据集中mIoU和DSC分别达到80.28%和89.06%,在ISIC18数据集中mIoU和DSC分别达到80.47%和89.17%。实验结果表明,我们提出的FFBA-VM可以优于现有的最先进的方法,验证了其对皮肤病变分割的有效性和实用性。
{"title":"When Mamba meets CNN: A hybrid architecture for skin lesion segmentation","authors":"Yun Xiao ,&nbsp;Caijuan Shi ,&nbsp;Jinghao Jia ,&nbsp;Ao Cai ,&nbsp;Yinan Zhang ,&nbsp;Meiwen Zhang","doi":"10.1016/j.imavis.2025.105880","DOIUrl":"10.1016/j.imavis.2025.105880","url":null,"abstract":"<div><div>As an important means of computer-aided diagnosis and treatment of skin cancer, skin lesion segmentation has recently received extensive research based on different deep models. Because of the limitations of each single deep model, the hybrid architectures, especially the Mamba–CNN based methods, have become a research hotspot. However, the segmentation accuracy of existing Mamba–CNN methods is still limited, especially for lesions of varying sizes and blurry boundaries, meanwhile, the model’s computational complexity remains high. Therefore, to address these issues and improve the skin lesion segmentation performance, we propose a new Mamba–CNN based model, named Feature Fusion and Boundary Awareness Visual Mamba (FFBA-VM). Specifically, the designed Multi-scale Hybrid Attention Interaction (MHAI) module can enhance the multi-scale feature representation with the powerful capability of long-range dependency modeling to obtain rich local and global information. The designed Region Localization and Boundary Enhancement (RLBE) module can effectively explore the local information to alleviate inaccurate skin lesion localization and boundary blurring. The Lightweight Visual State Space (LVSS) module is designed to reduce the model’s computational complexity. Extensive experiments are conducted on four datasets, and our model FFBA-VM effectively boosts the segmentation accuracy in terms of multiple evaluation metrics. For example, FFBA-VM achieves mIoU and DSC of 80.28% and 89.06% on the ISIC17 dataset, and reaches mIoU and DSC of 80.47% and 89.17% on the ISIC18 dataset. The experimental results indicate that our proposed FFBA-VM can outperform the existing state-of-the-art methods, validating its effectiveness and practicality for skin lesion segmentation.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105880"},"PeriodicalIF":4.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790261","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CONXA: A CONvnext and CROSS-attention combination network for Semantic Edge Detection CONXA:一种用于语义边缘检测的卷积和交叉注意组合网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-12-15 DOI: 10.1016/j.imavis.2025.105867
Gwangsoo Kim , Hyuk-jae Lee , Hyunmin Jung
Semantic Edge Detection (SED) is an advanced edge detection technique that simultaneously detects edges in an image and classifies them according to their semantics. It is expected to be applied in various fields, such as medical imaging, satellite imagery, and smart manufacturing. Although previous research on SED has significantly improved performance, further advancements are still needed. In particular, existing studies have typically focused on specific types of dataset, limiting the broader applicability of SED techniques. Motivated by this, our paper makes three key contributions. First, we propose a novel network for SED, called CONXA. CONXA improves SED accuracy by leveraging the powerful feature extraction of ConvNeXt and the effective feature combination of cross-attention. Second, we introduce a novel loss function, Inverted Dice (I-Dice) loss, which calculates loss based on a sufficient number of non-edge pixels rather than edge pixels. This helps balance false positives and false negatives, enabling more stable training. Third, unlike previous studies that typically use only one type of dataset, we validate our method using two distinct types of datasets commonly used in SED. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art (SOTA) methods on datasets that define semantics by edge types, and achieves comparable performance to SOTA methods on datasets where semantics are defined by object boundaries. This indicates that our method can be effectively applied across diverse datasets regardless of the semantic characteristics of edges, contributing to the generalization of SED. Code is available at https://github.com/GSKIM13/CONXA/.
语义边缘检测(Semantic Edge Detection, SED)是一种先进的边缘检测技术,它可以同时检测图像中的边缘,并根据其语义对边缘进行分类。它有望应用于医学成像、卫星图像和智能制造等各个领域。虽然先前的研究已经显著提高了SED的性能,但仍需要进一步的发展。特别是,现有的研究通常集中在特定类型的数据集上,限制了SED技术的广泛适用性。在此激励下,本文做出了三个关键贡献。首先,我们提出了一个新的SED网络,称为CONXA。CONXA通过利用ConvNeXt强大的特征提取和交叉注意的有效特征组合来提高SED的准确性。其次,我们引入了一种新的损失函数,倒骰子(I-Dice)损失,它基于足够数量的非边缘像素而不是边缘像素来计算损失。这有助于平衡假阳性和假阴性,使训练更稳定。第三,与以前的研究通常只使用一种类型的数据集不同,我们使用SED中常用的两种不同类型的数据集来验证我们的方法。实验结果表明,我们的方法在按边缘类型定义语义的数据集上显著优于现有的最先进的(SOTA)方法,并且在按对象边界定义语义的数据集上实现了与SOTA方法相当的性能。这表明我们的方法可以有效地应用于不同的数据集,而不考虑边缘的语义特征,有助于SED的泛化。代码可从https://github.com/GSKIM13/CONXA/获得。
{"title":"CONXA: A CONvnext and CROSS-attention combination network for Semantic Edge Detection","authors":"Gwangsoo Kim ,&nbsp;Hyuk-jae Lee ,&nbsp;Hyunmin Jung","doi":"10.1016/j.imavis.2025.105867","DOIUrl":"10.1016/j.imavis.2025.105867","url":null,"abstract":"<div><div>Semantic Edge Detection (SED) is an advanced edge detection technique that simultaneously detects edges in an image and classifies them according to their semantics. It is expected to be applied in various fields, such as medical imaging, satellite imagery, and smart manufacturing. Although previous research on SED has significantly improved performance, further advancements are still needed. In particular, existing studies have typically focused on specific types of dataset, limiting the broader applicability of SED techniques. Motivated by this, our paper makes three key contributions. First, we propose a novel network for SED, called CONXA. CONXA improves SED accuracy by leveraging the powerful feature extraction of ConvNeXt and the effective feature combination of cross-attention. Second, we introduce a novel loss function, Inverted Dice (I-Dice) loss, which calculates loss based on a sufficient number of non-edge pixels rather than edge pixels. This helps balance false positives and false negatives, enabling more stable training. Third, unlike previous studies that typically use only one type of dataset, we validate our method using two distinct types of datasets commonly used in SED. Experimental results demonstrate that our approach significantly outperforms existing state-of-the-art (SOTA) methods on datasets that define semantics by edge types, and achieves comparable performance to SOTA methods on datasets where semantics are defined by object boundaries. This indicates that our method can be effectively applied across diverse datasets regardless of the semantic characteristics of edges, contributing to the generalization of SED. Code is available at <span><span>https://github.com/GSKIM13/CONXA/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105867"},"PeriodicalIF":4.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1