首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Unsupervised Object Localization driven by self-supervised foundation models: A comprehensive review 基于自监督基础模型驱动的无监督对象定位:综述
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-12 DOI: 10.1016/j.imavis.2025.105807
Sotirios Papadopoulos , Emmanouil Patsiouras , Konstantinos Ioannidis , Stefanos Vrochidis , Ioannis Kompatsiaris , Ioannis Patras
Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.
目标定位是计算机视觉中的一项基本任务,传统上需要标记数据集才能获得准确的结果。自监督学习的最新进展使无监督对象定位成为可能,减少了对人工注释的依赖。与依赖于带注释的训练数据的监督式编码器不同,自监督式编码器直接从大量未标记的图像集合中学习语义表示。这使得它们成为无监督对象定位的自然基础,因为它们捕获了对象相关的特征,同时消除了昂贵的手动标签的需要。这些编码器产生语义上一致的补丁嵌入。对这些嵌入进行分组,就会得到一组对应于图像中物体的补丁。这些补丁集可以转换为对象掩码或边界框,从而实现单对象发现、多对象检测和实例分割等任务。通过离线掩码聚类或使用预训练的视觉语言模型,无监督定位方法可以为发现的对象分配语义标签。这将最初的类不可知对象(没有类标签的对象)转换为类感知对象(有类标签的对象),并将这些任务与受监督的对应对象对齐。本文提供了一个结构化的回顾无监督对象定位方法在类别不可知论和类别感知设置。相比之下,之前的调查只关注与类别无关的定位。我们讨论了基于自监督特征的最先进的对象发现策略,并在广泛的任务、数据集和评估指标中提供了实验结果的详细比较。
{"title":"Unsupervised Object Localization driven by self-supervised foundation models: A comprehensive review","authors":"Sotirios Papadopoulos ,&nbsp;Emmanouil Patsiouras ,&nbsp;Konstantinos Ioannidis ,&nbsp;Stefanos Vrochidis ,&nbsp;Ioannis Kompatsiaris ,&nbsp;Ioannis Patras","doi":"10.1016/j.imavis.2025.105807","DOIUrl":"10.1016/j.imavis.2025.105807","url":null,"abstract":"<div><div>Object localization is a fundamental task in computer vision that traditionally requires labeled datasets for accurate results. Recent progress in self-supervised learning has enabled unsupervised object localization, reducing reliance on manual annotations. Unlike supervised encoders, which depend on annotated training data, self-supervised encoders learn semantic representations directly from large collections of unlabeled images. This makes them the natural foundation for unsupervised object localization, as they capture object-relevant features while eliminating the need for costly manual labels. These encoders produce semantically coherent patch embeddings. Grouping these embeddings reveals sets of patches that correspond to objects in an image. These patch sets can be converted into object masks or bounding boxes, enabling tasks such as single-object discovery, multi-object detection, and instance segmentation. By applying off-line mask clustering or using pre-trained vision-language models, unsupervised localization methods can assign semantic labels to discovered objects. This transforms initially class-agnostic objects (objects without class labels) into class-aware ones (objects with class labels), aligning these tasks with their supervised counterparts. This paper provides a structured review of unsupervised object localization methods in both class-agnostic and class-aware settings. In contrast, previous surveys have focused only on class-agnostic localization. We discuss state-of-the-art object discovery strategies based on self-supervised features and provide a detailed comparison of experimental results across a wide range of tasks, datasets, and evaluation metrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105807"},"PeriodicalIF":4.2,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition 一类不平衡混合CNN-Transformer网络用于人脸属性识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-10 DOI: 10.1016/j.imavis.2025.105823
Yanfei Liu , Youchang Shi , Yufei Long , Miaosen Xu , Junhua Chen , Yuanqian Li , Hao Wen
Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.
最近的面部属性识别(FAR)方法往往难以捕获全局依赖关系,并且受到严重的类不平衡、类内大变化和类间高相似性的进一步挑战,最终限制了它们的整体性能。为了解决这些挑战,我们提出了一个结合CNN和Transformer的网络,称为Class Imbalance Transformer-CNN (CI-TransCNN),用于人脸属性识别,该网络主要由TransCNN主干和双注意特征融合(Dual Attention Feature Fusion, DAFF)模块组成。在TransCNN中,我们引入了结构自注意(StructSA)来提高图像中结构模式的利用率,并提出了倒残差卷积GLU (IRC-GLU)来增强模型的鲁棒性。这种设计使TransCNN能够有效地捕获多层次和多尺度的特征,同时集成全局和局部信息。DAFF利用空间注意和通道注意,融合从TransCNN中提取的特征,进一步提高特征的可分辨性。此外,为了提高模型在类不平衡、类内变化大、类间相似度高的数据集上的性能,提出了类不平衡二元交叉熵(CIBCE)损失。在CelebA和LFWA数据集上的实验结果表明,与现有的基于CNN和transformer的FAR方法相比,我们的方法有效地解决了类不平衡等问题,并取得了更好的性能。
{"title":"CI-TransCNN: A class imbalance hybrid CNN-Transformer Network for facial attribute recognition","authors":"Yanfei Liu ,&nbsp;Youchang Shi ,&nbsp;Yufei Long ,&nbsp;Miaosen Xu ,&nbsp;Junhua Chen ,&nbsp;Yuanqian Li ,&nbsp;Hao Wen","doi":"10.1016/j.imavis.2025.105823","DOIUrl":"10.1016/j.imavis.2025.105823","url":null,"abstract":"<div><div>Recent facial attribute recognition (FAR) methods often struggle to capture global dependencies and are further challenged by severe class imbalance, large intra-class variations, and high inter-class similarity, ultimately limiting their overall performance. To address these challenges, we propose a network combining CNN and Transformer, termed Class Imbalance Transformer-CNN (CI-TransCNN), for facial attribute recognition, which mainly consists of a TransCNN backbone and a Dual Attention Feature Fusion (DAFF) module. In TransCNN, we incorporate a Structure Self-Attention (StructSA) to improve the utilization of structural patterns in images and propose an Inverted Residual Convolutional GLU (IRC-GLU) to enhance model robustness. This design enables TransCNN to effectively capture multi-level and multi-scale features while integrating both global and local information. DAFF is presented to fuse the features extracted from TransCNN to further improve the feature’s discriminability by using spatial attention and channel attention. Moreover, a Class-Imbalance Binary Cross-Entropy (CIBCE) loss is proposed to improve the model performance on datasets with class imbalance, large intra-class variation, and high inter-class similarity. Experimental results on the CelebA and LFWA datasets show that our method effectively addresses issues such as class imbalance and achieves superior performance compared to existing state-of-the-art CNN- and Transformer-based FAR approaches.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105823"},"PeriodicalIF":4.2,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Density-aware global–local attention network for point cloud segmentation 基于密度感知的点云分割全局-局部关注网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-08 DOI: 10.1016/j.imavis.2025.105822
Chade Li , Pengju Zhang , Jiaming Zhang , Yihong Wu
3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.
三维点云分割在自动驾驶、增强现实、虚拟现实和数字孪生等领域有着广泛的应用。在真实场景中采集的点云数据往往包含较小的对象和类别,样本量小,现有网络难以处理。为此,我们提出了一种基于密度感知的局部关注与全局关注融合的点云分割网络。其核心思想是增加每个点的有效接受野,同时减少密集区域中小物体的信息丢失。具体来说,我们为不同密度的局部区域划分不同大小的窗口,计算窗口内的注意力。此外,我们将每个局部区域视为整个输入的全局关注的独立令牌。为了平衡对不同类别和大小的对象的处理,还提出了类别响应损失。特别是,我们在网络中间额外设置了一个全连接层来预测对象类别的存在,并构建了一个二元交叉熵损失来响应场景中类别的存在。在实验中,我们的方法在几个公开可用的数据集上,在语义分割和零件分割任务上取得了有竞争力的结果。在充满微小物体的复杂现实场景中获得的点云数据的实验也验证了我们的方法对于小物体和小样本类别的强大分割能力。
{"title":"Density-aware global–local attention network for point cloud segmentation","authors":"Chade Li ,&nbsp;Pengju Zhang ,&nbsp;Jiaming Zhang ,&nbsp;Yihong Wu","doi":"10.1016/j.imavis.2025.105822","DOIUrl":"10.1016/j.imavis.2025.105822","url":null,"abstract":"<div><div>3D point cloud segmentation has a wide range of applications in areas such as autonomous driving, augmented reality, virtual reality and digital twins. The point cloud data collected in real scenes often contain small objects and categories with small sample sizes, which are difficult to handle by existing networks. In this regard, we propose a point cloud segmentation network that fuses local attention based on density perception with global attention. The core idea is to increase the effective receptive field of each point while reducing the loss of information about small objects in dense areas. Specifically, we divide different sized windows for local areas with different densities to compute attention within the window. Furthermore, we consider each local area as an independent token for the global attention of the entire input. A category-response loss is also proposed to balance the processing of different categories and sizes of objects. In particular, we set up an additional fully connected layer in the middle of the network for prediction of the presence of object categories, and construct a binary cross-entropy loss to respond to the presence of categories in the scene. In experiments, our method achieves competitive results in semantic segmentation and part segmentation tasks on several publicly available datasets. Experiments on point cloud data obtained from complex real-world scenes filled with tiny objects also validate the strong segmentation capability of our method for small objects as well as small sample categories.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105822"},"PeriodicalIF":4.2,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521236","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
W-MambaFuse: A wavelet decomposition and adaptive state-space modeling approach for anatomical and functional image fusion W-MambaFuse:一种用于解剖和功能图像融合的小波分解和自适应状态空间建模方法
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-08 DOI: 10.1016/j.imavis.2025.105796
Bowen Zhong , Shijie Li , Xuan Deng , Zheng Li
Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at https://github.com/Bowen-Zhong/W-Mamba.
解剖功能图像融合在各种医学和生物学应用中起着至关重要的作用。当前基于卷积神经网络的融合算法受其接收域的限制,阻碍了医学图像中远程依赖关系的有效建模。虽然基于变压器的体系结构具有全局建模能力,但由于其自关注机制的二次复杂度,它们面临着计算方面的挑战。为了解决这些限制,我们提出了一个基于小波域分解和自适应选择性结构化状态空间模型的网络,称为W-MambaFuse,用于解剖和功能图像融合。具体来说,该网络首先应用小波变换来扩大卷积层的接受场,促进低频结构轮廓和高频纹理基元的捕获。此外,我们开发了一种自适应门控融合模块,称为cnn -曼巴门控(MCG),它利用了状态空间模型的动态建模能力和卷积神经网络的局部特征提取优势。这种设计有利于有效地提取模态内和模态间的特征,从而增强多模态图像融合。在基准数据集上的实验结果表明,W-MambaFuse在视觉质量和定量评估方面始终优于纯基于cnn的模型、基于变压器的模型和cnn -变压器混合方法。我们的代码可以在https://github.com/Bowen-Zhong/W-Mamba上公开获得。
{"title":"W-MambaFuse: A wavelet decomposition and adaptive state-space modeling approach for anatomical and functional image fusion","authors":"Bowen Zhong ,&nbsp;Shijie Li ,&nbsp;Xuan Deng ,&nbsp;Zheng Li","doi":"10.1016/j.imavis.2025.105796","DOIUrl":"10.1016/j.imavis.2025.105796","url":null,"abstract":"<div><div>Anatomical-functional image fusion plays a critical role in a variety of medical and biological applications. Current convolutional neural network-based fusion algorithms are constrained by their limited receptive fields, impeding the effective modeling of long-range dependencies in medical images. While transformer-based architectures possess global modeling capabilities, they face computational challenges due to the quadratic complexity of their self-attention mechanisms. To address these limitations, we propose a network based on wavelet-domain decomposition and an adaptive selectively structured state space model, termed as W-MambaFuse, for anatomical and functional image fusion. Specifically, the network first applies a wavelet transform to enlarge the receptive field of the convolutional layers, facilitating the capture of low-frequency structural outlines and high-frequency textural primitives. Furthermore, we develop an adaptive gated fusion module, referred to as CNN-Mamba Gated (MCG), which leverages the dynamic modeling capability of state space models and the local feature extraction strengths of convolutional neural networks. This design facilitates the effective extraction of both intra-modal and inter-modal features, thereby enhancing multimodal image fusion. Experimental results on benchmark datasets demonstrate that W-MambaFuse consistently outperforms pure CNN-based models, transformer-based models, and CNN-transformer hybrid approaches in terms of both visual quality and quantitative evaluations. Our code is publicly available at <span><span>https://github.com/Bowen-Zhong/W-Mamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105796"},"PeriodicalIF":4.2,"publicationDate":"2025-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FaceMINT: A library for gaining insights into biometric face recognition via mechanistic interpretability FaceMINT:一个通过机制可解释性获得生物特征面部识别见解的库
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-07 DOI: 10.1016/j.imavis.2025.105804
Peter Rot , Robert Jutreša , Peter Peer , Vitomir Štruc , Walter Scheirer , Klemen Grm
Deep-learning models, including those used in biometric recognition, have achieved remarkable performance on benchmark datasets as well as real-world recognition tasks. However, a major drawback of these models is their lack of transparency in decision-making. Mechanistic interpretability has emerged as a promising research field intended to help us gain insights into such models, but its application to biometric data remains limited. In this work, we bridge this gap by introducing the FaceMINT library, a publicly available Python library (build on top of Pytorch) that enables biometric researchers to inspect their models through mechanistic interpretability. It provides a plug-and-play solution that allows researchers to seamlessly switch between the analyzed biometric models, evaluate state-of-the-art sparse autoencoders, select from various image parametrizations, and fine-tune hyperparameters. Using a large scale Glint360K dataset, we demonstrate the usability of FaceMINT by applying its functionality to two state-of-the-art (deep-learning) face recognition models: AdaFace, based on Convolutional Neural Networks (CNN), and SwinFace, based on transformers. The proposed library implements various sparse auto-encoders (SAEs), including vanilla SAE, Gated SAE, JumpReLU SAE, and TopK SAE, which have achieved state-of-the-art results in the mechanistic interpretability of large language models. Our study highlights the promise of mechanistic interpretability in the biometric field, providing new avenues for researchers to explore model transparency and refine biometric recognition systems. The library is publicly available at www.gitlab.com/peterrot/facemint.
深度学习模型,包括那些用于生物识别的模型,已经在基准数据集和现实世界的识别任务上取得了显着的性能。然而,这些模型的一个主要缺点是决策缺乏透明度。机械可解释性已成为一个有前途的研究领域,旨在帮助我们深入了解这些模型,但其在生物识别数据中的应用仍然有限。在这项工作中,我们通过引入FaceMINT库来弥合这一差距,FaceMINT库是一个公开可用的Python库(建立在Pytorch之上),它使生物识别研究人员能够通过机械可解释性来检查他们的模型。它提供了一个即插即用的解决方案,允许研究人员在分析的生物识别模型之间无缝切换,评估最先进的稀疏自编码器,从各种图像参数化中进行选择,并微调超参数。使用大规模的Glint360K数据集,我们通过将其功能应用于两种最先进的(深度学习)面部识别模型来展示FaceMINT的可用性:基于卷积神经网络(CNN)的adface和基于变压器的SwinFace。提出的库实现了各种稀疏自编码器(SAE),包括vanilla SAE、Gated SAE、JumpReLU SAE和TopK SAE,它们在大型语言模型的机制可解释性方面取得了最先进的成果。我们的研究强调了生物识别领域机制可解释性的前景,为研究人员探索模型透明度和改进生物识别系统提供了新的途径。该图书馆可在www.gitlab.com/peterrot/facemint上公开访问。
{"title":"FaceMINT: A library for gaining insights into biometric face recognition via mechanistic interpretability","authors":"Peter Rot ,&nbsp;Robert Jutreša ,&nbsp;Peter Peer ,&nbsp;Vitomir Štruc ,&nbsp;Walter Scheirer ,&nbsp;Klemen Grm","doi":"10.1016/j.imavis.2025.105804","DOIUrl":"10.1016/j.imavis.2025.105804","url":null,"abstract":"<div><div>Deep-learning models, including those used in biometric recognition, have achieved remarkable performance on benchmark datasets as well as real-world recognition tasks. However, a major drawback of these models is their lack of transparency in decision-making. Mechanistic interpretability has emerged as a promising research field intended to help us gain insights into such models, but its application to biometric data remains limited. In this work, we bridge this gap by introducing the FaceMINT library, a publicly available Python library (build on top of Pytorch) that enables biometric researchers to inspect their models through mechanistic interpretability. It provides a plug-and-play solution that allows researchers to seamlessly switch between the analyzed biometric models, evaluate state-of-the-art sparse autoencoders, select from various image parametrizations, and fine-tune hyperparameters. Using a large scale Glint360K dataset, we demonstrate the usability of FaceMINT by applying its functionality to two state-of-the-art (deep-learning) face recognition models: AdaFace, based on Convolutional Neural Networks (CNN), and SwinFace, based on transformers. The proposed library implements various sparse auto-encoders (SAEs), including vanilla SAE, Gated SAE, JumpReLU SAE, and TopK SAE, which have achieved state-of-the-art results in the mechanistic interpretability of large language models. Our study highlights the promise of mechanistic interpretability in the biometric field, providing new avenues for researchers to explore model transparency and refine biometric recognition systems. The library is publicly available at <span><span>www.gitlab.com/peterrot/facemint</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105804"},"PeriodicalIF":4.2,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Large-small model collaboration for medical visual question answering with task aware mixture of experts and relation knowledge distillation 基于任务感知混合专家和关系知识精馏的医学视觉问答大-小模型协作
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-07 DOI: 10.1016/j.imavis.2025.105820
Qishen Chen , Wenxuan He , Xingyuan Chen , Chen Cheng , Minjie Bian , Huahu Xu
Medical visual question answering aims to support clinical decision-making by answering natural language questions about medical images. In this domain, specialized task-specific small models often surpass large medical vision–language models due to stronger task alignment and domain-specific expertise. However, these small models lack broad domain knowledge necessary for high-level diagnostic reasoning, such as identifying disease causes or recommending treatments. To address these limitations, this paper presents CoMed-TR, a collaborative framework that integrates multiple large medical vision–language models with a task–specific small model to achieve robust and accurate medical visual question answering. The framework introduces a domain-specific vision–language embedding model, Med-Vec, trained on large-scale medical data to produce rich joint image–text representations. A Task-Aware Mixture-of-Experts module dynamically selects and weights heterogeneous experts, including pretrained medical vision–language models and a segmentation-based image encoder-based on question semantics, enabling adaptive expert fusion. In addition, a relation knowledge distillation strategy aligns the fused expert representations with the relational structure learned by Med-Vec, embedding clinically meaningful similarity relationships into the model’s representation space. Experiments on two public benchmarks demonstrate that CoMed-TR achieves significant performance gains, improving overall accuracy by 3.3% on the VQA-RAD dataset and 1.1% on SLAKE, outperforming both prior state-of-the-art models and existing large–small collaboration methods. Notably, these results are achieved with only approximately 25 million trainable parameters, highlighting the framework’s computational efficiency and practical potential for deployment in real-world clinical settings. The codes are available at: https://github.com/shanziSZ/CoMed-TR/.
医学视觉问答旨在通过回答有关医学图像的自然语言问题来支持临床决策。在这个领域中,由于更强的任务一致性和特定于领域的专业知识,特定于任务的小型模型通常优于大型医学视觉语言模型。然而,这些小模型缺乏高级诊断推理所需的广泛领域知识,例如识别疾病原因或推荐治疗方法。为了解决这些限制,本文提出了CoMed-TR,这是一个协作框架,它将多个大型医学视觉语言模型与特定任务的小模型集成在一起,以实现鲁棒和准确的医学视觉问答。该框架引入了一个特定领域的视觉语言嵌入模型Med-Vec,该模型在大规模医疗数据上进行训练,以产生丰富的联合图像-文本表示。任务感知混合专家模块动态选择和加权异构专家,包括预训练的医学视觉语言模型和基于问题语义的基于分割的图像编码器,实现自适应专家融合。此外,关系知识蒸馏策略将融合的专家表示与Med-Vec学习的关系结构对齐,将临床有意义的相似关系嵌入到模型的表示空间中。在两个公共基准测试上的实验表明,CoMed-TR取得了显著的性能提升,在VQA-RAD数据集上提高了3.3%的总体准确率,在SLAKE上提高了1.1%,优于之前最先进的模型和现有的大型-小型协作方法。值得注意的是,这些结果仅使用了大约2500万个可训练参数,突出了该框架的计算效率和在实际临床环境中部署的实际潜力。代码可在https://github.com/shanziSZ/CoMed-TR/上获得。
{"title":"Large-small model collaboration for medical visual question answering with task aware mixture of experts and relation knowledge distillation","authors":"Qishen Chen ,&nbsp;Wenxuan He ,&nbsp;Xingyuan Chen ,&nbsp;Chen Cheng ,&nbsp;Minjie Bian ,&nbsp;Huahu Xu","doi":"10.1016/j.imavis.2025.105820","DOIUrl":"10.1016/j.imavis.2025.105820","url":null,"abstract":"<div><div>Medical visual question answering aims to support clinical decision-making by answering natural language questions about medical images. In this domain, specialized task-specific small models often surpass large medical vision–language models due to stronger task alignment and domain-specific expertise. However, these small models lack broad domain knowledge necessary for high-level diagnostic reasoning, such as identifying disease causes or recommending treatments. To address these limitations, this paper presents CoMed-TR, a collaborative framework that integrates multiple large medical vision–language models with a task–specific small model to achieve robust and accurate medical visual question answering. The framework introduces a domain-specific vision–language embedding model, Med-Vec, trained on large-scale medical data to produce rich joint image–text representations. A Task-Aware Mixture-of-Experts module dynamically selects and weights heterogeneous experts, including pretrained medical vision–language models and a segmentation-based image encoder-based on question semantics, enabling adaptive expert fusion. In addition, a relation knowledge distillation strategy aligns the fused expert representations with the relational structure learned by Med-Vec, embedding clinically meaningful similarity relationships into the model’s representation space. Experiments on two public benchmarks demonstrate that CoMed-TR achieves significant performance gains, improving overall accuracy by 3.3% on the VQA-RAD dataset and 1.1% on SLAKE, outperforming both prior state-of-the-art models and existing large–small collaboration methods. Notably, these results are achieved with only approximately 25 million trainable parameters, highlighting the framework’s computational efficiency and practical potential for deployment in real-world clinical settings. The codes are available at: <span><span>https://github.com/shanziSZ/CoMed-TR/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105820"},"PeriodicalIF":4.2,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PACMAN: Rapid identification of keypoint patch-based fiducial marker in occluded environments PACMAN:闭塞环境中基于关键点补丁的基准标记的快速识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-07 DOI: 10.1016/j.imavis.2025.105821
Taewook Park , Geunsik Bae , Woojae Shin , Meraj Mammadov , Jaemin Seo , Heejung Shin , Hyondong Oh
Fiducial marker systems are widely used in image-based localization methods due to their high robustness and low computational latency. However, occlusions caused by dynamic environmental factors, such as shadows and unexpected objects, significantly hinder the detection of fiducial markers, as partially visible patterns and significant image degradation often contradict the fundamental assumptions of marker detection systems. To address this challenge, we propose a keypoint-based fiducial marker and a deep-learning-based detector that jointly handle occlusion and image degradation with minimal computational latency. First, we design four distinct keypoint patches that account for occlusions and maintain essential functionalities. A fiducial marker is then constructed by assembling six identical patches under geometric constraints. Second, a widely-used interest point detector network is optimized for the proposed marker design, resulting in robust keypoint detection under various types of image deformation and degradation. A geometric consistency check is subsequently applied to map imperfect keypoint detections to 6D marker poses in the image, effectively rejecting occlusions and potential network failures. Third, neural network quantization and parallel CPU processing are applied to minimize computational latency. Our experimental results demonstrate higher detection rates then other types of single marker in occluded environments, with improvements ranging from 17% to 40%. The proposed system is also evaluated under motion blur, dimming effects, and variations in scale and rotation. Additionally, the efficient computational design enables end-to-end processing at up to 749 FPS on a desktop PC and 138 FPS on an edge device, for 640 × 480 resolution images containing a single marker. Our code is available at: https://github.com/WhiteCri/PACMAN.git.
基准标记系统具有鲁棒性好、计算时延低的特点,在基于图像的定位方法中得到了广泛的应用。然而,由动态环境因素(如阴影和意外物体)引起的遮挡严重阻碍了基准标记的检测,因为部分可见的模式和显著的图像退化往往与标记检测系统的基本假设相矛盾。为了解决这一挑战,我们提出了一个基于关键点的基准标记和一个基于深度学习的检测器,它们以最小的计算延迟共同处理遮挡和图像退化。首先,我们设计了四个不同的关键点补丁,以考虑闭塞并保持基本功能。然后在几何约束下,通过组装六个相同的块来构造基准标记。其次,针对所提出的标记设计优化了广泛使用的兴趣点检测器网络,从而在各种类型的图像变形和退化情况下实现了鲁棒的关键点检测。随后应用几何一致性检查将不完美的关键点检测映射到图像中的6D标记姿势,有效地拒绝遮挡和潜在的网络故障。第三,采用神经网络量化和并行CPU处理来减少计算延迟。我们的实验结果表明,在闭塞环境中,检测率比其他类型的单一标记更高,提高幅度从17%到40%不等。该系统还在运动模糊、调光效果以及缩放和旋转变化下进行了评估。此外,高效的计算设计使端到端处理在桌面PC上高达749帧/秒,在边缘设备上高达138帧/秒,包含单个标记的640 × 480分辨率图像。我们的代码可在:https://github.com/WhiteCri/PACMAN.git。
{"title":"PACMAN: Rapid identification of keypoint patch-based fiducial marker in occluded environments","authors":"Taewook Park ,&nbsp;Geunsik Bae ,&nbsp;Woojae Shin ,&nbsp;Meraj Mammadov ,&nbsp;Jaemin Seo ,&nbsp;Heejung Shin ,&nbsp;Hyondong Oh","doi":"10.1016/j.imavis.2025.105821","DOIUrl":"10.1016/j.imavis.2025.105821","url":null,"abstract":"<div><div>Fiducial marker systems are widely used in image-based localization methods due to their high robustness and low computational latency. However, occlusions caused by dynamic environmental factors, such as shadows and unexpected objects, significantly hinder the detection of fiducial markers, as partially visible patterns and significant image degradation often contradict the fundamental assumptions of marker detection systems. To address this challenge, we propose a keypoint-based fiducial marker and a deep-learning-based detector that jointly handle occlusion and image degradation with minimal computational latency. First, we design four distinct keypoint patches that account for occlusions and maintain essential functionalities. A fiducial marker is then constructed by assembling six identical patches under geometric constraints. Second, a widely-used interest point detector network is optimized for the proposed marker design, resulting in robust keypoint detection under various types of image deformation and degradation. A geometric consistency check is subsequently applied to map imperfect keypoint detections to 6D marker poses in the image, effectively rejecting occlusions and potential network failures. Third, neural network quantization and parallel CPU processing are applied to minimize computational latency. Our experimental results demonstrate higher detection rates then other types of single marker in occluded environments, with improvements ranging from 17% to 40%. The proposed system is also evaluated under motion blur, dimming effects, and variations in scale and rotation. Additionally, the efficient computational design enables end-to-end processing at up to 749 FPS on a desktop PC and 138 FPS on an edge device, for 640 × 480 resolution images containing a single marker. Our code is available at: <span><span>https://github.com/WhiteCri/PACMAN.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105821"},"PeriodicalIF":4.2,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A multi-scale U-shaped transformer neural network for low-light image enhancement 一种用于弱光图像增强的多尺度u形变压器神经网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-07 DOI: 10.1016/j.imavis.2025.105801
Ji Soo Shin, Ho Sub Lee
Low-light image enhancement (LLIE) aims to improve the visibility and visual quality of images captured under insufficient lighting conditions, which are typically characterized by low contrast, suppressed textures, and amplified noise. Recent methods often employ a multi-scale enhancement strategy by stacking sub-networks—such as cascaded convolutional blocks or a single scale transposed self-attention module—to refine contrast from coarse to fine levels. However, these methods struggle to effectively restore natural color appearance and fail to preserve global illumination cues, which limits the generalization capability of the models. In addition, conventional self-attention methods for LLIE operate at a single resolution, making it difficult to effectively fuse multi-scale features and thus constraining their ability to simultaneously capture long-range dependencies and preserve fine structural details. To address these issues, this paper proposes MSTSA-UTNet, a compact U-shaped Transformer architecture that incorporates a newly designed Transformer block based on multi-scale transposed self-attention (MSTSA) with lightweight feed forward modules, and adopts a multi-scale input, single-scale output (MISO) strategy. The key idea of MSTSA is to enable multi-resolution interaction by simultaneously incorporating original high-resolution features and down-sampled low-resolution features. Furthermore, the proposed feature extraction and fusion framework comprises two core components: a prior-guided shallow feature extraction (PG-SFE) module that preserves low-level spatial cues while incorporating illumination priors to modulate shallow features, and a multi-scale feed forward network (MSFFN) that performs gated fusion to selectively integrate global context and local detail. This design facilitates improved feature learning for low-light enhancement. Extensive experimental results demonstrate that the proposed MSTSA-UTNet consistently outperforms recent state-of-the-art multi-scale enhancement method, SMNet [37], by up to 0.59 dB in PSNR on the LOL-v1 dataset.
低光图像增强(LLIE)旨在提高在光照不足条件下拍摄的图像的可见度和视觉质量,这些图像通常具有对比度低、纹理被抑制和噪声放大的特点。最近的方法通常采用多尺度增强策略,通过堆叠子网络(如级联卷积块或单尺度转置的自注意模块)将对比度从粗级细化到细级。然而,这些方法难以有效地恢复自然颜色外观,并且不能保留全局光照线索,这限制了模型的泛化能力。此外,传统的LLIE自关注方法在单一分辨率下运行,难以有效融合多尺度特征,从而限制了它们同时捕获远程依赖关系和保留精细结构细节的能力。为了解决这些问题,本文提出了一种紧凑的u型Transformer架构MSTSA- utnet,该架构结合了新设计的基于多尺度转置自关注(MSTSA)的Transformer模块和轻量级前馈模块,并采用了多尺度输入,单尺度输出(MISO)策略。MSTSA的关键思想是通过同时融合原始的高分辨率特征和下采样的低分辨率特征来实现多分辨率交互。此外,所提出的特征提取和融合框架包括两个核心组件:一个是先验引导的浅层特征提取(PG-SFE)模块,该模块在结合光照先验来调制浅层特征的同时保留低级空间线索;另一个是多尺度前馈网络(MSFFN),该网络执行门控融合,有选择地整合全局背景和局部细节。这种设计有助于改进弱光增强的特征学习。大量的实验结果表明,MSTSA-UTNet在LOL-v1数据集上的PSNR高达0.59 dB,持续优于最近最先进的多尺度增强方法SMNet[37]。
{"title":"A multi-scale U-shaped transformer neural network for low-light image enhancement","authors":"Ji Soo Shin,&nbsp;Ho Sub Lee","doi":"10.1016/j.imavis.2025.105801","DOIUrl":"10.1016/j.imavis.2025.105801","url":null,"abstract":"<div><div>Low-light image enhancement (LLIE) aims to improve the visibility and visual quality of images captured under insufficient lighting conditions, which are typically characterized by low contrast, suppressed textures, and amplified noise. Recent methods often employ a multi-scale enhancement strategy by stacking sub-networks—such as cascaded convolutional blocks or a single scale transposed self-attention module—to refine contrast from coarse to fine levels. However, these methods struggle to effectively restore natural color appearance and fail to preserve global illumination cues, which limits the generalization capability of the models. In addition, conventional self-attention methods for LLIE operate at a single resolution, making it difficult to effectively fuse multi-scale features and thus constraining their ability to simultaneously capture long-range dependencies and preserve fine structural details. To address these issues, this paper proposes MSTSA-UTNet, a compact U-shaped Transformer architecture that incorporates a newly designed Transformer block based on multi-scale transposed self-attention (MSTSA) with lightweight feed forward modules, and adopts a multi-scale input, single-scale output (MISO) strategy. The key idea of MSTSA is to enable multi-resolution interaction by simultaneously incorporating original high-resolution features and down-sampled low-resolution features. Furthermore, the proposed feature extraction and fusion framework comprises two core components: a prior-guided shallow feature extraction (PG-SFE) module that preserves low-level spatial cues while incorporating illumination priors to modulate shallow features, and a multi-scale feed forward network (MSFFN) that performs gated fusion to selectively integrate global context and local detail. This design facilitates improved feature learning for low-light enhancement. Extensive experimental results demonstrate that the proposed MSTSA-UTNet consistently outperforms recent state-of-the-art multi-scale enhancement method, SMNet [<span><span>37</span></span>], by up to 0.59 dB in PSNR on the LOL-v1 dataset.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"166 ","pages":"Article 105801"},"PeriodicalIF":4.2,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145790258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic-assisted unpaired image dehazing 语义辅助非配对图像去雾
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-06 DOI: 10.1016/j.imavis.2025.105818
Yang Yang, Lei Zhang, Ke Pang, Tongtong Chen, Xiaodong Yue
Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .
最近,一系列创新的非配对图像去雾技术被引入,它们缓解了收集配对数据的压力,但这些方法通常忽略了语义信息的整合,而语义信息对于更全面的去雾过程至关重要。我们的研究旨在通过提出一种新颖的方法来弥补这一差距,该方法将特征信息完全集成到非成对图像去雾中。具体来说,我们提出了一种语义信息导向的特征增强与融合块,基于语义信息的不确定性,对语义结果层和语义特征层导向的细化特征进行选择性融合。此外,我们的方法在训练过程中采用语义信息来指导雾霾的产生。这种方法产生了一组更多样化的模糊图像,这反过来又增强了除雾性能。此外,在损失函数方面,我们引入了一个损失项来约束消雾结果的语义信息熵。这种约束保证了去雾后的图像在保持清晰的同时,还能保持语义的准确性和完整性。大量的实验验证了我们的方法优于其他方法和我们的设计的有效性。代码可在。
{"title":"Semantic-assisted unpaired image dehazing","authors":"Yang Yang,&nbsp;Lei Zhang,&nbsp;Ke Pang,&nbsp;Tongtong Chen,&nbsp;Xiaodong Yue","doi":"10.1016/j.imavis.2025.105818","DOIUrl":"10.1016/j.imavis.2025.105818","url":null,"abstract":"<div><div>Recently, a series of innovative unpaired image dehazing techniques have been introduced, they have relieved pressure from collecting paired data, yet these methods typically overlook the integration of semantic information, which is essential for a more comprehensive dehazing process. Our research aims to bridge this gap by proposing a novel method that fully integrates feature information into unpaired image dehazing. Specifically, we propose a semantic information-guided feature enhancement and fusion block, which selectively fuses the refined features guided by the semantic result layer and semantic feature layer based on the uncertainty of semantic information. Besides, our method adopts semantic information to guide the generation of haze in the training process. This approach results in the creation of a more diverse set of hazy images, which in turn enhances the dehazing performance. Furthermore, in terms of the loss function, we introduce a loss term that constrains the semantic information entropy of the dehazing results. This constraint ensures that the dehazed images not only achieve clarity but also retain semantic accuracy and integrity. Extensive experiments validate our superiority over other methods and the effectiveness of our designs. The code is available at .</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105818"},"PeriodicalIF":4.2,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145528664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fidelity-preserving zero-shot diffusion models for highly ill-posed inverse problems in lensless imaging 无透镜成像中高度病态逆问题的保真度零弹扩散模型
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-05 DOI: 10.1016/j.imavis.2025.105786
Haechang Lee , Dong Ju Mun , Hyunwoo Lee , Kyung Chul Lee , Seongmin Hong , Gwanghyun Kim , Seung Ah Lee , Se Young Chun
Diffusion models have been extensively explored for solving ill-posed inverse problems, achieving remarkable performance. However, their applicability to real-world scenarios, such as lensless imaging, has not been well investigated. Modern lensless imaging has compact form factor, low-cost hardware requirement, and intrinsic compressive imaging capabilities, but these advantages pose highly ill-posed inverse problems. In this work, we introduce a training-free zero-shot diffusion model, termed Dilack, for restoring raw images captured by lensless cameras that are degraded by large and complex kernels. Our approach incorporates novel data fidelity terms, referred to as the pseudo-inverse anchor for constraining (PiAC) fidelity loss, to enhance reconstruction quality by addressing the ill-posed nature of challenging inverse problems. Additionally, inspired by locally acting classical regularizers, we propose integrating masked fidelity within the PiAC loss. This scheme enables interaction with globally acting diffusion models while adaptively enforcing spatially and stepwise local fidelity through masks. Our proposed framework effectively mitigates erratic behavior and inherent artifacts in diffusion models when used for highly ill-posed inverse problems, significantly improving the quality of lensless camera raw image restoration, including perceptual aspects. Experimental results on both synthetic and real-world datasets for modern lensless imaging demonstrate that our approach outperforms prior arts including classical and existing diffusion based methods. The code is available at https://github.com/mundongju/Dilack.
扩散模型被广泛地用于求解病态逆问题,并取得了显著的效果。然而,它们在现实场景中的适用性,比如无透镜成像,还没有得到很好的研究。现代无透镜成像具有紧凑的外形、低成本的硬件要求和固有的压缩成像能力,但这些优点带来了高度不适定的逆问题。在这项工作中,我们引入了一种称为Dilack的无训练零射击扩散模型,用于恢复由大型复杂核退化的无透镜相机捕获的原始图像。我们的方法结合了新的数据保真度项,称为约束伪逆锚(PiAC)保真度损失,通过解决具有挑战性的逆问题的不定性来提高重建质量。此外,受局部作用经典正则化器的启发,我们提出在PiAC损失中集成掩码保真度。该方案能够与全局作用的扩散模型进行交互,同时通过掩模自适应地增强空间和逐步的局部保真度。我们提出的框架有效地减轻了扩散模型在用于高度不适定逆问题时的不稳定行为和固有伪影,显著提高了无镜头相机原始图像恢复的质量,包括感知方面。在现代无透镜成像的合成和真实数据集上的实验结果表明,我们的方法优于现有技术,包括经典和现有的基于扩散的方法。代码可在https://github.com/mundongju/Dilack上获得。
{"title":"Fidelity-preserving zero-shot diffusion models for highly ill-posed inverse problems in lensless imaging","authors":"Haechang Lee ,&nbsp;Dong Ju Mun ,&nbsp;Hyunwoo Lee ,&nbsp;Kyung Chul Lee ,&nbsp;Seongmin Hong ,&nbsp;Gwanghyun Kim ,&nbsp;Seung Ah Lee ,&nbsp;Se Young Chun","doi":"10.1016/j.imavis.2025.105786","DOIUrl":"10.1016/j.imavis.2025.105786","url":null,"abstract":"<div><div>Diffusion models have been extensively explored for solving ill-posed inverse problems, achieving remarkable performance. However, their applicability to real-world scenarios, such as lensless imaging, has not been well investigated. Modern lensless imaging has compact form factor, low-cost hardware requirement, and intrinsic compressive imaging capabilities, but these advantages pose highly ill-posed inverse problems. In this work, we introduce a training-free zero-shot diffusion model, termed <strong>Dilack</strong>, for restoring raw images captured by lensless cameras that are degraded by large and complex kernels. Our approach incorporates novel data fidelity terms, referred to as the pseudo-inverse anchor for constraining (PiAC) fidelity loss, to enhance reconstruction quality by addressing the ill-posed nature of challenging inverse problems. Additionally, inspired by locally acting classical regularizers, we propose integrating masked fidelity within the PiAC loss. This scheme enables interaction with globally acting diffusion models while adaptively enforcing spatially and stepwise local fidelity through masks. Our proposed framework effectively mitigates erratic behavior and inherent artifacts in diffusion models when used for highly ill-posed inverse problems, significantly improving the quality of lensless camera raw image restoration, including perceptual aspects. Experimental results on both synthetic and real-world datasets for modern lensless imaging demonstrate that our approach outperforms prior arts including classical and existing diffusion based methods. The code is available at <span><span>https://github.com/mundongju/Dilack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105786"},"PeriodicalIF":4.2,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1