首页 > 最新文献

Image and Vision Computing最新文献

英文 中文
Learning like a real student: Black-box domain adaptation with preview, differentiated learning and review 像真正的学生一样学习:黑盒域适应,预习,差异化学习和复习
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-01 DOI: 10.1016/j.imavis.2025.105806
Qing Tian , Zhiwen Liu , Weihua Ou
Black-box Domain Adaptation (BDA) is a source-free unsupervised domain adaptation method that requires access only to black-box source predictors. This method offers significant security advantages since it does not necessitate access to the source data or specific parameters of the source model. However, adaptation using only noisy source-predicted labels presents considerable challenges due to the limited information available. Existing research primarily focuses on minor improvements at the micro-level, without addressing the macro-level training strategies required for effective black-box domain adaptation. In this article, we propose a novel three-step BDA framework for image classification called PDLR, which emulates the learning strategies of real students, dividing the training process into three stages: Preview, Differentiated Learning, and Review. Initially, during the preview stage, we enable the model to acquire fundamental knowledge and stable features. Subsequently, in the differentiated learning stage, we categorize target samples into easy-adaptable, semi-adaptable, and hard-adaptable subdomains and employ graph contrastive learning to align these samples. Finally, in the review stage, we identify and conduct supplementary learning on classes that are prone to being forgotten. Our method achieves state-of-the-art performance across multiple benchmarks.
黑盒域自适应(BDA)是一种无源无监督的域自适应方法,只需要访问黑盒源预测器。这种方法提供了显著的安全性优势,因为它不需要访问源数据或源模型的特定参数。然而,由于可用信息有限,仅使用噪声源预测标签的自适应存在相当大的挑战。现有的研究主要集中在微观层面的微小改进,而没有解决有效的黑盒域适应所需的宏观层面的训练策略。在本文中,我们提出了一种新的三步BDA图像分类框架PDLR,它模拟了真实学生的学习策略,将训练过程分为三个阶段:预览、差异化学习和复习。首先,在预览阶段,我们使模型获得基本知识和稳定的特征。随后,在差异化学习阶段,我们将目标样本分为易适应子域、半适应子域和难适应子域,并采用图对比学习对这些样本进行对齐。最后,在复习阶段,我们对容易被遗忘的课程进行识别和补充学习。我们的方法在多个基准测试中实现了最先进的性能。
{"title":"Learning like a real student: Black-box domain adaptation with preview, differentiated learning and review","authors":"Qing Tian ,&nbsp;Zhiwen Liu ,&nbsp;Weihua Ou","doi":"10.1016/j.imavis.2025.105806","DOIUrl":"10.1016/j.imavis.2025.105806","url":null,"abstract":"<div><div>Black-box Domain Adaptation (BDA) is a source-free unsupervised domain adaptation method that requires access only to black-box source predictors. This method offers significant security advantages since it does not necessitate access to the source data or specific parameters of the source model. However, adaptation using only noisy source-predicted labels presents considerable challenges due to the limited information available. Existing research primarily focuses on minor improvements at the micro-level, without addressing the macro-level training strategies required for effective black-box domain adaptation. In this article, we propose a novel three-step BDA framework for image classification called PDLR, which emulates the learning strategies of real students, dividing the training process into three stages: <strong>P</strong>review, <strong>D</strong>ifferentiated <strong>L</strong>earning, and <strong>R</strong>eview. Initially, during the preview stage, we enable the model to acquire fundamental knowledge and stable features. Subsequently, in the differentiated learning stage, we categorize target samples into easy-adaptable, semi-adaptable, and hard-adaptable subdomains and employ graph contrastive learning to align these samples. Finally, in the review stage, we identify and conduct supplementary learning on classes that are prone to being forgotten. Our method achieves state-of-the-art performance across multiple benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105806"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automated recognition of humerus anomalies with convolutional neural networks 基于卷积神经网络的肱骨异常自动识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-01 DOI: 10.1016/j.imavis.2025.105799
Gea Viozzi , Fabio Persia , Daniela D’Auria
Humerus anomalies are a problem that requires rapid and accurate diagnosis to ensure immediate and efficient treatment. In this context, the main goal of this paper is to develop and analyze well-known Convolutional Neural Network models for the automatic recognition of humeral fractures, with the aim of proposing a useful tool for healthcare personnel. Specifically, three distinct architectures were implemented and compared: a three-layer untrained neural network, a network based on the ResNet18 architecture and one based on the DenseNet121 model, both of which were trained. The performance analysis highlighted a trade-off between accuracy and generalization ability, showing better accuracy in the pre-trained models - in particular, the DenseNet121 model achieved optimal accuracy across multiple runs of 85%. - which however proved more prone to suffer from overfitting compared to the non-pre-trained model. As a result, this study aims to propose the integration of deep learning tools in medical practice, laying important foundations for future developments, with the hope of improving the efficiency and accuracy of orthopedic diagnoses.
肱骨异常是一个问题,需要快速和准确的诊断,以确保即时和有效的治疗。在此背景下,本文的主要目标是开发和分析众所周知的用于肱骨骨折自动识别的卷积神经网络模型,旨在为医护人员提供一个有用的工具。具体来说,实现并比较了三种不同的架构:三层未训练的神经网络,基于ResNet18架构的网络和基于DenseNet121模型的网络,两者都经过了训练。性能分析强调了准确性和泛化能力之间的权衡,在预训练模型中显示出更好的准确性-特别是,DenseNet121模型在多次运行中实现了85%的最佳准确率。——但事实证明,与未经预训练的模型相比,这种模型更容易出现过拟合。因此,本研究旨在提出将深度学习工具整合到医疗实践中,为未来的发展奠定重要基础,以期提高骨科诊断的效率和准确性。
{"title":"Automated recognition of humerus anomalies with convolutional neural networks","authors":"Gea Viozzi ,&nbsp;Fabio Persia ,&nbsp;Daniela D’Auria","doi":"10.1016/j.imavis.2025.105799","DOIUrl":"10.1016/j.imavis.2025.105799","url":null,"abstract":"<div><div>Humerus anomalies are a problem that requires rapid and accurate diagnosis to ensure immediate and efficient treatment. In this context, the main goal of this paper is to develop and analyze well-known Convolutional Neural Network models for the automatic recognition of humeral fractures, with the aim of proposing a useful tool for healthcare personnel. Specifically, three distinct architectures were implemented and compared: a three-layer untrained neural network, a network based on the ResNet18 architecture and one based on the DenseNet121 model, both of which were trained. The performance analysis highlighted a trade-off between accuracy and generalization ability, showing better accuracy in the pre-trained models - in particular, the DenseNet121 model achieved optimal accuracy across multiple runs of 85%. - which however proved more prone to suffer from overfitting compared to the non-pre-trained model. As a result, this study aims to propose the integration of deep learning tools in medical practice, laying important foundations for future developments, with the hope of improving the efficiency and accuracy of orthopedic diagnoses.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105799"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Diffusion model-based imbalanced diabetic retinal image classification 基于扩散模型的糖尿病视网膜不平衡图像分类
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-11-01 DOI: 10.1016/j.imavis.2025.105805
Yu Chen , Liu Yang , Jun Long , TingBo Bao
Since the release of Denoising Diffusion Probabilistic Models by Google in 2020, diffusion models have gradually emerged as a new research focus in generative modeling. However, in the task of diabetic retinopathy image classification, conventional convolutional neural network methods, although capable of achieving high accuracy, generally lack interpretability and thus fail to meet the transparency requirements of clinical diagnosis. To address this issue, a novel denoising diffusion framework named TCG-DiffDRC is proposed for diabetic retinopathy classification. An innovative triple-granularity conditional guidance strategy is introduced, in which three independent branches are fused. The global feature branch employs an improved ResNet-50 architecture with class activation mapping to generate global descriptors and capture macroscopic patterns. The local feature branch integrates multiple regions through a gated attention mechanism to identify local structures. The detail branch leverages an interpretable neural transformer with a multi-head attention mechanism to extract fine-grained lesion features. Furthermore, a dynamic guidance mechanism based on the correctness of an auxiliary classifier is incorporated during the diffusion reconstruction process, while segmentation masks are embedded as a regularization term in the loss function to enhance structural consistency in lesion regions. Experimental results demonstrate that TCG-DiffDRC consistently outperforms state-of-the-art methods across three public datasets, including APTOS2019, Messidor, and IDRiD. On the APTOS2019 dataset, the proposed method achieves an accuracy of 86.7% and a Cohen’s Kappa of 75.8%, with improvements confirmed by statistical significance testing, thereby verifying the reliability of the model.
自谷歌于2020年发布去噪扩散概率模型(Denoising Diffusion Probabilistic Models)以来,扩散模型逐渐成为生成建模中的一个新的研究热点。然而,在糖尿病视网膜病变图像分类任务中,传统的卷积神经网络方法虽然能够达到较高的准确率,但普遍缺乏可解释性,无法满足临床诊断的透明度要求。为了解决这一问题,提出了一种新的去噪扩散框架TCG-DiffDRC用于糖尿病视网膜病变分类。提出了一种新颖的三粒度条件引导策略,将三个独立分支融合在一起。全局特征分支采用改进的ResNet-50架构和类激活映射来生成全局描述符和捕获宏观模式。局部特征分支通过门控注意机制集成多个区域,识别局部结构。细节分支利用具有多头注意机制的可解释神经转换器来提取细粒度病变特征。在扩散重建过程中引入基于辅助分类器正确性的动态引导机制,在损失函数中嵌入分割掩码作为正则化项,增强损伤区域的结构一致性。实验结果表明,TCG-DiffDRC在包括APTOS2019、Messidor和IDRiD在内的三个公共数据集上始终优于最先进的方法。在APTOS2019数据集上,本文方法的准确率为86.7%,Cohen’s Kappa为75.8%,通过统计显著性检验证实了该方法的改进,从而验证了模型的可靠性。
{"title":"Diffusion model-based imbalanced diabetic retinal image classification","authors":"Yu Chen ,&nbsp;Liu Yang ,&nbsp;Jun Long ,&nbsp;TingBo Bao","doi":"10.1016/j.imavis.2025.105805","DOIUrl":"10.1016/j.imavis.2025.105805","url":null,"abstract":"<div><div>Since the release of Denoising Diffusion Probabilistic Models by Google in 2020, diffusion models have gradually emerged as a new research focus in generative modeling. However, in the task of diabetic retinopathy image classification, conventional convolutional neural network methods, although capable of achieving high accuracy, generally lack interpretability and thus fail to meet the transparency requirements of clinical diagnosis. To address this issue, a novel denoising diffusion framework named TCG-DiffDRC is proposed for diabetic retinopathy classification. An innovative triple-granularity conditional guidance strategy is introduced, in which three independent branches are fused. The global feature branch employs an improved ResNet-50 architecture with class activation mapping to generate global descriptors and capture macroscopic patterns. The local feature branch integrates multiple regions through a gated attention mechanism to identify local structures. The detail branch leverages an interpretable neural transformer with a multi-head attention mechanism to extract fine-grained lesion features. Furthermore, a dynamic guidance mechanism based on the correctness of an auxiliary classifier is incorporated during the diffusion reconstruction process, while segmentation masks are embedded as a regularization term in the loss function to enhance structural consistency in lesion regions. Experimental results demonstrate that TCG-DiffDRC consistently outperforms state-of-the-art methods across three public datasets, including APTOS2019, Messidor, and IDRiD. On the APTOS2019 dataset, the proposed method achieves an accuracy of 86.7% and a Cohen’s Kappa of 75.8%, with improvements confirmed by statistical significance testing, thereby verifying the reliability of the model.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105805"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis SPM-CyViT:一种自监督预训练周期一致视觉变压器,具有多分支,用于对比增强CT合成
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-31 DOI: 10.1016/j.imavis.2025.105802
Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin
Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.
对比增强计算机断层扫描(CECT)是评估血管解剖和病理的关键。然而,碘造影剂的使用存在风险,包括过敏性休克和急性肾损伤。为了解决这个问题,我们提出了SPM-CyViT,这是一个自我监督的预训练,多分支,周期一致的视觉转换器,它从非对比CT (NCCT)合成高质量的虚拟CECT。其生成器采用并行编码方法,将视觉变换块与卷积下采样层相结合。它们的编码输出通过定制的交叉注意模块融合,产生具有多尺度互补属性的特征图。采用掩模重建,ViT全局编码器能够对各种未标记的CT切片进行自监督预训练。这克服了固定数据集的限制,并显著提高了泛化。此外,该模型还具有针对特定标签量身定制的多分支解码识别器设计。它采用40 keV单能增强CT (MonoE)作为辅助标签来优化对比度敏感区域。双中心内部测试集的结果表明,SPM-CyViT在所有定量指标上都优于现有的CECT综合模型。此外,基于真实CECT作为基准,三位放射科医生在多个角度上给予SPM-CyViT的平均临床评估分数为4.215.00。此外,SPM-CyViT在外部测试集上表现出稳健的泛化,合成CECT的平均CNR为10.96,接近真实CECT的12.38,共同强调了其临床应用潜力。
{"title":"SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis","authors":"Hongwei Yang ,&nbsp;Wen Zeng ,&nbsp;Ke Chen ,&nbsp;Zhan Hua ,&nbsp;Yan Zhuang ,&nbsp;Lin Han ,&nbsp;Guoliang Liao ,&nbsp;Yiteng Zhang ,&nbsp;Hanyu Li ,&nbsp;Zhenlin Li ,&nbsp;Jiangli Lin","doi":"10.1016/j.imavis.2025.105802","DOIUrl":"10.1016/j.imavis.2025.105802","url":null,"abstract":"<div><div>Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105802"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DBaP-net: Deep network for image defogging based on physical properties prior DBaP-net:基于物理属性先验的图像去雾深度网络
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-31 DOI: 10.1016/j.imavis.2025.105803
Bo Yu , Hanting Wei , Chenghong Zhang , Wei Wang
In the realm of computer vision, high-quality image information serves as the foundation for downstream tasks. Nevertheless, elements such as foggy weather, suboptimal lighting circumstances, and atmospheric impurities frequently deteriorate image quality, posing a considerable research challenge in effectively restoring these low-quality images. Existing defogging approaches mainly rely on constraints and physical priors; however, they have demonstrated limited efficacy, especially when dealing with extensive fog-affected areas. To tackle this issue, a deep trainable de-fog network named DBaP-net is proposed in this paper. By leveraging convolutional neural networks, this network integrates diverse filters to extract physical priors from images. Through the construction of a sophisticated deep network architecture, DBaP-net precisely estimates the transmission map and efficiently facilitates the restoration of haze-free images. Additionally, we design a spatial transformation layer customized for physical prior features and adopt a multi-kernel fusion extraction technique to further enhance the model’s feature extraction capabilities and spatial adaptability, thereby laying a solid foundation for subsequent visual tasks. Experimental validation indicates that DBaP-net not only effectively eliminates haze from images but also significantly enhances their overall quality. In both quantitative and qualitative evaluations, DBaP-net surpasses other comparison algorithms in terms of efficiency and usability. As a result, this study offers a novel solution to the image defogging problem within computer vision frameworks, enabling the precise restoration of low-quality images and providing robust support for research endeavors and downstream applications in related fields.
在计算机视觉领域,高质量的图像信息是后续任务的基础。然而,雾天气、次优照明环境和大气杂质等因素经常会使图像质量恶化,这对有效恢复这些低质量图像提出了相当大的研究挑战。现有的去雾方法主要依赖于约束条件和物理先验;然而,它们的效果有限,特别是在处理大面积受雾影响的地区时。为了解决这一问题,本文提出了一种深度可训练除雾网络DBaP-net。通过利用卷积神经网络,该网络集成了多种滤波器,从图像中提取物理先验。DBaP-net通过构建复杂的深度网络架构,精确估计传输图,有效地促进无雾图像的恢复。此外,我们设计了针对物理先验特征定制的空间变换层,并采用多核融合提取技术,进一步增强了模型的特征提取能力和空间适应性,为后续的视觉任务奠定了坚实的基础。实验验证表明,DBaP-net不仅有效地消除了图像中的雾霾,而且显著提高了图像的整体质量。在定量和定性评价中,ddap -net在效率和可用性方面都优于其他比较算法。因此,本研究为计算机视觉框架内的图像去雾问题提供了一种新的解决方案,使低质量图像的精确恢复成为可能,并为相关领域的研究工作和下游应用提供了强有力的支持。
{"title":"DBaP-net: Deep network for image defogging based on physical properties prior","authors":"Bo Yu ,&nbsp;Hanting Wei ,&nbsp;Chenghong Zhang ,&nbsp;Wei Wang","doi":"10.1016/j.imavis.2025.105803","DOIUrl":"10.1016/j.imavis.2025.105803","url":null,"abstract":"<div><div>In the realm of computer vision, high-quality image information serves as the foundation for downstream tasks. Nevertheless, elements such as foggy weather, suboptimal lighting circumstances, and atmospheric impurities frequently deteriorate image quality, posing a considerable research challenge in effectively restoring these low-quality images. Existing defogging approaches mainly rely on constraints and physical priors; however, they have demonstrated limited efficacy, especially when dealing with extensive fog-affected areas. To tackle this issue, a deep trainable de-fog network named DBaP-net is proposed in this paper. By leveraging convolutional neural networks, this network integrates diverse filters to extract physical priors from images. Through the construction of a sophisticated deep network architecture, DBaP-net precisely estimates the transmission map and efficiently facilitates the restoration of haze-free images. Additionally, we design a spatial transformation layer customized for physical prior features and adopt a multi-kernel fusion extraction technique to further enhance the model’s feature extraction capabilities and spatial adaptability, thereby laying a solid foundation for subsequent visual tasks. Experimental validation indicates that DBaP-net not only effectively eliminates haze from images but also significantly enhances their overall quality. In both quantitative and qualitative evaluations, DBaP-net surpasses other comparison algorithms in terms of efficiency and usability. As a result, this study offers a novel solution to the image defogging problem within computer vision frameworks, enabling the precise restoration of low-quality images and providing robust support for research endeavors and downstream applications in related fields.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105803"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145428873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DGFMamba: Model fine-tuning based on bidirectional state space for domain generalization semantic segmentation DGFMamba:基于双向状态空间的领域泛化语义分割模型微调
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-31 DOI: 10.1016/j.imavis.2025.105800
Yongchao Qiao , Ya’nan Guan , Zhiyou Wang , Jingmin Yang , Wenyuan Yang
Compared with traditional semantic segmentation, Domain Generalization Semantic Segmentation (DGSS) focuses more on improving the generalization of models in unseen domains. Existing methods are mainly based on Transformers and convolutional neural networks, which have limited receptive fields and high complexity. Mamba, as a new state-space model, can solve these problems well. Nevertheless, the problems of hidden states and learning domain-invariant semantic features make it difficult to apply to DGSS. In this paper, we propose a model fine-tuning method named DGFMamba, which introduces Hidden State Fine-tuning Tokens (HSFT) and Feature-level Bidirectional Selective Scan Module (FBSSM) to improve the feature maps. HSFT, which consists of channel tokens and feature tokens, can perform local forgetting on feature maps. Feature-level embedding allows feature maps to be input to FBSSM with single pixels as vectors. FBSSM obtains contextual information from both forward and reverse directions, with reverse information serving as a complement to forward information. To further reduce the trainable parameters of the model, the parameters of FBSSM and MLP at each layer are shared. DGFMamba achieves promising results in experiments with different settings. This also demonstrates the effectiveness of applying state-space models to model fine-tuning. The average mIoU under the GTAVscapes+BDD100K+Mapillary setting is 64.4%. The average mIoU under the GTAV+SynthiaCityscapes+BDD100K+Mapillary setting is 65.8%. It is worth noting that DGFMamba only adds an additional 0.5% of trainable parameters. The code is available at https://github.com/xiaoxia0722/DGFMamba.
与传统的语义分割相比,领域泛化语义分割(DGSS)更侧重于提高模型在未知领域的泛化能力。现有的方法主要基于变压器和卷积神经网络,它们的接受域有限,且复杂度高。曼巴作为一种新的状态空间模型,可以很好地解决这些问题。然而,隐藏状态和学习领域不变语义特征的问题使其难以应用于DGSS。在本文中,我们提出了一种名为DGFMamba的模型微调方法,该方法引入了隐藏状态微调令牌(HSFT)和特征级双向选择性扫描模块(FBSSM)来改进特征映射。HSFT由通道标记和特征标记组成,可以在特征映射上实现局部遗忘。特征级嵌入允许将特征映射以单个像素作为向量输入到FBSSM。FBSSM从正向和反向两个方向获取上下文信息,反向信息作为正向信息的补充。为了进一步减少模型的可训练参数,FBSSM和MLP在每层的参数是共享的。DGFMamba在不同设置的实验中取得了令人满意的结果。这也证明了将状态空间模型应用于模型微调的有效性。GTAV→景观+BDD100K+Mapillary设置下的平均mIoU为64.4%。GTAV+Synthia→cityscape +BDD100K+Mapillary设置下的平均mIoU为65.8%。值得注意的是,DGFMamba只增加了0.5%的可训练参数。代码可在https://github.com/xiaoxia0722/DGFMamba上获得。
{"title":"DGFMamba: Model fine-tuning based on bidirectional state space for domain generalization semantic segmentation","authors":"Yongchao Qiao ,&nbsp;Ya’nan Guan ,&nbsp;Zhiyou Wang ,&nbsp;Jingmin Yang ,&nbsp;Wenyuan Yang","doi":"10.1016/j.imavis.2025.105800","DOIUrl":"10.1016/j.imavis.2025.105800","url":null,"abstract":"<div><div>Compared with traditional semantic segmentation, Domain Generalization Semantic Segmentation (DGSS) focuses more on improving the generalization of models in unseen domains. Existing methods are mainly based on Transformers and convolutional neural networks, which have limited receptive fields and high complexity. Mamba, as a new state-space model, can solve these problems well. Nevertheless, the problems of hidden states and learning domain-invariant semantic features make it difficult to apply to DGSS. In this paper, we propose a model fine-tuning method named DGFMamba, which introduces Hidden State Fine-tuning Tokens (HSFT) and Feature-level Bidirectional Selective Scan Module (FBSSM) to improve the feature maps. HSFT, which consists of channel tokens and feature tokens, can perform local forgetting on feature maps. Feature-level embedding allows feature maps to be input to FBSSM with single pixels as vectors. FBSSM obtains contextual information from both forward and reverse directions, with reverse information serving as a complement to forward information. To further reduce the trainable parameters of the model, the parameters of FBSSM and MLP at each layer are shared. DGFMamba achieves promising results in experiments with different settings. This also demonstrates the effectiveness of applying state-space models to model fine-tuning. The average mIoU under the GTAV<span><math><mo>→</mo></math></span>scapes+BDD100K+Mapillary setting is 64.4%. The average mIoU under the GTAV+Synthia<span><math><mo>→</mo></math></span>Cityscapes+BDD100K+Mapillary setting is 65.8%. It is worth noting that DGFMamba only adds an additional 0.5% of trainable parameters. The code is available at <span><span>https://github.com/xiaoxia0722/DGFMamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105800"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhanced skeleton-based Group Activity Recognition through spatio-temporal graph convolution with cross-dimensional attention 基于跨维注意的时空图卷积增强基于骨骼的群体活动识别
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-30 DOI: 10.1016/j.imavis.2025.105784
Dongli Wang , Yongcan Weng , Xiaolin Zhu , Yan Zhou , Zixin Zhang , Richard Irampaye
Group Activity Recognition is a pivotal task in video understanding, with broad applications ranging from surveillance to human–computer interaction. Traditional RGB-based methods face challenges such as privacy concerns, environmental sensitivity, and fragmented scene-level semantic understanding. Skeleton-based approaches offer a promising alternative but often suffer from limited exploration of heterogeneous features and the absence of explicit modeling for human-object interactions. In this paper, we introduce a lightweight framework for skeleton-based GAR, leveraging an attention-enhanced spatio-temporal graph convolutional network. Specially, we first decouple joint and bone features along with their motion patterns, constructing a global human-object relational graph using an attention graph convolution module (AGCM). Additionally, we incorporate a Multi-Scale Temporal Convolution Module (MTC) and a Cross-Dimensional Attention Module (CDAM) to dynamically focus on key spatio-temporal nodes and feature channels. Our method achieves significant improvements in accuracy while maintaining high computational efficiency, making it suitable for real-time applications in privacy-sensitive scenarios. Experiments on the Volleyball and NBA datasets demonstrate that our method achieves competitive performance using only skeleton input, significantly reducing parameters and computational cost compared to mainstream approaches. Here, our method show an improvement in Multi-Class Per-Class Accuracy (MPCA) to 96.1% on the Volleyball dataset and 71.6% on the NBA dataset, offering a lightweight and efficient solution for GAR in privacy-sensitive scenarios.
群体活动识别是视频理解中的一项关键任务,具有广泛的应用范围,从监控到人机交互。传统的基于rgb的方法面临着隐私问题、环境敏感性和碎片化场景级语义理解等挑战。基于骨架的方法提供了一个很有前途的选择,但往往受到对异构特征的有限探索和缺乏对人-物交互的显式建模的影响。在本文中,我们为基于骨架的GAR引入了一个轻量级框架,利用注意力增强的时空图卷积网络。特别地,我们首先解耦关节和骨骼特征及其运动模式,使用注意图卷积模块(AGCM)构建全局人-物关系图。此外,我们还结合了一个多尺度时间卷积模块(MTC)和一个跨维度关注模块(CDAM)来动态关注关键的时空节点和特征通道。我们的方法在保持高计算效率的同时,在精度上有了显著的提高,适合于隐私敏感场景下的实时应用。在排球和NBA数据集上的实验表明,与主流方法相比,我们的方法仅使用骨架输入就可以获得具有竞争力的性能,显著减少了参数和计算成本。在这里,我们的方法显示,在排球数据集上,多类每类准确率(MPCA)提高到96.1%,在NBA数据集上提高到71.6%,为隐私敏感场景中的GAR提供了轻量级和高效的解决方案。
{"title":"Enhanced skeleton-based Group Activity Recognition through spatio-temporal graph convolution with cross-dimensional attention","authors":"Dongli Wang ,&nbsp;Yongcan Weng ,&nbsp;Xiaolin Zhu ,&nbsp;Yan Zhou ,&nbsp;Zixin Zhang ,&nbsp;Richard Irampaye","doi":"10.1016/j.imavis.2025.105784","DOIUrl":"10.1016/j.imavis.2025.105784","url":null,"abstract":"<div><div>Group Activity Recognition is a pivotal task in video understanding, with broad applications ranging from surveillance to human–computer interaction. Traditional RGB-based methods face challenges such as privacy concerns, environmental sensitivity, and fragmented scene-level semantic understanding. Skeleton-based approaches offer a promising alternative but often suffer from limited exploration of heterogeneous features and the absence of explicit modeling for human-object interactions. In this paper, we introduce a lightweight framework for skeleton-based GAR, leveraging an attention-enhanced spatio-temporal graph convolutional network. Specially, we first decouple joint and bone features along with their motion patterns, constructing a global human-object relational graph using an attention graph convolution module (AGCM). Additionally, we incorporate a Multi-Scale Temporal Convolution Module (MTC) and a Cross-Dimensional Attention Module (CDAM) to dynamically focus on key spatio-temporal nodes and feature channels. Our method achieves significant improvements in accuracy while maintaining high computational efficiency, making it suitable for real-time applications in privacy-sensitive scenarios. Experiments on the Volleyball and NBA datasets demonstrate that our method achieves competitive performance using only skeleton input, significantly reducing parameters and computational cost compared to mainstream approaches. Here, our method show an improvement in Multi-Class Per-Class Accuracy (MPCA) to 96.1% on the Volleyball dataset and 71.6% on the NBA dataset, offering a lightweight and efficient solution for GAR in privacy-sensitive scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105784"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EmbryoVision AI: An explainable deep learning framework for enhanced blastocyst selection in assisted reproductive technologies EmbryoVision AI:一个可解释的深度学习框架,用于增强辅助生殖技术中的囊胚选择
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-30 DOI: 10.1016/j.imavis.2025.105795
Alessia Auriemma Citarella , Pietro Battistoni , Chiara Coscarelli , Fabiola De Marco , Luigi Di Biasi , Mengyuan Wang
Accurate embryo selection is a key factor in improving implantation success rates in Assisted Reproductive Technologies. This study presents a deep learning framework, EmbryoVision AI, designed to enhance blastocyst assessment using Time-Lapse Imaging and eXplainable AI techniques. A customized convolutional neural network was developed to capture both morphological and temporal dynamics, enabling a precise classification of the embryo. To ensure transparency, Gradient-weighted Class Activation Mapping was integrated, allowing visualization of decision-critical embryonic structures and ensuring clinical alignment. The model demonstrated strong predictive performance across different embryo grades, achieving an accuracy of 91.5% for Grade AA, 88.4% for Grade AB, and 79.3% for Grade BC. The AUC-ROC values were 0.95, 0.90, and 0.81 for Grade AA, AB, and BC, respectively, indicating strong discriminatory capabilities. The findings suggest that AI-driven embryo selection can enhance objectivity, reduce human variability, and improve ART outcomes. However, the results also underscore the need to refine AI models to better handle morphological variability in lower-quality embryos, highlighting the importance of improving generalization and strengthening clinical integration.
准确的胚胎选择是提高辅助生殖技术着床成功率的关键因素。本研究提出了一个深度学习框架,EmbryoVision AI,旨在通过延时成像和可解释的AI技术增强囊胚评估。开发了一个定制的卷积神经网络来捕获形态和时间动态,从而实现胚胎的精确分类。为了确保透明度,集成了梯度加权类激活映射,使决策关键的胚胎结构可视化,并确保临床对齐。该模型在不同胚胎等级中表现出很强的预测能力,AA级的准确率为91.5%,AB级的准确率为88.4%,BC级的准确率为79.3%。AA级、AB级和BC级的AUC-ROC值分别为0.95、0.90和0.81,具有较强的区分能力。研究结果表明,人工智能驱动的胚胎选择可以增强客观性,减少人类的可变性,并改善ART结果。然而,结果也强调了改进人工智能模型以更好地处理低质量胚胎的形态变异的必要性,强调了提高泛化和加强临床整合的重要性。
{"title":"EmbryoVision AI: An explainable deep learning framework for enhanced blastocyst selection in assisted reproductive technologies","authors":"Alessia Auriemma Citarella ,&nbsp;Pietro Battistoni ,&nbsp;Chiara Coscarelli ,&nbsp;Fabiola De Marco ,&nbsp;Luigi Di Biasi ,&nbsp;Mengyuan Wang","doi":"10.1016/j.imavis.2025.105795","DOIUrl":"10.1016/j.imavis.2025.105795","url":null,"abstract":"<div><div>Accurate embryo selection is a key factor in improving implantation success rates in Assisted Reproductive Technologies. This study presents a deep learning framework, <em>EmbryoVision AI</em>, designed to enhance blastocyst assessment using Time-Lapse Imaging and eXplainable AI techniques. A customized convolutional neural network was developed to capture both morphological and temporal dynamics, enabling a precise classification of the embryo. To ensure transparency, Gradient-weighted Class Activation Mapping was integrated, allowing visualization of decision-critical embryonic structures and ensuring clinical alignment. The model demonstrated strong predictive performance across different embryo grades, achieving an accuracy of 91.5% for Grade AA, 88.4% for Grade AB, and 79.3% for Grade BC. The AUC-ROC values were 0.95, 0.90, and 0.81 for Grade AA, AB, and BC, respectively, indicating strong discriminatory capabilities. The findings suggest that AI-driven embryo selection can enhance objectivity, reduce human variability, and improve ART outcomes. However, the results also underscore the need to refine AI models to better handle morphological variability in lower-quality embryos, highlighting the importance of improving generalization and strengthening clinical integration.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105795"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
No reference Point Cloud Quality Assessment via cross-modal learning and contrastive enhancement 无参考点云质量评估通过跨模式学习和对比增强
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-28 DOI: 10.1016/j.imavis.2025.105788
Ruiyu Ming, Haibing Yin, Xiaofeng Huang, Weifeng Dong, Hang Lu, Hongkui Wang
Point Cloud Quality Assessment (PCQA) has become an important research area due to the rapid development and widespread application of 3D vision. Point clouds have diverse representation forms, including point-wise modality and projection image-wise modality. However, most existing methods inadequately account for the cross-modality interactions with elaborate modality characteristics depiction, resulting in unsatisfactory PCQA model accuracy. In addition, PCQA datasets are scarce, further limiting the generalization ability of deep learning-based models. This paper proposes a no-reference cross-modal PCQA framework to address these issues by leveraging cross-modal learning and contrastive constraints. Firstly, we render the original point cloud into corresponding multi-view projections and construct enhanced versions of the point cloud. Then, we utilize a modified pre-trained CLIP-transformer-based encoder to extract the point-wise features, and a convolutional network-based encoder to extract the projection image-wise features, fully maximizing the intrinsic modality characteristics. Furthermore, a contrastive loss function is adopted for cross-modal training, covering both the point cloud and projection image modalities, maximizing the consistency between multi-modal features to obtain robust feature representations. Finally, a specially designed parallel cross-attention mechanism enhances and integrates multi-modal features, obtaining the final predicted quality score. Experimental results show that our method outperforms the state-of-the-art benchmark NR-PCQA method. Code will be released on https://github.com/NovemberWind7/PCQA.
随着三维视觉技术的迅速发展和广泛应用,点云质量评估(PCQA)已成为一个重要的研究领域。点云有不同的表现形式,包括点云和投影云。然而,大多数现有的方法没有充分考虑到跨模态的相互作用,没有详细描述模态特征,导致PCQA模型的准确性不理想。此外,PCQA数据集的稀缺进一步限制了基于深度学习的模型的泛化能力。本文提出了一个无参考跨模态PCQA框架,通过利用跨模态学习和对比约束来解决这些问题。首先,我们将原始点云渲染成相应的多视图投影,并构建增强版本的点云。然后,我们利用改进的预训练的基于clip -transformer的编码器来提取点方向的特征,并利用基于卷积网络的编码器来提取投影图像方向的特征,充分最大化固有模态特征。此外,采用对比损失函数进行跨模态训练,覆盖点云和投影图像模态,最大限度地提高多模态特征之间的一致性,获得鲁棒特征表示。最后,专门设计的并行交叉注意机制增强并整合了多模态特征,得到最终的预测质量分数。实验结果表明,该方法优于最先进的基准NR-PCQA方法。代码将在https://github.com/NovemberWind7/PCQA上发布。
{"title":"No reference Point Cloud Quality Assessment via cross-modal learning and contrastive enhancement","authors":"Ruiyu Ming,&nbsp;Haibing Yin,&nbsp;Xiaofeng Huang,&nbsp;Weifeng Dong,&nbsp;Hang Lu,&nbsp;Hongkui Wang","doi":"10.1016/j.imavis.2025.105788","DOIUrl":"10.1016/j.imavis.2025.105788","url":null,"abstract":"<div><div>Point Cloud Quality Assessment (PCQA) has become an important research area due to the rapid development and widespread application of 3D vision. Point clouds have diverse representation forms, including point-wise modality and projection image-wise modality. However, most existing methods inadequately account for the cross-modality interactions with elaborate modality characteristics depiction, resulting in unsatisfactory PCQA model accuracy. In addition, PCQA datasets are scarce, further limiting the generalization ability of deep learning-based models. This paper proposes a no-reference cross-modal PCQA framework to address these issues by leveraging cross-modal learning and contrastive constraints. Firstly, we render the original point cloud into corresponding multi-view projections and construct enhanced versions of the point cloud. Then, we utilize a modified pre-trained CLIP-transformer-based encoder to extract the point-wise features, and a convolutional network-based encoder to extract the projection image-wise features, fully maximizing the intrinsic modality characteristics. Furthermore, a contrastive loss function is adopted for cross-modal training, covering both the point cloud and projection image modalities, maximizing the consistency between multi-modal features to obtain robust feature representations. Finally, a specially designed parallel cross-attention mechanism enhances and integrates multi-modal features, obtaining the final predicted quality score. Experimental results show that our method outperforms the state-of-the-art benchmark NR-PCQA method. Code will be released on <span><span>https://github.com/NovemberWind7/PCQA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105788"},"PeriodicalIF":4.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Simultaneous acquisition of geometry and material for translucent objects 同时获取半透明物体的几何形状和材料
IF 4.2 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-10-24 DOI: 10.1016/j.imavis.2025.105793
Chenhao Li , Trung Thanh Ngo , Hajime Nagahara
Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.
由于半透明介质复杂的光传播和逆向渲染固有的模糊性,从图像中重建半透明物体的几何和材料属性是一个具有挑战性的问题。因此,以往的作品往往假设物体是不透明的,或者使用简化的模型来描述半透明的物体,这严重影响了重建质量,限制了下游的任务,如重光照或材料编辑。我们提出了一个新的框架,通过物理基础和数据驱动策略的结合来解决这一挑战。我们方法的核心是一个混合渲染监督方案,它融合了一个可微分的物理渲染器和一个学习的神经渲染器来指导重建。为了进一步加强监督,我们引入了针对神经渲染器的增强损失。我们的系统将闪光/无闪光图像对作为输入,使其能够消除半透明物体内部发生的复杂光传播。我们在117k个场景的大规模合成数据集上训练我们的模型,并在合成基准和真实世界的捕获中进行评估。为了减轻合成数据和真实数据之间的领域差距,我们提供了一个新的真实世界数据集,其中包含真实表面法线,并相应地微调我们的模型。大量的实验验证了我们的方法在不同场景下的鲁棒性和准确性。
{"title":"Simultaneous acquisition of geometry and material for translucent objects","authors":"Chenhao Li ,&nbsp;Trung Thanh Ngo ,&nbsp;Hajime Nagahara","doi":"10.1016/j.imavis.2025.105793","DOIUrl":"10.1016/j.imavis.2025.105793","url":null,"abstract":"<div><div>Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105793"},"PeriodicalIF":4.2,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145366055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Image and Vision Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1