Pub Date : 2025-11-01DOI: 10.1016/j.imavis.2025.105806
Qing Tian , Zhiwen Liu , Weihua Ou
Black-box Domain Adaptation (BDA) is a source-free unsupervised domain adaptation method that requires access only to black-box source predictors. This method offers significant security advantages since it does not necessitate access to the source data or specific parameters of the source model. However, adaptation using only noisy source-predicted labels presents considerable challenges due to the limited information available. Existing research primarily focuses on minor improvements at the micro-level, without addressing the macro-level training strategies required for effective black-box domain adaptation. In this article, we propose a novel three-step BDA framework for image classification called PDLR, which emulates the learning strategies of real students, dividing the training process into three stages: Preview, Differentiated Learning, and Review. Initially, during the preview stage, we enable the model to acquire fundamental knowledge and stable features. Subsequently, in the differentiated learning stage, we categorize target samples into easy-adaptable, semi-adaptable, and hard-adaptable subdomains and employ graph contrastive learning to align these samples. Finally, in the review stage, we identify and conduct supplementary learning on classes that are prone to being forgotten. Our method achieves state-of-the-art performance across multiple benchmarks.
{"title":"Learning like a real student: Black-box domain adaptation with preview, differentiated learning and review","authors":"Qing Tian , Zhiwen Liu , Weihua Ou","doi":"10.1016/j.imavis.2025.105806","DOIUrl":"10.1016/j.imavis.2025.105806","url":null,"abstract":"<div><div>Black-box Domain Adaptation (BDA) is a source-free unsupervised domain adaptation method that requires access only to black-box source predictors. This method offers significant security advantages since it does not necessitate access to the source data or specific parameters of the source model. However, adaptation using only noisy source-predicted labels presents considerable challenges due to the limited information available. Existing research primarily focuses on minor improvements at the micro-level, without addressing the macro-level training strategies required for effective black-box domain adaptation. In this article, we propose a novel three-step BDA framework for image classification called PDLR, which emulates the learning strategies of real students, dividing the training process into three stages: <strong>P</strong>review, <strong>D</strong>ifferentiated <strong>L</strong>earning, and <strong>R</strong>eview. Initially, during the preview stage, we enable the model to acquire fundamental knowledge and stable features. Subsequently, in the differentiated learning stage, we categorize target samples into easy-adaptable, semi-adaptable, and hard-adaptable subdomains and employ graph contrastive learning to align these samples. Finally, in the review stage, we identify and conduct supplementary learning on classes that are prone to being forgotten. Our method achieves state-of-the-art performance across multiple benchmarks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105806"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.imavis.2025.105799
Gea Viozzi , Fabio Persia , Daniela D’Auria
Humerus anomalies are a problem that requires rapid and accurate diagnosis to ensure immediate and efficient treatment. In this context, the main goal of this paper is to develop and analyze well-known Convolutional Neural Network models for the automatic recognition of humeral fractures, with the aim of proposing a useful tool for healthcare personnel. Specifically, three distinct architectures were implemented and compared: a three-layer untrained neural network, a network based on the ResNet18 architecture and one based on the DenseNet121 model, both of which were trained. The performance analysis highlighted a trade-off between accuracy and generalization ability, showing better accuracy in the pre-trained models - in particular, the DenseNet121 model achieved optimal accuracy across multiple runs of 85%. - which however proved more prone to suffer from overfitting compared to the non-pre-trained model. As a result, this study aims to propose the integration of deep learning tools in medical practice, laying important foundations for future developments, with the hope of improving the efficiency and accuracy of orthopedic diagnoses.
{"title":"Automated recognition of humerus anomalies with convolutional neural networks","authors":"Gea Viozzi , Fabio Persia , Daniela D’Auria","doi":"10.1016/j.imavis.2025.105799","DOIUrl":"10.1016/j.imavis.2025.105799","url":null,"abstract":"<div><div>Humerus anomalies are a problem that requires rapid and accurate diagnosis to ensure immediate and efficient treatment. In this context, the main goal of this paper is to develop and analyze well-known Convolutional Neural Network models for the automatic recognition of humeral fractures, with the aim of proposing a useful tool for healthcare personnel. Specifically, three distinct architectures were implemented and compared: a three-layer untrained neural network, a network based on the ResNet18 architecture and one based on the DenseNet121 model, both of which were trained. The performance analysis highlighted a trade-off between accuracy and generalization ability, showing better accuracy in the pre-trained models - in particular, the DenseNet121 model achieved optimal accuracy across multiple runs of 85%. - which however proved more prone to suffer from overfitting compared to the non-pre-trained model. As a result, this study aims to propose the integration of deep learning tools in medical practice, laying important foundations for future developments, with the hope of improving the efficiency and accuracy of orthopedic diagnoses.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105799"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-01DOI: 10.1016/j.imavis.2025.105805
Yu Chen , Liu Yang , Jun Long , TingBo Bao
Since the release of Denoising Diffusion Probabilistic Models by Google in 2020, diffusion models have gradually emerged as a new research focus in generative modeling. However, in the task of diabetic retinopathy image classification, conventional convolutional neural network methods, although capable of achieving high accuracy, generally lack interpretability and thus fail to meet the transparency requirements of clinical diagnosis. To address this issue, a novel denoising diffusion framework named TCG-DiffDRC is proposed for diabetic retinopathy classification. An innovative triple-granularity conditional guidance strategy is introduced, in which three independent branches are fused. The global feature branch employs an improved ResNet-50 architecture with class activation mapping to generate global descriptors and capture macroscopic patterns. The local feature branch integrates multiple regions through a gated attention mechanism to identify local structures. The detail branch leverages an interpretable neural transformer with a multi-head attention mechanism to extract fine-grained lesion features. Furthermore, a dynamic guidance mechanism based on the correctness of an auxiliary classifier is incorporated during the diffusion reconstruction process, while segmentation masks are embedded as a regularization term in the loss function to enhance structural consistency in lesion regions. Experimental results demonstrate that TCG-DiffDRC consistently outperforms state-of-the-art methods across three public datasets, including APTOS2019, Messidor, and IDRiD. On the APTOS2019 dataset, the proposed method achieves an accuracy of 86.7% and a Cohen’s Kappa of 75.8%, with improvements confirmed by statistical significance testing, thereby verifying the reliability of the model.
{"title":"Diffusion model-based imbalanced diabetic retinal image classification","authors":"Yu Chen , Liu Yang , Jun Long , TingBo Bao","doi":"10.1016/j.imavis.2025.105805","DOIUrl":"10.1016/j.imavis.2025.105805","url":null,"abstract":"<div><div>Since the release of Denoising Diffusion Probabilistic Models by Google in 2020, diffusion models have gradually emerged as a new research focus in generative modeling. However, in the task of diabetic retinopathy image classification, conventional convolutional neural network methods, although capable of achieving high accuracy, generally lack interpretability and thus fail to meet the transparency requirements of clinical diagnosis. To address this issue, a novel denoising diffusion framework named TCG-DiffDRC is proposed for diabetic retinopathy classification. An innovative triple-granularity conditional guidance strategy is introduced, in which three independent branches are fused. The global feature branch employs an improved ResNet-50 architecture with class activation mapping to generate global descriptors and capture macroscopic patterns. The local feature branch integrates multiple regions through a gated attention mechanism to identify local structures. The detail branch leverages an interpretable neural transformer with a multi-head attention mechanism to extract fine-grained lesion features. Furthermore, a dynamic guidance mechanism based on the correctness of an auxiliary classifier is incorporated during the diffusion reconstruction process, while segmentation masks are embedded as a regularization term in the loss function to enhance structural consistency in lesion regions. Experimental results demonstrate that TCG-DiffDRC consistently outperforms state-of-the-art methods across three public datasets, including APTOS2019, Messidor, and IDRiD. On the APTOS2019 dataset, the proposed method achieves an accuracy of 86.7% and a Cohen’s Kappa of 75.8%, with improvements confirmed by statistical significance testing, thereby verifying the reliability of the model.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105805"},"PeriodicalIF":4.2,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1016/j.imavis.2025.105802
Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin
Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.
{"title":"SPM-CyViT: A self-supervised pre-trained cycle-consistent vision transformer with multi-branch for contrast-enhanced CT synthesis","authors":"Hongwei Yang , Wen Zeng , Ke Chen , Zhan Hua , Yan Zhuang , Lin Han , Guoliang Liao , Yiteng Zhang , Hanyu Li , Zhenlin Li , Jiangli Lin","doi":"10.1016/j.imavis.2025.105802","DOIUrl":"10.1016/j.imavis.2025.105802","url":null,"abstract":"<div><div>Contrast-enhanced computed tomography (CECT) is crucial for assessing vascular anatomy and pathology. However, the use of iodine contrast medium poses risks, including anaphylactic shock and acute kidney injury. To address this, we propose SPM-CyViT, a self-supervised pre-trained, multi-branch, cycle-consistent vision transformer that synthesizes high-quality virtual CECT from non-contrast CT (NCCT). Its generator employs a parallel encoding approach, combining vision transformer blocks with convolutional downsampling layers. Their encoded outputs are fused through a tailored cross-attention module, producing feature maps with multi-scale complementary properties. Employing masked reconstruction, the ViT global encoder enables self-supervised pre-training on diverse unlabeled CT slices. This overcomes fixed-dataset limitations and significantly improves generalization. Additionally, the model features a multi-branch decoder-discriminator design tailored to specific labels. It incorporates 40 keV monoenergetic enhanced CT (MonoE) as an auxiliary label to optimize contrast-sensitive regions. Results from the dual-center internal test set demonstrate that SPM-CyViT outperforms existing CECT synthesis models across all quantitative metrics. Furthermore, based on real CECT as a benchmark, three radiologists awarded SPM-CyViT an average clinical evaluation score of 4.215.00 across multiple perspectives. Additionally, SPM-CyViT exhibits robust generalization on the external test set, achieving a mean CNR of 10.96 for synthesized CECT, nearing the 12.38 value of real CECT, collectively underscoring its clinical application potential.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105802"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467697","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1016/j.imavis.2025.105803
Bo Yu , Hanting Wei , Chenghong Zhang , Wei Wang
In the realm of computer vision, high-quality image information serves as the foundation for downstream tasks. Nevertheless, elements such as foggy weather, suboptimal lighting circumstances, and atmospheric impurities frequently deteriorate image quality, posing a considerable research challenge in effectively restoring these low-quality images. Existing defogging approaches mainly rely on constraints and physical priors; however, they have demonstrated limited efficacy, especially when dealing with extensive fog-affected areas. To tackle this issue, a deep trainable de-fog network named DBaP-net is proposed in this paper. By leveraging convolutional neural networks, this network integrates diverse filters to extract physical priors from images. Through the construction of a sophisticated deep network architecture, DBaP-net precisely estimates the transmission map and efficiently facilitates the restoration of haze-free images. Additionally, we design a spatial transformation layer customized for physical prior features and adopt a multi-kernel fusion extraction technique to further enhance the model’s feature extraction capabilities and spatial adaptability, thereby laying a solid foundation for subsequent visual tasks. Experimental validation indicates that DBaP-net not only effectively eliminates haze from images but also significantly enhances their overall quality. In both quantitative and qualitative evaluations, DBaP-net surpasses other comparison algorithms in terms of efficiency and usability. As a result, this study offers a novel solution to the image defogging problem within computer vision frameworks, enabling the precise restoration of low-quality images and providing robust support for research endeavors and downstream applications in related fields.
{"title":"DBaP-net: Deep network for image defogging based on physical properties prior","authors":"Bo Yu , Hanting Wei , Chenghong Zhang , Wei Wang","doi":"10.1016/j.imavis.2025.105803","DOIUrl":"10.1016/j.imavis.2025.105803","url":null,"abstract":"<div><div>In the realm of computer vision, high-quality image information serves as the foundation for downstream tasks. Nevertheless, elements such as foggy weather, suboptimal lighting circumstances, and atmospheric impurities frequently deteriorate image quality, posing a considerable research challenge in effectively restoring these low-quality images. Existing defogging approaches mainly rely on constraints and physical priors; however, they have demonstrated limited efficacy, especially when dealing with extensive fog-affected areas. To tackle this issue, a deep trainable de-fog network named DBaP-net is proposed in this paper. By leveraging convolutional neural networks, this network integrates diverse filters to extract physical priors from images. Through the construction of a sophisticated deep network architecture, DBaP-net precisely estimates the transmission map and efficiently facilitates the restoration of haze-free images. Additionally, we design a spatial transformation layer customized for physical prior features and adopt a multi-kernel fusion extraction technique to further enhance the model’s feature extraction capabilities and spatial adaptability, thereby laying a solid foundation for subsequent visual tasks. Experimental validation indicates that DBaP-net not only effectively eliminates haze from images but also significantly enhances their overall quality. In both quantitative and qualitative evaluations, DBaP-net surpasses other comparison algorithms in terms of efficiency and usability. As a result, this study offers a novel solution to the image defogging problem within computer vision frameworks, enabling the precise restoration of low-quality images and providing robust support for research endeavors and downstream applications in related fields.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105803"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145428873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1016/j.imavis.2025.105800
Yongchao Qiao , Ya’nan Guan , Zhiyou Wang , Jingmin Yang , Wenyuan Yang
Compared with traditional semantic segmentation, Domain Generalization Semantic Segmentation (DGSS) focuses more on improving the generalization of models in unseen domains. Existing methods are mainly based on Transformers and convolutional neural networks, which have limited receptive fields and high complexity. Mamba, as a new state-space model, can solve these problems well. Nevertheless, the problems of hidden states and learning domain-invariant semantic features make it difficult to apply to DGSS. In this paper, we propose a model fine-tuning method named DGFMamba, which introduces Hidden State Fine-tuning Tokens (HSFT) and Feature-level Bidirectional Selective Scan Module (FBSSM) to improve the feature maps. HSFT, which consists of channel tokens and feature tokens, can perform local forgetting on feature maps. Feature-level embedding allows feature maps to be input to FBSSM with single pixels as vectors. FBSSM obtains contextual information from both forward and reverse directions, with reverse information serving as a complement to forward information. To further reduce the trainable parameters of the model, the parameters of FBSSM and MLP at each layer are shared. DGFMamba achieves promising results in experiments with different settings. This also demonstrates the effectiveness of applying state-space models to model fine-tuning. The average mIoU under the GTAVscapes+BDD100K+Mapillary setting is 64.4%. The average mIoU under the GTAV+SynthiaCityscapes+BDD100K+Mapillary setting is 65.8%. It is worth noting that DGFMamba only adds an additional 0.5% of trainable parameters. The code is available at https://github.com/xiaoxia0722/DGFMamba.
{"title":"DGFMamba: Model fine-tuning based on bidirectional state space for domain generalization semantic segmentation","authors":"Yongchao Qiao , Ya’nan Guan , Zhiyou Wang , Jingmin Yang , Wenyuan Yang","doi":"10.1016/j.imavis.2025.105800","DOIUrl":"10.1016/j.imavis.2025.105800","url":null,"abstract":"<div><div>Compared with traditional semantic segmentation, Domain Generalization Semantic Segmentation (DGSS) focuses more on improving the generalization of models in unseen domains. Existing methods are mainly based on Transformers and convolutional neural networks, which have limited receptive fields and high complexity. Mamba, as a new state-space model, can solve these problems well. Nevertheless, the problems of hidden states and learning domain-invariant semantic features make it difficult to apply to DGSS. In this paper, we propose a model fine-tuning method named DGFMamba, which introduces Hidden State Fine-tuning Tokens (HSFT) and Feature-level Bidirectional Selective Scan Module (FBSSM) to improve the feature maps. HSFT, which consists of channel tokens and feature tokens, can perform local forgetting on feature maps. Feature-level embedding allows feature maps to be input to FBSSM with single pixels as vectors. FBSSM obtains contextual information from both forward and reverse directions, with reverse information serving as a complement to forward information. To further reduce the trainable parameters of the model, the parameters of FBSSM and MLP at each layer are shared. DGFMamba achieves promising results in experiments with different settings. This also demonstrates the effectiveness of applying state-space models to model fine-tuning. The average mIoU under the GTAV<span><math><mo>→</mo></math></span>scapes+BDD100K+Mapillary setting is 64.4%. The average mIoU under the GTAV+Synthia<span><math><mo>→</mo></math></span>Cityscapes+BDD100K+Mapillary setting is 65.8%. It is worth noting that DGFMamba only adds an additional 0.5% of trainable parameters. The code is available at <span><span>https://github.com/xiaoxia0722/DGFMamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105800"},"PeriodicalIF":4.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145579401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1016/j.imavis.2025.105784
Dongli Wang , Yongcan Weng , Xiaolin Zhu , Yan Zhou , Zixin Zhang , Richard Irampaye
Group Activity Recognition is a pivotal task in video understanding, with broad applications ranging from surveillance to human–computer interaction. Traditional RGB-based methods face challenges such as privacy concerns, environmental sensitivity, and fragmented scene-level semantic understanding. Skeleton-based approaches offer a promising alternative but often suffer from limited exploration of heterogeneous features and the absence of explicit modeling for human-object interactions. In this paper, we introduce a lightweight framework for skeleton-based GAR, leveraging an attention-enhanced spatio-temporal graph convolutional network. Specially, we first decouple joint and bone features along with their motion patterns, constructing a global human-object relational graph using an attention graph convolution module (AGCM). Additionally, we incorporate a Multi-Scale Temporal Convolution Module (MTC) and a Cross-Dimensional Attention Module (CDAM) to dynamically focus on key spatio-temporal nodes and feature channels. Our method achieves significant improvements in accuracy while maintaining high computational efficiency, making it suitable for real-time applications in privacy-sensitive scenarios. Experiments on the Volleyball and NBA datasets demonstrate that our method achieves competitive performance using only skeleton input, significantly reducing parameters and computational cost compared to mainstream approaches. Here, our method show an improvement in Multi-Class Per-Class Accuracy (MPCA) to 96.1% on the Volleyball dataset and 71.6% on the NBA dataset, offering a lightweight and efficient solution for GAR in privacy-sensitive scenarios.
{"title":"Enhanced skeleton-based Group Activity Recognition through spatio-temporal graph convolution with cross-dimensional attention","authors":"Dongli Wang , Yongcan Weng , Xiaolin Zhu , Yan Zhou , Zixin Zhang , Richard Irampaye","doi":"10.1016/j.imavis.2025.105784","DOIUrl":"10.1016/j.imavis.2025.105784","url":null,"abstract":"<div><div>Group Activity Recognition is a pivotal task in video understanding, with broad applications ranging from surveillance to human–computer interaction. Traditional RGB-based methods face challenges such as privacy concerns, environmental sensitivity, and fragmented scene-level semantic understanding. Skeleton-based approaches offer a promising alternative but often suffer from limited exploration of heterogeneous features and the absence of explicit modeling for human-object interactions. In this paper, we introduce a lightweight framework for skeleton-based GAR, leveraging an attention-enhanced spatio-temporal graph convolutional network. Specially, we first decouple joint and bone features along with their motion patterns, constructing a global human-object relational graph using an attention graph convolution module (AGCM). Additionally, we incorporate a Multi-Scale Temporal Convolution Module (MTC) and a Cross-Dimensional Attention Module (CDAM) to dynamically focus on key spatio-temporal nodes and feature channels. Our method achieves significant improvements in accuracy while maintaining high computational efficiency, making it suitable for real-time applications in privacy-sensitive scenarios. Experiments on the Volleyball and NBA datasets demonstrate that our method achieves competitive performance using only skeleton input, significantly reducing parameters and computational cost compared to mainstream approaches. Here, our method show an improvement in Multi-Class Per-Class Accuracy (MPCA) to 96.1% on the Volleyball dataset and 71.6% on the NBA dataset, offering a lightweight and efficient solution for GAR in privacy-sensitive scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105784"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145521219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1016/j.imavis.2025.105795
Alessia Auriemma Citarella , Pietro Battistoni , Chiara Coscarelli , Fabiola De Marco , Luigi Di Biasi , Mengyuan Wang
Accurate embryo selection is a key factor in improving implantation success rates in Assisted Reproductive Technologies. This study presents a deep learning framework, EmbryoVision AI, designed to enhance blastocyst assessment using Time-Lapse Imaging and eXplainable AI techniques. A customized convolutional neural network was developed to capture both morphological and temporal dynamics, enabling a precise classification of the embryo. To ensure transparency, Gradient-weighted Class Activation Mapping was integrated, allowing visualization of decision-critical embryonic structures and ensuring clinical alignment. The model demonstrated strong predictive performance across different embryo grades, achieving an accuracy of 91.5% for Grade AA, 88.4% for Grade AB, and 79.3% for Grade BC. The AUC-ROC values were 0.95, 0.90, and 0.81 for Grade AA, AB, and BC, respectively, indicating strong discriminatory capabilities. The findings suggest that AI-driven embryo selection can enhance objectivity, reduce human variability, and improve ART outcomes. However, the results also underscore the need to refine AI models to better handle morphological variability in lower-quality embryos, highlighting the importance of improving generalization and strengthening clinical integration.
{"title":"EmbryoVision AI: An explainable deep learning framework for enhanced blastocyst selection in assisted reproductive technologies","authors":"Alessia Auriemma Citarella , Pietro Battistoni , Chiara Coscarelli , Fabiola De Marco , Luigi Di Biasi , Mengyuan Wang","doi":"10.1016/j.imavis.2025.105795","DOIUrl":"10.1016/j.imavis.2025.105795","url":null,"abstract":"<div><div>Accurate embryo selection is a key factor in improving implantation success rates in Assisted Reproductive Technologies. This study presents a deep learning framework, <em>EmbryoVision AI</em>, designed to enhance blastocyst assessment using Time-Lapse Imaging and eXplainable AI techniques. A customized convolutional neural network was developed to capture both morphological and temporal dynamics, enabling a precise classification of the embryo. To ensure transparency, Gradient-weighted Class Activation Mapping was integrated, allowing visualization of decision-critical embryonic structures and ensuring clinical alignment. The model demonstrated strong predictive performance across different embryo grades, achieving an accuracy of 91.5% for Grade AA, 88.4% for Grade AB, and 79.3% for Grade BC. The AUC-ROC values were 0.95, 0.90, and 0.81 for Grade AA, AB, and BC, respectively, indicating strong discriminatory capabilities. The findings suggest that AI-driven embryo selection can enhance objectivity, reduce human variability, and improve ART outcomes. However, the results also underscore the need to refine AI models to better handle morphological variability in lower-quality embryos, highlighting the importance of improving generalization and strengthening clinical integration.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"165 ","pages":"Article 105795"},"PeriodicalIF":4.2,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145468942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-28DOI: 10.1016/j.imavis.2025.105788
Ruiyu Ming, Haibing Yin, Xiaofeng Huang, Weifeng Dong, Hang Lu, Hongkui Wang
Point Cloud Quality Assessment (PCQA) has become an important research area due to the rapid development and widespread application of 3D vision. Point clouds have diverse representation forms, including point-wise modality and projection image-wise modality. However, most existing methods inadequately account for the cross-modality interactions with elaborate modality characteristics depiction, resulting in unsatisfactory PCQA model accuracy. In addition, PCQA datasets are scarce, further limiting the generalization ability of deep learning-based models. This paper proposes a no-reference cross-modal PCQA framework to address these issues by leveraging cross-modal learning and contrastive constraints. Firstly, we render the original point cloud into corresponding multi-view projections and construct enhanced versions of the point cloud. Then, we utilize a modified pre-trained CLIP-transformer-based encoder to extract the point-wise features, and a convolutional network-based encoder to extract the projection image-wise features, fully maximizing the intrinsic modality characteristics. Furthermore, a contrastive loss function is adopted for cross-modal training, covering both the point cloud and projection image modalities, maximizing the consistency between multi-modal features to obtain robust feature representations. Finally, a specially designed parallel cross-attention mechanism enhances and integrates multi-modal features, obtaining the final predicted quality score. Experimental results show that our method outperforms the state-of-the-art benchmark NR-PCQA method. Code will be released on https://github.com/NovemberWind7/PCQA.
{"title":"No reference Point Cloud Quality Assessment via cross-modal learning and contrastive enhancement","authors":"Ruiyu Ming, Haibing Yin, Xiaofeng Huang, Weifeng Dong, Hang Lu, Hongkui Wang","doi":"10.1016/j.imavis.2025.105788","DOIUrl":"10.1016/j.imavis.2025.105788","url":null,"abstract":"<div><div>Point Cloud Quality Assessment (PCQA) has become an important research area due to the rapid development and widespread application of 3D vision. Point clouds have diverse representation forms, including point-wise modality and projection image-wise modality. However, most existing methods inadequately account for the cross-modality interactions with elaborate modality characteristics depiction, resulting in unsatisfactory PCQA model accuracy. In addition, PCQA datasets are scarce, further limiting the generalization ability of deep learning-based models. This paper proposes a no-reference cross-modal PCQA framework to address these issues by leveraging cross-modal learning and contrastive constraints. Firstly, we render the original point cloud into corresponding multi-view projections and construct enhanced versions of the point cloud. Then, we utilize a modified pre-trained CLIP-transformer-based encoder to extract the point-wise features, and a convolutional network-based encoder to extract the projection image-wise features, fully maximizing the intrinsic modality characteristics. Furthermore, a contrastive loss function is adopted for cross-modal training, covering both the point cloud and projection image modalities, maximizing the consistency between multi-modal features to obtain robust feature representations. Finally, a specially designed parallel cross-attention mechanism enhances and integrates multi-modal features, obtaining the final predicted quality score. Experimental results show that our method outperforms the state-of-the-art benchmark NR-PCQA method. Code will be released on <span><span>https://github.com/NovemberWind7/PCQA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105788"},"PeriodicalIF":4.2,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145467696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-24DOI: 10.1016/j.imavis.2025.105793
Chenhao Li , Trung Thanh Ngo , Hajime Nagahara
Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.
{"title":"Simultaneous acquisition of geometry and material for translucent objects","authors":"Chenhao Li , Trung Thanh Ngo , Hajime Nagahara","doi":"10.1016/j.imavis.2025.105793","DOIUrl":"10.1016/j.imavis.2025.105793","url":null,"abstract":"<div><div>Reconstructing the geometry and material properties of translucent objects from images is a challenging problem due to the complex light propagation of translucent media and the inherent ambiguity of inverse rendering. Therefore, previous works often make the assumption that the objects are opaque or use a simplified model to describe translucent objects, which significantly affects the reconstruction quality and limits the downstream tasks such as relighting or material editing. We present a novel framework that tackles this challenge through a combination of physically grounded and data-driven strategies. At the core of our approach is a hybrid rendering supervision scheme that fuses a differentiable physical renderer with a learned neural renderer to guide reconstruction. To further enhance supervision, we introduce an augmented loss tailored to the neural renderer. Our system takes as input a flash/no-flash image pair, enabling it to disambiguate complex light propagation that happens inside translucent objects. We train our model on a large-scale synthetic dataset of 117 K scenes and evaluate across both synthetic benchmarks and real-world captures. To mitigate the domain gap between synthetic and real data, we contribute a new real-world dataset with ground-truth surface normals and fine-tune our model accordingly. Extensive experiments validate the robustness and accuracy of our method across diverse scenarios.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"164 ","pages":"Article 105793"},"PeriodicalIF":4.2,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145366055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}