International Journal of Computer Vision最新文献_第4页

Towards Boosting Out-of-Distribution Detection from a Spatial Feature Importance Perspective

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-02-05 DOI: 10.1007/s11263-025-02347-0

Yao Zhu, Xiu Yan, Chuanlong Xie

In ensuring the reliable and secure operation of models, Out-of-Distribution (OOD) detection has gained widespread attention in recent years. Researchers have proposed various promising detection criteria to construct the rejection region of the model, treating samples falling into this region as out-of-distribution. However, these detection criteria are computed using all dense features of the model (before the pooling layer), overlooking the fact that different features may exhibit varying importance in the decision-making process. To this end, we first propose quantifying the contribution of different spatial positions in the dense features to the decision result with the Shapley Value, thereby obtaining the feature importance map. This spatial-oriented feature attribution method, compared to the classical channel-oriented feature attribution method that linearly combines weights with each activation map, achieves superior visual performance and fidelity in interpreting the decision-making process. Subsequently, we introduce a spatial feature purification method that removes spatial features with low importance in dense features and advocates using these purified features to compute detection criteria. Extensive experiments demonstrate the effectiveness of spatial feature purification in enhancing the performance of various existing detection methods. Notably, spatial feature purification boosts Energy Score and NNGuide on the ImageNet benchmark by 18.39(%) and 26.45(%) in average FPR95, respectively.

{"title":"Towards Boosting Out-of-Distribution Detection from a Spatial Feature Importance Perspective","authors":"Yao Zhu, Xiu Yan, Chuanlong Xie","doi":"10.1007/s11263-025-02347-0","DOIUrl":"https://doi.org/10.1007/s11263-025-02347-0","url":null,"abstract":"In ensuring the reliable and secure operation of models, Out-of-Distribution (OOD) detection has gained widespread attention in recent years. Researchers have proposed various promising detection criteria to construct the rejection region of the model, treating samples falling into this region as out-of-distribution. However, these detection criteria are computed using all dense features of the model (before the pooling layer), overlooking the fact that different features may exhibit varying importance in the decision-making process. To this end, we first propose quantifying the contribution of different spatial positions in the dense features to the decision result with the Shapley Value, thereby obtaining the feature importance map. This spatial-oriented feature attribution method, compared to the classical channel-oriented feature attribution method that linearly combines weights with each activation map, achieves superior visual performance and fidelity in interpreting the decision-making process. Subsequently, we introduce a spatial feature purification method that removes spatial features with low importance in dense features and advocates using these purified features to compute detection criteria. Extensive experiments demonstrate the effectiveness of spatial feature purification in enhancing the performance of various existing detection methods. Notably, spatial feature purification boosts Energy Score and NNGuide on the ImageNet benchmark by 18.39(%) and 26.45(%) in average FPR95, respectively.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"9 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143125021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation 用于语音保全面部表情操纵的对比解耦表征学习和正则化

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-02-04 DOI: 10.1007/s11263-025-02358-x

Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin

Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.

{"title":"Contrastive Decoupled Representation Learning and Regularization for Speech-Preserving Facial Expression Manipulation","authors":"Tianshui Chen, Jianman Lin, Zhijing Yang, Chumei Qing, Yukai Shi, Liang Lin","doi":"10.1007/s11263-025-02358-x","DOIUrl":"https://doi.org/10.1007/s11263-025-02358-x","url":null,"abstract":"Speech-preserving facial expression manipulation (SPFEM) aims to modify a talking head to display a specific reference emotion while preserving the mouth animation of source spoken contents. Thus, emotion and content information existing in reference and source inputs can provide direct and accurate supervision signals for SPFEM models. However, the intrinsic intertwining of these elements during the talking process poses challenges to their effectiveness as supervisory signals. In this work, we propose to learn content and emotion priors as guidance augmented with contrastive learning to learn decoupled content and emotion representation via an innovative Contrastive Decoupled Representation Learning (CDRL) algorithm. Specifically, a Contrastive Content Representation Learning (CCRL) module is designed to learn audio feature, which primarily contains content information, as content priors to guide learning content representation from the source input. Meanwhile, a Contrastive Emotion Representation Learning (CERL) module is proposed to make use of a pre-trained visual-language model to learn emotion prior, which is then used to guide learning emotion representation from the reference input. We further introduce emotion-aware and emotion-augmented contrastive learning to train CCRL and CERL modules, respectively, ensuring learning emotion-independent content representation and content-independent emotion representation. During SPFEM model training, the decoupled content and emotion representations are used to supervise the generation process, ensuring more accurate emotion manipulation together with audio-lip synchronization. Extensive experiments and evaluations on various benchmarks show the effectiveness of the proposed algorithm.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"39 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143083462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DiffuVolume: Diffusion Model for Volume based Stereo Matching

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-02-01 DOI: 10.1007/s11263-025-02362-1

Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-Shi Zheng

Stereo matching is a significant part in many computer vision tasks and driving-based applications. Recently cost volume-based methods have achieved great success benefiting from the rich geometry information in paired images. However, the redundancy of cost volume also interferes with the model training and limits the performance. To construct a more precise cost volume, we pioneeringly apply the diffusion model to stereo matching. Our method, termed DiffuVolume, considers the diffusion model as a cost volume filter, which will recurrently remove the redundant information from the cost volume. Two main designs make our method not trivial. Firstly, to make the diffusion model more adaptive to stereo matching, we eschew the traditional manner of directly adding noise into the image but embed the diffusion model into a task-specific module. In this way, we outperform the traditional diffusion stereo matching method by 27(%) EPE improvement and 7 times parameters reduction. Secondly, DiffuVolume can be easily embedded into any volume-based stereo matching network, boosting performance with only a slight increase in parameters (approximately 2(%)). By adding the DiffuVolume into well-performed methods, we outperform all the published methods on Scene Flow, KITTI2012, KITTI2015 benchmarks and zero-shot generalization setting. It is worth mentioning that the proposed model ranks 1st on KITTI 2012 leader board, 2nd on KITTI 2015 leader board since 15, July 2023.

{"title":"DiffuVolume: Diffusion Model for Volume based Stereo Matching","authors":"Dian Zheng, Xiao-Ming Wu, Zuhao Liu, Jingke Meng, Wei-Shi Zheng","doi":"10.1007/s11263-025-02362-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02362-1","url":null,"abstract":"Stereo matching is a significant part in many computer vision tasks and driving-based applications. Recently cost volume-based methods have achieved great success benefiting from the rich geometry information in paired images. However, the redundancy of cost volume also interferes with the model training and limits the performance. To construct a more precise cost volume, we pioneeringly apply the diffusion model to stereo matching. Our method, termed DiffuVolume, considers the diffusion model as a cost volume filter, which will recurrently remove the redundant information from the cost volume. Two main designs make our method not trivial. Firstly, to make the diffusion model more adaptive to stereo matching, we eschew the traditional manner of directly adding noise into the image but embed the diffusion model into a task-specific module. In this way, we outperform the traditional diffusion stereo matching method by 27(%) EPE improvement and 7 times parameters reduction. Secondly, DiffuVolume can be easily embedded into any volume-based stereo matching network, boosting performance with only a slight increase in parameters (approximately 2(%)). By adding the DiffuVolume into well-performed methods, we outperform all the published methods on Scene Flow, KITTI2012, KITTI2015 benchmarks and zero-shot generalization setting. It is worth mentioning that the proposed model ranks 1st on KITTI 2012 leader board, 2nd on KITTI 2015 leader board since 15, July 2023.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"17 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-31 DOI: 10.1007/s11263-025-02352-3

Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang

Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.

{"title":"TryOn-Adapter: Efficient Fine-Grained Clothing Identity Adaptation for High-Fidelity Virtual Try-On","authors":"Jiazheng Xing, Chao Xu, Yijie Qian, Yang Liu, Guang Dai, Baigui Sun, Yong Liu, Jingdong Wang","doi":"10.1007/s11263-025-02352-3","DOIUrl":"https://doi.org/10.1007/s11263-025-02352-3","url":null,"abstract":"Virtual try-on focuses on adjusting the given clothes to fit a specific person seamlessly while avoiding any distortion of the patterns and textures of the garment. However, the clothing identity uncontrollability and training inefficiency of existing diffusion-based methods, which struggle to maintain the identity even with full parameter training, are significant limitations that hinder the widespread applications. In this work, we propose an effective and efficient framework, termed TryOn-Adapter. Specifically, we first decouple clothing identity into fine-grained factors: style for color and category information, texture for high-frequency details, and structure for smooth spatial adaptive transformation. Our approach utilizes a pre-trained exemplar-based diffusion model as the fundamental network, whose parameters are frozen except for the attention layers. We then customize three lightweight modules (Style Preserving, Texture Highlighting, and Structure Adapting) incorporated with fine-tuning techniques to enable precise and efficient identity control. Meanwhile, we introduce the training-free T-RePaint strategy to further enhance clothing identity preservation while maintaining the realistic try-on effect during the inference. Our experiments demonstrate that our approach achieves state-of-the-art performance on two widely-used benchmarks. Additionally, compared with recent full-tuning diffusion-based methods, we only use about half of their tunable parameters during training. The code will be made publicly available at https://github.com/jiazheng-xing/TryOn-Adapter.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"51 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143072450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised Shutter Unrolling with Events

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-29 DOI: 10.1007/s11263-025-02364-z

Mingyuan Lin, Yangguang Wang, Xiang Zhang, Boxin Shi, Wen Yang, Chu He, Gui-song Xia, Lei Yu

Continuous-time Global Shutter Video Recovery (CGVR) faces a substantial challenge in recovering undistorted high frame-rate Global Shutter (GS) videos from distorted Rolling Shutter (RS) images. This problem is severely ill-posed due to the absence of temporal dynamic information within RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions is unavailable. Commonly used artificial assumptions on scenes/motions and data-specific characteristics are prone to producing sub-optimal solutions in real-world scenarios. To address this challenge, we propose an event-based CGVR network within a self-supervised learning paradigm, i.e., SelfUnroll, and leverage the extremely high temporal resolution of event cameras to provide accurate inter/intra-frame dynamic information. Specifically, an Event-based Inter/intra-frame Compensator (E-IC) is proposed to predict the per-pixel dynamic between arbitrary time intervals, including the temporal transition and spatial translation. Exploring connections in terms of RS-RS, RS-GS, and GS-RS, we explicitly formulate mutual constraints with the proposed E-IC, resulting in supervisions without ground-truth GS images. Extensive evaluations over synthetic and real datasets demonstrate that the proposed method achieves state-of-the-art methods and shows remarkable performance for event-based RS2GS inversion in real-world scenarios. The dataset and code are available at https://w3un.github.io/selfunroll/.

{"title":"Self-supervised Shutter Unrolling with Events","authors":"Mingyuan Lin, Yangguang Wang, Xiang Zhang, Boxin Shi, Wen Yang, Chu He, Gui-song Xia, Lei Yu","doi":"10.1007/s11263-025-02364-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02364-z","url":null,"abstract":"Continuous-time Global Shutter Video Recovery (CGVR) faces a substantial challenge in recovering undistorted high frame-rate Global Shutter (GS) videos from distorted Rolling Shutter (RS) images. This problem is severely ill-posed due to the absence of temporal dynamic information within RS intra-frame scanlines and inter-frame exposures, particularly when prior knowledge about camera/object motions is unavailable. Commonly used artificial assumptions on scenes/motions and data-specific characteristics are prone to producing sub-optimal solutions in real-world scenarios. To address this challenge, we propose an event-based CGVR network within a self-supervised learning paradigm, i.e., SelfUnroll, and leverage the extremely high temporal resolution of event cameras to provide accurate inter/intra-frame dynamic information. Specifically, an Event-based Inter/intra-frame Compensator (E-IC) is proposed to predict the per-pixel dynamic between arbitrary time intervals, including the temporal transition and spatial translation. Exploring connections in terms of RS-RS, RS-GS, and GS-RS, we explicitly formulate mutual constraints with the proposed E-IC, resulting in supervisions without ground-truth GS images. Extensive evaluations over synthetic and real datasets demonstrate that the proposed method achieves state-of-the-art methods and shows remarkable performance for event-based RS2GS inversion in real-world scenarios. The dataset and code are available at https://w3un.github.io/selfunroll/.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"36 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143055414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning with Enriched Inductive Biases for Vision-Language Models

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-28 DOI: 10.1007/s11263-025-02354-1

Lingxiao Yang, Ru-Yuan Zhang, Qi Chen, Xiaohua Xie

Vision-Language Models, pre-trained on large-scale image-text pairs, serve as strong foundation models for transfer learning across a variety of downstream tasks. For few-shot generalization tasks, i.e., when the model is trained on few-shot samples and then tested on unseen categories or datasets, there is a balance to be struck between generalization and discrimination when tweaking these models. Existing approaches typically rely on one or two strategies during training to learn task-specific knowledge, while preserving as much task-agnostic representation as possible. However, these methods overlook the importance of other useful inductive biases, thereby limiting their generalization capabilities. In this work, we propose a method – Learning with Enriched Inductive Biases (LwEIB) – to explore multiple inductive biases at the text, model, and optimization levels. Specifically, we first propose to enrich the handcrafted text prompt with Large Language Model generated descriptions for each category. To better capture structural cues in both linguistics and vision, we design two new adapters for text and image encoders, respectively. Additionally, we propose a slow-fast optimization method to explore different degrees of adaptation more efficiently, learning task-specific representations while maintaining task-agnostic ones. We empirically validate the effectiveness of LwEIB on three widely used benchmarks. Remarkably, our LwEIB outperforms numerous state-of-the-art methods across all evaluation metrics, demonstrating its efficacy and versatility. Our code is available at https://github.com/ZjjConan/VLM-LwEIB.

{"title":"Learning with Enriched Inductive Biases for Vision-Language Models","authors":"Lingxiao Yang, Ru-Yuan Zhang, Qi Chen, Xiaohua Xie","doi":"10.1007/s11263-025-02354-1","DOIUrl":"https://doi.org/10.1007/s11263-025-02354-1","url":null,"abstract":"Vision-Language Models, pre-trained on large-scale image-text pairs, serve as strong foundation models for transfer learning across a variety of downstream tasks. For few-shot generalization tasks, i.e., when the model is trained on few-shot samples and then tested on unseen categories or datasets, there is a balance to be struck between generalization and discrimination when tweaking these models. Existing approaches typically rely on one or two strategies during training to learn task-specific knowledge, while preserving as much task-agnostic representation as possible. However, these methods overlook the importance of other useful inductive biases, thereby limiting their generalization capabilities. In this work, we propose a method – Learning with Enriched Inductive Biases (LwEIB) – to explore multiple inductive biases at the text, model, and optimization levels. Specifically, we first propose to enrich the handcrafted text prompt with Large Language Model generated descriptions for each category. To better capture structural cues in both linguistics and vision, we design two new adapters for text and image encoders, respectively. Additionally, we propose a slow-fast optimization method to explore different degrees of adaptation more efficiently, learning task-specific representations while maintaining task-agnostic ones. We empirically validate the effectiveness of LwEIB on three widely used benchmarks. Remarkably, our LwEIB outperforms numerous state-of-the-art methods across all evaluation metrics, demonstrating its efficacy and versatility. Our code is available at https://github.com/ZjjConan/VLM-LwEIB.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"59 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-28 DOI: 10.1007/s11263-025-02348-z

Yuanyuan Liu, Shaoze Feng, Shuyang Liu, Yibing Zhan, Dapeng Tao, Zijing Chen, Zhe Chen

Self-supervised facial representation learning (SFRL) methods, especially contrastive learning (CL) methods, have been increasingly popular due to their ability to perform face understanding without heavily relying on large-scale well-annotated datasets. However, analytically, current CL-based SFRL methods still perform unsatisfactorily in learning facial representations due to their tendency to learn pose-insensitive features, resulting in the loss of some useful pose details. This could be due to the inappropriate positive/negative pair selection within CL. To conquer this challenge, we propose a Pose-disentangled Contrastive Facial Representation Learning (PCFRL) framework to enhance pose awareness for SFRL. We achieve this by explicitly disentangling the pose-aware features from non-pose face-aware features and introducing appropriate sample calibration schemes for better CL with the disentangled features. In PCFRL, we first devise a pose-disentangled decoder with a delicately designed orthogonalizing regulation to perform the disentanglement; therefore, the learning on the pose-aware and non-pose face-aware features would not affect each other. Then, we introduce a false-negative pair calibration module to overcome the issue that the two types of disentangled features may not share the same negative pairs for CL. Our calibration employs a novel neighborhood-cohesive pair alignment method to identify pose and face false-negative pairs, respectively, and further help calibrate them to appropriate positive pairs. Lastly, we devise two calibrated CL losses, namely calibrated pose-aware and face-aware CL losses, for adaptively learning the calibrated pairs more effectively, ultimately enhancing the learning with the disentangled features and providing robust facial representations for various downstream tasks. In the experiments, we perform linear evaluations on four challenging downstream facial tasks with SFRL using our method, including facial expression recognition, face recognition, facial action unit detection, and head pose estimation. Experimental results show that PCFRL outperforms existing state-of-the-art methods by a substantial margin, demonstrating the importance of improving pose awareness for SFRL. Our evaluation code and model will be available at https://github.com/fulaoze/CV/tree/main.

{"title":"Sample-Cohesive Pose-Aware Contrastive Facial Representation Learning","authors":"Yuanyuan Liu, Shaoze Feng, Shuyang Liu, Yibing Zhan, Dapeng Tao, Zijing Chen, Zhe Chen","doi":"10.1007/s11263-025-02348-z","DOIUrl":"https://doi.org/10.1007/s11263-025-02348-z","url":null,"abstract":"Self-supervised facial representation learning (SFRL) methods, especially contrastive learning (CL) methods, have been increasingly popular due to their ability to perform face understanding without heavily relying on large-scale well-annotated datasets. However, analytically, current CL-based SFRL methods still perform unsatisfactorily in learning facial representations due to their tendency to learn pose-insensitive features, resulting in the loss of some useful pose details. This could be due to the inappropriate positive/negative pair selection within CL. To conquer this challenge, we propose a Pose-disentangled Contrastive Facial Representation Learning (PCFRL) framework to enhance pose awareness for SFRL. We achieve this by explicitly disentangling the pose-aware features from non-pose face-aware features and introducing appropriate sample calibration schemes for better CL with the disentangled features. In PCFRL, we first devise a pose-disentangled decoder with a delicately designed orthogonalizing regulation to perform the disentanglement; therefore, the learning on the pose-aware and non-pose face-aware features would not affect each other. Then, we introduce a false-negative pair calibration module to overcome the issue that the two types of disentangled features may not share the same negative pairs for CL. Our calibration employs a novel neighborhood-cohesive pair alignment method to identify pose and face false-negative pairs, respectively, and further help calibrate them to appropriate positive pairs. Lastly, we devise two calibrated CL losses, namely calibrated pose-aware and face-aware CL losses, for adaptively learning the calibrated pairs more effectively, ultimately enhancing the learning with the disentangled features and providing robust facial representations for various downstream tasks. In the experiments, we perform linear evaluations on four challenging downstream facial tasks with SFRL using our method, including facial expression recognition, face recognition, facial action unit detection, and head pose estimation. Experimental results show that PCFRL outperforms existing state-of-the-art methods by a substantial margin, demonstrating the importance of improving pose awareness for SFRL. Our evaluation code and model will be available at https://github.com/fulaoze/CV/tree/main.\u0000","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"14 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049681","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image Synthesis Under Limited Data: A Survey and Taxonomy

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-27 DOI: 10.1007/s11263-025-02357-y

Mengping Yang, Zhe Wang

Deep generative models, which target reproducing the data distribution to produce novel images, have made unprecedented advancements in recent years. However, one critical prerequisite for their tremendous success is the availability of a sufficient number of training samples, which requires massive computation resources. When trained on limited data, generative models tend to suffer from severe performance deterioration due to overfitting and memorization. Accordingly, researchers have devoted considerable attention to develop novel models that are capable of generating plausible and diverse images from limited training data recently. Despite numerous efforts to enhance training stability and synthesis quality in the limited data scenarios, there is a lack of a systematic survey that provides (1) a clear problem definition, challenges, and taxonomy of various tasks; (2) an in-depth analysis on the pros, cons, and limitations of existing literature; and (3) a thorough discussion on the potential applications and future directions in this field. To fill this gap and provide an informative introduction to researchers who are new to this topic, this survey offers a comprehensive review and a novel taxonomy on the development of image synthesis under limited data. In particular, it covers the problem definition, requirements, main solutions, popular benchmarks, and remaining challenges in a comprehensive and all-around manner. We hope this survey can provide an informative overview and a valuable resource for researchers and practitioners. Apart from the relevant references, we aim to constantly maintain a timely up-to-date repository to track the latest advances at awesome-few-shot-generation.

{"title":"Image Synthesis Under Limited Data: A Survey and Taxonomy","authors":"Mengping Yang, Zhe Wang","doi":"10.1007/s11263-025-02357-y","DOIUrl":"https://doi.org/10.1007/s11263-025-02357-y","url":null,"abstract":"Deep generative models, which target reproducing the data distribution to produce novel images, have made unprecedented advancements in recent years. However, one critical prerequisite for their tremendous success is the availability of a sufficient number of training samples, which requires massive computation resources. When trained on limited data, generative models tend to suffer from severe performance deterioration due to overfitting and memorization. Accordingly, researchers have devoted considerable attention to develop novel models that are capable of generating plausible and diverse images from limited training data recently. Despite numerous efforts to enhance training stability and synthesis quality in the limited data scenarios, there is a lack of a systematic survey that provides (1) a clear problem definition, challenges, and taxonomy of various tasks; (2) an in-depth analysis on the pros, cons, and limitations of existing literature; and (3) a thorough discussion on the potential applications and future directions in this field. To fill this gap and provide an informative introduction to researchers who are new to this topic, this survey offers a comprehensive review and a novel taxonomy on the development of image synthesis under limited data. In particular, it covers the problem definition, requirements, main solutions, popular benchmarks, and remaining challenges in a comprehensive and all-around manner. We hope this survey can provide an informative overview and a valuable resource for researchers and practitioners. Apart from the relevant references, we aim to constantly maintain a timely up-to-date repository to track the latest advances at awesome-few-shot-generation.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"62 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual-Space Video Person Re-identification

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-27 DOI: 10.1007/s11263-025-02350-5

Jiaxu Leng, Changjiang Kuang, Shuang Li, Ji Gan, Haosheng Chen, Xinbo Gao

Video person re-identification (VReID) aims to recognize individuals across video sequences. Existing methods primarily use Euclidean space for representation learning but struggle to capture complex hierarchical structures, especially in scenarios with occlusions and background clutter. In contrast, hyperbolic space, with its negatively curved geometry, excels at preserving hierarchical relationships and enhancing discrimination between similar appearances. Inspired by these, we propose Dual-Space Video Person Re-Identification (DS-VReID) to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features while also exploring the intrinsic hierarchical relations, thereby enhancing the discriminative capacity of the features. Specifically, we design the Dynamic Prompt Graph Construction (DPGC) module, which uses a pre-trained CLIP model with learnable dynamic prompts to construct 3D graphs that capture subtle changes and dynamic information in video sequences. Building upon this, we introduce the Hyperbolic Disentangled Aggregation (HDA) module, which addresses long-range dependency modeling by decoupling node distances and integrating adjacency matrices, capturing detailed spatial-temporal hierarchical relationships. Extensive experiments on benchmark datasets demonstrate the superiority of DS-VReID over state-of-the-art methods, showcasing its potential in complex VReID scenarios.

{"title":"Dual-Space Video Person Re-identification","authors":"Jiaxu Leng, Changjiang Kuang, Shuang Li, Ji Gan, Haosheng Chen, Xinbo Gao","doi":"10.1007/s11263-025-02350-5","DOIUrl":"https://doi.org/10.1007/s11263-025-02350-5","url":null,"abstract":"Video person re-identification (VReID) aims to recognize individuals across video sequences. Existing methods primarily use Euclidean space for representation learning but struggle to capture complex hierarchical structures, especially in scenarios with occlusions and background clutter. In contrast, hyperbolic space, with its negatively curved geometry, excels at preserving hierarchical relationships and enhancing discrimination between similar appearances. Inspired by these, we propose Dual-Space Video Person Re-Identification (DS-VReID) to utilize the strength of both Euclidean and hyperbolic geometries, capturing the visual features while also exploring the intrinsic hierarchical relations, thereby enhancing the discriminative capacity of the features. Specifically, we design the Dynamic Prompt Graph Construction (DPGC) module, which uses a pre-trained CLIP model with learnable dynamic prompts to construct 3D graphs that capture subtle changes and dynamic information in video sequences. Building upon this, we introduce the Hyperbolic Disentangled Aggregation (HDA) module, which addresses long-range dependency modeling by decoupling node distances and integrating adjacency matrices, capturing detailed spatial-temporal hierarchical relationships. Extensive experiments on benchmark datasets demonstrate the superiority of DS-VReID over state-of-the-art methods, showcasing its potential in complex VReID scenarios.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"8 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143049680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition

IF 19.5 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision

Pub Date : 2025-01-25 DOI: 10.1007/s11263-025-02345-2

Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang

Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.

{"title":"SeaFormer++: Squeeze-Enhanced Axial Transformer for Mobile Visual Recognition","authors":"Qiang Wan, Zilong Huang, Jiachen Lu, Gang Yu, Li Zhang","doi":"10.1007/s11263-025-02345-2","DOIUrl":"https://doi.org/10.1007/s11263-025-02345-2","url":null,"abstract":"Since the introduction of Vision Transformers, the landscape of many computer vision tasks (e.g., semantic segmentation), which has been overwhelmingly dominated by CNNs, recently has significantly revolutionized. However, the computational cost and memory requirement renders these methods unsuitable on the mobile device. In this paper, we introduce a new method squeeze-enhanced Axial Transformer (SeaFormer) for mobile visual recognition. Specifically, we design a generic attention block characterized by the formulation of squeeze Axial and detail enhancement. It can be further used to create a family of backbone architectures with superior cost-effectiveness. Coupled with a light segmentation head, we achieve the best trade-off between segmentation accuracy and latency on the ARM-based mobile devices on the ADE20K, Cityscapes Pascal Context and COCO-Stuff datasets. Critically, we beat both the mobile-friendly rivals and Transformer-based counterparts with better performance and lower latency without bells and whistles. Furthermore, we incorporate a feature upsampling-based multi-resolution distillation technique, further reducing the inference latency of the proposed framework. Beyond semantic segmentation, we further apply the proposed SeaFormer architecture to image classification and object detection problems, demonstrating the potential of serving as a versatile mobile-friendly backbone. Our code and models are made publicly available at https://github.com/fudan-zvg/SeaFormer.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"38 1","pages":""},"PeriodicalIF":19.5,"publicationDate":"2025-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143031203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0