Pub Date : 2026-03-03DOI: 10.1109/TPAMI.2026.3669598
Jen-Tzung Chien, Kuan Chen
Contrastive learning aims to learn an embedding space with sample discrimination where similar samples attract together while dissimilar samples repulse apart. However, the issue of sampling bias likely happens and degrades the classification performance when a contrast model is trained with the leakage caused by similar samples but from different classes or dissimilar samples from the same class. Out-of-distribution (OOD) detection provides a meaningful scheme to detect and mask those false negative samples for debiasing in an outlier-aware contrastive loss for high-fidelity contrastive learning. Sample debiasing is feasible to reduce the upper bound of contrastive loss. Also, the previous OOD detector was trained from auxiliary collection of OOD samples. In real world, the prior knowledge of OOD samples is commonly unavailable. This study presents new outlier-aware detection and contrast models through generation and augmentation of those samples near the boundary between in-distribution (ID) and OOD. These synthesized samples are located right outside ID, and their Gaussian embeddings sufficiently reflect OOD behaviors. An OOD detector is learned by using ID samples and synthesized OOD samples with the learning objective towards contrastive OOD detection and debiased contrast model. The experiments are conducted to illustrate the merit of the proposed outlier-aware contrastive learning.
{"title":"Outlier-Aware Contrastive Learning.","authors":"Jen-Tzung Chien, Kuan Chen","doi":"10.1109/TPAMI.2026.3669598","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669598","url":null,"abstract":"<p><p>Contrastive learning aims to learn an embedding space with sample discrimination where similar samples attract together while dissimilar samples repulse apart. However, the issue of sampling bias likely happens and degrades the classification performance when a contrast model is trained with the leakage caused by similar samples but from different classes or dissimilar samples from the same class. Out-of-distribution (OOD) detection provides a meaningful scheme to detect and mask those false negative samples for debiasing in an outlier-aware contrastive loss for high-fidelity contrastive learning. Sample debiasing is feasible to reduce the upper bound of contrastive loss. Also, the previous OOD detector was trained from auxiliary collection of OOD samples. In real world, the prior knowledge of OOD samples is commonly unavailable. This study presents new outlier-aware detection and contrast models through generation and augmentation of those samples near the boundary between in-distribution (ID) and OOD. These synthesized samples are located right outside ID, and their Gaussian embeddings sufficiently reflect OOD behaviors. An OOD detector is learned by using ID samples and synthesized OOD samples with the learning objective towards contrastive OOD detection and debiased contrast model. The experiments are conducted to illustrate the merit of the proposed outlier-aware contrastive learning.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147349807","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-03DOI: 10.1109/TPAMI.2026.3669975
Sourav Chakrabarty, Anirvan Chakraborty, Shyamal K De
We develop novel clustering algorithms for functional data when the number of clusters $K$ is unspecified and also when it is specified. These algorithms are developed based on the Maximum Mean Discrepancy (MMD) measure between the empirical distributions associated with two sets of observations. The algorithms recursively use a binary splitting strategy to partition the dataset into two subgroups such that they are maximally separated in terms of an appropriate weighted MMD measure. When $K$ is unspecified, the proposed clustering algorithm has an additional step to check whether a group of observations obtained by the binary splitting technique consists of observations from a single population. We also learn $K$ directly from the data using this algorithm. When $K$ is specified, a modification of the previous algorithm is proposed which consists of an additional step of merging subgroups which are similar in terms of the weighted MMD distance. The theoretical properties of the proposed algorithms are investigated in an oracle scenario that requires the knowledge of the empirical distributions of the observations from different populations involved. In this setting, we prove that the algorithm proposed when $K$ is unspecified achieves perfect clustering while the algorithm proposed when $K$ is specified has the perfect order preserving (POP) property. Extensive real and simulated data analyses using a variety of models having location difference as well as scale difference show near-perfect clustering performance of both the algorithms which improve upon the state-of-the-art clustering methods for functional data.
{"title":"Near-Perfect Clustering Based on Recursive Binary Splitting Using Max-MMD.","authors":"Sourav Chakrabarty, Anirvan Chakraborty, Shyamal K De","doi":"10.1109/TPAMI.2026.3669975","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669975","url":null,"abstract":"<p><p>We develop novel clustering algorithms for functional data when the number of clusters $K$ is unspecified and also when it is specified. These algorithms are developed based on the Maximum Mean Discrepancy (MMD) measure between the empirical distributions associated with two sets of observations. The algorithms recursively use a binary splitting strategy to partition the dataset into two subgroups such that they are maximally separated in terms of an appropriate weighted MMD measure. When $K$ is unspecified, the proposed clustering algorithm has an additional step to check whether a group of observations obtained by the binary splitting technique consists of observations from a single population. We also learn $K$ directly from the data using this algorithm. When $K$ is specified, a modification of the previous algorithm is proposed which consists of an additional step of merging subgroups which are similar in terms of the weighted MMD distance. The theoretical properties of the proposed algorithms are investigated in an oracle scenario that requires the knowledge of the empirical distributions of the observations from different populations involved. In this setting, we prove that the algorithm proposed when $K$ is unspecified achieves perfect clustering while the algorithm proposed when $K$ is specified has the perfect order preserving (POP) property. Extensive real and simulated data analyses using a variety of models having location difference as well as scale difference show near-perfect clustering performance of both the algorithms which improve upon the state-of-the-art clustering methods for functional data.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147349792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Machine vision systems, which can efficiently manage extensive visual perception tasks, are becoming increasingly popular in industrial production and daily life. Due to the challenge of simultaneously obtaining accurate depth and texture information with a single sensor, multimodal data captured by cameras and LiDAR is commonly used to enhance performance. Additionally, cloud-edge cooperation has emerged as a novel computing approach to improve user experience and ensure data security in machine vision systems. This paper proposes a pioneering solution to address the feature compression problem in multimodal 3D object detection. Given a sparse tensor-based object detection network at the edge device, we introduce two modes to accommodate different application requirements: Transmission-Friendly Feature Compression (T-FFC) and Accuracy-Friendly Feature Compression (A-FFC). In T-FFC mode, only the output of the last layer of the network's backbone is transmitted from the edge device. The received feature is processed at the cloud device through a channel expansion module and two spatial upsampling modules to generate multi-scale features. In A-FFC mode, we expand upon the T-FFC mode by transmitting two additional types of features. These added features enable the cloud device to generate more accurate multi-scale features. Experimental results on the KITTI dataset using the VirConv-L detection network showed that T-FFC was able to compress the features by a factor of 4933 with less than a 3% reduction in detection performance. On the other hand, A-FFC compressed the features by a factor of about 733 with almost no degradation in detection performance. We also designed optional residual extraction and 3D object reconstruction modules to facilitate the reconstruction of detected objects. The reconstructed objects effectively reflected the shape, occlusion, and details of the original objects.
{"title":"Feature Compression for Cloud-Edge Multimodal 3D Object Detection.","authors":"Chongzhen Tian, Zhengxin Li, Hui Yuan, Raouf Hamzaoui, Liquan Shen, Sam Kwong","doi":"10.1109/TPAMI.2026.3669471","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669471","url":null,"abstract":"<p><p>Machine vision systems, which can efficiently manage extensive visual perception tasks, are becoming increasingly popular in industrial production and daily life. Due to the challenge of simultaneously obtaining accurate depth and texture information with a single sensor, multimodal data captured by cameras and LiDAR is commonly used to enhance performance. Additionally, cloud-edge cooperation has emerged as a novel computing approach to improve user experience and ensure data security in machine vision systems. This paper proposes a pioneering solution to address the feature compression problem in multimodal 3D object detection. Given a sparse tensor-based object detection network at the edge device, we introduce two modes to accommodate different application requirements: Transmission-Friendly Feature Compression (T-FFC) and Accuracy-Friendly Feature Compression (A-FFC). In T-FFC mode, only the output of the last layer of the network's backbone is transmitted from the edge device. The received feature is processed at the cloud device through a channel expansion module and two spatial upsampling modules to generate multi-scale features. In A-FFC mode, we expand upon the T-FFC mode by transmitting two additional types of features. These added features enable the cloud device to generate more accurate multi-scale features. Experimental results on the KITTI dataset using the VirConv-L detection network showed that T-FFC was able to compress the features by a factor of 4933 with less than a 3% reduction in detection performance. On the other hand, A-FFC compressed the features by a factor of about 733 with almost no degradation in detection performance. We also designed optional residual extraction and 3D object reconstruction modules to facilitate the reconstruction of detected objects. The reconstructed objects effectively reflected the shape, occlusion, and details of the original objects.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147349819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.
{"title":"Spatio-temporal Decoupled Knowledge Compensator for Few-Shot Action Recognition.","authors":"Hongyu Qu, Xiangbo Shu, Rui Yan, Hailiang Gao, Wenguan Wang, Jinhui Tang","doi":"10.1109/TPAMI.2026.3669254","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669254","url":null,"abstract":"<p><p>Few-Shot Action Recognition (FSAR) is a challenging task that requires recognizing novel action categories with a few labeled videos. Recent works typically apply semantically coarse category names as auxiliary contexts to guide the learning of discriminative visual features. However, such context provided by the action names is too limited to provide sufficient background knowledge for capturing novel spatial and temporal concepts in actions. In this paper, we propose DiST, an innovative Decomposition-incorporation framework for FSAR that makes use of decoupled Spatial and Temporal knowledge provided by large language models to learn expressive multi-granularity prototypes. In the decomposition stage, we decouple vanilla action names into diverse spatio-temporal attribute descriptions (action-related knowledge). Such commonsense knowledge complements semantic contexts from spatial and temporal perspectives. In the incorporation stage, we propose Spatial/Temporal Knowledge Compensators (SKC/TKC) to discover discriminative object-level and frame-level prototypes, respectively. In SKC, object-level prototypes adaptively aggregate important patch tokens under the guidance of spatial knowledge. Moreover, in TKC, frame-level prototypes utilize temporal attributes to assist in inter-frame temporal relation modeling. These learned prototypes thus provide transparency in capturing fine-grained spatial details and diverse temporal patterns. Experimental results show DiST achieves state-of-the-art results on five standard FSAR datasets.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/TPAMI.2026.3669002
Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss
Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still, however, a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.
{"title":"Towards Generating Realistic 3D Semantic Training Data for Autonomous Driving.","authors":"Lucas Nunes, Rodrigo Marcuzzi, Jens Behley, Cyrill Stachniss","doi":"10.1109/TPAMI.2026.3669002","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669002","url":null,"abstract":"<p><p>Semantic scene understanding is crucial for robotics and computer vision applications. In autonomous driving, 3D semantic segmentation plays an important role for enabling safe navigation. Despite significant advances in the field, the complexity of collecting and annotating 3D data is a bottleneck in this developments. To overcome that data annotation limitation, synthetic simulated data has been used to generate annotated data on demand. There is still, however, a domain gap between real and simulated data. More recently, diffusion models have been in the spotlight, enabling close-to-real data synthesis. Those generative models have been recently applied to the 3D data domain for generating scene-scale data with semantic annotations. Still, those methods either rely on image projection or decoupled models trained with different resolutions in a coarse-to-fine manner. Such intermediary representations impact the generated data quality due to errors added in those transformations. In this work, we propose a novel approach able to generate 3D semantic scene-scale data without relying on any projection or decoupled trained multi-resolution models, achieving more realistic semantic scene data generation compared to previous state-of-the-art methods. Besides improving 3D semantic scene-scale data synthesis, we thoroughly evaluate the use of the synthetic scene samples as labeled data to train a semantic segmentation network. In our experiments, we show that using the synthetic annotated data generated by our method as training data together with the real semantic segmentation labels, leads to an improvement in the semantic segmentation model performance. Our results show the potential of generated scene-scale point clouds to generate more training data to extend existing datasets, reducing the data annotation effort. Our code is available at https://github.com/PRBonn/3DiSS.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147346050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-02DOI: 10.1109/TPAMI.2026.3669121
Xiaole Tang, Xiaoyi He, Jiayi Xu, Xiang Gu, Jian Sun
Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (e.g., types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.
{"title":"Learning Continuous Wasserstein Barycenter Space for Generalized All-in-One Image Restoration.","authors":"Xiaole Tang, Xiaoyi He, Jiayi Xu, Xiang Gu, Jian Sun","doi":"10.1109/TPAMI.2026.3669121","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3669121","url":null,"abstract":"<p><p>Despite substantial advances in all-in-one image restoration for addressing diverse degradations within a unified model, existing methods remain vulnerable to out-of-distribution degradations, thereby limiting their generalization in real-world scenarios. To tackle the challenge, this work is motivated by the intuition that multisource degraded feature distributions are induced by different degradation-specific shifts from an underlying degradation-agnostic distribution, and recovering such a shared distribution is thus crucial for achieving generalization across degradations. With this insight, we propose BaryIR, a representation learning framework that aligns multisource degraded features in the Wasserstein barycenter (WB) space, which models a degradation-agnostic distribution by minimizing the average of Wasserstein distances to multisource degraded distributions. We further introduce residual subspaces, whose embeddings are mutually contrasted while remaining orthogonal to the WB embeddings. Consequently, BaryIR explicitly decouples two orthogonal spaces: a WB space that encodes the degradation-agnostic invariant contents shared across degradations, and residual subspaces that adaptively preserve the degradation-specific knowledge. This disentanglement mitigates overfitting to in-distribution degradations and enables adaptive restoration grounded on the degradation-agnostic shared invariance. Extensive experiments demonstrate that BaryIR performs competitively against state-of-the-art all-in-one methods. Notably, BaryIR generalizes well to unseen degradations (e.g., types and levels) and shows remarkable robustness in learning generalized features, even when trained on limited degradation types and evaluated on real-world data with mixed degradations.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147345990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-27DOI: 10.1109/TPAMI.2026.3668757
Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan
The key success of existing video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information that is usually achieved by a temporal propagation with alignment strategies. However, inaccurate alignment usually leads to significant artifacts that will be accumulated during propagation and thus affect video restoration. Moreover, only propagating the same timestep features forward or backward does not handle the videos with complex motion or occlusion. To address these issues, we propose a collaborative feedback discriminative (CFD) method to correct inaccurate aligned features and better model spatial and temporal information for VSR. Specifically, we first develop a discriminative alignment correction (DAC) method to reduce the influences of the artifacts caused by inaccurate alignment. Then, we propose a collaborative feedback propagation (CFP) module based on feedback and gating mechanisms to explore spatial and temporal information of different timestep features from forward and backward propagation simultaneously. Finally, we embed the proposed DAC and CFP into commonly used VSR networks to verify the effectiveness of our method. Experimental results demonstrate that our method improves the performance of existing VSR models while maintaining a lower model complexity.
{"title":"Collaborative Feedback Discriminative Propagation for Video Super-Resolution.","authors":"Hao Li, Xiang Chen, Jiangxin Dong, Jinhui Tang, Jinshan Pan","doi":"10.1109/TPAMI.2026.3668757","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3668757","url":null,"abstract":"<p><p>The key success of existing video super-resolution (VSR) methods stems mainly from exploring spatial and temporal information that is usually achieved by a temporal propagation with alignment strategies. However, inaccurate alignment usually leads to significant artifacts that will be accumulated during propagation and thus affect video restoration. Moreover, only propagating the same timestep features forward or backward does not handle the videos with complex motion or occlusion. To address these issues, we propose a collaborative feedback discriminative (CFD) method to correct inaccurate aligned features and better model spatial and temporal information for VSR. Specifically, we first develop a discriminative alignment correction (DAC) method to reduce the influences of the artifacts caused by inaccurate alignment. Then, we propose a collaborative feedback propagation (CFP) module based on feedback and gating mechanisms to explore spatial and temporal information of different timestep features from forward and backward propagation simultaneously. Finally, we embed the proposed DAC and CFP into commonly used VSR networks to verify the effectiveness of our method. Experimental results demonstrate that our method improves the performance of existing VSR models while maintaining a lower model complexity.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Unsigned distance functions (UDFs) have emerged as powerful representation for modeling and reconstructing geometries with open surfaces. However, the development of 3D generative models for UDFs remains largely unexplored, limiting current methods from generating diverse open-surface 3D content. Moreover, mainstream 3D datasets predominantly consist of watertight meshes, revealing a critical challenge: the absence of standardized datasets and benchmarks specifically tailored for open-surface generation and reconstruction. In this paper, we begin by introducing UDiFF, a novel diffusion-based 3D generative model specifically designed for UDFs. UDiFF supports both conditional and unconditional generation of textured 3D shapes with open surfaces. At its core, UDiFF generates UDFs in the spatial-frequency domain using a learnable wavelet transform. Instead of relying on manually selected wavelet transforms, which are labor-intensive and prone to information loss, we introduce a data-driven approach that learns the optimal wavelet transformation from UDFs datasets. Beyond UDiFF, we present the UWings dataset, comprising 1,509 high-quality 3D open surface models of winged creatures. Using UWings, we establish comprehensive benchmarks for evaluating both generative and reconstruction methods based on UDFs.
{"title":"UDFStudio: A Unified Framework of Datasets, Benchmarks and Generative Models for Unsigned Distance Functions.","authors":"Junsheng Zhou, Weiqi Zhang, Baorui Ma, Kanle Shi, Yu-Shen Liu, Zhizhong Han","doi":"10.1109/TPAMI.2026.3668763","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3668763","url":null,"abstract":"<p><p>Unsigned distance functions (UDFs) have emerged as powerful representation for modeling and reconstructing geometries with open surfaces. However, the development of 3D generative models for UDFs remains largely unexplored, limiting current methods from generating diverse open-surface 3D content. Moreover, mainstream 3D datasets predominantly consist of watertight meshes, revealing a critical challenge: the absence of standardized datasets and benchmarks specifically tailored for open-surface generation and reconstruction. In this paper, we begin by introducing UDiFF, a novel diffusion-based 3D generative model specifically designed for UDFs. UDiFF supports both conditional and unconditional generation of textured 3D shapes with open surfaces. At its core, UDiFF generates UDFs in the spatial-frequency domain using a learnable wavelet transform. Instead of relying on manually selected wavelet transforms, which are labor-intensive and prone to information loss, we introduce a data-driven approach that learns the optimal wavelet transformation from UDFs datasets. Beyond UDiFF, we present the UWings dataset, comprising 1,509 high-quality 3D open surface models of winged creatures. Using UWings, we establish comprehensive benchmarks for evaluating both generative and reconstruction methods based on UDFs.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147319279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-23DOI: 10.1109/TPAMI.2026.3665111
Kim Youwang, Taehyun Byun, Kim Ji-Yeon, Sungjoon Choi, Tae-Hyun Oh
We propose CLIP-Actor-X, a text-driven motion generation and neural mesh stylization system for 4D human avatar generation. CLIP-Actor-X generates a detailed 3D human mesh, motion animation, and texture to conform to a given text prompt input from a user. CLIP- Actor-X system mainly consists of two modules. First, for generating realistic human motion, we build a text-driven human motion synthesis module modeled by a retrieval-augmented generative model, powered by a text-to-motion diffusion model. Second, our novel zero-shot neural style optimization module detailizes and texturizes the sampled sequence of a neutral human mesh template, such that the resulting mesh and appearance comply with the input text prompt in a temporally-consistent and pose-agnostic manner. In contrast to the prior arts that use an artist-designed, non-animatable mesh as an input, our output representation is animatable and better aligned between an input text and the generated avatar without additional post-processes, e.g., re-alignment, retargeting, or rigging. We further propose the ways to stabilize the optimization process: spatio-temporal view augmentation and visibility-aware embedding attention, which deals with poorly rendered views. We demonstrate that CLIP-Actor-X produces perceptually plausible and human-recognizable human avatar in motion with detailed geometry and texture solely from a natural language prompt.
{"title":"CLIP-Actor-X: Text-driven 4D Human Avatar Generation via Cross-modal Synthesis-through-Optimization.","authors":"Kim Youwang, Taehyun Byun, Kim Ji-Yeon, Sungjoon Choi, Tae-Hyun Oh","doi":"10.1109/TPAMI.2026.3665111","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3665111","url":null,"abstract":"<p><p>We propose CLIP-Actor-X, a text-driven motion generation and neural mesh stylization system for 4D human avatar generation. CLIP-Actor-X generates a detailed 3D human mesh, motion animation, and texture to conform to a given text prompt input from a user. CLIP- Actor-X system mainly consists of two modules. First, for generating realistic human motion, we build a text-driven human motion synthesis module modeled by a retrieval-augmented generative model, powered by a text-to-motion diffusion model. Second, our novel zero-shot neural style optimization module detailizes and texturizes the sampled sequence of a neutral human mesh template, such that the resulting mesh and appearance comply with the input text prompt in a temporally-consistent and pose-agnostic manner. In contrast to the prior arts that use an artist-designed, non-animatable mesh as an input, our output representation is animatable and better aligned between an input text and the generated avatar without additional post-processes, e.g., re-alignment, retargeting, or rigging. We further propose the ways to stabilize the optimization process: spatio-temporal view augmentation and visibility-aware embedding attention, which deals with poorly rendered views. We demonstrate that CLIP-Actor-X produces perceptually plausible and human-recognizable human avatar in motion with detailed geometry and texture solely from a natural language prompt.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-23DOI: 10.1109/TPAMI.2026.3665753
Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao
Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks.
{"title":"Aligning Few-Step Diffusion Models with Dense Reward Difference Learning.","authors":"Ziyi Zhang, Li Shen, Sen Zhang, Deheng Ye, Yong Luo, Miaojing Shi, Dongjing Shan, Bo Du, Dacheng Tao","doi":"10.1109/TPAMI.2026.3665753","DOIUrl":"https://doi.org/10.1109/TPAMI.2026.3665753","url":null,"abstract":"<p><p>Few-step diffusion models enable efficient high-resolution image synthesis but struggle to align with specific downstream objectives due to limitations of existing reinforcement learning (RL) methods in low-step regimes with limited state spaces and suboptimal sample quality. To address this, we propose Stepwise Diffusion Policy Optimization (SDPO), a novel RL framework tailored for few-step diffusion models. SDPO introduces a dual-state trajectory sampling mechanism, tracking both noisy and predicted clean states at each step to provide dense reward feedback and enable low-variance, mixed-step optimization. For further efficiency, we develop a latent similarity-based dense reward prediction strategy to minimize costly dense reward queries. Leveraging these dense rewards, SDPO optimizes a dense reward difference learning objective that enables more frequent and granular policy updates. Additional refinements, including stepwise advantage estimates, temporal importance weighting, and step-shuffled gradient updates, further enhance long-term dependency, low-step priority, and gradient stability. Our experiments demonstrate that SDPO consistently delivers superior reward-aligned results across diverse few-step settings and tasks.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":18.6,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147277962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}