We present Class-agnostic Repetitive action Counting (CaRaCount), a novel approach to count repetitive human actions in the wild using wearable devices time series data. CaRaCount is the first few-shot class-agnostic method, being able to count repetitions of any action class with only a short exemplar data sequence containing a few examples from the action class of interest. To develop and evaluate this method, we collect a large-scale time series dataset of repetitive human actions in various context, containing smartwatch data from 10 subjects performing 50 different activities. Experiments on this dataset and three other activity counting datasets namely Crossfit, Recofit, and MM-Fit show that CaRaCount can count repetitive actions with low error, and it outperforms other baselines and state-of-the-art action counting methods. Finally, with a user experience study, we evaluate the usability of our real-time implementation. Our results highlight the efficiency and effectiveness of our approach when deployed outside the laboratory environments.
{"title":"Class-Agnostic Repetitive Action Counting Using Wearable Devices.","authors":"Duc Duy Nguyen, Lam Thanh Nguyen, Yifeng Huang, Cuong Pham, Minh Hoai","doi":"10.1109/TPAMI.2025.3548131","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548131","url":null,"abstract":"<p><p>We present Class-agnostic Repetitive action Counting (CaRaCount), a novel approach to count repetitive human actions in the wild using wearable devices time series data. CaRaCount is the first few-shot class-agnostic method, being able to count repetitions of any action class with only a short exemplar data sequence containing a few examples from the action class of interest. To develop and evaluate this method, we collect a large-scale time series dataset of repetitive human actions in various context, containing smartwatch data from 10 subjects performing 50 different activities. Experiments on this dataset and three other activity counting datasets namely Crossfit, Recofit, and MM-Fit show that CaRaCount can count repetitive actions with low error, and it outperforms other baselines and state-of-the-art action counting methods. Finally, with a user experience study, we evaluate the usability of our real-time implementation. Our results highlight the efficiency and effectiveness of our approach when deployed outside the laboratory environments.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143569151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1109/TPAMI.2025.3548516
Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V Srinivasa Somayazulu, Omesh Tickoo, Anderson de Andrade, Ivan V Bajic
Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of media, especially images and video. As a result, a growing need for efficient compression methods optimised for machine vision, rather than human vision, has emerged. To meet this growing demand, significant developments have been made in image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we greatly extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilise this newfound understanding to improve several methods for learned image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks - classification, instance and semantic segmentation, and object detection.
{"title":"Rate-Distortion Theory in Coding for Machines and its Applications.","authors":"Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V Srinivasa Somayazulu, Omesh Tickoo, Anderson de Andrade, Ivan V Bajic","doi":"10.1109/TPAMI.2025.3548516","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548516","url":null,"abstract":"<p><p>Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of media, especially images and video. As a result, a growing need for efficient compression methods optimised for machine vision, rather than human vision, has emerged. To meet this growing demand, significant developments have been made in image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we greatly extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilise this newfound understanding to improve several methods for learned image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks - classification, instance and semantic segmentation, and object detection.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visible-thermal small object detection (RGBT SOD) is a significant yet challenging task with a wide range of applications, including video surveillance, traffic monitoring, search and rescue. However, existing studies mainly focus on either visible or thermal modality, while RGBT SOD is rarely explored. Although some RGBT datasets have been developed, the insufficient quantity, limited diversity, unitary application, misaligned images and large target size cannot provide an impartial benchmark to evaluate RGBT SOD algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93 K frames and 1.2 M manual annotations. RGBT-Tiny contains abundant objects (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of objects are smaller than 16×16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT image fusion, object detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large objects. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset, extensive evaluations have been conducted with IoU and SAFit metrics, including 32 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.
{"title":"Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.","authors":"Xinyi Ying, Chao Xiao, Wei An, Ruojing Li, Xu He, Boyang Li, Xu Cao, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Weidong Sheng, Li Liu","doi":"10.1109/TPAMI.2025.3544621","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3544621","url":null,"abstract":"<p><p>Visible-thermal small object detection (RGBT SOD) is a significant yet challenging task with a wide range of applications, including video surveillance, traffic monitoring, search and rescue. However, existing studies mainly focus on either visible or thermal modality, while RGBT SOD is rarely explored. Although some RGBT datasets have been developed, the insufficient quantity, limited diversity, unitary application, misaligned images and large target size cannot provide an impartial benchmark to evaluate RGBT SOD algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93 K frames and 1.2 M manual annotations. RGBT-Tiny contains abundant objects (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of objects are smaller than 16×16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT image fusion, object detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large objects. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset, extensive evaluations have been conducted with IoU and SAFit metrics, including 32 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-05DOI: 10.1109/TPAMI.2025.3548620
Degang Chen, Jiayu Liu, Xiaoya Che
Understanding the effect of hyperparameters of the network structure on the performance of Convolutional Neural Networks (CNNs) remains the most fundamental and urgent issue in deep learning, and we attempt to address this issue based on the piecewise linear (PWL) function nature of CNNs in this paper. Firstly, the operations of convolutions, ReLUs and Max pooling in a CNN are represented as the multiplication of multiple matrices for a fixed sample in order to obtain an algebraic expression of CNNs, this expression clearly suggests that CNNs are PWL functions. Although such representation has high time complexity, it provides a more convenient and intuitive way to study the mathematical properties of CNNs. Secondly, we develop a tight bound of the number of linear regions and the upper bounds of generalization error for CNNs, both taking into account factors such as the number of layers, dimension of pooling, and the width in the network. The above research results provide a possible guidance for designing and training CNNs.
{"title":"On the Upper Bounds of Number of Linear Regions and Generalization Error of Deep Convolutional Neural Networks.","authors":"Degang Chen, Jiayu Liu, Xiaoya Che","doi":"10.1109/TPAMI.2025.3548620","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548620","url":null,"abstract":"<p><p>Understanding the effect of hyperparameters of the network structure on the performance of Convolutional Neural Networks (CNNs) remains the most fundamental and urgent issue in deep learning, and we attempt to address this issue based on the piecewise linear (PWL) function nature of CNNs in this paper. Firstly, the operations of convolutions, ReLUs and Max pooling in a CNN are represented as the multiplication of multiple matrices for a fixed sample in order to obtain an algebraic expression of CNNs, this expression clearly suggests that CNNs are PWL functions. Although such representation has high time complexity, it provides a more convenient and intuitive way to study the mathematical properties of CNNs. Secondly, we develop a tight bound of the number of linear regions and the upper bounds of generalization error for CNNs, both taking into account factors such as the number of layers, dimension of pooling, and the width in the network. The above research results provide a possible guidance for designing and training CNNs.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143569154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-25DOI: 10.1109/TPAMI.2025.3545966
Jinpeng Chen, Runmin Cong, Yuxuan Luo, Horace Ho Shing Ip, Sam Kwong
The research of class-incremental semantic segmentation (CISS) seeks to enhance semantic segmentation methods by enabling the progressive learning of new classes while preserving knowledge of previously learned ones. A significant yet often neglected challenge in this domain is class imbalance. In CISS, each task focuses on different foreground classes, with the training set for each task exclusively comprising images that contain these currently focused classes. This results in an overrepresentation of these classes within the single-task training set, leading to a classification bias towards them. To address this issue, we propose a novel CISS method named STAR, whose core principle is to reintegrate the missing proportions of previous classes into current single-task training samples by replaying their prototypes. Moreover, we develop a prototype deviation technique that enables the deduction of past-class prototypes, integrating the recognition patterns of the classifiers and the extraction patterns of the feature extractor. With this technique, replay can be accomplished without using any storage to save prototypes. Complementing our method, we devise two loss functions to enforce cross-task feature constraints: the Old-Class Features Maintaining (OCFM) loss and the Similarity-Aware Discriminative (SAD) loss. The OCFM loss is designed to stabilize the feature space of old classes, thus preserving previously acquired knowledge without compromising the ability to learn new classes. The SAD loss aims to enhance feature distinctions between similar old and new class pairs, minimizing potential confusion. Our experiments on two public datasets, Pascal VOC 2012 and ADE20 K, demonstrate that our STAR achieves state-of-the-art performance.
{"title":"Replay Without Saving: Prototype Derivation and Distribution Rebalance for Class-Incremental Semantic Segmentation.","authors":"Jinpeng Chen, Runmin Cong, Yuxuan Luo, Horace Ho Shing Ip, Sam Kwong","doi":"10.1109/TPAMI.2025.3545966","DOIUrl":"10.1109/TPAMI.2025.3545966","url":null,"abstract":"<p><p>The research of class-incremental semantic segmentation (CISS) seeks to enhance semantic segmentation methods by enabling the progressive learning of new classes while preserving knowledge of previously learned ones. A significant yet often neglected challenge in this domain is class imbalance. In CISS, each task focuses on different foreground classes, with the training set for each task exclusively comprising images that contain these currently focused classes. This results in an overrepresentation of these classes within the single-task training set, leading to a classification bias towards them. To address this issue, we propose a novel CISS method named STAR, whose core principle is to reintegrate the missing proportions of previous classes into current single-task training samples by replaying their prototypes. Moreover, we develop a prototype deviation technique that enables the deduction of past-class prototypes, integrating the recognition patterns of the classifiers and the extraction patterns of the feature extractor. With this technique, replay can be accomplished without using any storage to save prototypes. Complementing our method, we devise two loss functions to enforce cross-task feature constraints: the Old-Class Features Maintaining (OCFM) loss and the Similarity-Aware Discriminative (SAD) loss. The OCFM loss is designed to stabilize the feature space of old classes, thus preserving previously acquired knowledge without compromising the ability to learn new classes. The SAD loss aims to enhance feature distinctions between similar old and new class pairs, minimizing potential confusion. Our experiments on two public datasets, Pascal VOC 2012 and ADE20 K, demonstrate that our STAR achieves state-of-the-art performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143545270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.
{"title":"Fully-Connected Transformer for Multi-Source Image Fusion","authors":"Xiao Wu;Zi-Han Cao;Ting-Zhu Huang;Liang-Jian Deng;Jocelyn Chanussot;Gemine Vivone","doi":"10.1109/TPAMI.2024.3523364","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3523364","url":null,"abstract":"Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2071-2088"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-24DOI: 10.1109/TPAMI.2025.3529927
Ao Li;Le Zhang;Yun Liu;Ce Zhu
Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.
{"title":"Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution","authors":"Ao Li;Le Zhang;Yun Liu;Ce Zhu","doi":"10.1109/TPAMI.2025.3529927","DOIUrl":"10.1109/TPAMI.2025.3529927","url":null,"abstract":"Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the <bold>c</b>ross-<bold>r</b>efinement <bold>a</b>daptive <bold>f</b>eature modulation <bold>t</b>ransformer (<bold>CRAFT</b>), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (<bold>HFERB</b>) for extracting high-frequency information, the shift rectangle window attention block (<bold>SRWAB</b>) for capturing global information, and the hybrid fusion block (<bold>HFB</b>) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3141-3158"},"PeriodicalIF":0.0,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-21DOI: 10.1109/TPAMI.2025.3526930
Yibo Zhou;Bo Li;Hai-Miao Hu;Xiaokang Zhang;Dongping Zhang;Hanzi Wang
In pedestrian attribute recognition (PAR), the loose umbrella term ‘attribute’ ranges from human soft-biometrics to wearing accessory, and even extending to various subjective body descriptors. As a result, the vast coverage of ‘attributes’ implies that, instead of being over-specialized to limited attributes with exclusive characteristic, PAR should be approached from a much fundamental perspective. To this end, given that most attributes are greatly under-represented in real-world datasets, we simply distill PAR into a visual task of multi-label recognition under significant data imbalance. Accordingly, we introduce feature re-sampled detached learning (FRDL) to decouple label-balanced learning from the curse of attributes co-occurrence. Specifically, FRDL is able to balance the sampling distribution of an attribute without biasing the label prior of co-occurring others. As a complementary method, we also propose gradient-oriented augment translating (GOAT) to alleviate the feature noise and semantics imbalance aggravated in FRDL. Integrated in a highly unified framework, FRDL and GOAT substantially refresh the state-of-the-art performance on various realistic benchmarks, while maintaining a minimal computational budget. Further analytical discussion and experimental evidence corroborate the veracity of our advancement: this is the first work that establishes labels-independent and impartial balanced learning for PAR.
{"title":"Heterogeneous Feature Re-Sampling for Balanced Pedestrian Attribute Recognition","authors":"Yibo Zhou;Bo Li;Hai-Miao Hu;Xiaokang Zhang;Dongping Zhang;Hanzi Wang","doi":"10.1109/TPAMI.2025.3526930","DOIUrl":"10.1109/TPAMI.2025.3526930","url":null,"abstract":"In pedestrian attribute recognition (PAR), the loose umbrella term ‘attribute’ ranges from human soft-biometrics to wearing accessory, and even extending to various subjective body descriptors. As a result, the vast coverage of ‘attributes’ implies that, instead of being over-specialized to limited attributes with exclusive characteristic, PAR should be approached from a much fundamental perspective. To this end, given that most attributes are greatly under-represented in real-world datasets, we simply distill PAR into a visual task of multi-label recognition under significant data imbalance. Accordingly, we introduce feature re-sampled detached learning (FRDL) to decouple label-balanced learning from the curse of attributes co-occurrence. Specifically, FRDL is able to balance the sampling distribution of an attribute without biasing the label prior of co-occurring others. As a complementary method, we also propose gradient-oriented augment translating (GOAT) to alleviate the feature noise and semantics imbalance aggravated in FRDL. Integrated in a highly unified framework, FRDL and GOAT substantially refresh the state-of-the-art performance on various realistic benchmarks, while maintaining a minimal computational budget. Further analytical discussion and experimental evidence corroborate the veracity of our advancement: this is the first work that establishes labels-independent and impartial balanced learning for PAR.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2706-2722"},"PeriodicalIF":0.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-21DOI: 10.1109/TPAMI.2025.3530099
Yong Du;Jiahui Zhan;Xinzhe Li;Junyu Dong;Sheng Chen;Ming-Hsuan Yang;Shengfeng He
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
{"title":"One-for-All: Towards Universal Domain Translation With a Single StyleGAN","authors":"Yong Du;Jiahui Zhan;Xinzhe Li;Junyu Dong;Sheng Chen;Ming-Hsuan Yang;Shengfeng He","doi":"10.1109/TPAMI.2025.3530099","DOIUrl":"10.1109/TPAMI.2025.3530099","url":null,"abstract":"In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2865-2881"},"PeriodicalIF":0.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}