首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
2024 Reviewers List*
Pub Date : 2025-03-06 DOI: 10.1109/TPAMI.2025.3537400
{"title":"2024 Reviewers List*","authors":"","doi":"10.1109/TPAMI.2025.3537400","DOIUrl":"10.1109/TPAMI.2025.3537400","url":null,"abstract":"","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3183-3199"},"PeriodicalIF":0.0,"publicationDate":"2025-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10916530","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143569503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Class-Agnostic Repetitive Action Counting Using Wearable Devices.
Pub Date : 2025-03-05 DOI: 10.1109/TPAMI.2025.3548131
Duc Duy Nguyen, Lam Thanh Nguyen, Yifeng Huang, Cuong Pham, Minh Hoai

We present Class-agnostic Repetitive action Counting (CaRaCount), a novel approach to count repetitive human actions in the wild using wearable devices time series data. CaRaCount is the first few-shot class-agnostic method, being able to count repetitions of any action class with only a short exemplar data sequence containing a few examples from the action class of interest. To develop and evaluate this method, we collect a large-scale time series dataset of repetitive human actions in various context, containing smartwatch data from 10 subjects performing 50 different activities. Experiments on this dataset and three other activity counting datasets namely Crossfit, Recofit, and MM-Fit show that CaRaCount can count repetitive actions with low error, and it outperforms other baselines and state-of-the-art action counting methods. Finally, with a user experience study, we evaluate the usability of our real-time implementation. Our results highlight the efficiency and effectiveness of our approach when deployed outside the laboratory environments.

{"title":"Class-Agnostic Repetitive Action Counting Using Wearable Devices.","authors":"Duc Duy Nguyen, Lam Thanh Nguyen, Yifeng Huang, Cuong Pham, Minh Hoai","doi":"10.1109/TPAMI.2025.3548131","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548131","url":null,"abstract":"<p><p>We present Class-agnostic Repetitive action Counting (CaRaCount), a novel approach to count repetitive human actions in the wild using wearable devices time series data. CaRaCount is the first few-shot class-agnostic method, being able to count repetitions of any action class with only a short exemplar data sequence containing a few examples from the action class of interest. To develop and evaluate this method, we collect a large-scale time series dataset of repetitive human actions in various context, containing smartwatch data from 10 subjects performing 50 different activities. Experiments on this dataset and three other activity counting datasets namely Crossfit, Recofit, and MM-Fit show that CaRaCount can count repetitive actions with low error, and it outperforms other baselines and state-of-the-art action counting methods. Finally, with a user experience study, we evaluate the usability of our real-time implementation. Our results highlight the efficiency and effectiveness of our approach when deployed outside the laboratory environments.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143569151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rate-Distortion Theory in Coding for Machines and its Applications.
Pub Date : 2025-03-05 DOI: 10.1109/TPAMI.2025.3548516
Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V Srinivasa Somayazulu, Omesh Tickoo, Anderson de Andrade, Ivan V Bajic

Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of media, especially images and video. As a result, a growing need for efficient compression methods optimised for machine vision, rather than human vision, has emerged. To meet this growing demand, significant developments have been made in image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we greatly extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilise this newfound understanding to improve several methods for learned image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks - classification, instance and semantic segmentation, and object detection.

{"title":"Rate-Distortion Theory in Coding for Machines and its Applications.","authors":"Alon Harell, Yalda Foroutan, Nilesh Ahuja, Parual Datta, Bhavya Kanzariya, V Srinivasa Somayazulu, Omesh Tickoo, Anderson de Andrade, Ivan V Bajic","doi":"10.1109/TPAMI.2025.3548516","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548516","url":null,"abstract":"<p><p>Recent years have seen a tremendous growth in both the capability and popularity of automatic machine analysis of media, especially images and video. As a result, a growing need for efficient compression methods optimised for machine vision, rather than human vision, has emerged. To meet this growing demand, significant developments have been made in image and video coding for machines. Unfortunately, while there is a substantial body of knowledge regarding rate-distortion theory for human vision, the same cannot be said of machine analysis. In this paper, we greatly extend the current rate-distortion theory for machines, providing insight into important design considerations of machine-vision codecs. We then utilise this newfound understanding to improve several methods for learned image coding for machines. Our proposed methods achieve state-of-the-art rate-distortion performance on several computer vision tasks - classification, instance and semantic segmentation, and object detection.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.
Pub Date : 2025-03-05 DOI: 10.1109/TPAMI.2025.3544621
Xinyi Ying, Chao Xiao, Wei An, Ruojing Li, Xu He, Boyang Li, Xu Cao, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Weidong Sheng, Li Liu

Visible-thermal small object detection (RGBT SOD) is a significant yet challenging task with a wide range of applications, including video surveillance, traffic monitoring, search and rescue. However, existing studies mainly focus on either visible or thermal modality, while RGBT SOD is rarely explored. Although some RGBT datasets have been developed, the insufficient quantity, limited diversity, unitary application, misaligned images and large target size cannot provide an impartial benchmark to evaluate RGBT SOD algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93 K frames and 1.2 M manual annotations. RGBT-Tiny contains abundant objects (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of objects are smaller than 16×16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT image fusion, object detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large objects. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset, extensive evaluations have been conducted with IoU and SAFit metrics, including 32 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.

{"title":"Visible-Thermal Tiny Object Detection: A Benchmark Dataset and Baselines.","authors":"Xinyi Ying, Chao Xiao, Wei An, Ruojing Li, Xu He, Boyang Li, Xu Cao, Zhaoxu Li, Yingqian Wang, Mingyuan Hu, Qingyu Xu, Zaiping Lin, Miao Li, Shilin Zhou, Weidong Sheng, Li Liu","doi":"10.1109/TPAMI.2025.3544621","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3544621","url":null,"abstract":"<p><p>Visible-thermal small object detection (RGBT SOD) is a significant yet challenging task with a wide range of applications, including video surveillance, traffic monitoring, search and rescue. However, existing studies mainly focus on either visible or thermal modality, while RGBT SOD is rarely explored. Although some RGBT datasets have been developed, the insufficient quantity, limited diversity, unitary application, misaligned images and large target size cannot provide an impartial benchmark to evaluate RGBT SOD algorithms. In this paper, we build the first large-scale benchmark with high diversity for RGBT SOD (namely RGBT-Tiny), including 115 paired sequences, 93 K frames and 1.2 M manual annotations. RGBT-Tiny contains abundant objects (7 categories) and high-diversity scenes (8 types that cover different illumination and density variations). Note that, over 81% of objects are smaller than 16×16, and we provide paired bounding box annotations with tracking ID to offer an extremely challenging benchmark with wide-range applications, such as RGBT image fusion, object detection and tracking. In addition, we propose a scale adaptive fitness (SAFit) measure that exhibits high robustness on both small and large objects. The proposed SAFit can provide reasonable performance evaluation and promote detection performance. Based on the proposed RGBT-Tiny dataset, extensive evaluations have been conducted with IoU and SAFit metrics, including 32 recent state-of-the-art algorithms that cover four different types (i.e., visible generic detection, visible SOD, thermal SOD and RGBT object detection). Project is available at https://github.com/XinyiYing/RGBT-Tiny.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143568849","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
On the Upper Bounds of Number of Linear Regions and Generalization Error of Deep Convolutional Neural Networks.
Pub Date : 2025-03-05 DOI: 10.1109/TPAMI.2025.3548620
Degang Chen, Jiayu Liu, Xiaoya Che

Understanding the effect of hyperparameters of the network structure on the performance of Convolutional Neural Networks (CNNs) remains the most fundamental and urgent issue in deep learning, and we attempt to address this issue based on the piecewise linear (PWL) function nature of CNNs in this paper. Firstly, the operations of convolutions, ReLUs and Max pooling in a CNN are represented as the multiplication of multiple matrices for a fixed sample in order to obtain an algebraic expression of CNNs, this expression clearly suggests that CNNs are PWL functions. Although such representation has high time complexity, it provides a more convenient and intuitive way to study the mathematical properties of CNNs. Secondly, we develop a tight bound of the number of linear regions and the upper bounds of generalization error for CNNs, both taking into account factors such as the number of layers, dimension of pooling, and the width in the network. The above research results provide a possible guidance for designing and training CNNs.

{"title":"On the Upper Bounds of Number of Linear Regions and Generalization Error of Deep Convolutional Neural Networks.","authors":"Degang Chen, Jiayu Liu, Xiaoya Che","doi":"10.1109/TPAMI.2025.3548620","DOIUrl":"https://doi.org/10.1109/TPAMI.2025.3548620","url":null,"abstract":"<p><p>Understanding the effect of hyperparameters of the network structure on the performance of Convolutional Neural Networks (CNNs) remains the most fundamental and urgent issue in deep learning, and we attempt to address this issue based on the piecewise linear (PWL) function nature of CNNs in this paper. Firstly, the operations of convolutions, ReLUs and Max pooling in a CNN are represented as the multiplication of multiple matrices for a fixed sample in order to obtain an algebraic expression of CNNs, this expression clearly suggests that CNNs are PWL functions. Although such representation has high time complexity, it provides a more convenient and intuitive way to study the mathematical properties of CNNs. Secondly, we develop a tight bound of the number of linear regions and the upper bounds of generalization error for CNNs, both taking into account factors such as the number of layers, dimension of pooling, and the width in the network. The above research results provide a possible guidance for designing and training CNNs.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143569154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Replay Without Saving: Prototype Derivation and Distribution Rebalance for Class-Incremental Semantic Segmentation.
Pub Date : 2025-02-25 DOI: 10.1109/TPAMI.2025.3545966
Jinpeng Chen, Runmin Cong, Yuxuan Luo, Horace Ho Shing Ip, Sam Kwong

The research of class-incremental semantic segmentation (CISS) seeks to enhance semantic segmentation methods by enabling the progressive learning of new classes while preserving knowledge of previously learned ones. A significant yet often neglected challenge in this domain is class imbalance. In CISS, each task focuses on different foreground classes, with the training set for each task exclusively comprising images that contain these currently focused classes. This results in an overrepresentation of these classes within the single-task training set, leading to a classification bias towards them. To address this issue, we propose a novel CISS method named STAR, whose core principle is to reintegrate the missing proportions of previous classes into current single-task training samples by replaying their prototypes. Moreover, we develop a prototype deviation technique that enables the deduction of past-class prototypes, integrating the recognition patterns of the classifiers and the extraction patterns of the feature extractor. With this technique, replay can be accomplished without using any storage to save prototypes. Complementing our method, we devise two loss functions to enforce cross-task feature constraints: the Old-Class Features Maintaining (OCFM) loss and the Similarity-Aware Discriminative (SAD) loss. The OCFM loss is designed to stabilize the feature space of old classes, thus preserving previously acquired knowledge without compromising the ability to learn new classes. The SAD loss aims to enhance feature distinctions between similar old and new class pairs, minimizing potential confusion. Our experiments on two public datasets, Pascal VOC 2012 and ADE20 K, demonstrate that our STAR achieves state-of-the-art performance.

{"title":"Replay Without Saving: Prototype Derivation and Distribution Rebalance for Class-Incremental Semantic Segmentation.","authors":"Jinpeng Chen, Runmin Cong, Yuxuan Luo, Horace Ho Shing Ip, Sam Kwong","doi":"10.1109/TPAMI.2025.3545966","DOIUrl":"10.1109/TPAMI.2025.3545966","url":null,"abstract":"<p><p>The research of class-incremental semantic segmentation (CISS) seeks to enhance semantic segmentation methods by enabling the progressive learning of new classes while preserving knowledge of previously learned ones. A significant yet often neglected challenge in this domain is class imbalance. In CISS, each task focuses on different foreground classes, with the training set for each task exclusively comprising images that contain these currently focused classes. This results in an overrepresentation of these classes within the single-task training set, leading to a classification bias towards them. To address this issue, we propose a novel CISS method named STAR, whose core principle is to reintegrate the missing proportions of previous classes into current single-task training samples by replaying their prototypes. Moreover, we develop a prototype deviation technique that enables the deduction of past-class prototypes, integrating the recognition patterns of the classifiers and the extraction patterns of the feature extractor. With this technique, replay can be accomplished without using any storage to save prototypes. Complementing our method, we devise two loss functions to enforce cross-task feature constraints: the Old-Class Features Maintaining (OCFM) loss and the Similarity-Aware Discriminative (SAD) loss. The OCFM loss is designed to stabilize the feature space of old classes, thus preserving previously acquired knowledge without compromising the ability to learn new classes. The SAD loss aims to enhance feature distinctions between similar old and new class pairs, minimizing potential confusion. Our experiments on two public datasets, Pascal VOC 2012 and ADE20 K, demonstrate that our STAR achieves state-of-the-art performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"PP ","pages":""},"PeriodicalIF":0.0,"publicationDate":"2025-02-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143545270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fully-Connected Transformer for Multi-Source Image Fusion 用于多源图像融合的全连接变压器
Pub Date : 2025-02-05 DOI: 10.1109/TPAMI.2024.3523364
Xiao Wu;Zi-Han Cao;Ting-Zhu Huang;Liang-Jian Deng;Jocelyn Chanussot;Gemine Vivone
Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.
多源图像融合将来自多个图像的信息合并为一个数据,从而提高成像质量。这一课题引起了社会各界的极大兴趣。尽管现有的基于自注意的变换器方法可以捕捉空间和信道相似性,但如何整合不同来源的信息仍然是一个巨大的挑战。在本文中,我们首先讨论了所提出的广义自注意机制背后的数学概念,并将现有的自注意视为基本形式。所提出的机制采用多线性代数来推动新型全连接自注意(FCSA)方法的发展,以充分利用多源图像之间的局部和非局部特定域相关性。此外,我们还提出了一种多源图像表示法,将其作为优化问题中的非局部先验嵌入 FCSA 框架。一些不同的融合问题被展开到所提出的全连接变压器融合网络(FC-Former)中。更具体地说,广义自我注意的概念可以促进自我注意的潜在发展。因此,FC-Former 可以被视为一个统一了不同融合任务的网络模型。与最先进的方法相比,所提出的 FC-Former 方法表现出稳健而优越的性能,显示了其忠实保存信息的能力。
{"title":"Fully-Connected Transformer for Multi-Source Image Fusion","authors":"Xiao Wu;Zi-Han Cao;Ting-Zhu Huang;Liang-Jian Deng;Jocelyn Chanussot;Gemine Vivone","doi":"10.1109/TPAMI.2024.3523364","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3523364","url":null,"abstract":"Multi-source image fusion combines the information coming from multiple images into one data, thus improving imaging quality. This topic has aroused great interest in the community. How to integrate information from different sources is still a big challenge, although the existing self-attention based transformer methods can capture spatial and channel similarities. In this paper, we first discuss the mathematical concepts behind the proposed generalized self-attention mechanism, where the existing self-attentions are considered basic forms. The proposed mechanism employs multilinear algebra to drive the development of a novel fully-connected self-attention (FCSA) method to fully exploit local and non-local domain-specific correlations among multi-source images. Moreover, we propose a multi-source image representation embedding it into the FCSA framework as a non-local prior within an optimization problem. Some different fusion problems are unfolded into the proposed fully-connected transformer fusion network (FC-Former). More specifically, the concept of generalized self-attention can promote the potential development of self-attention. Hence, the FC-Former can be viewed as a network model unifying different fusion tasks. Compared with state-of-the-art methods, the proposed FC-Former method exhibits robust and superior performance, showing its capability of faithfully preserving information.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"2071-2088"},"PeriodicalIF":0.0,"publicationDate":"2025-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143184170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution
Pub Date : 2025-01-24 DOI: 10.1109/TPAMI.2025.3529927
Ao Li;Le Zhang;Yun Liu;Ce Zhu
Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the cross-refinement adaptive feature modulation transformer (CRAFT), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (HFERB) for extracting high-frequency information, the shift rectangle window attention block (SRWAB) for capturing global information, and the hybrid fusion block (HFB) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.
{"title":"Exploring Frequency-Inspired Optimization in Transformer for Efficient Single Image Super-Resolution","authors":"Ao Li;Le Zhang;Yun Liu;Ce Zhu","doi":"10.1109/TPAMI.2025.3529927","DOIUrl":"10.1109/TPAMI.2025.3529927","url":null,"abstract":"Transformer-based methods have exhibited remarkable potential in single image super-resolution (SISR) by effectively extracting long-range dependencies. However, most of the current research in this area has prioritized the design of transformer blocks to capture global information, while overlooking the importance of incorporating high-frequency priors, which we believe could be beneficial. In our study, we conducted a series of experiments and found that transformer structures are more adept at capturing low-frequency information, but have limited capacity in constructing high-frequency representations when compared to their convolutional counterparts. Our proposed solution, the <bold>c</b>ross-<bold>r</b>efinement <bold>a</b>daptive <bold>f</b>eature modulation <bold>t</b>ransformer (<bold>CRAFT</b>), integrates the strengths of both convolutional and transformer structures. It comprises three key components: the high-frequency enhancement residual block (<bold>HFERB</b>) for extracting high-frequency information, the shift rectangle window attention block (<bold>SRWAB</b>) for capturing global information, and the hybrid fusion block (<bold>HFB</b>) for refining the global representation. To tackle the inherent intricacies of transformer structures, we introduce a frequency-guided post-training quantization (PTQ) method aimed at enhancing CRAFT's efficiency. These strategies incorporate adaptive dual clipping and boundary refinement. To further amplify the versatility of our proposed approach, we extend our PTQ strategy to function as a general quantization method for transformer-based SISR techniques. Our experimental findings showcase CRAFT's superiority over current state-of-the-art methods, both in full-precision and quantization scenarios. These results underscore the efficacy and universality of our PTQ strategy.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3141-3158"},"PeriodicalIF":0.0,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143030789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Heterogeneous Feature Re-Sampling for Balanced Pedestrian Attribute Recognition 基于异构特征重采样的平衡行人属性识别
Pub Date : 2025-01-21 DOI: 10.1109/TPAMI.2025.3526930
Yibo Zhou;Bo Li;Hai-Miao Hu;Xiaokang Zhang;Dongping Zhang;Hanzi Wang
In pedestrian attribute recognition (PAR), the loose umbrella term ‘attribute’ ranges from human soft-biometrics to wearing accessory, and even extending to various subjective body descriptors. As a result, the vast coverage of ‘attributes’ implies that, instead of being over-specialized to limited attributes with exclusive characteristic, PAR should be approached from a much fundamental perspective. To this end, given that most attributes are greatly under-represented in real-world datasets, we simply distill PAR into a visual task of multi-label recognition under significant data imbalance. Accordingly, we introduce feature re-sampled detached learning (FRDL) to decouple label-balanced learning from the curse of attributes co-occurrence. Specifically, FRDL is able to balance the sampling distribution of an attribute without biasing the label prior of co-occurring others. As a complementary method, we also propose gradient-oriented augment translating (GOAT) to alleviate the feature noise and semantics imbalance aggravated in FRDL. Integrated in a highly unified framework, FRDL and GOAT substantially refresh the state-of-the-art performance on various realistic benchmarks, while maintaining a minimal computational budget. Further analytical discussion and experimental evidence corroborate the veracity of our advancement: this is the first work that establishes labels-independent and impartial balanced learning for PAR.
{"title":"Heterogeneous Feature Re-Sampling for Balanced Pedestrian Attribute Recognition","authors":"Yibo Zhou;Bo Li;Hai-Miao Hu;Xiaokang Zhang;Dongping Zhang;Hanzi Wang","doi":"10.1109/TPAMI.2025.3526930","DOIUrl":"10.1109/TPAMI.2025.3526930","url":null,"abstract":"In pedestrian attribute recognition (PAR), the loose umbrella term ‘attribute’ ranges from human soft-biometrics to wearing accessory, and even extending to various subjective body descriptors. As a result, the vast coverage of ‘attributes’ implies that, instead of being over-specialized to limited attributes with exclusive characteristic, PAR should be approached from a much fundamental perspective. To this end, given that most attributes are greatly under-represented in real-world datasets, we simply distill PAR into a visual task of multi-label recognition under significant data imbalance. Accordingly, we introduce feature re-sampled detached learning (FRDL) to decouple label-balanced learning from the curse of attributes co-occurrence. Specifically, FRDL is able to balance the sampling distribution of an attribute without biasing the label prior of co-occurring others. As a complementary method, we also propose gradient-oriented augment translating (GOAT) to alleviate the feature noise and semantics imbalance aggravated in FRDL. Integrated in a highly unified framework, FRDL and GOAT substantially refresh the state-of-the-art performance on various realistic benchmarks, while maintaining a minimal computational budget. Further analytical discussion and experimental evidence corroborate the veracity of our advancement: this is the first work that establishes labels-independent and impartial balanced learning for PAR.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2706-2722"},"PeriodicalIF":0.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992772","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
One-for-All: Towards Universal Domain Translation With a Single StyleGAN 一网打尽:用单一风格广域网实现通用领域翻译
Pub Date : 2025-01-21 DOI: 10.1109/TPAMI.2025.3530099
Yong Du;Jiahui Zhan;Xinzhe Li;Junyu Dong;Sheng Chen;Ming-Hsuan Yang;Shengfeng He
In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.
{"title":"One-for-All: Towards Universal Domain Translation With a Single StyleGAN","authors":"Yong Du;Jiahui Zhan;Xinzhe Li;Junyu Dong;Sheng Chen;Ming-Hsuan Yang;Shengfeng He","doi":"10.1109/TPAMI.2025.3530099","DOIUrl":"10.1109/TPAMI.2025.3530099","url":null,"abstract":"In this paper, we propose a novel translation model, UniTranslator, for transforming representations between visually distinct domains under conditions of limited training data and significant visual differences. The main idea behind our approach is leveraging the domain-neutral capabilities of CLIP as a bridging mechanism, while utilizing a separate module to extract abstract, domain-agnostic semantics from the embeddings of both the source and target realms. Fusing these abstract semantics with target-specific semantics results in a transformed embedding within the CLIP space. To bridge the gap between the disparate worlds of CLIP and StyleGAN, we introduce a new non-linear mapper, the CLIP2P mapper. Utilizing CLIP embeddings, this module is tailored to approximate the latent distribution in the StyleGAN's latent space, effectively acting as a connector between these two spaces. The proposed UniTranslator is versatile and capable of performing various tasks, including style mixing, stylization, and translations, even in visually challenging scenarios across different visual domains. Notably, UniTranslator generates high-quality translations that showcase domain relevance, diversity, and improved image quality. UniTranslator surpasses the performance of existing general-purpose models and performs well against specialized models in representative tasks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2865-2881"},"PeriodicalIF":0.0,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142992779","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1