首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
DiffTF++: 3D-Aware Diffusion Transformer for Large-Vocabulary 3D Generation 用于大词汇量3D生成的3D感知扩散变压器
Pub Date : 2025-01-10 DOI: 10.1109/TPAMI.2025.3528247
Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu
Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.
{"title":"DiffTF++: 3D-Aware Diffusion Transformer for Large-Vocabulary 3D Generation","authors":"Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu","doi":"10.1109/TPAMI.2025.3528247","DOIUrl":"10.1109/TPAMI.2025.3528247","url":null,"abstract":"Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges <italic>with a single model</i>. To handle the large diversity and complexity in geometry and texture across categories efficiently, we <bold>1</b>) adopt improved triplane to guarantee efficiency; <bold>2</b>) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and <bold>3</b>) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware <bold>Diff</b>usion model with <bold>T</b>rans<bold>F</b>ormer, <bold>DiffTF</b>, we propose a stronger version for 3D generation, i.e., <bold>DiffTF++</b>. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3018-3030"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction 文档级关系抽取的翻译引导双图推理
Pub Date : 2025-01-10 DOI: 10.1109/TPAMI.2025.3528246
Lingling Zhang;Yujie Zhong;Qinghua Zheng;Jun Liu;Qianying Wang;Jiaxin Wang;Xiaojun Chang
Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.
{"title":"TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction","authors":"Lingling Zhang;Yujie Zhong;Qinghua Zheng;Jun Liu;Qianying Wang;Jiaxin Wang;Xiaojun Chang","doi":"10.1109/TPAMI.2025.3528246","DOIUrl":"10.1109/TPAMI.2025.3528246","url":null,"abstract":"Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2647-2659"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Latent Weight Quantization for Integerized Training of Deep Neural Networks 深度神经网络综合训练的潜在权量化
Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2025.3527498
Wen Fei;Wenrui Dai;Liang Zhang;Luoming Zhang;Chenglin Li;Junni Zou;Hongkai Xiong
Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.
{"title":"Latent Weight Quantization for Integerized Training of Deep Neural Networks","authors":"Wen Fei;Wenrui Dai;Liang Zhang;Luoming Zhang;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TPAMI.2025.3527498","DOIUrl":"10.1109/TPAMI.2025.3527498","url":null,"abstract":"Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2816-2832"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Auto-Pairing Positives Through Implicit Relation Circulation for Discriminative Self-Learning 基于隐式关系循环的判别自学习自动配对阳性
Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2025.3526802
Bo Pang;Zhenyu Wei;Jingli Lin;Cewu Lu
Contrastive learning, a discriminative self-learning framework, is one of the most popular representation learning methods which has a wide range of application scenarios. Although relative techniques have been continuously updated in recent years, designing and seeking positive pairs are still inevitable. Just because of the requirement of explicit positive pairs, the utilization of contrastive learning is restricted in dense, multi-modal, and other scenarios where positive pairs are difficult to obtain. To solve this problem, in this paper, we design an auto-pairing mechanism called Implicit Relation Circulation (IRC) for discriminative self-learning frameworks. Its core idea is to conduct a random walk among multiple feature groups we want to contrast but without explicit matchup, which we call the complex task (Task C). By linking the head and tail of the random walk to form a circulation with a simple task (task S) containing easy-obtaining pairs, we can apply cycle consistency as supervision guidance to gradually learn the wanted positive pairs among the random walk of feature groups automatically. We provide several amazing applications of IRC: we can learn 1) effective dense image pixel relations and representation with only image-level pairs; 2) 3D temporal point-level multi-modal point cloud relations and representation; and 3) even image representation with the help of language without off-the-shelf vision-language pairs. As an easy-to-use plug-and-play mechanism, we evaluate its universality and robustness with multiple self-learning algorithms, tasks, and datasets, achieving stable and significant improvements. As an illustrative example, IRC improves the SOTA performance by about 3.0 mIoU on image semantic segmentation, 1.5 mIoU on 3D segmentation, 1.3 mAP on 3D detection, and an average of 1.2 top1 accuracy on image classification with the help of the auto-learned positive pairs. Importantly, these improvements are achieved with little parameter and computation overhead. We hope IRC can provide the community with new insight into discriminative self-learning.
{"title":"Auto-Pairing Positives Through Implicit Relation Circulation for Discriminative Self-Learning","authors":"Bo Pang;Zhenyu Wei;Jingli Lin;Cewu Lu","doi":"10.1109/TPAMI.2025.3526802","DOIUrl":"10.1109/TPAMI.2025.3526802","url":null,"abstract":"Contrastive learning, a discriminative self-learning framework, is one of the most popular representation learning methods which has a wide range of application scenarios. Although relative techniques have been continuously updated in recent years, designing and seeking positive pairs are still inevitable. Just because of the requirement of explicit positive pairs, the utilization of contrastive learning is restricted in dense, multi-modal, and other scenarios where positive pairs are difficult to obtain. To solve this problem, in this paper, we design an auto-pairing mechanism called <bold>I</b>mplicit <bold>R</b>elation <bold>C</b>irculation (<bold>IRC</b>) for discriminative self-learning frameworks. Its core idea is to conduct a random walk among multiple feature groups we want to contrast but without explicit matchup, which we call the complex task (Task C). By linking the head and tail of the random walk to form a circulation with a simple task (task S) containing easy-obtaining pairs, we can apply cycle consistency as supervision guidance to gradually learn the wanted positive pairs among the random walk of feature groups automatically. We provide several amazing applications of IRC: we can learn 1) effective dense image pixel relations and representation with only image-level pairs; 2) 3D temporal point-level multi-modal point cloud relations and representation; and 3) even image representation with the help of language without off-the-shelf vision-language pairs. As an easy-to-use plug-and-play mechanism, we evaluate its universality and robustness with multiple self-learning algorithms, tasks, and datasets, achieving stable and significant improvements. As an illustrative example, IRC improves the SOTA performance by about 3.0 mIoU on image semantic segmentation, 1.5 mIoU on 3D segmentation, 1.3 mAP on 3D detection, and an average of 1.2 top1 accuracy on image classification with the help of the auto-learned positive pairs. Importantly, these improvements are achieved with little parameter and computation overhead. We hope IRC can provide the community with new insight into discriminative self-learning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2739-2753"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Foundation Models Defining a New Era in Vision: A Survey and Outlook 定义视觉新时代的基础模型:综述与展望
Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2024.3506283
Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan
Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.
{"title":"Foundation Models Defining a New Era in Vision: A Survey and Outlook","authors":"Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan","doi":"10.1109/TPAMI.2024.3506283","DOIUrl":"10.1109/TPAMI.2024.3506283","url":null,"abstract":"Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as <italic>foundation models</i>. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2245-2264"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional Diffusion Models for Camouflaged and Salient Object Detection 伪装和显著目标检测的条件扩散模型
Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527469
Ke Sun;Zhongxi Chen;Xianming Lin;Xiaoshuai Sun;Hong Liu;Rongrong Ji
Camouflaged Object Detection (COD) poses a significant challenge in computer vision, playing a critical role in applications. Existing COD methods often exhibit challenges in accurately predicting nuanced boundaries with high-confidence predictions. In this work, we introduce CamoDiffusion, a new learning method that employs a conditional diffusion model to generate masks that progressively refine the boundaries of camouflaged objects. In particular, we first design an adaptive transformer conditional network, specifically designed for integration into a Denoising Network, which facilitates iterative refinement of the saliency masks. Second, based on the classical diffusion model training, we investigate a variance noise schedule and a structure corruption strategy, which aim to enhance the accuracy of our denoising model by effectively handling uncertain input. Third, we introduce a Consensus Time Ensemble technique, which integrates intermediate predictions using a sampling mechanism, thus reducing overconfidence and incorrect predictions. Finally, we conduct extensive experiments on three benchmark datasets that show that: 1) the efficacy and universality of our method is demonstrated in both camouflaged and salient object detection tasks. 2) compared to existing state-of-the-art methods, CamoDiffusion demonstrates superior performance 3) CamoDiffusion offers flexible enhancements, such as an accelerated version based on the VQ-VAE model and a skip approach.
{"title":"Conditional Diffusion Models for Camouflaged and Salient Object Detection","authors":"Ke Sun;Zhongxi Chen;Xianming Lin;Xiaoshuai Sun;Hong Liu;Rongrong Ji","doi":"10.1109/TPAMI.2025.3527469","DOIUrl":"10.1109/TPAMI.2025.3527469","url":null,"abstract":"Camouflaged Object Detection (COD) poses a significant challenge in computer vision, playing a critical role in applications. Existing COD methods often exhibit challenges in accurately predicting nuanced boundaries with high-confidence predictions. In this work, we introduce CamoDiffusion, a new learning method that employs a conditional diffusion model to generate masks that progressively refine the boundaries of camouflaged objects. In particular, we first design an adaptive transformer conditional network, specifically designed for integration into a Denoising Network, which facilitates iterative refinement of the saliency masks. Second, based on the classical diffusion model training, we investigate a variance noise schedule and a structure corruption strategy, which aim to enhance the accuracy of our denoising model by effectively handling uncertain input. Third, we introduce a Consensus Time Ensemble technique, which integrates intermediate predictions using a sampling mechanism, thus reducing overconfidence and incorrect predictions. Finally, we conduct extensive experiments on three benchmark datasets that show that: 1) the efficacy and universality of our method is demonstrated in both camouflaged and salient object detection tasks. 2) compared to existing state-of-the-art methods, CamoDiffusion demonstrates superior performance 3) CamoDiffusion offers flexible enhancements, such as an accelerated version based on the VQ-VAE model and a skip approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2833-2848"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enhancing Object Detection With Fourier Series 傅里叶级数增强目标检测
Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3526990
Jin Liu;Zhongyuan Lu;Yaorong Cen;Hui Hu;Zhenfeng Shao;Yong Hong;Ming Jiang;Miaozhong Xu
Traditional object detection models often lose the detailed outline information of the object. To address this problem, we propose the Fourier Series Object Detection (FSD). It encodes the object's outline closed curve into two one-dimensional periodic Fourier series. The Fourier Series Model (FSM) is constructed to regress the Fourier series for each object in the image. Thus, during inference, the detailed outline information of each object can be retrieved. We introduce Rolling Optimization Matching for Fourier loss to ensure that the model's learning process is not affected by the sequence of the starting points of the labeled contour points, speeding up the training process. The FSM demonstrates improved feature extraction and descriptive capabilities for non-rectangular or elongated object regions. The model achieves AP50 = 73.3% on the DOTA 1.5 dataset, which surpasses the state-of-the-art (SOTA) method by 6.44% at 66.86%. On the UCAS dataset, the model achieves AP50 = 97.25%, also surpassing the performance indicators of the SOTA methods. Furthermore, we introduce the object's Fourier power spectrum to describe outline features and the Fourier vector to indicate its direction. This enhances the scene semantic representation of the object detection model and paves a new pathway for the evolution of object detection methodologies.
{"title":"Enhancing Object Detection With Fourier Series","authors":"Jin Liu;Zhongyuan Lu;Yaorong Cen;Hui Hu;Zhenfeng Shao;Yong Hong;Ming Jiang;Miaozhong Xu","doi":"10.1109/TPAMI.2025.3526990","DOIUrl":"10.1109/TPAMI.2025.3526990","url":null,"abstract":"Traditional object detection models often lose the detailed outline information of the object. To address this problem, we propose the Fourier Series Object Detection (FSD). It encodes the object's outline closed curve into two one-dimensional periodic Fourier series. The Fourier Series Model (FSM) is constructed to regress the Fourier series for each object in the image. Thus, during inference, the detailed outline information of each object can be retrieved. We introduce Rolling Optimization Matching for Fourier loss to ensure that the model's learning process is not affected by the sequence of the starting points of the labeled contour points, speeding up the training process. The FSM demonstrates improved feature extraction and descriptive capabilities for non-rectangular or elongated object regions. The model achieves AP50 = 73.3% on the DOTA 1.5 dataset, which surpasses the state-of-the-art (SOTA) method by 6.44% at 66.86%. On the UCAS dataset, the model achieves AP50 = 97.25%, also surpassing the performance indicators of the SOTA methods. Furthermore, we introduce the object's Fourier power spectrum to describe outline features and the Fourier vector to indicate its direction. This enhances the scene semantic representation of the object detection model and paves a new pathway for the evolution of object detection methodologies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2581-2596"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Quality Assessment: Exploring Joint Degradation Effect of Deep Network Features via Kernel Representation Similarity Analysis 图像质量评估:通过核表示相似度分析探索深度网络特征的联合退化效应
Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527004
Xingran Liao;Xuekai Wei;Mingliang Zhou;Hau-San Wong;Sam Kwong
Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model’s perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method’s efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.
{"title":"Image Quality Assessment: Exploring Joint Degradation Effect of Deep Network Features via Kernel Representation Similarity Analysis","authors":"Xingran Liao;Xuekai Wei;Mingliang Zhou;Hau-San Wong;Sam Kwong","doi":"10.1109/TPAMI.2025.3527004","DOIUrl":"10.1109/TPAMI.2025.3527004","url":null,"abstract":"Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model’s perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method’s efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2799-2815"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Robust Asymmetric Heterogeneous Federated Learning With Corrupted Clients 带有损坏客户端的鲁棒非对称异构联邦学习
Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527137
Xiuwen Fang;Mang Ye;Bo Du
This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors, such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments.
{"title":"Robust Asymmetric Heterogeneous Federated Learning With Corrupted Clients","authors":"Xiuwen Fang;Mang Ye;Bo Du","doi":"10.1109/TPAMI.2025.3527137","DOIUrl":"10.1109/TPAMI.2025.3527137","url":null,"abstract":"This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors, such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2693-2705"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid-Prediction Integrated Planning for Autonomous Driving 自动驾驶的混合预测集成规划
Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3526936
Haochen Liu;Zhiyu Huang;Wenhui Huang;Haohan Yang;Xiaoyu Mo;Chen Lv
Autonomous driving systems require a comprehensive understanding and accurate prediction of the surrounding environment to facilitate informed decision-making in complex scenarios. Recent advances in learning-based systems have highlighted the importance of integrating prediction and planning. However, this integration poses significant alignment challenges through consistency between prediction patterns, to interaction between future prediction and planning. To address these challenges, we introduce a Hybrid-Prediction integrated Planning (HPP) framework, which operates through three novel modules collaboratively. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-specific motion forecasting. Our proposed MS-OccFormer module achieves spatial-temporal alignment with motion predictions across multiple granularities. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive dynamics among agents based on their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated into the Ego Planner and optimized by prediction guidance. The HPP framework establishes state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and safety in end-to-end configurations. Moreover, HPP’s interactive open-loop and closed-loop planning performance are demonstrated on the Waymo Open Motion Dataset (WOMD) and CARLA benchmark, outperforming existing integrated pipelines by achieving enhanced consistency between prediction and planning.
{"title":"Hybrid-Prediction Integrated Planning for Autonomous Driving","authors":"Haochen Liu;Zhiyu Huang;Wenhui Huang;Haohan Yang;Xiaoyu Mo;Chen Lv","doi":"10.1109/TPAMI.2025.3526936","DOIUrl":"10.1109/TPAMI.2025.3526936","url":null,"abstract":"Autonomous driving systems require a comprehensive understanding and accurate prediction of the surrounding environment to facilitate informed decision-making in complex scenarios. Recent advances in learning-based systems have highlighted the importance of integrating prediction and planning. However, this integration poses significant alignment challenges through consistency between prediction patterns, to interaction between future prediction and planning. To address these challenges, we introduce a Hybrid-Prediction integrated Planning (HPP) framework, which operates through three novel modules collaboratively. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-specific motion forecasting. Our proposed MS-OccFormer module achieves spatial-temporal alignment with motion predictions across multiple granularities. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive dynamics among agents based on their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated into the Ego Planner and optimized by prediction guidance. The HPP framework establishes state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and safety in end-to-end configurations. Moreover, HPP’s interactive open-loop and closed-loop planning performance are demonstrated on the Waymo Open Motion Dataset (WOMD) and CARLA benchmark, outperforming existing integrated pipelines by achieving enhanced consistency between prediction and planning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2597-2614"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1