IEEE transactions on pattern analysis and machine intelligence最新文献_第4页

DiffTF++: 3D-Aware Diffusion Transformer for Large-Vocabulary 3D Generation 用于大词汇量3D生成的3D感知扩散变压器

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-10 DOI: 10.1109/TPAMI.2025.3528247

Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu

Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we 1) adopt improved triplane to guarantee efficiency; 2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and 3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware Diffusion model with TransFormer, DiffTF, we propose a stronger version for 3D generation, i.e., DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.

{"title":"DiffTF++: 3D-Aware Diffusion Transformer for Large-Vocabulary 3D Generation","authors":"Ziang Cao;Fangzhou Hong;Tong Wu;Liang Pan;Ziwei Liu","doi":"10.1109/TPAMI.2025.3528247","DOIUrl":"10.1109/TPAMI.2025.3528247","url":null,"abstract":"Generating diverse and high-quality 3D assets automatically poses a fundamental yet challenging task in 3D computer vision. Despite extensive efforts in 3D generation, existing optimization-based approaches struggle to produce large-scale 3D assets efficiently. Meanwhile, feed-forward methods often focus on generating only a single category or a few categories, limiting their generalizability. Therefore, we introduce a diffusion-based feed-forward framework to address these challenges <italic>with a single model. To handle the large diversity and complexity in geometry and texture across categories efficiently, we <bold>1) adopt improved triplane to guarantee efficiency; <bold>2) introduce the 3D-aware transformer to aggregate the generalized 3D knowledge with specialized 3D features; and <bold>3) devise the 3D-aware encoder/decoder to enhance the generalized 3D knowledge. Building upon our 3D-aware <bold>Diffusion model with <bold>Trans<bold>Former, <bold>DiffTF, we propose a stronger version for 3D generation, i.e., <bold>DiffTF++. It boils down to two parts: multi-view reconstruction loss and triplane refinement. Specifically, we utilize multi-view reconstruction loss to fine-tune the diffusion model and triplane decoder, thereby avoiding the negative influence caused by reconstruction errors and improving texture synthesis. By eliminating the mismatch between the two stages, the generative performance is enhanced, especially in texture. Additionally, a 3D-aware refinement process is introduced to filter out artifacts and refine triplanes, resulting in the generation of more intricate and reasonable details. Extensive experiments on ShapeNet and OmniObject3D convincingly demonstrate the effectiveness of our proposed modules and the state-of-the-art 3D object generation performance with large diversity, rich semantics, and high quality.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"3018-3030"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction 文档级关系抽取的翻译引导双图推理

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-10 DOI: 10.1109/TPAMI.2025.3528246

Lingling Zhang;Yujie Zhong;Qinghua Zheng;Jun Liu;Qianying Wang;Jiaxin Wang;Xiaojun Chang

Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.

{"title":"TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction","authors":"Lingling Zhang;Yujie Zhong;Qinghua Zheng;Jun Liu;Qianying Wang;Jiaxin Wang;Xiaojun Chang","doi":"10.1109/TPAMI.2025.3528246","DOIUrl":"10.1109/TPAMI.2025.3528246","url":null,"abstract":"Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model’s stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2647-2659"},"PeriodicalIF":0.0,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142961306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Latent Weight Quantization for Integerized Training of Deep Neural Networks 深度神经网络综合训练的潜在权量化

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2025.3527498

Wen Fei;Wenrui Dai;Liang Zhang;Luoming Zhang;Chenglin Li;Junni Zou;Hongkai Xiong

Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.

{"title":"Latent Weight Quantization for Integerized Training of Deep Neural Networks","authors":"Wen Fei;Wenrui Dai;Liang Zhang;Luoming Zhang;Chenglin Li;Junni Zou;Hongkai Xiong","doi":"10.1109/TPAMI.2025.3527498","DOIUrl":"10.1109/TPAMI.2025.3527498","url":null,"abstract":"Existing methods for integerized training speed up deep learning by using low-bitwidth integerized weights, activations, gradients, and optimizer buffers. However, they overlook the issue of full-precision latent weights, which consume excessive memory to accumulate gradient-based updates for optimizing the integerized weights. In this paper, we propose the first latent weight quantization schema for general integerized training, which minimizes quantization perturbation to training process via residual quantization with optimized dual quantizer. We leverage residual quantization to eliminate the correlation between latent weight and integerized weight for suppressing quantization noise. We further propose dual quantizer with optimal nonuniform codebook to avoid frozen weight and ensure statistically unbiased training trajectory as full-precision latent weight. The codebook is optimized to minimize the disturbance on weight update under importance guidance and achieved with a three-segment polyline approximation for hardware-friendly implementation. Extensive experiments show that the proposed schema allows integerized training with lowest 4-bit latent weight for various architectures including ResNets, MobileNetV2, and Transformers, and yields negligible performance loss in image classification and text generation. Furthermore, we successfully fine-tune Large Language Models with up to 13 billion parameters on one single GPU using the proposed schema.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2816-2832"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Auto-Pairing Positives Through Implicit Relation Circulation for Discriminative Self-Learning 基于隐式关系循环的判别自学习自动配对阳性

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2025.3526802

Bo Pang;Zhenyu Wei;Jingli Lin;Cewu Lu

Contrastive learning, a discriminative self-learning framework, is one of the most popular representation learning methods which has a wide range of application scenarios. Although relative techniques have been continuously updated in recent years, designing and seeking positive pairs are still inevitable. Just because of the requirement of explicit positive pairs, the utilization of contrastive learning is restricted in dense, multi-modal, and other scenarios where positive pairs are difficult to obtain. To solve this problem, in this paper, we design an auto-pairing mechanism called Implicit Relation Circulation (IRC) for discriminative self-learning frameworks. Its core idea is to conduct a random walk among multiple feature groups we want to contrast but without explicit matchup, which we call the complex task (Task C). By linking the head and tail of the random walk to form a circulation with a simple task (task S) containing easy-obtaining pairs, we can apply cycle consistency as supervision guidance to gradually learn the wanted positive pairs among the random walk of feature groups automatically. We provide several amazing applications of IRC: we can learn 1) effective dense image pixel relations and representation with only image-level pairs; 2) 3D temporal point-level multi-modal point cloud relations and representation; and 3) even image representation with the help of language without off-the-shelf vision-language pairs. As an easy-to-use plug-and-play mechanism, we evaluate its universality and robustness with multiple self-learning algorithms, tasks, and datasets, achieving stable and significant improvements. As an illustrative example, IRC improves the SOTA performance by about 3.0 mIoU on image semantic segmentation, 1.5 mIoU on 3D segmentation, 1.3 mAP on 3D detection, and an average of 1.2 top1 accuracy on image classification with the help of the auto-learned positive pairs. Importantly, these improvements are achieved with little parameter and computation overhead. We hope IRC can provide the community with new insight into discriminative self-learning.

{"title":"Auto-Pairing Positives Through Implicit Relation Circulation for Discriminative Self-Learning","authors":"Bo Pang;Zhenyu Wei;Jingli Lin;Cewu Lu","doi":"10.1109/TPAMI.2025.3526802","DOIUrl":"10.1109/TPAMI.2025.3526802","url":null,"abstract":"Contrastive learning, a discriminative self-learning framework, is one of the most popular representation learning methods which has a wide range of application scenarios. Although relative techniques have been continuously updated in recent years, designing and seeking positive pairs are still inevitable. Just because of the requirement of explicit positive pairs, the utilization of contrastive learning is restricted in dense, multi-modal, and other scenarios where positive pairs are difficult to obtain. To solve this problem, in this paper, we design an auto-pairing mechanism called <bold>Implicit <bold>Relation <bold>Circulation (<bold>IRC) for discriminative self-learning frameworks. Its core idea is to conduct a random walk among multiple feature groups we want to contrast but without explicit matchup, which we call the complex task (Task C). By linking the head and tail of the random walk to form a circulation with a simple task (task S) containing easy-obtaining pairs, we can apply cycle consistency as supervision guidance to gradually learn the wanted positive pairs among the random walk of feature groups automatically. We provide several amazing applications of IRC: we can learn 1) effective dense image pixel relations and representation with only image-level pairs; 2) 3D temporal point-level multi-modal point cloud relations and representation; and 3) even image representation with the help of language without off-the-shelf vision-language pairs. As an easy-to-use plug-and-play mechanism, we evaluate its universality and robustness with multiple self-learning algorithms, tasks, and datasets, achieving stable and significant improvements. As an illustrative example, IRC improves the SOTA performance by about 3.0 mIoU on image semantic segmentation, 1.5 mIoU on 3D segmentation, 1.3 mAP on 3D detection, and an average of 1.2 top1 accuracy on image classification with the help of the auto-learned positive pairs. Importantly, these improvements are achieved with little parameter and computation overhead. We hope IRC can provide the community with new insight into discriminative self-learning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2739-2753"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Foundation Models Defining a New Era in Vision: A Survey and Outlook 定义视觉新时代的基础模型：综述与展望

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-09 DOI: 10.1109/TPAMI.2024.3506283

Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan

Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.

{"title":"Foundation Models Defining a New Era in Vision: A Survey and Outlook","authors":"Muhammad Awais;Muzammal Naseer;Salman Khan;Rao Muhammad Anwer;Hisham Cholakkal;Mubarak Shah;Ming-Hsuan Yang;Fahad Shahbaz Khan","doi":"10.1109/TPAMI.2024.3506283","DOIUrl":"10.1109/TPAMI.2024.3506283","url":null,"abstract":"Vision systems that see and reason about the compositional nature of visual scenes are fundamental to understanding our world. The complex relations between objects and their locations, ambiguities, and variations in the real-world environment can be better described in human language, naturally governed by grammatical rules and other modalities such as audio and depth. The models learned to bridge the gap between such modalities and large-scale training data facilitate contextual reasoning, generalization, and prompt capabilities at test time. These models are referred to as <italic>foundation models. The output of such models can be modified through human-provided prompts without retraining, e.g., segmenting a particular object by providing a bounding box, having interactive dialogues by asking questions about an image or video scene or manipulating the robot's behavior through language instructions. In this survey, we provide a comprehensive review of such emerging foundation models, including typical architecture designs to combine different modalities (vision, text, audio, etc.), training objectives (contrastive, generative), pre-training datasets, fine-tuning mechanisms, and the common prompting patterns; textual, visual, and heterogeneous. We discuss the open challenges and research directions for foundation models in computer vision, including difficulties in their evaluations and benchmarking, gaps in their real-world understanding, limitations of contextual understanding, biases, vulnerability to adversarial attacks, and interpretability issues. We review recent developments in this field, covering a wide range of applications of foundation models systematically and comprehensively.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2245-2264"},"PeriodicalIF":0.0,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142940447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Conditional Diffusion Models for Camouflaged and Salient Object Detection 伪装和显著目标检测的条件扩散模型

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527469

Ke Sun;Zhongxi Chen;Xianming Lin;Xiaoshuai Sun;Hong Liu;Rongrong Ji

Camouflaged Object Detection (COD) poses a significant challenge in computer vision, playing a critical role in applications. Existing COD methods often exhibit challenges in accurately predicting nuanced boundaries with high-confidence predictions. In this work, we introduce CamoDiffusion, a new learning method that employs a conditional diffusion model to generate masks that progressively refine the boundaries of camouflaged objects. In particular, we first design an adaptive transformer conditional network, specifically designed for integration into a Denoising Network, which facilitates iterative refinement of the saliency masks. Second, based on the classical diffusion model training, we investigate a variance noise schedule and a structure corruption strategy, which aim to enhance the accuracy of our denoising model by effectively handling uncertain input. Third, we introduce a Consensus Time Ensemble technique, which integrates intermediate predictions using a sampling mechanism, thus reducing overconfidence and incorrect predictions. Finally, we conduct extensive experiments on three benchmark datasets that show that: 1) the efficacy and universality of our method is demonstrated in both camouflaged and salient object detection tasks. 2) compared to existing state-of-the-art methods, CamoDiffusion demonstrates superior performance 3) CamoDiffusion offers flexible enhancements, such as an accelerated version based on the VQ-VAE model and a skip approach.

{"title":"Conditional Diffusion Models for Camouflaged and Salient Object Detection","authors":"Ke Sun;Zhongxi Chen;Xianming Lin;Xiaoshuai Sun;Hong Liu;Rongrong Ji","doi":"10.1109/TPAMI.2025.3527469","DOIUrl":"10.1109/TPAMI.2025.3527469","url":null,"abstract":"Camouflaged Object Detection (COD) poses a significant challenge in computer vision, playing a critical role in applications. Existing COD methods often exhibit challenges in accurately predicting nuanced boundaries with high-confidence predictions. In this work, we introduce CamoDiffusion, a new learning method that employs a conditional diffusion model to generate masks that progressively refine the boundaries of camouflaged objects. In particular, we first design an adaptive transformer conditional network, specifically designed for integration into a Denoising Network, which facilitates iterative refinement of the saliency masks. Second, based on the classical diffusion model training, we investigate a variance noise schedule and a structure corruption strategy, which aim to enhance the accuracy of our denoising model by effectively handling uncertain input. Third, we introduce a Consensus Time Ensemble technique, which integrates intermediate predictions using a sampling mechanism, thus reducing overconfidence and incorrect predictions. Finally, we conduct extensive experiments on three benchmark datasets that show that: 1) the efficacy and universality of our method is demonstrated in both camouflaged and salient object detection tasks. 2) compared to existing state-of-the-art methods, CamoDiffusion demonstrates superior performance 3) CamoDiffusion offers flexible enhancements, such as an accelerated version based on the VQ-VAE model and a skip approach.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2833-2848"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Object Detection With Fourier Series 傅里叶级数增强目标检测

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3526990

Jin Liu;Zhongyuan Lu;Yaorong Cen;Hui Hu;Zhenfeng Shao;Yong Hong;Ming Jiang;Miaozhong Xu

Traditional object detection models often lose the detailed outline information of the object. To address this problem, we propose the Fourier Series Object Detection (FSD). It encodes the object's outline closed curve into two one-dimensional periodic Fourier series. The Fourier Series Model (FSM) is constructed to regress the Fourier series for each object in the image. Thus, during inference, the detailed outline information of each object can be retrieved. We introduce Rolling Optimization Matching for Fourier loss to ensure that the model's learning process is not affected by the sequence of the starting points of the labeled contour points, speeding up the training process. The FSM demonstrates improved feature extraction and descriptive capabilities for non-rectangular or elongated object regions. The model achieves AP50 = 73.3% on the DOTA 1.5 dataset, which surpasses the state-of-the-art (SOTA) method by 6.44% at 66.86%. On the UCAS dataset, the model achieves AP50 = 97.25%, also surpassing the performance indicators of the SOTA methods. Furthermore, we introduce the object's Fourier power spectrum to describe outline features and the Fourier vector to indicate its direction. This enhances the scene semantic representation of the object detection model and paves a new pathway for the evolution of object detection methodologies.

{"title":"Enhancing Object Detection With Fourier Series","authors":"Jin Liu;Zhongyuan Lu;Yaorong Cen;Hui Hu;Zhenfeng Shao;Yong Hong;Ming Jiang;Miaozhong Xu","doi":"10.1109/TPAMI.2025.3526990","DOIUrl":"10.1109/TPAMI.2025.3526990","url":null,"abstract":"Traditional object detection models often lose the detailed outline information of the object. To address this problem, we propose the Fourier Series Object Detection (FSD). It encodes the object's outline closed curve into two one-dimensional periodic Fourier series. The Fourier Series Model (FSM) is constructed to regress the Fourier series for each object in the image. Thus, during inference, the detailed outline information of each object can be retrieved. We introduce Rolling Optimization Matching for Fourier loss to ensure that the model's learning process is not affected by the sequence of the starting points of the labeled contour points, speeding up the training process. The FSM demonstrates improved feature extraction and descriptive capabilities for non-rectangular or elongated object regions. The model achieves AP50 = 73.3% on the DOTA 1.5 dataset, which surpasses the state-of-the-art (SOTA) method by 6.44% at 66.86%. On the UCAS dataset, the model achieves AP50 = 97.25%, also surpassing the performance indicators of the SOTA methods. Furthermore, we introduce the object's Fourier power spectrum to describe outline features and the Fourier vector to indicate its direction. This enhances the scene semantic representation of the object detection model and paves a new pathway for the evolution of object detection methodologies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2581-2596"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Image Quality Assessment: Exploring Joint Degradation Effect of Deep Network Features via Kernel Representation Similarity Analysis 图像质量评估：通过核表示相似度分析探索深度网络特征的联合退化效应

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527004

Xingran Liao;Xuekai Wei;Mingliang Zhou;Hau-San Wong;Sam Kwong

Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model’s perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method’s efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.

{"title":"Image Quality Assessment: Exploring Joint Degradation Effect of Deep Network Features via Kernel Representation Similarity Analysis","authors":"Xingran Liao;Xuekai Wei;Mingliang Zhou;Hau-San Wong;Sam Kwong","doi":"10.1109/TPAMI.2025.3527004","DOIUrl":"10.1109/TPAMI.2025.3527004","url":null,"abstract":"Typically, deep network-based full-reference image quality assessment (FR-IQA) models compare deep features from reference and distorted images pairwise, overlooking correlations among features from the same source. We propose a dual-branch framework to capture the joint degradation effect among deep network features. The first branch uses kernel representation similarity analysis (KRSA), which compares feature self-similarity matrices via the mean absolute error (MAE). The second branch conducts pairwise comparisons via the MAE, and a training-free logarithmic summation of both branches derives the final score. Our approach contributes in three ways. First, integrating the KRSA with pairwise comparisons enhances the model’s perceptual awareness. Second, our approach is adaptable to diverse network architectures. Third, our approach can guide perceptual image enhancement. Extensive experiments on 10 datasets validate our method’s efficacy, demonstrating that perceptual deformation widely exists in diverse IQA scenarios and that measuring the joint degradation effect can discern appealing content deformations.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2799-2815"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust Asymmetric Heterogeneous Federated Learning With Corrupted Clients 带有损坏客户端的鲁棒非对称异构联邦学习

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3527137

Xiuwen Fang;Mang Ye;Bo Du

This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors, such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments.

{"title":"Robust Asymmetric Heterogeneous Federated Learning With Corrupted Clients","authors":"Xiuwen Fang;Mang Ye;Bo Du","doi":"10.1109/TPAMI.2025.3527137","DOIUrl":"10.1109/TPAMI.2025.3527137","url":null,"abstract":"This paper studies a challenging robust federated learning task with model heterogeneous and data corrupted clients, where the clients have different local model structures. Data corruption is unavoidable due to factors, such as random noise, compression artifacts, or environmental conditions in real-world deployment, drastically crippling the entire federated system. To address these issues, this paper introduces a novel Robust Asymmetric Heterogeneous Federated Learning (RAHFL) framework. We propose a Diversity-enhanced supervised Contrastive Learning technique to enhance the resilience and adaptability of local models on various data corruption patterns. Its basic idea is to utilize complex augmented samples obtained by the mixed-data augmentation strategy for supervised contrastive learning, thereby enhancing the ability of the model to learn robust and diverse feature representations. Furthermore, we design an Asymmetric Heterogeneous Federated Learning strategy to resist corrupt feedback from external clients. The strategy allows clients to perform selective one-way learning during collaborative learning phase, enabling clients to refrain from incorporating lower-quality information from less robust or underperforming collaborators. Extensive experimental results demonstrate the effectiveness and robustness of our approach in diverse, challenging federated learning environments.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2693-2705"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hybrid-Prediction Integrated Planning for Autonomous Driving 自动驾驶的混合预测集成规划

IEEE transactions on pattern analysis and machine intelligence

Pub Date : 2025-01-08 DOI: 10.1109/TPAMI.2025.3526936

Haochen Liu;Zhiyu Huang;Wenhui Huang;Haohan Yang;Xiaoyu Mo;Chen Lv

Autonomous driving systems require a comprehensive understanding and accurate prediction of the surrounding environment to facilitate informed decision-making in complex scenarios. Recent advances in learning-based systems have highlighted the importance of integrating prediction and planning. However, this integration poses significant alignment challenges through consistency between prediction patterns, to interaction between future prediction and planning. To address these challenges, we introduce a Hybrid-Prediction integrated Planning (HPP) framework, which operates through three novel modules collaboratively. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-specific motion forecasting. Our proposed MS-OccFormer module achieves spatial-temporal alignment with motion predictions across multiple granularities. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive dynamics among agents based on their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated into the Ego Planner and optimized by prediction guidance. The HPP framework establishes state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and safety in end-to-end configurations. Moreover, HPP’s interactive open-loop and closed-loop planning performance are demonstrated on the Waymo Open Motion Dataset (WOMD) and CARLA benchmark, outperforming existing integrated pipelines by achieving enhanced consistency between prediction and planning.

{"title":"Hybrid-Prediction Integrated Planning for Autonomous Driving","authors":"Haochen Liu;Zhiyu Huang;Wenhui Huang;Haohan Yang;Xiaoyu Mo;Chen Lv","doi":"10.1109/TPAMI.2025.3526936","DOIUrl":"10.1109/TPAMI.2025.3526936","url":null,"abstract":"Autonomous driving systems require a comprehensive understanding and accurate prediction of the surrounding environment to facilitate informed decision-making in complex scenarios. Recent advances in learning-based systems have highlighted the importance of integrating prediction and planning. However, this integration poses significant alignment challenges through consistency between prediction patterns, to interaction between future prediction and planning. To address these challenges, we introduce a Hybrid-Prediction integrated Planning (HPP) framework, which operates through three novel modules collaboratively. First, we introduce marginal-conditioned occupancy prediction to align joint occupancy with agent-specific motion forecasting. Our proposed MS-OccFormer module achieves spatial-temporal alignment with motion predictions across multiple granularities. Second, we propose a game-theoretic motion predictor, GTFormer, to model the interactive dynamics among agents based on their joint predictive awareness. Third, hybrid prediction patterns are concurrently integrated into the Ego Planner and optimized by prediction guidance. The HPP framework establishes state-of-the-art performance on the nuScenes dataset, demonstrating superior accuracy and safety in end-to-end configurations. Moreover, HPP’s interactive open-loop and closed-loop planning performance are demonstrated on the Waymo Open Motion Dataset (WOMD) and CARLA benchmark, outperforming existing integrated pipelines by achieving enhanced consistency between prediction and planning.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2597-2614"},"PeriodicalIF":0.0,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142936805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0