首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
Efficient Neural Collaborative Search for Pickup and Delivery Problems 针对取货和送货问题的高效神经协作搜索。
Pub Date : 2024-09-03 DOI: 10.1109/TPAMI.2024.3450850
Detian Kong;Yining Ma;Zhiguang Cao;Tianshu Yu;Jianhua Xiao
In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.
在本文中,我们介绍了神经协作搜索(NCS),这是一种基于学习的新型框架,用于高效解决取货和交货问题(PDP)。NCS 首创了最新流行的神经构建模型和神经改进模型之间的协作,建立了一个改进模型迭代完善构建模型启动的解决方案的协作框架。我们的 NCS 通过强化学习和有效的共享批判机制对这两个模型进行协作训练。此外,构建模型通过课程学习为改进模型提供高质量的初始解决方案,而改进模型则通过模仿学习加速构建模型的收敛。除了新的框架设计,我们还提出了高效的神经邻域搜索(Neighborhood Search,N2S),这是一种在 NCS 框架内使用的高效改进模型。N2S 利用定制的马尔可夫决策过程表述和两个定制的解码器来移除和重新插入一对取货-交货节点,从而学习一个毁坏-修复搜索过程,以高效地解决 PDP 中的优先级约束。为了平衡编码器和解码器之间的计算成本,N2S 通过轻合成注意机制简化了现有的编码器设计,该机制允许 vanilla 自我注意合成有关路由解决方案的各种特征。此外,N2S 还进一步利用多样性增强方案来改善推理过程中的性能。我们的 NCS 和 N2S 都是通用的,在两个典型 PDP 变体上的广泛实验表明,在现有的神经方法中,它们能产生最先进的结果。值得注意的是,我们的 NCS 和 N2S 超越了著名的 LKH3 求解器,尤其是在约束性更强的 PDP 变体上。
{"title":"Efficient Neural Collaborative Search for Pickup and Delivery Problems","authors":"Detian Kong;Yining Ma;Zhiguang Cao;Tianshu Yu;Jianhua Xiao","doi":"10.1109/TPAMI.2024.3450850","DOIUrl":"10.1109/TPAMI.2024.3450850","url":null,"abstract":"In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11019-11034"},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation Panoptic-PartFormer++:用于 Panoptic 零件分割的统一解耦视图
Pub Date : 2024-09-03 DOI: 10.1109/TPAMI.2024.3453916
Xiangtai Li;Shilin Xu;Yibo Yang;Haobo Yuan;Guangliang Cheng;Yunhai Tong;Zhouchen Lin;Ming-Hsuan Yang;Dacheng Tao
Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS.
全景部件分割(PPS)将全景和部件分割统一为一项任务。以前的工作采用不同的方法来处理事物、物品和部件预测,但没有共享计算和任务关联。我们的目标是在架构层面统一这些任务,设计出首个端到端统一框架 Panoptic-PartFormer。此外,我们发现之前的度量标准 PartPQ 偏向于 PQ。为了解决这两个问题,我们首先设计了一个元架构,将零件特征和事物/物品特征分别解耦。我们将事物、物品和零件建模为对象查询,并直接学习优化所有三种形式的预测,将其作为一个统一的掩码预测和分类问题。我们将这一模型称为 Panoptic-PartFormer 模型。其次,我们提出了一个新指标 "部分-整体质量"(PWQ),它能更好地从像素区域和部分-整体的角度来衡量这项任务。此外,它还将部分分割和全景分割的误差分离开来。第三,受 Mask2Former 的启发,基于我们的元架构,我们提出了 Panoptic-PartFormer++ 并设计了一种新的部分-整体交叉关注方案,以进一步提高部分分割质量。我们设计了一种新的部分-整体交互方法,使用了掩蔽交叉注意。最后,大量的消融研究和分析证明了 Panoptic-PartFormer 和 Panoptic-PartFormer++ 的有效性。与之前的 Panoptic-PartFormer 相比,我们的 Panoptic-PartFormer++ 在 Cityscapes PPS 数据集上实现了 2% 的 PartPQ 和 3% 的 PWQ 改进,在 Pascal Context PPS 数据集上实现了 5% 的 PartPQ 改进。在这两个数据集上,Panoptic-PartFormer++ 都取得了新的一流结果。我们的模型可以作为一个强大的基线,帮助未来的 PPS 研究。源代码和训练好的模型将发布在 https://github.com/lxtGH/Panoptic-PartFormer 网站上。
{"title":"Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation","authors":"Xiangtai Li;Shilin Xu;Yibo Yang;Haobo Yuan;Guangliang Cheng;Yunhai Tong;Zhouchen Lin;Ming-Hsuan Yang;Dacheng Tao","doi":"10.1109/TPAMI.2024.3453916","DOIUrl":"10.1109/TPAMI.2024.3453916","url":null,"abstract":"Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11087-11103"},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy 医学影像的深度交互式分割:系统回顾与分类
Pub Date : 2024-08-30 DOI: 10.1109/TPAMI.2024.3452629
Zdravko Marinov;Paul F. Jäger;Jan Egger;Jens Kleesiek;Rainer Stiefelhagen
Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.
交互式分割是医学图像分析的一个重要研究领域,旨在通过结合人类反馈来提高成本高昂的注释效率。这种反馈以点击、涂鸦或遮罩的形式出现,允许对模型输出进行迭代改进,从而有效地引导系统实现所需的行为。近年来,基于深度学习的方法将成果推向了一个新的高度,使该领域迅速发展,仅在医学影像领域就提出了 121 种方法。在这篇综述中,我们对这一新兴领域进行了结构化概述,包括全面的分类、对现有方法的系统回顾以及对当前实践的深入分析。基于这些贡献,我们讨论了该领域的挑战和机遇。例如,我们发现不同方法之间严重缺乏可比性,这需要通过标准化基线和基准来解决。
{"title":"Deep Interactive Segmentation of Medical Images: A Systematic Review and Taxonomy","authors":"Zdravko Marinov;Paul F. Jäger;Jan Egger;Jens Kleesiek;Rainer Stiefelhagen","doi":"10.1109/TPAMI.2024.3452629","DOIUrl":"10.1109/TPAMI.2024.3452629","url":null,"abstract":"Interactive segmentation is a crucial research area in medical image analysis aiming to boost the efficiency of costly annotations by incorporating human feedback. This feedback takes the form of clicks, scribbles, or masks and allows for iterative refinement of the model output so as to efficiently guide the system towards the desired behavior. In recent years, deep learning-based approaches have propelled results to a new level causing a rapid growth in the field with 121 methods proposed in the medical imaging domain alone. In this review, we provide a structured overview of this emerging field featuring a comprehensive taxonomy, a systematic review of existing methods, and an in-depth analysis of current practices. Based on these contributions, we discuss the challenges and opportunities in the field. For instance, we find that there is a severe lack of comparison across methods which needs to be tackled by standardized baselines and benchmarks.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10998-11018"},"PeriodicalIF":0.0,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10660300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unsupervised Active Visual Search With Monte Carlo Planning Under Uncertain Detections 不确定检测条件下的无监督主动视觉搜索与蒙特卡洛规划
Pub Date : 2024-08-29 DOI: 10.1109/TPAMI.2024.3451994
Francesco Taioli;Francesco Giuliari;Yiming Wang;Riccardo Berra;Alberto Castellini;Alessio Del Bue;Alessandro Farinelli;Marco Cristani;Francesco Setti
We propose a solution for Active Visual Search of objects in an environment, whose 2D floor map is the only known information. Our solution has three key features that make it more plausible and robust to detector failures compared to state-of-the-art methods: i) it is unsupervised as it does not need any training sessions. ii) During the exploration, a probability distribution on the 2D floor map is updated according to an intuitive mechanism, while an improved belief update increases the effectiveness of the agent's exploration. iii) We incorporate the awareness that an object detector may fail into the aforementioned probability modelling by exploiting the success statistics of a specific detector. Our solution is dubbed POMP-BE-PD (Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection). It uses the current pose of an agent and an RGB-D observation to learn an optimal search policy, exploiting a POMDP solved by a Monte-Carlo planning approach. On the Active Vision Dataset Benchmark, we increase the average success rate over all the environments by a significant 35$%$ while decreasing the average path length by 4$%$ with respect to competing methods. Thus, our results are state-of-the-art, even without any training procedure.
我们提出了一种对环境中的物体进行主动视觉搜索的解决方案,环境中的二维平面图是唯一已知的信息。与最先进的方法相比,我们的解决方案有三个主要特点,使其在检测器失效时更具可信度和鲁棒性:i) 它是无监督的,因为它不需要任何训练课程;ii) 在探索过程中,2D 地图上的概率分布会根据一种直观的机制进行更新,而改进的信念更新会提高代理探索的有效性;iii) 我们通过利用特定检测器的成功统计数据,将物体检测器可能失效的意识纳入上述概率建模中。我们的解决方案被称为 POMP-BE-PD(Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection)。它使用代理的当前姿势和 RGB-D 观察结果来学习最优搜索策略,利用蒙特卡洛规划方法求解的 POMDP。在 "主动视觉数据集基准"(Active Vision Dataset Benchmark)上,与其他竞争方法相比,我们将所有环境下的平均成功率大幅提高了 35%,同时将平均路径长度减少了 4%。因此,即使没有任何训练程序,我们的结果也是最先进的。
{"title":"Unsupervised Active Visual Search With Monte Carlo Planning Under Uncertain Detections","authors":"Francesco Taioli;Francesco Giuliari;Yiming Wang;Riccardo Berra;Alberto Castellini;Alessio Del Bue;Alessandro Farinelli;Marco Cristani;Francesco Setti","doi":"10.1109/TPAMI.2024.3451994","DOIUrl":"10.1109/TPAMI.2024.3451994","url":null,"abstract":"We propose a solution for Active Visual Search of objects in an environment, whose 2D floor map is the only known information. Our solution has three key features that make it more plausible and robust to detector failures compared to state-of-the-art methods: \u0000<i>i)</i>\u0000 it is unsupervised as it does not need any training sessions. \u0000<i>ii)</i>\u0000 During the exploration, a probability distribution on the 2D floor map is updated according to an intuitive mechanism, while an improved belief update increases the effectiveness of the agent's exploration. \u0000<i>iii)</i>\u0000 We incorporate the awareness that an object detector may fail into the aforementioned probability modelling by exploiting the success statistics of a specific detector. Our solution is dubbed POMP-BE-PD (Pomcp-based Online Motion Planning with Belief by Exploration and Probabilistic Detection). It uses the current pose of an agent and an RGB-D observation to learn an optimal search policy, exploiting a POMDP solved by a Monte-Carlo planning approach. On the Active Vision Dataset Benchmark, we increase the average success rate over all the environments by a significant 35\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 while decreasing the average path length by 4\u0000<inline-formula><tex-math>$%$</tex-math></inline-formula>\u0000 with respect to competing methods. Thus, our results are state-of-the-art, even without any training procedure.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11047-11058"},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10659171","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation 用于双向激光雷达-相机语义分割的单模到多模知识提炼
Pub Date : 2024-08-29 DOI: 10.1109/TPAMI.2024.3451658
Tianfang Sun;Zhizhong Zhang;Xin Tan;Yong Peng;Yanyun Qu;Yuan Xie
Combining LiDAR points and images for robust semantic segmentation has shown great potential. However, the heterogeneity between the two modalities (e.g. the density, the field of view) poses challenges in establishing a bijective mapping between each point and pixel. This modality alignment problem introduces new challenges in network design and data processing for cross-modal methods. Specifically, 1) points that are projected outside the image planes; 2) the complexity of maintaining geometric consistency limits the deployment of many data augmentation techniques. To address these challenges, we propose a cross-modal knowledge imputation and transition approach. First, we introduce a bidirectional feature fusion strategy that imputes missing image features and performs cross-modal fusion simultaneously. This allows us to generate reliable predictions even when images are missing. Second, we propose a Uni-to-Multi modal Knowledge Distillation (U2MKD) framework, leveraging the transfer of informative features from a single-modality teacher to a cross-modality student. This overcomes the issues of augmentation misalignment and enables us to train the student effectively. Extensive experiments on the nuScenes, Waymo, and SemanticKITTI datasets demonstrate the effectiveness of our approach. Notably, our method achieves an 8.3 mIoU gain over the LiDAR-only baseline on the nuScenes validation set and achieves state-of-the-art performance on the three datasets.
结合激光雷达点和图像进行稳健的语义分割已显示出巨大的潜力。然而,两种模态之间的异质性(如密度、视场)给在每个点和像素之间建立双射映射带来了挑战。这种模态对齐问题给跨模态方法的网络设计和数据处理带来了新的挑战。具体来说,1)投射到图像平面之外的点;2)保持几何一致性的复杂性限制了许多数据增强技术的部署。为了应对这些挑战,我们提出了一种跨模态知识归因和转换方法。首先,我们引入了一种双向特征融合策略,该策略可估算缺失的图像特征,并同时执行跨模态融合。这样,即使图像缺失,我们也能生成可靠的预测结果。其次,我们提出了单模态到多模态知识蒸馏(U2MKD)框架,利用信息特征从单模态教师转移到跨模态学生。这克服了增强不对齐的问题,使我们能够有效地训练学生。在 nuScenes、Waymo 和 SemanticKITTI 数据集上进行的大量实验证明了我们方法的有效性。值得注意的是,在 nuScenes 验证集上,我们的方法比仅使用激光雷达的基线方法获得了 8.3 mIoU 的增益,并在这三个数据集上实现了最先进的性能。
{"title":"Uni-to-Multi Modal Knowledge Distillation for Bidirectional LiDAR-Camera Semantic Segmentation","authors":"Tianfang Sun;Zhizhong Zhang;Xin Tan;Yong Peng;Yanyun Qu;Yuan Xie","doi":"10.1109/TPAMI.2024.3451658","DOIUrl":"10.1109/TPAMI.2024.3451658","url":null,"abstract":"Combining LiDAR points and images for robust semantic segmentation has shown great potential. However, the heterogeneity between the two modalities (e.g. the density, the field of view) poses challenges in establishing a bijective mapping between each point and pixel. This modality alignment problem introduces new challenges in network design and data processing for cross-modal methods. Specifically, 1) points that are projected outside the image planes; 2) the complexity of maintaining geometric consistency limits the deployment of many data augmentation techniques. To address these challenges, we propose a cross-modal knowledge imputation and transition approach. First, we introduce a bidirectional feature fusion strategy that imputes missing image features and performs cross-modal fusion simultaneously. This allows us to generate reliable predictions even when images are missing. Second, we propose a Uni-to-Multi modal Knowledge Distillation (U2MKD) framework, leveraging the transfer of informative features from a single-modality teacher to a cross-modality student. This overcomes the issues of augmentation misalignment and enables us to train the student effectively. Extensive experiments on the nuScenes, Waymo, and SemanticKITTI datasets demonstrate the effectiveness of our approach. Notably, our method achieves an 8.3 mIoU gain over the LiDAR-only baseline on the nuScenes validation set and achieves state-of-the-art performance on the three datasets.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11059-11072"},"PeriodicalIF":0.0,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142101431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topo-Geometric Analysis of Variability in Point Clouds Using Persistence Landscapes 利用持久性景观对点云中的可变性进行拓扑-几何分析。
Pub Date : 2024-08-28 DOI: 10.1109/TPAMI.2024.3451328
James Matuk;Sebastian Kurtek;Karthik Bharath
Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.
拓扑数据分析提供了一套工具,用于揭示噪声点云中的低维结构。其中最重要的工具是持久同源性,它利用被称为持久图的数据对象总结同源性特征的出生-死亡时间。为了更好地帮助统计分析,持久图的功能表示法(称为持久景观)可用于功能数据分析和机器学习工具。点云固有的拓扑和几何变异在持久图和地貌图中都会被混淆,因此在使用持久同源性时,必须将拓扑信号与噪声区分开来,才能就点云的结构得出可靠的结论。我们开发了一个框架,通过使用弹性黎曼度量对持久性地貌进行配准,将持久性图中的可变性分解为拓扑信号和拓扑噪声。对齐的景观(振幅)可隔离拓扑信号。用于景观配准(相位)的重参数化与用于生成持久图的分辨率参数相关联,并以几何、全局缩放和采样变异的形式捕捉拓扑噪声。我们通过几个模拟示例说明了持久图(景观)中拓扑信号和拓扑噪声解耦的重要性。我们还在两项真实数据研究中证明了我们的方法能提供新颖的见解。
{"title":"Topo-Geometric Analysis of Variability in Point Clouds Using Persistence Landscapes","authors":"James Matuk;Sebastian Kurtek;Karthik Bharath","doi":"10.1109/TPAMI.2024.3451328","DOIUrl":"10.1109/TPAMI.2024.3451328","url":null,"abstract":"Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"11035-11046"},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142086429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Playing for 3D Human Recovery 玩转 3D 人体复原。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450537
Zhongang Cai;Mingyuan Zhang;Jiawei Ren;Chen Wei;Daxuan Ren;Zhengyu Lin;Haiyu Zhao;Lei Yang;Chen Change Loy;Ziwei Liu
Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.
基于图像和视频的三维人体复原(即姿势和形状估计)取得了长足的进步。然而,由于动作捕捉的成本过高,现有数据集在规模和多样性方面往往受到限制。在这项工作中,我们通过玩视频游戏和自动注释的三维地面实况来获取大量的人体序列。具体来说,我们贡献了 GTA-Human 数据集,这是一个利用 GTA-V 游戏引擎生成的大规模 3D 人体数据集,其中包含了高度多样化的主体、动作和场景。更重要的是,我们研究了游戏数据的使用,并获得了五大启示。首先,游戏数据出奇地有效。在《GTA-Human》上训练的基于帧的简单基线方法远远优于更复杂的方法。对于基于视频的方法,GTA-Human 甚至可以与域内训练集相媲美。其次,我们发现合成数据为通常在室内收集的真实数据提供了重要补充。我们强调,我们对领域差距的调查为我们的数据混合策略提供了简单而有用的解释,这为研究界提供了新的见解。第三,数据集的规模很重要。性能提升与可用的额外数据密切相关。对多个关键因素(如摄像机角度和身体姿势)的系统研究表明,模型性能对数据密度非常敏感。第四,GTA-Human 的有效性还归功于丰富的强监督标签集合(SMPL 参数),否则在真实数据集中获取这些标签的成本会很高。第五,合成数据的优势还可扩展到更大型的模型,如深度卷积神经网络(CNN)和变形器(Transformers),对它们也有显著影响。我们希望我们的工作能为扩大三维人体复原到现实世界铺平道路。主页:https://caizhongang.github.io/projects/GTA-Human/.
{"title":"Playing for 3D Human Recovery","authors":"Zhongang Cai;Mingyuan Zhang;Jiawei Ren;Chen Wei;Daxuan Ren;Zhengyu Lin;Haiyu Zhao;Lei Yang;Chen Change Loy;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450537","DOIUrl":"10.1109/TPAMI.2024.3450537","url":null,"abstract":"Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute \u0000<bold>GTA-Human</b>\u0000, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. \u0000<bold>First</b>\u0000, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For video-based methods, GTA-Human is even on par with the in-domain training set. \u0000<bold>Second</b>\u0000, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. \u0000<bold>Third</b>\u0000, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. \u0000<bold>Fourth</b>\u0000, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. \u0000<bold>Fifth</b>\u0000, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10533-10545"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis Fast-Vid2Vid++:用于实时视频到视频合成的时空蒸馏。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450630
Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu
Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.
视频到视频合成(Vid2Vid)在从一系列语义映射(如分割、草图和姿势)生成逼真的照片视频方面表现出色。然而,这一管道受到计算成本高和推理延迟长的严重限制,这主要归因于两个基本因素:1) 网络架构参数,2) 连续数据流。最近,通过更高效的网络架构,基于图像的生成模型的参数被大大降低。现有方法主要关注网络架构的精简,但忽略了序列数据流的大小。此外,由于缺乏时间连贯性,基于图像的压缩不足以完成视频任务的压缩。在本文中,我们提出了一种空间-时间混合蒸馏压缩框架 Fast-Vid2Vid++,它侧重于教师网络和生成模型数据流在空间和时间上的知识蒸馏。Fast-Vid2Vid++ 首次尝试在时间维度上传输分层特征和时间一致性知识,以减少计算资源并加速推理。具体来说,我们从空间上压缩数据流,减少时间冗余。我们在高分辨率和全时域中将层次特征知识和最终响应从教师网络提炼到学生网络。我们将特征和视频帧的长期依赖关系转移到学生模型中。经过提出的时空混合知识提炼(Spatial-Temporal-HKD),我们的模型可以利用低分辨率数据流合成高分辨率关键帧。最后,Fast-Vid2Vid++ 通过运动补偿对中间帧进行插值,延迟较小,并通过运动感知推理(MAI)生成全长序列。在标准基准测试中,Fast-Vid2Vid++ 实现了 30-59 FPS 的实时性能,并在单个 V100 GPU 上节省了 28-35 倍的计算成本。代码和模型可公开获取。
{"title":"Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis","authors":"Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450630","DOIUrl":"10.1109/TPAMI.2024.3450630","url":null,"abstract":"Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, \u0000<bold>Fast-Vid2Vid++</b>\u0000, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10732-10747"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepTensor: Low-Rank Tensor Decomposition With Deep Network Priors DeepTensor:利用深度网络先验的低张量分解。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450575
Vishwanath Saragadam;Randall Balestriero;Ashok Veeraraghavan;Richard G. Baraniuk
DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor’s entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6 dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.
DeepTensor 是一个利用深度生成网络对矩阵和张量进行低阶分解的高效计算框架。我们将张量分解为低阶张量因子的乘积(例如,矩阵是两个向量的外积),其中每个低阶张量由深度网络(DN)生成,该网络以自我监督的方式进行训练,以最小化均方近似误差。我们的主要观察结果是,DN 固有的隐式正则化使其能够捕捉非线性信号结构(如流形),而奇异值分解(SVD)和主成分分析(PCA)等经典线性方法无法捕捉到这些结构。此外,当张量条目偏离加性白高斯噪声时,SVD 和 PCA 的性能会下降,而 DeepTensor 的性能对各种分布都很稳健。通过探索高光谱图像去噪、三维核磁共振成像断层扫描和图像分类等一系列实际应用,我们验证了 DeepTensor 是 SVD、PCA、非负矩阵因式分解 (NMF) 和类似分解的稳健且计算效率高的直接替代品。特别是,与标准去噪方法相比,DeepTensor 对泊松噪声干扰信号的信噪比提高了 6dB,其分解三维张量的学习速度比配备三维卷积的单一 DN 快 60 倍。
{"title":"DeepTensor: Low-Rank Tensor Decomposition With Deep Network Priors","authors":"Vishwanath Saragadam;Randall Balestriero;Ashok Veeraraghavan;Richard G. Baraniuk","doi":"10.1109/TPAMI.2024.3450575","DOIUrl":"10.1109/TPAMI.2024.3450575","url":null,"abstract":"DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a \u0000<italic>self-supervised</i>\u0000 manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor’s entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6 dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10337-10348"},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Frequency-Aware Feature Fusion for Dense Image Prediction 频率感知特征融合用于密集图像预测
Pub Date : 2024-08-26 DOI: 10.1109/TPAMI.2024.3449959
Linwei Chen;Ying Fu;Lin Gu;Chenggang Yan;Tatsuya Harada;Gao Huang
Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness.
密集图像预测任务要求特征具有强大的类别信息和精确的高分辨率空间边界细节。为了实现这一目标,现代分层模型通常会利用特征融合技术,直接添加来自深层的上采样粗特征和来自低层的高分辨率特征。在本文中,我们观察到对象内部融合特征值的快速变化,由于高频特征受到干扰,导致类别内部不一致。此外,融合特征中模糊的边界缺乏准确的高频率,从而导致边界位移。基于这些观察结果,我们提出了频率感知特征融合(FreqFusion),它集成了自适应低通滤波器(ALPF)生成器、偏移生成器和自适应高通滤波器(AHPF)生成器。ALPF 生成器可预测空间变异低通滤波器,以衰减对象内的高频成分,从而在上采样过程中减少类内不一致性。偏移发生器通过重新采样,用更一致的特征替换不一致的特征,从而完善大的不一致特征和细边界,而 AHPF 发生器则增强了在下采样过程中丢失的高频详细边界信息。全面的可视化和定量分析证明,FreqFusion 能有效提高特征的一致性,并使物体边界更加清晰。在各种高密度预测任务中进行的大量实验证实了它的有效性。代码可在 https://github.com/Linwei-Chen/FreqFusion 公开获取。
{"title":"Frequency-Aware Feature Fusion for Dense Image Prediction","authors":"Linwei Chen;Ying Fu;Lin Gu;Chenggang Yan;Tatsuya Harada;Gao Huang","doi":"10.1109/TPAMI.2024.3449959","DOIUrl":"10.1109/TPAMI.2024.3449959","url":null,"abstract":"Dense image prediction tasks demand features with strong category information and precise spatial boundary details at high resolution. To achieve this, modern hierarchical models often utilize feature fusion, directly adding upsampled coarse features from deep layers and high-resolution features from lower levels. In this paper, we observe rapid variations in fused feature values within objects, resulting in intra-category inconsistency due to disturbed high-frequency features. Additionally, blurred boundaries in fused features lack accurate high frequency, leading to boundary displacement. Building upon these observations, we propose Frequency-Aware Feature Fusion (FreqFusion), integrating an Adaptive Low-Pass Filter (ALPF) generator, an offset generator, and an Adaptive High-Pass Filter (AHPF) generator. The ALPF generator predicts spatially-variant low-pass filters to attenuate high-frequency components within objects, reducing intra-class inconsistency during upsampling. The offset generator refines large inconsistent features and thin boundaries by replacing inconsistent features with more consistent ones through resampling, while the AHPF generator enhances high-frequency detailed boundary information lost during downsampling. Comprehensive visualization and quantitative analysis demonstrate that FreqFusion effectively improves feature consistency and sharpens object boundaries. Extensive experiments across various dense prediction tasks confirm its effectiveness.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10763-10780"},"PeriodicalIF":0.0,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142074851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1