首页 > 最新文献

IEEE transactions on pattern analysis and machine intelligence最新文献

英文 中文
Diversifying Policies with Non-Markov Dispersion to Expand the Solution Space. 利用非马尔可夫离散性的多样化政策来扩展求解空间
Pub Date : 2024-09-06 DOI: 10.1109/TPAMI.2024.3455257
Bohao Qu, Xiaofeng Cao, Yi Chang, Ivor W Tsang, Yew-Soon Ong

Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.

策略多样性包括一个代理可以采用的各种策略,它通过在环境中促进更稳健、适应性更强和更具创新性的问题解决,来提高强化学习(RL)的成功率。标准强化学习的运行环境通常以马尔可夫决策过程(MDP)作为理论基础。然而,在现实世界的许多场景中,奖励取决于代理的历史状态和行动,从而导致非马尔可夫决策过程。在策略扩散初始化的前提下,由于历史信息和时间依赖性的不同,非 MDP 可能会有非结构化的扩展解空间。这就导致非 MDP 中的解具有非等价的封闭形式。在本文中,要推导出非 MDPs 的多样化解,需要政策通过逐步分散来突破当前解空间的边界。我们的目标是扩大解空间,从而获得更多样化的策略。具体来说,由于变换器在处理顺序数据和捕捉非 MDP 的长程依赖性方面具有优势,因此我们首先通过基于变换器的方法对状态和行动序列进行建模,以学习解空间中分散的策略嵌入。然后,我们通过堆叠策略嵌入来构建分散矩阵作为策略多样性度量,从而诱导解空间中的策略分散,并得到一组多样性策略。最后,我们证明了如果分散矩阵是正定的,分散的嵌入可以有效地扩大政策间的分歧,从而得到原始政策嵌入分布的多样性表达式。非 MDP 和 MDP 环境的实验结果表明,这种分散方案可以通过扩大解空间获得更具表现力的多样化策略,与最近的学习基线相比,表现出更强的鲁棒性。
{"title":"Diversifying Policies with Non-Markov Dispersion to Expand the Solution Space.","authors":"Bohao Qu, Xiaofeng Cao, Yi Chang, Ivor W Tsang, Yew-Soon Ong","doi":"10.1109/TPAMI.2024.3455257","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3455257","url":null,"abstract":"<p><p>Policy diversity, encompassing the variety of policies an agent can adopt, enhances reinforcement learning (RL) success by fostering more robust, adaptable, and innovative problem-solving in the environment. The environment in which standard RL operates is usually modeled with a Markov Decision Process (MDP) as the theoretical foundation. However, in many real-world scenarios, the rewards depend on an agent's history of states and actions leading to a non-MDP. Under the premise of policy diffusion initialization, non-MDPs may have unstructured expanding solution space due to varying historical information and temporal dependencies. This results in solutions having non-equivalent closed forms in non-MDPs. In this paper, deriving diverse solutions for non-MDPs requires policies to break through the boundaries of the current solution space through gradual dispersion. The goal is to expand the solution space, thereby obtaining more diverse policies. Specifically, we first model the sequences of states and actions by a transformer-based method to learn policy embeddings for dispersion in the solution space, since the transformer has advantages in handling sequential data and capturing long-range dependencies for non-MDP. Then, we stack the policy embeddings to construct a dispersion matrix as the policy diversity measure to induce the policy dispersion in the solution space and obtain a set of diverse policies. Finally, we prove that if the dispersion matrix is positive definite, the dispersed embeddings can effectively enlarge the disagreements across policies, yielding a diverse expression for the original policy embedding distribution. Experimental results of both non-MDP and MDP environments show that this dispersion scheme can obtain more expressive diverse policies via expanding the solution space, showing more robust performance than the recent learning baselines.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation. 从头到尾整合神经辐射场,实现认知视觉运动导航。
Pub Date : 2024-09-06 DOI: 10.1109/TPAMI.2024.3455252
Qiming Liu, Haoran Xin, Zhe Liu, Hesheng Wang

We propose an end-to-end visuomotor navigation framework that leverages Neural Radiance Fields (NeRF) for spatial cognition. To the best of our knowledge, this is the first effort to integrate such implicit spatial representation with embodied policy end-to-end for cognitive decision-making. Consequently, our system does not necessitate modularized designs nor transformations into explicit scene representations for downstream control. The NeRF-based memory is constructed online during navigation, without relying on any environmental priors. To enhance the extraction of decision-critical historical insights from the rigid and implicit structure of NeRF, we introduce a spatial information extraction mechanism named Structural Radiance Attention (SRA). SRA empowers the agent to grasp complex scene structures and task objectives, thus paving the way for the development of intelligent behavioral patterns. Our comprehensive testing in image-goal navigation tasks demonstrates that our approach significantly outperforms existing navigation models. We demonstrate that SRA markedly improves the agent's understanding of both the scene and the task by retrieving historical information stored in NeRF memory. The agent also learns exploratory awareness from our pipeline to better adapt to low signal-to-noise memory signals in unknown scenes. We deploy our navigation system on a mobile robot in real-world scenarios, where it exhibits evident cognitive capabilities while ensuring real-time performance.

我们提出了一个端到端的视觉运动导航框架,利用神经辐射场(NeRF)进行空间认知。据我们所知,这是首次将这种隐式空间表征与用于认知决策的端到端嵌入式策略整合在一起。因此,我们的系统既不需要模块化设计,也不需要转换成显式场景表示来进行下游控制。基于 NeRF 的记忆是在导航过程中在线构建的,无需依赖任何环境先验。为了加强从 NeRF 的刚性和隐式结构中提取对决策至关重要的历史洞察力,我们引入了一种名为 "结构辐射注意"(SRA)的空间信息提取机制。SRA 可使代理掌握复杂的场景结构和任务目标,从而为开发智能行为模式铺平道路。我们在图像目标导航任务中进行的全面测试表明,我们的方法明显优于现有的导航模型。我们证明,通过检索存储在 NeRF 内存中的历史信息,SRA 显著提高了代理对场景和任务的理解。代理还能从我们的管道中学习探索意识,从而更好地适应未知场景中的低信噪比记忆信号。我们在实际场景中的移动机器人上部署了我们的导航系统,该系统在确保实时性能的同时,还展现了明显的认知能力。
{"title":"Integrating Neural Radiance Fields End-to-End for Cognitive Visuomotor Navigation.","authors":"Qiming Liu, Haoran Xin, Zhe Liu, Hesheng Wang","doi":"10.1109/TPAMI.2024.3455252","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3455252","url":null,"abstract":"<p><p>We propose an end-to-end visuomotor navigation framework that leverages Neural Radiance Fields (NeRF) for spatial cognition. To the best of our knowledge, this is the first effort to integrate such implicit spatial representation with embodied policy end-to-end for cognitive decision-making. Consequently, our system does not necessitate modularized designs nor transformations into explicit scene representations for downstream control. The NeRF-based memory is constructed online during navigation, without relying on any environmental priors. To enhance the extraction of decision-critical historical insights from the rigid and implicit structure of NeRF, we introduce a spatial information extraction mechanism named Structural Radiance Attention (SRA). SRA empowers the agent to grasp complex scene structures and task objectives, thus paving the way for the development of intelligent behavioral patterns. Our comprehensive testing in image-goal navigation tasks demonstrates that our approach significantly outperforms existing navigation models. We demonstrate that SRA markedly improves the agent's understanding of both the scene and the task by retrieving historical information stored in NeRF memory. The agent also learns exploratory awareness from our pipeline to better adapt to low signal-to-noise memory signals in unknown scenes. We deploy our navigation system on a mobile robot in real-world scenarios, where it exhibits evident cognitive capabilities while ensuring real-time performance.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Variational Label Enhancement for Instance-Dependent Partial Label Learning. 针对依赖于实例的部分标签学习的变量标签增强。
Pub Date : 2024-09-06 DOI: 10.1109/TPAMI.2024.3455260
Ning Xu, Congyu Qiao, Yuchen Zhao, Xin Geng, Min-Ling Zhang

Partial label learning (PLL) is a form of weakly supervised learning, where each training example is linked to a set of candidate labels, among which only one label is correct. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels. However, in practice, this assumption may not hold true, as the candidate labels are often instance-dependent. In this paper, we address the instance-dependent PLL problem and assume that each example is associated with a latent label distribution where the incorrect label with a high degree is more likely to be annotated as a candidate label. Motivated by this consideration, we propose two methods VALEN and MILEN, which train the predictive model via utilizing the latent label distributions recovered by the label enhancement process. Specifically, VALEN recovers the latent label distributions via inferring the variational posterior density parameterized by an inference model with the deduced evidence lower bound. MILEN recovers the latent label distribution by adopting the variational approximation to bound the mutual information among the latent label distribution, observed labels and augmented instances. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed methods.

部分标签学习(PLL)是弱监督学习的一种形式,每个训练实例都与一组候选标签相关联,其中只有一个标签是正确的。大多数现有的 PLL 方法都假定,每个训练示例中的错误标签都是随机选取的候选标签。然而,在实践中,这一假设可能并不成立,因为候选标签往往取决于实例。在本文中,我们将解决与实例相关的 PLL 问题,并假设每个实例都与潜在标签分布相关联,其中错误标签的度数高的标签更有可能被注释为候选标签。基于这一考虑,我们提出了 VALEN 和 MILEN 两种方法,它们通过利用标签增强过程恢复的潜在标签分布来训练预测模型。具体来说,VALEN 通过推断由推理模型参数化的变分后验密度和推导出的证据下限来恢复潜在标签分布。MILEN 通过采用变分近似来约束潜在标签分布、观察标签和增强实例之间的互信息,从而恢复潜在标签分布。在基准数据集和实际数据集上进行的实验验证了所提方法的有效性。
{"title":"Variational Label Enhancement for Instance-Dependent Partial Label Learning.","authors":"Ning Xu, Congyu Qiao, Yuchen Zhao, Xin Geng, Min-Ling Zhang","doi":"10.1109/TPAMI.2024.3455260","DOIUrl":"10.1109/TPAMI.2024.3455260","url":null,"abstract":"<p><p>Partial label learning (PLL) is a form of weakly supervised learning, where each training example is linked to a set of candidate labels, among which only one label is correct. Most existing PLL approaches assume that the incorrect labels in each training example are randomly picked as the candidate labels. However, in practice, this assumption may not hold true, as the candidate labels are often instance-dependent. In this paper, we address the instance-dependent PLL problem and assume that each example is associated with a latent label distribution where the incorrect label with a high degree is more likely to be annotated as a candidate label. Motivated by this consideration, we propose two methods VALEN and MILEN, which train the predictive model via utilizing the latent label distributions recovered by the label enhancement process. Specifically, VALEN recovers the latent label distributions via inferring the variational posterior density parameterized by an inference model with the deduced evidence lower bound. MILEN recovers the latent label distribution by adopting the variational approximation to bound the mutual information among the latent label distribution, observed labels and augmented instances. Experiments on benchmark and real-world datasets validate the effectiveness of the proposed methods.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142143507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation. TagCLIP:提高零镜头语义分割的识别能力。
Pub Date : 2024-09-04 DOI: 10.1109/TPAMI.2024.3454647
Jingyao Li, Pengguang Chen, Shengju Qian, Shu Liu, Jiaya Jia

Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012 and COCO-Stuff 164K. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4% and 1.7%, respectively, with negligible overheads.

对比语言图像预训练(CLIP)最近在像素级零点学习任务中显示出了巨大的潜力。然而,利用 CLIP 的文本和补丁嵌入生成语义掩码的现有方法经常会错误识别来自未见类别的输入像素,从而导致新类别和语义相似类别之间的混淆。在这项工作中,我们提出了一种新方法 TagCLIP(信任感知引导式 CLIP)来解决这个问题。我们将难以解决的优化问题分解为两个并行过程:单独进行的语义匹配和提高辨别能力的可靠性判断。基于语言建模中代表句子级嵌入的特殊标记的想法,我们引入了一种可信标记,它能在预测中将新类别与已知类别区分开来。为了评估我们的方法,我们在 PASCAL VOC 2012 和 COCO-Stuff 164K 这两个基准数据集上进行了实验。结果表明,TagCLIP 将未见类别的 "交集大于联合"(Intersection over Union,IoU)分别提高了 7.4% 和 1.7%,而开销几乎可以忽略不计。
{"title":"TagCLIP: Improving Discrimination Ability of Zero-Shot Semantic Segmentation.","authors":"Jingyao Li, Pengguang Chen, Shengju Qian, Shu Liu, Jiaya Jia","doi":"10.1109/TPAMI.2024.3454647","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3454647","url":null,"abstract":"<p><p>Contrastive Language-Image Pre-training (CLIP) has recently shown great promise in pixel-level zero-shot learning tasks. However, existing approaches utilizing CLIP's text and patch embeddings to generate semantic masks often misidentify input pixels from unseen classes, leading to confusion between novel classes and semantically similar ones. In this work, we propose a novel approach, TagCLIP (Trusty-aware guided CLIP), to address this issue. We disentangle the ill-posed optimization problem into two parallel processes: semantic matching performed individually and reliability judgment for improving discrimination ability. Building on the idea of special tokens in language modeling representing sentence-level embeddings, we introduce a trusty token that enables distinguishing novel classes from known ones in prediction. To evaluate our approach, we conduct experiments on two benchmark datasets, PASCAL VOC 2012 and COCO-Stuff 164K. Our results show that TagCLIP improves the Intersection over Union (IoU) of unseen classes by 7.4% and 1.7%, respectively, with negligible overheads.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142134894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient Neural Collaborative Search for Pickup and Delivery Problems. 针对取货和送货问题的高效神经协作搜索。
Pub Date : 2024-09-03 DOI: 10.1109/TPAMI.2024.3450850
Detian Kong, Yining Ma, Zhiguang Cao, Tianshu Yu, Jianhua Xiao

In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.

在本文中,我们介绍了神经协作搜索(NCS),这是一种基于学习的新型框架,用于高效解决取货和交货问题(PDP)。NCS 首创了最新流行的神经构建模型和神经改进模型之间的协作,建立了一个改进模型迭代完善构建模型启动的解决方案的协作框架。我们的 NCS 通过强化学习和有效的共享批判机制对这两个模型进行协作训练。此外,构建模型通过课程学习为改进模型提供高质量的初始解决方案,而改进模型则通过模仿学习加速构建模型的收敛。除了新的框架设计,我们还提出了高效的神经邻域搜索(Neighborhood Search,N2S),这是一种在 NCS 框架内使用的高效改进模型。N2S 利用定制的马尔可夫决策过程表述和两个定制的解码器来移除和重新插入一对取货-交货节点,从而学习一个毁坏-修复搜索过程,以高效地解决 PDP 中的优先级约束。为了平衡编码器和解码器之间的计算成本,N2S 通过轻合成注意机制简化了现有的编码器设计,该机制允许 vanilla 自我注意合成有关路由解决方案的各种特征。此外,N2S 还进一步利用多样性增强方案来改善推理过程中的性能。我们的 NCS 和 N2S 都是通用的,在两个典型 PDP 变体上的广泛实验表明,在现有的神经方法中,它们能产生最先进的结果。值得注意的是,我们的 NCS 和 N2S 超越了著名的 LKH3 求解器,尤其是在约束性更强的 PDP 变体上。
{"title":"Efficient Neural Collaborative Search for Pickup and Delivery Problems.","authors":"Detian Kong, Yining Ma, Zhiguang Cao, Tianshu Yu, Jianhua Xiao","doi":"10.1109/TPAMI.2024.3450850","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3450850","url":null,"abstract":"<p><p>In this paper, we introduce Neural Collaborative Search (NCS), a novel learning-based framework for efficiently solving pickup and delivery problems (PDPs). NCS pioneers the collaboration between the latest prevalent neural construction and neural improvement models, establishing a collaborative framework where an improvement model iteratively refines solutions initiated by a construction model. Our NCS collaboratively trains the two models via reinforcement learning with an effective shared-critic mechanism. In addition, the construction model enhances the improvement model with high-quality initial solutions via curriculum learning, while the improvement model accelerates the convergence of the construction model through imitation learning. Besides the new framework design, we also propose the efficient Neural Neighborhood Search (N2S), an efficient improvement model employed within the NCS framework. N2S exploits a tailored Markov decision process formulation and two customized decoders for removing and then reinserting a pair of pickup-delivery nodes, thereby learning a ruin-repair search process for addressing the precedence constraints in PDPs efficiently. To balance the computation cost between encoders and decoders, N2S streamlines the existing encoder design through a light Synthesis Attention mechanism that allows the vanilla self-attention to synthesize various features regarding a route solution. Moreover, a diversity enhancement scheme is further leveraged to ameliorate the performance during the inference of N2S. Our NCS and N2S are both generic, and extensive experiments on two canonical PDP variants show that they can produce state-of-the-art results among existing neural methods. Remarkably, our NCS and N2S could surpass the well-known LKH3 solver especially on the more constrained PDP variant.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation. Panoptic-PartFormer++:用于 Panoptic 零件分割的统一解耦视图
Pub Date : 2024-09-03 DOI: 10.1109/TPAMI.2024.3453916
Xiangtai Li, Shilin Xu, Yibo Yang, Haobo Yuan, Guangliang Cheng, Yunhai Tong, Zhouchen Lin, Ming-Hsuan Yang, Dacheng Tao

Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at https://github.com/lxtGH/Panoptic-PartFormer.

全景部件分割(PPS)将全景和部件分割统一为一项任务。以前的工作采用不同的方法来处理事物、物品和部件预测,但没有共享计算和任务关联。我们的目标是在架构层面统一这些任务,设计出首个端到端统一框架 Panoptic-PartFormer。此外,我们发现之前的度量标准 PartPQ 偏向于 PQ。为了解决这两个问题,我们首先设计了一个元架构,将零件特征和事物/物品特征分别解耦。我们将事物、物品和零件建模为对象查询,并直接学习优化所有三种形式的预测,将其作为一个统一的掩码预测和分类问题。我们将这一模型称为 Panoptic-PartFormer 模型。其次,我们提出了一个新指标 "部分-整体质量"(PWQ),它能更好地从像素区域和部分-整体的角度来衡量这项任务。此外,它还将部分分割和全景分割的误差分离开来。第三,受 Mask2Former 的启发,基于我们的元架构,我们提出了 Panoptic-PartFormer++ 并设计了一种新的部分-整体交叉关注方案,以进一步提高部分分割质量。我们设计了一种新的部分-整体交互方法,使用了掩蔽交叉注意。最后,大量的消融研究和分析证明了 Panoptic-PartFormer 和 Panoptic-PartFormer++ 的有效性。与之前的 Panoptic-PartFormer 相比,我们的 Panoptic-PartFormer++ 在 Cityscapes PPS 数据集上实现了 2% 的 PartPQ 和 3% 的 PWQ 改进,在 Pascal Context PPS 数据集上实现了 5% 的 PartPQ 改进。在这两个数据集上,Panoptic-PartFormer++ 都取得了新的一流结果。我们的模型可以作为一个强大的基线,帮助未来的 PPS 研究。源代码和训练好的模型将发布在 https://github.com/lxtGH/Panoptic-PartFormer 网站上。
{"title":"Panoptic-PartFormer++: A Unified and Decoupled View for Panoptic Part Segmentation.","authors":"Xiangtai Li, Shilin Xu, Yibo Yang, Haobo Yuan, Guangliang Cheng, Yunhai Tong, Zhouchen Lin, Ming-Hsuan Yang, Dacheng Tao","doi":"10.1109/TPAMI.2024.3453916","DOIUrl":"10.1109/TPAMI.2024.3453916","url":null,"abstract":"<p><p>Panoptic Part Segmentation (PPS) unifies panoptic and part segmentation into one task. Previous works utilize separate approaches to handle things, stuff, and part predictions without shared computation and task association. We aim to unify these tasks at the architectural level, designing the first end-to-end unified framework, Panoptic-PartFormer. Moreover, we find the previous metric PartPQ biases to PQ. To handle both issues, we first design a meta-architecture that decouples part features and things/stuff features, respectively. We model things, stuff, and parts as object queries and directly learn to optimize all three forms of prediction as a unified mask prediction and classification problem. We term our model as Panoptic-PartFormer. Second, we propose a new metric Part-Whole Quality (PWQ), better to measure this task from pixel-region and part-whole perspectives. It also decouples the errors for part segmentation and panoptic segmentation. Third, inspired by Mask2Former, based on our meta-architecture, we propose Panoptic-PartFormer++ and design a new part-whole cross-attention scheme to boost part segmentation qualities further. We design a new part-whole interaction method using masked cross attention. Finally, extensive ablation studies and analysis demonstrate the effectiveness of both Panoptic-PartFormer and Panoptic-PartFormer++. Compared with previous Panoptic-PartFormer, our Panoptic-PartFormer++ achieves 2% PartPQ and 3% PWQ improvements on the Cityscapes PPS dataset and 5% PartPQ on the Pascal Context PPS dataset. On both datasets, Panoptic-PartFormer++ achieves new state-of-the-art results. Our models can serve as a strong baseline and aid future research in PPS. The source code and trained models will be available at https://github.com/lxtGH/Panoptic-PartFormer.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142127750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Topo-Geometric Analysis of Variability in Point Clouds using Persistence Landscapes. 利用持久性景观对点云中的可变性进行拓扑-几何分析。
Pub Date : 2024-08-28 DOI: 10.1109/TPAMI.2024.3451328
James Matuk, Sebastian Kurtek, Karthik Bharath

Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.

拓扑数据分析提供了一套工具,用于揭示噪声点云中的低维结构。其中最重要的工具是持久同源性,它利用被称为持久图的数据对象总结同源性特征的出生-死亡时间。为了更好地帮助统计分析,持久图的功能表示法(称为持久景观)可用于功能数据分析和机器学习工具。点云固有的拓扑和几何变异在持久图和地貌图中都会被混淆,因此在使用持久同源性时,必须将拓扑信号与噪声区分开来,才能就点云的结构得出可靠的结论。我们开发了一个框架,通过使用弹性黎曼度量对持久性地貌进行配准,将持久性图中的可变性分解为拓扑信号和拓扑噪声。对齐的景观(振幅)可隔离拓扑信号。用于景观配准(相位)的重参数化与用于生成持久图的分辨率参数相关联,并以几何、全局缩放和采样变异的形式捕捉拓扑噪声。我们通过几个模拟示例说明了持久图(景观)中拓扑信号和拓扑噪声解耦的重要性。我们还在两项真实数据研究中证明了我们的方法能提供新颖的见解。
{"title":"Topo-Geometric Analysis of Variability in Point Clouds using Persistence Landscapes.","authors":"James Matuk, Sebastian Kurtek, Karthik Bharath","doi":"10.1109/TPAMI.2024.3451328","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3451328","url":null,"abstract":"<p><p>Topological data analysis provides a set of tools to uncover low-dimensional structure in noisy point clouds. Prominent amongst the tools is persistence homology, which summarizes birth-death times of homological features using data objects known as persistence diagrams. To better aid statistical analysis, a functional representation of the diagrams, known as persistence landscapes, enable use of functional data analysis and machine learning tools. Topological and geometric variabilities inherent in point clouds are confounded in both persistence diagrams and landscapes, and it is important to distinguish topological signal from noise to draw reliable conclusions on the structure of the point clouds when using persistence homology. We develop a framework for decomposing variability in persistence diagrams into topological signal and topological noise through alignment of persistence landscapes using an elastic Riemannian metric. Aligned landscapes (amplitude) isolate the topological signal. Reparameterizations used for landscape alignment (phase) are linked to a resolution parameter used to generate persistence diagrams, and capture topological noise in the form of geometric, global scaling and sampling variabilities. We illustrate the importance of decoupling topological signal and topological noise in persistence diagrams (landscapes) using several simulated examples. We also demonstrate that our approach provides novel insights in two real data studies.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142086429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Playing for 3D Human Recovery. 玩转 3D 人体复原。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450537
Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu

Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For videobased methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world. Homepage: https://caizhongang.github.io/projects/GTA-Human/.

基于图像和视频的三维人体复原(即姿势和形状估计)取得了长足的进步。然而,由于动作捕捉的成本过高,现有数据集在规模和多样性方面往往受到限制。在这项工作中,我们通过玩视频游戏和自动注释的三维地面实况来获取大量的人体序列。具体来说,我们贡献了 GTA-Human 数据集,这是一个利用 GTA-V 游戏引擎生成的大规模 3D 人体数据集,其中包含了高度多样化的主体、动作和场景。更重要的是,我们研究了游戏数据的使用,并获得了五大启示。首先,游戏数据出奇地有效。在《GTA-Human》上训练的基于帧的简单基线方法远远优于更复杂的方法。对于基于视频的方法,GTA-Human 甚至可以与域内训练集相媲美。其次,我们发现合成数据为通常在室内收集的真实数据提供了重要补充。我们强调,我们对领域差距的调查为我们的数据混合策略提供了简单而有用的解释,这为研究界提供了新的见解。第三,数据集的规模很重要。性能提升与可用的额外数据密切相关。对多个关键因素(如摄像机角度和身体姿势)的系统研究表明,模型性能对数据密度非常敏感。第四,GTA-Human 的有效性还归功于丰富的强监督标签集合(SMPL 参数),否则在真实数据集中获取这些标签的成本会很高。第五,合成数据的优势还可扩展到更大型的模型,如深度卷积神经网络(CNN)和变形器(Transformers),对它们也有显著影响。我们希望我们的工作能为扩大三维人体复原到现实世界铺平道路。主页:https://caizhongang.github.io/projects/GTA-Human/.
{"title":"Playing for 3D Human Recovery.","authors":"Zhongang Cai, Mingyuan Zhang, Jiawei Ren, Chen Wei, Daxuan Ren, Zhengyu Lin, Haiyu Zhao, Lei Yang, Chen Change Loy, Ziwei Liu","doi":"10.1109/TPAMI.2024.3450537","DOIUrl":"10.1109/TPAMI.2024.3450537","url":null,"abstract":"<p><p>Image- and video-based 3D human recovery (i.e., pose and shape estimation) have achieved substantial progress. However, due to the prohibitive cost of motion capture, existing datasets are often limited in scale and diversity. In this work, we obtain massive human sequences by playing the video game with automatically annotated 3D ground truths. Specifically, we contribute GTA-Human, a large-scale 3D human dataset generated with the GTA-V game engine, featuring a highly diverse set of subjects, actions, and scenarios. More importantly, we study the use of game-playing data and obtain five major insights. First, game-playing data is surprisingly effective. A simple frame-based baseline trained on GTA-Human outperforms more sophisticated methods by a large margin. For videobased methods, GTA-Human is even on par with the in-domain training set. Second, we discover that synthetic data provides critical complements to the real data that is typically collected indoor. We highlight that our investigation into domain gap provides explanations for our data mixture strategies that are simple yet useful, which offers new insights to the research community. Third, the scale of the dataset matters. The performance boost is closely related to the additional data available. A systematic study on multiple key factors (such as camera angle and body pose) reveals that the model performance is sensitive to data density. Fourth, the effectiveness of GTA-Human is also attributed to the rich collection of strong supervision labels (SMPL parameters), which are otherwise expensive to acquire in real datasets. Fifth, the benefits of synthetic data extend to larger models such as deeper convolutional neural networks (CNNs) and Transformers, for which a significant impact is also observed. We hope our work could pave the way for scaling up 3D human recovery to the real world. Homepage: https://caizhongang.github.io/projects/GTA-Human/.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082926","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis. Fast-Vid2Vid++:用于实时视频到视频合成的时空蒸馏。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450630
Long Zhuo, Guangcong Wang, Shikai Li, Wayne Wu, Ziwei Liu

Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.

视频到视频合成(Vid2Vid)在从一系列语义映射(如分割、草图和姿势)生成逼真的照片视频方面表现出色。然而,这一管道受到计算成本高和推理延迟长的严重限制,这主要归因于两个基本因素:1) 网络架构参数,2) 连续数据流。最近,通过更高效的网络架构,基于图像的生成模型的参数被大大降低。现有方法主要关注网络架构的精简,但忽略了序列数据流的大小。此外,由于缺乏时间连贯性,基于图像的压缩不足以完成视频任务的压缩。在本文中,我们提出了一种空间-时间混合蒸馏压缩框架 Fast-Vid2Vid++,它侧重于教师网络和生成模型数据流在空间和时间上的知识蒸馏。Fast-Vid2Vid++ 首次尝试在时间维度上传输分层特征和时间一致性知识,以减少计算资源并加速推理。具体来说,我们从空间上压缩数据流,减少时间冗余。我们在高分辨率和全时域中将层次特征知识和最终响应从教师网络提炼到学生网络。我们将特征和视频帧的长期依赖关系转移到学生模型中。经过提出的时空混合知识提炼(Spatial-Temporal-HKD),我们的模型可以利用低分辨率数据流合成高分辨率关键帧。最后,Fast-Vid2Vid++ 通过运动补偿对中间帧进行插值,延迟较小,并通过运动感知推理(MAI)生成全长序列。在标准基准测试中,Fast-Vid2Vid++ 实现了 30-59 FPS 的实时性能,并在单个 V100 GPU 上节省了 28-35 倍的计算成本。代码和模型可公开获取。
{"title":"Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis.","authors":"Long Zhuo, Guangcong Wang, Shikai Li, Wayne Wu, Ziwei Liu","doi":"10.1109/TPAMI.2024.3450630","DOIUrl":"https://doi.org/10.1109/TPAMI.2024.3450630","url":null,"abstract":"<p><p>Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30-59 FPS and saves 28-35× computational cost on a single V100 GPU. Code and models are publicly available.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeepTensor: Low-Rank Tensor Decomposition with Deep Network Priors. DeepTensor:利用深度网络先验的低张量分解。
Pub Date : 2024-08-27 DOI: 10.1109/TPAMI.2024.3450575
Vishwanath Saragadam, Randall Balestriero, Ashok Veeraraghavan, Richard G Baraniuk

DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor's entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.

DeepTensor 是一个利用深度生成网络对矩阵和张量进行低阶分解的高效计算框架。我们将张量分解为低阶张量因子的乘积(例如,矩阵是两个向量的外积),其中每个低阶张量由深度网络(DN)生成,该网络以自我监督的方式进行训练,以最小化均方近似误差。我们的主要观察结果是,DN 固有的隐式正则化使其能够捕捉非线性信号结构(如流形),而奇异值分解(SVD)和主成分分析(PCA)等经典线性方法无法捕捉到这些结构。此外,当张量条目偏离加性白高斯噪声时,SVD 和 PCA 的性能会下降,而 DeepTensor 的性能对各种分布都很稳健。通过探索高光谱图像去噪、三维核磁共振成像断层扫描和图像分类等一系列实际应用,我们验证了 DeepTensor 是 SVD、PCA、非负矩阵因式分解 (NMF) 和类似分解的稳健且计算效率高的直接替代品。特别是,与标准去噪方法相比,DeepTensor 对泊松噪声干扰信号的信噪比提高了 6dB,其分解三维张量的学习速度比配备三维卷积的单一 DN 快 60 倍。
{"title":"DeepTensor: Low-Rank Tensor Decomposition with Deep Network Priors.","authors":"Vishwanath Saragadam, Randall Balestriero, Ashok Veeraraghavan, Richard G Baraniuk","doi":"10.1109/TPAMI.2024.3450575","DOIUrl":"10.1109/TPAMI.2024.3450575","url":null,"abstract":"<p><p>DeepTensor is a computationally efficient framework for low-rank decomposition of matrices and tensors using deep generative networks. We decompose a tensor as the product of low-rank tensor factors (e.g., a matrix as the outer product of two vectors), where each low-rank tensor is generated by a deep network (DN) that is trained in a self-supervised manner to minimize the mean-square approximation error. Our key observation is that the implicit regularization inherent in DNs enables them to capture nonlinear signal structures (e.g., manifolds) that are out of the reach of classical linear methods like the singular value decomposition (SVD) and principal components analysis (PCA). Furthermore, in contrast to the SVD and PCA, whose performance deteriorates when the tensor's entries deviate from additive white Gaussian noise, we demonstrate that the performance of DeepTensor is robust to a wide range of distributions. We validate that DeepTensor is a robust and computationally efficient drop-in replacement for the SVD, PCA, nonnegative matrix factorization (NMF), and similar decompositions by exploring a range of real-world applications, including hyperspectral image denoising, 3D MRI tomography, and image classification. In particular, DeepTensor offers a 6dB signal-to-noise ratio improvement over standard denoising methods for signal corrupted by Poisson noise and learns to decompose 3D tensors 60 times faster than a single DN equipped with 3D convolutions.</p>","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142082983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE transactions on pattern analysis and machine intelligence
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1