{"title":"InvPT++:用于视觉场景理解的倒金字塔多任务变换器","authors":"Hanrong Ye;Dan Xu","doi":"10.1109/TPAMI.2024.3397031","DOIUrl":null,"url":null,"abstract":"Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models’ ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, \n<italic>i.e.</i>\n, Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"7493-7508"},"PeriodicalIF":18.6000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding\",\"authors\":\"Hanrong Ye;Dan Xu\",\"doi\":\"10.1109/TPAMI.2024.3397031\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models’ ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules, \\n<italic>i.e.</i>\\n, Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.\",\"PeriodicalId\":94034,\"journal\":{\"name\":\"IEEE transactions on pattern analysis and machine intelligence\",\"volume\":\"46 12\",\"pages\":\"7493-7508\"},\"PeriodicalIF\":18.6000,\"publicationDate\":\"2024-03-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on pattern analysis and machine intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10520818/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10520818/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
多任务场景理解旨在设计出能用一个多功能模型同时预测多个场景理解任务的模型。以往的研究通常以更局部的方式处理多任务特征,因此无法有效学习空间全局和跨任务交互,这阻碍了模型在多任务学习中充分利用各种任务一致性的能力。为了解决这个问题,我们提出了一种倒金字塔多任务转换器,它能够在全局背景下模拟不同任务的空间特征之间的跨任务交互。具体来说,我们首先利用变换器编码器捕捉所有任务的任务通用特征。然后,我们设计了一个变换器解码器来建立全局的空间和跨任务交互,并设计了一个新颖的 UP 变换器块来逐步提高多任务特征的分辨率,并建立不同尺度的跨任务交互。此外,还提出了两种跨尺度自我注意模块,即融合注意和选择注意,以有效促进不同特征尺度的跨任务交互。此外,还进一步引入了编码器特征聚合策略,以便在解码器中更好地模拟多尺度信息。在多个二维/三维多任务基准上进行的综合实验清楚地证明了我们建议的有效性,并确立了显著的一流性能。
InvPT++: Inverted Pyramid Multi-Task Transformer for Visual Scene Understanding
Multi-task scene understanding aims to design models that can simultaneously predict several scene understanding tasks with one versatile model. Previous studies typically process multi-task features in a more local way, and thus cannot effectively learn spatially global and cross-task interactions, which hampers the models’ ability to fully leverage the consistency of various tasks in multi-task learning. To tackle this problem, we propose an Inverted Pyramid multi-task Transformer, capable of modeling cross-task interaction among spatial features of different tasks in a global context. Specifically, we first utilize a transformer encoder to capture task-generic features for all tasks. And then, we design a transformer decoder to establish spatial and cross-task interaction globally, and a novel UP-Transformer block is devised to increase the resolutions of multi-task features gradually and establish cross-task interaction at different scales. Furthermore, two types of Cross-Scale Self-Attention modules,
i.e.
, Fusion Attention and Selective Attention, are proposed to efficiently facilitate cross-task interaction across different feature scales. An Encoder Feature Aggregation strategy is further introduced to better model multi-scale information in the decoder. Comprehensive experiments on several 2D/3D multi-task benchmarks clearly demonstrate our proposal's effectiveness, establishing significant state-of-the-art performances.