{"title":"CTOD: 单阶段目标检测的跨注意力任务调整","authors":"Ruilin Yao;Yi Rong;Qiangqiang Huang;Shengwu Xiong","doi":"10.1109/TCSVT.2024.3422879","DOIUrl":null,"url":null,"abstract":"Existing one-stage object detectors are commonly implemented in a multi-task learning based manner, which simultaneously solves two different sub-tasks: object classification and localization. To achieve this, the detection heads with two independent branches are typically utilized to extract specific image features for each task separately. However, due to the lack of interaction between the parallel branches, the difference in learning objectives of classification and localization will lead to spatial misalignment between the predictions of these two tasks. In this work, we propose a novel Cross-attentive Task-aligned Object Detection (CTOD) method to handle this problem by explicitly promoting the prediction consistency for both tasks. Specifically, we first design a Dual Task Interaction (DTI) module, which generates task-interactive embeddings for each branch from task-specific features by using a task cross-attention mechanism. Then based on these embeddings, we propose a Spatial Feature Aggregation (SFA) module that calculates offsets and weights to aggregate information from nearby feature points at each spatial location of the task-specific feature maps. Meanwhile, we also generate adjustment parameters from the task-interactive embeddings to finally align the prediction results of the two tasks obtained from the enhanced task-specific features described above. Extensive experiments are conducted on the MS-COCO dataset. When using ResNeXt-101-\n<inline-formula> <tex-math>$64\\times 4$ </tex-math></inline-formula>\n d-DCN as the backbone, our CTOD method achieves a detection result of 51.8 AP with single-model and single-scale testing, outperforming the recently proposed one-stage detectors ATSS, VFNet, LD and TOOD by 4.1, 1.9, 1.3 and 0.7 AP, respectively. The analysis of qualitative results also illustrates the effectiveness and superiority of CTOD in solving the task misalignment problem for object detection. Our code is available at \n<uri>https://github.com/Mr-Bigworth/CTOD</uri>\n.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"34 11","pages":"11507-11520"},"PeriodicalIF":11.1000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CTOD: Cross-Attentive Task-Alignment for One-Stage Object Detection\",\"authors\":\"Ruilin Yao;Yi Rong;Qiangqiang Huang;Shengwu Xiong\",\"doi\":\"10.1109/TCSVT.2024.3422879\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Existing one-stage object detectors are commonly implemented in a multi-task learning based manner, which simultaneously solves two different sub-tasks: object classification and localization. To achieve this, the detection heads with two independent branches are typically utilized to extract specific image features for each task separately. However, due to the lack of interaction between the parallel branches, the difference in learning objectives of classification and localization will lead to spatial misalignment between the predictions of these two tasks. In this work, we propose a novel Cross-attentive Task-aligned Object Detection (CTOD) method to handle this problem by explicitly promoting the prediction consistency for both tasks. Specifically, we first design a Dual Task Interaction (DTI) module, which generates task-interactive embeddings for each branch from task-specific features by using a task cross-attention mechanism. Then based on these embeddings, we propose a Spatial Feature Aggregation (SFA) module that calculates offsets and weights to aggregate information from nearby feature points at each spatial location of the task-specific feature maps. Meanwhile, we also generate adjustment parameters from the task-interactive embeddings to finally align the prediction results of the two tasks obtained from the enhanced task-specific features described above. Extensive experiments are conducted on the MS-COCO dataset. When using ResNeXt-101-\\n<inline-formula> <tex-math>$64\\\\times 4$ </tex-math></inline-formula>\\n d-DCN as the backbone, our CTOD method achieves a detection result of 51.8 AP with single-model and single-scale testing, outperforming the recently proposed one-stage detectors ATSS, VFNet, LD and TOOD by 4.1, 1.9, 1.3 and 0.7 AP, respectively. The analysis of qualitative results also illustrates the effectiveness and superiority of CTOD in solving the task misalignment problem for object detection. Our code is available at \\n<uri>https://github.com/Mr-Bigworth/CTOD</uri>\\n.\",\"PeriodicalId\":13082,\"journal\":{\"name\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"volume\":\"34 11\",\"pages\":\"11507-11520\"},\"PeriodicalIF\":11.1000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Circuits and Systems for Video Technology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10583896/\",\"RegionNum\":1,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10583896/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
CTOD: Cross-Attentive Task-Alignment for One-Stage Object Detection
Existing one-stage object detectors are commonly implemented in a multi-task learning based manner, which simultaneously solves two different sub-tasks: object classification and localization. To achieve this, the detection heads with two independent branches are typically utilized to extract specific image features for each task separately. However, due to the lack of interaction between the parallel branches, the difference in learning objectives of classification and localization will lead to spatial misalignment between the predictions of these two tasks. In this work, we propose a novel Cross-attentive Task-aligned Object Detection (CTOD) method to handle this problem by explicitly promoting the prediction consistency for both tasks. Specifically, we first design a Dual Task Interaction (DTI) module, which generates task-interactive embeddings for each branch from task-specific features by using a task cross-attention mechanism. Then based on these embeddings, we propose a Spatial Feature Aggregation (SFA) module that calculates offsets and weights to aggregate information from nearby feature points at each spatial location of the task-specific feature maps. Meanwhile, we also generate adjustment parameters from the task-interactive embeddings to finally align the prediction results of the two tasks obtained from the enhanced task-specific features described above. Extensive experiments are conducted on the MS-COCO dataset. When using ResNeXt-101-
$64\times 4$
d-DCN as the backbone, our CTOD method achieves a detection result of 51.8 AP with single-model and single-scale testing, outperforming the recently proposed one-stage detectors ATSS, VFNet, LD and TOOD by 4.1, 1.9, 1.3 and 0.7 AP, respectively. The analysis of qualitative results also illustrates the effectiveness and superiority of CTOD in solving the task misalignment problem for object detection. Our code is available at
https://github.com/Mr-Bigworth/CTOD
.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.