{"title":"TP-LSM:视觉时空金字塔时间建模网络,用于基于图像的人工智能中的多标签动作检测","authors":"Haojie Gao, Peishun Liu, Xiaolong Ma, Zikang Yan, Ningning Ma, Wenqiang Liu, Xuefang Wang, Ruichun Tang","doi":"10.1007/s00371-024-03601-1","DOIUrl":null,"url":null,"abstract":"<p>Dense multi-label action detection is a challenging task in the field of visual action, where multiple actions occur simultaneously in different time spans, hence accurately assessing the short-term and long-term temporal dependencies between actions is crucial for action detection. There is an urgent need for an effective temporal modeling technology to detect the temporal dependence of actions in videos and efficiently learn long-term and short-term action information. This paper proposes a new method based on temporal pyramid and long short-term time modeling for multi-label action detection, which combines hierarchical structure with pyramid feature hierarchy for dense multi-label temporal action detection. By using the expansion and compression convolution module (SEC) and external attention for time modeling, we focus on the temporal relationships of long and short-term actions at each stage. We then integrate hierarchical pyramid features to achieve accurate detection of actions at different temporal resolution scales. We evaluated the performance of the model on dense multi-label benchmark datasets, and achieved mAP of 47.3% and 36.0% on the MultiTHUMOS and TSU datasets, which outperforms 2.7% and 2.3% on the current state-of-the-art results. The code is available at https://github.com/Yoona6371/TP-LSM.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TP-LSM: visual temporal pyramidal time modeling network to multi-label action detection in image-based AI\",\"authors\":\"Haojie Gao, Peishun Liu, Xiaolong Ma, Zikang Yan, Ningning Ma, Wenqiang Liu, Xuefang Wang, Ruichun Tang\",\"doi\":\"10.1007/s00371-024-03601-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Dense multi-label action detection is a challenging task in the field of visual action, where multiple actions occur simultaneously in different time spans, hence accurately assessing the short-term and long-term temporal dependencies between actions is crucial for action detection. There is an urgent need for an effective temporal modeling technology to detect the temporal dependence of actions in videos and efficiently learn long-term and short-term action information. This paper proposes a new method based on temporal pyramid and long short-term time modeling for multi-label action detection, which combines hierarchical structure with pyramid feature hierarchy for dense multi-label temporal action detection. By using the expansion and compression convolution module (SEC) and external attention for time modeling, we focus on the temporal relationships of long and short-term actions at each stage. We then integrate hierarchical pyramid features to achieve accurate detection of actions at different temporal resolution scales. We evaluated the performance of the model on dense multi-label benchmark datasets, and achieved mAP of 47.3% and 36.0% on the MultiTHUMOS and TSU datasets, which outperforms 2.7% and 2.3% on the current state-of-the-art results. The code is available at https://github.com/Yoona6371/TP-LSM.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03601-1\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03601-1","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
TP-LSM: visual temporal pyramidal time modeling network to multi-label action detection in image-based AI
Dense multi-label action detection is a challenging task in the field of visual action, where multiple actions occur simultaneously in different time spans, hence accurately assessing the short-term and long-term temporal dependencies between actions is crucial for action detection. There is an urgent need for an effective temporal modeling technology to detect the temporal dependence of actions in videos and efficiently learn long-term and short-term action information. This paper proposes a new method based on temporal pyramid and long short-term time modeling for multi-label action detection, which combines hierarchical structure with pyramid feature hierarchy for dense multi-label temporal action detection. By using the expansion and compression convolution module (SEC) and external attention for time modeling, we focus on the temporal relationships of long and short-term actions at each stage. We then integrate hierarchical pyramid features to achieve accurate detection of actions at different temporal resolution scales. We evaluated the performance of the model on dense multi-label benchmark datasets, and achieved mAP of 47.3% and 36.0% on the MultiTHUMOS and TSU datasets, which outperforms 2.7% and 2.3% on the current state-of-the-art results. The code is available at https://github.com/Yoona6371/TP-LSM.