{"title":"利用多分支注意力加权进行弱监督时间动作定位","authors":"Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao","doi":"10.1007/s00530-024-01445-2","DOIUrl":null,"url":null,"abstract":"<p>Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.</p>","PeriodicalId":3,"journal":{"name":"ACS Applied Electronic Materials","volume":null,"pages":null},"PeriodicalIF":4.3000,"publicationDate":"2024-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Weakly-supervised temporal action localization using multi-branch attention weighting\",\"authors\":\"Mengxue Liu, Wenjing Li, Fangzhen Ge, Xiangjun Gao\",\"doi\":\"10.1007/s00530-024-01445-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.</p>\",\"PeriodicalId\":3,\"journal\":{\"name\":\"ACS Applied Electronic Materials\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":4.3000,\"publicationDate\":\"2024-08-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACS Applied Electronic Materials\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s00530-024-01445-2\",\"RegionNum\":3,\"RegionCategory\":\"材料科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ENGINEERING, ELECTRICAL & ELECTRONIC\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Electronic Materials","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s00530-024-01445-2","RegionNum":3,"RegionCategory":"材料科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
Weakly-supervised temporal action localization using multi-branch attention weighting
Weakly-supervised temporal action localization aims to train an accurate and robust localization model using only video-level labels. Due to the lack of frame-level temporal annotations, existing weakly-supervised temporal action localization methods typically rely on multiple instance learning mechanisms to localize and classify all action instances in an untrimmed video. However, these methods focus only on the most discriminative regions that contribute to the classification task, neglecting a large number of ambiguous background and context snippets in the video. We believe that these controversial snippets have a significant impact on the localization results. To mitigate this issue, we propose a multi-branch attention weighting network (MAW-Net), which introduces an additional non-action class and integrates a multi-branch attention module to generate action and background attention, respectively. In addition, considering the correlation among context, action, and background, we use the difference of action and background attention to construct context attention. Finally, based on these three types of attention values, we obtain three new class activation sequences that distinguish action, background, and context. This enables our model to effectively remove background and context snippets in the localization results. Extensive experiments were performed on the THUMOS-14 and Activitynet1.3 datasets. The experimental results show that our method is superior to other state-of-the-art methods, and its performance is comparable to those of fully-supervised approaches.