Yongkang Zhang;Han Zhang;Jun Li;Zhiping Shi;Jian Yang;Kaixin Yang;Shuo Yin;Qiuyan Liang;Xianglong Liu
{"title":"Bullet-Screen-Emoji Attack With Temporal Difference Noise for Video Action Recognition","authors":"Yongkang Zhang;Han Zhang;Jun Li;Zhiping Shi;Jian Yang;Kaixin Yang;Shuo Yin;Qiuyan Liang;Xianglong Liu","doi":"10.1109/TCSVT.2024.3455799","DOIUrl":null,"url":null,"abstract":"Recent studies have shown that video action recognition models are also vulnerable to fooling by adversarial samples. However, currently existing video attack methods usually require high computational overhead (e.g., they generate adversarial perturbations for all frames by default), and most of them are difficult to implement printable attacks in the physical world. To address the above issues, we devise a novel efficient and effective framework for video action recognition attack: Bullet-Screen-Emoji Attack with Temporal Difference Noise (BSE), a reinforcement learning-based black-box attack method that fools the model by simply generating adversarial bullet screens for key frame and scrolling them on clean video. The agent is optimized to make the optimal actions, i.e., searching key frame. Moreover, we introduce a simple and effective temporal difference noise to enhance the attack capability of the adversarial bullet screen and accelerate the convergence speed. Most importantly, BSE enables printable physical attacks. Extensive experiments show that our proposed BSE achieves promising attack performance on mainstream datasets (HMDB51, UCF101 and Kinetics-400) and in the physical world with high efficiency.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"589-600"},"PeriodicalIF":11.1000,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10669043/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Recent studies have shown that video action recognition models are also vulnerable to fooling by adversarial samples. However, currently existing video attack methods usually require high computational overhead (e.g., they generate adversarial perturbations for all frames by default), and most of them are difficult to implement printable attacks in the physical world. To address the above issues, we devise a novel efficient and effective framework for video action recognition attack: Bullet-Screen-Emoji Attack with Temporal Difference Noise (BSE), a reinforcement learning-based black-box attack method that fools the model by simply generating adversarial bullet screens for key frame and scrolling them on clean video. The agent is optimized to make the optimal actions, i.e., searching key frame. Moreover, we introduce a simple and effective temporal difference noise to enhance the attack capability of the adversarial bullet screen and accelerate the convergence speed. Most importantly, BSE enables printable physical attacks. Extensive experiments show that our proposed BSE achieves promising attack performance on mainstream datasets (HMDB51, UCF101 and Kinetics-400) and in the physical world with high efficiency.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.