PADVG: A Simple Baseline of Active Protection for Audio-driven Video Generation

IF 5.2 3区 计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS ACM Transactions on Multimedia Computing Communications and Applications Pub Date : 2024-01-16 DOI:10.1145/3638556
Huan Liu, Xiaolong Liu, Zichang Tan, Xiaolong Li, Yao Zhao
{"title":"PADVG: A Simple Baseline of Active Protection for Audio-driven Video Generation","authors":"Huan Liu, Xiaolong Liu, Zichang Tan, Xiaolong Li, Yao Zhao","doi":"10.1145/3638556","DOIUrl":null,"url":null,"abstract":"<p>Over the past few years, deep generative models have significantly evolved, enabling the synthesis of realistic content and also bringing security concerns of illegal misuse. Therefore, active protection for generative models has been proposed recently, aiming to generate samples with hidden messages for future identification while preserving the original generating performance. However, existing active protection methods are specifically designed for generative adversarial networks (GANs), restricted to handling unconditional image generation. We observe that they get limited identification performance and visual quality when handling audio-driven video generation conditioned on target audio and source input to drive video generation with consistent context, <i>e.g.</i>, identity and movement, between frame sequences. To address this issue, we introduce a simple yet effective active <b>P</b>rotection framework for <b>A</b>udio-<b>D</b>riven <b>V</b>ideo <b>G</b>eneration, named PADVG. To be specific, we present a novel frame-shared embedding module in which messages to hide are first transformed into frame-shared message coefficients. Then, these coefficients are assembled with the intermediate feature maps of video generators at multiple feature levels to generate the embedded video frames. Besides, PADVG further considers two visual consistent losses: i) intra-frame loss is utilized to keep the visual consistency with different hidden messages; ii) inter-frame loss is used to preserve the visual consistency across different video frames. Moreover, we also propose an auxiliary denoising training strategy through perturbing the assembled features by learnable pixel-level noise to improve identification performance, while enhancing robustness against real-world disturbances. Extensive experiments demonstrate that our proposed PADVG for audio-driven video generation can effectively identify the generated videos and achieve high visual quality.</p>","PeriodicalId":50937,"journal":{"name":"ACM Transactions on Multimedia Computing Communications and Applications","volume":"281 1","pages":""},"PeriodicalIF":5.2000,"publicationDate":"2024-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Multimedia Computing Communications and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1145/3638556","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Over the past few years, deep generative models have significantly evolved, enabling the synthesis of realistic content and also bringing security concerns of illegal misuse. Therefore, active protection for generative models has been proposed recently, aiming to generate samples with hidden messages for future identification while preserving the original generating performance. However, existing active protection methods are specifically designed for generative adversarial networks (GANs), restricted to handling unconditional image generation. We observe that they get limited identification performance and visual quality when handling audio-driven video generation conditioned on target audio and source input to drive video generation with consistent context, e.g., identity and movement, between frame sequences. To address this issue, we introduce a simple yet effective active Protection framework for Audio-Driven Video Generation, named PADVG. To be specific, we present a novel frame-shared embedding module in which messages to hide are first transformed into frame-shared message coefficients. Then, these coefficients are assembled with the intermediate feature maps of video generators at multiple feature levels to generate the embedded video frames. Besides, PADVG further considers two visual consistent losses: i) intra-frame loss is utilized to keep the visual consistency with different hidden messages; ii) inter-frame loss is used to preserve the visual consistency across different video frames. Moreover, we also propose an auxiliary denoising training strategy through perturbing the assembled features by learnable pixel-level noise to improve identification performance, while enhancing robustness against real-world disturbances. Extensive experiments demonstrate that our proposed PADVG for audio-driven video generation can effectively identify the generated videos and achieve high visual quality.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
PADVG:为音频驱动视频生成提供主动保护的简单基线
在过去几年中,深度生成模型得到了长足的发展,能够合成逼真的内容,同时也带来了非法滥用的安全问题。因此,最近有人提出了对生成模型的主动保护,目的是在保持原始生成性能的同时,生成带有隐藏信息的样本,以便将来进行识别。然而,现有的主动保护方法是专门为生成式对抗网络(GANs)设计的,仅限于处理无条件图像生成。我们发现,在处理以目标音频和源输入为条件的音频驱动视频生成时,这些方法的识别性能和视觉质量都很有限,而在帧序列之间以一致的上下文(如身份和运动)驱动视频生成时,这些方法的识别性能和视觉质量都很有限。为了解决这个问题,我们为音频驱动视频生成引入了一个简单而有效的主动保护框架,命名为 PADVG。具体来说,我们提出了一种新颖的帧共享嵌入模块,首先将需要隐藏的信息转化为帧共享信息系数。然后,将这些系数与视频生成器的中间特征图在多个特征级别上进行组合,生成嵌入的视频帧。此外,PADVG 还进一步考虑了两种视觉一致性损失:i) 利用帧内损失来保持不同隐藏信息的视觉一致性;ii) 利用帧间损失来保持不同视频帧的视觉一致性。此外,我们还提出了一种辅助去噪训练策略,即通过可学习的像素级噪声对集合特征进行扰动,以提高识别性能,同时增强对现实世界干扰的鲁棒性。大量实验证明,我们提出的用于音频驱动视频生成的 PADVG 能有效识别生成的视频,并获得较高的视觉质量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
8.50
自引率
5.90%
发文量
285
审稿时长
7.5 months
期刊介绍: The ACM Transactions on Multimedia Computing, Communications, and Applications is the flagship publication of the ACM Special Interest Group in Multimedia (SIGMM). It is soliciting paper submissions on all aspects of multimedia. Papers on single media (for instance, audio, video, animation) and their processing are also welcome. TOMM is a peer-reviewed, archival journal, available in both print form and digital form. The Journal is published quarterly; with roughly 7 23-page articles in each issue. In addition, all Special Issues are published online-only to ensure a timely publication. The transactions consists primarily of research papers. This is an archival journal and it is intended that the papers will have lasting importance and value over time. In general, papers whose primary focus is on particular multimedia products or the current state of the industry will not be included.
期刊最新文献
TA-Detector: A GNN-based Anomaly Detector via Trust Relationship KF-VTON: Keypoints-Driven Flow Based Virtual Try-On Network Unified View Empirical Study for Large Pretrained Model on Cross-Domain Few-Shot Learning Multimodal Fusion for Talking Face Generation Utilizing Speech-related Facial Action Units Compressed Point Cloud Quality Index by Combining Global Appearance and Local Details
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1