Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis

Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu
{"title":"Fast-Vid2Vid++: Spatial-Temporal Distillation for Real-Time Video-to-Video Synthesis","authors":"Long Zhuo;Guangcong Wang;Shikai Li;Wayne Wu;Ziwei Liu","doi":"10.1109/TPAMI.2024.3450630","DOIUrl":null,"url":null,"abstract":"Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, \n<bold>Fast-Vid2Vid++</b>\n, which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"46 12","pages":"10732-10747"},"PeriodicalIF":0.0000,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10652893/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Video-to-Video synthesis (Vid2Vid) gains remarkable performance in generating a photo-realistic video from a sequence of semantic maps, such as segmentation, sketch and pose. However, this pipeline is heavily limited to high computational cost and long inference latency, mainly attributed to two essential factors: 1) network architecture parameters, 2) sequential data stream. Recently, the parameters of image-based generative models have been significantly reduced via more efficient network architectures. Existing methods mainly focus on slimming network architectures but ignore the size of the sequential data stream. Moreover, due to the lack of temporal coherence, image-based compression is not sufficient for the compression of the video task. In this paper, we present a spatial-temporal hybrid distillation compression framework, Fast-Vid2Vid++ , which focuses on knowledge distillation of the teacher network and the data stream of generative models on both space and time. Fast-Vid2Vid++ makes the first attempt at time dimension to transfer hierarchical features and time coherence knowledge to reduce computational resources and accelerate inference. Specifically, we compress the data stream spatially and reduce the temporal redundancy. We distill the knowledge of the hierarchical features and the final response from the teacher network to the student network in high-resolution and full-time domains. We transfer the long-term dependencies of the features and video frames to the student model. After the proposed spatial-temporal hybrid knowledge distillation (Spatial-Temporal-HKD), our model can synthesize high-resolution key-frames using the low-resolution data stream. Finally, Fast-Vid2Vid++ interpolates intermediate frames by motion compensation with slight latency and generates full-length sequences with motion-aware inference (MAI). On standard benchmarks, Fast-Vid2Vid++ achieves a real-time performance of 30–59 FPS and saves 28–35× computational cost on a single V100 GPU.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Fast-Vid2Vid++:用于实时视频到视频合成的时空蒸馏。
视频到视频合成(Vid2Vid)在从一系列语义映射(如分割、草图和姿势)生成逼真的照片视频方面表现出色。然而,这一管道受到计算成本高和推理延迟长的严重限制,这主要归因于两个基本因素:1) 网络架构参数,2) 连续数据流。最近,通过更高效的网络架构,基于图像的生成模型的参数被大大降低。现有方法主要关注网络架构的精简,但忽略了序列数据流的大小。此外,由于缺乏时间连贯性,基于图像的压缩不足以完成视频任务的压缩。在本文中,我们提出了一种空间-时间混合蒸馏压缩框架 Fast-Vid2Vid++,它侧重于教师网络和生成模型数据流在空间和时间上的知识蒸馏。Fast-Vid2Vid++ 首次尝试在时间维度上传输分层特征和时间一致性知识,以减少计算资源并加速推理。具体来说,我们从空间上压缩数据流,减少时间冗余。我们在高分辨率和全时域中将层次特征知识和最终响应从教师网络提炼到学生网络。我们将特征和视频帧的长期依赖关系转移到学生模型中。经过提出的时空混合知识提炼(Spatial-Temporal-HKD),我们的模型可以利用低分辨率数据流合成高分辨率关键帧。最后,Fast-Vid2Vid++ 通过运动补偿对中间帧进行插值,延迟较小,并通过运动感知推理(MAI)生成全长序列。在标准基准测试中,Fast-Vid2Vid++ 实现了 30-59 FPS 的实时性能,并在单个 V100 GPU 上节省了 28-35 倍的计算成本。代码和模型可公开获取。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Language-Inspired Relation Transfer for Few-Shot Class-Incremental Learning. Multi-Modality Multi-Attribute Contrastive Pre-Training for Image Aesthetics Computing. 360SFUDA++: Towards Source-Free UDA for Panoramic Segmentation by Learning Reliable Category Prototypes. Anti-Forgetting Adaptation for Unsupervised Person Re-Identification. Evolved Hierarchical Masking for Self-Supervised Learning.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1