OSV: One Step is Enough for High-Quality Image to Video Generation

Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang
{"title":"OSV: One Step is Enough for High-Quality Image to Video Generation","authors":"Xiaofeng Mao, Zhengkai Jiang, Fu-Yun Wang, Wenbing Zhu, Jiangning Zhang, Hao Chen, Mingmin Chi, Yabiao Wang","doi":"arxiv-2409.11367","DOIUrl":null,"url":null,"abstract":"Video diffusion models have shown great potential in generating high-quality\nvideos, making them an increasingly popular focus. However, their inherent\niterative nature leads to substantial computational and time costs. While\nefforts have been made to accelerate video diffusion by reducing inference\nsteps (through techniques like consistency distillation) and GAN training\n(these approaches often fall short in either performance or training\nstability). In this work, we introduce a two-stage training framework that\neffectively combines consistency distillation with GAN training to address\nthese challenges. Additionally, we propose a novel video discriminator design,\nwhich eliminates the need for decoding the video latents and improves the final\nperformance. Our model is capable of producing high-quality videos in merely\none-step, with the flexibility to perform multi-step refinement for further\nperformance enhancement. Our quantitative evaluation on the OpenWebVid-1M\nbenchmark shows that our model significantly outperforms existing methods.\nNotably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of\nthe consistency distillation based method, AnimateLCM (FVD 184.79), and\napproaches the 25-step performance of advanced Stable Video Diffusion (FVD\n156.94).","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":"3 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11367","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Video diffusion models have shown great potential in generating high-quality videos, making them an increasingly popular focus. However, their inherent iterative nature leads to substantial computational and time costs. While efforts have been made to accelerate video diffusion by reducing inference steps (through techniques like consistency distillation) and GAN training (these approaches often fall short in either performance or training stability). In this work, we introduce a two-stage training framework that effectively combines consistency distillation with GAN training to address these challenges. Additionally, we propose a novel video discriminator design, which eliminates the need for decoding the video latents and improves the final performance. Our model is capable of producing high-quality videos in merely one-step, with the flexibility to perform multi-step refinement for further performance enhancement. Our quantitative evaluation on the OpenWebVid-1M benchmark shows that our model significantly outperforms existing methods. Notably, our 1-step performance(FVD 171.15) exceeds the 8-step performance of the consistency distillation based method, AnimateLCM (FVD 184.79), and approaches the 25-step performance of advanced Stable Video Diffusion (FVD 156.94).
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
OSV:一步即可生成高质量图像到视频
视频扩散模型在生成高质量视频方面显示出巨大的潜力,因此越来越受到人们的关注。然而,其固有的推理性质导致了大量的计算和时间成本。虽然人们已经努力通过减少推理步骤(通过一致性蒸馏等技术)和 GAN 训练(这些方法通常在性能或训练稳定性方面存在不足)来加速视频扩散。在这项工作中,我们引入了一个两阶段训练框架,有效地将一致性蒸馏和 GAN 训练结合起来,以应对这些挑战。此外,我们还提出了一种新颖的视频判别器设计,无需对视频潜变量进行解码,从而提高了最终性能。我们的模型只需一步就能生成高质量视频,并能灵活地执行多步细化以进一步提高性能。我们在 OpenWebVid-1Mbenchmark 上进行的定量评估表明,我们的模型明显优于现有方法。值得注意的是,我们的 1 步性能(FVD 171.15)超过了基于一致性蒸馏的方法 AnimateLCM 的 8 步性能(FVD 184.79),并接近高级稳定视频扩散的 25 步性能(FVD 156.94)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Massively Multi-Person 3D Human Motion Forecasting with Scene Context Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution Precise Forecasting of Sky Images Using Spatial Warping JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation Applications of Knowledge Distillation in Remote Sensing: A Survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1