DITTO-2:用于音乐生成的蒸馏扩散推理-时间 T 优化技术

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
{"title":"DITTO-2:用于音乐生成的蒸馏扩散推理-时间 T 优化技术","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":null,"url":null,"abstract":"Controllable music generation methods are critical for human-centered\nAI-based music creation, but are currently limited by speed, quality, and\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\nparticular, offers state-of-the-art results, but is over 10x slower than\nreal-time, limiting practical use. We propose Distilled Diffusion\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\ninference-time optimization-based control and unlock faster-than-real-time\ngeneration for a wide-variety of applications such as music inpainting,\noutpainting, intensity, melody, and musical structure control. Our method works\nby (1) distilling a pre-trained diffusion model for fast sampling via an\nefficient, modified consistency or consistency trajectory distillation process\n(2) performing inference-time optimization using our distilled model with\none-step sampling as an efficient surrogate optimization task and (3) running a\nfinal multi-step sampling generation (decoding) using our estimated noise\nlatents for best-quality, fast, controllable generation. Through thorough\nevaluation, we find our method not only speeds up generation over 10-20x, but\nsimultaneously improves control adherence and generation quality all at once.\nFurthermore, we apply our approach to a new application of maximizing text\nadherence (CLAP score) and show we can convert an unconditional diffusion model\nwithout text inputs into a model that yields state-of-the-art text control.\nSound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation\",\"authors\":\"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan\",\"doi\":\"arxiv-2405.20289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Controllable music generation methods are critical for human-centered\\nAI-based music creation, but are currently limited by speed, quality, and\\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\\nparticular, offers state-of-the-art results, but is over 10x slower than\\nreal-time, limiting practical use. We propose Distilled Diffusion\\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\\ninference-time optimization-based control and unlock faster-than-real-time\\ngeneration for a wide-variety of applications such as music inpainting,\\noutpainting, intensity, melody, and musical structure control. Our method works\\nby (1) distilling a pre-trained diffusion model for fast sampling via an\\nefficient, modified consistency or consistency trajectory distillation process\\n(2) performing inference-time optimization using our distilled model with\\none-step sampling as an efficient surrogate optimization task and (3) running a\\nfinal multi-step sampling generation (decoding) using our estimated noise\\nlatents for best-quality, fast, controllable generation. Through thorough\\nevaluation, we find our method not only speeds up generation over 10-20x, but\\nsimultaneously improves control adherence and generation quality all at once.\\nFurthermore, we apply our approach to a new application of maximizing text\\nadherence (CLAP score) and show we can convert an unconditional diffusion model\\nwithout text inputs into a model that yields state-of-the-art text control.\\nSound examples can be found at https://ditto-music.github.io/ditto2/.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.20289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.20289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

可控音乐生成方法对于以人为中心的人工智能音乐创作至关重要,但目前受到速度、质量和控制设计权衡的限制。特别是扩散推理-时间 T 优化(DITTO),它提供了最先进的结果,但比实时速度慢 10 倍以上,限制了实际应用。我们提出了蒸馏扩散推理-时间优化(或 DITTO-2),这是一种新方法,可加快基于推理-时间优化的控制,并在音乐内画、外画、强度、旋律和音乐结构控制等多种应用中实现比实时更快的生成。我们的方法的工作原理是:(1) 通过一个高效、改进的一致性或一致性轨迹蒸馏过程,蒸馏出一个预训练的扩散模型,用于快速采样;(2) 使用我们蒸馏出的模型执行推理时间优化,将一步采样作为高效的替代优化任务;(3) 使用我们估计的噪声系数运行最终的多步采样生成(解码),以获得最佳质量、快速、可控的生成。通过深入评估,我们发现我们的方法不仅能将生成速度提高 10-20 倍,还能同时提高控制一致性和生成质量。此外,我们还将我们的方法应用于最大化文本一致性(CLAP 分数)的新应用中,并证明我们能将无文本输入的无条件扩散模型转换为能产生最先进文本控制的模型。更多实例请访问 https://ditto-music.github.io/ditto2/。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1