Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
{"title":"DITTO-2:用于音乐生成的蒸馏扩散推理-时间 T 优化技术","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":null,"url":null,"abstract":"Controllable music generation methods are critical for human-centered\nAI-based music creation, but are currently limited by speed, quality, and\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\nparticular, offers state-of-the-art results, but is over 10x slower than\nreal-time, limiting practical use. We propose Distilled Diffusion\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\ninference-time optimization-based control and unlock faster-than-real-time\ngeneration for a wide-variety of applications such as music inpainting,\noutpainting, intensity, melody, and musical structure control. Our method works\nby (1) distilling a pre-trained diffusion model for fast sampling via an\nefficient, modified consistency or consistency trajectory distillation process\n(2) performing inference-time optimization using our distilled model with\none-step sampling as an efficient surrogate optimization task and (3) running a\nfinal multi-step sampling generation (decoding) using our estimated noise\nlatents for best-quality, fast, controllable generation. Through thorough\nevaluation, we find our method not only speeds up generation over 10-20x, but\nsimultaneously improves control adherence and generation quality all at once.\nFurthermore, we apply our approach to a new application of maximizing text\nadherence (CLAP score) and show we can convert an unconditional diffusion model\nwithout text inputs into a model that yields state-of-the-art text control.\nSound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation\",\"authors\":\"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan\",\"doi\":\"arxiv-2405.20289\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Controllable music generation methods are critical for human-centered\\nAI-based music creation, but are currently limited by speed, quality, and\\ncontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\\nparticular, offers state-of-the-art results, but is over 10x slower than\\nreal-time, limiting practical use. We propose Distilled Diffusion\\nInference-Time T -Optimization (or DITTO-2), a new method to speed up\\ninference-time optimization-based control and unlock faster-than-real-time\\ngeneration for a wide-variety of applications such as music inpainting,\\noutpainting, intensity, melody, and musical structure control. Our method works\\nby (1) distilling a pre-trained diffusion model for fast sampling via an\\nefficient, modified consistency or consistency trajectory distillation process\\n(2) performing inference-time optimization using our distilled model with\\none-step sampling as an efficient surrogate optimization task and (3) running a\\nfinal multi-step sampling generation (decoding) using our estimated noise\\nlatents for best-quality, fast, controllable generation. Through thorough\\nevaluation, we find our method not only speeds up generation over 10-20x, but\\nsimultaneously improves control adherence and generation quality all at once.\\nFurthermore, we apply our approach to a new application of maximizing text\\nadherence (CLAP score) and show we can convert an unconditional diffusion model\\nwithout text inputs into a model that yields state-of-the-art text control.\\nSound examples can be found at https://ditto-music.github.io/ditto2/.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2405.20289\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.20289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation
Controllable music generation methods are critical for human-centered
AI-based music creation, but are currently limited by speed, quality, and
control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in
particular, offers state-of-the-art results, but is over 10x slower than
real-time, limiting practical use. We propose Distilled Diffusion
Inference-Time T -Optimization (or DITTO-2), a new method to speed up
inference-time optimization-based control and unlock faster-than-real-time
generation for a wide-variety of applications such as music inpainting,
outpainting, intensity, melody, and musical structure control. Our method works
by (1) distilling a pre-trained diffusion model for fast sampling via an
efficient, modified consistency or consistency trajectory distillation process
(2) performing inference-time optimization using our distilled model with
one-step sampling as an efficient surrogate optimization task and (3) running a
final multi-step sampling generation (decoding) using our estimated noise
latents for best-quality, fast, controllable generation. Through thorough
evaluation, we find our method not only speeds up generation over 10-20x, but
simultaneously improves control adherence and generation quality all at once.
Furthermore, we apply our approach to a new application of maximizing text
adherence (CLAP score) and show we can convert an unconditional diffusion model
without text inputs into a model that yields state-of-the-art text control.
Sound examples can be found at https://ditto-music.github.io/ditto2/.