Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman
We present DenseAV, a novel dual encoder grounding architecture that learns high-resolution, semantically meaningful, and audio-visually aligned features solely through watching videos. We show that DenseAV can discover the ``meaning'' of words and the ``location'' of sounds without explicit localization supervision. Furthermore, it automatically discovers and distinguishes between these two types of associations without supervision. We show that DenseAV's localization abilities arise from a new multi-head feature aggregation operator that directly compares dense image and audio representations for contrastive learning. In contrast, many other systems that learn ``global'' audio and video representations cannot localize words and sound. Finally, we contribute two new datasets to improve the evaluation of AV representations through speech and sound prompted semantic segmentation. On these and other datasets we show DenseAV dramatically outperforms the prior art on speech and sound prompted semantic segmentation. DenseAV outperforms the previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than half of the parameters. Project Page: href{https://aka.ms/denseav}{https://aka.ms/denseav}
{"title":"Separating the \"Chirp\" from the \"Chat\": Self-supervised Visual Grounding of Sound and Language","authors":"Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman","doi":"arxiv-2406.05629","DOIUrl":"https://doi.org/arxiv-2406.05629","url":null,"abstract":"We present DenseAV, a novel dual encoder grounding architecture that learns\u0000high-resolution, semantically meaningful, and audio-visually aligned features\u0000solely through watching videos. We show that DenseAV can discover the\u0000``meaning'' of words and the ``location'' of sounds without explicit\u0000localization supervision. Furthermore, it automatically discovers and\u0000distinguishes between these two types of associations without supervision. We\u0000show that DenseAV's localization abilities arise from a new multi-head feature\u0000aggregation operator that directly compares dense image and audio\u0000representations for contrastive learning. In contrast, many other systems that\u0000learn ``global'' audio and video representations cannot localize words and\u0000sound. Finally, we contribute two new datasets to improve the evaluation of AV\u0000representations through speech and sound prompted semantic segmentation. On\u0000these and other datasets we show DenseAV dramatically outperforms the prior art\u0000on speech and sound prompted semantic segmentation. DenseAV outperforms the\u0000previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than\u0000half of the parameters. Project Page:\u0000href{https://aka.ms/denseav}{https://aka.ms/denseav}","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei
This paper introduces VALL-E 2, the latest advancement in neural codec language models that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to https://aka.ms/valle2.
{"title":"VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers","authors":"Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei","doi":"arxiv-2406.05370","DOIUrl":"https://doi.org/arxiv-2406.05370","url":null,"abstract":"This paper introduces VALL-E 2, the latest advancement in neural codec\u0000language models that marks a milestone in zero-shot text-to-speech synthesis\u0000(TTS), achieving human parity for the first time. Based on its predecessor,\u0000VALL-E, the new iteration introduces two significant enhancements: Repetition\u0000Aware Sampling refines the original nucleus sampling process by accounting for\u0000token repetition in the decoding history. It not only stabilizes the decoding\u0000but also circumvents the infinite loop issue. Grouped Code Modeling organizes\u0000codec codes into groups to effectively shorten the sequence length, which not\u0000only boosts inference speed but also addresses the challenges of long sequence\u0000modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E\u00002 surpasses previous systems in speech robustness, naturalness, and speaker\u0000similarity. It is the first of its kind to reach human parity on these\u0000benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech,\u0000even for sentences that are traditionally challenging due to their complexity\u0000or repetitive phrases. The advantages of this work could contribute to valuable\u0000endeavors, such as generating speech for individuals with aphasia or people\u0000with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to\u0000https://aka.ms/valle2.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Self-supervised speech models have shown to be useful for various tasks, but their large size limits the use in devices with low computing power and memory. In this work, we explore early exit, an approach for reducing latency by exiting the forward process of a network early. Most approaches of early exit need a separate early exit model for each task, with some even requiring fine-tuning of the entire pretrained model. We introduce Data Adaptive Self-Supervised Early Exit (DAISY), an approach that decides when to exit based on the self-supervised loss, eliminating the need for multiple round of training and fine-tuning. DAISY matches the performance of HuBERT on the MiniSUPERB benchmark, but with much faster inference times. Our analysis on the adaptivity of DAISY shows that the model exits early (using fewer layers) on clean data while exits late (using more layers) on noisy data, dynamically adjusting the computational cost of inference based on the noise level of each sample.
{"title":"DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models","authors":"Tzu-Quan Lin, Hung-yi Lee, Hao Tang","doi":"arxiv-2406.05464","DOIUrl":"https://doi.org/arxiv-2406.05464","url":null,"abstract":"Self-supervised speech models have shown to be useful for various tasks, but\u0000their large size limits the use in devices with low computing power and memory.\u0000In this work, we explore early exit, an approach for reducing latency by\u0000exiting the forward process of a network early. Most approaches of early exit\u0000need a separate early exit model for each task, with some even requiring\u0000fine-tuning of the entire pretrained model. We introduce Data Adaptive\u0000Self-Supervised Early Exit (DAISY), an approach that decides when to exit based\u0000on the self-supervised loss, eliminating the need for multiple round of\u0000training and fine-tuning. DAISY matches the performance of HuBERT on the\u0000MiniSUPERB benchmark, but with much faster inference times. Our analysis on the\u0000adaptivity of DAISY shows that the model exits early (using fewer layers) on\u0000clean data while exits late (using more layers) on noisy data, dynamically\u0000adjusting the computational cost of inference based on the noise level of each\u0000sample.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Acoustic context effects, where surrounding changes in pitch, rate or timbre influence the perception of a sound, are well documented in speech perception, but how they interact with language background remains unclear. Using a reverse-correlation approach, we systematically varied the pitch and speech rate in phrases around different pairs of vowels for second language (L2) speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a data-driven manner, the prosodic profiles that bias their perception. Testing English and French speakers (n=25), we showed that vowel perception is in fact influenced by conflicting effects from the surrounding pitch and speech rate: a congruent proximal effect 0.2s pre-target and a distal contrastive effect up to 1s before; and found that L1 and L2 speakers exhibited strikingly similar prosodic profiles in perception. We provide a novel method to investigate acoustic context effects across stimuli, timescales, and acoustic domain.
{"title":"Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation","authors":"Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim","doi":"arxiv-2406.05515","DOIUrl":"https://doi.org/arxiv-2406.05515","url":null,"abstract":"Acoustic context effects, where surrounding changes in pitch, rate or timbre\u0000influence the perception of a sound, are well documented in speech perception,\u0000but how they interact with language background remains unclear. Using a\u0000reverse-correlation approach, we systematically varied the pitch and speech\u0000rate in phrases around different pairs of vowels for second language (L2)\u0000speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a\u0000data-driven manner, the prosodic profiles that bias their perception. Testing\u0000English and French speakers (n=25), we showed that vowel perception is in fact\u0000influenced by conflicting effects from the surrounding pitch and speech rate: a\u0000congruent proximal effect 0.2s pre-target and a distal contrastive effect up to\u00001s before; and found that L1 and L2 speakers exhibited strikingly similar\u0000prosodic profiles in perception. We provide a novel method to investigate\u0000acoustic context effects across stimuli, timescales, and acoustic domain.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tokenization algorithms that merge the units of a base vocabulary into larger, variable-rate units have become standard in natural language processing tasks. This idea, however, has been mostly overlooked when the vocabulary consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based representation that is playing an increasingly important role due to the success of discrete language-modeling techniques. In this paper, we showcase the advantages of tokenization of phonetic units and of DAUs on three prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised speech generation using DAU language modeling. We demonstrate that tokenization yields significant improvements in terms of performance, as well as training and inference speed, across all three tasks. We also offer theoretical insights to provide some explanation for the superior performance observed.
在自然语言处理任务中,将基础词汇单位合并为更大的、速率可变的单位的标记化算法已成为标准。然而,当词汇包含音素或离散声学单位(DAUs)时,这一想法大多被忽视了,由于离散语言建模技术的成功,基于音频的表述正发挥着越来越重要的作用。在本文中,我们展示了语音单位标记化和 DAUs 在三项预测任务中的优势:词素到词素、词素到 DAUs 以及使用 DAU 语言建模的无监督语音生成。我们证明,在所有三个任务中,标记化在性能、训练和推理速度方面都有显著提高。我们还提出了一些理论见解,为所观察到的卓越性能提供了一些解释。
{"title":"Exploring the Benefits of Tokenization of Discrete Acoustic Units","authors":"Avihu Dekel, Raul Fernandez","doi":"arxiv-2406.05547","DOIUrl":"https://doi.org/arxiv-2406.05547","url":null,"abstract":"Tokenization algorithms that merge the units of a base vocabulary into\u0000larger, variable-rate units have become standard in natural language processing\u0000tasks. This idea, however, has been mostly overlooked when the vocabulary\u0000consists of phonemes or Discrete Acoustic Units (DAUs), an audio-based\u0000representation that is playing an increasingly important role due to the\u0000success of discrete language-modeling techniques. In this paper, we showcase\u0000the advantages of tokenization of phonetic units and of DAUs on three\u0000prediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervised\u0000speech generation using DAU language modeling. We demonstrate that tokenization\u0000yields significant improvements in terms of performance, as well as training\u0000and inference speed, across all three tasks. We also offer theoretical insights\u0000to provide some explanation for the superior performance observed.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we systematically study music generation conditioned solely on the video. First, we present a large-scale dataset comprising 190K video-music pairs, including various genres such as movie trailers, advertisements, and documentaries. Furthermore, we propose VidMuse, a simple framework for generating music aligned with video inputs. VidMuse stands out by producing high-fidelity music that is both acoustically and semantically aligned with the video. By incorporating local and global visual cues, VidMuse enables the creation of musically coherent audio tracks that consistently match the video content through Long-Short-Term modeling. Through extensive experiments, VidMuse outperforms existing models in terms of audio quality, diversity, and audio-visual alignment. The code and datasets will be available at https://github.com/ZeyueT/VidMuse/.
{"title":"VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling","authors":"Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo","doi":"arxiv-2406.04321","DOIUrl":"https://doi.org/arxiv-2406.04321","url":null,"abstract":"In this work, we systematically study music generation conditioned solely on\u0000the video. First, we present a large-scale dataset comprising 190K video-music\u0000pairs, including various genres such as movie trailers, advertisements, and\u0000documentaries. Furthermore, we propose VidMuse, a simple framework for\u0000generating music aligned with video inputs. VidMuse stands out by producing\u0000high-fidelity music that is both acoustically and semantically aligned with the\u0000video. By incorporating local and global visual cues, VidMuse enables the\u0000creation of musically coherent audio tracks that consistently match the video\u0000content through Long-Short-Term modeling. Through extensive experiments,\u0000VidMuse outperforms existing models in terms of audio quality, diversity, and\u0000audio-visual alignment. The code and datasets will be available at\u0000https://github.com/ZeyueT/VidMuse/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate the construction of latent spaces through self-supervised learning to support semantically meaningful operations. Analogous to operational amplifiers, these "operational latent spaces" (OpLaS) not only demonstrate semantic structure such as clustering but also support common transformational operations with inherent semantic meaning. Some operational latent spaces are found to have arisen "unintentionally" in the progress toward some (other) self-supervised learning objective, in which unintended but still useful properties are discovered among the relationships of points in the space. Other spaces may be constructed "intentionally" by developers stipulating certain kinds of clustering or transformations intended to produce the desired structure. We focus on the intentional creation of operational latent spaces via self-supervised learning, including the introduction of rotation operators via a novel "FiLMR" layer, which can be used to enable ring-like symmetries found in some musical constructions.
{"title":"Operational Latent Spaces","authors":"Scott H. Hawley, Austin R. Tackett","doi":"arxiv-2406.02699","DOIUrl":"https://doi.org/arxiv-2406.02699","url":null,"abstract":"We investigate the construction of latent spaces through self-supervised\u0000learning to support semantically meaningful operations. Analogous to\u0000operational amplifiers, these \"operational latent spaces\" (OpLaS) not only\u0000demonstrate semantic structure such as clustering but also support common\u0000transformational operations with inherent semantic meaning. Some operational\u0000latent spaces are found to have arisen \"unintentionally\" in the progress toward\u0000some (other) self-supervised learning objective, in which unintended but still\u0000useful properties are discovered among the relationships of points in the\u0000space. Other spaces may be constructed \"intentionally\" by developers\u0000stipulating certain kinds of clustering or transformations intended to produce\u0000the desired structure. We focus on the intentional creation of operational\u0000latent spaces via self-supervised learning, including the introduction of\u0000rotation operators via a novel \"FiLMR\" layer, which can be used to enable\u0000ring-like symmetries found in some musical constructions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141552030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji
Music mixing is compositional -- experts combine multiple audio processors to achieve a cohesive mix from dry source tracks. We propose a method to reverse engineer this process from the input and output audio. First, we create a mixing console that applies all available processors to every chain. Then, after the initial console parameter optimization, we alternate between removing redundant processors and fine-tuning. We achieve this through differentiable implementation of both processors and pruning. Consequently, we find a sparse mixing graph that achieves nearly identical matching quality of the full mixing console. We apply this procedure to dry-mix pairs from various datasets and collect graphs that also can be used to train neural networks for music mixing applications.
{"title":"Searching For Music Mixing Graphs: A Pruning Approach","authors":"Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji","doi":"arxiv-2406.01049","DOIUrl":"https://doi.org/arxiv-2406.01049","url":null,"abstract":"Music mixing is compositional -- experts combine multiple audio processors to\u0000achieve a cohesive mix from dry source tracks. We propose a method to reverse\u0000engineer this process from the input and output audio. First, we create a\u0000mixing console that applies all available processors to every chain. Then,\u0000after the initial console parameter optimization, we alternate between removing\u0000redundant processors and fine-tuning. We achieve this through differentiable\u0000implementation of both processors and pruning. Consequently, we find a sparse\u0000mixing graph that achieves nearly identical matching quality of the full mixing\u0000console. We apply this procedure to dry-mix pairs from various datasets and\u0000collect graphs that also can be used to train neural networks for music mixing\u0000applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-06-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141254791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan
Controllable music generation methods are critical for human-centered AI-based music creation, but are currently limited by speed, quality, and control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in particular, offers state-of-the-art results, but is over 10x slower than real-time, limiting practical use. We propose Distilled Diffusion Inference-Time T -Optimization (or DITTO-2), a new method to speed up inference-time optimization-based control and unlock faster-than-real-time generation for a wide-variety of applications such as music inpainting, outpainting, intensity, melody, and musical structure control. Our method works by (1) distilling a pre-trained diffusion model for fast sampling via an efficient, modified consistency or consistency trajectory distillation process (2) performing inference-time optimization using our distilled model with one-step sampling as an efficient surrogate optimization task and (3) running a final multi-step sampling generation (decoding) using our estimated noise latents for best-quality, fast, controllable generation. Through thorough evaluation, we find our method not only speeds up generation over 10-20x, but simultaneously improves control adherence and generation quality all at once. Furthermore, we apply our approach to a new application of maximizing text adherence (CLAP score) and show we can convert an unconditional diffusion model without text inputs into a model that yields state-of-the-art text control. Sound examples can be found at https://ditto-music.github.io/ditto2/.
{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":"https://doi.org/arxiv-2405.20289","url":null,"abstract":"Controllable music generation methods are critical for human-centered\u0000AI-based music creation, but are currently limited by speed, quality, and\u0000control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\u0000particular, offers state-of-the-art results, but is over 10x slower than\u0000real-time, limiting practical use. We propose Distilled Diffusion\u0000Inference-Time T -Optimization (or DITTO-2), a new method to speed up\u0000inference-time optimization-based control and unlock faster-than-real-time\u0000generation for a wide-variety of applications such as music inpainting,\u0000outpainting, intensity, melody, and musical structure control. Our method works\u0000by (1) distilling a pre-trained diffusion model for fast sampling via an\u0000efficient, modified consistency or consistency trajectory distillation process\u0000(2) performing inference-time optimization using our distilled model with\u0000one-step sampling as an efficient surrogate optimization task and (3) running a\u0000final multi-step sampling generation (decoding) using our estimated noise\u0000latents for best-quality, fast, controllable generation. Through thorough\u0000evaluation, we find our method not only speeds up generation over 10-20x, but\u0000simultaneously improves control adherence and generation quality all at once.\u0000Furthermore, we apply our approach to a new application of maximizing text\u0000adherence (CLAP score) and show we can convert an unconditional diffusion model\u0000without text inputs into a model that yields state-of-the-art text control.\u0000Sound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141187830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the recent advancement in large language models (LLMs), there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal large language models (MLLMs) mainly focus on understanding. This survey elaborates on multimodal generation across different domains, including image, video, 3D, and audio, where we highlight the notable advancements with milestone works in these fields. Specifically, we exhaustively investigate the key technical components behind methods and multimodal datasets utilized in these studies. Moreover, we dig into tool-augmented multimodal agents that can use existing generative models for human-computer interaction. Lastly, we also comprehensively discuss the advancement in AI safety and investigate emerging applications as well as future prospects. Our work provides a systematic and insightful overview of multimodal generation, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation
{"title":"LLMs Meet Multimodal Generation and Editing: A Survey","authors":"Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen","doi":"arxiv-2405.19334","DOIUrl":"https://doi.org/arxiv-2405.19334","url":null,"abstract":"With the recent advancement in large language models (LLMs), there is a\u0000growing interest in combining LLMs with multimodal learning. Previous surveys\u0000of multimodal large language models (MLLMs) mainly focus on understanding. This\u0000survey elaborates on multimodal generation across different domains, including\u0000image, video, 3D, and audio, where we highlight the notable advancements with\u0000milestone works in these fields. Specifically, we exhaustively investigate the\u0000key technical components behind methods and multimodal datasets utilized in\u0000these studies. Moreover, we dig into tool-augmented multimodal agents that can\u0000use existing generative models for human-computer interaction. Lastly, we also\u0000comprehensively discuss the advancement in AI safety and investigate emerging\u0000applications as well as future prospects. Our work provides a systematic and\u0000insightful overview of multimodal generation, which is expected to advance the\u0000development of Artificial Intelligence for Generative Content (AIGC) and world\u0000models. A curated list of all related papers can be found at\u0000https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}