arXiv - CS - Sound最新文献

英文中文

Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language 从 "聊天 "中分离 "啁啾"：声音和语言的自我监督视觉基础

arXiv - CS - Sound

Pub Date : 2024-06-09 DOI: arxiv-2406.05629

Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman

We present DenseAV, a novel dual encoder grounding architecture that learnshigh-resolution, semantically meaningful, and audio-visually aligned featuressolely through watching videos. We show that DenseAV can discover the``meaning'' of words and the ``location'' of sounds without explicitlocalization supervision. Furthermore, it automatically discovers anddistinguishes between these two types of associations without supervision. Weshow that DenseAV's localization abilities arise from a new multi-head featureaggregation operator that directly compares dense image and audiorepresentations for contrastive learning. In contrast, many other systems thatlearn ``global'' audio and video representations cannot localize words andsound. Finally, we contribute two new datasets to improve the evaluation of AVrepresentations through speech and sound prompted semantic segmentation. Onthese and other datasets we show DenseAV dramatically outperforms the prior arton speech and sound prompted semantic segmentation. DenseAV outperforms theprevious state-of-the-art, ImageBind, on cross-modal retrieval using fewer thanhalf of the parameters. Project Page:href{https://aka.ms/denseav}{https://aka.ms/denseav}

我们介绍的 DenseAV 是一种新颖的双编码器接地架构，它可以仅通过观看视频来学习高分辨率、有语义和视听一致的特征。我们的研究表明，DenseAV 可以发现单词的 "含义 "和声音的 "位置"，而无需明确的定位监督。此外，它还能在没有监督的情况下自动发现并区分这两类关联。我们发现，DenseAV的定位能力源于一种新的多头特征聚合算子，这种算子直接比较密集图像和声音表示，从而进行对比学习。相比之下，其他许多学习 "全局 "音视频表征的系统无法定位单词和声音。最后，我们贡献了两个新的数据集，通过语音和声音提示语义分割来改进影音表征的评估。在这些数据集和其他数据集上，我们发现 DenseAV 在语音和声音提示语义分割方面的表现大大优于现有技术。在跨模态检索方面，DenseAV使用不到一半的参数就超越了先前的最先进技术ImageBind。项目页面：href{https://aka.ms/denseav}{https://aka.ms/denseav}

{"title":"Separating the \"Chirp\" from the \"Chat\": Self-supervised Visual Grounding of Sound and Language","authors":"Mark Hamilton, Andrew Zisserman, John R. Hershey, William T. Freeman","doi":"arxiv-2406.05629","DOIUrl":"https://doi.org/arxiv-2406.05629","url":null,"abstract":"We present DenseAV, a novel dual encoder grounding architecture that learns\u0000high-resolution, semantically meaningful, and audio-visually aligned features\u0000solely through watching videos. We show that DenseAV can discover the\u0000``meaning'' of words and the ``location'' of sounds without explicit\u0000localization supervision. Furthermore, it automatically discovers and\u0000distinguishes between these two types of associations without supervision. We\u0000show that DenseAV's localization abilities arise from a new multi-head feature\u0000aggregation operator that directly compares dense image and audio\u0000representations for contrastive learning. In contrast, many other systems that\u0000learn ``global'' audio and video representations cannot localize words and\u0000sound. Finally, we contribute two new datasets to improve the evaluation of AV\u0000representations through speech and sound prompted semantic segmentation. On\u0000these and other datasets we show DenseAV dramatically outperforms the prior art\u0000on speech and sound prompted semantic segmentation. DenseAV outperforms the\u0000previous state-of-the-art, ImageBind, on cross-modal retrieval using fewer than\u0000half of the parameters. Project Page:\u0000href{https://aka.ms/denseav}{https://aka.ms/denseav}","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers VALL-E 2：神经编解码语言模型是人类平等的零镜头文本到语音合成器

arXiv - CS - Sound

Pub Date : 2024-06-08 DOI: arxiv-2406.05370

Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei

This paper introduces VALL-E 2, the latest advancement in neural codeclanguage models that marks a milestone in zero-shot text-to-speech synthesis(TTS), achieving human parity for the first time. Based on its predecessor,VALL-E, the new iteration introduces two significant enhancements: RepetitionAware Sampling refines the original nucleus sampling process by accounting fortoken repetition in the decoding history. It not only stabilizes the decodingbut also circumvents the infinite loop issue. Grouped Code Modeling organizescodec codes into groups to effectively shorten the sequence length, which notonly boosts inference speed but also addresses the challenges of long sequencemodeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E2 surpasses previous systems in speech robustness, naturalness, and speakersimilarity. It is the first of its kind to reach human parity on thesebenchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech,even for sentences that are traditionally challenging due to their complexityor repetitive phrases. The advantages of this work could contribute to valuableendeavors, such as generating speech for individuals with aphasia or peoplewith amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted tohttps://aka.ms/valle2.

本文介绍了 VALL-E 2，它是神经编解码语言模型的最新进展，标志着零镜头文本到语音合成（TTS）领域的一个里程碑，首次实现了与人类的平等。在其前身 VALL-E 的基础上，新的迭代版本引入了两项重大改进：重复采样（RepetitionAware Sampling）通过考虑解码历史中的令牌重复，改进了原始的核采样过程。它不仅稳定了解码，还避免了无限循环问题。分组编码建模（Grouped Code Modeling）将解码编码组织成组，有效缩短了序列长度，不仅提高了推理速度，还解决了长序列建模的难题。我们在 LibriSpeech 和 VCTK 数据集上的实验表明，VALL-E2 在语音鲁棒性、自然度和说话人相似性方面都超越了以前的系统。在这些基准测试中，VALL-E2 是第一个达到与人类同等水平的系统。此外，VALL-E 2 还能始终如一地合成高质量语音，即使是那些因复杂性或重复短语而具有传统挑战性的句子也不例外。这项工作的优势可以为有价值的事业做出贡献，例如为失语症患者或肌萎缩侧索硬化症患者生成语音。VALL-E 2 的演示将发布在https://aka.ms/valle2。

{"title":"VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers","authors":"Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, Furu Wei","doi":"arxiv-2406.05370","DOIUrl":"https://doi.org/arxiv-2406.05370","url":null,"abstract":"This paper introduces VALL-E 2, the latest advancement in neural codec\u0000language models that marks a milestone in zero-shot text-to-speech synthesis\u0000(TTS), achieving human parity for the first time. Based on its predecessor,\u0000VALL-E, the new iteration introduces two significant enhancements: Repetition\u0000Aware Sampling refines the original nucleus sampling process by accounting for\u0000token repetition in the decoding history. It not only stabilizes the decoding\u0000but also circumvents the infinite loop issue. Grouped Code Modeling organizes\u0000codec codes into groups to effectively shorten the sequence length, which not\u0000only boosts inference speed but also addresses the challenges of long sequence\u0000modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E\u00002 surpasses previous systems in speech robustness, naturalness, and speaker\u0000similarity. It is the first of its kind to reach human parity on these\u0000benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech,\u0000even for sentences that are traditionally challenging due to their complexity\u0000or repetitive phrases. The advantages of this work could contribute to valuable\u0000endeavors, such as generating speech for individuals with aphasia or people\u0000with amyotrophic lateral sclerosis. Demos of VALL-E 2 will be posted to\u0000https://aka.ms/valle2.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DAISY: Data Adaptive Self-Supervised Early Exit for Speech Representation Models DAISY：语音表征模型的数据自适应自监督早期退出

arXiv - CS - Sound

Pub Date : 2024-06-08 DOI: arxiv-2406.05464

Tzu-Quan Lin, Hung-yi Lee, Hao Tang

Self-supervised speech models have shown to be useful for various tasks, buttheir large size limits the use in devices with low computing power and memory.In this work, we explore early exit, an approach for reducing latency byexiting the forward process of a network early. Most approaches of early exitneed a separate early exit model for each task, with some even requiringfine-tuning of the entire pretrained model. We introduce Data AdaptiveSelf-Supervised Early Exit (DAISY), an approach that decides when to exit basedon the self-supervised loss, eliminating the need for multiple round oftraining and fine-tuning. DAISY matches the performance of HuBERT on theMiniSUPERB benchmark, but with much faster inference times. Our analysis on theadaptivity of DAISY shows that the model exits early (using fewer layers) onclean data while exits late (using more layers) on noisy data, dynamicallyadjusting the computational cost of inference based on the noise level of eachsample.

自监督语音模型已被证明可用于各种任务，但其庞大的体积限制了其在计算能力和内存较低的设备中的使用。在这项工作中，我们探索了提前退出的方法，这是一种通过提前退出网络前向过程来减少延迟的方法。大多数早期退出方法需要为每个任务建立单独的早期退出模型，有些方法甚至需要对整个预训练模型进行微调。我们引入了数据自适应自监督早期退出（DAISY），这是一种根据自监督损失决定何时退出的方法，无需多轮训练和微调。DAISY与HuBERT在MiniSUPERB基准上的性能相当，但推理时间更短。我们对 DAISY 适应性的分析表明，该模型在干净数据上退出较早（使用较少层），而在噪声数据上退出较晚（使用较多层），根据每个样本的噪声水平动态调整推理的计算成本。

引用次数: 0

Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation 嗯，你说什么？利用心理物理反向相关性揭示第一语言和第二语言词汇感知中的远近语境效应

arXiv - CS - Sound

Pub Date : 2024-06-08 DOI: arxiv-2406.05515

Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim

Acoustic context effects, where surrounding changes in pitch, rate or timbreinfluence the perception of a sound, are well documented in speech perception,but how they interact with language background remains unclear. Using areverse-correlation approach, we systematically varied the pitch and speechrate in phrases around different pairs of vowels for second language (L2)speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in adata-driven manner, the prosodic profiles that bias their perception. TestingEnglish and French speakers (n=25), we showed that vowel perception is in factinfluenced by conflicting effects from the surrounding pitch and speech rate: acongruent proximal effect 0.2s pre-target and a distal contrastive effect up to1s before; and found that L1 and L2 speakers exhibited strikingly similarprosodic profiles in perception. We provide a novel method to investigateacoustic context effects across stimuli, timescales, and acoustic domain.

在语音感知中，周围音高、音速或音色的变化会影响对声音的感知，这种声学语境效应已被充分记录下来，但它们如何与语言背景相互作用仍不清楚。利用反相关方法，我们系统地改变了以英语（/i/-/I/）和法语（/u/-/y/）为第二语言（L2）的人在不同对元音周围短语中的音高和语速，从而以数据驱动的方式重建了影响他们感知的前音特征。通过测试英语和法语使用者（n=25），我们发现元音感知实际上受到周围音高和语速的冲突影响：目标前 0.2 秒的近端一致效应和目标前 1 秒的远端对比效应；并且发现 L1 和 L2 说话者在感知中表现出惊人相似的前音轮廓。我们提供了一种新方法来研究跨刺激、跨时标和跨声学领域的声学语境效应。

{"title":"Mmm whatcha say? Uncovering distal and proximal context effects in first and second-language word perception using psychophysical reverse correlation","authors":"Paige Tuttösí, H. Henny Yeung, Yue Wang, Fenqi Wang, Guillaume Denis, Jean-Julien Aucouturier, Angelica Lim","doi":"arxiv-2406.05515","DOIUrl":"https://doi.org/arxiv-2406.05515","url":null,"abstract":"Acoustic context effects, where surrounding changes in pitch, rate or timbre\u0000influence the perception of a sound, are well documented in speech perception,\u0000but how they interact with language background remains unclear. Using a\u0000reverse-correlation approach, we systematically varied the pitch and speech\u0000rate in phrases around different pairs of vowels for second language (L2)\u0000speakers of English (/i/-/I/) and French (/u/-/y/), thus reconstructing, in a\u0000data-driven manner, the prosodic profiles that bias their perception. Testing\u0000English and French speakers (n=25), we showed that vowel perception is in fact\u0000influenced by conflicting effects from the surrounding pitch and speech rate: a\u0000congruent proximal effect 0.2s pre-target and a distal contrastive effect up to\u00001s before; and found that L1 and L2 speakers exhibited strikingly similar\u0000prosodic profiles in perception. We provide a novel method to investigate\u0000acoustic context effects across stimuli, timescales, and acoustic domain.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141509127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring the Benefits of Tokenization of Discrete Acoustic Units 探索离散声学单元标记化的益处

arXiv - CS - Sound

Pub Date : 2024-06-08 DOI: arxiv-2406.05547

Avihu Dekel, Raul Fernandez

Tokenization algorithms that merge the units of a base vocabulary intolarger, variable-rate units have become standard in natural language processingtasks. This idea, however, has been mostly overlooked when the vocabularyconsists of phonemes or Discrete Acoustic Units (DAUs), an audio-basedrepresentation that is playing an increasingly important role due to thesuccess of discrete language-modeling techniques. In this paper, we showcasethe advantages of tokenization of phonetic units and of DAUs on threeprediction tasks: grapheme-to-phoneme, grapheme-to-DAUs, and unsupervisedspeech generation using DAU language modeling. We demonstrate that tokenizationyields significant improvements in terms of performance, as well as trainingand inference speed, across all three tasks. We also offer theoretical insightsto provide some explanation for the superior performance observed.

在自然语言处理任务中，将基础词汇单位合并为更大的、速率可变的单位的标记化算法已成为标准。然而，当词汇包含音素或离散声学单位（DAUs）时，这一想法大多被忽视了，由于离散语言建模技术的成功，基于音频的表述正发挥着越来越重要的作用。在本文中，我们展示了语音单位标记化和 DAUs 在三项预测任务中的优势：词素到词素、词素到 DAUs 以及使用 DAU 语言建模的无监督语音生成。我们证明，在所有三个任务中，标记化在性能、训练和推理速度方面都有显著提高。我们还提出了一些理论见解，为所观察到的卓越性能提供了一些解释。

引用次数: 0

VidMuse: A Simple Video-to-Music Generation Framework with Long-Short-Term Modeling VidMuse：具有长期短期建模功能的简单视频音乐生成框架

arXiv - CS - Sound

Pub Date : 2024-06-06 DOI: arxiv-2406.04321

Zeyue Tian, Zhaoyang Liu, Ruibin Yuan, Jiahao Pan, Xiaoqiang Huang, Qifeng Liu, Xu Tan, Qifeng Chen, Wei Xue, Yike Guo

In this work, we systematically study music generation conditioned solely onthe video. First, we present a large-scale dataset comprising 190K video-musicpairs, including various genres such as movie trailers, advertisements, anddocumentaries. Furthermore, we propose VidMuse, a simple framework forgenerating music aligned with video inputs. VidMuse stands out by producinghigh-fidelity music that is both acoustically and semantically aligned with thevideo. By incorporating local and global visual cues, VidMuse enables thecreation of musically coherent audio tracks that consistently match the videocontent through Long-Short-Term modeling. Through extensive experiments,VidMuse outperforms existing models in terms of audio quality, diversity, andaudio-visual alignment. The code and datasets will be available athttps://github.com/ZeyueT/VidMuse/.

在这项工作中，我们系统地研究了仅以视频为条件的音乐生成。首先，我们提供了一个包含 190K 对视频音乐的大型数据集，其中包括电影预告片、广告和纪录片等各种类型。此外，我们还提出了 VidMuse，这是一个根据视频输入生成音乐的简单框架。VidMuse 的突出之处在于它能生成与视频在声学和语义上都一致的高保真音乐。VidMuse 结合了局部和全局视觉线索，通过长短期建模，能够创建与视频内容一致的音乐连贯音轨。通过广泛的实验，VidMuse 在音频质量、多样性和音视频一致性方面都优于现有模型。有关代码和数据集可从以下网址获取：https://github.com/ZeyueT/VidMuse/。

引用次数: 0

Operational Latent Spaces 业务潜空间

arXiv - CS - Sound

Pub Date : 2024-06-04 DOI: arxiv-2406.02699

Scott H. Hawley, Austin R. Tackett

We investigate the construction of latent spaces through self-supervisedlearning to support semantically meaningful operations. Analogous tooperational amplifiers, these "operational latent spaces" (OpLaS) not onlydemonstrate semantic structure such as clustering but also support commontransformational operations with inherent semantic meaning. Some operationallatent spaces are found to have arisen "unintentionally" in the progress towardsome (other) self-supervised learning objective, in which unintended but stilluseful properties are discovered among the relationships of points in thespace. Other spaces may be constructed "intentionally" by developersstipulating certain kinds of clustering or transformations intended to producethe desired structure. We focus on the intentional creation of operationallatent spaces via self-supervised learning, including the introduction ofrotation operators via a novel "FiLMR" layer, which can be used to enablering-like symmetries found in some musical constructions.

我们研究了通过自我监督学习来构建潜空间，以支持有语义的操作。与操作放大器类似，这些 "操作潜空间"（OpLaS）不仅能展示聚类等语义结构，还能支持具有内在语义的常见转换操作。有些操作潜空间是在实现某个（其他）自我监督学习目标的过程中 "无意 "产生的，在空间中的点关系中发现了无意但仍然有用的属性。其他空间可能是由开发人员 "有意 "构建的，他们通过某些类型的聚类或转换来产生所需的结构。我们将重点放在通过自监督学习有意创建操作性空间上，包括通过新颖的 "FiLMR "层引入旋转算子，这可用于增强某些音乐结构中的对称性。

引用次数: 0

Searching For Music Mixing Graphs: A Pruning Approach 搜索音乐混合图：剪枝法

arXiv - CS - Sound

Pub Date : 2024-06-03 DOI: arxiv-2406.01049

Sungho Lee, Marco A. Martínez-Ramírez, Wei-Hsiang Liao, Stefan Uhlich, Giorgio Fabbro, Kyogu Lee, Yuki Mitsufuji

Music mixing is compositional -- experts combine multiple audio processors toachieve a cohesive mix from dry source tracks. We propose a method to reverseengineer this process from the input and output audio. First, we create amixing console that applies all available processors to every chain. Then,after the initial console parameter optimization, we alternate between removingredundant processors and fine-tuning. We achieve this through differentiableimplementation of both processors and pruning. Consequently, we find a sparsemixing graph that achieves nearly identical matching quality of the full mixingconsole. We apply this procedure to dry-mix pairs from various datasets andcollect graphs that also can be used to train neural networks for music mixingapplications.

音乐混音是一种作曲--专家们将多个音频处理器结合在一起，从枯燥的源音轨中获得具有凝聚力的混音效果。我们提出了一种从输入和输出音频反向设计这一过程的方法。首先，我们创建了一个混音控制台，将所有可用的处理器应用到每一条链上。然后，在初始控制台参数优化后，我们交替移除多余的处理器并进行微调。我们通过处理器和剪枝的可微调实现这一点。因此，我们找到了一个稀疏混音图，其匹配质量几乎与完整混音控制台相同。我们将这一过程应用于各种数据集中的干混音对，并收集了可用于训练音乐混音应用神经网络的图。

引用次数: 0

DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation DITTO-2：用于音乐生成的蒸馏扩散推理-时间 T 优化技术

arXiv - CS - Sound

Pub Date : 2024-05-30 DOI: arxiv-2405.20289

Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan

Controllable music generation methods are critical for human-centeredAI-based music creation, but are currently limited by speed, quality, andcontrol design trade-offs. Diffusion Inference-Time T-optimization (DITTO), inparticular, offers state-of-the-art results, but is over 10x slower thanreal-time, limiting practical use. We propose Distilled DiffusionInference-Time T -Optimization (or DITTO-2), a new method to speed upinference-time optimization-based control and unlock faster-than-real-timegeneration for a wide-variety of applications such as music inpainting,outpainting, intensity, melody, and musical structure control. Our method worksby (1) distilling a pre-trained diffusion model for fast sampling via anefficient, modified consistency or consistency trajectory distillation process(2) performing inference-time optimization using our distilled model withone-step sampling as an efficient surrogate optimization task and (3) running afinal multi-step sampling generation (decoding) using our estimated noiselatents for best-quality, fast, controllable generation. Through thoroughevaluation, we find our method not only speeds up generation over 10-20x, butsimultaneously improves control adherence and generation quality all at once.Furthermore, we apply our approach to a new application of maximizing textadherence (CLAP score) and show we can convert an unconditional diffusion modelwithout text inputs into a model that yields state-of-the-art text control.Sound examples can be found at https://ditto-music.github.io/ditto2/.

可控音乐生成方法对于以人为中心的人工智能音乐创作至关重要，但目前受到速度、质量和控制设计权衡的限制。特别是扩散推理-时间 T 优化（DITTO），它提供了最先进的结果，但比实时速度慢 10 倍以上，限制了实际应用。我们提出了蒸馏扩散推理-时间优化（或 DITTO-2），这是一种新方法，可加快基于推理-时间优化的控制，并在音乐内画、外画、强度、旋律和音乐结构控制等多种应用中实现比实时更快的生成。我们的方法的工作原理是：(1) 通过一个高效、改进的一致性或一致性轨迹蒸馏过程，蒸馏出一个预训练的扩散模型，用于快速采样；(2) 使用我们蒸馏出的模型执行推理时间优化，将一步采样作为高效的替代优化任务；(3) 使用我们估计的噪声系数运行最终的多步采样生成（解码），以获得最佳质量、快速、可控的生成。通过深入评估，我们发现我们的方法不仅能将生成速度提高 10-20 倍，还能同时提高控制一致性和生成质量。此外，我们还将我们的方法应用于最大化文本一致性（CLAP 分数）的新应用中，并证明我们能将无文本输入的无条件扩散模型转换为能产生最先进文本控制的模型。更多实例请访问 https://ditto-music.github.io/ditto2/。

{"title":"DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music Generation","authors":"Zachary Novack, Julian McAuley, Taylor Berg-Kirkpatrick, Nicholas Bryan","doi":"arxiv-2405.20289","DOIUrl":"https://doi.org/arxiv-2405.20289","url":null,"abstract":"Controllable music generation methods are critical for human-centered\u0000AI-based music creation, but are currently limited by speed, quality, and\u0000control design trade-offs. Diffusion Inference-Time T-optimization (DITTO), in\u0000particular, offers state-of-the-art results, but is over 10x slower than\u0000real-time, limiting practical use. We propose Distilled Diffusion\u0000Inference-Time T -Optimization (or DITTO-2), a new method to speed up\u0000inference-time optimization-based control and unlock faster-than-real-time\u0000generation for a wide-variety of applications such as music inpainting,\u0000outpainting, intensity, melody, and musical structure control. Our method works\u0000by (1) distilling a pre-trained diffusion model for fast sampling via an\u0000efficient, modified consistency or consistency trajectory distillation process\u0000(2) performing inference-time optimization using our distilled model with\u0000one-step sampling as an efficient surrogate optimization task and (3) running a\u0000final multi-step sampling generation (decoding) using our estimated noise\u0000latents for best-quality, fast, controllable generation. Through thorough\u0000evaluation, we find our method not only speeds up generation over 10-20x, but\u0000simultaneously improves control adherence and generation quality all at once.\u0000Furthermore, we apply our approach to a new application of maximizing text\u0000adherence (CLAP score) and show we can convert an unconditional diffusion model\u0000without text inputs into a model that yields state-of-the-art text control.\u0000Sound examples can be found at https://ditto-music.github.io/ditto2/.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"36 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141187830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LLMs Meet Multimodal Generation and Editing: A Survey LLM 满足多模态生成和编辑：调查

arXiv - CS - Sound

Pub Date : 2024-05-29 DOI: arxiv-2405.19334

Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen

With the recent advancement in large language models (LLMs), there is agrowing interest in combining LLMs with multimodal learning. Previous surveysof multimodal large language models (MLLMs) mainly focus on understanding. Thissurvey elaborates on multimodal generation across different domains, includingimage, video, 3D, and audio, where we highlight the notable advancements withmilestone works in these fields. Specifically, we exhaustively investigate thekey technical components behind methods and multimodal datasets utilized inthese studies. Moreover, we dig into tool-augmented multimodal agents that canuse existing generative models for human-computer interaction. Lastly, we alsocomprehensively discuss the advancement in AI safety and investigate emergingapplications as well as future prospects. Our work provides a systematic andinsightful overview of multimodal generation, which is expected to advance thedevelopment of Artificial Intelligence for Generative Content (AIGC) and worldmodels. A curated list of all related papers can be found athttps://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

随着近年来大型语言模型（LLM）的发展，人们对将 LLM 与多模态学习相结合的兴趣与日俱增。以往对多模态大型语言模型（MLLMs）的研究主要集中在理解方面。本调查详细阐述了不同领域的多模态生成，包括图像、视频、三维和音频，我们重点介绍了这些领域中里程碑式工作的显著进展。具体来说，我们详尽调查了这些研究中使用的方法和多模态数据集背后的关键技术组件。此外，我们还深入研究了可利用现有生成模型进行人机交互的工具增强型多模态代理。最后，我们还全面讨论了人工智能安全方面的进展，并研究了新兴应用和未来前景。我们的工作对多模态生成进行了系统而深入的概述，有望推动生成内容人工智能（AIGC）和世界模型的发展。所有相关论文的编辑列表可在https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation。

{"title":"LLMs Meet Multimodal Generation and Editing: A Survey","authors":"Yingqing He, Zhaoyang Liu, Jingye Chen, Zeyue Tian, Hongyu Liu, Xiaowei Chi, Runtao Liu, Ruibin Yuan, Yazhou Xing, Wenhai Wang, Jifeng Dai, Yong Zhang, Wei Xue, Qifeng Liu, Yike Guo, Qifeng Chen","doi":"arxiv-2405.19334","DOIUrl":"https://doi.org/arxiv-2405.19334","url":null,"abstract":"With the recent advancement in large language models (LLMs), there is a\u0000growing interest in combining LLMs with multimodal learning. Previous surveys\u0000of multimodal large language models (MLLMs) mainly focus on understanding. This\u0000survey elaborates on multimodal generation across different domains, including\u0000image, video, 3D, and audio, where we highlight the notable advancements with\u0000milestone works in these fields. Specifically, we exhaustively investigate the\u0000key technical components behind methods and multimodal datasets utilized in\u0000these studies. Moreover, we dig into tool-augmented multimodal agents that can\u0000use existing generative models for human-computer interaction. Lastly, we also\u0000comprehensively discuss the advancement in AI safety and investigate emerging\u0000applications as well as future prospects. Our work provides a systematic and\u0000insightful overview of multimodal generation, which is expected to advance the\u0000development of Artificial Intelligence for Generative Content (AIGC) and world\u0000models. A curated list of all related papers can be found at\u0000https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"214 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141528501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Sound

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀