arXiv - CS - Sound最新文献_第4页

GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis GROOT：为基于扩散模型的音频合成生成鲁棒水印

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10471

Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li

Amid the burgeoning development of generative models like diffusion models,the task of differentiating synthesized audio from its natural counterpartgrows more daunting. Deepfake detection offers a viable solution to combat thischallenge. Yet, this defensive measure unintentionally fuels the continuedrefinement of generative models. Watermarking emerges as a proactive andsustainable tactic, preemptively regulating the creation and dissemination ofsynthesized content. Thus, this paper, as a pioneer, proposes the generativerobust audio watermarking method (Groot), presenting a paradigm for proactivelysupervising the synthesized audio and its source diffusion models. In thisparadigm, the processes of watermark generation and audio synthesis occursimultaneously, facilitated by parameter-fixed diffusion models equipped with adedicated encoder. The watermark embedded within the audio can subsequently beretrieved by a lightweight decoder. The experimental results highlight Groot'soutstanding performance, particularly in terms of robustness, surpassing thatof the leading state-of-the-art methods. Beyond its impressive resilienceagainst individual post-processing attacks, Groot exhibits exceptionalrobustness when facing compound attacks, maintaining an average watermarkextraction accuracy of around 95%.

随着扩散模型等生成模型的蓬勃发展，区分合成音频和自然音频的任务变得更加艰巨。深度伪造检测为应对这一挑战提供了可行的解决方案。然而，这种防御措施无意中助长了生成模型的不断完善。水印作为一种积极主动、可持续发展的策略出现了，它可以先发制人地规范合成内容的创建和传播。因此，本文作为先驱，提出了生成式音频水印方法（Groot），提出了一种主动监督合成音频及其源扩散模型的范式。在这一范例中，水印生成和音频合成过程同时进行，并通过配备专用编码器的固定参数扩散模型来实现。嵌入音频中的水印随后可以通过轻量级解码器提取出来。实验结果凸显了 Groot 的卓越性能，特别是在鲁棒性方面，超过了最先进的领先方法。除了对单个后处理攻击具有令人印象深刻的抗击打能力外，Groot 在面对复合攻击时也表现出了卓越的鲁棒性，水印提取的平均准确率保持在 95% 左右。

{"title":"GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis","authors":"Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li","doi":"arxiv-2407.10471","DOIUrl":"https://doi.org/arxiv-2407.10471","url":null,"abstract":"Amid the burgeoning development of generative models like diffusion models,\u0000the task of differentiating synthesized audio from its natural counterpart\u0000grows more daunting. Deepfake detection offers a viable solution to combat this\u0000challenge. Yet, this defensive measure unintentionally fuels the continued\u0000refinement of generative models. Watermarking emerges as a proactive and\u0000sustainable tactic, preemptively regulating the creation and dissemination of\u0000synthesized content. Thus, this paper, as a pioneer, proposes the generative\u0000robust audio watermarking method (Groot), presenting a paradigm for proactively\u0000supervising the synthesized audio and its source diffusion models. In this\u0000paradigm, the processes of watermark generation and audio synthesis occur\u0000simultaneously, facilitated by parameter-fixed diffusion models equipped with a\u0000dedicated encoder. The watermark embedded within the audio can subsequently be\u0000retrieved by a lightweight decoder. The experimental results highlight Groot's\u0000outstanding performance, particularly in terms of robustness, surpassing that\u0000of the leading state-of-the-art methods. Beyond its impressive resilience\u0000against individual post-processing attacks, Groot exhibits exceptional\u0000robustness when facing compound attacks, maintaining an average watermark\u0000extraction accuracy of around 95%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis LiteFocus：用于长音频合成的加速扩散推理

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10468

Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang

Latent diffusion models have shown promising results in audio generation,making notable advancements over traditional methods. However, theirperformance, while impressive with short audio clips, faces challenges whenextended to longer audio sequences. These challenges are due to model'sself-attention mechanism and training predominantly on 10-second clips, whichcomplicates the extension to longer audio without adaptation. In response tothese issues, we introduce a novel approach, LiteFocus that enhances theinference of existing audio latent diffusion models in long audio synthesis.Observed the attention pattern in self-attention, we employ a dual sparse formfor attention calculation, designated as same-frequency focus andcross-frequency compensation, which curtails the attention computation undersame-frequency constraints, while enhancing audio quality throughcross-frequency refillment. LiteFocus demonstrates substantial reduction oninference time with diffusion-based TTA model by 1.99x in synthesizing80-second audio clips while also obtaining improved audio quality.

潜在扩散模型在音频生成方面取得了可喜的成果，与传统方法相比取得了显著的进步。然而，虽然它们在短音频片段中的表现令人印象深刻，但在扩展到较长的音频序列时却面临挑战。这些挑战是由于模型的自我注意机制和主要在 10 秒片段上进行的训练造成的，这使得在不进行适应性调整的情况下扩展到较长的音频时变得复杂。针对这些问题，我们引入了一种新方法 LiteFocus，它能增强现有音频潜扩散模型在长音频合成中的推理能力。观察到自我注意中的注意模式，我们采用了双重稀疏形式进行注意计算，即同频聚焦和跨频补偿，它能在同频约束下减少注意计算，同时通过跨频补偿提高音频质量。在合成 80 秒音频片段时，LiteFocus 与基于扩散的 TTA 模型相比，大幅缩短了 1.99 倍的推理时间，同时还提高了音频质量。

引用次数: 0

Guitar Chord Diagram Suggestion for Western Popular Music 西方流行音乐吉他和弦图建议

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.14260

Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin

Chord diagrams are used by guitar players to show where and how to play achord on the fretboard. They are useful to beginners learning chords or forsharing the hand positions required to play a song.However, the diagramspresented on guitar learning toolsare usually selected from an existingdatabaseand rarely represent the actual positions used by performers.In thispaper, we propose a tool which suggests a chord diagram for achord label,takinginto account the diagram of the previous chord.Based on statistical analysis ofthe DadaGP and mySongBook datasets, we show that some chord diagrams areover-represented in western popular musicand that some chords can be played inmore than 20 different ways.We argue that taking context into account canimprove the variety and the quality of chord diagram suggestion, and comparethis approach with a model taking only the current chord label into account.Weshow that adding previous context improves the F1-score on this task by up to27% and reduces the propensity of the model to suggest standard open chords.Wealso define the notion of texture in the context of chord diagrams andshowthrough a variety of metrics that our model improves textureconsistencywith theprevious diagram.

吉他手使用和弦图来显示和弦在指板上的弹奏位置和弹奏方法。然而，吉他学习工具上的和弦图通常是从现有数据库中挑选出来的，很少能代表演奏者使用的实际位置。基于对 DadaGP 和 mySongBook 数据集的统计分析，我们发现有些和弦图在西方流行音乐中出现过多，而且有些和弦可以有超过 20 种不同的弹奏方式。我们认为，将上下文考虑在内可以提高和弦图建议的多样性和质量，并将这种方法与只考虑当前和弦标签的模型进行了比较。我们发现，加入先前的上下文可以将这项任务的 F1 分数提高 27%，并降低了模型建议标准开放和弦的倾向。

{"title":"Guitar Chord Diagram Suggestion for Western Popular Music","authors":"Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin","doi":"arxiv-2407.14260","DOIUrl":"https://doi.org/arxiv-2407.14260","url":null,"abstract":"Chord diagrams are used by guitar players to show where and how to play a\u0000chord on the fretboard. They are useful to beginners learning chords or for\u0000sharing the hand positions required to play a song.However, the diagrams\u0000presented on guitar learning toolsare usually selected from an existing\u0000databaseand rarely represent the actual positions used by performers.In this\u0000paper, we propose a tool which suggests a chord diagram for achord label,taking\u0000into account the diagram of the previous chord.Based on statistical analysis of\u0000the DadaGP and mySongBook datasets, we show that some chord diagrams are\u0000over-represented in western popular musicand that some chords can be played in\u0000more than 20 different ways.We argue that taking context into account can\u0000improve the variety and the quality of chord diagram suggestion, and compare\u0000this approach with a model taking only the current chord label into account.We\u0000show that adding previous context improves the F1-score on this task by up to\u000027% and reduces the propensity of the model to suggest standard open chords.We\u0000also define the notion of texture in the context of chord diagrams andshow\u0000through a variety of metrics that our model improves textureconsistencywith the\u0000previous diagram.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"339 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion 通过视觉场景驱动扩散进行声学匹配和消除混响的相互学习

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10373

Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng

Visual acoustic matching (VAM) is pivotal for enhancing the immersiveexperience, and the task of dereverberation is effective in improving audiointelligibility. Existing methods treat each task independently, overlookingthe inherent reciprocity between them. Moreover, these methods depend on pairedtraining data, which is challenging to acquire, impeding the utilization ofextensive unpaired data. In this paper, we introduce MVSD, a mutual learningframework based on diffusion models. MVSD considers the two taskssymmetrically, exploiting the reciprocal relationship to facilitate learningfrom inverse tasks and overcome data scarcity. Furthermore, we employ thediffusion model as foundational conditional converters to circumvent thetraining instability and over-smoothing drawbacks of conventional GANarchitectures. Specifically, MVSD employs two converters: one for VAM calledreverberator and one for dereverberation called dereverberator. Thedereverberator judges whether the reverberation audio generated by reverberatorsounds like being in the conditional visual scenario, and vice versa. Byforming a closed loop, these two converters can generate informative feedbacksignals to optimize the inverse tasks, even with easily acquired one-wayunpaired data. Extensive experiments on two standard benchmarks, i.e.,SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework canimprove the performance of the reverberator and dereverberator and better matchspecified visual scenarios.

视觉声学匹配（VAM）对于增强身临其境的体验至关重要，而消除混响的任务则能有效提高音频的可理解性。现有方法将这两项任务分开处理，忽略了它们之间固有的互惠性。此外，这些方法依赖于配对训练数据，而配对训练数据的获取极具挑战性，阻碍了对大量非配对数据的利用。本文介绍了基于扩散模型的互学框架 MVSD。MVSD 对称考虑两个任务，利用互惠关系促进逆任务学习，克服数据稀缺问题。此外，我们采用扩散模型作为基础条件转换器，以规避传统 GAN 架构的训练不稳定性和过度平滑缺点。具体来说，MVSD 采用了两个转换器：一个用于 VAM，称为转换器（reverberator）；另一个用于消除混响，称为消除混响器（dereverberator）。去混响器判断混响器生成的混响音频是否听起来像在条件视觉场景中，反之亦然。通过形成闭环，这两个转换器可以生成信息反馈信号，以优化逆任务，即使是轻松获取的单向非配对数据也不例外。在两个标准基准（即 SoundSpaces-Speech 和 Acoustic AVSpeech）上进行的广泛实验表明，我们的框架可以提高混响器和消除混响器的性能，并更好地匹配指定的视觉场景。

{"title":"Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion","authors":"Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng","doi":"arxiv-2407.10373","DOIUrl":"https://doi.org/arxiv-2407.10373","url":null,"abstract":"Visual acoustic matching (VAM) is pivotal for enhancing the immersive\u0000experience, and the task of dereverberation is effective in improving audio\u0000intelligibility. Existing methods treat each task independently, overlooking\u0000the inherent reciprocity between them. Moreover, these methods depend on paired\u0000training data, which is challenging to acquire, impeding the utilization of\u0000extensive unpaired data. In this paper, we introduce MVSD, a mutual learning\u0000framework based on diffusion models. MVSD considers the two tasks\u0000symmetrically, exploiting the reciprocal relationship to facilitate learning\u0000from inverse tasks and overcome data scarcity. Furthermore, we employ the\u0000diffusion model as foundational conditional converters to circumvent the\u0000training instability and over-smoothing drawbacks of conventional GAN\u0000architectures. Specifically, MVSD employs two converters: one for VAM called\u0000reverberator and one for dereverberation called dereverberator. The\u0000dereverberator judges whether the reverberation audio generated by reverberator\u0000sounds like being in the conditional visual scenario, and vice versa. By\u0000forming a closed loop, these two converters can generate informative feedback\u0000signals to optimize the inverse tasks, even with easily acquired one-way\u0000unpaired data. Extensive experiments on two standard benchmarks, i.e.,\u0000SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can\u0000improve the performance of the reverberator and dereverberator and better match\u0000specified visual scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DDFAD: Dataset Distillation Framework for Audio Data DDFAD：音频数据的数据集蒸馏框架

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10446

Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu

Deep neural networks (DNNs) have achieved significant success in numerousapplications. The remarkable performance of DNNs is largely attributed to theavailability of massive, high-quality training datasets. However, processingsuch massive training data requires huge computational and storage resources.Dataset distillation is a promising solution to this problem, offering thecapability to compress a large dataset into a smaller distilled dataset. Themodel trained on the distilled dataset can achieve comparable performance tothe model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none haveexplored dataset distillation for audio data. In this work, for the first time,we propose a Dataset Distillation Framework for Audio Data (DDFAD).Specifically, we first propose the Fused Differential MFCC (FD-MFCC) asextracted features for audio data. After that, the FD-MFCC is distilled throughthe matching training trajectory distillation method. Finally, we propose anaudio signal reconstruction algorithm based on the Griffin-Lim Algorithm toreconstruct the audio signal from the distilled FD-MFCC. Extensive experimentsdemonstrate the effectiveness of DDFAD on various audio datasets. In addition,we show that DDFAD has promising application prospects in many applications,such as continual learning and neural architecture search.

深度神经网络（DNN）在众多应用中取得了巨大成功。DNNs 的卓越性能在很大程度上归功于海量、高质量训练数据集的可用性。然而，处理如此海量的训练数据需要巨大的计算和存储资源。数据集蒸馏是解决这一问题的一个很有前途的方法，它能够将大型数据集压缩成较小的蒸馏数据集。在蒸馏数据集上训练的模型可以达到与在整个数据集上训练的模型相当的性能。虽然数据集蒸馏已在图像数据中得到证实，但还没有人探索过音频数据的数据集蒸馏。在这项工作中，我们首次提出了音频数据的数据集蒸馏框架（DDFAD）。具体来说，我们首先提出了融合差分 MFCC（FD-MFCC）作为音频数据的提取特征。然后，通过匹配训练轨迹蒸馏法对 FD-MFCC 进行蒸馏。最后，我们提出了一种基于 Griffin-Lim 算法的音频信号重建算法，从提炼出的 FD-MFCC 中重建音频信号。广泛的实验证明了 DDFAD 在各种音频数据集上的有效性。此外，我们还证明了 DDFAD 在持续学习和神经架构搜索等许多应用领域具有广阔的应用前景。

{"title":"DDFAD: Dataset Distillation Framework for Audio Data","authors":"Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu","doi":"arxiv-2407.10446","DOIUrl":"https://doi.org/arxiv-2407.10446","url":null,"abstract":"Deep neural networks (DNNs) have achieved significant success in numerous\u0000applications. The remarkable performance of DNNs is largely attributed to the\u0000availability of massive, high-quality training datasets. However, processing\u0000such massive training data requires huge computational and storage resources.\u0000Dataset distillation is a promising solution to this problem, offering the\u0000capability to compress a large dataset into a smaller distilled dataset. The\u0000model trained on the distilled dataset can achieve comparable performance to\u0000the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have\u0000explored dataset distillation for audio data. In this work, for the first time,\u0000we propose a Dataset Distillation Framework for Audio Data (DDFAD).\u0000Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as\u0000extracted features for audio data. After that, the FD-MFCC is distilled through\u0000the matching training trajectory distillation method. Finally, we propose an\u0000audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to\u0000reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments\u0000demonstrate the effectiveness of DDFAD on various audio datasets. In addition,\u0000we show that DDFAD has promising application prospects in many applications,\u0000such as continual learning and neural architecture search.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features BandControlNet：基于并行变换器的可转向流行音乐生成与细粒度时空特征

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10462

Jing Luo, Xinyu Yang, Dorien Herremans

Controllable music generation promotes the interaction between humans andcomposition systems by projecting the users' intent on their desired music. Thechallenge of introducing controllability is an increasingly important issue inthe symbolic music generation field. When building controllable generativepopular multi-instrument music systems, two main challenges typically presentthemselves, namely weak controllability and poor music quality. To addressthese issues, we first propose spatiotemporal features as powerful andfine-grained controls to enhance the controllability of the generative model.In addition, an efficient music representation called REMI_Track is designed toconvert multitrack music into multiple parallel music sequences and shorten thesequence length of each track with Byte Pair Encoding (BPE) techniques.Subsequently, we release BandControlNet, a conditional model based on parallelTransformers, to tackle the multiple music sequences and generate high-qualitymusic samples that are conditioned to the given spatiotemporal controlfeatures. More concretely, the two specially designed modules ofBandControlNet, namely structure-enhanced self-attention (SE-SA) andCross-Track Transformer (CTT), are utilized to strengthen the resulting musicalstructure and inter-track harmony modeling respectively. Experimental resultstested on two popular music datasets of different lengths demonstrate that theproposed BandControlNet outperforms other conditional music generation modelson most objective metrics in terms of fidelity and inference speed and showsgreat robustness in generating long music samples. The subjective evaluationsshow BandControlNet trained on short datasets can generate music withcomparable quality to state-of-the-art models, while outperforming themsignificantly using longer datasets.

可控音乐生成通过将用户的意图投射到所需音乐上，促进了人类与作曲系统之间的互动。在符号音乐生成领域，引入可控性是一个日益重要的挑战。在构建可控的流行多乐器音乐生成系统时，通常会遇到两大挑战，即可控性弱和音乐质量差。为了解决这些问题，我们首先提出了时空特征作为强大而精细的控制手段，以增强生成模型的可控性。此外，我们还设计了一种名为 REMI_Track 的高效音乐表示法，用于将多轨音乐转换为多个并行音乐序列，并利用字节对编码（BPE）技术缩短每个音轨的序列长度。随后，我们发布了基于并行变换器的条件模型 BandControlNet，用于处理多音乐序列，并生成符合给定时空控制特征的高质量音乐样本。更具体地说，BandControlNet 专门设计的两个模块，即结构增强自注意（SE-SA）和跨音轨变换器（CTT），分别用于加强生成的音乐结构和音轨间和声建模。在两个不同长度的流行音乐数据集上测试的实验结果表明，所提出的 BandControlNet 在保真度和推理速度等大多数客观指标上都优于其他条件音乐生成模型，并且在生成长音乐样本时表现出极大的鲁棒性。主观评估结果表明，在短数据集上训练的 BandControlNet 生成的音乐质量可与最先进的模型相媲美，而在使用长数据集时则明显优于它们。

{"title":"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":"https://doi.org/arxiv-2407.10462","url":null,"abstract":"Controllable music generation promotes the interaction between humans and\u0000composition systems by projecting the users' intent on their desired music. The\u0000challenge of introducing controllability is an increasingly important issue in\u0000the symbolic music generation field. When building controllable generative\u0000popular multi-instrument music systems, two main challenges typically present\u0000themselves, namely weak controllability and poor music quality. To address\u0000these issues, we first propose spatiotemporal features as powerful and\u0000fine-grained controls to enhance the controllability of the generative model.\u0000In addition, an efficient music representation called REMI_Track is designed to\u0000convert multitrack music into multiple parallel music sequences and shorten the\u0000sequence length of each track with Byte Pair Encoding (BPE) techniques.\u0000Subsequently, we release BandControlNet, a conditional model based on parallel\u0000Transformers, to tackle the multiple music sequences and generate high-quality\u0000music samples that are conditioned to the given spatiotemporal control\u0000features. More concretely, the two specially designed modules of\u0000BandControlNet, namely structure-enhanced self-attention (SE-SA) and\u0000Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical\u0000structure and inter-track harmony modeling respectively. Experimental results\u0000tested on two popular music datasets of different lengths demonstrate that the\u0000proposed BandControlNet outperforms other conditional music generation models\u0000on most objective metrics in terms of fidelity and inference speed and shows\u0000great robustness in generating long music samples. The subjective evaluations\u0000show BandControlNet trained on short datasets can generate music with\u0000comparable quality to state-of-the-art models, while outperforming them\u0000significantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"185 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity 具有增强同步性的屏蔽式生成视频音频转换器

arXiv - CS - Sound

Pub Date : 2024-07-15 DOI: arxiv-2407.10387

Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà

Video-to-audio (V2A) generation leverages visual-only video features torender plausible sounds that match the scene. Importantly, the generated soundonsets should match the visual actions that are aligned with them, otherwiseunnatural synchronization artifacts arise. Recent works have explored theprogression of conditioning sound generators on still images and then videofeatures, focusing on quality and semantic matching while ignoringsynchronization, or by sacrificing some amount of quality to focus on improvingsynchronization only. In this work, we propose a V2A generative model, namedMaskVAT, that interconnects a full-band high-quality general audio codec with asequence-to-sequence masked generative model. This combination allows modelingboth high audio quality, semantic matching, and temporal synchronicity at thesame time. Our results show that, by combining a high-quality codec with theproper pre-trained audio-visual features and a sequence-to-sequence parallelstructure, we are able to yield highly synchronized results on one hand, whilstbeing competitive with the state of the art of non-codec generative audiomodels. Sample videos and generated audios are available athttps://maskvat.github.io .

视频-音频（V2A）生成技术利用纯视觉视频特征来生成与场景相匹配的可信声音。重要的是，生成的声音集应与与之对齐的视觉动作相匹配，否则就会产生不自然的同步假象。最近的一些研究已经探索了在静态图像和视频特征上调节声音生成器的方法，这些方法侧重于质量和语义匹配，而忽略了同步性，或者牺牲了一定的质量，只侧重于提高同步性。在这项工作中，我们提出了一种 V2A 生成模型，名为 "MaskVAT"，它将全频段高质量通用音频编解码器与序列到序列掩码生成模型相互连接。这种组合可以同时模拟高音频质量、语义匹配和时间同步性。我们的研究结果表明，通过将高质量编解码器与适当的预训练视听特征和序列到序列并行结构相结合，我们一方面能够获得高度同步的结果，另一方面又能与非编解码器生成式音频模型的技术水平相媲美。样本视频和生成的音频可在https://maskvat.github.io。

{"title":"Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity","authors":"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà","doi":"arxiv-2407.10387","DOIUrl":"https://doi.org/arxiv-2407.10387","url":null,"abstract":"Video-to-audio (V2A) generation leverages visual-only video features to\u0000render plausible sounds that match the scene. Importantly, the generated sound\u0000onsets should match the visual actions that are aligned with them, otherwise\u0000unnatural synchronization artifacts arise. Recent works have explored the\u0000progression of conditioning sound generators on still images and then video\u0000features, focusing on quality and semantic matching while ignoring\u0000synchronization, or by sacrificing some amount of quality to focus on improving\u0000synchronization only. In this work, we propose a V2A generative model, named\u0000MaskVAT, that interconnects a full-band high-quality general audio codec with a\u0000sequence-to-sequence masked generative model. This combination allows modeling\u0000both high audio quality, semantic matching, and temporal synchronicity at the\u0000same time. Our results show that, by combining a high-quality codec with the\u0000proper pre-trained audio-visual features and a sequence-to-sequence parallel\u0000structure, we are able to yield highly synchronized results on one hand, whilst\u0000being competitive with the state of the art of non-codec generative audio\u0000models. Sample videos and generated audios are available at\u0000https://maskvat.github.io .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification Whisper-SV：为低数据资源扬声器验证调整 Whisper

arXiv - CS - Sound

Pub Date : 2024-07-14 DOI: arxiv-2407.10048

Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie

Trained on 680,000 hours of massive speech data, Whisper is a multitasking,multilingual speech foundation model demonstrating superior performance inautomatic speech recognition, translation, and language identification.However, its applicability in speaker verification (SV) tasks remainsunexplored, particularly in low-data-resource scenarios where labeled speakerdata in specific domains are limited. To fill this gap, we propose alightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.Given that Whisper is not specifically optimized for SV tasks, we introduce arepresentation selection module to quantify the speaker-specificcharacteristics contained in each layer of Whisper and select the top-k layerswith prominent discriminative speaker features. To aggregate pivotalspeaker-related features while diminishing non-speaker redundancies across theselected top-k distinct layers of Whisper, we design a multi-layer aggregationmodule in Whisper-SV to integrate multi-layer representations into a singular,compacted representation for SV. In the multi-layer aggregation module, weemploy convolutional layers with shortcut connections among different layers torefine speaker characteristics derived from multi-layer representations fromWhisper. In addition, an attention aggregation layer is used to reducenon-speaker interference and amplify speaker-specific cues for SV tasks.Finally, a simple classification module is used for speaker classification.Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SVachieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,showing superior performance in low-data-resource SV scenarios.

经过 68 万小时海量语音数据的训练，Whisper 是一种多任务、多语言语音基础模型，在自动语音识别、翻译和语言识别方面表现出卓越的性能。然而，它在说话人验证（SV）任务中的适用性仍有待探索，尤其是在低数据资源场景中，特定领域中的标注说话人数据有限。为了填补这一空白，我们提出了一个轻量级适配器框架，即 Whisper-SV，以利用 Whisper 提升 SV。鉴于 Whisper 并未专门针对 SV 任务进行优化，我们引入了呈现选择模块，以量化 Whisper 每一层所包含的特定说话人特征，并选择具有突出辨别说话人特征的 top-k 层。为了聚合与说话人相关的关键特征，同时减少 Whisper 中选出的前 k 个不同层中的非说话人冗余，我们在 Whisper-SV 中设计了一个多层聚合模块，将多层表征整合为一个单一、紧凑的 SV 表征。在多层聚合模块中，我们利用卷积层与不同层之间的快捷连接来提炼从 Whisper 的多层表征中得出的说话者特征。在 VoxCeleb1、FFSVC 和 IMSV 数据集上的实验表明，Whisper-SV 的 EER/minDCF 分别为 2.22%/0.307、6.14%/0.488 和 7.50%/0.582，在低数据资源 SV 场景中表现出了卓越的性能。

{"title":"Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification","authors":"Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie","doi":"arxiv-2407.10048","DOIUrl":"https://doi.org/arxiv-2407.10048","url":null,"abstract":"Trained on 680,000 hours of massive speech data, Whisper is a multitasking,\u0000multilingual speech foundation model demonstrating superior performance in\u0000automatic speech recognition, translation, and language identification.\u0000However, its applicability in speaker verification (SV) tasks remains\u0000unexplored, particularly in low-data-resource scenarios where labeled speaker\u0000data in specific domains are limited. To fill this gap, we propose a\u0000lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.\u0000Given that Whisper is not specifically optimized for SV tasks, we introduce a\u0000representation selection module to quantify the speaker-specific\u0000characteristics contained in each layer of Whisper and select the top-k layers\u0000with prominent discriminative speaker features. To aggregate pivotal\u0000speaker-related features while diminishing non-speaker redundancies across the\u0000selected top-k distinct layers of Whisper, we design a multi-layer aggregation\u0000module in Whisper-SV to integrate multi-layer representations into a singular,\u0000compacted representation for SV. In the multi-layer aggregation module, we\u0000employ convolutional layers with shortcut connections among different layers to\u0000refine speaker characteristics derived from multi-layer representations from\u0000Whisper. In addition, an attention aggregation layer is used to reduce\u0000non-speaker interference and amplify speaker-specific cues for SV tasks.\u0000Finally, a simple classification module is used for speaker classification.\u0000Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV\u0000achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,\u0000showing superior performance in low-data-resource SV scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Interpretation Gap in Text-to-Music Generation Models 文本到音乐生成模型中的解读差距

arXiv - CS - Sound

Pub Date : 2024-07-14 DOI: arxiv-2407.10328

Yongyi Zang, Yixiao Zhang

Large-scale text-to-music generation models have significantly enhanced musiccreation capabilities, offering unprecedented creative freedom. However, theirability to collaborate effectively with human musicians remains limited. Inthis paper, we propose a framework to describe the musical interaction process,which includes expression, interpretation, and execution of controls. Followingthis framework, we argue that the primary gap between existing text-to-musicmodels and musicians lies in the interpretation stage, where models lack theability to interpret controls from musicians. We also propose two strategies toaddress this gap and call on the music information retrieval community totackle the interpretation challenge to improve human-AI musical collaboration.

大规模文本到音乐生成模型极大地增强了音乐创作能力，提供了前所未有的创作自由。然而，它们与人类音乐家有效合作的能力仍然有限。在本文中，我们提出了一个描述音乐交互过程的框架，其中包括表达、解释和执行控制。根据这一框架，我们认为现有的文本到音乐模型与音乐家之间的主要差距在于解释阶段，模型缺乏解释音乐家控制的能力。我们还提出了解决这一差距的两种策略，并呼吁音乐信息检索界解决解释难题，以改善人类与人工智能的音乐合作。

引用次数: 0

Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks 评估用于无人机控制的语音命令管道：从 STT 和 LLM 到直接分类和连体网络

arXiv - CS - Sound

Pub Date : 2024-07-10 DOI: arxiv-2407.08658

Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva

This paper presents the development and comparative evaluation of three voicecommand pipelines for controlling a Tello drone, using speech recognition anddeep learning techniques. The aim is to enhance human-machine interaction byenabling intuitive voice control of drone actions. The pipelines developedinclude: (1) a traditional Speech-to-Text (STT) followed by a Large LanguageModel (LLM) approach, (2) a direct voice-to-function mapping model, and (3) aSiamese neural network-based system. Each pipeline was evaluated based oninference time, accuracy, efficiency, and flexibility. Detailed methodologies,dataset preparation, and evaluation metrics are provided, offering acomprehensive analysis of each pipeline's strengths and applicability acrossdifferent scenarios.

本文介绍了利用语音识别和深度学习技术控制 Tello 无人机的三种语音命令管道的开发和比较评估。其目的是通过直观的语音控制无人机行动来增强人机交互。开发的管道包括(1) 传统的语音到文本（STT），然后是大语言模型（LLM）方法；(2) 直接语音到功能映射模型；(3) 基于暹罗神经网络的系统。每个管道都根据推理时间、准确性、效率和灵活性进行了评估。报告提供了详细的方法、数据集准备和评估指标，对每种管道在不同场景下的优势和适用性进行了全面分析。

引用次数: 0