Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
随着扩散模型等生成模型的蓬勃发展,区分合成音频和自然音频的任务变得更加艰巨。深度伪造检测为应对这一挑战提供了可行的解决方案。然而,这种防御措施无意中助长了生成模型的不断完善。水印作为一种积极主动、可持续发展的策略出现了,它可以先发制人地规范合成内容的创建和传播。因此,本文作为先驱,提出了生成式音频水印方法(Groot),提出了一种主动监督合成音频及其源扩散模型的范式。在这一范例中,水印生成和音频合成过程同时进行,并通过配备专用编码器的固定参数扩散模型来实现。嵌入音频中的水印随后可以通过轻量级解码器提取出来。实验结果凸显了 Groot 的卓越性能,特别是在鲁棒性方面,超过了最先进的领先方法。除了对单个后处理攻击具有令人印象深刻的抗击打能力外,Groot 在面对复合攻击时也表现出了卓越的鲁棒性,水印提取的平均准确率保持在 95% 左右。
{"title":"GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis","authors":"Weizhi Liu, Yue Li, Dongdong Lin, Hui Tian, Haizhou Li","doi":"arxiv-2407.10471","DOIUrl":"https://doi.org/arxiv-2407.10471","url":null,"abstract":"Amid the burgeoning development of generative models like diffusion models,\u0000the task of differentiating synthesized audio from its natural counterpart\u0000grows more daunting. Deepfake detection offers a viable solution to combat this\u0000challenge. Yet, this defensive measure unintentionally fuels the continued\u0000refinement of generative models. Watermarking emerges as a proactive and\u0000sustainable tactic, preemptively regulating the creation and dissemination of\u0000synthesized content. Thus, this paper, as a pioneer, proposes the generative\u0000robust audio watermarking method (Groot), presenting a paradigm for proactively\u0000supervising the synthesized audio and its source diffusion models. In this\u0000paradigm, the processes of watermark generation and audio synthesis occur\u0000simultaneously, facilitated by parameter-fixed diffusion models equipped with a\u0000dedicated encoder. The watermark embedded within the audio can subsequently be\u0000retrieved by a lightweight decoder. The experimental results highlight Groot's\u0000outstanding performance, particularly in terms of robustness, surpassing that\u0000of the leading state-of-the-art methods. Beyond its impressive resilience\u0000against individual post-processing attacks, Groot exhibits exceptional\u0000robustness when facing compound attacks, maintaining an average watermark\u0000extraction accuracy of around 95%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang
Latent diffusion models have shown promising results in audio generation, making notable advancements over traditional methods. However, their performance, while impressive with short audio clips, faces challenges when extended to longer audio sequences. These challenges are due to model's self-attention mechanism and training predominantly on 10-second clips, which complicates the extension to longer audio without adaptation. In response to these issues, we introduce a novel approach, LiteFocus that enhances the inference of existing audio latent diffusion models in long audio synthesis. Observed the attention pattern in self-attention, we employ a dual sparse form for attention calculation, designated as same-frequency focus and cross-frequency compensation, which curtails the attention computation under same-frequency constraints, while enhancing audio quality through cross-frequency refillment. LiteFocus demonstrates substantial reduction on inference time with diffusion-based TTA model by 1.99x in synthesizing 80-second audio clips while also obtaining improved audio quality.
{"title":"LiteFocus: Accelerated Diffusion Inference for Long Audio Synthesis","authors":"Zhenxiong Tan, Xinyin Ma, Gongfan Fang, Xinchao Wang","doi":"arxiv-2407.10468","DOIUrl":"https://doi.org/arxiv-2407.10468","url":null,"abstract":"Latent diffusion models have shown promising results in audio generation,\u0000making notable advancements over traditional methods. However, their\u0000performance, while impressive with short audio clips, faces challenges when\u0000extended to longer audio sequences. These challenges are due to model's\u0000self-attention mechanism and training predominantly on 10-second clips, which\u0000complicates the extension to longer audio without adaptation. In response to\u0000these issues, we introduce a novel approach, LiteFocus that enhances the\u0000inference of existing audio latent diffusion models in long audio synthesis.\u0000Observed the attention pattern in self-attention, we employ a dual sparse form\u0000for attention calculation, designated as same-frequency focus and\u0000cross-frequency compensation, which curtails the attention computation under\u0000same-frequency constraints, while enhancing audio quality through\u0000cross-frequency refillment. LiteFocus demonstrates substantial reduction on\u0000inference time with diffusion-based TTA model by 1.99x in synthesizing\u000080-second audio clips while also obtaining improved audio quality.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin
Chord diagrams are used by guitar players to show where and how to play a chord on the fretboard. They are useful to beginners learning chords or for sharing the hand positions required to play a song.However, the diagrams presented on guitar learning toolsare usually selected from an existing databaseand rarely represent the actual positions used by performers.In this paper, we propose a tool which suggests a chord diagram for achord label,taking into account the diagram of the previous chord.Based on statistical analysis of the DadaGP and mySongBook datasets, we show that some chord diagrams are over-represented in western popular musicand that some chords can be played in more than 20 different ways.We argue that taking context into account can improve the variety and the quality of chord diagram suggestion, and compare this approach with a model taking only the current chord label into account.We show that adding previous context improves the F1-score on this task by up to 27% and reduces the propensity of the model to suggest standard open chords.We also define the notion of texture in the context of chord diagrams andshow through a variety of metrics that our model improves textureconsistencywith the previous diagram.
吉他手使用和弦图来显示和弦在指板上的弹奏位置和弹奏方法。然而,吉他学习工具上的和弦图通常是从现有数据库中挑选出来的,很少能代表演奏者使用的实际位置。基于对 DadaGP 和 mySongBook 数据集的统计分析,我们发现有些和弦图在西方流行音乐中出现过多,而且有些和弦可以有超过 20 种不同的弹奏方式。我们认为,将上下文考虑在内可以提高和弦图建议的多样性和质量,并将这种方法与只考虑当前和弦标签的模型进行了比较。我们发现,加入先前的上下文可以将这项任务的 F1 分数提高 27%,并降低了模型建议标准开放和弦的倾向。
{"title":"Guitar Chord Diagram Suggestion for Western Popular Music","authors":"Alexandre d'HoogeLaBRI, SCRIME, Louis BigoLaBRI, SCRIME, Ken Déguernel, Nicolas Martin","doi":"arxiv-2407.14260","DOIUrl":"https://doi.org/arxiv-2407.14260","url":null,"abstract":"Chord diagrams are used by guitar players to show where and how to play a\u0000chord on the fretboard. They are useful to beginners learning chords or for\u0000sharing the hand positions required to play a song.However, the diagrams\u0000presented on guitar learning toolsare usually selected from an existing\u0000databaseand rarely represent the actual positions used by performers.In this\u0000paper, we propose a tool which suggests a chord diagram for achord label,taking\u0000into account the diagram of the previous chord.Based on statistical analysis of\u0000the DadaGP and mySongBook datasets, we show that some chord diagrams are\u0000over-represented in western popular musicand that some chords can be played in\u0000more than 20 different ways.We argue that taking context into account can\u0000improve the variety and the quality of chord diagram suggestion, and compare\u0000this approach with a model taking only the current chord label into account.We\u0000show that adding previous context improves the F1-score on this task by up to\u000027% and reduces the propensity of the model to suggest standard open chords.We\u0000also define the notion of texture in the context of chord diagrams andshow\u0000through a variety of metrics that our model improves textureconsistencywith the\u0000previous diagram.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"339 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141745581","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual acoustic matching (VAM) is pivotal for enhancing the immersive experience, and the task of dereverberation is effective in improving audio intelligibility. Existing methods treat each task independently, overlooking the inherent reciprocity between them. Moreover, these methods depend on paired training data, which is challenging to acquire, impeding the utilization of extensive unpaired data. In this paper, we introduce MVSD, a mutual learning framework based on diffusion models. MVSD considers the two tasks symmetrically, exploiting the reciprocal relationship to facilitate learning from inverse tasks and overcome data scarcity. Furthermore, we employ the diffusion model as foundational conditional converters to circumvent the training instability and over-smoothing drawbacks of conventional GAN architectures. Specifically, MVSD employs two converters: one for VAM called reverberator and one for dereverberation called dereverberator. The dereverberator judges whether the reverberation audio generated by reverberator sounds like being in the conditional visual scenario, and vice versa. By forming a closed loop, these two converters can generate informative feedback signals to optimize the inverse tasks, even with easily acquired one-way unpaired data. Extensive experiments on two standard benchmarks, i.e., SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can improve the performance of the reverberator and dereverberator and better match specified visual scenarios.
{"title":"Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion","authors":"Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng","doi":"arxiv-2407.10373","DOIUrl":"https://doi.org/arxiv-2407.10373","url":null,"abstract":"Visual acoustic matching (VAM) is pivotal for enhancing the immersive\u0000experience, and the task of dereverberation is effective in improving audio\u0000intelligibility. Existing methods treat each task independently, overlooking\u0000the inherent reciprocity between them. Moreover, these methods depend on paired\u0000training data, which is challenging to acquire, impeding the utilization of\u0000extensive unpaired data. In this paper, we introduce MVSD, a mutual learning\u0000framework based on diffusion models. MVSD considers the two tasks\u0000symmetrically, exploiting the reciprocal relationship to facilitate learning\u0000from inverse tasks and overcome data scarcity. Furthermore, we employ the\u0000diffusion model as foundational conditional converters to circumvent the\u0000training instability and over-smoothing drawbacks of conventional GAN\u0000architectures. Specifically, MVSD employs two converters: one for VAM called\u0000reverberator and one for dereverberation called dereverberator. The\u0000dereverberator judges whether the reverberation audio generated by reverberator\u0000sounds like being in the conditional visual scenario, and vice versa. By\u0000forming a closed loop, these two converters can generate informative feedback\u0000signals to optimize the inverse tasks, even with easily acquired one-way\u0000unpaired data. Extensive experiments on two standard benchmarks, i.e.,\u0000SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can\u0000improve the performance of the reverberator and dereverberator and better match\u0000specified visual scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"73 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718912","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep neural networks (DNNs) have achieved significant success in numerous applications. The remarkable performance of DNNs is largely attributed to the availability of massive, high-quality training datasets. However, processing such massive training data requires huge computational and storage resources. Dataset distillation is a promising solution to this problem, offering the capability to compress a large dataset into a smaller distilled dataset. The model trained on the distilled dataset can achieve comparable performance to the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have explored dataset distillation for audio data. In this work, for the first time, we propose a Dataset Distillation Framework for Audio Data (DDFAD). Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as extracted features for audio data. After that, the FD-MFCC is distilled through the matching training trajectory distillation method. Finally, we propose an audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments demonstrate the effectiveness of DDFAD on various audio datasets. In addition, we show that DDFAD has promising application prospects in many applications, such as continual learning and neural architecture search.
{"title":"DDFAD: Dataset Distillation Framework for Audio Data","authors":"Wenbo Jiang, Rui Zhang, Hongwei Li, Xiaoyuan Liu, Haomiao Yang, Shui Yu","doi":"arxiv-2407.10446","DOIUrl":"https://doi.org/arxiv-2407.10446","url":null,"abstract":"Deep neural networks (DNNs) have achieved significant success in numerous\u0000applications. The remarkable performance of DNNs is largely attributed to the\u0000availability of massive, high-quality training datasets. However, processing\u0000such massive training data requires huge computational and storage resources.\u0000Dataset distillation is a promising solution to this problem, offering the\u0000capability to compress a large dataset into a smaller distilled dataset. The\u0000model trained on the distilled dataset can achieve comparable performance to\u0000the model trained on the whole dataset. While dataset distillation has been demonstrated in image data, none have\u0000explored dataset distillation for audio data. In this work, for the first time,\u0000we propose a Dataset Distillation Framework for Audio Data (DDFAD).\u0000Specifically, we first propose the Fused Differential MFCC (FD-MFCC) as\u0000extracted features for audio data. After that, the FD-MFCC is distilled through\u0000the matching training trajectory distillation method. Finally, we propose an\u0000audio signal reconstruction algorithm based on the Griffin-Lim Algorithm to\u0000reconstruct the audio signal from the distilled FD-MFCC. Extensive experiments\u0000demonstrate the effectiveness of DDFAD on various audio datasets. In addition,\u0000we show that DDFAD has promising application prospects in many applications,\u0000such as continual learning and neural architecture search.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722118","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Controllable music generation promotes the interaction between humans and composition systems by projecting the users' intent on their desired music. The challenge of introducing controllability is an increasingly important issue in the symbolic music generation field. When building controllable generative popular multi-instrument music systems, two main challenges typically present themselves, namely weak controllability and poor music quality. To address these issues, we first propose spatiotemporal features as powerful and fine-grained controls to enhance the controllability of the generative model. In addition, an efficient music representation called REMI_Track is designed to convert multitrack music into multiple parallel music sequences and shorten the sequence length of each track with Byte Pair Encoding (BPE) techniques. Subsequently, we release BandControlNet, a conditional model based on parallel Transformers, to tackle the multiple music sequences and generate high-quality music samples that are conditioned to the given spatiotemporal control features. More concretely, the two specially designed modules of BandControlNet, namely structure-enhanced self-attention (SE-SA) and Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical structure and inter-track harmony modeling respectively. Experimental results tested on two popular music datasets of different lengths demonstrate that the proposed BandControlNet outperforms other conditional music generation models on most objective metrics in terms of fidelity and inference speed and shows great robustness in generating long music samples. The subjective evaluations show BandControlNet trained on short datasets can generate music with comparable quality to state-of-the-art models, while outperforming them significantly using longer datasets.
{"title":"BandControlNet: Parallel Transformers-based Steerable Popular Music Generation with Fine-Grained Spatiotemporal Features","authors":"Jing Luo, Xinyu Yang, Dorien Herremans","doi":"arxiv-2407.10462","DOIUrl":"https://doi.org/arxiv-2407.10462","url":null,"abstract":"Controllable music generation promotes the interaction between humans and\u0000composition systems by projecting the users' intent on their desired music. The\u0000challenge of introducing controllability is an increasingly important issue in\u0000the symbolic music generation field. When building controllable generative\u0000popular multi-instrument music systems, two main challenges typically present\u0000themselves, namely weak controllability and poor music quality. To address\u0000these issues, we first propose spatiotemporal features as powerful and\u0000fine-grained controls to enhance the controllability of the generative model.\u0000In addition, an efficient music representation called REMI_Track is designed to\u0000convert multitrack music into multiple parallel music sequences and shorten the\u0000sequence length of each track with Byte Pair Encoding (BPE) techniques.\u0000Subsequently, we release BandControlNet, a conditional model based on parallel\u0000Transformers, to tackle the multiple music sequences and generate high-quality\u0000music samples that are conditioned to the given spatiotemporal control\u0000features. More concretely, the two specially designed modules of\u0000BandControlNet, namely structure-enhanced self-attention (SE-SA) and\u0000Cross-Track Transformer (CTT), are utilized to strengthen the resulting musical\u0000structure and inter-track harmony modeling respectively. Experimental results\u0000tested on two popular music datasets of different lengths demonstrate that the\u0000proposed BandControlNet outperforms other conditional music generation models\u0000on most objective metrics in terms of fidelity and inference speed and shows\u0000great robustness in generating long music samples. The subjective evaluations\u0000show BandControlNet trained on short datasets can generate music with\u0000comparable quality to state-of-the-art models, while outperforming them\u0000significantly using longer datasets.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"185 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà
Video-to-audio (V2A) generation leverages visual-only video features to render plausible sounds that match the scene. Importantly, the generated sound onsets should match the visual actions that are aligned with them, otherwise unnatural synchronization artifacts arise. Recent works have explored the progression of conditioning sound generators on still images and then video features, focusing on quality and semantic matching while ignoring synchronization, or by sacrificing some amount of quality to focus on improving synchronization only. In this work, we propose a V2A generative model, named MaskVAT, that interconnects a full-band high-quality general audio codec with a sequence-to-sequence masked generative model. This combination allows modeling both high audio quality, semantic matching, and temporal synchronicity at the same time. Our results show that, by combining a high-quality codec with the proper pre-trained audio-visual features and a sequence-to-sequence parallel structure, we are able to yield highly synchronized results on one hand, whilst being competitive with the state of the art of non-codec generative audio models. Sample videos and generated audios are available at https://maskvat.github.io .
{"title":"Masked Generative Video-to-Audio Transformers with Enhanced Synchronicity","authors":"Santiago Pascual, Chunghsin Yeh, Ioannis Tsiamas, Joan Serrà","doi":"arxiv-2407.10387","DOIUrl":"https://doi.org/arxiv-2407.10387","url":null,"abstract":"Video-to-audio (V2A) generation leverages visual-only video features to\u0000render plausible sounds that match the scene. Importantly, the generated sound\u0000onsets should match the visual actions that are aligned with them, otherwise\u0000unnatural synchronization artifacts arise. Recent works have explored the\u0000progression of conditioning sound generators on still images and then video\u0000features, focusing on quality and semantic matching while ignoring\u0000synchronization, or by sacrificing some amount of quality to focus on improving\u0000synchronization only. In this work, we propose a V2A generative model, named\u0000MaskVAT, that interconnects a full-band high-quality general audio codec with a\u0000sequence-to-sequence masked generative model. This combination allows modeling\u0000both high audio quality, semantic matching, and temporal synchronicity at the\u0000same time. Our results show that, by combining a high-quality codec with the\u0000proper pre-trained audio-visual features and a sequence-to-sequence parallel\u0000structure, we are able to yield highly synchronized results on one hand, whilst\u0000being competitive with the state of the art of non-codec generative audio\u0000models. Sample videos and generated audios are available at\u0000https://maskvat.github.io .","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"64 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie
Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.
{"title":"Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification","authors":"Li Zhang, Ning Jiang, Qing Wang, Yue Li, Quan Lu, Lei Xie","doi":"arxiv-2407.10048","DOIUrl":"https://doi.org/arxiv-2407.10048","url":null,"abstract":"Trained on 680,000 hours of massive speech data, Whisper is a multitasking,\u0000multilingual speech foundation model demonstrating superior performance in\u0000automatic speech recognition, translation, and language identification.\u0000However, its applicability in speaker verification (SV) tasks remains\u0000unexplored, particularly in low-data-resource scenarios where labeled speaker\u0000data in specific domains are limited. To fill this gap, we propose a\u0000lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV.\u0000Given that Whisper is not specifically optimized for SV tasks, we introduce a\u0000representation selection module to quantify the speaker-specific\u0000characteristics contained in each layer of Whisper and select the top-k layers\u0000with prominent discriminative speaker features. To aggregate pivotal\u0000speaker-related features while diminishing non-speaker redundancies across the\u0000selected top-k distinct layers of Whisper, we design a multi-layer aggregation\u0000module in Whisper-SV to integrate multi-layer representations into a singular,\u0000compacted representation for SV. In the multi-layer aggregation module, we\u0000employ convolutional layers with shortcut connections among different layers to\u0000refine speaker characteristics derived from multi-layer representations from\u0000Whisper. In addition, an attention aggregation layer is used to reduce\u0000non-speaker interference and amplify speaker-specific cues for SV tasks.\u0000Finally, a simple classification module is used for speaker classification.\u0000Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV\u0000achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively,\u0000showing superior performance in low-data-resource SV scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"69 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141718915","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large-scale text-to-music generation models have significantly enhanced music creation capabilities, offering unprecedented creative freedom. However, their ability to collaborate effectively with human musicians remains limited. In this paper, we propose a framework to describe the musical interaction process, which includes expression, interpretation, and execution of controls. Following this framework, we argue that the primary gap between existing text-to-music models and musicians lies in the interpretation stage, where models lack the ability to interpret controls from musicians. We also propose two strategies to address this gap and call on the music information retrieval community to tackle the interpretation challenge to improve human-AI musical collaboration.
{"title":"The Interpretation Gap in Text-to-Music Generation Models","authors":"Yongyi Zang, Yixiao Zhang","doi":"arxiv-2407.10328","DOIUrl":"https://doi.org/arxiv-2407.10328","url":null,"abstract":"Large-scale text-to-music generation models have significantly enhanced music\u0000creation capabilities, offering unprecedented creative freedom. However, their\u0000ability to collaborate effectively with human musicians remains limited. In\u0000this paper, we propose a framework to describe the musical interaction process,\u0000which includes expression, interpretation, and execution of controls. Following\u0000this framework, we argue that the primary gap between existing text-to-music\u0000models and musicians lies in the interpretation stage, where models lack the\u0000ability to interpret controls from musicians. We also propose two strategies to\u0000address this gap and call on the music information retrieval community to\u0000tackle the interpretation challenge to improve human-AI musical collaboration.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141722119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva
This paper presents the development and comparative evaluation of three voice command pipelines for controlling a Tello drone, using speech recognition and deep learning techniques. The aim is to enhance human-machine interaction by enabling intuitive voice control of drone actions. The pipelines developed include: (1) a traditional Speech-to-Text (STT) followed by a Large Language Model (LLM) approach, (2) a direct voice-to-function mapping model, and (3) a Siamese neural network-based system. Each pipeline was evaluated based on inference time, accuracy, efficiency, and flexibility. Detailed methodologies, dataset preparation, and evaluation metrics are provided, offering a comprehensive analysis of each pipeline's strengths and applicability across different scenarios.
{"title":"Evaluating Voice Command Pipelines for Drone Control: From STT and LLM to Direct Classification and Siamese Networks","authors":"Lucca Emmanuel Pineli Simões, Lucas Brandão Rodrigues, Rafaela Mota Silva, Gustavo Rodrigues da Silva","doi":"arxiv-2407.08658","DOIUrl":"https://doi.org/arxiv-2407.08658","url":null,"abstract":"This paper presents the development and comparative evaluation of three voice\u0000command pipelines for controlling a Tello drone, using speech recognition and\u0000deep learning techniques. The aim is to enhance human-machine interaction by\u0000enabling intuitive voice control of drone actions. The pipelines developed\u0000include: (1) a traditional Speech-to-Text (STT) followed by a Large Language\u0000Model (LLM) approach, (2) a direct voice-to-function mapping model, and (3) a\u0000Siamese neural network-based system. Each pipeline was evaluated based on\u0000inference time, accuracy, efficiency, and flexibility. Detailed methodologies,\u0000dataset preparation, and evaluation metrics are provided, offering a\u0000comprehensive analysis of each pipeline's strengths and applicability across\u0000different scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"71 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141609941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}