arXiv - EE - Audio and Speech Processing最新文献_第10页

Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT 结合多方言音素级 BERT 的跨方言音高附着语言文本到语音技术

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-11 DOI: arxiv-2409.07265

Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari

We explore cross-dialect text-to-speech (CD-TTS), a task to synthesizelearned speakers' voices in non-native dialects, especially in pitch-accentlanguages. CD-TTS is important for developing voice agents that naturallycommunicate with people across regions. We present a novel TTS model comprisingthree sub-modules to perform competitively at this task. We first train abackbone TTS model to synthesize dialect speech from a text conditioned onphoneme-level accent latent variables (ALVs) extracted from speech by areference encoder. Then, we train an ALV predictor to predict ALVs tailored toa target dialect from input text leveraging our novel multi-dialectphoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate theeffectiveness of our model by comparing it with a baseline derived fromconventional dialect TTS methods. The results show that our model improves thedialectal naturalness of synthetic speech in CD-TTS.

我们探讨了跨方言文本到语音（CD-TTS），这是一项用非母语方言，特别是音高增强语言合成学习者声音的任务。CD-TTS 对于开发能与不同地区的人自然交流的语音代理非常重要。我们提出了一种由三个子模块组成的新型 TTS 模型，以在这项任务中表现出竞争力。首先，我们训练一个骨干 TTS 模型，以语音编码器从语音中提取的音素级口音潜变量（ALV）为条件，从文本中合成方言语音。然后，我们训练一个 ALV 预测器，利用我们新颖的多方言音素级 BERT，从输入文本中预测适合目标方言的 ALV。我们进行了多方言 TTS 实验，并通过与传统方言 TTS 方法得出的基线进行比较，评估了我们模型的有效性。结果表明，我们的模型提高了 CD-TTS 中合成语音的方言自然度。

引用次数: 0

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching FlowSep：通过整流匹配进行语言查询声音分离

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-11 DOI: arxiv-2409.07614

Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang

Language-queried audio source separation (LASS) focuses on separating soundsusing textual descriptions of the desired sources. Current methods mainly usediscriminative approaches, such as time-frequency masking, to separate targetsounds and minimize interference from other sources. However, these models facechallenges when separating overlapping soundtracks, which may lead to artifactssuch as spectral holes or incomplete separation. Rectified flow matching (RFM),a generative model that establishes linear relations between the distributionof data and noise, offers superior theoretical properties and simplicity, buthas not yet been explored in sound separation. In this work, we introduceFlowSep, a new generative model based on RFM for LASS tasks. FlowSep learnslinear flow trajectories from noise to target source features within thevariational autoencoder (VAE) latent space. During inference, the RFM-generatedlatent features are reconstructed into a mel-spectrogram via the pre-trainedVAE decoder, followed by a pre-trained vocoder to synthesize the waveform.Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-artmodels across multiple benchmarks, as evaluated with subjective and objectivemetrics. Additionally, our results show that FlowSep surpasses adiffusion-based LASS model in both separation quality and inference efficiency,highlighting its strong potential for audio source separation tasks. Code,pre-trained models and demos can be found at:https://audio-agi.github.io/FlowSep_demo/.

语言查询音源分离（LASS）主要是通过对所需音源的文字描述来分离声音。目前的方法主要使用时间频率掩蔽等鉴别方法来分离目标声音，并尽量减少其他音源的干扰。然而，这些模型在分离重叠音轨时面临挑战，可能会导致频谱孔洞或分离不完全等假象。整流匹配（RFM）是一种在数据和噪声分布之间建立线性关系的生成模型，具有优越的理论特性和简便性，但尚未在声音分离中得到应用。在这项工作中，我们引入了基于 RFM 的新生成模型 FlowSep，用于 LASS 任务。FlowSep 在变异自动编码器（VAE）潜空间内学习从噪声到目标声源特征的线性流动轨迹。在推理过程中，RFM 生成的潜在特征会通过预训练的 VAE 解码器重构为 mel 光谱图，然后再通过预训练的声码器合成波形。此外，我们的结果表明，FlowSep 在分离质量和推理效率方面都超过了基于扩散的 LASS 模型，这凸显了它在音源分离任务中的强大潜力。有关代码、预训练模型和演示，请访问：https://audio-agi.github.io/FlowSep_demo/。

{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":"https://doi.org/arxiv-2409.07614","url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\u0000using textual descriptions of the desired sources. Current methods mainly use\u0000discriminative approaches, such as time-frequency masking, to separate target\u0000sounds and minimize interference from other sources. However, these models face\u0000challenges when separating overlapping soundtracks, which may lead to artifacts\u0000such as spectral holes or incomplete separation. Rectified flow matching (RFM),\u0000a generative model that establishes linear relations between the distribution\u0000of data and noise, offers superior theoretical properties and simplicity, but\u0000has not yet been explored in sound separation. In this work, we introduce\u0000FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\u0000linear flow trajectories from noise to target source features within the\u0000variational autoencoder (VAE) latent space. During inference, the RFM-generated\u0000latent features are reconstructed into a mel-spectrogram via the pre-trained\u0000VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\u0000Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\u0000models across multiple benchmarks, as evaluated with subjective and objective\u0000metrics. Additionally, our results show that FlowSep surpasses a\u0000diffusion-based LASS model in both separation quality and inference efficiency,\u0000highlighting its strong potential for audio source separation tasks. Code,\u0000pre-trained models and demos can be found at:\u0000https://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection 声源定位的分析类增量学习与隐私保护

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-11 DOI: arxiv-2409.07224

Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li

Sound Source Localization (SSL) enabling technology for applications such assurveillance and robotics. While traditional Signal Processing (SP)-based SSLmethods provide analytic solutions under specific signal and noise assumptions,recent Deep Learning (DL)-based methods have significantly outperformed them.However, their success depends on extensive training data and substantialcomputational resources. Moreover, they often rely on large-scale annotatedspatial data and may struggle when adapting to evolving sound classes. Tomitigate these challenges, we propose a novel Class Incremental Learning (CIL)approach, termed SSL-CIL, which avoids serious accuracy degradation due tocatastrophic forgetting by incrementally updating the DL-based SSL modelthrough a closed-form analytic solution. In particular, data privacy is ensuredsince the learning process does not revisit any historical data(exemplar-free), which is more suitable for smart home scenarios. Empiricalresults in the public SSLR dataset demonstrate the superior performance of ourproposal, achieving a localization accuracy of 90.9%, surpassing othercompetitive methods.

声源定位（SSL）技术使监控和机器人等应用成为可能。传统的基于信号处理（SP）的声源定位方法能在特定信号和噪声假设条件下提供分析解决方案，而最新的基于深度学习（DL）的方法则明显优于这些方法。此外，它们通常依赖于大规模注释空间数据，在适应不断变化的声音类别时可能会遇到困难。为了应对这些挑战，我们提出了一种新颖的类增量学习（CIL）方法，称为 SSL-CIL，它通过闭式分析解决方案增量更新基于 DL 的 SSL 模型，避免了因灾难性遗忘而导致的严重准确度下降。特别是，由于学习过程不会重新访问任何历史数据（无范例），因此数据隐私得到了保证，这更适合智能家居场景。公共 SSLR 数据集的实证结果证明了我们的方案性能优越，定位精度达到 90.9%，超过了其他竞争方法。

{"title":"Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection","authors":"Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li","doi":"arxiv-2409.07224","DOIUrl":"https://doi.org/arxiv-2409.07224","url":null,"abstract":"Sound Source Localization (SSL) enabling technology for applications such as\u0000surveillance and robotics. While traditional Signal Processing (SP)-based SSL\u0000methods provide analytic solutions under specific signal and noise assumptions,\u0000recent Deep Learning (DL)-based methods have significantly outperformed them.\u0000However, their success depends on extensive training data and substantial\u0000computational resources. Moreover, they often rely on large-scale annotated\u0000spatial data and may struggle when adapting to evolving sound classes. To\u0000mitigate these challenges, we propose a novel Class Incremental Learning (CIL)\u0000approach, termed SSL-CIL, which avoids serious accuracy degradation due to\u0000catastrophic forgetting by incrementally updating the DL-based SSL model\u0000through a closed-form analytic solution. In particular, data privacy is ensured\u0000since the learning process does not revisit any historical data\u0000(exemplar-free), which is more suitable for smart home scenarios. Empirical\u0000results in the public SSLR dataset demonstrate the superior performance of our\u0000proposal, achieving a localization accuracy of 90.9%, surpassing other\u0000competitive methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array 使用圆形麦克风阵列为多扬声器场景进行神经 Ambisonic 编码

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-11 DOI: arxiv-2409.06954

Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu

Spatial audio formats like Ambisonics are playback device layout-agnostic andwell-suited for applications such as teleconferencing and virtual reality.Conventional Ambisonic encoding methods often rely on spherical microphonearrays for efficient sound field capture, which limits their flexibility inpractical scenarios. We propose a deep learning (DL)-based approach, leveraginga two-stage network architecture for encoding circular microphone array signalsinto second-order Ambisonics (SOA) in multi-speaker environments. In addition,we introduce: (i) a novel loss function based on spatial power maps toregularize inter-channel correlations of the Ambisonic signals, and (ii) achannel permutation technique to resolve the ambiguity of encoding verticalinformation using a horizontal circular array. Evaluation on simulated speechand noise datasets shows that our approach consistently outperforms traditionalsignal processing (SP) and DL-based methods, providing significantly bettertimbral and spatial quality and higher source localization accuracy. Binauralaudio demos with visualizations are available athttps://bridgoon97.github.io/NeuralAmbisonicEncoding/.

Ambisonics 等空间音频格式与播放设备的布局无关，非常适合远程会议和虚拟现实等应用。传统的 Ambisonic 编码方法通常依赖球形麦克风来实现高效的声场捕捉，这限制了它们在实际应用场景中的灵活性。我们提出了一种基于深度学习（DL）的方法，利用两级网络架构将圆形麦克风阵列信号编码为多扬声器环境中的二阶 Ambisonics（SOA）。此外，我们还引入了：(i) 基于空间功率图的新型损失函数，以规范 Ambisonic 信号的信道间相关性；(ii) 信道置换技术，以解决使用水平圆形阵列编码垂直信息的模糊性问题。在模拟语音和噪声数据集上进行的评估表明，我们的方法始终优于传统的信号处理（SP）和基于 DL 的方法，提供了明显更好的音质和空间质量，以及更高的声源定位精度。带有可视化效果的双耳音频演示可在https://bridgoon97.github.io/NeuralAmbisonicEncoding/。

{"title":"Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array","authors":"Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu","doi":"arxiv-2409.06954","DOIUrl":"https://doi.org/arxiv-2409.06954","url":null,"abstract":"Spatial audio formats like Ambisonics are playback device layout-agnostic and\u0000well-suited for applications such as teleconferencing and virtual reality.\u0000Conventional Ambisonic encoding methods often rely on spherical microphone\u0000arrays for efficient sound field capture, which limits their flexibility in\u0000practical scenarios. We propose a deep learning (DL)-based approach, leveraging\u0000a two-stage network architecture for encoding circular microphone array signals\u0000into second-order Ambisonics (SOA) in multi-speaker environments. In addition,\u0000we introduce: (i) a novel loss function based on spatial power maps to\u0000regularize inter-channel correlations of the Ambisonic signals, and (ii) a\u0000channel permutation technique to resolve the ambiguity of encoding vertical\u0000information using a horizontal circular array. Evaluation on simulated speech\u0000and noise datasets shows that our approach consistently outperforms traditional\u0000signal processing (SP) and DL-based methods, providing significantly better\u0000timbral and spatial quality and higher source localization accuracy. Binaural\u0000audio demos with visualizations are available at\u0000https://bridgoon97.github.io/NeuralAmbisonicEncoding/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis SSR-Speech：实现基于文本的稳定、安全和鲁棒的零镜头语音编辑与合成

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-11 DOI: arxiv-2409.07556

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu

In this paper, we introduce SSR-Speech, a neural codec autoregressive modeldesigned for stable, safe, and robust zero-shot text-based speech editing andtext-to-speech synthesis. SSR-Speech is built on a Transformer decoder andincorporates classifier-free guidance to enhance the stability of thegeneration process. A watermark Encodec is proposed to embed frame-levelwatermarks into the edited regions of the speech so that which parts wereedited can be detected. In addition, the waveform reconstruction leverages theoriginal unedited speech segments, providing superior recovery compared to theEncodec model. Our approach achieves the state-of-the-art performance in theRealEdit speech editing task and the LibriTTS text-to-speech task, surpassingprevious methods. Furthermore, SSR-Speech excels in multi-span speech editingand also demonstrates remarkable robustness to background sounds. Source codeand demos are released.

本文介绍了 SSR-Speech，它是一种神经编解码器自回归模型，专为稳定、安全和鲁棒的零镜头文本语音编辑和文本到语音合成而设计。SSR-Speech 建立在变换器解码器的基础上，并加入了无分类器引导，以增强生成过程的稳定性。我们还提出了一种水印编码器，用于在语音的编辑区域嵌入帧级水印，以便检测哪些部分被编辑。此外，波形重建利用了未经编辑的原始语音片段，与编码解码模型相比，恢复效果更佳。我们的方法在 RealEdit 语音编辑任务和 LibriTTS 文本到语音任务中达到了最先进的性能，超过了以前的方法。此外，SSR-Speech 在多跨度语音编辑方面表现出色，对背景声音也表现出显著的鲁棒性。源代码和演示版已发布。

引用次数: 0

VoiceWukong: Benchmarking Deepfake Voice Detection 悟空语音：深度伪语音检测基准测试

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-10 DOI: arxiv-2409.06348

Ziwei Yan, Yanjie Zhao, Haoyu Wang

With the rapid advancement of technologies like text-to-speech (TTS) andvoice conversion (VC), detecting deepfake voices has become increasinglycrucial. However, both academia and industry lack a comprehensive and intuitivebenchmark for evaluating detectors. Existing datasets are limited in languagediversity and lack many manipulations encountered in real-world productionenvironments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluatethe performance of deepfake voice detectors. To build the dataset, we firstcollected deepfake voices generated by 19 advanced and widely recognizedcommercial tools and 15 open-source tools. We then created 38 data variantscovering six types of manipulations, constructing the evaluation dataset fordeepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of13.50%, while all others exceeded 20%. Our findings reveal that these detectorsface significant challenges in real-world applications, with dramaticallydeclining performance. In addition, we conducted a user study with more than300 participants. The results are compared with the performance of the 12detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,where different detectors and humans exhibit varying identificationcapabilities for deepfake voices at different deception levels, while the LALMdemonstrates no detection ability at all. Furthermore, we provide a leaderboardfor deepfake voice detection, publicly available at{https://voicewukong.github.io}.

随着文本到语音（TTS）和语音转换（VC）等技术的快速发展，检测深度假语音变得越来越重要。然而，学术界和工业界都缺乏一个全面、直观的基准来评估检测器。现有数据集的语言多样性有限，而且缺乏在真实世界生产环境中遇到的许多操作。为了填补这一空白，我们提出了 "语音悟空"（VoiceWukong），这是一个用于评估深度假声检测器性能的基准。为了建立该数据集，我们首先收集了由 19 种先进且广受认可的商业工具和 15 种开源工具生成的深度伪造语音。然后，我们创建了 38 个数据变体，涵盖了六种操作类型，构建了假语音检测的评估数据集。因此，"悟空语音 "包含 26.52 万个英文和 14.82 万个中文深假语音样本。利用 "悟空语音"，我们对 12 种最先进的检测器进行了评估。其中，AASIST2 的平均错误率（EER）最高，为 13.50%，而其他所有检测器的平均错误率都超过了 20%。我们的研究结果表明，这些检测器在实际应用中面临巨大挑战，性能急剧下降。此外，我们还对 300 多名参与者进行了用户研究。结果与 12 个检测器和多模型大语言模型（MLLM）（即 Qwen2-Audio）的性能进行了比较，不同检测器和人类在不同的欺骗水平下对深层伪语音表现出不同的识别能力，而 LALM 则完全没有表现出检测能力。此外，我们还提供了深度伪造语音检测的排行榜，可在{https://voicewukong.github.io}上公开获取。

{"title":"VoiceWukong: Benchmarking Deepfake Voice Detection","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":"https://doi.org/arxiv-2409.06348","url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\u0000voice conversion (VC), detecting deepfake voices has become increasingly\u0000crucial. However, both academia and industry lack a comprehensive and intuitive\u0000benchmark for evaluating detectors. Existing datasets are limited in language\u0000diversity and lack many manipulations encountered in real-world production\u0000environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\u0000the performance of deepfake voice detectors. To build the dataset, we first\u0000collected deepfake voices generated by 19 advanced and widely recognized\u0000commercial tools and 15 open-source tools. We then created 38 data variants\u0000covering six types of manipulations, constructing the evaluation dataset for\u0000deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\u0000Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12\u0000state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\u000013.50%, while all others exceeded 20%. Our findings reveal that these detectors\u0000face significant challenges in real-world applications, with dramatically\u0000declining performance. In addition, we conducted a user study with more than\u0000300 participants. The results are compared with the performance of the 12\u0000detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\u0000where different detectors and humans exhibit varying identification\u0000capabilities for deepfake voices at different deception levels, while the LALM\u0000demonstrates no detection ability at all. Furthermore, we provide a leaderboard\u0000for deepfake voice detection, publicly available at\u0000{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself InstructSing：通过自学生成高保真歌声

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-10 DOI: arxiv-2409.06330

Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen

It is challenging to accelerate the training process while ensuring bothhigh-quality generated voices and acceptable inference speed. In this paper, wepropose a novel neural vocoder called InstructSing, which can converge muchfaster compared with other neural vocoders while maintaining good performanceby integrating differentiable digital signal processing and adversarialtraining. It includes one generator and two discriminators. Specifically, thegenerator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audioas an instructive signal. Subsequently, the HN module is connected with anextended WaveNet by an UNet-based module, which transforms the output of the HNmodule to a latent variable sequence containing essential periodic andaperiodic information. In addition to the latent sequence, the extended WaveNetalso takes the mel-spectrogram as input to generate 48kHz high-fidelity singingvoices. In terms of discriminators, we combine a multi-period discriminator, asoriginally proposed in HiFiGAN, with a multi-resolution multi-band STFTdiscriminator. Notably, InstructSing achieves comparable voice quality to otherneural vocoders but with only one-tenth of the training steps on a 4 NVIDIAV100 GPU machinefootnote{{Demo page:href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.We plan to open-source our code and pretrained model once the paper getaccepted.

如何在加快训练过程的同时，确保生成高质量的语音和可接受的推理速度，是一项挑战。在本文中，我们提出了一种名为 InstructSing 的新型神经声码器，与其他神经声码器相比，它的收敛速度更快，同时通过整合可微分数字信号处理和对抗训练，保持了良好的性能。它包括一个生成器和两个鉴别器。具体来说，生成器包含一个谐波加噪声（HN）模块，生成 8kHz 音频作为指示信号。随后，谐波加噪声模块通过一个基于 UNet 的模块与下一个扩展波网连接，后者将谐波加噪声模块的输出转换为包含基本周期性和周期性信息的潜变量序列。除潜在序列外，扩展波网还将旋律谱图作为输入，生成 48kHz 高保真歌声。在判别器方面，我们将最初在 HiFiGAN 中提出的多周期判别器与多分辨率多波段 STFT 判别器相结合。值得注意的是，InstructSing实现了与其他神经声码器相当的语音质量，但在4台英伟达V100 GPU机器上的训练步骤只有其十分之一。我们计划在论文被接受后开源我们的代码和预训练模型。

{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":"https://doi.org/arxiv-2409.06330","url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\u0000high-quality generated voices and acceptable inference speed. In this paper, we\u0000propose a novel neural vocoder called InstructSing, which can converge much\u0000faster compared with other neural vocoders while maintaining good performance\u0000by integrating differentiable digital signal processing and adversarial\u0000training. It includes one generator and two discriminators. Specifically, the\u0000generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\u0000as an instructive signal. Subsequently, the HN module is connected with an\u0000extended WaveNet by an UNet-based module, which transforms the output of the HN\u0000module to a latent variable sequence containing essential periodic and\u0000aperiodic information. In addition to the latent sequence, the extended WaveNet\u0000also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\u0000voices. In terms of discriminators, we combine a multi-period discriminator, as\u0000originally proposed in HiFiGAN, with a multi-resolution multi-band STFT\u0000discriminator. Notably, InstructSing achieves comparable voice quality to other\u0000neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\u0000V100 GPU machinefootnote{{Demo page:\u0000href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.\u0000We plan to open-source our code and pretrained model once the paper get\u0000accepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches 识别欺骗的扬声器验证可抵御领域和信道错配

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-10 DOI: arxiv-2409.06327

Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi

In real-world applications, it is challenging to build a speaker verificationsystem that is simultaneously robust against common threats, including spoofingattacks, channel mismatch, and domain mismatch. Traditional automatic speakerverification (ASV) systems often tackle these issues separately, leading tosuboptimal performance when faced with simultaneous challenges. In this paper,we propose an integrated framework that incorporates pair-wise learning andspoofing attack simulation into the meta-learning paradigm to enhancerobustness against these multifaceted threats. This novel approach employs anasymmetric dual-path model and a multi-task learning strategy to handle ASV,anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testingdataset, CNComplex, is introduced to evaluate system performance under thesecombined threats. Experimental results demonstrate that our integrated modelsignificantly improves performance over traditional ASV systems across variousscenarios, showcasing its potential for real-world deployment. Additionally,the proposed framework's ability to generalize across different conditionshighlights its robustness and reliability, making it a promising solution forpractical ASV applications.

在现实世界的应用中，建立一个能同时抵御常见威胁（包括欺骗攻击、信道错配和域错配）的说话人验证系统是一项挑战。传统的自动说话人验证（ASV）系统通常单独处理这些问题，导致在同时面临挑战时无法达到最佳性能。在本文中，我们提出了一种集成框架，它将成对学习和欺骗攻击模拟纳入元学习范式，以增强对这些多方面威胁的防御能力。这种新方法采用了非对称双路径模型和多任务学习策略，可同时处理ASV、反欺骗和欺骗感知ASV任务。我们引入了一个新的测试数据集 CNComplex，用于评估系统在这些综合威胁下的性能。实验结果表明，与传统 ASV 系统相比，我们的集成模型在各种情况下都能显著提高性能，展示了其在现实世界中部署的潜力。此外，所提出的框架在不同条件下的泛化能力凸显了其鲁棒性和可靠性，使其成为ASV实际应用中一个前景广阔的解决方案。

{"title":"Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches","authors":"Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi","doi":"arxiv-2409.06327","DOIUrl":"https://doi.org/arxiv-2409.06327","url":null,"abstract":"In real-world applications, it is challenging to build a speaker verification\u0000system that is simultaneously robust against common threats, including spoofing\u0000attacks, channel mismatch, and domain mismatch. Traditional automatic speaker\u0000verification (ASV) systems often tackle these issues separately, leading to\u0000suboptimal performance when faced with simultaneous challenges. In this paper,\u0000we propose an integrated framework that incorporates pair-wise learning and\u0000spoofing attack simulation into the meta-learning paradigm to enhance\u0000robustness against these multifaceted threats. This novel approach employs an\u0000asymmetric dual-path model and a multi-task learning strategy to handle ASV,\u0000anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing\u0000dataset, CNComplex, is introduced to evaluate system performance under these\u0000combined threats. Experimental results demonstrate that our integrated model\u0000significantly improves performance over traditional ASV systems across various\u0000scenarios, showcasing its potential for real-world deployment. Additionally,\u0000the proposed framework's ability to generalize across different conditions\u0000highlights its robustness and reliability, making it a promising solution for\u0000practical ASV applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models 增强音频问题解答中的时态理解，建立大型音频语言模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-10 DOI: arxiv-2409.06223

Arvind Krishna Sridhar, Yinyi Guo, Erik Visser

The Audio Question Answering task includes audio event classification, audiocaptioning, and open ended reasoning. Recently, Audio Question Answering hasgarnered attention due to the advent of Large Audio Language Models. Currentliterature focuses on constructing LALMs by integrating audio encoders withtext only Large Language Models through a projection module. While Large AudioLanguage Models excel in general audio understanding, they are limited intemporal reasoning which may hinder their commercial applications and on devicedeployment. This paper addresses these challenges and limitations in audiotemporal reasoning. First, we introduce a data augmentation technique forgenerating reliable audio temporal questions and answers using an LLM. Second,we propose a continued finetuning curriculum learning strategy to specialize intemporal reasoning without compromising performance on finetuned tasks.Finally, we develop a reliable and transparent automated metric, assisted by anLLM, to measure the correlation between Large Audio Language Model responsesand ground truth data intelligently. We demonstrate the effectiveness of ourproposed techniques using SOTA LALMs on public audio benchmark datasets.

音频问题解答任务包括音频事件分类、音频字幕和开放式推理。最近，音频问题解答因大型音频语言模型的出现而备受关注。目前的文献侧重于通过投影模块将音频编码器与纯文本大语言模型集成在一起，从而构建大语言模型。虽然大型音频语言模型在一般音频理解方面表现出色，但在时态推理方面却受到限制，这可能会阻碍其商业应用和设备部署。本文旨在解决音频时态推理中的这些挑战和局限性。首先，我们介绍了一种数据增强技术，利用 LLM 生成可靠的音频时态问题和答案。其次，我们提出了一种持续微调的课程学习策略，在不影响微调任务性能的前提下实现时空推理的专业化。最后，我们在 LLM 的辅助下开发了一种可靠、透明的自动度量方法，用于智能测量大型音频语言模型的回答与地面真实数据之间的相关性。我们在公共音频基准数据集上使用 SOTA LALM 展示了我们提出的技术的有效性。

{"title":"Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models","authors":"Arvind Krishna Sridhar, Yinyi Guo, Erik Visser","doi":"arxiv-2409.06223","DOIUrl":"https://doi.org/arxiv-2409.06223","url":null,"abstract":"The Audio Question Answering task includes audio event classification, audio\u0000captioning, and open ended reasoning. Recently, Audio Question Answering has\u0000garnered attention due to the advent of Large Audio Language Models. Current\u0000literature focuses on constructing LALMs by integrating audio encoders with\u0000text only Large Language Models through a projection module. While Large Audio\u0000Language Models excel in general audio understanding, they are limited in\u0000temporal reasoning which may hinder their commercial applications and on device\u0000deployment. This paper addresses these challenges and limitations in audio\u0000temporal reasoning. First, we introduce a data augmentation technique for\u0000generating reliable audio temporal questions and answers using an LLM. Second,\u0000we propose a continued finetuning curriculum learning strategy to specialize in\u0000temporal reasoning without compromising performance on finetuned tasks.\u0000Finally, we develop a reliable and transparent automated metric, assisted by an\u0000LLM, to measure the correlation between Large Audio Language Model responses\u0000and ground truth data intelligently. We demonstrate the effectiveness of our\u0000proposed techniques using SOTA LALMs on public audio benchmark datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders MoWE-Audio：使用弱编码器混合的多任务音频LLMs

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-10 DOI: arxiv-2409.06635

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

The rapid advancements in large language models (LLMs) have significantlyenhanced natural language processing capabilities, facilitating the developmentof AudioLLMs that process and understand speech and audio inputs alongsidetext. Existing AudioLLMs typically combine a pre-trained audio encoder with apre-trained LLM, which are subsequently finetuned on specific audio tasks.However, the pre-trained audio encoder has constrained capacity to capturefeatures for new tasks and datasets. To address this, we propose to incorporatemixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWEsupplements a base encoder with a pool of relatively light weight encoders,selectively activated based on the audio input to enhance feature extractionwithout significantly increasing model size. Our empirical results demonstratethat MoWE effectively improves multi-task performance, broadening theapplicability of AudioLLMs to more diverse audio tasks.

大型语言模型（LLM）的快速发展极大地增强了自然语言处理能力，促进了音频LLM 的发展，音频LLM 可以处理和理解语音和音频输入以及文本。现有的音频LLM 通常将预先训练好的音频编码器与预先训练好的 LLM 结合在一起，然后在特定的音频任务中对其进行微调。然而，预先训练好的音频编码器捕捉新任务和数据集特征的能力受到限制。为了解决这个问题，我们建议将 "弱 "编码器混合物（MoWE）纳入音频LLM 框架。MoWE 使用相对较轻的编码器池对基本编码器进行补充，并根据音频输入有选择性地激活，从而在不显著增加模型大小的情况下增强特征提取。我们的实证结果表明，MoWE 有效地提高了多任务性能，扩大了 AudioLLM 在更多样化音频任务中的应用范围。

引用次数: 0