We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize learned speakers' voices in non-native dialects, especially in pitch-accent languages. CD-TTS is important for developing voice agents that naturally communicate with people across regions. We present a novel TTS model comprising three sub-modules to perform competitively at this task. We first train a backbone TTS model to synthesize dialect speech from a text conditioned on phoneme-level accent latent variables (ALVs) extracted from speech by a reference encoder. Then, we train an ALV predictor to predict ALVs tailored to a target dialect from input text leveraging our novel multi-dialect phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the effectiveness of our model by comparing it with a baseline derived from conventional dialect TTS methods. The results show that our model improves the dialectal naturalness of synthetic speech in CD-TTS.
{"title":"Cross-Dialect Text-To-Speech in Pitch-Accent Language Incorporating Multi-Dialect Phoneme-Level BERT","authors":"Kazuki Yamauchi, Yuki Saito, Hiroshi Saruwatari","doi":"arxiv-2409.07265","DOIUrl":"https://doi.org/arxiv-2409.07265","url":null,"abstract":"We explore cross-dialect text-to-speech (CD-TTS), a task to synthesize\u0000learned speakers' voices in non-native dialects, especially in pitch-accent\u0000languages. CD-TTS is important for developing voice agents that naturally\u0000communicate with people across regions. We present a novel TTS model comprising\u0000three sub-modules to perform competitively at this task. We first train a\u0000backbone TTS model to synthesize dialect speech from a text conditioned on\u0000phoneme-level accent latent variables (ALVs) extracted from speech by a\u0000reference encoder. Then, we train an ALV predictor to predict ALVs tailored to\u0000a target dialect from input text leveraging our novel multi-dialect\u0000phoneme-level BERT. We conduct multi-dialect TTS experiments and evaluate the\u0000effectiveness of our model by comparing it with a baseline derived from\u0000conventional dialect TTS methods. The results show that our model improves the\u0000dialectal naturalness of synthetic speech in CD-TTS.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang
Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/.
{"title":"FlowSep: Language-Queried Sound Separation with Rectified Flow Matching","authors":"Yi Yuan, Xubo Liu, Haohe Liu, Mark D. Plumbley, Wenwu Wang","doi":"arxiv-2409.07614","DOIUrl":"https://doi.org/arxiv-2409.07614","url":null,"abstract":"Language-queried audio source separation (LASS) focuses on separating sounds\u0000using textual descriptions of the desired sources. Current methods mainly use\u0000discriminative approaches, such as time-frequency masking, to separate target\u0000sounds and minimize interference from other sources. However, these models face\u0000challenges when separating overlapping soundtracks, which may lead to artifacts\u0000such as spectral holes or incomplete separation. Rectified flow matching (RFM),\u0000a generative model that establishes linear relations between the distribution\u0000of data and noise, offers superior theoretical properties and simplicity, but\u0000has not yet been explored in sound separation. In this work, we introduce\u0000FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns\u0000linear flow trajectories from noise to target source features within the\u0000variational autoencoder (VAE) latent space. During inference, the RFM-generated\u0000latent features are reconstructed into a mel-spectrogram via the pre-trained\u0000VAE decoder, followed by a pre-trained vocoder to synthesize the waveform.\u0000Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art\u0000models across multiple benchmarks, as evaluated with subjective and objective\u0000metrics. Additionally, our results show that FlowSep surpasses a\u0000diffusion-based LASS model in both separation quality and inference efficiency,\u0000highlighting its strong potential for audio source separation tasks. Code,\u0000pre-trained models and demos can be found at:\u0000https://audio-agi.github.io/FlowSep_demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"460 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142227172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li
Sound Source Localization (SSL) enabling technology for applications such as surveillance and robotics. While traditional Signal Processing (SP)-based SSL methods provide analytic solutions under specific signal and noise assumptions, recent Deep Learning (DL)-based methods have significantly outperformed them. However, their success depends on extensive training data and substantial computational resources. Moreover, they often rely on large-scale annotated spatial data and may struggle when adapting to evolving sound classes. To mitigate these challenges, we propose a novel Class Incremental Learning (CIL) approach, termed SSL-CIL, which avoids serious accuracy degradation due to catastrophic forgetting by incrementally updating the DL-based SSL model through a closed-form analytic solution. In particular, data privacy is ensured since the learning process does not revisit any historical data (exemplar-free), which is more suitable for smart home scenarios. Empirical results in the public SSLR dataset demonstrate the superior performance of our proposal, achieving a localization accuracy of 90.9%, surpassing other competitive methods.
{"title":"Analytic Class Incremental Learning for Sound Source Localization with Privacy Protection","authors":"Xinyuan Qian, Xianghu Yue, Jiadong Wang, Huiping Zhuang, Haizhou Li","doi":"arxiv-2409.07224","DOIUrl":"https://doi.org/arxiv-2409.07224","url":null,"abstract":"Sound Source Localization (SSL) enabling technology for applications such as\u0000surveillance and robotics. While traditional Signal Processing (SP)-based SSL\u0000methods provide analytic solutions under specific signal and noise assumptions,\u0000recent Deep Learning (DL)-based methods have significantly outperformed them.\u0000However, their success depends on extensive training data and substantial\u0000computational resources. Moreover, they often rely on large-scale annotated\u0000spatial data and may struggle when adapting to evolving sound classes. To\u0000mitigate these challenges, we propose a novel Class Incremental Learning (CIL)\u0000approach, termed SSL-CIL, which avoids serious accuracy degradation due to\u0000catastrophic forgetting by incrementally updating the DL-based SSL model\u0000through a closed-form analytic solution. In particular, data privacy is ensured\u0000since the learning process does not revisit any historical data\u0000(exemplar-free), which is more suitable for smart home scenarios. Empirical\u0000results in the public SSLR dataset demonstrate the superior performance of our\u0000proposal, achieving a localization accuracy of 90.9%, surpassing other\u0000competitive methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Spatial audio formats like Ambisonics are playback device layout-agnostic and well-suited for applications such as teleconferencing and virtual reality. Conventional Ambisonic encoding methods often rely on spherical microphone arrays for efficient sound field capture, which limits their flexibility in practical scenarios. We propose a deep learning (DL)-based approach, leveraging a two-stage network architecture for encoding circular microphone array signals into second-order Ambisonics (SOA) in multi-speaker environments. In addition, we introduce: (i) a novel loss function based on spatial power maps to regularize inter-channel correlations of the Ambisonic signals, and (ii) a channel permutation technique to resolve the ambiguity of encoding vertical information using a horizontal circular array. Evaluation on simulated speech and noise datasets shows that our approach consistently outperforms traditional signal processing (SP) and DL-based methods, providing significantly better timbral and spatial quality and higher source localization accuracy. Binaural audio demos with visualizations are available at https://bridgoon97.github.io/NeuralAmbisonicEncoding/.
{"title":"Neural Ambisonic Encoding For Multi-Speaker Scenarios Using A Circular Microphone Array","authors":"Yue Qiao, Vinay Kothapally, Meng Yu, Dong Yu","doi":"arxiv-2409.06954","DOIUrl":"https://doi.org/arxiv-2409.06954","url":null,"abstract":"Spatial audio formats like Ambisonics are playback device layout-agnostic and\u0000well-suited for applications such as teleconferencing and virtual reality.\u0000Conventional Ambisonic encoding methods often rely on spherical microphone\u0000arrays for efficient sound field capture, which limits their flexibility in\u0000practical scenarios. We propose a deep learning (DL)-based approach, leveraging\u0000a two-stage network architecture for encoding circular microphone array signals\u0000into second-order Ambisonics (SOA) in multi-speaker environments. In addition,\u0000we introduce: (i) a novel loss function based on spatial power maps to\u0000regularize inter-channel correlations of the Ambisonic signals, and (ii) a\u0000channel permutation technique to resolve the ambiguity of encoding vertical\u0000information using a horizontal circular array. Evaluation on simulated speech\u0000and noise datasets shows that our approach consistently outperforms traditional\u0000signal processing (SP) and DL-based methods, providing significantly better\u0000timbral and spatial quality and higher source localization accuracy. Binaural\u0000audio demos with visualizations are available at\u0000https://bridgoon97.github.io/NeuralAmbisonicEncoding/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"75 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot text-based speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves the state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. Source code and demos are released.
{"title":"SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis","authors":"Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu","doi":"arxiv-2409.07556","DOIUrl":"https://doi.org/arxiv-2409.07556","url":null,"abstract":"In this paper, we introduce SSR-Speech, a neural codec autoregressive model\u0000designed for stable, safe, and robust zero-shot text-based speech editing and\u0000text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and\u0000incorporates classifier-free guidance to enhance the stability of the\u0000generation process. A watermark Encodec is proposed to embed frame-level\u0000watermarks into the edited regions of the speech so that which parts were\u0000edited can be detected. In addition, the waveform reconstruction leverages the\u0000original unedited speech segments, providing superior recovery compared to the\u0000Encodec model. Our approach achieves the state-of-the-art performance in the\u0000RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing\u0000previous methods. Furthermore, SSR-Speech excels in multi-span speech editing\u0000and also demonstrates remarkable robustness to background sounds. Source code\u0000and demos are released.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142218032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapid advancement of technologies like text-to-speech (TTS) and voice conversion (VC), detecting deepfake voices has become increasingly crucial. However, both academia and industry lack a comprehensive and intuitive benchmark for evaluating detectors. Existing datasets are limited in language diversity and lack many manipulations encountered in real-world production environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate the performance of deepfake voice detectors. To build the dataset, we first collected deepfake voices generated by 19 advanced and widely recognized commercial tools and 15 open-source tools. We then created 38 data variants covering six types of manipulations, constructing the evaluation dataset for deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200 Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12 state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of 13.50%, while all others exceeded 20%. Our findings reveal that these detectors face significant challenges in real-world applications, with dramatically declining performance. In addition, we conducted a user study with more than 300 participants. The results are compared with the performance of the 12 detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio, where different detectors and humans exhibit varying identification capabilities for deepfake voices at different deception levels, while the LALM demonstrates no detection ability at all. Furthermore, we provide a leaderboard for deepfake voice detection, publicly available at {https://voicewukong.github.io}.
{"title":"VoiceWukong: Benchmarking Deepfake Voice Detection","authors":"Ziwei Yan, Yanjie Zhao, Haoyu Wang","doi":"arxiv-2409.06348","DOIUrl":"https://doi.org/arxiv-2409.06348","url":null,"abstract":"With the rapid advancement of technologies like text-to-speech (TTS) and\u0000voice conversion (VC), detecting deepfake voices has become increasingly\u0000crucial. However, both academia and industry lack a comprehensive and intuitive\u0000benchmark for evaluating detectors. Existing datasets are limited in language\u0000diversity and lack many manipulations encountered in real-world production\u0000environments. To fill this gap, we propose VoiceWukong, a benchmark designed to evaluate\u0000the performance of deepfake voice detectors. To build the dataset, we first\u0000collected deepfake voices generated by 19 advanced and widely recognized\u0000commercial tools and 15 open-source tools. We then created 38 data variants\u0000covering six types of manipulations, constructing the evaluation dataset for\u0000deepfake voice detection. VoiceWukong thus includes 265,200 English and 148,200\u0000Chinese deepfake voice samples. Using VoiceWukong, we evaluated 12\u0000state-of-the-art detectors. AASIST2 achieved the best equal error rate (EER) of\u000013.50%, while all others exceeded 20%. Our findings reveal that these detectors\u0000face significant challenges in real-world applications, with dramatically\u0000declining performance. In addition, we conducted a user study with more than\u0000300 participants. The results are compared with the performance of the 12\u0000detectors and a multimodel large language model (MLLM), i.e., Qwen2-Audio,\u0000where different detectors and humans exhibit varying identification\u0000capabilities for deepfake voices at different deception levels, while the LALM\u0000demonstrates no detection ability at all. Furthermore, we provide a leaderboard\u0000for deepfake voice detection, publicly available at\u0000{https://voicewukong.github.io}.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
It is challenging to accelerate the training process while ensuring both high-quality generated voices and acceptable inference speed. In this paper, we propose a novel neural vocoder called InstructSing, which can converge much faster compared with other neural vocoders while maintaining good performance by integrating differentiable digital signal processing and adversarial training. It includes one generator and two discriminators. Specifically, the generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio as an instructive signal. Subsequently, the HN module is connected with an extended WaveNet by an UNet-based module, which transforms the output of the HN module to a latent variable sequence containing essential periodic and aperiodic information. In addition to the latent sequence, the extended WaveNet also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing voices. In terms of discriminators, we combine a multi-period discriminator, as originally proposed in HiFiGAN, with a multi-resolution multi-band STFT discriminator. Notably, InstructSing achieves comparable voice quality to other neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA V100 GPU machinefootnote{{Demo page: href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}. We plan to open-source our code and pretrained model once the paper get accepted.
{"title":"InstructSing: High-Fidelity Singing Voice Generation via Instructing Yourself","authors":"Chang Zeng, Chunhui Wang, Xiaoxiao Miao, Jian Zhao, Zhonglin Jiang, Yong Chen","doi":"arxiv-2409.06330","DOIUrl":"https://doi.org/arxiv-2409.06330","url":null,"abstract":"It is challenging to accelerate the training process while ensuring both\u0000high-quality generated voices and acceptable inference speed. In this paper, we\u0000propose a novel neural vocoder called InstructSing, which can converge much\u0000faster compared with other neural vocoders while maintaining good performance\u0000by integrating differentiable digital signal processing and adversarial\u0000training. It includes one generator and two discriminators. Specifically, the\u0000generator incorporates a harmonic-plus-noise (HN) module to produce 8kHz audio\u0000as an instructive signal. Subsequently, the HN module is connected with an\u0000extended WaveNet by an UNet-based module, which transforms the output of the HN\u0000module to a latent variable sequence containing essential periodic and\u0000aperiodic information. In addition to the latent sequence, the extended WaveNet\u0000also takes the mel-spectrogram as input to generate 48kHz high-fidelity singing\u0000voices. In terms of discriminators, we combine a multi-period discriminator, as\u0000originally proposed in HiFiGAN, with a multi-resolution multi-band STFT\u0000discriminator. Notably, InstructSing achieves comparable voice quality to other\u0000neural vocoders but with only one-tenth of the training steps on a 4 NVIDIA\u0000V100 GPU machinefootnote{{Demo page:\u0000href{https://wavelandspeech.github.io/instructsing/}{texttt{https://wavelandspeech.github.io/instructsing/}}}}.\u0000We plan to open-source our code and pretrained model once the paper get\u0000accepted.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In real-world applications, it is challenging to build a speaker verification system that is simultaneously robust against common threats, including spoofing attacks, channel mismatch, and domain mismatch. Traditional automatic speaker verification (ASV) systems often tackle these issues separately, leading to suboptimal performance when faced with simultaneous challenges. In this paper, we propose an integrated framework that incorporates pair-wise learning and spoofing attack simulation into the meta-learning paradigm to enhance robustness against these multifaceted threats. This novel approach employs an asymmetric dual-path model and a multi-task learning strategy to handle ASV, anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing dataset, CNComplex, is introduced to evaluate system performance under these combined threats. Experimental results demonstrate that our integrated model significantly improves performance over traditional ASV systems across various scenarios, showcasing its potential for real-world deployment. Additionally, the proposed framework's ability to generalize across different conditions highlights its robustness and reliability, making it a promising solution for practical ASV applications.
在现实世界的应用中,建立一个能同时抵御常见威胁(包括欺骗攻击、信道错配和域错配)的说话人验证系统是一项挑战。传统的自动说话人验证(ASV)系统通常单独处理这些问题,导致在同时面临挑战时无法达到最佳性能。在本文中,我们提出了一种集成框架,它将成对学习和欺骗攻击模拟纳入元学习范式,以增强对这些多方面威胁的防御能力。这种新方法采用了非对称双路径模型和多任务学习策略,可同时处理ASV、反欺骗和欺骗感知ASV任务。我们引入了一个新的测试数据集 CNComplex,用于评估系统在这些综合威胁下的性能。实验结果表明,与传统 ASV 系统相比,我们的集成模型在各种情况下都能显著提高性能,展示了其在现实世界中部署的潜力。此外,所提出的框架在不同条件下的泛化能力凸显了其鲁棒性和可靠性,使其成为ASV实际应用中一个前景广阔的解决方案。
{"title":"Spoofing-Aware Speaker Verification Robust Against Domain and Channel Mismatches","authors":"Chang Zeng, Xiaoxiao Miao, Xin Wang, Erica Cooper, Junichi Yamagishi","doi":"arxiv-2409.06327","DOIUrl":"https://doi.org/arxiv-2409.06327","url":null,"abstract":"In real-world applications, it is challenging to build a speaker verification\u0000system that is simultaneously robust against common threats, including spoofing\u0000attacks, channel mismatch, and domain mismatch. Traditional automatic speaker\u0000verification (ASV) systems often tackle these issues separately, leading to\u0000suboptimal performance when faced with simultaneous challenges. In this paper,\u0000we propose an integrated framework that incorporates pair-wise learning and\u0000spoofing attack simulation into the meta-learning paradigm to enhance\u0000robustness against these multifaceted threats. This novel approach employs an\u0000asymmetric dual-path model and a multi-task learning strategy to handle ASV,\u0000anti-spoofing, and spoofing-aware ASV tasks concurrently. A new testing\u0000dataset, CNComplex, is introduced to evaluate system performance under these\u0000combined threats. Experimental results demonstrate that our integrated model\u0000significantly improves performance over traditional ASV systems across various\u0000scenarios, showcasing its potential for real-world deployment. Additionally,\u0000the proposed framework's ability to generalize across different conditions\u0000highlights its robustness and reliability, making it a promising solution for\u0000practical ASV applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"58 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Audio Question Answering task includes audio event classification, audio captioning, and open ended reasoning. Recently, Audio Question Answering has garnered attention due to the advent of Large Audio Language Models. Current literature focuses on constructing LALMs by integrating audio encoders with text only Large Language Models through a projection module. While Large Audio Language Models excel in general audio understanding, they are limited in temporal reasoning which may hinder their commercial applications and on device deployment. This paper addresses these challenges and limitations in audio temporal reasoning. First, we introduce a data augmentation technique for generating reliable audio temporal questions and answers using an LLM. Second, we propose a continued finetuning curriculum learning strategy to specialize in temporal reasoning without compromising performance on finetuned tasks. Finally, we develop a reliable and transparent automated metric, assisted by an LLM, to measure the correlation between Large Audio Language Model responses and ground truth data intelligently. We demonstrate the effectiveness of our proposed techniques using SOTA LALMs on public audio benchmark datasets.
音频问题解答任务包括音频事件分类、音频字幕和开放式推理。最近,音频问题解答因大型音频语言模型的出现而备受关注。目前的文献侧重于通过投影模块将音频编码器与纯文本大语言模型集成在一起,从而构建大语言模型。虽然大型音频语言模型在一般音频理解方面表现出色,但在时态推理方面却受到限制,这可能会阻碍其商业应用和设备部署。本文旨在解决音频时态推理中的这些挑战和局限性。首先,我们介绍了一种数据增强技术,利用 LLM 生成可靠的音频时态问题和答案。其次,我们提出了一种持续微调的课程学习策略,在不影响微调任务性能的前提下实现时空推理的专业化。最后,我们在 LLM 的辅助下开发了一种可靠、透明的自动度量方法,用于智能测量大型音频语言模型的回答与地面真实数据之间的相关性。我们在公共音频基准数据集上使用 SOTA LALM 展示了我们提出的技术的有效性。
{"title":"Enhancing Temporal Understanding in Audio Question Answering for Large Audio Language Models","authors":"Arvind Krishna Sridhar, Yinyi Guo, Erik Visser","doi":"arxiv-2409.06223","DOIUrl":"https://doi.org/arxiv-2409.06223","url":null,"abstract":"The Audio Question Answering task includes audio event classification, audio\u0000captioning, and open ended reasoning. Recently, Audio Question Answering has\u0000garnered attention due to the advent of Large Audio Language Models. Current\u0000literature focuses on constructing LALMs by integrating audio encoders with\u0000text only Large Language Models through a projection module. While Large Audio\u0000Language Models excel in general audio understanding, they are limited in\u0000temporal reasoning which may hinder their commercial applications and on device\u0000deployment. This paper addresses these challenges and limitations in audio\u0000temporal reasoning. First, we introduce a data augmentation technique for\u0000generating reliable audio temporal questions and answers using an LLM. Second,\u0000we propose a continued finetuning curriculum learning strategy to specialize in\u0000temporal reasoning without compromising performance on finetuned tasks.\u0000Finally, we develop a reliable and transparent automated metric, assisted by an\u0000LLM, to measure the correlation between Large Audio Language Model responses\u0000and ground truth data intelligently. We demonstrate the effectiveness of our\u0000proposed techniques using SOTA LALMs on public audio benchmark datasets.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw
The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.
{"title":"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":"https://doi.org/arxiv-2409.06635","url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\u0000enhanced natural language processing capabilities, facilitating the development\u0000of AudioLLMs that process and understand speech and audio inputs alongside\u0000text. Existing AudioLLMs typically combine a pre-trained audio encoder with a\u0000pre-trained LLM, which are subsequently finetuned on specific audio tasks.\u0000However, the pre-trained audio encoder has constrained capacity to capture\u0000features for new tasks and datasets. To address this, we propose to incorporate\u0000mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\u0000supplements a base encoder with a pool of relatively light weight encoders,\u0000selectively activated based on the audio input to enhance feature extraction\u0000without significantly increasing model size. Our empirical results demonstrate\u0000that MoWE effectively improves multi-task performance, broadening the\u0000applicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142217884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}