Voice conversion (VC) aims to modify the speaker's identity while preserving the linguistic content. Commonly, VC methods use an encoder-decoder architecture, where disentangling the speaker's identity from linguistic information is crucial. However, the disentanglement approaches used in these methods are limited as the speaker features depend on the phonetic content of the utterance, compromising disentanglement. This dependency is amplified with attention-based methods. To address this, we introduce a novel masking mechanism in the input before speaker encoding, masking certain discrete speech units that correspond highly with phoneme classes. Our work aims to reduce the phonetic dependency of speaker features by restricting access to some phonetic information. Furthermore, since our approach is at the input level, it is applicable to any encoder-decoder based VC framework. Our approach improves disentanglement and conversion performance across multiple VC methods, showing significant effectiveness, particularly in attention-based method, with 44% relative improvement in objective intelligibility.
{"title":"Discrete Unit based Masking for Improving Disentanglement in Voice Conversion","authors":"Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman","doi":"arxiv-2409.11560","DOIUrl":"https://doi.org/arxiv-2409.11560","url":null,"abstract":"Voice conversion (VC) aims to modify the speaker's identity while preserving\u0000the linguistic content. Commonly, VC methods use an encoder-decoder\u0000architecture, where disentangling the speaker's identity from linguistic\u0000information is crucial. However, the disentanglement approaches used in these\u0000methods are limited as the speaker features depend on the phonetic content of\u0000the utterance, compromising disentanglement. This dependency is amplified with\u0000attention-based methods. To address this, we introduce a novel masking\u0000mechanism in the input before speaker encoding, masking certain discrete speech\u0000units that correspond highly with phoneme classes. Our work aims to reduce the\u0000phonetic dependency of speaker features by restricting access to some phonetic\u0000information. Furthermore, since our approach is at the input level, it is\u0000applicable to any encoder-decoder based VC framework. Our approach improves\u0000disentanglement and conversion performance across multiple VC methods, showing\u0000significant effectiveness, particularly in attention-based method, with 44%\u0000relative improvement in objective intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu
Preference tuning is a crucial process for aligning deep generative models with human preferences. This survey offers a thorough overview of recent advancements in preference tuning and the integration of human feedback. The paper is organized into three main sections: 1) introduction and preliminaries: an introduction to reinforcement learning frameworks, preference tuning tasks, models, and datasets across various modalities: language, speech, and vision, as well as different policy approaches, 2) in-depth examination of each preference tuning approach: a detailed analysis of the methods used in preference tuning, and 3) applications, discussion, and future directions: an exploration of the applications of preference tuning in downstream tasks, including evaluation methods for different modalities, and an outlook on future research directions. Our objective is to present the latest methodologies in preference tuning and model alignment, enhancing the understanding of this field for researchers and practitioners. We hope to encourage further engagement and innovation in this area.
{"title":"Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey","authors":"Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu","doi":"arxiv-2409.11564","DOIUrl":"https://doi.org/arxiv-2409.11564","url":null,"abstract":"Preference tuning is a crucial process for aligning deep generative models\u0000with human preferences. This survey offers a thorough overview of recent\u0000advancements in preference tuning and the integration of human feedback. The\u0000paper is organized into three main sections: 1) introduction and preliminaries:\u0000an introduction to reinforcement learning frameworks, preference tuning tasks,\u0000models, and datasets across various modalities: language, speech, and vision,\u0000as well as different policy approaches, 2) in-depth examination of each\u0000preference tuning approach: a detailed analysis of the methods used in\u0000preference tuning, and 3) applications, discussion, and future directions: an\u0000exploration of the applications of preference tuning in downstream tasks,\u0000including evaluation methods for different modalities, and an outlook on future\u0000research directions. Our objective is to present the latest methodologies in\u0000preference tuning and model alignment, enhancing the understanding of this\u0000field for researchers and practitioners. We hope to encourage further\u0000engagement and innovation in this area.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli
The growing popularity of multi-channel wearable devices, such as smart glasses, has led to a surge of applications such as targeted speech recognition and enhanced hearing. However, current approaches to solve these tasks use independently trained models, which may not benefit from large amounts of unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel speech foundation model for smart glasses, which is designed to leverage large-scale self-supervised learning (SSL) in an array-geometry agnostic approach. While prior work on multi-channel speech SSL only evaluated on simulated settings, we curate a suite of real downstream tasks to evaluate our model, namely (i) conversational automatic speech recognition (ASR), (ii) spherical active source localization, and (iii) glasses wearer voice activity detection, which are sourced from the MMCSG and EasyCom datasets. We show that a general-purpose M-BEST-RQ encoder is able to match or surpass supervised models across all tasks. For the conversational ASR task in particular, using only 8 hours of labeled speech, our model outperforms a supervised ASR baseline that is trained on 2000 hours of labeled data, which demonstrates the effectiveness of our approach.
{"title":"M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses","authors":"Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli","doi":"arxiv-2409.11494","DOIUrl":"https://doi.org/arxiv-2409.11494","url":null,"abstract":"The growing popularity of multi-channel wearable devices, such as smart\u0000glasses, has led to a surge of applications such as targeted speech recognition\u0000and enhanced hearing. However, current approaches to solve these tasks use\u0000independently trained models, which may not benefit from large amounts of\u0000unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel\u0000speech foundation model for smart glasses, which is designed to leverage\u0000large-scale self-supervised learning (SSL) in an array-geometry agnostic\u0000approach. While prior work on multi-channel speech SSL only evaluated on\u0000simulated settings, we curate a suite of real downstream tasks to evaluate our\u0000model, namely (i) conversational automatic speech recognition (ASR), (ii)\u0000spherical active source localization, and (iii) glasses wearer voice activity\u0000detection, which are sourced from the MMCSG and EasyCom datasets. We show that\u0000a general-purpose M-BEST-RQ encoder is able to match or surpass supervised\u0000models across all tasks. For the conversational ASR task in particular, using\u0000only 8 hours of labeled speech, our model outperforms a supervised ASR baseline\u0000that is trained on 2000 hours of labeled data, which demonstrates the\u0000effectiveness of our approach.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Music-text multimodal systems have enabled new approaches to Music Information Research (MIR) applications such as audio-to-text and text-to-audio retrieval, text-based song generation, and music captioning. Despite the reported success, little effort has been put into evaluating the musical knowledge of Large Language Models (LLM). In this paper, we demonstrate that LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g. 'rock song without guitar'), and 3) sensitivity towards the presence of specific words. We quantified these properties as a triplet-based accuracy, evaluating the ability to model the relative similarity of labels in a hierarchical ontology. We leveraged the Audioset ontology to generate triplets consisting of an anchor, a positive (relevant) label, and a negative (less relevant) label for the genre and instruments sub-tree. We evaluated the triplet-based musical knowledge for six general-purpose Transformer-based models. The triplets obtained through this methodology required filtering, as some were difficult to judge and therefore relatively uninformative for evaluation purposes. Despite the relatively high accuracy reported, inconsistencies are evident in all six models, suggesting that off-the-shelf LLMs need adaptation to music before use.
{"title":"Evaluation of pretrained language models on music understanding","authors":"Yannis Vasilakis, Rachel Bittner, Johan Pauwels","doi":"arxiv-2409.11449","DOIUrl":"https://doi.org/arxiv-2409.11449","url":null,"abstract":"Music-text multimodal systems have enabled new approaches to Music\u0000Information Research (MIR) applications such as audio-to-text and text-to-audio\u0000retrieval, text-based song generation, and music captioning. Despite the\u0000reported success, little effort has been put into evaluating the musical\u0000knowledge of Large Language Models (LLM). In this paper, we demonstrate that\u0000LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g.\u0000'rock song without guitar'), and 3) sensitivity towards the presence of\u0000specific words. We quantified these properties as a triplet-based accuracy,\u0000evaluating the ability to model the relative similarity of labels in a\u0000hierarchical ontology. We leveraged the Audioset ontology to generate triplets\u0000consisting of an anchor, a positive (relevant) label, and a negative (less\u0000relevant) label for the genre and instruments sub-tree. We evaluated the\u0000triplet-based musical knowledge for six general-purpose Transformer-based\u0000models. The triplets obtained through this methodology required filtering, as\u0000some were difficult to judge and therefore relatively uninformative for\u0000evaluation purposes. Despite the relatively high accuracy reported,\u0000inconsistencies are evident in all six models, suggesting that off-the-shelf\u0000LLMs need adaptation to music before use.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audio language models can understand audio inputs and perform a range of audio-related tasks based on instructions, such as speech recognition and audio captioning, where the instructions are usually textual prompts. Audio language models are mostly initialized from pre-trained audio encoders and large language models (LLMs). Although these pre-trained components were developed to support multiple languages, audio-language models are trained predominantly on English data, which may limit their usability to only English instructions or English speech inputs. First, this paper examines the performance of existing audio language models in an underserved language using Thai as an example. This paper demonstrates that, despite being built on multilingual backbones, audio language models do not exhibit cross-lingual emergent abilities to low-resource languages. Second, this paper studies data mixture for developing audio language models that are optimized for a target language as well as English. In addition. this paper integrates audio comprehension and speech instruction-following capabilities into a single unified model. Our experiments provide insights into data mixture for enhancing instruction-following capabilities in both a low-resource language and English. Our model, Typhoon-Audio, outperforms existing open-source audio language models by a considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in both English and Thai languages.
{"title":"Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models","authors":"Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul","doi":"arxiv-2409.10999","DOIUrl":"https://doi.org/arxiv-2409.10999","url":null,"abstract":"Audio language models can understand audio inputs and perform a range of\u0000audio-related tasks based on instructions, such as speech recognition and audio\u0000captioning, where the instructions are usually textual prompts. Audio language\u0000models are mostly initialized from pre-trained audio encoders and large\u0000language models (LLMs). Although these pre-trained components were developed to\u0000support multiple languages, audio-language models are trained predominantly on\u0000English data, which may limit their usability to only English instructions or\u0000English speech inputs. First, this paper examines the performance of existing\u0000audio language models in an underserved language using Thai as an example. This\u0000paper demonstrates that, despite being built on multilingual backbones, audio\u0000language models do not exhibit cross-lingual emergent abilities to low-resource\u0000languages. Second, this paper studies data mixture for developing audio\u0000language models that are optimized for a target language as well as English. In\u0000addition. this paper integrates audio comprehension and speech\u0000instruction-following capabilities into a single unified model. Our experiments\u0000provide insights into data mixture for enhancing instruction-following\u0000capabilities in both a low-resource language and English. Our model,\u0000Typhoon-Audio, outperforms existing open-source audio language models by a\u0000considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in\u0000both English and Thai languages.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"167 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas
Recent advancements in music source separation have significantly progressed, particularly in isolating vocals, drums, and bass elements from mixed tracks. These developments owe much to the creation and use of large-scale, multitrack datasets dedicated to these specific components. However, the challenge of extracting similarly sounding sources from orchestra recordings has not been extensively explored, largely due to a scarcity of comprehensive and clean (i.e bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack dataset called SynthSOD, developed using a set of simulation techniques to create a realistic (i.e. using high-quality soundfonts), musically motivated, and heterogeneous training set comprising different dynamics, natural tempo changes, styles, and conditions. Moreover, we demonstrate the application of a widely used baseline music separation model trained on our synthesized dataset w.r.t to the well-known EnsembleSet, and evaluate its performance under both synthetic and real-world conditions.
{"title":"SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation","authors":"Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas","doi":"arxiv-2409.10995","DOIUrl":"https://doi.org/arxiv-2409.10995","url":null,"abstract":"Recent advancements in music source separation have significantly progressed,\u0000particularly in isolating vocals, drums, and bass elements from mixed tracks.\u0000These developments owe much to the creation and use of large-scale, multitrack\u0000datasets dedicated to these specific components. However, the challenge of\u0000extracting similarly sounding sources from orchestra recordings has not been\u0000extensively explored, largely due to a scarcity of comprehensive and clean (i.e\u0000bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack\u0000dataset called SynthSOD, developed using a set of simulation techniques to\u0000create a realistic (i.e. using high-quality soundfonts), musically motivated,\u0000and heterogeneous training set comprising different dynamics, natural tempo\u0000changes, styles, and conditions. Moreover, we demonstrate the application of a\u0000widely used baseline music separation model trained on our synthesized dataset\u0000w.r.t to the well-known EnsembleSet, and evaluate its performance under both\u0000synthetic and real-world conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar
This paper presents a comprehensive evaluation of Urdu Automatic Speech Recognition (ASR) models. We analyze the performance of three ASR model families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along with a detailed examination of the most frequent wrong words and error types including insertions, deletions, and substitutions. Our analysis is conducted using two types of datasets, read speech and conversational speech. Notably, we present the first conversational speech dataset designed for benchmarking Urdu ASR models. We find that seamless-large outperforms other ASR models on the read speech dataset, while whisper-large performs best on the conversational speech dataset. Furthermore, this evaluation highlights the complexities of assessing ASR models for low-resource languages like Urdu using quantitative metrics alone and emphasizes the need for a robust Urdu text normalization system. Our findings contribute valuable insights for developing robust ASR systems for low-resource languages like Urdu.
本文全面评估了乌尔都语自动语音识别(ASR)模型。我们分析了三种 ASR 模式家族的性能:我们使用词错误率 (WER) 分析了三种 ASR 模式家族的性能:Whisper、MMS 和 Seamless-M4T,并详细分析了最常见的错词和错误类型,包括插入、删除和替换。我们使用阅读语音和对话语音两种数据集进行分析。值得注意的是,我们首次提出了用于对 UrduASR 模型进行基准测试的会话语音数据集。我们发现,seamless-large 在阅读语音数据集上的表现优于其他 ASR 模型,而 whisper-large 在对话语音数据集上的表现最好。此外,这项评估还凸显了仅使用定量指标对乌尔都语等低资源语言的 ASR 模型进行评估的复杂性,并强调了对强大的乌尔都语文本规范化系统的需求。我们的研究结果为开发适用于乌尔都语等低资源语言的强大 ASR 系统提供了宝贵的见解。
{"title":"WER We Stand: Benchmarking Urdu ASR Models","authors":"Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar","doi":"arxiv-2409.11252","DOIUrl":"https://doi.org/arxiv-2409.11252","url":null,"abstract":"This paper presents a comprehensive evaluation of Urdu Automatic Speech\u0000Recognition (ASR) models. We analyze the performance of three ASR model\u0000families: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), along\u0000with a detailed examination of the most frequent wrong words and error types\u0000including insertions, deletions, and substitutions. Our analysis is conducted\u0000using two types of datasets, read speech and conversational speech. Notably, we\u0000present the first conversational speech dataset designed for benchmarking Urdu\u0000ASR models. We find that seamless-large outperforms other ASR models on the\u0000read speech dataset, while whisper-large performs best on the conversational\u0000speech dataset. Furthermore, this evaluation highlights the complexities of\u0000assessing ASR models for low-resource languages like Urdu using quantitative\u0000metrics alone and emphasizes the need for a robust Urdu text normalization\u0000system. Our findings contribute valuable insights for developing robust ASR\u0000systems for low-resource languages like Urdu.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie
Integrating audio encoders with LLMs through connectors has enabled these models to process and comprehend audio modalities, significantly enhancing speech-to-text tasks, including automatic speech recognition (ASR) and automatic speech translation (AST). However, these methods often overlook the critical aspect of language adaptation in multilingual settings, relying instead on multilingual data without adequately addressing language differences. To address this gap, we propose the Ideal-LLM model, which employs dual multilingual encoders to enrich language feature information and utilizes a language-adapted connector to target the adaptation of each language specifically. By leveraging the complementary strengths of Whisper and MMS encoders, our approach ensures richer multilingual representations. Additionally, the language-adapted connector enhances modal transformation via a language weight selector tailored for each language. Experimental results demonstrate that Ideal-LLM significantly improves ASR performance, achieving a 32.6% relative reduction in average word error rates compared to the standard speech encoder integrated with LLMs and yields an average BLEU score of 36.78 for AST task.
{"title":"Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text","authors":"Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie","doi":"arxiv-2409.11214","DOIUrl":"https://doi.org/arxiv-2409.11214","url":null,"abstract":"Integrating audio encoders with LLMs through connectors has enabled these\u0000models to process and comprehend audio modalities, significantly enhancing\u0000speech-to-text tasks, including automatic speech recognition (ASR) and\u0000automatic speech translation (AST). However, these methods often overlook the\u0000critical aspect of language adaptation in multilingual settings, relying\u0000instead on multilingual data without adequately addressing language\u0000differences. To address this gap, we propose the Ideal-LLM model, which employs\u0000dual multilingual encoders to enrich language feature information and utilizes\u0000a language-adapted connector to target the adaptation of each language\u0000specifically. By leveraging the complementary strengths of Whisper and MMS\u0000encoders, our approach ensures richer multilingual representations.\u0000Additionally, the language-adapted connector enhances modal transformation via\u0000a language weight selector tailored for each language. Experimental results\u0000demonstrate that Ideal-LLM significantly improves ASR performance, achieving a\u000032.6% relative reduction in average word error rates compared to the standard\u0000speech encoder integrated with LLMs and yields an average BLEU score of 36.78\u0000for AST task.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Audio-text contrastive models have become a powerful approach in music representation learning. Despite their empirical success, however, little is known about the influence of key design choices on the quality of music-text representations learnt through this framework. In this work, we expose these design choices within the constraints of limited data and computation budgets, and establish a more solid understanding of their impact grounded in empirical observations along three axes: the choice of base encoders, the level of curation in training data, and the use of text augmentation. We find that data curation is the single most important factor for music-text contrastive training in resource-constrained scenarios. Motivated by this insight, we introduce two novel techniques, Augmented View Dropout and TextSwap, which increase the diversity and descriptiveness of text inputs seen in training. Through our experiments we demonstrate that these are effective at boosting performance across different pre-training regimes, model architectures, and downstream data distributions, without incurring higher computational costs or requiring additional training data.
{"title":"Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning","authors":"Ilaria Manco, Justin Salamon, Oriol Nieto","doi":"arxiv-2409.11498","DOIUrl":"https://doi.org/arxiv-2409.11498","url":null,"abstract":"Audio-text contrastive models have become a powerful approach in music\u0000representation learning. Despite their empirical success, however, little is\u0000known about the influence of key design choices on the quality of music-text\u0000representations learnt through this framework. In this work, we expose these\u0000design choices within the constraints of limited data and computation budgets,\u0000and establish a more solid understanding of their impact grounded in empirical\u0000observations along three axes: the choice of base encoders, the level of\u0000curation in training data, and the use of text augmentation. We find that data\u0000curation is the single most important factor for music-text contrastive\u0000training in resource-constrained scenarios. Motivated by this insight, we\u0000introduce two novel techniques, Augmented View Dropout and TextSwap, which\u0000increase the diversity and descriptiveness of text inputs seen in training.\u0000Through our experiments we demonstrate that these are effective at boosting\u0000performance across different pre-training regimes, model architectures, and\u0000downstream data distributions, without incurring higher computational costs or\u0000requiring additional training data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Latent diffusion models have shown promising results in text-to-audio (T2A) generation tasks, yet previous models have encountered difficulties in generation quality, computational cost, diffusion sampling, and data preparation. In this paper, we introduce EzAudio, a transformer-based T2A diffusion model, to handle these challenges. Our approach includes several key innovations: (1) We build the T2A model on the latent space of a 1D waveform Variational Autoencoder (VAE), avoiding the complexities of handling 2D spectrogram representations and using an additional neural vocoder. (2) We design an optimized diffusion transformer architecture specifically tailored for audio latent representations and diffusion modeling, which enhances convergence speed, training stability, and memory usage, making the training process easier and more efficient. (3) To tackle data scarcity, we adopt a data-efficient training strategy that leverages unlabeled data for learning acoustic dependencies, audio caption data annotated by audio-language models for text-to-audio alignment learning, and human-labeled data for fine-tuning. (4) We introduce a classifier-free guidance (CFG) rescaling method that simplifies EzAudio by achieving strong prompt alignment while preserving great audio quality when using larger CFG scores, eliminating the need to struggle with finding the optimal CFG score to balance this trade-off. EzAudio surpasses existing open-source models in both objective metrics and subjective evaluations, delivering realistic listening experiences while maintaining a streamlined model structure, low training costs, and an easy-to-follow training pipeline. Code, data, and pre-trained models are released at: https://haidog-yaqub.github.io/EzAudio-Page/.
{"title":"EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer","authors":"Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu","doi":"arxiv-2409.10819","DOIUrl":"https://doi.org/arxiv-2409.10819","url":null,"abstract":"Latent diffusion models have shown promising results in text-to-audio (T2A)\u0000generation tasks, yet previous models have encountered difficulties in\u0000generation quality, computational cost, diffusion sampling, and data\u0000preparation. In this paper, we introduce EzAudio, a transformer-based T2A\u0000diffusion model, to handle these challenges. Our approach includes several key\u0000innovations: (1) We build the T2A model on the latent space of a 1D waveform\u0000Variational Autoencoder (VAE), avoiding the complexities of handling 2D\u0000spectrogram representations and using an additional neural vocoder. (2) We\u0000design an optimized diffusion transformer architecture specifically tailored\u0000for audio latent representations and diffusion modeling, which enhances\u0000convergence speed, training stability, and memory usage, making the training\u0000process easier and more efficient. (3) To tackle data scarcity, we adopt a\u0000data-efficient training strategy that leverages unlabeled data for learning\u0000acoustic dependencies, audio caption data annotated by audio-language models\u0000for text-to-audio alignment learning, and human-labeled data for fine-tuning.\u0000(4) We introduce a classifier-free guidance (CFG) rescaling method that\u0000simplifies EzAudio by achieving strong prompt alignment while preserving great\u0000audio quality when using larger CFG scores, eliminating the need to struggle\u0000with finding the optimal CFG score to balance this trade-off. EzAudio surpasses\u0000existing open-source models in both objective metrics and subjective\u0000evaluations, delivering realistic listening experiences while maintaining a\u0000streamlined model structure, low training costs, and an easy-to-follow training\u0000pipeline. Code, data, and pre-trained models are released at:\u0000https://haidog-yaqub.github.io/EzAudio-Page/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}