arXiv - EE - Audio and Speech Processing最新文献_第3页

Discrete Unit based Masking for Improving Disentanglement in Voice Conversion 基于离散单元的掩码技术改善语音转换中的解缠效果

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11560

Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman

Voice conversion (VC) aims to modify the speaker's identity while preservingthe linguistic content. Commonly, VC methods use an encoder-decoderarchitecture, where disentangling the speaker's identity from linguisticinformation is crucial. However, the disentanglement approaches used in thesemethods are limited as the speaker features depend on the phonetic content ofthe utterance, compromising disentanglement. This dependency is amplified withattention-based methods. To address this, we introduce a novel maskingmechanism in the input before speaker encoding, masking certain discrete speechunits that correspond highly with phoneme classes. Our work aims to reduce thephonetic dependency of speaker features by restricting access to some phoneticinformation. Furthermore, since our approach is at the input level, it isapplicable to any encoder-decoder based VC framework. Our approach improvesdisentanglement and conversion performance across multiple VC methods, showingsignificant effectiveness, particularly in attention-based method, with 44%relative improvement in objective intelligibility.

语音转换（VC）旨在修改说话者的身份，同时保留语言内容。语音转换方法通常使用编码器-解码器架构，其中将说话人的身份与语言信息分离是关键。然而，这些方法中使用的解离方法是有限的，因为说话人的特征依赖于语篇的语音内容，从而影响了解离效果。这种依赖性在基于注意力的方法中被进一步放大。为了解决这个问题，我们在扬声器编码前的输入中引入了一种新的屏蔽机制，屏蔽某些与音素类别高度对应的离散语音单元。我们的工作旨在通过限制对某些语音信息的访问来减少说话人特征的语音依赖性。此外，由于我们的方法是输入层面的，因此适用于任何基于编码器-解码器的 VC 框架。我们的方法提高了多种变声方法的解切和转换性能，显示出显著的效果，尤其是在基于注意力的方法中，客观可懂度相对提高了 44%。

{"title":"Discrete Unit based Masking for Improving Disentanglement in Voice Conversion","authors":"Philip H. Lee, Ismail Rasim Ulgen, Berrak Sisman","doi":"arxiv-2409.11560","DOIUrl":"https://doi.org/arxiv-2409.11560","url":null,"abstract":"Voice conversion (VC) aims to modify the speaker's identity while preserving\u0000the linguistic content. Commonly, VC methods use an encoder-decoder\u0000architecture, where disentangling the speaker's identity from linguistic\u0000information is crucial. However, the disentanglement approaches used in these\u0000methods are limited as the speaker features depend on the phonetic content of\u0000the utterance, compromising disentanglement. This dependency is amplified with\u0000attention-based methods. To address this, we introduce a novel masking\u0000mechanism in the input before speaker encoding, masking certain discrete speech\u0000units that correspond highly with phoneme classes. Our work aims to reduce the\u0000phonetic dependency of speaker features by restricting access to some phonetic\u0000information. Furthermore, since our approach is at the input level, it is\u0000applicable to any encoder-decoder based VC framework. Our approach improves\u0000disentanglement and conversion performance across multiple VC methods, showing\u0000significant effectiveness, particularly in attention-based method, with 44%\u0000relative improvement in objective intelligibility.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey 在语言、语音和视觉任务中通过人工反馈调整偏好：一项调查

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11564

Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu

Preference tuning is a crucial process for aligning deep generative modelswith human preferences. This survey offers a thorough overview of recentadvancements in preference tuning and the integration of human feedback. Thepaper is organized into three main sections: 1) introduction and preliminaries:an introduction to reinforcement learning frameworks, preference tuning tasks,models, and datasets across various modalities: language, speech, and vision,as well as different policy approaches, 2) in-depth examination of eachpreference tuning approach: a detailed analysis of the methods used inpreference tuning, and 3) applications, discussion, and future directions: anexploration of the applications of preference tuning in downstream tasks,including evaluation methods for different modalities, and an outlook on futureresearch directions. Our objective is to present the latest methodologies inpreference tuning and model alignment, enhancing the understanding of thisfield for researchers and practitioners. We hope to encourage furtherengagement and innovation in this area.

偏好调整是使深度生成模型与人类偏好相一致的关键过程。本调查报告全面概述了偏好调整和人类反馈整合方面的最新进展。本文分为三个主要部分：1）引言和前言：介绍强化学习框架、偏好调优任务、模型和不同模式的数据集：语言、语音和视觉，以及不同的策略方法；2）深入研究每种偏好调优方法：详细分析偏好调优中使用的方法；3）应用、讨论和未来方向：探讨偏好调优在下游任务中的应用，包括不同模式的评估方法，以及对未来研究方向的展望。我们的目标是介绍偏好调整和模型配准方面的最新方法，增强研究人员和从业人员对这一领域的了解。我们希望鼓励这一领域的进一步参与和创新。

{"title":"Preference Tuning with Human Feedback on Language, Speech, and Vision Tasks: A Survey","authors":"Genta Indra Winata, Hanyang Zhao, Anirban Das, Wenpin Tang, David D. Yao, Shi-Xiong Zhang, Sambit Sahu","doi":"arxiv-2409.11564","DOIUrl":"https://doi.org/arxiv-2409.11564","url":null,"abstract":"Preference tuning is a crucial process for aligning deep generative models\u0000with human preferences. This survey offers a thorough overview of recent\u0000advancements in preference tuning and the integration of human feedback. The\u0000paper is organized into three main sections: 1) introduction and preliminaries:\u0000an introduction to reinforcement learning frameworks, preference tuning tasks,\u0000models, and datasets across various modalities: language, speech, and vision,\u0000as well as different policy approaches, 2) in-depth examination of each\u0000preference tuning approach: a detailed analysis of the methods used in\u0000preference tuning, and 3) applications, discussion, and future directions: an\u0000exploration of the applications of preference tuning in downstream tasks,\u0000including evaluation methods for different modalities, and an outlook on future\u0000research directions. Our objective is to present the latest methodologies in\u0000preference tuning and model alignment, enhancing the understanding of this\u0000field for researchers and practitioners. We hope to encourage further\u0000engagement and innovation in this area.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses M-BEST-RQ：适用于智能眼镜的多通道语音基础模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11494

Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli

The growing popularity of multi-channel wearable devices, such as smartglasses, has led to a surge of applications such as targeted speech recognitionand enhanced hearing. However, current approaches to solve these tasks useindependently trained models, which may not benefit from large amounts ofunlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channelspeech foundation model for smart glasses, which is designed to leveragelarge-scale self-supervised learning (SSL) in an array-geometry agnosticapproach. While prior work on multi-channel speech SSL only evaluated onsimulated settings, we curate a suite of real downstream tasks to evaluate ourmodel, namely (i) conversational automatic speech recognition (ASR), (ii)spherical active source localization, and (iii) glasses wearer voice activitydetection, which are sourced from the MMCSG and EasyCom datasets. We show thata general-purpose M-BEST-RQ encoder is able to match or surpass supervisedmodels across all tasks. For the conversational ASR task in particular, usingonly 8 hours of labeled speech, our model outperforms a supervised ASR baselinethat is trained on 2000 hours of labeled data, which demonstrates theeffectiveness of our approach.

随着智能眼镜等多通道可穿戴设备的日益普及，有针对性的语音识别和增强听力等应用激增。然而，目前解决这些任务的方法使用的是独立训练的模型，可能无法从大量无标记数据中获益。在本文中，我们提出了首个用于智能眼镜的多通道语音基础模型 M-BEST-RQ，该模型旨在利用大规模自监督学习（SSL），采用与阵列几何无关的方法。之前关于多通道语音 SSL 的研究只在模拟环境中进行评估，而我们则从 MMCSG 和 EasyCom 数据集中收集了一套真实的下游任务来评估我们的模型，即 (i) 对话式自动语音识别 (ASR)、(ii) 球形主动声源定位和 (iii) 眼镜佩戴者语音活动检测。我们的研究表明，通用 M-BEST-RQ 编码器在所有任务中都能达到或超过监督模型。特别是在会话自动语音识别（ASR）任务中，仅使用 8 小时的标注语音，我们的模型就超过了使用 2000 小时标注数据训练的有监督自动语音识别基础模型，这证明了我们方法的有效性。

{"title":"M-BEST-RQ: A Multi-Channel Speech Foundation Model for Smart Glasses","authors":"Yufeng Yang, Desh Raj, Ju Lin, Niko Moritz, Junteng Jia, Gil Keren, Egor Lakomkin, Yiteng Huang, Jacob Donley, Jay Mahadeokar, Ozlem Kalinli","doi":"arxiv-2409.11494","DOIUrl":"https://doi.org/arxiv-2409.11494","url":null,"abstract":"The growing popularity of multi-channel wearable devices, such as smart\u0000glasses, has led to a surge of applications such as targeted speech recognition\u0000and enhanced hearing. However, current approaches to solve these tasks use\u0000independently trained models, which may not benefit from large amounts of\u0000unlabeled data. In this paper, we propose M-BEST-RQ, the first multi-channel\u0000speech foundation model for smart glasses, which is designed to leverage\u0000large-scale self-supervised learning (SSL) in an array-geometry agnostic\u0000approach. While prior work on multi-channel speech SSL only evaluated on\u0000simulated settings, we curate a suite of real downstream tasks to evaluate our\u0000model, namely (i) conversational automatic speech recognition (ASR), (ii)\u0000spherical active source localization, and (iii) glasses wearer voice activity\u0000detection, which are sourced from the MMCSG and EasyCom datasets. We show that\u0000a general-purpose M-BEST-RQ encoder is able to match or surpass supervised\u0000models across all tasks. For the conversational ASR task in particular, using\u0000only 8 hours of labeled speech, our model outperforms a supervised ASR baseline\u0000that is trained on 2000 hours of labeled data, which demonstrates the\u0000effectiveness of our approach.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluation of pretrained language models on music understanding 对音乐理解预训练语言模型的评估

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11449

Yannis Vasilakis, Rachel Bittner, Johan Pauwels

Music-text multimodal systems have enabled new approaches to MusicInformation Research (MIR) applications such as audio-to-text and text-to-audioretrieval, text-based song generation, and music captioning. Despite thereported success, little effort has been put into evaluating the musicalknowledge of Large Language Models (LLM). In this paper, we demonstrate thatLLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g.'rock song without guitar'), and 3) sensitivity towards the presence ofspecific words. We quantified these properties as a triplet-based accuracy,evaluating the ability to model the relative similarity of labels in ahierarchical ontology. We leveraged the Audioset ontology to generate tripletsconsisting of an anchor, a positive (relevant) label, and a negative (lessrelevant) label for the genre and instruments sub-tree. We evaluated thetriplet-based musical knowledge for six general-purpose Transformer-basedmodels. The triplets obtained through this methodology required filtering, assome were difficult to judge and therefore relatively uninformative forevaluation purposes. Despite the relatively high accuracy reported,inconsistencies are evident in all six models, suggesting that off-the-shelfLLMs need adaptation to music before use.

音乐-文本多模态系统为音乐信息研究（MIR）应用提供了新方法，如音频-文本和文本-音频检索、基于文本的歌曲生成和音乐字幕。尽管取得了巨大的成功，但在评估大型语言模型（LLM）的音乐知识方面却鲜有建树。在本文中，我们证明了大型语言模型存在以下问题：1）对提示敏感；2）无法对否定进行建模（例如 "没有吉他的摇滚歌曲"）；3）对特定词语的存在敏感。我们将这些特性量化为基于三元组的准确度，评估对层次本体中标签的相对相似性进行建模的能力。我们利用 Audioset 本体论为流派和乐器子树生成由一个锚点、一个正面（相关）标签和一个负面（不太相关）标签组成的三元组。我们对六个基于通用转换器的模型进行了基于三元组的音乐知识评估。通过这种方法获得的三连音需要过滤，其中一些很难判断，因此对评估目的而言信息量相对较小。尽管报告的准确率相对较高，但所有六个模型都存在明显的不一致性，这表明现成的LLM 在使用前需要适应音乐。

{"title":"Evaluation of pretrained language models on music understanding","authors":"Yannis Vasilakis, Rachel Bittner, Johan Pauwels","doi":"arxiv-2409.11449","DOIUrl":"https://doi.org/arxiv-2409.11449","url":null,"abstract":"Music-text multimodal systems have enabled new approaches to Music\u0000Information Research (MIR) applications such as audio-to-text and text-to-audio\u0000retrieval, text-based song generation, and music captioning. Despite the\u0000reported success, little effort has been put into evaluating the musical\u0000knowledge of Large Language Models (LLM). In this paper, we demonstrate that\u0000LLMs suffer from 1) prompt sensitivity, 2) inability to model negation (e.g.\u0000'rock song without guitar'), and 3) sensitivity towards the presence of\u0000specific words. We quantified these properties as a triplet-based accuracy,\u0000evaluating the ability to model the relative similarity of labels in a\u0000hierarchical ontology. We leveraged the Audioset ontology to generate triplets\u0000consisting of an anchor, a positive (relevant) label, and a negative (less\u0000relevant) label for the genre and instruments sub-tree. We evaluated the\u0000triplet-based musical knowledge for six general-purpose Transformer-based\u0000models. The triplets obtained through this methodology required filtering, as\u0000some were difficult to judge and therefore relatively uninformative for\u0000evaluation purposes. Despite the relatively high accuracy reported,\u0000inconsistencies are evident in all six models, suggesting that off-the-shelf\u0000LLMs need adaptation to music before use.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models 增强有声语言模型的低资源语言和教学跟踪能力

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10999

Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul

Audio language models can understand audio inputs and perform a range ofaudio-related tasks based on instructions, such as speech recognition and audiocaptioning, where the instructions are usually textual prompts. Audio languagemodels are mostly initialized from pre-trained audio encoders and largelanguage models (LLMs). Although these pre-trained components were developed tosupport multiple languages, audio-language models are trained predominantly onEnglish data, which may limit their usability to only English instructions orEnglish speech inputs. First, this paper examines the performance of existingaudio language models in an underserved language using Thai as an example. Thispaper demonstrates that, despite being built on multilingual backbones, audiolanguage models do not exhibit cross-lingual emergent abilities to low-resourcelanguages. Second, this paper studies data mixture for developing audiolanguage models that are optimized for a target language as well as English. Inaddition. this paper integrates audio comprehension and speechinstruction-following capabilities into a single unified model. Our experimentsprovide insights into data mixture for enhancing instruction-followingcapabilities in both a low-resource language and English. Our model,Typhoon-Audio, outperforms existing open-source audio language models by aconsiderable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro inboth English and Thai languages.

音频语言模型可以理解音频输入，并根据指令执行一系列与音频相关的任务，如语音识别和字幕制作，其中的指令通常是文本提示。音频语言模型大多由预先训练好的音频编码器和大型语言模型（LLM）初始化而成。虽然这些预训练组件是为支持多种语言而开发的，但音频语言模型主要是在英语数据上训练的，这可能会限制它们仅在英语指令或英语语音输入时的可用性。首先，本文以泰语为例，研究了现有音频语言模型在未得到充分服务的语言中的性能。本文表明，尽管音频语言模型是建立在多语言基础之上的，但在低资源语言方面并没有表现出跨语言的新兴能力。其次，本文研究了开发针对目标语言和英语进行优化的听力语言模型的数据混合物。此外，本文还将音频理解能力和语音指令跟读能力整合到一个统一的模型中。我们的实验为提高低资源语言和英语的教学跟读能力提供了数据混合物方面的见解。我们的模型 Typhoon-Audio 在英语和泰语中的表现远远优于现有的开源音频语言模型，并可与最先进的 Gemini-1.5-Pro 相媲美。

{"title":"Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models","authors":"Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul","doi":"arxiv-2409.10999","DOIUrl":"https://doi.org/arxiv-2409.10999","url":null,"abstract":"Audio language models can understand audio inputs and perform a range of\u0000audio-related tasks based on instructions, such as speech recognition and audio\u0000captioning, where the instructions are usually textual prompts. Audio language\u0000models are mostly initialized from pre-trained audio encoders and large\u0000language models (LLMs). Although these pre-trained components were developed to\u0000support multiple languages, audio-language models are trained predominantly on\u0000English data, which may limit their usability to only English instructions or\u0000English speech inputs. First, this paper examines the performance of existing\u0000audio language models in an underserved language using Thai as an example. This\u0000paper demonstrates that, despite being built on multilingual backbones, audio\u0000language models do not exhibit cross-lingual emergent abilities to low-resource\u0000languages. Second, this paper studies data mixture for developing audio\u0000language models that are optimized for a target language as well as English. In\u0000addition. this paper integrates audio comprehension and speech\u0000instruction-following capabilities into a single unified model. Our experiments\u0000provide insights into data mixture for enhancing instruction-following\u0000capabilities in both a low-resource language and English. Our model,\u0000Typhoon-Audio, outperforms existing open-source audio language models by a\u0000considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in\u0000both English and Thai languages.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"167 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation SynthSOD：为管弦乐队音乐源分离开发异构数据集

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10995

Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas

Recent advancements in music source separation have significantly progressed,particularly in isolating vocals, drums, and bass elements from mixed tracks.These developments owe much to the creation and use of large-scale, multitrackdatasets dedicated to these specific components. However, the challenge ofextracting similarly sounding sources from orchestra recordings has not beenextensively explored, largely due to a scarcity of comprehensive and clean (i.ebleed-free) multitrack datasets. In this paper, we introduce a novel multitrackdataset called SynthSOD, developed using a set of simulation techniques tocreate a realistic (i.e. using high-quality soundfonts), musically motivated,and heterogeneous training set comprising different dynamics, natural tempochanges, styles, and conditions. Moreover, we demonstrate the application of awidely used baseline music separation model trained on our synthesized datasetw.r.t to the well-known EnsembleSet, and evaluate its performance under bothsynthetic and real-world conditions.

近年来，音乐音源分离技术取得了长足进步，特别是从混合音轨中分离人声、鼓声和低音元素方面。然而，从管弦乐队录音中提取类似音源的挑战尚未得到广泛探索，这主要是由于缺乏全面、干净（即无噪声）的多轨数据集。在本文中，我们介绍了一种名为 SynthSOD 的新型多轨数据集，该数据集采用一系列模拟技术来创建一个真实的（即使用高质量音色字体）、有音乐动机的异质训练集，其中包括不同的动态、自然的节奏变化、风格和条件。此外，我们还演示了在我们的合成数据集上训练的广泛使用的基线音乐分离模型在著名的 EnsembleSet 上的应用，并评估了其在合成和真实世界条件下的性能。

{"title":"SynthSOD: Developing an Heterogeneous Dataset for Orchestra Music Source Separation","authors":"Jaime Garcia-Martinez, David Diaz-Guerra, Archontis Politis, Tuomas Virtanen, Julio J. Carabias-Orti, Pedro Vera-Candeas","doi":"arxiv-2409.10995","DOIUrl":"https://doi.org/arxiv-2409.10995","url":null,"abstract":"Recent advancements in music source separation have significantly progressed,\u0000particularly in isolating vocals, drums, and bass elements from mixed tracks.\u0000These developments owe much to the creation and use of large-scale, multitrack\u0000datasets dedicated to these specific components. However, the challenge of\u0000extracting similarly sounding sources from orchestra recordings has not been\u0000extensively explored, largely due to a scarcity of comprehensive and clean (i.e\u0000bleed-free) multitrack datasets. In this paper, we introduce a novel multitrack\u0000dataset called SynthSOD, developed using a set of simulation techniques to\u0000create a realistic (i.e. using high-quality soundfonts), musically motivated,\u0000and heterogeneous training set comprising different dynamics, natural tempo\u0000changes, styles, and conditions. Moreover, we demonstrate the application of a\u0000widely used baseline music separation model trained on our synthesized dataset\u0000w.r.t to the well-known EnsembleSet, and evaluate its performance under both\u0000synthetic and real-world conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WER We Stand: Benchmarking Urdu ASR Models 我们的立场：乌尔都语 ASR 模型基准测试

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11252

Samee Arif, Aamina Jamal Khan, Mustafa Abbas, Agha Ali Raza, Awais Athar

This paper presents a comprehensive evaluation of Urdu Automatic SpeechRecognition (ASR) models. We analyze the performance of three ASR modelfamilies: Whisper, MMS, and Seamless-M4T using Word Error Rate (WER), alongwith a detailed examination of the most frequent wrong words and error typesincluding insertions, deletions, and substitutions. Our analysis is conductedusing two types of datasets, read speech and conversational speech. Notably, wepresent the first conversational speech dataset designed for benchmarking UrduASR models. We find that seamless-large outperforms other ASR models on theread speech dataset, while whisper-large performs best on the conversationalspeech dataset. Furthermore, this evaluation highlights the complexities ofassessing ASR models for low-resource languages like Urdu using quantitativemetrics alone and emphasizes the need for a robust Urdu text normalizationsystem. Our findings contribute valuable insights for developing robust ASRsystems for low-resource languages like Urdu.

本文全面评估了乌尔都语自动语音识别（ASR）模型。我们分析了三种 ASR 模式家族的性能：我们使用词错误率 (WER) 分析了三种 ASR 模式家族的性能：Whisper、MMS 和 Seamless-M4T，并详细分析了最常见的错词和错误类型，包括插入、删除和替换。我们使用阅读语音和对话语音两种数据集进行分析。值得注意的是，我们首次提出了用于对 UrduASR 模型进行基准测试的会话语音数据集。我们发现，seamless-large 在阅读语音数据集上的表现优于其他 ASR 模型，而 whisper-large 在对话语音数据集上的表现最好。此外，这项评估还凸显了仅使用定量指标对乌尔都语等低资源语言的 ASR 模型进行评估的复杂性，并强调了对强大的乌尔都语文本规范化系统的需求。我们的研究结果为开发适用于乌尔都语等低资源语言的强大 ASR 系统提供了宝贵的见解。

引用次数: 0

Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text Ideal-LLM：整合双编码器和语言适应性 LLM，实现多语种语音到文本的转换

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11214

Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie

Integrating audio encoders with LLMs through connectors has enabled thesemodels to process and comprehend audio modalities, significantly enhancingspeech-to-text tasks, including automatic speech recognition (ASR) andautomatic speech translation (AST). However, these methods often overlook thecritical aspect of language adaptation in multilingual settings, relyinginstead on multilingual data without adequately addressing languagedifferences. To address this gap, we propose the Ideal-LLM model, which employsdual multilingual encoders to enrich language feature information and utilizesa language-adapted connector to target the adaptation of each languagespecifically. By leveraging the complementary strengths of Whisper and MMSencoders, our approach ensures richer multilingual representations.Additionally, the language-adapted connector enhances modal transformation viaa language weight selector tailored for each language. Experimental resultsdemonstrate that Ideal-LLM significantly improves ASR performance, achieving a32.6% relative reduction in average word error rates compared to the standardspeech encoder integrated with LLMs and yields an average BLEU score of 36.78for AST task.

通过连接器将音频编码器与 LLM 集成在一起，使这些模型能够处理和理解音频模式，极大地增强了语音到文本的任务，包括自动语音识别（ASR）和自动语音翻译（AST）。然而，这些方法往往忽略了多语言环境下语言适应的关键问题，而是依赖于多语言数据，没有充分解决语言差异问题。为了弥补这一不足，我们提出了 Ideal-LLM 模型，该模型采用双多语言编码器来丰富语言特征信息，并利用语言适配连接器来针对每种语言进行适配。通过利用 Whisper 和 MMS 编码器的互补优势，我们的方法确保了更丰富的多语言表征。此外，语言适配连接器通过为每种语言量身定制的语言权重选择器增强了模态转换。实验结果表明，Ideal-LLM 显著提高了 ASR 性能，与集成了 LLM 的标准语音编码器相比，平均单词错误率相对降低了 32.6%，AST 任务的平均 BLEU 得分为 36.78。

{"title":"Ideal-LLM: Integrating Dual Encoders and Language-Adapted LLM for Multilingual Speech-to-Text","authors":"Hongfei Xue, Wei Ren, Xuelong Geng, Kun Wei, Longhao Li, Qijie Shao, Linju Yang, Kai Diao, Lei Xie","doi":"arxiv-2409.11214","DOIUrl":"https://doi.org/arxiv-2409.11214","url":null,"abstract":"Integrating audio encoders with LLMs through connectors has enabled these\u0000models to process and comprehend audio modalities, significantly enhancing\u0000speech-to-text tasks, including automatic speech recognition (ASR) and\u0000automatic speech translation (AST). However, these methods often overlook the\u0000critical aspect of language adaptation in multilingual settings, relying\u0000instead on multilingual data without adequately addressing language\u0000differences. To address this gap, we propose the Ideal-LLM model, which employs\u0000dual multilingual encoders to enrich language feature information and utilizes\u0000a language-adapted connector to target the adaptation of each language\u0000specifically. By leveraging the complementary strengths of Whisper and MMS\u0000encoders, our approach ensures richer multilingual representations.\u0000Additionally, the language-adapted connector enhances modal transformation via\u0000a language weight selector tailored for each language. Experimental results\u0000demonstrate that Ideal-LLM significantly improves ASR performance, achieving a\u000032.6% relative reduction in average word error rates compared to the standard\u0000speech encoder integrated with LLMs and yields an average BLEU score of 36.78\u0000for AST task.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning 增强、删除和交换：提高 LLM 字幕的多样性，实现高效的音乐-文本表征学习

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11498

Ilaria Manco, Justin Salamon, Oriol Nieto

Audio-text contrastive models have become a powerful approach in musicrepresentation learning. Despite their empirical success, however, little isknown about the influence of key design choices on the quality of music-textrepresentations learnt through this framework. In this work, we expose thesedesign choices within the constraints of limited data and computation budgets,and establish a more solid understanding of their impact grounded in empiricalobservations along three axes: the choice of base encoders, the level ofcuration in training data, and the use of text augmentation. We find that datacuration is the single most important factor for music-text contrastivetraining in resource-constrained scenarios. Motivated by this insight, weintroduce two novel techniques, Augmented View Dropout and TextSwap, whichincrease the diversity and descriptiveness of text inputs seen in training.Through our experiments we demonstrate that these are effective at boostingperformance across different pre-training regimes, model architectures, anddownstream data distributions, without incurring higher computational costs orrequiring additional training data.

音频-文本对比模型已成为音乐表述学习的一种强有力的方法。尽管在实证方面取得了成功，但人们对关键设计选择对通过该框架学习的音乐-文本呈现质量的影响知之甚少。在这项工作中，我们揭示了在有限的数据和计算预算约束下的设计选择，并根据三个方面的经验观察，对其影响建立了更扎实的理解：基础编码器的选择、训练数据的饱和度以及文本增强的使用。我们发现，在资源有限的情况下，数据饱和度是音乐-文本对比训练的最重要因素。我们通过实验证明，在不同的预训练机制、模型架构和下游数据分布中，这两种技术都能有效提高性能，而且不会增加计算成本或要求额外的训练数据。

{"title":"Augment, Drop & Swap: Improving Diversity in LLM Captions for Efficient Music-Text Representation Learning","authors":"Ilaria Manco, Justin Salamon, Oriol Nieto","doi":"arxiv-2409.11498","DOIUrl":"https://doi.org/arxiv-2409.11498","url":null,"abstract":"Audio-text contrastive models have become a powerful approach in music\u0000representation learning. Despite their empirical success, however, little is\u0000known about the influence of key design choices on the quality of music-text\u0000representations learnt through this framework. In this work, we expose these\u0000design choices within the constraints of limited data and computation budgets,\u0000and establish a more solid understanding of their impact grounded in empirical\u0000observations along three axes: the choice of base encoders, the level of\u0000curation in training data, and the use of text augmentation. We find that data\u0000curation is the single most important factor for music-text contrastive\u0000training in resource-constrained scenarios. Motivated by this insight, we\u0000introduce two novel techniques, Augmented View Dropout and TextSwap, which\u0000increase the diversity and descriptiveness of text inputs seen in training.\u0000Through our experiments we demonstrate that these are effective at boosting\u0000performance across different pre-training regimes, model architectures, and\u0000downstream data distributions, without incurring higher computational costs or\u0000requiring additional training data.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer EzAudio：利用高效扩散变换器增强文本到音频生成功能

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10819

Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu

Latent diffusion models have shown promising results in text-to-audio (T2A)generation tasks, yet previous models have encountered difficulties ingeneration quality, computational cost, diffusion sampling, and datapreparation. In this paper, we introduce EzAudio, a transformer-based T2Adiffusion model, to handle these challenges. Our approach includes several keyinnovations: (1) We build the T2A model on the latent space of a 1D waveformVariational Autoencoder (VAE), avoiding the complexities of handling 2Dspectrogram representations and using an additional neural vocoder. (2) Wedesign an optimized diffusion transformer architecture specifically tailoredfor audio latent representations and diffusion modeling, which enhancesconvergence speed, training stability, and memory usage, making the trainingprocess easier and more efficient. (3) To tackle data scarcity, we adopt adata-efficient training strategy that leverages unlabeled data for learningacoustic dependencies, audio caption data annotated by audio-language modelsfor text-to-audio alignment learning, and human-labeled data for fine-tuning.(4) We introduce a classifier-free guidance (CFG) rescaling method thatsimplifies EzAudio by achieving strong prompt alignment while preserving greataudio quality when using larger CFG scores, eliminating the need to strugglewith finding the optimal CFG score to balance this trade-off. EzAudio surpassesexisting open-source models in both objective metrics and subjectiveevaluations, delivering realistic listening experiences while maintaining astreamlined model structure, low training costs, and an easy-to-follow trainingpipeline. Code, data, and pre-trained models are released at:https://haidog-yaqub.github.io/EzAudio-Page/.

潜在扩散模型在文本到音频（T2A）生成任务中取得了可喜的成果，但以前的模型在生成质量、计算成本、扩散采样和数据准备方面遇到了困难。在本文中，我们介绍了基于变压器的 T2A 扩散模型 EzAudio，以应对这些挑战。我们的方法包括几项关键创新：（1）我们在一维波形变异自动编码器（VAE）的潜空间上建立 T2A 模型，避免了处理二维频谱图表示的复杂性，并使用了额外的神经声码器。(2) 专门针对音频潜在表示和扩散建模设计了优化的扩散变换器架构，提高了收敛速度、训练稳定性和内存使用率，使训练过程更简单、更高效。(3) 为了解决数据稀缺的问题，我们采用了一种数据高效的训练策略，即利用未标记数据学习声学依赖关系，利用音频语言模型注释的音频标题数据学习文本到音频的配准，以及利用人类标记数据进行微调。(4) 我们引入了一种无分类器引导（CFG）重缩放方法，该方法简化了 EzAudio，在使用较大的 CFG 分数时，既能实现较强的提示对齐，又能保持较高的音频质量，从而无需费力寻找最佳 CFG 分数来平衡这种权衡。EzAudio 在客观指标和主观评价方面都超越了现有的开源模型，在提供逼真的听觉体验的同时，还保持了精简的模型结构、较低的训练成本和简单易学的训练管道。代码、数据和预训练模型的发布网址为：https://haidog-yaqub.github.io/EzAudio-Page/。

{"title":"EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer","authors":"Jiarui Hai, Yong Xu, Hao Zhang, Chenxing Li, Helin Wang, Mounya Elhilali, Dong Yu","doi":"arxiv-2409.10819","DOIUrl":"https://doi.org/arxiv-2409.10819","url":null,"abstract":"Latent diffusion models have shown promising results in text-to-audio (T2A)\u0000generation tasks, yet previous models have encountered difficulties in\u0000generation quality, computational cost, diffusion sampling, and data\u0000preparation. In this paper, we introduce EzAudio, a transformer-based T2A\u0000diffusion model, to handle these challenges. Our approach includes several key\u0000innovations: (1) We build the T2A model on the latent space of a 1D waveform\u0000Variational Autoencoder (VAE), avoiding the complexities of handling 2D\u0000spectrogram representations and using an additional neural vocoder. (2) We\u0000design an optimized diffusion transformer architecture specifically tailored\u0000for audio latent representations and diffusion modeling, which enhances\u0000convergence speed, training stability, and memory usage, making the training\u0000process easier and more efficient. (3) To tackle data scarcity, we adopt a\u0000data-efficient training strategy that leverages unlabeled data for learning\u0000acoustic dependencies, audio caption data annotated by audio-language models\u0000for text-to-audio alignment learning, and human-labeled data for fine-tuning.\u0000(4) We introduce a classifier-free guidance (CFG) rescaling method that\u0000simplifies EzAudio by achieving strong prompt alignment while preserving great\u0000audio quality when using larger CFG scores, eliminating the need to struggle\u0000with finding the optimal CFG score to balance this trade-off. EzAudio surpasses\u0000existing open-source models in both objective metrics and subjective\u0000evaluations, delivering realistic listening experiences while maintaining a\u0000streamlined model structure, low training costs, and an easy-to-follow training\u0000pipeline. Code, data, and pre-trained models are released at:\u0000https://haidog-yaqub.github.io/EzAudio-Page/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0