arXiv - CS - Sound最新文献

英文中文

ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration ESPnet-EZ：仅使用 Python 的 ESPnet，易于微调和集成

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09506

Masao Someki, Kwanghee Choi, Siddhant Arora, William Chen, Samuele Cornell, Jionghao Han, Yifan Peng, Jiatong Shi, Vaibhav Srivastav, Shinji Watanabe

We introduce ESPnet-EZ, an extension of the open-source speech processingtoolkit ESPnet, aimed at quick and easy development of speech models. ESPnet-EZfocuses on two major aspects: (i) easy fine-tuning and inference of existingESPnet models on various tasks and (ii) easy integration with popular deepneural network frameworks such as PyTorch-Lightning, Hugging Face transformersand datasets, and Lhotse. By replacing ESPnet design choices inherited fromKaldi with a Python-only, Bash-free interface, we dramatically reduce theeffort required to build, debug, and use a new model. For example, to fine-tunea speech foundation model, ESPnet-EZ, compared to ESPnet, reduces the number ofnewly written code by 2.7x and the amount of dependent code by 6.7x whiledramatically reducing the Bash script dependencies. The codebase of ESPnet-EZis publicly available.

我们介绍的 ESPnet-EZ 是开源语音处理工具包 ESPnet 的扩展，旨在快速、轻松地开发语音模型。ESPnet-EZ 主要关注两个方面：(i) 在各种任务中轻松微调和推断现有的 ESPnet 模型；(ii) 轻松集成流行的深度神经网络框架，如 PyTorch-Lightning、Hugging Face transformersand datasets 和 Lhotse。通过用纯 Python、无 Bash 界面取代从 Kaldi 继承而来的 ESPnet 设计选择，我们大大减少了构建、调试和使用新模型所需的工作量。例如，在微调语音基础模型时，ESPnet-EZ 与 ESPnet 相比，新编写代码的数量减少了 2.7 倍，依赖代码的数量减少了 6.7 倍，同时大大减少了对 Bash 脚本的依赖。ESPnet-EZ的代码库已经公开。

引用次数: 0

MacST: Multi-Accent Speech Synthesis via Text Transliteration for Accent Conversion MacST：通过文本转写进行重音转换的多重音语音合成

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09352

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, Haizhou Li

In accented voice conversion or accent conversion, we seek to convert theaccent in speech from one another while preserving speaker identity andsemantic content. In this study, we formulate a novel method for creatingmulti-accented speech samples, thus pairs of accented speech samples by thesame speaker, through text transliteration for training accent conversionsystems. We begin by generating transliterated text with Large Language Models(LLMs), which is then fed into multilingual TTS models to synthesize accentedEnglish speech. As a reference system, we built a sequence-to-sequence model onthe synthetic parallel corpus for accent conversion. We validated the proposedmethod for both native and non-native English speakers. Subjective andobjective evaluations further validate our dataset's effectiveness in accentconversion studies.

在重音语音转换或重音转换中，我们力求在保留说话人身份和语义内容的同时，将语音中的重音相互转换。在本研究中，我们提出了一种新方法，通过文本音译来创建多重音语音样本，即同一说话人的成对重音语音样本，用于训练重音转换系统。我们首先使用大型语言模型（LLMs）生成音译文本，然后将其输入多语言 TTS 模型以合成重音英语语音。作为参考系统，我们在合成平行语料库上建立了一个序列到序列模型，用于口音转换。我们对母语为英语和非母语为英语的用户验证了所提出的方法。主观和客观评估进一步验证了我们的数据集在口音转换研究中的有效性。

引用次数: 0

DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training DSCLAP：特定领域对比语言-音频预培训

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09289

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li

Analyzing real-world multimodal signals is an essential and challenging taskfor intelligent voice assistants (IVAs). Mainstream approaches have achievedremarkable performance on various downstream tasks of IVAs with pre-trainedaudio models and text models. However, these models are pre-trainedindependently and usually on tasks different from target domains, resulting insub-optimal modality representations for downstream tasks. Moreover, in manydomains, collecting enough language-audio pairs is extremely hard, andtranscribing raw audio also requires high professional skills, making itdifficult or even infeasible to joint pre-training. To address thesepainpoints, we propose DSCLAP, a simple and effective framework that enableslanguage-audio pre-training with only raw audio signal input. Specifically,DSCLAP converts raw audio signals into text via an ASR system and combines acontrastive learning objective and a language-audio matching objective to alignthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours ofin-vehicle domain audio. Empirical results on two downstream tasks show thatwhile conceptually simple, DSCLAP significantly outperforms the baseline modelsin all metrics, showing great promise for domain-specific IVAs applications.

分析真实世界的多模态信号是智能语音助手（IVA）的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型，在 IVA 的各种下游任务中取得了显著的性能。然而，这些模型都是独立预训练的，而且通常是在与目标领域不同的任务上进行预训练，从而导致下游任务的模态表示不够理想。此外，在许多领域，收集足够的语言-音频对非常困难，而翻译原始音频也需要很高的专业技能，这使得联合预训练变得困难甚至不可行。为了解决这些问题，我们提出了 DSCLAP，这是一个简单有效的框架，只需输入原始音频信号就能实现语言和音频的预训练。具体来说，DSCLAP 通过 ASR 系统将原始音频信号转换为文本，并将对比学习目标和语言音频匹配目标结合起来，使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明，虽然概念简单，但 DSCLAP 在所有指标上都明显优于基线模型，为特定领域的 IVAs 应用展示了巨大的前景。

{"title":"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":"https://doi.org/arxiv-2409.09289","url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\u0000for intelligent voice assistants (IVAs). Mainstream approaches have achieved\u0000remarkable performance on various downstream tasks of IVAs with pre-trained\u0000audio models and text models. However, these models are pre-trained\u0000independently and usually on tasks different from target domains, resulting in\u0000sub-optimal modality representations for downstream tasks. Moreover, in many\u0000domains, collecting enough language-audio pairs is extremely hard, and\u0000transcribing raw audio also requires high professional skills, making it\u0000difficult or even infeasible to joint pre-training. To address these\u0000painpoints, we propose DSCLAP, a simple and effective framework that enables\u0000language-audio pre-training with only raw audio signal input. Specifically,\u0000DSCLAP converts raw audio signals into text via an ASR system and combines a\u0000contrastive learning objective and a language-audio matching objective to align\u0000the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\u0000in-vehicle domain audio. Empirical results on two downstream tasks show that\u0000while conceptually simple, DSCLAP significantly outperforms the baseline models\u0000in all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection M$^{3}$V：用于设备导向语音检测的多模态多视角方法

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09284

Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li

With the goal of more natural and human-like interaction with virtual voiceassistants, recent research in the field has focused on full duplex interactionmode without relying on repeated wake-up words. This requires that in sceneswith complex sound sources, the voice assistant must classify utterances asdevice-oriented or non-device-oriented. The dual-encoder structure, which isjointly modeled by text and speech, has become the paradigm of device-directedspeech detection. However, in practice, these models often produce incorrectpredictions for unaligned input pairs due to the unavoidable errors ofautomatic speech recognition (ASR).To address this challenge, we proposeM$^{3}$V, a multi-modal multi-view approach for device-directed speechdetection, which frames we frame the problem as a multi-view learning task thatintroduces unimodal views and a text-audio alignment view in the networkbesides the multi-modal. Experimental results show that M$^{3}$V significantlyoutperforms models trained using only single or multi-modality and surpasseshuman judgment performance on ASR error data for the first time.

为了能与虚拟语音助手进行更自然、更像人的交互，该领域最近的研究重点是不依赖重复唤醒词的全双工交互模式。这就要求在声源复杂的场景中，语音助手必须将语音分为面向设备或不面向设备。由文本和语音共同建模的双编码器结构已成为设备导向语音检测的典范。为了应对这一挑战，我们提出了一种用于设备导向语音检测的多模态多视图方法--M$^{3}$V，它将问题框架化为一项多视图学习任务，在多模态之外，在网络中引入了单模态视图和文本-音频对齐视图。实验结果表明，M$^{3}$V 显著优于仅使用单模态或多模态训练的模型，并首次在 ASR 错误数据上超越了人类的判断性能。

{"title":"M$^{3}$V: A multi-modal multi-view approach for Device-Directed Speech Detection","authors":"Anna Wang, Da Liu, Zhiyu Zhang, Shengqiang Liu, Jie Gao, Yali Li","doi":"arxiv-2409.09284","DOIUrl":"https://doi.org/arxiv-2409.09284","url":null,"abstract":"With the goal of more natural and human-like interaction with virtual voice\u0000assistants, recent research in the field has focused on full duplex interaction\u0000mode without relying on repeated wake-up words. This requires that in scenes\u0000with complex sound sources, the voice assistant must classify utterances as\u0000device-oriented or non-device-oriented. The dual-encoder structure, which is\u0000jointly modeled by text and speech, has become the paradigm of device-directed\u0000speech detection. However, in practice, these models often produce incorrect\u0000predictions for unaligned input pairs due to the unavoidable errors of\u0000automatic speech recognition (ASR).To address this challenge, we propose\u0000M$^{3}$V, a multi-modal multi-view approach for device-directed speech\u0000detection, which frames we frame the problem as a multi-view learning task that\u0000introduces unimodal views and a text-audio alignment view in the network\u0000besides the multi-modal. Experimental results show that M$^{3}$V significantly\u0000outperforms models trained using only single or multi-modality and surpasses\u0000human judgment performance on ASR error data for the first time.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation 子带分割：解决确定盲源分离中块排列问题的简单、高效和有效技术

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09294

Kazuki Matsumoto, Kohei Yatabe

Solving the permutation problem is essential for determined blind sourceseparation (BSS). Existing methods, such as independent vector analysis (IVA)and independent low-rank matrix analysis (ILRMA), tackle the permutationproblem by modeling the co-occurrence of the frequency components of sourcesignals. One of the remaining challenges in these methods is the blockpermutation problem, which may lead to poor separation results. In this paper,we propose a simple and effective technique for solving the block permutationproblem. The proposed technique splits the entire frequencies into overlappingsubbands and sequentially applies a BSS method (e.g., IVA, ILRMA, or any othermethod) to each subband. Since the problem size is reduced by the splitting,the BSS method can effectively work in each subband. Then, the permutationsbetween the subbands are aligned by using the separation result in one subbandas the initial values for the other subbands. Experimental results showed thatthe proposed technique remarkably improved the separation performance withoutincreasing the total computational cost.

解决置换问题对于确定盲源分离（BSS）至关重要。现有的方法，如独立矢量分析法（IVA）和独立低阶矩阵分析法（ILRMA），通过对源信号频率成分的共现建模来解决置换问题。这些方法仍然面临的挑战之一是块置换问题，这可能会导致分离效果不佳。在本文中，我们提出了一种简单有效的技术来解决块置换问题。所提出的技术将整个频率分割成重叠的子带，并依次对每个子带应用 BSS 方法（如 IVA、ILRMA 或其他方法）。由于拆分后问题规模减小，BSS 方法可以在每个子带中有效工作。然后，利用一个子带的分离结果作为其他子带的初始值，对子带之间的排列进行调整。实验结果表明，所提出的技术在不增加总计算成本的情况下显著提高了分离性能。

{"title":"Subband Splitting: Simple, Efficient and Effective Technique for Solving Block Permutation Problem in Determined Blind Source Separation","authors":"Kazuki Matsumoto, Kohei Yatabe","doi":"arxiv-2409.09294","DOIUrl":"https://doi.org/arxiv-2409.09294","url":null,"abstract":"Solving the permutation problem is essential for determined blind source\u0000separation (BSS). Existing methods, such as independent vector analysis (IVA)\u0000and independent low-rank matrix analysis (ILRMA), tackle the permutation\u0000problem by modeling the co-occurrence of the frequency components of source\u0000signals. One of the remaining challenges in these methods is the block\u0000permutation problem, which may lead to poor separation results. In this paper,\u0000we propose a simple and effective technique for solving the block permutation\u0000problem. The proposed technique splits the entire frequencies into overlapping\u0000subbands and sequentially applies a BSS method (e.g., IVA, ILRMA, or any other\u0000method) to each subband. Since the problem size is reduced by the splitting,\u0000the BSS method can effectively work in each subband. Then, the permutations\u0000between the subbands are aligned by using the separation result in one subband\u0000as the initial values for the other subbands. Experimental results showed that\u0000the proposed technique remarkably improved the separation performance without\u0000increasing the total computational cost.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features 通过预测可解释的声学特征来解释用于语音情感识别的深度学习嵌入式算法

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09511

Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh

Pre-trained deep learning embeddings have consistently shown superiorperformance over handcrafted acoustic features in speech emotion recognition(SER). However, unlike acoustic features with clear physical meaning, theseembeddings lack clear interpretability. Explaining these embeddings is crucialfor building trust in healthcare and security applications and advancing thescientific understanding of the acoustic information that is encoded in them.This paper proposes a modified probing approach to explain deep learningembeddings in the SER space. We predict interpretable acoustic features (e.g.,f0, loudness) from (i) the complete set of embeddings and (ii) a subset of theembedding dimensions identified as most important for predicting each emotion.If the subset of the most important dimensions better predicts a given emotionthan all dimensions and also predicts specific acoustic features moreaccurately, we infer those acoustic features are important for the embeddingmodel for the given task. We conducted experiments using the WavLM embeddingsand eGeMAPS acoustic features as audio representations, applying our method tothe RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, wedemonstrate that Energy, Frequency, Spectral, and Temporal categories ofacoustic features provide diminishing information to SER in that order,demonstrating the utility of the probing classifier method to relate embeddingsto interpretable acoustic features.

在语音情感识别（SER）中，预训练的深度学习嵌入一直显示出优于手工制作的声学特征的性能。然而，与具有明确物理意义的声学特征不同，这些嵌入缺乏明确的可解释性。解释这些嵌入对于在医疗保健和安全应用中建立信任以及推进对其中编码的声学信息的科学理解至关重要。如果最重要维度的子集比所有维度都能更好地预测特定情绪，并且能更准确地预测特定声学特征，那么我们就能推断出这些声学特征对于特定任务的嵌入模型非常重要。我们使用 WavLM 嵌入和 eGeMAPS 声音特征作为音频表示进行了实验，并将我们的方法应用于 RAVDESS 和 SAVEE 情感语音数据集。基于这一评估，我们证明了声学特征的能量、频率、频谱和时间类别依次为 SER 提供了递减信息，这证明了探测分类器方法在将嵌入与可解释的声学特征相关联方面的实用性。

{"title":"Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features","authors":"Satvik Dixit, Daniel M. Low, Gasser Elbanna, Fabio Catania, Satrajit S. Ghosh","doi":"arxiv-2409.09511","DOIUrl":"https://doi.org/arxiv-2409.09511","url":null,"abstract":"Pre-trained deep learning embeddings have consistently shown superior\u0000performance over handcrafted acoustic features in speech emotion recognition\u0000(SER). However, unlike acoustic features with clear physical meaning, these\u0000embeddings lack clear interpretability. Explaining these embeddings is crucial\u0000for building trust in healthcare and security applications and advancing the\u0000scientific understanding of the acoustic information that is encoded in them.\u0000This paper proposes a modified probing approach to explain deep learning\u0000embeddings in the SER space. We predict interpretable acoustic features (e.g.,\u0000f0, loudness) from (i) the complete set of embeddings and (ii) a subset of the\u0000embedding dimensions identified as most important for predicting each emotion.\u0000If the subset of the most important dimensions better predicts a given emotion\u0000than all dimensions and also predicts specific acoustic features more\u0000accurately, we infer those acoustic features are important for the embedding\u0000model for the given task. We conducted experiments using the WavLM embeddings\u0000and eGeMAPS acoustic features as audio representations, applying our method to\u0000the RAVDESS and SAVEE emotional speech datasets. Based on this evaluation, we\u0000demonstrate that Energy, Frequency, Spectral, and Temporal categories of\u0000acoustic features provide diminishing information to SER in that order,\u0000demonstrating the utility of the probing classifier method to relate embeddings\u0000to interpretable acoustic features.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Prevailing Research Areas for Music AI in the Era of Foundation Models 基础模型时代音乐人工智能的主流研究领域

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09378

Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans

In tandem with the recent advancements in foundation model research, therehas been a surge of generative music AI applications within the past few years.As the idea of AI-generated or AI-augmented music becomes more mainstream, manyresearchers in the music AI community may be wondering what avenues of researchare left. With regards to music generative models, we outline the current areasof research with significant room for exploration. Firstly, we pose thequestion of foundational representation of these generative models andinvestigate approaches towards explainability. Next, we discuss the currentstate of music datasets and their limitations. We then overview differentgenerative models, forms of evaluating these models, and their computationalconstraints/limitations. Subsequently, we highlight applications of thesegenerative models towards extensions to multiple modalities and integrationwith artists' workflow as well as music education systems. Finally, we surveythe potential copyright implications of generative music and discuss strategiesfor protecting the rights of musicians. While it is not meant to be exhaustive,our survey calls to attention a variety of research directions enabled by musicfoundation models.

随着人工智能生成或人工智能增强音乐的理念逐渐成为主流，音乐人工智能界的许多研究人员可能会想知道还有哪些研究途径。关于音乐生成模型，我们概述了当前具有重大探索空间的研究领域。首先，我们提出了这些生成模型的基础表征问题，并探讨了实现可解释性的方法。接下来，我们讨论了音乐数据集的现状及其局限性。然后，我们概述了不同的生成模型、评估这些模型的形式以及它们的计算限制/局限性。随后，我们重点介绍了这些生成模型在扩展到多种模式以及与艺术家工作流程和音乐教育系统集成方面的应用。最后，我们探讨了生成式音乐对版权的潜在影响，并讨论了保护音乐家权利的策略。我们的调查并非详尽无遗，但我们的调查唤起了人们对音乐基础模型所带来的各种研究方向的关注。

{"title":"Prevailing Research Areas for Music AI in the Era of Foundation Models","authors":"Megan Wei, Mateusz Modrzejewski, Aswin Sivaraman, Dorien Herremans","doi":"arxiv-2409.09378","DOIUrl":"https://doi.org/arxiv-2409.09378","url":null,"abstract":"In tandem with the recent advancements in foundation model research, there\u0000has been a surge of generative music AI applications within the past few years.\u0000As the idea of AI-generated or AI-augmented music becomes more mainstream, many\u0000researchers in the music AI community may be wondering what avenues of research\u0000are left. With regards to music generative models, we outline the current areas\u0000of research with significant room for exploration. Firstly, we pose the\u0000question of foundational representation of these generative models and\u0000investigate approaches towards explainability. Next, we discuss the current\u0000state of music datasets and their limitations. We then overview different\u0000generative models, forms of evaluating these models, and their computational\u0000constraints/limitations. Subsequently, we highlight applications of these\u0000generative models towards extensions to multiple modalities and integration\u0000with artists' workflow as well as music education systems. Finally, we survey\u0000the potential copyright implications of generative music and discuss strategies\u0000for protecting the rights of musicians. While it is not meant to be exhaustive,\u0000our survey calls to attention a variety of research directions enabled by music\u0000foundation models.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"105 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation 利用基于变换器的分层对齐和分离式跨模态表示进行音频文本检索

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09256

Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou

Most existing audio-text retrieval (ATR) approaches typically rely on asingle-level interaction to associate audio and text, limiting their ability toalign different modalities and leading to suboptimal matches. In this work, wepresent a novel ATR framework that leverages two-stream Transformers inconjunction with a Hierarchical Alignment (THA) module to identify multi-levelcorrespondences of different Transformer blocks between audio and text.Moreover, current ATR methods mainly focus on learning a global-levelrepresentation, missing out on intricate details to capture audio occurrencesthat correspond to textual semantics. To bridge this gap, we introduce aDisentangled Cross-modal Representation (DCR) approach that disentangleshigh-dimensional features into compact latent factors to grasp fine-grainedaudio-text semantic correlations. Additionally, we develop a confidence-aware(CA) module to estimate the confidence of each latent factor pair andadaptively aggregate cross-modal latent factors to achieve local semanticalignment. Experiments show that our THA effectively boosts ATR performance,with the DCR approach further contributing to consistent performance gains.

大多数现有的音频-文本检索（ATR）方法通常依赖于单级交互来关联音频和文本，这限制了它们对齐不同模态的能力，并导致次优匹配。在这项工作中，我们提出了一种新颖的 ATR 框架，该框架利用双流变换器与分层对齐（THA）模块相结合，来识别音频和文本之间不同变换器块的多层次对应关系。此外，当前的 ATR 方法主要侧重于学习全局级别的表述，而忽略了捕捉音频出现与文本语义对应的复杂细节。为了弥补这一缺陷，我们引入了一种将高维特征分解为紧凑潜在因子的 "分解跨模态表示"（Disentangled Cross-modal Representation，DCR）方法，以把握细粒度的音频-文本语义关联。此外，我们还开发了一个置信度感知（CA）模块，用于估计每个潜在因子对的置信度，并自适应地聚合跨模态潜在因子，以实现局部语义对齐。实验表明，我们的 THA 有效地提高了 ATR 性能，而 DCR 方法则进一步促进了性能的持续提升。

{"title":"Audio-text Retrieval with Transformer-based Hierarchical Alignment and Disentangled Cross-modal Representation","authors":"Yifei Xin, Zhihong Zhu, Xuxin Cheng, Xusheng Yang, Yuexian Zou","doi":"arxiv-2409.09256","DOIUrl":"https://doi.org/arxiv-2409.09256","url":null,"abstract":"Most existing audio-text retrieval (ATR) approaches typically rely on a\u0000single-level interaction to associate audio and text, limiting their ability to\u0000align different modalities and leading to suboptimal matches. In this work, we\u0000present a novel ATR framework that leverages two-stream Transformers in\u0000conjunction with a Hierarchical Alignment (THA) module to identify multi-level\u0000correspondences of different Transformer blocks between audio and text.\u0000Moreover, current ATR methods mainly focus on learning a global-level\u0000representation, missing out on intricate details to capture audio occurrences\u0000that correspond to textual semantics. To bridge this gap, we introduce a\u0000Disentangled Cross-modal Representation (DCR) approach that disentangles\u0000high-dimensional features into compact latent factors to grasp fine-grained\u0000audio-text semantic correlations. Additionally, we develop a confidence-aware\u0000(CA) module to estimate the confidence of each latent factor pair and\u0000adaptively aggregate cross-modal latent factors to achieve local semantic\u0000alignment. Experiments show that our THA effectively boosts ATR performance,\u0000with the DCR approach further contributing to consistent performance gains.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech 参加 2024 年 VoiceMOS 挑战赛的 T05 系统：从深度图像分类器到高质量合成语音自然度 MOS 预测的迁移学习

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09305

Kaito Baba, Wataru Nakata, Yuki Saito, Hiroshi Saruwatari

We present our system (denoted as T05) for the VoiceMOS Challenge (VMC) 2024.Our system was designed for the VMC 2024 Track 1, which focused on the accurateprediction of naturalness mean opinion score (MOS) for high-quality syntheticspeech. In addition to a pretrained self-supervised learning (SSL)-based speechfeature extractor, our system incorporates a pretrained image feature extractorto capture the difference of synthetic speech observed in speech spectrograms.We first separately train two MOS predictors that use either of an SSL-based orspectrogram-based feature. Then, we fine-tune the two predictors for better MOSprediction using the fusion of two extracted features. In the VMC 2024 Track 1,our T05 system achieved first place in 7 out of 16 evaluation metrics andsecond place in the remaining 9 metrics, with a significant difference comparedto those ranked third and below. We also report the results of our ablationstudy to investigate essential factors of our system.

我们的系统是为 VMC 2024 第 1 赛道设计的，该赛道的重点是准确预测高质量合成语音的自然度平均意见分（MOS）。除了预训练的基于自我监督学习（SSL）的语音特征提取器外，我们的系统还结合了预训练的图像特征提取器，以捕捉语音频谱图中观察到的合成语音的差异。然后，我们对这两个预测器进行微调，利用两个提取特征的融合获得更好的 MOS 预测效果。在 VMC 2024 Track 1 中，我们的 T05 系统在 16 个评估指标中的 7 个指标中获得第一名，在其余 9 个指标中获得第二名，与排名第三及以下的系统相比差距显著。我们还报告了消融研究的结果，以研究我们系统的关键因素。

引用次数: 0

Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling 儿童与成人二元互动中以自我为中心的说话者分类：从感知到计算建模

arXiv - CS - Sound

Pub Date : 2024-09-14 DOI: arxiv-2409.09340

Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan

Autism spectrum disorder (ASD) is a neurodevelopmental conditioncharacterized by challenges in social communication, repetitive behavior, andsensory processing. One important research area in ASD is evaluating children'sbehavioral changes over time during treatment. The standard protocol with thisobjective is BOSCC, which involves dyadic interactions between a child andclinicians performing a pre-defined set of activities. A fundamental aspect ofunderstanding children's behavior in these interactions is automatic speechunderstanding, particularly identifying who speaks and when. Conventionalapproaches in this area heavily rely on speech samples recorded from aspectator perspective, and there is limited research on egocentric speechmodeling. In this study, we design an experiment to perform speech sampling inBOSCC interviews from an egocentric perspective using wearable sensors andexplore pre-training Ego4D speech samples to enhance child-adult speakerclassification in dyadic interactions. Our findings highlight the potential ofegocentric speech collection and pre-training to improve speaker classificationaccuracy.

自闭症谱系障碍（ASD）是一种以社交沟通、重复行为和感官处理方面的障碍为特征的神经发育疾病。自闭症谱系障碍的一个重要研究领域是评估儿童在治疗期间的行为变化。实现这一目标的标准方案是 BOSCC，其中包括儿童与临床医生之间的双人互动，并进行一组预定义的活动。理解儿童在这些互动中的行为的一个基本方面是自动言语理解，特别是识别谁在什么时候说话。这方面的传统方法严重依赖于从旁观者角度记录的语音样本，而以自我为中心的语音建模研究却很有限。在本研究中，我们设计了一项实验，利用可穿戴传感器从自我中心视角对 BOSCC 访谈进行语音采样，并探索通过预先训练 Ego4D 语音样本来增强儿童与成人在双向互动中的说话者分类。我们的研究结果凸显了以自我为中心的语音收集和预训练在提高说话者分类准确性方面的潜力。

{"title":"Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling","authors":"Tiantian Feng, Anfeng Xu, Xuan Shi, Somer Bishop, Shrikanth Narayanan","doi":"arxiv-2409.09340","DOIUrl":"https://doi.org/arxiv-2409.09340","url":null,"abstract":"Autism spectrum disorder (ASD) is a neurodevelopmental condition\u0000characterized by challenges in social communication, repetitive behavior, and\u0000sensory processing. One important research area in ASD is evaluating children's\u0000behavioral changes over time during treatment. The standard protocol with this\u0000objective is BOSCC, which involves dyadic interactions between a child and\u0000clinicians performing a pre-defined set of activities. A fundamental aspect of\u0000understanding children's behavior in these interactions is automatic speech\u0000understanding, particularly identifying who speaks and when. Conventional\u0000approaches in this area heavily rely on speech samples recorded from a\u0000spectator perspective, and there is limited research on egocentric speech\u0000modeling. In this study, we design an experiment to perform speech sampling in\u0000BOSCC interviews from an egocentric perspective using wearable sensors and\u0000explore pre-training Ego4D speech samples to enhance child-adult speaker\u0000classification in dyadic interactions. Our findings highlight the potential of\u0000egocentric speech collection and pre-training to improve speaker classification\u0000accuracy.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Sound

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀