arXiv - EE - Audio and Speech Processing最新文献_第6页

FakeMusicCaps: a Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models FakeMusicCaps：通过文本到音乐模型生成的合成音乐的检测和归属数据集

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10684

Luca Comanducci, Paolo Bestagini, Stefano Tubaro

Text-To-Music (TTM) models have recently revolutionized the automatic musicgeneration research field. Specifically, by reaching superior performances toall previous state-of-the-art models and by lowering the technical proficiencyneeded to use them. Due to these reasons, they have readily started to beadopted for commercial uses and music production practices. This widespreaddiffusion of TTMs poses several concerns regarding copyright violation andrightful attribution, posing the need of serious consideration of them by theaudio forensics community. In this paper, we tackle the problem of detectionand attribution of TTM-generated data. We propose a dataset, FakeMusicCaps thatcontains several versions of the music-caption pairs dataset MusicCapsre-generated via several state-of-the-art TTM techniques. We evaluate theproposed dataset by performing initial experiments regarding the detection andattribution of TTM-generated audio.

文本到音乐（TTM）模型最近在自动音乐生成研究领域掀起了一场革命。具体来说，TTM 模型的性能优于以往所有最先进的模型，而且降低了使用这些模型所需的技术熟练度。由于这些原因，它们已开始被商业用途和音乐制作实践所采用。TTM 的广泛应用带来了一些有关侵犯版权和合法归属的问题，需要音频取证界认真考虑。在本文中，我们探讨了 TTM 生成数据的检测和归属问题。我们提出了一个名为 "FakeMusicCaps "的数据集，其中包含通过几种最先进的 TTM 技术生成的多个版本的音乐字幕对数据集 MusicCaps。我们通过对 TTM 生成的音频进行检测和归属的初步实验，对所提出的数据集进行了评估。

引用次数: 0

Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels Speaker-IPL：利用基于 i-Vector 的伪标签对说话者特征进行无监督学习

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10791

Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald

Iterative self-training, or iterative pseudo-labeling (IPL)--using animproved model from the current iteration to provide pseudo-labels for the nextiteration--has proven to be a powerful approach to enhance the quality ofspeaker representations. Recent applications of IPL in unsupervised speakerrecognition start with representations extracted from very elaborateself-supervised methods (e.g., DINO). However, training such strongself-supervised models is not straightforward (they require hyper-parameterstuning and may not generalize to out-of-domain data) and, moreover, may not beneeded at all. To this end, we show the simple, well-studied, and establishedi-vector generative model is enough to bootstrap the IPL process forunsupervised learning of speaker representations. We also systematically studythe impact of other components on the IPL process, which includes the initialmodel, the encoder, augmentations, the number of clusters, and the clusteringalgorithm. Remarkably, we find that even with a simple and significantly weakerinitial model like i-vector, IPL can still achieve speaker verificationperformance that rivals state-of-the-art methods.

迭代自我训练或迭代伪标注（IPL）--利用当前迭代中改进的模型为下一次迭代提供伪标注--已被证明是提高说话人表征质量的有力方法。IPL 在无监督说话人识别中的最新应用，是从非常精细的自我监督方法（如 DINO）中提取的表征开始的。然而，训练这种强自我监督模型并不简单（它们需要超参数调整，而且可能无法泛化到域外数据），而且可能根本无法训练。为此，我们展示了简单、经过充分研究和建立的向量生成模型足以引导 IPL 过程，从而实现说话者表征的无监督学习。我们还系统地研究了其他因素对 IPL 过程的影响，其中包括初始模型、编码器、增强、聚类数量和聚类算法。值得注意的是，我们发现即使使用像 i-vector 这样简单且明显较弱的初始模型，IPL 仍然可以实现与最先进方法相媲美的说话人验证性能。

{"title":"Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels","authors":"Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald","doi":"arxiv-2409.10791","DOIUrl":"https://doi.org/arxiv-2409.10791","url":null,"abstract":"Iterative self-training, or iterative pseudo-labeling (IPL)--using an\u0000improved model from the current iteration to provide pseudo-labels for the next\u0000iteration--has proven to be a powerful approach to enhance the quality of\u0000speaker representations. Recent applications of IPL in unsupervised speaker\u0000recognition start with representations extracted from very elaborate\u0000self-supervised methods (e.g., DINO). However, training such strong\u0000self-supervised models is not straightforward (they require hyper-parameters\u0000tuning and may not generalize to out-of-domain data) and, moreover, may not be\u0000needed at all. To this end, we show the simple, well-studied, and established\u0000i-vector generative model is enough to bootstrap the IPL process for\u0000unsupervised learning of speaker representations. We also systematically study\u0000the impact of other components on the IPL process, which includes the initial\u0000model, the encoder, augmentations, the number of clusters, and the clustering\u0000algorithm. Remarkably, we find that even with a simple and significantly weaker\u0000initial model like i-vector, IPL can still achieve speaker verification\u0000performance that rivals state-of-the-art methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models 在语音基础模型的掩码预训练中探索预测目标

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10788

Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh

Speech foundation models, such as HuBERT and its variants, are pre-trained onlarge amounts of unlabeled speech for various downstream tasks. These modelsuse a masked prediction objective, where the model learns to predictinformation about masked input segments from the unmasked context. The choiceof prediction targets in this framework can influence performance on downstreamtasks. For example, targets that encode prosody are beneficial forspeaker-related tasks, while targets that encode phonetics are more suited forcontent-related tasks. Additionally, prediction targets can vary in the levelof detail they encode; targets that encode fine-grained acoustic details arebeneficial for denoising tasks, while targets that encode higher-levelabstractions are more suited for content-related tasks. Despite the importanceof prediction targets, the design choices that affect them have not beenthoroughly studied. This work explores the design choices and their impact ondownstream task performance. Our results indicate that the commonly used designchoices for HuBERT can be suboptimal. We propose novel approaches to createmore informative prediction targets and demonstrate their effectiveness throughimprovements across various downstream tasks.

语音基础模型，如 HuBERT 及其变体，是针对各种下游任务在大量无标记语音上进行预训练的。这些模型使用掩码预测目标，即模型学会从未加掩码的上下文中预测有关掩码输入片段的信息。在这个框架中，预测目标的选择会影响下游任务的性能。例如，编码前音的目标有利于与说话人相关的任务，而编码语音的目标则更适合与内容相关的任务。此外，预测目标编码的细节程度也各不相同；编码细粒度声学细节的目标有利于去噪任务，而编码高层次抽象概念的目标则更适合与内容相关的任务。尽管预测目标非常重要，但对其产生影响的设计选择却尚未得到深入研究。本研究探讨了设计选择及其对下游任务性能的影响。我们的研究结果表明，HuBERT 常用的设计选择可能是次优的。我们提出了创建信息量更大的预测目标的新方法，并通过各种下游任务的改进证明了这些方法的有效性。

{"title":"Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models","authors":"Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh","doi":"arxiv-2409.10788","DOIUrl":"https://doi.org/arxiv-2409.10788","url":null,"abstract":"Speech foundation models, such as HuBERT and its variants, are pre-trained on\u0000large amounts of unlabeled speech for various downstream tasks. These models\u0000use a masked prediction objective, where the model learns to predict\u0000information about masked input segments from the unmasked context. The choice\u0000of prediction targets in this framework can influence performance on downstream\u0000tasks. For example, targets that encode prosody are beneficial for\u0000speaker-related tasks, while targets that encode phonetics are more suited for\u0000content-related tasks. Additionally, prediction targets can vary in the level\u0000of detail they encode; targets that encode fine-grained acoustic details are\u0000beneficial for denoising tasks, while targets that encode higher-level\u0000abstractions are more suited for content-related tasks. Despite the importance\u0000of prediction targets, the design choices that affect them have not been\u0000thoroughly studied. This work explores the design choices and their impact on\u0000downstream task performance. Our results indicate that the commonly used design\u0000choices for HuBERT can be suboptimal. We propose novel approaches to create\u0000more informative prediction targets and demonstrate their effectiveness through\u0000improvements across various downstream tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech as a Biomarker for Disease Detection 将语音作为疾病检测的生物标志物

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10230

Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso

Speech is a rich biomarker that encodes substantial information about thehealth of a speaker, and thus it has been proposed for the detection ofnumerous diseases, achieving promising results. However, questions remain aboutwhat the models trained for the automatic detection of these diseases areactually learning and the basis for their predictions, which can significantlyimpact patients' lives. This work advocates for an interpretable health model,suitable for detecting several diseases, motivated by the observation thatspeech-affecting disorders often have overlapping effects on speech signals. Aframework is presented that first defines "reference speech" and then leveragesthis definition for disease detection. Reference speech is characterizedthrough reference intervals, i.e., the typical values of clinically meaningfulacoustic and linguistic features derived from a reference population. Thisnovel approach in the field of speech as a biomarker is inspired by the use ofreference intervals in clinical laboratory science. Deviations of new speakersfrom this reference model are quantified and used as input to detectAlzheimer's and Parkinson's disease. The classification strategy explored isbased on Neural Additive Models, a type of glass-box neural network, whichenables interpretability. The proposed framework for reference speechcharacterization and disease detection is designed to support the medicalcommunity by providing clinically meaningful explanations that can serve as avaluable second opinion.

语音是一种丰富的生物标志物，能编码关于说话者健康状况的大量信息，因此被提议用于检测许多疾病，并取得了可喜的成果。然而，人们对为自动检测这些疾病而训练的模型究竟在学习什么以及它们的预测依据仍有疑问，而这可能会对患者的生活产生重大影响。这项工作主张建立一个可解释的健康模型，该模型适用于检测多种疾病，其动机是观察到影响语音的疾病通常对语音信号有重叠影响。该模型首先定义了 "参考语音"，然后利用这一定义进行疾病检测。参考语音的特征是参考区间，即从参考人群中得出的具有临床意义的声学和语言特征的典型值。这种将语音作为生物标记物的新方法是受临床实验室科学中使用参考区间的启发。新发言人与这一参考模型的偏差被量化并用作检测阿尔茨海默病和帕金森病的输入。所探讨的分类策略是基于神经加法模型，这是一种玻璃箱神经网络，具有可解释性。所提出的参考语音特征和疾病检测框架旨在通过提供有临床意义的解释来支持医学界，这些解释可以作为宝贵的第二意见。

{"title":"Speech as a Biomarker for Disease Detection","authors":"Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso","doi":"arxiv-2409.10230","DOIUrl":"https://doi.org/arxiv-2409.10230","url":null,"abstract":"Speech is a rich biomarker that encodes substantial information about the\u0000health of a speaker, and thus it has been proposed for the detection of\u0000numerous diseases, achieving promising results. However, questions remain about\u0000what the models trained for the automatic detection of these diseases are\u0000actually learning and the basis for their predictions, which can significantly\u0000impact patients' lives. This work advocates for an interpretable health model,\u0000suitable for detecting several diseases, motivated by the observation that\u0000speech-affecting disorders often have overlapping effects on speech signals. A\u0000framework is presented that first defines \"reference speech\" and then leverages\u0000this definition for disease detection. Reference speech is characterized\u0000through reference intervals, i.e., the typical values of clinically meaningful\u0000acoustic and linguistic features derived from a reference population. This\u0000novel approach in the field of speech as a biomarker is inspired by the use of\u0000reference intervals in clinical laboratory science. Deviations of new speakers\u0000from this reference model are quantified and used as input to detect\u0000Alzheimer's and Parkinson's disease. The classification strategy explored is\u0000based on Neural Additive Models, a type of glass-box neural network, which\u0000enables interpretability. The proposed framework for reference speech\u0000characterization and disease detection is designed to support the medical\u0000community by providing clinically meaningful explanations that can serve as a\u0000valuable second opinion.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Machine listening in a neonatal intensive care unit 新生儿重症监护室的机器监听

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.11439

Modan TailleurLS2N, Nantes Univ - ECN, LS2N - équipe SIMS, Vincent LostanlenLS2N, LS2N - équipe SIMS, Nantes Univ - ECN, Jean-Philippe RivièreNantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - équipe PACCE, Pierre Aumond

Oxygenators, alarm devices, and footsteps are some of the most common soundsources in a hospital. Detecting them has scientific value for environmentalpsychology but comes with challenges of its own: namely, privacy preservationand limited labeled data. In this paper, we address these two challenges via acombination of edge computing and cloud computing. For privacy preservation, wehave designed an acoustic sensor which computes third-octave spectrograms onthe fly instead of recording audio waveforms. For sample-efficient machinelearning, we have repurposed a pretrained audio neural network (PANN) viaspectral transcoding and label space adaptation. A small-scale study in aneonatological intensive care unit (NICU) confirms that the time series ofdetected events align with another modality of measurement: i.e., electronicbadges for parents and healthcare professionals. Hence, this paper demonstratesthe feasibility of polyphonic machine listening in a hospital ward whileguaranteeing privacy by design.

氧气机、报警装置和脚步声是医院中最常见的声音来源。检测它们对环境心理学具有科学价值，但同时也面临着自身的挑战：即隐私保护和有限的标记数据。在本文中，我们通过边缘计算和云计算的结合来应对这两个挑战。为了保护隐私，我们设计了一种声学传感器，它可以即时计算第三倍频程频谱图，而不是记录音频波形。为了提高机器学习的采样效率，我们重新利用了预训练音频神经网络（PANN）进行频谱转码和标签空间适应。在新生儿重症监护室（NICU）进行的一项小规模研究证实，检测到的事件时间序列与另一种测量方式（即家长和医护人员的电子胸牌）一致。因此，本文证明了在医院病房中使用多声部机器监听的可行性，同时通过设计保证了隐私。

{"title":"Machine listening in a neonatal intensive care unit","authors":"Modan TailleurLS2N, Nantes Univ - ECN, LS2N - équipe SIMS, Vincent LostanlenLS2N, LS2N - équipe SIMS, Nantes Univ - ECN, Jean-Philippe RivièreNantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - équipe PACCE, Pierre Aumond","doi":"arxiv-2409.11439","DOIUrl":"https://doi.org/arxiv-2409.11439","url":null,"abstract":"Oxygenators, alarm devices, and footsteps are some of the most common sound\u0000sources in a hospital. Detecting them has scientific value for environmental\u0000psychology but comes with challenges of its own: namely, privacy preservation\u0000and limited labeled data. In this paper, we address these two challenges via a\u0000combination of edge computing and cloud computing. For privacy preservation, we\u0000have designed an acoustic sensor which computes third-octave spectrograms on\u0000the fly instead of recording audio waveforms. For sample-efficient machine\u0000learning, we have repurposed a pretrained audio neural network (PANN) via\u0000spectral transcoding and label space adaptation. A small-scale study in a\u0000neonatological intensive care unit (NICU) confirms that the time series of\u0000detected events align with another modality of measurement: i.e., electronic\u0000badges for parents and healthcare professionals. Hence, this paper demonstrates\u0000the feasibility of polyphonic machine listening in a hospital ward while\u0000guaranteeing privacy by design.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Towards Automatic Assessment of Self-Supervised Speech Models using Rank 利用等级自动评估自监督语音模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10787

Zakaria Aldeneh, Vimal Thilak, Takuya Higuchi, Barry-John Theobald, Tatiana Likhomanenko

This study explores using embedding rank as an unsupervised evaluation metricfor general-purpose speech encoders trained via self-supervised learning (SSL).Traditionally, assessing the performance of these encoders isresource-intensive and requires labeled data from the downstream tasks.Inspired by the vision domain, where embedding rank has shown promise forevaluating image encoders without tuning on labeled downstream data, this workexamines its applicability in the speech domain, considering the temporalnature of the signals. The findings indicate rank correlates with downstreamperformance within encoder layers across various downstream tasks and for in-and out-of-domain scenarios. However, rank does not reliably predict thebest-performing layer for specific downstream tasks, as lower-ranked layers canoutperform higher-ranked ones. Despite this limitation, the results suggestthat embedding rank can be a valuable tool for monitoring training progress inSSL speech models, offering a less resource-demanding alternative totraditional evaluation methods.

传统上，评估这些编码器的性能是资源密集型的，需要下游任务的标注数据。受视觉领域的启发，嵌入等级已显示出评估图像编码器的前景，而无需对标注的下游数据进行调整。研究结果表明，在各种下游任务以及域内和域外场景中，等级与编码器层内的下游性能相关。然而，排名并不能可靠地预测特定下游任务中表现最佳的层，因为排名较低的层可能比排名较高的层表现更好。尽管存在这种局限性，但研究结果表明，嵌入排名可以成为监控SSL 语音模型训练进度的重要工具，为传统评估方法提供了一种资源需求较少的替代方法。

{"title":"Towards Automatic Assessment of Self-Supervised Speech Models using Rank","authors":"Zakaria Aldeneh, Vimal Thilak, Takuya Higuchi, Barry-John Theobald, Tatiana Likhomanenko","doi":"arxiv-2409.10787","DOIUrl":"https://doi.org/arxiv-2409.10787","url":null,"abstract":"This study explores using embedding rank as an unsupervised evaluation metric\u0000for general-purpose speech encoders trained via self-supervised learning (SSL).\u0000Traditionally, assessing the performance of these encoders is\u0000resource-intensive and requires labeled data from the downstream tasks.\u0000Inspired by the vision domain, where embedding rank has shown promise for\u0000evaluating image encoders without tuning on labeled downstream data, this work\u0000examines its applicability in the speech domain, considering the temporal\u0000nature of the signals. The findings indicate rank correlates with downstream\u0000performance within encoder layers across various downstream tasks and for in-\u0000and out-of-domain scenarios. However, rank does not reliably predict the\u0000best-performing layer for specific downstream tasks, as lower-ranked layers can\u0000outperform higher-ranked ones. Despite this limitation, the results suggest\u0000that embedding rank can be a valuable tool for monitoring training progress in\u0000SSL speech models, offering a less resource-demanding alternative to\u0000traditional evaluation methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers 使用视觉变形器在人机交互中进行个性化语音情感识别

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10687

Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa

Emotions are an essential element in verbal communication, so understandingindividuals' affect during a human-robot interaction (HRI) becomes imperative.This paper investigates the application of vision transformer models, namelyViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is togeneralize the SER models for individual speech characteristics by fine-tuningthese models on benchmark datasets and exploiting ensemble methods. For thispurpose, we collected audio data from different human subjects havingpseudo-naturalistic conversations with the NAO robot. We then fine-tuned ourViT and BEiT-based models and tested these models on unseen speech samples fromthe participants. In the results, we show that fine-tuning vision transformerson benchmark datasets and and then using either these already fine-tuned modelsor ensembling ViT/BEiT models gets us the highest classification accuracies perindividual when it comes to identifying four primary emotions from theirspeech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTsor BEiTs.

本文研究了视觉变换器模型，即ViT（视觉变换器）和BEiT（图像变换器的BERT预训练）管道在人机交互中的语音情感识别（SER）应用。重点是通过在基准数据集上对这些模型进行微调，并利用集合方法，针对单个语音特征对 SER 模型进行泛化。为此，我们收集了不同人类受试者与NAO机器人进行伪自然对话的音频数据。然后，我们对基于ViT和BEiT的模型进行了微调，并在受试者未见过的语音样本上对这些模型进行了测试。结果表明，与微调ViTs或BEiTs相比，微调基准数据集的视觉变换，然后使用这些已微调的模型或ViT/BEiT模型的集合，在从语音中识别四种主要情绪（中性、快乐、悲伤和愤怒）时，每个人的分类准确率最高。

{"title":"Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers","authors":"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa","doi":"arxiv-2409.10687","DOIUrl":"https://doi.org/arxiv-2409.10687","url":null,"abstract":"Emotions are an essential element in verbal communication, so understanding\u0000individuals' affect during a human-robot interaction (HRI) becomes imperative.\u0000This paper investigates the application of vision transformer models, namely\u0000ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\u0000pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\u0000generalize the SER models for individual speech characteristics by fine-tuning\u0000these models on benchmark datasets and exploiting ensemble methods. For this\u0000purpose, we collected audio data from different human subjects having\u0000pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\u0000ViT and BEiT-based models and tested these models on unseen speech samples from\u0000the participants. In the results, we show that fine-tuning vision transformers\u0000on benchmark datasets and and then using either these already fine-tuned models\u0000or ensembling ViT/BEiT models gets us the highest classification accuracies per\u0000individual when it comes to identifying four primary emotions from their\u0000speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\u0000or BEiTs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Self-supervised Speech Models for Word-Level Stuttered Speech Detection 用于词级口吃语音检测的自监督语音模型

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10704

Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath

Clinical diagnosis of stuttering requires an assessment by a licensedspeech-language pathologist. However, this process is time-consuming andrequires clinicians with training and experience in stuttering and fluencydisorders. Unfortunately, only a small percentage of speech-languagepathologists report being comfortable working with individuals who stutter,which is inadequate to accommodate for the 80 million individuals who stutterworldwide. Developing machine learning models for detecting stuttered speechwould enable universal and automated screening for stuttering, enabling speechpathologists to identify and follow up with patients who are most likely to bediagnosed with a stuttering speech disorder. Previous research in this area haspredominantly focused on utterance-level detection, which is not sufficient forclinical settings where word-level annotation of stuttering is the norm. Inthis study, we curated a stuttered speech dataset with word-level annotationsand introduced a word-level stuttering speech detection model leveragingself-supervised speech models. Our evaluation demonstrates that our modelsurpasses previous approaches in word-level stuttering speech detection.Additionally, we conducted an extensive ablation analysis of our method,providing insight into the most important aspects of adapting self-supervisedspeech models for stuttered speech detection.

口吃的临床诊断需要有执照的语言病理学家进行评估。然而，这一过程非常耗时，而且需要临床医生接受过口吃和流利性障碍方面的培训并具有相关经验。遗憾的是，只有一小部分言语病理学家表示自己能够胜任口吃患者的工作，这不足以满足全球 8000 万口吃患者的需求。开发用于检测口吃语音的机器学习模型可以实现口吃的普遍和自动筛查，使言语病理学家能够识别和跟踪最有可能被诊断为口吃性言语障碍的患者。以前在这一领域的研究主要集中在语篇级检测上，这对于临床环境来说是不够的，因为在临床环境中，口吃的词级注释是常态。在这项研究中，我们整理了一个带有词级注释的口吃语音数据集，并利用自我监督语音模型引入了词级口吃语音检测模型。我们的评估结果表明，我们的模型在词级口吃语音检测方面超越了以往的方法。此外，我们还对我们的方法进行了广泛的消融分析，为口吃语音检测中调整自监督语音模型的最重要方面提供了见解。

{"title":"Self-supervised Speech Models for Word-Level Stuttered Speech Detection","authors":"Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath","doi":"arxiv-2409.10704","DOIUrl":"https://doi.org/arxiv-2409.10704","url":null,"abstract":"Clinical diagnosis of stuttering requires an assessment by a licensed\u0000speech-language pathologist. However, this process is time-consuming and\u0000requires clinicians with training and experience in stuttering and fluency\u0000disorders. Unfortunately, only a small percentage of speech-language\u0000pathologists report being comfortable working with individuals who stutter,\u0000which is inadequate to accommodate for the 80 million individuals who stutter\u0000worldwide. Developing machine learning models for detecting stuttered speech\u0000would enable universal and automated screening for stuttering, enabling speech\u0000pathologists to identify and follow up with patients who are most likely to be\u0000diagnosed with a stuttering speech disorder. Previous research in this area has\u0000predominantly focused on utterance-level detection, which is not sufficient for\u0000clinical settings where word-level annotation of stuttering is the norm. In\u0000this study, we curated a stuttered speech dataset with word-level annotations\u0000and introduced a word-level stuttering speech detection model leveraging\u0000self-supervised speech models. Our evaluation demonstrates that our model\u0000surpasses previous approaches in word-level stuttering speech detection.\u0000Additionally, we conducted an extensive ablation analysis of our method,\u0000providing insight into the most important aspects of adapting self-supervised\u0000speech models for stuttered speech detection.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Ultra-Low Latency Speech Enhancement - A Comprehensive Study 超低延迟语音增强技术--一项综合研究

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10358

Haibin Wu, Sebastian Braun

Speech enhancement models should meet very low latency requirements typicallysmaller than 5 ms for hearing assistive devices. While various low-latencytechniques have been proposed, comparing these methods in a controlled setupusing DNNs remains blank. Previous papers have variations in task, trainingdata, scripts, and evaluation settings, which make fair comparison impossible.Moreover, all methods are tested on small, simulated datasets, making itdifficult to fairly assess their performance in real-world conditions, whichcould impact the reliability of scientific findings. To address these issues,we comprehensively investigate various low-latency techniques using consistenttraining on large-scale data and evaluate with more relevant metrics onreal-world data. Specifically, we explore the effectiveness of asymmetricwindows, learnable windows, adaptive time domain filterbanks, and thefuture-frame prediction technique. Additionally, we examine whether increasingthe model size can compensate for the reduced window size, as well as the novelMamba architecture in low-latency environments.

语音增强模型应满足非常低的延迟要求，通常应小于 5 毫秒，用于听力辅助设备。虽然已经提出了各种低延迟技术，但在受控设置中使用 DNN 对这些方法进行比较仍是空白。此外，所有方法都是在小型模拟数据集上测试的，很难公平地评估它们在真实世界条件下的性能，这可能会影响科学研究结果的可靠性。为了解决这些问题，我们在大规模数据上使用一致性训练全面研究了各种低延迟技术，并在真实世界数据上使用更多相关指标进行评估。具体来说，我们探讨了非对称窗口、可学习窗口、自适应时域滤波器库和未来帧预测技术的有效性。此外，我们还研究了增加模型大小是否能补偿窗口大小的减少，以及低延迟环境下的新型 Mamba 架构。

引用次数: 0

Mitigating Sex Bias in Audio Data-driven COPD and COVID-19 Breathing Pattern Detection Models 减轻音频数据驱动的慢性阻塞性肺病和 COVID-19 呼吸模式检测模型中的性别偏差

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-16 DOI: arxiv-2409.10677

Rachel Pfeifer, Sudip Vhaduri, James Eric Dietz

In the healthcare industry, researchers have been developing machine learningmodels to automate diagnosing patients with respiratory illnesses based ontheir breathing patterns. However, these models do not consider the demographicbiases, particularly sex bias, that often occur when models are trained with askewed patient dataset. Hence, it is essential in such an important industry toreduce this bias so that models can make fair diagnoses. In this work, weexamine the bias in models used to detect breathing patterns of two majorrespiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) andCOVID-19. Using decision tree models trained with audio recordings of breathingpatterns obtained from two open-source datasets consisting of 29 COPD and 680COVID-19-positive patients, we analyze the effect of sex bias on the models.With a threshold optimizer and two constraints (demographic parity andequalized odds) to mitigate the bias, we witness 81.43% (demographic paritydifference) and 71.81% (equalized odds difference) improvements. These findingsare statistically significant.

在医疗保健行业，研究人员一直在开发机器学习模型，以便根据呼吸模式自动诊断呼吸系统疾病患者。然而，这些模型并没有考虑人口统计学偏差，尤其是性别偏差，而当模型使用偏斜的患者数据集进行训练时，往往会出现这种偏差。因此，在如此重要的行业中，必须减少这种偏差，以便模型能做出公平的诊断。在这项工作中，我们研究了用于检测两种主要呼吸系统疾病（即慢性阻塞性肺病（COPD）和COVID-19）呼吸模式的模型中存在的偏差。通过使用阈值优化器和两个约束条件（人口统计学奇偶性和均等化几率）来减轻偏差，我们见证了 81.43%（人口统计学奇偶性差异）和 71.81%（均等化几率差异）的改进。这些结果在统计学上具有重要意义。

{"title":"Mitigating Sex Bias in Audio Data-driven COPD and COVID-19 Breathing Pattern Detection Models","authors":"Rachel Pfeifer, Sudip Vhaduri, James Eric Dietz","doi":"arxiv-2409.10677","DOIUrl":"https://doi.org/arxiv-2409.10677","url":null,"abstract":"In the healthcare industry, researchers have been developing machine learning\u0000models to automate diagnosing patients with respiratory illnesses based on\u0000their breathing patterns. However, these models do not consider the demographic\u0000biases, particularly sex bias, that often occur when models are trained with a\u0000skewed patient dataset. Hence, it is essential in such an important industry to\u0000reduce this bias so that models can make fair diagnoses. In this work, we\u0000examine the bias in models used to detect breathing patterns of two major\u0000respiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) and\u0000COVID-19. Using decision tree models trained with audio recordings of breathing\u0000patterns obtained from two open-source datasets consisting of 29 COPD and 680\u0000COVID-19-positive patients, we analyze the effect of sex bias on the models.\u0000With a threshold optimizer and two constraints (demographic parity and\u0000equalized odds) to mitigate the bias, we witness 81.43% (demographic parity\u0000difference) and 71.81% (equalized odds difference) improvements. These findings\u0000are statistically significant.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0