Text-To-Music (TTM) models have recently revolutionized the automatic music generation research field. Specifically, by reaching superior performances to all previous state-of-the-art models and by lowering the technical proficiency needed to use them. Due to these reasons, they have readily started to be adopted for commercial uses and music production practices. This widespread diffusion of TTMs poses several concerns regarding copyright violation and rightful attribution, posing the need of serious consideration of them by the audio forensics community. In this paper, we tackle the problem of detection and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps that contains several versions of the music-caption pairs dataset MusicCaps re-generated via several state-of-the-art TTM techniques. We evaluate the proposed dataset by performing initial experiments regarding the detection and attribution of TTM-generated audio.
{"title":"FakeMusicCaps: a Dataset for Detection and Attribution of Synthetic Music Generated via Text-to-Music Models","authors":"Luca Comanducci, Paolo Bestagini, Stefano Tubaro","doi":"arxiv-2409.10684","DOIUrl":"https://doi.org/arxiv-2409.10684","url":null,"abstract":"Text-To-Music (TTM) models have recently revolutionized the automatic music\u0000generation research field. Specifically, by reaching superior performances to\u0000all previous state-of-the-art models and by lowering the technical proficiency\u0000needed to use them. Due to these reasons, they have readily started to be\u0000adopted for commercial uses and music production practices. This widespread\u0000diffusion of TTMs poses several concerns regarding copyright violation and\u0000rightful attribution, posing the need of serious consideration of them by the\u0000audio forensics community. In this paper, we tackle the problem of detection\u0000and attribution of TTM-generated data. We propose a dataset, FakeMusicCaps that\u0000contains several versions of the music-caption pairs dataset MusicCaps\u0000re-generated via several state-of-the-art TTM techniques. We evaluate the\u0000proposed dataset by performing initial experiments regarding the detection and\u0000attribution of TTM-generated audio.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald
Iterative self-training, or iterative pseudo-labeling (IPL)--using an improved model from the current iteration to provide pseudo-labels for the next iteration--has proven to be a powerful approach to enhance the quality of speaker representations. Recent applications of IPL in unsupervised speaker recognition start with representations extracted from very elaborate self-supervised methods (e.g., DINO). However, training such strong self-supervised models is not straightforward (they require hyper-parameters tuning and may not generalize to out-of-domain data) and, moreover, may not be needed at all. To this end, we show the simple, well-studied, and established i-vector generative model is enough to bootstrap the IPL process for unsupervised learning of speaker representations. We also systematically study the impact of other components on the IPL process, which includes the initial model, the encoder, augmentations, the number of clusters, and the clustering algorithm. Remarkably, we find that even with a simple and significantly weaker initial model like i-vector, IPL can still achieve speaker verification performance that rivals state-of-the-art methods.
{"title":"Speaker-IPL: Unsupervised Learning of Speaker Characteristics with i-Vector based Pseudo-Labels","authors":"Zakaria Aldeneh, Takuya Higuchi, Jee-weon Jung, Li-Wei Chen, Stephen Shum, Ahmed Hussen Abdelaziz, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald","doi":"arxiv-2409.10791","DOIUrl":"https://doi.org/arxiv-2409.10791","url":null,"abstract":"Iterative self-training, or iterative pseudo-labeling (IPL)--using an\u0000improved model from the current iteration to provide pseudo-labels for the next\u0000iteration--has proven to be a powerful approach to enhance the quality of\u0000speaker representations. Recent applications of IPL in unsupervised speaker\u0000recognition start with representations extracted from very elaborate\u0000self-supervised methods (e.g., DINO). However, training such strong\u0000self-supervised models is not straightforward (they require hyper-parameters\u0000tuning and may not generalize to out-of-domain data) and, moreover, may not be\u0000needed at all. To this end, we show the simple, well-studied, and established\u0000i-vector generative model is enough to bootstrap the IPL process for\u0000unsupervised learning of speaker representations. We also systematically study\u0000the impact of other components on the IPL process, which includes the initial\u0000model, the encoder, augmentations, the number of clusters, and the clustering\u0000algorithm. Remarkably, we find that even with a simple and significantly weaker\u0000initial model like i-vector, IPL can still achieve speaker verification\u0000performance that rivals state-of-the-art methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh
Speech foundation models, such as HuBERT and its variants, are pre-trained on large amounts of unlabeled speech for various downstream tasks. These models use a masked prediction objective, where the model learns to predict information about masked input segments from the unmasked context. The choice of prediction targets in this framework can influence performance on downstream tasks. For example, targets that encode prosody are beneficial for speaker-related tasks, while targets that encode phonetics are more suited for content-related tasks. Additionally, prediction targets can vary in the level of detail they encode; targets that encode fine-grained acoustic details are beneficial for denoising tasks, while targets that encode higher-level abstractions are more suited for content-related tasks. Despite the importance of prediction targets, the design choices that affect them have not been thoroughly studied. This work explores the design choices and their impact on downstream task performance. Our results indicate that the commonly used design choices for HuBERT can be suboptimal. We propose novel approaches to create more informative prediction targets and demonstrate their effectiveness through improvements across various downstream tasks.
{"title":"Exploring Prediction Targets in Masked Pre-Training for Speech Foundation Models","authors":"Li-Wei Chen, Takuya Higuchi, He Bai, Ahmed Hussen Abdelaziz, Alexander Rudnicky, Shinji Watanabe, Tatiana Likhomanenko, Barry-John Theobald, Zakaria Aldeneh","doi":"arxiv-2409.10788","DOIUrl":"https://doi.org/arxiv-2409.10788","url":null,"abstract":"Speech foundation models, such as HuBERT and its variants, are pre-trained on\u0000large amounts of unlabeled speech for various downstream tasks. These models\u0000use a masked prediction objective, where the model learns to predict\u0000information about masked input segments from the unmasked context. The choice\u0000of prediction targets in this framework can influence performance on downstream\u0000tasks. For example, targets that encode prosody are beneficial for\u0000speaker-related tasks, while targets that encode phonetics are more suited for\u0000content-related tasks. Additionally, prediction targets can vary in the level\u0000of detail they encode; targets that encode fine-grained acoustic details are\u0000beneficial for denoising tasks, while targets that encode higher-level\u0000abstractions are more suited for content-related tasks. Despite the importance\u0000of prediction targets, the design choices that affect them have not been\u0000thoroughly studied. This work explores the design choices and their impact on\u0000downstream task performance. Our results indicate that the commonly used design\u0000choices for HuBERT can be suboptimal. We propose novel approaches to create\u0000more informative prediction targets and demonstrate their effectiveness through\u0000improvements across various downstream tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso
Speech is a rich biomarker that encodes substantial information about the health of a speaker, and thus it has been proposed for the detection of numerous diseases, achieving promising results. However, questions remain about what the models trained for the automatic detection of these diseases are actually learning and the basis for their predictions, which can significantly impact patients' lives. This work advocates for an interpretable health model, suitable for detecting several diseases, motivated by the observation that speech-affecting disorders often have overlapping effects on speech signals. A framework is presented that first defines "reference speech" and then leverages this definition for disease detection. Reference speech is characterized through reference intervals, i.e., the typical values of clinically meaningful acoustic and linguistic features derived from a reference population. This novel approach in the field of speech as a biomarker is inspired by the use of reference intervals in clinical laboratory science. Deviations of new speakers from this reference model are quantified and used as input to detect Alzheimer's and Parkinson's disease. The classification strategy explored is based on Neural Additive Models, a type of glass-box neural network, which enables interpretability. The proposed framework for reference speech characterization and disease detection is designed to support the medical community by providing clinically meaningful explanations that can serve as a valuable second opinion.
{"title":"Speech as a Biomarker for Disease Detection","authors":"Catarina Botelho, Alberto Abad, Tanja Schultz, Isabel Trancoso","doi":"arxiv-2409.10230","DOIUrl":"https://doi.org/arxiv-2409.10230","url":null,"abstract":"Speech is a rich biomarker that encodes substantial information about the\u0000health of a speaker, and thus it has been proposed for the detection of\u0000numerous diseases, achieving promising results. However, questions remain about\u0000what the models trained for the automatic detection of these diseases are\u0000actually learning and the basis for their predictions, which can significantly\u0000impact patients' lives. This work advocates for an interpretable health model,\u0000suitable for detecting several diseases, motivated by the observation that\u0000speech-affecting disorders often have overlapping effects on speech signals. A\u0000framework is presented that first defines \"reference speech\" and then leverages\u0000this definition for disease detection. Reference speech is characterized\u0000through reference intervals, i.e., the typical values of clinically meaningful\u0000acoustic and linguistic features derived from a reference population. This\u0000novel approach in the field of speech as a biomarker is inspired by the use of\u0000reference intervals in clinical laboratory science. Deviations of new speakers\u0000from this reference model are quantified and used as input to detect\u0000Alzheimer's and Parkinson's disease. The classification strategy explored is\u0000based on Neural Additive Models, a type of glass-box neural network, which\u0000enables interpretability. The proposed framework for reference speech\u0000characterization and disease detection is designed to support the medical\u0000community by providing clinically meaningful explanations that can serve as a\u0000valuable second opinion.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Oxygenators, alarm devices, and footsteps are some of the most common sound sources in a hospital. Detecting them has scientific value for environmental psychology but comes with challenges of its own: namely, privacy preservation and limited labeled data. In this paper, we address these two challenges via a combination of edge computing and cloud computing. For privacy preservation, we have designed an acoustic sensor which computes third-octave spectrograms on the fly instead of recording audio waveforms. For sample-efficient machine learning, we have repurposed a pretrained audio neural network (PANN) via spectral transcoding and label space adaptation. A small-scale study in a neonatological intensive care unit (NICU) confirms that the time series of detected events align with another modality of measurement: i.e., electronic badges for parents and healthcare professionals. Hence, this paper demonstrates the feasibility of polyphonic machine listening in a hospital ward while guaranteeing privacy by design.
{"title":"Machine listening in a neonatal intensive care unit","authors":"Modan TailleurLS2N, Nantes Univ - ECN, LS2N - équipe SIMS, Vincent LostanlenLS2N, LS2N - équipe SIMS, Nantes Univ - ECN, Jean-Philippe RivièreNantes Univ, Nantes Univ - UFR FLCE, LS2N, LS2N - équipe PACCE, Pierre Aumond","doi":"arxiv-2409.11439","DOIUrl":"https://doi.org/arxiv-2409.11439","url":null,"abstract":"Oxygenators, alarm devices, and footsteps are some of the most common sound\u0000sources in a hospital. Detecting them has scientific value for environmental\u0000psychology but comes with challenges of its own: namely, privacy preservation\u0000and limited labeled data. In this paper, we address these two challenges via a\u0000combination of edge computing and cloud computing. For privacy preservation, we\u0000have designed an acoustic sensor which computes third-octave spectrograms on\u0000the fly instead of recording audio waveforms. For sample-efficient machine\u0000learning, we have repurposed a pretrained audio neural network (PANN) via\u0000spectral transcoding and label space adaptation. A small-scale study in a\u0000neonatological intensive care unit (NICU) confirms that the time series of\u0000detected events align with another modality of measurement: i.e., electronic\u0000badges for parents and healthcare professionals. Hence, this paper demonstrates\u0000the feasibility of polyphonic machine listening in a hospital ward while\u0000guaranteeing privacy by design.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study explores using embedding rank as an unsupervised evaluation metric for general-purpose speech encoders trained via self-supervised learning (SSL). Traditionally, assessing the performance of these encoders is resource-intensive and requires labeled data from the downstream tasks. Inspired by the vision domain, where embedding rank has shown promise for evaluating image encoders without tuning on labeled downstream data, this work examines its applicability in the speech domain, considering the temporal nature of the signals. The findings indicate rank correlates with downstream performance within encoder layers across various downstream tasks and for in- and out-of-domain scenarios. However, rank does not reliably predict the best-performing layer for specific downstream tasks, as lower-ranked layers can outperform higher-ranked ones. Despite this limitation, the results suggest that embedding rank can be a valuable tool for monitoring training progress in SSL speech models, offering a less resource-demanding alternative to traditional evaluation methods.
{"title":"Towards Automatic Assessment of Self-Supervised Speech Models using Rank","authors":"Zakaria Aldeneh, Vimal Thilak, Takuya Higuchi, Barry-John Theobald, Tatiana Likhomanenko","doi":"arxiv-2409.10787","DOIUrl":"https://doi.org/arxiv-2409.10787","url":null,"abstract":"This study explores using embedding rank as an unsupervised evaluation metric\u0000for general-purpose speech encoders trained via self-supervised learning (SSL).\u0000Traditionally, assessing the performance of these encoders is\u0000resource-intensive and requires labeled data from the downstream tasks.\u0000Inspired by the vision domain, where embedding rank has shown promise for\u0000evaluating image encoders without tuning on labeled downstream data, this work\u0000examines its applicability in the speech domain, considering the temporal\u0000nature of the signals. The findings indicate rank correlates with downstream\u0000performance within encoder layers across various downstream tasks and for in-\u0000and out-of-domain scenarios. However, rank does not reliably predict the\u0000best-performing layer for specific downstream tasks, as lower-ranked layers can\u0000outperform higher-ranked ones. Despite this limitation, the results suggest\u0000that embedding rank can be a valuable tool for monitoring training progress in\u0000SSL speech models, offering a less resource-demanding alternative to\u0000traditional evaluation methods.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa
Emotions are an essential element in verbal communication, so understanding individuals' affect during a human-robot interaction (HRI) becomes imperative. This paper investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers) pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from different human subjects having pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants. In the results, we show that fine-tuning vision transformers on benchmark datasets and and then using either these already fine-tuned models or ensembling ViT/BEiT models gets us the highest classification accuracies per individual when it comes to identifying four primary emotions from their speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs or BEiTs.
本文研究了视觉变换器模型,即ViT(视觉变换器)和BEiT(图像变换器的BERT预训练)管道在人机交互中的语音情感识别(SER)应用。重点是通过在基准数据集上对这些模型进行微调,并利用集合方法,针对单个语音特征对 SER 模型进行泛化。为此,我们收集了不同人类受试者与NAO机器人进行伪自然对话的音频数据。然后,我们对基于ViT和BEiT的模型进行了微调,并在受试者未见过的语音样本上对这些模型进行了测试。结果表明,与微调ViTs或BEiTs相比,微调基准数据集的视觉变换,然后使用这些已微调的模型或ViT/BEiT模型的集合,在从语音中识别四种主要情绪(中性、快乐、悲伤和愤怒)时,每个人的分类准确率最高。
{"title":"Personalized Speech Emotion Recognition in Human-Robot Interaction using Vision Transformers","authors":"Ruchik Mishra, Andrew Frye, Madan Mohan Rayguru, Dan O. Popa","doi":"arxiv-2409.10687","DOIUrl":"https://doi.org/arxiv-2409.10687","url":null,"abstract":"Emotions are an essential element in verbal communication, so understanding\u0000individuals' affect during a human-robot interaction (HRI) becomes imperative.\u0000This paper investigates the application of vision transformer models, namely\u0000ViT (Vision Transformers) and BEiT (BERT Pre-Training of Image Transformers)\u0000pipelines, for Speech Emotion Recognition (SER) in HRI. The focus is to\u0000generalize the SER models for individual speech characteristics by fine-tuning\u0000these models on benchmark datasets and exploiting ensemble methods. For this\u0000purpose, we collected audio data from different human subjects having\u0000pseudo-naturalistic conversations with the NAO robot. We then fine-tuned our\u0000ViT and BEiT-based models and tested these models on unseen speech samples from\u0000the participants. In the results, we show that fine-tuning vision transformers\u0000on benchmark datasets and and then using either these already fine-tuned models\u0000or ensembling ViT/BEiT models gets us the highest classification accuracies per\u0000individual when it comes to identifying four primary emotions from their\u0000speech: neutral, happy, sad, and angry, as compared to fine-tuning vanilla-ViTs\u0000or BEiTs.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265626","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath
Clinical diagnosis of stuttering requires an assessment by a licensed speech-language pathologist. However, this process is time-consuming and requires clinicians with training and experience in stuttering and fluency disorders. Unfortunately, only a small percentage of speech-language pathologists report being comfortable working with individuals who stutter, which is inadequate to accommodate for the 80 million individuals who stutter worldwide. Developing machine learning models for detecting stuttered speech would enable universal and automated screening for stuttering, enabling speech pathologists to identify and follow up with patients who are most likely to be diagnosed with a stuttering speech disorder. Previous research in this area has predominantly focused on utterance-level detection, which is not sufficient for clinical settings where word-level annotation of stuttering is the norm. In this study, we curated a stuttered speech dataset with word-level annotations and introduced a word-level stuttering speech detection model leveraging self-supervised speech models. Our evaluation demonstrates that our model surpasses previous approaches in word-level stuttering speech detection. Additionally, we conducted an extensive ablation analysis of our method, providing insight into the most important aspects of adapting self-supervised speech models for stuttered speech detection.
{"title":"Self-supervised Speech Models for Word-Level Stuttered Speech Detection","authors":"Yi-Jen Shih, Zoi Gkalitsiou, Alexandros G. Dimakis, David Harwath","doi":"arxiv-2409.10704","DOIUrl":"https://doi.org/arxiv-2409.10704","url":null,"abstract":"Clinical diagnosis of stuttering requires an assessment by a licensed\u0000speech-language pathologist. However, this process is time-consuming and\u0000requires clinicians with training and experience in stuttering and fluency\u0000disorders. Unfortunately, only a small percentage of speech-language\u0000pathologists report being comfortable working with individuals who stutter,\u0000which is inadequate to accommodate for the 80 million individuals who stutter\u0000worldwide. Developing machine learning models for detecting stuttered speech\u0000would enable universal and automated screening for stuttering, enabling speech\u0000pathologists to identify and follow up with patients who are most likely to be\u0000diagnosed with a stuttering speech disorder. Previous research in this area has\u0000predominantly focused on utterance-level detection, which is not sufficient for\u0000clinical settings where word-level annotation of stuttering is the norm. In\u0000this study, we curated a stuttered speech dataset with word-level annotations\u0000and introduced a word-level stuttering speech detection model leveraging\u0000self-supervised speech models. Our evaluation demonstrates that our model\u0000surpasses previous approaches in word-level stuttering speech detection.\u0000Additionally, we conducted an extensive ablation analysis of our method,\u0000providing insight into the most important aspects of adapting self-supervised\u0000speech models for stuttered speech detection.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech enhancement models should meet very low latency requirements typically smaller than 5 ms for hearing assistive devices. While various low-latency techniques have been proposed, comparing these methods in a controlled setup using DNNs remains blank. Previous papers have variations in task, training data, scripts, and evaluation settings, which make fair comparison impossible. Moreover, all methods are tested on small, simulated datasets, making it difficult to fairly assess their performance in real-world conditions, which could impact the reliability of scientific findings. To address these issues, we comprehensively investigate various low-latency techniques using consistent training on large-scale data and evaluate with more relevant metrics on real-world data. Specifically, we explore the effectiveness of asymmetric windows, learnable windows, adaptive time domain filterbanks, and the future-frame prediction technique. Additionally, we examine whether increasing the model size can compensate for the reduced window size, as well as the novel Mamba architecture in low-latency environments.
{"title":"Ultra-Low Latency Speech Enhancement - A Comprehensive Study","authors":"Haibin Wu, Sebastian Braun","doi":"arxiv-2409.10358","DOIUrl":"https://doi.org/arxiv-2409.10358","url":null,"abstract":"Speech enhancement models should meet very low latency requirements typically\u0000smaller than 5 ms for hearing assistive devices. While various low-latency\u0000techniques have been proposed, comparing these methods in a controlled setup\u0000using DNNs remains blank. Previous papers have variations in task, training\u0000data, scripts, and evaluation settings, which make fair comparison impossible.\u0000Moreover, all methods are tested on small, simulated datasets, making it\u0000difficult to fairly assess their performance in real-world conditions, which\u0000could impact the reliability of scientific findings. To address these issues,\u0000we comprehensively investigate various low-latency techniques using consistent\u0000training on large-scale data and evaluate with more relevant metrics on\u0000real-world data. Specifically, we explore the effectiveness of asymmetric\u0000windows, learnable windows, adaptive time domain filterbanks, and the\u0000future-frame prediction technique. Additionally, we examine whether increasing\u0000the model size can compensate for the reduced window size, as well as the novel\u0000Mamba architecture in low-latency environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the healthcare industry, researchers have been developing machine learning models to automate diagnosing patients with respiratory illnesses based on their breathing patterns. However, these models do not consider the demographic biases, particularly sex bias, that often occur when models are trained with a skewed patient dataset. Hence, it is essential in such an important industry to reduce this bias so that models can make fair diagnoses. In this work, we examine the bias in models used to detect breathing patterns of two major respiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) and COVID-19. Using decision tree models trained with audio recordings of breathing patterns obtained from two open-source datasets consisting of 29 COPD and 680 COVID-19-positive patients, we analyze the effect of sex bias on the models. With a threshold optimizer and two constraints (demographic parity and equalized odds) to mitigate the bias, we witness 81.43% (demographic parity difference) and 71.81% (equalized odds difference) improvements. These findings are statistically significant.
{"title":"Mitigating Sex Bias in Audio Data-driven COPD and COVID-19 Breathing Pattern Detection Models","authors":"Rachel Pfeifer, Sudip Vhaduri, James Eric Dietz","doi":"arxiv-2409.10677","DOIUrl":"https://doi.org/arxiv-2409.10677","url":null,"abstract":"In the healthcare industry, researchers have been developing machine learning\u0000models to automate diagnosing patients with respiratory illnesses based on\u0000their breathing patterns. However, these models do not consider the demographic\u0000biases, particularly sex bias, that often occur when models are trained with a\u0000skewed patient dataset. Hence, it is essential in such an important industry to\u0000reduce this bias so that models can make fair diagnoses. In this work, we\u0000examine the bias in models used to detect breathing patterns of two major\u0000respiratory diseases, i.e., chronic obstructive pulmonary disease (COPD) and\u0000COVID-19. Using decision tree models trained with audio recordings of breathing\u0000patterns obtained from two open-source datasets consisting of 29 COPD and 680\u0000COVID-19-positive patients, we analyze the effect of sex bias on the models.\u0000With a threshold optimizer and two constraints (demographic parity and\u0000equalized odds) to mitigate the bias, we witness 81.43% (demographic parity\u0000difference) and 71.81% (equalized odds difference) improvements. These findings\u0000are statistically significant.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}