Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen
We propose a novel approach for spoofed speech characterization through explainable probabilistic attribute embeddings. In contrast to high-dimensional raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions are not easy to interpret, the probabilistic attributes are designed to gauge the presence or absence of sub-components that make up a specific spoofing attack. These attributes are then applied to two downstream tasks: spoofing detection and attack attribution. To enforce interpretability also to the back-end, we adopt a decision tree classifier. Our experiments on the ASVspoof2019 dataset with spoof CM embeddings extracted from three models (AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of the attribute embeddings are on par with the original raw spoof CM embeddings for both tasks. The best performance achieved with the proposed approach for spoofing detection and attack attribution, in terms of accuracy, is 99.7% and 99.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings. To analyze the relative contribution of each attribute, we estimate their Shapley values. Attributes related to acoustic feature prediction, waveform generation (vocoder), and speaker modeling are found important for spoofing detection; while duration modeling, vocoder, and input type play a role in spoofing attack attribution.
我们提出了一种通过可解释的概率属性嵌入来描述欺骗语音特征的新方法。从欺骗对策(CM)中提取的高维草图嵌入不容易解释,与之相反,概率属性旨在衡量是否存在构成特定欺骗攻击的子组件。然后将这些属性应用于两个下游任务:欺骗检测和攻击归因。为了使后端也具有可解释性,我们采用了决策树分类器。我们使用从三种模型(AASIST、Rawboost-AASIST、SSL-AASIST)中提取的欺骗性 CM 嵌入在 ASVspoof2019 数据集上进行的实验表明,属性嵌入在这两项任务中的性能与原始的欺骗性 CM 嵌入相当。在欺骗检测和攻击归因方面,拟议方法的准确率分别达到 99.7% 和 99.2%,而使用原始 CM 嵌入的准确率分别为 99.7% 和 94.7%。我们发现,与声学特征预测、波形生成(声码器)和扬声器建模相关的属性对于欺骗检测非常重要;而时长建模、声码器和输入类型则在欺骗攻击归因中发挥了作用。
{"title":"An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization","authors":"Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen","doi":"arxiv-2409.11027","DOIUrl":"https://doi.org/arxiv-2409.11027","url":null,"abstract":"We propose a novel approach for spoofed speech characterization through\u0000explainable probabilistic attribute embeddings. In contrast to high-dimensional\u0000raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions\u0000are not easy to interpret, the probabilistic attributes are designed to gauge\u0000the presence or absence of sub-components that make up a specific spoofing\u0000attack. These attributes are then applied to two downstream tasks: spoofing\u0000detection and attack attribution. To enforce interpretability also to the\u0000back-end, we adopt a decision tree classifier. Our experiments on the\u0000ASVspoof2019 dataset with spoof CM embeddings extracted from three models\u0000(AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of the\u0000attribute embeddings are on par with the original raw spoof CM embeddings for\u0000both tasks. The best performance achieved with the proposed approach for\u0000spoofing detection and attack attribution, in terms of accuracy, is 99.7% and\u000099.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings.\u0000To analyze the relative contribution of each attribute, we estimate their\u0000Shapley values. Attributes related to acoustic feature prediction, waveform\u0000generation (vocoder), and speaker modeling are found important for spoofing\u0000detection; while duration modeling, vocoder, and input type play a role in\u0000spoofing attack attribution.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley
The recent explosion of generative AI-Music systems has raised numerous concerns over data copyright, licensing music from musicians, and the conflict between open-source AI and large prestige companies. Such issues highlight the need for publicly available, copyright-free musical data, in which there is a large shortage, particularly for symbolic music data. To alleviate this issue, we present PDMX: a large-scale open-source dataset of over 250K public domain MusicXML scores collected from the score-sharing forum MuseScore, making it the largest available copyright-free symbolic music dataset to our knowledge. PDMX additionally includes a wealth of both tag and user interaction metadata, allowing us to efficiently analyze the dataset and filter for high quality user-generated scores. Given the additional metadata afforded by our data collection process, we conduct multitrack music generation experiments evaluating how different representative subsets of PDMX lead to different behaviors in downstream models, and how user-rating statistics can be used as an effective measure of data quality. Examples can be found at https://pnlong.github.io/PDMX.demo/.
{"title":"PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing","authors":"Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley","doi":"arxiv-2409.10831","DOIUrl":"https://doi.org/arxiv-2409.10831","url":null,"abstract":"The recent explosion of generative AI-Music systems has raised numerous\u0000concerns over data copyright, licensing music from musicians, and the conflict\u0000between open-source AI and large prestige companies. Such issues highlight the\u0000need for publicly available, copyright-free musical data, in which there is a\u0000large shortage, particularly for symbolic music data. To alleviate this issue,\u0000we present PDMX: a large-scale open-source dataset of over 250K public domain\u0000MusicXML scores collected from the score-sharing forum MuseScore, making it the\u0000largest available copyright-free symbolic music dataset to our knowledge. PDMX\u0000additionally includes a wealth of both tag and user interaction metadata,\u0000allowing us to efficiently analyze the dataset and filter for high quality\u0000user-generated scores. Given the additional metadata afforded by our data\u0000collection process, we conduct multitrack music generation experiments\u0000evaluating how different representative subsets of PDMX lead to different\u0000behaviors in downstream models, and how user-rating statistics can be used as\u0000an effective measure of data quality. Examples can be found at\u0000https://pnlong.github.io/PDMX.demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"210 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee
Speech Emotion Recognition (SER) is a crucial component in developing general-purpose AI agents capable of natural human-computer interaction. However, building robust multilingual SER systems remains challenging due to the scarcity of labeled data in languages other than English and Chinese. In this paper, we propose an approach to enhance SER performance in low SER resource languages by leveraging data from high-resource languages. Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined with a novel bootstrapping data selection pipeline to generate labeled data in the target language. Extensive experiments demonstrate that our method is both effective and generalizable across different upstream models and languages. Our results suggest that this approach can facilitate the development of more scalable and robust multilingual SER systems.
语音情感识别(SER)是开发能够进行自然人机交互的通用人工智能代理的重要组成部分。然而,由于除英语和中文之外的其他语言的标注数据稀缺,构建强大的多语言 SER 系统仍然具有挑战性。在本文中,我们提出了一种通过利用高资源语言的数据来提高低资源语言 SER 性能的方法。具体来说,我们采用了表达式语音到语音翻译(S2ST),并结合新颖的引导数据选择管道来生成目标语言中的标记数据。广泛的实验证明,我们的方法在不同的上游模型和语言中既有效又具有通用性。我们的研究结果表明,这种方法可以促进可扩展性更强的多语言 SER 系统的开发。
{"title":"Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection","authors":"Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee","doi":"arxiv-2409.10985","DOIUrl":"https://doi.org/arxiv-2409.10985","url":null,"abstract":"Speech Emotion Recognition (SER) is a crucial component in developing\u0000general-purpose AI agents capable of natural human-computer interaction.\u0000However, building robust multilingual SER systems remains challenging due to\u0000the scarcity of labeled data in languages other than English and Chinese. In\u0000this paper, we propose an approach to enhance SER performance in low SER\u0000resource languages by leveraging data from high-resource languages.\u0000Specifically, we employ expressive Speech-to-Speech translation (S2ST) combined\u0000with a novel bootstrapping data selection pipeline to generate labeled data in\u0000the target language. Extensive experiments demonstrate that our method is both\u0000effective and generalizable across different upstream models and languages. Our\u0000results suggest that this approach can facilitate the development of more\u0000scalable and robust multilingual SER systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce Label-Combination Prototypical Networks (LC-Protonets) to address the problem of multi-label few-shot classification, where a model must generalize to new classes based on only a few available examples. Extending Prototypical Networks, LC-Protonets generate one prototype per label combination, derived from the power set of labels present in the limited training items, rather than one prototype per label. Our method is applied to automatic audio tagging across diverse music datasets, covering various cultures and including both modern and traditional music, and is evaluated against existing approaches in the literature. The results demonstrate a significant performance improvement in almost all domains and training setups when using LC-Protonets for multi-label classification. In addition to training a few-shot learning model from scratch, we explore the use of a pre-trained model, obtained via supervised learning, to embed items in the feature space. Fine-tuning improves the generalization ability of all methods, yet LC-Protonets achieve high-level performance even without fine-tuning, in contrast to the comparative approaches. We finally analyze the scalability of the proposed method, providing detailed quantitative metrics from our experiments. The implementation and experimental setup are made publicly available, offering a benchmark for future research.
{"title":"LC-Protonets: Multi-label Few-shot learning for world music audio tagging","authors":"Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos","doi":"arxiv-2409.11264","DOIUrl":"https://doi.org/arxiv-2409.11264","url":null,"abstract":"We introduce Label-Combination Prototypical Networks (LC-Protonets) to\u0000address the problem of multi-label few-shot classification, where a model must\u0000generalize to new classes based on only a few available examples. Extending\u0000Prototypical Networks, LC-Protonets generate one prototype per label\u0000combination, derived from the power set of labels present in the limited\u0000training items, rather than one prototype per label. Our method is applied to\u0000automatic audio tagging across diverse music datasets, covering various\u0000cultures and including both modern and traditional music, and is evaluated\u0000against existing approaches in the literature. The results demonstrate a\u0000significant performance improvement in almost all domains and training setups\u0000when using LC-Protonets for multi-label classification. In addition to training\u0000a few-shot learning model from scratch, we explore the use of a pre-trained\u0000model, obtained via supervised learning, to embed items in the feature space.\u0000Fine-tuning improves the generalization ability of all methods, yet\u0000LC-Protonets achieve high-level performance even without fine-tuning, in\u0000contrast to the comparative approaches. We finally analyze the scalability of\u0000the proposed method, providing detailed quantitative metrics from our\u0000experiments. The implementation and experimental setup are made publicly\u0000available, offering a benchmark for future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu
Traditional speech enhancement methods often oversimplify the task of restoration by focusing on a single type of distortion. Generative models that handle multiple distortions frequently struggle with phone reconstruction and high-frequency harmonics, leading to breathing and gasping artifacts that reduce the intelligibility of reconstructed speech. These models are also computationally demanding, and many solutions are restricted to producing outputs in the wide-band frequency range, which limits their suitability for professional applications. To address these challenges, we propose Hi-ResLDM, a novel generative model based on latent diffusion designed to remove multiple distortions and restore speech recordings to studio quality, sampled at 48kHz. We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and Conditional Flow Matching (CFM) components, demonstrating superior performance in regenerating high-frequency-band details. Hi-ResLDM not only excels in non-instrusive metrics but is also consistently preferred in human evaluation and performs competitively on intrusive evaluations, making it ideal for high-resolution speech restoration.
{"title":"High-Resolution Speech Restoration with Latent Diffusion Model","authors":"Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu","doi":"arxiv-2409.11145","DOIUrl":"https://doi.org/arxiv-2409.11145","url":null,"abstract":"Traditional speech enhancement methods often oversimplify the task of\u0000restoration by focusing on a single type of distortion. Generative models that\u0000handle multiple distortions frequently struggle with phone reconstruction and\u0000high-frequency harmonics, leading to breathing and gasping artifacts that\u0000reduce the intelligibility of reconstructed speech. These models are also\u0000computationally demanding, and many solutions are restricted to producing\u0000outputs in the wide-band frequency range, which limits their suitability for\u0000professional applications. To address these challenges, we propose Hi-ResLDM, a\u0000novel generative model based on latent diffusion designed to remove multiple\u0000distortions and restore speech recordings to studio quality, sampled at 48kHz.\u0000We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and\u0000Conditional Flow Matching (CFM) components, demonstrating superior performance\u0000in regenerating high-frequency-band details. Hi-ResLDM not only excels in\u0000non-instrusive metrics but is also consistently preferred in human evaluation\u0000and performs competitively on intrusive evaluations, making it ideal for\u0000high-resolution speech restoration.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley
This paper presents a residential audio dataset to support sound event detection research for smart home applications aimed at promoting wellbeing for older adults. The dataset is constructed by deploying audio recording systems in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic characteristics are documented through detailed floor plans and construction material information to enable replication of the recording environments for AI model deployment. A novel automated speech removal pipeline is developed, using pre-trained audio neural networks to detect and remove segments containing spoken voice, while preserving segments containing other sound events. The resulting dataset consists of privacy-compliant audio recordings that accurately capture the soundscapes and activities of daily living within residential spaces. The paper details the dataset creation methodology, the speech removal pipeline utilizing cascaded model architectures, and an analysis of the vocal label distribution to validate the speech removal process. This dataset enables the development and benchmarking of sound event detection models tailored specifically for in-home applications.
{"title":"The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection","authors":"Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley","doi":"arxiv-2409.11262","DOIUrl":"https://doi.org/arxiv-2409.11262","url":null,"abstract":"This paper presents a residential audio dataset to support sound event\u0000detection research for smart home applications aimed at promoting wellbeing for\u0000older adults. The dataset is constructed by deploying audio recording systems\u0000in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic\u0000characteristics are documented through detailed floor plans and construction\u0000material information to enable replication of the recording environments for AI\u0000model deployment. A novel automated speech removal pipeline is developed, using\u0000pre-trained audio neural networks to detect and remove segments containing\u0000spoken voice, while preserving segments containing other sound events. The\u0000resulting dataset consists of privacy-compliant audio recordings that\u0000accurately capture the soundscapes and activities of daily living within\u0000residential spaces. The paper details the dataset creation methodology, the\u0000speech removal pipeline utilizing cascaded model architectures, and an analysis\u0000of the vocal label distribution to validate the speech removal process. This\u0000dataset enables the development and benchmarking of sound event detection\u0000models tailored specifically for in-home applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural audio codecs have significantly advanced audio compression by efficiently converting continuous audio signals into discrete tokens. These codecs preserve high-quality sound and enable sophisticated sound generation through generative models trained on these tokens. However, existing neural codec models are typically trained on large, undifferentiated audio datasets, neglecting the essential discrepancies between sound domains like speech, music, and environmental sound effects. This oversight complicates data modeling and poses additional challenges to the controllability of sound generation. To tackle these issues, we introduce the Source-Disentangled Neural Audio Codec (SD-Codec), a novel approach that combines audio coding and source separation. By jointly learning audio resynthesis and separation, SD-Codec explicitly assigns audio signals from different domains to distinct codebooks, sets of discrete representations. Experimental results indicate that SD-Codec not only maintains competitive resynthesis quality but also, supported by the separation results, demonstrates successful disentanglement of different sources in the latent space, thereby enhancing interpretability in audio codec and providing potential finer control over the audio generation process.
{"title":"Learning Source Disentanglement in Neural Audio Codec","authors":"Xiaoyu Bie, Xubo Liu, Gaël Richard","doi":"arxiv-2409.11228","DOIUrl":"https://doi.org/arxiv-2409.11228","url":null,"abstract":"Neural audio codecs have significantly advanced audio compression by\u0000efficiently converting continuous audio signals into discrete tokens. These\u0000codecs preserve high-quality sound and enable sophisticated sound generation\u0000through generative models trained on these tokens. However, existing neural\u0000codec models are typically trained on large, undifferentiated audio datasets,\u0000neglecting the essential discrepancies between sound domains like speech,\u0000music, and environmental sound effects. This oversight complicates data\u0000modeling and poses additional challenges to the controllability of sound\u0000generation. To tackle these issues, we introduce the Source-Disentangled Neural\u0000Audio Codec (SD-Codec), a novel approach that combines audio coding and source\u0000separation. By jointly learning audio resynthesis and separation, SD-Codec\u0000explicitly assigns audio signals from different domains to distinct codebooks,\u0000sets of discrete representations. Experimental results indicate that SD-Codec\u0000not only maintains competitive resynthesis quality but also, supported by the\u0000separation results, demonstrates successful disentanglement of different\u0000sources in the latent space, thereby enhancing interpretability in audio codec\u0000and providing potential finer control over the audio generation process.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul
Police departments around the world use two-way radio for coordination. These broadcast police communications (BPC) are a unique source of information about everyday police activity and emergency response. Yet BPC are not transcribed, and their naturalistic audio properties make automatic transcription challenging. We collect a corpus of roughly 62,000 manually transcribed radio transmissions (~46 hours of audio) to evaluate the feasibility of automatic speech recognition (ASR) using modern recognition models. We evaluate the performance of off-the-shelf speech recognizers, models fine-tuned on BPC data, and customized end-to-end models. We find that both human and machine transcription is challenging in this domain. Large off-the-shelf ASR models perform poorly, but fine-tuned models can reach the approximate range of human performance. Our work suggests directions for future work, including analysis of short utterances and potential miscommunication in police radio interactions. We make our corpus and data annotation pipeline available to other researchers, to enable further research on recognition and analysis of police communication.
{"title":"Speech Recognition for Analysis of Police Radio Communication","authors":"Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul","doi":"arxiv-2409.10858","DOIUrl":"https://doi.org/arxiv-2409.10858","url":null,"abstract":"Police departments around the world use two-way radio for coordination. These\u0000broadcast police communications (BPC) are a unique source of information about\u0000everyday police activity and emergency response. Yet BPC are not transcribed,\u0000and their naturalistic audio properties make automatic transcription\u0000challenging. We collect a corpus of roughly 62,000 manually transcribed radio\u0000transmissions (~46 hours of audio) to evaluate the feasibility of automatic\u0000speech recognition (ASR) using modern recognition models. We evaluate the\u0000performance of off-the-shelf speech recognizers, models fine-tuned on BPC data,\u0000and customized end-to-end models. We find that both human and machine\u0000transcription is challenging in this domain. Large off-the-shelf ASR models\u0000perform poorly, but fine-tuned models can reach the approximate range of human\u0000performance. Our work suggests directions for future work, including analysis\u0000of short utterances and potential miscommunication in police radio\u0000interactions. We make our corpus and data annotation pipeline available to\u0000other researchers, to enable further research on recognition and analysis of\u0000police communication.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia
Humans can picture a sound scene given an imprecise natural language description. For example, it is easy to imagine an acoustic environment given a phrase like "the lion roar came from right behind me!". For a machine to have the same degree of comprehension, the machine must know what a lion is (semantic attribute), what the concept of "behind" is (spatial attribute) and how these pieces of linguistic information align with the semantic and spatial attributes of the sound (what a roar sounds like when its coming from behind). State-of-the-art audio foundation models which learn to map between audio scenes and natural textual descriptions, are trained on non-spatial audio and text pairs, and hence lack spatial awareness. In contrast, sound event localization and detection models are limited to recognizing sounds from a fixed number of classes, and they localize the source to absolute position (e.g., 0.2m) rather than a position described using natural language (e.g., "next to me"). To address these gaps, we present ELSA a spatially aware-audio and text embedding model trained using multimodal contrastive learning. ELSA supports non-spatial audio, spatial audio, and open vocabulary text captions describing both the spatial and semantic components of sound. To train ELSA: (a) we spatially augment the audio and captions of three open-source audio datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture the semantics of non-spatial audio, and the semantics and spatial attributes of spatial audio using contrastive learning. ELSA is competitive with state-of-the-art for both semantic retrieval and 3D source localization. In particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above the baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D source localization over the baseline.
{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":"https://doi.org/arxiv-2409.11369","url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\u0000description. For example, it is easy to imagine an acoustic environment given a\u0000phrase like \"the lion roar came from right behind me!\". For a machine to have\u0000the same degree of comprehension, the machine must know what a lion is\u0000(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\u0000how these pieces of linguistic information align with the semantic and spatial\u0000attributes of the sound (what a roar sounds like when its coming from behind).\u0000State-of-the-art audio foundation models which learn to map between audio\u0000scenes and natural textual descriptions, are trained on non-spatial audio and\u0000text pairs, and hence lack spatial awareness. In contrast, sound event\u0000localization and detection models are limited to recognizing sounds from a\u0000fixed number of classes, and they localize the source to absolute position\u0000(e.g., 0.2m) rather than a position described using natural language (e.g.,\u0000\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\u0000and text embedding model trained using multimodal contrastive learning. ELSA\u0000supports non-spatial audio, spatial audio, and open vocabulary text captions\u0000describing both the spatial and semantic components of sound. To train ELSA:\u0000(a) we spatially augment the audio and captions of three open-source audio\u0000datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\u0000the semantics of non-spatial audio, and the semantics and spatial attributes of\u0000spatial audio using contrastive learning. ELSA is competitive with\u0000state-of-the-art for both semantic retrieval and 3D source localization. In\u0000particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\u0000the baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D source\u0000localization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francesco Nespoli, Daniel Barreda, Patrick A. Naylor
In recent years, automatic speech recognition (ASR) models greatly improved transcription performance both in clean, low noise, acoustic conditions and in reverberant environments. However, all these systems rely on the availability of hundreds of hours of labelled training data in specific acoustic conditions. When such a training dataset is not available, the performance of the system is heavily impacted. For example, this happens when a specific acoustic environment or a particular population of speakers is under-represented in the training dataset. Specifically, in this paper we investigate the effect of accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a strategy based on zero-shot text-to-speech to augment the accented speech corpora. We show that this augmentation method is able to mitigate the loss in performance of the ASR system on accented data up to 5% word error rate reduction (WERR). In conclusion, we demonstrate that by incorporating a modest fraction of real with synthetically generated data, the ASR system exhibits superior performance compared to a model trained exclusively on authentic accented speech with up to 14% WERR.
近年来,自动语音识别(ASR)模型大大提高了在洁净、低噪音的声学条件下和在混响环境中的转录性能。但是,所有这些系统都依赖于特定声学条件下数百小时的标注训练数据。例如,当特定的声学环境或特定的说话者群体在训练数据集中的代表性不足时,就会出现这种情况。具体来说,我们在本文中研究了带重音语音数据对现成 ASR 系统的影响。此外,我们还提出了基于零镜头文本到语音的策略,以增强重音语音群。我们的研究表明,这种增强方法能够减轻 ASR 系统在重音数据上的性能损失,最高可将词错误率(WERR)降低 5%。总之,我们证明,通过适度地将真实数据与合成数据相结合,ASR 系统的性能优于完全基于真实重音语音训练的模型,词错误率最高可降低 14%。
{"title":"Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora","authors":"Francesco Nespoli, Daniel Barreda, Patrick A. Naylor","doi":"arxiv-2409.11107","DOIUrl":"https://doi.org/arxiv-2409.11107","url":null,"abstract":"In recent years, automatic speech recognition (ASR) models greatly improved\u0000transcription performance both in clean, low noise, acoustic conditions and in\u0000reverberant environments. However, all these systems rely on the availability\u0000of hundreds of hours of labelled training data in specific acoustic conditions.\u0000When such a training dataset is not available, the performance of the system is\u0000heavily impacted. For example, this happens when a specific acoustic\u0000environment or a particular population of speakers is under-represented in the\u0000training dataset. Specifically, in this paper we investigate the effect of\u0000accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a\u0000strategy based on zero-shot text-to-speech to augment the accented speech\u0000corpora. We show that this augmentation method is able to mitigate the loss in\u0000performance of the ASR system on accented data up to 5% word error rate\u0000reduction (WERR). In conclusion, we demonstrate that by incorporating a modest\u0000fraction of real with synthetically generated data, the ASR system exhibits\u0000superior performance compared to a model trained exclusively on authentic\u0000accented speech with up to 14% WERR.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}