arXiv - EE - Audio and Speech Processing最新文献_第4页

An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization 用于欺骗性语音特征描述的可解释概率属性嵌入方法

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11027

Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen

We propose a novel approach for spoofed speech characterization throughexplainable probabilistic attribute embeddings. In contrast to high-dimensionalraw embeddings extracted from a spoofing countermeasure (CM) whose dimensionsare not easy to interpret, the probabilistic attributes are designed to gaugethe presence or absence of sub-components that make up a specific spoofingattack. These attributes are then applied to two downstream tasks: spoofingdetection and attack attribution. To enforce interpretability also to theback-end, we adopt a decision tree classifier. Our experiments on theASVspoof2019 dataset with spoof CM embeddings extracted from three models(AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of theattribute embeddings are on par with the original raw spoof CM embeddings forboth tasks. The best performance achieved with the proposed approach forspoofing detection and attack attribution, in terms of accuracy, is 99.7% and99.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings.To analyze the relative contribution of each attribute, we estimate theirShapley values. Attributes related to acoustic feature prediction, waveformgeneration (vocoder), and speaker modeling are found important for spoofingdetection; while duration modeling, vocoder, and input type play a role inspoofing attack attribution.

我们提出了一种通过可解释的概率属性嵌入来描述欺骗语音特征的新方法。从欺骗对策（CM）中提取的高维草图嵌入不容易解释，与之相反，概率属性旨在衡量是否存在构成特定欺骗攻击的子组件。然后将这些属性应用于两个下游任务：欺骗检测和攻击归因。为了使后端也具有可解释性，我们采用了决策树分类器。我们使用从三种模型（AASIST、Rawboost-AASIST、SSL-AASIST）中提取的欺骗性 CM 嵌入在 ASVspoof2019 数据集上进行的实验表明，属性嵌入在这两项任务中的性能与原始的欺骗性 CM 嵌入相当。在欺骗检测和攻击归因方面，拟议方法的准确率分别达到 99.7% 和 99.2%，而使用原始 CM 嵌入的准确率分别为 99.7% 和 94.7%。我们发现，与声学特征预测、波形生成（声码器）和扬声器建模相关的属性对于欺骗检测非常重要；而时长建模、声码器和输入类型则在欺骗攻击归因中发挥了作用。

{"title":"An Explainable Probabilistic Attribute Embedding Approach for Spoofed Speech Characterization","authors":"Manasi Chhibber, Jagabandhu Mishra, Hyejin Shim, Tomi H. Kinnunen","doi":"arxiv-2409.11027","DOIUrl":"https://doi.org/arxiv-2409.11027","url":null,"abstract":"We propose a novel approach for spoofed speech characterization through\u0000explainable probabilistic attribute embeddings. In contrast to high-dimensional\u0000raw embeddings extracted from a spoofing countermeasure (CM) whose dimensions\u0000are not easy to interpret, the probabilistic attributes are designed to gauge\u0000the presence or absence of sub-components that make up a specific spoofing\u0000attack. These attributes are then applied to two downstream tasks: spoofing\u0000detection and attack attribution. To enforce interpretability also to the\u0000back-end, we adopt a decision tree classifier. Our experiments on the\u0000ASVspoof2019 dataset with spoof CM embeddings extracted from three models\u0000(AASIST, Rawboost-AASIST, SSL-AASIST) suggest that the performance of the\u0000attribute embeddings are on par with the original raw spoof CM embeddings for\u0000both tasks. The best performance achieved with the proposed approach for\u0000spoofing detection and attack attribution, in terms of accuracy, is 99.7% and\u000099.2%, respectively, compared to 99.7% and 94.7% using the raw CM embeddings.\u0000To analyze the relative contribution of each attribute, we estimate their\u0000Shapley values. Attributes related to acoustic feature prediction, waveform\u0000generation (vocoder), and speaker modeling are found important for spoofing\u0000detection; while duration modeling, vocoder, and input type play a role in\u0000spoofing attack attribution.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing PDMX：用于符号音乐处理的大规模公共领域音乐 XML 数据集

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10831

Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley

The recent explosion of generative AI-Music systems has raised numerousconcerns over data copyright, licensing music from musicians, and the conflictbetween open-source AI and large prestige companies. Such issues highlight theneed for publicly available, copyright-free musical data, in which there is alarge shortage, particularly for symbolic music data. To alleviate this issue,we present PDMX: a large-scale open-source dataset of over 250K public domainMusicXML scores collected from the score-sharing forum MuseScore, making it thelargest available copyright-free symbolic music dataset to our knowledge. PDMXadditionally includes a wealth of both tag and user interaction metadata,allowing us to efficiently analyze the dataset and filter for high qualityuser-generated scores. Given the additional metadata afforded by our datacollection process, we conduct multitrack music generation experimentsevaluating how different representative subsets of PDMX lead to differentbehaviors in downstream models, and how user-rating statistics can be used asan effective measure of data quality. Examples can be found athttps://pnlong.github.io/PDMX.demo/.

最近，人工智能音乐生成系统的爆炸式发展引发了许多关于数据版权、音乐家音乐授权以及开源人工智能与大型知名公司之间冲突的问题。这些问题凸显了对可公开获取、无版权限制的音乐数据的需求，而这方面的数据尤其是符号音乐数据非常缺乏。为了缓解这一问题，我们推出了 PDMX：一个大型开源数据集，其中包含从乐谱共享论坛 MuseScore 收集的超过 25 万个公共领域的 MusicXML 乐谱，是我们所知的最大的可用无版权符号音乐数据集。PDMX 还包括大量标签和用户交互元数据，使我们能够高效地分析数据集，并筛选出高质量的用户生成乐谱。鉴于我们的数据收集过程提供了额外的元数据，我们进行了多轨音乐生成实验，评估 PDMX 的不同代表性子集如何导致下游模型的不同行为，以及用户评分统计如何用作数据质量的有效衡量标准。示例可在https://pnlong.github.io/PDMX.demo/。

{"title":"PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music Processing","authors":"Phillip Long, Zachary Novack, Taylor Berg-Kirkpatrick, Julian McAuley","doi":"arxiv-2409.10831","DOIUrl":"https://doi.org/arxiv-2409.10831","url":null,"abstract":"The recent explosion of generative AI-Music systems has raised numerous\u0000concerns over data copyright, licensing music from musicians, and the conflict\u0000between open-source AI and large prestige companies. Such issues highlight the\u0000need for publicly available, copyright-free musical data, in which there is a\u0000large shortage, particularly for symbolic music data. To alleviate this issue,\u0000we present PDMX: a large-scale open-source dataset of over 250K public domain\u0000MusicXML scores collected from the score-sharing forum MuseScore, making it the\u0000largest available copyright-free symbolic music dataset to our knowledge. PDMX\u0000additionally includes a wealth of both tag and user interaction metadata,\u0000allowing us to efficiently analyze the dataset and filter for high quality\u0000user-generated scores. Given the additional metadata afforded by our data\u0000collection process, we conduct multitrack music generation experiments\u0000evaluating how different representative subsets of PDMX lead to different\u0000behaviors in downstream models, and how user-rating statistics can be used as\u0000an effective measure of data quality. Examples can be found at\u0000https://pnlong.github.io/PDMX.demo/.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"210 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving Speech Emotion Recognition in Under-Resourced Languages via Speech-to-Speech Translation with Bootstrapping Data Selection 通过引导数据选择进行语音到语音翻译，提高资源不足语言的语音情感识别能力

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10985

Hsi-Che Lin, Yi-Cheng Lin, Huang-Cheng Chou, Hung-yi Lee

Speech Emotion Recognition (SER) is a crucial component in developinggeneral-purpose AI agents capable of natural human-computer interaction.However, building robust multilingual SER systems remains challenging due tothe scarcity of labeled data in languages other than English and Chinese. Inthis paper, we propose an approach to enhance SER performance in low SERresource languages by leveraging data from high-resource languages.Specifically, we employ expressive Speech-to-Speech translation (S2ST) combinedwith a novel bootstrapping data selection pipeline to generate labeled data inthe target language. Extensive experiments demonstrate that our method is botheffective and generalizable across different upstream models and languages. Ourresults suggest that this approach can facilitate the development of morescalable and robust multilingual SER systems.

语音情感识别（SER）是开发能够进行自然人机交互的通用人工智能代理的重要组成部分。然而，由于除英语和中文之外的其他语言的标注数据稀缺，构建强大的多语言 SER 系统仍然具有挑战性。在本文中，我们提出了一种通过利用高资源语言的数据来提高低资源语言 SER 性能的方法。具体来说，我们采用了表达式语音到语音翻译（S2ST），并结合新颖的引导数据选择管道来生成目标语言中的标记数据。广泛的实验证明，我们的方法在不同的上游模型和语言中既有效又具有通用性。我们的研究结果表明，这种方法可以促进可扩展性更强的多语言 SER 系统的开发。

引用次数: 0

LC-Protonets: Multi-label Few-shot learning for world music audio tagging LC-Protonets：用于世界音乐音频标记的多标签少镜头学习

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11264

Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos

We introduce Label-Combination Prototypical Networks (LC-Protonets) toaddress the problem of multi-label few-shot classification, where a model mustgeneralize to new classes based on only a few available examples. ExtendingPrototypical Networks, LC-Protonets generate one prototype per labelcombination, derived from the power set of labels present in the limitedtraining items, rather than one prototype per label. Our method is applied toautomatic audio tagging across diverse music datasets, covering variouscultures and including both modern and traditional music, and is evaluatedagainst existing approaches in the literature. The results demonstrate asignificant performance improvement in almost all domains and training setupswhen using LC-Protonets for multi-label classification. In addition to traininga few-shot learning model from scratch, we explore the use of a pre-trainedmodel, obtained via supervised learning, to embed items in the feature space.Fine-tuning improves the generalization ability of all methods, yetLC-Protonets achieve high-level performance even without fine-tuning, incontrast to the comparative approaches. We finally analyze the scalability ofthe proposed method, providing detailed quantitative metrics from ourexperiments. The implementation and experimental setup are made publiclyavailable, offering a benchmark for future research.

我们引入了标签组合原型网络（LC-Protonets）来解决多标签少量分类问题，在这种情况下，模型必须根据少量可用示例归纳出新的类别。LC-Protonets 对原型网络进行了扩展，根据有限训练项目中存在的强大标签集，为每个标签组合生成一个原型，而不是为每个标签生成一个原型。我们的方法被应用于各种音乐数据集的自动音频标记，涵盖各种文化，包括现代音乐和传统音乐，并与文献中的现有方法进行了对比评估。结果表明，在使用 LC-Protonets 进行多标签分类时，几乎所有领域和训练设置的性能都有显著提高。微调提高了所有方法的泛化能力，但是 LC-Protonets 即使不进行微调也能获得高水平的性能，这与其他方法形成了鲜明对比。最后，我们分析了所提方法的可扩展性，并提供了实验中的详细量化指标。我们公开了实现方法和实验设置，为未来的研究提供了一个基准。

{"title":"LC-Protonets: Multi-label Few-shot learning for world music audio tagging","authors":"Charilaos Papaioannou, Emmanouil Benetos, Alexandros Potamianos","doi":"arxiv-2409.11264","DOIUrl":"https://doi.org/arxiv-2409.11264","url":null,"abstract":"We introduce Label-Combination Prototypical Networks (LC-Protonets) to\u0000address the problem of multi-label few-shot classification, where a model must\u0000generalize to new classes based on only a few available examples. Extending\u0000Prototypical Networks, LC-Protonets generate one prototype per label\u0000combination, derived from the power set of labels present in the limited\u0000training items, rather than one prototype per label. Our method is applied to\u0000automatic audio tagging across diverse music datasets, covering various\u0000cultures and including both modern and traditional music, and is evaluated\u0000against existing approaches in the literature. The results demonstrate a\u0000significant performance improvement in almost all domains and training setups\u0000when using LC-Protonets for multi-label classification. In addition to training\u0000a few-shot learning model from scratch, we explore the use of a pre-trained\u0000model, obtained via supervised learning, to embed items in the feature space.\u0000Fine-tuning improves the generalization ability of all methods, yet\u0000LC-Protonets achieve high-level performance even without fine-tuning, in\u0000contrast to the comparative approaches. We finally analyze the scalability of\u0000the proposed method, providing detailed quantitative metrics from our\u0000experiments. The implementation and experimental setup are made publicly\u0000available, offering a benchmark for future research.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

High-Resolution Speech Restoration with Latent Diffusion Model 利用潜在扩散模型进行高分辨率语音修复

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11145

Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu

Traditional speech enhancement methods often oversimplify the task ofrestoration by focusing on a single type of distortion. Generative models thathandle multiple distortions frequently struggle with phone reconstruction andhigh-frequency harmonics, leading to breathing and gasping artifacts thatreduce the intelligibility of reconstructed speech. These models are alsocomputationally demanding, and many solutions are restricted to producingoutputs in the wide-band frequency range, which limits their suitability forprofessional applications. To address these challenges, we propose Hi-ResLDM, anovel generative model based on latent diffusion designed to remove multipledistortions and restore speech recordings to studio quality, sampled at 48kHz.We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN andConditional Flow Matching (CFM) components, demonstrating superior performancein regenerating high-frequency-band details. Hi-ResLDM not only excels innon-instrusive metrics but is also consistently preferred in human evaluationand performs competitively on intrusive evaluations, making it ideal forhigh-resolution speech restoration.

传统的语音增强方法往往只关注单一类型的失真，从而过度简化了修复任务。处理多种失真的生成模型经常在电话重建和高频谐波方面遇到困难，导致呼吸和喘息伪音，降低了重建语音的可懂度。这些模型的计算要求也很高，而且许多解决方案仅限于在宽带频率范围内产生输出，这限制了它们在专业应用中的适用性。为了应对这些挑战，我们提出了 Hi-ResLDM 模型，这是一种基于潜在扩散的高级生成模型，旨在消除多重失真并将语音录音恢复到录音棚质量，采样频率为 48kHz。我们将 Hi-ResLDM 与利用 GAN 和条件流匹配 (CFM) 组件的最先进方法进行了比较，结果表明，Hi-ResLDM 在再生高频段细节方面表现出色。Hi-ResLDM 不仅在非侵入性指标方面表现出色，而且在人类评估中一直受到青睐，在侵入性评估中表现也很有竞争力，是高分辨率语音修复的理想选择。

{"title":"High-Resolution Speech Restoration with Latent Diffusion Model","authors":"Tushar Dhyani, Florian Lux, Michele Mancusi, Giorgio Fabbro, Fritz Hohl, Ngoc Thang Vu","doi":"arxiv-2409.11145","DOIUrl":"https://doi.org/arxiv-2409.11145","url":null,"abstract":"Traditional speech enhancement methods often oversimplify the task of\u0000restoration by focusing on a single type of distortion. Generative models that\u0000handle multiple distortions frequently struggle with phone reconstruction and\u0000high-frequency harmonics, leading to breathing and gasping artifacts that\u0000reduce the intelligibility of reconstructed speech. These models are also\u0000computationally demanding, and many solutions are restricted to producing\u0000outputs in the wide-band frequency range, which limits their suitability for\u0000professional applications. To address these challenges, we propose Hi-ResLDM, a\u0000novel generative model based on latent diffusion designed to remove multiple\u0000distortions and restore speech recordings to studio quality, sampled at 48kHz.\u0000We benchmark Hi-ResLDM against state-of-the-art methods that leverage GAN and\u0000Conditional Flow Matching (CFM) components, demonstrating superior performance\u0000in regenerating high-frequency-band details. Hi-ResLDM not only excels in\u0000non-instrusive metrics but is also consistently preferred in human evaluation\u0000and performs competitively on intrusive evaluations, making it ideal for\u0000high-resolution speech restoration.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265635","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection 家的声音用于声音事件检测的语音去除住宅音频数据集

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11262

Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley

This paper presents a residential audio dataset to support sound eventdetection research for smart home applications aimed at promoting wellbeing forolder adults. The dataset is constructed by deploying audio recording systemsin the homes of 8 participants aged 55-80 years for a 7-day period. Acousticcharacteristics are documented through detailed floor plans and constructionmaterial information to enable replication of the recording environments for AImodel deployment. A novel automated speech removal pipeline is developed, usingpre-trained audio neural networks to detect and remove segments containingspoken voice, while preserving segments containing other sound events. Theresulting dataset consists of privacy-compliant audio recordings thataccurately capture the soundscapes and activities of daily living withinresidential spaces. The paper details the dataset creation methodology, thespeech removal pipeline utilizing cascaded model architectures, and an analysisof the vocal label distribution to validate the speech removal process. Thisdataset enables the development and benchmarking of sound event detectionmodels tailored specifically for in-home applications.

本文介绍了一个住宅音频数据集，用于支持智能家居应用的声音事件检测研究，旨在提高老年人的健康水平。该数据集是通过在 8 位 55-80 岁的参与者家中部署音频录音系统构建的，为期 7 天。通过详细的平面图和建筑材料信息记录了声学特征，以便为人工智能模型的部署复制录音环境。利用预先训练的音频神经网络，开发了一种新型的自动语音移除管道，用于检测和移除包含语音的片段，同时保留包含其他声音事件的片段。由此产生的数据集由符合隐私标准的音频记录组成，准确捕捉了住宅空间内的声音景观和日常生活活动。论文详细介绍了数据集创建方法、利用级联模型架构的语音移除管道，以及对人声标签分布的分析，以验证语音移除过程。通过该数据集，可以开发专门针对家庭应用的声音事件检测模型，并对其进行基准测试。

{"title":"The Sounds of Home: A Speech-Removed Residential Audio Dataset for Sound Event Detection","authors":"Gabriel Bibbó, Thomas Deacon, Arshdeep Singh, Mark D. Plumbley","doi":"arxiv-2409.11262","DOIUrl":"https://doi.org/arxiv-2409.11262","url":null,"abstract":"This paper presents a residential audio dataset to support sound event\u0000detection research for smart home applications aimed at promoting wellbeing for\u0000older adults. The dataset is constructed by deploying audio recording systems\u0000in the homes of 8 participants aged 55-80 years for a 7-day period. Acoustic\u0000characteristics are documented through detailed floor plans and construction\u0000material information to enable replication of the recording environments for AI\u0000model deployment. A novel automated speech removal pipeline is developed, using\u0000pre-trained audio neural networks to detect and remove segments containing\u0000spoken voice, while preserving segments containing other sound events. The\u0000resulting dataset consists of privacy-compliant audio recordings that\u0000accurately capture the soundscapes and activities of daily living within\u0000residential spaces. The paper details the dataset creation methodology, the\u0000speech removal pipeline utilizing cascaded model architectures, and an analysis\u0000of the vocal label distribution to validate the speech removal process. This\u0000dataset enables the development and benchmarking of sound event detection\u0000models tailored specifically for in-home applications.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Source Disentanglement in Neural Audio Codec 学习神经音频编解码器中的源解缠

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11228

Xiaoyu Bie, Xubo Liu, Gaël Richard

Neural audio codecs have significantly advanced audio compression byefficiently converting continuous audio signals into discrete tokens. Thesecodecs preserve high-quality sound and enable sophisticated sound generationthrough generative models trained on these tokens. However, existing neuralcodec models are typically trained on large, undifferentiated audio datasets,neglecting the essential discrepancies between sound domains like speech,music, and environmental sound effects. This oversight complicates datamodeling and poses additional challenges to the controllability of soundgeneration. To tackle these issues, we introduce the Source-Disentangled NeuralAudio Codec (SD-Codec), a novel approach that combines audio coding and sourceseparation. By jointly learning audio resynthesis and separation, SD-Codecexplicitly assigns audio signals from different domains to distinct codebooks,sets of discrete representations. Experimental results indicate that SD-Codecnot only maintains competitive resynthesis quality but also, supported by theseparation results, demonstrates successful disentanglement of differentsources in the latent space, thereby enhancing interpretability in audio codecand providing potential finer control over the audio generation process.

神经音频编解码器通过有效地将连续音频信号转换为离散标记，大大推进了音频压缩技术的发展。这些编解码器保留了高质量的声音，并通过在这些标记上训练的生成模型实现了复杂的声音生成。然而，现有的神经编解码模型通常是在大型、无差别的音频数据集上训练的，忽略了语音、音乐和环境音效等声域之间的本质区别。这种疏忽使数据建模变得复杂，并对声音生成的可控性提出了更多挑战。为了解决这些问题，我们引入了源分离神经音频编解码器（SD-Codec），这是一种将音频编码与源分离相结合的新方法。通过联合学习音频合成和分离，SD-Codece 明确地将来自不同领域的音频信号分配给不同的编码本（离散表示集）。实验结果表明，SD-Codec 不仅保持了有竞争力的合成质量，而且在这些分离结果的支持下，成功地在潜空间中分离了不同来源，从而提高了音频编解码的可解释性，并为音频生成过程提供了更精细的控制潜力。

{"title":"Learning Source Disentanglement in Neural Audio Codec","authors":"Xiaoyu Bie, Xubo Liu, Gaël Richard","doi":"arxiv-2409.11228","DOIUrl":"https://doi.org/arxiv-2409.11228","url":null,"abstract":"Neural audio codecs have significantly advanced audio compression by\u0000efficiently converting continuous audio signals into discrete tokens. These\u0000codecs preserve high-quality sound and enable sophisticated sound generation\u0000through generative models trained on these tokens. However, existing neural\u0000codec models are typically trained on large, undifferentiated audio datasets,\u0000neglecting the essential discrepancies between sound domains like speech,\u0000music, and environmental sound effects. This oversight complicates data\u0000modeling and poses additional challenges to the controllability of sound\u0000generation. To tackle these issues, we introduce the Source-Disentangled Neural\u0000Audio Codec (SD-Codec), a novel approach that combines audio coding and source\u0000separation. By jointly learning audio resynthesis and separation, SD-Codec\u0000explicitly assigns audio signals from different domains to distinct codebooks,\u0000sets of discrete representations. Experimental results indicate that SD-Codec\u0000not only maintains competitive resynthesis quality but also, supported by the\u0000separation results, demonstrates successful disentanglement of different\u0000sources in the latent space, thereby enhancing interpretability in audio codec\u0000and providing potential finer control over the audio generation process.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Speech Recognition for Analysis of Police Radio Communication 用于分析警用无线电通信的语音识别技术

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.10858

Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul

Police departments around the world use two-way radio for coordination. Thesebroadcast police communications (BPC) are a unique source of information abouteveryday police activity and emergency response. Yet BPC are not transcribed,and their naturalistic audio properties make automatic transcriptionchallenging. We collect a corpus of roughly 62,000 manually transcribed radiotransmissions (~46 hours of audio) to evaluate the feasibility of automaticspeech recognition (ASR) using modern recognition models. We evaluate theperformance of off-the-shelf speech recognizers, models fine-tuned on BPC data,and customized end-to-end models. We find that both human and machinetranscription is challenging in this domain. Large off-the-shelf ASR modelsperform poorly, but fine-tuned models can reach the approximate range of humanperformance. Our work suggests directions for future work, including analysisof short utterances and potential miscommunication in police radiointeractions. We make our corpus and data annotation pipeline available toother researchers, to enable further research on recognition and analysis ofpolice communication.

世界各地的警察部门都使用双向无线电进行协调。这些广播警务通信（BPC）是有关日常警务活动和应急响应的独特信息来源。然而，BPC 并没有转录，其自然的音频特性给自动转录带来了挑战。我们收集了大约 62,000 个人工转录的无线电传输语料库（约 46 小时的音频），以评估使用现代识别模型进行自动语音识别 (ASR) 的可行性。我们评估了现成的语音识别器、根据 BPC 数据微调的模型以及定制的端到端模型的性能。我们发现，在这一领域，人工和机器转写都具有挑战性。现成的大型 ASR 模型表现不佳，但经过微调的模型可以达到人类表现的大致范围。我们的工作为未来的工作指明了方向，包括分析短语和警方无线电互动中潜在的误传。我们向其他研究人员提供我们的语料库和数据注释管道，以便进一步研究警察交流的识别和分析。

{"title":"Speech Recognition for Analysis of Police Radio Communication","authors":"Tejes Srivastava, Ju-Chieh Chou, Priyank Shroff, Karen Livescu, Christopher Graziul","doi":"arxiv-2409.10858","DOIUrl":"https://doi.org/arxiv-2409.10858","url":null,"abstract":"Police departments around the world use two-way radio for coordination. These\u0000broadcast police communications (BPC) are a unique source of information about\u0000everyday police activity and emergency response. Yet BPC are not transcribed,\u0000and their naturalistic audio properties make automatic transcription\u0000challenging. We collect a corpus of roughly 62,000 manually transcribed radio\u0000transmissions (~46 hours of audio) to evaluate the feasibility of automatic\u0000speech recognition (ASR) using modern recognition models. We evaluate the\u0000performance of off-the-shelf speech recognizers, models fine-tuned on BPC data,\u0000and customized end-to-end models. We find that both human and machine\u0000transcription is challenging in this domain. Large off-the-shelf ASR models\u0000perform poorly, but fine-tuned models can reach the approximate range of human\u0000performance. Our work suggests directions for future work, including analysis\u0000of short utterances and potential miscommunication in police radio\u0000interactions. We make our corpus and data annotation pipeline available to\u0000other researchers, to enable further research on recognition and analysis of\u0000police communication.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265634","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning Spatially-Aware Language and Audio Embedding 学习空间感知语言和音频嵌入

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11369

Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia

Humans can picture a sound scene given an imprecise natural languagedescription. For example, it is easy to imagine an acoustic environment given aphrase like "the lion roar came from right behind me!". For a machine to havethe same degree of comprehension, the machine must know what a lion is(semantic attribute), what the concept of "behind" is (spatial attribute) andhow these pieces of linguistic information align with the semantic and spatialattributes of the sound (what a roar sounds like when its coming from behind).State-of-the-art audio foundation models which learn to map between audioscenes and natural textual descriptions, are trained on non-spatial audio andtext pairs, and hence lack spatial awareness. In contrast, sound eventlocalization and detection models are limited to recognizing sounds from afixed number of classes, and they localize the source to absolute position(e.g., 0.2m) rather than a position described using natural language (e.g.,"next to me"). To address these gaps, we present ELSA a spatially aware-audioand text embedding model trained using multimodal contrastive learning. ELSAsupports non-spatial audio, spatial audio, and open vocabulary text captionsdescribing both the spatial and semantic components of sound. To train ELSA:(a) we spatially augment the audio and captions of three open-source audiodatasets totaling 4,738 hours of audio, and (b) we design an encoder to capturethe semantics of non-spatial audio, and the semantics and spatial attributes ofspatial audio using contrastive learning. ELSA is competitive withstate-of-the-art for both semantic retrieval and 3D source localization. Inparticular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 abovethe baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D sourcelocalization over the baseline.

人类可以通过不精确的自然语言描述来想象声音场景。例如，用 "狮子的吼叫声从我身后传来！"这样的短语很容易想象出声音环境。要让机器具有相同程度的理解能力，机器必须知道狮子是什么（语义属性），"后面 "是什么概念（空间属性），以及这些语言信息如何与声音的语义和空间属性相一致（当吼声从后面传来时听起来像什么）。与此相反，声音事件定位和检测模型仅限于识别固定数量类别的声音，它们将声源定位到绝对位置（如 0.2 米），而不是使用自然语言描述的位置（如 "在我旁边"）。为了弥补这些不足，我们推出了 ELSA，这是一种利用多模态对比学习训练的空间感知音频和文本嵌入模型。ELSA 支持非空间音频、空间音频以及描述声音空间和语义成分的开放词汇文本标题。为了训练 ELSA：（a）我们对三个开源音频数据集共 4738 小时的音频和字幕进行了空间增强；（b）我们设计了一个编码器，利用对比学习捕捉非空间音频的语义以及空间音频的语义和空间属性。在语义检索和三维声源定位方面，ELSA 都能与最先进的技术相媲美。特别是，ELSA的音频到文本和文本到音频平均R@1比基线高出2.8%，在三维音源定位方面比基线高出-11.6{/deg}平均绝对误差。

{"title":"Learning Spatially-Aware Language and Audio Embedding","authors":"Bhavika Devnani, Skyler Seto, Zakaria Aldeneh, Alessandro Toso, Elena Menyaylenko, Barry-John Theobald, Jonathan Sheaffer, Miguel Sarabia","doi":"arxiv-2409.11369","DOIUrl":"https://doi.org/arxiv-2409.11369","url":null,"abstract":"Humans can picture a sound scene given an imprecise natural language\u0000description. For example, it is easy to imagine an acoustic environment given a\u0000phrase like \"the lion roar came from right behind me!\". For a machine to have\u0000the same degree of comprehension, the machine must know what a lion is\u0000(semantic attribute), what the concept of \"behind\" is (spatial attribute) and\u0000how these pieces of linguistic information align with the semantic and spatial\u0000attributes of the sound (what a roar sounds like when its coming from behind).\u0000State-of-the-art audio foundation models which learn to map between audio\u0000scenes and natural textual descriptions, are trained on non-spatial audio and\u0000text pairs, and hence lack spatial awareness. In contrast, sound event\u0000localization and detection models are limited to recognizing sounds from a\u0000fixed number of classes, and they localize the source to absolute position\u0000(e.g., 0.2m) rather than a position described using natural language (e.g.,\u0000\"next to me\"). To address these gaps, we present ELSA a spatially aware-audio\u0000and text embedding model trained using multimodal contrastive learning. ELSA\u0000supports non-spatial audio, spatial audio, and open vocabulary text captions\u0000describing both the spatial and semantic components of sound. To train ELSA:\u0000(a) we spatially augment the audio and captions of three open-source audio\u0000datasets totaling 4,738 hours of audio, and (b) we design an encoder to capture\u0000the semantics of non-spatial audio, and the semantics and spatial attributes of\u0000spatial audio using contrastive learning. ELSA is competitive with\u0000state-of-the-art for both semantic retrieval and 3D source localization. In\u0000particular, ELSA achieves +2.8% mean audio-to-text and text-to-audio R@1 above\u0000the baseline, and outperforms by -11.6{deg} mean-absolute-error in 3D source\u0000localization over the baseline.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265603","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora 用于低资源重音语音库自动语音识别的零镜头文本到语音增强技术

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-17 DOI: arxiv-2409.11107

Francesco Nespoli, Daniel Barreda, Patrick A. Naylor

In recent years, automatic speech recognition (ASR) models greatly improvedtranscription performance both in clean, low noise, acoustic conditions and inreverberant environments. However, all these systems rely on the availabilityof hundreds of hours of labelled training data in specific acoustic conditions.When such a training dataset is not available, the performance of the system isheavily impacted. For example, this happens when a specific acousticenvironment or a particular population of speakers is under-represented in thetraining dataset. Specifically, in this paper we investigate the effect ofaccented speech data on an off-the-shelf ASR system. Furthermore, we suggest astrategy based on zero-shot text-to-speech to augment the accented speechcorpora. We show that this augmentation method is able to mitigate the loss inperformance of the ASR system on accented data up to 5% word error ratereduction (WERR). In conclusion, we demonstrate that by incorporating a modestfraction of real with synthetically generated data, the ASR system exhibitssuperior performance compared to a model trained exclusively on authenticaccented speech with up to 14% WERR.

近年来，自动语音识别（ASR）模型大大提高了在洁净、低噪音的声学条件下和在混响环境中的转录性能。但是，所有这些系统都依赖于特定声学条件下数百小时的标注训练数据。例如，当特定的声学环境或特定的说话者群体在训练数据集中的代表性不足时，就会出现这种情况。具体来说，我们在本文中研究了带重音语音数据对现成 ASR 系统的影响。此外，我们还提出了基于零镜头文本到语音的策略，以增强重音语音群。我们的研究表明，这种增强方法能够减轻 ASR 系统在重音数据上的性能损失，最高可将词错误率（WERR）降低 5%。总之，我们证明，通过适度地将真实数据与合成数据相结合，ASR 系统的性能优于完全基于真实重音语音训练的模型，词错误率最高可降低 14%。

{"title":"Zero Shot Text to Speech Augmentation for Automatic Speech Recognition on Low-Resource Accented Speech Corpora","authors":"Francesco Nespoli, Daniel Barreda, Patrick A. Naylor","doi":"arxiv-2409.11107","DOIUrl":"https://doi.org/arxiv-2409.11107","url":null,"abstract":"In recent years, automatic speech recognition (ASR) models greatly improved\u0000transcription performance both in clean, low noise, acoustic conditions and in\u0000reverberant environments. However, all these systems rely on the availability\u0000of hundreds of hours of labelled training data in specific acoustic conditions.\u0000When such a training dataset is not available, the performance of the system is\u0000heavily impacted. For example, this happens when a specific acoustic\u0000environment or a particular population of speakers is under-represented in the\u0000training dataset. Specifically, in this paper we investigate the effect of\u0000accented speech data on an off-the-shelf ASR system. Furthermore, we suggest a\u0000strategy based on zero-shot text-to-speech to augment the accented speech\u0000corpora. We show that this augmentation method is able to mitigate the loss in\u0000performance of the ASR system on accented data up to 5% word error rate\u0000reduction (WERR). In conclusion, we demonstrate that by incorporating a modest\u0000fraction of real with synthetically generated data, the ASR system exhibits\u0000superior performance compared to a model trained exclusively on authentic\u0000accented speech with up to 14% WERR.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"27 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0