首页 > 最新文献

arXiv - CS - Sound最新文献

英文 中文
RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification RepAugment:用于呼吸声分类的输入诊断表征级增强技术
Pub Date : 2024-05-05 DOI: arxiv-2405.02996
June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung
Recent advancements in AI have democratized its deployment as a healthcareassistant. While pretrained models from large-scale visual and audio datasetshave demonstrably generalized to this task, surprisingly, no studies haveexplored pretrained speech models, which, as human-originated sounds,intuitively would share closer resemblance to lung sounds. This paper exploresthe efficacy of pretrained speech models for respiratory sound classification.We find that there is a characterization gap between speech and lung soundsamples, and to bridge this gap, data augmentation is essential. However, themost widely used augmentation technique for audio and speech, SpecAugment,requires 2-dimensional spectrogram format and cannot be applied to modelspretrained on speech waveforms. To address this, we propose RepAugment, aninput-agnostic representation-level augmentation technique that outperformsSpecAugment, but is also suitable for respiratory sound classification withwaveform pretrained models. Experimental results show that our approachoutperforms the SpecAugment, demonstrating a substantial improvement in theaccuracy of minority disease classes, reaching up to 7.14%.
人工智能的最新进展使其作为医疗辅助工具的应用更加民主化。虽然来自大规模视觉和音频数据集的预训练模型已被证明适用于这一任务,但令人惊讶的是,还没有研究探索过预训练的语音模型,而语音作为人类发出的声音,直觉上与肺音更为相似。本文探讨了预训练语音模型在呼吸音分类中的功效。我们发现,语音和肺部声音样本之间存在表征差距,要弥补这一差距,数据增强是必不可少的。然而,最广泛应用的音频和语音增强技术 SpecAugment 需要二维频谱图格式,无法应用于在语音波形上训练的模型。为了解决这个问题,我们提出了 RepAugment,这是一种与输入无关的表示级增强技术,其性能优于 SpecAugment,同时也适用于使用波形预训练模型的呼吸声分类。实验结果表明,我们的方法优于 SpecAugment,在少数疾病类别的准确率方面有了大幅提高,最高可达 7.14%。
{"title":"RepAugment: Input-Agnostic Representation-Level Augmentation for Respiratory Sound Classification","authors":"June-Woo Kim, Miika Toikkanen, Sangmin Bae, Minseok Kim, Ho-Young Jung","doi":"arxiv-2405.02996","DOIUrl":"https://doi.org/arxiv-2405.02996","url":null,"abstract":"Recent advancements in AI have democratized its deployment as a healthcare\u0000assistant. While pretrained models from large-scale visual and audio datasets\u0000have demonstrably generalized to this task, surprisingly, no studies have\u0000explored pretrained speech models, which, as human-originated sounds,\u0000intuitively would share closer resemblance to lung sounds. This paper explores\u0000the efficacy of pretrained speech models for respiratory sound classification.\u0000We find that there is a characterization gap between speech and lung sound\u0000samples, and to bridge this gap, data augmentation is essential. However, the\u0000most widely used augmentation technique for audio and speech, SpecAugment,\u0000requires 2-dimensional spectrogram format and cannot be applied to models\u0000pretrained on speech waveforms. To address this, we propose RepAugment, an\u0000input-agnostic representation-level augmentation technique that outperforms\u0000SpecAugment, but is also suitable for respiratory sound classification with\u0000waveform pretrained models. Experimental results show that our approach\u0000outperforms the SpecAugment, demonstrating a substantial improvement in the\u0000accuracy of minority disease classes, reaching up to 7.14%.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886245","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Steered Response Power for Sound Source Localization: A Tutorial Review 声源定位的转向响应功率:教程回顾
Pub Date : 2024-05-05 DOI: arxiv-2405.02991
Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor
In the last three decades, the Steered Response Power (SRP) method has beenwidely used for the task of Sound Source Localization (SSL), due to itssatisfactory localization performance on moderately reverberant and noisyscenarios. Many works have analyzed and extended the original SRP method toreduce its computational cost, to allow it to locate multiple sources, or toimprove its performance in adverse environments. In this work, we review over200 papers on the SRP method and its variants, with emphasis on the SRP-PHATmethod. We also present eXtensible-SRP, or X-SRP, a generalized and modularizedversion of the SRP algorithm which allows the reviewed extensions to beimplemented. We provide a Python implementation of the algorithm which includesselected extensions from the literature.
在过去的三十年里,转向响应功率(SRP)方法因其在中等混响和嘈杂环境下令人满意的定位性能,被广泛用于声源定位(SSL)任务。许多作品对原始 SRP 方法进行了分析和扩展,以降低其计算成本,使其能够定位多个声源,或提高其在不利环境中的性能。在这项工作中,我们回顾了有关 SRP 方法及其变体的 200 多篇论文,重点是 SRP-PHAT 方法。我们还介绍了 eXtensible-SRP,或称 X-SRP,它是 SRP 算法的一个通用模块化版本,可以实现所回顾的扩展。我们提供了该算法的 Python 实现,其中包括从文献中选取的扩展。
{"title":"Steered Response Power for Sound Source Localization: A Tutorial Review","authors":"Eric Grinstein, Elisa Tengan, Bilgesu Çakmak, Thomas Dietzen, Leonardo Nunes, Toon van Waterschoot, Mike Brookes, Patrick A. Naylor","doi":"arxiv-2405.02991","DOIUrl":"https://doi.org/arxiv-2405.02991","url":null,"abstract":"In the last three decades, the Steered Response Power (SRP) method has been\u0000widely used for the task of Sound Source Localization (SSL), due to its\u0000satisfactory localization performance on moderately reverberant and noisy\u0000scenarios. Many works have analyzed and extended the original SRP method to\u0000reduce its computational cost, to allow it to locate multiple sources, or to\u0000improve its performance in adverse environments. In this work, we review over\u0000200 papers on the SRP method and its variants, with emphasis on the SRP-PHAT\u0000method. We also present eXtensible-SRP, or X-SRP, a generalized and modularized\u0000version of the SRP algorithm which allows the reviewed extensions to be\u0000implemented. We provide a Python implementation of the algorithm which includes\u0000selected extensions from the literature.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers 古兰经音频数据集:来自非阿拉伯语发言人的众包和标签化朗诵
Pub Date : 2024-05-04 DOI: arxiv-2405.02675
Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara
This paper addresses the challenge of learning to recite the Quran fornon-Arabic speakers. We explore the possibility of crowdsourcing a carefullyannotated Quranic dataset, on top of which AI models can be built to simplifythe learning process. In particular, we use the volunteer-based crowdsourcinggenre and implement a crowdsourcing API to gather audio assets. We integratedthe API into an existing mobile application called NamazApp to collect audiorecitations. We developed a crowdsourcing platform called Quran Voice forannotating the gathered audio assets. As a result, we have collected around7000 Quranic recitations from a pool of 1287 participants across more than 11non-Arabic countries, and we have annotated 1166 recitations from the datasetin six categories. We have achieved a crowd accuracy of 0.77, an inter-rateragreement of 0.63 between the annotators, and 0.89 between the labels assignedby the algorithm and the expert judgments.
本文探讨了非阿拉伯语使用者学习背诵《古兰经》所面临的挑战。我们探索了众包精心注释的古兰经数据集的可能性,在此基础上可以建立人工智能模型来简化学习过程。特别是,我们使用了基于志愿者的众包领域,并实施了一个众包应用程序接口(API)来收集音频资产。我们将应用程序接口集成到现有的移动应用程序 "NamazApp "中,以收集音频吟诵。我们开发了一个名为 "古兰经之声 "的众包平台,用于对收集到的音频资产进行注释。因此,我们从超过 11 个非阿拉伯国家的 1287 名参与者中收集了约 7000 篇古兰经诵读内容,并对数据集中的 1166 篇诵读内容进行了六类注释。我们的人群准确率达到了 0.77,注释者之间的互评准确率为 0.63,算法分配的标签与专家判断之间的准确率为 0.89。
{"title":"Quranic Audio Dataset: Crowdsourced and Labeled Recitation from Non-Arabic Speakers","authors":"Raghad Salameh, Mohamad Al Mdfaa, Nursultan Askarbekuly, Manuel Mazzara","doi":"arxiv-2405.02675","DOIUrl":"https://doi.org/arxiv-2405.02675","url":null,"abstract":"This paper addresses the challenge of learning to recite the Quran for\u0000non-Arabic speakers. We explore the possibility of crowdsourcing a carefully\u0000annotated Quranic dataset, on top of which AI models can be built to simplify\u0000the learning process. In particular, we use the volunteer-based crowdsourcing\u0000genre and implement a crowdsourcing API to gather audio assets. We integrated\u0000the API into an existing mobile application called NamazApp to collect audio\u0000recitations. We developed a crowdsourcing platform called Quran Voice for\u0000annotating the gathered audio assets. As a result, we have collected around\u00007000 Quranic recitations from a pool of 1287 participants across more than 11\u0000non-Arabic countries, and we have annotated 1166 recitations from the dataset\u0000in six categories. We have achieved a crowd accuracy of 0.77, an inter-rater\u0000agreement of 0.63 between the annotators, and 0.89 between the labels assigned\u0000by the algorithm and the expert judgments.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Toward end-to-end interpretable convolutional neural networks for waveform signals 面向波形信号的端到端可解释卷积神经网络
Pub Date : 2024-05-03 DOI: arxiv-2405.01815
Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan
This paper introduces a novel convolutional neural networks (CNN) frameworktailored for end-to-end audio deep learning models, presenting advancements inefficiency and explainability. By benchmarking experiments on three standardspeech emotion recognition datasets with five-fold cross-validation, ourframework outperforms Mel spectrogram features by up to seven percent. It canpotentially replace the Mel-Frequency Cepstral Coefficients (MFCC) whileremaining lightweight. Furthermore, we demonstrate the efficiency andinterpretability of the front-end layer using the PhysioNet Heart SoundDatabase, illustrating its ability to handle and capture intricate longwaveform patterns. Our contributions offer a portable solution for buildingefficient and interpretable models for raw waveform data.
本文介绍了一种新颖的卷积神经网络(CNN)框架,该框架专为端到端音频深度学习模型而设计,在低效率和可解释性方面取得了进步。通过在三个标准语音情感识别数据集上进行五倍交叉验证的基准实验,我们的框架优于梅尔频谱图特征达 7%。它有可能取代梅尔频率倒频谱系数(MFCC),同时保持轻量级。此外,我们还利用 PhysioNet 心音数据库展示了前端层的效率和可解释性,说明它有能力处理和捕捉复杂的长波形模式。我们的贡献为原始波形数据建立高效、可解释的模型提供了一种便携式解决方案。
{"title":"Toward end-to-end interpretable convolutional neural networks for waveform signals","authors":"Linh Vu, Thu Tran, Wern-Han Lim, Raphael Phan","doi":"arxiv-2405.01815","DOIUrl":"https://doi.org/arxiv-2405.01815","url":null,"abstract":"This paper introduces a novel convolutional neural networks (CNN) framework\u0000tailored for end-to-end audio deep learning models, presenting advancements in\u0000efficiency and explainability. By benchmarking experiments on three standard\u0000speech emotion recognition datasets with five-fold cross-validation, our\u0000framework outperforms Mel spectrogram features by up to seven percent. It can\u0000potentially replace the Mel-Frequency Cepstral Coefficients (MFCC) while\u0000remaining lightweight. Furthermore, we demonstrate the efficiency and\u0000interpretability of the front-end layer using the PhysioNet Heart Sound\u0000Database, illustrating its ability to handle and capture intricate long\u0000waveform patterns. Our contributions offer a portable solution for building\u0000efficient and interpretable models for raw waveform data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"111 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886241","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models 利用大规模预训练模型实现免训练深度伪语音识别
Pub Date : 2024-05-03 DOI: arxiv-2405.02179
Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva
Generalization is a main issue for current audio deepfake detectors, whichstruggle to provide reliable results on out-of-distribution data. Given thespeed at which more and more accurate synthesis methods are developed, it isvery important to design techniques that work well also on data they were nottrained for. In this paper we study the potential of large-scale pre-trainedmodels for audio deepfake detection, with special focus on generalizationability. To this end, the detection problem is reformulated in a speakerverification framework and fake audios are exposed by the mismatch between thevoice sample under test and the voice of the claimed identity. With thisparadigm, no fake speech sample is necessary in training, cutting off any linkwith the generation method at the root, and ensuring full generalizationability. Features are extracted by general-purpose large pre-trained models,with no need for training or fine-tuning on specific fake detection or speakerverification datasets. At detection time only a limited set of voice fragmentsof the identity under test is required. Experiments on several datasetswidespread in the community show that detectors based on pre-trained modelsachieve excellent performance and show strong generalization ability, rivalingsupervised methods on in-distribution data and largely overcoming them onout-of-distribution data.
通用化是当前音频深度伪造检测器的一个主要问题,因为这些检测器很难在非分布数据上提供可靠的结果。鉴于越来越多的精确合成方法被快速开发出来,设计出在未经训练的数据上也能良好工作的技术就显得非常重要。在本文中,我们研究了大规模预训练模型在音频深度防伪检测方面的潜力,并特别关注其通用性。为此,我们在扬声器验证框架中对检测问题进行了重新表述,通过被测声音样本与声称身份的声音之间的不匹配来揭露虚假音频。有了这种范式,在训练中就不需要假语音样本,从根本上切断了与生成方法的任何联系,确保了完全的通用性。特征由通用的大型预训练模型提取,无需在特定的假语音检测或说话人验证数据集上进行训练或微调。检测时只需要一组有限的被测身份语音片段。在社区广泛使用的几个数据集上进行的实验表明,基于预训练模型的检测器性能卓越,显示出很强的泛化能力,在分布内数据上可与有监督的方法相媲美,在分布外数据上则在很大程度上战胜了有监督的方法。
{"title":"Training-Free Deepfake Voice Recognition by Leveraging Large-Scale Pre-Trained Models","authors":"Alessandro Pianese, Davide Cozzolino, Giovanni Poggi, Luisa Verdoliva","doi":"arxiv-2405.02179","DOIUrl":"https://doi.org/arxiv-2405.02179","url":null,"abstract":"Generalization is a main issue for current audio deepfake detectors, which\u0000struggle to provide reliable results on out-of-distribution data. Given the\u0000speed at which more and more accurate synthesis methods are developed, it is\u0000very important to design techniques that work well also on data they were not\u0000trained for. In this paper we study the potential of large-scale pre-trained\u0000models for audio deepfake detection, with special focus on generalization\u0000ability. To this end, the detection problem is reformulated in a speaker\u0000verification framework and fake audios are exposed by the mismatch between the\u0000voice sample under test and the voice of the claimed identity. With this\u0000paradigm, no fake speech sample is necessary in training, cutting off any link\u0000with the generation method at the root, and ensuring full generalization\u0000ability. Features are extracted by general-purpose large pre-trained models,\u0000with no need for training or fine-tuning on specific fake detection or speaker\u0000verification datasets. At detection time only a limited set of voice fragments\u0000of the identity under test is required. Experiments on several datasets\u0000widespread in the community show that detectors based on pre-trained models\u0000achieve excellent performance and show strong generalization ability, rivaling\u0000supervised methods on in-distribution data and largely overcoming them on\u0000out-of-distribution data.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"80 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT GMP-ATL:通过 HuBERT 对语音情感识别进行性别增强型多尺度伪标签增强自适应迁移学习
Pub Date : 2024-05-03 DOI: arxiv-2405.02151
Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao
The continuous evolution of pre-trained speech models has greatly advancedSpeech Emotion Recognition (SER). However, there is still potential forenhancement in the performance of these methods. In this paper, we presentGMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning),a novel HuBERT-based adaptive transfer learning framework for SER.Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementingmulti-task learning and multi-scale k-means clustering to acquire frame-levelgender-augmented multi-scale pseudo-labels. Then, to fully leverage bothobtained frame-level and utterance-level emotion labels, we incorporate modelretraining and fine-tuning methods to further optimize GMP-ATL. Experiments onIEMOCAP show that our GMP-ATL achieves superior recognition performance, with aWAR of 80.0% and a UAR of 82.0%, surpassing state-of-the-art unimodal SERmethods, while also yielding comparable results with multimodal SER approaches.
预训练语音模型的不断发展极大地推动了语音情感识别(SER)技术的进步。然而,这些方法的性能仍有提升空间。具体来说,GMP-ATL首先利用预训练的HuBERT,实施多任务学习和多尺度k均值聚类,以获取帧级性别增量的多尺度伪标签。然后,为了充分利用获得的帧级和语篇级情感标签,我们采用了模型训练和微调方法来进一步优化 GMP-ATL。在IEMOCAP上的实验表明,我们的GMP-ATL实现了卓越的识别性能,WAR为80.0%,UAR为82.0%,超越了最先进的单模态SER方法,同时也取得了与多模态SER方法相当的结果。
{"title":"GMP-ATL: Gender-augmented Multi-scale Pseudo-label Enhanced Adaptive Transfer Learning for Speech Emotion Recognition via HuBERT","authors":"Yu Pan, Yuguang Yang, Heng Lu, Lei Ma, Jianjun Zhao","doi":"arxiv-2405.02151","DOIUrl":"https://doi.org/arxiv-2405.02151","url":null,"abstract":"The continuous evolution of pre-trained speech models has greatly advanced\u0000Speech Emotion Recognition (SER). However, there is still potential for\u0000enhancement in the performance of these methods. In this paper, we present\u0000GMP-ATL (Gender-augmented Multi-scale Pseudo-label Adaptive Transfer Learning),\u0000a novel HuBERT-based adaptive transfer learning framework for SER.\u0000Specifically, GMP-ATL initially employs the pre-trained HuBERT, implementing\u0000multi-task learning and multi-scale k-means clustering to acquire frame-level\u0000gender-augmented multi-scale pseudo-labels. Then, to fully leverage both\u0000obtained frame-level and utterance-level emotion labels, we incorporate model\u0000retraining and fine-tuning methods to further optimize GMP-ATL. Experiments on\u0000IEMOCAP show that our GMP-ATL achieves superior recognition performance, with a\u0000WAR of 80.0% and a UAR of 82.0%, surpassing state-of-the-art unimodal SER\u0000methods, while also yielding comparable results with multimodal SER approaches.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"14 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Can We Identify Unknown Audio Recording Environments in Forensic Scenarios? 我们能否识别法证场景中的未知音频录音环境?
Pub Date : 2024-05-03 DOI: arxiv-2405.02119
Denise Moussa, Germans Hirsch, Christian Riess
Audio recordings may provide important evidence in criminal investigations.One such case is the forensic association of the recorded audio to therecording location. For example, a voice message may be the only investigativecue to narrow down the candidate sites for a crime. Up to now, several worksprovide tools for closed-set recording environment classification underrelatively clean recording conditions. However, in forensic investigations, thecandidate locations are case-specific. Thus, closed-set tools are notapplicable without retraining on a sufficient amount of training samples foreach case and respective candidate set. In addition, a forensic tool has todeal with audio material from uncontrolled sources with variable properties andquality. In this work, we therefore attempt a major step towards practical forensicapplication scenarios. We propose a representation learning framework calledEnvId, short for environment identification. EnvId avoids case-specificretraining. Instead, it is the first tool for robust few-shot classification ofunseen environment locations. We demonstrate that EnvId can handle forensicallychallenging material. It provides good quality predictions even under unseensignal degradations, environment characteristics or recording positionmismatches. Our code and datasets will be made publicly available upon acceptance.
录音可在刑事调查中提供重要证据。其中一种情况是将录音与录音地点联系起来进行取证。例如,语音信息可能是缩小犯罪候选地点的唯一调查线索。迄今为止,有几项工作提供了在相对干净的录音条件下进行封闭式录音环境分类的工具。然而,在法医调查中,候选地点是针对具体案件的。因此,如果不对每个案件和各自的候选集进行足够数量的训练样本的再训练,封闭集工具是不适用的。此外,法证工具还必须处理来自不受控制来源的音频资料,这些资料的属性和质量各不相同。因此,在这项工作中,我们尝试向实际取证应用场景迈出重要一步。我们提出了一个表征学习框架,称为 EnvId,是环境识别的简称。EnvId 避免了针对具体案例的训练。取而代之的是,它是第一款用于对未见环境位置进行稳健的少镜头分类的工具。我们证明了 EnvId 能够处理具有法医挑战性的材料。即使在信号衰减、环境特征或记录位置不匹配的情况下,它也能提供高质量的预测。我们的代码和数据集将在通过验收后公开发布。
{"title":"Can We Identify Unknown Audio Recording Environments in Forensic Scenarios?","authors":"Denise Moussa, Germans Hirsch, Christian Riess","doi":"arxiv-2405.02119","DOIUrl":"https://doi.org/arxiv-2405.02119","url":null,"abstract":"Audio recordings may provide important evidence in criminal investigations.\u0000One such case is the forensic association of the recorded audio to the\u0000recording location. For example, a voice message may be the only investigative\u0000cue to narrow down the candidate sites for a crime. Up to now, several works\u0000provide tools for closed-set recording environment classification under\u0000relatively clean recording conditions. However, in forensic investigations, the\u0000candidate locations are case-specific. Thus, closed-set tools are not\u0000applicable without retraining on a sufficient amount of training samples for\u0000each case and respective candidate set. In addition, a forensic tool has to\u0000deal with audio material from uncontrolled sources with variable properties and\u0000quality. In this work, we therefore attempt a major step towards practical forensic\u0000application scenarios. We propose a representation learning framework called\u0000EnvId, short for environment identification. EnvId avoids case-specific\u0000retraining. Instead, it is the first tool for robust few-shot classification of\u0000unseen environment locations. We demonstrate that EnvId can handle forensically\u0000challenging material. It provides good quality predictions even under unseen\u0000signal degradations, environment characteristics or recording position\u0000mismatches. Our code and datasets will be made publicly available upon acceptance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"247 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint sentiment analysis of lyrics and audio in music 对音乐中的歌词和音频进行联合情感分析
Pub Date : 2024-05-03 DOI: arxiv-2405.01988
Lea Schaab, Anna Kruspe
Sentiment or mood can express themselves on various levels in music. Inautomatic analysis, the actual audio data is usually analyzed, but the lyricscan also play a crucial role in the perception of moods. We first evaluatevarious models for sentiment analysis based on lyrics and audio separately. Thecorresponding approaches already show satisfactory results, but they alsoexhibit weaknesses, the causes of which we examine in more detail. Furthermore,different approaches to combining the audio and lyrics results are proposed andevaluated. Considering both modalities generally leads to improved performance.We investigate misclassifications and (also intentional) contradictions betweenaudio and lyrics sentiment more closely, and identify possible causes. Finally,we address fundamental problems in this research area, such as highsubjectivity, lack of data, and inconsistency in emotion taxonomies.
在音乐中,情感或情绪可以在不同层面上表现出来。在自动分析中,通常分析的是实际的音频数据,但歌词在感知情绪方面也起着至关重要的作用。我们首先分别评估了基于歌词和音频的各种情感分析模型。相应的方法已经显示出令人满意的结果,但它们也表现出一些弱点,我们将对其原因进行更详细的研究。此外,我们还提出并评估了将音频和歌词结果相结合的不同方法。我们更仔细地研究了音频和歌词情感之间的误分类和(有意的)矛盾,并找出了可能的原因。最后,我们解决了这一研究领域的基本问题,如高主观性、数据缺乏和情感分类标准不一致等。
{"title":"Joint sentiment analysis of lyrics and audio in music","authors":"Lea Schaab, Anna Kruspe","doi":"arxiv-2405.01988","DOIUrl":"https://doi.org/arxiv-2405.01988","url":null,"abstract":"Sentiment or mood can express themselves on various levels in music. In\u0000automatic analysis, the actual audio data is usually analyzed, but the lyrics\u0000can also play a crucial role in the perception of moods. We first evaluate\u0000various models for sentiment analysis based on lyrics and audio separately. The\u0000corresponding approaches already show satisfactory results, but they also\u0000exhibit weaknesses, the causes of which we examine in more detail. Furthermore,\u0000different approaches to combining the audio and lyrics results are proposed and\u0000evaluated. Considering both modalities generally leads to improved performance.\u0000We investigate misclassifications and (also intentional) contradictions between\u0000audio and lyrics sentiment more closely, and identify possible causes. Finally,\u0000we address fundamental problems in this research area, such as high\u0000subjectivity, lack of data, and inconsistency in emotion taxonomies.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886339","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets 在中文开源数据集上揭示基于 LLM 的 ASR 的潜力
Pub Date : 2024-05-03 DOI: arxiv-2405.02132
Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie
Large Language Models (LLMs) have demonstrated unparalleled effectiveness invarious NLP tasks, and integrating LLMs with automatic speech recognition (ASR)is becoming a mainstream paradigm. Building upon this momentum, our researchdelves into an in-depth examination of this paradigm on a large open-sourceChinese dataset. Specifically, our research aims to evaluate the impact ofvarious configurations of speech encoders, LLMs, and projector modules in thecontext of the speech foundation encoder-LLM ASR paradigm. Furthermore, weintroduce a three-stage training approach, expressly developed to enhance themodel's ability to align auditory and textual information. The implementationof this approach, alongside the strategic integration of ASR components,enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, andTest_Meeting test sets. Our analysis presents an empirical foundation forfuture research in LLM-based ASR systems and offers insights into optimizingperformance using Chinese datasets. We will publicly release all scripts usedfor data preparation, training, inference, and scoring, as well as pre-trainedmodels and training logs to promote reproducible research.
大型语言模型(LLMs)在各种 NLP 任务中表现出了无与伦比的有效性,而将 LLMs 与自动语音识别(ASR)集成正在成为一种主流范式。在这一势头的推动下,我们的研究致力于在大型开源中文数据集上对这一范例进行深入检验。具体来说,我们的研究旨在评估在语音基础编码器-LLM ASR 范式的背景下,语音编码器、LLM 和投影仪模块的各种配置所产生的影响。此外,我们还引入了一种三阶段训练方法,专门用于提高模型协调听觉和文本信息的能力。这种方法的实施以及 ASR 组件的战略性集成,使我们能够在 AISHELL-1、Test_Net 和 Test_Meeting 测试集上实现 SOTA 性能。我们的分析为基于 LLM 的 ASR 系统的未来研究奠定了经验基础,并为使用中文数据集优化性能提供了启示。我们将公开发布用于数据准备、训练、推理和评分的所有脚本,以及预训练模型和训练日志,以促进可重复研究。
{"title":"Unveiling the Potential of LLM-Based ASR on Chinese Open-Source Datasets","authors":"Xuelong Geng, Tianyi Xu, Kun Wei, Bingshen Mu, Hongfei Xue, He Wang, Yangze Li, Pengcheng Guo, Yuhang Dai, Longhao Li, Mingchen Shao, Lei Xie","doi":"arxiv-2405.02132","DOIUrl":"https://doi.org/arxiv-2405.02132","url":null,"abstract":"Large Language Models (LLMs) have demonstrated unparalleled effectiveness in\u0000various NLP tasks, and integrating LLMs with automatic speech recognition (ASR)\u0000is becoming a mainstream paradigm. Building upon this momentum, our research\u0000delves into an in-depth examination of this paradigm on a large open-source\u0000Chinese dataset. Specifically, our research aims to evaluate the impact of\u0000various configurations of speech encoders, LLMs, and projector modules in the\u0000context of the speech foundation encoder-LLM ASR paradigm. Furthermore, we\u0000introduce a three-stage training approach, expressly developed to enhance the\u0000model's ability to align auditory and textual information. The implementation\u0000of this approach, alongside the strategic integration of ASR components,\u0000enabled us to achieve the SOTA performance on the AISHELL-1, Test_Net, and\u0000Test_Meeting test sets. Our analysis presents an empirical foundation for\u0000future research in LLM-based ASR systems and offers insights into optimizing\u0000performance using Chinese datasets. We will publicly release all scripts used\u0000for data preparation, training, inference, and scoring, as well as pre-trained\u0000models and training logs to promote reproducible research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios 助听器的实时多通道深度语音增强:比较复杂声学场景中的单声道和双声道处理方法
Pub Date : 2024-05-03 DOI: arxiv-2405.01967
Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer
Deep learning has the potential to enhance speech signals and increase theirintelligibility for users of hearing aids. Deep models suited for real-worldapplication should feature a low computational complexity and low processingdelay of only a few milliseconds. In this paper, we explore deep speechenhancement that matches these requirements and contrast monaural and binauralprocessing algorithms in two complex acoustic scenes. Both algorithms areevaluated with objective metrics and in experiments with hearing-impairedlisteners performing a speech-in-noise test. Results are compared to twotraditional enhancement strategies, i.e., adaptive differential microphoneprocessing and binaural beamforming. While in diffuse noise, all algorithmsperform similarly, the binaural deep learning approach performs best in thepresence of spatial interferers. Through a post-analysis, this can beattributed to improvements at low SNRs and to precise spatial filtering.
深度学习有可能为助听器用户增强语音信号并提高其可理解性。适合实际应用的深度模型应具有计算复杂度低、处理延迟小(仅几毫秒)的特点。在本文中,我们探索了符合这些要求的深度语音增强技术,并在两个复杂的声学场景中对比了单耳和双耳处理算法。这两种算法都通过客观指标进行了评估,并在听力受损的听众进行噪声语音测试的实验中进行了评估。结果与两种传统增强策略(即自适应差分麦克风处理和双耳波束成形)进行了比较。虽然在弥散噪声中,所有算法的表现相似,但双耳深度学习方法在存在空间干扰的情况下表现最佳。通过后期分析,这可以归因于低信噪比和精确空间过滤的改进。
{"title":"Real-time multichannel deep speech enhancement in hearing aids: Comparing monaural and binaural processing in complex acoustic scenarios","authors":"Nils L. Westhausen, Hendrik Kayser, Theresa Jansen, Bernd T. Meyer","doi":"arxiv-2405.01967","DOIUrl":"https://doi.org/arxiv-2405.01967","url":null,"abstract":"Deep learning has the potential to enhance speech signals and increase their\u0000intelligibility for users of hearing aids. Deep models suited for real-world\u0000application should feature a low computational complexity and low processing\u0000delay of only a few milliseconds. In this paper, we explore deep speech\u0000enhancement that matches these requirements and contrast monaural and binaural\u0000processing algorithms in two complex acoustic scenes. Both algorithms are\u0000evaluated with objective metrics and in experiments with hearing-impaired\u0000listeners performing a speech-in-noise test. Results are compared to two\u0000traditional enhancement strategies, i.e., adaptive differential microphone\u0000processing and binaural beamforming. While in diffuse noise, all algorithms\u0000perform similarly, the binaural deep learning approach performs best in the\u0000presence of spatial interferers. Through a post-analysis, this can be\u0000attributed to improvements at low SNRs and to precise spatial filtering.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140886332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Sound
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1