arXiv - CS - Sound最新文献

英文中文

Cross-Domain Audio Deepfake Detection: Dataset and Analysis 跨域音频深度伪造检测：数据集与分析

arXiv - CS - Sound

Pub Date : 2024-04-07 DOI: arxiv-2404.04904

Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang

Audio deepfake detection (ADD) is essential for preventing the misuse ofsynthetic voices that may infringe on personal rights and privacy. Recentzero-shot text-to-speech (TTS) models pose higher risks as they can clonevoices with a single utterance. However, the existing ADD datasets areoutdated, leading to suboptimal generalization of detection models. In thispaper, we construct a new cross-domain ADD dataset comprising over 300 hours ofspeech data that is generated by five advanced zero-shot TTS models. Tosimulate real-world scenarios, we employ diverse attack methods and audioprompts from different datasets. Experiments show that, through novelattack-augmented training, the Wav2Vec2-large and Whisper-medium models achieveequal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrateour models' outstanding few-shot ADD ability by fine-tuning with just oneminute of target-domain data. Nonetheless, neural codec compressors greatlyaffect the detection accuracy, necessitating further research.

音频深度伪造检测（ADD）对于防止滥用合成声音（可能会侵犯个人权利和隐私）至关重要。最近拍摄的文本到语音（TTS）模型会带来更高的风险，因为它们可以通过单个语音克隆声音。然而，现有的 ADD 数据集已经过时，导致检测模型的泛化效果不理想。在本文中，我们构建了一个新的跨领域 ADD 数据集，该数据集由五个先进的零镜头 TTS 模型生成，包含超过 300 小时的语音数据。为了模拟真实场景，我们采用了不同的攻击方法和来自不同数据集的音频选段。实验表明，通过新颖的攻击增强训练，Wav2Vec2-large 和 Whisper-medium 模型的错误率分别为 4.1% 和 6.5% 。此外，我们仅用一分钟的目标域数据进行微调，就证明了我们的模型具有出色的少量 ADD 能力。然而，神经编解码器压缩器会极大地影响检测精度，因此有必要开展进一步的研究。

引用次数: 0

HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks HyperTTS：利用超网络实现文本到语音的参数高效适配

arXiv - CS - Sound

Pub Date : 2024-04-06 DOI: arxiv-2404.04645

Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria

Neural speech synthesis, or text-to-speech (TTS), aims to transform a signalfrom the text domain to the speech domain. While developing TTS architecturesthat train and test on the same set of speakers has seen significantimprovements, out-of-domain speaker performance still faces enormouslimitations. Domain adaptation on a new set of speakers can be achieved byfine-tuning the whole model for each new domain, thus making itparameter-inefficient. This problem can be solved by Adapters that provide aparameter-efficient alternative to domain adaptation. Although famous in NLP,speech synthesis has not seen much improvement from Adapters. In this work, wepresent HyperTTS, which comprises a small learnable network, "hypernetwork",that generates parameters of the Adapter blocks, allowing us to conditionAdapters on speaker representations and making them dynamic. Extensiveevaluations of two domain adaptation settings demonstrate its effectiveness inachieving state-of-the-art performance in the parameter-efficient regime. Wealso compare different variants of HyperTTS, comparing them with baselines indifferent studies. Promising results on the dynamic adaptation of adapterparameters using hypernetworks open up new avenues for domain-genericmulti-speaker TTS systems. The audio samples and code are available athttps://github.com/declare-lab/HyperTTS.

神经语音合成或文本到语音（TTS）旨在将信号从文本域转换到语音域。虽然开发在同一组扬声器上进行训练和测试的 TTS 架构取得了显著进步，但域外扬声器的性能仍然面临巨大限制。要在一组新的扬声器上实现领域适应，就必须针对每个新领域对整个模型进行微调，从而使其参数效率低下。这个问题可以通过适配器来解决，它为领域适应提供了一种参数效率高的替代方案。虽然 Adapters 在 NLP 领域很有名，但语音合成领域还没有看到 Adapters 有什么改进。在这项工作中，我们提出了 HyperTTS，它由一个小型可学习网络 "超网络 "组成，可生成适配器模块的参数，从而使我们能够根据说话者的表征对适配器进行调节，并使其动态化。对两个领域适应性设置的广泛评估证明了它在参数效率机制中实现最先进性能的有效性。我们还比较了 HyperTTS 的不同变体，并将它们与其他研究的基线进行了比较。利用超网络对适配器参数进行动态调整的研究结果令人鼓舞，为领域通用多扬声器 TTS 系统开辟了新途径。音频样本和代码可在https://github.com/declare-lab/HyperTTS。

{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":"https://doi.org/arxiv-2404.04645","url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\u0000from the text domain to the speech domain. While developing TTS architectures\u0000that train and test on the same set of speakers has seen significant\u0000improvements, out-of-domain speaker performance still faces enormous\u0000limitations. Domain adaptation on a new set of speakers can be achieved by\u0000fine-tuning the whole model for each new domain, thus making it\u0000parameter-inefficient. This problem can be solved by Adapters that provide a\u0000parameter-efficient alternative to domain adaptation. Although famous in NLP,\u0000speech synthesis has not seen much improvement from Adapters. In this work, we\u0000present HyperTTS, which comprises a small learnable network, \"hypernetwork\",\u0000that generates parameters of the Adapter blocks, allowing us to condition\u0000Adapters on speaker representations and making them dynamic. Extensive\u0000evaluations of two domain adaptation settings demonstrate its effectiveness in\u0000achieving state-of-the-art performance in the parameter-efficient regime. We\u0000also compare different variants of HyperTTS, comparing them with baselines in\u0000different studies. Promising results on the dynamic adaptation of adapter\u0000parameters using hypernetworks open up new avenues for domain-generic\u0000multi-speaker TTS systems. The audio samples and code are available at\u0000https://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"66 6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Multi-Task Learning for Lung sound & Lung disease classification 肺音和肺病分类的多任务学习

arXiv - CS - Sound

Pub Date : 2024-04-05 DOI: arxiv-2404.03908

Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh

In recent years, advancements in deep learning techniques have considerablyenhanced the efficiency and accuracy of medical diagnostics. In this work, anovel approach using multi-task learning (MTL) for the simultaneousclassification of lung sounds and lung diseases is proposed. Our proposed modelleverages MTL with four different deep learning models such as 2D CNN,ResNet50, MobileNet and Densenet to extract relevant features from the lungsound recordings. The ICBHI 2017 Respiratory Sound Database was employed in thecurrent study. The MTL for MobileNet model performed better than the othermodels considered, with an accuracy of74% for lung sound analysis and 91% forlung diseases classification. Results of the experimentation demonstrate theefficacy of our approach in classifying both lung sounds and lung diseasesconcurrently. In this study,using the demographic data of the patients from the database,risk level computation for Chronic Obstructive Pulmonary Disease is alsocarried out. For this computation, three machine learning algorithms namelyLogistic Regression, SVM and Random Forest classifierswere employed. Amongthese ML algorithms, the Random Forest classifier had the highest accuracy of92%.This work helps in considerably reducing the physician's burden of notjust diagnosing the pathology but also effectively communicating to the patientabout the possible causes or outcomes.

近年来，深度学习技术的发展大大提高了医疗诊断的效率和准确性。在这项工作中，我们提出了一种利用多任务学习（MTL）同时对肺部声音和肺部疾病进行分类的新方法。我们提出的模型将 MTL 与四种不同的深度学习模型（如 2D CNN、ResNet50、MobileNet 和 Densenet）相结合，从肺部声音记录中提取相关特征。本研究采用了 ICBHI 2017 呼吸声音数据库。MobileNet 的 MTL 模型比其他模型表现更好，其肺部声音分析准确率为 74%，肺部疾病分类准确率为 91%。实验结果证明了我们的方法在同时对肺音和肺病进行分类方面的有效性。在这项研究中，利用数据库中的患者人口统计数据，还进行了慢性阻塞性肺病的风险等级计算。在计算过程中，采用了三种机器学习算法，即逻辑回归、SVM 和随机森林分类器。这项工作有助于大大减轻医生的负担，不仅能诊断病症，还能有效地与患者交流可能的病因或结果。

{"title":"Multi-Task Learning for Lung sound & Lung disease classification","authors":"Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh","doi":"arxiv-2404.03908","DOIUrl":"https://doi.org/arxiv-2404.03908","url":null,"abstract":"In recent years, advancements in deep learning techniques have considerably\u0000enhanced the efficiency and accuracy of medical diagnostics. In this work, a\u0000novel approach using multi-task learning (MTL) for the simultaneous\u0000classification of lung sounds and lung diseases is proposed. Our proposed model\u0000leverages MTL with four different deep learning models such as 2D CNN,\u0000ResNet50, MobileNet and Densenet to extract relevant features from the lung\u0000sound recordings. The ICBHI 2017 Respiratory Sound Database was employed in the\u0000current study. The MTL for MobileNet model performed better than the other\u0000models considered, with an accuracy of74% for lung sound analysis and 91% for\u0000lung diseases classification. Results of the experimentation demonstrate the\u0000efficacy of our approach in classifying both lung sounds and lung diseases\u0000concurrently. In this study,using the demographic data of the patients from the database,\u0000risk level computation for Chronic Obstructive Pulmonary Disease is also\u0000carried out. For this computation, three machine learning algorithms namely\u0000Logistic Regression, SVM and Random Forest classifierswere employed. Among\u0000these ML algorithms, the Random Forest classifier had the highest accuracy of\u000092%.This work helps in considerably reducing the physician's burden of not\u0000just diagnosing the pathology but also effectively communicating to the patient\u0000about the possible causes or outcomes.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos NES 视频音乐数据库：与游戏视频配对的符号电子游戏音乐数据集

arXiv - CS - Sound

Pub Date : 2024-04-05 DOI: arxiv-2404.04420

Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira

Neural models are one of the most popular approaches for music generation,yet there aren't standard large datasets tailored for learning music directlyfrom game data. To address this research gap, we introduce a novel datasetnamed NES-VMDB, containing 98,940 gameplay videos from 389 NES games, eachpaired with its original soundtrack in symbolic format (MIDI). NES-VMDB isbuilt upon the Nintendo Entertainment System Music Database (NES-MDB),encompassing 5,278 music pieces from 397 NES games. Our approach involvescollecting long-play videos for 389 games of the original dataset, slicing theminto 15-second-long clips, and extracting the audio from each clip.Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) toautomatically identify the corresponding piece in the NES-MDB dataset.Additionally, we introduce a baseline method based on the Controllable MusicTransformer to generate NES music conditioned on gameplay clips. We evaluatedthis approach with objective metrics, and the results showed that theconditional CMT improves musical structural quality when compared to itsunconditional counterpart. Moreover, we used a neural classifier to predict thegame genre of the generated pieces. Results showed that the CMT generator canlearn correlations between gameplay videos and game genres, but furtherresearch has to be conducted to achieve human-level performance.

神经模型是最流行的音乐生成方法之一，但目前还没有直接从游戏数据中学习音乐的标准大型数据集。为了填补这一研究空白，我们引入了一个名为 NES-VMDB 的新型数据集，其中包含 389 款 NES 游戏的 98,940 个游戏视频，每个视频都配有符号格式（MIDI）的原始配乐。NES-VMDB 基于任天堂娱乐系统音乐数据库（Nintendo Entertainment System Music Database，NES-MDB），包含 397 款 NES 游戏中的 5,278 首乐曲。我们的方法包括收集原始数据集中 389 款游戏的长播放视频，将其切成 15 秒长的片段，并从每个片段中提取音频。随后，我们应用音频指纹识别算法（类似于 Shazam）自动识别 NES-MDB 数据集中的相应乐曲。我们用客观指标对这种方法进行了评估，结果表明，条件式 CMT 与无条件式 CMT 相比，提高了音乐结构质量。此外，我们还使用神经分类器来预测生成乐曲的游戏流派。结果表明，CMT 生成器可以学习游戏视频和游戏类型之间的相关性，但要达到人类水平的性能，还需要进行进一步的研究。

{"title":"The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos","authors":"Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira","doi":"arxiv-2404.04420","DOIUrl":"https://doi.org/arxiv-2404.04420","url":null,"abstract":"Neural models are one of the most popular approaches for music generation,\u0000yet there aren't standard large datasets tailored for learning music directly\u0000from game data. To address this research gap, we introduce a novel dataset\u0000named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each\u0000paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is\u0000built upon the Nintendo Entertainment System Music Database (NES-MDB),\u0000encompassing 5,278 music pieces from 397 NES games. Our approach involves\u0000collecting long-play videos for 389 games of the original dataset, slicing them\u0000into 15-second-long clips, and extracting the audio from each clip.\u0000Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to\u0000automatically identify the corresponding piece in the NES-MDB dataset.\u0000Additionally, we introduce a baseline method based on the Controllable Music\u0000Transformer to generate NES music conditioned on gameplay clips. We evaluated\u0000this approach with objective metrics, and the results showed that the\u0000conditional CMT improves musical structural quality when compared to its\u0000unconditional counterpart. Moreover, we used a neural classifier to predict the\u0000game genre of the generated pieces. Results showed that the CMT generator can\u0000learn correlations between gameplay videos and game genres, but further\u0000research has to be conducted to achieve human-level performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"85 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders PromptCodec：使用基于分离表征学习的自适应特征感知提示编码器的高保真神经语音编解码器

arXiv - CS - Sound

Pub Date : 2024-04-03 DOI: arxiv-2404.02702

Yu Pan, Lei Ma, Jianjun Zhao

Neural speech codec has recently gained widespread attention in generativespeech modeling domains, like voice conversion, text-to-speech synthesis, etc.However, ensuring high-fidelity audio reconstruction of speech codecs underhigh compression rates remains an open and challenging issue. In this paper, wepropose PromptCodec, a novel end-to-end neural speech codec model usingdisentangled representation learning based feature-aware prompt encoders. Byincorporating additional feature representations from prompt encoders,PromptCodec can distribute the speech information requiring processing andenhance its capabilities. Moreover, a simple yet effective adaptive featureweighted fusion approach is introduced to integrate features of differentencoders. Meanwhile, we propose a novel disentangled representation learningstrategy based on cosine distance to optimize PromptCodec's encoders to ensuretheir efficiency, thereby further improving the performance of PromptCodec.Experiments on LibriTTS demonstrate that our proposed PromptCodec consistentlyoutperforms state-of-the-art neural speech codec models under all differentbitrate conditions while achieving impressive performance with low bitrates.

最近，神经语音编解码器在语音转换、文本到语音合成等生成语音建模领域获得了广泛关注。然而，在高压缩率下确保语音编解码器的高保真音频重构仍然是一个开放且具有挑战性的问题。在本文中，我们提出了一种新颖的端到端神经语音编解码器模型 PromptCodec，它使用基于特征感知提示编码器的分离表示学习。通过将提示编码器的附加特征表征纳入其中，PromptCodec 可以分散需要处理的语音信息并增强其能力。此外，我们还引入了一种简单而有效的自适应特征加权融合方法来整合不同编码器的特征。在 LibriTTS 上的实验表明，我们提出的 PromptCodec 在所有不同比特率条件下的性能始终优于最先进的神经语音编解码器模型，同时在低比特率条件下也取得了令人印象深刻的性能。

{"title":"PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders","authors":"Yu Pan, Lei Ma, Jianjun Zhao","doi":"arxiv-2404.02702","DOIUrl":"https://doi.org/arxiv-2404.02702","url":null,"abstract":"Neural speech codec has recently gained widespread attention in generative\u0000speech modeling domains, like voice conversion, text-to-speech synthesis, etc.\u0000However, ensuring high-fidelity audio reconstruction of speech codecs under\u0000high compression rates remains an open and challenging issue. In this paper, we\u0000propose PromptCodec, a novel end-to-end neural speech codec model using\u0000disentangled representation learning based feature-aware prompt encoders. By\u0000incorporating additional feature representations from prompt encoders,\u0000PromptCodec can distribute the speech information requiring processing and\u0000enhance its capabilities. Moreover, a simple yet effective adaptive feature\u0000weighted fusion approach is introduced to integrate features of different\u0000encoders. Meanwhile, we propose a novel disentangled representation learning\u0000strategy based on cosine distance to optimize PromptCodec's encoders to ensure\u0000their efficiency, thereby further improving the performance of PromptCodec.\u0000Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently\u0000outperforms state-of-the-art neural speech codec models under all different\u0000bitrate conditions while achieving impressive performance with low bitrates.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"10 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification 合成声景：利用文本到音频模型进行环境声音分类

arXiv - CS - Sound

Pub Date : 2024-03-26 DOI: arxiv-2403.17864

Francesca Ronchini, Luca Comanducci, Fabio Antonacci

In the past few years, text-to-audio models have emerged as a significantadvancement in automatic audio generation. Although they represent impressivetechnological progress, the effectiveness of their use in the development ofaudio applications remains uncertain. This paper aims to investigate theseaspects, specifically focusing on the task of classification of environmentalsounds. This study analyzes the performance of two different environmentalclassification systems when data generated from text-to-audio models is usedfor training. Two cases are considered: a) when the training dataset isaugmented by data coming from two different text-to-audio models; and b) whenthe training dataset consists solely of synthetic audio generated. In bothcases, the performance of the classification task is tested on real data.Results indicate that text-to-audio models are effective for datasetaugmentation, whereas the performance of the models drops when relying on onlygenerated audio.

在过去几年中，文本到音频模型已成为自动音频生成领域的一大进步。虽然它们代表了令人印象深刻的技术进步，但在音频应用开发中的使用效果仍不确定。本文旨在研究这些方面，特别关注环境声音的分类任务。本研究分析了两种不同的环境分类系统在使用文本到音频模型生成的数据进行训练时的性能。研究考虑了两种情况：a）训练数据集由来自两种不同文本到音频模型的数据进行增强；b）训练数据集仅由合成音频生成。在这两种情况下，分类任务的性能都在真实数据上进行了测试。结果表明，文本到音频模型对数据扩充很有效，而仅依赖于生成的音频时，模型的性能会下降。

引用次数: 0

Speaker Distance Estimation in Enclosures from Single-Channel Audio 通过单通道音频估计箱体内扬声器的距离

arXiv - CS - Sound

Pub Date : 2024-03-26 DOI: arxiv-2403.17514

Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen

Distance estimation from audio plays a crucial role in various applications,such as acoustic scene analysis, sound source localization, and room modeling.Most studies predominantly center on employing a classification approach, wheredistances are discretized into distinct categories, enabling smoother modeltraining and achieving higher accuracy but imposing restrictions on theprecision of the obtained sound source position. Towards this direction, inthis paper we propose a novel approach for continuous distance estimation fromaudio signals using a convolutional recurrent neural network with an attentionmodule. The attention mechanism enables the model to focus on relevant temporaland spectral features, enhancing its ability to capture fine-graineddistance-related information. To evaluate the effectiveness of our proposedmethod, we conduct extensive experiments using audio recordings in controlledenvironments with three levels of realism (synthetic room impulse response,measured response with convolved speech, and real recordings) on four datasets(our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimentalresults show that the model achieves an absolute error of 0.11 meters in anoiseless synthetic scenario. Moreover, the results showed an absolute error ofabout 1.30 meters in the hybrid scenario. The algorithm's performance in thereal scenario, where unpredictable environmental factors and noise areprevalent, yields an absolute error of approximately 0.50 meters. Forreproducible research purposes we make model, code, and synthetic datasetsavailable at https://github.com/michaelneri/audio-distance-estimation.

从音频中进行距离估计在声学场景分析、声源定位和房间建模等各种应用中发挥着至关重要的作用。大多数研究主要集中于采用分类方法，将距离离散为不同的类别，从而使模型训练更加平滑，并获得更高的精度，但对所获得声源位置的精度施加了限制。朝着这个方向，我们在本文中提出了一种利用带有注意力模块的卷积递归神经网络从音频信号中进行连续距离估计的新方法。注意力机制使模型能够关注相关的时间和频谱特征，从而增强其捕捉细粒度距离相关信息的能力。为了评估我们提出的方法的有效性，我们在四个数据集（我们的合成数据集、QMULTIMIT、VoiceHome-2 和 STARSS23）上使用受控环境中的音频录音进行了广泛的实验，实验采用了三种真实度（合成房间脉冲响应、卷积语音的测量响应和真实录音）。实验结果表明，该模型在无声合成场景中的绝对误差为 0.11 米。此外，在混合场景中，结果显示绝对误差约为 1.30 米。在不可预测的环境因素和噪音普遍存在的实际场景中，该算法的绝对误差约为 0.50 米。为便于研究，我们在 https://github.com/michaelneri/audio-distance-estimation 网站上提供了模型、代码和合成数据集。

{"title":"Speaker Distance Estimation in Enclosures from Single-Channel Audio","authors":"Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen","doi":"arxiv-2403.17514","DOIUrl":"https://doi.org/arxiv-2403.17514","url":null,"abstract":"Distance estimation from audio plays a crucial role in various applications,\u0000such as acoustic scene analysis, sound source localization, and room modeling.\u0000Most studies predominantly center on employing a classification approach, where\u0000distances are discretized into distinct categories, enabling smoother model\u0000training and achieving higher accuracy but imposing restrictions on the\u0000precision of the obtained sound source position. Towards this direction, in\u0000this paper we propose a novel approach for continuous distance estimation from\u0000audio signals using a convolutional recurrent neural network with an attention\u0000module. The attention mechanism enables the model to focus on relevant temporal\u0000and spectral features, enhancing its ability to capture fine-grained\u0000distance-related information. To evaluate the effectiveness of our proposed\u0000method, we conduct extensive experiments using audio recordings in controlled\u0000environments with three levels of realism (synthetic room impulse response,\u0000measured response with convolved speech, and real recordings) on four datasets\u0000(our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental\u0000results show that the model achieves an absolute error of 0.11 meters in a\u0000noiseless synthetic scenario. Moreover, the results showed an absolute error of\u0000about 1.30 meters in the hybrid scenario. The algorithm's performance in the\u0000real scenario, where unpredictable environmental factors and noise are\u0000prevalent, yields an absolute error of approximately 0.50 meters. For\u0000reproducible research purposes we make model, code, and synthetic datasets\u0000available at https://github.com/michaelneri/audio-distance-estimation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"19 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks 基于并行估算架构和防缠绕损失的低延迟神经语音相位预测，适用于语音生成任务

arXiv - CS - Sound

Pub Date : 2024-03-26 DOI: arxiv-2403.17378

Yang Ai, Zhen-Hua Ling

This paper presents a novel neural speech phase prediction model whichpredicts wrapped phase spectra directly from amplitude spectra. The proposedmodel is a cascade of a residual convolutional network and a parallelestimation architecture. The parallel estimation architecture is a core modulefor direct wrapped phase prediction. This architecture consists of two parallellinear convolutional layers and a phase calculation formula, imitating theprocess of calculating the phase spectra from the real and imaginary parts ofcomplex spectra and strictly restricting the predicted phase values to theprincipal value interval. To avoid the error expansion issue caused by phasewrapping, we design anti-wrapping training losses defined between the predictedwrapped phase spectra and natural ones by activating the instantaneous phaseerror, group delay error and instantaneous angular frequency error using ananti-wrapping function. We mathematically demonstrate that the anti-wrappingfunction should possess three properties, namely parity, periodicity andmonotonicity. We also achieve low-latency streamable phase prediction bycombining causal convolutions and knowledge distillation training strategies.For both analysis-synthesis and specific speech generation tasks, experimentalresults show that our proposed neural speech phase prediction model outperformsthe iterative phase estimation algorithms and neural network-based phaseprediction methods in terms of phase prediction precision, efficiency androbustness. Compared with HiFi-GAN-based waveform reconstruction method, ourproposed model also shows outstanding efficiency advantages while ensuring thequality of synthesized speech. To the best of our knowledge, we are the firstto directly predict speech phase spectra from amplitude spectra only via neuralnetworks.

本文提出了一种新颖的神经语音相位预测模型，该模型可直接从振幅频谱预测包裹的相位频谱。所提出的模型是一个残差卷积网络和一个并行估计架构的级联。并行估计架构是直接进行包裹相位预测的核心模块。该架构由两个并行线性卷积层和一个相位计算公式组成，模仿了从复杂频谱的实部和虚部计算相位频谱的过程，并将预测的相位值严格限制在主值区间内。为了避免相位裹包引起的误差扩大问题，我们设计了反裹包训练损耗，通过使用反裹包函数激活瞬时相位误差、群延迟误差和瞬时角频率误差，在预测裹包相位谱和自然相位谱之间定义反裹包训练损耗。我们用数学方法证明了反包函数应具备三个特性，即奇偶性、周期性和单调性。对于分析-合成和特定语音生成任务，实验结果表明，我们提出的神经语音相位预测模型在相位预测精度、效率和稳健性方面优于迭代相位估计算法和基于神经网络的相位预测方法。与基于 HiFiGAN 的波形重构方法相比，我们提出的模型在保证合成语音质量的同时，也表现出了突出的效率优势。据我们所知，我们是第一个仅通过神经网络从振幅频谱直接预测语音相位频谱的人。

{"title":"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":"https://doi.org/arxiv-2403.17378","url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\u0000predicts wrapped phase spectra directly from amplitude spectra. The proposed\u0000model is a cascade of a residual convolutional network and a parallel\u0000estimation architecture. The parallel estimation architecture is a core module\u0000for direct wrapped phase prediction. This architecture consists of two parallel\u0000linear convolutional layers and a phase calculation formula, imitating the\u0000process of calculating the phase spectra from the real and imaginary parts of\u0000complex spectra and strictly restricting the predicted phase values to the\u0000principal value interval. To avoid the error expansion issue caused by phase\u0000wrapping, we design anti-wrapping training losses defined between the predicted\u0000wrapped phase spectra and natural ones by activating the instantaneous phase\u0000error, group delay error and instantaneous angular frequency error using an\u0000anti-wrapping function. We mathematically demonstrate that the anti-wrapping\u0000function should possess three properties, namely parity, periodicity and\u0000monotonicity. We also achieve low-latency streamable phase prediction by\u0000combining causal convolutions and knowledge distillation training strategies.\u0000For both analysis-synthesis and specific speech generation tasks, experimental\u0000results show that our proposed neural speech phase prediction model outperforms\u0000the iterative phase estimation algorithms and neural network-based phase\u0000prediction methods in terms of phase prediction precision, efficiency and\u0000robustness. Compared with HiFi-GAN-based waveform reconstruction method, our\u0000proposed model also shows outstanding efficiency advantages while ensuring the\u0000quality of synthesized speech. To the best of our knowledge, we are the first\u0000to directly predict speech phase spectra from amplitude spectra only via neural\u0000networks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"106 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep functional multiple index models with an application to SER 深度功能多重指数模型在 SER 中的应用

arXiv - CS - Sound

Pub Date : 2024-03-26 DOI: arxiv-2403.17562

Matthieu Saumard, Abir El Haj, Thibault Napoleon

Speech Emotion Recognition (SER) plays a crucial role in advancinghuman-computer interaction and speech processing capabilities. We introduce anovel deep-learning architecture designed specifically for the functional datamodel known as the multiple-index functional model. Our key innovation lies inintegrating adaptive basis layers and an automated data transformation searchwithin the deep learning framework. Simulations for this new model show goodperformances. This allows us to extract features tailored for chunk-level SER,based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate theeffectiveness of our approach on the benchmark IEMOCAP database, achieving goodperformance compared to existing methods.

语音情感识别（SER）在提高人机交互和语音处理能力方面发挥着至关重要的作用。我们引入了一种专为函数数据模型设计的高级深度学习架构，即多索引函数模型。我们的关键创新在于在深度学习框架中集成了自适应基础层和自动数据转换搜索。对这一新模型的模拟显示了良好的性能。这使我们能够基于梅尔频率倒频谱系数（MFCC），提取为块级 SER 量身定制的特征。我们在基准 IEMOCAP 数据库上演示了我们的方法的有效性，与现有方法相比取得了良好的性能。

引用次数: 0

Detection of Deepfake Environmental Audio 深度伪造环境音频的检测

arXiv - CS - Sound

Pub Date : 2024-03-26 DOI: arxiv-2403.17529

Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller

With the ever-rising quality of deep generative models, it is increasinglyimportant to be able to discern whether the audio data at hand have beenrecorded or synthesized. Although the detection of fake speech signals has beenstudied extensively, this is not the case for the detection of fakeenvironmental audio. We propose a simple and efficient pipeline for detecting fake environmentalsounds based on the CLAP audio embedding. We evaluate this detector using audiodata from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-artsynthesizers can be detected on average with 98% accuracy. We show that usingan audio embedding learned on environmental audio is beneficial over a standardVGGish one as it provides a 10% increase in detection performance. Informallistening to Incorrect Negative examples demonstrates audible features of fakesounds missed by the detector such as distortion and implausible backgroundnoise.

随着深度生成模型质量的不断提高，能够辨别手头的音频数据是录制的还是合成的变得越来越重要。虽然假语音信号的检测已经得到了广泛的研究，但假环境音频的检测却并非如此。我们提出了一种基于 CLAP 音频嵌入的简单高效的假环境音检测方法。我们使用 2023 年 DCASE 挑战任务中关于 Foley 声音合成的音频数据对该检测器进行了评估。实验结果表明，由 44 种最先进合成器生成的虚假声音的平均检测准确率为 98%。我们的实验表明，使用从环境音频中学到的音频嵌入比标准的 VGGish 音频嵌入更有优势，因为它能将检测性能提高 10%。对不正确的负面示例进行的信息监听展示了检测器所遗漏的假声音的可听特征，如失真和难以置信的背景噪音。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - CS - Sound

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀