Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang
Audio deepfake detection (ADD) is essential for preventing the misuse of synthetic voices that may infringe on personal rights and privacy. Recent zero-shot text-to-speech (TTS) models pose higher risks as they can clone voices with a single utterance. However, the existing ADD datasets are outdated, leading to suboptimal generalization of detection models. In this paper, we construct a new cross-domain ADD dataset comprising over 300 hours of speech data that is generated by five advanced zero-shot TTS models. To simulate real-world scenarios, we employ diverse attack methods and audio prompts from different datasets. Experiments show that, through novel attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate our models' outstanding few-shot ADD ability by fine-tuning with just one minute of target-domain data. Nonetheless, neural codec compressors greatly affect the detection accuracy, necessitating further research.
{"title":"Cross-Domain Audio Deepfake Detection: Dataset and Analysis","authors":"Yuang Li, Min Zhang, Mengxin Ren, Miaomiao Ma, Daimeng Wei, Hao Yang","doi":"arxiv-2404.04904","DOIUrl":"https://doi.org/arxiv-2404.04904","url":null,"abstract":"Audio deepfake detection (ADD) is essential for preventing the misuse of\u0000synthetic voices that may infringe on personal rights and privacy. Recent\u0000zero-shot text-to-speech (TTS) models pose higher risks as they can clone\u0000voices with a single utterance. However, the existing ADD datasets are\u0000outdated, leading to suboptimal generalization of detection models. In this\u0000paper, we construct a new cross-domain ADD dataset comprising over 300 hours of\u0000speech data that is generated by five advanced zero-shot TTS models. To\u0000simulate real-world scenarios, we employ diverse attack methods and audio\u0000prompts from different datasets. Experiments show that, through novel\u0000attack-augmented training, the Wav2Vec2-large and Whisper-medium models achieve\u0000equal error rates of 4.1% and 6.5% respectively. Additionally, we demonstrate\u0000our models' outstanding few-shot ADD ability by fine-tuning with just one\u0000minute of target-domain data. Nonetheless, neural codec compressors greatly\u0000affect the detection accuracy, necessitating further research.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria
Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal from the text domain to the speech domain. While developing TTS architectures that train and test on the same set of speakers has seen significant improvements, out-of-domain speaker performance still faces enormous limitations. Domain adaptation on a new set of speakers can be achieved by fine-tuning the whole model for each new domain, thus making it parameter-inefficient. This problem can be solved by Adapters that provide a parameter-efficient alternative to domain adaptation. Although famous in NLP, speech synthesis has not seen much improvement from Adapters. In this work, we present HyperTTS, which comprises a small learnable network, "hypernetwork", that generates parameters of the Adapter blocks, allowing us to condition Adapters on speaker representations and making them dynamic. Extensive evaluations of two domain adaptation settings demonstrate its effectiveness in achieving state-of-the-art performance in the parameter-efficient regime. We also compare different variants of HyperTTS, comparing them with baselines in different studies. Promising results on the dynamic adaptation of adapter parameters using hypernetworks open up new avenues for domain-generic multi-speaker TTS systems. The audio samples and code are available at https://github.com/declare-lab/HyperTTS.
{"title":"HyperTTS: Parameter Efficient Adaptation in Text to Speech using Hypernetworks","authors":"Yingting Li, Rishabh Bhardwaj, Ambuj Mehrish, Bo Cheng, Soujanya Poria","doi":"arxiv-2404.04645","DOIUrl":"https://doi.org/arxiv-2404.04645","url":null,"abstract":"Neural speech synthesis, or text-to-speech (TTS), aims to transform a signal\u0000from the text domain to the speech domain. While developing TTS architectures\u0000that train and test on the same set of speakers has seen significant\u0000improvements, out-of-domain speaker performance still faces enormous\u0000limitations. Domain adaptation on a new set of speakers can be achieved by\u0000fine-tuning the whole model for each new domain, thus making it\u0000parameter-inefficient. This problem can be solved by Adapters that provide a\u0000parameter-efficient alternative to domain adaptation. Although famous in NLP,\u0000speech synthesis has not seen much improvement from Adapters. In this work, we\u0000present HyperTTS, which comprises a small learnable network, \"hypernetwork\",\u0000that generates parameters of the Adapter blocks, allowing us to condition\u0000Adapters on speaker representations and making them dynamic. Extensive\u0000evaluations of two domain adaptation settings demonstrate its effectiveness in\u0000achieving state-of-the-art performance in the parameter-efficient regime. We\u0000also compare different variants of HyperTTS, comparing them with baselines in\u0000different studies. Promising results on the dynamic adaptation of adapter\u0000parameters using hypernetworks open up new avenues for domain-generic\u0000multi-speaker TTS systems. The audio samples and code are available at\u0000https://github.com/declare-lab/HyperTTS.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586845","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh
In recent years, advancements in deep learning techniques have considerably enhanced the efficiency and accuracy of medical diagnostics. In this work, a novel approach using multi-task learning (MTL) for the simultaneous classification of lung sounds and lung diseases is proposed. Our proposed model leverages MTL with four different deep learning models such as 2D CNN, ResNet50, MobileNet and Densenet to extract relevant features from the lung sound recordings. The ICBHI 2017 Respiratory Sound Database was employed in the current study. The MTL for MobileNet model performed better than the other models considered, with an accuracy of74% for lung sound analysis and 91% for lung diseases classification. Results of the experimentation demonstrate the efficacy of our approach in classifying both lung sounds and lung diseases concurrently. In this study,using the demographic data of the patients from the database, risk level computation for Chronic Obstructive Pulmonary Disease is also carried out. For this computation, three machine learning algorithms namely Logistic Regression, SVM and Random Forest classifierswere employed. Among these ML algorithms, the Random Forest classifier had the highest accuracy of 92%.This work helps in considerably reducing the physician's burden of not just diagnosing the pathology but also effectively communicating to the patient about the possible causes or outcomes.
{"title":"Multi-Task Learning for Lung sound & Lung disease classification","authors":"Suma K V, Deepali Koppad, Preethi Kumar, Neha A Kantikar, Surabhi Ramesh","doi":"arxiv-2404.03908","DOIUrl":"https://doi.org/arxiv-2404.03908","url":null,"abstract":"In recent years, advancements in deep learning techniques have considerably\u0000enhanced the efficiency and accuracy of medical diagnostics. In this work, a\u0000novel approach using multi-task learning (MTL) for the simultaneous\u0000classification of lung sounds and lung diseases is proposed. Our proposed model\u0000leverages MTL with four different deep learning models such as 2D CNN,\u0000ResNet50, MobileNet and Densenet to extract relevant features from the lung\u0000sound recordings. The ICBHI 2017 Respiratory Sound Database was employed in the\u0000current study. The MTL for MobileNet model performed better than the other\u0000models considered, with an accuracy of74% for lung sound analysis and 91% for\u0000lung diseases classification. Results of the experimentation demonstrate the\u0000efficacy of our approach in classifying both lung sounds and lung diseases\u0000concurrently. In this study,using the demographic data of the patients from the database,\u0000risk level computation for Chronic Obstructive Pulmonary Disease is also\u0000carried out. For this computation, three machine learning algorithms namely\u0000Logistic Regression, SVM and Random Forest classifierswere employed. Among\u0000these ML algorithms, the Random Forest classifier had the highest accuracy of\u000092%.This work helps in considerably reducing the physician's burden of not\u0000just diagnosing the pathology but also effectively communicating to the patient\u0000about the possible causes or outcomes.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural models are one of the most popular approaches for music generation, yet there aren't standard large datasets tailored for learning music directly from game data. To address this research gap, we introduce a novel dataset named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is built upon the Nintendo Entertainment System Music Database (NES-MDB), encompassing 5,278 music pieces from 397 NES games. Our approach involves collecting long-play videos for 389 games of the original dataset, slicing them into 15-second-long clips, and extracting the audio from each clip. Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to automatically identify the corresponding piece in the NES-MDB dataset. Additionally, we introduce a baseline method based on the Controllable Music Transformer to generate NES music conditioned on gameplay clips. We evaluated this approach with objective metrics, and the results showed that the conditional CMT improves musical structural quality when compared to its unconditional counterpart. Moreover, we used a neural classifier to predict the game genre of the generated pieces. Results showed that the CMT generator can learn correlations between gameplay videos and game genres, but further research has to be conducted to achieve human-level performance.
神经模型是最流行的音乐生成方法之一,但目前还没有直接从游戏数据中学习音乐的标准大型数据集。为了填补这一研究空白,我们引入了一个名为 NES-VMDB 的新型数据集,其中包含 389 款 NES 游戏的 98,940 个游戏视频,每个视频都配有符号格式(MIDI)的原始配乐。NES-VMDB 基于任天堂娱乐系统音乐数据库(Nintendo Entertainment System Music Database,NES-MDB),包含 397 款 NES 游戏中的 5,278 首乐曲。我们的方法包括收集原始数据集中 389 款游戏的长播放视频,将其切成 15 秒长的片段,并从每个片段中提取音频。随后,我们应用音频指纹识别算法(类似于 Shazam)自动识别 NES-MDB 数据集中的相应乐曲。我们用客观指标对这种方法进行了评估,结果表明,条件式 CMT 与无条件式 CMT 相比,提高了音乐结构质量。此外,我们还使用神经分类器来预测生成乐曲的游戏流派。结果表明,CMT 生成器可以学习游戏视频和游戏类型之间的相关性,但要达到人类水平的性能,还需要进行进一步的研究。
{"title":"The NES Video-Music Database: A Dataset of Symbolic Video Game Music Paired with Gameplay Videos","authors":"Igor Cardoso, Rubens O. Moraes, Lucas N. Ferreira","doi":"arxiv-2404.04420","DOIUrl":"https://doi.org/arxiv-2404.04420","url":null,"abstract":"Neural models are one of the most popular approaches for music generation,\u0000yet there aren't standard large datasets tailored for learning music directly\u0000from game data. To address this research gap, we introduce a novel dataset\u0000named NES-VMDB, containing 98,940 gameplay videos from 389 NES games, each\u0000paired with its original soundtrack in symbolic format (MIDI). NES-VMDB is\u0000built upon the Nintendo Entertainment System Music Database (NES-MDB),\u0000encompassing 5,278 music pieces from 397 NES games. Our approach involves\u0000collecting long-play videos for 389 games of the original dataset, slicing them\u0000into 15-second-long clips, and extracting the audio from each clip.\u0000Subsequently, we apply an audio fingerprinting algorithm (similar to Shazam) to\u0000automatically identify the corresponding piece in the NES-MDB dataset.\u0000Additionally, we introduce a baseline method based on the Controllable Music\u0000Transformer to generate NES music conditioned on gameplay clips. We evaluated\u0000this approach with objective metrics, and the results showed that the\u0000conditional CMT improves musical structural quality when compared to its\u0000unconditional counterpart. Moreover, we used a neural classifier to predict the\u0000game genre of the generated pieces. Results showed that the CMT generator can\u0000learn correlations between gameplay videos and game genres, but further\u0000research has to be conducted to achieve human-level performance.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140586528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Neural speech codec has recently gained widespread attention in generative speech modeling domains, like voice conversion, text-to-speech synthesis, etc. However, ensuring high-fidelity audio reconstruction of speech codecs under high compression rates remains an open and challenging issue. In this paper, we propose PromptCodec, a novel end-to-end neural speech codec model using disentangled representation learning based feature-aware prompt encoders. By incorporating additional feature representations from prompt encoders, PromptCodec can distribute the speech information requiring processing and enhance its capabilities. Moreover, a simple yet effective adaptive feature weighted fusion approach is introduced to integrate features of different encoders. Meanwhile, we propose a novel disentangled representation learning strategy based on cosine distance to optimize PromptCodec's encoders to ensure their efficiency, thereby further improving the performance of PromptCodec. Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently outperforms state-of-the-art neural speech codec models under all different bitrate conditions while achieving impressive performance with low bitrates.
{"title":"PromptCodec: High-Fidelity Neural Speech Codec using Disentangled Representation Learning based Adaptive Feature-aware Prompt Encoders","authors":"Yu Pan, Lei Ma, Jianjun Zhao","doi":"arxiv-2404.02702","DOIUrl":"https://doi.org/arxiv-2404.02702","url":null,"abstract":"Neural speech codec has recently gained widespread attention in generative\u0000speech modeling domains, like voice conversion, text-to-speech synthesis, etc.\u0000However, ensuring high-fidelity audio reconstruction of speech codecs under\u0000high compression rates remains an open and challenging issue. In this paper, we\u0000propose PromptCodec, a novel end-to-end neural speech codec model using\u0000disentangled representation learning based feature-aware prompt encoders. By\u0000incorporating additional feature representations from prompt encoders,\u0000PromptCodec can distribute the speech information requiring processing and\u0000enhance its capabilities. Moreover, a simple yet effective adaptive feature\u0000weighted fusion approach is introduced to integrate features of different\u0000encoders. Meanwhile, we propose a novel disentangled representation learning\u0000strategy based on cosine distance to optimize PromptCodec's encoders to ensure\u0000their efficiency, thereby further improving the performance of PromptCodec.\u0000Experiments on LibriTTS demonstrate that our proposed PromptCodec consistently\u0000outperforms state-of-the-art neural speech codec models under all different\u0000bitrate conditions while achieving impressive performance with low bitrates.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140602683","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the past few years, text-to-audio models have emerged as a significant advancement in automatic audio generation. Although they represent impressive technological progress, the effectiveness of their use in the development of audio applications remains uncertain. This paper aims to investigate these aspects, specifically focusing on the task of classification of environmental sounds. This study analyzes the performance of two different environmental classification systems when data generated from text-to-audio models is used for training. Two cases are considered: a) when the training dataset is augmented by data coming from two different text-to-audio models; and b) when the training dataset consists solely of synthetic audio generated. In both cases, the performance of the classification task is tested on real data. Results indicate that text-to-audio models are effective for dataset augmentation, whereas the performance of the models drops when relying on only generated audio.
{"title":"Synthesizing Soundscapes: Leveraging Text-to-Audio Models for Environmental Sound Classification","authors":"Francesca Ronchini, Luca Comanducci, Fabio Antonacci","doi":"arxiv-2403.17864","DOIUrl":"https://doi.org/arxiv-2403.17864","url":null,"abstract":"In the past few years, text-to-audio models have emerged as a significant\u0000advancement in automatic audio generation. Although they represent impressive\u0000technological progress, the effectiveness of their use in the development of\u0000audio applications remains uncertain. This paper aims to investigate these\u0000aspects, specifically focusing on the task of classification of environmental\u0000sounds. This study analyzes the performance of two different environmental\u0000classification systems when data generated from text-to-audio models is used\u0000for training. Two cases are considered: a) when the training dataset is\u0000augmented by data coming from two different text-to-audio models; and b) when\u0000the training dataset consists solely of synthetic audio generated. In both\u0000cases, the performance of the classification task is tested on real data.\u0000Results indicate that text-to-audio models are effective for dataset\u0000augmentation, whereas the performance of the models drops when relying on only\u0000generated audio.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen
Distance estimation from audio plays a crucial role in various applications, such as acoustic scene analysis, sound source localization, and room modeling. Most studies predominantly center on employing a classification approach, where distances are discretized into distinct categories, enabling smoother model training and achieving higher accuracy but imposing restrictions on the precision of the obtained sound source position. Towards this direction, in this paper we propose a novel approach for continuous distance estimation from audio signals using a convolutional recurrent neural network with an attention module. The attention mechanism enables the model to focus on relevant temporal and spectral features, enhancing its ability to capture fine-grained distance-related information. To evaluate the effectiveness of our proposed method, we conduct extensive experiments using audio recordings in controlled environments with three levels of realism (synthetic room impulse response, measured response with convolved speech, and real recordings) on four datasets (our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental results show that the model achieves an absolute error of 0.11 meters in a noiseless synthetic scenario. Moreover, the results showed an absolute error of about 1.30 meters in the hybrid scenario. The algorithm's performance in the real scenario, where unpredictable environmental factors and noise are prevalent, yields an absolute error of approximately 0.50 meters. For reproducible research purposes we make model, code, and synthetic datasets available at https://github.com/michaelneri/audio-distance-estimation.
{"title":"Speaker Distance Estimation in Enclosures from Single-Channel Audio","authors":"Michael Neri, Archontis Politis, Daniel Krause, Marco Carli, Tuomas Virtanen","doi":"arxiv-2403.17514","DOIUrl":"https://doi.org/arxiv-2403.17514","url":null,"abstract":"Distance estimation from audio plays a crucial role in various applications,\u0000such as acoustic scene analysis, sound source localization, and room modeling.\u0000Most studies predominantly center on employing a classification approach, where\u0000distances are discretized into distinct categories, enabling smoother model\u0000training and achieving higher accuracy but imposing restrictions on the\u0000precision of the obtained sound source position. Towards this direction, in\u0000this paper we propose a novel approach for continuous distance estimation from\u0000audio signals using a convolutional recurrent neural network with an attention\u0000module. The attention mechanism enables the model to focus on relevant temporal\u0000and spectral features, enhancing its ability to capture fine-grained\u0000distance-related information. To evaluate the effectiveness of our proposed\u0000method, we conduct extensive experiments using audio recordings in controlled\u0000environments with three levels of realism (synthetic room impulse response,\u0000measured response with convolved speech, and real recordings) on four datasets\u0000(our synthetic dataset, QMULTIMIT, VoiceHome-2, and STARSS23). Experimental\u0000results show that the model achieves an absolute error of 0.11 meters in a\u0000noiseless synthetic scenario. Moreover, the results showed an absolute error of\u0000about 1.30 meters in the hybrid scenario. The algorithm's performance in the\u0000real scenario, where unpredictable environmental factors and noise are\u0000prevalent, yields an absolute error of approximately 0.50 meters. For\u0000reproducible research purposes we make model, code, and synthetic datasets\u0000available at https://github.com/michaelneri/audio-distance-estimation.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a novel neural speech phase prediction model which predicts wrapped phase spectra directly from amplitude spectra. The proposed model is a cascade of a residual convolutional network and a parallel estimation architecture. The parallel estimation architecture is a core module for direct wrapped phase prediction. This architecture consists of two parallel linear convolutional layers and a phase calculation formula, imitating the process of calculating the phase spectra from the real and imaginary parts of complex spectra and strictly restricting the predicted phase values to the principal value interval. To avoid the error expansion issue caused by phase wrapping, we design anti-wrapping training losses defined between the predicted wrapped phase spectra and natural ones by activating the instantaneous phase error, group delay error and instantaneous angular frequency error using an anti-wrapping function. We mathematically demonstrate that the anti-wrapping function should possess three properties, namely parity, periodicity and monotonicity. We also achieve low-latency streamable phase prediction by combining causal convolutions and knowledge distillation training strategies. For both analysis-synthesis and specific speech generation tasks, experimental results show that our proposed neural speech phase prediction model outperforms the iterative phase estimation algorithms and neural network-based phase prediction methods in terms of phase prediction precision, efficiency and robustness. Compared with HiFi-GAN-based waveform reconstruction method, our proposed model also shows outstanding efficiency advantages while ensuring the quality of synthesized speech. To the best of our knowledge, we are the first to directly predict speech phase spectra from amplitude spectra only via neural networks.
{"title":"Low-Latency Neural Speech Phase Prediction based on Parallel Estimation Architecture and Anti-Wrapping Losses for Speech Generation Tasks","authors":"Yang Ai, Zhen-Hua Ling","doi":"arxiv-2403.17378","DOIUrl":"https://doi.org/arxiv-2403.17378","url":null,"abstract":"This paper presents a novel neural speech phase prediction model which\u0000predicts wrapped phase spectra directly from amplitude spectra. The proposed\u0000model is a cascade of a residual convolutional network and a parallel\u0000estimation architecture. The parallel estimation architecture is a core module\u0000for direct wrapped phase prediction. This architecture consists of two parallel\u0000linear convolutional layers and a phase calculation formula, imitating the\u0000process of calculating the phase spectra from the real and imaginary parts of\u0000complex spectra and strictly restricting the predicted phase values to the\u0000principal value interval. To avoid the error expansion issue caused by phase\u0000wrapping, we design anti-wrapping training losses defined between the predicted\u0000wrapped phase spectra and natural ones by activating the instantaneous phase\u0000error, group delay error and instantaneous angular frequency error using an\u0000anti-wrapping function. We mathematically demonstrate that the anti-wrapping\u0000function should possess three properties, namely parity, periodicity and\u0000monotonicity. We also achieve low-latency streamable phase prediction by\u0000combining causal convolutions and knowledge distillation training strategies.\u0000For both analysis-synthesis and specific speech generation tasks, experimental\u0000results show that our proposed neural speech phase prediction model outperforms\u0000the iterative phase estimation algorithms and neural network-based phase\u0000prediction methods in terms of phase prediction precision, efficiency and\u0000robustness. Compared with HiFi-GAN-based waveform reconstruction method, our\u0000proposed model also shows outstanding efficiency advantages while ensuring the\u0000quality of synthesized speech. To the best of our knowledge, we are the first\u0000to directly predict speech phase spectra from amplitude spectra only via neural\u0000networks.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech Emotion Recognition (SER) plays a crucial role in advancing human-computer interaction and speech processing capabilities. We introduce a novel deep-learning architecture designed specifically for the functional data model known as the multiple-index functional model. Our key innovation lies in integrating adaptive basis layers and an automated data transformation search within the deep learning framework. Simulations for this new model show good performances. This allows us to extract features tailored for chunk-level SER, based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate the effectiveness of our approach on the benchmark IEMOCAP database, achieving good performance compared to existing methods.
语音情感识别(SER)在提高人机交互和语音处理能力方面发挥着至关重要的作用。我们引入了一种专为函数数据模型设计的高级深度学习架构,即多索引函数模型。我们的关键创新在于在深度学习框架中集成了自适应基础层和自动数据转换搜索。对这一新模型的模拟显示了良好的性能。这使我们能够基于梅尔频率倒频谱系数(MFCC),提取为块级 SER 量身定制的特征。我们在基准 IEMOCAP 数据库上演示了我们的方法的有效性,与现有方法相比取得了良好的性能。
{"title":"Deep functional multiple index models with an application to SER","authors":"Matthieu Saumard, Abir El Haj, Thibault Napoleon","doi":"arxiv-2403.17562","DOIUrl":"https://doi.org/arxiv-2403.17562","url":null,"abstract":"Speech Emotion Recognition (SER) plays a crucial role in advancing\u0000human-computer interaction and speech processing capabilities. We introduce a\u0000novel deep-learning architecture designed specifically for the functional data\u0000model known as the multiple-index functional model. Our key innovation lies in\u0000integrating adaptive basis layers and an automated data transformation search\u0000within the deep learning framework. Simulations for this new model show good\u0000performances. This allows us to extract features tailored for chunk-level SER,\u0000based on Mel Frequency Cepstral Coefficients (MFCCs). We demonstrate the\u0000effectiveness of our approach on the benchmark IEMOCAP database, achieving good\u0000performance compared to existing methods.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the ever-rising quality of deep generative models, it is increasingly important to be able to discern whether the audio data at hand have been recorded or synthesized. Although the detection of fake speech signals has been studied extensively, this is not the case for the detection of fake environmental audio. We propose a simple and efficient pipeline for detecting fake environmental sounds based on the CLAP audio embedding. We evaluate this detector using audio data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art synthesizers can be detected on average with 98% accuracy. We show that using an audio embedding learned on environmental audio is beneficial over a standard VGGish one as it provides a 10% increase in detection performance. Informal listening to Incorrect Negative examples demonstrates audible features of fake sounds missed by the detector such as distortion and implausible background noise.
{"title":"Detection of Deepfake Environmental Audio","authors":"Hafsa Ouajdi, Oussama Hadder, Modan Tailleur, Mathieu Lagrange, Laurie M. Heller","doi":"arxiv-2403.17529","DOIUrl":"https://doi.org/arxiv-2403.17529","url":null,"abstract":"With the ever-rising quality of deep generative models, it is increasingly\u0000important to be able to discern whether the audio data at hand have been\u0000recorded or synthesized. Although the detection of fake speech signals has been\u0000studied extensively, this is not the case for the detection of fake\u0000environmental audio. We propose a simple and efficient pipeline for detecting fake environmental\u0000sounds based on the CLAP audio embedding. We evaluate this detector using audio\u0000data from the 2023 DCASE challenge task on Foley sound synthesis. Our experiments show that fake sounds generated by 44 state-of-the-art\u0000synthesizers can be detected on average with 98% accuracy. We show that using\u0000an audio embedding learned on environmental audio is beneficial over a standard\u0000VGGish one as it provides a 10% increase in detection performance. Informal\u0000listening to Incorrect Negative examples demonstrates audible features of fake\u0000sounds missed by the detector such as distortion and implausible background\u0000noise.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-03-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140314176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}