Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.
语音欺骗技术的最新进展要求神经语音编解码器采用更强大的验证机制来确保真实性。目前的方法是在压缩前嵌入数字水印,并从重组后的语音中提取水印进行验证,但这种方法面临着水印和编解码器训练过程分离、跨模态信息整合不足等限制,导致水印的不可感知性、提取精度和容量降低。为了解决这些问题,我们提出了 WMCodec,它是第一个以端到端方式联合训练压缩-重构和水印嵌入-提取的神经语音编解码器,同时优化了水印的可感知性和可提取性。此外,我们还设计了一种迭代注意力印记单元(AIU),用于更深入地整合水印和语音的特征,从而降低量化噪声对水印的影响。实验结果表明,WMCodec 在水印不可感知性的大多数质量指标上都优于 AudioSeal with Encodec,并且在水印提取准确性上一直超过 AudioSeal with Encodec 和强化可追踪语音。在带宽为 6 kbps、水印容量为 16bps 的情况下,WMCodec 在常见攻击下的提取准确率保持在 99% 以上,显示了强大的鲁棒性。
{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":"https://doi.org/arxiv-2409.12121","url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\u0000mechanisms in neural speech codecs to ensure authenticity. Current methods\u0000embed numerical watermarks before compression and extract them from\u0000reconstructed speech for verification, but face limitations such as separate\u0000training processes for the watermark and codec, and insufficient cross-modal\u0000information integration, leading to reduced watermark imperceptibility,\u0000extraction accuracy, and capacity. To address these issues, we propose WMCodec,\u0000the first neural speech codec to jointly train compression-reconstruction and\u0000watermark embedding-extraction in an end-to-end manner, optimizing both\u0000imperceptibility and extractability of the watermark. Furthermore, We design an\u0000iterative Attention Imprint Unit (AIU) for deeper feature integration of\u0000watermark and speech, reducing the impact of quantization noise on the\u0000watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\u0000in most quality metrics for watermark imperceptibility and consistently exceeds\u0000both AudioSeal with Encodec and reinforced TraceableSpeech in extraction\u0000accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\u0000bps, WMCodec maintains over 99% extraction accuracy under common attacks,\u0000demonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"197 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current audio-visual representation learning can capture rough object categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to recognize fine-grained details, such as specific categories like ``dogs'' and ``flutes'' within animals and instruments. To address this issue, we introduce DETECLAP, a method to enhance audio-visual representation learning with object information. Our key idea is to introduce an audio-visual label prediction loss to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its object awareness. To avoid costly manual annotations, we prepare object labels from both audio and visual inputs using state-of-the-art language-audio models and object detectors. We evaluate the method of audio-visual retrieval and classification using the VGGSound and AudioSet20K datasets. Our method achieves improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and visual-to-audio retrieval, respectively, and an improvement in accuracy of +0.6% for audio-visual classification.
{"title":"DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information","authors":"Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu","doi":"arxiv-2409.11729","DOIUrl":"https://doi.org/arxiv-2409.11729","url":null,"abstract":"Current audio-visual representation learning can capture rough object\u0000categories (e.g., ``animals'' and ``instruments''), but it lacks the ability to\u0000recognize fine-grained details, such as specific categories like ``dogs'' and\u0000``flutes'' within animals and instruments. To address this issue, we introduce\u0000DETECLAP, a method to enhance audio-visual representation learning with object\u0000information. Our key idea is to introduce an audio-visual label prediction loss\u0000to the existing Contrastive Audio-Visual Masked AutoEncoder to enhance its\u0000object awareness. To avoid costly manual annotations, we prepare object labels\u0000from both audio and visual inputs using state-of-the-art language-audio models\u0000and object detectors. We evaluate the method of audio-visual retrieval and\u0000classification using the VGGSound and AudioSet20K datasets. Our method achieves\u0000improvements in recall@10 of +1.5% and +1.2% for audio-to-visual and\u0000visual-to-audio retrieval, respectively, and an improvement in accuracy of\u0000+0.6% for audio-visual classification.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"3 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell
While table tennis players primarily rely on visual cues, sound provides valuable information. The sound generated when the ball strikes the racket can assist in predicting the ball's trajectory, especially in determining the spin. While professional players can distinguish spin through these auditory cues, they often go unnoticed by untrained players. In this paper, we demonstrate that different rackets produce distinct sounds, which can be used to identify the racket type. In addition, we show that the sound generated by the racket can indicate whether spin was applied to the ball, or not. To achieve this, we created a comprehensive dataset featuring bounce sounds from 10 racket configurations, each applying various spins to the ball. To achieve millisecond level temporal accuracy, we first detect high frequency peaks that may correspond to table tennis ball bounces. We then refine these results using a CNN based classifier that accurately predicts both the type of racket used and whether spin was applied.
{"title":"Spin Detection Using Racket Bounce Sounds in Table Tennis","authors":"Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell","doi":"arxiv-2409.11760","DOIUrl":"https://doi.org/arxiv-2409.11760","url":null,"abstract":"While table tennis players primarily rely on visual cues, sound provides\u0000valuable information. The sound generated when the ball strikes the racket can\u0000assist in predicting the ball's trajectory, especially in determining the spin.\u0000While professional players can distinguish spin through these auditory cues,\u0000they often go unnoticed by untrained players. In this paper, we demonstrate\u0000that different rackets produce distinct sounds, which can be used to identify\u0000the racket type. In addition, we show that the sound generated by the racket\u0000can indicate whether spin was applied to the ball, or not. To achieve this, we\u0000created a comprehensive dataset featuring bounce sounds from 10 racket\u0000configurations, each applying various spins to the ball. To achieve millisecond\u0000level temporal accuracy, we first detect high frequency peaks that may\u0000correspond to table tennis ball bounces. We then refine these results using a\u0000CNN based classifier that accurately predicts both the type of racket used and\u0000whether spin was applied.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"47 2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We tackle the challenge of uncertainty quantification in the localization of a sound source within adverse acoustic environments. Estimating the position of the source is influenced by various factors such as noise and reverberation, leading to significant uncertainty. Quantifying this uncertainty is essential, particularly when localization outcomes impact critical decision-making processes, such as in robot audition, where the accuracy of location estimates directly influences subsequent actions. Despite this, many localization methods typically offer point estimates without quantifying the estimation uncertainty. To address this, we employ conformal prediction (CP)-a framework that delivers statistically valid prediction intervals with finite-sample guarantees, independent of the data distribution. However, commonly used Inductive CP (ICP) methods require a substantial amount of labeled data, which can be difficult to obtain in the localization setting. To mitigate this limitation, we incorporate a manifold-based localization method using Gaussian process regression (GPR), with an efficient Transductive CP (TCP) technique specifically designed for GPR. We demonstrate that our method generates statistically valid uncertainty intervals across different acoustic conditions.
{"title":"Conformal Prediction for Manifold-based Source Localization with Gaussian Processes","authors":"Vadim Rozenfeld, Bracha Laufer Goldshtein","doi":"arxiv-2409.11804","DOIUrl":"https://doi.org/arxiv-2409.11804","url":null,"abstract":"We tackle the challenge of uncertainty quantification in the localization of\u0000a sound source within adverse acoustic environments. Estimating the position of\u0000the source is influenced by various factors such as noise and reverberation,\u0000leading to significant uncertainty. Quantifying this uncertainty is essential,\u0000particularly when localization outcomes impact critical decision-making\u0000processes, such as in robot audition, where the accuracy of location estimates\u0000directly influences subsequent actions. Despite this, many localization methods\u0000typically offer point estimates without quantifying the estimation uncertainty.\u0000To address this, we employ conformal prediction (CP)-a framework that delivers\u0000statistically valid prediction intervals with finite-sample guarantees,\u0000independent of the data distribution. However, commonly used Inductive CP (ICP)\u0000methods require a substantial amount of labeled data, which can be difficult to\u0000obtain in the localization setting. To mitigate this limitation, we incorporate\u0000a manifold-based localization method using Gaussian process regression (GPR),\u0000with an efficient Transductive CP (TCP) technique specifically designed for\u0000GPR. We demonstrate that our method generates statistically valid uncertainty\u0000intervals across different acoustic conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li
In recent years, speech diffusion models have advanced rapidly. Alongside the widely used U-Net architecture, transformer-based models such as the Diffusion Transformer (DiT) have also gained attention. However, current DiT speech models treat Mel spectrograms as general images, which overlooks the specific acoustic properties of speech. To address these limitations, we propose a method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which builds on DiT and achieves fast training without compromising accuracy. Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive inference approach that aligns more closely with acoustic properties, enhancing the naturalness of the generated speech. Additionally, we introduce a fine-grained style temporal modeling method that further improves speaker style similarity. Experimental results demonstrate that our method increases the training speed by nearly 2 times and significantly outperforms the baseline models.
近年来,语音扩散模型发展迅速。除了广泛使用的 U-Net 架构外,基于变压器的模型(如扩散变压器(DiT))也受到了关注。然而,目前的 DiT 语音模型将梅尔频谱图视为一般图像,忽略了语音的特定声学特性。为了解决这些局限性,我们提出了一种名为 "文本到语音的方向性补丁交互"(DPI-TTS)的方法,它建立在 DiT 的基础上,在不影响准确性的情况下实现了快速训练。值得注意的是,DPI-TTS 采用了一种从低频到高频、逐帧渐进的推理方法,更贴近声学特性,增强了生成语音的自然度。此外,我们还引入了细粒度风格时间建模方法,进一步提高了说话者的风格相似性。实验结果表明,我们的方法将训练速度提高了近 2 倍,明显优于基线模型。
{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":"https://doi.org/arxiv-2409.11835","url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\u0000widely used U-Net architecture, transformer-based models such as the Diffusion\u0000Transformer (DiT) have also gained attention. However, current DiT speech\u0000models treat Mel spectrograms as general images, which overlooks the specific\u0000acoustic properties of speech. To address these limitations, we propose a\u0000method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\u0000builds on DiT and achieves fast training without compromising accuracy.\u0000Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\u0000inference approach that aligns more closely with acoustic properties, enhancing\u0000the naturalness of the generated speech. Additionally, we introduce a\u0000fine-grained style temporal modeling method that further improves speaker style\u0000similarity. Experimental results demonstrate that our method increases the\u0000training speed by nearly 2 times and significantly outperforms the baseline\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech enhancement aims to improve speech quality and intelligibility in noisy environments. Recent advancements have concentrated on deep neural networks, particularly employing the Two-Stage (TS) architecture to enhance feature extraction. However, the complexity and size of these models remain significant, which limits their applicability in resource-constrained scenarios. Designing models suitable for edge devices presents its own set of challenges. Narrow lightweight models often encounter performance bottlenecks due to uneven loss landscapes. Additionally, advanced operators such as Transformers or Mamba may lack the practical adaptability and efficiency that convolutional neural networks (CNNs) offer in real-world deployments. To address these challenges, we propose Dense-TSNet, an innovative ultra-lightweight speech enhancement network. Our approach employs a novel Dense Two-Stage (Dense-TS) architecture, which, compared to the classic Two-Stage architecture, ensures more robust refinement of the objective function in the later training stages. This leads to improved final performance, addressing the early convergence limitations of the baseline model. We also introduce the Multi-View Gaze Block (MVGB), which enhances feature extraction by incorporating global, channel, and local perspectives through convolutional neural networks (CNNs). Furthermore, we discuss how the choice of loss function impacts perceptual quality. Dense-TSNet demonstrates promising performance with a compact model size of around 14K parameters, making it particularly well-suited for deployment in resource-constrained environments.
{"title":"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":"https://doi.org/arxiv-2409.11725","url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\u0000noisy environments. Recent advancements have concentrated on deep neural\u0000networks, particularly employing the Two-Stage (TS) architecture to enhance\u0000feature extraction. However, the complexity and size of these models remain\u0000significant, which limits their applicability in resource-constrained\u0000scenarios. Designing models suitable for edge devices presents its own set of\u0000challenges. Narrow lightweight models often encounter performance bottlenecks\u0000due to uneven loss landscapes. Additionally, advanced operators such as\u0000Transformers or Mamba may lack the practical adaptability and efficiency that\u0000convolutional neural networks (CNNs) offer in real-world deployments. To\u0000address these challenges, we propose Dense-TSNet, an innovative\u0000ultra-lightweight speech enhancement network. Our approach employs a novel\u0000Dense Two-Stage (Dense-TS) architecture, which, compared to the classic\u0000Two-Stage architecture, ensures more robust refinement of the objective\u0000function in the later training stages. This leads to improved final\u0000performance, addressing the early convergence limitations of the baseline\u0000model. We also introduce the Multi-View Gaze Block (MVGB), which enhances\u0000feature extraction by incorporating global, channel, and local perspectives\u0000through convolutional neural networks (CNNs). Furthermore, we discuss how the\u0000choice of loss function impacts perceptual quality. Dense-TSNet demonstrates\u0000promising performance with a compact model size of around 14K parameters,\u0000making it particularly well-suited for deployment in resource-constrained\u0000environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely
The increasing popularity of spatial audio in applications such as teleconferencing, entertainment, and virtual reality has led to the recent developments of binaural reproduction methods. However, only a few of these methods are well-suited for wearable and mobile arrays, which typically consist of a small number of microphones. One such method is binaural signal matching (BSM), which has been shown to produce high-quality binaural signals for wearable arrays. However, BSM may be suboptimal in cases of high direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field assumption. To overcome this limitation, previous studies incorporated sound-field models other than diffuse. However, this approach was not studied comprehensively. This paper extensively investigates two BSM-based methods designed for high DRR scenarios. The methods incorporate a sound field model composed of direct and reverberant components.The methods are investigated both mathematically and using simulations, finally validated by a listening test. The results show that the proposed methods can significantly improve the performance of BSM , in particular in the direction of the source, while presenting only a negligible degradation in other directions. Furthermore, when source direction estimation is inaccurate, performance of these methods degrade to equal that of the BSM, presenting a desired robustness quality.
{"title":"Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays","authors":"Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely","doi":"arxiv-2409.11731","DOIUrl":"https://doi.org/arxiv-2409.11731","url":null,"abstract":"The increasing popularity of spatial audio in applications such as\u0000teleconferencing, entertainment, and virtual reality has led to the recent\u0000developments of binaural reproduction methods. However, only a few of these\u0000methods are well-suited for wearable and mobile arrays, which typically consist\u0000of a small number of microphones. One such method is binaural signal matching\u0000(BSM), which has been shown to produce high-quality binaural signals for\u0000wearable arrays. However, BSM may be suboptimal in cases of high\u0000direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field\u0000assumption. To overcome this limitation, previous studies incorporated\u0000sound-field models other than diffuse. However, this approach was not studied\u0000comprehensively. This paper extensively investigates two BSM-based methods\u0000designed for high DRR scenarios. The methods incorporate a sound field model\u0000composed of direct and reverberant components.The methods are investigated both\u0000mathematically and using simulations, finally validated by a listening test.\u0000The results show that the proposed methods can significantly improve the\u0000performance of BSM , in particular in the direction of the source, while\u0000presenting only a negligible degradation in other directions. Furthermore, when\u0000source direction estimation is inaccurate, performance of these methods degrade\u0000to equal that of the BSM, presenting a desired robustness quality.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech synthesis technology has posed a serious threat to speaker verification systems. Currently, the most effective fake audio detection methods utilize pretrained models, and integrating features from various layers of pretrained model further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning the pretrained models, resulting in excessively long training times and hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on the Mixture of Experts, which extracts and integrates features relevant to fake audio detection from layer features, guided by a gating network based on the last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets demonstrate that the proposed method achieves competitive performance compared to those requiring fine-tuning.
{"title":"Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0","authors":"Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11909","DOIUrl":"https://doi.org/arxiv-2409.11909","url":null,"abstract":"Speech synthesis technology has posed a serious threat to speaker\u0000verification systems. Currently, the most effective fake audio detection methods utilize pretrained\u0000models, and integrating features from various layers of pretrained model\u0000further enhances detection performance. However, most of the previously proposed fusion methods require fine-tuning\u0000the pretrained models, resulting in excessively long training times and\u0000hindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based on\u0000the Mixture of Experts, which extracts and integrates features relevant to fake\u0000audio detection from layer features, guided by a gating network based on the\u0000last layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasets\u0000demonstrate that the proposed method achieves competitive performance compared\u0000to those requiring fine-tuning.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad
Automatic Speech Recognition (ASR) systems have achieved remarkable performance on widely used benchmarks such as LibriSpeech and Fleurs. However, these benchmarks do not adequately reflect the complexities of real-world conversational environments, where speech is often unstructured and contains disfluencies such as pauses, interruptions, and diverse accents. In this study, we introduce a multilingual conversational dataset, derived from TalkBank, consisting of unstructured phone conversation between adults. Our results show a significant performance drop across various state-of-the-art ASR models when tested in conversational settings. Furthermore, we observe a correlation between Word Error Rate and the presence of speech disfluencies, highlighting the critical need for more realistic, conversational ASR benchmarks.
自动语音识别(ASR)系统在 LibriSpeech 和 Fleurs 等广泛使用的基准测试中表现出色。然而,这些基准并不能充分反映真实世界对话环境的复杂性,因为对话环境中的语音通常是非结构化的,并包含停顿、中断和不同口音等不流畅现象。在这项研究中,我们引入了一个多语言会话数据集,该数据集来自 TalkBank,由成人之间的非结构化电话会话组成。我们的研究结果表明,在会话环境中进行测试时,各种最先进的 ASR 模型的性能明显下降。此外,我们还观察到单词错误率(Word Error Rate)与语音不流畅(speech disfluencies)之间存在相关性,这凸显了对更真实的会话式 ASR 基准的迫切需求。
{"title":"ASR Benchmarking: Need for a More Representative Conversational Dataset","authors":"Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad","doi":"arxiv-2409.12042","DOIUrl":"https://doi.org/arxiv-2409.12042","url":null,"abstract":"Automatic Speech Recognition (ASR) systems have achieved remarkable\u0000performance on widely used benchmarks such as LibriSpeech and Fleurs. However,\u0000these benchmarks do not adequately reflect the complexities of real-world\u0000conversational environments, where speech is often unstructured and contains\u0000disfluencies such as pauses, interruptions, and diverse accents. In this study,\u0000we introduce a multilingual conversational dataset, derived from TalkBank,\u0000consisting of unstructured phone conversation between adults. Our results show\u0000a significant performance drop across various state-of-the-art ASR models when\u0000tested in conversational settings. Furthermore, we observe a correlation\u0000between Word Error Rate and the presence of speech disfluencies, highlighting\u0000the critical need for more realistic, conversational ASR benchmarks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan
In this technical report, we describe the SNTL-NTU team's submission for Task 1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection and classification of acoustic scenes and events (DCASE) 2024 challenge. Three systems are introduced to tackle training splits of different sizes. For small training splits, we explored reducing the complexity of the provided baseline model by reducing the number of base channels. We introduce data augmentation in the form of mixup to increase the diversity of training samples. For the larger training splits, we use FocusNet to provide confusing class information to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models and baseline models trained on the original sampling rate of 44.1 kHz. We use Knowledge Distillation to distill the ensemble model to the baseline student model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile development dataset yielded the highest average testing accuracy of (62.21, 59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over the three systems.
{"title":"Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction","authors":"Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan","doi":"arxiv-2409.11964","DOIUrl":"https://doi.org/arxiv-2409.11964","url":null,"abstract":"In this technical report, we describe the SNTL-NTU team's submission for Task\u00001 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection\u0000and classification of acoustic scenes and events (DCASE) 2024 challenge. Three\u0000systems are introduced to tackle training splits of different sizes. For small\u0000training splits, we explored reducing the complexity of the provided baseline\u0000model by reducing the number of base channels. We introduce data augmentation\u0000in the form of mixup to increase the diversity of training samples. For the\u0000larger training splits, we use FocusNet to provide confusing class information\u0000to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models\u0000and baseline models trained on the original sampling rate of 44.1 kHz. We use\u0000Knowledge Distillation to distill the ensemble model to the baseline student\u0000model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile\u0000development dataset yielded the highest average testing accuracy of (62.21,\u000059.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over\u0000the three systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}