arXiv - EE - Audio and Speech Processing最新文献

英文中文

WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification WMCodec：带有深度水印的端到端神经语音编解码器，用于真实性验证

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.12121

Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang

Recent advances in speech spoofing necessitate stronger verificationmechanisms in neural speech codecs to ensure authenticity. Current methodsembed numerical watermarks before compression and extract them fromreconstructed speech for verification, but face limitations such as separatetraining processes for the watermark and codec, and insufficient cross-modalinformation integration, leading to reduced watermark imperceptibility,extraction accuracy, and capacity. To address these issues, we propose WMCodec,the first neural speech codec to jointly train compression-reconstruction andwatermark embedding-extraction in an end-to-end manner, optimizing bothimperceptibility and extractability of the watermark. Furthermore, We design aniterative Attention Imprint Unit (AIU) for deeper feature integration ofwatermark and speech, reducing the impact of quantization noise on thewatermark. Experimental results show WMCodec outperforms AudioSeal with Encodecin most quality metrics for watermark imperceptibility and consistently exceedsboth AudioSeal with Encodec and reinforced TraceableSpeech in extractionaccuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16bps, WMCodec maintains over 99% extraction accuracy under common attacks,demonstrating strong robustness.

语音欺骗技术的最新进展要求神经语音编解码器采用更强大的验证机制来确保真实性。目前的方法是在压缩前嵌入数字水印，并从重组后的语音中提取水印进行验证，但这种方法面临着水印和编解码器训练过程分离、跨模态信息整合不足等限制，导致水印的不可感知性、提取精度和容量降低。为了解决这些问题，我们提出了 WMCodec，它是第一个以端到端方式联合训练压缩-重构和水印嵌入-提取的神经语音编解码器，同时优化了水印的可感知性和可提取性。此外，我们还设计了一种迭代注意力印记单元（AIU），用于更深入地整合水印和语音的特征，从而降低量化噪声对水印的影响。实验结果表明，WMCodec 在水印不可感知性的大多数质量指标上都优于 AudioSeal with Encodec，并且在水印提取准确性上一直超过 AudioSeal with Encodec 和强化可追踪语音。在带宽为 6 kbps、水印容量为 16bps 的情况下，WMCodec 在常见攻击下的提取准确率保持在 99% 以上，显示了强大的鲁棒性。

{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":"https://doi.org/arxiv-2409.12121","url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\u0000mechanisms in neural speech codecs to ensure authenticity. Current methods\u0000embed numerical watermarks before compression and extract them from\u0000reconstructed speech for verification, but face limitations such as separate\u0000training processes for the watermark and codec, and insufficient cross-modal\u0000information integration, leading to reduced watermark imperceptibility,\u0000extraction accuracy, and capacity. To address these issues, we propose WMCodec,\u0000the first neural speech codec to jointly train compression-reconstruction and\u0000watermark embedding-extraction in an end-to-end manner, optimizing both\u0000imperceptibility and extractability of the watermark. Furthermore, We design an\u0000iterative Attention Imprint Unit (AIU) for deeper feature integration of\u0000watermark and speech, reducing the impact of quantization noise on the\u0000watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\u0000in most quality metrics for watermark imperceptibility and consistently exceeds\u0000both AudioSeal with Encodec and reinforced TraceableSpeech in extraction\u0000accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\u0000bps, WMCodec maintains over 99% extraction accuracy under common attacks,\u0000demonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"197 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DETECLAP: Enhancing Audio-Visual Representation Learning with Object Information DETECLAP：利用对象信息加强视听表征学习

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11729

Shota Nakada, Taichi Nishimura, Hokuto Munakata, Masayoshi Kondo, Tatsuya Komatsu

Current audio-visual representation learning can capture rough objectcategories (e.g., ``animals'' and ``instruments''), but it lacks the ability torecognize fine-grained details, such as specific categories like ``dogs'' and``flutes'' within animals and instruments. To address this issue, we introduceDETECLAP, a method to enhance audio-visual representation learning with objectinformation. Our key idea is to introduce an audio-visual label prediction lossto the existing Contrastive Audio-Visual Masked AutoEncoder to enhance itsobject awareness. To avoid costly manual annotations, we prepare object labelsfrom both audio and visual inputs using state-of-the-art language-audio modelsand object detectors. We evaluate the method of audio-visual retrieval andclassification using the VGGSound and AudioSet20K datasets. Our method achievesimprovements in recall@10 of +1.5% and +1.2% for audio-to-visual andvisual-to-audio retrieval, respectively, and an improvement in accuracy of+0.6% for audio-visual classification.

目前的视听表征学习可以捕捉粗略的对象类别（例如 "动物 "和 "乐器"），但缺乏识别细粒度细节的能力，例如动物和乐器中的 "狗 "和 "长笛 "等具体类别。为了解决这个问题，我们引入了DETECLAP，这是一种利用对象信息增强视听表征学习的方法。我们的主要想法是在现有的对比视听屏蔽自动编码器中引入视听标签预测，以增强其对象感知能力。为了避免昂贵的人工标注，我们使用最先进的语言-音频模型和对象检测器从音频和视觉输入中准备对象标签。我们使用 VGGSound 和 AudioSet20K 数据集对音视频检索和分类方法进行了评估。我们的方法在音频到视频和视频到音频检索方面的召回率@10分别提高了+1.5%和+1.2%，在音频到视频分类方面的准确率提高了+0.6%。

引用次数: 0

Spin Detection Using Racket Bounce Sounds in Table Tennis 利用乒乓球拍弹跳声检测旋转

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11760

Thomas Gossard, Julian Schmalzl, Andreas Ziegler, Andreas Zell

While table tennis players primarily rely on visual cues, sound providesvaluable information. The sound generated when the ball strikes the racket canassist in predicting the ball's trajectory, especially in determining the spin.While professional players can distinguish spin through these auditory cues,they often go unnoticed by untrained players. In this paper, we demonstratethat different rackets produce distinct sounds, which can be used to identifythe racket type. In addition, we show that the sound generated by the racketcan indicate whether spin was applied to the ball, or not. To achieve this, wecreated a comprehensive dataset featuring bounce sounds from 10 racketconfigurations, each applying various spins to the ball. To achieve millisecondlevel temporal accuracy, we first detect high frequency peaks that maycorrespond to table tennis ball bounces. We then refine these results using aCNN based classifier that accurately predicts both the type of racket used andwhether spin was applied.

虽然乒乓球运动员主要依靠视觉线索，但声音也能提供宝贵的信息。当球撞击球拍时产生的声音可以帮助预测球的轨迹，特别是在确定旋转方面。虽然专业选手可以通过这些听觉线索分辨旋转，但未经训练的选手往往不会注意到这些线索。在本文中，我们证明了不同的球拍会产生不同的声音，这些声音可以用来识别球拍类型。此外，我们还证明，球拍产生的声音可以显示球是否旋转。为此，我们创建了一个综合数据集，其中包含来自 10 种球拍配置的反弹声音，每种配置都对球施加了不同的旋转。为了达到毫秒级的时间精度，我们首先检测出可能与乒乓球反弹相对应的高频峰值。然后，我们使用基于 CNN 的分类器完善这些结果，该分类器可准确预测所使用球拍的类型以及是否应用了旋转。

引用次数: 0

Conformal Prediction for Manifold-based Source Localization with Gaussian Processes 用高斯过程进行基于 Manifold 的声源定位的共形预测

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11804

Vadim Rozenfeld, Bracha Laufer Goldshtein

We tackle the challenge of uncertainty quantification in the localization ofa sound source within adverse acoustic environments. Estimating the position ofthe source is influenced by various factors such as noise and reverberation,leading to significant uncertainty. Quantifying this uncertainty is essential,particularly when localization outcomes impact critical decision-makingprocesses, such as in robot audition, where the accuracy of location estimatesdirectly influences subsequent actions. Despite this, many localization methodstypically offer point estimates without quantifying the estimation uncertainty.To address this, we employ conformal prediction (CP)-a framework that deliversstatistically valid prediction intervals with finite-sample guarantees,independent of the data distribution. However, commonly used Inductive CP (ICP)methods require a substantial amount of labeled data, which can be difficult toobtain in the localization setting. To mitigate this limitation, we incorporatea manifold-based localization method using Gaussian process regression (GPR),with an efficient Transductive CP (TCP) technique specifically designed forGPR. We demonstrate that our method generates statistically valid uncertaintyintervals across different acoustic conditions.

我们要解决在不利声学环境中声源定位的不确定性量化难题。声源位置的估计会受到噪声和混响等各种因素的影响，从而导致显著的不确定性。量化这种不确定性至关重要，尤其是当定位结果影响关键决策过程时，例如在机器人听觉中，位置估计的准确性会直接影响后续行动。为了解决这个问题，我们采用了保形预测 (CP)--一种提供统计上有效的预测区间并具有有限样本保证的框架，与数据分布无关。然而，常用的归纳预测（ICP）方法需要大量标注数据，这在本地化环境中很难获得。为了缓解这一限制，我们使用高斯过程回归（GPR）将基于流形的定位方法与专为高斯过程回归设计的高效归纳 CP（TCP）技术结合起来。我们证明，我们的方法能在不同的声学条件下产生统计上有效的不确定性区间。

{"title":"Conformal Prediction for Manifold-based Source Localization with Gaussian Processes","authors":"Vadim Rozenfeld, Bracha Laufer Goldshtein","doi":"arxiv-2409.11804","DOIUrl":"https://doi.org/arxiv-2409.11804","url":null,"abstract":"We tackle the challenge of uncertainty quantification in the localization of\u0000a sound source within adverse acoustic environments. Estimating the position of\u0000the source is influenced by various factors such as noise and reverberation,\u0000leading to significant uncertainty. Quantifying this uncertainty is essential,\u0000particularly when localization outcomes impact critical decision-making\u0000processes, such as in robot audition, where the accuracy of location estimates\u0000directly influences subsequent actions. Despite this, many localization methods\u0000typically offer point estimates without quantifying the estimation uncertainty.\u0000To address this, we employ conformal prediction (CP)-a framework that delivers\u0000statistically valid prediction intervals with finite-sample guarantees,\u0000independent of the data distribution. However, commonly used Inductive CP (ICP)\u0000methods require a substantial amount of labeled data, which can be difficult to\u0000obtain in the localization setting. To mitigate this limitation, we incorporate\u0000a manifold-based localization method using Gaussian process regression (GPR),\u0000with an efficient Transductive CP (TCP) technique specifically designed for\u0000GPR. We demonstrate that our method generates statistically valid uncertainty\u0000intervals across different acoustic conditions.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech DPI-TTS：用于文本到语音中快速转换和风格时态建模的定向补丁交互

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11835

Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li

In recent years, speech diffusion models have advanced rapidly. Alongside thewidely used U-Net architecture, transformer-based models such as the DiffusionTransformer (DiT) have also gained attention. However, current DiT speechmodels treat Mel spectrograms as general images, which overlooks the specificacoustic properties of speech. To address these limitations, we propose amethod called Directional Patch Interaction for Text-to-Speech (DPI-TTS), whichbuilds on DiT and achieves fast training without compromising accuracy.Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressiveinference approach that aligns more closely with acoustic properties, enhancingthe naturalness of the generated speech. Additionally, we introduce afine-grained style temporal modeling method that further improves speaker stylesimilarity. Experimental results demonstrate that our method increases thetraining speed by nearly 2 times and significantly outperforms the baselinemodels.

近年来，语音扩散模型发展迅速。除了广泛使用的 U-Net 架构外，基于变压器的模型（如扩散变压器（DiT））也受到了关注。然而，目前的 DiT 语音模型将梅尔频谱图视为一般图像，忽略了语音的特定声学特性。为了解决这些局限性，我们提出了一种名为 "文本到语音的方向性补丁交互"（DPI-TTS）的方法，它建立在 DiT 的基础上，在不影响准确性的情况下实现了快速训练。值得注意的是，DPI-TTS 采用了一种从低频到高频、逐帧渐进的推理方法，更贴近声学特性，增强了生成语音的自然度。此外，我们还引入了细粒度风格时间建模方法，进一步提高了说话者的风格相似性。实验结果表明，我们的方法将训练速度提高了近 2 倍，明显优于基线模型。

{"title":"DPI-TTS: Directional Patch Interaction for Fast-Converging and Style Temporal Modeling in Text-to-Speech","authors":"Xin Qi, Ruibo Fu, Zhengqi Wen, Tao Wang, Chunyu Qiang, Jianhua Tao, Chenxing Li, Yi Lu, Shuchen Shi, Zhiyong Wang, Xiaopeng Wang, Yuankun Xie, Yukun Liu, Xuefei Liu, Guanjun Li","doi":"arxiv-2409.11835","DOIUrl":"https://doi.org/arxiv-2409.11835","url":null,"abstract":"In recent years, speech diffusion models have advanced rapidly. Alongside the\u0000widely used U-Net architecture, transformer-based models such as the Diffusion\u0000Transformer (DiT) have also gained attention. However, current DiT speech\u0000models treat Mel spectrograms as general images, which overlooks the specific\u0000acoustic properties of speech. To address these limitations, we propose a\u0000method called Directional Patch Interaction for Text-to-Speech (DPI-TTS), which\u0000builds on DiT and achieves fast training without compromising accuracy.\u0000Notably, DPI-TTS employs a low-to-high frequency, frame-by-frame progressive\u0000inference approach that aligns more closely with acoustic properties, enhancing\u0000the naturalness of the generated speech. Additionally, we introduce a\u0000fine-grained style temporal modeling method that further improves speaker style\u0000similarity. Experimental results demonstrate that our method increases the\u0000training speed by nearly 2 times and significantly outperforms the baseline\u0000models.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement Dense-TSNet：用于超轻量级语音增强的密集连接两级结构

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11725

Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li

Speech enhancement aims to improve speech quality and intelligibility innoisy environments. Recent advancements have concentrated on deep neuralnetworks, particularly employing the Two-Stage (TS) architecture to enhancefeature extraction. However, the complexity and size of these models remainsignificant, which limits their applicability in resource-constrainedscenarios. Designing models suitable for edge devices presents its own set ofchallenges. Narrow lightweight models often encounter performance bottlenecksdue to uneven loss landscapes. Additionally, advanced operators such asTransformers or Mamba may lack the practical adaptability and efficiency thatconvolutional neural networks (CNNs) offer in real-world deployments. Toaddress these challenges, we propose Dense-TSNet, an innovativeultra-lightweight speech enhancement network. Our approach employs a novelDense Two-Stage (Dense-TS) architecture, which, compared to the classicTwo-Stage architecture, ensures more robust refinement of the objectivefunction in the later training stages. This leads to improved finalperformance, addressing the early convergence limitations of the baselinemodel. We also introduce the Multi-View Gaze Block (MVGB), which enhancesfeature extraction by incorporating global, channel, and local perspectivesthrough convolutional neural networks (CNNs). Furthermore, we discuss how thechoice of loss function impacts perceptual quality. Dense-TSNet demonstratespromising performance with a compact model size of around 14K parameters,making it particularly well-suited for deployment in resource-constrainedenvironments.

语音增强的目的是提高语音质量和噪音环境下的可懂度。最近的进展主要集中在深度神经网络上，特别是采用两阶段（TS）架构来增强特征提取。然而，这些模型的复杂性和规模仍然很大，这限制了它们在资源受限情况下的适用性。设计适用于边缘设备的模型也面临着一系列挑战。由于损耗情况不均衡，窄小的轻量级模型经常会遇到性能瓶颈。此外，Transformers 或 Mamba 等高级运营商可能缺乏卷积神经网络 (CNN) 在实际部署中提供的实际适应性和效率。为了应对这些挑战，我们提出了一种创新的超轻量级语音增强网络 Dense-TSNet。我们的方法采用了新颖的密集两阶段（Dense-TS）架构，与经典的两阶段架构相比，它能确保在后期训练阶段对目标函数进行更稳健的细化。这将提高最终性能，解决基线模型早期收敛的局限性。我们还介绍了多视角凝视块（MVGB），它通过卷积神经网络（CNN）将全局、通道和局部视角结合起来，从而增强了特征提取能力。此外，我们还讨论了损失函数的选择如何影响感知质量。Dense-TSNet 以约 14K 个参数的紧凑模型大小展示了良好的性能，特别适合在资源有限的环境中部署。

{"title":"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":"https://doi.org/arxiv-2409.11725","url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\u0000noisy environments. Recent advancements have concentrated on deep neural\u0000networks, particularly employing the Two-Stage (TS) architecture to enhance\u0000feature extraction. However, the complexity and size of these models remain\u0000significant, which limits their applicability in resource-constrained\u0000scenarios. Designing models suitable for edge devices presents its own set of\u0000challenges. Narrow lightweight models often encounter performance bottlenecks\u0000due to uneven loss landscapes. Additionally, advanced operators such as\u0000Transformers or Mamba may lack the practical adaptability and efficiency that\u0000convolutional neural networks (CNNs) offer in real-world deployments. To\u0000address these challenges, we propose Dense-TSNet, an innovative\u0000ultra-lightweight speech enhancement network. Our approach employs a novel\u0000Dense Two-Stage (Dense-TS) architecture, which, compared to the classic\u0000Two-Stage architecture, ensures more robust refinement of the objective\u0000function in the later training stages. This leads to improved final\u0000performance, addressing the early convergence limitations of the baseline\u0000model. We also introduce the Multi-View Gaze Block (MVGB), which enhances\u0000feature extraction by incorporating global, channel, and local perspectives\u0000through convolutional neural networks (CNNs). Furthermore, we discuss how the\u0000choice of loss function impacts perceptual quality. Dense-TSNet demonstrates\u0000promising performance with a compact model size of around 14K parameters,\u0000making it particularly well-suited for deployment in resource-constrained\u0000environments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265469","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays 利用可佩戴麦克风阵列在双耳信号匹配中纳入信号信息的启示

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11731

Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely

The increasing popularity of spatial audio in applications such asteleconferencing, entertainment, and virtual reality has led to the recentdevelopments of binaural reproduction methods. However, only a few of thesemethods are well-suited for wearable and mobile arrays, which typically consistof a small number of microphones. One such method is binaural signal matching(BSM), which has been shown to produce high-quality binaural signals forwearable arrays. However, BSM may be suboptimal in cases of highdirect-to-reverberant ratio (DRR) as it is based on the diffuse sound fieldassumption. To overcome this limitation, previous studies incorporatedsound-field models other than diffuse. However, this approach was not studiedcomprehensively. This paper extensively investigates two BSM-based methodsdesigned for high DRR scenarios. The methods incorporate a sound field modelcomposed of direct and reverberant components.The methods are investigated bothmathematically and using simulations, finally validated by a listening test.The results show that the proposed methods can significantly improve theperformance of BSM , in particular in the direction of the source, whilepresenting only a negligible degradation in other directions. Furthermore, whensource direction estimation is inaccurate, performance of these methods degradeto equal that of the BSM, presenting a desired robustness quality.

随着空间音频在电话会议、娱乐和虚拟现实等应用中的日益普及，双耳再现方法也随之发展起来。然而，只有少数方法非常适合通常由少量麦克风组成的可穿戴和移动阵列。其中一种方法是双耳信号匹配（BSM），这种方法已被证明可以为可穿戴阵列产生高质量的双耳信号。然而，由于双耳信号匹配法基于扩散声场假设，因此在直接与混响比（DRR）较高的情况下，双耳信号匹配法可能并不理想。为了克服这一局限性，以前的研究采用了扩散声场以外的声场模型。然而，这种方法并未得到全面研究。本文广泛研究了两种基于 BSM 的方法，它们设计用于高 DRR 场景。结果表明，所提出的方法可以显著提高 BSM 的性能，尤其是在声源方向上，而在其他方向上的性能下降可以忽略不计。此外，当声源方向估计不准确时，这些方法的性能会下降到与 BSM 相等，从而达到理想的鲁棒性质量。

{"title":"Insights into the Incorporation of Signal Information in Binaural Signal Matching with Wearable Microphone Arrays","authors":"Ami Berger, Vladimir Tourbabin, Jacob Donley, Zamir Ben-Hur, Boaz Rafaely","doi":"arxiv-2409.11731","DOIUrl":"https://doi.org/arxiv-2409.11731","url":null,"abstract":"The increasing popularity of spatial audio in applications such as\u0000teleconferencing, entertainment, and virtual reality has led to the recent\u0000developments of binaural reproduction methods. However, only a few of these\u0000methods are well-suited for wearable and mobile arrays, which typically consist\u0000of a small number of microphones. One such method is binaural signal matching\u0000(BSM), which has been shown to produce high-quality binaural signals for\u0000wearable arrays. However, BSM may be suboptimal in cases of high\u0000direct-to-reverberant ratio (DRR) as it is based on the diffuse sound field\u0000assumption. To overcome this limitation, previous studies incorporated\u0000sound-field models other than diffuse. However, this approach was not studied\u0000comprehensively. This paper extensively investigates two BSM-based methods\u0000designed for high DRR scenarios. The methods incorporate a sound field model\u0000composed of direct and reverberant components.The methods are investigated both\u0000mathematically and using simulations, finally validated by a listening test.\u0000The results show that the proposed methods can significantly improve the\u0000performance of BSM , in particular in the direction of the source, while\u0000presenting only a negligible degradation in other directions. Furthermore, when\u0000source direction estimation is inaccurate, performance of these methods degrade\u0000to equal that of the BSM, presenting a desired robustness quality.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265468","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mixture of Experts Fusion for Fake Audio Detection Using Frozen wav2vec 2.0 使用冻结的 wav2vec 2.0 进行专家混合假音频检测

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11909

Zhiyong Wang, Ruibo Fu, Zhengqi Wen, Jianhua Tao, Xiaopeng Wang, Yuankun Xie, Xin Qi, Shuchen Shi, Yi Lu, Yukun Liu, Chenxing Li, Xuefei Liu, Guanjun Li

Speech synthesis technology has posed a serious threat to speakerverification systems. Currently, the most effective fake audio detection methods utilize pretrainedmodels, and integrating features from various layers of pretrained modelfurther enhances detection performance. However, most of the previously proposed fusion methods require fine-tuningthe pretrained models, resulting in excessively long training times andhindering model iteration when facing new speech synthesis technology. To address this issue, this paper proposes a feature fusion method based onthe Mixture of Experts, which extracts and integrates features relevant to fakeaudio detection from layer features, guided by a gating network based on thelast layer feature, while freezing the pretrained model. Experiments conducted on the ASVspoof2019 and ASVspoof2021 datasetsdemonstrate that the proposed method achieves competitive performance comparedto those requiring fine-tuning.

语音合成技术对扬声器验证系统构成了严重威胁。目前，最有效的虚假音频检测方法是利用预训练模型，并将各层预训练模型的特征进行融合，从而进一步提高检测性能。然而，之前提出的大多数融合方法都需要对预设模型进行微调，导致训练时间过长，在面对新的语音合成技术时，模型迭代受到阻碍。针对这一问题，本文提出了一种基于专家混合的特征融合方法，该方法在冻结预训练模型的同时，通过基于最后一层特征的门控网络，从各层特征中提取并整合与假音频检测相关的特征。在 ASVspoof2019 和 ASVspoof2021 数据集上进行的实验表明，与那些需要微调的数据集相比，所提出的方法取得了具有竞争力的性能。

引用次数: 0

ASR Benchmarking: Need for a More Representative Conversational Dataset ASR 基准测试：需要更具代表性的对话数据集

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.12042

Gaurav Maheshwari, Dmitry Ivanov, Théo Johannet, Kevin El Haddad

Automatic Speech Recognition (ASR) systems have achieved remarkableperformance on widely used benchmarks such as LibriSpeech and Fleurs. However,these benchmarks do not adequately reflect the complexities of real-worldconversational environments, where speech is often unstructured and containsdisfluencies such as pauses, interruptions, and diverse accents. In this study,we introduce a multilingual conversational dataset, derived from TalkBank,consisting of unstructured phone conversation between adults. Our results showa significant performance drop across various state-of-the-art ASR models whentested in conversational settings. Furthermore, we observe a correlationbetween Word Error Rate and the presence of speech disfluencies, highlightingthe critical need for more realistic, conversational ASR benchmarks.

自动语音识别（ASR）系统在 LibriSpeech 和 Fleurs 等广泛使用的基准测试中表现出色。然而，这些基准并不能充分反映真实世界对话环境的复杂性，因为对话环境中的语音通常是非结构化的，并包含停顿、中断和不同口音等不流畅现象。在这项研究中，我们引入了一个多语言会话数据集，该数据集来自 TalkBank，由成人之间的非结构化电话会话组成。我们的研究结果表明，在会话环境中进行测试时，各种最先进的 ASR 模型的性能明显下降。此外，我们还观察到单词错误率（Word Error Rate）与语音不流畅（speech disfluencies）之间存在相关性，这凸显了对更真实的会话式 ASR 基准的迫切需求。

引用次数: 0

Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction 利用教师启发的混淆类教学进行数据高效的声学场景分类

arXiv - EE - Audio and Speech Processing

Pub Date : 2024-09-18 DOI: arxiv-2409.11964

Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan

In this technical report, we describe the SNTL-NTU team's submission for Task1 Data-Efficient Low-Complexity Acoustic Scene Classification of the detectionand classification of acoustic scenes and events (DCASE) 2024 challenge. Threesystems are introduced to tackle training splits of different sizes. For smalltraining splits, we explored reducing the complexity of the provided baselinemodel by reducing the number of base channels. We introduce data augmentationin the form of mixup to increase the diversity of training samples. For thelarger training splits, we use FocusNet to provide confusing class informationto an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) modelsand baseline models trained on the original sampling rate of 44.1 kHz. We useKnowledge Distillation to distill the ensemble model to the baseline studentmodel. Training the systems on the TAU Urban Acoustic Scene 2022 Mobiledevelopment dataset yielded the highest average testing accuracy of (62.21,59.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively overthe three systems.

在本技术报告中，我们介绍了 SNTL-NTU 团队为 2024 年声学场景和事件检测与分类（DCASE）挑战赛任务 1 "数据高效低复杂度声学场景分类 "提交的报告。我们引入了三个系统来处理不同规模的训练分片。对于较小的训练分区，我们探索了通过减少基通道数量来降低所提供基线模型的复杂性。我们引入了混合形式的数据增强，以增加训练样本的多样性。对于较大的训练分裂，我们使用 FocusNet 为多个 Patchout faSt Spectrogram Transformer (PaSST) 模型和在 44.1 kHz 原始采样率上训练的基线模型的集合提供混淆类信息。我们使用知识蒸馏（Knowledge Distillation）技术将集合模型蒸馏为基线学生模型。在 TAU Urban Acoustic Scene 2022 Mobiledevelopment 数据集上对系统进行训练后，三个系统的平均测试准确率分别为（62.21, 59.82, 56.81, 53.03, 47.97）%和（100, 50, 25, 10, 5）%。

{"title":"Data Efficient Acoustic Scene Classification using Teacher-Informed Confusing Class Instruction","authors":"Jin Jie Sean Yeo, Ee-Leng Tan, Jisheng Bai, Santi Peksi, Woon-Seng Gan","doi":"arxiv-2409.11964","DOIUrl":"https://doi.org/arxiv-2409.11964","url":null,"abstract":"In this technical report, we describe the SNTL-NTU team's submission for Task\u00001 Data-Efficient Low-Complexity Acoustic Scene Classification of the detection\u0000and classification of acoustic scenes and events (DCASE) 2024 challenge. Three\u0000systems are introduced to tackle training splits of different sizes. For small\u0000training splits, we explored reducing the complexity of the provided baseline\u0000model by reducing the number of base channels. We introduce data augmentation\u0000in the form of mixup to increase the diversity of training samples. For the\u0000larger training splits, we use FocusNet to provide confusing class information\u0000to an ensemble of multiple Patchout faSt Spectrogram Transformer (PaSST) models\u0000and baseline models trained on the original sampling rate of 44.1 kHz. We use\u0000Knowledge Distillation to distill the ensemble model to the baseline student\u0000model. Training the systems on the TAU Urban Acoustic Scene 2022 Mobile\u0000development dataset yielded the highest average testing accuracy of (62.21,\u000059.82, 56.81, 53.03, 47.97)% on split (100, 50, 25, 10, 5)% respectively over\u0000the three systems.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"44 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142265492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

arXiv - EE - Audio and Speech Processing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀