WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-18 DOI:arxiv-2409.12121

Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang

{"title":"WMCodec: End-to-End Neural Speech Codec with Deep Watermarking for Authenticity Verification","authors":"Junzuo Zhou, Jiangyan Yi, Yong Ren, Jianhua Tao, Tao Wang, Chu Yuan Zhang","doi":"arxiv-2409.12121","DOIUrl":null,"url":null,"abstract":"Recent advances in speech spoofing necessitate stronger verification\nmechanisms in neural speech codecs to ensure authenticity. Current methods\nembed numerical watermarks before compression and extract them from\nreconstructed speech for verification, but face limitations such as separate\ntraining processes for the watermark and codec, and insufficient cross-modal\ninformation integration, leading to reduced watermark imperceptibility,\nextraction accuracy, and capacity. To address these issues, we propose WMCodec,\nthe first neural speech codec to jointly train compression-reconstruction and\nwatermark embedding-extraction in an end-to-end manner, optimizing both\nimperceptibility and extractability of the watermark. Furthermore, We design an\niterative Attention Imprint Unit (AIU) for deeper feature integration of\nwatermark and speech, reducing the impact of quantization noise on the\nwatermark. Experimental results show WMCodec outperforms AudioSeal with Encodec\nin most quality metrics for watermark imperceptibility and consistently exceeds\nboth AudioSeal with Encodec and reinforced TraceableSpeech in extraction\naccuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16\nbps, WMCodec maintains over 99% extraction accuracy under common attacks,\ndemonstrating strong robustness.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"197 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in speech spoofing necessitate stronger verification mechanisms in neural speech codecs to ensure authenticity. Current methods embed numerical watermarks before compression and extract them from reconstructed speech for verification, but face limitations such as separate training processes for the watermark and codec, and insufficient cross-modal information integration, leading to reduced watermark imperceptibility, extraction accuracy, and capacity. To address these issues, we propose WMCodec, the first neural speech codec to jointly train compression-reconstruction and watermark embedding-extraction in an end-to-end manner, optimizing both imperceptibility and extractability of the watermark. Furthermore, We design an iterative Attention Imprint Unit (AIU) for deeper feature integration of watermark and speech, reducing the impact of quantization noise on the watermark. Experimental results show WMCodec outperforms AudioSeal with Encodec in most quality metrics for watermark imperceptibility and consistently exceeds both AudioSeal with Encodec and reinforced TraceableSpeech in extraction accuracy of watermark. At bandwidth of 6 kbps with a watermark capacity of 16 bps, WMCodec maintains over 99% extraction accuracy under common attacks, demonstrating strong robustness.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

WMCodec：带有深度水印的端到端神经语音编解码器，用于真实性验证

语音欺骗技术的最新进展要求神经语音编解码器采用更强大的验证机制来确保真实性。目前的方法是在压缩前嵌入数字水印，并从重组后的语音中提取水印进行验证，但这种方法面临着水印和编解码器训练过程分离、跨模态信息整合不足等限制，导致水印的不可感知性、提取精度和容量降低。为了解决这些问题，我们提出了 WMCodec，它是第一个以端到端方式联合训练压缩-重构和水印嵌入-提取的神经语音编解码器，同时优化了水印的可感知性和可提取性。此外，我们还设计了一种迭代注意力印记单元（AIU），用于更深入地整合水印和语音的特征，从而降低量化噪声对水印的影响。实验结果表明，WMCodec 在水印不可感知性的大多数质量指标上都优于 AudioSeal with Encodec，并且在水印提取准确性上一直超过 AudioSeal with Encodec 和强化可追踪语音。在带宽为 6 kbps、水印容量为 16bps 的情况下，WMCodec 在常见攻击下的提取准确率保持在 99% 以上，显示了强大的鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量