This paper presents a comprehensive study on developing and implementing a speech prosody extractor to enhance audio security in Automatic Speaker Verification (ASV) systems. Our novel training approach, which operates without exposure to spoofing examples, significantly improves the modeling of essential prosodic elements often overlooked in deep fake attacks. By integrating codec and recording device embeddings, the prosody extractor effectively neutralizes codec-specific distortions, enhancing robustness across various audio transmission channels. Combined with state-of-the-art ASV systems, our prosody extractor reduces the Equal Error Rate (EER) by an average of 49.15% without codecs, 50.53% with the g711 codec, 44.77% with the g729 codec, 43.43% with the Vonage1 channel, 42.05% with ECAPA-TDNN, and 45.17% with TitaNet across diverse datasets, including high-quality commercial deep fakes.2,3 This integration markedly improves the detection and mitigation of sophisticated spoofing attempts, especially in compressed or altered audio environments. Our methodology also eliminates the dependency on textual data during training, enabling the use of larger and more varied datasets.
扫码关注我们
求助内容:
应助结果提醒方式:
