基于特征分解学习和合成器特征增强的鲁棒人工智能合成语音检测

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS IEEE Transactions on Information Forensics and Security Pub Date : 2024-12-18 DOI:10.1109/TIFS.2024.3520001

Kuiyuan Zhang;Zhongyun Hua;Yushu Zhang;Yifang Guo;Tao Xiang

{"title":"基于特征分解学习和合成器特征增强的鲁棒人工智能合成语音检测","authors":"Kuiyuan Zhang;Zhongyun Hua;Yushu Zhang;Yifang Guo;Tao Xiang","doi":"10.1109/TIFS.2024.3520001","DOIUrl":null,"url":null,"abstract":"AI-synthesized speech, also known as deepfake speech, has recently raised significant concerns due to the rapid advancement of speech synthesis and speech conversion techniques. Previous works often rely on distinguishing synthesizer artifacts to identify deepfake speech. However, excessive reliance on these specific synthesizer artifacts may result in unsatisfactory performance when addressing speech signals created by unseen synthesizers. In this paper, we propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features as complementary for detection. Specifically, we propose a dual-stream feature decomposition learning strategy that decomposes the learned speech representation using a synthesizer stream and a content stream. The synthesizer stream specializes in learning synthesizer features through supervised training with synthesizer labels. Meanwhile, the content stream focuses on learning synthesizer-independent content features, enabled by a pseudo-labeling-based supervised learning method. This method randomly transforms speech to generate speed and compression labels for training. Additionally, we employ an adversarial learning technique to reduce the synthesizer-related components in the content stream. The final classification is determined by concatenating the synthesizer and content features. To enhance the model’s robustness to different synthesizer characteristics, we further propose a synthesizer feature augmentation strategy that randomly blends the characteristic styles within real and fake audio features and randomly shuffles the synthesizer features with the content features. This strategy effectively enhances the feature diversity and simulates more feature combinations. Experimental results on four deepfake speech benchmark datasets demonstrate that our model achieves state-of-the-art robust detection performance across various evaluation scenarios, including cross-method, cross-dataset, and cross-language evaluations.","PeriodicalId":13492,"journal":{"name":"IEEE Transactions on Information Forensics and Security","volume":"20 ","pages":"871-885"},"PeriodicalIF":6.3000,"publicationDate":"2024-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation\",\"authors\":\"Kuiyuan Zhang;Zhongyun Hua;Yushu Zhang;Yifang Guo;Tao Xiang\",\"doi\":\"10.1109/TIFS.2024.3520001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"AI-synthesized speech, also known as deepfake speech, has recently raised significant concerns due to the rapid advancement of speech synthesis and speech conversion techniques. Previous works often rely on distinguishing synthesizer artifacts to identify deepfake speech. However, excessive reliance on these specific synthesizer artifacts may result in unsatisfactory performance when addressing speech signals created by unseen synthesizers. In this paper, we propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features as complementary for detection. Specifically, we propose a dual-stream feature decomposition learning strategy that decomposes the learned speech representation using a synthesizer stream and a content stream. The synthesizer stream specializes in learning synthesizer features through supervised training with synthesizer labels. Meanwhile, the content stream focuses on learning synthesizer-independent content features, enabled by a pseudo-labeling-based supervised learning method. This method randomly transforms speech to generate speed and compression labels for training. Additionally, we employ an adversarial learning technique to reduce the synthesizer-related components in the content stream. The final classification is determined by concatenating the synthesizer and content features. To enhance the model’s robustness to different synthesizer characteristics, we further propose a synthesizer feature augmentation strategy that randomly blends the characteristic styles within real and fake audio features and randomly shuffles the synthesizer features with the content features. This strategy effectively enhances the feature diversity and simulates more feature combinations. Experimental results on four deepfake speech benchmark datasets demonstrate that our model achieves state-of-the-art robust detection performance across various evaluation scenarios, including cross-method, cross-dataset, and cross-language evaluations.\",\"PeriodicalId\":13492,\"journal\":{\"name\":\"IEEE Transactions on Information Forensics and Security\",\"volume\":\"20 \",\"pages\":\"871-885\"},\"PeriodicalIF\":6.3000,\"publicationDate\":\"2024-12-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Information Forensics and Security\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10806877/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Information Forensics and Security","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10806877/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

由于语音合成和语音转换技术的快速发展，人工智能合成语音也被称为深度假语音，最近引起了人们的极大关注。以前的工作通常依赖于区分合成器伪影来识别深度假语音。然而，过度依赖这些特定的合成器工件可能导致在处理由看不见的合成器创建的语音信号时性能不理想。在本文中，我们提出了一种鲁棒的深度假语音检测方法，该方法使用特征分解来学习与合成器无关的内容特征，作为检测的补充。具体来说，我们提出了一种双流特征分解学习策略，该策略使用合成器流和内容流来分解学习到的语音表示。合成器流专门通过带有合成器标签的监督训练来学习合成器的特征。同时，内容流侧重于学习与合成器无关的内容特征，通过基于伪标记的监督学习方法实现。该方法对语音进行随机变换，生成用于训练的速度和压缩标签。此外，我们采用对抗学习技术来减少内容流中与合成器相关的组件。最后的分类是通过连接合成器和内容特征来确定的。为了增强模型对不同合成器特征的鲁棒性，我们进一步提出了一种合成器特征增强策略，该策略随机混合真假音频特征中的特征样式，并将合成器特征与内容特征随机洗牌。该策略有效地增强了特征的多样性，模拟了更多的特征组合。在四个深度假语音基准数据集上的实验结果表明，我们的模型在各种评估场景（包括跨方法、跨数据集和跨语言评估）中实现了最先进的鲁棒检测性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

AI-synthesized speech, also known as deepfake speech, has recently raised significant concerns due to the rapid advancement of speech synthesis and speech conversion techniques. Previous works often rely on distinguishing synthesizer artifacts to identify deepfake speech. However, excessive reliance on these specific synthesizer artifacts may result in unsatisfactory performance when addressing speech signals created by unseen synthesizers. In this paper, we propose a robust deepfake speech detection method that employs feature decomposition to learn synthesizer-independent content features as complementary for detection. Specifically, we propose a dual-stream feature decomposition learning strategy that decomposes the learned speech representation using a synthesizer stream and a content stream. The synthesizer stream specializes in learning synthesizer features through supervised training with synthesizer labels. Meanwhile, the content stream focuses on learning synthesizer-independent content features, enabled by a pseudo-labeling-based supervised learning method. This method randomly transforms speech to generate speed and compression labels for training. Additionally, we employ an adversarial learning technique to reduce the synthesizer-related components in the content stream. The final classification is determined by concatenating the synthesizer and content features. To enhance the model’s robustness to different synthesizer characteristics, we further propose a synthesizer feature augmentation strategy that randomly blends the characteristic styles within real and fake audio features and randomly shuffles the synthesizer features with the content features. This strategy effectively enhances the feature diversity and simulates more feature combinations. Experimental results on four deepfake speech benchmark datasets demonstrate that our model achieves state-of-the-art robust detection performance across various evaluation scenarios, including cross-method, cross-dataset, and cross-language evaluations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Information Forensics and Security 工程技术-工程：电子与电气

CiteScore

14.40

自引率

7.40%

发文量

234

审稿时长

6.5 months

期刊介绍： The IEEE Transactions on Information Forensics and Security covers the sciences, technologies, and applications relating to information forensics, information security, biometrics, surveillance and systems applications that incorporate these features