A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection

2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS) Pub Date : 2021-12-29 DOI:10.1109/ICSPIS54653.2021.9729387

Pedram Abdzadeh Ziabary, H. Veisi

{"title":"A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection","authors":"Pedram Abdzadeh Ziabary, H. Veisi","doi":"10.1109/ICSPIS54653.2021.9729387","DOIUrl":null,"url":null,"abstract":"Nowadays, biometrics like face, voice, fingerprint, and iris are widely used for the identity authentication of individuals. Automatic Speaker Verification (ASV) systems aim to verify the speaker's authenticity, but recent research has shown that they are vulnerable to various types of attacks. A large number of Text-To-Speech (TTS) and Voice Conversion (VC) methods are being used to create the so-called synthetic or deepfake speech. In recent years, numerous works have been proposed to improve the spoofing detection ability to protect ASV systems against these attacks. This work proposes a synthetic speech detection system, which uses the spectrogram of Constant Q Transform (CQT) as its input features. The CQT spectrogram provides a constant Q factor in different frequency regions similar to the human perception system. Also, compared with Short-Term Fourier Transform (STFT), CQT provides higher time resolution at higher frequencies and higher frequency resolution at lower frequencies. Additionally, the CQT spectrogram has brought us low input feature dimensions, which aids with reducing needed computation time. The Constant Q Cepstral Coefficients (CQCC) features, driven from cepstral analysis of the CQT, have been employed in some recent works for voice spoofing detection. However, to the best of our knowledge, ours is the first work using CQT magnitude and power spectrogram directly for voice spoofing detection. We also use a combination of self-attended ResNet and one class learning to provide our model the robustness against unseen attacks. Finally, it is observed that even though using input features with relatively lower dimensions and reducing computation time, we can still obtain EER 3.53% and min t-DCF 0.10 on ASVspoof 2019 Logical Access (LA) dataset, which places our model among the top performers in this field.","PeriodicalId":286966,"journal":{"name":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPIS54653.2021.9729387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Nowadays, biometrics like face, voice, fingerprint, and iris are widely used for the identity authentication of individuals. Automatic Speaker Verification (ASV) systems aim to verify the speaker's authenticity, but recent research has shown that they are vulnerable to various types of attacks. A large number of Text-To-Speech (TTS) and Voice Conversion (VC) methods are being used to create the so-called synthetic or deepfake speech. In recent years, numerous works have been proposed to improve the spoofing detection ability to protect ASV systems against these attacks. This work proposes a synthetic speech detection system, which uses the spectrogram of Constant Q Transform (CQT) as its input features. The CQT spectrogram provides a constant Q factor in different frequency regions similar to the human perception system. Also, compared with Short-Term Fourier Transform (STFT), CQT provides higher time resolution at higher frequencies and higher frequency resolution at lower frequencies. Additionally, the CQT spectrogram has brought us low input feature dimensions, which aids with reducing needed computation time. The Constant Q Cepstral Coefficients (CQCC) features, driven from cepstral analysis of the CQT, have been employed in some recent works for voice spoofing detection. However, to the best of our knowledge, ours is the first work using CQT magnitude and power spectrogram directly for voice spoofing detection. We also use a combination of self-attended ResNet and one class learning to provide our model the robustness against unseen attacks. Finally, it is observed that even though using input features with relatively lower dimensions and reducing computation time, we can still obtain EER 3.53% and min t-DCF 0.10 on ASVspoof 2019 Logical Access (LA) dataset, which places our model among the top performers in this field.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于CQT谱图的深度假语音检测对策

如今，人脸、声音、指纹、虹膜等生物识别技术被广泛用于个人身份认证。自动说话人验证(ASV)系统旨在验证说话人的真实性，但最近的研究表明，它们很容易受到各种类型的攻击。大量的文本到语音(TTS)和语音转换(VC)方法被用来创造所谓的合成或深度假语音。近年来，人们提出了许多工作来提高欺骗检测能力，以保护自动驾驶汽车系统免受这些攻击。本文提出了一种以恒Q变换(CQT)谱图作为输入特征的合成语音检测系统。CQT频谱图提供了一个常数的Q因子在不同的频率区域类似于人类感知系统。此外，与短时傅里叶变换(STFT)相比，CQT在较高频率下提供更高的时间分辨率，在较低频率下提供更高的频率分辨率。此外，CQT谱图为我们带来了低输入特征维数，这有助于减少所需的计算时间。恒定Q倒谱系数(CQCC)特征是由CQT的倒谱分析驱动的，近年来已被用于语音欺骗检测。然而，据我们所知，我们是第一个直接使用CQT幅度和功率谱进行语音欺骗检测的工作。我们还使用了自参加ResNet和一个类学习的组合，以提供我们的模型对看不见的攻击的鲁棒性。最后，我们观察到，即使使用相对较低维度的输入特征并减少计算时间，我们仍然可以在ASVspoof 2019逻辑访问(LA)数据集上获得3.53%的EER和0.10的最小t-DCF，这使得我们的模型在该领域中表现最好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)

自引率

0.00%

发文量