{"title":"A Countermeasure Based on CQT Spectrogram for Deepfake Speech Detection","authors":"Pedram Abdzadeh Ziabary, H. Veisi","doi":"10.1109/ICSPIS54653.2021.9729387","DOIUrl":null,"url":null,"abstract":"Nowadays, biometrics like face, voice, fingerprint, and iris are widely used for the identity authentication of individuals. Automatic Speaker Verification (ASV) systems aim to verify the speaker's authenticity, but recent research has shown that they are vulnerable to various types of attacks. A large number of Text-To-Speech (TTS) and Voice Conversion (VC) methods are being used to create the so-called synthetic or deepfake speech. In recent years, numerous works have been proposed to improve the spoofing detection ability to protect ASV systems against these attacks. This work proposes a synthetic speech detection system, which uses the spectrogram of Constant Q Transform (CQT) as its input features. The CQT spectrogram provides a constant Q factor in different frequency regions similar to the human perception system. Also, compared with Short-Term Fourier Transform (STFT), CQT provides higher time resolution at higher frequencies and higher frequency resolution at lower frequencies. Additionally, the CQT spectrogram has brought us low input feature dimensions, which aids with reducing needed computation time. The Constant Q Cepstral Coefficients (CQCC) features, driven from cepstral analysis of the CQT, have been employed in some recent works for voice spoofing detection. However, to the best of our knowledge, ours is the first work using CQT magnitude and power spectrogram directly for voice spoofing detection. We also use a combination of self-attended ResNet and one class learning to provide our model the robustness against unseen attacks. Finally, it is observed that even though using input features with relatively lower dimensions and reducing computation time, we can still obtain EER 3.53% and min t-DCF 0.10 on ASVspoof 2019 Logical Access (LA) dataset, which places our model among the top performers in this field.","PeriodicalId":286966,"journal":{"name":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 7th International Conference on Signal Processing and Intelligent Systems (ICSPIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSPIS54653.2021.9729387","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5
Abstract
Nowadays, biometrics like face, voice, fingerprint, and iris are widely used for the identity authentication of individuals. Automatic Speaker Verification (ASV) systems aim to verify the speaker's authenticity, but recent research has shown that they are vulnerable to various types of attacks. A large number of Text-To-Speech (TTS) and Voice Conversion (VC) methods are being used to create the so-called synthetic or deepfake speech. In recent years, numerous works have been proposed to improve the spoofing detection ability to protect ASV systems against these attacks. This work proposes a synthetic speech detection system, which uses the spectrogram of Constant Q Transform (CQT) as its input features. The CQT spectrogram provides a constant Q factor in different frequency regions similar to the human perception system. Also, compared with Short-Term Fourier Transform (STFT), CQT provides higher time resolution at higher frequencies and higher frequency resolution at lower frequencies. Additionally, the CQT spectrogram has brought us low input feature dimensions, which aids with reducing needed computation time. The Constant Q Cepstral Coefficients (CQCC) features, driven from cepstral analysis of the CQT, have been employed in some recent works for voice spoofing detection. However, to the best of our knowledge, ours is the first work using CQT magnitude and power spectrogram directly for voice spoofing detection. We also use a combination of self-attended ResNet and one class learning to provide our model the robustness against unseen attacks. Finally, it is observed that even though using input features with relatively lower dimensions and reducing computation time, we can still obtain EER 3.53% and min t-DCF 0.10 on ASVspoof 2019 Logical Access (LA) dataset, which places our model among the top performers in this field.