{"title":"AFP-Conformer:用于欺骗性语音检测的渐进特征金字塔构形器","authors":"Yida Huang, Qian Shen, Jianfen Ma","doi":"10.1016/j.specom.2024.103149","DOIUrl":null,"url":null,"abstract":"<div><div>The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"166 ","pages":"Article 103149"},"PeriodicalIF":2.4000,"publicationDate":"2024-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection\",\"authors\":\"Yida Huang, Qian Shen, Jianfen Ma\",\"doi\":\"10.1016/j.specom.2024.103149\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.</div></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"166 \",\"pages\":\"Article 103149\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2024-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639324001201\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639324001201","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection
The existing spoofing speech detection methods mostly use either convolutional neural networks or Transformer architectures as their backbone, which fail to adequately represent speech features during feature extraction, resulting in poor detection and generalization performance of the models. To solve this limitation, we propose a novel spoofing speech detection method based on the Conformer architecture. This method integrates a convolutional module into the Transformer framework to enhance its capacity for local feature modeling, enabling to extract both local and global information from speech signals simultaneously. Besides, to mitigate the issue of semantic information loss or degradation in traditional feature pyramid networks during feature fusion, we propose a feature fusion method based on the asymptotic feature pyramid network (AFPN) to fuse multi-scale features and improve generalization of detecting unknown attacks. Our experiments conducted on the ASVspoof 2019 LA dataset demonstrate that our proposed method achieved the equal error rate (EER) of 1.61 % and the minimum tandem detection cost function (min t-DCF) of 0.045, effectively improving the detection performance of the model while enhancing its generalization capability against unknown spoofing attacks. In particular, it demonstrates substantial performance improvement in detecting the most challenging A17 attack.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.