{"title":"Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion","authors":"Souvik Sinha, Spandan Dey, Goutam Saha","doi":"10.1016/j.csl.2023.101599","DOIUrl":null,"url":null,"abstract":"<div><p>The application of voice recognition<span><span> systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures<span> are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the </span></span>EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001183","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The application of voice recognition systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.
期刊介绍:
Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language.
The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.