Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion

IF 3.4 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-06-01 Epub Date: 2023-12-18 DOI:10.1016/j.csl.2023.101599

Souvik Sinha, Spandan Dey, Goutam Saha

{"title":"Improving self-supervised learning model for audio spoofing detection with layer-conditioned embedding fusion","authors":"Souvik Sinha, Spandan Dey, Goutam Saha","doi":"10.1016/j.csl.2023.101599","DOIUrl":null,"url":null,"abstract":"<div><p>The application of voice recognition<span><span> systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures<span> are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the </span></span>EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.</span></p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":"86 ","pages":"Article 101599"},"PeriodicalIF":3.4000,"publicationDate":"2024-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823001183","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/12/18 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The application of voice recognition systems has increased by a great deal with technology. This has allowed adversaries to falsely claim access to these systems by spoofing the identity of a target speaker. The existing supervised learning (SL)-based countermeasures are yet to provide a complete solution against the newly evolving spoofing attacks. To tackle this problem, we explore self-supervised learning (SSL)-based frameworks. At first, we implement widely used SSL frameworks, where our target is identifying spoofed speech. We report a considerable performance improvement over the SL state-of-the-art baseline as a whole. Then, we perform an attack-wise comparative analysis between SL and SSL frameworks. While the SSL performs better in most cases, there are certain attacks where the SL outperforms it. Hence, we hypothesize that there is scope to jointly utilize information effectively included by both these models for better performance. To do that, we first perform conventional weighted score fusion between the SL and best-performing SSL models, which reduces the EER, outperforming both the state-of-the-art SL and best-performing SSL framework. Then, we propose an embedding fusion scheme that minimizes the embedding distribution between the selected SL and SSL representations. To select the appropriate layers, we perform a comprehensive statistical analysis. The proposed fusion scheme outperforms the score fusion method and shows that the SSL performance can be improved by effectively including learned knowledge from the SL framework. The final EER achieved on the ASVspoof 2019 logical access (LA) database is 0.177%, a significant improvement over our baseline. Using the ASVspoof 2021 LA as a blind evaluation dataset, our proposed embedding fusion scheme reduces the EER to 2.666%.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用层条件嵌入融合改进音频欺骗检测的自监督学习模型

随着技术的发展，语音识别系统的应用大大增加。这就使得对手可以通过欺骗目标说话者的身份，谎称可以访问这些系统。现有的基于监督学习（SL）的反制措施还不能提供完整的解决方案来应对新发展的欺骗攻击。为了解决这个问题，我们探索了基于自监督学习（SSL）的框架。首先，我们实施了广泛使用的 SSL 框架，目标是识别欺骗性语音。我们的报告显示，与基于自监督学习的最先进基准相比，我们的整体性能有了相当大的提高。然后，我们对 SL 和 SSL 框架进行了攻击方面的比较分析。虽然 SSL 在大多数情况下表现更好，但在某些攻击中，SL 的表现要优于 SSL。因此，我们假设可以联合利用这两种模型所包含的有效信息，以获得更好的性能。为此，我们首先在 SL 模型和表现最佳的 SSL 模型之间进行了传统的加权分数融合，从而降低了 EER，表现优于最先进的 SL 框架和表现最佳的 SSL 框架。然后，我们提出了一种嵌入式融合方案，它能使所选 SL 和 SSL 表示之间的嵌入分布最小化。为了选择合适的层，我们进行了全面的统计分析。所提出的融合方案优于分数融合方法，并表明通过有效纳入从 SL 框架中学到的知识，可以提高 SSL 性能。在 ASVspoof 2019 逻辑访问（LA）数据库上实现的最终 EER 为 0.177%，比我们的基线有显著提高。使用 ASVspoof 2021 LA 作为盲评估数据集，我们提出的嵌入融合方案将 EER 降低到 2.666%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Computer Speech and Language 工程技术-计算机：人工智能

CiteScore

11.30

自引率

4.70%

发文量

审稿时长

22.9 weeks

期刊介绍： Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.