利用暴力特征检测假语音的欺骗对策

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2024-10-02 DOI:10.1016/j.csl.2024.101732
Arsalan Rahman Mirza , Abdulbasit K. Al-Talabani
{"title":"利用暴力特征检测假语音的欺骗对策","authors":"Arsalan Rahman Mirza ,&nbsp;Abdulbasit K. Al-Talabani","doi":"10.1016/j.csl.2024.101732","DOIUrl":null,"url":null,"abstract":"<div><div>Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.</div></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Spoofing countermeasure for fake speech detection using brute force features\",\"authors\":\"Arsalan Rahman Mirza ,&nbsp;Abdulbasit K. Al-Talabani\",\"doi\":\"10.1016/j.csl.2024.101732\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.</div></div>\",\"PeriodicalId\":50638,\"journal\":{\"name\":\"Computer Speech and Language\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.1000,\"publicationDate\":\"2024-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Computer Speech and Language\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0885230824001153\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230824001153","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

由于深度学习技术的进步,生成欺骗性语音的技术已经大量涌现。这种合成语音可被用于有害目的,如冒名顶替或传播虚假信息。该领域的研究人员正在研究用于欺骗检测的有用特征。本文广泛研究了语音欺骗检测中的三个问题,即每类样本的不平衡(这可能会对任何检测模型的性能产生负面影响)、特征早期和晚期融合的影响以及对模型的未见攻击分析。关于不平衡问题,我们提出了两种方法(基于合成少数群体过度采样技术(SMOTE)的模型和基于 Bootstrap 的模型)。我们使用 OpenSMILE 工具包提取了不同的特征集,并对其结果以及早期和晚期融合进行了研究。实验使用 ASVspoof 2019 数据集进行评估,其中包括合成、语音转换和重放语音样本。此外,分类中还采用了支持向量机(SVM)和深度神经网络(DNN)。各种测试场景的结果表明,无论是数据集的不平衡性,还是特定特征或它们的融合,其性能都优于蛮力版本的模型,因为不平衡模型在逻辑访问(LA)和物理访问(PA)方面实现的最佳等错误率(EER)分别为 6.67 % 和 1.80 %。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Spoofing countermeasure for fake speech detection using brute force features
Due to the progress in deep learning technology, techniques that generate spoofed speech have significantly emerged. Such synthetic speech can be exploited for harmful purposes, like impersonation or disseminating false information. Researchers in the area investigate the useful features for spoof detection. This paper extensively investigates three problems in spoof detection in speech, namely, the imbalanced sample per class, which may negatively affect the performance of any detection models, the effect of the feature early and late fusion, and the analysis of unseen attacks on the model. Regarding the imbalanced issue, we have proposed two approaches (a Synthetic Minority Over Sampling Technique (SMOTE)-based and a Bootstrap-based model). We have used the OpenSMILE toolkit, to extract different feature sets, their results and early and late fusion of them have been investigated. The experiments are evaluated using the ASVspoof 2019 datasets which encompass synthetic, voice-conversion, and replayed speech samples. Additionally, Support Vector Machine (SVM) and Deep Neural Network (DNN) have been adopted in the classification. The outcomes from various test scenarios indicated that neither the imbalanced nature of the dataset nor a specific feature or their fusions outperformed the brute force version of the model as the best Equal Error Rate (EER) achieved by the Imbalance model is 6.67 % and 1.80 % for both Logical Access (LA) and Physical Access (PA) respectively.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Entity and relationship extraction based on span contribution evaluation and focusing framework Taking relations as known conditions: A tagging based method for relational triple extraction What’s so complex about conversational speech? A comparison of HMM-based and transformer-based ASR architectures Combining replay and LoRA for continual learning in natural language understanding Optimizing pipeline task-oriented dialogue systems using post-processing networks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1