Hear "No Evil", See "Kenansville"*: Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems

2021 IEEE Symposium on Security and Privacy (SP) Pub Date : 2021-05-01 DOI:10.1109/SP40001.2021.00009

H. Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, T. Shrimpton, Patrick Traynor

{"title":"Hear \"No Evil\", See \"Kenansville\"*: Efficient and Transferable Black-Box Attacks on Speech Recognition and Voice Identification Systems","authors":"H. Abdullah, Muhammad Sajidur Rahman, Washington Garcia, Kevin Warren, Anurag Swarnim Yadav, T. Shrimpton, Patrick Traynor","doi":"10.1109/SP40001.2021.00009","DOIUrl":null,"url":null,"abstract":"Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box, transferable, can be tuned to require zero queries to the target, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy.","PeriodicalId":6786,"journal":{"name":"2021 IEEE Symposium on Security and Privacy (SP)","volume":"39 1","pages":"712-729"},"PeriodicalIF":0.0000,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Symposium on Security and Privacy (SP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SP40001.2021.00009","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

Automatic speech recognition and voice identification systems are being deployed in a wide array of applications, from providing control mechanisms to devices lacking traditional interfaces, to the automatic transcription of conversations and authentication of users. Many of these applications have significant security and privacy considerations. We develop attacks that force mistranscription and misidentification in state of the art systems, with minimal impact on human comprehension. Processing pipelines for modern systems are comprised of signal preprocessing and feature extraction steps, whose output is fed to a machine-learned model. Prior work has focused on the models, using white-box knowledge to tailor model-specific attacks. We focus on the pipeline stages before the models, which (unlike the models) are quite similar across systems. As such, our attacks are black-box, transferable, can be tuned to require zero queries to the target, and demonstrably achieve mistranscription and misidentification rates as high as 100% by modifying only a few frames of audio. We perform a study via Amazon Mechanical Turk demonstrating that there is no statistically significant difference between human perception of regular and perturbed audio. Our findings suggest that models may learn aspects of speech that are generally not perceived by human subjects, but that are crucial for model accuracy.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

听“无恶”，见“Kenansville”*:对语音识别和语音识别系统的有效和可转移的黑匣子攻击

自动语音识别和语音识别系统正在广泛应用，从提供控制机制到缺乏传统接口的设备，再到对话的自动转录和用户身份验证。这些应用程序中的许多都有重要的安全和隐私考虑。我们在最先进的系统中开发攻击，迫使错误转录和错误识别，对人类理解的影响最小。现代系统的处理管道由信号预处理和特征提取步骤组成，其输出被馈送到机器学习模型。先前的工作集中在模型上，使用白盒知识来定制特定于模型的攻击。我们关注的是模型之前的管道阶段，不同于模型，管道阶段在系统之间是非常相似的。因此，我们的攻击是黑盒的，可转移的，可以调整到不需要对目标进行查询，并且通过修改几帧音频可以明显实现高达100%的错误转录和错误识别率。我们通过Amazon Mechanical Turk进行了一项研究，表明人类对正常音频和干扰音频的感知没有统计学上的显著差异。我们的研究结果表明，模型可以学习人类通常无法感知的语言方面，但这对模型的准确性至关重要。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Symposium on Security and Privacy (SP)

自引率

0.00%

发文量

期刊最新文献

A2L: Anonymous Atomic Locks for Scalability in Payment Channel Hubs High-Assurance Cryptography in the Spectre Era An I/O Separation Model for Formal Verification of Kernel Implementations Trust, But Verify: A Longitudinal Analysis Of Android OEM Compliance and Customization HackEd: A Pedagogical Analysis of Online Vulnerability Discovery Exercises