{"title":"Kirigami","authors":"Sudershan Boovaraghavan, Haozhe Zhou, Mayank Goel, Yuvraj Agarwal","doi":"10.1145/3643502","DOIUrl":null,"url":null,"abstract":"Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) approaches. These audio-based ML HAR pipelines often use common featurization techniques, such as extracting various statistical and spectral features by converting time domain signals to the frequency domain (using an FFT) and using them to train ML models. Some of these approaches also claim privacy benefits by preventing the identification of human speech. However, recent deep learning-based automatic speech recognition (ASR) models pose new privacy challenges to these featurization techniques. In this paper, we systematically evaluate various featurization approaches for audio data, assessing their privacy risks through metrics like speech intelligibility (PER and WER) while considering the utility tradeoff in terms of ML-based activity recognition accuracy. Our findings reveal the susceptibility of these approaches to speech content recovery when exposed to recent ASR models, especially under re-tuning or retraining conditions. Notably, fine-tuned ASR models achieved an average Phoneme Error Rate (PER) of 39.99% and Word Error Rate (WER) of 44.43% in speech recognition for these approaches. To overcome these privacy concerns, we propose Kirigami, a lightweight machine learning-based audio speech filter that removes human speech segments reducing the efficacy of ASR models (70.48% PER and 101.40% WER) while also maintaining HAR accuracy (76.0% accuracy). We show that Kirigami can be implemented on common edge microcontrollers with limited computational capabilities and memory, providing a path to deployment on a variety of IoT devices. Finally, we conducted a real-world user study and showed the robustness of Kirigami on a laptop and an ARM Cortex-M4F microcontroller under three different background noises.","PeriodicalId":20553,"journal":{"name":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","volume":null,"pages":null},"PeriodicalIF":3.6000,"publicationDate":"2024-03-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3643502","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Audio-based human activity recognition (HAR) is very popular because many human activities have unique sound signatures that can be detected using machine learning (ML) approaches. These audio-based ML HAR pipelines often use common featurization techniques, such as extracting various statistical and spectral features by converting time domain signals to the frequency domain (using an FFT) and using them to train ML models. Some of these approaches also claim privacy benefits by preventing the identification of human speech. However, recent deep learning-based automatic speech recognition (ASR) models pose new privacy challenges to these featurization techniques. In this paper, we systematically evaluate various featurization approaches for audio data, assessing their privacy risks through metrics like speech intelligibility (PER and WER) while considering the utility tradeoff in terms of ML-based activity recognition accuracy. Our findings reveal the susceptibility of these approaches to speech content recovery when exposed to recent ASR models, especially under re-tuning or retraining conditions. Notably, fine-tuned ASR models achieved an average Phoneme Error Rate (PER) of 39.99% and Word Error Rate (WER) of 44.43% in speech recognition for these approaches. To overcome these privacy concerns, we propose Kirigami, a lightweight machine learning-based audio speech filter that removes human speech segments reducing the efficacy of ASR models (70.48% PER and 101.40% WER) while also maintaining HAR accuracy (76.0% accuracy). We show that Kirigami can be implemented on common edge microcontrollers with limited computational capabilities and memory, providing a path to deployment on a variety of IoT devices. Finally, we conducted a real-world user study and showed the robustness of Kirigami on a laptop and an ARM Cortex-M4F microcontroller under three different background noises.
基于音频的人类活动识别(HAR)非常流行,因为许多人类活动都有独特的声音特征,可以使用机器学习(ML)方法进行检测。这些基于音频的 ML HAR 管道通常使用常见的特征化技术,例如通过将时域信号转换到频域(使用 FFT)来提取各种统计和频谱特征,并将其用于训练 ML 模型。其中一些方法还声称可以通过防止识别人类语音来保护隐私。然而,最近基于深度学习的自动语音识别(ASR)模型给这些特征化技术带来了新的隐私挑战。在本文中,我们系统地评估了音频数据的各种特征化方法,通过语音清晰度(PER 和 WER)等指标评估了它们的隐私风险,同时考虑了基于 ML 的活动识别准确率方面的效用权衡。我们的研究结果表明,当这些方法暴露在最新的 ASR 模型中时,特别是在重新调整或重新训练的条件下,很容易出现语音内容恢复问题。值得注意的是,经过微调的 ASR 模型在这些方法的语音识别中平均达到了 39.99% 的音素错误率(PER)和 44.43% 的单词错误率(WER)。为了克服这些隐私问题,我们提出了基于机器学习的轻量级音频语音过滤器 Kirigami,它可以去除人类语音片段,降低 ASR 模型的效率(PER 为 70.48%,WER 为 101.40%),同时还能保持 HAR 准确率(76.0%)。我们的研究表明,Kirigami 可以在计算能力和内存有限的普通边缘微控制器上实现,为在各种物联网设备上部署提供了途径。最后,我们进行了一项实际用户研究,在笔记本电脑和 ARM Cortex-M4F 微控制器上展示了 Kirigami 在三种不同背景噪声下的鲁棒性。