基于自注意的卷积GRU增强对抗性语音示例

IF 0.8 Q4 COMPUTER SCIENCE, SOFTWARE ENGINEERING International Journal of Image and Graphics Pub Date : 2023-07-08 DOI:10.1142/s0219467824500530

Chaitanya Jannu, S. Vanambathina

{"title":"基于自注意的卷积GRU增强对抗性语音示例","authors":"Chaitanya Jannu, S. Vanambathina","doi":"10.1142/s0219467824500530","DOIUrl":null,"url":null,"abstract":"Recent research has identified adversarial examples which are the challenges to DNN-based ASR systems. In this paper, we propose a new model based on Convolutional GRU and Self-attention U-Net called [Formula: see text] to improve adversarial speech signals. To represent the correlation between neighboring noisy speech frames, a two-Layer GRU is added in the bottleneck of U-Net and an attention gate is inserted in up-sampling units to increase the adversarial stability. The goal of using GRU is to combine the weights sharing technique with the use of gates to control the flow of data across multiple feature maps. As a result, it outperforms the original 1D convolution used in [Formula: see text]. Especially, the performance of the model is evaluated by explainable speech recognition metrics and its performance is analyzed by the improved adversarial training. We used adversarial audio attacks to perform experiments on automatic speech recognition (ASR). We saw (i) the robustness of ASR models which are based on DNN can be improved using the temporal features grasped by the attention-based GRU network; (ii) through adversarial training, including some additive adversarial data augmentation, we could improve the generalization power of Automatic Speech Recognition models which are based on DNN. The word-error-rate (WER) metric confirmed that the enhancement capabilities are better than the state-of-the-art [Formula: see text]. The reason for this enhancement is the ability of GRU units to extract global information within the feature maps. Based on the conducted experiments, the proposed [Formula: see text] increases the score of Speech Transmission Index (STI), Perceptual Evaluation of Speech Quality (PESQ), and the Short-term Objective Intelligibility (STOI) with adversarial speech examples in speech enhancement.","PeriodicalId":44688,"journal":{"name":"International Journal of Image and Graphics","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Self-Attention-Based Convolutional GRU for Enhancement of Adversarial Speech Examples\",\"authors\":\"Chaitanya Jannu, S. Vanambathina\",\"doi\":\"10.1142/s0219467824500530\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent research has identified adversarial examples which are the challenges to DNN-based ASR systems. In this paper, we propose a new model based on Convolutional GRU and Self-attention U-Net called [Formula: see text] to improve adversarial speech signals. To represent the correlation between neighboring noisy speech frames, a two-Layer GRU is added in the bottleneck of U-Net and an attention gate is inserted in up-sampling units to increase the adversarial stability. The goal of using GRU is to combine the weights sharing technique with the use of gates to control the flow of data across multiple feature maps. As a result, it outperforms the original 1D convolution used in [Formula: see text]. Especially, the performance of the model is evaluated by explainable speech recognition metrics and its performance is analyzed by the improved adversarial training. We used adversarial audio attacks to perform experiments on automatic speech recognition (ASR). We saw (i) the robustness of ASR models which are based on DNN can be improved using the temporal features grasped by the attention-based GRU network; (ii) through adversarial training, including some additive adversarial data augmentation, we could improve the generalization power of Automatic Speech Recognition models which are based on DNN. The word-error-rate (WER) metric confirmed that the enhancement capabilities are better than the state-of-the-art [Formula: see text]. The reason for this enhancement is the ability of GRU units to extract global information within the feature maps. Based on the conducted experiments, the proposed [Formula: see text] increases the score of Speech Transmission Index (STI), Perceptual Evaluation of Speech Quality (PESQ), and the Short-term Objective Intelligibility (STOI) with adversarial speech examples in speech enhancement.\",\"PeriodicalId\":44688,\"journal\":{\"name\":\"International Journal of Image and Graphics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.8000,\"publicationDate\":\"2023-07-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Image and Graphics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0219467824500530\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, SOFTWARE ENGINEERING\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Image and Graphics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0219467824500530","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, SOFTWARE ENGINEERING","Score":null,"Total":0}

引用次数: 0

摘要

最近的研究已经确定了对抗性示例，这些示例是对基于DNN的ASR系统的挑战。在本文中，我们提出了一种基于卷积GRU和自注意U-Net的新模型，称为[公式：见正文]，以改进对抗性语音信号。为了表示相邻噪声语音帧之间的相关性，在U-Net的瓶颈中添加了两层GRU，并在上采样单元中插入了注意门，以提高对抗性稳定性。使用GRU的目标是将权重共享技术与使用门相结合，以控制多个特征图之间的数据流。因此，它优于[公式：见正文]中使用的原始1D卷积。特别是，通过可解释的语音识别指标来评估该模型的性能，并通过改进的对抗性训练来分析其性能。我们使用对抗性音频攻击来进行自动语音识别（ASR）实验。我们看到（i）使用基于注意力的GRU网络所掌握的时间特征，可以提高基于DNN的ASR模型的鲁棒性；（ii）通过对抗性训练，包括一些附加的对抗性数据增强，我们可以提高基于DNN的自动语音识别模型的泛化能力。单词错误率（WER）指标证实了增强能力优于最先进的[公式：见正文]。这种增强的原因是GRU单元能够提取特征图中的全局信息。基于所进行的实验，所提出的[公式：见正文]在语音增强中提高了对抗性语音示例的语音传输指数（STI）、语音质量感知评估（PESQ）和短期目标可理解性（STOI）的得分。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Self-Attention-Based Convolutional GRU for Enhancement of Adversarial Speech Examples

Recent research has identified adversarial examples which are the challenges to DNN-based ASR systems. In this paper, we propose a new model based on Convolutional GRU and Self-attention U-Net called [Formula: see text] to improve adversarial speech signals. To represent the correlation between neighboring noisy speech frames, a two-Layer GRU is added in the bottleneck of U-Net and an attention gate is inserted in up-sampling units to increase the adversarial stability. The goal of using GRU is to combine the weights sharing technique with the use of gates to control the flow of data across multiple feature maps. As a result, it outperforms the original 1D convolution used in [Formula: see text]. Especially, the performance of the model is evaluated by explainable speech recognition metrics and its performance is analyzed by the improved adversarial training. We used adversarial audio attacks to perform experiments on automatic speech recognition (ASR). We saw (i) the robustness of ASR models which are based on DNN can be improved using the temporal features grasped by the attention-based GRU network; (ii) through adversarial training, including some additive adversarial data augmentation, we could improve the generalization power of Automatic Speech Recognition models which are based on DNN. The word-error-rate (WER) metric confirmed that the enhancement capabilities are better than the state-of-the-art [Formula: see text]. The reason for this enhancement is the ability of GRU units to extract global information within the feature maps. Based on the conducted experiments, the proposed [Formula: see text] increases the score of Speech Transmission Index (STI), Perceptual Evaluation of Speech Quality (PESQ), and the Short-term Objective Intelligibility (STOI) with adversarial speech examples in speech enhancement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Image and Graphics COMPUTER SCIENCE, SOFTWARE ENGINEERING-

CiteScore

2.40

自引率

18.80%

发文量