Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2018-05-07 DOI:10.1109/ICASSP.2018.8462223

Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani

{"title":"Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition","authors":"Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani","doi":"10.1109/ICASSP.2018.8462223","DOIUrl":null,"url":null,"abstract":"In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 1","pages":"5729-5733"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8462223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

用于远场语音识别的相敏深度神经网络训练的频谱失真模型

本文提出了一种在训练相敏深度神经网络模型时引入相摄动的算法。传统的特征，如对数特征或倒谱特征没有任何与相位相关的信息。然而，诸如原始波形或复杂光谱特征等特征包含相位相关信息。相敏特性的优点是能够检测到不同麦克风通道或频带之间到达时间的差异。然而，与基于幅度的特征相比，相位信息对各种失真更敏感，如麦克风特性的变化、混响等。对于传统的基于震级的特征，众所周知，添加噪声或混响(通常称为多风格训练(MTR))可以提高鲁棒性。本着类似的精神，我们提出了一种引入频谱失真的算法，使深度学习模型对相位失真具有更强的鲁棒性。我们称这种方法为频谱失真训练(SDTR)。在我们的实验中，使用包含2200万个有和没有MTR的话语的训练集，该方法在Google Home上记录的测试集上相对降低了3.2%和8.48%的单词错误率(wer)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量

期刊最新文献

Reduced Dimension Minimum BER PSK Precoding for Constrained Transmit Signals in Massive MIMO Low Complexity Joint RDO of Prediction Units Couples for HEVC Intra Coding Non-Native Children Speech Recognition Through Transfer Learning Synthesis of Images by Two-Stage Generative Adversarial Networks Statistical T+2d Subband Modelling for Crowd Counting