基于多分辨率、神经网络信号处理的语音波形声学建模

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2018-05-02 DOI:10.1109/ICASSP.2018.8461871

Zoltán Tüske, R. Schlüter, H. Ney

{"title":"基于多分辨率、神经网络信号处理的语音波形声学建模","authors":"Zoltán Tüske, R. Schlüter, H. Ney","doi":"10.1109/ICASSP.2018.8461871","DOIUrl":null,"url":null,"abstract":"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"4859-4863"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing\",\"authors\":\"Zoltán Tüske, R. Schlüter, H. Ney\",\"doi\":\"10.1109/ICASSP.2018.8461871\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.\",\"PeriodicalId\":6638,\"journal\":{\"name\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"22 1\",\"pages\":\"4859-4863\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2018.8461871\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8461871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 24

摘要

最近，一些论文已经证明神经网络(NN)能够将特征提取作为声学模型的一部分。在Gammatone特征提取管道的激励下，本文通过第二级时间卷积元素扩展了基于波形的神经网络模型。提出的扩展扩展了包络提取块，并允许模型学习多分辨率表示。自动语音识别(ASR)实验表明，与我们之前在信号域直接训练的最佳声学模型相比，自动语音识别(ASR)的单词错误率显著降低。虽然我们只使用了250小时的语音，但基于数据驱动的神经网络的语音信号处理与传统的手工特征提取器几乎相同。在额外的实验中，我们还测试了基于神经网络衍生特征的片段级特征归一化技术，这进一步改善了结果。然而，将前馈神经网络派生的语音表示移植到LSTM后端模型表明，与标准特征提取器相比，神经网络前端的鲁棒性要差得多。对所提出的新层的权重分析表明，神经网络更倾向于多分辨率和调制频谱表示。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing

Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量

期刊最新文献

Reduced Dimension Minimum BER PSK Precoding for Constrained Transmit Signals in Massive MIMO Low Complexity Joint RDO of Prediction Units Couples for HEVC Intra Coding Non-Native Children Speech Recognition Through Transfer Learning Synthesis of Images by Two-Stage Generative Adversarial Networks Statistical T+2d Subband Modelling for Crowd Counting