基于多分辨率、神经网络信号处理的语音波形声学建模

Zoltán Tüske, R. Schlüter, H. Ney
{"title":"基于多分辨率、神经网络信号处理的语音波形声学建模","authors":"Zoltán Tüske, R. Schlüter, H. Ney","doi":"10.1109/ICASSP.2018.8461871","DOIUrl":null,"url":null,"abstract":"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"4859-4863"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"24","resultStr":"{\"title\":\"Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing\",\"authors\":\"Zoltán Tüske, R. Schlüter, H. Ney\",\"doi\":\"10.1109/ICASSP.2018.8461871\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.\",\"PeriodicalId\":6638,\"journal\":{\"name\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"volume\":\"22 1\",\"pages\":\"4859-4863\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-05-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"24\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICASSP.2018.8461871\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2018.8461871","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 24

摘要

最近,一些论文已经证明神经网络(NN)能够将特征提取作为声学模型的一部分。在Gammatone特征提取管道的激励下,本文通过第二级时间卷积元素扩展了基于波形的神经网络模型。提出的扩展扩展了包络提取块,并允许模型学习多分辨率表示。自动语音识别(ASR)实验表明,与我们之前在信号域直接训练的最佳声学模型相比,自动语音识别(ASR)的单词错误率显著降低。虽然我们只使用了250小时的语音,但基于数据驱动的神经网络的语音信号处理与传统的手工特征提取器几乎相同。在额外的实验中,我们还测试了基于神经网络衍生特征的片段级特征归一化技术,这进一步改善了结果。然而,将前馈神经网络派生的语音表示移植到LSTM后端模型表明,与标准特征提取器相比,神经网络前端的鲁棒性要差得多。对所提出的新层的权重分析表明,神经网络更倾向于多分辨率和调制频谱表示。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing
Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Reduced Dimension Minimum BER PSK Precoding for Constrained Transmit Signals in Massive MIMO Low Complexity Joint RDO of Prediction Units Couples for HEVC Intra Coding Non-Native Children Speech Recognition Through Transfer Learning Synthesis of Images by Two-Stage Generative Adversarial Networks Statistical T+2d Subband Modelling for Crowd Counting
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1