Pub Date : 2018-09-10DOI: 10.1109/ICASSP.2018.8461687
Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, H. Wang
Spoken document retrieval (SDR) has become a prominently required application since unprecedented volumes of multimedia data along with speech have become available in our daily life. As far as we are aware, there has been relatively less work in launching unsupervised paragraph embedding methods and investigating the effectiveness of these methods on the SDR task. This paper first presents a novel paragraph embedding method, named the essence vector (EV) model, which aims at inferring a representation for a given paragraph by encapsulating the most representative information from the paragraph and excluding the general background information at the same time. On top of the EV model, we develop three query language modeling mechanisms to improve the retrieval performance. A series of empirical SDR experiments conducted on two benchmark collections demonstrate the good efficacy of the proposed framework, compared to several existing strong baseline systems.
{"title":"Essence Vector-Based Query Modeling for Spoken Document Retrieval","authors":"Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, H. Wang","doi":"10.1109/ICASSP.2018.8461687","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461687","url":null,"abstract":"Spoken document retrieval (SDR) has become a prominently required application since unprecedented volumes of multimedia data along with speech have become available in our daily life. As far as we are aware, there has been relatively less work in launching unsupervised paragraph embedding methods and investigating the effectiveness of these methods on the SDR task. This paper first presents a novel paragraph embedding method, named the essence vector (EV) model, which aims at inferring a representation for a given paragraph by encapsulating the most representative information from the paragraph and excluding the general background information at the same time. On top of the EV model, we develop three query language modeling mechanisms to improve the retrieval performance. A series of empirical SDR experiments conducted on two benchmark collections demonstrate the good efficacy of the proposed framework, compared to several existing strong baseline systems.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"97 1","pages":"6274-6278"},"PeriodicalIF":0.0,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84099319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-10DOI: 10.1109/ICASSP.2018.8461548
Dimitris Berberidis, A. Nikolakopoulos, G. Giannakis
The present work introduces methods for sampling and inference for the purpose of semi-supervised classification over the nodes of a graph. The graph may be given or constructed using similarity measures among nodal features. Leveraging the graph for classification builds on the premise that relation among nodes can be modeled via stationary distributions of a certain class of random walks. The proposed classifier builds on existing scalable random-walk-based methods and improves accuracy and robustness by automatically adjusting a set of parameters to the graph and label distribution at hand. Furthermore, a sampling strategy tailored to random-walk-based classifiers is introduced. Numerical tests on benchmark synthetic and real labeled graphs demonstrate the performance of the proposed sampling and inference methods in terms of classification accuracy.
{"title":"Random Walks with Restarts for Graph-Based Classification: Teleportation Tuning and Sampling Design","authors":"Dimitris Berberidis, A. Nikolakopoulos, G. Giannakis","doi":"10.1109/ICASSP.2018.8461548","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461548","url":null,"abstract":"The present work introduces methods for sampling and inference for the purpose of semi-supervised classification over the nodes of a graph. The graph may be given or constructed using similarity measures among nodal features. Leveraging the graph for classification builds on the premise that relation among nodes can be modeled via stationary distributions of a certain class of random walks. The proposed classifier builds on existing scalable random-walk-based methods and improves accuracy and robustness by automatically adjusting a set of parameters to the graph and label distribution at hand. Furthermore, a sampling strategy tailored to random-walk-based classifiers is introduced. Numerical tests on benchmark synthetic and real labeled graphs demonstrate the performance of the proposed sampling and inference methods in terms of classification accuracy.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"115 1","pages":"2811-2815"},"PeriodicalIF":0.0,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89566550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-18DOI: 10.1109/ICASSP.2018.8462676
Safa C. Medin, John Murray-Bruce, Vivek K Goyal
We address the problem of estimating the parameter of a Bernoulli process. This arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. We introduce a framework within which to minimize the mean-squared error (MSE) subject to an upper bound on the mean number of trials. This optimization has several simple and intuitive properties when the Bernoulli parameter has a beta prior. In addition, by exploiting typical spatial correlation using total variation regularization, we extend the developed framework to a rectangular array of Bernoulli processes representing the pixels in a natural scene. In simulations inspired by realistic active imaging scenarios, we demonstrate a 4.26 dB reduction in MSE due to the adaptive acquisition, as an average over many independent experiments and invariant to a factor of 3.4 variation in trial budget.
{"title":"Optimal Stopping Times for Estimating Bernoulli Parameters with Applications to Active Imaging","authors":"Safa C. Medin, John Murray-Bruce, Vivek K Goyal","doi":"10.1109/ICASSP.2018.8462676","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462676","url":null,"abstract":"We address the problem of estimating the parameter of a Bernoulli process. This arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. We introduce a framework within which to minimize the mean-squared error (MSE) subject to an upper bound on the mean number of trials. This optimization has several simple and intuitive properties when the Bernoulli parameter has a beta prior. In addition, by exploiting typical spatial correlation using total variation regularization, we extend the developed framework to a rectangular array of Bernoulli processes representing the pixels in a natural scene. In simulations inspired by realistic active imaging scenarios, we demonstrate a 4.26 dB reduction in MSE due to the adaptive acquisition, as an average over many independent experiments and invariant to a factor of 3.4 variation in trial budget.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"4429-4433"},"PeriodicalIF":0.0,"publicationDate":"2018-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75733802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-18DOI: 10.1109/ICASSP.2018.8462020
Jing-Xuan Zhang, Zhenhua Ling, Lirong Dai
This paper proposes a forward attention method for the sequence-to-sequence acoustic modeling of speech synthesis. This method is motivated by the nature of the monotonic alignment from phone sequences to acoustic sequences. Only the alignment paths that satisfy the monotonic condition are taken into consideration at each decoder timestep. The modified attention probabilities at each timestep are computed recursively using a forward algorithm. A transition agent for forward attention is further proposed, which helps the attention mechanism to make decisions whether to move forward or stay at each decoder timestep. Experimental results show that the proposed forward attention method achieves faster convergence speed and higher stability than the baseline attention method. Besides, the method of forward attention with transition agent can also help improve the naturalness of synthetic speech and control the speed of synthetic speech effectively.
{"title":"Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis","authors":"Jing-Xuan Zhang, Zhenhua Ling, Lirong Dai","doi":"10.1109/ICASSP.2018.8462020","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462020","url":null,"abstract":"This paper proposes a forward attention method for the sequence-to-sequence acoustic modeling of speech synthesis. This method is motivated by the nature of the monotonic alignment from phone sequences to acoustic sequences. Only the alignment paths that satisfy the monotonic condition are taken into consideration at each decoder timestep. The modified attention probabilities at each timestep are computed recursively using a forward algorithm. A transition agent for forward attention is further proposed, which helps the attention mechanism to make decisions whether to move forward or stay at each decoder timestep. Experimental results show that the proposed forward attention method achieves faster convergence speed and higher stability than the baseline attention method. Besides, the method of forward attention with transition agent can also help improve the naturalness of synthetic speech and control the speed of synthetic speech effectively.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"37 1","pages":"4789-4793"},"PeriodicalIF":0.0,"publicationDate":"2018-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74949477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-07-13DOI: 10.1109/ICASSP.2018.8461954
Gokcen Cilingir, Jonathan Huang, Mandar Joshi, Narayan Biswal
Text-independent speaker recognition (TI-SR) requires a lengthy enrollment process that involves asking dedicated time from the user to create a reliable model of their voice. Seamless enrollment is a highly attractive feature which refers to the enrollment process that happens in the background and asks for no dedicated time from the user. One of the key problems in a fully automated seamless enrollment process is to determine the sufficiency of a given utterance collection for the purpose of TI-SR. No known metric exists in the literature to quantify sufficiency. This paper introduces a novel metric called phoneme-richness score. Quality of a sufficiency metric can be assessed via its correlation with the TI-SR performance. Our assessment shows that phoneme-richness score achieves −0.96 correlation with TI-SR performance (measured in equal error rate), which is highly significant, whereas a naive sufficiency metric like speech duration achieves only −0.68 correlation.
{"title":"Sufficiency Quantification for Seamless Text-Independent Speaker Enrollment","authors":"Gokcen Cilingir, Jonathan Huang, Mandar Joshi, Narayan Biswal","doi":"10.1109/ICASSP.2018.8461954","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461954","url":null,"abstract":"Text-independent speaker recognition (TI-SR) requires a lengthy enrollment process that involves asking dedicated time from the user to create a reliable model of their voice. Seamless enrollment is a highly attractive feature which refers to the enrollment process that happens in the background and asks for no dedicated time from the user. One of the key problems in a fully automated seamless enrollment process is to determine the sufficiency of a given utterance collection for the purpose of TI-SR. No known metric exists in the literature to quantify sufficiency. This paper introduces a novel metric called phoneme-richness score. Quality of a sufficiency metric can be assessed via its correlation with the TI-SR performance. Our assessment shows that phoneme-richness score achieves −0.96 correlation with TI-SR performance (measured in equal error rate), which is highly significant, whereas a naive sufficiency metric like speech duration achieves only −0.68 correlation.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"19 1","pages":"5259-5263"},"PeriodicalIF":0.0,"publicationDate":"2018-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79581715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-30DOI: 10.1109/ICASSP.2018.8462315
Etienne Thuillier, H. Gamper, I. Tashev
The advent of mixed reality consumer products brings about a pressing need to develop and improve spatial sound rendering techniques for a broad user base. Despite a large body of prior work, the precise nature and importance of various sound localization cues and how they should be personalized for an individual user to improve localization performance is still an open research problem. Here we propose training a convolutional neural network (CNN) to classify the elevation angle of spatially rendered sounds and employing Layer-wise Relevance Propagation (LRP) on the trained CNN model. LRP provides saliency maps that can be used to identify spectral features used by the network for classification. These maps, in addition to the convolution filters learned by the CNN, are discussed in the context of listening tests reported in the literature. The proposed approach could potentially provide an avenue for future studies on modeling and personalization of head-related transfer functions (HRTFs).
{"title":"Spatial Audio Feature Discovery with Convolutional Neural Networks","authors":"Etienne Thuillier, H. Gamper, I. Tashev","doi":"10.1109/ICASSP.2018.8462315","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462315","url":null,"abstract":"The advent of mixed reality consumer products brings about a pressing need to develop and improve spatial sound rendering techniques for a broad user base. Despite a large body of prior work, the precise nature and importance of various sound localization cues and how they should be personalized for an individual user to improve localization performance is still an open research problem. Here we propose training a convolutional neural network (CNN) to classify the elevation angle of spatially rendered sounds and employing Layer-wise Relevance Propagation (LRP) on the trained CNN model. LRP provides saliency maps that can be used to identify spectral features used by the network for classification. These maps, in addition to the convolution filters learned by the CNN, are discussed in the context of listening tests reported in the literature. The proposed approach could potentially provide an avenue for future studies on modeling and personalization of head-related transfer functions (HRTFs).","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"25 1","pages":"6797-6801"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75113069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-27DOI: 10.1109/ICASSP.2018.8462323
Yang Zhang, Qingtao Tang, Li Niu, Tao Dai, Xi Xiao, Shutao Xia
Gaussian mixture model (GMM) is a powerful probabilistic model for representing the probability distribution of observations in the population. However, the fitness of Gaussian mixture model can be significantly degraded when the data contain a certain amount of outliers. Although there are certain variants of GMM (e.g., mixture of Laplace, mixture of $t$ distribution) attempting to handle outliers, none of them can sufficiently mitigate the effect of outliers if the outliers are far from the centroids. Aiming to remove the effect of outliers further, this paper introduces a Self-Paced Learning mechanism into mixture of $t$ distribution, which leads to Self-Paced Mixture of $t$ distribution model (SPTMM). We derive an Expectation-Maximization based algorithm to train SPTMM and show SPTMM is able to screen the outliers. To demonstrate the effectiveness of SPTMM, we apply the model to density estimation and clustering. Finally, the results indicate that SPTMM outperforms other methods, especially on the data with outliers.
高斯混合模型(GMM)是一种强大的概率模型,用于表示总体中观测值的概率分布。然而,当数据中含有一定数量的异常值时,高斯混合模型的适应度会显著下降。虽然GMM的某些变体(例如,拉普拉斯的混合物,$t$分布的混合物)试图处理异常值,但如果异常值远离质心,它们都不能充分减轻异常值的影响。为了进一步消除离群值的影响,本文在$t$分布的混合中引入自定节奏学习机制,从而得到$t$分布的自定节奏混合模型(self - pace mixture of $t$ distribution model, SPTMM)。我们推导了一种基于期望最大化的算法来训练SPTMM,并证明了SPTMM能够筛选异常值。为了证明SPTMM的有效性,我们将该模型应用于密度估计和聚类。最后,结果表明SPTMM优于其他方法,特别是在有异常值的数据上。
{"title":"Self -Paced Mixture of T Distribution Model","authors":"Yang Zhang, Qingtao Tang, Li Niu, Tao Dai, Xi Xiao, Shutao Xia","doi":"10.1109/ICASSP.2018.8462323","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462323","url":null,"abstract":"Gaussian mixture model (GMM) is a powerful probabilistic model for representing the probability distribution of observations in the population. However, the fitness of Gaussian mixture model can be significantly degraded when the data contain a certain amount of outliers. Although there are certain variants of GMM (e.g., mixture of Laplace, mixture of $t$ distribution) attempting to handle outliers, none of them can sufficiently mitigate the effect of outliers if the outliers are far from the centroids. Aiming to remove the effect of outliers further, this paper introduces a Self-Paced Learning mechanism into mixture of $t$ distribution, which leads to Self-Paced Mixture of $t$ distribution model (SPTMM). We derive an Expectation-Maximization based algorithm to train SPTMM and show SPTMM is able to screen the outliers. To demonstrate the effectiveness of SPTMM, we apply the model to density estimation and clustering. Finally, the results indicate that SPTMM outperforms other methods, especially on the data with outliers.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"7 1","pages":"2796-2800"},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88350385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-07DOI: 10.1109/ICASSP.2018.8462223
Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani
In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.
{"title":"Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition","authors":"Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani","doi":"10.1109/ICASSP.2018.8462223","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462223","url":null,"abstract":"In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 1","pages":"5729-5733"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79851555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-07DOI: 10.1109/ICASSP.2018.8462269
Chanwoo Kim, Anjali Menon, M. Bacchiani, R. Stern
We present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the phase difference information. The RMS-PDCW algorithm selects masks to apply using the information about the localized sound source and the onset detection of speech. We demonstrate that this algorithm shows relatively 5.3 percent improvement over the baseline acoustic model, which was multistyle-trained using 22 million utterances on the simulated test set consisting of real-world and interfering-speaker noise with reverberation time distribution between 0 ms and 900 ms and SNR distribution between 0 dB up to clean.
{"title":"Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection","authors":"Chanwoo Kim, Anjali Menon, M. Bacchiani, R. Stern","doi":"10.1109/ICASSP.2018.8462269","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462269","url":null,"abstract":"We present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the phase difference information. The RMS-PDCW algorithm selects masks to apply using the information about the localized sound source and the onset detection of speech. We demonstrate that this algorithm shows relatively 5.3 percent improvement over the baseline acoustic model, which was multistyle-trained using 22 million utterances on the simulated test set consisting of real-world and interfering-speaker noise with reverberation time distribution between 0 ms and 900 ms and SNR distribution between 0 dB up to clean.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"110 1","pages":"5559-5563"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82487737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-05-02DOI: 10.1109/ICASSP.2018.8461871
Zoltán Tüske, R. Schlüter, H. Ney
Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.
{"title":"Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing","authors":"Zoltán Tüske, R. Schlüter, H. Ney","doi":"10.1109/ICASSP.2018.8461871","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461871","url":null,"abstract":"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"4859-4863"},"PeriodicalIF":0.0,"publicationDate":"2018-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73294824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}