首页 > 最新文献

2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)最新文献

英文 中文
Essence Vector-Based Query Modeling for Spoken Document Retrieval 基于本质向量的口语文档检索查询建模
Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, H. Wang
Spoken document retrieval (SDR) has become a prominently required application since unprecedented volumes of multimedia data along with speech have become available in our daily life. As far as we are aware, there has been relatively less work in launching unsupervised paragraph embedding methods and investigating the effectiveness of these methods on the SDR task. This paper first presents a novel paragraph embedding method, named the essence vector (EV) model, which aims at inferring a representation for a given paragraph by encapsulating the most representative information from the paragraph and excluding the general background information at the same time. On top of the EV model, we develop three query language modeling mechanisms to improve the retrieval performance. A series of empirical SDR experiments conducted on two benchmark collections demonstrate the good efficacy of the proposed framework, compared to several existing strong baseline systems.
语音文档检索(SDR)已成为一个突出的应用需求,因为在我们的日常生活中,多媒体数据随着语音的空前庞大而变得可用。据我们所知,在引入无监督段嵌入方法以及研究这些方法在SDR任务上的有效性方面的工作相对较少。本文首先提出了一种新的段落嵌入方法——本质向量模型(essence vector model, EV),该模型通过封装段落中最具代表性的信息,同时排除一般背景信息,来推断给定段落的表示。在EV模型的基础上,我们开发了三种查询语言建模机制来提高检索性能。在两个基准集合上进行的一系列经验SDR实验表明,与现有的几个强基线系统相比,所提出的框架具有良好的有效性。
{"title":"Essence Vector-Based Query Modeling for Spoken Document Retrieval","authors":"Kuan-Yu Chen, Shih-Hung Liu, Berlin Chen, H. Wang","doi":"10.1109/ICASSP.2018.8461687","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461687","url":null,"abstract":"Spoken document retrieval (SDR) has become a prominently required application since unprecedented volumes of multimedia data along with speech have become available in our daily life. As far as we are aware, there has been relatively less work in launching unsupervised paragraph embedding methods and investigating the effectiveness of these methods on the SDR task. This paper first presents a novel paragraph embedding method, named the essence vector (EV) model, which aims at inferring a representation for a given paragraph by encapsulating the most representative information from the paragraph and excluding the general background information at the same time. On top of the EV model, we develop three query language modeling mechanisms to improve the retrieval performance. A series of empirical SDR experiments conducted on two benchmark collections demonstrate the good efficacy of the proposed framework, compared to several existing strong baseline systems.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"97 1","pages":"6274-6278"},"PeriodicalIF":0.0,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84099319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Random Walks with Restarts for Graph-Based Classification: Teleportation Tuning and Sampling Design 基于图的分类随机漫步:传送调整和抽样设计
Dimitris Berberidis, A. Nikolakopoulos, G. Giannakis
The present work introduces methods for sampling and inference for the purpose of semi-supervised classification over the nodes of a graph. The graph may be given or constructed using similarity measures among nodal features. Leveraging the graph for classification builds on the premise that relation among nodes can be modeled via stationary distributions of a certain class of random walks. The proposed classifier builds on existing scalable random-walk-based methods and improves accuracy and robustness by automatically adjusting a set of parameters to the graph and label distribution at hand. Furthermore, a sampling strategy tailored to random-walk-based classifiers is introduced. Numerical tests on benchmark synthetic and real labeled graphs demonstrate the performance of the proposed sampling and inference methods in terms of classification accuracy.
本文介绍了在图的节点上进行半监督分类的抽样和推理方法。图可以使用节点特征之间的相似性度量来给出或构造。利用图进行分类的前提是节点之间的关系可以通过某类随机游走的平稳分布来建模。提出的分类器建立在现有的基于可扩展随机行走的方法之上,并通过自动调整一组参数来适应手头的图和标签分布,从而提高准确性和鲁棒性。此外,还介绍了一种适合随机行走分类器的采样策略。在基准合成图和真实标记图上的数值测试表明了所提出的采样和推理方法在分类精度方面的性能。
{"title":"Random Walks with Restarts for Graph-Based Classification: Teleportation Tuning and Sampling Design","authors":"Dimitris Berberidis, A. Nikolakopoulos, G. Giannakis","doi":"10.1109/ICASSP.2018.8461548","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461548","url":null,"abstract":"The present work introduces methods for sampling and inference for the purpose of semi-supervised classification over the nodes of a graph. The graph may be given or constructed using similarity measures among nodal features. Leveraging the graph for classification builds on the premise that relation among nodes can be modeled via stationary distributions of a certain class of random walks. The proposed classifier builds on existing scalable random-walk-based methods and improves accuracy and robustness by automatically adjusting a set of parameters to the graph and label distribution at hand. Furthermore, a sampling strategy tailored to random-walk-based classifiers is introduced. Numerical tests on benchmark synthetic and real labeled graphs demonstrate the performance of the proposed sampling and inference methods in terms of classification accuracy.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"115 1","pages":"2811-2815"},"PeriodicalIF":0.0,"publicationDate":"2018-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89566550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Optimal Stopping Times for Estimating Bernoulli Parameters with Applications to Active Imaging 估计伯努利参数的最佳停车时间及其在主动成像中的应用
Safa C. Medin, John Murray-Bruce, Vivek K Goyal
We address the problem of estimating the parameter of a Bernoulli process. This arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. We introduce a framework within which to minimize the mean-squared error (MSE) subject to an upper bound on the mean number of trials. This optimization has several simple and intuitive properties when the Bernoulli parameter has a beta prior. In addition, by exploiting typical spatial correlation using total variation regularization, we extend the developed framework to a rectangular array of Bernoulli processes representing the pixels in a natural scene. In simulations inspired by realistic active imaging scenarios, we demonstrate a 4.26 dB reduction in MSE due to the adaptive acquisition, as an average over many independent experiments and invariant to a factor of 3.4 variation in trial budget.
我们讨论了估计伯努利过程参数的问题。这在许多应用中出现,包括光子高效主动成像,其中每个照明周期被视为单个伯努利试验。我们引入了一个框架,在其中最小化均方误差(MSE)受制于平均试验次数的上界。当伯努利参数具有beta先验时,这种优化具有几个简单直观的性质。此外,通过使用全变分正则化来利用典型的空间相关性,我们将开发的框架扩展到表示自然场景中像素的伯努利过程的矩形阵列。在真实的主动成像场景启发的模拟中,我们证明了由于自适应采集,MSE降低了4.26 dB,这是许多独立实验的平均值,并且不受试验预算3.4变化因子的影响。
{"title":"Optimal Stopping Times for Estimating Bernoulli Parameters with Applications to Active Imaging","authors":"Safa C. Medin, John Murray-Bruce, Vivek K Goyal","doi":"10.1109/ICASSP.2018.8462676","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462676","url":null,"abstract":"We address the problem of estimating the parameter of a Bernoulli process. This arises in many applications, including photon-efficient active imaging where each illumination period is regarded as a single Bernoulli trial. We introduce a framework within which to minimize the mean-squared error (MSE) subject to an upper bound on the mean number of trials. This optimization has several simple and intuitive properties when the Bernoulli parameter has a beta prior. In addition, by exploiting typical spatial correlation using total variation regularization, we extend the developed framework to a rectangular array of Bernoulli processes representing the pixels in a natural scene. In simulations inspired by realistic active imaging scenarios, we demonstrate a 4.26 dB reduction in MSE due to the adaptive acquisition, as an average over many independent experiments and invariant to a factor of 3.4 variation in trial budget.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"27 1","pages":"4429-4433"},"PeriodicalIF":0.0,"publicationDate":"2018-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75733802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis 语音合成中序列对序列声学建模的前向注意
Jing-Xuan Zhang, Zhenhua Ling, Lirong Dai
This paper proposes a forward attention method for the sequence-to-sequence acoustic modeling of speech synthesis. This method is motivated by the nature of the monotonic alignment from phone sequences to acoustic sequences. Only the alignment paths that satisfy the monotonic condition are taken into consideration at each decoder timestep. The modified attention probabilities at each timestep are computed recursively using a forward algorithm. A transition agent for forward attention is further proposed, which helps the attention mechanism to make decisions whether to move forward or stay at each decoder timestep. Experimental results show that the proposed forward attention method achieves faster convergence speed and higher stability than the baseline attention method. Besides, the method of forward attention with transition agent can also help improve the naturalness of synthetic speech and control the speed of synthetic speech effectively.
提出了一种语音合成中序列对序列声学建模的前向注意方法。这种方法的动机是从电话序列到声学序列的单调排列的性质。在每个解码器时间步长只考虑满足单调条件的对齐路径。使用前向算法递归地计算每个时间步上修改后的注意概率。进一步提出了前向注意过渡代理,帮助注意机制在每个解码器时间步长上决定是继续前进还是停留。实验结果表明,所提出的前向注意方法比基线注意方法具有更快的收敛速度和更高的稳定性。此外,添加过渡剂的前向注意方法也有助于提高合成语音的自然度,有效地控制合成语音的语速。
{"title":"Forward Attention in Sequence- To-Sequence Acoustic Modeling for Speech Synthesis","authors":"Jing-Xuan Zhang, Zhenhua Ling, Lirong Dai","doi":"10.1109/ICASSP.2018.8462020","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462020","url":null,"abstract":"This paper proposes a forward attention method for the sequence-to-sequence acoustic modeling of speech synthesis. This method is motivated by the nature of the monotonic alignment from phone sequences to acoustic sequences. Only the alignment paths that satisfy the monotonic condition are taken into consideration at each decoder timestep. The modified attention probabilities at each timestep are computed recursively using a forward algorithm. A transition agent for forward attention is further proposed, which helps the attention mechanism to make decisions whether to move forward or stay at each decoder timestep. Experimental results show that the proposed forward attention method achieves faster convergence speed and higher stability than the baseline attention method. Besides, the method of forward attention with transition agent can also help improve the naturalness of synthetic speech and control the speed of synthetic speech effectively.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"37 1","pages":"4789-4793"},"PeriodicalIF":0.0,"publicationDate":"2018-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74949477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Sufficiency Quantification for Seamless Text-Independent Speaker Enrollment 无缝文本独立演讲者注册的充分性量化
Gokcen Cilingir, Jonathan Huang, Mandar Joshi, Narayan Biswal
Text-independent speaker recognition (TI-SR) requires a lengthy enrollment process that involves asking dedicated time from the user to create a reliable model of their voice. Seamless enrollment is a highly attractive feature which refers to the enrollment process that happens in the background and asks for no dedicated time from the user. One of the key problems in a fully automated seamless enrollment process is to determine the sufficiency of a given utterance collection for the purpose of TI-SR. No known metric exists in the literature to quantify sufficiency. This paper introduces a novel metric called phoneme-richness score. Quality of a sufficiency metric can be assessed via its correlation with the TI-SR performance. Our assessment shows that phoneme-richness score achieves −0.96 correlation with TI-SR performance (measured in equal error rate), which is highly significant, whereas a naive sufficiency metric like speech duration achieves only −0.68 correlation.
文本无关说话人识别(TI-SR)需要一个漫长的注册过程,包括要求用户花专门的时间来创建他们声音的可靠模型。无缝注册是一个非常吸引人的特性,它指的是注册过程在后台进行,不需要用户花费专门的时间。在一个完全自动化的无缝注册过程中,关键问题之一是确定一个给定的话语收集是否足够用于TI-SR。文献中没有已知的度量来量化充分性。本文介绍了一种新的音素丰富度评分方法。充分性度量的质量可以通过其与TI-SR性能的相关性来评估。我们的评估表明,音素丰富度得分与TI-SR表现(以等错误率衡量)的相关性为- 0.96,这是非常显著的,而语音持续时间等幼稚充分性指标的相关性仅为- 0.68。
{"title":"Sufficiency Quantification for Seamless Text-Independent Speaker Enrollment","authors":"Gokcen Cilingir, Jonathan Huang, Mandar Joshi, Narayan Biswal","doi":"10.1109/ICASSP.2018.8461954","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461954","url":null,"abstract":"Text-independent speaker recognition (TI-SR) requires a lengthy enrollment process that involves asking dedicated time from the user to create a reliable model of their voice. Seamless enrollment is a highly attractive feature which refers to the enrollment process that happens in the background and asks for no dedicated time from the user. One of the key problems in a fully automated seamless enrollment process is to determine the sufficiency of a given utterance collection for the purpose of TI-SR. No known metric exists in the literature to quantify sufficiency. This paper introduces a novel metric called phoneme-richness score. Quality of a sufficiency metric can be assessed via its correlation with the TI-SR performance. Our assessment shows that phoneme-richness score achieves −0.96 correlation with TI-SR performance (measured in equal error rate), which is highly significant, whereas a naive sufficiency metric like speech duration achieves only −0.68 correlation.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"19 1","pages":"5259-5263"},"PeriodicalIF":0.0,"publicationDate":"2018-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79581715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Spatial Audio Feature Discovery with Convolutional Neural Networks 基于卷积神经网络的空间音频特征发现
Etienne Thuillier, H. Gamper, I. Tashev
The advent of mixed reality consumer products brings about a pressing need to develop and improve spatial sound rendering techniques for a broad user base. Despite a large body of prior work, the precise nature and importance of various sound localization cues and how they should be personalized for an individual user to improve localization performance is still an open research problem. Here we propose training a convolutional neural network (CNN) to classify the elevation angle of spatially rendered sounds and employing Layer-wise Relevance Propagation (LRP) on the trained CNN model. LRP provides saliency maps that can be used to identify spectral features used by the network for classification. These maps, in addition to the convolution filters learned by the CNN, are discussed in the context of listening tests reported in the literature. The proposed approach could potentially provide an avenue for future studies on modeling and personalization of head-related transfer functions (HRTFs).
混合现实消费产品的出现带来了为广大用户群开发和改进空间声音渲染技术的迫切需要。尽管之前有大量的工作,但各种声音定位线索的确切性质和重要性,以及如何针对个人用户个性化它们以提高定位性能仍然是一个开放的研究问题。在这里,我们提出训练卷积神经网络(CNN)来分类空间渲染声音的仰角,并在训练好的CNN模型上使用分层相关传播(LRP)。LRP提供了显著性图,可用于识别网络用于分类的光谱特征。除了CNN学习的卷积过滤器之外,这些地图还在文献中报道的听力测试的背景下进行了讨论。所提出的方法可能为未来头部相关传递函数(hrtf)的建模和个性化研究提供一条途径。
{"title":"Spatial Audio Feature Discovery with Convolutional Neural Networks","authors":"Etienne Thuillier, H. Gamper, I. Tashev","doi":"10.1109/ICASSP.2018.8462315","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462315","url":null,"abstract":"The advent of mixed reality consumer products brings about a pressing need to develop and improve spatial sound rendering techniques for a broad user base. Despite a large body of prior work, the precise nature and importance of various sound localization cues and how they should be personalized for an individual user to improve localization performance is still an open research problem. Here we propose training a convolutional neural network (CNN) to classify the elevation angle of spatially rendered sounds and employing Layer-wise Relevance Propagation (LRP) on the trained CNN model. LRP provides saliency maps that can be used to identify spectral features used by the network for classification. These maps, in addition to the convolution filters learned by the CNN, are discussed in the context of listening tests reported in the literature. The proposed approach could potentially provide an avenue for future studies on modeling and personalization of head-related transfer functions (HRTFs).","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"25 1","pages":"6797-6801"},"PeriodicalIF":0.0,"publicationDate":"2018-05-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75113069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Self -Paced Mixture of T Distribution Model 自定步混合T分布模型
Yang Zhang, Qingtao Tang, Li Niu, Tao Dai, Xi Xiao, Shutao Xia
Gaussian mixture model (GMM) is a powerful probabilistic model for representing the probability distribution of observations in the population. However, the fitness of Gaussian mixture model can be significantly degraded when the data contain a certain amount of outliers. Although there are certain variants of GMM (e.g., mixture of Laplace, mixture of $t$ distribution) attempting to handle outliers, none of them can sufficiently mitigate the effect of outliers if the outliers are far from the centroids. Aiming to remove the effect of outliers further, this paper introduces a Self-Paced Learning mechanism into mixture of $t$ distribution, which leads to Self-Paced Mixture of $t$ distribution model (SPTMM). We derive an Expectation-Maximization based algorithm to train SPTMM and show SPTMM is able to screen the outliers. To demonstrate the effectiveness of SPTMM, we apply the model to density estimation and clustering. Finally, the results indicate that SPTMM outperforms other methods, especially on the data with outliers.
高斯混合模型(GMM)是一种强大的概率模型,用于表示总体中观测值的概率分布。然而,当数据中含有一定数量的异常值时,高斯混合模型的适应度会显著下降。虽然GMM的某些变体(例如,拉普拉斯的混合物,$t$分布的混合物)试图处理异常值,但如果异常值远离质心,它们都不能充分减轻异常值的影响。为了进一步消除离群值的影响,本文在$t$分布的混合中引入自定节奏学习机制,从而得到$t$分布的自定节奏混合模型(self - pace mixture of $t$ distribution model, SPTMM)。我们推导了一种基于期望最大化的算法来训练SPTMM,并证明了SPTMM能够筛选异常值。为了证明SPTMM的有效性,我们将该模型应用于密度估计和聚类。最后,结果表明SPTMM优于其他方法,特别是在有异常值的数据上。
{"title":"Self -Paced Mixture of T Distribution Model","authors":"Yang Zhang, Qingtao Tang, Li Niu, Tao Dai, Xi Xiao, Shutao Xia","doi":"10.1109/ICASSP.2018.8462323","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462323","url":null,"abstract":"Gaussian mixture model (GMM) is a powerful probabilistic model for representing the probability distribution of observations in the population. However, the fitness of Gaussian mixture model can be significantly degraded when the data contain a certain amount of outliers. Although there are certain variants of GMM (e.g., mixture of Laplace, mixture of $t$ distribution) attempting to handle outliers, none of them can sufficiently mitigate the effect of outliers if the outliers are far from the centroids. Aiming to remove the effect of outliers further, this paper introduces a Self-Paced Learning mechanism into mixture of $t$ distribution, which leads to Self-Paced Mixture of $t$ distribution model (SPTMM). We derive an Expectation-Maximization based algorithm to train SPTMM and show SPTMM is able to screen the outliers. To demonstrate the effectiveness of SPTMM, we apply the model to density estimation and clustering. Finally, the results indicate that SPTMM outperforms other methods, especially on the data with outliers.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"7 1","pages":"2796-2800"},"PeriodicalIF":0.0,"publicationDate":"2018-05-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88350385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition 用于远场语音识别的相敏深度神经网络训练的频谱失真模型
Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani
In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.
本文提出了一种在训练相敏深度神经网络模型时引入相摄动的算法。传统的特征,如对数特征或倒谱特征没有任何与相位相关的信息。然而,诸如原始波形或复杂光谱特征等特征包含相位相关信息。相敏特性的优点是能够检测到不同麦克风通道或频带之间到达时间的差异。然而,与基于幅度的特征相比,相位信息对各种失真更敏感,如麦克风特性的变化、混响等。对于传统的基于震级的特征,众所周知,添加噪声或混响(通常称为多风格训练(MTR))可以提高鲁棒性。本着类似的精神,我们提出了一种引入频谱失真的算法,使深度学习模型对相位失真具有更强的鲁棒性。我们称这种方法为频谱失真训练(SDTR)。在我们的实验中,使用包含2200万个有和没有MTR的话语的训练集,该方法在Google Home上记录的测试集上相对降低了3.2%和8.48%的单词错误率(wer)。
{"title":"Spectral Distortion Model for Training Phase-Sensitive Deep-Neural Networks for Far-Field Speech Recognition","authors":"Chanwoo Kim, Tara N. Sainath, A. Narayanan, Ananya Misra, R. Nongpiur, M. Bacchiani","doi":"10.1109/ICASSP.2018.8462223","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462223","url":null,"abstract":"In this paper, we present an algorithm which introduces phase-perturbation to the training database when training phase-sensitive deep neural-network models. Traditional features such as log-mel or cepstral features do not have have any phase-relevant information. However features such as raw-waveform or complex spectra features contain phase-relevant information. Phase-sensitive features have the advantage of being able to detect differences in time of arrival across different microphone channels or frequency bands. However, compared to magnitude-based features, phase information is more sensitive to various kinds of distortions such as variations in microphone characteristics, reverberation, and so on. For traditional magnitude-based features, it is widely known that adding noise or reverberation, often called Multistyle-TRaining (MTR), improves robustness. In a similar spirit, we propose an algorithm which introduces spectral distortion to make the deep-learning models more robust to phase-distortion. We call this approach Spectral-Distortion TRaining (SDTR). In our experiments using a training set consisting of 22-million utterances with and without MTR, this approach reduces Word Error Rates (WERs) relatively by 3.2 % and 8.48 % respectively on test sets recorded on Google Home.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"55 1","pages":"5729-5733"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79851555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection 利用相位差分离声源和可靠掩模选择
Chanwoo Kim, Anjali Menon, M. Bacchiani, R. Stern
We present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the phase difference information. The RMS-PDCW algorithm selects masks to apply using the information about the localized sound source and the onset detection of speech. We demonstrate that this algorithm shows relatively 5.3 percent improvement over the baseline acoustic model, which was multistyle-trained using 22 million utterances on the simulated test set consisting of real-world and interfering-speaker noise with reverberation time distribution between 0 ms and 900 ms and SNR distribution between 0 dB up to clean.
本文提出了一种可靠掩模选择-相位差信道加权(RMS-PDCW)算法,该算法利用相位差信息计算的到达角(AoA)信息来选择被噪声源掩盖的目标源。RMS-PDCW算法利用局部声源信息和语音起始检测选择掩码进行应用。我们证明,该算法比基线声学模型显示出相对5.3%的改进,该模型在模拟测试集上使用2200万个话语进行多风格训练,该模拟测试集由真实世界和干扰扬声器噪声组成,混响时间分布在0 ms到900 ms之间,信噪比分布在0 dB到clean之间。
{"title":"Sound Source Separation Using Phase Difference and Reliable Mask Selection Selection","authors":"Chanwoo Kim, Anjali Menon, M. Bacchiani, R. Stern","doi":"10.1109/ICASSP.2018.8462269","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8462269","url":null,"abstract":"We present an algorithm called Reliable Mask Selection-Phase Difference Channel Weighting (RMS-PDCW) which selects the target source masked by a noise source using the Angle of Arrival (AoA) information calculated using the phase difference information. The RMS-PDCW algorithm selects masks to apply using the information about the localized sound source and the onset detection of speech. We demonstrate that this algorithm shows relatively 5.3 percent improvement over the baseline acoustic model, which was multistyle-trained using 22 million utterances on the simulated test set consisting of real-world and interfering-speaker noise with reverberation time distribution between 0 ms and 900 ms and SNR distribution between 0 dB up to clean.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"110 1","pages":"5559-5563"},"PeriodicalIF":0.0,"publicationDate":"2018-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82487737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing 基于多分辨率、神经网络信号处理的语音波形声学建模
Zoltán Tüske, R. Schlüter, H. Ney
Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.
最近,一些论文已经证明神经网络(NN)能够将特征提取作为声学模型的一部分。在Gammatone特征提取管道的激励下,本文通过第二级时间卷积元素扩展了基于波形的神经网络模型。提出的扩展扩展了包络提取块,并允许模型学习多分辨率表示。自动语音识别(ASR)实验表明,与我们之前在信号域直接训练的最佳声学模型相比,自动语音识别(ASR)的单词错误率显著降低。虽然我们只使用了250小时的语音,但基于数据驱动的神经网络的语音信号处理与传统的手工特征提取器几乎相同。在额外的实验中,我们还测试了基于神经网络衍生特征的片段级特征归一化技术,这进一步改善了结果。然而,将前馈神经网络派生的语音表示移植到LSTM后端模型表明,与标准特征提取器相比,神经网络前端的鲁棒性要差得多。对所提出的新层的权重分析表明,神经网络更倾向于多分辨率和调制频谱表示。
{"title":"Acoustic Modeling of Speech Waveform Based on Multi-Resolution, Neural Network Signal Processing","authors":"Zoltán Tüske, R. Schlüter, H. Ney","doi":"10.1109/ICASSP.2018.8461871","DOIUrl":"https://doi.org/10.1109/ICASSP.2018.8461871","url":null,"abstract":"Recently, several papers have demonstrated that neural networks (NN) are able to perform the feature extraction as part of the acoustic model. Motivated by the Gammatone feature extraction pipeline, in this paper we extend the waveform based NN model by a second level of time-convolutional element. The proposed extension generalizes the envelope extraction block, and allows the model to learn multi-resolutional representations. Automatic speech recognition (ASR) experiments show significant word error rate reduction over our previous best acoustic model trained in the signal domain directly. Although we use only 250 hours of speech, the data-driven NN based speech signal processing performs nearly equally to traditional handcrafted feature extractors. In additional experiments, we also test segment-level feature normalization techniques on NN derived features, which improve the results further. However, the porting of speech representations derived by a feed-forward NN to a LSTM back-end model indicates much less robustness of the NN front-end compared to the standard feature extractors. Analysis of the weights in the proposed new layer reveals that the NN prefers both multi-resolution and modulation spectrum representations.","PeriodicalId":6638,"journal":{"name":"2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"22 1","pages":"4859-4863"},"PeriodicalIF":0.0,"publicationDate":"2018-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73294824","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1