Multi-channel Itakura Saito Distance Minimization with Deep Neural Network

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) Pub Date : 2019-05-12 DOI:10.1109/ICASSP.2019.8683410

M. Togami

{"title":"Multi-channel Itakura Saito Distance Minimization with Deep Neural Network","authors":"M. Togami","doi":"10.1109/ICASSP.2019.8683410","DOIUrl":null,"url":null,"abstract":"A multi-channel speech source separation with a deep neural network which optimizes not only the time-varying variance of a speech source but also the multi-channel spatial covariance matrix jointly without any iterative optimization method is shown. Instead of a loss function which does not evaluate spatial characteristics of the output signal, the proposed method utilizes a loss function based on minimization of multi-channel Itakura-Saito Distance (MISD), which evaluates spatial characteristics of the output signal. The cost function based on MISD is calculated by the estimated posterior probability density function (PDF) of each speech source based on a time-varying Gaussian distribution model. The loss function of the neural network and the PDF of each speech source that is assumed in multi-channel speech source separation are consistent with each other. As a neural-network architecture, the proposed method utilizes multiple bidirectional long-short term memory (BLSTM) layers. The BLSTM layers and the successive complex-valued signal processing are jointly optimized in the training phase. Experimental results show that more accurately separated speech signal can be obtained with neural network parameters optimized based on the proposed MISD minimization than that with neural network parameters optimized based on loss functions without spatial covariance matrix evaluation.","PeriodicalId":13203,"journal":{"name":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"109 1","pages":"536-540"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP.2019.8683410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

A multi-channel speech source separation with a deep neural network which optimizes not only the time-varying variance of a speech source but also the multi-channel spatial covariance matrix jointly without any iterative optimization method is shown. Instead of a loss function which does not evaluate spatial characteristics of the output signal, the proposed method utilizes a loss function based on minimization of multi-channel Itakura-Saito Distance (MISD), which evaluates spatial characteristics of the output signal. The cost function based on MISD is calculated by the estimated posterior probability density function (PDF) of each speech source based on a time-varying Gaussian distribution model. The loss function of the neural network and the PDF of each speech source that is assumed in multi-channel speech source separation are consistent with each other. As a neural-network architecture, the proposed method utilizes multiple bidirectional long-short term memory (BLSTM) layers. The BLSTM layers and the successive complex-valued signal processing are jointly optimized in the training phase. Experimental results show that more accurately separated speech signal can be obtained with neural network parameters optimized based on the proposed MISD minimization than that with neural network parameters optimized based on loss functions without spatial covariance matrix evaluation.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于深度神经网络的多通道Itakura Saito距离最小化

提出了一种基于深度神经网络的多通道语音源分离方法，该方法不仅对语音源的时变方差进行优化，而且对多通道空间协方差矩阵进行联合优化，无需任何迭代优化方法。该方法利用基于多通道Itakura-Saito距离最小化(MISD)的损失函数来评估输出信号的空间特征，而不是不评估输出信号的空间特征的损失函数。基于MISD的代价函数是根据时变高斯分布模型估计每个语音源的后验概率密度函数(PDF)。神经网络的损失函数与多通道语音源分离中假设的每个语音源的PDF是一致的。作为一种神经网络结构，该方法利用了多个双向长短期记忆(BLSTM)层。在训练阶段对BLSTM层和逐次复值信号处理进行联合优化。实验结果表明，与不进行空间协方差矩阵评估的基于损失函数的神经网络参数优化方法相比，基于MISD最小化的神经网络参数优化方法可以更准确地分离语音信号。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)

自引率

0.00%

发文量

期刊最新文献

Universal Acoustic Modeling Using Neural Mixture Models Speech Landmark Bigrams for Depression Detection from Naturalistic Smartphone Speech Robust M-estimation Based Matrix Completion When Can a System of Subnetworks Be Registered Uniquely? Learning Search Path for Region-level Image Matching