基于混合模型的鲁棒ASR单通道语音-音乐分离

IEEE Transactions on Audio Speech and Language Processing Pub Date : 2013-04-01 DOI:10.1109/TASL.2012.2231072

Cemil Demir, M. Saraçlar, A. Cemgil

{"title":"基于混合模型的鲁棒ASR单通道语音-音乐分离","authors":"Cemil Demir, M. Saraçlar, A. Cemgil","doi":"10.1109/TASL.2012.2231072","DOIUrl":null,"url":null,"abstract":"In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.","PeriodicalId":55014,"journal":{"name":"IEEE Transactions on Audio Speech and Language Processing","volume":"21 1","pages":"725-736"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TASL.2012.2231072","citationCount":"22","resultStr":"{\"title\":\"Single-Channel Speech-Music Separation for Robust ASR With Mixture Models\",\"authors\":\"Cemil Demir, M. Saraçlar, A. Cemgil\",\"doi\":\"10.1109/TASL.2012.2231072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.\",\"PeriodicalId\":55014,\"journal\":{\"name\":\"IEEE Transactions on Audio Speech and Language Processing\",\"volume\":\"21 1\",\"pages\":\"725-736\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TASL.2012.2231072\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Audio Speech and Language Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TASL.2012.2231072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Audio Speech and Language Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TASL.2012.2231072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

摘要

在本研究中，我们描述了一种基于混合模型的单通道语音音乐分离方法。给定背景音乐材料的目录，我们提出了一个叠加语音和音乐谱图的生成模型。假设背景音乐信号是由目录中的叮当声产生的。背景音乐成分用一个表示叮当声的比例条件混合模型来建模。语音信号通过概率模型建模，该模型类似于非负矩阵分解(NMF)模型的概率解释。从混合信号中以半监督的方式估计语音模型的参数。该方法用泊松和复高斯观测模型进行了测试，分别对应于Kullback-Leibler (KL)和Itakura-Saito (is)散度度量。我们的实验表明，所提出的混合模型在语音音乐分离和自动语音识别(ASR)任务中都优于标准的NMF方法。这些结果进一步改进使用马尔可夫先验结构之间的时间连续性叮当声帧。实际数据的测试结果表明，该方法提高了语音识别性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Single-Channel Speech-Music Separation for Robust ASR With Mixture Models

In this study, we describe a mixture model based single-channel speech-music separation method. Given a catalog of background music material, we propose a generative model for the superposed speech and music spectrograms. The background music signal is assumed to be generated by a jingle in the catalog. The background music component is modeled by a scaled conditional mixture model representing the jingle. The speech signal is modeled by a probabilistic model, which is similar to the probabilistic interpretation of Non-negative Matrix Factorization (NMF) model. The parameters of the speech model is estimated in a semi-supervised manner from the mixed signal. The approach is tested with Poisson and complex Gaussian observation models that correspond respectively to Kullback-Leibler (KL) and Itakura-Saito (IS) divergence measures. Our experiments show that the proposed mixture model outperforms a standard NMF method both in speech-music separation and automatic speech recognition (ASR) tasks. These results are further improved using Markovian prior structures for temporal continuity between the jingle frames. Our test results with real data show that our method increases the speech recognition performance.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Audio Speech and Language Processing 工程技术-工程：电子与电气

自引率

0.00%

发文量

审稿时长

24.0 months

期刊介绍： The IEEE Transactions on Audio, Speech and Language Processing covers the sciences, technologies and applications relating to the analysis, coding, enhancement, recognition and synthesis of audio, music, speech and language. In particular, audio processing also covers auditory modeling, acoustic modeling and source separation. Speech processing also covers speech production and perception, adaptation, lexical modeling and speaker recognition. Language processing also covers spoken language understanding, translation, summarization, mining, general language modeling, as well as spoken dialog systems.