远程语音识别前端系统的弱监督神经全秩空间协方差分析

Interspeech Pub Date : 2022-09-18 DOI:10.21437/interspeech.2022-11077

Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai

{"title":"远程语音识别前端系统的弱监督神经全秩空间协方差分析","authors":"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai","doi":"10.21437/interspeech.2022-11077","DOIUrl":null,"url":null,"abstract":"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is ﬁxed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.","PeriodicalId":73500,"journal":{"name":"Interspeech","volume":"1 1","pages":"3824-3828"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition\",\"authors\":\"Yoshiaki Bando, T. Aizawa, Katsutoshi Itoyama, K. Nakadai\",\"doi\":\"10.21437/interspeech.2022-11077\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is ﬁxed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.\",\"PeriodicalId\":73500,\"journal\":{\"name\":\"Interspeech\",\"volume\":\"1 1\",\"pages\":\"3824-3828\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Interspeech\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21437/interspeech.2022-11077\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Interspeech","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21437/interspeech.2022-11077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文提出了一种弱监督多通道神经语音分离方法，用于实际会话语音混合的远程语音识别（DSR）。一种称为神经全秩空间协方差分析（FCA）的盲源分离（BSS）方法可以在没有任何监督的情况下通过使用深谱模型来精确分离多声道语音混合物。然而，神经FCA要求事先确定并知道声源的数量。这一要求使其在DSR前端系统中的应用复杂化，用于多扬声器对话，其中扬声器的数量动态变化。在本文中，我们提出了一种神经FCA的扩展，通过将目标说话者的时间语音活动作为辅助信息来处理动态变化的声源数量。我们使用多声道音频混合物及其语音活动的数据集，以弱监督的方式训练源分离网络。CHiME-6数据集的实验结果表明，我们的方法在单词错误率方面优于传统的基于BSS的系统，该数据集的任务是识别晚宴上的对话。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Weakly-Supervised Neural Full-Rank Spatial Covariance Analysis for a Front-End System of Distant Speech Recognition

This paper presents a weakly-supervised multichannel neural speech separation method for distant speech recognition (DSR) of real conversational speech mixtures. A blind source separation (BSS) method called neural full-rank spatial covariance analysis (FCA) can precisely separate multichannel speech mixtures by using a deep spectral model without any supervision. The neural FCA, however, requires that the number of sound sources is ﬁxed and known in advance. This requirement com-plicates its utilization for a front-end system of DSR for multispeaker conversations, in which the number of speakers changes dynamically. In this paper, we propose an extension of neural FCA to handle a dynamically changing number of sound sources by taking temporal voice activities of target speakers as auxiliary information. We train a source separation network in a weakly-supervised manner using a dataset of multichannel audio mixtures and their voice activities. Experimental results with the CHiME-6 dataset, whose task is to recognize conversations at dinner parties, show that our method outperformed a conventional BSS-based system in word error rates.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Interspeech

自引率

0.00%

发文量