时频域粗-细语音分离方法

IF 2.4 3区 计算机科学 Q2 ACOUSTICS Speech Communication Pub Date : 2023-11-01 DOI:10.1016/j.specom.2023.103003
Xue Yang, Changchun Bao, Xianhong Chen
{"title":"时频域粗-细语音分离方法","authors":"Xue Yang,&nbsp;Changchun Bao,&nbsp;Xianhong Chen","doi":"10.1016/j.specom.2023.103003","DOIUrl":null,"url":null,"abstract":"<div><p>Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.</p></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"155 ","pages":"Article 103003"},"PeriodicalIF":2.4000,"publicationDate":"2023-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Coarse-to-fine speech separation method in the time-frequency domain\",\"authors\":\"Xue Yang,&nbsp;Changchun Bao,&nbsp;Xianhong Chen\",\"doi\":\"10.1016/j.specom.2023.103003\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.</p></div>\",\"PeriodicalId\":49485,\"journal\":{\"name\":\"Speech Communication\",\"volume\":\"155 \",\"pages\":\"Article 103003\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2023-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Speech Communication\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167639323001371\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639323001371","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0

摘要

虽然时域语音分离方法在消声环境中表现出了优异的性能,但在混响环境中其有效性却大大降低。与时域方法相比,时频域的语音分离方法主要关注结构化的T-F表示,近年来显示出很大的发展潜力。本文提出了一种基于T-F域的从粗到精的语音分离方法,该方法包括两个步骤:1)在粗阶段进行粗分离,2)在精炼阶段进行精确提取。在粗阶段,所有说话人的语音信号最初被粗分离,导致估计的信号有一定程度的失真。在细化阶段,每个估计信号的T-F表示作为提取相应说话者的残差T-F表示的指南,这有助于减少在粗阶段造成的失真。此外,专门设计的用于粗阶段和精炼阶段的网络进行了联合训练,以获得更好的性能。此外,利用循环关注并行分支(RAPB)块来充分利用包含在整个T-F特征中的上下文信息,该模型在具有少量参数的干净数据集上显示出具有竞争力的性能。此外,该方法在更真实的数据集上显示出更强的鲁棒性,并获得了最先进的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Coarse-to-fine speech separation method in the time-frequency domain

Although time-domain speech separation methods have exhibited the outstanding performance in anechoic scenarios, their effectiveness is considerably reduced in the reverberant scenarios. Compared to the time-domain methods, the speech separation methods in time-frequency (T-F) domain mainly concern the structured T-F representations and have shown a great potential recently. In this paper, we propose a coarse-to-fine speech separation method in the T-F domain, which involves two steps: 1) a rough separation conducted in the coarse phase and 2) a precise extraction accomplished in the refining phase. In the coarse phase, the speech signals of all speakers are initially separated in a rough manner, resulting in some level of distortion in the estimated signals. In the refining phase, the T-F representation of each estimated signal acts as a guide to extract the residual T-F representation for the corresponding speaker, which helps to reduce the distortions caused in the coarse phase. Besides, the specially designed networks used for the coarse and refining phases are jointly trained for superior performance. Furthermore, utilizing the recurrent attention with parallel branches (RAPB) block to fully exploit the contextual information contained in the whole T-F features, the proposed model demonstrates competitive performance on clean datasets with a small number of parameters. Additionally, the proposed method shows more robustness and achieves state-of-the-art results on more realistic datasets.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Speech Communication
Speech Communication 工程技术-计算机:跨学科应用
CiteScore
6.80
自引率
6.20%
发文量
94
审稿时长
19.2 weeks
期刊介绍: Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results. The journal''s primary objectives are: • to present a forum for the advancement of human and human-machine speech communication science; • to stimulate cross-fertilization between different fields of this domain; • to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.
期刊最新文献
A new universal camouflage attack algorithm for intelligent speech system Fixed frequency range empirical wavelet transform based acoustic and entropy features for speech emotion recognition AFP-Conformer: Asymptotic feature pyramid conformer for spoofing speech detection A robust temporal map of speech monitoring from planning to articulation The combined effects of bilingualism and musicianship on listeners’ perception of non-native lexical tones
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1