PGSS: Pitch-Guided Speech Separation

Xiang Li, Yiwen Wang, Yifan Sun, Xihong Wu, J. Chen
{"title":"PGSS: Pitch-Guided Speech Separation","authors":"Xiang Li, Yiwen Wang, Yifan Sun, Xihong Wu, J. Chen","doi":"10.1609/aaai.v37i11.26542","DOIUrl":null,"url":null,"abstract":"Monaural speech separation aims to separate concurrent speakers from a single-microphone mixture recording. Inspired by the effect of pitch priming in auditory scene analysis (ASA) mechanisms, a novel pitch-guided speech separation framework is proposed in this work. The prominent advantage of this framework is that both the permutation problem and the unknown speaker number problem existing in general models can be avoided by using pitch contours as the primary means to guide the target speaker. In addition, adversarial training is applied, instead of a traditional time-frequency mask, to improve the perceptual quality of separated speech. Specifically, the proposed framework can be divided into two phases: pitch extraction and speech separation. The former aims to extract pitch contour candidates for each speaker from the mixture, modeling the bottom-up process in ASA mechanisms. Any pitch contour can be selected as the condition in the second phase to separate the corresponding speaker, where a conditional generative adversarial network (CGAN) is applied. The second phase models the effect of pitch priming in ASA. Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.","PeriodicalId":74506,"journal":{"name":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","volume":"29 1","pages":"13130-13138"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Artificial Intelligence. AAAI Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/aaai.v37i11.26542","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Monaural speech separation aims to separate concurrent speakers from a single-microphone mixture recording. Inspired by the effect of pitch priming in auditory scene analysis (ASA) mechanisms, a novel pitch-guided speech separation framework is proposed in this work. The prominent advantage of this framework is that both the permutation problem and the unknown speaker number problem existing in general models can be avoided by using pitch contours as the primary means to guide the target speaker. In addition, adversarial training is applied, instead of a traditional time-frequency mask, to improve the perceptual quality of separated speech. Specifically, the proposed framework can be divided into two phases: pitch extraction and speech separation. The former aims to extract pitch contour candidates for each speaker from the mixture, modeling the bottom-up process in ASA mechanisms. Any pitch contour can be selected as the condition in the second phase to separate the corresponding speaker, where a conditional generative adversarial network (CGAN) is applied. The second phase models the effect of pitch priming in ASA. Experiments on the WSJ0-2mix corpus reveal that the proposed approaches can achieve higher pitch extraction accuracy and better separation performance, compared to the baseline models, and have the potential to be applied to SOTA architectures.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
音高引导语音分离
单耳语音分离旨在从单麦克风混合录音中分离并发说话者。受音高启动在听觉场景分析(ASA)机制中的作用启发,本文提出了一种新的音高引导语音分离框架。该框架的突出优点是,利用基音轮廓作为引导目标说话人的主要手段,可以避免一般模型中存在的排列问题和未知说话人数量问题。此外,采用对抗训练代替传统的时频掩模,提高分离语音的感知质量。具体而言,该框架可分为两个阶段:音高提取和语音分离。前者旨在从混合中提取每个说话者的候选音高轮廓,模拟ASA机制中的自下而上过程。第二阶段采用条件生成对抗网络(conditional generative adversarial network, CGAN),选取任意音高轮廓作为分离相应说话人的条件。第二阶段模拟音调启动对ASA的影响。在WSJ0-2mix语料上的实验表明,与基线模型相比,所提方法具有更高的基音提取精度和更好的分离性能,具有应用于SOTA体系结构的潜力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Open-Set Heterogeneous Domain Adaptation: Theoretical Analysis and Algorithm. Step-Calibrated Diffusion for Biomedical Optical Image Restoration. Tackling Intertwined Data and Device Heterogeneities in Federated Learning with Unlimited Staleness. A Deployed Online Reinforcement Learning Algorithm In An Oral Health Clinical Trial. Learning Physics Informed Neural ODEs with Partial Measurements.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1