Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks

2021 IEEE Spoken Language Technology Workshop (SLT) Pub Date : 2021-01-13 DOI:10.1109/SLT48900.2021.9383464

Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu

{"title":"Effective Low-Cost Time-Domain Audio Separation Using Globally Attentive Locally Recurrent Networks","authors":"Max W. Y. Lam, Jun Wang, Dan Su, Dong Yu","doi":"10.1109/SLT48900.2021.9383464","DOIUrl":null,"url":null,"abstract":"Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we first split a feature sequence into 2D segments and then process the sequence along both the intra- and inter-segment dimensions. Our main innovation lies in that, on top of features recurrently processed along the inter-segment dimensions, GALR applies a self-attention mechanism to the sequence along the inter-segment dimension, which aggregates context-aware information and also enables parallelization. Our experiments suggest that GALR is a notably more effective network than the prior work. On one hand, with only 1.5M parameters, it has achieved comparable separation performance at a much lower cost with 36.1% less runtime memory and 49.4% fewer computational operations, relative to the DPRNN. On the other hand, in a comparable model size with DPRNN, GALR has consistently outperformed DPRNN in three datasets, in particular, with a substantial margin of 2.4dB absolute improvement of SI-SNRi in the benchmark WSJ0-2mix task.","PeriodicalId":243211,"journal":{"name":"2021 IEEE Spoken Language Technology Workshop (SLT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE Spoken Language Technology Workshop (SLT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SLT48900.2021.9383464","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

Abstract

Recent research on the time-domain audio separation networks (TasNets) has brought great success to speech separation. Nevertheless, conventional TasNets struggle to satisfy the memory and latency constraints in industrial applications. In this regard, we design a low-cost high-performance architecture, namely, globally attentive locally recurrent (GALR) network. Alike the dual-path RNN (DPRNN), we first split a feature sequence into 2D segments and then process the sequence along both the intra- and inter-segment dimensions. Our main innovation lies in that, on top of features recurrently processed along the inter-segment dimensions, GALR applies a self-attention mechanism to the sequence along the inter-segment dimension, which aggregates context-aware information and also enables parallelization. Our experiments suggest that GALR is a notably more effective network than the prior work. On one hand, with only 1.5M parameters, it has achieved comparable separation performance at a much lower cost with 36.1% less runtime memory and 49.4% fewer computational operations, relative to the DPRNN. On the other hand, in a comparable model size with DPRNN, GALR has consistently outperformed DPRNN in three datasets, in particular, with a substantial margin of 2.4dB absolute improvement of SI-SNRi in the benchmark WSJ0-2mix task.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于全局关注局部循环网络的有效低成本时域音频分离

近年来对时域音频分离网络(TasNets)的研究为语音分离带来了巨大的成功。然而，传统的TasNets难以满足工业应用中的内存和延迟限制。在这方面，我们设计了一个低成本的高性能架构，即全局关注局部循环(GALR)网络。与双路径RNN (DPRNN)相似，我们首先将特征序列分割成二维片段，然后沿段内和段间维度处理序列。我们的主要创新之处在于，在沿着段间维度循环处理的特征之上，GALR对沿着段间维度的序列应用了自关注机制，该机制可以聚合上下文感知信息并实现并行化。我们的实验表明，GALR是一个明显比以前的工作更有效的网络。一方面，与DPRNN相比，它只有1.5M个参数，以更低的成本获得了相当的分离性能，运行时内存减少了36.1%，计算操作减少了49.4%。另一方面，在与DPRNN相当的模型大小中，GALR在三个数据集上的表现始终优于DPRNN，特别是在基准WSJ0-2mix任务中，SI-SNRi的绝对改进幅度高达2.4dB。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2021 IEEE Spoken Language Technology Workshop (SLT)

自引率

0.00%

发文量