{"title":"MFFN: Multi-level Feature Fusion Network for monaural speech separation","authors":"Jianjun Lei, Yun He, Ying Wang","doi":"10.1016/j.specom.2025.103229","DOIUrl":null,"url":null,"abstract":"<div><div>Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (<em>MFFN</em>) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The <em>MFFN</em> integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our <em>MFFN</em> outperforms some current methods without using data augmentation technologies.</div></div>","PeriodicalId":49485,"journal":{"name":"Speech Communication","volume":"171 ","pages":"Article 103229"},"PeriodicalIF":3.0000,"publicationDate":"2025-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Speech Communication","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167639325000445","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
Abstract
Monaural speech separation based on Dual-path networks has recently been widely developed due to their outstanding processing ability for long feature sequences. However, these methods often exploit a fixed receptive field during feature learning, which hardly captures feature information at different scales and thus restricts the model’s performance. This paper proposes a novel Multi-level Feature Fusion Network (MFFN) to facilitate dual-path networks for monaural speech separation by capturing multi-scale information. The MFFN integrates information of different scales from long sequences by using a multi-scale sampling strategy and employs Squeeze-and-Excitation blocks in parallel to extract features along the channel and temporal dimensions. Moreover, we introduce a collaborative attention mechanism to fuse feature information across different levels, further improving the model’s representation capability. Finally, we conduct extensive experiments on noise-free datasets, WSJ0-2mix and Libri2mix, and the noisy datasets, WHAM! and WHAMR!. The results demonstrate that our MFFN outperforms some current methods without using data augmentation technologies.
期刊介绍:
Speech Communication is an interdisciplinary journal whose primary objective is to fulfil the need for the rapid dissemination and thorough discussion of basic and applied research results.
The journal''s primary objectives are:
• to present a forum for the advancement of human and human-machine speech communication science;
• to stimulate cross-fertilization between different fields of this domain;
• to contribute towards the rapid and wide diffusion of scientifically sound contributions in this domain.