{"title":"Leveraging Joint Spectral and Spatial Learning with MAMBA for Multichannel Speech Enhancement","authors":"Wenze Ren, Haibin Wu, Yi-Cheng Lin, Xuanjun Chen, Rong Chao, Kuo-Hsuan Hung, You-Jin Li, Wen-Yuan Ting, Hsin-Min Wang, Yu Tsao","doi":"arxiv-2409.10376","DOIUrl":null,"url":null,"abstract":"In multichannel speech enhancement, effectively capturing spatial and\nspectral information across different microphones is crucial for noise\nreduction. Traditional methods, such as CNN or LSTM, attempt to model the\ntemporal dynamics of full-band and sub-band spectral and spatial features.\nHowever, these approaches face limitations in fully modeling complex temporal\ndependencies, especially in dynamic acoustic environments. To overcome these\nchallenges, we modify the current advanced model McNet by introducing an\nimproved version of Mamba, a state-space model, and further propose MCMamba.\nMCMamba has been completely reengineered to integrate full-band and narrow-band\nspatial information with sub-band and full-band spectral features, providing a\nmore comprehensive approach to modeling spatial and spectral information. Our\nexperimental results demonstrate that MCMamba significantly improves the\nmodeling of spatial and spectral features in multichannel speech enhancement,\noutperforming McNet and achieving state-of-the-art performance on the CHiME-3\ndataset. Additionally, we find that Mamba performs exceptionally well in\nmodeling spectral information.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10376","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In multichannel speech enhancement, effectively capturing spatial and
spectral information across different microphones is crucial for noise
reduction. Traditional methods, such as CNN or LSTM, attempt to model the
temporal dynamics of full-band and sub-band spectral and spatial features.
However, these approaches face limitations in fully modeling complex temporal
dependencies, especially in dynamic acoustic environments. To overcome these
challenges, we modify the current advanced model McNet by introducing an
improved version of Mamba, a state-space model, and further propose MCMamba.
MCMamba has been completely reengineered to integrate full-band and narrow-band
spatial information with sub-band and full-band spectral features, providing a
more comprehensive approach to modeling spatial and spectral information. Our
experimental results demonstrate that MCMamba significantly improves the
modeling of spatial and spectral features in multichannel speech enhancement,
outperforming McNet and achieving state-of-the-art performance on the CHiME-3
dataset. Additionally, we find that Mamba performs exceptionally well in
modeling spectral information.