{"title":"Dense-TSNet:用于超轻量级语音增强的密集连接两级结构","authors":"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li","doi":"arxiv-2409.11725","DOIUrl":null,"url":null,"abstract":"Speech enhancement aims to improve speech quality and intelligibility in\nnoisy environments. Recent advancements have concentrated on deep neural\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\nfeature extraction. However, the complexity and size of these models remain\nsignificant, which limits their applicability in resource-constrained\nscenarios. Designing models suitable for edge devices presents its own set of\nchallenges. Narrow lightweight models often encounter performance bottlenecks\ndue to uneven loss landscapes. Additionally, advanced operators such as\nTransformers or Mamba may lack the practical adaptability and efficiency that\nconvolutional neural networks (CNNs) offer in real-world deployments. To\naddress these challenges, we propose Dense-TSNet, an innovative\nultra-lightweight speech enhancement network. Our approach employs a novel\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\nTwo-Stage architecture, ensures more robust refinement of the objective\nfunction in the later training stages. This leads to improved final\nperformance, addressing the early convergence limitations of the baseline\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\nfeature extraction by incorporating global, channel, and local perspectives\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\npromising performance with a compact model size of around 14K parameters,\nmaking it particularly well-suited for deployment in resource-constrained\nenvironments.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"96 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement\",\"authors\":\"Zizhen Lin, Yuanle Li, Junyu Wang, Ruili Li\",\"doi\":\"arxiv-2409.11725\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech enhancement aims to improve speech quality and intelligibility in\\nnoisy environments. Recent advancements have concentrated on deep neural\\nnetworks, particularly employing the Two-Stage (TS) architecture to enhance\\nfeature extraction. However, the complexity and size of these models remain\\nsignificant, which limits their applicability in resource-constrained\\nscenarios. Designing models suitable for edge devices presents its own set of\\nchallenges. Narrow lightweight models often encounter performance bottlenecks\\ndue to uneven loss landscapes. Additionally, advanced operators such as\\nTransformers or Mamba may lack the practical adaptability and efficiency that\\nconvolutional neural networks (CNNs) offer in real-world deployments. To\\naddress these challenges, we propose Dense-TSNet, an innovative\\nultra-lightweight speech enhancement network. Our approach employs a novel\\nDense Two-Stage (Dense-TS) architecture, which, compared to the classic\\nTwo-Stage architecture, ensures more robust refinement of the objective\\nfunction in the later training stages. This leads to improved final\\nperformance, addressing the early convergence limitations of the baseline\\nmodel. We also introduce the Multi-View Gaze Block (MVGB), which enhances\\nfeature extraction by incorporating global, channel, and local perspectives\\nthrough convolutional neural networks (CNNs). Furthermore, we discuss how the\\nchoice of loss function impacts perceptual quality. Dense-TSNet demonstrates\\npromising performance with a compact model size of around 14K parameters,\\nmaking it particularly well-suited for deployment in resource-constrained\\nenvironments.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"96 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11725\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11725","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Dense-TSNet: Dense Connected Two-Stage Structure for Ultra-Lightweight Speech Enhancement
Speech enhancement aims to improve speech quality and intelligibility in
noisy environments. Recent advancements have concentrated on deep neural
networks, particularly employing the Two-Stage (TS) architecture to enhance
feature extraction. However, the complexity and size of these models remain
significant, which limits their applicability in resource-constrained
scenarios. Designing models suitable for edge devices presents its own set of
challenges. Narrow lightweight models often encounter performance bottlenecks
due to uneven loss landscapes. Additionally, advanced operators such as
Transformers or Mamba may lack the practical adaptability and efficiency that
convolutional neural networks (CNNs) offer in real-world deployments. To
address these challenges, we propose Dense-TSNet, an innovative
ultra-lightweight speech enhancement network. Our approach employs a novel
Dense Two-Stage (Dense-TS) architecture, which, compared to the classic
Two-Stage architecture, ensures more robust refinement of the objective
function in the later training stages. This leads to improved final
performance, addressing the early convergence limitations of the baseline
model. We also introduce the Multi-View Gaze Block (MVGB), which enhances
feature extraction by incorporating global, channel, and local perspectives
through convolutional neural networks (CNNs). Furthermore, we discuss how the
choice of loss function impacts perceptual quality. Dense-TSNet demonstrates
promising performance with a compact model size of around 14K parameters,
making it particularly well-suited for deployment in resource-constrained
environments.