Module-Based End-to-End Distant Speech Processing: A case study of far-field automatic speech recognition [Special Issue On Model-Based and Data-Driven Audio Signal Processing]
{"title":"Module-Based End-to-End Distant Speech Processing: A case study of far-field automatic speech recognition [Special Issue On Model-Based and Data-Driven Audio Signal Processing]","authors":"Xuankai Chang;Shinji Watanabe;Marc Delcroix;Tsubasa Ochiai;Wangyou Zhang;Yanmin Qian","doi":"10.1109/MSP.2024.3486469","DOIUrl":null,"url":null,"abstract":"Distant speech processing is a critical downstream application in speech and audio signal processing. Traditionally, researchers have addressed this challenge by breaking it down into distinct subproblems and encompassing the extraction of clean speech signals from noisy inputs, feature extraction, and transcription. This approach led to the development of modular distant automatic speech recognition (DASR) models, which are often designed with multiple stages in cascade, corresponding to specific subproblems. Recently, the surge in the capabilities of deep learning is propelling the popularity of purely end-to-end (E2E) models that employ a single large neural network to tackle an entire DASR task in an extremely data-driven manner. However, an alternative paradigm persists in the form of a modular model design, where we can often leverage speech and signal processing models. Although this approach mirrors the multistage model, it is trained through an E2E process. This article overviews the recent development of DASR systems, focusing on E2E module-based models and showcasing successful downstream applications of model-based and data-driven audio signal processing.","PeriodicalId":13246,"journal":{"name":"IEEE Signal Processing Magazine","volume":"41 6","pages":"39-50"},"PeriodicalIF":9.4000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Signal Processing Magazine","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10819672/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Distant speech processing is a critical downstream application in speech and audio signal processing. Traditionally, researchers have addressed this challenge by breaking it down into distinct subproblems and encompassing the extraction of clean speech signals from noisy inputs, feature extraction, and transcription. This approach led to the development of modular distant automatic speech recognition (DASR) models, which are often designed with multiple stages in cascade, corresponding to specific subproblems. Recently, the surge in the capabilities of deep learning is propelling the popularity of purely end-to-end (E2E) models that employ a single large neural network to tackle an entire DASR task in an extremely data-driven manner. However, an alternative paradigm persists in the form of a modular model design, where we can often leverage speech and signal processing models. Although this approach mirrors the multistage model, it is trained through an E2E process. This article overviews the recent development of DASR systems, focusing on E2E module-based models and showcasing successful downstream applications of model-based and data-driven audio signal processing.
期刊介绍:
EEE Signal Processing Magazine is a publication that focuses on signal processing research and applications. It publishes tutorial-style articles, columns, and forums that cover a wide range of topics related to signal processing. The magazine aims to provide the research, educational, and professional communities with the latest technical developments, issues, and events in the field. It serves as the main communication platform for the society, addressing important matters that concern all members.