Pub Date : 2025-06-27DOI: 10.1109/JSTSP.2025.3570397
Amir Hussain;Yu Tsao;John H.L. Hansen;Naomi Harte;Shinji Watanabe;Isabel Trancoso;Shixiong Zhang
{"title":"Guest Editorial: IEEE JSTSP Special Issue on Deep Multimodal Speech Enhancement and Separation (DEMSES)","authors":"Amir Hussain;Yu Tsao;John H.L. Hansen;Naomi Harte;Shinji Watanabe;Isabel Trancoso;Shixiong Zhang","doi":"10.1109/JSTSP.2025.3570397","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3570397","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"596-599"},"PeriodicalIF":8.7,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11054321","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-27DOI: 10.1109/JSTSP.2025.3570405
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2025.3570405","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3570405","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"C3-C3"},"PeriodicalIF":8.7,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11054323","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-19DOI: 10.1109/JSTSP.2025.3579644
Sajjad Nassirpour;Toan-Van Nguyen;Hien Quoc Ngo;Le-Nam Tran;Tharmalingam Ratnarajah;Duy H. N. Nguyen
We study the joint channel estimation and data detection (JED) problem in cell-free massive MIMO (CF-mMIMO) networks, where access points (APs) forward signals to a central processing unit (CPU) over fronthaul links. Due to bandwidth limitations of these links, especially with a growing number of users, efficient processing becomes challenging. To address this, we propose a variational Bayesian (VB) inference-based method for JED that accommodates low-resolution quantized signals from APs. We consider two approaches: quantization-and-estimation (Q-E) and estimation-and-quantization (E-Q). In Q-E, each AP directly quantizes its received signals before forwarding them to the CPU. In E-Q, each AP first estimates channels locally during the pilot phase, then sends quantized versions of both the local channel estimates and received data to the CPU. The final JED process in both Q-E and E-Q is performed at the CPU. We evaluate our proposed approach under perfect fronthaul links (PFL) with unquantized received signals, Q-E, and E-Q, using symbol error rate (SER), channel normalized mean squared error (NMSE), computational complexity, and fronthaul signaling overhead as performance metrics. Our methods are benchmarked against both linear and nonlinear state-of-the-art JED techniques. Numerical results demonstrate that our VB-based approaches consistently outperform the linear baseline by leveraging the nonlinear VB framework. They also surpass existing nonlinear methods due to: i) a fully VB-driven formulation, which performs better than hybrid schemes such as VB combined with expectation maximization; and ii) the stability of our approach under correlated channels, where competing methods may fail to converge or experience performance degradation.
{"title":"Variational Bayesian Channel Estimation and Data Detection for Cell-Free Massive MIMO With Low-Resolution Quantized Fronthaul Links","authors":"Sajjad Nassirpour;Toan-Van Nguyen;Hien Quoc Ngo;Le-Nam Tran;Tharmalingam Ratnarajah;Duy H. N. Nguyen","doi":"10.1109/JSTSP.2025.3579644","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3579644","url":null,"abstract":"We study the joint channel estimation and data detection (JED) problem in cell-free massive MIMO (CF-mMIMO) networks, where access points (APs) forward signals to a central processing unit (CPU) over fronthaul links. Due to bandwidth limitations of these links, especially with a growing number of users, efficient processing becomes challenging. To address this, we propose a variational Bayesian (VB) inference-based method for JED that accommodates low-resolution quantized signals from APs. We consider two approaches: <italic>quantization-and-estimation</i> (Q-E) and <italic>estimation-and-quantization</i> (E-Q). In Q-E, each AP directly quantizes its received signals before forwarding them to the CPU. In E-Q, each AP first estimates channels locally during the pilot phase, then sends quantized versions of both the local channel estimates and received data to the CPU. The final JED process in both Q-E and E-Q is performed at the CPU. We evaluate our proposed approach under perfect fronthaul links (PFL) with unquantized received signals, Q-E, and E-Q, using symbol error rate (SER), channel normalized mean squared error (NMSE), computational complexity, and fronthaul signaling overhead as performance metrics. Our methods are benchmarked against both linear and nonlinear state-of-the-art JED techniques. Numerical results demonstrate that our VB-based approaches consistently outperform the linear baseline by leveraging the nonlinear VB framework. They also surpass existing nonlinear methods due to: <italic>i)</i> a fully VB-driven formulation, which performs better than hybrid schemes such as VB combined with expectation maximization; and <italic>ii)</i> the stability of our approach under correlated channels, where competing methods may fail to converge or experience performance degradation.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 6","pages":"1187-1202"},"PeriodicalIF":13.7,"publicationDate":"2025-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145879960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speech corpora are collections of textual data derived from human verbal output and speech signals that can be processed from a variety of perspectives, including formal or semantic content, to serve analyses of different levels of linguistic organisation (phonemic, morphosyntactic, lexico-semantic and content information, prosody and intonation) and to serve analyses of important phenomena such as speech fluency and errors (non-fluencies). We focus on transcribing speech along with non-fluencies or dysfluencies, the detection of which plays an important role in the diagnosis of primary progressive aphasia, where we specifically examine articulation-based dysfluencies in nfvPPA speech. In this work, we propose SSDM 2.0, which is built on top of the current state-of-the-art system of dysfluency detection [1] and tackles its shortcomings via four main contributions: (1) We propose a novel Neural Articulatory Flow for deriving highly scalable, dysfluency-aware speech representations. (2) We develop a full-stack connectionist subsequence aligner to capture all major dysfluency types. (3) We introduce a mispronunciation prompt pipeline and consistency learning into LLMs to enable in-context dysfluency learning. (4) We curate and open-source Libri-Co-Dys (Lian et al., 2024), the largest co-dysfluency corpus to date. (5) We also present SSDM-L, a modular, non-end-to-end, lightweight model designed for clinical deployment. In clinical experiments on pathological speech transcription, we tested SSDM 2.0 using nfvPPA corpus primarily characterized by articulatory dysfluencies. Overall, SSDM 2.0 outperforms SSDM and all other dysfluency transcription models by a large margin.
语音语料库是从人类语言输出和语音信号中提取的文本数据的集合,可以从各种角度进行处理,包括形式或语义内容,以服务于不同层次的语言组织(音位、形态句法、词汇语义和内容信息、韵律和语调)的分析,并服务于分析语音流畅性和错误(非流畅性)等重要现象。我们专注于转录非流利或不流利的语音,其检测在原发性进行性失语症的诊断中起着重要作用,我们特别研究了nfvPPA语音中基于发音的不流利。在这项工作中,我们提出了SSDM 2.0,它建立在当前最先进的非流利检测系统[1]的基础上,并通过四个主要贡献解决了它的缺点:(1)我们提出了一种新的神经发音流,用于派生高度可扩展的,非流利感知的语音表示。(2)我们开发了一个全栈连接子序列对齐器来捕获所有主要的流利障碍类型。(3)我们在法学硕士课程中引入了发音错误提示管道和一致性学习,以实现语境中的非流利性学习。(4)我们管理并开源了lib - co-dys (Lian et al., 2024),这是迄今为止最大的协同不流畅语料库。(5)我们还提出了SSDM-L,这是一种模块化、非端到端轻量级模型,专为临床部署而设计。在病理语音转录的临床实验中,我们使用以发音障碍为主要特征的nfvPPA语料库来测试SSDM 2.0。总体而言,SSDM 2.0在很大程度上优于SSDM和所有其他异常转录模型。
{"title":"Automatic Detection of Articulatory-Based Disfluencies in Primary Progressive Aphasia","authors":"Jiachen Lian;Xuanru Zhou;Chenxu Guo;Zongli Ye;Zoe Ezzes;Jet M.J. Vonk;Brittany Morin;David Baquirin;Zachary Miller;Maria Luisa Gorno-Tempini;Gopala Krishna Anumanchipalli","doi":"10.1109/JSTSP.2025.3579972","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3579972","url":null,"abstract":"Speech corpora are collections of textual data derived from human verbal output and speech signals that can be processed from a variety of perspectives, including formal or semantic content, to serve analyses of different levels of linguistic organisation (phonemic, morphosyntactic, lexico-semantic and content information, prosody and intonation) and to serve analyses of important phenomena such as speech fluency and errors (non-fluencies). We focus on transcribing speech along with non-fluencies or dysfluencies, the detection of which plays an important role in the diagnosis of primary progressive aphasia, where we specifically examine articulation-based dysfluencies in nfvPPA speech. In this work, we propose SSDM 2.0, which is built on top of the current state-of-the-art system of dysfluency detection [1] and tackles its shortcomings via four main contributions: (1) We propose a novel <italic>Neural Articulatory Flow</i> for deriving highly scalable, dysfluency-aware speech representations. (2) We develop a <italic>full-stack connectionist subsequence aligner</i> to capture all major dysfluency types. (3) We introduce a mispronunciation prompt pipeline and consistency learning into LLMs to enable in-context dysfluency learning. (4) We curate and open-source <italic>Libri-Co-Dys</i> (Lian et al., 2024), the largest co-dysfluency corpus to date. (5) We also present <italic>SSDM-L</i>, a modular, non-end-to-end, lightweight model designed for clinical deployment. In clinical experiments on pathological speech transcription, we tested SSDM 2.0 using nfvPPA corpus primarily characterized by <italic>articulatory dysfluencies</i>. Overall, SSDM 2.0 outperforms SSDM and all other dysfluency transcription models by a large margin.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"810-826"},"PeriodicalIF":13.7,"publicationDate":"2025-06-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data augmentation methods have been employed to address the deficiencies in dysarthric speech datasets, achieving state-of-the-art (SOTA) results in the Dysarthric Speech Recognition (DSR) task. Current research on Dysarthric Speech Synthesis (DSS), however, fails to focus on the encoding of pathological features in dysarthric speech. The dysarthric speech is characterized by its discontinuous pronunciation, uncontrolled volume, slow speech, and excessive nasal sounds. Moreover, compared with typical speech, the dysarthric speech contains more non-stationary components generated by the explosive pronunciation, hoarseness, and air-flow noise during the pronunciation. We propose a DSS model named the Long-range and Non-stationary Variational Autoencoder (LNVAE). The LNVAE estimates the acoustic parameters of dysarthric speech by encoding the long-range dependency duration of phonemes in frame-level representations of dysarthric speech. Moreover, the LNVAE employs the Gaussian noise perturbation within the latent variables to capture the non-stationary fluctuations in dysarthric speech. The experiments were conducted on the speech synthesis and recognition tasks using the CDSD Chinese and UASpeech English corpora. The dysarthric speech synthesized by the LNVAE achieved the best performance across 29 and 28 objective metrics in the Chinese and English datasets, respectively. The synthesized speech also received the highest score from speech rehabilitation experts in the MOS experiments. The Whisper model fine-tuned on the synthesized data, achieved the lowest CER on the Chinese CDSD dataset. Moreover, for the UASpeech dataset, we increased the data by 0.5 times to fine-tune the DSR model, yet surpassed the current SOTA method, which uses four times more augmentation data, by 4.52$%$.
{"title":"Long-Range and Non-Stationary Encoding for Dysarthric Speech Data Augmentation","authors":"Daipeng Zhang;Hongcheng Zhang;Wenhuan Lu;Wei Li;Jinghong Wang;Jianguo Wei","doi":"10.1109/JSTSP.2025.3562417","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3562417","url":null,"abstract":"Data augmentation methods have been employed to address the deficiencies in dysarthric speech datasets, achieving state-of-the-art (SOTA) results in the Dysarthric Speech Recognition (DSR) task. Current research on Dysarthric Speech Synthesis (DSS), however, fails to focus on the encoding of pathological features in dysarthric speech. The dysarthric speech is characterized by its discontinuous pronunciation, uncontrolled volume, slow speech, and excessive nasal sounds. Moreover, compared with typical speech, the dysarthric speech contains more non-stationary components generated by the explosive pronunciation, hoarseness, and air-flow noise during the pronunciation. We propose a DSS model named the Long-range and Non-stationary Variational Autoencoder (LNVAE). The LNVAE estimates the acoustic parameters of dysarthric speech by encoding the long-range dependency duration of phonemes in frame-level representations of dysarthric speech. Moreover, the LNVAE employs the Gaussian noise perturbation within the latent variables to capture the non-stationary fluctuations in dysarthric speech. The experiments were conducted on the speech synthesis and recognition tasks using the CDSD Chinese and UASpeech English corpora. The dysarthric speech synthesized by the LNVAE achieved the best performance across 29 and 28 objective metrics in the Chinese and English datasets, respectively. The synthesized speech also received the highest score from speech rehabilitation experts in the MOS experiments. The Whisper model fine-tuned on the synthesized data, achieved the lowest CER on the Chinese CDSD dataset. Moreover, for the UASpeech dataset, we increased the data by 0.5 times to fine-tune the DSR model, yet surpassed the current SOTA method, which uses four times more augmentation data, by 4.52<inline-formula> <tex-math>$%$</tex-math></inline-formula>.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"767-782"},"PeriodicalIF":13.7,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-15DOI: 10.1109/JSTSP.2025.3560513
Wenxuan Wu;Xueyuan Chen;Shuai Wang;Jiadong Wang;Lingwei Meng;Xixin Wu;Helen Meng;Haizhou Li
Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.
{"title":"$C^{2}$AV-TSE: Context and Confidence-Aware Audio Visual Target Speaker Extraction","authors":"Wenxuan Wu;Xueyuan Chen;Shuai Wang;Jiadong Wang;Lingwei Meng;Xixin Wu;Helen Meng;Haizhou Li","doi":"10.1109/JSTSP.2025.3560513","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3560513","url":null,"abstract":"Audio-Visual Target Speaker Extraction (AV-TSE) aims to mimic the human ability to enhance auditory perception using visual cues. Although numerous models have been proposed recently, most of them estimate target signals by primarily relying on local dependencies within acoustic features, underutilizing the human-like capacity to infer unclear parts of speech through contextual information. This limitation results in not only suboptimal performance but also inconsistent extraction quality across the utterance, with some segments exhibiting poor quality or inadequate suppression of interfering speakers. To close this gap, we propose a model-agnostic strategy called the Mask-And-Recover (MAR). It integrates both inter- and intra-modality contextual correlations to enable global inference within extraction modules. Additionally, to better target challenging parts within each sample, we introduce a Fine-grained Confidence Score (FCS) model to assess extraction quality and guide extraction modules to emphasize improvement on low-quality segments. To validate the effectiveness of our proposed model-agnostic training paradigm, six popular AV-TSE backbones were adopted for evaluation on the VoxCeleb2 dataset, demonstrating consistent performance improvements across various metrics.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"646-657"},"PeriodicalIF":8.7,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-10DOI: 10.1109/JSTSP.2025.3559763
Hang Chen;Chen-Yue Zhang;Qing Wang;Jun Du;Sabato Marco Siniscalchi;Shi-Fu Xiong;Gen-Shun Wan
To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.
{"title":"HPCNet: Hybrid Pixel and Contour Network for Audio-Visual Speech Enhancement With Low-Quality Video","authors":"Hang Chen;Chen-Yue Zhang;Qing Wang;Jun Du;Sabato Marco Siniscalchi;Shi-Fu Xiong;Gen-Shun Wan","doi":"10.1109/JSTSP.2025.3559763","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3559763","url":null,"abstract":"To advance audio-visual speech enhancement (AVSE) research in low-quality video settings, we introduce the multimodal information-based speech processing-low quality video (MISP-LQV) benchmark, which includes a 120-hour real-world Mandarin audio-visual dataset, two video degradation simulation methods, and benchmark results from several well-known AVSE models. We also propose a novel hybrid pixel and contour network (HPCNet), incorporating a lip reconstruction and distillation (LRD) module and a contour graph convolution (CGConv) layer. Specifically, the LRD module reconstructs high-quality lip frames from low-quality audio-visual data, utilizing knowledge distillation from a teacher model trained on high-quality data. The CGConv layer employs spatio-temporal and semantic-contextual graphs to capture complex relationships among lip landmark points. Extensive experiments on the MISP-LQV benchmark reveal the performance degradation caused by low-quality video across various AVSE models. Notably, including real/simulated low-quality videos in AVSE training enhances its robustness to low-quality videos but degrades the performance of high-quality videos.The proposed HPCNet demonstrates strong robustness against video quality degradation, which can be attributed to (1) the reconstructed lip frames closely aligning with high-quality frames and (2) the contour features exhibiting consistency across different video quality levels. The generalizability of HPCNet also has been validated through experiments on the 2nd COG-MHEAR AVSE Challenge dataset.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"671-684"},"PeriodicalIF":8.7,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-07DOI: 10.1109/JSTSP.2025.3558653
Qingtian Xu;Jie Zhang;Zhenhua Ling
Brain-assisted speech enhancement (BASE) that utilizes electroencephalogram (EEG) signals as an assistive modality has shown a great potential for extracting the target speaker in multi-talker conditions. This is feasible as the EEG measurements contain the auditory attention of hearing-impaired listeners that can be leveraged to classify the target identity. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, the EEG channel selection for BASE is desired. This problem was tackled in a subject-invariant manner in literature, the resulting BASE performance varies significantly across subjects. In this work, we therefore propose an input-independent subject-adaptive channel selection method for BASE, called subject-adaptive convolutional regularization selection (SA-ConvRS), which enables a personalized informative channel distribution. We observe the abnormal over memory phenomenon that facilitates the model to perform BASE without any brain signals, which often occurs in related fields due to the data recording and validation conditions. To remove this effect, we further design a task-based multi-process adversarial training (TMAT) approach by exploiting pseudo-EEG inputs. Experimental results on a public dataset show that the proposed SA-ConvRS can achieve subject-adaptive channel selections and keep the BASE performance close to the full-channel upper bound; the TMAT can avoid the over memory problem without sacrificing the performance of SA-ConvRS.
{"title":"Input-Independent Subject-Adaptive Channel Selection for Brain-Assisted Speech Enhancement","authors":"Qingtian Xu;Jie Zhang;Zhenhua Ling","doi":"10.1109/JSTSP.2025.3558653","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3558653","url":null,"abstract":"Brain-assisted speech enhancement (BASE) that utilizes electroencephalogram (EEG) signals as an assistive modality has shown a great potential for extracting the target speaker in multi-talker conditions. This is feasible as the EEG measurements contain the auditory attention of hearing-impaired listeners that can be leveraged to classify the target identity. Considering that an EEG cap with sparse channels exhibits multiple benefits and in practice many electrodes might contribute marginally, the EEG channel selection for BASE is desired. This problem was tackled in a subject-invariant manner in literature, the resulting BASE performance varies significantly across subjects. In this work, we therefore propose an input-independent subject-adaptive channel selection method for BASE, called subject-adaptive convolutional regularization selection (SA-ConvRS), which enables a personalized informative channel distribution. We observe the abnormal <italic>over memory</i> phenomenon that facilitates the model to perform BASE without any brain signals, which often occurs in related fields due to the data recording and validation conditions. To remove this effect, we further design a task-based multi-process adversarial training (TMAT) approach by exploiting pseudo-EEG inputs. Experimental results on a public dataset show that the proposed SA-ConvRS can achieve subject-adaptive channel selections and keep the BASE performance close to the full-channel upper bound; the TMAT can avoid the over memory problem without sacrificing the performance of SA-ConvRS.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"658-670"},"PeriodicalIF":8.7,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in Synthetic Aperture Radar (SAR) sensors and innovative advanced imagery techniques have enabled SAR systems to acquire very high-resolution images with wide swaths, large bandwidth and in multiple polarization channels. The improvements of the SAR system capabilities also imply a significant increase in SAR data acquisition rates, such that efficient and effective compression methods become necessary. The compression of SAR raw data plays a crucial role in addressing the challenges posed by downlink and memory limitations onboard the SAR satellites and directly affects the quality of the generated SAR image. Neural data compression techniques using deep models have attracted many interests for natural image compression tasks and demonstrated promising results. In this study, neural data compression is extended into the complex domain to develop a Complex-Valued (CV) autoencoder-based data compression for SAR raw data. To this end, the basic fundamentals of data compression and Rate-Distortion (RD) theory are reviewed, well known data compression methods, Block Adaptive Quantization (BAQ) and JPEG2000 methods, are implemented and tested for SAR raw data compression, and a neural data compression based on CV autoencoders is developed for SAR raw data. Furthermore, since the available Sentinel-1 SAR raw products are already compressed with Flexible Dynamic BAQ (FDBAQ), an adaptation procedure applied to the decoded SAR raw data to generate SAR raw data with quasi-uniform quantization that resemble the statistics of the uncompressed SAR raw data onboard the satellites.
{"title":"Complex-Valued Autoencoder-Based Neural Data Compression for SAR Raw Data","authors":"Reza Mohammadi Asiyabi;Mihai Datcu;Andrei Anghel;Adrian Focsa;Michele Martone;Paola Rizzoli;Ernesto Imbembo","doi":"10.1109/JSTSP.2025.3558651","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3558651","url":null,"abstract":"Recent advances in Synthetic Aperture Radar (SAR) sensors and innovative advanced imagery techniques have enabled SAR systems to acquire very high-resolution images with wide swaths, large bandwidth and in multiple polarization channels. The improvements of the SAR system capabilities also imply a significant increase in SAR data acquisition rates, such that efficient and effective compression methods become necessary. The compression of SAR raw data plays a crucial role in addressing the challenges posed by downlink and memory limitations onboard the SAR satellites and directly affects the quality of the generated SAR image. Neural data compression techniques using deep models have attracted many interests for natural image compression tasks and demonstrated promising results. In this study, neural data compression is extended into the complex domain to develop a Complex-Valued (CV) autoencoder-based data compression for SAR raw data. To this end, the basic fundamentals of data compression and Rate-Distortion (RD) theory are reviewed, well known data compression methods, Block Adaptive Quantization (BAQ) and JPEG2000 methods, are implemented and tested for SAR raw data compression, and a neural data compression based on CV autoencoders is developed for SAR raw data. Furthermore, since the available Sentinel-1 SAR raw products are already compressed with Flexible Dynamic BAQ (FDBAQ), an adaptation procedure applied to the decoded SAR raw data to generate SAR raw data with quasi-uniform quantization that resemble the statistics of the uncompressed SAR raw data onboard the satellites.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 3","pages":"572-582"},"PeriodicalIF":8.7,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10955162","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144073335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-07DOI: 10.1109/JSTSP.2025.3558654
Xinyuan Qian;Jiaran Gao;Yaodan Zhang;Qiquan Zhang;Hexin Liu;Leibny Paola Garcia Perera;Haizhou Li
Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. Scene-aware Audio-Visual Speech Enhancement (SAV-SE). To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S $^{2}$ E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S $^{2}$ E over other competitive methods.
{"title":"SAV-SE: Scene-Aware Audio-Visual Speech Enhancement With Selective State Space Model","authors":"Xinyuan Qian;Jiaran Gao;Yaodan Zhang;Qiquan Zhang;Hexin Liu;Leibny Paola Garcia Perera;Haizhou Li","doi":"10.1109/JSTSP.2025.3558654","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3558654","url":null,"abstract":"Speech enhancement plays an essential role in various applications, and the integration of visual information has been demonstrated to bring substantial advantages. However, the majority of current research concentrates on the examination of facial and lip movements, which can be compromised or entirely inaccessible in scenarios where occlusions occur or when the camera view is distant. Whereas contextual visual cues from the surrounding environment have been overlooked: for example, when we see a dog bark, our brain has the innate ability to discern and filter out the barking noise. To this end, in this paper, we introduce a novel task, i.e. Scene-aware Audio-Visual Speech Enhancement (SAV-SE). To our best knowledge, this is the first proposal to use rich contextual information from synchronized video as auxiliary cues to indicate the type of noise, which eventually improves the speech enhancement performance. Specifically, we propose the VC-S <inline-formula><tex-math>$^{2}$</tex-math></inline-formula> E method, which incorporates the Conformer and Mamba modules for their complementary strengths. Extensive experiments are conducted on public MUSIC, AVSpeech and AudioSet datasets, where the results demonstrate the superiority of VC-S <inline-formula><tex-math>$^{2}$</tex-math></inline-formula> E over other competitive methods.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"623-634"},"PeriodicalIF":8.7,"publicationDate":"2025-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144502891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}