Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.
神经精神疾病,如阿尔茨海默病(AD)、抑郁症和自闭症谱系障碍(ASD),以语言和声学异常为特征,为早期检测提供了潜在的生物标志物。尽管多模态方法带来了希望,但多语言泛化和缺乏统一的评估框架等挑战仍然存在。为了解决这些差距,我们提出了基于基础模型的神经精神疾病评估(Foundation model-based Evaluation of Neuropsychiatric Disorders),这是一个综合的多模式框架,整合了语音和文本模式,用于在整个生命周期中检测AD、抑郁症和ASD。利用13个多语言数据集,包括英语、中文、希腊语、法语和荷兰语,我们系统地评估了多模态融合的性能。我们的研究结果表明,由于数据集的异质性,多模态融合在AD和抑郁检测中表现出色,但在ASD中表现不佳。我们还确定了模态不平衡是一个普遍的问题,其中多模态融合未能超越最好的单模态模型。跨语料库实验显示,在任务和语言一致的情况下,该算法表现稳健,但在多语言和任务异构的情况下,其性能明显下降。通过提供广泛的基准和详细的性能影响因素分析,该系统推动了自动化、寿命包容性和多语言神经精神障碍评估领域的发展。我们鼓励研究人员为了公平比较和可重复的研究而采用免疫免疫系统框架。
{"title":"Foundation Model-Based Evaluation of Neuropsychiatric Disorders: A Lifespan-Inclusive, Multi-Modal, and Multi-Lingual Study","authors":"Zhongren Dong;Haotian Guo;Weixiang Xu;Huan Zhao;Zixing Zhang","doi":"10.1109/JSTSP.2025.3622051","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3622051","url":null,"abstract":"Neuropsychiatric disorders, such as Alzheimer's disease (AD), depression, and autism spectrum disorder (ASD), are characterized by linguistic and acoustic abnormalities, offering potential biomarkers for early detection. Despite the promise of multi-modal approaches, challenges like multi-lingual generalization and the absence of a unified evaluation framework persist. To address these gaps, we propose FEND (Foundation model-based Evaluation of Neuropsychiatric Disorders), a comprehensive multi-modal framework integrating speech and text modalities for detecting AD, depression, and ASD across the lifespan. Leveraging 13 multi-lingual datasets spanning English, Chinese, Greek, French, and Dutch, we systematically evaluate multi-modal fusion performance. Our results show that multi-modal fusion excels in AD and depression detection but underperforms in ASD due to dataset heterogeneity. We also identify modality imbalance as a prevalent issue, where multi-modal fusion fails to surpass the best mono-modal models. Cross-corpus experiments reveal robust performance in task- and language-consistent scenarios but noticeable degradation in multi-lingual and task-heterogeneous settings. By providing extensive benchmarks and a detailed analysis of performance-influencing factors, FEND advances the field of automated, lifespan-inclusive, and multi-lingual neuropsychiatric disorder assessment. We encourage researchers to adopt the FEND framework for fair comparisons and reproducible research.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"796-809"},"PeriodicalIF":13.7,"publicationDate":"2025-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-17DOI: 10.1109/JSTSP.2025.3620716
Kara M. Smith;James R. Williamson;Thomas F. Quatieri
Background: Speech biomarkers have been used to assess motor dysfunction in people with Parkinson’s disease (PD), but speech biomarkers for mild cognitive impairment in PD (PD-MCI) have not been well studied. Objective: To identify speech acoustic features associated with PD-MCI and evaluate the performance of a model to discriminate PD-MCI from participants with normal cognitive status (PD-NC). Methods: We analyzed speech samples from 42 participants with PD, diagnosed as either PD-MCI or PD-NC using the Movement disorders Society Task Force Tier II criteria as a gold-standard classification of MCI. A reading passage and a picture description task were analyzed for acoustic features, which were used to generate individual and then a final fused Gaussian mixture model (GMM) to discriminate PD-MCI and PD-NC participants. Results: The picture description task yielded a larger number of acoustic features that were highly associated with PD-MCI status compared to the reading task. Fusing the model outputs from the picture description task resulted in an AUC = 0.82 for discriminating PD-MCI from PD-NC participants. The acoustic features associated with PD-MCI stemmed from multiple speech subsystems. Conclusion: PD-MCI has a distinct speech acoustic signature that may be harnessed to develop better tools to detect and monitor this complication.
{"title":"Speech Acoustic Markers Can Detect Mild Cognitive Impairment in Parkinson’s Disease","authors":"Kara M. Smith;James R. Williamson;Thomas F. Quatieri","doi":"10.1109/JSTSP.2025.3620716","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3620716","url":null,"abstract":"Background: Speech biomarkers have been used to assess motor dysfunction in people with Parkinson’s disease (PD), but speech biomarkers for mild cognitive impairment in PD (PD-MCI) have not been well studied. Objective: To identify speech acoustic features associated with PD-MCI and evaluate the performance of a model to discriminate PD-MCI from participants with normal cognitive status (PD-NC). Methods: We analyzed speech samples from 42 participants with PD, diagnosed as either PD-MCI or PD-NC using the Movement disorders Society Task Force Tier II criteria as a gold-standard classification of MCI. A reading passage and a picture description task were analyzed for acoustic features, which were used to generate individual and then a final fused Gaussian mixture model (GMM) to discriminate PD-MCI and PD-NC participants. Results: The picture description task yielded a larger number of acoustic features that were highly associated with PD-MCI status compared to the reading task. Fusing the model outputs from the picture description task resulted in an AUC = 0.82 for discriminating PD-MCI from PD-NC participants. The acoustic features associated with PD-MCI stemmed from multiple speech subsystems. Conclusion: PD-MCI has a distinct speech acoustic signature that may be harnessed to develop better tools to detect and monitor this complication.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"731-740"},"PeriodicalIF":13.7,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11206405","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1109/JSTSP.2025.3622049
Jinchao Li;Yuejiao Wang;Junan Li;Jiawen Kang;Bo Zheng;Ka Ho Wong;Brian Kan-Wing Mak;Helene H. Fung;Jean Woo;Man-Wai Mak;Timothy Kwok;Vincent Mok;Xianmin Gong;Xixin Wu;Xunying Liu;Patrick C. M. Wong;Helen Meng
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., lexical diversity) that are closely tied to bottom-up, stimulus-driven cognitive processes. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., topic development) that may reflect top-down, concept-driven cognitive abilities remain underexplored. These macrostructural patterns are crucial for NCD detection, yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli. Experimental results show the effectiveness of the proposed approaches in NCD detection, with TITAN achieving superior performance across three corpora: ADReSS (F1 = 0.8889), ADReSSo (F1 = 0.8504), and CU-MARVEL-RABBIT (F1 = 0.7238). Feature contribution analysis reveals that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constitute the most significant contributors to the model's decision pathways, outperforming the investigated microstructural features. These findings underscore the value of macrostructural analysis for understanding linguistic-cognitive interactions associated with NCDs.
{"title":"Detecting Neurocognitive Disorders Through Analyses of Topic Evolution and Cross-Modal Consistency in Visual-Stimulated Narratives","authors":"Jinchao Li;Yuejiao Wang;Junan Li;Jiawen Kang;Bo Zheng;Ka Ho Wong;Brian Kan-Wing Mak;Helene H. Fung;Jean Woo;Man-Wai Mak;Timothy Kwok;Vincent Mok;Xianmin Gong;Xixin Wu;Xunying Liu;Patrick C. M. Wong;Helen Meng","doi":"10.1109/JSTSP.2025.3622049","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3622049","url":null,"abstract":"Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., lexical diversity) that are closely tied to bottom-up, stimulus-driven cognitive processes. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., topic development) that may reflect top-down, concept-driven cognitive abilities remain underexplored. These macrostructural patterns are crucial for NCD detection, yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli. Experimental results show the effectiveness of the proposed approaches in NCD detection, with TITAN achieving superior performance across three corpora: ADReSS (F1 = 0.8889), ADReSSo (F1 = 0.8504), and CU-MARVEL-RABBIT (F1 = 0.7238). Feature contribution analysis reveals that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constitute the most significant contributors to the model's decision pathways, outperforming the investigated microstructural features. These findings underscore the value of macrostructural analysis for understanding linguistic-cognitive interactions associated with NCDs.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"741-756"},"PeriodicalIF":13.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1109/JSTSP.2025.3615540
Shijun Liang;Van Hoang Minh Nguyen;Jinghan Jia;Ismail R. Alkhouri;Sijia Liu;Saiprasad Ravishankar
As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case or random additive perturbations. This sensitivity often leads to unstable aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to these variations. To address this problem, we propose a novel image reconstruction framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noise, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that SMUG and its variants address the above issue by customizing the RS process based on the unrolling architecture of DL-based MRI reconstruction models. We theoretically analyze the robustness of our method in the presence of perturbations. Compared to vanilla RS and other recent approaches, we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps.
{"title":"Robust MRI Reconstruction by Smoothed Unrolling (SMUG)","authors":"Shijun Liang;Van Hoang Minh Nguyen;Jinghan Jia;Ismail R. Alkhouri;Sijia Liu;Saiprasad Ravishankar","doi":"10.1109/JSTSP.2025.3615540","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3615540","url":null,"abstract":"As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case or random additive perturbations. This sensitivity often leads to unstable aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to these variations. To address this problem, we propose a novel image reconstruction framework, termed <sc><u>Sm</u>oothed <u>U</u>nrollin<u>g</u></small> (<sc>SMUG</small>), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noise, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that <sc>SMUG</small> and its variants address the above issue by customizing the RS process based on the unrolling architecture of DL-based MRI reconstruction models. We theoretically analyze the robustness of our method in the presence of perturbations. Compared to vanilla RS and other recent approaches, we show that <sc>SMUG</small> improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 7","pages":"1558-1573"},"PeriodicalIF":13.7,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1109/JSTSP.2025.3617859
Bence Mark Halpern;Thomas B. Tienkamp;Teja Rebernik;Rob J.J.H. van Son;Sebastiaan A.H.J. de Visscher;Max J.H. Witjes;Defne Abur;Tomoki Toda
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.
{"title":"XPPG-PCA: Reference-Free Automatic Speech Severity Evaluation With Principal Components","authors":"Bence Mark Halpern;Thomas B. Tienkamp;Teja Rebernik;Rob J.J.H. van Son;Sebastiaan A.H.J. de Visscher;Max J.H. Witjes;Defne Abur;Tomoki Toda","doi":"10.1109/JSTSP.2025.3617859","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3617859","url":null,"abstract":"Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"783-795"},"PeriodicalIF":13.7,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditionally, dysarthric speech intelligibility assessment systems have focused on speech as the primary input, utilizing methods such as extraction of relevant speech features, classification models, alignment of Automatic Speech Recognition (ASR) outputs, and comparisons between speech representations of dysarthric and healthy speakers. However, to achieve an automated intelligibility assessment that closely mirrors the auditory-perceptual evaluations conducted by clinicians, a model that captures both the acoustic characteristics of dysarthric speech and the linguistic structure related to word pronunciation are needed. Inspired by the practices of clinicians, this study introduces a novel text-guided dysarthric speech intelligibility assessment framework that leverages custom keyword spotting (DySIA-CKWS). The model evaluates intelligibility by detecting specific keywords and is extensively tested using UA-Speech database for speaker-wise analysis and across word groups of varying complexity. To ensure robustness, the system’s performance is further validated on TORGO database, demonstrating its adaptability in cross-database settings. Statistical analysis demonstrates strong alignment between predicted and subjective intelligibility scores, with a Pearson Correlation Coefficient (PCC) of 0.9588 and a Spearman’s Correlation Coefficient (SCC) of 0.9141, achieved using the proposed system on the UA-Speech database. The findings emphasize the importance of word selection and showcase the model’s effectiveness in diagnosing dysarthric speech, offering a significant advancement in intelligibility assessment methodologies.
{"title":"Dysarthric Speech Intelligibility Assessment by Custom Keyword Spotting","authors":"Anuprabha M;Krishna Gurugubelli;Anil Kumar Vuppala","doi":"10.1109/JSTSP.2025.3604709","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3604709","url":null,"abstract":"Traditionally, dysarthric speech intelligibility assessment systems have focused on speech as the primary input, utilizing methods such as extraction of relevant speech features, classification models, alignment of Automatic Speech Recognition (ASR) outputs, and comparisons between speech representations of dysarthric and healthy speakers. However, to achieve an automated intelligibility assessment that closely mirrors the auditory-perceptual evaluations conducted by clinicians, a model that captures both the acoustic characteristics of dysarthric speech and the linguistic structure related to word pronunciation are needed. Inspired by the practices of clinicians, this study introduces a novel text-guided dysarthric speech intelligibility assessment framework that leverages custom keyword spotting (DySIA-CKWS). The model evaluates intelligibility by detecting specific keywords and is extensively tested using UA-Speech database for speaker-wise analysis and across word groups of varying complexity. To ensure robustness, the system’s performance is further validated on TORGO database, demonstrating its adaptability in cross-database settings. Statistical analysis demonstrates strong alignment between predicted and subjective intelligibility scores, with a Pearson Correlation Coefficient (PCC) of 0.9588 and a Spearman’s Correlation Coefficient (SCC) of 0.9141, achieved using the proposed system on the UA-Speech database. The findings emphasize the importance of word selection and showcase the model’s effectiveness in diagnosing dysarthric speech, offering a significant advancement in intelligibility assessment methodologies.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"757-766"},"PeriodicalIF":13.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/JSTSP.2025.3590607
Lina Liu;Dongning Guo
Massive multiple-input multiple-output (MIMO) systems are vital for achieving high spectral efficiencies at mid-band and millimeter wave frequencies. Conventional hybrid MIMO architectures, which use fewer digital chains than antennas, offer a balance between performance, cost, and energy consumption but often prolong channel estimation. This paper proposes a novel architecture that integrates a set of full-dimension digital chains with one-bit analog-to-digital converters (ADCs) to overcome these limitations and provide an alternative trade-off. By assigning one digital chain to each receive antenna, the proposed approach captures energy from all receive antennas and accelerates angle-of-arrival (AoA) estimation and beam computation. Likelihood-based AoA estimation methods are developed to optimize analog beamforming in narrowband and wideband channels, in both single-user and multiuser scenarios. Numerical results, including the equivalent signal-to-noise ratio per bit post-equalization, demonstrate that full-dimension one-bit digital chains significantly improve the efficiency of beamforming.
{"title":"Accelerating Multiuser Beamforming With Full-Dimension One-Bit Chains","authors":"Lina Liu;Dongning Guo","doi":"10.1109/JSTSP.2025.3590607","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3590607","url":null,"abstract":"Massive multiple-input multiple-output (MIMO) systems are vital for achieving high spectral efficiencies at mid-band and millimeter wave frequencies. Conventional hybrid MIMO architectures, which use fewer digital chains than antennas, offer a balance between performance, cost, and energy consumption but often prolong channel estimation. This paper proposes a novel architecture that integrates a set of full-dimension digital chains with one-bit analog-to-digital converters (ADCs) to overcome these limitations and provide an alternative trade-off. By assigning one digital chain to each receive antenna, the proposed approach captures energy from all receive antennas and accelerates angle-of-arrival (AoA) estimation and beam computation. Likelihood-based AoA estimation methods are developed to optimize analog beamforming in narrowband and wideband channels, in both single-user and multiuser scenarios. Numerical results, including the equivalent signal-to-noise ratio per bit post-equalization, demonstrate that full-dimension one-bit digital chains significantly improve the efficiency of beamforming.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 6","pages":"1203-1217"},"PeriodicalIF":13.7,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/JSTSP.2025.3591062
Shakeel A. Sheikh;Md. Sahidullah;Ina Kodrasi
Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.
{"title":"Overview of Automatic Speech Analysis and Technologies for Neurodegenerative Disorders: Diagnosis and Assistive Applications","authors":"Shakeel A. Sheikh;Md. Sahidullah;Ina Kodrasi","doi":"10.1109/JSTSP.2025.3591062","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3591062","url":null,"abstract":"Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"700-716"},"PeriodicalIF":13.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1109/JSTSP.2025.3589745
Yiming Fang;Li Chen;Yunfei Chen;Weidong Wang;Changsheng You
Mixed-precision quantization offers superior performance to fixed-precision quantization. It has been widely used in signal processing, communication systems, and machine learning. In mixed-precision quantization, bit allocation is essential. Hence, in this paper, we propose a new bit allocation framework for mixed-precision quantization from a search perspective. First, we formulate a general bit allocation problem for mixed-precision quantization. Then we introduce the penalized particle swarm optimization (PPSO) algorithm to address the integer consumption constraint. To improve efficiency and avoid iterations on infeasible solutions within the PPSO algorithm, a greedy criterion particle swarm optimization (GC-PSO) algorithm is proposed. The corresponding convergence analysis is derived based on dynamical system theory. Furthermore, we apply the above framework to some specific classic fields, i.e., finite impulse response (FIR) filters, receivers, and gradient descent. Numerical examples in each application underscore the superiority of the proposed framework to the existing algorithms.
{"title":"Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most","authors":"Yiming Fang;Li Chen;Yunfei Chen;Weidong Wang;Changsheng You","doi":"10.1109/JSTSP.2025.3589745","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3589745","url":null,"abstract":"Mixed-precision quantization offers superior performance to fixed-precision quantization. It has been widely used in signal processing, communication systems, and machine learning. In mixed-precision quantization, bit allocation is essential. Hence, in this paper, we propose a new bit allocation framework for mixed-precision quantization from a search perspective. First, we formulate a general bit allocation problem for mixed-precision quantization. Then we introduce the penalized particle swarm optimization (PPSO) algorithm to address the integer consumption constraint. To improve efficiency and avoid iterations on infeasible solutions within the PPSO algorithm, a greedy criterion particle swarm optimization (GC-PSO) algorithm is proposed. The corresponding convergence analysis is derived based on dynamical system theory. Furthermore, we apply the above framework to some specific classic fields, i.e., finite impulse response (FIR) filters, receivers, and gradient descent. Numerical examples in each application underscore the superiority of the proposed framework to the existing algorithms.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 6","pages":"1218-1233"},"PeriodicalIF":13.7,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-30DOI: 10.1109/JSTSP.2025.3584195
Lester Phillip Violeta;Wen-Chin Huang;Ding Ma;Ryuichi Yamamoto;Kazuhiro Kobayashi;Tomoki Toda
We investigate the use of linguistic intermediates to resolve domain mismatches in the electrolaryngeal (EL) speech enhancement task. We first propose the use of linguistic encoders to produce bottleneck feature intermediates, and use a recognition, alignment, and synthesis framework, effectively improving performance due to the removal of the timbre mismatches between the pretraining (typical) and fine-tuning (EL) data. We then further improve this by introducing discrete text intermediates, which effectively alleviate temporal mismatches between the source (EL) and target (typical) data to improve prosody modeling. Our findings show that by simply using bottleneck feature intermediates, more intelligible and naturally sounding speech can already be synthesized, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score compared to the baseline. Moreover, through the use of discrete phoneme-level intermediates, we can further improve the modeling of the temporal structure of typical speech and get another absolute improvement of 1.4% in character error rate and 0.2 in naturalness compared to the initially proposed system. Finally, we also verify these findings on a larger pseudo-EL dataset of 14 speakers and another set of 3 real-world EL speakers, which consistently show that using the phoneme-level intermediates is most effective approach in terms of phoneme error rate. We conclude the research by summarizing the advantages and disadvantages of each proposed technique.
{"title":"Resolving Domain Mismatches in Electrolaryngeal Speech Enhancement With Linguistic Intermediates","authors":"Lester Phillip Violeta;Wen-Chin Huang;Ding Ma;Ryuichi Yamamoto;Kazuhiro Kobayashi;Tomoki Toda","doi":"10.1109/JSTSP.2025.3584195","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3584195","url":null,"abstract":"We investigate the use of linguistic intermediates to resolve domain mismatches in the electrolaryngeal (EL) speech enhancement task. We first propose the use of linguistic encoders to produce bottleneck feature intermediates, and use a recognition, alignment, and synthesis framework, effectively improving performance due to the removal of the timbre mismatches between the pretraining (typical) and fine-tuning (EL) data. We then further improve this by introducing discrete text intermediates, which effectively alleviate temporal mismatches between the source (EL) and target (typical) data to improve prosody modeling. Our findings show that by simply using bottleneck feature intermediates, more intelligible and naturally sounding speech can already be synthesized, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score compared to the baseline. Moreover, through the use of discrete phoneme-level intermediates, we can further improve the modeling of the temporal structure of typical speech and get another absolute improvement of 1.4% in character error rate and 0.2 in naturalness compared to the initially proposed system. Finally, we also verify these findings on a larger pseudo-EL dataset of 14 speakers and another set of 3 real-world EL speakers, which consistently show that using the phoneme-level intermediates is most effective approach in terms of phoneme error rate. We conclude the research by summarizing the advantages and disadvantages of each proposed technique.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"827-839"},"PeriodicalIF":13.7,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059307","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}