Pub Date : 2025-10-17DOI: 10.1109/JSTSP.2025.3620716
Kara M. Smith;James R. Williamson;Thomas F. Quatieri
Background: Speech biomarkers have been used to assess motor dysfunction in people with Parkinson’s disease (PD), but speech biomarkers for mild cognitive impairment in PD (PD-MCI) have not been well studied. Objective: To identify speech acoustic features associated with PD-MCI and evaluate the performance of a model to discriminate PD-MCI from participants with normal cognitive status (PD-NC). Methods: We analyzed speech samples from 42 participants with PD, diagnosed as either PD-MCI or PD-NC using the Movement disorders Society Task Force Tier II criteria as a gold-standard classification of MCI. A reading passage and a picture description task were analyzed for acoustic features, which were used to generate individual and then a final fused Gaussian mixture model (GMM) to discriminate PD-MCI and PD-NC participants. Results: The picture description task yielded a larger number of acoustic features that were highly associated with PD-MCI status compared to the reading task. Fusing the model outputs from the picture description task resulted in an AUC = 0.82 for discriminating PD-MCI from PD-NC participants. The acoustic features associated with PD-MCI stemmed from multiple speech subsystems. Conclusion: PD-MCI has a distinct speech acoustic signature that may be harnessed to develop better tools to detect and monitor this complication.
{"title":"Speech Acoustic Markers Can Detect Mild Cognitive Impairment in Parkinson’s Disease","authors":"Kara M. Smith;James R. Williamson;Thomas F. Quatieri","doi":"10.1109/JSTSP.2025.3620716","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3620716","url":null,"abstract":"Background: Speech biomarkers have been used to assess motor dysfunction in people with Parkinson’s disease (PD), but speech biomarkers for mild cognitive impairment in PD (PD-MCI) have not been well studied. Objective: To identify speech acoustic features associated with PD-MCI and evaluate the performance of a model to discriminate PD-MCI from participants with normal cognitive status (PD-NC). Methods: We analyzed speech samples from 42 participants with PD, diagnosed as either PD-MCI or PD-NC using the Movement disorders Society Task Force Tier II criteria as a gold-standard classification of MCI. A reading passage and a picture description task were analyzed for acoustic features, which were used to generate individual and then a final fused Gaussian mixture model (GMM) to discriminate PD-MCI and PD-NC participants. Results: The picture description task yielded a larger number of acoustic features that were highly associated with PD-MCI status compared to the reading task. Fusing the model outputs from the picture description task resulted in an AUC = 0.82 for discriminating PD-MCI from PD-NC participants. The acoustic features associated with PD-MCI stemmed from multiple speech subsystems. Conclusion: PD-MCI has a distinct speech acoustic signature that may be harnessed to develop better tools to detect and monitor this complication.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"731-740"},"PeriodicalIF":13.7,"publicationDate":"2025-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11206405","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659224","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-15DOI: 10.1109/JSTSP.2025.3622049
Jinchao Li;Yuejiao Wang;Junan Li;Jiawen Kang;Bo Zheng;Ka Ho Wong;Brian Kan-Wing Mak;Helene H. Fung;Jean Woo;Man-Wai Mak;Timothy Kwok;Vincent Mok;Xianmin Gong;Xixin Wu;Xunying Liu;Patrick C. M. Wong;Helen Meng
Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., lexical diversity) that are closely tied to bottom-up, stimulus-driven cognitive processes. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., topic development) that may reflect top-down, concept-driven cognitive abilities remain underexplored. These macrostructural patterns are crucial for NCD detection, yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli. Experimental results show the effectiveness of the proposed approaches in NCD detection, with TITAN achieving superior performance across three corpora: ADReSS (F1 = 0.8889), ADReSSo (F1 = 0.8504), and CU-MARVEL-RABBIT (F1 = 0.7238). Feature contribution analysis reveals that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constitute the most significant contributors to the model's decision pathways, outperforming the investigated microstructural features. These findings underscore the value of macrostructural analysis for understanding linguistic-cognitive interactions associated with NCDs.
{"title":"Detecting Neurocognitive Disorders Through Analyses of Topic Evolution and Cross-Modal Consistency in Visual-Stimulated Narratives","authors":"Jinchao Li;Yuejiao Wang;Junan Li;Jiawen Kang;Bo Zheng;Ka Ho Wong;Brian Kan-Wing Mak;Helene H. Fung;Jean Woo;Man-Wai Mak;Timothy Kwok;Vincent Mok;Xianmin Gong;Xixin Wu;Xunying Liu;Patrick C. M. Wong;Helen Meng","doi":"10.1109/JSTSP.2025.3622049","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3622049","url":null,"abstract":"Early detection of neurocognitive disorders (NCDs) is crucial for timely intervention and disease management. Given that language impairments manifest early in NCD progression, visual-stimulated narrative (VSN)-based analysis offers a promising avenue for NCD detection. Current VSN-based NCD detection methods primarily focus on linguistic microstructures (e.g., lexical diversity) that are closely tied to bottom-up, stimulus-driven cognitive processes. While these features illuminate basic language abilities, the higher-order linguistic macrostructures (e.g., topic development) that may reflect top-down, concept-driven cognitive abilities remain underexplored. These macrostructural patterns are crucial for NCD detection, yet challenging to quantify due to their abstract and complex nature. To bridge this gap, we propose two novel macrostructural approaches: (1) a Dynamic Topic Model (DTM) to track topic evolution over time, and (2) a Text-Image Temporal Alignment Network (TITAN) to measure cross-modal consistency between narrative and visual stimuli. Experimental results show the effectiveness of the proposed approaches in NCD detection, with TITAN achieving superior performance across three corpora: ADReSS (F1 = 0.8889), ADReSSo (F1 = 0.8504), and CU-MARVEL-RABBIT (F1 = 0.7238). Feature contribution analysis reveals that macrostructural features (e.g., topic variability, topic change rate, and topic consistency) constitute the most significant contributors to the model's decision pathways, outperforming the investigated microstructural features. These findings underscore the value of macrostructural analysis for understanding linguistic-cognitive interactions associated with NCDs.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"741-756"},"PeriodicalIF":13.7,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-10DOI: 10.1109/JSTSP.2025.3615540
Shijun Liang;Van Hoang Minh Nguyen;Jinghan Jia;Ismail R. Alkhouri;Sijia Liu;Saiprasad Ravishankar
As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case or random additive perturbations. This sensitivity often leads to unstable aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to these variations. To address this problem, we propose a novel image reconstruction framework, termed Smoothed Unrolling (SMUG), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noise, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that SMUG and its variants address the above issue by customizing the RS process based on the unrolling architecture of DL-based MRI reconstruction models. We theoretically analyze the robustness of our method in the presence of perturbations. Compared to vanilla RS and other recent approaches, we show that SMUG improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps.
{"title":"Robust MRI Reconstruction by Smoothed Unrolling (SMUG)","authors":"Shijun Liang;Van Hoang Minh Nguyen;Jinghan Jia;Ismail R. Alkhouri;Sijia Liu;Saiprasad Ravishankar","doi":"10.1109/JSTSP.2025.3615540","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3615540","url":null,"abstract":"As the popularity of deep learning (DL) in the field of magnetic resonance imaging (MRI) continues to rise, recent research has indicated that DL-based MRI reconstruction models might be excessively sensitive to minor input disturbances, including worst-case or random additive perturbations. This sensitivity often leads to unstable aliased images. This raises the question of how to devise DL techniques for MRI reconstruction that can be robust to these variations. To address this problem, we propose a novel image reconstruction framework, termed <sc><u>Sm</u>oothed <u>U</u>nrollin<u>g</u></small> (<sc>SMUG</small>), which advances a deep unrolling-based MRI reconstruction model using a randomized smoothing (RS)-based robust learning approach. RS, which improves the tolerance of a model against input noise, has been widely used in the design of adversarial defense approaches for image classification tasks. Yet, we find that the conventional design that applies RS to the entire DL-based MRI model is ineffective. In this paper, we show that <sc>SMUG</small> and its variants address the above issue by customizing the RS process based on the unrolling architecture of DL-based MRI reconstruction models. We theoretically analyze the robustness of our method in the presence of perturbations. Compared to vanilla RS and other recent approaches, we show that <sc>SMUG</small> improves the robustness of MRI reconstruction with respect to a diverse set of instability sources, including worst-case and random noise perturbations to input measurements, varying measurement sampling rates, and different numbers of unrolling steps.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 7","pages":"1558-1573"},"PeriodicalIF":13.7,"publicationDate":"2025-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145859871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-03DOI: 10.1109/JSTSP.2025.3617859
Bence Mark Halpern;Thomas B. Tienkamp;Teja Rebernik;Rob J.J.H. van Son;Sebastiaan A.H.J. de Visscher;Max J.H. Witjes;Defne Abur;Tomoki Toda
Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.
{"title":"XPPG-PCA: Reference-Free Automatic Speech Severity Evaluation With Principal Components","authors":"Bence Mark Halpern;Thomas B. Tienkamp;Teja Rebernik;Rob J.J.H. van Son;Sebastiaan A.H.J. de Visscher;Max J.H. Witjes;Defne Abur;Tomoki Toda","doi":"10.1109/JSTSP.2025.3617859","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3617859","url":null,"abstract":"Reliably evaluating the severity of a speech pathology is crucial in healthcare. However, the current reliance on expert evaluations by speech-language pathologists presents several challenges: while their assessments are highly skilled, they are also subjective, time-consuming, and costly, which can limit the reproducibility of clinical studies and place a strain on healthcare resources. While automated methods exist, they have significant drawbacks. Reference-based approaches require transcriptions or healthy speech samples, restricting them to read speech and limiting their applicability. Existing reference-free methods are also flawed; supervised models often learn spurious shortcuts from data, while handcrafted features are often unreliable and restricted to specific speech tasks. This paper introduces XPPG-PCA (x-vector phonetic posteriorgram principal component analysis), a novel, unsupervised, reference-free method for speech severity evaluation. Using three Dutch oral cancer datasets, we demonstrate that XPPG-PCA performs comparably to, or exceeds established reference-based methods. Our experiments confirm its robustness against data shortcuts and noise, showing its potential for real-world clinical use. Taken together, our results show that XPPG-PCA provides a robust, generalizable solution for the objective assessment of speech pathology, with the potential to significantly improve the efficiency and reliability of clinical evaluations across a range of disorders. An open-source implementation is available.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"783-795"},"PeriodicalIF":13.7,"publicationDate":"2025-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditionally, dysarthric speech intelligibility assessment systems have focused on speech as the primary input, utilizing methods such as extraction of relevant speech features, classification models, alignment of Automatic Speech Recognition (ASR) outputs, and comparisons between speech representations of dysarthric and healthy speakers. However, to achieve an automated intelligibility assessment that closely mirrors the auditory-perceptual evaluations conducted by clinicians, a model that captures both the acoustic characteristics of dysarthric speech and the linguistic structure related to word pronunciation are needed. Inspired by the practices of clinicians, this study introduces a novel text-guided dysarthric speech intelligibility assessment framework that leverages custom keyword spotting (DySIA-CKWS). The model evaluates intelligibility by detecting specific keywords and is extensively tested using UA-Speech database for speaker-wise analysis and across word groups of varying complexity. To ensure robustness, the system’s performance is further validated on TORGO database, demonstrating its adaptability in cross-database settings. Statistical analysis demonstrates strong alignment between predicted and subjective intelligibility scores, with a Pearson Correlation Coefficient (PCC) of 0.9588 and a Spearman’s Correlation Coefficient (SCC) of 0.9141, achieved using the proposed system on the UA-Speech database. The findings emphasize the importance of word selection and showcase the model’s effectiveness in diagnosing dysarthric speech, offering a significant advancement in intelligibility assessment methodologies.
{"title":"Dysarthric Speech Intelligibility Assessment by Custom Keyword Spotting","authors":"Anuprabha M;Krishna Gurugubelli;Anil Kumar Vuppala","doi":"10.1109/JSTSP.2025.3604709","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3604709","url":null,"abstract":"Traditionally, dysarthric speech intelligibility assessment systems have focused on speech as the primary input, utilizing methods such as extraction of relevant speech features, classification models, alignment of Automatic Speech Recognition (ASR) outputs, and comparisons between speech representations of dysarthric and healthy speakers. However, to achieve an automated intelligibility assessment that closely mirrors the auditory-perceptual evaluations conducted by clinicians, a model that captures both the acoustic characteristics of dysarthric speech and the linguistic structure related to word pronunciation are needed. Inspired by the practices of clinicians, this study introduces a novel text-guided dysarthric speech intelligibility assessment framework that leverages custom keyword spotting (DySIA-CKWS). The model evaluates intelligibility by detecting specific keywords and is extensively tested using UA-Speech database for speaker-wise analysis and across word groups of varying complexity. To ensure robustness, the system’s performance is further validated on TORGO database, demonstrating its adaptability in cross-database settings. Statistical analysis demonstrates strong alignment between predicted and subjective intelligibility scores, with a Pearson Correlation Coefficient (PCC) of 0.9588 and a Spearman’s Correlation Coefficient (SCC) of 0.9141, achieved using the proposed system on the UA-Speech database. The findings emphasize the importance of word selection and showcase the model’s effectiveness in diagnosing dysarthric speech, offering a significant advancement in intelligibility assessment methodologies.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"757-766"},"PeriodicalIF":13.7,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659158","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-08-01DOI: 10.1109/JSTSP.2025.3590607
Lina Liu;Dongning Guo
Massive multiple-input multiple-output (MIMO) systems are vital for achieving high spectral efficiencies at mid-band and millimeter wave frequencies. Conventional hybrid MIMO architectures, which use fewer digital chains than antennas, offer a balance between performance, cost, and energy consumption but often prolong channel estimation. This paper proposes a novel architecture that integrates a set of full-dimension digital chains with one-bit analog-to-digital converters (ADCs) to overcome these limitations and provide an alternative trade-off. By assigning one digital chain to each receive antenna, the proposed approach captures energy from all receive antennas and accelerates angle-of-arrival (AoA) estimation and beam computation. Likelihood-based AoA estimation methods are developed to optimize analog beamforming in narrowband and wideband channels, in both single-user and multiuser scenarios. Numerical results, including the equivalent signal-to-noise ratio per bit post-equalization, demonstrate that full-dimension one-bit digital chains significantly improve the efficiency of beamforming.
{"title":"Accelerating Multiuser Beamforming With Full-Dimension One-Bit Chains","authors":"Lina Liu;Dongning Guo","doi":"10.1109/JSTSP.2025.3590607","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3590607","url":null,"abstract":"Massive multiple-input multiple-output (MIMO) systems are vital for achieving high spectral efficiencies at mid-band and millimeter wave frequencies. Conventional hybrid MIMO architectures, which use fewer digital chains than antennas, offer a balance between performance, cost, and energy consumption but often prolong channel estimation. This paper proposes a novel architecture that integrates a set of full-dimension digital chains with one-bit analog-to-digital converters (ADCs) to overcome these limitations and provide an alternative trade-off. By assigning one digital chain to each receive antenna, the proposed approach captures energy from all receive antennas and accelerates angle-of-arrival (AoA) estimation and beam computation. Likelihood-based AoA estimation methods are developed to optimize analog beamforming in narrowband and wideband channels, in both single-user and multiuser scenarios. Numerical results, including the equivalent signal-to-noise ratio per bit post-equalization, demonstrate that full-dimension one-bit digital chains significantly improve the efficiency of beamforming.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 6","pages":"1203-1217"},"PeriodicalIF":13.7,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-21DOI: 10.1109/JSTSP.2025.3591062
Shakeel A. Sheikh;Md. Sahidullah;Ina Kodrasi
Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.
{"title":"Overview of Automatic Speech Analysis and Technologies for Neurodegenerative Disorders: Diagnosis and Assistive Applications","authors":"Shakeel A. Sheikh;Md. Sahidullah;Ina Kodrasi","doi":"10.1109/JSTSP.2025.3591062","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3591062","url":null,"abstract":"Advancements in spoken language technologies for neurodegenerative speech disorders are crucial for meeting both clinical and technological needs. This overview paper is vital for advancing the field, as it presents a comprehensive review of state-of-the-art methods in pathological speech detection, automatic speech recognition, pathological speech intelligibility enhancement, intelligibility and severity assessment, and data augmentation approaches for pathological speech. It also highlights key challenges, such as ensuring robustness, privacy, and interpretability. The paper concludes by exploring promising future directions, including the adoption of multimodal approaches and the integration of large language models to further advance speech technologies for neurodegenerative speech disorders.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"700-716"},"PeriodicalIF":13.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-16DOI: 10.1109/JSTSP.2025.3589745
Yiming Fang;Li Chen;Yunfei Chen;Weidong Wang;Changsheng You
Mixed-precision quantization offers superior performance to fixed-precision quantization. It has been widely used in signal processing, communication systems, and machine learning. In mixed-precision quantization, bit allocation is essential. Hence, in this paper, we propose a new bit allocation framework for mixed-precision quantization from a search perspective. First, we formulate a general bit allocation problem for mixed-precision quantization. Then we introduce the penalized particle swarm optimization (PPSO) algorithm to address the integer consumption constraint. To improve efficiency and avoid iterations on infeasible solutions within the PPSO algorithm, a greedy criterion particle swarm optimization (GC-PSO) algorithm is proposed. The corresponding convergence analysis is derived based on dynamical system theory. Furthermore, we apply the above framework to some specific classic fields, i.e., finite impulse response (FIR) filters, receivers, and gradient descent. Numerical examples in each application underscore the superiority of the proposed framework to the existing algorithms.
{"title":"Mixed-Precision Quantization: Make the Best Use of Bits Where They Matter Most","authors":"Yiming Fang;Li Chen;Yunfei Chen;Weidong Wang;Changsheng You","doi":"10.1109/JSTSP.2025.3589745","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3589745","url":null,"abstract":"Mixed-precision quantization offers superior performance to fixed-precision quantization. It has been widely used in signal processing, communication systems, and machine learning. In mixed-precision quantization, bit allocation is essential. Hence, in this paper, we propose a new bit allocation framework for mixed-precision quantization from a search perspective. First, we formulate a general bit allocation problem for mixed-precision quantization. Then we introduce the penalized particle swarm optimization (PPSO) algorithm to address the integer consumption constraint. To improve efficiency and avoid iterations on infeasible solutions within the PPSO algorithm, a greedy criterion particle swarm optimization (GC-PSO) algorithm is proposed. The corresponding convergence analysis is derived based on dynamical system theory. Furthermore, we apply the above framework to some specific classic fields, i.e., finite impulse response (FIR) filters, receivers, and gradient descent. Numerical examples in each application underscore the superiority of the proposed framework to the existing algorithms.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 6","pages":"1218-1233"},"PeriodicalIF":13.7,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145852508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-30DOI: 10.1109/JSTSP.2025.3584195
Lester Phillip Violeta;Wen-Chin Huang;Ding Ma;Ryuichi Yamamoto;Kazuhiro Kobayashi;Tomoki Toda
We investigate the use of linguistic intermediates to resolve domain mismatches in the electrolaryngeal (EL) speech enhancement task. We first propose the use of linguistic encoders to produce bottleneck feature intermediates, and use a recognition, alignment, and synthesis framework, effectively improving performance due to the removal of the timbre mismatches between the pretraining (typical) and fine-tuning (EL) data. We then further improve this by introducing discrete text intermediates, which effectively alleviate temporal mismatches between the source (EL) and target (typical) data to improve prosody modeling. Our findings show that by simply using bottleneck feature intermediates, more intelligible and naturally sounding speech can already be synthesized, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score compared to the baseline. Moreover, through the use of discrete phoneme-level intermediates, we can further improve the modeling of the temporal structure of typical speech and get another absolute improvement of 1.4% in character error rate and 0.2 in naturalness compared to the initially proposed system. Finally, we also verify these findings on a larger pseudo-EL dataset of 14 speakers and another set of 3 real-world EL speakers, which consistently show that using the phoneme-level intermediates is most effective approach in terms of phoneme error rate. We conclude the research by summarizing the advantages and disadvantages of each proposed technique.
{"title":"Resolving Domain Mismatches in Electrolaryngeal Speech Enhancement With Linguistic Intermediates","authors":"Lester Phillip Violeta;Wen-Chin Huang;Ding Ma;Ryuichi Yamamoto;Kazuhiro Kobayashi;Tomoki Toda","doi":"10.1109/JSTSP.2025.3584195","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3584195","url":null,"abstract":"We investigate the use of linguistic intermediates to resolve domain mismatches in the electrolaryngeal (EL) speech enhancement task. We first propose the use of linguistic encoders to produce bottleneck feature intermediates, and use a recognition, alignment, and synthesis framework, effectively improving performance due to the removal of the timbre mismatches between the pretraining (typical) and fine-tuning (EL) data. We then further improve this by introducing discrete text intermediates, which effectively alleviate temporal mismatches between the source (EL) and target (typical) data to improve prosody modeling. Our findings show that by simply using bottleneck feature intermediates, more intelligible and naturally sounding speech can already be synthesized, as shown by a significant 16% improvement in character error rate and 0.83 improvement in naturalness score compared to the baseline. Moreover, through the use of discrete phoneme-level intermediates, we can further improve the modeling of the temporal structure of typical speech and get another absolute improvement of 1.4% in character error rate and 0.2 in naturalness compared to the initially proposed system. Finally, we also verify these findings on a larger pseudo-EL dataset of 14 speakers and another set of 3 real-world EL speakers, which consistently show that using the phoneme-level intermediates is most effective approach in terms of phoneme error rate. We conclude the research by summarizing the advantages and disadvantages of each proposed technique.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 5","pages":"827-839"},"PeriodicalIF":13.7,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11059307","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145659225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-27DOI: 10.1109/JSTSP.2025.3570399
{"title":"IEEE Signal Processing Society Publication Information","authors":"","doi":"10.1109/JSTSP.2025.3570399","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3570399","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"C2-C2"},"PeriodicalIF":8.7,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11054319","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}