Pub Date : 2025-03-09DOI: 10.1109/JSTSP.2025.3568585
Jianhui Lv;Wadii Boulila;Shalli Rani;Huamao Jiang
Communication in healthcare settings is sometimes affected by ambient noise, resulting in possible misunderstanding of essential information. We introduce the healthcare audio-visual deep fusion (HAV-DF) model, an innovative method that improves speech comprehension in clinical environments by intelligently merging acoustic and visual data. The HAV-DF model has three key advancements. First, it utilizes a medical video interface that collects nuanced visual signals pertinent to medical communication. Then, it employs an advanced multimodal fusion method that adaptively modifies the integration of auditory and visual data in response to noisy situations. Finally, it employs an innovative loss function that integrates healthcare-specific indicators to increase voice optimization for medical applications. Experimental findings on the MedDialog and MedVidQA datasets illustrate the efficacy of the proposed model efficacy under diverse noise situations. In low SNR situations (−5dB), HAV-DF attains a PESQ score of 2.45, indicating a 25% enhancement compared to leading approaches. The model achieves a medical term preservation rate of 93.18% under difficult acoustic settings, markedly surpassing current methodologies. These enhancements provide more dependable communication across many therapeutic contexts, from emergency departments to telemedicine consultations.
{"title":"Enhanced Multimodal Speech Processing for Healthcare Applications: A Deep Fusion Approach","authors":"Jianhui Lv;Wadii Boulila;Shalli Rani;Huamao Jiang","doi":"10.1109/JSTSP.2025.3568585","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3568585","url":null,"abstract":"Communication in healthcare settings is sometimes affected by ambient noise, resulting in possible misunderstanding of essential information. We introduce the healthcare audio-visual deep fusion (HAV-DF) model, an innovative method that improves speech comprehension in clinical environments by intelligently merging acoustic and visual data. The HAV-DF model has three key advancements. First, it utilizes a medical video interface that collects nuanced visual signals pertinent to medical communication. Then, it employs an advanced multimodal fusion method that adaptively modifies the integration of auditory and visual data in response to noisy situations. Finally, it employs an innovative loss function that integrates healthcare-specific indicators to increase voice optimization for medical applications. Experimental findings on the MedDialog and MedVidQA datasets illustrate the efficacy of the proposed model efficacy under diverse noise situations. In low SNR situations (−5dB), HAV-DF attains a PESQ score of 2.45, indicating a 25% enhancement compared to leading approaches. The model achieves a medical term preservation rate of 93.18% under difficult acoustic settings, markedly surpassing current methodologies. These enhancements provide more dependable communication across many therapeutic contexts, from emergency departments to telemedicine consultations.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"600-612"},"PeriodicalIF":8.7,"publicationDate":"2025-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-07DOI: 10.1109/JSTSP.2025.3567838
Vahid Ahmadi Kalkhorani;Cheng Yu;Anurag Kumar;Ke Tan;Buye Xu;DeLiang Wang
Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the TF-CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, TCD-TIMIT, and COG-MHEAR challenge, in terms of the performance metrics of PESQ, STOI, SNR and SDR. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.
{"title":"AV-CrossNet: An Audiovisual Complex Spectral Mapping Network for Speech Separation by Leveraging Narrow- and Cross-Band Modeling","authors":"Vahid Ahmadi Kalkhorani;Cheng Yu;Anurag Kumar;Ke Tan;Buye Xu;DeLiang Wang","doi":"10.1109/JSTSP.2025.3567838","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3567838","url":null,"abstract":"Adding visual cues to audio-based speech separation can improve separation performance. This paper introduces AV-CrossNet, an audiovisual (AV) system for speech enhancement, target speaker extraction, and multi-talker speaker separation. AV-CrossNet is extended from the TF-CrossNet architecture, which is a recently proposed network that performs complex spectral mapping for speech separation by leveraging global attention and positional encoding. To effectively utilize visual cues, the proposed system incorporates pre-extracted visual embeddings and employs a visual encoder comprising temporal convolutional layers. Audio and visual features are fused in an early fusion layer before feeding to AV-CrossNet blocks. We evaluate AV-CrossNet on multiple datasets, including LRS, VoxCeleb, TCD-TIMIT, and COG-MHEAR challenge, in terms of the performance metrics of PESQ, STOI, SNR and SDR. Evaluation results demonstrate that AV-CrossNet advances the state-of-the-art performance in all audiovisual tasks, even on untrained and mismatched datasets.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 4","pages":"685-694"},"PeriodicalIF":8.7,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144501031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-02DOI: 10.1109/JSTSP.2025.3562641
{"title":"IEEE Signal Processing Society Publication Information","authors":"","doi":"10.1109/JSTSP.2025.3562641","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3562641","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 2","pages":"C2-C2"},"PeriodicalIF":8.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10982379","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-02DOI: 10.1109/JSTSP.2025.3542164
Tsung-Hui Chang;Eduard A. Jorswieck;Erik G. Larsson;Xiao Li;A. Lee Swindlehurst
{"title":"Guest Editorial Distributed Signal Processing for Extremely Large-Scale Antenna Array Systems","authors":"Tsung-Hui Chang;Eduard A. Jorswieck;Erik G. Larsson;Xiao Li;A. Lee Swindlehurst","doi":"10.1109/JSTSP.2025.3542164","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3542164","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 2","pages":"298-303"},"PeriodicalIF":8.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10982380","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-03-02DOI: 10.1109/JSTSP.2025.3562646
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2025.3562646","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3562646","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 2","pages":"C3-C3"},"PeriodicalIF":8.7,"publicationDate":"2025-03-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10982378","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-28DOI: 10.1109/JSTSP.2025.3546083
Hansung Choi;Daewon Seo
The concept of a minimax classifier is well-established in statistical decision theory, but its implementation via neural networks remains challenging, particularly in scenarios with imbalanced training data having a limited number of samples for minority classes. To address this issue, we propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our algorithm iterates through two steps: a minimization step that trains the model based on a selected target prior, and a maximization step that updates the target prior towards the adversarial prior for the trained model. In the minimization, we introduce a targeted logit-adjustment loss function that efficiently identifies optimal decision boundaries under the target prior. Moreover, based on a new prior-dependent generalization bound that we obtained, we theoretically prove that our loss function has a better generalization capability than existing loss functions. During the maximization, we refine the target prior by shifting it towards the adversarial prior, depending on the worst-performing classes rather than on per-class risk estimates. Our maximization method is particularly robust in the regime of a small number of samples. Additionally, to adapt to overparameterized neural networks, we partition the entire training dataset into two subsets: one for model training during the minimization step and the other for updating the target prior during the maximization step. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods.
{"title":"Deep Minimax Classifiers for Imbalanced Datasets With a Small Number of Minority Samples","authors":"Hansung Choi;Daewon Seo","doi":"10.1109/JSTSP.2025.3546083","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3546083","url":null,"abstract":"The concept of a minimax classifier is well-established in statistical decision theory, but its implementation via neural networks remains challenging, particularly in scenarios with imbalanced training data having a limited number of samples for minority classes. To address this issue, we propose a novel minimax learning algorithm designed to minimize the risk of worst-performing classes. Our algorithm iterates through two steps: a minimization step that trains the model based on a selected target prior, and a maximization step that updates the target prior towards the adversarial prior for the trained model. In the minimization, we introduce a targeted logit-adjustment loss function that efficiently identifies optimal decision boundaries under the target prior. Moreover, based on a new prior-dependent generalization bound that we obtained, we theoretically prove that our loss function has a better generalization capability than existing loss functions. During the maximization, we refine the target prior by shifting it towards the adversarial prior, depending on the worst-performing classes rather than on per-class risk estimates. Our maximization method is particularly robust in the regime of a small number of samples. Additionally, to adapt to overparameterized neural networks, we partition the entire training dataset into two subsets: one for model training during the minimization step and the other for updating the target prior during the maximization step. Our proposed algorithm has a provable convergence property, and empirical results indicate that our algorithm performs better than or is comparable to existing methods.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 3","pages":"491-506"},"PeriodicalIF":8.7,"publicationDate":"2025-02-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144073337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-27DOI: 10.1109/JSTSP.2025.3539494
{"title":"IEEE Signal Processing Society Information","authors":"","doi":"10.1109/JSTSP.2025.3539494","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3539494","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"C3-C3"},"PeriodicalIF":8.7,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10906681","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143521551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-27DOI: 10.1109/JSTSP.2025.3539490
{"title":"IEEE Signal Processing Society Publication Information","authors":"","doi":"10.1109/JSTSP.2025.3539490","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3539490","url":null,"abstract":"","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 1","pages":"C2-C2"},"PeriodicalIF":8.7,"publicationDate":"2025-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10906684","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143512768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-20DOI: 10.1109/JSTSP.2025.3544024
Basarbatu Can;Soner Ozgun Pelvan;Huseyin Ozkan
We consider the statistical anomaly detection problem with regard to false alarm rate (or false positive rate, FPR) controllability, nonlinear modeling and computational efficiency for real-time processing. A decision theoretical solution can be formulated as Neyman-Pearson (NP) hypothesis testing (binary classification: anomaly/nominal). In this framework, we propose an ensemble NP classifier (Tree OLNP) that is based on a binary partitioning tree. Tree OLNP generates an ensemble of sample space partitions. Each partition corresponds to an online piecewise linear (hence nonlinear) expert classifier as a union of online linear NP classifiers (union of OLNPs). While maintaining a precise control over the FPR, Tree OLNP generates its overall prediction as a performance driven and time varying weighted combination of the experts. This provides a dynamical nonlinear modeling power in the sense that simpler (more powerful) experts receive larger weights early (late) in the data stream, which manages the bias-variance trade-off and mitigates overfitting/underfitting issues. We mathematically prove that, for any stream, Tree OLNP asymptotically performs at least as well as of the best expert in terms of the NP performance with a regret diminishing in the order $O(1/sqrt{t})$ ($t:$ data size). Our algorithm is computationally highly efficient since it is online and its complexity scales linearly with respect to both the data size and tree depth, and scales twice-logarithmic with respect to the number of experts. We experimentally show that Tree OLNP strongly outperforms the state-of-the-art alternative techniques.
{"title":"Online Neyman-Pearson Classification With Hierarchically Represented Models","authors":"Basarbatu Can;Soner Ozgun Pelvan;Huseyin Ozkan","doi":"10.1109/JSTSP.2025.3544024","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3544024","url":null,"abstract":"We consider the statistical anomaly detection problem with regard to false alarm rate (or false positive rate, FPR) controllability, nonlinear modeling and computational efficiency for real-time processing. A decision theoretical solution can be formulated as Neyman-Pearson (NP) hypothesis testing (binary classification: anomaly/nominal). In this framework, we propose an ensemble NP classifier (Tree OLNP) that is based on a binary partitioning tree. Tree OLNP generates an ensemble of sample space partitions. Each partition corresponds to an online piecewise linear (hence nonlinear) expert classifier as a union of online linear NP classifiers (union of OLNPs). While maintaining a precise control over the FPR, Tree OLNP generates its overall prediction as a performance driven and time varying weighted combination of the experts. This provides a dynamical nonlinear modeling power in the sense that simpler (more powerful) experts receive larger weights early (late) in the data stream, which manages the bias-variance trade-off and mitigates overfitting/underfitting issues. We mathematically prove that, for any stream, Tree OLNP asymptotically performs at least as well as of the best expert in terms of the NP performance with a regret diminishing in the order <inline-formula><tex-math>$O(1/sqrt{t})$</tex-math></inline-formula> (<inline-formula><tex-math>$t:$</tex-math></inline-formula> data size). Our algorithm is computationally highly efficient since it is online and its complexity scales linearly with respect to both the data size and tree depth, and scales twice-logarithmic with respect to the number of experts. We experimentally show that Tree OLNP strongly outperforms the state-of-the-art alternative techniques.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 3","pages":"478-490"},"PeriodicalIF":8.7,"publicationDate":"2025-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144073107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Being aware of the channel and its properties is critical for coherent transmission in massive multiple-input multiple-output (M-MIMO) systems due to the large channel dimension in the space domain. In cell-free (CF) systems, the channel dimension increases further as each user is served by multiple access points, with a significant burden on signal processing. Angle domain transmission and channel maps promise to alleviate this burden by reducing channel dimensions in the angle domain and providing a priori channel information through channel measurements and modeling, respectively. In this paper, we propose a channel map-based angle domain multiple access scheme for the uplink CF M-MIMO communications. First, we propose an angle domain data reception scheme constituting receive combining and large-scale fading decoding to maximize spectral efficiency. Then, we derive an initial access criterion utilizing the angle domain channel similarity between users, based on which we propose pilot assignment and access point selection schemes for better trade-offs between spectral and energy efficiency. Finally, we construct two channel map-based transmission mechanisms by wielding different levels of channel information, where a tailored data reception scheme with a newly derived spectral efficiency upper bound is also proposed for quantitative evaluation. Simulation results show that the proposed channel map-based angle domain schemes outperform their space domain alternatives and the schemes without using channel maps regarding spectral and energy efficiency.
{"title":"Channel Map-Based Angle Domain Multiple Access for Cell-Free Massive MIMO Communications","authors":"Shuaifei Chen;Cheng-Xiang Wang;Junling Li;Chen Huang;Hengtai Chang;Yusong Huang;Jie Huang;Yunfei Chen","doi":"10.1109/JSTSP.2025.3536289","DOIUrl":"https://doi.org/10.1109/JSTSP.2025.3536289","url":null,"abstract":"Being aware of the channel and its properties is critical for coherent transmission in massive multiple-input multiple-output (M-MIMO) systems due to the large channel dimension in the space domain. In cell-free (CF) systems, the channel dimension increases further as each user is served by multiple access points, with a significant burden on signal processing. Angle domain transmission and channel maps promise to alleviate this burden by reducing channel dimensions in the angle domain and providing a priori channel information through channel measurements and modeling, respectively. In this paper, we propose a channel map-based angle domain multiple access scheme for the uplink CF M-MIMO communications. First, we propose an angle domain data reception scheme constituting receive combining and large-scale fading decoding to maximize spectral efficiency. Then, we derive an initial access criterion utilizing the angle domain channel similarity between users, based on which we propose pilot assignment and access point selection schemes for better trade-offs between spectral and energy efficiency. Finally, we construct two channel map-based transmission mechanisms by wielding different levels of channel information, where a tailored data reception scheme with a newly derived spectral efficiency upper bound is also proposed for quantitative evaluation. Simulation results show that the proposed channel map-based angle domain schemes outperform their space domain alternatives and the schemes without using channel maps regarding spectral and energy efficiency.","PeriodicalId":13038,"journal":{"name":"IEEE Journal of Selected Topics in Signal Processing","volume":"19 2","pages":"366-380"},"PeriodicalIF":8.7,"publicationDate":"2025-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143900473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}