Pub Date : 2026-01-28DOI: 10.1109/OJSP.2026.3657696
Jaeho Park;Yong-Yeon Jo;Jong-Hwan Jang;Jin Yu;Joon-myoung Kwon;Junho Song
Electrocardiograms (ECGs) remain widely archived as paper ECG charts. In the 12-lead paper ECG chart layout, each lead shows only 2.5-second visible segments. Therefore, digitized charts are incomplete, leaving most of the 10-second recording invisible and misaligned with the digital standard required by ECG-AI models. Previous work has attempted to recover these invisible segments but has shown markedly lower performance than visible segments. We propose the Visible Context Propagation (VCP) architecture, an extension of ECGrecover, which leverages the quasi-periodic structure of ECGs and employs cross-attention to propagate contextual information from visible to invisible segments. Our model consistently outperformed ECGrecover, the strongest baseline, reducing RMSE by 32.4% overall, including 12.0% on invisible segments. Beyond recovery accuracy, evaluations on ECG applications demonstrated that recovered ECGs achieved performance comparable to raw ECGs in both diagnostic classification and ECG feature measurement. These results highlight the effectiveness of explicitly modeling the propagation of visible-to-invisible context and establish VCP as a robust solution for recovering incomplete paper-based ECGs, enabling reliable surrogates for clinical and analytical use.
{"title":"VCP: Visible Context Propagation for Electrocardiogram Recovery","authors":"Jaeho Park;Yong-Yeon Jo;Jong-Hwan Jang;Jin Yu;Joon-myoung Kwon;Junho Song","doi":"10.1109/OJSP.2026.3657696","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657696","url":null,"abstract":"Electrocardiograms (ECGs) remain widely archived as paper ECG charts. In the 12-lead paper ECG chart layout, each lead shows only 2.5-second visible segments. Therefore, digitized charts are incomplete, leaving most of the 10-second recording invisible and misaligned with the digital standard required by ECG-AI models. Previous work has attempted to recover these invisible segments but has shown markedly lower performance than visible segments. We propose the <bold>Visible Context Propagation (VCP)</b> architecture, an extension of ECGrecover, which leverages the quasi-periodic structure of ECGs and employs cross-attention to propagate contextual information from visible to invisible segments. Our model consistently outperformed ECGrecover, the strongest baseline, reducing RMSE by 32.4% overall, including 12.0% on invisible segments. Beyond recovery accuracy, evaluations on ECG applications demonstrated that recovered ECGs achieved performance comparable to raw ECGs in both diagnostic classification and ECG feature measurement. These results highlight the effectiveness of explicitly modeling the propagation of visible-to-invisible context and establish VCP as a robust solution for recovering incomplete paper-based ECGs, enabling reliable surrogates for clinical and analytical use.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"185-194"},"PeriodicalIF":2.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11363447","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/OJSP.2026.3659053
Manu Harju;Frederic Font;Annamaria Mesaros
Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.
{"title":"Self-Labeling Sounds Using Optimal Transport","authors":"Manu Harju;Frederic Font;Annamaria Mesaros","doi":"10.1109/OJSP.2026.3659053","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3659053","url":null,"abstract":"Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"116-124"},"PeriodicalIF":2.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11366927","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article revises informed independent vector extraction (iIVE) as a framework for connecting model-based blind source extraction (BSE) with deep learning. We introduce the contrast function for iIVE, which is derived by extending IVE with beamforming-based constraints, enabling an interpretable use of reference signals. We also show that structured mixing models implementing physical knowledge can be integrated, which is demonstrated by two far-field models. With the contrast functions, rapidly converging second-order algorithms are developed, whose performance is first verified through simulations. In the experimental part, we refine iIVE by training models containing unrolled iterations of the developed algorithm. The resulting structures achieve performance comparable to state-of-the-art networks while requiring two orders of magnitude fewer trainable parameters and exhibiting strong generalization to unseen conditions.
{"title":"From Informed Independent Vector Extraction to Hybrid Architectures for Target Source Extraction","authors":"Zbyněk Koldovský;Jiří Málek;Martin Vrátný;Tereza Vrbová;Jaroslav Čmejla;Stephen O'Regan","doi":"10.1109/OJSP.2026.3657698","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657698","url":null,"abstract":"This article revises informed independent vector extraction (iIVE) as a framework for connecting model-based blind source extraction (BSE) with deep learning. We introduce the contrast function for iIVE, which is derived by extending IVE with beamforming-based constraints, enabling an interpretable use of reference signals. We also show that structured mixing models implementing physical knowledge can be integrated, which is demonstrated by two far-field models. With the contrast functions, rapidly converging second-order algorithms are developed, whose performance is first verified through simulations. In the experimental part, we refine iIVE by training models containing unrolled iterations of the developed algorithm. The resulting structures achieve performance comparable to state-of-the-art networks while requiring two orders of magnitude fewer trainable parameters and exhibiting strong generalization to unseen conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"195-212"},"PeriodicalIF":2.7,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11363453","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective quantification of Acoustic Scene Complexity (ASC) remains a significant challenge. While existing entropy-based metrics capture spectro-temporal variability, a metric accounting for the spatial distribution of sources has been lacking. We introduce Spherical Acoustic Spatial Entropy (SASE), a novel information-theoretic metric designed for Virtual Acoustic Environments (VAEs). SASE leverages ground-truth spatial data, utilizing an equal-area spherical partition around the listener and weighting source contributions by their perceptual loudness (ITU-R BS.1770-4). We validated SASE through a psychoacoustic experiment (N=21) using a $2times 2 times 2$ factorial design that manipulated masker count, spatial distribution, and motion. SASE was evaluated alongside energy and spectral entropy metrics against subjective ratings of complexity, effort, and spatial spread. Results show that SASE mean was the most robust predictor of perceived complexity in condition-level ratings ($R^{2}=0.714$, $p=0.008$), outperforming spectral and energy entropy. A random-effects pooling of participant regression coefficients confirmed this relationship at the population level ($R^{2}_{text{ {pseudo}}}=0.740$, $p< .001$). Furthermore, a model combining SASE mean with spectral entropy standard deviation explained 84.4% of the variance in perceived complexity, indicating spatial and spectro-temporal metrics capture complementary scene dynamics. SASE provides an objective measure of spatial complexity, enhancing existing frameworks for predicting ASC in virtual environments.
{"title":"Spherical Acoustic Spatial Entropy: Predicting Acoustic Scene Complexity in Virtual Environments","authors":"Luca Resti;Amelia Gully;Michael McLoughlin;Gavin Kearney;Alena Denisova","doi":"10.1109/OJSP.2026.3657297","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657297","url":null,"abstract":"The objective quantification of Acoustic Scene Complexity (ASC) remains a significant challenge. While existing entropy-based metrics capture spectro-temporal variability, a metric accounting for the spatial distribution of sources has been lacking. We introduce Spherical Acoustic Spatial Entropy (SASE), a novel information-theoretic metric designed for Virtual Acoustic Environments (VAEs). SASE leverages ground-truth spatial data, utilizing an equal-area spherical partition around the listener and weighting source contributions by their perceptual loudness (ITU-R BS.1770-4). We validated SASE through a psychoacoustic experiment (N=21) using a <inline-formula><tex-math>$2times 2 times 2$</tex-math></inline-formula> factorial design that manipulated masker count, spatial distribution, and motion. SASE was evaluated alongside energy and spectral entropy metrics against subjective ratings of complexity, effort, and spatial spread. Results show that SASE mean was the most robust predictor of perceived complexity in condition-level ratings (<inline-formula><tex-math>$R^{2}=0.714$</tex-math></inline-formula>, <inline-formula><tex-math>$p=0.008$</tex-math></inline-formula>), outperforming spectral and energy entropy. A random-effects pooling of participant regression coefficients confirmed this relationship at the population level (<inline-formula><tex-math>$R^{2}_{text{ {pseudo}}}=0.740$</tex-math></inline-formula>, <inline-formula><tex-math>$p< .001$</tex-math></inline-formula>). Furthermore, a model combining SASE mean with spectral entropy standard deviation explained 84.4% of the variance in perceived complexity, indicating spatial and spectro-temporal metrics capture complementary scene dynamics. SASE provides an objective measure of spatial complexity, enhancing existing frameworks for predicting ASC in virtual environments.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"144-153"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362972","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657284
Gokberk Yaylali;Dionysis Kalogerias
Optimal resource allocation in wireless systems remains a fundamental challenge due to the inherent adversities caused by channel fading.Modern wireless applications require efficient allocation schemes that maximize total network utility ensuring robust and reliable system performance. Although optimal on average, ergodic-optimal policies, commonly realized via stochastic waterfilling schemes, are susceptible to statistical dispersion of commonly heavy-tailed or highly volatile fading channels, particularly in terms of both instantaneous power policy fluctuations and frequent service outages (due to deep fade events), violating established power-level and quality-of-service specifications, essentially sabotaging fulfillment of provider-specific power/energy targets on the one hand, and user-perceived system reliability on the other. At the other extreme, short-term-optimal policies, commonly relying on deterministic waterfilling, or maximally averse minimax-optimal policies, strictly satisfy specifications but are computationally demanding, impractical, while also being suboptimal in any long-term regime. To address these challenges, we introduce a distributionally robust formulation of the constrained stochastic resource allocation problem in the classical point-to-point interference-free multi-terminal network by leveraging Conditional Value-at-Risk (CVaR) as a coherent measure of fading and/or fluctuation risk relevant to both transmission power and achievable rate distributions. We derive a closed-form parameterized expression for the CVaR-optimal resource policy which is of remarkably simple and interpretable form, along with subgradient-based update schemes for the corresponding CVaR quantile levels to both transmission power and achievable rates. Building on this, we develop a primal-dual double tail waterfilling scheme which iteratively computes globally optimal policies achieving ultra-reliable long-term rate performance, but with near-short-term characteristics. Extensive numerical experiments corroborate the effectiveness of the proposed approach.
{"title":"Distributionally Robust Ultra-Reliable Resource Allocation via Double Tail Waterfilling Under Fading Risk","authors":"Gokberk Yaylali;Dionysis Kalogerias","doi":"10.1109/OJSP.2026.3657284","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657284","url":null,"abstract":"Optimal resource allocation in wireless systems remains a fundamental challenge due to the inherent adversities caused by channel fading.Modern wireless applications require efficient allocation schemes that maximize total network utility ensuring robust and reliable system performance. Although optimal on average, ergodic-optimal policies, commonly realized via stochastic waterfilling schemes, are susceptible to statistical dispersion of commonly heavy-tailed or highly volatile fading channels, particularly in terms of <italic>both</i> instantaneous power policy fluctuations <italic>and</i> frequent service outages (due to deep fade events), violating established power-level and quality-of-service specifications, essentially sabotaging fulfillment of provider-specific power/energy targets on the one hand, and user-perceived system reliability on the other. At the other extreme, short-term-optimal policies, commonly relying on deterministic waterfilling, or maximally averse minimax-optimal policies, strictly satisfy specifications but are computationally demanding, impractical, while also being suboptimal in any long-term regime. To address these challenges, we introduce a distributionally robust formulation of the constrained stochastic resource allocation problem in the classical point-to-point interference-free multi-terminal network by leveraging Conditional Value-at-Risk (CVaR) as a coherent measure of fading and/or fluctuation risk relevant to both transmission power and achievable rate distributions. We derive a closed-form parameterized expression for the CVaR-optimal resource policy which is of remarkably simple and interpretable form, along with subgradient-based update schemes for the corresponding CVaR quantile levels to both transmission power and achievable rates. Building on this, we develop a primal-dual <italic>double tail waterfilling</i> scheme which iteratively computes <italic>globally optimal policies achieving ultra-reliable long-term rate performance, but with near-short-term characteristics</i>. Extensive numerical experiments corroborate the effectiveness of the proposed approach.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"154-164"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362910","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657286
Yichen Jia;Hugo Van hamme
To estimate confidence for end-to-end Automatic Speech Recognition (ASR) systems, recent research has proposed Confidence Estimation Modules that incorporate features from the backbone ASR model. Most existing approaches, however, are architecture-dependent. In this paper, we propose the Score-Rank Confidence Estimation Module (SR-CEM), a lightweight module that leverages beam search information to generate token- and word-level confidence scores. Specifically, SR-CEM constructs features by combining the scores and ranks of tokens within a hypothesis. Experiments show that SR-CEM achieves effective calibration on both in-domain and out-of-domain English data. On the in-domain test set, it attains a Maximum Calibration Error of 4.50% and an Expected Calibration Error of 0.30% at the token level, significantly outperforming softmax confidence (20.04% and 1.75%, respectively). At the word level, SR-CEM achieves 8.17% and 0.35%, compared to 17.91% and 1.67% from softmax confidence. Furthermore, we demonstrate its robustness across hybrid and transducer ASR architectures with different decoding strategies, as well as on Dutch, noisy and conversational speech conditions. Our main finding is that SR-CEM is particularly effective in reducing Maximum Calibration Error, which is critical for reliable downstream use of ASR outputs, while maintaining architecture independence and generality across diverse evaluation conditions.
{"title":"Leveraging Beam Search Information for Confidence Estimation in E2E ASR","authors":"Yichen Jia;Hugo Van hamme","doi":"10.1109/OJSP.2026.3657286","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657286","url":null,"abstract":"To estimate confidence for end-to-end Automatic Speech Recognition (ASR) systems, recent research has proposed Confidence Estimation Modules that incorporate features from the backbone ASR model. Most existing approaches, however, are architecture-dependent. In this paper, we propose the Score-Rank Confidence Estimation Module (SR-CEM), a lightweight module that leverages beam search information to generate token- and word-level confidence scores. Specifically, SR-CEM constructs features by combining the scores and ranks of tokens within a hypothesis. Experiments show that SR-CEM achieves effective calibration on both in-domain and out-of-domain English data. On the in-domain test set, it attains a Maximum Calibration Error of 4.50% and an Expected Calibration Error of 0.30% at the token level, significantly outperforming softmax confidence (20.04% and 1.75%, respectively). At the word level, SR-CEM achieves 8.17% and 0.35%, compared to 17.91% and 1.67% from softmax confidence. Furthermore, we demonstrate its robustness across hybrid and transducer ASR architectures with different decoding strategies, as well as on Dutch, noisy and conversational speech conditions. Our main finding is that SR-CEM is particularly effective in reducing Maximum Calibration Error, which is critical for reliable downstream use of ASR outputs, while maintaining architecture independence and generality across diverse evaluation conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"125-133"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656063
Zhengzhe Zhang;Jie Zhang;Haoyin Yan;Hengshuang Liu;Junhua Liu
Passive sonar systems offer stealth and low-energy consumption advantages while facing highly complex signal conditions. Underwater acoustic signal (UWAS) enhancement for passive sonar systems aims to improve the quality of vessel-radiated signals captured by hydrophones, thereby facilitating subsequent tasks like target recognition. However, conventional methods might struggle due to the intricate marine environment and weak target signals. In this work, we propose a fully Complex-valued U-Net based Multidimensional Attention Network (CUMA-Net), with all modules operating in the complex domain to jointly exploit magnitude and phase information. CUMA-Net employs a complex-valued encoder-decoder, which captures multiscale features for spectral mapping. To boost representation power and emphasize line spectrum components, we incorporate a complex-valued multidimensional attention module. This module includes a complex-valued time-frequency conformer to model dependencies along temporal and frequency axes. Complementarily, a complex convolutional block attention module extracts features across spatial and channel dimensions. To guide training under low SNR conditions, we propose a normalized mean squared error loss tailored for spectrogram reconstruction. Results on a public dataset verify that CUMA-Net achieves superior UWAS enhancement performance, while the improved signal quality further benefits vessel classification. Furthermore, we explore the impact of input frequency resolution on both enhancement and classification performance.
{"title":"A Fully Complex-Valued Underwater Acoustic Signal Enhancement Model for Passive Sonar Systems","authors":"Zhengzhe Zhang;Jie Zhang;Haoyin Yan;Hengshuang Liu;Junhua Liu","doi":"10.1109/OJSP.2026.3656063","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656063","url":null,"abstract":"Passive sonar systems offer stealth and low-energy consumption advantages while facing highly complex signal conditions. Underwater acoustic signal (UWAS) enhancement for passive sonar systems aims to improve the quality of vessel-radiated signals captured by hydrophones, thereby facilitating subsequent tasks like target recognition. However, conventional methods might struggle due to the intricate marine environment and weak target signals. In this work, we propose a fully Complex-valued U-Net based Multidimensional Attention Network (CUMA-Net), with all modules operating in the complex domain to jointly exploit magnitude and phase information. CUMA-Net employs a complex-valued encoder-decoder, which captures multiscale features for spectral mapping. To boost representation power and emphasize line spectrum components, we incorporate a complex-valued multidimensional attention module. This module includes a complex-valued time-frequency conformer to model dependencies along temporal and frequency axes. Complementarily, a complex convolutional block attention module extracts features across spatial and channel dimensions. To guide training under low SNR conditions, we propose a normalized mean squared error loss tailored for spectrogram reconstruction. Results on a public dataset verify that CUMA-Net achieves superior UWAS enhancement performance, while the improved signal quality further benefits vessel classification. Furthermore, we explore the impact of input frequency resolution on both enhancement and classification performance.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"101-115"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359482","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656104
Abdelrahman Seleem;André F. R. Guarda;Nuno M. M. Rodrigues;Fernando Pereira
Neuromorphic vision sensors, commonly referred to as event cameras, generate a massive number of pixel-level events, composed by spatiotemporal and polarity information, thus demanding highly efficient coding solutions. Existing solutions focus on lossless coding of event data, assuming that no distortion is acceptable for the target use cases, mostly including computer vision tasks such as classification and recognition. One promising coding approach exploits the similarity between event data and point clouds, both being sets of 3D points, thus allowing to use current point cloud coding solutions to code event data, typically adopting a two-point clouds representation, one for each event polarity. This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution, which adopts for the first time a single-point cloud representation, where the event polarity plays the role of a point cloud attribute, thus enabling to exploit the correlation between the geometry/spatiotemporal and polarity event information. Moreover, this paper also proposes novel adaptive voxel binarization strategies which may be used in DL-JEC, optimized for either quality-oriented or computer vision task-oriented purposes which allow to maximize the performance for the task at hand. DL-JEC can achieve significant compression performance gains when compared with relevant conventional and DL-based state-of-the-art event data coding solutions, notably the MPEG G-PCC and JPEG Pleno PCC standards. Furthermore, it is shown that it is possible to use lossy event data coding, with significantly reduced rate regarding lossless coding, without compromising the target computer vision task performance, notably event classification, thus changing the current event data coding paradigm.
{"title":"Deep Learning-Based Event Data Coding: A Joint Spatiotemporal and Polarity Solution","authors":"Abdelrahman Seleem;André F. R. Guarda;Nuno M. M. Rodrigues;Fernando Pereira","doi":"10.1109/OJSP.2026.3656104","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656104","url":null,"abstract":"Neuromorphic vision sensors, commonly referred to as event cameras, generate a massive number of pixel-level events, composed by spatiotemporal and polarity information, thus demanding highly efficient coding solutions. Existing solutions focus on lossless coding of event data, assuming that no distortion is acceptable for the target use cases, mostly including computer vision tasks such as classification and recognition. One promising coding approach exploits the similarity between event data and point clouds, both being sets of 3D points, thus allowing to use current point cloud coding solutions to code event data, typically adopting a two-point clouds representation, one for each event polarity. This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution, which adopts for the first time a single-point cloud representation, where the event polarity plays the role of a point cloud attribute, thus enabling to exploit the correlation between the geometry/spatiotemporal and polarity event information. Moreover, this paper also proposes novel adaptive voxel binarization strategies which may be used in DL-JEC, optimized for either quality-oriented or computer vision task-oriented purposes which allow to maximize the performance for the task at hand. DL-JEC can achieve significant compression performance gains when compared with relevant conventional and DL-based state-of-the-art event data coding solutions, notably the MPEG G-PCC and JPEG Pleno PCC standards. Furthermore, it is shown that it is possible to use lossy event data coding, with significantly reduced rate regarding lossless coding, without compromising the target computer vision task performance, notably event classification, thus changing the current event data coding paradigm.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"222-237"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359485","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656059
Tobias Raichle;Niels Edinger;Bin Yang
Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present laden, the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.
{"title":"Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation","authors":"Tobias Raichle;Niels Edinger;Bin Yang","doi":"10.1109/OJSP.2026.3656059","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656059","url":null,"abstract":"Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present laden, the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"134-143"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359505","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656057
Florian Hilgemann;Peter Jax
The use of equalization filters to achieve acoustic transparency can improve the sound quality of hearables and hearing aids. Finite impulse response (FIR) filters guarantee stability and offer a listening impression close to the open ear, but their implementation may conflict with the resource constraints typical of hearing devices. Infinite impulse response (IIR) filters are commonly used to meet these constraints, but their design often lacks stability and performance guarantees. Therefore, we consider indirect IIR filter design methods that extend FIR filter designs with an IIR approximation step. To mitigate the performance degradation caused by the IIR approximation, we establish a formal connection between optimization variable and IIR approximation error, and propose an approximation-aware design algorithm based on the nuclear norm heuristic. The evaluation considers the design of hear-through filters using real-world measurement data. The proposed approach can reduce the time-domain mean-squared error by up to $text{6},text{dB}$ compared to conventional methods, and shows a high robustness against between-person variance. Thus, the results offer an improvement in hearing device personalization within practical constraints.
{"title":"Design of Acoustic Equalization Filters for Headphones Based on Low-Rank Regularization","authors":"Florian Hilgemann;Peter Jax","doi":"10.1109/OJSP.2026.3656057","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656057","url":null,"abstract":"The use of equalization filters to achieve acoustic transparency can improve the sound quality of hearables and hearing aids. Finite impulse response (FIR) filters guarantee stability and offer a listening impression close to the open ear, but their implementation may conflict with the resource constraints typical of hearing devices. Infinite impulse response (IIR) filters are commonly used to meet these constraints, but their design often lacks stability and performance guarantees. Therefore, we consider indirect IIR filter design methods that extend FIR filter designs with an IIR approximation step. To mitigate the performance degradation caused by the IIR approximation, we establish a formal connection between optimization variable and IIR approximation error, and propose an approximation-aware design algorithm based on the nuclear norm heuristic. The evaluation considers the design of hear-through filters using real-world measurement data. The proposed approach can reduce the time-domain mean-squared error by up to <inline-formula><tex-math>$text{6},text{dB}$</tex-math></inline-formula> compared to conventional methods, and shows a high robustness against between-person variance. Thus, the results offer an improvement in hearing device personalization within practical constraints.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"173-184"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359448","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}