Attention-based encoder-decoder (AED) models achieve high accuracy in offline automatic speech recognition (ASR), but their application to streaming remains challenging due to the lack of mechanisms for regulating token emission. Existing approaches include monotonic attention, forced alignment with external models providing token-level boundaries, and encoder-based emission control methods. However, these methods either require structural modifications, complicate the training pipeline, or show limited accuracy. In addition, local agreement has been proposed as a method enabling streaming without retraining, but it incurs fixed delays corresponding to the input window size and premature commitments. To address these limitations, we propose Evidential Streaming TRAnsformer (ESTRA), a framework that leverages evidential deep learning (EDL) to estimate uncertainty. ESTRA models token probabilities with a Dirichlet distribution and introduces hierarchical and direct Kullback–Leibler divergence losses to ensure uncertainty decreases progressively as more speech is observed. During inference, token emission is controlled by comparing uncertainty against a threshold, suppressing premature outputs without fixed delays. Experiments on the LibriSpeech benchmark show that ESTRA achieves streaming performance comparable to offline AED models, surpasses local agreement in robustness under small input windows, and reduces 50th-percentile latency by avoiding fixed window-size delays, while leaving room for improvement at the 90th percentile. Furthermore, it provides more reliable control of token emission than probability- or entropy-based baselines, demonstrating the effectiveness of uncertainty as an indicator. ESTRA offers a promising approach to streaming ASR, with results supporting the effectiveness of uncertainty-driven token emission.
{"title":"Uncertainty-Based Streaming ASR With Evidential Deep Learning","authors":"Hiroaki Sato;Asahi Sakuma;Ryuga Sugano;Tadashi Kumano;Yoshihiko Kawai;Shinji Watanabe;Tetsuji Ogawa","doi":"10.1109/OJSP.2026.3657308","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657308","url":null,"abstract":"Attention-based encoder-decoder (AED) models achieve high accuracy in offline automatic speech recognition (ASR), but their application to streaming remains challenging due to the lack of mechanisms for regulating token emission. Existing approaches include monotonic attention, forced alignment with external models providing token-level boundaries, and encoder-based emission control methods. However, these methods either require structural modifications, complicate the training pipeline, or show limited accuracy. In addition, local agreement has been proposed as a method enabling streaming without retraining, but it incurs fixed delays corresponding to the input window size and premature commitments. To address these limitations, we propose Evidential Streaming TRAnsformer (ESTRA), a framework that leverages evidential deep learning (EDL) to estimate uncertainty. ESTRA models token probabilities with a Dirichlet distribution and introduces hierarchical and direct Kullback–Leibler divergence losses to ensure uncertainty decreases progressively as more speech is observed. During inference, token emission is controlled by comparing uncertainty against a threshold, suppressing premature outputs without fixed delays. Experiments on the LibriSpeech benchmark show that ESTRA achieves streaming performance comparable to offline AED models, surpasses local agreement in robustness under small input windows, and reduces 50th-percentile latency by avoiding fixed window-size delays, while leaving room for improvement at the 90th percentile. Furthermore, it provides more reliable control of token emission than probability- or entropy-based baselines, demonstrating the effectiveness of uncertainty as an indicator. ESTRA offers a promising approach to streaming ASR, with results supporting the effectiveness of uncertainty-driven token emission.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"373-381"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362987","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147440652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The objective quantification of Acoustic Scene Complexity (ASC) remains a significant challenge. While existing entropy-based metrics capture spectro-temporal variability, a metric accounting for the spatial distribution of sources has been lacking. We introduce Spherical Acoustic Spatial Entropy (SASE), a novel information-theoretic metric designed for Virtual Acoustic Environments (VAEs). SASE leverages ground-truth spatial data, utilizing an equal-area spherical partition around the listener and weighting source contributions by their perceptual loudness (ITU-R BS.1770-4). We validated SASE through a psychoacoustic experiment (N=21) using a $2times 2 times 2$ factorial design that manipulated masker count, spatial distribution, and motion. SASE was evaluated alongside energy and spectral entropy metrics against subjective ratings of complexity, effort, and spatial spread. Results show that SASE mean was the most robust predictor of perceived complexity in condition-level ratings ($R^{2}=0.714$, $p=0.008$), outperforming spectral and energy entropy. A random-effects pooling of participant regression coefficients confirmed this relationship at the population level ($R^{2}_{text{ {pseudo}}}=0.740$, $p< .001$). Furthermore, a model combining SASE mean with spectral entropy standard deviation explained 84.4% of the variance in perceived complexity, indicating spatial and spectro-temporal metrics capture complementary scene dynamics. SASE provides an objective measure of spatial complexity, enhancing existing frameworks for predicting ASC in virtual environments.
{"title":"Spherical Acoustic Spatial Entropy: Predicting Acoustic Scene Complexity in Virtual Environments","authors":"Luca Resti;Amelia Gully;Michael McLoughlin;Gavin Kearney;Alena Denisova","doi":"10.1109/OJSP.2026.3657297","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657297","url":null,"abstract":"The objective quantification of Acoustic Scene Complexity (ASC) remains a significant challenge. While existing entropy-based metrics capture spectro-temporal variability, a metric accounting for the spatial distribution of sources has been lacking. We introduce Spherical Acoustic Spatial Entropy (SASE), a novel information-theoretic metric designed for Virtual Acoustic Environments (VAEs). SASE leverages ground-truth spatial data, utilizing an equal-area spherical partition around the listener and weighting source contributions by their perceptual loudness (ITU-R BS.1770-4). We validated SASE through a psychoacoustic experiment (N=21) using a <inline-formula><tex-math>$2times 2 times 2$</tex-math></inline-formula> factorial design that manipulated masker count, spatial distribution, and motion. SASE was evaluated alongside energy and spectral entropy metrics against subjective ratings of complexity, effort, and spatial spread. Results show that SASE mean was the most robust predictor of perceived complexity in condition-level ratings (<inline-formula><tex-math>$R^{2}=0.714$</tex-math></inline-formula>, <inline-formula><tex-math>$p=0.008$</tex-math></inline-formula>), outperforming spectral and energy entropy. A random-effects pooling of participant regression coefficients confirmed this relationship at the population level (<inline-formula><tex-math>$R^{2}_{text{ {pseudo}}}=0.740$</tex-math></inline-formula>, <inline-formula><tex-math>$p< .001$</tex-math></inline-formula>). Furthermore, a model combining SASE mean with spectral entropy standard deviation explained 84.4% of the variance in perceived complexity, indicating spatial and spectro-temporal metrics capture complementary scene dynamics. SASE provides an objective measure of spatial complexity, enhancing existing frameworks for predicting ASC in virtual environments.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"144-153"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362972","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657300
Xilin Jiang;Hannes Gamper;Sebastian Braun
Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents Sci-Phi, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, Sci-Phi enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, Sci-Phi generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: https://sci-phi-audio.github.io/demo
{"title":"Sci-Phi: A Large Language Model Spatial Audio Descriptor","authors":"Xilin Jiang;Hannes Gamper;Sebastian Braun","doi":"10.1109/OJSP.2026.3657300","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657300","url":null,"abstract":"Acoustic scene perception involves describing the type of sounds, their timing, their direction and distance, as well as their loudness and reverberation. While audio language models excel in sound recognition, single-channel input fundamentally limits spatial understanding. This work presents <italic>Sci-Phi</i>, a spatial audio large language model with dual spatial and spectral encoders that estimates a complete parameter set for all sound sources and the surrounding environment. Learning from over 4,000 hours of synthetic first-order Ambisonics recordings including metadata, <italic>Sci-Phi</i> enumerates and describes up to four directional sound sources in one pass, alongside non-directional background sounds and room characteristics. We evaluate the model with a permutation-invariant protocol and 15 metrics covering content, location, timing, loudness, and reverberation, and analyze its robustness across source counts, signal-to-noise ratios, reverberation levels, and challenging mixtures of acoustically, spatially, or temporally similar sources. Notably, <italic>Sci-Phi</i> generalizes to real room impulse responses with only minor performance degradation. Overall, this work establishes the first audio LLM capable of full spatial-scene description, with strong potential for real-world deployment. Demo: <uri>https://sci-phi-audio.github.io/demo</uri>","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"276-284"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362973","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657284
Gokberk Yaylali;Dionysis Kalogerias
Optimal resource allocation in wireless systems remains a fundamental challenge due to the inherent adversities caused by channel fading.Modern wireless applications require efficient allocation schemes that maximize total network utility ensuring robust and reliable system performance. Although optimal on average, ergodic-optimal policies, commonly realized via stochastic waterfilling schemes, are susceptible to statistical dispersion of commonly heavy-tailed or highly volatile fading channels, particularly in terms of both instantaneous power policy fluctuations and frequent service outages (due to deep fade events), violating established power-level and quality-of-service specifications, essentially sabotaging fulfillment of provider-specific power/energy targets on the one hand, and user-perceived system reliability on the other. At the other extreme, short-term-optimal policies, commonly relying on deterministic waterfilling, or maximally averse minimax-optimal policies, strictly satisfy specifications but are computationally demanding, impractical, while also being suboptimal in any long-term regime. To address these challenges, we introduce a distributionally robust formulation of the constrained stochastic resource allocation problem in the classical point-to-point interference-free multi-terminal network by leveraging Conditional Value-at-Risk (CVaR) as a coherent measure of fading and/or fluctuation risk relevant to both transmission power and achievable rate distributions. We derive a closed-form parameterized expression for the CVaR-optimal resource policy which is of remarkably simple and interpretable form, along with subgradient-based update schemes for the corresponding CVaR quantile levels to both transmission power and achievable rates. Building on this, we develop a primal-dual double tail waterfilling scheme which iteratively computes globally optimal policies achieving ultra-reliable long-term rate performance, but with near-short-term characteristics. Extensive numerical experiments corroborate the effectiveness of the proposed approach.
{"title":"Distributionally Robust Ultra-Reliable Resource Allocation via Double Tail Waterfilling Under Fading Risk","authors":"Gokberk Yaylali;Dionysis Kalogerias","doi":"10.1109/OJSP.2026.3657284","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657284","url":null,"abstract":"Optimal resource allocation in wireless systems remains a fundamental challenge due to the inherent adversities caused by channel fading.Modern wireless applications require efficient allocation schemes that maximize total network utility ensuring robust and reliable system performance. Although optimal on average, ergodic-optimal policies, commonly realized via stochastic waterfilling schemes, are susceptible to statistical dispersion of commonly heavy-tailed or highly volatile fading channels, particularly in terms of <italic>both</i> instantaneous power policy fluctuations <italic>and</i> frequent service outages (due to deep fade events), violating established power-level and quality-of-service specifications, essentially sabotaging fulfillment of provider-specific power/energy targets on the one hand, and user-perceived system reliability on the other. At the other extreme, short-term-optimal policies, commonly relying on deterministic waterfilling, or maximally averse minimax-optimal policies, strictly satisfy specifications but are computationally demanding, impractical, while also being suboptimal in any long-term regime. To address these challenges, we introduce a distributionally robust formulation of the constrained stochastic resource allocation problem in the classical point-to-point interference-free multi-terminal network by leveraging Conditional Value-at-Risk (CVaR) as a coherent measure of fading and/or fluctuation risk relevant to both transmission power and achievable rate distributions. We derive a closed-form parameterized expression for the CVaR-optimal resource policy which is of remarkably simple and interpretable form, along with subgradient-based update schemes for the corresponding CVaR quantile levels to both transmission power and achievable rates. Building on this, we develop a primal-dual <italic>double tail waterfilling</i> scheme which iteratively computes <italic>globally optimal policies achieving ultra-reliable long-term rate performance, but with near-short-term characteristics</i>. Extensive numerical experiments corroborate the effectiveness of the proposed approach.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"154-164"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362910","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175827","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657286
Yichen Jia;Hugo Van hamme
To estimate confidence for end-to-end Automatic Speech Recognition (ASR) systems, recent research has proposed Confidence Estimation Modules that incorporate features from the backbone ASR model. Most existing approaches, however, are architecture-dependent. In this paper, we propose the Score-Rank Confidence Estimation Module (SR-CEM), a lightweight module that leverages beam search information to generate token- and word-level confidence scores. Specifically, SR-CEM constructs features by combining the scores and ranks of tokens within a hypothesis. Experiments show that SR-CEM achieves effective calibration on both in-domain and out-of-domain English data. On the in-domain test set, it attains a Maximum Calibration Error of 4.50% and an Expected Calibration Error of 0.30% at the token level, significantly outperforming softmax confidence (20.04% and 1.75%, respectively). At the word level, SR-CEM achieves 8.17% and 0.35%, compared to 17.91% and 1.67% from softmax confidence. Furthermore, we demonstrate its robustness across hybrid and transducer ASR architectures with different decoding strategies, as well as on Dutch, noisy and conversational speech conditions. Our main finding is that SR-CEM is particularly effective in reducing Maximum Calibration Error, which is critical for reliable downstream use of ASR outputs, while maintaining architecture independence and generality across diverse evaluation conditions.
{"title":"Leveraging Beam Search Information for Confidence Estimation in E2E ASR","authors":"Yichen Jia;Hugo Van hamme","doi":"10.1109/OJSP.2026.3657286","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657286","url":null,"abstract":"To estimate confidence for end-to-end Automatic Speech Recognition (ASR) systems, recent research has proposed Confidence Estimation Modules that incorporate features from the backbone ASR model. Most existing approaches, however, are architecture-dependent. In this paper, we propose the Score-Rank Confidence Estimation Module (SR-CEM), a lightweight module that leverages beam search information to generate token- and word-level confidence scores. Specifically, SR-CEM constructs features by combining the scores and ranks of tokens within a hypothesis. Experiments show that SR-CEM achieves effective calibration on both in-domain and out-of-domain English data. On the in-domain test set, it attains a Maximum Calibration Error of 4.50% and an Expected Calibration Error of 0.30% at the token level, significantly outperforming softmax confidence (20.04% and 1.75%, respectively). At the word level, SR-CEM achieves 8.17% and 0.35%, compared to 17.91% and 1.67% from softmax confidence. Furthermore, we demonstrate its robustness across hybrid and transducer ASR architectures with different decoding strategies, as well as on Dutch, noisy and conversational speech conditions. Our main finding is that SR-CEM is particularly effective in reducing Maximum Calibration Error, which is critical for reliable downstream use of ASR outputs, while maintaining architecture independence and generality across diverse evaluation conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"125-133"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362960","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175829","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656063
Zhengzhe Zhang;Jie Zhang;Haoyin Yan;Hengshuang Liu;Junhua Liu
Passive sonar systems offer stealth and low-energy consumption advantages while facing highly complex signal conditions. Underwater acoustic signal (UWAS) enhancement for passive sonar systems aims to improve the quality of vessel-radiated signals captured by hydrophones, thereby facilitating subsequent tasks like target recognition. However, conventional methods might struggle due to the intricate marine environment and weak target signals. In this work, we propose a fully Complex-valued U-Net based Multidimensional Attention Network (CUMA-Net), with all modules operating in the complex domain to jointly exploit magnitude and phase information. CUMA-Net employs a complex-valued encoder-decoder, which captures multiscale features for spectral mapping. To boost representation power and emphasize line spectrum components, we incorporate a complex-valued multidimensional attention module. This module includes a complex-valued time-frequency conformer to model dependencies along temporal and frequency axes. Complementarily, a complex convolutional block attention module extracts features across spatial and channel dimensions. To guide training under low SNR conditions, we propose a normalized mean squared error loss tailored for spectrogram reconstruction. Results on a public dataset verify that CUMA-Net achieves superior UWAS enhancement performance, while the improved signal quality further benefits vessel classification. Furthermore, we explore the impact of input frequency resolution on both enhancement and classification performance.
{"title":"A Fully Complex-Valued Underwater Acoustic Signal Enhancement Model for Passive Sonar Systems","authors":"Zhengzhe Zhang;Jie Zhang;Haoyin Yan;Hengshuang Liu;Junhua Liu","doi":"10.1109/OJSP.2026.3656063","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656063","url":null,"abstract":"Passive sonar systems offer stealth and low-energy consumption advantages while facing highly complex signal conditions. Underwater acoustic signal (UWAS) enhancement for passive sonar systems aims to improve the quality of vessel-radiated signals captured by hydrophones, thereby facilitating subsequent tasks like target recognition. However, conventional methods might struggle due to the intricate marine environment and weak target signals. In this work, we propose a fully Complex-valued U-Net based Multidimensional Attention Network (CUMA-Net), with all modules operating in the complex domain to jointly exploit magnitude and phase information. CUMA-Net employs a complex-valued encoder-decoder, which captures multiscale features for spectral mapping. To boost representation power and emphasize line spectrum components, we incorporate a complex-valued multidimensional attention module. This module includes a complex-valued time-frequency conformer to model dependencies along temporal and frequency axes. Complementarily, a complex convolutional block attention module extracts features across spatial and channel dimensions. To guide training under low SNR conditions, we propose a normalized mean squared error loss tailored for spectrogram reconstruction. Results on a public dataset verify that CUMA-Net achieves superior UWAS enhancement performance, while the improved signal quality further benefits vessel classification. Furthermore, we explore the impact of input frequency resolution on both enhancement and classification performance.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"101-115"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359482","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656104
Abdelrahman Seleem;André F. R. Guarda;Nuno M. M. Rodrigues;Fernando Pereira
Neuromorphic vision sensors, commonly referred to as event cameras, generate a massive number of pixel-level events, composed by spatiotemporal and polarity information, thus demanding highly efficient coding solutions. Existing solutions focus on lossless coding of event data, assuming that no distortion is acceptable for the target use cases, mostly including computer vision tasks such as classification and recognition. One promising coding approach exploits the similarity between event data and point clouds, both being sets of 3D points, thus allowing to use current point cloud coding solutions to code event data, typically adopting a two-point clouds representation, one for each event polarity. This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution, which adopts for the first time a single-point cloud representation, where the event polarity plays the role of a point cloud attribute, thus enabling to exploit the correlation between the geometry/spatiotemporal and polarity event information. Moreover, this paper also proposes novel adaptive voxel binarization strategies which may be used in DL-JEC, optimized for either quality-oriented or computer vision task-oriented purposes which allow to maximize the performance for the task at hand. DL-JEC can achieve significant compression performance gains when compared with relevant conventional and DL-based state-of-the-art event data coding solutions, notably the MPEG G-PCC and JPEG Pleno PCC standards. Furthermore, it is shown that it is possible to use lossy event data coding, with significantly reduced rate regarding lossless coding, without compromising the target computer vision task performance, notably event classification, thus changing the current event data coding paradigm.
{"title":"Deep Learning-Based Event Data Coding: A Joint Spatiotemporal and Polarity Solution","authors":"Abdelrahman Seleem;André F. R. Guarda;Nuno M. M. Rodrigues;Fernando Pereira","doi":"10.1109/OJSP.2026.3656104","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656104","url":null,"abstract":"Neuromorphic vision sensors, commonly referred to as event cameras, generate a massive number of pixel-level events, composed by spatiotemporal and polarity information, thus demanding highly efficient coding solutions. Existing solutions focus on lossless coding of event data, assuming that no distortion is acceptable for the target use cases, mostly including computer vision tasks such as classification and recognition. One promising coding approach exploits the similarity between event data and point clouds, both being sets of 3D points, thus allowing to use current point cloud coding solutions to code event data, typically adopting a two-point clouds representation, one for each event polarity. This paper proposes a novel lossy Deep Learning-based Joint Event data Coding (DL-JEC) solution, which adopts for the first time a single-point cloud representation, where the event polarity plays the role of a point cloud attribute, thus enabling to exploit the correlation between the geometry/spatiotemporal and polarity event information. Moreover, this paper also proposes novel adaptive voxel binarization strategies which may be used in DL-JEC, optimized for either quality-oriented or computer vision task-oriented purposes which allow to maximize the performance for the task at hand. DL-JEC can achieve significant compression performance gains when compared with relevant conventional and DL-based state-of-the-art event data coding solutions, notably the MPEG G-PCC and JPEG Pleno PCC standards. Furthermore, it is shown that it is possible to use lossy event data coding, with significantly reduced rate regarding lossless coding, without compromising the target computer vision task performance, notably event classification, thus changing the current event data coding paradigm.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"222-237"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359485","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656059
Tobias Raichle;Niels Edinger;Bin Yang
Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present laden, the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.
{"title":"Test-Time Adaptation for Speech Enhancement via Domain Invariant Embedding Transformation","authors":"Tobias Raichle;Niels Edinger;Bin Yang","doi":"10.1109/OJSP.2026.3656059","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656059","url":null,"abstract":"Deep learning-based speech enhancement models achieve remarkable performance when test distributions match training conditions, but often degrade when deployed in unpredictable real-world environments with domain shifts. To address this challenge, we present laden, the first test-time adaptation method specifically designed for speech enhancement. Our approach leverages powerful pre-trained speech representations to perform latent denoising, approximating clean speech representations through a linear transformation of noisy embeddings. We show that this transformation generalizes well across domains, enabling effective pseudo-labeling for target domains without labeled target data. The resulting pseudo-labels enable effective test-time adaptation of speech enhancement models across diverse acoustic environments. We propose a comprehensive benchmark spanning multiple datasets with various domain shifts, including changes in noise types, speaker characteristics, and languages. Our extensive experiments demonstrate that LaDen consistently outperforms baseline methods across perceptual metrics, particularly for speaker and language domain shifts.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"134-143"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359505","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1109/OJSP.2026.3656057
Florian Hilgemann;Peter Jax
The use of equalization filters to achieve acoustic transparency can improve the sound quality of hearables and hearing aids. Finite impulse response (FIR) filters guarantee stability and offer a listening impression close to the open ear, but their implementation may conflict with the resource constraints typical of hearing devices. Infinite impulse response (IIR) filters are commonly used to meet these constraints, but their design often lacks stability and performance guarantees. Therefore, we consider indirect IIR filter design methods that extend FIR filter designs with an IIR approximation step. To mitigate the performance degradation caused by the IIR approximation, we establish a formal connection between optimization variable and IIR approximation error, and propose an approximation-aware design algorithm based on the nuclear norm heuristic. The evaluation considers the design of hear-through filters using real-world measurement data. The proposed approach can reduce the time-domain mean-squared error by up to $text{6},text{dB}$ compared to conventional methods, and shows a high robustness against between-person variance. Thus, the results offer an improvement in hearing device personalization within practical constraints.
{"title":"Design of Acoustic Equalization Filters for Headphones Based on Low-Rank Regularization","authors":"Florian Hilgemann;Peter Jax","doi":"10.1109/OJSP.2026.3656057","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3656057","url":null,"abstract":"The use of equalization filters to achieve acoustic transparency can improve the sound quality of hearables and hearing aids. Finite impulse response (FIR) filters guarantee stability and offer a listening impression close to the open ear, but their implementation may conflict with the resource constraints typical of hearing devices. Infinite impulse response (IIR) filters are commonly used to meet these constraints, but their design often lacks stability and performance guarantees. Therefore, we consider indirect IIR filter design methods that extend FIR filter designs with an IIR approximation step. To mitigate the performance degradation caused by the IIR approximation, we establish a formal connection between optimization variable and IIR approximation error, and propose an approximation-aware design algorithm based on the nuclear norm heuristic. The evaluation considers the design of hear-through filters using real-world measurement data. The proposed approach can reduce the time-domain mean-squared error by up to <inline-formula><tex-math>$text{6},text{dB}$</tex-math></inline-formula> compared to conventional methods, and shows a high robustness against between-person variance. Thus, the results offer an improvement in hearing device personalization within practical constraints.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"173-184"},"PeriodicalIF":2.7,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11359448","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-19DOI: 10.1109/OJSP.2026.3655219
Yi-Sheng Chen;Hung-Shuo Chang
In this paper, we propose a semi-blind channel estimation method for single-carrier with frequency-domain equalization (SC-FDE) systems in the presence of colored noise. The method introduces a cyclic zero insertion approach where a single zero is periodically inserted into the transmitted sequence. This way induces a special structure of the autocorrelation matrix of the received signal, enabling effective channel estimation even in environments with colored noise. By extracting the channel product coefficients from the autocorrelation matrix, we construct a Hermitian matrix whose dominant eigenvector corresponds to the channel impulse response. Simulation results demonstrate the effectiveness of the proposed method.
{"title":"Semi-Blind Channel Estimation for Single-Carrier With Frequency-Domain Equalization Systems in the Presence of Colored Noise : A Cyclic Zero Insertion Approach","authors":"Yi-Sheng Chen;Hung-Shuo Chang","doi":"10.1109/OJSP.2026.3655219","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3655219","url":null,"abstract":"In this paper, we propose a semi-blind channel estimation method for single-carrier with frequency-domain equalization (SC-FDE) systems in the presence of colored noise. The method introduces a cyclic zero insertion approach where a single zero is periodically inserted into the transmitted sequence. This way induces a special structure of the autocorrelation matrix of the received signal, enabling effective channel estimation even in environments with colored noise. By extracting the channel product coefficients from the autocorrelation matrix, we construct a Hermitian matrix whose dominant eigenvector corresponds to the channel impulse response. Simulation results demonstrate the effectiveness of the proposed method.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"213-221"},"PeriodicalIF":2.7,"publicationDate":"2026-01-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11358533","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}