Pub Date : 2026-02-23DOI: 10.1109/OJSP.2026.3667079
Waleed Hilal;Alex McCafferty-Leroux;John Yawney;S. Andrew Gadsden
This paper proposes two novel filtering strategies as sub-optimal robust solutions for state estimation in systems affected by non-Gaussian noise, outliers, or modeling uncertainties. The moments-based Kalman filter (MKF) and moments-based innovation filter (MIF) replace the mean squared error criterion with a correntropy-based cost function that incorporates higher-order statistical moments of the innovation sequence. Through Taylor series expansion of the Gaussian kernel, correntropy inherently captures all even-order moments—including variance, kurtosis, and higher-order statistics—providing natural robustness to heavy-tailed and asymmetric noise distributions. An adaptive kernel bandwidth mechanism uses real-time estimates of innovation skewness and kurtosis to automatically balance efficiency and robustness. The MIF augments this framework with variable structure control theory, incorporating a saturation-based gain that bounds corrective action during large disturbances. Both methods employ fixed-point iteration with correntropy-weighted covariance matrices in their predictor-corrector algorithms. Mathematical derivations and stability proofs are provided for both filters. The approaches extend to nonlinear systems through first-order Taylor series linearization, yielding the extended MKF (EMKF) and extended MIF (EMIF). To validate their robustness relative to the conventional Kalman filter, the proposed methods are applied to both linear and nonlinear representations of a simulated electrohydrostatic actuator (EHA) experiencing leakage faults. Computational experiments demonstrate that the MKF and MIF achieve superior estimation accuracy compared to the KF under non-Gaussian conditions, more faithfully representing faulty system behavior.
{"title":"Robust Kalman Filtering via Correntropy-Based Higher-Order Moment Adaptation and Variable Structure Gains","authors":"Waleed Hilal;Alex McCafferty-Leroux;John Yawney;S. Andrew Gadsden","doi":"10.1109/OJSP.2026.3667079","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3667079","url":null,"abstract":"This paper proposes two novel filtering strategies as sub-optimal robust solutions for state estimation in systems affected by non-Gaussian noise, outliers, or modeling uncertainties. The moments-based Kalman filter (MKF) and moments-based innovation filter (MIF) replace the mean squared error criterion with a correntropy-based cost function that incorporates higher-order statistical moments of the innovation sequence. Through Taylor series expansion of the Gaussian kernel, correntropy inherently captures all even-order moments—including variance, kurtosis, and higher-order statistics—providing natural robustness to heavy-tailed and asymmetric noise distributions. An adaptive kernel bandwidth mechanism uses real-time estimates of innovation skewness and kurtosis to automatically balance efficiency and robustness. The MIF augments this framework with variable structure control theory, incorporating a saturation-based gain that bounds corrective action during large disturbances. Both methods employ fixed-point iteration with correntropy-weighted covariance matrices in their predictor-corrector algorithms. Mathematical derivations and stability proofs are provided for both filters. The approaches extend to nonlinear systems through first-order Taylor series linearization, yielding the extended MKF (EMKF) and extended MIF (EMIF). To validate their robustness relative to the conventional Kalman filter, the proposed methods are applied to both linear and nonlinear representations of a simulated electrohydrostatic actuator (EHA) experiencing leakage faults. Computational experiments demonstrate that the MKF and MIF achieve superior estimation accuracy compared to the KF under non-Gaussian conditions, more faithfully representing faulty system behavior.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"343-355"},"PeriodicalIF":2.7,"publicationDate":"2026-02-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11406897","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147440614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-20DOI: 10.1109/OJSP.2026.3666822
Noga Bar;Raja Giryes
Large annotated datasets are crucial for the success of deep learning, but labeling data can be prohibitively expensive in domains such as medical imaging. This work tackles the subset selection problem: selecting a small set of the most informative examples from a large unlabeled pool for annotation. We propose a simple and effective method that combines feature norms, randomization, and orthogonality (via the Gram–Schmidt process) to select diverse and informative samples. Feature norms serve as a proxy for informativeness, while randomization and orthogonalization reduce redundancy and encourage coverage of the feature space. Extensive experiments on image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp, show that our method consistently improves subset selection performance, both as a standalone approach and when integrated with existing techniques.
{"title":"Diverse Subset Selection via Norm-Based Sampling and Orthogonality","authors":"Noga Bar;Raja Giryes","doi":"10.1109/OJSP.2026.3666822","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3666822","url":null,"abstract":"Large annotated datasets are crucial for the success of deep learning, but labeling data can be prohibitively expensive in domains such as medical imaging. This work tackles the <italic>subset selection problem:</i> selecting a small set of the most informative examples from a large unlabeled pool for annotation. We propose a simple and effective method that combines feature norms, randomization, and orthogonality (via the Gram–Schmidt process) to select diverse and informative samples. Feature norms serve as a proxy for informativeness, while randomization and orthogonalization reduce redundancy and encourage coverage of the feature space. Extensive experiments on image and text benchmarks, including CIFAR-10/100, Tiny ImageNet, ImageNet, OrganAMNIST, and Yelp, show that our method consistently improves subset selection performance, both as a standalone approach and when integrated with existing techniques.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"333-342"},"PeriodicalIF":2.7,"publicationDate":"2026-02-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11404417","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147440547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radio frequency (RF) signal-based localization using modern cellular networks has emerged as a promising solution to accurately locate objects in challenging environments. One of the most promising solutions for situations involving obstructed-line-of-sight (OLoS) and multipath propagation is multipath-based simultaneous localization and mapping (MP-SLAM) that employs map features (MFs), such as virtual anchors. This paper presents an extended MP-SLAM method that is augmented with a global map feature (GMF) repository. This repository stores consistent MFs of high quality that are collected during prior traversals. We integrate these GMFs back into the MP-SLAM framework via a probability hypothesis density (PHD) filter, which propagates GMF intensity functions over time. Extensive simulations, together with a challenging real-world experiment using LTE RF signals in a dense urban scenario with severe multipath propagation and inter-cell interference, demonstrate that our framework achieves robust and accurate localization, thereby showcasing its effectiveness in realistic modern cellular networks such as 5G or future 6G networks. It outperforms conventional proprioceptive sensor-based localization and conventional MP-SLAM methods, and achieves reliable localization even under adverse signal conditions.
{"title":"Robust Localization in Modern Cellular Networks Using Global Map Features","authors":"Junshi Chen;Xuhong Li;Russ Whiton;Erik Leitinger;Fredrik Tufvesson","doi":"10.1109/OJSP.2026.3665385","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3665385","url":null,"abstract":"Radio frequency (RF) signal-based localization using modern cellular networks has emerged as a promising solution to accurately locate objects in challenging environments. One of the most promising solutions for situations involving obstructed-line-of-sight (OLoS) and multipath propagation is multipath-based simultaneous localization and mapping (MP-SLAM) that employs map features (MFs), such as virtual anchors. This paper presents an extended MP-SLAM method that is augmented with a global map feature (GMF) repository. This repository stores consistent MFs of high quality that are collected during prior traversals. We integrate these GMFs back into the MP-SLAM framework via a probability hypothesis density (PHD) filter, which propagates GMF intensity functions over time. Extensive simulations, together with a challenging real-world experiment using LTE RF signals in a dense urban scenario with severe multipath propagation and inter-cell interference, demonstrate that our framework achieves robust and accurate localization, thereby showcasing its effectiveness in realistic modern cellular networks such as 5G or future 6G networks. It outperforms conventional proprioceptive sensor-based localization and conventional MP-SLAM methods, and achieves reliable localization even under adverse signal conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"356-372"},"PeriodicalIF":2.7,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11397371","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147440679","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-13DOI: 10.1109/OJSP.2026.3664335
Keitaro Yamashita;Kazuki Naganuma;Shunsuke Ono
This paper proposes a method for vertex-wise aggregation sampling of a broad class of graph signals, designed to attain the best possible recovery based on the generalized sampling theory. This is achieved by designing a sampling operator by an optimization problem, which is inherently non-convex, as the best possible recovery imposes a rank constraint. An existing method for vertex-wise aggregation sampling is able to control the number of active vertices but cannot incorporate prior knowledge of mandatory or avoided vertices. To address these challenges, we formulate the operator design as a problem that handles a constraint on the number of active vertices and prior knowledge on specific vertices for sampling, mandatory inclusion or exclusion. We transformed this constrained problem into a difference-of-convex (DC) optimization problem by using the nuclear norm and a DC penalty for vertex selection. To solve this, we develop a convergent solver based on the general double-proximal gradient DC algorithm. The effectiveness of our method is demonstrated through experiments on various graph signal models, including real-world data, showing superior performance in the recovery accuracy compared to existing methods.
{"title":"Sampling Method for Generalized Graph Signals With Pre-Selected Vertices via DC Optimization","authors":"Keitaro Yamashita;Kazuki Naganuma;Shunsuke Ono","doi":"10.1109/OJSP.2026.3664335","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3664335","url":null,"abstract":"This paper proposes a method for vertex-wise aggregation sampling of a broad class of graph signals, designed to attain the best possible recovery based on the generalized sampling theory. This is achieved by designing a sampling operator by an optimization problem, which is inherently non-convex, as the best possible recovery imposes a rank constraint. An existing method for vertex-wise aggregation sampling is able to control the number of active vertices but cannot incorporate prior knowledge of mandatory or avoided vertices. To address these challenges, we formulate the operator design as a problem that handles a constraint on the number of active vertices and prior knowledge on specific vertices for sampling, mandatory inclusion or exclusion. We transformed this constrained problem into a difference-of-convex (DC) optimization problem by using the nuclear norm and a DC penalty for vertex selection. To solve this, we develop a convergent solver based on the general double-proximal gradient DC algorithm. The effectiveness of our method is demonstrated through experiments on various graph signal models, including real-world data, showing superior performance in the recovery accuracy compared to existing methods.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"314-323"},"PeriodicalIF":2.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11395613","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299610","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-13DOI: 10.1109/OJSP.2026.3664271
Lenaïg Guého;Henrique Lefundes da Silva;Cyril Plapous;Laurent Bougrain;Patrick Hénaff;Rozenn Nicol
In this paper, the use of non-sinusoidal amplitude-modulated stimuli is assessed for Brain-Computer Interfaces (BCIs) based on Steady-State Auditory Evoked Potentials (SSAEPs). Three different stimuli are compared to the frequently used 1-kHz pure tone: Brownian noise, cicada song and cat's purr. While these alternative sounds are intended to be more pleasant for listeners, they may impact the detectability of the modulation frequency in ElectroEncephaloGraphic (EEG) signals. Stimuli are equalized in loudness using an Head And Torso Simulator (HATS). The experiment is conducted at two loudness levels (50 and 56 phons), with 24 subjects participating in each condition. Hearing capacity is assessed prior to the experiment, using an audiometry test and questionnaires. For each stimulus, detection is performed by using 10 different classifiers: a linear discriminant analysis, deep learning networks and Riemannian classifiers including tangent space-based algorithms. These latter consistently outperformed alternative approaches. Pure tones provide the highest accuracy of detection (above 83%), whereas cicada song only achieve 60%. Classification using the proposed models fails for Brownian noise and cat's purr, with accuracy at a chance level. Additionally, increasing the loudness of the stimuli does not enhance the detectability of the modulation frequency for any stimulus. Amplitude modulation, frequency content and temporal characteristics of stimuli are further analyzed for explanation. These findings provide practical recommendations for auditory BCI classification and audio stimuli design.
{"title":"Alternatives to Sine Carrier in Auditory BCI: Exploring Machine Learning Strategies for Assessing Modulation Detectability in EEG","authors":"Lenaïg Guého;Henrique Lefundes da Silva;Cyril Plapous;Laurent Bougrain;Patrick Hénaff;Rozenn Nicol","doi":"10.1109/OJSP.2026.3664271","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3664271","url":null,"abstract":"In this paper, the use of non-sinusoidal amplitude-modulated stimuli is assessed for Brain-Computer Interfaces (BCIs) based on Steady-State Auditory Evoked Potentials (SSAEPs). Three different stimuli are compared to the frequently used 1-kHz pure tone: Brownian noise, cicada song and cat's purr. While these alternative sounds are intended to be more pleasant for listeners, they may impact the detectability of the modulation frequency in ElectroEncephaloGraphic (EEG) signals. Stimuli are equalized in loudness using an Head And Torso Simulator (HATS). The experiment is conducted at two loudness levels (50 and 56 phons), with 24 subjects participating in each condition. Hearing capacity is assessed prior to the experiment, using an audiometry test and questionnaires. For each stimulus, detection is performed by using 10 different classifiers: a linear discriminant analysis, deep learning networks and Riemannian classifiers including tangent space-based algorithms. These latter consistently outperformed alternative approaches. Pure tones provide the highest accuracy of detection (above 83%), whereas cicada song only achieve 60%. Classification using the proposed models fails for Brownian noise and cat's purr, with accuracy at a chance level. Additionally, increasing the loudness of the stimuli does not enhance the detectability of the modulation frequency for any stimulus. Amplitude modulation, frequency content and temporal characteristics of stimuli are further analyzed for explanation. These findings provide practical recommendations for auditory BCI classification and audio stimuli design.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"324-332"},"PeriodicalIF":2.7,"publicationDate":"2026-02-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11395630","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Low-light image enhancement (LLIE) aims to restore the visual quality of poorly illuminated images by recovering fine details and textures while suppressing noise and artifacts. Recently, diffusion models have shown superior generative capabilities for LLIE. However, existing diffusion-based methods condition the denoising process only on low-light images or features derived from them (e.g., structural or illumination maps). Since the low-light images are severely degraded, this limits the denoising model’s ability to restore fine structure and reduce artifacts. In this work, we show that the event data captured simultaneously with the low-light images provides complementary high-dynamic-range and high-temporal-resolution structural information that can overcome this limitation. Therefore, we propose EcDiff-LLIE, a novel event-conditional diffusion framework for LLIE. At its core, we introduce a multimodality denoising network that conditions on both low-light images and concurrent event streams. To effectively fuse the two modalities, we design a cross-modality attention block that bridge their domain differences, while also enabling long-range dependency modeling for improved structural preservation. Experiments on the synthetic SDSD and real-world SDE datasets show significant improvements in quantitative evaluation metrics. Furthermore, evaluation on the high-resolution real-world HUE dataset further shows the generalization ability of the proposed framework.
{"title":"EcDiff-LLIE: Event-Conditional Diffusion Model for Structure-Preserving Low-Light Image Enhancement","authors":"Ramna Maqsood;Paulo Nunes;Luís Ducla Soares;Caroline Conti","doi":"10.1109/OJSP.2026.3662627","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3662627","url":null,"abstract":"Low-light image enhancement (LLIE) aims to restore the visual quality of poorly illuminated images by recovering fine details and textures while suppressing noise and artifacts. Recently, diffusion models have shown superior generative capabilities for LLIE. However, existing diffusion-based methods condition the denoising process only on low-light images or features derived from them (e.g., structural or illumination maps). Since the low-light images are severely degraded, this limits the denoising model’s ability to restore fine structure and reduce artifacts. In this work, we show that the event data captured simultaneously with the low-light images provides complementary high-dynamic-range and high-temporal-resolution structural information that can overcome this limitation. Therefore, we propose EcDiff-LLIE, a novel event-conditional diffusion framework for LLIE. At its core, we introduce a multimodality denoising network that conditions on both low-light images and concurrent event streams. To effectively fuse the two modalities, we design a cross-modality attention block that bridge their domain differences, while also enabling long-range dependency modeling for improved structural preservation. Experiments on the synthetic SDSD and real-world SDE datasets show significant improvements in quantitative evaluation metrics. Furthermore, evaluation on the high-resolution real-world HUE dataset further shows the generalization ability of the proposed framework.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"266-275"},"PeriodicalIF":2.7,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11373830","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/OJSP.2026.3657696
Jaeho Park;Yong-Yeon Jo;Jong-Hwan Jang;Jin Yu;Joon-myoung Kwon;Junho Song
Electrocardiograms (ECGs) remain widely archived as paper ECG charts. In the 12-lead paper ECG chart layout, each lead shows only 2.5-second visible segments. Therefore, digitized charts are incomplete, leaving most of the 10-second recording invisible and misaligned with the digital standard required by ECG-AI models. Previous work has attempted to recover these invisible segments but has shown markedly lower performance than visible segments. We propose the Visible Context Propagation (VCP) architecture, an extension of ECGrecover, which leverages the quasi-periodic structure of ECGs and employs cross-attention to propagate contextual information from visible to invisible segments. Our model consistently outperformed ECGrecover, the strongest baseline, reducing RMSE by 32.4% overall, including 12.0% on invisible segments. Beyond recovery accuracy, evaluations on ECG applications demonstrated that recovered ECGs achieved performance comparable to raw ECGs in both diagnostic classification and ECG feature measurement. These results highlight the effectiveness of explicitly modeling the propagation of visible-to-invisible context and establish VCP as a robust solution for recovering incomplete paper-based ECGs, enabling reliable surrogates for clinical and analytical use.
{"title":"VCP: Visible Context Propagation for Electrocardiogram Recovery","authors":"Jaeho Park;Yong-Yeon Jo;Jong-Hwan Jang;Jin Yu;Joon-myoung Kwon;Junho Song","doi":"10.1109/OJSP.2026.3657696","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657696","url":null,"abstract":"Electrocardiograms (ECGs) remain widely archived as paper ECG charts. In the 12-lead paper ECG chart layout, each lead shows only 2.5-second visible segments. Therefore, digitized charts are incomplete, leaving most of the 10-second recording invisible and misaligned with the digital standard required by ECG-AI models. Previous work has attempted to recover these invisible segments but has shown markedly lower performance than visible segments. We propose the <bold>Visible Context Propagation (VCP)</b> architecture, an extension of ECGrecover, which leverages the quasi-periodic structure of ECGs and employs cross-attention to propagate contextual information from visible to invisible segments. Our model consistently outperformed ECGrecover, the strongest baseline, reducing RMSE by 32.4% overall, including 12.0% on invisible segments. Beyond recovery accuracy, evaluations on ECG applications demonstrated that recovered ECGs achieved performance comparable to raw ECGs in both diagnostic classification and ECG feature measurement. These results highlight the effectiveness of explicitly modeling the propagation of visible-to-invisible context and establish VCP as a robust solution for recovering incomplete paper-based ECGs, enabling reliable surrogates for clinical and analytical use.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"185-194"},"PeriodicalIF":2.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11363447","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-28DOI: 10.1109/OJSP.2026.3659053
Manu Harju;Frederic Font;Annamaria Mesaros
Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.
自标记是一种使用未标记数据同时学习表示和类的方法。朴素的自标记方法会导致退化的解决方案,并且模型生成的标签需要正则化才能作为有用的训练目标。在这项工作中,我们使用FSD50K数据集将一种使用最佳传输的自标记方法应用到音频域。我们分析了学习到的表示的结构,并将紧急类与参考注释进行了比较。我们将学习到的表示与使用Bootstrap Your Own Latent for Audio (BYOL-A)在几个下游任务中产生的表示进行比较。我们的研究结果表明,这种方法可以在没有监督的情况下学习对感知相似的声音进行分组。结果表明,该方法是一种可行的音频表示学习方法,并且学习到的嵌入与使用基准方法获得的嵌入对下游任务同样有效。作为一个额外的结果,生成的分类对模型学习的内容提供了有价值的见解,促进了特征学习的可解释性。
{"title":"Self-Labeling Sounds Using Optimal Transport","authors":"Manu Harju;Frederic Font;Annamaria Mesaros","doi":"10.1109/OJSP.2026.3659053","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3659053","url":null,"abstract":"Self-labeling is a method to simultaneously learn representations and classes using unlabeled data. The naive approach to self-labeling leads to a degenerate solution, and the model-generated labels require regularization to serve as useful training targets. In this work, we adapt a self-labeling method using optimal transport to the audio domain using the FSD50K dataset. We analyze the structure of the learned representations and compare the emergent classes with the reference annotations. We compare the learned representations with the ones produced using Bootstrap Your Own Latent for Audio (BYOL-A) across several downstream tasks. Our findings indicate that the method learns to group perceptually similar sounds without supervision. The results show that the method is a viable approach for audio representation learning, and that the learned embeddings are as effective for downstream tasks as the ones obtained with the benchmark method. As an additional outcome, the generated classifications give valuable insight into what the model learns, promoting explainability in feature learning.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"116-124"},"PeriodicalIF":2.7,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11366927","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This article revises informed independent vector extraction (iIVE) as a framework for connecting model-based blind source extraction (BSE) with deep learning. We introduce the contrast function for iIVE, which is derived by extending IVE with beamforming-based constraints, enabling an interpretable use of reference signals. We also show that structured mixing models implementing physical knowledge can be integrated, which is demonstrated by two far-field models. With the contrast functions, rapidly converging second-order algorithms are developed, whose performance is first verified through simulations. In the experimental part, we refine iIVE by training models containing unrolled iterations of the developed algorithm. The resulting structures achieve performance comparable to state-of-the-art networks while requiring two orders of magnitude fewer trainable parameters and exhibiting strong generalization to unseen conditions.
{"title":"From Informed Independent Vector Extraction to Hybrid Architectures for Target Source Extraction","authors":"Zbyněk Koldovský;Jiří Málek;Martin Vrátný;Tereza Vrbová;Jaroslav Čmejla;Stephen O'Regan","doi":"10.1109/OJSP.2026.3657698","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657698","url":null,"abstract":"This article revises informed independent vector extraction (iIVE) as a framework for connecting model-based blind source extraction (BSE) with deep learning. We introduce the contrast function for iIVE, which is derived by extending IVE with beamforming-based constraints, enabling an interpretable use of reference signals. We also show that structured mixing models implementing physical knowledge can be integrated, which is demonstrated by two far-field models. With the contrast functions, rapidly converging second-order algorithms are developed, whose performance is first verified through simulations. In the experimental part, we refine iIVE by training models containing unrolled iterations of the developed algorithm. The resulting structures achieve performance comparable to state-of-the-art networks while requiring two orders of magnitude fewer trainable parameters and exhibiting strong generalization to unseen conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"195-212"},"PeriodicalIF":2.7,"publicationDate":"2026-01-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11363453","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1109/OJSP.2026.3657142
Sajad Shirali-Shahreza;Gerald Penn
One of the main goals of Text-To-Speech systems is to generate natural speech. Therefore, a major evaluation criterion of TTS outputs is their naturalness, usually measured through a Mean Opinion Score (MOS). Naturalness is not a well-defined property, however. This paper decomposes naturalness into eight specific dimensions, based on how judges define the term. We then evaluate the outputs of the systems submitted to the Blizzard 2025 Challenge based on these new dimensions and compare the results with alternative evaluations of naturalness. This includes recent subjective human evaluations that were performed by the 2025 Blizzard Challenge organizers, as well as various automatic MOS methods. Based on this analysis, we propose to use five dimensions in place of a single, primitive notion of naturalness: Clarity, Fluency, Human-vs-Computer, Pronunciation, and Understandability. We propose that these would serve as the basis of a better evaluation framework for advanced TTS systems.
{"title":"Better Naturalness Evaluation of TTS Systems","authors":"Sajad Shirali-Shahreza;Gerald Penn","doi":"10.1109/OJSP.2026.3657142","DOIUrl":"https://doi.org/10.1109/OJSP.2026.3657142","url":null,"abstract":"One of the main goals of Text-To-Speech systems is to generate natural speech. Therefore, a major evaluation criterion of TTS outputs is their naturalness, usually measured through a Mean Opinion Score (MOS). Naturalness is not a well-defined property, however. This paper decomposes naturalness into eight specific dimensions, based on how judges define the term. We then evaluate the outputs of the systems submitted to the Blizzard 2025 Challenge based on these new dimensions and compare the results with alternative evaluations of naturalness. This includes recent subjective human evaluations that were performed by the 2025 Blizzard Challenge organizers, as well as various automatic MOS methods. Based on this analysis, we propose to use five dimensions in place of a single, primitive notion of naturalness: Clarity, Fluency, Human-vs-Computer, Pronunciation, and Understandability. We propose that these would serve as the basis of a better evaluation framework for advanced TTS systems.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"7 ","pages":"296-304"},"PeriodicalIF":2.7,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11362964","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223698","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}