Pub Date : 2025-12-15DOI: 10.1109/OJSP.2025.3635745
{"title":"List of Reviewers","authors":"","doi":"10.1109/OJSP.2025.3635745","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3635745","url":null,"abstract":"","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1203-1206"},"PeriodicalIF":2.7,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11300295","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1109/OJSP.2025.3633577
Stefan Thaleiser;Gerald Enzner;Rainer Martin;Aleksej Chinaev
Binaural processing is becoming an important feature of high-end commercial headsets and hearing aids. Speech enhancement with binaural output requires adequate treatment of spatial cues in addition to desirable noise reduction and simultaneous speech preservation. Binaural speech enhancement was traditionally approached with model-based statistical signal processing, where the principle of common-gain filtering with identical treatment of left- and right-ear signals has been designed to achieve enhancement constrained by strict binaural cue preservation. However, model-based approaches may also be instructive for the design of modern deep learning architectures. In this article, the common-gain paradigm is therefore embedded into an artificial neural network approach. In order to maintain the desired common-gain property end-to-end, we derive the requirements for compressed feature formation and data normalization. Binaural experiments with moderate-sized artificial neural networks demonstrate the superiority of the proposed common-gain autoencoder network over model-based processing and related unconstrained network architectures for anechoic and reverberant noisy speech in terms of segmental SNR, binaural perception-based metrics MBSTOI, better-ear HASQI, and a listening experiment.
{"title":"Common-Gain Autoencoder Network for Binaural Speech Enhancement","authors":"Stefan Thaleiser;Gerald Enzner;Rainer Martin;Aleksej Chinaev","doi":"10.1109/OJSP.2025.3633577","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3633577","url":null,"abstract":"Binaural processing is becoming an important feature of high-end commercial headsets and hearing aids. Speech enhancement with binaural output requires adequate treatment of spatial cues in addition to desirable noise reduction and simultaneous speech preservation. Binaural speech enhancement was traditionally approached with model-based statistical signal processing, where the principle of common-gain filtering with identical treatment of left- and right-ear signals has been designed to achieve enhancement constrained by strict binaural cue preservation. However, model-based approaches may also be instructive for the design of modern deep learning architectures. In this article, the common-gain paradigm is therefore embedded into an artificial neural network approach. In order to maintain the desired common-gain property end-to-end, we derive the requirements for compressed feature formation and data normalization. Binaural experiments with moderate-sized artificial neural networks demonstrate the superiority of the proposed common-gain autoencoder network over model-based processing and related unconstrained network architectures for anechoic and reverberant noisy speech in terms of segmental SNR, binaural perception-based metrics MBSTOI, better-ear HASQI, and a listening experiment.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1193-1202"},"PeriodicalIF":2.7,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11250640","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1109/OJSP.2025.3633567
Chang-Bin Jeon;Gordon Wichern;François G. Germain;Jonathan Le Roux
In music source separation, a standard data augmentation technique involves creating new training examples by randomly combining instrument stems from different songs. However, these randomly mixed samples lack the natural coherence of real music, as their stems do not share a consistent beat or tonality, often resulting in a cacophony. Despite this apparent distribution shift, random mixing has been widely adopted due to its effectiveness. In this work, we investigate why random mixing improves performance when training a state-of-the-art music source separation model and analyze the factors that cause performance gains to plateau despite the theoretically limitless number of possible combinations. We further explore the impact of beat and tonality mismatches on separation performance. Beyond analyzing random mixing, we introduce ways to further enhance its effectiveness. First, we explore a multi-segment sampling strategy that increases the diversity of training examples by selecting multiple segments for the target source. Second, we incorporate a digital parametric equalizer, a fundamental tool in music production, to maximize the timbral diversity of random mixes. Our experiments demonstrate that a model trained with only 100 songs from the MUSDB18-HQ dataset, combined with our proposed methods, achieves competitive performance to a BS-RNN model trained with 1,750 additional songs.
{"title":"Embracing Cacophony: Explaining and Improving Random Mixing in Music Source Separation","authors":"Chang-Bin Jeon;Gordon Wichern;François G. Germain;Jonathan Le Roux","doi":"10.1109/OJSP.2025.3633567","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3633567","url":null,"abstract":"In music source separation, a standard data augmentation technique involves creating new training examples by randomly combining instrument stems from different songs. However, these randomly mixed samples lack the natural coherence of real music, as their stems do not share a consistent beat or tonality, often resulting in a cacophony. Despite this apparent distribution shift, random mixing has been widely adopted due to its effectiveness. In this work, we investigate why random mixing improves performance when training a state-of-the-art music source separation model and analyze the factors that cause performance gains to plateau despite the theoretically limitless number of possible combinations. We further explore the impact of beat and tonality mismatches on separation performance. Beyond analyzing random mixing, we introduce ways to further enhance its effectiveness. First, we explore a multi-segment sampling strategy that increases the diversity of training examples by selecting multiple segments for the target source. Second, we incorporate a digital parametric equalizer, a fundamental tool in music production, to maximize the timbral diversity of random mixes. Our experiments demonstrate that a model trained with only 100 songs from the MUSDB18-HQ dataset, combined with our proposed methods, achieves competitive performance to a BS-RNN model trained with 1,750 additional songs.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1179-1192"},"PeriodicalIF":2.7,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11250641","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612049","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1109/OJSP.2025.3627123
Furkan Mert Algan;Umut Yazgan;Driton Salihu;Cem Eteke;Eckehard Steinbach
We present LEMON, a mesh editing pipeline that integrates neural deferred shading with localized mesh optimization to enable fast and precise editing of polygonal meshes guided by text prompts. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. Our approach starts by identifying the most important vertices in the mesh for editing, using a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. Using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline on the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.
{"title":"LEMON: Localized Editing With Mesh Optimization and Neural Shaders","authors":"Furkan Mert Algan;Umut Yazgan;Driton Salihu;Cem Eteke;Eckehard Steinbach","doi":"10.1109/OJSP.2025.3627123","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3627123","url":null,"abstract":"We present LEMON, a mesh editing pipeline that integrates neural deferred shading with localized mesh optimization to enable fast and precise editing of polygonal meshes guided by text prompts. Existing solutions for this problem tend to focus on a single task, either geometry or novel view synthesis, which often leads to disjointed results between the mesh and view. Our approach starts by identifying the most important vertices in the mesh for editing, using a segmentation model to focus on these key regions. Given multi-view images of an object, we optimize a neural shader and a polygonal mesh while extracting the normal map and the rendered image from each view. Using these outputs as conditioning data, we edit the input images with a text-to-image diffusion model and iteratively update our dataset while deforming the mesh. This process results in a polygonal mesh that is edited according to the given text instruction, preserving the geometric characteristics of the initial mesh while focusing on the most significant areas. We evaluate our pipeline on the DTU dataset, demonstrating that it generates finely-edited meshes more rapidly than the current state-of-the-art methods. We include our code and additional results in the supplementary material.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1161-1168"},"PeriodicalIF":2.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11222920","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-27DOI: 10.1109/OJSP.2025.3625863
Ping Hu;Mert Kayaalp;Ali H. Sayed
Distributed decision-making over graphs involves a group of agents that collaboratively work toward a common objective. In the social learning framework, the agents are tasked to infer an unknown state from a finite set by using a stream of local observations. The probability of decision errors for each agent asymptotically converges to zero at an exponential rate, characterized by the error exponent, which depends on the combination policy employed by the network. This work addresses the challenge of identifying optimal combination policies to maximize the error exponent for the true state while ensuring the errors for all other states converge to zero as well. We derive an upper bound on the achievable error exponent under the social learning rule, and then establish conditions for the combination policy to reach this upper bound. Moreover, we examine the performance loss scenarios when the combination policy is chosen inappropriately. From a geometric perspective, each combination policy induces a weighted nearest neighbor classifier where the weights correspond to the agents’ Perron centralities. By implementing an optimized combination policy, we enhance the error exponent, leading to improved accuracy and efficiency in the distributed decision-making process.
{"title":"Minimizing the Probability of Error for Decision Making Over Graphs","authors":"Ping Hu;Mert Kayaalp;Ali H. Sayed","doi":"10.1109/OJSP.2025.3625863","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3625863","url":null,"abstract":"Distributed decision-making over graphs involves a group of agents that collaboratively work toward a common objective. In the social learning framework, the agents are tasked to infer an unknown state from a finite set by using a stream of local observations. The probability of decision errors for each agent asymptotically converges to zero at an exponential rate, characterized by the <italic>error exponent</i>, which depends on the combination policy employed by the network. This work addresses the challenge of identifying optimal combination policies to maximize the error exponent for the true state while ensuring the errors for all other states converge to zero as well. We derive an upper bound on the achievable error exponent under the social learning rule, and then establish conditions for the combination policy to reach this upper bound. Moreover, we examine the performance loss scenarios when the combination policy is chosen inappropriately. From a geometric perspective, each combination policy induces a weighted nearest neighbor classifier where the weights correspond to the agents’ Perron centralities. By implementing an optimized combination policy, we enhance the error exponent, leading to improved accuracy and efficiency in the distributed decision-making process.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1139-1160"},"PeriodicalIF":2.7,"publicationDate":"2025-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11217991","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-13DOI: 10.1109/OJSP.2025.3620713
Christos Korgialas;Constantine Kotropoulos
An approach to Source Device Identification (SDI) is proposed, leveraging a Residual Network (ResNet) architecture enhanced with the Convolutional Block Attention Module (CBAM). The approach employs log-Mel spectrograms of audio content from videos in the VISION dataset captured by 35 different devices. A content-disjoint evaluation protocol is applied at the recording level to eliminate content bias across splits, supported by fixed-length segmentation and structured patch extraction for input generation. Moreover, Gradient-weighted Class Activation Mapping (Grad-CAM) is exploited to highlight the spectrogram regions that contribute most to the identification process, thus enabling interpretability. Quantitatively, the CBAM ResNet model is compared with existing methods, demonstrating an increased SDI accuracy across scenarios, including flat, indoor, and outdoor environments. A statistical significance test is conducted to assess the SDI accuracies, while an ablation study is performed to analyze the effect of attention mechanisms on the proposed model’s performance. Additional evaluations are performed using the FloreView and POLIPHONE datasets to validate the model’s generalization capabilities across unseen devices via transfer learning, assessing robustness under various conditions.
{"title":"Attention Source Device Identification Using Audio Content From Videos and Grad-CAM Explanations","authors":"Christos Korgialas;Constantine Kotropoulos","doi":"10.1109/OJSP.2025.3620713","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3620713","url":null,"abstract":"An approach to Source Device Identification (SDI) is proposed, leveraging a Residual Network (ResNet) architecture enhanced with the Convolutional Block Attention Module (CBAM). The approach employs log-Mel spectrograms of audio content from videos in the VISION dataset captured by 35 different devices. A content-disjoint evaluation protocol is applied at the recording level to eliminate content bias across splits, supported by fixed-length segmentation and structured patch extraction for input generation. Moreover, Gradient-weighted Class Activation Mapping (Grad-CAM) is exploited to highlight the spectrogram regions that contribute most to the identification process, thus enabling interpretability. Quantitatively, the CBAM ResNet model is compared with existing methods, demonstrating an increased SDI accuracy across scenarios, including flat, indoor, and outdoor environments. A statistical significance test is conducted to assess the SDI accuracies, while an ablation study is performed to analyze the effect of attention mechanisms on the proposed model’s performance. Additional evaluations are performed using the FloreView and POLIPHONE datasets to validate the model’s generalization capabilities across unseen devices via transfer learning, assessing robustness under various conditions.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1124-1138"},"PeriodicalIF":2.7,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11202249","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-06DOI: 10.1109/OJSP.2025.3618583
Qiang Heng;Xiaoqian Liu;Eric C. Chi
Convex–nonconvex (CNC) regularization is a novel paradigm that employs a nonconvex penalty function while preserving the convexity of the overall objective function. It has found successful applications in signal processing, statistics, and machine learning. Despite its wide applicability, the computation of CNC-regularized problems is still dominated by the forward–backward splitting method, which can be computationally slow in practice and is restricted to handling a single regularizer. To address these limitations, we develop a unified Anderson acceleration framework that encompasses multiple prevalent operator-splitting schemes, thereby enabling the efficient solution of a broad class of CNC-regularized problems with a quadratic data-fidelity term. We establish global convergence of the proposed algorithm to an optimal point and demonstrate its substantial speed-ups across diverse applications.
{"title":"Anderson Accelerated Operator Splitting Methods for Convex-Nonconvex Regularized Problems","authors":"Qiang Heng;Xiaoqian Liu;Eric C. Chi","doi":"10.1109/OJSP.2025.3618583","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3618583","url":null,"abstract":"Convex–nonconvex (CNC) regularization is a novel paradigm that employs a nonconvex penalty function while preserving the convexity of the overall objective function. It has found successful applications in signal processing, statistics, and machine learning. Despite its wide applicability, the computation of CNC-regularized problems is still dominated by the forward–backward splitting method, which can be computationally slow in practice and is restricted to handling a single regularizer. To address these limitations, we develop a unified Anderson acceleration framework that encompasses multiple prevalent operator-splitting schemes, thereby enabling the efficient solution of a broad class of CNC-regularized problems with a quadratic data-fidelity term. We establish global convergence of the proposed algorithm to an optimal point and demonstrate its substantial speed-ups across diverse applications.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1094-1108"},"PeriodicalIF":2.7,"publicationDate":"2025-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11194222","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The primary challenge and opportunity lie in leveraging shared information across these tasks and domains to enhance the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we introduce a factorized tensor network (FTN) designed to achieve accuracy comparable to that of independent single-task or single-domain networks, while introducing a minimal number of additional parameters. The FTN approach entails incorporating task- or domain-specific low-rank tensor factors into a shared frozen network derived from a source model. This strategy allows for adaptation to numerous target domains and tasks without encountering catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. Our findings indicate that FTN attains similar accuracy as single-task or single-domain methods while using only a fraction of additional parameters per task.
{"title":"Parameter-Efficient Multi-Task and Multi-Domain Learning Using Factorized Tensor Networks","authors":"Yash Garg;Nebiyou Yismaw;Rakib Hyder;Ashley Prater-Bennette;Amit Roy-Chowdhury;M. Salman Asif","doi":"10.1109/OJSP.2025.3613142","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3613142","url":null,"abstract":"Multi-task and multi-domain learning methods seek to learn multiple tasks/domains, jointly or one after another, using a single unified network. The primary challenge and opportunity lie in leveraging shared information across these tasks and domains to enhance the efficiency of the unified network. The efficiency can be in terms of accuracy, storage cost, computation, or sample complexity. In this paper, we introduce a factorized tensor network (FTN) designed to achieve accuracy comparable to that of independent single-task or single-domain networks, while introducing a minimal number of additional parameters. The FTN approach entails incorporating task- or domain-specific low-rank tensor factors into a shared frozen network derived from a source model. This strategy allows for adaptation to numerous target domains and tasks without encountering catastrophic forgetting. Furthermore, FTN requires a significantly smaller number of task-specific parameters compared to existing methods. We performed experiments on widely used multi-domain and multi-task datasets. We show the experiments on convolutional-based architecture with different backbones and on transformer-based architecture. Our findings indicate that FTN attains similar accuracy as single-task or single-domain methods while using only a fraction of additional parameters per task.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1077-1085"},"PeriodicalIF":2.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11175489","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145255920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-22DOI: 10.1109/OJSP.2025.3613209
Camilo Arevalo;Julián Villegas
A method for performing spatial upsampling of Head-Related Impulse Responses (HRIRs) from sparse measurements is introduced. Based on a supervised elevation-wise encoder-decoder network design, we present two variants: one that performs progressive reconstructions with feed-forward connections from higher to lower elevations, and another that excludes these connections. The variants were evaluated in terms of the errors in interaural time and level differences, as well as the spectral distortion in the ipsilateral and contralateral ears. The additional complexity introduced by the variant with feed-forward connections does not always translate into accuracy gains, making the simpler variant preferable for efficiency. Performance generally improved as the number of available measurements increased. However, accuracy was also found to strongly depend on the spatial distribution of those measurements. Compared to an average non-personalized HRIRs, interaural time differences remain similar, while the proposed method achieves higher spectral and level accuracy, highlighting its practical use for HRIR upsampling.
{"title":"Spatial Upsampling of Head-Related Impulse Responses via Elevation-Wise Encoder-Decoder Networks","authors":"Camilo Arevalo;Julián Villegas","doi":"10.1109/OJSP.2025.3613209","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3613209","url":null,"abstract":"A method for performing spatial upsampling of Head-Related Impulse Responses (HRIRs) from sparse measurements is introduced. Based on a supervised elevation-wise encoder-decoder network design, we present two variants: one that performs progressive reconstructions with feed-forward connections from higher to lower elevations, and another that excludes these connections. The variants were evaluated in terms of the errors in interaural time and level differences, as well as the spectral distortion in the ipsilateral and contralateral ears. The additional complexity introduced by the variant with feed-forward connections does not always translate into accuracy gains, making the simpler variant preferable for efficiency. Performance generally improved as the number of available measurements increased. However, accuracy was also found to strongly depend on the spatial distribution of those measurements. Compared to an average non-personalized HRIRs, interaural time differences remain similar, while the proposed method achieves higher spectral and level accuracy, highlighting its practical use for HRIR upsampling.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1086-1093"},"PeriodicalIF":2.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11175513","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145351900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A spatial upsampling method for the head-related transfer function (HRTF) using deep neural networks (DNNs), consisting of an autoencoder conditioned on the source position and frequency, is proposed. On the basis of our finding that the conventional regularized linear regression (RLR)-based upsampling method can be reinterpreted as a linear autoencoder, we designed our network architecture as a nonlinear extension of the RLR-based method, whose key features are the encoder and decoder weights depending on the source positions and the latent variables independent of the source positions. We also extend this architecture to upsample HRTFs and interaural time differences (ITDs) in a single network, which allows us to efficiently obtain head-related impulse responses (HRIRs). Experimental results on upsampling accuracy and perceptual quality indicated that our proposed method can upsample HRTFs from sparse measurements with sufficient quality.
{"title":"Spatial Upsampling of Head-Related Transfer Function Using Neural Network Conditioned on Source Position and Frequency","authors":"Yuki Ito;Tomohiko Nakamura;Shoichi Koyama;Shuichi Sakamoto;Hiroshi Saruwatari","doi":"10.1109/OJSP.2025.3613132","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3613132","url":null,"abstract":"A spatial upsampling method for the head-related transfer function (HRTF) using deep neural networks (DNNs), consisting of an autoencoder conditioned on the source position and frequency, is proposed. On the basis of our finding that the conventional regularized linear regression (RLR)-based upsampling method can be reinterpreted as a linear autoencoder, we designed our network architecture as a nonlinear extension of the RLR-based method, whose key features are the encoder and decoder weights depending on the source positions and the latent variables independent of the source positions. We also extend this architecture to upsample HRTFs and interaural time differences (ITDs) in a single network, which allows us to efficiently obtain head-related impulse responses (HRIRs). Experimental results on upsampling accuracy and perceptual quality indicated that our proposed method can upsample HRTFs from sparse measurements with sufficient quality.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"1109-1123"},"PeriodicalIF":2.7,"publicationDate":"2025-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11175492","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}