Pub Date : 2025-06-18DOI: 10.1109/OJSP.2025.3580967
An D. Le;Shiwei Jin;Sungbal Seo;You-Suk Bae;Truong Q. Nguyen
This work introduces a universal wavelet unit constructed with a biorthogonal lattice structure which is a novel tunable wavelet unit to enhance image classification and anomaly detection in convolutional neural networks by reducing information loss during pooling. The unit employs a biorthogonal lattice structure to modify convolution, pooling, and down-sampling operations. Implemented in residual neural networks with 18 layers, it improved detection accuracy on CIFAR10 (by 2.67% ), ImageNet1K (by 1.85% ), and the Describable Textures dataset (by 11.81% ), showcasing its advantages in detecting detailed features. Similar gains are achieved in the implementations for residual neural networks with 34 layers and 50 layers. For anomaly detection on the MVTec Anomaly Detection and TUKPCB datasets, the proposed method achieved a competitive performance and better anomaly localization.
{"title":"Biorthogonal Lattice Tunable Wavelet Units and Their Implementation in Convolutional Neural Networks for Computer Vision Problems","authors":"An D. Le;Shiwei Jin;Sungbal Seo;You-Suk Bae;Truong Q. Nguyen","doi":"10.1109/OJSP.2025.3580967","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3580967","url":null,"abstract":"This work introduces a universal wavelet unit constructed with a biorthogonal lattice structure which is a novel tunable wavelet unit to enhance image classification and anomaly detection in convolutional neural networks by reducing information loss during pooling. The unit employs a biorthogonal lattice structure to modify convolution, pooling, and down-sampling operations. Implemented in residual neural networks with 18 layers, it improved detection accuracy on CIFAR10 (by 2.67% ), ImageNet1K (by 1.85% ), and the Describable Textures dataset (by 11.81% ), showcasing its advantages in detecting detailed features. Similar gains are achieved in the implementations for residual neural networks with 34 layers and 50 layers. For anomaly detection on the MVTec Anomaly Detection and TUKPCB datasets, the proposed method achieved a competitive performance and better anomaly localization.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"768-783"},"PeriodicalIF":2.9,"publicationDate":"2025-06-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11039659","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In sensor metrology, noise parameters governing the stochastic nature of photon detectors play critical role in characterizing the aleatoric uncertainty of computational imaging systems such as indirect time-of-flight cameras, structured light imaging, and division-of-time polarimetric imaging. Standard calibration procedures exists for extracting the noise parameters using calibration targets, but they are inconvenient or impractical for frequent updates. To keep up with noise parameters that are dynamically affected by sensor settings (e.g. exposure and gain) as well as environmental factors (e.g. temperature), we propose an In-Scene Calibration of Poisson Noise Parameters (ISC-PNP) method that does not require calibration targets. The main challenge lies in the heteroskedastic nature of the noise and the confounding influence of scene content. To address this, our method leverages global joint statistics of Poisson sensor data, which can be interpreted as a binomial random variable. We experimentally confirm that the noise parameters extracted by the proposed ISC-PNP and the standard calibration procedure are well-matched.
{"title":"In-Scene Calibration of Poisson Noise Parameters for Phase Image Recovery","authors":"Achour Idoughi;Sreelakshmi Sreeharan;Chen Zhang;Joseph Raffoul;Hui Wang;Keigo Hirakawa","doi":"10.1109/OJSP.2025.3579650","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3579650","url":null,"abstract":"In sensor metrology, noise parameters governing the stochastic nature of photon detectors play critical role in characterizing the aleatoric uncertainty of computational imaging systems such as indirect time-of-flight cameras, structured light imaging, and division-of-time polarimetric imaging. Standard calibration procedures exists for extracting the noise parameters using calibration targets, but they are inconvenient or impractical for frequent updates. To keep up with noise parameters that are dynamically affected by sensor settings (e.g. exposure and gain) as well as environmental factors (e.g. temperature), we propose an In-Scene Calibration of Poisson Noise Parameters (ISC-PNP) method that does not require calibration targets. The main challenge lies in the heteroskedastic nature of the noise and the confounding influence of scene content. To address this, our method leverages global joint statistics of Poisson sensor data, which can be interpreted as a binomial random variable. We experimentally confirm that the noise parameters extracted by the proposed ISC-PNP and the standard calibration procedure are well-matched.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"682-690"},"PeriodicalIF":2.9,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11034763","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144511201","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-13DOI: 10.1109/OJSP.2025.3579646
Masahiro Yukawa
We present a principled way of deriving a continuous relaxation of a given discontinuous shrinkage operator, which is based on two fundamental results, proximal inclusion and conversion. Using our results, the discontinuous operator is converted, via double inversion, to a continuous operator; more precisely, the associated “set-valued” operator is converted to a “single-valued” Lipschitz continuous operator. The first illustrative example is the firm shrinkage operator which can be derived as a continuous relaxation of the hard shrinkage operator. We also derive a new operator as a continuous relaxation of the discontinuous shrinkage operator associated with the so-called reverse ordered weighted $ell _{1}$ (ROWL) penalty. Numerical examples demonstrate potential advantages of the continuous relaxation.
我们提出了一个原则性的方法来推导一个给定的不连续收缩算子的连续松弛,这是基于两个基本结果,近端包含和转换。利用我们的结果,通过双反演,不连续算子被转换为连续算子;更准确地说,相关的“集值”算子被转换为“单值”Lipschitz连续算子。第一个说明性的例子是坚固收缩算子,它可以作为硬收缩算子的连续松弛而导出。我们还推导了一个新的算子,作为与所谓的反向有序加权$ well _{1}$ (ROWL)惩罚相关的不连续收缩算子的连续松弛。数值算例表明了连续松弛的潜在优势。
{"title":"Continuous Relaxation of Discontinuous Shrinkage Operator: Proximal Inclusion and Conversion","authors":"Masahiro Yukawa","doi":"10.1109/OJSP.2025.3579646","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3579646","url":null,"abstract":"We present a principled way of deriving a continuous relaxation of a given discontinuous shrinkage operator, which is based on two fundamental results, proximal inclusion and conversion. Using our results, the discontinuous operator is converted, via double inversion, to a continuous operator; more precisely, the associated “set-valued” operator is converted to a “single-valued” Lipschitz continuous operator. The first illustrative example is the firm shrinkage operator which can be derived as a continuous relaxation of the hard shrinkage operator. We also derive a new operator as a continuous relaxation of the discontinuous shrinkage operator associated with the so-called reverse ordered weighted <inline-formula><tex-math>$ell _{1}$</tex-math></inline-formula> (ROWL) penalty. Numerical examples demonstrate potential advantages of the continuous relaxation.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"753-767"},"PeriodicalIF":2.9,"publicationDate":"2025-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11034740","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144581587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1109/OJSP.2025.3578807
Jitendra K. Tugnait
Estimation of the conditional independence graph (CIG) of high-dimensional multivariate Gaussian time series from multi-attribute data is considered. Existing methods for graph estimation for such data are based on single-attribute models where one associates a scalar time series with each node. In multi-attribute graphical models, each node represents a random vector or vector time series. In this paper we provide a unified theoretical analysis of multi-attribute graph learning for dependent time series using a penalized log-likelihood objective function formulated in the frequency domain using the discrete Fourier transform of the time-domain data. We consider both convex (sparse-group lasso) and non-convex (log-sum and SCAD group penalties) penalty/regularization functions. We establish sufficient conditions in a high-dimensional setting for consistency (convergence of the inverse power spectral density to true value in the Frobenius norm), local convexity when using non-convex penalties, and graph recovery. We do not impose any incoherence or irrepresentability condition for our convergence results. We also empirically investigate selection of the tuning parameters based on the Bayesian information criterion, and illustrate our approach using numerical examples utilizing both synthetic and real data.
{"title":"On Conditional Independence Graph Learning From Multi-Attribute Gaussian Dependent Time Series","authors":"Jitendra K. Tugnait","doi":"10.1109/OJSP.2025.3578807","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578807","url":null,"abstract":"Estimation of the conditional independence graph (CIG) of high-dimensional multivariate Gaussian time series from multi-attribute data is considered. Existing methods for graph estimation for such data are based on single-attribute models where one associates a scalar time series with each node. In multi-attribute graphical models, each node represents a random vector or vector time series. In this paper we provide a unified theoretical analysis of multi-attribute graph learning for dependent time series using a penalized log-likelihood objective function formulated in the frequency domain using the discrete Fourier transform of the time-domain data. We consider both convex (sparse-group lasso) and non-convex (log-sum and SCAD group penalties) penalty/regularization functions. We establish sufficient conditions in a high-dimensional setting for consistency (convergence of the inverse power spectral density to true value in the Frobenius norm), local convexity when using non-convex penalties, and graph recovery. We do not impose any incoherence or irrepresentability condition for our convergence results. We also empirically investigate selection of the tuning parameters based on the Bayesian information criterion, and illustrate our approach using numerical examples utilizing both synthetic and real data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"705-721"},"PeriodicalIF":2.9,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030300","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144481841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-11DOI: 10.1109/OJSP.2025.3578812
Christopher C. Hulbert;Kathleen E. Wage
Detection and estimation performance depends on signal-to-interference-plus-noise ratio (SINR) at the output of an array. The Capon beamformer (BF) designed with ensemble statistics achieves the optimum SINR in stationary environments. Adaptive BFs compute their weights using the sample covariance matrix (SCM) obtained from snapshots, i.e., training samples. SINR loss, the ratio of adaptive to optimal SINR, quantifies the number of snapshots required to achieve a desired average level of performance. For adaptive Capon BFs that invert the full SCM, Reed et al. derived the SINR loss distribution and Miller quantified how the desired signal’s presence in the snapshots degrades that loss. Abraham and Owsley designed dominant mode rejection (DMR) for cases where the number of snapshots is less than or approximately equal to the number of sensors. DMR’s success in snapshot-starved passive sonar scenarios led to its application in other areas such as hyperspectral sensing and medical imaging. DMR forms a modified SCM as a weighted combination of the identity matrix and the dominant eigensubspace containing the loud interferers, thereby eliminating the inverse of the poorly estimated noise subspace. This work leverages recent random matrix theory (RMT) results to develop DMR performance predictions under the assumption that the desired signal is contained in the training data. Using white noise gain and interference suppression predictions, the paper derives a lower bound on DMR’s average SINR loss and confirms its accuracy using Monte Carlo simulations. Moreover, this paper creates a new eigensubspace leakage estimator applicable to broader RMT applications.
{"title":"Random Matrix Theory Predictions of Dominant Mode Rejection SINR Loss due to Signal in the Training Data","authors":"Christopher C. Hulbert;Kathleen E. Wage","doi":"10.1109/OJSP.2025.3578812","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578812","url":null,"abstract":"Detection and estimation performance depends on signal-to-interference-plus-noise ratio (SINR) at the output of an array. The Capon beamformer (BF) designed with ensemble statistics achieves the optimum SINR in stationary environments. Adaptive BFs compute their weights using the sample covariance matrix (SCM) obtained from snapshots, i.e., training samples. SINR loss, the ratio of adaptive to optimal SINR, quantifies the number of snapshots required to achieve a desired average level of performance. For adaptive Capon BFs that invert the full SCM, Reed et al. derived the SINR loss distribution and Miller quantified how the desired signal’s presence in the snapshots degrades that loss. Abraham and Owsley designed dominant mode rejection (DMR) for cases where the number of snapshots is less than or approximately equal to the number of sensors. DMR’s success in snapshot-starved passive sonar scenarios led to its application in other areas such as hyperspectral sensing and medical imaging. DMR forms a modified SCM as a weighted combination of the identity matrix and the dominant eigensubspace containing the loud interferers, thereby eliminating the inverse of the poorly estimated noise subspace. This work leverages recent random matrix theory (RMT) results to develop DMR performance predictions under the assumption that the desired signal is contained in the training data. Using white noise gain and interference suppression predictions, the paper derives a lower bound on DMR’s average SINR loss and confirms its accuracy using Monte Carlo simulations. Moreover, this paper creates a new eigensubspace leakage estimator applicable to broader RMT applications.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"735-752"},"PeriodicalIF":2.9,"publicationDate":"2025-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030297","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144550496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-10DOI: 10.1109/OJSP.2025.3578299
Gerardo Roa-Dabike;Michael A. Akeroyd;Scott Bannister;Jon P. Barker;Trevor J. Cox;Bruno Fazenda;Jennifer Firth;Simone Graetzer;Alinka Greasley;Rebecca R. Vos;William M. Whitmer
Listening to music can be an issue for those with a hearing impairment, and hearing aids are not a universal solution. This paper details the first use of an open challenge methodology to improve the audio quality of music for those with hearing loss through machine learning. The first challenge (CAD1) had 9 participants. The second was a 2024 ICASSP grand challenge (ICASSP24), which attracted 17 entrants. The challenge tasks concerned demixing and remixing pop/rock music to allow a personalized rebalancing of the instruments in the mix, along with amplification to correct for raised hearing thresholds. The software baselines provided for entrants to build upon used two state-of-the-art demix algorithms: Hybrid Demucs and Open-Unmix. Objective evaluation used HAAQI, the Hearing-Aid Audio Quality Index. No entries improved on the best baseline in CAD1. It is suggested that this arose because demixing algorithms are relatively mature, and recent work has shown that access to large (private) datasets is needed to further improve performance. Learning from this, for ICASSP24 the scenario was made more difficult by using loudspeaker reproduction and specifying gains to be applied before remixing. This also made the scenario more useful for listening through hearing aids. Nine entrants scored better than the best ICASSP24 baseline. Most of the entrants used a refined version of Hybrid Demucs and NAL-R amplification. The highest scoring system combined the outputs of several demixing algorithms in an ensemble approach. These challenges are now open benchmarks for future research with freely available software and data.
{"title":"The First Cadenza Challenges: Using Machine Learning Competitions to Improve Music for Listeners With a Hearing Loss","authors":"Gerardo Roa-Dabike;Michael A. Akeroyd;Scott Bannister;Jon P. Barker;Trevor J. Cox;Bruno Fazenda;Jennifer Firth;Simone Graetzer;Alinka Greasley;Rebecca R. Vos;William M. Whitmer","doi":"10.1109/OJSP.2025.3578299","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578299","url":null,"abstract":"Listening to music can be an issue for those with a hearing impairment, and hearing aids are not a universal solution. This paper details the first use of an open challenge methodology to improve the audio quality of music for those with hearing loss through machine learning. The first challenge (CAD1) had 9 participants. The second was a 2024 ICASSP grand challenge (ICASSP24), which attracted 17 entrants. The challenge tasks concerned demixing and remixing pop/rock music to allow a personalized rebalancing of the instruments in the mix, along with amplification to correct for raised hearing thresholds. The software baselines provided for entrants to build upon used two state-of-the-art demix algorithms: Hybrid Demucs and Open-Unmix. Objective evaluation used HAAQI, the Hearing-Aid Audio Quality Index. No entries improved on the best baseline in CAD1. It is suggested that this arose because demixing algorithms are relatively mature, and recent work has shown that access to large (private) datasets is needed to further improve performance. Learning from this, for ICASSP24 the scenario was made more difficult by using loudspeaker reproduction and specifying gains to be applied before remixing. This also made the scenario more useful for listening through hearing aids. Nine entrants scored better than the best ICASSP24 baseline. Most of the entrants used a refined version of Hybrid Demucs and NAL-R amplification. The highest scoring system combined the outputs of several demixing algorithms in an ensemble approach. These challenges are now open benchmarks for future research with freely available software and data.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"722-734"},"PeriodicalIF":2.9,"publicationDate":"2025-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11030066","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536564","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/OJSP.2025.3578296
Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen
This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.
{"title":"AVCaps: An Audio-Visual Dataset With Modality-Specific Captions","authors":"Parthasaarathy Sudarsanam;Irene Martín-Morató;Aapo Hakala;Tuomas Virtanen","doi":"10.1109/OJSP.2025.3578296","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578296","url":null,"abstract":"This paper introduces AVCaps, an audio-visual dataset that contains separate textual captions for the audio, visual, and audio-visual contents of video clips. The dataset contains 2061 video clips constituting a total of 28.8 hours. We provide up to 5 captions for the audio, visual, and audio-visual content of each clip, crowdsourced separately. Existing datasets focus on a single modality or do not provide modality-specific captions, limiting the study of how each modality contributes to overall comprehension in multimodal settings. Our dataset addresses this critical gap in multimodal research by offering a resource for studying how audio and visual content are captioned individually, as well as how audio-visual content is captioned in relation to these individual modalities. Crowdsourced audio-visual captions are prone to favor visual content over audio content. To avoid this we use large language models (LLMs) to generate three balanced audio-visual captions for each clip based on the crowdsourced captions. We present captioning and retrieval experiments to illustrate the effectiveness of modality-specific captions in evaluating model performance. Specifically, we show that the modality-specific captions allow us to quantitatively assess how well a model understands audio and visual information from a given video. Notably, we find that a model trained on the balanced LLM-generated audio-visual captions captures audio information more effectively compared to a model trained on crowdsourced audio-visual captions. This model achieves a 14% higher Sentence-BERT similarity on crowdsourced audio captions compared to a model trained on crowdsourced audio-visual captions, which are typically more biased towards visual information. We also discuss the possibilities in multimodal representation learning, question answering, developing new video captioning metrics, and generative AI that this dataset unlocks. The dataset is available publicly at Zenodo and Hugging Face.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"691-704"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144511206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/OJSP.2025.3578273
Rakib Ul Haque;Panagiotis Markopoulos
Federated Learning (FL) is a decentralized machine learning (ML) approach where multiple clients collaboratively train a shared model over several update rounds without exchanging local data. Similar to centralized learning, determining hyperparameters (HPs) like learning rate and batch size remains challenging yet critical for model performance. Current adaptive HP-tuning methods are often domain-specific and heavily influenced by initialization. Moreover, model accuracy often improves slowly, requiring many update rounds. This slow improvement is particularly problematic for FL, where each update round incurs high communication costs in addition to computation and energy costs. In this work, we introduce FLAUTO, the first method to perform dynamic HP-tuning simultaneously at both local (client) and global (server) levels. This dual-level adaptation directly addresses critical bottlenecks in FL, including slow convergence, client heterogeneity, and high communication costs, distinguishing it from existing approaches. FLAUTO leverages training loss and relative local model deviation as novel metrics, enabling robust and dynamic hyperparameter adjustments without reliance on initial guesses. By prioritizing high performance in early update rounds, FLAUTO significantly reduces communication and energy overhead—key challenges in FL deployments. Comprehensive experimental studies on image classification and object detection tasks demonstrate that FLAUTO consistently outperforms state-of-the-art methods, establishing its efficacy and broad applicability.
{"title":"Federated Learning With Automated Dual-Level Hyperparameter Tuning","authors":"Rakib Ul Haque;Panagiotis Markopoulos","doi":"10.1109/OJSP.2025.3578273","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578273","url":null,"abstract":"Federated Learning (FL) is a decentralized machine learning (ML) approach where multiple clients collaboratively train a shared model over several update rounds without exchanging local data. Similar to centralized learning, determining hyperparameters (HPs) like learning rate and batch size remains challenging yet critical for model performance. Current adaptive HP-tuning methods are often domain-specific and heavily influenced by initialization. Moreover, model accuracy often improves slowly, requiring many update rounds. This slow improvement is particularly problematic for FL, where each update round incurs high communication costs in addition to computation and energy costs. In this work, we introduce FLAUTO, the first method to perform dynamic HP-tuning simultaneously at both local (client) and global (server) levels. This dual-level adaptation directly addresses critical bottlenecks in FL, including slow convergence, client heterogeneity, and high communication costs, distinguishing it from existing approaches. FLAUTO leverages training loss and relative local model deviation as novel metrics, enabling robust and dynamic hyperparameter adjustments without reliance on initial guesses. By prioritizing high performance in early update rounds, FLAUTO significantly reduces communication and energy overhead—key challenges in FL deployments. Comprehensive experimental studies on image classification and object detection tasks demonstrate that FLAUTO consistently outperforms state-of-the-art methods, establishing its efficacy and broad applicability.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"795-802"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029096","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144634874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-09DOI: 10.1109/OJSP.2025.3578300
Jiuxu Chen;Nupur Thakur;Sachin Chhabra;Baoxin Li
Predicting near-future human actions in videos has become a focal point of research, driven by applications such as human-helping robotics, collaborative AI services, and surveillance video analysis. However, the inherent challenge lies in deciphering the complex spatial-temporal dynamics inherent in typical video feeds. While existing works excel in constrained settings with fine-grained action ground-truth labels, the general unavailability of such labeling at the frame level poses a significant hurdle. In this paper, we present an innovative solution to anticipate future human actions without relying on any form of supervision. Our approach involves generating pseudo-labels for video frames through the clustering of frame-wise visual features. These pseudo-labels are then input into a temporal sequence modeling module that learns to predict future actions in terms of pseudo-labels. Apart from the action anticipation method, we propose an innovative evaluation scheme GreedyMapper, a unique many-to-one mapping scheme that provides a practical solution to the many-to-one mapping challenge, a task that existing mapping algorithms struggle to address. Through comprehensive experimentation conducted on demanding real-world cooking datasets, our unsupervised method demonstrates superior performance compared to weakly-supervised approaches by a significant margin on the 50Salads dataset. When applied to the Breakfast dataset, our approach yields strong performance compared to the baselines in an unsupervised setting and delivers competitive results to (weakly) supervised methods under a similar setting.
{"title":"Unsupervised Action Anticipation Through Action Cluster Prediction","authors":"Jiuxu Chen;Nupur Thakur;Sachin Chhabra;Baoxin Li","doi":"10.1109/OJSP.2025.3578300","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3578300","url":null,"abstract":"Predicting near-future human actions in videos has become a focal point of research, driven by applications such as human-helping robotics, collaborative AI services, and surveillance video analysis. However, the inherent challenge lies in deciphering the complex spatial-temporal dynamics inherent in typical video feeds. While existing works excel in constrained settings with fine-grained action ground-truth labels, the general unavailability of such labeling at the frame level poses a significant hurdle. In this paper, we present an innovative solution to anticipate future human actions without relying on any form of supervision. Our approach involves generating pseudo-labels for video frames through the clustering of frame-wise visual features. These pseudo-labels are then input into a temporal sequence modeling module that learns to predict future actions in terms of pseudo-labels. Apart from the action anticipation method, we propose an innovative evaluation scheme GreedyMapper, a unique many-to-one mapping scheme that provides a practical solution to the many-to-one mapping challenge, a task that existing mapping algorithms struggle to address. Through comprehensive experimentation conducted on demanding real-world cooking datasets, our unsupervised method demonstrates superior performance compared to weakly-supervised approaches by a significant margin on the 50Salads dataset. When applied to the Breakfast dataset, our approach yields strong performance compared to the baselines in an unsupervised setting and delivers competitive results to (weakly) supervised methods under a similar setting.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"641-650"},"PeriodicalIF":2.9,"publicationDate":"2025-06-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11029147","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144366940","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-06-06DOI: 10.1109/OJSP.2025.3577503
Heedong Do;Namyoon Lee;Angel Lozano
An estimation method is presented for polynomial phase signals, i.e., those adopting the form of a complex exponential whose phase is polynomial in its indices. Transcending the scope of existing techniques, the proposed estimator can handle an arbitrary number of dimensions and an arbitrary set of polynomial degrees along each dimension; the only requirement is that the number of observations per dimension exceeds the highest degree thereon. Embodied by a highly compact sequential algorithm, this estimator is efficient at high signal-to-noise ratios (SNRs), exhibiting a computational complexity that is strictly linear in the number of observations and at most quadratic in the number of polynomial terms. To reinforce the performance at low and medium SNRs, where any phase estimator is bound to be hampered by the inherent ambiguity caused by phase wrappings, suitable functionalities are incorporated and shown to be highly effective.
{"title":"Multidimensional Polynomial Phase Estimation","authors":"Heedong Do;Namyoon Lee;Angel Lozano","doi":"10.1109/OJSP.2025.3577503","DOIUrl":"https://doi.org/10.1109/OJSP.2025.3577503","url":null,"abstract":"An estimation method is presented for polynomial phase signals, i.e., those adopting the form of a complex exponential whose phase is polynomial in its indices. Transcending the scope of existing techniques, the proposed estimator can handle an arbitrary number of dimensions and an arbitrary set of polynomial degrees along each dimension; the only requirement is that the number of observations per dimension exceeds the highest degree thereon. Embodied by a highly compact sequential algorithm, this estimator is efficient at high signal-to-noise ratios (SNRs), exhibiting a computational complexity that is strictly linear in the number of observations and at most quadratic in the number of polynomial terms. To reinforce the performance at low and medium SNRs, where any phase estimator is bound to be hampered by the inherent ambiguity caused by phase wrappings, suitable functionalities are incorporated and shown to be highly effective.","PeriodicalId":73300,"journal":{"name":"IEEE open journal of signal processing","volume":"6 ","pages":"651-681"},"PeriodicalIF":2.9,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11027552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144367013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}