Pub Date : 2024-11-12DOI: 10.1109/LSP.2024.3496579
Shanyu Dong;Jin Liu;Jianxin Wang
Handwriting images are commonly used to diagnose Parkinson's disease due to their intuitive nature and easy accessibility. However, existing methods have not explored the potential of the fusion of different handwriting image sources for diagnosis. To address this issue, this study proposes a hybrid fusion approach that makes use of the visual information derived from different handwriting images and handwriting templates, significantly enhancing the performance in diagnosing Parkinson's disease. The proposed method involves several key steps. Initially, different preprocessed handwriting images undergo pixel-level fusion using Laplacian transformation. Subsequently, the fused and original images are fed into a pre-trained CNN separately to extract visual features. Finally, feature-level fusion is performed by concatenating the feature vectors extracted from the flatten layer, and the fused feature vectors are input into SVM to obtain classification results. Our experimental results validate that the proposed method achieves excellent performance by only utilizing visual features from images, with 95.45% accuracy on the NewHandPD. Furthermore, the results obtained on our dataset verify the strong generalizability of the proposed approach.
{"title":"Diagnosis of Parkinson's Disease Based on Hybrid Fusion Approach of Offline Handwriting Images","authors":"Shanyu Dong;Jin Liu;Jianxin Wang","doi":"10.1109/LSP.2024.3496579","DOIUrl":"https://doi.org/10.1109/LSP.2024.3496579","url":null,"abstract":"Handwriting images are commonly used to diagnose Parkinson's disease due to their intuitive nature and easy accessibility. However, existing methods have not explored the potential of the fusion of different handwriting image sources for diagnosis. To address this issue, this study proposes a hybrid fusion approach that makes use of the visual information derived from different handwriting images and handwriting templates, significantly enhancing the performance in diagnosing Parkinson's disease. The proposed method involves several key steps. Initially, different preprocessed handwriting images undergo pixel-level fusion using Laplacian transformation. Subsequently, the fused and original images are fed into a pre-trained CNN separately to extract visual features. Finally, feature-level fusion is performed by concatenating the feature vectors extracted from the flatten layer, and the fused feature vectors are input into SVM to obtain classification results. Our experimental results validate that the proposed method achieves excellent performance by only utilizing visual features from images, with 95.45% accuracy on the NewHandPD. Furthermore, the results obtained on our dataset verify the strong generalizability of the proposed approach.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3179-3183"},"PeriodicalIF":3.2,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142671993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cross-domain slot filling is a widely explored problem in spoken language understanding (SLU), which requires the model to transfer between different domains under data sparsity conditions. Dominant two-step hierarchical models first extract slot entities and then calculate the similarity score between slot description-based prototypes and the last hidden layer of the slot entity, selecting the closest prototype as the predicted slot type. However, these models only use slot descriptions as prototypes, which lacks robustness. Moreover, these approaches have less regard for the inherent knowledge in the slot entity embedding to suffer from the issue of overfitting. In this letter, we propose a Robust Multi-prototypes Aware Integration (RMAI) method for zero-shot cross-domain slot filling. In RMAI, more robust slot entity-based prototypes and inherent knowledge in the slot entity embedding are utilized to improve the classification performance and alleviate the risk of overfitting. Furthermore, a multi-prototypes aware integration approach is proposed to effectively integrate both our proposed slot entity-based prototypes and the slot description-based prototypes. Experimental results on the SNIPS dataset demonstrate the well performance of RMAI.
{"title":"Robust Multi-Prototypes Aware Integration for Zero-Shot Cross-Domain Slot Filling","authors":"Shaoshen Chen;Peijie Huang;Zhanbiao Zhu;Yexing Zhang;Yuhong Xu","doi":"10.1109/LSP.2024.3495561","DOIUrl":"https://doi.org/10.1109/LSP.2024.3495561","url":null,"abstract":"Cross-domain slot filling is a widely explored problem in spoken language understanding (SLU), which requires the model to transfer between different domains under data sparsity conditions. Dominant two-step hierarchical models first extract slot entities and then calculate the similarity score between slot description-based prototypes and the last hidden layer of the slot entity, selecting the closest prototype as the predicted slot type. However, these models only use slot descriptions as prototypes, which lacks robustness. Moreover, these approaches have less regard for the inherent knowledge in the slot entity embedding to suffer from the issue of overfitting. In this letter, we propose a Robust Multi-prototypes Aware Integration (RMAI) method for zero-shot cross-domain slot filling. In RMAI, more robust slot entity-based prototypes and inherent knowledge in the slot entity embedding are utilized to improve the classification performance and alleviate the risk of overfitting. Furthermore, a multi-prototypes aware integration approach is proposed to effectively integrate both our proposed slot entity-based prototypes and the slot description-based prototypes. Experimental results on the SNIPS dataset demonstrate the well performance of RMAI.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3169-3173"},"PeriodicalIF":3.2,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot adaptation of Generative Adversarial Networks (GANs) under distributional shift is generally achieved via regularized retraining or latent space adaptation. While the former methods offer fast inference, the latter generate diverse images. This work aims to solve these issues and achieve the best of both regimes in a principled manner via Bayesian reformulation of the GAN objective. We highlight a hidden expectation term over GAN parameters, that is often overlooked but is critical in few-shot settings. This observation helps us justify prepending a latent adapter network (LAN) before a pre-trained GAN and propose a sampling procedure over the parameters of LAN (called SoLAD) to compute the usually-ignored hidden expectation. SoLAD enables fast generation of quality samples from multiple few-shot target domains using a GAN pre-trained on a single source domain.
生成式对抗网络(GAN)在分布偏移情况下的少量适应通常是通过正则化重训练或潜空间适应来实现的。前一种方法推理速度快,而后一种方法生成的图像却千差万别。这项研究旨在解决这些问题,并通过对 GAN 目标的贝叶斯重新表述,以原则性的方式实现两种机制的最佳效果。我们强调了 GAN 参数上的一个隐藏期望项,该期望项经常被忽视,但在少镜头设置中却至关重要。这一观察结果帮助我们证明了在预训练 GAN 之前预置潜在适配器网络 (LAN) 的合理性,并提出了一种针对 LAN 参数的采样程序(称为 SoLAD)来计算通常被忽视的隐藏期望值。SoLAD 可以使用在单个源域上预先训练好的 GAN,从多个少量目标域中快速生成高质量样本。
{"title":"SoLAD: Sampling Over Latent Adapter for Few Shot Generation","authors":"Arnab Kumar Mondal;Piyush Tiwary;Parag Singla;Prathosh A.P.","doi":"10.1109/LSP.2024.3496822","DOIUrl":"https://doi.org/10.1109/LSP.2024.3496822","url":null,"abstract":"Few-shot adaptation of Generative Adversarial Networks (GANs) under distributional shift is generally achieved via regularized retraining or latent space adaptation. While the former methods offer fast inference, the latter generate diverse images. This work aims to solve these issues and achieve the best of both regimes in a principled manner via Bayesian reformulation of the GAN objective. We highlight a hidden expectation term over GAN parameters, that is often overlooked but is critical in few-shot settings. This observation helps us justify prepending a latent adapter network (LAN) before a pre-trained GAN and propose a sampling procedure over the parameters of LAN (called SoLAD) to compute the usually-ignored hidden expectation. SoLAD enables fast generation of quality samples from multiple few-shot target domains using a GAN pre-trained on a single source domain.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3174-3178"},"PeriodicalIF":3.2,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672064","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-11DOI: 10.1109/LSP.2024.3495578
Jaeuk Lee;Yoonsoo Shin;Joon-Hyuk Chang
Most non-autoregressive text-to-speech (TTS) models acquire target phoneme duration (target duration) from internal or external aligners. They transform the speech-phoneme alignment produced by the aligner into the target duration. Since this transformation is not differentiable, the gradient of the loss function that maximizes the TTS model's likelihood of speech (e.g., mel spectrogram or waveform) cannot be propagated to the target duration. In other words, the target duration is produced regardless of the TTS model's likelihood of speech. Hence, we introduce a differentiable duration refinement that produces a learnable target duration for maximizing the likelihood of speech. The proposed method uses an internal division to locate the phoneme boundary, which is determined to improve the performance of the TTS model. Additionally, we propose a duration distribution loss to enhance the performance of the duration predictor. Our baseline model is JETS, a representative end-to-end TTS model, and we apply the proposed methods to the baseline model. Experimental results show that the proposed method outperforms the baseline model in terms of subjective naturalness and character error rate.
{"title":"Differentiable Duration Refinement Using Internal Division for Non-Autoregressive Text-to-Speech","authors":"Jaeuk Lee;Yoonsoo Shin;Joon-Hyuk Chang","doi":"10.1109/LSP.2024.3495578","DOIUrl":"https://doi.org/10.1109/LSP.2024.3495578","url":null,"abstract":"Most non-autoregressive text-to-speech (TTS) models acquire target phoneme duration (target duration) from internal or external aligners. They transform the speech-phoneme alignment produced by the aligner into the target duration. Since this transformation is not differentiable, the gradient of the loss function that maximizes the TTS model's likelihood of speech (e.g., mel spectrogram or waveform) cannot be propagated to the target duration. In other words, the target duration is produced regardless of the TTS model's likelihood of speech. Hence, we introduce a differentiable duration refinement that produces a learnable target duration for maximizing the likelihood of speech. The proposed method uses an internal division to locate the phoneme boundary, which is determined to improve the performance of the TTS model. Additionally, we propose a duration distribution loss to enhance the performance of the duration predictor. Our baseline model is JETS, a representative end-to-end TTS model, and we apply the proposed methods to the baseline model. Experimental results show that the proposed method outperforms the baseline model in terms of subjective naturalness and character error rate.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3154-3158"},"PeriodicalIF":3.2,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142671988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1109/LSP.2024.3493799
Zhengyi Liu;Longzhen Wang;Xianyong Fang;Zhengzheng Tu;Linbo Wang
A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object detection.
{"title":"LFSamba: Marry SAM With Mamba for Light Field Salient Object Detection","authors":"Zhengyi Liu;Longzhen Wang;Xianyong Fang;Zhengzheng Tu;Linbo Wang","doi":"10.1109/LSP.2024.3493799","DOIUrl":"https://doi.org/10.1109/LSP.2024.3493799","url":null,"abstract":"A light field camera can reconstruct 3D scenes using captured multi-focus images that contain rich spatial geometric information, enhancing applications in stereoscopic photography, virtual reality, and robotic vision. In this work, a state-of-the-art salient object detection model for multi-focus light field images, called LFSamba, is introduced to emphasize four main insights: (a) Efficient feature extraction, where SAM is used to extract modality-aware discriminative features; (b) Inter-slice relation modeling, leveraging Mamba to capture long-range dependencies across multiple focal slices, thus extracting implicit depth cues; (c) Inter-modal relation modeling, utilizing Mamba to integrate all-focus and multi-focus images, enabling mutual enhancement; (d) Weakly supervised learning capability, developing a scribble annotation dataset from an existing pixel-level mask dataset, establishing the first scribble-supervised baseline for light field salient object detection.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3144-3148"},"PeriodicalIF":3.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142671986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1109/LSP.2024.3493793
Xiangjie Ding;Zhi Zhao;Ying Yang
For binary offset carrier (BOC) signal tracking, the Two-Dimensional (2D) tracking method that independently tracks the code and subcarrier has garnered significant attention. The double estimator (DE) and the double phase estimator (DPE) are prominent approaches. However, the performance of the DE suffers under limited front-end bandwidths and sampling rates. The DPE, which treats the subcarrier as a sine wave, neglects side lobes, leading to performance degradation. This letter introduces the Binomial Harmonic Approximation DPE (BH-DPE), which uses two phase lock loops to track the first and third harmonics of the subcarrier. By applying a weighted combination of correlation values, the BH-DPE effectively reduces coherent output signal-to-noise ratio (SNR) loss and enhances ranging accuracy through combined delay estimations from both the harmonics. Theoretical analysis and simulations show that the BH-DPE outperforms both the DE and the DPE in terms of SNR loss and ranging accuracy under constrained front-end bandwidths and sampling rates, and approaches the DE while exceeds the DPE under wide front-end bandwidths.
{"title":"Binomial Harmonic Approximation Double-Phase Estimator Tracking for BOC Modulated Signals","authors":"Xiangjie Ding;Zhi Zhao;Ying Yang","doi":"10.1109/LSP.2024.3493793","DOIUrl":"https://doi.org/10.1109/LSP.2024.3493793","url":null,"abstract":"For binary offset carrier (BOC) signal tracking, the Two-Dimensional (2D) tracking method that independently tracks the code and subcarrier has garnered significant attention. The double estimator (DE) and the double phase estimator (DPE) are prominent approaches. However, the performance of the DE suffers under limited front-end bandwidths and sampling rates. The DPE, which treats the subcarrier as a sine wave, neglects side lobes, leading to performance degradation. This letter introduces the Binomial Harmonic Approximation DPE (BH-DPE), which uses two phase lock loops to track the first and third harmonics of the subcarrier. By applying a weighted combination of correlation values, the BH-DPE effectively reduces coherent output signal-to-noise ratio (SNR) loss and enhances ranging accuracy through combined delay estimations from both the harmonics. Theoretical analysis and simulations show that the BH-DPE outperforms both the DE and the DPE in terms of SNR loss and ranging accuracy under constrained front-end bandwidths and sampling rates, and approaches the DE while exceeds the DPE under wide front-end bandwidths.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"31 ","pages":"3139-3143"},"PeriodicalIF":3.2,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-11-07DOI: 10.1109/LSP.2024.3493801
Ionut Schiopu;Radu Ciprian Bilcu
The letter proposes an efficient context-adaptive lossless compression method for encoding event frame sequences. A first contribution proposes the use of a deep-ternary-tree of the current pixel position context as the context-tree model selector. The arithmetic codec encodes each trinary symbol using the probability distribution of the associated context-tree-leaf model. Another contribution proposes a novel context design based on several frames, where the context order controls the codec's complexity. Another contribution proposes a model search procedure to replace the context-tree prune-and-encode strategy by searching for the closest “mature” context model between lower-order context-tree models. The experimental evaluation shows that the proposed method provides an improved coding performance of 34.34% and a smaller runtime of up to $5.18times$