Pub Date : 2025-01-03DOI: 10.1109/LSP.2024.3525398
Ye Liu;Tianhao Shi;Mingliang Zhai;Jun Liu
In 3D skeleton-based action recognition, the limited availability of supervised data has driven interest in self-supervised learning methods. The reconstruction paradigm using masked auto-encoder (MAE) is an effective and mainstream self-supervised learning approach. However, recent studies indicate that MAE models tend to focus on features within a certain frequency range, which may result in the loss of important information. To address this issue, we propose a frequency decoupled MAE. Specifically, by incorporating a scale-specific frequency feature reconstruction module, we delve into leveraging frequency information as a direct and explicit target for reconstruction, which augments the MAE's capability to discern and accurately reproduce diverse frequency attributes within the data. Moreover, in order to address the issue of unstable gradient updates caused by more complex optimization objectives with frequency reconstruction, we introduce a dual-path network combined with an exponential moving average (EMA) parameter updating strategy to guide the model in stabilizing the training process. We have conducted extensive experiments which have demonstrated the effectiveness of the proposed method.
{"title":"Frequency Decoupled Masked Auto-Encoder for Self-Supervised Skeleton-Based Action Recognition","authors":"Ye Liu;Tianhao Shi;Mingliang Zhai;Jun Liu","doi":"10.1109/LSP.2024.3525398","DOIUrl":"https://doi.org/10.1109/LSP.2024.3525398","url":null,"abstract":"In 3D skeleton-based action recognition, the limited availability of supervised data has driven interest in self-supervised learning methods. The reconstruction paradigm using masked auto-encoder (MAE) is an effective and mainstream self-supervised learning approach. However, recent studies indicate that MAE models tend to focus on features within a certain frequency range, which may result in the loss of important information. To address this issue, we propose a frequency decoupled MAE. Specifically, by incorporating a scale-specific frequency feature reconstruction module, we delve into leveraging frequency information as a direct and explicit target for reconstruction, which augments the MAE's capability to discern and accurately reproduce diverse frequency attributes within the data. Moreover, in order to address the issue of unstable gradient updates caused by more complex optimization objectives with frequency reconstruction, we introduce a dual-path network combined with an exponential moving average (EMA) parameter updating strategy to guide the model in stabilizing the training process. We have conducted extensive experiments which have demonstrated the effectiveness of the proposed method.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"546-550"},"PeriodicalIF":3.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142975891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/LSP.2024.3524120
Yusen Wang;Xiaohong Qian;Wujie Zhou
Audio–visual segmentation (AVS) is a challenging task that focuses on segmenting sound-producing objects within video frames by leveraging audio signals. Existing convolutional neural networks (CNNs) and Transformer-based methods extract features separately from modality-specific encoders and then use fusion modules to integrate the visual and auditory features. We propose an effective Transformer-prompted network, TPNet, which utilizes prompt learning with a Transformer to guide the CNN in addressing AVS tasks. Specifically, during feature encoding, we incorporate a frequency-based prompt-supplement module to fine-tune and enhance the encoded features through frequency-domain methods. Furthermore, during audio–visual fusion, we integrate a self-supplementing cross-fusion module that uses self-attention, two-dimensional selective scanning, and cross-attention mechanisms to merge and enhance audio–visual features effectively. The prompt features undergo the same processing in cross-modal fusion, further refining the fused features to achieve more accurate segmentation results. Finally, we apply self-knowledge distillation to the network, further enhancing the model performance. Extensive experiments on the AVSBench dataset validate the effectiveness of TPNet.
{"title":"Transformer-Prompted Network: Efficient Audio–Visual Segmentation via Transformer and Prompt Learning","authors":"Yusen Wang;Xiaohong Qian;Wujie Zhou","doi":"10.1109/LSP.2024.3524120","DOIUrl":"https://doi.org/10.1109/LSP.2024.3524120","url":null,"abstract":"Audio–visual segmentation (AVS) is a challenging task that focuses on segmenting sound-producing objects within video frames by leveraging audio signals. Existing convolutional neural networks (CNNs) and Transformer-based methods extract features separately from modality-specific encoders and then use fusion modules to integrate the visual and auditory features. We propose an effective Transformer-prompted network, TPNet, which utilizes prompt learning with a Transformer to guide the CNN in addressing AVS tasks. Specifically, during feature encoding, we incorporate a frequency-based prompt-supplement module to fine-tune and enhance the encoded features through frequency-domain methods. Furthermore, during audio–visual fusion, we integrate a self-supplementing cross-fusion module that uses self-attention, two-dimensional selective scanning, and cross-attention mechanisms to merge and enhance audio–visual features effectively. The prompt features undergo the same processing in cross-modal fusion, further refining the fused features to achieve more accurate segmentation results. Finally, we apply self-knowledge distillation to the network, further enhancing the model performance. Extensive experiments on the AVSBench dataset validate the effectiveness of TPNet.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"516-520"},"PeriodicalIF":3.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/LSP.2024.3525406
Qingpeng Liang;Linsong Du;Yanzhi Wu;Zheng Ma
This letter focuses on the optimal sum secure degree of freedom (SDoF) in a two-way wiretap channel (TW-WC), wherein two legitimate full-duplex multiple-antenna nodes cooperate with each other and are wiretapped by a multiple antenna eavesdropper simultaneously. It aims to find the optimal sum SDoF pertaining to secret-key capacity for the TW-WC. First, we analyze the upper bound and lower bounds of the optimal sum SDoF by establishing their equivalence to the expression of the optimal SDoF corresponding to the secrecy rate for the TW-WC. Subsequently, in scenarios where the legitimate nodes are configured with an equal number of transmit and receive antennas, it is elucidated that the upper and lower bounds of the optimal SDoF converge. Furthermore, the findings suggest that a higher SDoF can be achieved than the existing works, thereby heralding an enhancement in secure spectral efficiency.
{"title":"Secure Degree of Freedom Bound of Secret-Key Capacity for Two-Way Wiretap Channel","authors":"Qingpeng Liang;Linsong Du;Yanzhi Wu;Zheng Ma","doi":"10.1109/LSP.2024.3525406","DOIUrl":"https://doi.org/10.1109/LSP.2024.3525406","url":null,"abstract":"This letter focuses on the optimal sum secure degree of freedom (SDoF) in a two-way wiretap channel (TW-WC), wherein two legitimate full-duplex multiple-antenna nodes cooperate with each other and are wiretapped by a multiple antenna eavesdropper simultaneously. It aims to find the optimal sum SDoF pertaining to secret-key capacity for the TW-WC. First, we analyze the upper bound and lower bounds of the optimal sum SDoF by establishing their equivalence to the expression of the optimal SDoF corresponding to the secrecy rate for the TW-WC. Subsequently, in scenarios where the legitimate nodes are configured with an equal number of transmit and receive antennas, it is elucidated that the upper and lower bounds of the optimal SDoF converge. Furthermore, the findings suggest that a higher SDoF can be achieved than the existing works, thereby heralding an enhancement in secure spectral efficiency.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"581-585"},"PeriodicalIF":3.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-03DOI: 10.1109/LSP.2024.3524096
Zhiguo Ding
The key idea of hybrid non-orthogonal multiple access (NOMA) is to allow users to use the bandwidth resources to which they cannot have access in orthogonal multiple access (OMA) based legacy networks while still guaranteeing its compatibility with the legacy network. However, in a conventional hybrid NOMA downlink network, some users have access to more bandwidth resources than others, which leads to a potential performance loss. So what if the users can access the same amount of bandwidth resources? This letter focuses on a simple two-user scenario, and develops analytical and simulation results to reveal that for this considered scenario, conventional hybrid NOMA is still an optimal transmission strategy.
{"title":"A Study on the Optimality of Downlink Hybrid NOMA","authors":"Zhiguo Ding","doi":"10.1109/LSP.2024.3524096","DOIUrl":"https://doi.org/10.1109/LSP.2024.3524096","url":null,"abstract":"The key idea of hybrid non-orthogonal multiple access (NOMA) is to allow users to use the bandwidth resources to which they cannot have access in orthogonal multiple access (OMA) based legacy networks while still guaranteeing its compatibility with the legacy network. However, in a conventional hybrid NOMA downlink network, some users have access to more bandwidth resources than others, which leads to a potential performance loss. So what if the users can access the same amount of bandwidth resources? This letter focuses on a simple two-user scenario, and develops analytical and simulation results to reveal that for this considered scenario, conventional hybrid NOMA is still an optimal transmission strategy.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"511-515"},"PeriodicalIF":3.2,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedestrian detection in crowded situation is a challenging task. This study presents a straightforward and effective method called Det RCNN to detect pedestrians in crowded situation, while also pairing the body and head of individual pedestrian. On the one hand, pedestrians' heads have their characteristics of stable shape and distinct feature. On the other hand, their heads are usually positioned higher in image, so even in crowded situation, it is difficult to completely cover the pedestrians' heads. Therefore, this study equipped the DETR model with a Head Decoder (HDecoder) parallel to the Decoder. HDecoder takes the head knowledge generated in the Decoder phase as head queries. Simultaneously, the HDecoder uses a key-query mechanism to search the entire image for the body bounding boxes corresponding to the head queries. Lastly, the proposed method conducts a straightforward IOU (Intersection over Union) matching between the body bounding boxes produced in the Decoder and HDecoder phases. This HDecoder resembles the second stage of the Faster RCNN model, hence this paper termed it Det RCNN (DETR RCNN). Compared to Deformable DETR, the experimental results on the CrowdHuman dataset show that the proposed model can increase AP$_{m}$ from 53.02 to 53.87. Furthermore, the mMR$^{-2}$ decreased from 52.46 to 42.32 compared to the existing BFJ.
拥挤环境下的行人检测是一项具有挑战性的任务。本研究提出了一种简单有效的方法,称为Det RCNN,用于在拥挤情况下检测行人,同时还将行人个体的身体和头部进行配对。一方面,行人头部具有形状稳定、特征鲜明的特点。另一方面,他们的头部通常在图像中的位置较高,因此即使在拥挤的情况下,也很难完全覆盖行人的头部。因此,本研究为DETR模型配备了一个与解码器并行的头部解码器(HDecoder)。HDecoder将在Decoder阶段生成的头部知识作为头部查询。同时,HDecoder使用键查询机制在整个图像中搜索与头部查询相对应的body边界框。最后,提出的方法在Decoder和HDecoder阶段产生的体边界框之间进行直接的IOU (Intersection over Union)匹配。这种HDecoder类似于Faster RCNN模型的第二阶段,因此本文将其称为Det RCNN (DETR RCNN)。与Deformable DETR相比,在CrowdHuman数据集上的实验结果表明,该模型可以将AP$_{m}$从53.02提高到53.87。mMR$^{-2}$从52.46下降到42.32。
{"title":"Detecting Pedestrian With Incomplete Head Feature in Crowded Situation Based on Transformer","authors":"Zefei Chen;Yongjie Lin;Jianmin Xu;Kai Lu;Yanfang Shou","doi":"10.1109/LSP.2024.3525397","DOIUrl":"https://doi.org/10.1109/LSP.2024.3525397","url":null,"abstract":"Pedestrian detection in crowded situation is a challenging task. This study presents a straightforward and effective method called Det RCNN to detect pedestrians in crowded situation, while also pairing the body and head of individual pedestrian. On the one hand, pedestrians' heads have their characteristics of stable shape and distinct feature. On the other hand, their heads are usually positioned higher in image, so even in crowded situation, it is difficult to completely cover the pedestrians' heads. Therefore, this study equipped the DETR model with a Head Decoder (HDecoder) parallel to the Decoder. HDecoder takes the head knowledge generated in the Decoder phase as head queries. Simultaneously, the HDecoder uses a key-query mechanism to search the entire image for the body bounding boxes corresponding to the head queries. Lastly, the proposed method conducts a straightforward IOU (Intersection over Union) matching between the body bounding boxes produced in the Decoder and HDecoder phases. This HDecoder resembles the second stage of the Faster RCNN model, hence this paper termed it Det RCNN (DETR RCNN). Compared to Deformable DETR, the experimental results on the CrowdHuman dataset show that the proposed model can increase AP<inline-formula><tex-math>$_{m}$</tex-math></inline-formula> from 53.02 to 53.87. Furthermore, the mMR<inline-formula><tex-math>$^{-2}$</tex-math></inline-formula> decreased from 52.46 to 42.32 compared to the existing BFJ.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"576-580"},"PeriodicalIF":3.2,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993084","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/LSP.2024.3523223
Qinghai Zheng;Yixin Zhuang
As a widely used method in signal processing, Principal Component Analysis (PCA) performs both the compression and the recovery of high dimensional data by leveraging the linear transformations. Considering the robustness of PCA, how to discriminate correct samples and outliers in PCA is a crucial and challenging issue. In this paper, we present a general model, which conducts PCA via a non-decreasing concave regularized minimization and is termed PCA-NCRM for short. Different from most existing PCA methods, which learn the linear transformations by minimizing the recovery errors between the recovered data and the original data in the least squared sense, our model adopts the monotonically non-decreasing concave function to enhance the ability of model in distinguishing correct samples and outliers. To be specific, PCA-NCRM enlarges the attention to samples with smaller recovery errors and diminishes the attention to samples with larger recovery errors at the same time. The proposed minimization problem can be efficiently addressed by employing an iterative re-weighting optimization. Experimental results on several datasets show the effectiveness of our model.
{"title":"Non-Decreasing Concave Regularized Minimization for Principal Component Analysis","authors":"Qinghai Zheng;Yixin Zhuang","doi":"10.1109/LSP.2024.3523223","DOIUrl":"https://doi.org/10.1109/LSP.2024.3523223","url":null,"abstract":"As a widely used method in signal processing, Principal Component Analysis (PCA) performs both the compression and the recovery of high dimensional data by leveraging the linear transformations. Considering the robustness of PCA, how to discriminate correct samples and outliers in PCA is a crucial and challenging issue. In this paper, we present a general model, which conducts PCA via a non-decreasing concave regularized minimization and is termed PCA-NCRM for short. Different from most existing PCA methods, which learn the linear transformations by minimizing the recovery errors between the recovered data and the original data in the least squared sense, our model adopts the monotonically non-decreasing concave function to enhance the ability of model in distinguishing correct samples and outliers. To be specific, PCA-NCRM enlarges the attention to samples with smaller recovery errors and diminishes the attention to samples with larger recovery errors at the same time. The proposed minimization problem can be efficiently addressed by employing an iterative re-weighting optimization. Experimental results on several datasets show the effectiveness of our model.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"486-490"},"PeriodicalIF":3.2,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142938330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/LSP.2024.3522851
Jiawei Song;Dengpan Ye;Yunming Zhang
Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Experimental results demonstrate that traditional forgery detection methods perform poorly in adapting to diffusion model-generated scenarios, while existing diffusion-specific techniques lack robustness against post-processed images. In response, we propose the Trinity Detector, which integrates coarse-grained text features from a Contrastive Language-Image Pretraining (CLIP) encoder with fine-grained artifacts in the pixel domain to achieve semantic-level image detection, significantly enhancing model robustness. To enhance sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed. It adaptively fuses multiple preset frequency bands, dynamically adjusting the weight of each band, and then integrates the fused frequency-domain information with the spatial co-occurrence of the two modalities. Extensive experiments validate that our Trinity Detector improves transfer detection performance across black-box datasets by an average of 14.3% compared to previous diffusion detection models and demonstrating superior performance on post-processed image datasets.
{"title":"Trinity Detector: Text-Assisted and Attention Mechanisms Based Spectral Fusion for Diffusion Generation Image Detection","authors":"Jiawei Song;Dengpan Ye;Yunming Zhang","doi":"10.1109/LSP.2024.3522851","DOIUrl":"https://doi.org/10.1109/LSP.2024.3522851","url":null,"abstract":"Artificial Intelligence Generated Content (AIGC) techniques, represented by text-to-image generation, have led to a malicious use of deep forgeries, raising concerns about the trustworthiness of multimedia content. Experimental results demonstrate that traditional forgery detection methods perform poorly in adapting to diffusion model-generated scenarios, while existing diffusion-specific techniques lack robustness against post-processed images. In response, we propose the Trinity Detector, which integrates coarse-grained text features from a Contrastive Language-Image Pretraining (CLIP) encoder with fine-grained artifacts in the pixel domain to achieve semantic-level image detection, significantly enhancing model robustness. To enhance sensitivity to diffusion-generated image features, a Multi-spectral Channel Attention Fusion Unit (MCAF) is designed. It adaptively fuses multiple preset frequency bands, dynamically adjusting the weight of each band, and then integrates the fused frequency-domain information with the spatial co-occurrence of the two modalities. Extensive experiments validate that our Trinity Detector improves transfer detection performance across black-box datasets by an average of 14.3% compared to previous diffusion detection models and demonstrating superior performance on post-processed image datasets.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"501-505"},"PeriodicalIF":3.2,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/LSP.2024.3522856
Xianyun Sun;Caiyong Wang;Yunlong Wang;Jianze Wei;Zhenan Sun
While Vision Transformer (ViT)-based methods have significantly improved the performance of various vision tasks in natural scenes, progress in iris recognition remains limited. In addition, the human iris contains unique characters that are distinct from natural scenes. To remedy this, this paper investigates a dedicated Transformer framework, termed IrisFormer, for iris recognition and attempts to improve the accuracy by combining the contextual modeling ability of ViT and iris-specific optimization to learn robust, fine-grained, and discriminative features. Specifically, to achieve rotation invariance in iris recognition, we employ relative position encoding instead of regular absolute position encoding for each iris image token, and a horizontal pixel-shifting strategy is utilized during training for data augmentation. Then, to enhance the model's robustness against local distortions such as occlusions and reflections, we randomly mask some tokens during training to force the model to learn representative identity features from only part of the image. Finally, considering that fine-grained features are more discriminative in iris recognition, we retain the entire token sequence for patch-wise feature matching instead of using the standard single classification token. Experiments on three popular datasets demonstrate that the proposed framework achieves competitive performance under both intra- and inter-dataset testing protocols.
{"title":"IrisFormer: A Dedicated Transformer Framework for Iris Recognition","authors":"Xianyun Sun;Caiyong Wang;Yunlong Wang;Jianze Wei;Zhenan Sun","doi":"10.1109/LSP.2024.3522856","DOIUrl":"https://doi.org/10.1109/LSP.2024.3522856","url":null,"abstract":"While Vision Transformer (ViT)-based methods have significantly improved the performance of various vision tasks in natural scenes, progress in iris recognition remains limited. In addition, the human iris contains unique characters that are distinct from natural scenes. To remedy this, this paper investigates a dedicated Transformer framework, termed IrisFormer, for iris recognition and attempts to improve the accuracy by combining the contextual modeling ability of ViT and iris-specific optimization to learn robust, fine-grained, and discriminative features. Specifically, to achieve rotation invariance in iris recognition, we employ relative position encoding instead of regular absolute position encoding for each iris image token, and a horizontal pixel-shifting strategy is utilized during training for data augmentation. Then, to enhance the model's robustness against local distortions such as occlusions and reflections, we randomly mask some tokens during training to force the model to learn representative identity features from only part of the image. Finally, considering that fine-grained features are more discriminative in iris recognition, we retain the entire token sequence for patch-wise feature matching instead of using the standard single classification token. Experiments on three popular datasets demonstrate that the proposed framework achieves competitive performance under both intra- and inter-dataset testing protocols.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"431-435"},"PeriodicalIF":3.2,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142925368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/LSP.2024.3523227
Man Wang;Zheng Shi;Yunfei Li;Xianda Wu;Weiqiang Tan
This letter proposes a novel hybrid automatic repeat request with chase combining assisted sparse code multiple access (HARQ-CC-SCMA) scheme. Depending on whether the same superimposed packet is retransmitted, synchronous and asynchronous modes are considered for retransmissions. Moreover, a factor graph aggregation (FGA) method is used for multi-user detection. Specifically, a large-scale factor graph is constructed by combining all the received superimposed signals and message passing algorithm (MPA) is applied to calculate log-likelihood ratio (LLR). Monte Carlo simulations are preformed to show that FGA surpasses bit-level combining (BLC) and HARQ with incremental redundancy (HARQ-IR) in synchronous mode. Moreover, FGA performs better than BLC at high signal-to-noise ratio (SNR) region in asynchronous mode. However, FGA in asynchronous mode is worse than BLC at low SNR, because significant error propagation is induced by the presence of failed messages after the maximum allowable HARQ rounds.
{"title":"Synchronous and Asynchronous HARQ-CC Assisted SCMA Schemes","authors":"Man Wang;Zheng Shi;Yunfei Li;Xianda Wu;Weiqiang Tan","doi":"10.1109/LSP.2024.3523227","DOIUrl":"https://doi.org/10.1109/LSP.2024.3523227","url":null,"abstract":"This letter proposes a novel hybrid automatic repeat request with chase combining assisted sparse code multiple access (HARQ-CC-SCMA) scheme. Depending on whether the same superimposed packet is retransmitted, synchronous and asynchronous modes are considered for retransmissions. Moreover, a factor graph aggregation (FGA) method is used for multi-user detection. Specifically, a large-scale factor graph is constructed by combining all the received superimposed signals and message passing algorithm (MPA) is applied to calculate log-likelihood ratio (LLR). Monte Carlo simulations are preformed to show that FGA surpasses bit-level combining (BLC) and HARQ with incremental redundancy (HARQ-IR) in synchronous mode. Moreover, FGA performs better than BLC at high signal-to-noise ratio (SNR) region in asynchronous mode. However, FGA in asynchronous mode is worse than BLC at low SNR, because significant error propagation is induced by the presence of failed messages after the maximum allowable HARQ rounds.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"506-510"},"PeriodicalIF":3.2,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142937891","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-26DOI: 10.1109/LSP.2024.3522850
Zhi Chen;Lijun Liu;Zhen Yu
With high-efficiency computing advantages and desirable tracking accuracy, discriminative correlation filters (DCFs) have been widely utilized in UAV tracking, leading to substantial progress. However, in some intricate scenarios (e.g., similar objects or backgrounds, background clutter), DCF-based trackers are prone to generating low-reliability response maps influenced by surrounding response distractors, thereby reducing tracking robustness. Furthermore, the limited computational resources and endurance on UAV platforms drive DCF-based trackers to exhibit real-time and reliable tracking performance. To address the aforementioned issues, a dynamic distractor-repressed correlation filter (DDRCF) is proposed. First, a dynamic distractor-repressed regularization is introduced into the DCF framework. Then, a new objective function is formulated to tune the penalty intensity of the distractor-repressed regularization module. Furthermore, a novel response map variation evaluation mechanism is used to dynamically tune the distractor-repressed regularization coefficient to adapt to omnipresent appearance variations. Considerable and exhaustive experiments on four prevailing UAV benchmarks, i.e., UAV123@10fps, UAVTrack112, DTB70 and UAVDT, validate that the proposed DDRCF tracker is superior to other state-of-the-art trackers. Moreover, the proposed method can achieve a tracking speed of 59 FPS on a CPU, meeting the requirements of real-time aerial tracking.
{"title":"Learning Dynamic Distractor-Repressed Correlation Filter for Real-Time UAV Tracking","authors":"Zhi Chen;Lijun Liu;Zhen Yu","doi":"10.1109/LSP.2024.3522850","DOIUrl":"https://doi.org/10.1109/LSP.2024.3522850","url":null,"abstract":"With high-efficiency computing advantages and desirable tracking accuracy, discriminative correlation filters (DCFs) have been widely utilized in UAV tracking, leading to substantial progress. However, in some intricate scenarios (e.g., similar objects or backgrounds, background clutter), DCF-based trackers are prone to generating low-reliability response maps influenced by surrounding response distractors, thereby reducing tracking robustness. Furthermore, the limited computational resources and endurance on UAV platforms drive DCF-based trackers to exhibit real-time and reliable tracking performance. To address the aforementioned issues, a dynamic distractor-repressed correlation filter (DDRCF) is proposed. First, a dynamic distractor-repressed regularization is introduced into the DCF framework. Then, a new objective function is formulated to tune the penalty intensity of the distractor-repressed regularization module. Furthermore, a novel response map variation evaluation mechanism is used to dynamically tune the distractor-repressed regularization coefficient to adapt to omnipresent appearance variations. Considerable and exhaustive experiments on four prevailing UAV benchmarks, i.e., UAV123@10fps, UAVTrack112, DTB70 and UAVDT, validate that the proposed DDRCF tracker is superior to other state-of-the-art trackers. Moreover, the proposed method can achieve a tracking speed of 59 FPS on a CPU, meeting the requirements of real-time aerial tracking.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"616-620"},"PeriodicalIF":3.2,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}