Pub Date : 2025-11-12DOI: 10.1109/LSP.2025.3632238
Renjie Zhou;Qingsong Hu;Meiling Wang
To address the suboptimal fusion of depth data in RGB-D instance segmentation, we propose a novel framework with two synergistic modules. The Depth Gradient Guidance Module (DGGM) provides fine-grained boundary cues by processing an explicit depth gradient map. Concurrently, the Enhanced Depth-Sensitive Attention Module (E-DSAM) adaptively captures scene context using a lightweight predictor to make its attention mechanism dynamic. Extensive experiments on the NYUv2-48 dataset validate our approach, which achieves 26.8 mAP (a 4.1-point improvement over a strong baseline) and generates qualitatively superior masks. The code is available at https://github.com/TheoBald200814/RGB-D-Instance-Segmentation.
{"title":"Harnessing Depth Gradients: A New Framework for Precise RGB-D Instance Segmentation","authors":"Renjie Zhou;Qingsong Hu;Meiling Wang","doi":"10.1109/LSP.2025.3632238","DOIUrl":"https://doi.org/10.1109/LSP.2025.3632238","url":null,"abstract":"To address the suboptimal fusion of depth data in RGB-D instance segmentation, we propose a novel framework with two synergistic modules. The Depth Gradient Guidance Module (DGGM) provides fine-grained boundary cues by processing an explicit depth gradient map. Concurrently, the Enhanced Depth-Sensitive Attention Module (E-DSAM) adaptively captures scene context using a lightweight predictor to make its attention mechanism dynamic. Extensive experiments on the NYUv2-48 dataset validate our approach, which achieves 26.8 mAP (a 4.1-point improvement over a strong baseline) and generates qualitatively superior masks. The code is available at <uri>https://github.com/TheoBald200814/RGB-D-Instance-Segmentation</uri>.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4429-4433"},"PeriodicalIF":3.9,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1109/LSP.2025.3631433
Kaidi Hu;Wei Li;Guangwei Gao;Ruigang Yang
Effectively fusing and complementing RGB and depth modalities while mitigating image noise is a critical challenge in the RGB-D semantic segmentation task. In this letter, we propose a novel Internal and External Multi-fusion Transformer (IEMFormer) to address this issue. IEMFormer incorporates stage-specific fusion strategies to enhance modal complementarity. For internal fusion, we integrate a fusion unit within the traditional Transformer block, combining matching tokens from both modalities on a pixel-by-pixel basis. For external fusion, the proposed External Adaptive Cross-modal Fusion (EACF) module filters dual-modal features across both spatial and channel dimensions, serving the purpose of adaptively weighting complementary channel information and robustly aggregating spatial patterns from both modalities, thereby facilitating the integration of multimodal information. Additionally, the Global Self-attention Guided Fusion (GSGF) module in the decoder refines the fused features from earlier stages, effectively suppressing noise. This is achieved by leveraging high-level semantic features to guide the refinement and incorporating an active noise suppression mechanism to prevent overfitting to dominant, noisy features. Extensive experiments on the NYUv2 and SUN RGB-D datasets demonstrate that IEMFormer achieves highly competitive performance in accurately understanding indoor scenes.
{"title":"IEMFormer: Internal and External Multi-Fusion Transformer for Indoor RGB-D Semantic Segmentation","authors":"Kaidi Hu;Wei Li;Guangwei Gao;Ruigang Yang","doi":"10.1109/LSP.2025.3631433","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631433","url":null,"abstract":"Effectively fusing and complementing RGB and depth modalities while mitigating image noise is a critical challenge in the RGB-D semantic segmentation task. In this letter, we propose a novel Internal and External Multi-fusion Transformer (IEMFormer) to address this issue. IEMFormer incorporates stage-specific fusion strategies to enhance modal complementarity. For internal fusion, we integrate a fusion unit within the traditional Transformer block, combining matching tokens from both modalities on a pixel-by-pixel basis. For external fusion, the proposed External Adaptive Cross-modal Fusion (EACF) module filters dual-modal features across both spatial and channel dimensions, serving the purpose of adaptively weighting complementary channel information and robustly aggregating spatial patterns from both modalities, thereby facilitating the integration of multimodal information. Additionally, the Global Self-attention Guided Fusion (GSGF) module in the decoder refines the fused features from earlier stages, effectively suppressing noise. This is achieved by leveraging high-level semantic features to guide the refinement and incorporating an active noise suppression mechanism to prevent overfitting to dominant, noisy features. Extensive experiments on the NYUv2 and SUN RGB-D datasets demonstrate that IEMFormer achieves highly competitive performance in accurately understanding indoor scenes.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4424-4428"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612090","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1109/LSP.2025.3631427
Peng Gao;Jiu-Ao Zhu;Wen-Hua Qin
Existing deep learning-based face recognition models are vulnerable to adversarial attacks due to their inherent network fragility. However, current attack methods generate adversarial examples that often suffer from low visual quality and poor transferability. To address these issues, this paper proposes a novel adversarial attack method, G-FRadv, combining greedy iteration with multi-scale distortion maps to enhance both the attack performance and the visual quality of the adversarial examples. Specifically, G-FRadv first fuses images from different scales to obtain multiple distortion maps. These maps are then partitioned, and the disturbance weight map is coupled with the iteratively sorted gradient information. Finally, the adversarial perturbations generated by different distortion maps are fused and applied to the original image. Experimental results show that the proposed G-FRadv method achieves an average attack success rate 11.38% higher than noise-based methods, and 26.53% higher than makeup-based attack methods, while maintaining better visual quality.
{"title":"Towards Greedy Iterative Adversarial Attack With Distortion Maps Against Deep Face Recognition","authors":"Peng Gao;Jiu-Ao Zhu;Wen-Hua Qin","doi":"10.1109/LSP.2025.3631427","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631427","url":null,"abstract":"Existing deep learning-based face recognition models are vulnerable to adversarial attacks due to their inherent network fragility. However, current attack methods generate adversarial examples that often suffer from low visual quality and poor transferability. To address these issues, this paper proposes a novel adversarial attack method, G-FRadv, combining greedy iteration with multi-scale distortion maps to enhance both the attack performance and the visual quality of the adversarial examples. Specifically, G-FRadv first fuses images from different scales to obtain multiple distortion maps. These maps are then partitioned, and the disturbance weight map is coupled with the iteratively sorted gradient information. Finally, the adversarial perturbations generated by different distortion maps are fused and applied to the original image. Experimental results show that the proposed G-FRadv method achieves an average attack success rate 11.38% higher than noise-based methods, and 26.53% higher than makeup-based attack methods, while maintaining better visual quality.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4369-4373"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1109/LSP.2025.3631381
Qiuyuan Chen;Liuchang Zhuo;Taihao Zhang;Cunhua Pan;Hong Ren;Jiangzhou Wang;Ruidong Li;Changhong Wang
This paper proposes a novel channel estimation protocol for a reconfigurable intelligent surface (RIS) aided multi-user (MU) multi-input multi-output (MIMO) millimeter wave (mmWave) system under the hybrid architecture where the direct channels between the base station (BS) and user equipment (UE) exist. There are two stages respectively estimating the direct and cascaded channels. In Stage I, besides the direct channels, the angles of arrival (AoA) and the angles of departure (AoD) of the cascaded channels are also estimated. Stage II is divided into two sub-stages and the overall cascaded channels are estimated. In sub-stage I, the cascaded channel of a typical UE is estimated. In sub-stage II, the cascaded channels of all the remaining UEs are estimated. Simulation results demonstrate that the proposed method has lower pilot overhead and achieves higher accuracy than the existing benchmark approaches.
{"title":"RIS-Aided Channel Estimation for Multi-User MIMO mmWave Systems Under Practical Hybrid Architecture With Direct Path","authors":"Qiuyuan Chen;Liuchang Zhuo;Taihao Zhang;Cunhua Pan;Hong Ren;Jiangzhou Wang;Ruidong Li;Changhong Wang","doi":"10.1109/LSP.2025.3631381","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631381","url":null,"abstract":"This paper proposes a novel channel estimation protocol for a reconfigurable intelligent surface (RIS) aided multi-user (MU) multi-input multi-output (MIMO) millimeter wave (mmWave) system under the hybrid architecture where the direct channels between the base station (BS) and user equipment (UE) exist. There are two stages respectively estimating the direct and cascaded channels. In Stage I, besides the direct channels, the angles of arrival (AoA) and the angles of departure (AoD) of the cascaded channels are also estimated. Stage II is divided into two sub-stages and the overall cascaded channels are estimated. In sub-stage I, the cascaded channel of a typical UE is estimated. In sub-stage II, the cascaded channels of all the remaining UEs are estimated. Simulation results demonstrate that the proposed method has lower pilot overhead and achieves higher accuracy than the existing benchmark approaches.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4364-4368"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The radar echo of a target can be modeled as a 2-D signal whose variables are the intra-pulse sampling time (fast-time) and the inter-pulse sampling time (slow-time).The fast-time instantaneous frequency (FIF) and slow-time instantaneous frequency (SIF) of the signal are modulated by the target’s slant range. When the target undergoes complex motion, the radar echo becomes a 2-D signal with fast-varying instantaneous frequencies (IFs). IF analysis for such a signal is challenging. To solve this issue, an extending complex-lag time-frequency distribution (ECTD) is introduced. ECTD is a 3-D distribution for 2-D signals based on traditional complex-lag time-frequency distribution (CTD), and it inherits the good performance in handling fast-varying IFs.By introducing complex-lags in both the fast-time and slow-time dimensions, the ECTD can accurately estimate the SIF and FIF of a 2-D signal with fast-varying IFs. Finally, a reduced interference realization of the ECTD achieved by introducing the frequency domain filter is given. Numerical examples validate the effectiveness of the ECTD.
{"title":"An Analysis of 2-D Signals With Fast Varying Instantaneous Frequencies: Extending Complex-Lag Time-Frequency Distribution","authors":"Xinhang Zhu;Yicheng Jiang;Zitao Liu;Yong Wang;Yun Zhang;Qinglong Hua","doi":"10.1109/LSP.2025.3631428","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631428","url":null,"abstract":"The radar echo of a target can be modeled as a 2-D signal whose variables are the intra-pulse sampling time (fast-time) and the inter-pulse sampling time (slow-time).The fast-time instantaneous frequency (FIF) and slow-time instantaneous frequency (SIF) of the signal are modulated by the target’s slant range. When the target undergoes complex motion, the radar echo becomes a 2-D signal with fast-varying instantaneous frequencies (IFs). IF analysis for such a signal is challenging. To solve this issue, an extending complex-lag time-frequency distribution (ECTD) is introduced. ECTD is a 3-D distribution for 2-D signals based on traditional complex-lag time-frequency distribution (CTD), and it inherits the good performance in handling fast-varying IFs.By introducing complex-lags in both the fast-time and slow-time dimensions, the ECTD can accurately estimate the SIF and FIF of a 2-D signal with fast-varying IFs. Finally, a reduced interference realization of the ECTD achieved by introducing the frequency domain filter is given. Numerical examples validate the effectiveness of the ECTD.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4384-4388"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-11DOI: 10.1109/LSP.2025.3631429
Jiawei Tang;Shilei Li;Lisheng Kuang;Ling Shi
This letter investigates learning-based geometric tracking control for rigid body dynamics without precise system model parameters. Our approach leverages recent advancements in geometric optimal control and data-driven techniques to develop a learning-based tracking solution. By adopting Lie algebra formulation to transform tracking dynamics into a vector space, we estimate unknown parameters from data, achieving robust and efficient learning. Compared to existing learning-based methods, our approach ensures geometric consistency and delivers superior tracking accuracy. The simulation results validate the effectiveness of our method.
{"title":"Learning-Based Geometric Tracking Control for Rigid Body Dynamics","authors":"Jiawei Tang;Shilei Li;Lisheng Kuang;Ling Shi","doi":"10.1109/LSP.2025.3631429","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631429","url":null,"abstract":"This letter investigates learning-based geometric tracking control for rigid body dynamics without precise system model parameters. Our approach leverages recent advancements in geometric optimal control and data-driven techniques to develop a learning-based tracking solution. By adopting Lie algebra formulation to transform tracking dynamics into a vector space, we estimate unknown parameters from data, achieving robust and efficient learning. Compared to existing learning-based methods, our approach ensures geometric consistency and delivers superior tracking accuracy. The simulation results validate the effectiveness of our method.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4419-4423"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Point cloud completion aims to reconstruct complete 3D shapes from partial observations, often requiring multiple views or complete data for training. In this paper, we propose an attention-driven, self-supervised autoencoder network that completes 3D point clouds from a single partial observation. Multi-head self-attention captures robust contextual relationships, while residual connections in the autoencoder enhance geometric feature learning. In addition to this, we incorporate a contrastive learning-based loss, which encourages the network to better distinguish structural patterns even in highly incomplete observations. Experimental results on benchmark datasets demonstrate that the proposed approach achieves state-of-the-art performance in single-view point cloud completion.
{"title":"Contrastive Attention-Based Network for Self-Supervised Point Cloud Completion","authors":"Seema Kumari;Preyum Kumar;Srimanta Mandal;Shanmuganathan Raman","doi":"10.1109/LSP.2025.3631424","DOIUrl":"https://doi.org/10.1109/LSP.2025.3631424","url":null,"abstract":"Point cloud completion aims to reconstruct complete 3D shapes from partial observations, often requiring multiple views or complete data for training. In this paper, we propose an attention-driven, self-supervised autoencoder network that completes 3D point clouds from a single partial observation. Multi-head self-attention captures robust contextual relationships, while residual connections in the autoencoder enhance geometric feature learning. In addition to this, we incorporate a contrastive learning-based loss, which encourages the network to better distinguish structural patterns even in highly incomplete observations. Experimental results on benchmark datasets demonstrate that the proposed approach achieves state-of-the-art performance in single-view point cloud completion.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4444-4448"},"PeriodicalIF":3.9,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-06DOI: 10.1109/LSP.2025.3630085
Jiateng Liu;Hengcan Shi;Yaonan Wang;Yuan Zong
Micro-expression recognition (MER) has drawn increasing attention in recent years due to its ability to reveal the true feelings people want to hide. The key challenge in MER is subtle motions, which are hard to capture but crucial for MER. Existing methods usually solve this problem by magnifying all motions in the whole face and temporal sequence. However, micro-expressions (MEs) only involve a few facial areas and several temporal snippets. The all-motion magnification in previous methods cannot precisely capture these local ME motion patterns, and can easily cause spatial as well as temporal distortions, which significantly decrease the MER accuracy. In this paper, we propose an RGB-Frequency Relation Exploration Network (RF-REN), which enhances the subtle motions in refined local ME cues by exploring spatial and temporal relations in both RGB and frequency domains. Specifically, we first decompose the ME video into RGB as well as frequency domains, and conduct temporal division according to different motion stages to cover various ME local patterns. Secondly, we construct an adaptive local-global relation exploration (LGRE) module to explore the local relation cues in the spatial appearance and temporal dynamics in both domains. Finally, we propose an RGB-Frequency routing strategy to fuse the RGB and frequency cues, aiming to aggregate spatial-temporal local-global information and enhance subtle motions for MER. Extensive experiments on three databases (CASME II, SAMM and SMIC) show that the proposed model outperforms other state-of-the-art methods.
{"title":"RF-REN: RGB-Frequency Relation Exploration Network for Micro-Expression Recognition","authors":"Jiateng Liu;Hengcan Shi;Yaonan Wang;Yuan Zong","doi":"10.1109/LSP.2025.3630085","DOIUrl":"https://doi.org/10.1109/LSP.2025.3630085","url":null,"abstract":"Micro-expression recognition (MER) has drawn increasing attention in recent years due to its ability to reveal the true feelings people want to hide. The key challenge in MER is subtle motions, which are hard to capture but crucial for MER. Existing methods usually solve this problem by magnifying all motions in the whole face and temporal sequence. However, micro-expressions (MEs) only involve a few facial areas and several temporal snippets. The all-motion magnification in previous methods cannot precisely capture these local ME motion patterns, and can easily cause spatial as well as temporal distortions, which significantly decrease the MER accuracy. In this paper, we propose an RGB-Frequency Relation Exploration Network (RF-REN), which enhances the subtle motions in refined local ME cues by exploring spatial and temporal relations in both RGB and frequency domains. Specifically, we first decompose the ME video into RGB as well as frequency domains, and conduct temporal division according to different motion stages to cover various ME local patterns. Secondly, we construct an adaptive local-global relation exploration (LGRE) module to explore the local relation cues in the spatial appearance and temporal dynamics in both domains. Finally, we propose an RGB-Frequency routing strategy to fuse the RGB and frequency cues, aiming to aggregate spatial-temporal local-global information and enhance subtle motions for MER. Extensive experiments on three databases (CASME II, SAMM and SMIC) show that the proposed model outperforms other state-of-the-art methods.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4439-4443"},"PeriodicalIF":3.9,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/LSP.2025.3627853
Kaidi Hu;Mao Ye;Yi Wu;Wei Li;Ruigang Yang
Few-shot 3D point cloud classification has attracted significant attention due to the challenge of acquiring large-scale labeled data. Existing methods often employ network backbones tailored for fully-supervised learning, which can lead to suboptimal performance in few-shot settings. To tackle these limitations, we propose ABFE-Net, a novel method for point cloud classification with few-shot learning principles. We comprehensively summarize the drawbacks of existing network architectures into four aspects: contextual information loss, channel redundancy, overfitting, and insufficient hidden feature extraction. Accordingly, we design novel modules, such as the Attention-based Dilated Mix-up Module (ADMM) and Attention-based Comprehensive Feature Learning (ACFL), to enhance the network by addressing those issues effectively. Experiments on multiple public datasets demonstrate that ABFE-Net achieves state-of-the-art performance with superior generalization.
{"title":"ABFE-Net: Attention-Based Feature Enhancement Network for Few-Shot Point Cloud Classification","authors":"Kaidi Hu;Mao Ye;Yi Wu;Wei Li;Ruigang Yang","doi":"10.1109/LSP.2025.3627853","DOIUrl":"https://doi.org/10.1109/LSP.2025.3627853","url":null,"abstract":"Few-shot 3D point cloud classification has attracted significant attention due to the challenge of acquiring large-scale labeled data. Existing methods often employ network backbones tailored for fully-supervised learning, which can lead to suboptimal performance in few-shot settings. To tackle these limitations, we propose ABFE-Net, a novel method for point cloud classification with few-shot learning principles. We comprehensively summarize the drawbacks of existing network architectures into four aspects: contextual information loss, channel redundancy, overfitting, and insufficient hidden feature extraction. Accordingly, we design novel modules, such as the Attention-based Dilated Mix-up Module (ADMM) and Attention-based Comprehensive Feature Learning (ACFL), to enhance the network by addressing those issues effectively. Experiments on multiple public datasets demonstrate that ABFE-Net achieves state-of-the-art performance with superior generalization.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4414-4418"},"PeriodicalIF":3.9,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/LSP.2025.3627114
Mingdi Hu;Yize Yang;Helin Yu;Bingyi Jing
Industrial scene text detection is challenging due to cluttered backgrounds, rust occlusions, and arbitrary orientations. We introduce ISTD-DLA, a dynamic local-aware aggregation network for industrial text detection. The framework integrates two synergistic components: (i) a dynamic local-aware feature learner that fuses shape-aware and Bayar convolutions to enrich fine-grained structural cues; and (ii) a local feature aggregation module that forms superpixel-based proposals and uses cross-attention to iteratively exchange context between pixels and superpixels, enabling more precise localization in complex scenes. To maintain efficiency, we prune implicit-mapping submodules at inference, reducing complexity without degrading accuracy. On the MPSC benchmark, ISTD-DLA attains an F-measure of 86.2% at 32 FPS, demonstrating a favorable accuracy–efficiency trade-off and robust practicality for industrial applications.
{"title":"ISTD-DLA: Industrial Scene Text Detection Method Based on Dynamic Local-Aware Aggregation Network","authors":"Mingdi Hu;Yize Yang;Helin Yu;Bingyi Jing","doi":"10.1109/LSP.2025.3627114","DOIUrl":"https://doi.org/10.1109/LSP.2025.3627114","url":null,"abstract":"Industrial scene text detection is challenging due to cluttered backgrounds, rust occlusions, and arbitrary orientations. We introduce ISTD-DLA, a dynamic local-aware aggregation network for industrial text detection. The framework integrates two synergistic components: (i) a dynamic local-aware feature learner that fuses shape-aware and Bayar convolutions to enrich fine-grained structural cues; and (ii) a local feature aggregation module that forms superpixel-based proposals and uses cross-attention to iteratively exchange context between pixels and superpixels, enabling more precise localization in complex scenes. To maintain efficiency, we prune implicit-mapping submodules at inference, reducing complexity without degrading accuracy. On the MPSC benchmark, ISTD-DLA attains an F-measure of 86.2% at 32 FPS, demonstrating a favorable accuracy–efficiency trade-off and robust practicality for industrial applications.","PeriodicalId":13154,"journal":{"name":"IEEE Signal Processing Letters","volume":"32 ","pages":"4264-4268"},"PeriodicalIF":3.9,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145510096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}