Pub Date : 2026-03-01Epub Date: 2026-01-05DOI: 10.1016/j.image.2026.117478
Feng Chen , Jielong He , Yang Liu , Xiwen Qu
Existing text-based person search methods face challenges in handling complex cross-modal interactions, often failing to capture subtle semantic nuances. To address this, we propose a novel Fine-grained Cross-modal Semantic Alignment (FCSA) framework that enhances accuracy and robustness in text-based person search. FCSA introduces two key components: the Cross-Modal Reconstruction Strategy (CMRS) and the Saliency-Guided Masking Mechanism (SGMM). CMRS facilitates feature alignment by leveraging incomplete visual and textual features, promoting bidirectional reasoning across modalities, and enhancing fine-grained semantic understanding. SGMM further refines performance by dynamically focusing on salient visual patches and critical text tokens, thereby improving discriminative region perception and image–text matching precision. Our approach outperforms existing state-of-the-art methods, achieving mean Average Precision (mAP) scores of 69.72%, 43.78% and 48.78% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively. Source code is at https://github.com/flychen321/FCSA.
{"title":"Text-based person search via fine-grained cross-modal semantic alignment","authors":"Feng Chen , Jielong He , Yang Liu , Xiwen Qu","doi":"10.1016/j.image.2026.117478","DOIUrl":"10.1016/j.image.2026.117478","url":null,"abstract":"<div><div>Existing text-based person search methods face challenges in handling complex cross-modal interactions, often failing to capture subtle semantic nuances. To address this, we propose a novel Fine-grained Cross-modal Semantic Alignment (FCSA) framework that enhances accuracy and robustness in text-based person search. FCSA introduces two key components: the Cross-Modal Reconstruction Strategy (CMRS) and the Saliency-Guided Masking Mechanism (SGMM). CMRS facilitates feature alignment by leveraging incomplete visual and textual features, promoting bidirectional reasoning across modalities, and enhancing fine-grained semantic understanding. SGMM further refines performance by dynamically focusing on salient visual patches and critical text tokens, thereby improving discriminative region perception and image–text matching precision. Our approach outperforms existing state-of-the-art methods, achieving mean Average Precision (mAP) scores of 69.72%, 43.78% and 48.78% on CUHK-PEDES, ICFG-PEDES, and RSTPReid, respectively. Source code is at <span><span>https://github.com/flychen321/FCSA</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117478"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-24DOI: 10.1016/j.image.2025.117468
Mohammad Roueinfar, Mohammad Hossein Kahaei
The effect of noise on the Inverse Synthetic Aperture Radar (ISAR) with sparse apertures is challenging for image reconstruction with high resolution at low Signal-to-Noise Ratios (SNRs). It is well-known that the image resolution is affected by the bandwidth of the transmitted signal and the Coherent Processing Interval (CPI) in two dimensions, range and azimuth, respectively. To reduce the noise effect and thus increase the two-dimensional resolution of Unmanned Aerial Vehicles (UAVs) images, we propose the Fast Reweighted Atomic Norm Denoising (FRAND) algorithm by incorporating the weighted atomic norm minimization. To solve the problem, the Two-Dimensional Alternating Direction Method of Multipliers (2D-ADMM) algorithm is developed to speed up the implementation procedure. Assuming sparse apertures for ISAR images of UAVs, we compare the proposed method with the MUltiple SIgnal Classification (MUSIC), Cadzow, and methods in different SNRs. Simulation results show the superiority of FRAND at low SNRs based on the Mean-Square Error (MSE), Peak Signal-to-Noise ratio (PSNR), and Structural Similarity Index Measure (SSIM) criteria.
{"title":"Enhanced ISAR imaging of UAVs: Noise reduction via weighted atomic norm minimization and 2D-ADMM","authors":"Mohammad Roueinfar, Mohammad Hossein Kahaei","doi":"10.1016/j.image.2025.117468","DOIUrl":"10.1016/j.image.2025.117468","url":null,"abstract":"<div><div>The effect of noise on the Inverse Synthetic Aperture Radar (ISAR) with sparse apertures is challenging for image reconstruction with high resolution at low Signal-to-Noise Ratios (SNRs). It is well-known that the image resolution is affected by the bandwidth of the transmitted signal and the Coherent Processing Interval (CPI) in two dimensions, range and azimuth, respectively. To reduce the noise effect and thus increase the two-dimensional resolution of Unmanned Aerial Vehicles (UAVs) images, we propose the Fast Reweighted Atomic Norm Denoising (FRAND) algorithm by incorporating the weighted atomic norm minimization. To solve the problem, the Two-Dimensional Alternating Direction Method of Multipliers (2D-ADMM) algorithm is developed to speed up the implementation procedure. Assuming sparse apertures for ISAR images of UAVs, we compare the proposed method with the MUltiple SIgnal Classification (MUSIC), Cadzow, and <span><math><msub><mrow><mi>SL</mi></mrow><mrow><mn>0</mn></mrow></msub></math></span> methods in different SNRs. Simulation results show the superiority of FRAND at low SNRs based on the Mean-Square Error (MSE), Peak Signal-to-Noise ratio (PSNR), and Structural Similarity Index Measure (SSIM) criteria.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117468"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The video object segmentation algorithm based on memory networks stores the information of the target object through the maintained external memory inventory. As the segmentation progresses, the size of the memory inventory will continue to increase, leading to redundancy of feature information and affecting the execution efficiency of the algorithm. In addition, the key value pairs stored in the memory library are subjected to channel dimension reduction using standard convolution, resulting in insufficient representation ability of target object features. In response to the above issues, this chapter proposes a video object segmentation algorithm based on feature compression and attention correction, constructing a reliable and effective memory library to ensure efficient storage and updating of target object information, thereby reducing computational complexity and storage consumption. A dual attention mechanism based on spatial and channel dimensions was proposed to correct feature information and enhance the representation ability of features. A large number of experiments have shown that the proposed algorithm demonstrates reliable competitiveness compared to other mainstream algorithms in recent years.
{"title":"Video object segmentation based on feature compression and attention correction","authors":"Zhiqiang Hou, Jiale Dong, Chenxu Wang, Sugang Ma, Wangsheng Yu, Yuncheng Wang","doi":"10.1016/j.image.2025.117456","DOIUrl":"10.1016/j.image.2025.117456","url":null,"abstract":"<div><div>The video object segmentation algorithm based on memory networks stores the information of the target object through the maintained external memory inventory. As the segmentation progresses, the size of the memory inventory will continue to increase, leading to redundancy of feature information and affecting the execution efficiency of the algorithm. In addition, the key value pairs stored in the memory library are subjected to channel dimension reduction using standard convolution, resulting in insufficient representation ability of target object features. In response to the above issues, this chapter proposes a video object segmentation algorithm based on feature compression and attention correction, constructing a reliable and effective memory library to ensure efficient storage and updating of target object information, thereby reducing computational complexity and storage consumption. A dual attention mechanism based on spatial and channel dimensions was proposed to correct feature information and enhance the representation ability of features. A large number of experiments have shown that the proposed algorithm demonstrates reliable competitiveness compared to other mainstream algorithms in recent years.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117456"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145824218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Panoramic driving perception requires robust and efficient context understanding, which requires simultaneous semantic and instance segmentation. This paper proposes U-MobileViT, a lightweight backbone network designed to address this challenge. Our architecture combines the advantages of MobileViT, a family of Transformer-based models with high accuracy and fast processing speed, with the image segmentation structure of the U-Net model, facilitating multiscale feature fusion and accurate localization. U-MobileViT efficiently combines local and global spatial information by utilizing MobileViT Blocks with Separable-Attention layers, resulting in a computationally lightweight yet effective architecture, while the U-Net structure enables efficient integration of features from different levels of the hierarchy. This synergistic combination enables the generation of rich, context-aware feature maps that are critical for accurate panoramic segmentation. Through extensive experiments on the challenging BDD100K driving dataset, we demonstrate that U-MobileViT achieves state-of-the-art performance in panoramic driving perception, outperforming existing lightweight models in both accuracy and inference speed. Our results demonstrate the potential of U-MobileViT as a robust and efficient backbone for real-time panoramic scene understanding in autonomous driving applications. Code is available at https://github.com/quyongkeomut/UMobileViT.
{"title":"U-MobileViT: A Lightweight Vision Transformer-based Backbone for Panoptic Driving Segmentation","authors":"Phuoc-Thinh Nguyen , The-Bang Nguyen , Phu Pham , Quang-Thinh Bui","doi":"10.1016/j.image.2025.117461","DOIUrl":"10.1016/j.image.2025.117461","url":null,"abstract":"<div><div>Panoramic driving perception requires robust and efficient context understanding, which requires simultaneous semantic and instance segmentation. This paper proposes U-MobileViT, a lightweight backbone network designed to address this challenge. Our architecture combines the advantages of MobileViT, a family of Transformer-based models with high accuracy and fast processing speed, with the image segmentation structure of the U-Net model, facilitating multiscale feature fusion and accurate localization. U-MobileViT efficiently combines local and global spatial information by utilizing MobileViT Blocks with Separable-Attention layers, resulting in a computationally lightweight yet effective architecture, while the U-Net structure enables efficient integration of features from different levels of the hierarchy. This synergistic combination enables the generation of rich, context-aware feature maps that are critical for accurate panoramic segmentation. Through extensive experiments on the challenging BDD100K driving dataset, we demonstrate that U-MobileViT achieves state-of-the-art performance in panoramic driving perception, outperforming existing lightweight models in both accuracy and inference speed. Our results demonstrate the potential of U-MobileViT as a robust and efficient backbone for real-time panoramic scene understanding in autonomous driving applications. Code is available at <span><span>https://github.com/quyongkeomut/UMobileViT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117461"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145824219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-15DOI: 10.1016/j.image.2026.117484
Haijun Wang, Haoyu Qu, Lihua Qi, Zihao Su
Most advancements in unmanned aerial vehicle (UAV) tracking have focused on daytime scenarios with optimal lighting conditions. However, the unpredictable and complex noise inherent in camera systems significantly impairs the effectiveness of UAV tracking algorithms, particularly in low-light environments. To address this challenge, we introduce a novel U-shaped plug-and-play denoising network that reduces cluttered and intricate real-world noise, thereby enhancing nighttime UAV tracking performance. Specifically, the U-shaped denoising network utilizes a CNN-Transformer block as the encoder, which incorporates hybrid attention to simultaneously capture both local details and global structures. Additionally, to further improve the denoising effect, we design a wavelet-based multi-scale feature fusion block that adaptively combines features from various stages of the encoding process. Finally, we develop a multi-feature collaboration decoder to fully integrate comprehensive features through multi-head transposed cross-attention. Extensive experiments demonstrate that the proposed UHW-former achieves remarkable denoising performance and significantly enhances nighttime UAV tracking.
{"title":"UHW-former: U-shape hybrid transformer with wavelet-based multi-scale feature fusion for nighttime UAV tracking","authors":"Haijun Wang, Haoyu Qu, Lihua Qi, Zihao Su","doi":"10.1016/j.image.2026.117484","DOIUrl":"10.1016/j.image.2026.117484","url":null,"abstract":"<div><div>Most advancements in unmanned aerial vehicle (UAV) tracking have focused on daytime scenarios with optimal lighting conditions. However, the unpredictable and complex noise inherent in camera systems significantly impairs the effectiveness of UAV tracking algorithms, particularly in low-light environments. To address this challenge, we introduce a novel U-shaped plug-and-play denoising network that reduces cluttered and intricate real-world noise, thereby enhancing nighttime UAV tracking performance. Specifically, the U-shaped denoising network utilizes a CNN-Transformer block as the encoder, which incorporates hybrid attention to simultaneously capture both local details and global structures. Additionally, to further improve the denoising effect, we design a wavelet-based multi-scale feature fusion block that adaptively combines features from various stages of the encoding process. Finally, we develop a multi-feature collaboration decoder to fully integrate comprehensive features through multi-head transposed cross-attention. Extensive experiments demonstrate that the proposed UHW-former achieves remarkable denoising performance and significantly enhances nighttime UAV tracking.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117484"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023544","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-09DOI: 10.1016/j.image.2026.117482
Chen-Yi Lin, Su-Ho Chiu
The widespread adoption of the Internet has enhanced communication between individuals but increased the risk of secret messages being intercepted, thereby drawing public attention to the security of message transmission. Image steganography has been a prominent area of research within the field secure communication technologies. However, traditional image steganography techniques risk being compromised by steganalysis tools, leading researchers to propose the concept of coverless image steganography. In recent years, numerous coverless image steganography techniques have been developed that effectively resist steganalysis tools. However, these techniques commonly suffer from incomplete mapping of secret messages, rendering them incapable of successfully concealing the information. Furthermore, most existing coverless steganography techniques rely on cryptographic methods to protect auxiliary information, which may raise suspicion and result in interception, thereby preventing the receiver from correctly recovering the secret messages. To address these issues, this study proposes a novel coverless image steganography technique based on ring features and discrete wavelet transform sequence mapping. This method generates feature sequences from both the spatial and frequency domains of images and employs an innovative stego image collage mechanism to transmit auxiliary information, thereby reducing the risk of interception. Experimental results demonstrate that the proposed technique significantly enhances the richness of feature sequences and the completeness of message mapping, achieving a 100 % success rate on medium- and large-scale image datasets. Moreover, the proposed method exhibits superior robustness even under conditions where existing techniques suffer from low mapping success rates or prolonged mapping times.
{"title":"Robust coverless image steganography based on ring features and DWT sequence mapping","authors":"Chen-Yi Lin, Su-Ho Chiu","doi":"10.1016/j.image.2026.117482","DOIUrl":"10.1016/j.image.2026.117482","url":null,"abstract":"<div><div>The widespread adoption of the Internet has enhanced communication between individuals but increased the risk of secret messages being intercepted, thereby drawing public attention to the security of message transmission. Image steganography has been a prominent area of research within the field secure communication technologies. However, traditional image steganography techniques risk being compromised by steganalysis tools, leading researchers to propose the concept of coverless image steganography. In recent years, numerous coverless image steganography techniques have been developed that effectively resist steganalysis tools. However, these techniques commonly suffer from incomplete mapping of secret messages, rendering them incapable of successfully concealing the information. Furthermore, most existing coverless steganography techniques rely on cryptographic methods to protect auxiliary information, which may raise suspicion and result in interception, thereby preventing the receiver from correctly recovering the secret messages. To address these issues, this study proposes a novel coverless image steganography technique based on ring features and discrete wavelet transform sequence mapping. This method generates feature sequences from both the spatial and frequency domains of images and employs an innovative stego image collage mechanism to transmit auxiliary information, thereby reducing the risk of interception. Experimental results demonstrate that the proposed technique significantly enhances the richness of feature sequences and the completeness of message mapping, achieving a 100 % success rate on medium- and large-scale image datasets. Moreover, the proposed method exhibits superior robustness even under conditions where existing techniques suffer from low mapping success rates or prolonged mapping times.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117482"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145979150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2026-01-16DOI: 10.1016/j.image.2026.117483
Francesco Barbato , Matteo Caligiuri , Pietro Zanuttigh
The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: (1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; (2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; (3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; (4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding. Dataset link:https://medialab.dei.unipd.it/paper_data/FlyAwareV2
{"title":"FlyAwareV2: A multimodal cross-domain UAV dataset for urban scene understanding","authors":"Francesco Barbato , Matteo Caligiuri , Pietro Zanuttigh","doi":"10.1016/j.image.2026.117483","DOIUrl":"10.1016/j.image.2026.117483","url":null,"abstract":"<div><div>The development of computer vision algorithms for Unmanned Aerial Vehicle (UAV) applications in urban environments heavily relies on the availability of large-scale datasets with accurate annotations. However, collecting and annotating real-world UAV data is extremely challenging and costly. To address this limitation, we present FlyAwareV2, a novel multimodal dataset encompassing both real and synthetic UAV imagery tailored for urban scene understanding tasks. Building upon the recently introduced SynDrone and FlyAware datasets, FlyAwareV2 introduces several new key contributions: (1) Multimodal data (RGB, depth, semantic labels) across diverse environmental conditions including varying weather and daytime; (2) Depth maps for real samples computed via state-of-the-art monocular depth estimation; (3) Benchmarks for RGB and multimodal semantic segmentation on standard architectures; (4) Studies on synthetic-to-real domain adaptation to assess the generalization capabilities of models trained on the synthetic data. With its rich set of annotations and environmental diversity, FlyAwareV2 provides a valuable resource for research on UAV-based 3D urban scene understanding. <strong>Dataset link:</strong> <span><span>https://medialab.dei.unipd.it/paper_data/FlyAwareV2</span><svg><path></path></svg></span></div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117483"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146023539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-23DOI: 10.1016/j.image.2025.117466
Shaheen Raphiahmed Mujawar , Sridhar Iyer
The development of an image captioning system could make the world more accessible to persons who are blind. Recently, researchers have focused on the need to create automatic textual descriptions associated with observed images. However, in computer vision and natural language processing, autonomously creating captions for images is difficult. Hence, this article proposes an efficient automatic image caption with an attentional language encoder-decoder framework enabled by Deep Learning (DL) models. The developed model integrates four main strategies: the Feature Extractor Encoder Module (FEEM), the Co-ordinated Relationship Learning Module (CRLM), the Attentional Feature Fusion Module (AFFM), and the Language Decoder Module. The region and semantic-based feature extraction of the image can be ensured by utilizing the Res-Inception and Convolutional Neural Network (CNN) model. Moreover, CRLM is introduced to generate balanced relationship features, and AFFM is used to fuse various levels of visual information and selectively focus on particular visual regions associated with each word prediction. An Attentional Model with Residual BiGRU (ARBiGRU) is implemented as a language model for decoding to identify the correct caption for the input image effectively. The developed model utilizes the flickr8k and flickr30k datasets, respectively. To examine the achievement of the projected work, caption metrics such as BLEU, METEOR, CIDER, and ROUGE-L are used. To evaluate the effectiveness of the proposed model, an ablation study is conducted using six cases, and the performance analysis demonstrates that the proposed approach outpaces the existing techniques in caption generation.
{"title":"Deep learning model with co-ordinated relationship for image captioning enabled via attentional language encoder-decoder","authors":"Shaheen Raphiahmed Mujawar , Sridhar Iyer","doi":"10.1016/j.image.2025.117466","DOIUrl":"10.1016/j.image.2025.117466","url":null,"abstract":"<div><div>The development of an image captioning system could make the world more accessible to persons who are blind. Recently, researchers have focused on the need to create automatic textual descriptions associated with observed images. However, in computer vision and natural language processing, autonomously creating captions for images is difficult. Hence, this article proposes an efficient automatic image caption with an attentional language encoder-decoder framework enabled by Deep Learning (DL) models. The developed model integrates four main strategies: the Feature Extractor Encoder Module (FEEM), the Co-ordinated Relationship Learning Module (CRLM), the Attentional Feature Fusion Module (AFFM), and the Language Decoder Module. The region and semantic-based feature extraction of the image can be ensured by utilizing the Res-Inception and Convolutional Neural Network (CNN) model. Moreover, CRLM is introduced to generate balanced relationship features, and AFFM is used to fuse various levels of visual information and selectively focus on particular visual regions associated with each word prediction. An Attentional Model with Residual BiGRU (ARBiGRU) is implemented as a language model for decoding to identify the correct caption for the input image effectively. The developed model utilizes the flickr8k and flickr30k datasets, respectively. To examine the achievement of the projected work, caption metrics such as BLEU, METEOR, CIDER, and ROUGE-L are used. To evaluate the effectiveness of the proposed model, an ablation study is conducted using six cases, and the performance analysis demonstrates that the proposed approach outpaces the existing techniques in caption generation.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117466"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145927548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, diffusion models have achieved remarkable performance in the field of image generation and have been widely applied, with their potential in image enhancement tasks gradually being unearthed. However, when applied to underwater scenes, diffusion models for general image restoration struggle to achieve their expected performance. This is due to the scattering and absorption of light in underwater environments, resulting in underwater images suffering from color distortion, low contrast, and haziness. These issues often co-occur within a single underwater image, making the task of underwater image enhancement more challenging than typical image enhancement tasks. To better adapt diffusion models for underwater image enhancement, this paper proposes an underwater image enhancement method based on latent diffusion model. The proposed model’s latent encoder progressively mitigates adverse degradation factors embedded within the hidden layers, while preserving essential image feature information in the latent representation, thus enabling a smoother diffusion process. Additionally, we design a gated fusion network that integrates guiding features at multiple scales, steering the network towards diffusion with superior visual quality restoration. A series of qualitative and quantitative experiments conducted on various real-world underwater image datasets demonstrate that our proposed method outperforms recent state-of-the-art methods in terms of visual effects and generalization capabilities, proving the effectiveness of our approach in applying diffusion model to underwater enhancement tasks.
{"title":"UW-SDE: Multi-scale prompt feature guided diffusion model for underwater image enhancement","authors":"Jiaxi Li, Junjun Wu, Qinghua Lu, Ningwei Qin, Shuhong Zhou, Weijian Li","doi":"10.1016/j.image.2026.117486","DOIUrl":"10.1016/j.image.2026.117486","url":null,"abstract":"<div><div>In recent years, diffusion models have achieved remarkable performance in the field of image generation and have been widely applied, with their potential in image enhancement tasks gradually being unearthed. However, when applied to underwater scenes, diffusion models for general image restoration struggle to achieve their expected performance. This is due to the scattering and absorption of light in underwater environments, resulting in underwater images suffering from color distortion, low contrast, and haziness. These issues often co-occur within a single underwater image, making the task of underwater image enhancement more challenging than typical image enhancement tasks. To better adapt diffusion models for underwater image enhancement, this paper proposes an underwater image enhancement method based on latent diffusion model. The proposed model’s latent encoder progressively mitigates adverse degradation factors embedded within the hidden layers, while preserving essential image feature information in the latent representation, thus enabling a smoother diffusion process. Additionally, we design a gated fusion network that integrates guiding features at multiple scales, steering the network towards diffusion with superior visual quality restoration. A series of qualitative and quantitative experiments conducted on various real-world underwater image datasets demonstrate that our proposed method outperforms recent state-of-the-art methods in terms of visual effects and generalization capabilities, proving the effectiveness of our approach in applying diffusion model to underwater enhancement tasks.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117486"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145979151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-03-01Epub Date: 2025-12-23DOI: 10.1016/j.image.2025.117465
Yan Chen , Zhongkang Jiang , Jixiang Du , Hongbo Zhang
High-quality fusion of template and search frames is essential for effective visual object tracking. However, mainstream Transformer-based trackers, whether dual-stream or single-stream, often fuse these frames indiscriminately, allowing background noise to disrupt target-specific feature extraction. To address this, we propose LTTrack(learnable token for visual tracking), an adaptive feature fusion method based on a Transformer architecture with an autoregressive encoder–decoder structure. The core innovation is a learnable token in the encoder, which processes three inputs: search tokens, template tokens, and the learnable token. This token is designed to interact with the template, enabling precise fusion and extraction of target-relevant features. Our approach adaptively fuses search and template tokens, and extensive experiments show LTTrack achieves state-of-the-art performance across six challenging benchmarks.
{"title":"Learnable token for visual tracking","authors":"Yan Chen , Zhongkang Jiang , Jixiang Du , Hongbo Zhang","doi":"10.1016/j.image.2025.117465","DOIUrl":"10.1016/j.image.2025.117465","url":null,"abstract":"<div><div>High-quality fusion of template and search frames is essential for effective visual object tracking. However, mainstream Transformer-based trackers, whether dual-stream or single-stream, often fuse these frames indiscriminately, allowing background noise to disrupt target-specific feature extraction. To address this, we propose LTTrack(learnable token for visual tracking), an adaptive feature fusion method based on a Transformer architecture with an autoregressive encoder–decoder structure. The core innovation is a learnable token in the encoder, which processes three inputs: search tokens, template tokens, and the learnable token. This token is designed to interact with the template, enabling precise fusion and extraction of target-relevant features. Our approach adaptively fuses search and template tokens, and extensive experiments show LTTrack achieves state-of-the-art performance across six challenging benchmarks.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"142 ","pages":"Article 117465"},"PeriodicalIF":2.7,"publicationDate":"2026-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145885405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}