Pub Date : 2025-11-25DOI: 10.1016/j.image.2025.117438
Tianli Liao , Nan Li
Natural image stitching aims to create a single, natural-looking mosaic from overlapped images that capture the same 3D scene from different viewing positions. Challenges inevitably arise when the scene is non-planar and captured by handheld cameras since parallax is non-negligible in such cases. In this paper, we propose a novel image stitching method using depth maps, which generates accurate alignment mosaics against parallax. Firstly, we construct a robust fitting method to filter out the outliers in feature matches and estimate the epipolar geometry between input images. Then, we utilize epipolar geometry to establish pixel-to-pixel correspondences between the input images and render the warped images using the proposed optimal warping. In the rendering stage, we introduce several modules to solve the mapping artifacts in the warping results and generate the final mosaic. Experimental results on three challenging datasets demonstrate that the depth maps of input images enable our method to provide much more accurate alignment in the overlapping region and view-consistent results in the non-overlapping region. We believe our method will continue to work under the rapid progress of monocular depth estimation. The source code is available at https://github.com/tlliao/NIS_depths.
{"title":"Natural image stitching using depth maps","authors":"Tianli Liao , Nan Li","doi":"10.1016/j.image.2025.117438","DOIUrl":"10.1016/j.image.2025.117438","url":null,"abstract":"<div><div>Natural image stitching aims to create a single, natural-looking mosaic from overlapped images that capture the same 3D scene from different viewing positions. Challenges inevitably arise when the scene is non-planar and captured by handheld cameras since parallax is non-negligible in such cases. In this paper, we propose a novel image stitching method using depth maps, which generates accurate alignment mosaics against parallax. Firstly, we construct a robust fitting method to filter out the outliers in feature matches and estimate the epipolar geometry between input images. Then, we utilize epipolar geometry to establish pixel-to-pixel correspondences between the input images and render the warped images using the proposed optimal warping. In the rendering stage, we introduce several modules to solve the mapping artifacts in the warping results and generate the final mosaic. Experimental results on three challenging datasets demonstrate that the depth maps of input images enable our method to provide much more accurate alignment in the overlapping region and view-consistent results in the non-overlapping region. We believe our method will continue to work under the rapid progress of monocular depth estimation. The source code is available at <span><span>https://github.com/tlliao/NIS_depths</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117438"},"PeriodicalIF":2.7,"publicationDate":"2025-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145625384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-19DOI: 10.1016/j.image.2025.117432
Cheng-Hui Chen, Torbjörn E.M. Nordling
Performing proper image contrast adjustment without information loss is an art. Many adjustment methods are used. The default settings are often inappropriate for the image in question rendering a contrast adjustment depending on trial and error. We propose a simple method, rank-based transformation (RBT), for image contrast adjustment that requires no prior knowledge. This makes RBT an ideal first tool to apply for underexposed images. The RBT algorithm normalizes and equalizes all the intensity differences of the image over the full intensity range of the image data type, and thus assigns equal weight to all gradients. Even the state-of-the-art AI tool Cellpose visually benefits from RBT preprocessing. Our comparison of histogram normalization methods demonstrates the ability of RBT to bring out image features.
{"title":"Rank-based transformation algorithm for image contrast adjustment","authors":"Cheng-Hui Chen, Torbjörn E.M. Nordling","doi":"10.1016/j.image.2025.117432","DOIUrl":"10.1016/j.image.2025.117432","url":null,"abstract":"<div><div>Performing proper image contrast adjustment without information loss is an art. Many adjustment methods are used. The default settings are often inappropriate for the image in question rendering a contrast adjustment depending on trial and error. We propose a simple method, rank-based transformation (RBT), for image contrast adjustment that requires no prior knowledge. This makes RBT an ideal first tool to apply for underexposed images. The RBT algorithm normalizes and equalizes all the intensity differences of the image over the full intensity range of the image data type, and thus assigns equal weight to all gradients. Even the state-of-the-art AI tool Cellpose visually benefits from RBT preprocessing. Our comparison of histogram normalization methods demonstrates the ability of RBT to bring out image features.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"141 ","pages":"Article 117432"},"PeriodicalIF":2.7,"publicationDate":"2025-11-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145584161","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-17DOI: 10.1016/j.image.2025.117435
Ying Zheng , Ling Zhou , Kaijie Jin , Songlin Jin , Zheng Liang , Wenyi Zhao , Weidong Zhang
Low-light image enhancement is a key topic in image processing, aiming to improve visual quality and visibility in poorly illuminated environments while reducing noise and detail loss to extract richer information. Traditional enhancement techniques often require complex mathematical modeling, rigorous derivations, and iterative processes, which limit their practicality. In contrast, deep learning-based approaches can effectively improve image brightness and detail visibility and have gradually become the dominant technology. However, these methods still face challenges such as data distribution limitations, performance instability, and restricted generalization to diverse scenes. To address these issues, researchers have proposed various novel approaches from different perspectives. This paper provides a comprehensive review of low-light image enhancement, encompassing both the historical development — from early traditional techniques to advanced deep learning models — and recent research progress, including datasets, methodological innovations, and evaluation strategies.Furthermore, it discusses current limitations and outlines future research directions to promote broader application and further development in this field.
{"title":"A comprehensive review of low-light image enhancement methods","authors":"Ying Zheng , Ling Zhou , Kaijie Jin , Songlin Jin , Zheng Liang , Wenyi Zhao , Weidong Zhang","doi":"10.1016/j.image.2025.117435","DOIUrl":"10.1016/j.image.2025.117435","url":null,"abstract":"<div><div>Low-light image enhancement is a key topic in image processing, aiming to improve visual quality and visibility in poorly illuminated environments while reducing noise and detail loss to extract richer information. Traditional enhancement techniques often require complex mathematical modeling, rigorous derivations, and iterative processes, which limit their practicality. In contrast, deep learning-based approaches can effectively improve image brightness and detail visibility and have gradually become the dominant technology. However, these methods still face challenges such as data distribution limitations, performance instability, and restricted generalization to diverse scenes. To address these issues, researchers have proposed various novel approaches from different perspectives. This paper provides a comprehensive review of low-light image enhancement, encompassing both the historical development — from early traditional techniques to advanced deep learning models — and recent research progress, including datasets, methodological innovations, and evaluation strategies.Furthermore, it discusses current limitations and outlines future research directions to promote broader application and further development in this field.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117435"},"PeriodicalIF":2.7,"publicationDate":"2025-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145568190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Light field (LF) imaging captures spatial and angular information, offering a 4D scene representation enabling enhanced visual understanding. However, high dimensionality and redundancy across spatial and angular domains present major challenges for compression, particularly where storage, transmission bandwidth, or processing latency are constrained. We present a novel Variational Autoencoder (VAE)-based framework that explicitly disentangles spatial and angular features using two parallel latent branches. Each branch is coupled with an independent hyperprior model, allowing more precise distribution estimation for entropy coding and finer rate–distortion control. This dual-hyperprior structure enables the network to adaptively compress spatial and angular information based on their unique statistical characteristics, improving coding efficiency. To further enhance latent feature specialization and promote disentanglement, we introduce a mutual information-based regularization term that minimizes redundancy between the two branches while preserving feature diversity. Unlike prior methods relying on covariance-based penalties prone to collapse, our information-theoretic regularizer provides more stable and interpretable latent separation. Experimental results on publicly available LF datasets demonstrate our method achieves strong compression performance, yielding an average BD-PSNR gain of 2.91 dB over HEVC and high compression ratios (e.g., 200:1). Additionally, our design enables fast inference, with a total end-to-end time over 19x faster than the JPEG Pleno standard, making it well-suited for real-time and bandwidth-sensitive applications. By jointly leveraging disentangled representation learning, dual-hyperprior modeling, and information-theoretic regularization, our approach offers a scalable, effective solution for practical light field image compression.
{"title":"DUALF-D: Disentangled dual-hyperprior approach for light field image compression","authors":"Soheib Takhtardeshir , Roger Olsson , Christine Guillemot , Mårten Sjöström","doi":"10.1016/j.image.2025.117436","DOIUrl":"10.1016/j.image.2025.117436","url":null,"abstract":"<div><div>Light field (LF) imaging captures spatial and angular information, offering a 4D scene representation enabling enhanced visual understanding. However, high dimensionality and redundancy across spatial and angular domains present major challenges for compression, particularly where storage, transmission bandwidth, or processing latency are constrained. We present a novel Variational Autoencoder (VAE)-based framework that explicitly disentangles spatial and angular features using two parallel latent branches. Each branch is coupled with an independent hyperprior model, allowing more precise distribution estimation for entropy coding and finer rate–distortion control. This dual-hyperprior structure enables the network to adaptively compress spatial and angular information based on their unique statistical characteristics, improving coding efficiency. To further enhance latent feature specialization and promote disentanglement, we introduce a mutual information-based regularization term that minimizes redundancy between the two branches while preserving feature diversity. Unlike prior methods relying on covariance-based penalties prone to collapse, our information-theoretic regularizer provides more stable and interpretable latent separation. Experimental results on publicly available LF datasets demonstrate our method achieves strong compression performance, yielding an average <strong>BD-PSNR gain of 2.91 dB over HEVC</strong> and high compression ratios (e.g., 200:1). Additionally, our design enables fast inference, with a total <strong>end-to-end time</strong> over <strong>19x faster</strong> than the JPEG Pleno standard, making it well-suited for real-time and bandwidth-sensitive applications. By jointly leveraging disentangled representation learning, dual-hyperprior modeling, and information-theoretic regularization, our approach offers a scalable, effective solution for practical light field image compression.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117436"},"PeriodicalIF":2.7,"publicationDate":"2025-11-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145519980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-12DOI: 10.1016/j.image.2025.117437
Aymen Sekhri , Mohamed-Chaker Larabi , Seyed Ali Amirshahi
As Augmented Reality (AR) technology continues to gain traction in various sectors, ensuring a superior user experience has become an essential challenge for both academic researchers and industry professionals. However, the task of automatically predicting the quality of AR images remains difficult due to several inherent challenges, particularly the issue of visual confusion arising from the overlap of virtual and real-world elements. This paper introduces transformAR, a novel and efficient transformer-based framework designed to objectively assess the quality of AR images. The proposed model uses pre-trained vision transformers to capture content features from AR images, calculates distance vectors to measure the impact of distortions, and employs cross-attention-based decoders to effectively model the perceptual qualities of the AR images. Additionally, the training framework uses regularization techniques and label smoothing-like method to reduce the risk of overfitting. Through comprehensive experiments, we demonstrate that transformAR outperforms existing state-of-the-art approaches, offering a more reliable and scalable solution for AR image quality assessment.
{"title":"TransformAR: A light-weight transformer-based metric for Augmented Reality quality assessment","authors":"Aymen Sekhri , Mohamed-Chaker Larabi , Seyed Ali Amirshahi","doi":"10.1016/j.image.2025.117437","DOIUrl":"10.1016/j.image.2025.117437","url":null,"abstract":"<div><div>As Augmented Reality (AR) technology continues to gain traction in various sectors, ensuring a superior user experience has become an essential challenge for both academic researchers and industry professionals. However, the task of automatically predicting the quality of AR images remains difficult due to several inherent challenges, particularly the issue of visual confusion arising from the overlap of virtual and real-world elements. This paper introduces transformAR, a novel and efficient transformer-based framework designed to objectively assess the quality of AR images. The proposed model uses pre-trained vision transformers to capture content features from AR images, calculates distance vectors to measure the impact of distortions, and employs cross-attention-based decoders to effectively model the perceptual qualities of the AR images. Additionally, the training framework uses regularization techniques and label smoothing-like method to reduce the risk of overfitting. Through comprehensive experiments, we demonstrate that transformAR outperforms existing state-of-the-art approaches, offering a more reliable and scalable solution for AR image quality assessment.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117437"},"PeriodicalIF":2.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-10DOI: 10.1016/j.image.2025.117431
Chunyu Yan , Lei Wang , Qibing Qin , Jiangyan Dai , Wenfeng Zhang
Currently, how to quickly retrieve target images from large-scale remote sensing data has emerged as a critical challenge in the context of explosive growth of remote sensing data volume. To deal with this challenge, hash learning becomes an ideal choice with its low storage cost and high efficiency. In recent years, the combination of hash learning with deep neural networks such as CNNs and Transformers has resulted in numerous frameworks demonstrating excellent performance. However, in the field of remote sensing image hashing, previous studies cannot simultaneously consider the effect of noise in feature extraction and loss optimization, so that their retrieval performance is greatly reduced due to noise interference. To resolve the mentioned problem, a Deep Noise-tolerant Hashing (DNtH) framework is proposed to learn the sample complexity and noise level, and adaptively reduce the weight of noisy information. Specifically, to realize the extraction of fine-grained features from information containing irrelevant samples, the noise-aware Transformer is proposed by introducing the patch-wise attention and depth-wise convolution. To reduce the interference of noisy labels on remote sensing image retrieval, an adaptive active-passive loss framework is proposed to dynamically adjust the weights of active passive loss, which learns the weight parameters through a dynamic weighted network while combining with asymmetric strategy for effective compact representation learning. The ratio of entropy to standard deviation and the probability difference are input into the above network and trained with the feature extraction network. Extensive experiments on three publicly available datasets show that the DNtH framework can adapt to noisy environments while achieving optimal performance in remote sensing image retrieval. The source code for the implementation of our DNtH framework is available at https://github.com/QinLab-WFU/DNtH.git.
{"title":"Deep noise-tolerant hashing for remote sensing image retrieval","authors":"Chunyu Yan , Lei Wang , Qibing Qin , Jiangyan Dai , Wenfeng Zhang","doi":"10.1016/j.image.2025.117431","DOIUrl":"10.1016/j.image.2025.117431","url":null,"abstract":"<div><div>Currently, how to quickly retrieve target images from large-scale remote sensing data has emerged as a critical challenge in the context of explosive growth of remote sensing data volume. To deal with this challenge, hash learning becomes an ideal choice with its low storage cost and high efficiency. In recent years, the combination of hash learning with deep neural networks such as CNNs and Transformers has resulted in numerous frameworks demonstrating excellent performance. However, in the field of remote sensing image hashing, previous studies cannot simultaneously consider the effect of noise in feature extraction and loss optimization, so that their retrieval performance is greatly reduced due to noise interference. To resolve the mentioned problem, a Deep Noise-tolerant Hashing (DNtH) framework is proposed to learn the sample complexity and noise level, and adaptively reduce the weight of noisy information. Specifically, to realize the extraction of fine-grained features from information containing irrelevant samples, the noise-aware Transformer is proposed by introducing the patch-wise attention and depth-wise convolution. To reduce the interference of noisy labels on remote sensing image retrieval, an adaptive active-passive loss framework is proposed to dynamically adjust the weights of active passive loss, which learns the weight parameters through a dynamic weighted network while combining with asymmetric strategy for effective compact representation learning. The ratio of entropy to standard deviation and the probability difference are input into the above network and trained with the feature extraction network. Extensive experiments on three publicly available datasets show that the DNtH framework can adapt to noisy environments while achieving optimal performance in remote sensing image retrieval. The source code for the implementation of our DNtH framework is available at <span><span>https://github.com/QinLab-WFU/DNtH.git</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117431"},"PeriodicalIF":2.7,"publicationDate":"2025-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-09DOI: 10.1016/j.image.2025.117434
Wujie Zhou , Jiabao Ma , Yulai Zhang , Lu Yu , Weijia Gao , Ting Luo
Saliency prediction is an underexplored but fundamental task in computer vision, especially for binocular images. For binocular images, saliency prediction aims to find the most visually distinctive parts to imitate the operation of the human visual system. Existing saliency prediction models do not utilize the extracted multilevel features adequately. In this paper, we introduce a novel saliency prediction framework called multilevel aggregation network (MLANet) to explicitly model the multilevel features of binocular images. We split the multilevel features into shallow and deep features using the ResNet-34 backbone. We consider the first three and last three stage features as shallow and deep features, respectively for saliency prediction reconstruction. First, we introduce the aggregation module to integrate adjacent shallow and deep features. Thereafter, we utilize shallow aggregation and multiscale-aware modules to locate objects of different sizes with features from adjacent levels. For better reconstruction, we integrate deep- and shallow-level features with the help of feature guidance and deploy dual-attention modules to select the discriminative enhanced characteristics. Experimental results on two benchmark datasets, NCTU (CC of 0.8575, KLDiv of 0.2648, AUC of 0.8856, and NSS of 2.0138) and S3D (CC of 0.7977, KLDiv of 0.2024, AUC of 0.7954, and NSS of 1.2963), for binocular eye-fixation prediction show that the proposed MLANet outperforms the other compared methods.
{"title":"MLANet: Multilevel aggregation network for binocular eye-fixation prediction","authors":"Wujie Zhou , Jiabao Ma , Yulai Zhang , Lu Yu , Weijia Gao , Ting Luo","doi":"10.1016/j.image.2025.117434","DOIUrl":"10.1016/j.image.2025.117434","url":null,"abstract":"<div><div>Saliency prediction is an underexplored but fundamental task in computer vision, especially for binocular images. For binocular images, saliency prediction aims to find the most visually distinctive parts to imitate the operation of the human visual system. Existing saliency prediction models do not utilize the extracted multilevel features adequately. In this paper, we introduce a novel saliency prediction framework called multilevel aggregation network (MLANet) to explicitly model the multilevel features of binocular images. We split the multilevel features into shallow and deep features using the ResNet-34 backbone. We consider the first three and last three stage features as shallow and deep features, respectively for saliency prediction reconstruction. First, we introduce the aggregation module to integrate adjacent shallow and deep features. Thereafter, we utilize shallow aggregation and multiscale-aware modules to locate objects of different sizes with features from adjacent levels. For better reconstruction, we integrate deep- and shallow-level features with the help of feature guidance and deploy dual-attention modules to select the discriminative enhanced characteristics. Experimental results on two benchmark datasets, NCTU (CC of 0.8575, KLDiv of 0.2648, AUC of 0.8856, and NSS of 2.0138) and S3D (CC of 0.7977, KLDiv of 0.2024, AUC of 0.7954, and NSS of 1.2963), for binocular eye-fixation prediction show that the proposed MLANet outperforms the other compared methods.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117434"},"PeriodicalIF":2.7,"publicationDate":"2025-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145520106","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-07DOI: 10.1016/j.image.2025.117433
L. Malathi
In recent decades, image processing professionals have focused a great deal of emphasis on single image super-resolution (SISR), which attempts to reconstruct a high-resolution (HR) image from a low-resolution (LR) image. Particularly, deep learning-based super-resolution (SR) methods have attracted a lot of interest and significantly enhanced reconstruction performance on synthetic data. However, their applications are limited to real-world SR with resource-constrained devices because of their huge number of convolutions and parameters. This requires significant computational costs and memory storage for SR model training. To tackle these issues, an improved deep learning model is developed with the use of the fast convolution (FConv) accelerator and hybrid multiplier architecture. Before performing the quality enhancement procedure, the adaptive Fast Normalized Least Mean Square algorithm-based filtering method is initially applied to perform the denoising process. Then, the Advanced MobileNet (Ad-MNet) model with FConv accelerator is proposed to improve the image quality as well as resolution for recovering lost information during image acquisition. Also, a Hybrid Parallel Adder-based Multiplier (HPAM) is designed to perform the multiplication operation in FConv to speed up the convolution operation. The proposed accelerator is executed using MATLAB and Xilinx Verilog tools, and the performance analysis is done for different metrics. The study of the results shows that the accuracy of the proposed model is 98 % for the image resolution process and 98.9 % for the image enhancement process.
{"title":"Ad-MNet with FConv: FPGA-enabled advanced MobileNet model with fast convolution accelerator for image resolution and quality enhancement","authors":"L. Malathi","doi":"10.1016/j.image.2025.117433","DOIUrl":"10.1016/j.image.2025.117433","url":null,"abstract":"<div><div>In recent decades, image processing professionals have focused a great deal of emphasis on single image super-resolution (SISR), which attempts to reconstruct a high-resolution (HR) image from a low-resolution (LR) image. Particularly, deep learning-based super-resolution (SR) methods have attracted a lot of interest and significantly enhanced reconstruction performance on synthetic data. However, their applications are limited to real-world SR with resource-constrained devices because of their huge number of convolutions and parameters. This requires significant computational costs and memory storage for SR model training. To tackle these issues, an improved deep learning model is developed with the use of the fast convolution (FConv) accelerator and hybrid multiplier architecture. Before performing the quality enhancement procedure, the adaptive Fast Normalized Least Mean Square algorithm-based filtering method is initially applied to perform the denoising process. Then, the Advanced MobileNet (Ad-MNet) model with FConv accelerator is proposed to improve the image quality as well as resolution for recovering lost information during image acquisition. Also, a Hybrid Parallel Adder-based Multiplier (HPAM) is designed to perform the multiplication operation in FConv to speed up the convolution operation. The proposed accelerator is executed using MATLAB and Xilinx Verilog tools, and the performance analysis is done for different metrics. The study of the results shows that the accuracy of the proposed model is 98 % for the image resolution process and 98.9 % for the image enhancement process.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117433"},"PeriodicalIF":2.7,"publicationDate":"2025-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145568191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1016/j.image.2025.117430
Doanh C. Bui , Nghia Hieu Nguyen , Khang Nguyen
Image captioning is one of the vision-language tasks that continues to attract interest from the research community worldwide in the 2020s. The MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, even though it was introduced in 2015. However, recent captioning models trained on the MS-COCO Caption dataset perform well only in English language patterns; they do not perform as effectively in describing contexts specific to Vietnam or in generating fluent Vietnamese captions. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the Open-domain Vietnamese Image Captioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models. In our experiments, we show that our dataset is more diverse and challenging than the MS-COCO caption dataset, as indicated by the significantly lower CIDEr scores on our testing set, ranging from 59.52 to 62.47 compared to MS-COCO. For the CAMO approach, experiments on UIT-OpenViIC show that when equipped with a captioning baseline model, it can improve performance by 0.8970 to 4.9167 CIDEr.
{"title":"UIT-OpenViIC: An open-domain benchmark for evaluating image captioning in Vietnamese","authors":"Doanh C. Bui , Nghia Hieu Nguyen , Khang Nguyen","doi":"10.1016/j.image.2025.117430","DOIUrl":"10.1016/j.image.2025.117430","url":null,"abstract":"<div><div>Image captioning is one of the vision-language tasks that continues to attract interest from the research community worldwide in the 2020s. The MS-COCO Caption benchmark is commonly used to evaluate the performance of advanced captioning models, even though it was introduced in 2015. However, recent captioning models trained on the MS-COCO Caption dataset perform well only in English language patterns; they do not perform as effectively in describing contexts specific to Vietnam or in generating fluent Vietnamese captions. To contribute to the low-resources research community as in Vietnam, we introduce a novel image captioning dataset in Vietnamese, the <strong>Open</strong>-domain <strong>Vi</strong>etnamese <strong>I</strong>mage <strong>C</strong>aptioning dataset (UIT-OpenViIC). The introduced dataset includes complex scenes captured in Vietnam and manually annotated by Vietnamese under strict rules and supervision. In this paper, we present in more detail the dataset creation process. From preliminary analysis, we show that our dataset is challenging to recent state-of-the-art (SOTA) Transformer-based baselines, which performed well on the MS COCO dataset. Then, the modest results prove that UIT-OpenViIC has room to grow, which can be one of the standard benchmarks in Vietnamese for the research community to evaluate their captioning models. Furthermore, we present a CAMO approach that effectively enhances the image representation ability by a multi-level encoder output fusion mechanism, which helps improve the quality of generated captions compared to previous captioning models. In our experiments, we show that our dataset is more diverse and challenging than the MS-COCO caption dataset, as indicated by the significantly lower CIDEr scores on our testing set, ranging from 59.52 to 62.47 compared to MS-COCO. For the CAMO approach, experiments on UIT-OpenViIC show that when equipped with a captioning baseline model, it can improve performance by 0.8970 to 4.9167 CIDEr.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117430"},"PeriodicalIF":2.7,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145466331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-30DOI: 10.1016/j.image.2025.117428
Fengyuan Wu , Jiuzhe Wei , Qinghong Sheng , Xiang Liu , Bo Wang , Jun Li
In the process of reconstructing high dynamic range (HDR) images from multi-exposed low dynamic range (LDR) images, removing the ghosting artifacts induced by the movement of objects and the misalignment of dynamic scenes is a pivotal challenge. Previous methods attempt to mitigate ghosting artifacts using optical flow registration or motion pixel rejection. However, these approaches are prone to errors due to inaccuracies in motion estimation and the lack of effective information guidance during pixel exclusion. This article proposes a novel attention-guided end-to-end deep neural network named HDRUnet3d, which treats multi-exposed LDR images as video stream and utilizes 3D convolution to extract both temporal and spatial features, enabling the network to adaptively capture the temporal dynamics of moving objects. Moreover, two information-guided feature enhancement modules is proposed, consisting of the Residual Map-Guided Attention Module and the Illumination-guided Local Enhanced Module. The former utilizes gamma-corrected residual images to guide the learning of temporal motion semantics information, while the latter adaptively enhance local features based on illumination maps. Additionally, Global Asymmetric Semantic Fusion is proposed to integrate multi-scale features, enriching high-level feature representation. HDRUnet3d achieves state-of-the-art performance on different datasets, demonstrating the effectiveness and robustness of the proposed method.
{"title":"HDRUnet3D: High dynamic range image reconstruction network with residual and illumination maps","authors":"Fengyuan Wu , Jiuzhe Wei , Qinghong Sheng , Xiang Liu , Bo Wang , Jun Li","doi":"10.1016/j.image.2025.117428","DOIUrl":"10.1016/j.image.2025.117428","url":null,"abstract":"<div><div>In the process of reconstructing high dynamic range (HDR) images from multi-exposed low dynamic range (LDR) images, removing the ghosting artifacts induced by the movement of objects and the misalignment of dynamic scenes is a pivotal challenge. Previous methods attempt to mitigate ghosting artifacts using optical flow registration or motion pixel rejection. However, these approaches are prone to errors due to inaccuracies in motion estimation and the lack of effective information guidance during pixel exclusion. This article proposes a novel attention-guided end-to-end deep neural network named HDRUnet3d, which treats multi-exposed LDR images as video stream and utilizes 3D convolution to extract both temporal and spatial features, enabling the network to adaptively capture the temporal dynamics of moving objects. Moreover, two information-guided feature enhancement modules is proposed, consisting of the Residual Map-Guided Attention Module and the Illumination-guided Local Enhanced Module. The former utilizes gamma-corrected residual images to guide the learning of temporal motion semantics information, while the latter adaptively enhance local features based on illumination maps. Additionally, Global Asymmetric Semantic Fusion is proposed to integrate multi-scale features, enriching high-level feature representation. HDRUnet3d achieves state-of-the-art performance on different datasets, demonstrating the effectiveness and robustness of the proposed method.</div></div>","PeriodicalId":49521,"journal":{"name":"Signal Processing-Image Communication","volume":"140 ","pages":"Article 117428"},"PeriodicalIF":2.7,"publicationDate":"2025-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145417511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}