Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104252
A typical monocular depth estimator is trained for a single camera, so its performance drops severely on images taken with different cameras. To address this issue, we propose a versatile depth estimator (VDE), composed of a common relative depth estimator (CRDE) and multiple relative-to-metric converters (R2MCs). The CRDE extracts relative depth information, and each R2MC converts the relative information to predict metric depths for a specific camera. The proposed VDE can cope with diverse scenes, including both indoor and outdoor scenes, with only a 1.12% parameter increase per camera. Experimental results demonstrate that VDE supports multiple cameras effectively and efficiently and also achieves state-of-the-art performance in the conventional single-camera scenario.
{"title":"Versatile depth estimator based on common relative depth estimation and camera-specific relative-to-metric depth conversion","authors":"","doi":"10.1016/j.jvcir.2024.104252","DOIUrl":"10.1016/j.jvcir.2024.104252","url":null,"abstract":"<div><p>A typical monocular depth estimator is trained for a single camera, so its performance drops severely on images taken with different cameras. To address this issue, we propose a versatile depth estimator (VDE), composed of a common relative depth estimator (CRDE) and multiple relative-to-metric converters (R2MCs). The CRDE extracts relative depth information, and each R2MC converts the relative information to predict metric depths for a specific camera. The proposed VDE can cope with diverse scenes, including both indoor and outdoor scenes, with only a 1.12% parameter increase per camera. Experimental results demonstrate that VDE supports multiple cameras effectively and efficiently and also achieves state-of-the-art performance in the conventional single-camera scenario.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104266
Gait recognition, which can realize long-distance and contactless identification, is an important biometric technology. Recent gait recognition methods focus on learning the pattern of human movement or appearance during walking, and construct the corresponding spatio-temporal representations. However, different individuals have their own laws of movement patterns, simple spatial–temporal features are difficult to describe changes in motion of human parts, especially when confounding variables such as clothing and carrying are included, thus distinguishability of features is reduced. To this end, we propose the Embedding and Motion (EM) block and Fine Feature Extractor (FFE) to capture the motion mode of walking and enhance the difference of local motion rules. The EM block consists of a Motion Excitation (ME) module to capture the changes of temporal motion and an Embedding Self-attention (ES) module to enhance the expression of motion rules. Specifically, without introducing additional parameters, ME module learns the difference information between frames and intervals to obtain the dynamic change representation of walking for frame sequences with uncertain length. By contrast, ES module divides the feature map hierarchically based on element values, blurring the difference of elements to highlight the motion track. Furthermore, we present the FFE, which independently learns the spatio-temporal representations of human body according to different horizontal parts of individuals. Benefiting from EM block and our proposed motion branch, our method innovatively combines motion change information, significantly improving the performance of the model under cross appearance conditions. On the popular dataset CASIA-B, our proposed EM-Gait is better than the existing single-modal gait recognition methods.
{"title":"EM-Gait: Gait recognition using motion excitation and feature embedding self-attention","authors":"","doi":"10.1016/j.jvcir.2024.104266","DOIUrl":"10.1016/j.jvcir.2024.104266","url":null,"abstract":"<div><p>Gait recognition, which can realize long-distance and contactless identification, is an important biometric technology. Recent gait recognition methods focus on learning the pattern of human movement or appearance during walking, and construct the corresponding spatio-temporal representations. However, different individuals have their own laws of movement patterns, simple spatial–temporal features are difficult to describe changes in motion of human parts, especially when confounding variables such as clothing and carrying are included, thus distinguishability of features is reduced. To this end, we propose the Embedding and Motion (EM) block and Fine Feature Extractor (FFE) to capture the motion mode of walking and enhance the difference of local motion rules. The EM block consists of a Motion Excitation (ME) module to capture the changes of temporal motion and an Embedding Self-attention (ES) module to enhance the expression of motion rules. Specifically, without introducing additional parameters, ME module learns the difference information between frames and intervals to obtain the dynamic change representation of walking for frame sequences with uncertain length. By contrast, ES module divides the feature map hierarchically based on element values, blurring the difference of elements to highlight the motion track. Furthermore, we present the FFE, which independently learns the spatio-temporal representations of human body according to different horizontal parts of individuals. Benefiting from EM block and our proposed motion branch, our method innovatively combines motion change information, significantly improving the performance of the model under cross appearance conditions. On the popular dataset CASIA-B, our proposed EM-Gait is better than the existing single-modal gait recognition methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142075777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104256
The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.
{"title":"Fusing structure from motion and simulation-augmented pose regression from optical flow for challenging indoor environments","authors":"","doi":"10.1016/j.jvcir.2024.104256","DOIUrl":"10.1016/j.jvcir.2024.104256","url":null,"abstract":"<div><p>The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1047320324002128/pdfft?md5=f88e7c25e01d5af99626350e7efd4744&pid=1-s2.0-S1047320324002128-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104247
Different from image-based 3D pose estimation, video-based 3D pose estimation gains performance improvement with temporal information. However, these methods still face the challenge of insufficient generalization ability, including human motion speed, body shape, and camera distance. To address the above problems, we propose a novel approach, referred to as joint Spatial–temporal Multi-scale Transformers and Pose Transformation Equivalence Constraints (SMT-PTEC) for 3D human pose estimation from videos. We design a more general spatial–temporal multi-scale feature extraction strategy, and introduce optimization constraints that adapt to the diversity of data to improve the accuracy of pose estimation. Specifically, we first introduce a spatial multi-scale transformer to extract multi-scale features of pose and establish a cross-scale information transfer mechanism, which effectively explores the underlying knowledge of human motion. Then, we present a temporal multi-scale transformer to explore multi-scale dependencies between frames, enhance the adaptability of the network to human motion speed, and improve the estimation accuracy through a context aware fusion of multi-scale predictions. Moreover, we add pose transformation equivalence constraints by changing the training samples with horizontal flipping, scaling, and body shape transformation to effectively overcome the influence of camera distance and body shape for the prediction accuracy. Extensive experimental results demonstrate that our approach achieves superior performance with less computational complexity than previous state-of-the-art methods. Code is available at https://github.com/JNGao123/SMT-PTEC.
{"title":"Joint multi-scale transformers and pose equivalence constraints for 3D human pose estimation","authors":"","doi":"10.1016/j.jvcir.2024.104247","DOIUrl":"10.1016/j.jvcir.2024.104247","url":null,"abstract":"<div><p>Different from image-based 3D pose estimation, video-based 3D pose estimation gains performance improvement with temporal information. However, these methods still face the challenge of insufficient generalization ability, including human motion speed, body shape, and camera distance. To address the above problems, we propose a novel approach, referred to as joint Spatial–temporal Multi-scale Transformers and Pose Transformation Equivalence Constraints (SMT-PTEC) for 3D human pose estimation from videos. We design a more general spatial–temporal multi-scale feature extraction strategy, and introduce optimization constraints that adapt to the diversity of data to improve the accuracy of pose estimation. Specifically, we first introduce a spatial multi-scale transformer to extract multi-scale features of pose and establish a cross-scale information transfer mechanism, which effectively explores the underlying knowledge of human motion. Then, we present a temporal multi-scale transformer to explore multi-scale dependencies between frames, enhance the adaptability of the network to human motion speed, and improve the estimation accuracy through a context aware fusion of multi-scale predictions. Moreover, we add pose transformation equivalence constraints by changing the training samples with horizontal flipping, scaling, and body shape transformation to effectively overcome the influence of camera distance and body shape for the prediction accuracy. Extensive experimental results demonstrate that our approach achieves superior performance with less computational complexity than previous state-of-the-art methods. Code is available at <span><span>https://github.com/JNGao123/SMT-PTEC</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141954063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104259
Object tracking stands as a cornerstone challenge within computer vision, with blurriness analysis representing a burgeoning field of interest. Among the various forms of blur encountered in natural scenes, defocus blur remains significantly underexplored. To bridge this gap, this article introduces the Defocus Blur Video Object Tracking (DBVOT) dataset, specifically crafted to facilitate research in visual object tracking under defocus blur conditions. We conduct a comprehensive performance analysis of 18 state-of-the-art object tracking methods on this unique dataset. Additionally, we propose a selective deblurring framework based on Deblurring Auxiliary Learning Net (DID-Anet), innovatively designed to tackle the complexities of defocus blur. This framework integrates a novel defocus blurriness metric for the smart deblurring of video frames, thereby enhancing the efficacy of tracking methods in defocus blur scenarios. Our extensive experimental evaluations underscore the significant advancements in tracking accuracy achieved by incorporating our proposed framework with leading tracking technologies.
{"title":"Detecting and tracking moving objects in defocus blur scenes","authors":"","doi":"10.1016/j.jvcir.2024.104259","DOIUrl":"10.1016/j.jvcir.2024.104259","url":null,"abstract":"<div><p>Object tracking stands as a cornerstone challenge within computer vision, with blurriness analysis representing a burgeoning field of interest. Among the various forms of blur encountered in natural scenes, defocus blur remains significantly underexplored. To bridge this gap, this article introduces the Defocus Blur Video Object Tracking (DBVOT) dataset, specifically crafted to facilitate research in visual object tracking under defocus blur conditions. We conduct a comprehensive performance analysis of 18 state-of-the-art object tracking methods on this unique dataset. Additionally, we propose a selective deblurring framework based on Deblurring Auxiliary Learning Net (DID-Anet), innovatively designed to tackle the complexities of defocus blur. This framework integrates a novel defocus blurriness metric for the smart deblurring of video frames, thereby enhancing the efficacy of tracking methods in defocus blur scenarios. Our extensive experimental evaluations underscore the significant advancements in tracking accuracy achieved by incorporating our proposed framework with leading tracking technologies.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141997972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104228
The traditional Siamese network based object tracking algorithms suffer from high computational complexity, making them difficult to run on embedded devices. Moreover, when faced with long-term tracking tasks, their success rates significantly decline. To address these issues, we propose a lightweight long-term object tracking algorithm called Meta-Master-based Ghost Fast Tracking (MGTtracker),which based on meta-learning. This algorithm integrates the Ghost mechanism to create a lightweight backbone network called G-ResNet, which accurately extracts target features while operating quickly. We design a tiny adaptive weighted fusion feature pyramid network (TiFPN) to enhance feature information fusion and mitigate interference from similar objects. We introduce a lightweight region regression network, the Ghost Decouple Net (GDNet) for target position prediction. Finally, we propose a meta-learning-based online template correction mechanism called Meta-Master to overcome error accumulation in long-term tracking tasks and the difficulty of reacquiring targets after loss. We evaluate the algorithm on public datasets OTB100, VOT2020, VOT2018LT, and LaSOT and deploy it for performance testing on Jetson Xavier NX. Experimental results demonstrate the effectiveness and superiority of the algorithm. Compared to existing classic object tracking algorithms, our approach achieves a faster running speed of 25 FPS on NX, and real-time correction enhances the algorithm’s robustness. Although similar in accuracy and EAO metrics, our algorithm outperforms similar algorithms in speed and effectively addresses the issues of significant cumulative errors and easy target loss during tracking. Code is released at https://github.com/ygh96521/MGTtracker.git.
{"title":"A lightweight target tracking algorithm based on online correction for meta-learning","authors":"","doi":"10.1016/j.jvcir.2024.104228","DOIUrl":"10.1016/j.jvcir.2024.104228","url":null,"abstract":"<div><p>The traditional Siamese network based object tracking algorithms suffer from high computational complexity, making them difficult to run on embedded devices. Moreover, when faced with long-term tracking tasks, their success rates significantly decline. To address these issues, we propose a lightweight long-term object tracking algorithm called Meta-Master-based Ghost Fast Tracking (MGTtracker),which based on meta-learning. This algorithm integrates the Ghost mechanism to create a lightweight backbone network called G-ResNet, which accurately extracts target features while operating quickly. We design a tiny adaptive weighted fusion feature pyramid network (TiFPN) to enhance feature information fusion and mitigate interference from similar objects. We introduce a lightweight region regression network, the Ghost Decouple Net (GDNet) for target position prediction. Finally, we propose a meta-learning-based online template correction mechanism called Meta-Master to overcome error accumulation in long-term tracking tasks and the difficulty of reacquiring targets after loss. We evaluate the algorithm on public datasets OTB100, VOT2020, VOT2018LT, and LaSOT and deploy it for performance testing on Jetson Xavier NX. Experimental results demonstrate the effectiveness and superiority of the algorithm. Compared to existing classic object tracking algorithms, our approach achieves a faster running speed of 25 FPS on NX, and real-time correction enhances the algorithm’s robustness. Although similar in accuracy and EAO metrics, our algorithm outperforms similar algorithms in speed and effectively addresses the issues of significant cumulative errors and easy target loss during tracking. Code is released at <span><span>https://github.com/ygh96521/MGTtracker.git</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141710561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1016/j.jvcir.2024.104239
Prediction-error value ordering (PEVO) is an efficient implementation of reversible data hiding (RDH), which is perfect for color images to exploit the inter-channel and intra-channel correlations synchronously. However, the existing PEVO method has a slight shortage in the mapping selection stage, the candidate mappings are selected under conditions inconsistent with actual embedding in advance, and this is not the optimal solution. Therefore, in this paper, a novel RDH method for color images based on PEVO and adaptive embedding is proposed to implement adaptive two-dimensional (2D) modification for PEVO. Firstly, an improved particle swarm optimization (IPSO) algorithm based on PEVO is designed to alleviate the high temporal complexity caused by the determination of parameters and implement adaptive 2D modification for PEVO. Next, to further optimize the mapping used in embedding, an improved adaptive 2D mapping generation strategy is proposed by introducing the position information of points. In addition, a dynamic payload partition strategy is proposed to improve the embedding performance. Finally, the experimental results show that the PSNR of the image Lena is as high as 62.94 dB and the average PSNR of the proposed method is 1.46 dB higher than that of the state-of-the-art methods for embedding capacity of 20,000 bits.
{"title":"Reversible data hiding for color images based on prediction-error value ordering and adaptive embedding","authors":"","doi":"10.1016/j.jvcir.2024.104239","DOIUrl":"10.1016/j.jvcir.2024.104239","url":null,"abstract":"<div><p>Prediction-error value ordering (PEVO) is an efficient implementation of reversible data hiding (RDH), which is perfect for color images to exploit the inter-channel and intra-channel correlations synchronously. However, the existing PEVO method has a slight shortage in the mapping selection stage, the candidate mappings are selected under conditions inconsistent with actual embedding in advance, and this is not the optimal solution. Therefore, in this paper, a novel RDH method for color images based on PEVO and adaptive embedding is proposed to implement adaptive two-dimensional (2D) modification for PEVO. Firstly, an improved particle swarm optimization (IPSO) algorithm based on PEVO is designed to alleviate the high temporal complexity caused by the determination of parameters and implement adaptive 2D modification for PEVO. Next, to further optimize the mapping used in embedding, an improved adaptive 2D mapping generation strategy is proposed by introducing the position information of points. In addition, a dynamic payload partition strategy is proposed to improve the embedding performance. Finally, the experimental results show that the PSNR of the image Lena is as high as 62.94 dB and the average PSNR of the proposed method is 1.46 dB higher than that of the state-of-the-art methods for embedding capacity of 20,000 bits.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141736607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-09DOI: 10.1016/j.jvcir.2024.104226
Video frame interpolation (VFI) is used to synthesize one or more intermediate frames between two frames in a video sequence to improve the temporal resolution of the video. However, many methods still face challenges when dealing with complex scenes involving high-speed motion, occlusions, and other factors. To address these challenges, we propose an Edge-based Multi-scale Cross Fusion Network (EMCFN) for VFI. We integrate a feature enhancement module (FEM) based on edge information into the U-Net architecture, resulting in richer and more complete feature maps, while also enhancing the preservation of image structure and details. This contributes to generating more accurate and realistic interpolated frames. At the same time, we use a multi-scale cross fusion frame synthesis model (MCFM) composed of three GridNet branches to generate high-quality interpolation frames. We have conducted a series of experiments and the results show that our model exhibits satisfactory performance on different datasets compared with the state-of-the-art methods.
{"title":"EMCFN: Edge-based Multi-scale Cross Fusion Network for video frame interpolation","authors":"","doi":"10.1016/j.jvcir.2024.104226","DOIUrl":"10.1016/j.jvcir.2024.104226","url":null,"abstract":"<div><p>Video frame interpolation (VFI) is used to synthesize one or more intermediate frames between two frames in a video sequence to improve the temporal resolution of the video. However, many methods still face challenges when dealing with complex scenes involving high-speed motion, occlusions, and other factors. To address these challenges, we propose an Edge-based Multi-scale Cross Fusion Network (EMCFN) for VFI. We integrate a feature enhancement module (FEM) based on edge information into the U-Net architecture, resulting in richer and more complete feature maps, while also enhancing the preservation of image structure and details. This contributes to generating more accurate and realistic interpolated frames. At the same time, we use a multi-scale cross fusion frame synthesis model (MCFM) composed of three GridNet branches to generate high-quality interpolation frames. We have conducted a series of experiments and the results show that our model exhibits satisfactory performance on different datasets compared with the state-of-the-art methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141623599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-08DOI: 10.1016/j.jvcir.2024.104224
Underwater image enhancement, especially in color restoration and detail reconstruction, remains a significant challenge. Current models focus on improving accuracy and learning efficiency through neural network design, often neglecting traditional optimization algorithms’ benefits. We propose FAIN-UIE, a novel approach for color and fine-texture recovery in underwater imagery. It leverages insights from the Fast Iterative Shrink-Threshold Algorithm (FISTA) to approximate image degradation, enhancing network fitting speed. FAIN-UIE integrates the residual degradation module (RDM) and momentum calculation module (MC) for gradient descent and momentum simulation, addressing feature fusion losses with the Feature Merge Block (FMB). By integrating multi-scale information and inter-stage pathways, our method effectively maps multi-stage image features, advancing color and fine-texture restoration. Experimental results validate its robust performance, positioning FAIN-UIE as a competitive solution for practical underwater imaging applications.
{"title":"FISTA acceleration inspired network design for underwater image enhancement","authors":"","doi":"10.1016/j.jvcir.2024.104224","DOIUrl":"10.1016/j.jvcir.2024.104224","url":null,"abstract":"<div><p>Underwater image enhancement, especially in color restoration and detail reconstruction, remains a significant challenge. Current models focus on improving accuracy and learning efficiency through neural network design, often neglecting traditional optimization algorithms’ benefits. We propose FAIN-UIE, a novel approach for color and fine-texture recovery in underwater imagery. It leverages insights from the Fast Iterative Shrink-Threshold Algorithm (FISTA) to approximate image degradation, enhancing network fitting speed. FAIN-UIE integrates the residual degradation module (RDM) and momentum calculation module (MC) for gradient descent and momentum simulation, addressing feature fusion losses with the Feature Merge Block (FMB). By integrating multi-scale information and inter-stage pathways, our method effectively maps multi-stage image features, advancing color and fine-texture restoration. Experimental results validate its robust performance, positioning FAIN-UIE as a competitive solution for practical underwater imaging applications.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":null,"pages":null},"PeriodicalIF":2.6,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141701198","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}