Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104235
Yuxi Wang , Jing Hu , Rongguo Zhang , Lifang Wang , Rui Zhang , Xiaojun Liu
{"title":"Corrigendum to “Heterogeneity constrained color ellipsoid prior image dehazing algorithm” [J. Vis. Commun. Image Represent. 101 (2024) 104177]","authors":"Yuxi Wang , Jing Hu , Rongguo Zhang , Lifang Wang , Rui Zhang , Xiaojun Liu","doi":"10.1016/j.jvcir.2024.104235","DOIUrl":"10.1016/j.jvcir.2024.104235","url":null,"abstract":"","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104235"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1047320324001913/pdfft?md5=acb08692ca9b1d2f6bd84d46fa591d30&pid=1-s2.0-S1047320324001913-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141694814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104258
Shengtang Guo , Huaxiang Zhang , Li Liu , Dongmei Liu , Xu Lu , Liujian Li
Most existing cross-modal retrieval methods face challenges in establishing semantic connections between different modalities due to inherent heterogeneity among them. To establish semantic connections between different modalities and align relevant semantic features across modalities, so as to fully capture important information within the same modality, this paper considers the superiority of hypergraph in representing higher-order relationships, and proposes an image-text retrieval method based on hypergraph clustering. Specifically, we construct hypergraphs to capture feature relationships within image and text modalities, as well as between image and text. This allows us to effectively model complex relationships between features of different modalities and explore the semantic connectivity within and across modalities. To compensate for potential semantic feature loss during the construction of the hypergraph neural network, we design a weight-adaptive coarse and fine-grained feature fusion module for semantic supplementation. Comprehensive experimental results on three common datasets demonstrate the effectiveness of the proposed method.
{"title":"Hypergraph clustering based multi-label cross-modal retrieval","authors":"Shengtang Guo , Huaxiang Zhang , Li Liu , Dongmei Liu , Xu Lu , Liujian Li","doi":"10.1016/j.jvcir.2024.104258","DOIUrl":"10.1016/j.jvcir.2024.104258","url":null,"abstract":"<div><p>Most existing cross-modal retrieval methods face challenges in establishing semantic connections between different modalities due to inherent heterogeneity among them. To establish semantic connections between different modalities and align relevant semantic features across modalities, so as to fully capture important information within the same modality, this paper considers the superiority of hypergraph in representing higher-order relationships, and proposes an image-text retrieval method based on hypergraph clustering. Specifically, we construct hypergraphs to capture feature relationships within image and text modalities, as well as between image and text. This allows us to effectively model complex relationships between features of different modalities and explore the semantic connectivity within and across modalities. To compensate for potential semantic feature loss during the construction of the hypergraph neural network, we design a weight-adaptive coarse and fine-grained feature fusion module for semantic supplementation. Comprehensive experimental results on three common datasets demonstrate the effectiveness of the proposed method.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104258"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141993447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104250
Gonghe Xiong , Shan Gai , Bofan Nie , Feilong Chen , Chengli Sun
The existing deraining methods are based on convolutional neural networks (CNN) learning the mapping relationship between rainy and clean images. However, the real-valued CNN processes the color images as three independent channels separately, which fails to fully leverage color information. Additionally, sliding-window-based neural networks cannot effectively model the non-local characteristics of an image. In this work, we proposed a non-local feature aggregation quaternion network (NLAQNet), which is composed of two concurrent sub-networks: the Quaternion Local Detail Repair Network (QLDRNet) and the Multi-Level Feature Aggregation Network (MLFANet). Furthermore, in the subnetwork of QLDRNet, the Local Detail Repair Block (LDRB) is proposed to repair the backdrop of an image that has not been damaged by rain streaks. Finally, within the MLFANet subnetwork, we have introduced two specialized blocks, namely the Non-Local Feature Aggregation Block (NLAB) and the Feature Aggregation Block (Mix), specifically designed to address the restoration of rain-streak-damaged image backgrounds. Extensive experiments demonstrate that the proposed network delivers strong performance in both qualitative and quantitative evaluations on existing datasets. The code is available at https://github.com/xionggonghe/NLAQNet.
{"title":"Non-local feature aggregation quaternion network for single image deraining","authors":"Gonghe Xiong , Shan Gai , Bofan Nie , Feilong Chen , Chengli Sun","doi":"10.1016/j.jvcir.2024.104250","DOIUrl":"10.1016/j.jvcir.2024.104250","url":null,"abstract":"<div><p>The existing deraining methods are based on convolutional neural networks (CNN) learning the mapping relationship between rainy and clean images. However, the real-valued CNN processes the color images as three independent channels separately, which fails to fully leverage color information. Additionally, sliding-window-based neural networks cannot effectively model the non-local characteristics of an image. In this work, we proposed a non-local feature aggregation quaternion network (NLAQNet), which is composed of two concurrent sub-networks: the Quaternion Local Detail Repair Network (QLDRNet) and the Multi-Level Feature Aggregation Network (MLFANet). Furthermore, in the subnetwork of QLDRNet, the Local Detail Repair Block (LDRB) is proposed to repair the backdrop of an image that has not been damaged by rain streaks. Finally, within the MLFANet subnetwork, we have introduced two specialized blocks, namely the Non-Local Feature Aggregation Block (NLAB) and the Feature Aggregation Block (Mix), specifically designed to address the restoration of rain-streak-damaged image backgrounds. Extensive experiments demonstrate that the proposed network delivers strong performance in both qualitative and quantitative evaluations on existing datasets. The code is available at <span><span>https://github.com/xionggonghe/NLAQNet</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104250"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141964652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104264
Nora Algaraawi , Tim Morris , Timothy F. Cootes
Facial Feature Point Detection (FFPD) plays a significant role in several face analysis tasks such as feature extraction and classification. This paper presents a Fully Automatic FFPD system using the application of Random Forest Regression Voting in a Constrained Local Model (RFRV-CLM) framework. A global detector is used to find the approximate positions of the facial region and eye centers. A sequence of local RFRV-CLMs are used to locate a detailed set of points around the facial features. Both global and local models use Random Forest Regression to vote for optimal positions. The system is evaluated in the task of facial expression localization using five different facial expression databases of different characteristics including age, intensity, 6-basic expressions, 22 compound expressions, static and dynamic images, and deliberate and spontaneous expressions. Quantitative results of the evaluation of automatic point localization against manual points (ground truth) demonstrated that the results of the proposed approach are encouraging and outperform the results of alternative techniques tested on the same databases.
{"title":"Facial feature point detection under large range of face deformations","authors":"Nora Algaraawi , Tim Morris , Timothy F. Cootes","doi":"10.1016/j.jvcir.2024.104264","DOIUrl":"10.1016/j.jvcir.2024.104264","url":null,"abstract":"<div><p>Facial Feature Point Detection (FFPD) plays a significant role in several face analysis tasks such as feature extraction and classification. This paper presents a Fully Automatic FFPD system using the application of Random Forest Regression Voting in a Constrained Local Model (RFRV-CLM) framework. A global detector is used to find the approximate positions of the facial region and eye centers. A sequence of local RFRV-CLMs are used to locate a detailed set of points around the facial features. Both global and local models use Random Forest Regression to vote for optimal positions. The system is evaluated in the task of facial expression localization using five different facial expression databases of different characteristics including age, intensity, 6-basic expressions, 22 compound expressions, static and dynamic images, and deliberate and spontaneous expressions. Quantitative results of the evaluation of automatic point localization against manual points (ground truth) demonstrated that the results of the proposed approach are encouraging and outperform the results of alternative techniques tested on the same databases.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104264"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142021580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104266
Zhengyou Wang , Chengyu Du , Yunpeng Zhang , Jing Bai , Shanna Zhuang
Gait recognition, which can realize long-distance and contactless identification, is an important biometric technology. Recent gait recognition methods focus on learning the pattern of human movement or appearance during walking, and construct the corresponding spatio-temporal representations. However, different individuals have their own laws of movement patterns, simple spatial–temporal features are difficult to describe changes in motion of human parts, especially when confounding variables such as clothing and carrying are included, thus distinguishability of features is reduced. To this end, we propose the Embedding and Motion (EM) block and Fine Feature Extractor (FFE) to capture the motion mode of walking and enhance the difference of local motion rules. The EM block consists of a Motion Excitation (ME) module to capture the changes of temporal motion and an Embedding Self-attention (ES) module to enhance the expression of motion rules. Specifically, without introducing additional parameters, ME module learns the difference information between frames and intervals to obtain the dynamic change representation of walking for frame sequences with uncertain length. By contrast, ES module divides the feature map hierarchically based on element values, blurring the difference of elements to highlight the motion track. Furthermore, we present the FFE, which independently learns the spatio-temporal representations of human body according to different horizontal parts of individuals. Benefiting from EM block and our proposed motion branch, our method innovatively combines motion change information, significantly improving the performance of the model under cross appearance conditions. On the popular dataset CASIA-B, our proposed EM-Gait is better than the existing single-modal gait recognition methods.
{"title":"EM-Gait: Gait recognition using motion excitation and feature embedding self-attention","authors":"Zhengyou Wang , Chengyu Du , Yunpeng Zhang , Jing Bai , Shanna Zhuang","doi":"10.1016/j.jvcir.2024.104266","DOIUrl":"10.1016/j.jvcir.2024.104266","url":null,"abstract":"<div><p>Gait recognition, which can realize long-distance and contactless identification, is an important biometric technology. Recent gait recognition methods focus on learning the pattern of human movement or appearance during walking, and construct the corresponding spatio-temporal representations. However, different individuals have their own laws of movement patterns, simple spatial–temporal features are difficult to describe changes in motion of human parts, especially when confounding variables such as clothing and carrying are included, thus distinguishability of features is reduced. To this end, we propose the Embedding and Motion (EM) block and Fine Feature Extractor (FFE) to capture the motion mode of walking and enhance the difference of local motion rules. The EM block consists of a Motion Excitation (ME) module to capture the changes of temporal motion and an Embedding Self-attention (ES) module to enhance the expression of motion rules. Specifically, without introducing additional parameters, ME module learns the difference information between frames and intervals to obtain the dynamic change representation of walking for frame sequences with uncertain length. By contrast, ES module divides the feature map hierarchically based on element values, blurring the difference of elements to highlight the motion track. Furthermore, we present the FFE, which independently learns the spatio-temporal representations of human body according to different horizontal parts of individuals. Benefiting from EM block and our proposed motion branch, our method innovatively combines motion change information, significantly improving the performance of the model under cross appearance conditions. On the popular dataset CASIA-B, our proposed EM-Gait is better than the existing single-modal gait recognition methods.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104266"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142075777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104244
Zhongning Ding , Yun Zhu , Shaoshan Niu , Jianyu Wang , Yan Su
In the domain of computer vision, addressing the degradation of image quality under adverse weather conditions remains a significant challenge. To tackle the challenges of image enhancement and deraining in dark settings, we have integrated image enhancement and deraining technologies to develop the DDR (Dark Environment Deraining Network) system. This specialized network is designed to enhance and clarify images in low-light conditions compromised by raindrops. DDR employs a strategic divide-and-conquer approach and an apt network selection to discern patterns of raindrops and background elements within images. It is capable of mitigating noise and blurring induced by raindrops in dark settings, thus enhancing the visual fidelity of images. Through testing on real-world imagery and the Rain LOL dataset, this innovative network offers a robust solution for deraining tasks in dark conditions, inspiring advancements in the performance of computer vision systems under challenging weather scenarios. The research of DDR provides technical and theoretical support for improving image quality in dark environment.
{"title":"DDR: A network of image deraining systems for dark environments","authors":"Zhongning Ding , Yun Zhu , Shaoshan Niu , Jianyu Wang , Yan Su","doi":"10.1016/j.jvcir.2024.104244","DOIUrl":"10.1016/j.jvcir.2024.104244","url":null,"abstract":"<div><p>In the domain of computer vision, addressing the degradation of image quality under adverse weather conditions remains a significant challenge. To tackle the challenges of image enhancement and deraining in dark settings, we have integrated image enhancement and deraining technologies to develop the DDR (Dark Environment Deraining Network) system. This specialized network is designed to enhance and clarify images in low-light conditions compromised by raindrops. DDR employs a strategic divide-and-conquer approach and an apt network selection to discern patterns of raindrops and background elements within images. It is capable of mitigating noise and blurring induced by raindrops in dark settings, thus enhancing the visual fidelity of images. Through testing on real-world imagery and the Rain LOL dataset, this innovative network offers a robust solution for deraining tasks in dark conditions, inspiring advancements in the performance of computer vision systems under challenging weather scenarios. The research of DDR provides technical and theoretical support for improving image quality in dark environment.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104244"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104246
Guoyou Zhang , Xiaoxue Cheng , Fan Yang , Anhong Wang , Xuenan Zhang , Li Liu
Reversible data hiding in encrypted domain (RDH-ED) is widely used in sensitive fields such as privacy protection and copyright authentication. However, the embedding capacity of existing methods is generally low due to the insufficient use of model topology. In order to improve the embedding capacity, this paper proposes a high-capacity multi-MSB predictive reversible data hiding in encrypted domain (MMPRDH-ED). Firstly, the 3D model is subdivided by triangular mesh subdivision (TMS) algorithm, and its vertices are divided into reference set and embedded set. Then, in order to make full use of the redundant space of embedded vertices, Multi-MSB prediction (MMP) and Multi-layer Embedding Strategy (MLES) are used to improve the capacity. Finally, stream encryption technology is used to encrypt the model and data to ensure data security. The experimental results show that compared with the existing methods, the embedding capacity of MMPRDH-ED is increased by 53 %, which has higher advantages.
{"title":"High-capacity multi-MSB predictive reversible data hiding in encrypted domain for triangular mesh models","authors":"Guoyou Zhang , Xiaoxue Cheng , Fan Yang , Anhong Wang , Xuenan Zhang , Li Liu","doi":"10.1016/j.jvcir.2024.104246","DOIUrl":"10.1016/j.jvcir.2024.104246","url":null,"abstract":"<div><p>Reversible data hiding in encrypted domain (RDH-ED) is widely used in sensitive fields such as privacy protection and copyright authentication. However, the embedding capacity of existing methods is generally low due to the insufficient use of model topology. In order to improve the embedding capacity, this paper proposes a high-capacity multi-MSB predictive reversible data hiding in encrypted domain (MMPRDH-ED). Firstly, the 3D model is subdivided by triangular mesh subdivision (TMS) algorithm, and its vertices are divided into reference set and embedded set. Then, in order to make full use of the redundant space of embedded vertices, Multi-MSB prediction (MMP) and Multi-layer Embedding Strategy (MLES) are used to improve the capacity. Finally, stream encryption technology is used to encrypt the model and data to ensure data security. The experimental results show that compared with the existing methods, the embedding capacity of MMPRDH-ED is increased by 53 %, which has higher advantages.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104246"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141782637","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104252
Jinyoung Jun , Jae-Han Lee , Chang-Su Kim
A typical monocular depth estimator is trained for a single camera, so its performance drops severely on images taken with different cameras. To address this issue, we propose a versatile depth estimator (VDE), composed of a common relative depth estimator (CRDE) and multiple relative-to-metric converters (R2MCs). The CRDE extracts relative depth information, and each R2MC converts the relative information to predict metric depths for a specific camera. The proposed VDE can cope with diverse scenes, including both indoor and outdoor scenes, with only a 1.12% parameter increase per camera. Experimental results demonstrate that VDE supports multiple cameras effectively and efficiently and also achieves state-of-the-art performance in the conventional single-camera scenario.
{"title":"Versatile depth estimator based on common relative depth estimation and camera-specific relative-to-metric depth conversion","authors":"Jinyoung Jun , Jae-Han Lee , Chang-Su Kim","doi":"10.1016/j.jvcir.2024.104252","DOIUrl":"10.1016/j.jvcir.2024.104252","url":null,"abstract":"<div><p>A typical monocular depth estimator is trained for a single camera, so its performance drops severely on images taken with different cameras. To address this issue, we propose a versatile depth estimator (VDE), composed of a common relative depth estimator (CRDE) and multiple relative-to-metric converters (R2MCs). The CRDE extracts relative depth information, and each R2MC converts the relative information to predict metric depths for a specific camera. The proposed VDE can cope with diverse scenes, including both indoor and outdoor scenes, with only a 1.12% parameter increase per camera. Experimental results demonstrate that VDE supports multiple cameras effectively and efficiently and also achieves state-of-the-art performance in the conventional single-camera scenario.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104252"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142006769","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-01DOI: 10.1016/j.jvcir.2024.104256
Felix Ott , Lucas Heublein , David Rügamer , Bernd Bischl , Christopher Mutschler
The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.
{"title":"Fusing structure from motion and simulation-augmented pose regression from optical flow for challenging indoor environments","authors":"Felix Ott , Lucas Heublein , David Rügamer , Bernd Bischl , Christopher Mutschler","doi":"10.1016/j.jvcir.2024.104256","DOIUrl":"10.1016/j.jvcir.2024.104256","url":null,"abstract":"<div><p>The localization of objects is essential in many applications, such as robotics, virtual and augmented reality, and warehouse logistics. Recent advancements in deep learning have enabled localization using monocular cameras. Traditionally, structure from motion (SfM) techniques predict an object’s absolute position from a point cloud, while absolute pose regression (APR) methods use neural networks to understand the environment semantically. However, both approaches face challenges from environmental factors like motion blur, lighting changes, repetitive patterns, and featureless areas. This study addresses these challenges by incorporating additional information and refining absolute pose estimates with relative pose regression (RPR) methods. RPR also struggles with issues like motion blur. To overcome this, we compute the optical flow between consecutive images using the Lucas–Kanade algorithm and use a small recurrent convolutional network to predict relative poses. Combining absolute and relative poses is difficult due to differences between global and local coordinate systems. Current methods use pose graph optimization (PGO) to align these poses. In this work, we propose recurrent fusion networks to better integrate absolute and relative pose predictions, enhancing the accuracy of absolute pose estimates. We evaluate eight different recurrent units and create a simulation environment to pre-train the APR and RPR networks for improved generalization. Additionally, we record a large dataset of various scenarios in a challenging indoor environment resembling a warehouse with transportation robots. Through hyperparameter searches and experiments, we demonstrate that our recurrent fusion method outperforms PGO in effectiveness.</p></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"103 ","pages":"Article 104256"},"PeriodicalIF":2.6,"publicationDate":"2024-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1047320324002128/pdfft?md5=f88e7c25e01d5af99626350e7efd4744&pid=1-s2.0-S1047320324002128-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141933775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}