Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.
{"title":"MMF-Net: A novel multi-feature and multi-level fusion network for 3D human pose estimation","authors":"Qianxing Li, Dehui Kong, Jinghua Li, Baocai Yin","doi":"10.1049/cvi2.12336","DOIUrl":"https://doi.org/10.1049/cvi2.12336","url":null,"abstract":"<p>Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12336","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Suhua Peng, Zongliang Zhang, Xingwang Huang, Zongyue Wang, Shubing Su, Guorong Cai
In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.
{"title":"A robust few-shot classifier with image as set of points","authors":"Suhua Peng, Zongliang Zhang, Xingwang Huang, Zongyue Wang, Shubing Su, Guorong Cai","doi":"10.1049/cvi2.12340","DOIUrl":"https://doi.org/10.1049/cvi2.12340","url":null,"abstract":"<p>In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the field of 3D human pose estimation (HPE), many deep learning algorithms overlook the topological relationships between 2D keypoints, resulting in imprecise regression of 3D coordinates and a notable decline in estimation performance. To address this limitation, this paper proposes a novel approach to 3D HPE, termed the Spatial Mamba Graph Convolutional Neural Network (GCN) Former (SMGNFormer). The proposed method utilises the Mamba architecture to extract spatial information from 2D keypoints and integrates GCNs with multi-head attention mechanisms to build a relational graph of 2D keypoints across a global receptive field. The outputs are subsequently processed by a Time-Frequency Feature Fusion Transformer to estimate 3D human poses. SMGNFormer demonstrates superior estimation performance on the Human3.6M dataset and real-world video data compared to most Transformer-based algorithms. Moreover, the proposed method achieves a training speed comparable to PoseFormerv2, providing a clear advantage over other methods in its category.
{"title":"SMGNFORMER: Fusion Mamba-graph transformer network for human pose estimation","authors":"Yi Li, Zan Wang, Weiran Niu","doi":"10.1049/cvi2.12339","DOIUrl":"https://doi.org/10.1049/cvi2.12339","url":null,"abstract":"<p>In the field of 3D human pose estimation (HPE), many deep learning algorithms overlook the topological relationships between 2D keypoints, resulting in imprecise regression of 3D coordinates and a notable decline in estimation performance. To address this limitation, this paper proposes a novel approach to 3D HPE, termed the Spatial Mamba Graph Convolutional Neural Network (GCN) Former (SMGNFormer). The proposed method utilises the Mamba architecture to extract spatial information from 2D keypoints and integrates GCNs with multi-head attention mechanisms to build a relational graph of 2D keypoints across a global receptive field. The outputs are subsequently processed by a Time-Frequency Feature Fusion Transformer to estimate 3D human poses. SMGNFormer demonstrates superior estimation performance on the Human3.6M dataset and real-world video data compared to most Transformer-based algorithms. Moreover, the proposed method achieves a training speed comparable to PoseFormerv2, providing a clear advantage over other methods in its category.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12339","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.
{"title":"LLFormer4D: LiDAR-based lane detection method by temporal feature fusion and sparse transformer","authors":"Jun Hu, Chaolu Feng, Haoxiang Jie, Zuotao Ning, Xinyi Zuo, Wei Liu, Xiangyu Wei","doi":"10.1049/cvi2.12338","DOIUrl":"https://doi.org/10.1049/cvi2.12338","url":null,"abstract":"<p>Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.
{"title":"HMSFU: A hierarchical multi-scale fusion unit for video prediction and beyond","authors":"Hongchang Zhu, Faming Fang","doi":"10.1049/cvi2.12312","DOIUrl":"https://doi.org/10.1049/cvi2.12312","url":null,"abstract":"<p>Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12312","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.
{"title":"Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost","authors":"Jayati Vijaywargiya, Anandakumar M. Ramiya","doi":"10.1049/cvi2.12334","DOIUrl":"https://doi.org/10.1049/cvi2.12334","url":null,"abstract":"<p>Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Qingyan Ding, Yu Pan, Jianxin Liu, Lianxin Li, Nan Liu, Na Li, Wan Zheng, Xuecheng Dong
Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.
{"title":"AMEF-Net: Towards an attention and multi-level enhancement fusion for medical image classification in Parkinson's aided diagnosis","authors":"Qingyan Ding, Yu Pan, Jianxin Liu, Lianxin Li, Nan Liu, Na Li, Wan Zheng, Xuecheng Dong","doi":"10.1049/cvi2.12324","DOIUrl":"https://doi.org/10.1049/cvi2.12324","url":null,"abstract":"<p>Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.
{"title":"Unlocking the power of multi-modal fusion in 3D object tracking","authors":"Yue Hu","doi":"10.1049/cvi2.12335","DOIUrl":"https://doi.org/10.1049/cvi2.12335","url":null,"abstract":"<p>3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.
{"title":"Category-instance distillation based on visual-language models for rehearsal-free class incremental learning","authors":"Weilong Jin, Zilei Wang, Yixin Zhang","doi":"10.1049/cvi2.12327","DOIUrl":"https://doi.org/10.1049/cvi2.12327","url":null,"abstract":"<p>Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.
{"title":"Outliers rejection for robust camera pose estimation using graduated non-convexity","authors":"Hao Yi, Bo Liu, Bin Zhao, Enhai Liu","doi":"10.1049/cvi2.12330","DOIUrl":"https://doi.org/10.1049/cvi2.12330","url":null,"abstract":"<p>Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}