首页 > 最新文献

IET Computer Vision最新文献

英文 中文
MMF-Net: A novel multi-feature and multi-level fusion network for 3D human pose estimation
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-07 DOI: 10.1049/cvi2.12336
Qianxing Li, Dehui Kong, Jinghua Li, Baocai Yin

Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.

{"title":"MMF-Net: A novel multi-feature and multi-level fusion network for 3D human pose estimation","authors":"Qianxing Li,&nbsp;Dehui Kong,&nbsp;Jinghua Li,&nbsp;Baocai Yin","doi":"10.1049/cvi2.12336","DOIUrl":"https://doi.org/10.1049/cvi2.12336","url":null,"abstract":"<p>Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12336","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A robust few-shot classifier with image as set of points
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2025-01-02 DOI: 10.1049/cvi2.12340
Suhua Peng, Zongliang Zhang, Xingwang Huang, Zongyue Wang, Shubing Su, Guorong Cai

In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.

{"title":"A robust few-shot classifier with image as set of points","authors":"Suhua Peng,&nbsp;Zongliang Zhang,&nbsp;Xingwang Huang,&nbsp;Zongyue Wang,&nbsp;Shubing Su,&nbsp;Guorong Cai","doi":"10.1049/cvi2.12340","DOIUrl":"https://doi.org/10.1049/cvi2.12340","url":null,"abstract":"<p>In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SMGNFORMER: Fusion Mamba-graph transformer network for human pose estimation
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-31 DOI: 10.1049/cvi2.12339
Yi Li, Zan Wang, Weiran Niu

In the field of 3D human pose estimation (HPE), many deep learning algorithms overlook the topological relationships between 2D keypoints, resulting in imprecise regression of 3D coordinates and a notable decline in estimation performance. To address this limitation, this paper proposes a novel approach to 3D HPE, termed the Spatial Mamba Graph Convolutional Neural Network (GCN) Former (SMGNFormer). The proposed method utilises the Mamba architecture to extract spatial information from 2D keypoints and integrates GCNs with multi-head attention mechanisms to build a relational graph of 2D keypoints across a global receptive field. The outputs are subsequently processed by a Time-Frequency Feature Fusion Transformer to estimate 3D human poses. SMGNFormer demonstrates superior estimation performance on the Human3.6M dataset and real-world video data compared to most Transformer-based algorithms. Moreover, the proposed method achieves a training speed comparable to PoseFormerv2, providing a clear advantage over other methods in its category.

{"title":"SMGNFORMER: Fusion Mamba-graph transformer network for human pose estimation","authors":"Yi Li,&nbsp;Zan Wang,&nbsp;Weiran Niu","doi":"10.1049/cvi2.12339","DOIUrl":"https://doi.org/10.1049/cvi2.12339","url":null,"abstract":"<p>In the field of 3D human pose estimation (HPE), many deep learning algorithms overlook the topological relationships between 2D keypoints, resulting in imprecise regression of 3D coordinates and a notable decline in estimation performance. To address this limitation, this paper proposes a novel approach to 3D HPE, termed the Spatial Mamba Graph Convolutional Neural Network (GCN) Former (SMGNFormer). The proposed method utilises the Mamba architecture to extract spatial information from 2D keypoints and integrates GCNs with multi-head attention mechanisms to build a relational graph of 2D keypoints across a global receptive field. The outputs are subsequently processed by a Time-Frequency Feature Fusion Transformer to estimate 3D human poses. SMGNFormer demonstrates superior estimation performance on the Human3.6M dataset and real-world video data compared to most Transformer-based algorithms. Moreover, the proposed method achieves a training speed comparable to PoseFormerv2, providing a clear advantage over other methods in its category.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12339","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
LLFormer4D: LiDAR-based lane detection method by temporal feature fusion and sparse transformer
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-30 DOI: 10.1049/cvi2.12338
Jun Hu, Chaolu Feng, Haoxiang Jie, Zuotao Ning, Xinyi Zuo, Wei Liu, Xiangyu Wei

Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.

{"title":"LLFormer4D: LiDAR-based lane detection method by temporal feature fusion and sparse transformer","authors":"Jun Hu,&nbsp;Chaolu Feng,&nbsp;Haoxiang Jie,&nbsp;Zuotao Ning,&nbsp;Xinyi Zuo,&nbsp;Wei Liu,&nbsp;Xiangyu Wei","doi":"10.1049/cvi2.12338","DOIUrl":"https://doi.org/10.1049/cvi2.12338","url":null,"abstract":"<p>Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HMSFU: A hierarchical multi-scale fusion unit for video prediction and beyond
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-29 DOI: 10.1049/cvi2.12312
Hongchang Zhu, Faming Fang

Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.

{"title":"HMSFU: A hierarchical multi-scale fusion unit for video prediction and beyond","authors":"Hongchang Zhu,&nbsp;Faming Fang","doi":"10.1049/cvi2.12312","DOIUrl":"https://doi.org/10.1049/cvi2.12312","url":null,"abstract":"<p>Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12312","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-27 DOI: 10.1049/cvi2.12334
Jayati Vijaywargiya, Anandakumar M. Ramiya

Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.

{"title":"Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost","authors":"Jayati Vijaywargiya,&nbsp;Anandakumar M. Ramiya","doi":"10.1049/cvi2.12334","DOIUrl":"https://doi.org/10.1049/cvi2.12334","url":null,"abstract":"<p>Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AMEF-Net: Towards an attention and multi-level enhancement fusion for medical image classification in Parkinson's aided diagnosis
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-25 DOI: 10.1049/cvi2.12324
Qingyan Ding, Yu Pan, Jianxin Liu, Lianxin Li, Nan Liu, Na Li, Wan Zheng, Xuecheng Dong

Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.

{"title":"AMEF-Net: Towards an attention and multi-level enhancement fusion for medical image classification in Parkinson's aided diagnosis","authors":"Qingyan Ding,&nbsp;Yu Pan,&nbsp;Jianxin Liu,&nbsp;Lianxin Li,&nbsp;Nan Liu,&nbsp;Na Li,&nbsp;Wan Zheng,&nbsp;Xuecheng Dong","doi":"10.1049/cvi2.12324","DOIUrl":"https://doi.org/10.1049/cvi2.12324","url":null,"abstract":"<p>Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Unlocking the power of multi-modal fusion in 3D object tracking
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-25 DOI: 10.1049/cvi2.12335
Yue Hu

3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.

{"title":"Unlocking the power of multi-modal fusion in 3D object tracking","authors":"Yue Hu","doi":"10.1049/cvi2.12335","DOIUrl":"https://doi.org/10.1049/cvi2.12335","url":null,"abstract":"<p>3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Category-instance distillation based on visual-language models for rehearsal-free class incremental learning
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-23 DOI: 10.1049/cvi2.12327
Weilong Jin, Zilei Wang, Yixin Zhang

Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.

{"title":"Category-instance distillation based on visual-language models for rehearsal-free class incremental learning","authors":"Weilong Jin,&nbsp;Zilei Wang,&nbsp;Yixin Zhang","doi":"10.1049/cvi2.12327","DOIUrl":"https://doi.org/10.1049/cvi2.12327","url":null,"abstract":"<p>Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Outliers rejection for robust camera pose estimation using graduated non-convexity
IF 1.5 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2024-12-23 DOI: 10.1049/cvi2.12330
Hao Yi, Bo Liu, Bin Zhao, Enhai Liu

Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.

{"title":"Outliers rejection for robust camera pose estimation using graduated non-convexity","authors":"Hao Yi,&nbsp;Bo Liu,&nbsp;Bin Zhao,&nbsp;Enhai Liu","doi":"10.1049/cvi2.12330","DOIUrl":"https://doi.org/10.1049/cvi2.12330","url":null,"abstract":"<p>Camera pose estimation plays a crucial role in computer vision, which is widely used in augmented reality, robotics and autonomous driving. However, previous studies have neglected the presence of outliers in measurements, so that even a small percentage of outliers will significantly degrade precision. In order to deal with outliers, this paper proposes using a graduated non-convexity (GNC) method to suppress outliers in robust camera pose estimation, which serves as the core of GNCPnP. The authors first reformulate the camera pose estimation problem using a non-convex cost, which is less affected by outliers. Then, to apply a non-minimum solver to solve the reformulated problem, the authors use the Black-Rangarajan duality theory to transform it. Finally, to address the dependence of non-convex optimisation on initial values, the GNC method was customised according to the truncated least squares cost. The results of simulation and real experiments show that GNCPnP can effectively handle the interference of outliers and achieve higher accuracy compared to existing state-of-the-art algorithms. In particular, the camera pose estimation accuracy of GNCPnP in the case of a low percentage of outliers is almost comparable to that of the state-of-the-art algorithm in the case of no outliers.</p>","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12330","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IET Computer Vision
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1