IET Computer Vision最新文献

英文中文

Re-identification of patterned animals by multi-image feature aggregation and geometric similarity

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2025-01-08 DOI: 10.1049/cvi2.12337

Ekaterina Nepovinnykh, Veikka Immonen, Tuomas Eerola, Charles V. Stewart, Heikki Kälviäinen

Image-based re-identification of animal individuals allows gathering of information such as population size and migration patterns of the animals over time. This, together with large image volumes collected using camera traps and crowdsourcing, opens novel possibilities to study animal populations. For many species, the re-identification can be done by analysing the permanent fur, feather, or skin patterns that are unique to each individual. In this paper, the authors study pattern feature aggregation based re-identification and consider two ways of improving accuracy: (1) aggregating pattern image features over multiple images and (2) combining the pattern appearance similarity obtained by feature aggregation and geometric pattern similarity. Aggregation over multiple database images of the same individual allows to obtain more comprehensive and robust descriptors while reducing the computation time. On the other hand, combining the two similarity measures allows to efficiently utilise both the local and global pattern features, providing a general re-identification approach that can be applied to a wide variety of different pattern types. In the experimental part of the work, the authors demonstrate that the proposed method achieves promising re-identification accuracies for Saimaa ringed seals and whale sharks without species-specific training or fine-tuning.

{"title":"Re-identification of patterned animals by multi-image feature aggregation and geometric similarity","authors":"Ekaterina Nepovinnykh, Veikka Immonen, Tuomas Eerola, Charles V. Stewart, Heikki Kälviäinen","doi":"10.1049/cvi2.12337","DOIUrl":"https://doi.org/10.1049/cvi2.12337","url":null,"abstract":"Image-based re-identification of animal individuals allows gathering of information such as population size and migration patterns of the animals over time. This, together with large image volumes collected using camera traps and crowdsourcing, opens novel possibilities to study animal populations. For many species, the re-identification can be done by analysing the permanent fur, feather, or skin patterns that are unique to each individual. In this paper, the authors study pattern feature aggregation based re-identification and consider two ways of improving accuracy: (1) aggregating pattern image features over multiple images and (2) combining the pattern appearance similarity obtained by feature aggregation and geometric pattern similarity. Aggregation over multiple database images of the same individual allows to obtain more comprehensive and robust descriptors while reducing the computation time. On the other hand, combining the two similarity measures allows to efficiently utilise both the local and global pattern features, providing a general re-identification approach that can be applied to a wide variety of different pattern types. In the experimental part of the work, the authors demonstrate that the proposed method achieves promising re-identification accuracies for Saimaa ringed seals and whale sharks without species-specific training or fine-tuning.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12337","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MMF-Net: A novel multi-feature and multi-level fusion network for 3D human pose estimation

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2025-01-07 DOI: 10.1049/cvi2.12336

Qianxing Li, Dehui Kong, Jinghua Li, Baocai Yin

Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.

{"title":"MMF-Net: A novel multi-feature and multi-level fusion network for 3D human pose estimation","authors":"Qianxing Li, Dehui Kong, Jinghua Li, Baocai Yin","doi":"10.1049/cvi2.12336","DOIUrl":"https://doi.org/10.1049/cvi2.12336","url":null,"abstract":"Human pose estimation based on monocular video has always been the focus of research in the human computer interaction community, which suffers mainly from depth ambiguity and self-occlusion challenges. While the recently proposed learning-based approaches have demonstrated promising performance, they do not fully explore the complementarity of features. In this paper, the authors propose a novel multi-feature and multi-level fusion network (MMF-Net), which extracts and combines joint features, bone features and trajectory features at multiple levels to estimate 3D human pose. In MMF-Net, firstly, the bone length estimation module and the trajectory multi-level fusion module are used to extract the geometric size information of the human body and multi-level trajectory information of human motion, respectively. Then, the fusion attention-based combination (FABC) module is used to extract multi-level topological structure information of the human body, and effectively fuse topological structure information, geometric size information and trajectory information. Extensive experiments show that MMF-Net achieves competitive results on Human3.6M, HumanEva-I and MPI-INF-3DHP datasets.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12336","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A robust few-shot classifier with image as set of points

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2025-01-02 DOI: 10.1049/cvi2.12340

Suhua Peng, Zongliang Zhang, Xingwang Huang, Zongyue Wang, Shubing Su, Guorong Cai

In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.

{"title":"A robust few-shot classifier with image as set of points","authors":"Suhua Peng, Zongliang Zhang, Xingwang Huang, Zongyue Wang, Shubing Su, Guorong Cai","doi":"10.1049/cvi2.12340","DOIUrl":"https://doi.org/10.1049/cvi2.12340","url":null,"abstract":"In recent years, many few-shot classification methods have been proposed. However, only a few of them have explored robust classification, which is an important aspect of human visual intelligence. Humans can effortlessly recognise visual patterns, including lines, circles, and even characters, from image data that has been corrupted or degraded. In this paper, the authors investigate a robust classification method that extends the classical paradigm of robust geometric model fitting. The method views an image as a set of points in a low-dimensional space and analyses each image through low-dimensional geometric model fitting. In contrast, the majority of other methods, such as deep learning methods, treat an image as a single point in a high-dimensional space. The authors evaluate the performance of the method using a noisy Omniglot dataset. The experimental results demonstrate that the proposed method is significantly more robust than other methods. The source code and data for this paper are available at https://github.com/pengsuhua/PMF_OMNIGLOT.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12340","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SMGNFORMER: Fusion Mamba-graph transformer network for human pose estimation

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-31 DOI: 10.1049/cvi2.12339

Yi Li, Zan Wang, Weiran Niu

In the field of 3D human pose estimation (HPE), many deep learning algorithms overlook the topological relationships between 2D keypoints, resulting in imprecise regression of 3D coordinates and a notable decline in estimation performance. To address this limitation, this paper proposes a novel approach to 3D HPE, termed the Spatial Mamba Graph Convolutional Neural Network (GCN) Former (SMGNFormer). The proposed method utilises the Mamba architecture to extract spatial information from 2D keypoints and integrates GCNs with multi-head attention mechanisms to build a relational graph of 2D keypoints across a global receptive field. The outputs are subsequently processed by a Time-Frequency Feature Fusion Transformer to estimate 3D human poses. SMGNFormer demonstrates superior estimation performance on the Human3.6M dataset and real-world video data compared to most Transformer-based algorithms. Moreover, the proposed method achieves a training speed comparable to PoseFormerv2, providing a clear advantage over other methods in its category.

引用次数: 0

LLFormer4D: LiDAR-based lane detection method by temporal feature fusion and sparse transformer

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-30 DOI: 10.1049/cvi2.12338

Jun Hu, Chaolu Feng, Haoxiang Jie, Zuotao Ning, Xinyi Zuo, Wei Liu, Xiangyu Wei

Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.

{"title":"LLFormer4D: LiDAR-based lane detection method by temporal feature fusion and sparse transformer","authors":"Jun Hu, Chaolu Feng, Haoxiang Jie, Zuotao Ning, Xinyi Zuo, Wei Liu, Xiangyu Wei","doi":"10.1049/cvi2.12338","DOIUrl":"https://doi.org/10.1049/cvi2.12338","url":null,"abstract":"Lane detection is a fundamental problem in autonomous driving, which provides vehicles with essential road information. Despite the attention from scholars and engineers, lane detection based on LiDAR meets challenges such as unsatisfactory detection accuracy and significant computation overhead. In this paper, the authors propose LLFormer4D to overcome these technical challenges by leveraging the strengths of both Convolutional Neural Network and Transformer networks. Specifically, the Temporal Feature Fusion module is introduced to enhance accuracy and robustness by integrating features from multi-frame point clouds. In addition, a sparse Transformer decoder based on Lane Key-point Query is designed, which introduces key-point supervision for each lane line to streamline the post-processing. The authors conduct experiments and evaluate the proposed method on the K-Lane and nuScenes map datasets respectively. The results demonstrate the effectiveness of the presented method, achieving second place with an F1 score of 82.39 and a processing speed of 16.03 Frames Per Seconds on the K-Lane dataset. Furthermore, this algorithm attains the best mAP of 70.66 for lane detection on the nuScenes map dataset.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12338","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143362983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HMSFU: A hierarchical multi-scale fusion unit for video prediction and beyond

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-29 DOI: 10.1049/cvi2.12312

Hongchang Zhu, Faming Fang

Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.

{"title":"HMSFU: A hierarchical multi-scale fusion unit for video prediction and beyond","authors":"Hongchang Zhu, Faming Fang","doi":"10.1049/cvi2.12312","DOIUrl":"https://doi.org/10.1049/cvi2.12312","url":null,"abstract":"Video prediction is the process of learning necessary information from historical frames to predict future video frames. Learning features from historical frames is a crucial step in this process. However, most current methods have a relatively single-scale learning approach, even if they learn features at different scales, they cannot fully integrate and utilise them, resulting in unsatisfactory prediction results. To address this issue, a hierarchical multi-scale fusion unit (HMSFU) is proposed. By using a hierarchical multi-scale architecture, each layer predicts future frames at different granularities using different convolutional scales. The abstract features from different layers can be fused, enabling the model not only to capture rich contextual information but also to expand the model's receptive field, enhance its expressive power, and improve its applicability to complex prediction scenarios. To fully utilise the expanded receptive field, HMSFU incorporates three fusion modules. The first module is the single-layer historical attention fusion module, which uses an attention mechanism to fuse the features from historical frames into the current frame at each layer. The second module is the single-layer spatiotemporal fusion module, which fuses complementary temporal and spatial features at each layer. The third module is the multi-layer spatiotemporal fusion module, which fuses spatiotemporal features from different layers. Additionally, the authors not only focus on the frame-level error using mean squared error loss, but also introduce the novel use of Kullback–Leibler (KL) divergence to consider inter-frame variations. Experimental results demonstrate that our proposed HMSFU model achieves the best performance on popular video prediction datasets, showcasing its remarkable competitiveness in the field.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12312","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-27 DOI: 10.1049/cvi2.12334

Jayati Vijaywargiya, Anandakumar M. Ramiya

Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.

{"title":"Semantic segmentation of urban airborne LiDAR data of varying landcover diversity using XGBoost","authors":"Jayati Vijaywargiya, Anandakumar M. Ramiya","doi":"10.1049/cvi2.12334","DOIUrl":"https://doi.org/10.1049/cvi2.12334","url":null,"abstract":"Semantic segmentation of aerial LiDAR dataset is a crucial step for accurate identification of urban objects for various applications pertaining to sustainable urban development. However, this task becomes more complex in urban areas characterised by the coexistence of modern developments and natural vegetation. The unstructured nature of point cloud data, along with data sparsity, irregular point distribution, and varying sizes of urban objects, presents challenges in point cloud classification. To address these challenges, development of robust algorithmic approach encompassing efficient feature sets and classification model are essential. This study incorporates point-wise features to capture the local spatial context of points in datasets. Furthermore, an ensemble machine learning model based on extreme boosting is utilised, which integrates sequential training for weak learners, to enhance the model’s resilience. To thoroughly investigate the efficacy of the proposed approach, this study utilises three distinct datasets from diverse geographical locations, each presenting unique challenges related to class distribution, 3D terrain intricacies, and geographical variations. The Land-cover Diversity Index is introduced to quantify the complexity of landcover in 3D by measuring the degree of class heterogeneity and the frequency of class variation in the dataset. The proposed approach achieved an accuracy of 90% on the regionally complex, higher landcover diversity dataset, Trivandrum Aerial LiDAR Dataset. Furthermore, the results of the study demonstrate improved overall predictive accuracy of 91% and 87% on data segments from two benchmark datasets, DALES and Vaihingen 3D.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12334","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AMEF-Net: Towards an attention and multi-level enhancement fusion for medical image classification in Parkinson's aided diagnosis

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-25 DOI: 10.1049/cvi2.12324

Qingyan Ding, Yu Pan, Jianxin Liu, Lianxin Li, Nan Liu, Na Li, Wan Zheng, Xuecheng Dong

Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.

{"title":"AMEF-Net: Towards an attention and multi-level enhancement fusion for medical image classification in Parkinson's aided diagnosis","authors":"Qingyan Ding, Yu Pan, Jianxin Liu, Lianxin Li, Nan Liu, Na Li, Wan Zheng, Xuecheng Dong","doi":"10.1049/cvi2.12324","DOIUrl":"https://doi.org/10.1049/cvi2.12324","url":null,"abstract":"Parkinson's disease (PD) is a neurodegenerative disorder primarily affecting middle-aged and elderly populations. Its insidious onset, high disability rate, long diagnostic cycle, and high diagnostic costs impose a heavy burden on patients and their families. Leveraging artificial intelligence, with its rapid diagnostic speed, high accuracy, and fatigue resistance, to achieve intelligent assisted diagnosis of PD holds significant promise for alleviating patients' financial stress, reducing diagnostic cycles, and helping patients seize the golden period for early treatment. This paper proposes an Attention and Multi-level Enhancement Fusion Network (AMEF-Net) based on the characteristics of three-dimensional medical imaging and the specific manifestations of PD in medical images. The focus is on small lesion areas and structural lesion areas that are often overlooked in traditional deep learning models, achieving multi-level attention and processing of imaging information. The model achieved a diagnostic accuracy of 98.867%, a precision of 99.830%, a sensitivity of 99.182%, and a specificity of 99.384% on Magnetic Resonance Images from the Parkinson's Progression Markers Initiative dataset. On Diffusion Tensor Images, it achieved a diagnostic accuracy of 99.602%, a precision of 99.930%, a sensitivity of 99.463%, and a specificity of 99.877%. The relevant code has been placed in https://github.com/EdwardTj/AMEF-NET.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12324","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Unlocking the power of multi-modal fusion in 3D object tracking

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-25 DOI: 10.1049/cvi2.12335

Yue Hu

3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.

{"title":"Unlocking the power of multi-modal fusion in 3D object tracking","authors":"Yue Hu","doi":"10.1049/cvi2.12335","DOIUrl":"https://doi.org/10.1049/cvi2.12335","url":null,"abstract":"3D Single Object Tracking plays a vital role in autonomous driving and robotics, yet traditional approaches have predominantly focused on using pure LiDAR-based point cloud data, often neglecting the benefits of integrating image modalities. To address this gap, we propose a novel Multi-modal Image-LiDAR Tracker (MILT) designed to overcome the limitations of single-modality methods by effectively combining RGB and point cloud data. Our key contribution is a dual-branch architecture that separately extracts geometric features from LiDAR and texture features from images. These features are then fused in a BEV perspective to achieve a comprehensive representation of the tracked object. A significant innovation in our approach is the Image-to-LiDAR Adapter module, which transfers the rich feature representation capabilities of the image modality to the 3D tracking task, and the BEV-Fusion module, which facilitates the interactive fusion of geometry and texture features. By validating MILT on public datasets, we demonstrate substantial performance improvements over traditional methods, effectively showcasing the advantages of our multi-modal fusion strategy. This work advances the state-of-the-art in SOT by integrating complementary information from RGB and LiDAR modalities, resulting in enhanced tracking accuracy and robustness.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12335","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143363016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Category-instance distillation based on visual-language models for rehearsal-free class incremental learning

IF 1.5 4区计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IET Computer Vision

Pub Date : 2024-12-23 DOI: 10.1049/cvi2.12327

Weilong Jin, Zilei Wang, Yixin Zhang

Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.

{"title":"Category-instance distillation based on visual-language models for rehearsal-free class incremental learning","authors":"Weilong Jin, Zilei Wang, Yixin Zhang","doi":"10.1049/cvi2.12327","DOIUrl":"https://doi.org/10.1049/cvi2.12327","url":null,"abstract":"Recently, visual-language models (VLMs) have displayed potent capabilities in the field of computer vision. Their emerging trend as the backbone of visual tasks necessitates studying class incremental learning (CIL) issues within the VLM architecture. However, the pre-training data for many VLMs is proprietary, and during the incremental phase, old task data may also raise privacy issues. Moreover, replay-based methods can introduce new problems like class imbalance, the selection of data for replay and a trade-off between replay cost and performance. Therefore, the authors choose the more challenging rehearsal-free settings. In this paper, the authors study class-incremental tasks based on the large pre-trained vision-language models like CLIP model. Initially, at the category level, the authors combine traditional optimisation and distillation techniques, utilising both pre-trained models and models trained in previous incremental stages to jointly guide the training of the new model. This paradigm effectively balances the stability and plasticity of the new model, mitigating the issue of catastrophic forgetting. Moreover, utilising the VLM infrastructure, the authors redefine the relationship between instances. This allows us to glean fine-grained instance relational information from the a priori knowledge provided during pre-training. The authors supplement this approach with an entropy-balancing method that allows the model to adaptively distribute optimisation weights across training samples. The authors’ experimental results validate that their method, within the framework of VLMs, outperforms traditional CIL methods.","PeriodicalId":56304,"journal":{"name":"IET Computer Vision","volume":"19 1","pages":""},"PeriodicalIF":1.5,"publicationDate":"2024-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/cvi2.12327","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143424295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IET Computer Vision

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀