Pub Date : 2025-02-07DOI: 10.1016/j.imavis.2025.105441
Mohammed A. Esmail , Jinlei Wang , Yihao Wang , Li Sun , Guoliang Zhu , Guohe Zhang
When using deep learning applications for human posture estimation (HPE), especially on devices with limited resources, accuracy and efficiency must be balanced. Common deep-learning architectures have a propensity to use a large amount of processing power while yielding low accuracy. This work proposes the implementation of Efficient YoloPose, a new architecture based on You Only Look Once version 8 (YOLOv8)-Pose, in an attempt to address these issues. Advanced lightweight methods like Depthwise Convolution, Ghost Convolution, and the C3Ghost module are used by Efficient YoloPose to replace traditional convolution and C2f (a quicker implementation of the Cross Stage Partial Bottleneck). This approach greatly decreases the inference, parameter count, and computing complexity. To improve posture estimation even further, Efficient YoloPose integrates the Squeeze Excitation (SE) attention method into the network. The main focus of this process during posture estimation is the significant areas of an image. Experimental results show that the suggested model performs better than the current models on the COCO and OCHuman datasets. The proposed model lowers the inference time from 1.1 milliseconds (ms) to 0.9 ms, the computational complexity from 9.2 Giga Floating-point operations (GFlops) to 4.8 GFlops and the parameter count from 3.3 million to 1.3 million when compared to YOLOv8-Pose. In addition, this model maintains an average precision (AP) score of 78.8 on the COCO dataset. The source code for Efficient YoloPose has been made publicly available at [https://github.com/malareeqi/Efficient-YoloPose].
{"title":"Resource-aware strategies for real-time multi-person pose estimation","authors":"Mohammed A. Esmail , Jinlei Wang , Yihao Wang , Li Sun , Guoliang Zhu , Guohe Zhang","doi":"10.1016/j.imavis.2025.105441","DOIUrl":"10.1016/j.imavis.2025.105441","url":null,"abstract":"<div><div>When using deep learning applications for human posture estimation (HPE), especially on devices with limited resources, accuracy and efficiency must be balanced. Common deep-learning architectures have a propensity to use a large amount of processing power while yielding low accuracy. This work proposes the implementation of Efficient YoloPose, a new architecture based on You Only Look Once version 8 (YOLOv8)-Pose, in an attempt to address these issues. Advanced lightweight methods like Depthwise Convolution, Ghost Convolution, and the C3Ghost module are used by Efficient YoloPose to replace traditional convolution and C2f (a quicker implementation of the Cross Stage Partial Bottleneck). This approach greatly decreases the inference, parameter count, and computing complexity. To improve posture estimation even further, Efficient YoloPose integrates the Squeeze Excitation (SE) attention method into the network. The main focus of this process during posture estimation is the significant areas of an image. Experimental results show that the suggested model performs better than the current models on the COCO and OCHuman datasets. The proposed model lowers the inference time from 1.1 milliseconds (ms) to 0.9 ms, the computational complexity from 9.2 Giga Floating-point operations (GFlops) to 4.8 GFlops and the parameter count from 3.3 million to 1.3 million when compared to YOLOv8-Pose. In addition, this model maintains an average precision (AP) score of 78.8 on the COCO dataset. The source code for Efficient YoloPose has been made publicly available at [<span><span>https://github.com/malareeqi/Efficient-YoloPose</span><svg><path></path></svg></span>].</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105441"},"PeriodicalIF":4.2,"publicationDate":"2025-02-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143402656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-04DOI: 10.1016/j.imavis.2025.105436
Jie Hu , Ting Pang , Bo Peng , Yongguo Shi , Tianrui Li
Object detection in aerial images is crucial for various applications, including precision agriculture, urban planning, disaster management, and military surveillance, as it enables the automated identification and localization of ground objects from high-altitude images. However, this field encounters several significant challenges: (1) The uneven distribution of objects; (2) High-resolution aerial images contain numerous small objects and complex backgrounds; (3) Significant variation in object sizes. To address these challenges, this paper proposes a new detection network architecture based on the fusion of multiple attention mechanisms named MAFDet. MAFDet comprises three main components: the multi-attention focusing sub-network, the multi-scale Swin transformer backbone, and the detection head. The multi-attention focusing sub-network generates attention maps to identify regions with dense small objects for precise detection. The multi-scale Swin transformer embeds the efficient multi-scale attention module into the Swin transformer block to extract better multi-layer features and mitigate background interference, thereby significantly enhancing the model’s feature extraction capability. Finally, the detector processes regions with dense small objects and global images separately, subsequently fusing the detection results to produce the final output. Experimental results demonstrate that MAFDet outperforms existing methods on widely used aerial image datasets, VisDrone and UAVDT, achieving improvements in small object detection average precision () of 1.21% and 1.98%, respectively.
{"title":"A small object detection model for drone images based on multi-attention fusion network","authors":"Jie Hu , Ting Pang , Bo Peng , Yongguo Shi , Tianrui Li","doi":"10.1016/j.imavis.2025.105436","DOIUrl":"10.1016/j.imavis.2025.105436","url":null,"abstract":"<div><div>Object detection in aerial images is crucial for various applications, including precision agriculture, urban planning, disaster management, and military surveillance, as it enables the automated identification and localization of ground objects from high-altitude images. However, this field encounters several significant challenges: (1) The uneven distribution of objects; (2) High-resolution aerial images contain numerous small objects and complex backgrounds; (3) Significant variation in object sizes. To address these challenges, this paper proposes a new detection network architecture based on the fusion of multiple attention mechanisms named MAFDet. MAFDet comprises three main components: the multi-attention focusing sub-network, the multi-scale Swin transformer backbone, and the detection head. The multi-attention focusing sub-network generates attention maps to identify regions with dense small objects for precise detection. The multi-scale Swin transformer embeds the efficient multi-scale attention module into the Swin transformer block to extract better multi-layer features and mitigate background interference, thereby significantly enhancing the model’s feature extraction capability. Finally, the detector processes regions with dense small objects and global images separately, subsequently fusing the detection results to produce the final output. Experimental results demonstrate that MAFDet outperforms existing methods on widely used aerial image datasets, VisDrone and UAVDT, achieving improvements in small object detection average precision (<span><math><mrow><mi>A</mi><msub><mrow><mi>P</mi></mrow><mrow><mi>s</mi></mrow></msub></mrow></math></span>) of 1.21% and 1.98%, respectively.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105436"},"PeriodicalIF":4.2,"publicationDate":"2025-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143334383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-03DOI: 10.1016/j.imavis.2025.105437
Ana Filipa Rodrigues Nogueira , Hélder P. Oliveira , Luís F. Teixeira
3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose.
Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.
{"title":"Markerless multi-view 3D human pose estimation: A survey","authors":"Ana Filipa Rodrigues Nogueira , Hélder P. Oliveira , Luís F. Teixeira","doi":"10.1016/j.imavis.2025.105437","DOIUrl":"10.1016/j.imavis.2025.105437","url":null,"abstract":"<div><div>3D human pose estimation aims to reconstruct the human skeleton of all the individuals in a scene by detecting several body joints. The creation of accurate and efficient methods is required for several real-world applications including animation, human–robot interaction, surveillance systems or sports, among many others. However, several obstacles such as occlusions, random camera perspectives, or the scarcity of 3D labelled data, have been hampering the models’ performance and limiting their deployment in real-world scenarios. The higher availability of cameras has led researchers to explore multi-view solutions due to the advantage of being able to exploit different perspectives to reconstruct the pose.</div><div>Most existing reviews focus mainly on monocular 3D human pose estimation and a comprehensive survey only on multi-view approaches to determine the 3D pose has been missing since 2012. Thus, the goal of this survey is to fill that gap and present an overview of the methodologies related to 3D pose estimation in multi-view settings, understand what were the strategies found to address the various challenges and also, identify their limitations. According to the reviewed articles, it was possible to find that most methods are fully-supervised approaches based on geometric constraints. Nonetheless, most of the methods suffer from 2D pose mismatches, to which the incorporation of temporal consistency and depth information have been suggested to reduce the impact of this limitation, besides working directly with 3D features can completely surpass this problem but at the expense of higher computational complexity. Models with lower supervision levels were identified to overcome some of the issues related to 3D pose, particularly the scarcity of labelled datasets. Therefore, no method is yet capable of solving all the challenges associated with the reconstruction of the 3D pose. Due to the existing trade-off between complexity and performance, the best method depends on the application scenario. Therefore, further research is still required to develop an approach capable of quickly inferring a highly accurate 3D pose with bearable computation cost. To this goal, techniques such as active learning, methods that learn with a low level of supervision, the incorporation of temporal consistency, view selection, estimation of depth information and multi-modal approaches might be interesting strategies to keep in mind when developing a new methodology to solve this task.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"155 ","pages":"Article 105437"},"PeriodicalIF":4.2,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143429777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105386
Zhenshan Hu, Bin Ge, Chenxing Xia
In recent years, arbitrary style transfer has gained a lot of attention from researchers. Although existing methods achieve good results, the generated images are usually biased towards styles, resulting in images with artifacts and repetitive patterns. To address the above problems, we propose a multi-angle feature fusion network for style transfer (MAFST). MAFST consists of a Multi-Angle Feature Fusion module (MAFF), a Multi-Scale Style Capture module (MSSC), multi-angle loss, and a content temporal consistency loss. MAFF can process the captured features from channel level and pixel level, and feature fusion is performed both locally and globally. MSSC processes the shallow style features and optimize generated images. To guide the model to focus on local features, we introduce a multi-angle loss. The content temporal consistency loss extends image style transfer to video style transfer. Extensive experiments have demonstrated that our proposed MAFST can effectively avoid images with artifacts and repetitive patterns. MAFST achieves advanced performance.
{"title":"Multiangle feature fusion network for style transfer","authors":"Zhenshan Hu, Bin Ge, Chenxing Xia","doi":"10.1016/j.imavis.2024.105386","DOIUrl":"10.1016/j.imavis.2024.105386","url":null,"abstract":"<div><div>In recent years, arbitrary style transfer has gained a lot of attention from researchers. Although existing methods achieve good results, the generated images are usually biased towards styles, resulting in images with artifacts and repetitive patterns. To address the above problems, we propose a multi-angle feature fusion network for style transfer (MAFST). MAFST consists of a Multi-Angle Feature Fusion module (MAFF), a Multi-Scale Style Capture module (MSSC), multi-angle loss, and a content temporal consistency loss. MAFF can process the captured features from channel level and pixel level, and feature fusion is performed both locally and globally. MSSC processes the shallow style features and optimize generated images. To guide the model to focus on local features, we introduce a multi-angle loss. The content temporal consistency loss extends image style transfer to video style transfer. Extensive experiments have demonstrated that our proposed MAFST can effectively avoid images with artifacts and repetitive patterns. MAFST achieves advanced performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105386"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105370
Ruchika Sharma, Rudresh Dwivedi
Recent progress in the field of computer vision incorporates robust tools for creating convincing deepfakes. Hence, the propagation of fake media may have detrimental effects on social communities, potentially tarnishing the reputation of individuals or groups. Furthermore, this phenomenon may manipulate public sentiments and skew opinions about the affected entities. Recent research determines Convolution Neural Networks (CNNs) as a viable solution for detecting deepfakes within the networks. However, existing techniques struggle to accurately capture the differences between frames in the collected media streams. To alleviate these limitations, our work proposes a new Deepfake detection approach using a hybrid model using the Multi-layer Perceptron Convolution Neural Network (MLP-CNN) model and LSTM (Long Short Term Memory). Our model has utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (Musa et al., 2018) approach to enhance the contrast of the image and later on applying Viola Jones Algorithm (VJA) (Paul et al., 2018) to the preprocessed image for detecting the face. The extracted features such as Improved eye blinking pattern detection (IEBPD), active shape model (ASM), face attributes, and eye attributes features along with the age and gender of the corresponding image are fed to the hybrid deepfake detection model that involves two classifiers MLP-CNN and LSTM model. The proposed model is trained with these features to detect the deepfake images proficiently. The experimentation demonstrates that our proposed hybrid model has been evaluated on two datasets, i.e. World Leader Dataset (WLDR) and the DeepfakeTIMIT Dataset. From the experimental results, it is affirmed that our proposed hybrid model outperforms existing approaches such as DeepVision, DNN (Deep Neutral Network), CNN (Convolution Neural Network), RNN (Recurrent Neural network), DeepMaxout, DBN (Deep Belief Networks), and Bi-GRU (Bi-Directional Gated Recurrent Unit).
{"title":"Unmasking deepfakes: Eye blink pattern analysis using a hybrid LSTM and MLP-CNN model","authors":"Ruchika Sharma, Rudresh Dwivedi","doi":"10.1016/j.imavis.2024.105370","DOIUrl":"10.1016/j.imavis.2024.105370","url":null,"abstract":"<div><div>Recent progress in the field of computer vision incorporates robust tools for creating convincing deepfakes. Hence, the propagation of fake media may have detrimental effects on social communities, potentially tarnishing the reputation of individuals or groups. Furthermore, this phenomenon may manipulate public sentiments and skew opinions about the affected entities. Recent research determines Convolution Neural Networks (CNNs) as a viable solution for detecting deepfakes within the networks. However, existing techniques struggle to accurately capture the differences between frames in the collected media streams. To alleviate these limitations, our work proposes a new Deepfake detection approach using a hybrid model using the Multi-layer Perceptron Convolution Neural Network (MLP-CNN) model and LSTM (Long Short Term Memory). Our model has utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (Musa et al., 2018) approach to enhance the contrast of the image and later on applying Viola Jones Algorithm (VJA) (Paul et al., 2018) to the preprocessed image for detecting the face. The extracted features such as Improved eye blinking pattern detection (IEBPD), active shape model (ASM), face attributes, and eye attributes features along with the age and gender of the corresponding image are fed to the hybrid deepfake detection model that involves two classifiers MLP-CNN and LSTM model. The proposed model is trained with these features to detect the deepfake images proficiently. The experimentation demonstrates that our proposed hybrid model has been evaluated on two datasets, i.e. World Leader Dataset (WLDR) and the DeepfakeTIMIT Dataset. From the experimental results, it is affirmed that our proposed hybrid model outperforms existing approaches such as DeepVision, DNN (Deep Neutral Network), CNN (Convolution Neural Network), RNN (Recurrent Neural network), DeepMaxout, DBN (Deep Belief Networks), and Bi-GRU (Bi-Directional Gated Recurrent Unit).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105370"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105341
Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai
3D human avatar reconstruction aims to reconstruct the 3D geometric shape and appearance of the human body from various data inputs, such as images, videos, and depth information, acting as a key component in human-oriented 3D vision in the metaverse. With the progress in neural fields for 3D reconstruction in recent years, significant advancements have been made in this research area for shape accuracy and appearance quality. Meanwhile, substantial efforts on dynamic avatars with the representation of neural fields have exhibited their effect. Although significant improvements have been achieved, challenges still exist in in-the-wild and complex environments, detailed shape recovery, and interactivity in real-world applications. In this survey, we present a comprehensive overview of 3D human avatar reconstruction methods using advanced neural fields. We start by introducing the background of 3D human avatar reconstruction and the mainstream paradigms with neural fields. Subsequently, representative research studies are classified based on their representation and avatar partswith detailed discussion. Moreover, we summarize the commonly used available datasets, evaluation metrics, and results in the research area. In the end, we discuss the open problems and highlight the promising future directions, hoping to inspire novel ideas and promote further research in this area.
{"title":"3D human avatar reconstruction with neural fields: A recent survey","authors":"Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai","doi":"10.1016/j.imavis.2024.105341","DOIUrl":"10.1016/j.imavis.2024.105341","url":null,"abstract":"<div><div>3D human avatar reconstruction aims to reconstruct the 3D geometric shape and appearance of the human body from various data inputs, such as images, videos, and depth information, acting as a key component in human-oriented 3D vision in the metaverse. With the progress in neural fields for 3D reconstruction in recent years, significant advancements have been made in this research area for shape accuracy and appearance quality. Meanwhile, substantial efforts on dynamic avatars with the representation of neural fields have exhibited their effect. Although significant improvements have been achieved, challenges still exist in in-the-wild and complex environments, detailed shape recovery, and interactivity in real-world applications. In this survey, we present a comprehensive overview of 3D human avatar reconstruction methods using advanced neural fields. We start by introducing the background of 3D human avatar reconstruction and the mainstream paradigms with neural fields. Subsequently, representative research studies are classified based on their representation and avatar partswith detailed discussion. Moreover, we summarize the commonly used available datasets, evaluation metrics, and results in the research area. In the end, we discuss the open problems and highlight the promising future directions, hoping to inspire novel ideas and promote further research in this area.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105341"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105387
Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava
Deep learning models have recently demonstrated outstanding results in classifying hyperspectral images (HSI). The Transformer model is among the various deep learning models that have received increasing interest due to its superior ability to simulate the long-term dependence of spatial-spectral information in HSI. Due to its self-attention mechanism, the Transformer exhibits quadratic computational complexity, which makes it heavier than other models and limits its application in the processing of HSI. Fortunately, the newly developed state space model Mamba exhibits excellent computing effectiveness and achieves Transformer-like modeling capabilities. Therefore, we propose a novel enhanced Mamba-based model called HSIRMamba that integrates residual operations into the Mamba architecture by combining the power of Mamba and the residual network to extract the spectral properties of HSI more effectively. It also includes a concurrent dedicated block for spatial analysis using a convolutional neural network. HSIRMamba extracts more accurate features with low computational power, making it more powerful than transformer-based models. HSIRMamba was tested on three majorly used HSI Datasets-Indian Pines, Pavia University, and Houston 2013. The experimental results demonstrate that the proposed method achieves competitive results compared to state-of-the-art methods.
{"title":"HSIRMamba: An effective feature learning for hyperspectral image classification using residual Mamba","authors":"Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava","doi":"10.1016/j.imavis.2024.105387","DOIUrl":"10.1016/j.imavis.2024.105387","url":null,"abstract":"<div><div>Deep learning models have recently demonstrated outstanding results in classifying hyperspectral images (HSI). The Transformer model is among the various deep learning models that have received increasing interest due to its superior ability to simulate the long-term dependence of spatial-spectral information in HSI. Due to its self-attention mechanism, the Transformer exhibits quadratic computational complexity, which makes it heavier than other models and limits its application in the processing of HSI. Fortunately, the newly developed state space model Mamba exhibits excellent computing effectiveness and achieves Transformer-like modeling capabilities. Therefore, we propose a novel enhanced Mamba-based model called HSIRMamba that integrates residual operations into the Mamba architecture by combining the power of Mamba and the residual network to extract the spectral properties of HSI more effectively. It also includes a concurrent dedicated block for spatial analysis using a convolutional neural network. HSIRMamba extracts more accurate features with low computational power, making it more powerful than transformer-based models. HSIRMamba was tested on three majorly used HSI Datasets-Indian Pines, Pavia University, and Houston 2013. The experimental results demonstrate that the proposed method achieves competitive results compared to state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105387"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of Multi-Focus Image Fusion (MFIF) is to extract the clear portions from multiple blurry images with complementary features to obtain a fully focused image, which is considered a prerequisite for other advanced visual tasks. With the development of deep learning technologies, significant breakthroughs have been achieved in multi-focus image fusion. However, most existing methods still face challenges related to detail information loss and misjudgment in boundary regions. In this paper, we propose a method called CPFusion for MFIF. On one hand, to fully preserve all detail information from the source images, we utilize an Invertible Neural Network (INN) for feature information transfer. The strong feature retention capability of INN allows for better preservation of the complementary features of the source images. On the other hand, to enhance the network’s performance in image fusion, we design a closed-loop structure to guide the fusion process. Specifically, during the training process, the forward operation of the network is used to learn the mapping from source images to fused images and decision maps, while the backward operation simulates the degradation of the focused image back to the source images. The backward operation serves as an additional constraint to guide the performance of the network’s forward operation. To achieve more natural fusion results, our network simultaneously generates an initial fused image and a decision map, utilizing the decision map to retain the details of the source images, while the initial fused image is employed to improve the visual effects of the decision map fusion method in boundary regions. Extensive experimental results demonstrate that the proposed method achieves excellent results in both subjective visual quality and objective metric assessments.
{"title":"CPFusion: A multi-focus image fusion method based on closed-loop regularization","authors":"Hao Zhai, Peng Chen, Nannan Luo, Qinyu Li, Ping Yu","doi":"10.1016/j.imavis.2024.105399","DOIUrl":"10.1016/j.imavis.2024.105399","url":null,"abstract":"<div><div>The purpose of Multi-Focus Image Fusion (MFIF) is to extract the clear portions from multiple blurry images with complementary features to obtain a fully focused image, which is considered a prerequisite for other advanced visual tasks. With the development of deep learning technologies, significant breakthroughs have been achieved in multi-focus image fusion. However, most existing methods still face challenges related to detail information loss and misjudgment in boundary regions. In this paper, we propose a method called CPFusion for MFIF. On one hand, to fully preserve all detail information from the source images, we utilize an Invertible Neural Network (INN) for feature information transfer. The strong feature retention capability of INN allows for better preservation of the complementary features of the source images. On the other hand, to enhance the network’s performance in image fusion, we design a closed-loop structure to guide the fusion process. Specifically, during the training process, the forward operation of the network is used to learn the mapping from source images to fused images and decision maps, while the backward operation simulates the degradation of the focused image back to the source images. The backward operation serves as an additional constraint to guide the performance of the network’s forward operation. To achieve more natural fusion results, our network simultaneously generates an initial fused image and a decision map, utilizing the decision map to retain the details of the source images, while the initial fused image is employed to improve the visual effects of the decision map fusion method in boundary regions. Extensive experimental results demonstrate that the proposed method achieves excellent results in both subjective visual quality and objective metric assessments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105399"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105401
Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue
Deep learning has been extensively adopted across various domains, yielding satisfactory outcomes. However, it heavily relies on extensive labeled datasets, collecting data labels is expensive and time-consuming. We propose a novel framework called Active Transfer Learning (ATL) to address this issue. The ATL framework consists of Active Learning (AL) and Transfer Learning (TL). AL queries the unlabeled samples with high inconsistency by Maximum Differentiation Classifier (MDC). The MDC pulls the discrepancy between the labeled data and their augmentations to select and annotate the informative samples. Additionally, we also explore the potential of incorporating TL techniques. The TL comprises pre-training and fine-tuning. The former learns knowledge from the origin-augmentation domain to pre-train the model, while the latter leverages the acquired knowledge for the downstream tasks. The results indicate that the combination of TL and AL exhibits complementary effects, while the proposed ATL framework outperforms state-of-the-art methods in terms of accuracy, precision, recall, and F1-score.
{"title":"An Active Transfer Learning framework for image classification based on Maximum Differentiation Classifier","authors":"Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue","doi":"10.1016/j.imavis.2024.105401","DOIUrl":"10.1016/j.imavis.2024.105401","url":null,"abstract":"<div><div>Deep learning has been extensively adopted across various domains, yielding satisfactory outcomes. However, it heavily relies on extensive labeled datasets, collecting data labels is expensive and time-consuming. We propose a novel framework called Active Transfer Learning (ATL) to address this issue. The ATL framework consists of Active Learning (AL) and Transfer Learning (TL). AL queries the unlabeled samples with high inconsistency by Maximum Differentiation Classifier (MDC). The MDC pulls the discrepancy between the labeled data and their augmentations to select and annotate the informative samples. Additionally, we also explore the potential of incorporating TL techniques. The TL comprises pre-training and fine-tuning. The former learns knowledge from the origin-augmentation domain to pre-train the model, while the latter leverages the acquired knowledge for the downstream tasks. The results indicate that the combination of TL and AL exhibits complementary effects, while the proposed ATL framework outperforms state-of-the-art methods in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105401"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2025.105424
David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença
Sketch understanding poses unique challenges for general-purpose vision algorithms due to the sparse and semantically ambiguous nature of sketches. This paper introduces a novel approach to biometric recognition that leverages sketch-based representations of ears, a largely unexplored but promising area in biometric research. Specifically, we address the “sketch-2-image” matching problem by synthesizing ear sketches at multiple abstraction levels, achieved through a triplet-loss function adapted to integrate these levels. The abstraction level is determined by the number of strokes used, with fewer strokes reflecting higher abstraction. Our methodology combines sketch representations across abstraction levels to improve robustness and generalizability in matching. Extensive evaluations were conducted on four ear datasets (AMI, AWE, IITDII, and BIPLab) using various pre-trained neural network backbones, showing consistently superior performance over state-of-the-art methods. These results highlight the potential of ear sketch-based recognition, with cross-dataset tests confirming its adaptability to real-world conditions and suggesting applicability beyond ear biometrics.
{"title":"Synthesizing multilevel abstraction ear sketches for enhanced biometric recognition","authors":"David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença","doi":"10.1016/j.imavis.2025.105424","DOIUrl":"10.1016/j.imavis.2025.105424","url":null,"abstract":"<div><div>Sketch understanding poses unique challenges for general-purpose vision algorithms due to the sparse and semantically ambiguous nature of sketches. This paper introduces a novel approach to biometric recognition that leverages sketch-based representations of ears, a largely unexplored but promising area in biometric research. Specifically, we address the “<em>sketch-2-image</em>” matching problem by synthesizing ear sketches at multiple abstraction levels, achieved through a triplet-loss function adapted to integrate these levels. The abstraction level is determined by the number of strokes used, with fewer strokes reflecting higher abstraction. Our methodology combines sketch representations across abstraction levels to improve robustness and generalizability in matching. Extensive evaluations were conducted on four ear datasets (AMI, AWE, IITDII, and BIPLab) using various pre-trained neural network backbones, showing consistently superior performance over state-of-the-art methods. These results highlight the potential of ear sketch-based recognition, with cross-dataset tests confirming its adaptability to real-world conditions and suggesting applicability beyond ear biometrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105424"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}