Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105386
Zhenshan Hu, Bin Ge, Chenxing Xia
In recent years, arbitrary style transfer has gained a lot of attention from researchers. Although existing methods achieve good results, the generated images are usually biased towards styles, resulting in images with artifacts and repetitive patterns. To address the above problems, we propose a multi-angle feature fusion network for style transfer (MAFST). MAFST consists of a Multi-Angle Feature Fusion module (MAFF), a Multi-Scale Style Capture module (MSSC), multi-angle loss, and a content temporal consistency loss. MAFF can process the captured features from channel level and pixel level, and feature fusion is performed both locally and globally. MSSC processes the shallow style features and optimize generated images. To guide the model to focus on local features, we introduce a multi-angle loss. The content temporal consistency loss extends image style transfer to video style transfer. Extensive experiments have demonstrated that our proposed MAFST can effectively avoid images with artifacts and repetitive patterns. MAFST achieves advanced performance.
{"title":"Multiangle feature fusion network for style transfer","authors":"Zhenshan Hu, Bin Ge, Chenxing Xia","doi":"10.1016/j.imavis.2024.105386","DOIUrl":"10.1016/j.imavis.2024.105386","url":null,"abstract":"<div><div>In recent years, arbitrary style transfer has gained a lot of attention from researchers. Although existing methods achieve good results, the generated images are usually biased towards styles, resulting in images with artifacts and repetitive patterns. To address the above problems, we propose a multi-angle feature fusion network for style transfer (MAFST). MAFST consists of a Multi-Angle Feature Fusion module (MAFF), a Multi-Scale Style Capture module (MSSC), multi-angle loss, and a content temporal consistency loss. MAFF can process the captured features from channel level and pixel level, and feature fusion is performed both locally and globally. MSSC processes the shallow style features and optimize generated images. To guide the model to focus on local features, we introduce a multi-angle loss. The content temporal consistency loss extends image style transfer to video style transfer. Extensive experiments have demonstrated that our proposed MAFST can effectively avoid images with artifacts and repetitive patterns. MAFST achieves advanced performance.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105386"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105370
Ruchika Sharma, Rudresh Dwivedi
Recent progress in the field of computer vision incorporates robust tools for creating convincing deepfakes. Hence, the propagation of fake media may have detrimental effects on social communities, potentially tarnishing the reputation of individuals or groups. Furthermore, this phenomenon may manipulate public sentiments and skew opinions about the affected entities. Recent research determines Convolution Neural Networks (CNNs) as a viable solution for detecting deepfakes within the networks. However, existing techniques struggle to accurately capture the differences between frames in the collected media streams. To alleviate these limitations, our work proposes a new Deepfake detection approach using a hybrid model using the Multi-layer Perceptron Convolution Neural Network (MLP-CNN) model and LSTM (Long Short Term Memory). Our model has utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (Musa et al., 2018) approach to enhance the contrast of the image and later on applying Viola Jones Algorithm (VJA) (Paul et al., 2018) to the preprocessed image for detecting the face. The extracted features such as Improved eye blinking pattern detection (IEBPD), active shape model (ASM), face attributes, and eye attributes features along with the age and gender of the corresponding image are fed to the hybrid deepfake detection model that involves two classifiers MLP-CNN and LSTM model. The proposed model is trained with these features to detect the deepfake images proficiently. The experimentation demonstrates that our proposed hybrid model has been evaluated on two datasets, i.e. World Leader Dataset (WLDR) and the DeepfakeTIMIT Dataset. From the experimental results, it is affirmed that our proposed hybrid model outperforms existing approaches such as DeepVision, DNN (Deep Neutral Network), CNN (Convolution Neural Network), RNN (Recurrent Neural network), DeepMaxout, DBN (Deep Belief Networks), and Bi-GRU (Bi-Directional Gated Recurrent Unit).
{"title":"Unmasking deepfakes: Eye blink pattern analysis using a hybrid LSTM and MLP-CNN model","authors":"Ruchika Sharma, Rudresh Dwivedi","doi":"10.1016/j.imavis.2024.105370","DOIUrl":"10.1016/j.imavis.2024.105370","url":null,"abstract":"<div><div>Recent progress in the field of computer vision incorporates robust tools for creating convincing deepfakes. Hence, the propagation of fake media may have detrimental effects on social communities, potentially tarnishing the reputation of individuals or groups. Furthermore, this phenomenon may manipulate public sentiments and skew opinions about the affected entities. Recent research determines Convolution Neural Networks (CNNs) as a viable solution for detecting deepfakes within the networks. However, existing techniques struggle to accurately capture the differences between frames in the collected media streams. To alleviate these limitations, our work proposes a new Deepfake detection approach using a hybrid model using the Multi-layer Perceptron Convolution Neural Network (MLP-CNN) model and LSTM (Long Short Term Memory). Our model has utilized Contrast Limited Adaptive Histogram Equalization (CLAHE) (Musa et al., 2018) approach to enhance the contrast of the image and later on applying Viola Jones Algorithm (VJA) (Paul et al., 2018) to the preprocessed image for detecting the face. The extracted features such as Improved eye blinking pattern detection (IEBPD), active shape model (ASM), face attributes, and eye attributes features along with the age and gender of the corresponding image are fed to the hybrid deepfake detection model that involves two classifiers MLP-CNN and LSTM model. The proposed model is trained with these features to detect the deepfake images proficiently. The experimentation demonstrates that our proposed hybrid model has been evaluated on two datasets, i.e. World Leader Dataset (WLDR) and the DeepfakeTIMIT Dataset. From the experimental results, it is affirmed that our proposed hybrid model outperforms existing approaches such as DeepVision, DNN (Deep Neutral Network), CNN (Convolution Neural Network), RNN (Recurrent Neural network), DeepMaxout, DBN (Deep Belief Networks), and Bi-GRU (Bi-Directional Gated Recurrent Unit).</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105370"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105341
Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai
3D human avatar reconstruction aims to reconstruct the 3D geometric shape and appearance of the human body from various data inputs, such as images, videos, and depth information, acting as a key component in human-oriented 3D vision in the metaverse. With the progress in neural fields for 3D reconstruction in recent years, significant advancements have been made in this research area for shape accuracy and appearance quality. Meanwhile, substantial efforts on dynamic avatars with the representation of neural fields have exhibited their effect. Although significant improvements have been achieved, challenges still exist in in-the-wild and complex environments, detailed shape recovery, and interactivity in real-world applications. In this survey, we present a comprehensive overview of 3D human avatar reconstruction methods using advanced neural fields. We start by introducing the background of 3D human avatar reconstruction and the mainstream paradigms with neural fields. Subsequently, representative research studies are classified based on their representation and avatar partswith detailed discussion. Moreover, we summarize the commonly used available datasets, evaluation metrics, and results in the research area. In the end, we discuss the open problems and highlight the promising future directions, hoping to inspire novel ideas and promote further research in this area.
{"title":"3D human avatar reconstruction with neural fields: A recent survey","authors":"Meiying Gu , Jiahe Li , Yuchen Wu , Haonan Luo , Jin Zheng , Xiao Bai","doi":"10.1016/j.imavis.2024.105341","DOIUrl":"10.1016/j.imavis.2024.105341","url":null,"abstract":"<div><div>3D human avatar reconstruction aims to reconstruct the 3D geometric shape and appearance of the human body from various data inputs, such as images, videos, and depth information, acting as a key component in human-oriented 3D vision in the metaverse. With the progress in neural fields for 3D reconstruction in recent years, significant advancements have been made in this research area for shape accuracy and appearance quality. Meanwhile, substantial efforts on dynamic avatars with the representation of neural fields have exhibited their effect. Although significant improvements have been achieved, challenges still exist in in-the-wild and complex environments, detailed shape recovery, and interactivity in real-world applications. In this survey, we present a comprehensive overview of 3D human avatar reconstruction methods using advanced neural fields. We start by introducing the background of 3D human avatar reconstruction and the mainstream paradigms with neural fields. Subsequently, representative research studies are classified based on their representation and avatar partswith detailed discussion. Moreover, we summarize the commonly used available datasets, evaluation metrics, and results in the research area. In the end, we discuss the open problems and highlight the promising future directions, hoping to inspire novel ideas and promote further research in this area.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105341"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105387
Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava
Deep learning models have recently demonstrated outstanding results in classifying hyperspectral images (HSI). The Transformer model is among the various deep learning models that have received increasing interest due to its superior ability to simulate the long-term dependence of spatial-spectral information in HSI. Due to its self-attention mechanism, the Transformer exhibits quadratic computational complexity, which makes it heavier than other models and limits its application in the processing of HSI. Fortunately, the newly developed state space model Mamba exhibits excellent computing effectiveness and achieves Transformer-like modeling capabilities. Therefore, we propose a novel enhanced Mamba-based model called HSIRMamba that integrates residual operations into the Mamba architecture by combining the power of Mamba and the residual network to extract the spectral properties of HSI more effectively. It also includes a concurrent dedicated block for spatial analysis using a convolutional neural network. HSIRMamba extracts more accurate features with low computational power, making it more powerful than transformer-based models. HSIRMamba was tested on three majorly used HSI Datasets-Indian Pines, Pavia University, and Houston 2013. The experimental results demonstrate that the proposed method achieves competitive results compared to state-of-the-art methods.
{"title":"HSIRMamba: An effective feature learning for hyperspectral image classification using residual Mamba","authors":"Rajat Kumar Arya, Siddhant Jain, Pratik Chattopadhyay, Rajeev Srivastava","doi":"10.1016/j.imavis.2024.105387","DOIUrl":"10.1016/j.imavis.2024.105387","url":null,"abstract":"<div><div>Deep learning models have recently demonstrated outstanding results in classifying hyperspectral images (HSI). The Transformer model is among the various deep learning models that have received increasing interest due to its superior ability to simulate the long-term dependence of spatial-spectral information in HSI. Due to its self-attention mechanism, the Transformer exhibits quadratic computational complexity, which makes it heavier than other models and limits its application in the processing of HSI. Fortunately, the newly developed state space model Mamba exhibits excellent computing effectiveness and achieves Transformer-like modeling capabilities. Therefore, we propose a novel enhanced Mamba-based model called HSIRMamba that integrates residual operations into the Mamba architecture by combining the power of Mamba and the residual network to extract the spectral properties of HSI more effectively. It also includes a concurrent dedicated block for spatial analysis using a convolutional neural network. HSIRMamba extracts more accurate features with low computational power, making it more powerful than transformer-based models. HSIRMamba was tested on three majorly used HSI Datasets-Indian Pines, Pavia University, and Houston 2013. The experimental results demonstrate that the proposed method achieves competitive results compared to state-of-the-art methods.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105387"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138385","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The purpose of Multi-Focus Image Fusion (MFIF) is to extract the clear portions from multiple blurry images with complementary features to obtain a fully focused image, which is considered a prerequisite for other advanced visual tasks. With the development of deep learning technologies, significant breakthroughs have been achieved in multi-focus image fusion. However, most existing methods still face challenges related to detail information loss and misjudgment in boundary regions. In this paper, we propose a method called CPFusion for MFIF. On one hand, to fully preserve all detail information from the source images, we utilize an Invertible Neural Network (INN) for feature information transfer. The strong feature retention capability of INN allows for better preservation of the complementary features of the source images. On the other hand, to enhance the network’s performance in image fusion, we design a closed-loop structure to guide the fusion process. Specifically, during the training process, the forward operation of the network is used to learn the mapping from source images to fused images and decision maps, while the backward operation simulates the degradation of the focused image back to the source images. The backward operation serves as an additional constraint to guide the performance of the network’s forward operation. To achieve more natural fusion results, our network simultaneously generates an initial fused image and a decision map, utilizing the decision map to retain the details of the source images, while the initial fused image is employed to improve the visual effects of the decision map fusion method in boundary regions. Extensive experimental results demonstrate that the proposed method achieves excellent results in both subjective visual quality and objective metric assessments.
{"title":"CPFusion: A multi-focus image fusion method based on closed-loop regularization","authors":"Hao Zhai, Peng Chen, Nannan Luo, Qinyu Li, Ping Yu","doi":"10.1016/j.imavis.2024.105399","DOIUrl":"10.1016/j.imavis.2024.105399","url":null,"abstract":"<div><div>The purpose of Multi-Focus Image Fusion (MFIF) is to extract the clear portions from multiple blurry images with complementary features to obtain a fully focused image, which is considered a prerequisite for other advanced visual tasks. With the development of deep learning technologies, significant breakthroughs have been achieved in multi-focus image fusion. However, most existing methods still face challenges related to detail information loss and misjudgment in boundary regions. In this paper, we propose a method called CPFusion for MFIF. On one hand, to fully preserve all detail information from the source images, we utilize an Invertible Neural Network (INN) for feature information transfer. The strong feature retention capability of INN allows for better preservation of the complementary features of the source images. On the other hand, to enhance the network’s performance in image fusion, we design a closed-loop structure to guide the fusion process. Specifically, during the training process, the forward operation of the network is used to learn the mapping from source images to fused images and decision maps, while the backward operation simulates the degradation of the focused image back to the source images. The backward operation serves as an additional constraint to guide the performance of the network’s forward operation. To achieve more natural fusion results, our network simultaneously generates an initial fused image and a decision map, utilizing the decision map to retain the details of the source images, while the initial fused image is employed to improve the visual effects of the decision map fusion method in boundary regions. Extensive experimental results demonstrate that the proposed method achieves excellent results in both subjective visual quality and objective metric assessments.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105399"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105401
Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue
Deep learning has been extensively adopted across various domains, yielding satisfactory outcomes. However, it heavily relies on extensive labeled datasets, collecting data labels is expensive and time-consuming. We propose a novel framework called Active Transfer Learning (ATL) to address this issue. The ATL framework consists of Active Learning (AL) and Transfer Learning (TL). AL queries the unlabeled samples with high inconsistency by Maximum Differentiation Classifier (MDC). The MDC pulls the discrepancy between the labeled data and their augmentations to select and annotate the informative samples. Additionally, we also explore the potential of incorporating TL techniques. The TL comprises pre-training and fine-tuning. The former learns knowledge from the origin-augmentation domain to pre-train the model, while the latter leverages the acquired knowledge for the downstream tasks. The results indicate that the combination of TL and AL exhibits complementary effects, while the proposed ATL framework outperforms state-of-the-art methods in terms of accuracy, precision, recall, and F1-score.
{"title":"An Active Transfer Learning framework for image classification based on Maximum Differentiation Classifier","authors":"Peng Zan , Yuerong Wang , Haohao Hu , Wanjun Zhong , Tianyu Han , Jingwei Yue","doi":"10.1016/j.imavis.2024.105401","DOIUrl":"10.1016/j.imavis.2024.105401","url":null,"abstract":"<div><div>Deep learning has been extensively adopted across various domains, yielding satisfactory outcomes. However, it heavily relies on extensive labeled datasets, collecting data labels is expensive and time-consuming. We propose a novel framework called Active Transfer Learning (ATL) to address this issue. The ATL framework consists of Active Learning (AL) and Transfer Learning (TL). AL queries the unlabeled samples with high inconsistency by Maximum Differentiation Classifier (MDC). The MDC pulls the discrepancy between the labeled data and their augmentations to select and annotate the informative samples. Additionally, we also explore the potential of incorporating TL techniques. The TL comprises pre-training and fine-tuning. The former learns knowledge from the origin-augmentation domain to pre-train the model, while the latter leverages the acquired knowledge for the downstream tasks. The results indicate that the combination of TL and AL exhibits complementary effects, while the proposed ATL framework outperforms state-of-the-art methods in terms of accuracy, precision, recall, and F1-score.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105401"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2025.105424
David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença
Sketch understanding poses unique challenges for general-purpose vision algorithms due to the sparse and semantically ambiguous nature of sketches. This paper introduces a novel approach to biometric recognition that leverages sketch-based representations of ears, a largely unexplored but promising area in biometric research. Specifically, we address the “sketch-2-image” matching problem by synthesizing ear sketches at multiple abstraction levels, achieved through a triplet-loss function adapted to integrate these levels. The abstraction level is determined by the number of strokes used, with fewer strokes reflecting higher abstraction. Our methodology combines sketch representations across abstraction levels to improve robustness and generalizability in matching. Extensive evaluations were conducted on four ear datasets (AMI, AWE, IITDII, and BIPLab) using various pre-trained neural network backbones, showing consistently superior performance over state-of-the-art methods. These results highlight the potential of ear sketch-based recognition, with cross-dataset tests confirming its adaptability to real-world conditions and suggesting applicability beyond ear biometrics.
{"title":"Synthesizing multilevel abstraction ear sketches for enhanced biometric recognition","authors":"David Freire-Obregón , Joao Neves , Žiga Emeršič , Blaž Meden , Modesto Castrillón-Santana , Hugo Proença","doi":"10.1016/j.imavis.2025.105424","DOIUrl":"10.1016/j.imavis.2025.105424","url":null,"abstract":"<div><div>Sketch understanding poses unique challenges for general-purpose vision algorithms due to the sparse and semantically ambiguous nature of sketches. This paper introduces a novel approach to biometric recognition that leverages sketch-based representations of ears, a largely unexplored but promising area in biometric research. Specifically, we address the “<em>sketch-2-image</em>” matching problem by synthesizing ear sketches at multiple abstraction levels, achieved through a triplet-loss function adapted to integrate these levels. The abstraction level is determined by the number of strokes used, with fewer strokes reflecting higher abstraction. Our methodology combines sketch representations across abstraction levels to improve robustness and generalizability in matching. Extensive evaluations were conducted on four ear datasets (AMI, AWE, IITDII, and BIPLab) using various pre-trained neural network backbones, showing consistently superior performance over state-of-the-art methods. These results highlight the potential of ear sketch-based recognition, with cross-dataset tests confirming its adaptability to real-world conditions and suggesting applicability beyond ear biometrics.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105424"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143139147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105368
Xuezhi Xiang , Xiankun Zhou , Yingxin Wei , Xi Wang , Yulong Qiao
3D scene flow estimation is a fundamental task in computer vision, which aims to estimate the 3D motions of point clouds. The point cloud is disordered, and the point density in the local area of the same object is non-uniform. The features extracted by previous methods are not discriminative enough to obtain accurate scene flow. Besides, scene flow may be misestimated when two adjacent frames of point clouds have large movements. From our observation, the quality of point cloud feature extraction and the correlations of two-frame point clouds directly affect the accuracy of scene flow estimation. Therefore, we propose an improved self-attention structure named Grouped Relative Self-Attention (GRSA) that simultaneously utilizes the grouping operation and offset subtraction operation with normalization refinement to process point clouds. Specifically, we embed the Grouped Relative Self-Attention (GRSA) into feature extraction and each stage of flow refinement to gain lightweight but efficient self-attention respectively, which can extract discriminative point features and enhance the point correlations to be more adaptable to long-distance movements. Furthermore, we use a comprehensive loss function to avoid outliers and obtain robust results. We evaluate our method on the FlyingThings3D and KITTI datasets and achieve superior performance. In particular, our method outperforms all other methods on the FlyingThings3D dataset, where Outliers achieves a 16.9% improvement. On the KITTI dataset, Outliers also achieves a 6.7% improvement.
{"title":"Scene flow estimation from point cloud based on grouped relative self-attention","authors":"Xuezhi Xiang , Xiankun Zhou , Yingxin Wei , Xi Wang , Yulong Qiao","doi":"10.1016/j.imavis.2024.105368","DOIUrl":"10.1016/j.imavis.2024.105368","url":null,"abstract":"<div><div>3D scene flow estimation is a fundamental task in computer vision, which aims to estimate the 3D motions of point clouds. The point cloud is disordered, and the point density in the local area of the same object is non-uniform. The features extracted by previous methods are not discriminative enough to obtain accurate scene flow. Besides, scene flow may be misestimated when two adjacent frames of point clouds have large movements. From our observation, the quality of point cloud feature extraction and the correlations of two-frame point clouds directly affect the accuracy of scene flow estimation. Therefore, we propose an improved self-attention structure named Grouped Relative Self-Attention (GRSA) that simultaneously utilizes the grouping operation and offset subtraction operation with normalization refinement to process point clouds. Specifically, we embed the Grouped Relative Self-Attention (GRSA) into feature extraction and each stage of flow refinement to gain lightweight but efficient self-attention respectively, which can extract discriminative point features and enhance the point correlations to be more adaptable to long-distance movements. Furthermore, we use a comprehensive loss function to avoid outliers and obtain robust results. We evaluate our method on the FlyingThings3D and KITTI datasets and achieve superior performance. In particular, our method outperforms all other methods on the FlyingThings3D dataset, where Outliers achieves a 16.9% improvement. On the KITTI dataset, Outliers also achieves a 6.7% improvement.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105368"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105365
Yan Zhang , Zenghui Li , Duo Shen , Ke Wang , Jia Li , Chenxing Xia
Facial Expression Recognition (FER) with occlusion presents a challenging task in computer vision because facial occlusions result in poor visual data features. Recently, the region attention technique has been introduced to address this problem by researchers, which make the model perceive occluded regions of the face and prioritize the most discriminative non-occluded regions. However, in real-world scenarios, facial images are influenced by various factors, including hair, masks and sunglasses, making it difficult to extract high-quality features from these occluded facial images. This inevitably limits the effectiveness of attention mechanisms. In this paper, we observe a correlation in facial emotion features from the same image, both with and without occlusion. This correlation contributes to addressing the issue of facial occlusions. To this end, we propose a Information Gap based Knowledge Distillation (IGKD) to explore the latent relationship. Specifically, our approach involves feeding non-occluded and masked images into separate teacher and student networks. Due to the incomplete emotion information in the masked images, there exists an information gap between the teacher and student networks. During training, we aim to minimize this gap to enable the student network to learn this relationship. To enhance the teacher’s guidance, we introduce a joint learning strategy where the teacher conducts knowledge distillation on the student during the training of the teacher. Additionally, we introduce two novel constraints, called knowledge learn and knowledge feedback loss, to supervise and optimize both the teacher and student networks. The reported experimental results show that IGKD outperforms other algorithms on four benchmark datasets. Specifically, our IGKD achieves 87.57% on Occlusion-RAF-DB, 87.33% on Occlusion-FERPlus, 64.86% on Occlusion-AffectNet, and 73.25% on FED-RO, clearly demonstrating its effectiveness and robustness. Source code is released at: .
{"title":"Information gap based knowledge distillation for occluded facial expression recognition","authors":"Yan Zhang , Zenghui Li , Duo Shen , Ke Wang , Jia Li , Chenxing Xia","doi":"10.1016/j.imavis.2024.105365","DOIUrl":"10.1016/j.imavis.2024.105365","url":null,"abstract":"<div><div>Facial Expression Recognition (FER) with occlusion presents a challenging task in computer vision because facial occlusions result in poor visual data features. Recently, the region attention technique has been introduced to address this problem by researchers, which make the model perceive occluded regions of the face and prioritize the most discriminative non-occluded regions. However, in real-world scenarios, facial images are influenced by various factors, including hair, masks and sunglasses, making it difficult to extract high-quality features from these occluded facial images. This inevitably limits the effectiveness of attention mechanisms. In this paper, we observe a correlation in facial emotion features from the same image, both with and without occlusion. This correlation contributes to addressing the issue of facial occlusions. To this end, we propose a Information Gap based Knowledge Distillation (IGKD) to explore the latent relationship. Specifically, our approach involves feeding non-occluded and masked images into separate teacher and student networks. Due to the incomplete emotion information in the masked images, there exists an information gap between the teacher and student networks. During training, we aim to minimize this gap to enable the student network to learn this relationship. To enhance the teacher’s guidance, we introduce a joint learning strategy where the teacher conducts knowledge distillation on the student during the training of the teacher. Additionally, we introduce two novel constraints, called knowledge learn and knowledge feedback loss, to supervise and optimize both the teacher and student networks. The reported experimental results show that IGKD outperforms other algorithms on four benchmark datasets. Specifically, our IGKD achieves 87.57% on Occlusion-RAF-DB, 87.33% on Occlusion-FERPlus, 64.86% on Occlusion-AffectNet, and 73.25% on FED-RO, clearly demonstrating its effectiveness and robustness. Source code is released at: .</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105365"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-01DOI: 10.1016/j.imavis.2024.105396
Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu
3D sensors provide rich depth information and are widely used across various fields, making 3D vision a hot topic of research. Point cloud data, as a crucial type of 3D data, offers precise three-dimensional coordinate information and is extensively utilized in numerous domains, especially in robotics. However, the unordered and unstructured nature of point cloud data poses a significant challenge for feature extraction. Traditional methods have relied on designing complex local feature extractors to achieve feature extraction, but these approaches have reached a performance bottleneck. To address these challenges, this paper introduces MD-Mamba, a novel network that enhances point cloud feature extraction by integrating multi-view depth maps. Our approach leverages multi-modal learning, treating the multi-view depth maps as an additional global feature modality. By fusing these with locally extracted point cloud features, we achieve richer and more distinctive representations. We utilize an innovative feature extraction strategy, performing real projections of point clouds and treating multi-view projections as video streams. This method captures dynamic features across viewpoints using a specially designed Mamba network. Additionally, the incorporation of the Siamese Cluster module optimizes feature spacing, improving class differentiation. Extensive evaluations on ModelNet40, ShapeNetPart, and ScanObjectNN datasets validate the effectiveness of MD-Mamba, setting a new benchmark for multi-modal feature extraction in point cloud analysis.
{"title":"MD-Mamba: Feature extractor on 3D representation with multi-view depth","authors":"Qihui Li , Zongtan Li , Lianfang Tian , Qiliang Du , Guoyu Lu","doi":"10.1016/j.imavis.2024.105396","DOIUrl":"10.1016/j.imavis.2024.105396","url":null,"abstract":"<div><div>3D sensors provide rich depth information and are widely used across various fields, making 3D vision a hot topic of research. Point cloud data, as a crucial type of 3D data, offers precise three-dimensional coordinate information and is extensively utilized in numerous domains, especially in robotics. However, the unordered and unstructured nature of point cloud data poses a significant challenge for feature extraction. Traditional methods have relied on designing complex local feature extractors to achieve feature extraction, but these approaches have reached a performance bottleneck. To address these challenges, this paper introduces MD-Mamba, a novel network that enhances point cloud feature extraction by integrating multi-view depth maps. Our approach leverages multi-modal learning, treating the multi-view depth maps as an additional global feature modality. By fusing these with locally extracted point cloud features, we achieve richer and more distinctive representations. We utilize an innovative feature extraction strategy, performing real projections of point clouds and treating multi-view projections as video streams. This method captures dynamic features across viewpoints using a specially designed Mamba network. Additionally, the incorporation of the Siamese Cluster module optimizes feature spacing, improving class differentiation. Extensive evaluations on ModelNet40, ShapeNetPart, and ScanObjectNN datasets validate the effectiveness of MD-Mamba, setting a new benchmark for multi-modal feature extraction in point cloud analysis.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105396"},"PeriodicalIF":4.2,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143138386","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}