首页 > 最新文献

Journal of Visual Communication and Image Representation最新文献

英文 中文
Scale-invariant mask-guided vehicle keypoint detection from a monocular image
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-02-01 DOI: 10.1016/j.jvcir.2025.104397
Sunpil Kim , Gang-Joon Yoon , Jinjoo Song , Sang Min Yoon
Intelligent vehicle detection and localization are important for autonomous driving systems, particularly traffic scene understanding. Robust vision-based vehicle localization directly affects the accuracy of self-driving systems but remains challenging to implement reliably due to differences in vehicle sizes, illumination changes, background clutter, and partial occlusion. Bottom-up-based vehicle detection using vehicle keypoint localization efficiently provides semantic information for partial occlusion and complex poses. However, bottom-up-based approaches still struggle to handle robust heatmap estimation from vehicles with scale variations and background ambiguities. This paper addresses the problem of predicting multiple vehicle locations by learning semantic vehicle keypoints using a multi-resolution feature extractor, an offset regression branch, and a heatmap regression branch network. The proposed pipeline estimates the vehicle keypoint by effectively eliminating similar background features using a mask-guided heatmap regression branch and emphasizing scale-adaptive heatmap features in the network. Quantitative and qualitative analyses, including ablation tests, verify that the proposed method is universally applicable, unlike previous approaches.
{"title":"Scale-invariant mask-guided vehicle keypoint detection from a monocular image","authors":"Sunpil Kim ,&nbsp;Gang-Joon Yoon ,&nbsp;Jinjoo Song ,&nbsp;Sang Min Yoon","doi":"10.1016/j.jvcir.2025.104397","DOIUrl":"10.1016/j.jvcir.2025.104397","url":null,"abstract":"<div><div>Intelligent vehicle detection and localization are important for autonomous driving systems, particularly traffic scene understanding. Robust vision-based vehicle localization directly affects the accuracy of self-driving systems but remains challenging to implement reliably due to differences in vehicle sizes, illumination changes, background clutter, and partial occlusion. Bottom-up-based vehicle detection using vehicle keypoint localization efficiently provides semantic information for partial occlusion and complex poses. However, bottom-up-based approaches still struggle to handle robust heatmap estimation from vehicles with scale variations and background ambiguities. This paper addresses the problem of predicting multiple vehicle locations by learning semantic vehicle keypoints using a multi-resolution feature extractor, an offset regression branch, and a heatmap regression branch network. The proposed pipeline estimates the vehicle keypoint by effectively eliminating similar background features using a mask-guided heatmap regression branch and emphasizing scale-adaptive heatmap features in the network. Quantitative and qualitative analyses, including ablation tests, verify that the proposed method is universally applicable, unlike previous approaches.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104397"},"PeriodicalIF":2.6,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Depth completion based on multi-scale spatial propagation and tensor decomposition
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104394
Mingming Sun, Tao Li, Qing Liao, Minghui Zhou
Depth completion aims to generate dense depth maps from sparse depth maps. Existing approaches typically apply a spatial propagation module to iteratively refine depth values based on a single-scale initial depth map. In contrast, to overcome the limitations imposed by the convolution kernel size on propagation range, we propose a multi-scale spatial propagation module (MSSPM) that utilizes multi-scale features from the decoder to guide spatial propagation. To further enhance the model’s performance, we introduce a bottleneck feature enhancement module (BFEM) based on tensor decomposition, which can reduce feature redundancy and perform denoising through low-rank feature factorization. We also introduce a cross-layer feature fusion module (Fusion) to efficiently combine low-level encoder features and high-level decoder features. Extensive experiments on the indoor NYUv2 dataset and the outdoor KITTI dataset demonstrate the effectiveness of the proposed method.
{"title":"Depth completion based on multi-scale spatial propagation and tensor decomposition","authors":"Mingming Sun,&nbsp;Tao Li,&nbsp;Qing Liao,&nbsp;Minghui Zhou","doi":"10.1016/j.jvcir.2025.104394","DOIUrl":"10.1016/j.jvcir.2025.104394","url":null,"abstract":"<div><div>Depth completion aims to generate dense depth maps from sparse depth maps. Existing approaches typically apply a spatial propagation module to iteratively refine depth values based on a single-scale initial depth map. In contrast, to overcome the limitations imposed by the convolution kernel size on propagation range, we propose a multi-scale spatial propagation module (MSSPM) that utilizes multi-scale features from the decoder to guide spatial propagation. To further enhance the model’s performance, we introduce a bottleneck feature enhancement module (BFEM) based on tensor decomposition, which can reduce feature redundancy and perform denoising through low-rank feature factorization. We also introduce a cross-layer feature fusion module (Fusion) to efficiently combine low-level encoder features and high-level decoder features. Extensive experiments on the indoor NYUv2 dataset and the outdoor KITTI dataset demonstrate the effectiveness of the proposed method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104394"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143339498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Document forgery detection based on spatial-frequency and multi-scale feature network
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104393
Li Li , Yu Bai , Shanqing Zhang , Mahmoud Emam
Passive image forgery detection is one of the main tasks for digital image forensics. Although it is easy to detect and localize forged regions with high accuracies from tampered images through utilizing the diversity and rich detail features of natural images, detecting tampered regions from a tampered textual document image (photographs) still presents many challenges. These challenges include poor detection results and difficulty of identifying the applied forgery type. In this paper, we propose a robust multi-category tampering detection algorithm based on spatial-frequency(SF) domain and multi-scale feature fusion network. First, we employ frequency domain transform and SF feature fusion strategy to strengthen the network’s ability to discriminate tampered document textures. Secondly, we combine HRNet, attention mechanism and a multi-supervision module to capture the features of the document images at different scales and improve forgery detection results. Furthermore, we design a multi-category detection head module to detect multiple types of forgeries that can improve the generalization ability of the proposed algorithm. Extensive experiments on a constructed dataset based on the public StaVer and SCUT-EnsExam datasets have been conducted. The experimental results show that the proposed algorithm improves F1 score of document images tampering detection by nearly 5.73%, and it’s not only able to localize the tampering location, but also accurately identify the applied tampering type.
{"title":"Document forgery detection based on spatial-frequency and multi-scale feature network","authors":"Li Li ,&nbsp;Yu Bai ,&nbsp;Shanqing Zhang ,&nbsp;Mahmoud Emam","doi":"10.1016/j.jvcir.2025.104393","DOIUrl":"10.1016/j.jvcir.2025.104393","url":null,"abstract":"<div><div>Passive image forgery detection is one of the main tasks for digital image forensics. Although it is easy to detect and localize forged regions with high accuracies from tampered images through utilizing the diversity and rich detail features of natural images, detecting tampered regions from a tampered textual document image (photographs) still presents many challenges. These challenges include poor detection results and difficulty of identifying the applied forgery type. In this paper, we propose a robust multi-category tampering detection algorithm based on spatial-frequency(SF) domain and multi-scale feature fusion network. First, we employ frequency domain transform and SF feature fusion strategy to strengthen the network’s ability to discriminate tampered document textures. Secondly, we combine HRNet, attention mechanism and a multi-supervision module to capture the features of the document images at different scales and improve forgery detection results. Furthermore, we design a multi-category detection head module to detect multiple types of forgeries that can improve the generalization ability of the proposed algorithm. Extensive experiments on a constructed dataset based on the public StaVer and SCUT-EnsExam datasets have been conducted. The experimental results show that the proposed algorithm improves F1 score of document images tampering detection by nearly 5.73%, and it’s not only able to localize the tampering location, but also accurately identify the applied tampering type.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104393"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174747","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GMNet: Low overlap point cloud registration based on graph matching
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104400
Lijia Cao , Xueru Wang , Chuandong Guo
Point cloud registration quality relies heavily on accurate point-to-point correspondences. Although significant progress has been made in this area by most methods, low-overlap point clouds pose challenges as dense point topological structures are often neglected. To address this, we propose the graph matching network (GMNet), which constructs graph features based on the dense point features obtained from the first point cloud sampling and the superpoints’ features encoded with geometry. By using intra-graph and cross-graph convolutions in local patches, GMNet extracts deeper global information for robust correspondences. The GMNet network significantly improves the inlier ratio for low-overlap point cloud registration, demonstrating high accuracy and robustness. Experimental results on public datasets for objects, indoor, and outdoor scenes validate the effectiveness of GMNet. Furthermore, on the low-overlap 3DLoMatch dataset, our registration recall rate remains stable at 72.6%, with the inlier ratio improving by up to 9.9%.
{"title":"GMNet: Low overlap point cloud registration based on graph matching","authors":"Lijia Cao ,&nbsp;Xueru Wang ,&nbsp;Chuandong Guo","doi":"10.1016/j.jvcir.2025.104400","DOIUrl":"10.1016/j.jvcir.2025.104400","url":null,"abstract":"<div><div>Point cloud registration quality relies heavily on accurate point-to-point correspondences. Although significant progress has been made in this area by most methods, low-overlap point clouds pose challenges as dense point topological structures are often neglected. To address this, we propose the graph matching network (GMNet), which constructs graph features based on the dense point features obtained from the first point cloud sampling and the superpoints’ features encoded with geometry. By using intra-graph and cross-graph convolutions in local patches, GMNet extracts deeper global information for robust correspondences. The GMNet network significantly improves the inlier ratio for low-overlap point cloud registration, demonstrating high accuracy and robustness. Experimental results on public datasets for objects, indoor, and outdoor scenes validate the effectiveness of GMNet. Furthermore, on the low-overlap 3DLoMatch dataset, our registration recall rate remains stable at 72.6%, with the inlier ratio improving by up to 9.9%.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104400"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
S2Mix: Style and Semantic Mix for cross-domain 3D model retrieval
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104390
Xinwei Fu , Dan Song , Yue Yang , Yuyi Zhang , Bo Wang
With the development of deep neural networks and image processing technology, cross-domain 3D model retrieval algorithms based on 2D images have attracted much attention, utilizing visual information from labeled 2D images to assist in processing unlabeled 3D models. Existing unsupervised cross-domain 3D model retrieval algorithm use domain adaptation to narrow the modality gap between 2D images and 3D models. However, these methods overlook specific style visual information between different domains of 2D images and 3D models, which is crucial for reducing the domain distribution discrepancy. To address this issue, this paper proposes a Style and Semantic Mix (S2Mix) network for cross-domain 3D model retrieval, which fuses style visual information and semantic consistency features between different domains. Specifically, we design a style mix module to perform on shallow feature maps that are closer to the input data, learning 2D image and 3D model features with intermediate domain mixed style to narrow the domain distribution discrepancy. In addition, in order to improve the semantic prediction accuracy of unlabeled samples, a semantic mix module is also designed to operate on deep features, fusing features from reliable unlabeled 3D model and 2D image samples with semantic consistency. Our experiments demonstrate the effectiveness of the proposed S2Mix on two commonly-used cross-domain 3D model retrieval datasets MI3DOR-1 and MI3DOR-2.
{"title":"S2Mix: Style and Semantic Mix for cross-domain 3D model retrieval","authors":"Xinwei Fu ,&nbsp;Dan Song ,&nbsp;Yue Yang ,&nbsp;Yuyi Zhang ,&nbsp;Bo Wang","doi":"10.1016/j.jvcir.2025.104390","DOIUrl":"10.1016/j.jvcir.2025.104390","url":null,"abstract":"<div><div>With the development of deep neural networks and image processing technology, cross-domain 3D model retrieval algorithms based on 2D images have attracted much attention, utilizing visual information from labeled 2D images to assist in processing unlabeled 3D models. Existing unsupervised cross-domain 3D model retrieval algorithm use domain adaptation to narrow the modality gap between 2D images and 3D models. However, these methods overlook specific style visual information between different domains of 2D images and 3D models, which is crucial for reducing the domain distribution discrepancy. To address this issue, this paper proposes a Style and Semantic Mix (S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>Mix) network for cross-domain 3D model retrieval, which fuses style visual information and semantic consistency features between different domains. Specifically, we design a style mix module to perform on shallow feature maps that are closer to the input data, learning 2D image and 3D model features with intermediate domain mixed style to narrow the domain distribution discrepancy. In addition, in order to improve the semantic prediction accuracy of unlabeled samples, a semantic mix module is also designed to operate on deep features, fusing features from reliable unlabeled 3D model and 2D image samples with semantic consistency. Our experiments demonstrate the effectiveness of the proposed S<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>Mix on two commonly-used cross-domain 3D model retrieval datasets MI3DOR-1 and MI3DOR-2.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104390"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
BAO: Background-aware activation map optimization for weakly supervised semantic segmentation without background threshold
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104404
Izumi Fujimori , Masaki Oono , Masami Shishibori
Weakly supervised semantic segmentation (WSSS), which employs only image-level labels, has attracted significant attention due to its low annotation cost. In WSSS, pseudo-labels are derived from class activation maps (CAMs) generated by convolutional neural networks or vision transformers. However, during the generation of pseudo-labels from CAMs, a background threshold is typically used to define background regions. In WSSS scenarios, pixel-level labels are typically unavailable, which makes it challenging to determine an optimal background threshold. This study proposes a method for generating pseudo-labels without a background threshold. The proposed method generates CAMs that activate background regions from CAMs initially based on foreground objects. These background-activated CAMs are then used to generate pseudo-labels. The pseudo-labels are then used to train the model to distinguish between the foreground and background regions in the newly generated activation maps. During inference, the background activation map obtained via training replaces the background threshold. To validate the effectiveness of the proposed method, we conducted experiments using the PASCAL VOC 2012 and MS COCO 2014 datasets. The results demonstrate that the pseudo-labels generated using the proposed method significantly outperform those generated using conventional background thresholds. The code is available at: https://github.com/i2mond/BAO.
{"title":"BAO: Background-aware activation map optimization for weakly supervised semantic segmentation without background threshold","authors":"Izumi Fujimori ,&nbsp;Masaki Oono ,&nbsp;Masami Shishibori","doi":"10.1016/j.jvcir.2025.104404","DOIUrl":"10.1016/j.jvcir.2025.104404","url":null,"abstract":"<div><div>Weakly supervised semantic segmentation (WSSS), which employs only image-level labels, has attracted significant attention due to its low annotation cost. In WSSS, pseudo-labels are derived from class activation maps (CAMs) generated by convolutional neural networks or vision transformers. However, during the generation of pseudo-labels from CAMs, a background threshold is typically used to define background regions. In WSSS scenarios, pixel-level labels are typically unavailable, which makes it challenging to determine an optimal background threshold. This study proposes a method for generating pseudo-labels without a background threshold. The proposed method generates CAMs that activate background regions from CAMs initially based on foreground objects. These background-activated CAMs are then used to generate pseudo-labels. The pseudo-labels are then used to train the model to distinguish between the foreground and background regions in the newly generated activation maps. During inference, the background activation map obtained via training replaces the background threshold. To validate the effectiveness of the proposed method, we conducted experiments using the PASCAL VOC 2012 and MS COCO 2014 datasets. The results demonstrate that the pseudo-labels generated using the proposed method significantly outperform those generated using conventional background thresholds. The code is available at: <span><span>https://github.com/i2mond/BAO</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104404"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143339471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generalization enhancement strategy based on ensemble learning for open domain image manipulation detection
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104396
H. Cheng , L. Niu , Z. Zhang , L. Ye
Image manipulation detection methods play a pivotal role in safeguarding digital image authenticity and integrity by identifying and locating manipulations. Existing image manipulation detection methods suffer from limited generalization, as it is difficult for existing training datasets to cover different manipulation modalities in the open domain. In this paper, we propose a Generalization Enhancement Strategy (GES) based on data augmentation and ensemble learning. Specifically, the GES consists of two modules, namely an Additive Image Manipulation Data Augmentation(AIM-DA) module and a Mask Confidence Estimate based Ensemble Learning (MCE-EL) module. In order to take full advantage of the limited number of real and manipulated images, the AIM-DA module enriches the diversity of the data by generating manipulated traces accumulatively with different kinds of manipulation methods. The MCE-EL module is designed to improve the accuracy of detection in the open domain, which is based on computing and integrating the evaluation of the confidence level of the output masks from different image manipulation detection models. Our proposed GES can be adapted to existing popular image manipulation detection methods. Extensive subjective and objective experimental results show that the detection F1 score can be improved by up to 34.9%, and the localization F1 score can be improved by up to 11.7%, which validates the effectiveness of our method.
{"title":"Generalization enhancement strategy based on ensemble learning for open domain image manipulation detection","authors":"H. Cheng ,&nbsp;L. Niu ,&nbsp;Z. Zhang ,&nbsp;L. Ye","doi":"10.1016/j.jvcir.2025.104396","DOIUrl":"10.1016/j.jvcir.2025.104396","url":null,"abstract":"<div><div>Image manipulation detection methods play a pivotal role in safeguarding digital image authenticity and integrity by identifying and locating manipulations. Existing image manipulation detection methods suffer from limited generalization, as it is difficult for existing training datasets to cover different manipulation modalities in the open domain. In this paper, we propose a Generalization Enhancement Strategy (GES) based on data augmentation and ensemble learning. Specifically, the GES consists of two modules, namely an Additive Image Manipulation Data Augmentation(AIM-DA) module and a Mask Confidence Estimate based Ensemble Learning (MCE-EL) module. In order to take full advantage of the limited number of real and manipulated images, the AIM-DA module enriches the diversity of the data by generating manipulated traces accumulatively with different kinds of manipulation methods. The MCE-EL module is designed to improve the accuracy of detection in the open domain, which is based on computing and integrating the evaluation of the confidence level of the output masks from different image manipulation detection models. Our proposed GES can be adapted to existing popular image manipulation detection methods. Extensive subjective and objective experimental results show that the detection F1 score can be improved by up to 34.9%, and the localization F1 score can be improved by up to 11.7%, which validates the effectiveness of our method.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104396"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143339496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Masked facial expression recognition based on temporal overlap module and action unit graph convolutional network
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104398
Zheyuan Zhang , Bingtong Liu , Ju Zhou , Hanpu Wang , Xinyu Liu , Bing Lin , Tong Chen
Facial expressions may not truly reflect genuine emotions of people . People often use masked facial expressions (MFEs) to hide their genuine emotions. The recognition of MFEs can help reveal these emotions, which has very important practical value in the field of mental health, security and education. However, MFE is very complex and lacks of research, and the existing facial expression recognition algorithms cannot well recognize the MFEs and the hidden genuine emotions at the same time. To obtain better representations of MFE, we first use the transformer model as the basic framework and design the temporal overlap module to enhance temporal receptive field of the tokens, so as to strengthen the capture of muscle movement patterns in MFE sequences. Secondly, we design a graph convolutional network (GCN) with action unit (AU) intensity as node features and the 3D learnable adjacency matrix based on AU activation state to reduce the irrelevant identity information introduced by image input. Finally, we propose a novel end-to-end dual-stream network combining the image stream (transformer) with the AU stream (GCN) for automatic recognition of MFEs. Compared with other methods, our approach has achieved state-of-the-art results on the core tasks of Masked Facial Expression Database (MFED).
{"title":"Masked facial expression recognition based on temporal overlap module and action unit graph convolutional network","authors":"Zheyuan Zhang ,&nbsp;Bingtong Liu ,&nbsp;Ju Zhou ,&nbsp;Hanpu Wang ,&nbsp;Xinyu Liu ,&nbsp;Bing Lin ,&nbsp;Tong Chen","doi":"10.1016/j.jvcir.2025.104398","DOIUrl":"10.1016/j.jvcir.2025.104398","url":null,"abstract":"<div><div>Facial expressions may not truly reflect genuine emotions of people . People often use masked facial expressions (MFEs) to hide their genuine emotions. The recognition of MFEs can help reveal these emotions, which has very important practical value in the field of mental health, security and education. However, MFE is very complex and lacks of research, and the existing facial expression recognition algorithms cannot well recognize the MFEs and the hidden genuine emotions at the same time. To obtain better representations of MFE, we first use the transformer model as the basic framework and design the temporal overlap module to enhance temporal receptive field of the tokens, so as to strengthen the capture of muscle movement patterns in MFE sequences. Secondly, we design a graph convolutional network (GCN) with action unit (AU) intensity as node features and the 3D learnable adjacency matrix based on AU activation state to reduce the irrelevant identity information introduced by image input. Finally, we propose a novel end-to-end dual-stream network combining the image stream (transformer) with the AU stream (GCN) for automatic recognition of MFEs. Compared with other methods, our approach has achieved state-of-the-art results on the core tasks of Masked Facial Expression Database (MFED).</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104398"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Transformer-guided exposure-aware fusion for single-shot HDR imaging
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-31 DOI: 10.1016/j.jvcir.2025.104401
An Gia Vien , Chul Lee
Spatially varying exposure (SVE) imaging, also known as single-shot high dynamic range (HDR) imaging, is an effective and practical approach for synthesizing HDR images without the need for handling motions. In this work, we propose a novel single-shot HDR imaging algorithm using transformer-guided exposure-aware fusion to improve the exploitation of inter-channel correlations and capture global and local dependencies by extracting valid information from an SVE image. Specifically, we first extract the initial feature maps by estimating dynamic local filters using local neighbor pixels across color channels. Then, we develop a transformer-based feature extractor that captures both global and local dependencies to extract well-exposed information even in poorly exposed regions. Finally, the proposed algorithm combines only valid features in multi-exposed feature maps by learning local and channel weights. Experimental results on both synthetic and captured real datasets demonstrate that the proposed algorithm significantly outperforms state-of-the-art algorithms both quantitatively and qualitatively.
{"title":"Transformer-guided exposure-aware fusion for single-shot HDR imaging","authors":"An Gia Vien ,&nbsp;Chul Lee","doi":"10.1016/j.jvcir.2025.104401","DOIUrl":"10.1016/j.jvcir.2025.104401","url":null,"abstract":"<div><div>Spatially varying exposure (SVE) imaging, also known as single-shot high dynamic range (HDR) imaging, is an effective and practical approach for synthesizing HDR images without the need for handling motions. In this work, we propose a novel single-shot HDR imaging algorithm using transformer-guided exposure-aware fusion to improve the exploitation of inter-channel correlations and capture global and local dependencies by extracting valid information from an SVE image. Specifically, we first extract the initial feature maps by estimating dynamic local filters using local neighbor pixels across color channels. Then, we develop a transformer-based feature extractor that captures both global and local dependencies to extract well-exposed information even in poorly exposed regions. Finally, the proposed algorithm combines only valid features in multi-exposed feature maps by learning local and channel weights. Experimental results on both synthetic and captured real datasets demonstrate that the proposed algorithm significantly outperforms state-of-the-art algorithms both quantitatively and qualitatively.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104401"},"PeriodicalIF":2.6,"publicationDate":"2025-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143174749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic feature refinement of YOLO for human mask detection in dense crowded
IF 2.6 4区 计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS Pub Date : 2025-01-30 DOI: 10.1016/j.jvcir.2025.104399
Dan Zhang , Qiong Gao , Zhenyu Chen , Zifan Lin
Due to varying scenes, changes in lighting, crowd density, and the ambiguity or small size of targets, issues often arise in mask detection regarding reduced accuracy and recall rates. To address these challenges, we developed a dataset covering diverse mask categories (CM-D) and designed the YOLO-SFR convolutional network (Semantic Feature Refinement of YOLO). To mitigate the impact of lighting and scene variability on network performance, we introduced the Direct Input Head (DIH). This method enhances the backbone’s ability to filter out light noise by directly incorporating backbone features into the objective function. To address distortion in detecting small and blurry targets during forward propagation, we devised the Progressive Multi-Scale Fusion Module (PMFM). This module integrates multi-scale features from the backbone to minimize feature loss associated with small or blurry targets. We proposed the Shunt Transit Feature Extraction Structure (STFES) to enhance the network’s discriminative capability for dense targets. Extensive experiments on CM-D, which requires less emphasis on high-level features, and MD-3, which demands more sophisticated feature handling, demonstrate that our approach outperforms existing state-of-the-art methods in mask detection. On CM-D, the Ap50 reaches as high as 0.934, and the Ap reaches 0.668. On MD-3, the Ap50 reaches as high as 0.915, and the Ap reaches 0.635.
{"title":"Semantic feature refinement of YOLO for human mask detection in dense crowded","authors":"Dan Zhang ,&nbsp;Qiong Gao ,&nbsp;Zhenyu Chen ,&nbsp;Zifan Lin","doi":"10.1016/j.jvcir.2025.104399","DOIUrl":"10.1016/j.jvcir.2025.104399","url":null,"abstract":"<div><div>Due to varying scenes, changes in lighting, crowd density, and the ambiguity or small size of targets, issues often arise in mask detection regarding reduced accuracy and recall rates. To address these challenges, we developed a dataset covering diverse mask categories (CM-D) and designed the YOLO-SFR convolutional network (Semantic Feature Refinement of YOLO). To mitigate the impact of lighting and scene variability on network performance, we introduced the Direct Input Head (DIH). This method enhances the backbone’s ability to filter out light noise by directly incorporating backbone features into the objective function. To address distortion in detecting small and blurry targets during forward propagation, we devised the Progressive Multi-Scale Fusion Module (PMFM). This module integrates multi-scale features from the backbone to minimize feature loss associated with small or blurry targets. We proposed the Shunt Transit Feature Extraction Structure (STFES) to enhance the network’s discriminative capability for dense targets. Extensive experiments on CM-D, which requires less emphasis on high-level features, and MD-3, which demands more sophisticated feature handling, demonstrate that our approach outperforms existing state-of-the-art methods in mask detection. On CM-D, the Ap50 reaches as high as 0.934, and the Ap reaches 0.668. On MD-3, the Ap50 reaches as high as 0.915, and the Ap reaches 0.635.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"107 ","pages":"Article 104399"},"PeriodicalIF":2.6,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143339497","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Journal of Visual Communication and Image Representation
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1