Semantic segmentation plays a pivotal role in computer scene understanding, but it typically requires a large amount of computing to achieve high performance. To achieve a balance between accuracy and complexity, we propose a lightweight semantic segmentation model, termed LiteMSNet (a Lightweight Semantic Segmentation Network with Multi-Scale Feature Extraction for urban streetscape scenes). In this model, we propose a novel Improved Feature Pyramid Network, which embeds a shuffle attention mechanism followed by a stacked Depth-wise Asymmetric Gating Module. Furthermore, a Multi-scale Dilation Pyramid Module is developed to expand the receptive field and capture multi-scale feature information. Finally, the proposed lightweight model integrates two loss mechanisms, the Cross-Entropy and the Dice Loss functions, which effectively mitigate the issue of data imbalance and gradient saturation. Numerical experimental results on the CamVid dataset demonstrate a remarkable mIoU measurement of 70.85% with less than 5M parameters, accompanied by a real-time inference speed of 66.1 FPS, surpassing the existing methods documented in the literature. The code for this work will be made available at https://github.com/River-ding/LiteMSNet.
{"title":"LiteMSNet: a lightweight semantic segmentation network with multi-scale feature extraction for urban streetscape scenes","authors":"Lirong Li, Jiang Ding, Hao Cui, Zhiqiang Chen, Guisheng Liao","doi":"10.1007/s00371-024-03569-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03569-y","url":null,"abstract":"<p>Semantic segmentation plays a pivotal role in computer scene understanding, but it typically requires a large amount of computing to achieve high performance. To achieve a balance between accuracy and complexity, we propose a lightweight semantic segmentation model, termed LiteMSNet (a Lightweight Semantic Segmentation Network with Multi-Scale Feature Extraction for urban streetscape scenes). In this model, we propose a novel Improved Feature Pyramid Network, which embeds a shuffle attention mechanism followed by a stacked Depth-wise Asymmetric Gating Module. Furthermore, a Multi-scale Dilation Pyramid Module is developed to expand the receptive field and capture multi-scale feature information. Finally, the proposed lightweight model integrates two loss mechanisms, the Cross-Entropy and the Dice Loss functions, which effectively mitigate the issue of data imbalance and gradient saturation. Numerical experimental results on the CamVid dataset demonstrate a remarkable mIoU measurement of 70.85% with less than 5M parameters, accompanied by a real-time inference speed of 66.1 FPS, surpassing the existing methods documented in the literature. The code for this work will be made available at https://github.com/River-ding/LiteMSNet.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-22DOI: 10.1007/s00371-024-03576-z
Wei Song, Kaili Yang
Few-shot fine-grained classification (FS-FGC) aims to learn discriminative semantic details (e.g., beaks and wings) with few labeled samples to precisely recognize novel classes. However, existing feature alignment methods mainly use a support set to align the query sample, which may lead to incorrect alignment of local semantic due to interference from background and non-target objects. In addition, these methods do not take into account the discrepancy of semantic information among channels. To address the above issues, we propose an effective dual adaptive local semantic alignment approach, which is composed of the channel semantic alignment module (CSAM) and the spatial semantic alignment module (SSAM). Specifically, CSAM adaptively generates channel weights to highlight discriminative information based on two sub-modules, namely the class-aware attention module and the target-aware attention module. CAM emphasizes the discriminative semantic details of each category in the support set and TAM enhances the target object region of the query image. On the basis of this, SSAM promotes effective alignment of semantically relevant local regions through a spatial bidirectional alignment strategy. Combining two adaptive modules to better capture fine-grained semantic contextual information along two dimensions, channel and spatial improves the accuracy and robustness of FS-FGC. Experimental results on three widely used fine-grained classification datasets demonstrate excellent performance that has significant competitive advantages over current mainstream methods. Codes are available at: https://github.com/kellyagya/DALSA.
{"title":"Dual adaptive local semantic alignment for few-shot fine-grained classification","authors":"Wei Song, Kaili Yang","doi":"10.1007/s00371-024-03576-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03576-z","url":null,"abstract":"<p>Few-shot fine-grained classification (FS-FGC) aims to learn discriminative semantic details (e.g., beaks and wings) with few labeled samples to precisely recognize novel classes. However, existing feature alignment methods mainly use a support set to align the query sample, which may lead to incorrect alignment of local semantic due to interference from background and non-target objects. In addition, these methods do not take into account the discrepancy of semantic information among channels. To address the above issues, we propose an effective dual adaptive local semantic alignment approach, which is composed of the channel semantic alignment module (CSAM) and the spatial semantic alignment module (SSAM). Specifically, CSAM adaptively generates channel weights to highlight discriminative information based on two sub-modules, namely the class-aware attention module and the target-aware attention module. CAM emphasizes the discriminative semantic details of each category in the support set and TAM enhances the target object region of the query image. On the basis of this, SSAM promotes effective alignment of semantically relevant local regions through a spatial bidirectional alignment strategy. Combining two adaptive modules to better capture fine-grained semantic contextual information along two dimensions, channel and spatial improves the accuracy and robustness of FS-FGC. Experimental results on three widely used fine-grained classification datasets demonstrate excellent performance that has significant competitive advantages over current mainstream methods. Codes are available at: https://github.com/kellyagya/DALSA.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-20DOI: 10.1007/s00371-024-03565-2
Ze Ouyang, Huihuang Zhao, Yudong Zhang, Long Chen
Video de-raining is of significant importance problem in computer vision as rain streaks adversely affect the visual quality of images and hinder subsequent vision-related tasks. Existing video de-raining methods still face challenges such as black shadows and loss of details. In this paper, we introduced a novel de-raining framework called STVDNet, which effectively solves the issues of black shadows and detail loss after de-raining. STVDNet utilizes a Spatial Detail Feature Extraction Module based on an auto-encoder to capture the spatial characteristics of the video. Additionally, we introduced an innovative interaction between the extracted spatial features and Spatio-Temporal features using LSTM to generate initial de-raining results. Finally, we employed 3D convolution and 2D convolution for the detailed processing of the coarse videos. During the training process, we utilized three loss functions, among which the SSIM loss function was employed to process the generated videos, aiming to enhance their detail structure and color recovery. Through extensive experiments conducted on three public datasets, we demonstrated the superiority of our proposed method over state-of-the-art approaches. We also provide our code and pre-trained models at https://github.com/O-Y-ZONE/STVDNet.git.
{"title":"STVDNet: spatio-temporal interactive video de-raining network","authors":"Ze Ouyang, Huihuang Zhao, Yudong Zhang, Long Chen","doi":"10.1007/s00371-024-03565-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03565-2","url":null,"abstract":"<p>Video de-raining is of significant importance problem in computer vision as rain streaks adversely affect the visual quality of images and hinder subsequent vision-related tasks. Existing video de-raining methods still face challenges such as black shadows and loss of details. In this paper, we introduced a novel de-raining framework called STVDNet, which effectively solves the issues of black shadows and detail loss after de-raining. STVDNet utilizes a Spatial Detail Feature Extraction Module based on an auto-encoder to capture the spatial characteristics of the video. Additionally, we introduced an innovative interaction between the extracted spatial features and Spatio-Temporal features using LSTM to generate initial de-raining results. Finally, we employed 3D convolution and 2D convolution for the detailed processing of the coarse videos. During the training process, we utilized three loss functions, among which the SSIM loss function was employed to process the generated videos, aiming to enhance their detail structure and color recovery. Through extensive experiments conducted on three public datasets, we demonstrated the superiority of our proposed method over state-of-the-art approaches. We also provide our code and pre-trained models at https://github.com/O-Y-ZONE/STVDNet.git.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-18DOI: 10.1007/s00371-024-03563-4
Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu
Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99(%) on the VQA-v2 and makes a breakthrough of 74.41(%) on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.
{"title":"Vman: visual-modified attention network for multimodal paradigms","authors":"Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu","doi":"10.1007/s00371-024-03563-4","DOIUrl":"https://doi.org/10.1007/s00371-024-03563-4","url":null,"abstract":"<p>Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99<span>(%)</span> on the VQA-v2 and makes a breakthrough of 74.41<span>(%)</span> on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-17DOI: 10.1007/s00371-024-03567-0
R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne
Face forgery, or deep fake, is a frequently used method to produce fake face images, network pornography, blackmail, and other illegal activities. Researchers developed several detection approaches based on the changing traces presented by deep forgery to limit the damage caused by deep fake methods. They obtain limited performance when evaluating cross-datum scenarios. This paper proposes an optimal deep learning approach with an attention-based feature learning scheme to perform DFD more accurately. The proposed system mainly comprises ‘5’ phases: face detection, preprocessing, texture feature extraction, spatial feature extraction, and classification. The face regions are initially detected from the collected data using the Viola–Jones (VJ) algorithm. Then, preprocessing is carried out, which resizes and normalizes the detected face regions to improve their quality for detection purposes. Next, texture features are learned using the Butterfly Optimized Gabor Filter to get information about the local features of objects in an image. Then, the spatial features are extracted using Residual Network-50 with Multi Head Attention (RN50MHA) to represent the data globally. Finally, classification is done using the Optimal Long Short-Term Memory (OLSTM), which classifies the data as fake or real, in which optimization of network is done using Enhanced Archimedes Optimization Algorithm. The proposed system is evaluated on four benchmark datasets such as Face Forensics + + (FF + +), Deepfake Detection Challenge, Celebrity Deepfake (CDF), and Wild Deepfake. The experimental results show that DFD using OLSTM and RN50MHA achieves a higher inter and intra-dataset detection rate than existing state-of-the-art methods.
{"title":"Deep fake detection using an optimal deep learning model with multi head attention-based feature extraction scheme","authors":"R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne","doi":"10.1007/s00371-024-03567-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03567-0","url":null,"abstract":"<p>Face forgery, or deep fake, is a frequently used method to produce fake face images, network pornography, blackmail, and other illegal activities. Researchers developed several detection approaches based on the changing traces presented by deep forgery to limit the damage caused by deep fake methods. They obtain limited performance when evaluating cross-datum scenarios. This paper proposes an optimal deep learning approach with an attention-based feature learning scheme to perform DFD more accurately. The proposed system mainly comprises ‘5’ phases: face detection, preprocessing, texture feature extraction, spatial feature extraction, and classification. The face regions are initially detected from the collected data using the Viola–Jones (VJ) algorithm. Then, preprocessing is carried out, which resizes and normalizes the detected face regions to improve their quality for detection purposes. Next, texture features are learned using the Butterfly Optimized Gabor Filter to get information about the local features of objects in an image. Then, the spatial features are extracted using Residual Network-50 with Multi Head Attention (RN50MHA) to represent the data globally. Finally, classification is done using the Optimal Long Short-Term Memory (OLSTM), which classifies the data as fake or real, in which optimization of network is done using Enhanced Archimedes Optimization Algorithm. The proposed system is evaluated on four benchmark datasets such as Face Forensics + + (FF + +), Deepfake Detection Challenge, Celebrity Deepfake (CDF), and Wild Deepfake. The experimental results show that DFD using OLSTM and RN50MHA achieves a higher inter and intra-dataset detection rate than existing state-of-the-art methods.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12DOI: 10.1007/s00371-024-03528-7
Jialin Zhu, He Wang, David Hogg, Tom Kelly
We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of (100 times 1600) meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.
我们介绍了一种学习雕刻大型城市环境三维模型的系统。大多数人都生活在城市环境中,他们在虚拟世界、特效和城市规划等各种应用中使用详细的虚拟模型。手动从示例生成此类三维模型非常耗时,而三维深度学习方法的内存成本很高。在本文中,我们介绍了一种训练二维神经网络的技术,可将一个平面反复雕刻成大规模的三维城市环境。初始粗深度图由 GAN 模型创建,在此基础上,我们使用由线性系统正则化的图像平移网络细化三维法线和深度。我们使用真实世界的数据对网络进行了训练,以便按比例生成合成网格。我们利用多视角雕刻技术生成高度精细、凹陷和防水的三维网格。我们展示了超过200万个三角形的(100乘以1600)米尺度的城市景观,并证明我们的结果在客观和主观上都与我们的范例相似。
{"title":"Learning to sculpt neural cityscapes","authors":"Jialin Zhu, He Wang, David Hogg, Tom Kelly","doi":"10.1007/s00371-024-03528-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03528-7","url":null,"abstract":"<p>We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of <span>(100 times 1600)</span> meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-11DOI: 10.1007/s00371-024-03548-3
Jiaxuan Zhu, Ming Shao, Libo Sun, Siyu Xia
Human skeleton data have been widely explored in action recognition and the human–computer interface recently, thanks to off-the-shelf motion sensors and cameras. With the widespread usage of deep models on human skeleton data, their vulnerabilities under adversarial attacks have raised increasing security concerns. Although there are several works focusing on attack strategies, fewer efforts are put into defense against adversaries in skeleton-based action recognition, which is nontrivial. In addition, labels required in adversarial learning are another pain in adversarial training-based defense. This paper proposes a robust model agnostic adversarial contrastive learning framework for this task. First, we introduce an adversarial contrastive learning framework for skeleton-based action recognition (ACL-SAR). Second, the nature of cross-view skeleton data enables cross-view adversarial contrastive learning (CV-ACL-SAR) as a further improvement. Third, adversarial attack and defense strategies are investigated, including alternate instance-wise attacks and options in adversarial training. To validate the effectiveness of our method, we conducted extensive experiments on the NTU-RGB+D and HDM05 datasets. The results show that our defense strategies are not only robust to various adversarial attacks but can also maintain generalization.
{"title":"ACL-SAR: model agnostic adversarial contrastive learning for robust skeleton-based action recognition","authors":"Jiaxuan Zhu, Ming Shao, Libo Sun, Siyu Xia","doi":"10.1007/s00371-024-03548-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03548-3","url":null,"abstract":"<p>Human skeleton data have been widely explored in action recognition and the human–computer interface recently, thanks to off-the-shelf motion sensors and cameras. With the widespread usage of deep models on human skeleton data, their vulnerabilities under adversarial attacks have raised increasing security concerns. Although there are several works focusing on attack strategies, fewer efforts are put into defense against adversaries in skeleton-based action recognition, which is nontrivial. In addition, labels required in adversarial learning are another pain in adversarial training-based defense. This paper proposes a robust model agnostic adversarial contrastive learning framework for this task. First, we introduce an adversarial contrastive learning framework for skeleton-based action recognition (ACL-SAR). Second, the nature of cross-view skeleton data enables cross-view adversarial contrastive learning (CV-ACL-SAR) as a further improvement. Third, adversarial attack and defense strategies are investigated, including alternate instance-wise attacks and options in adversarial training. To validate the effectiveness of our method, we conducted extensive experiments on the NTU-RGB+D and HDM05 datasets. The results show that our defense strategies are not only robust to various adversarial attacks but can also maintain generalization.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-09DOI: 10.1007/s00371-024-03560-7
Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus
Food computing has emerged as a promising research field, employing artificial intelligence, deep learning, and data science methodologies to enhance various stages of food production pipelines. To this end, the food computing community has compiled a variety of data sets and developed various deep-learning architectures to perform automatic classification. However, automated food classification presents a significant challenge, particularly when it comes to local and regional cuisines, which are often underrepresented in available public-domain data sets. Nevertheless, obtaining high-quality, well-labeled, and well-balanced real-world labeled images is challenging since manual data curation requires significant human effort and is time-consuming. In contrast, the web has a potentially unlimited source of food data but tapping into this resource has a good chance of corrupted and wrongly labeled images. In addition, the uneven distribution among food categories may lead to data imbalance problems. All these issues make it challenging to create clean data sets for food from web data. To address this issue, we present AutoCleanDeepFood, a novel end-to-end food computing framework for regional gastronomy that contains the following components: (i) a fully automated pre-processing pipeline for custom data sets creation related to specific regional gastronomy, (ii) a transfer learning-based training paradigm to filter out noisy labels through loss ranking, incorporating a Russian Roulette probabilistic approach to mitigate data imbalance problems, and (iii) a method for deploying the resulting model on smartphones for real-time inferences. We assess the performance of our framework on a real-world noisy public domain data set, ETH Food-101, and two novel web-collected datasets, MENA-150 and Pizza-Styles. We demonstrate the filtering capabilities of our proposed method through embedding visualization of the feature space using the t-SNE dimension reduction scheme. Our filtering scheme is efficient and effectively improves accuracy in all cases, boosting performance by 0.96, 0.71, and 1.29% on MENA-150, ETH Food-101, and Pizza-Styles, respectively.
{"title":"Autocleandeepfood: auto-cleaning and data balancing transfer learning for regional gastronomy food computing","authors":"Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus","doi":"10.1007/s00371-024-03560-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03560-7","url":null,"abstract":"<p>Food computing has emerged as a promising research field, employing artificial intelligence, deep learning, and data science methodologies to enhance various stages of food production pipelines. To this end, the food computing community has compiled a variety of data sets and developed various deep-learning architectures to perform automatic classification. However, automated food classification presents a significant challenge, particularly when it comes to local and regional cuisines, which are often underrepresented in available public-domain data sets. Nevertheless, obtaining high-quality, well-labeled, and well-balanced real-world labeled images is challenging since manual data curation requires significant human effort and is time-consuming. In contrast, the web has a potentially unlimited source of food data but tapping into this resource has a good chance of corrupted and wrongly labeled images. In addition, the uneven distribution among food categories may lead to data imbalance problems. All these issues make it challenging to create clean data sets for food from web data. To address this issue, we present <i>AutoCleanDeepFood</i>, a novel end-to-end food computing framework for regional gastronomy that contains the following components: (i) a fully automated pre-processing pipeline for custom data sets creation related to specific regional gastronomy, (ii) a transfer learning-based training paradigm to filter out noisy labels through loss ranking, incorporating a Russian Roulette probabilistic approach to mitigate data imbalance problems, and (iii) a method for deploying the resulting model on smartphones for real-time inferences. We assess the performance of our framework on a real-world noisy public domain data set, ETH Food-101, and two novel web-collected datasets, MENA-150 and Pizza-Styles. We demonstrate the filtering capabilities of our proposed method through embedding visualization of the feature space using the t-SNE dimension reduction scheme. Our filtering scheme is efficient and effectively improves accuracy in all cases, boosting performance by 0.96, 0.71, and 1.29% on MENA-150, ETH Food-101, and Pizza-Styles, respectively.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-05DOI: 10.1007/s00371-024-03558-1
Yumei Tan, Haiying Xia, Shuxiang Song
Label noise is inevitable in facial expression recognition (FER) datasets, especially for datasets that collected by web crawling, crowd sourcing in in-the-wild scenarios, which makes FER task more challenging. Recent advances tackle label noise by leveraging sample selection or constructing label distribution. However, they rely heavily on labels, which can result in confirmation bias issues. In this paper, we present RCL-Net, a simple yet effective robust consistency learning network, which combats label noise by learning robust representations and robust losses. RCL-Net can efficiently tackle facial samples with noisy labels commonly found in real-world datasets. Specifically, we first use a two-view-based backbone to embed facial images into high- and low-dimensional subspaces and then regularize the geometric structure of the high- and low-dimensional subspaces using an unsupervised dual-consistency learning strategy. Benefiting from the unsupervised dual-consistency learning strategy, we can obtain robust representations to combat label noise. Further, we impose a robust consistency regularization technique on the predictions of the classifiers to improve the whole network’s robustness. Comprehensive evaluations on three popular real-world FER datasets demonstrate that RCL-Net can effectively mitigate the impact of label noise, which significantly outperforms state-of-the-art noisy label FER methods. RCL-Net also shows better generalization capability to other tasks like CIFAR100 and Tiny-ImageNet. Our code and models will be available at this https https://github.com/myt889/RCL-Net.
标签噪声在面部表情识别(FER)数据集中是不可避免的,尤其是通过网络抓取、野外场景中的众包收集的数据集,这使得 FER 任务更具挑战性。最近的研究进展是利用样本选择或构建标签分布来解决标签噪声问题。然而,这些方法严重依赖标签,可能导致确认偏差问题。在本文中,我们介绍了一种简单而有效的鲁棒一致性学习网络 RCL-Net,它通过学习鲁棒表征和鲁棒损失来对抗标签噪声。RCL-Net 可以高效地处理现实世界数据集中常见的带有噪声标签的面部样本。具体来说,我们首先使用基于双视角的骨干网将面部图像嵌入高低维子空间,然后使用无监督双一致性学习策略对高低维子空间的几何结构进行正则化。得益于无监督双一致性学习策略,我们可以获得稳健的表征来对抗标签噪声。此外,我们还对分类器的预测施加了稳健一致性正则化技术,以提高整个网络的稳健性。在三个流行的真实世界 FER 数据集上进行的综合评估表明,RCL-Net 可以有效地减轻标签噪声的影响,其性能明显优于最先进的噪声标签 FER 方法。RCL-Net 还在 CIFAR100 和 Tiny-ImageNet 等其他任务中表现出更好的泛化能力。我们的代码和模型将发布在 https https://github.com/myt889/RCL-Net 上。
{"title":"Robust consistency learning for facial expression recognition under label noise","authors":"Yumei Tan, Haiying Xia, Shuxiang Song","doi":"10.1007/s00371-024-03558-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03558-1","url":null,"abstract":"<p>Label noise is inevitable in facial expression recognition (FER) datasets, especially for datasets that collected by web crawling, crowd sourcing in in-the-wild scenarios, which makes FER task more challenging. Recent advances tackle label noise by leveraging sample selection or constructing label distribution. However, they rely heavily on labels, which can result in confirmation bias issues. In this paper, we present RCL-Net, a simple yet effective robust consistency learning network, which combats label noise by learning robust representations and robust losses. RCL-Net can efficiently tackle facial samples with noisy labels commonly found in real-world datasets. Specifically, we first use a two-view-based backbone to embed facial images into high- and low-dimensional subspaces and then regularize the geometric structure of the high- and low-dimensional subspaces using an unsupervised dual-consistency learning strategy. Benefiting from the unsupervised dual-consistency learning strategy, we can obtain robust representations to combat label noise. Further, we impose a robust consistency regularization technique on the predictions of the classifiers to improve the whole network’s robustness. Comprehensive evaluations on three popular real-world FER datasets demonstrate that RCL-Net can effectively mitigate the impact of label noise, which significantly outperforms state-of-the-art noisy label FER methods. RCL-Net also shows better generalization capability to other tasks like CIFAR100 and Tiny-ImageNet. Our code and models will be available at this https https://github.com/myt889/RCL-Net.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-04DOI: 10.1007/s00371-024-03552-7
Surendrabikram Thapa, Abhijit Sarkar
Human factors research in transportation relies on naturalistic driving studies (NDS) which collect real-world data from drivers on actual roads. NDS data offer valuable insights into driving behavior, styles, habits, and safety-critical events. However, these data often contain personally identifiable information (PII), such as driver face videos, which cannot be publicly shared due to privacy concerns. To address this, our paper introduces a comprehensive framework for deidentifying drivers’ face videos, that can facilitate the wide sharing of driver face videos while protecting PII. Leveraging recent advancements in generative adversarial networks (GANs), we explore the efficacy of different face swapping algorithms in preserving essential human factors attributes while anonymizing participants’ identities. Most face swapping algorithms are tested in restricted lighting conditions and indoor settings, there is no known study that tested them in adverse and natural situations. We conducted extensive experiments using large-scale outdoor NDS data, evaluating the quantification of errors associated with head, mouth, and eye movements, along with other attributes important for human factors research. Additionally, we performed qualitative assessments of these methods through human evaluators providing valuable insights into the quality and fidelity of the deidentified videos. We propose the utilization of synthetic faces as substitutes for real faces to enhance generalization. Additionally, we created practical guidelines for video deidentification, emphasizing error threshold creation, spot-checking for abrupt metric changes, and mitigation strategies for reidentification risks. Our findings underscore nuanced challenges in balancing data utility and privacy, offering valuable insights into enhancing face video deidentification techniques in NDS scenarios.
{"title":"A deep dive into enhancing sharing of naturalistic driving data through face deidentification","authors":"Surendrabikram Thapa, Abhijit Sarkar","doi":"10.1007/s00371-024-03552-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03552-7","url":null,"abstract":"<p>Human factors research in transportation relies on naturalistic driving studies (NDS) which collect real-world data from drivers on actual roads. NDS data offer valuable insights into driving behavior, styles, habits, and safety-critical events. However, these data often contain personally identifiable information (PII), such as driver face videos, which cannot be publicly shared due to privacy concerns. To address this, our paper introduces a comprehensive framework for deidentifying drivers’ face videos, that can facilitate the wide sharing of driver face videos while protecting PII. Leveraging recent advancements in generative adversarial networks (GANs), we explore the efficacy of different face swapping algorithms in preserving essential human factors attributes while anonymizing participants’ identities. Most face swapping algorithms are tested in restricted lighting conditions and indoor settings, there is no known study that tested them in adverse and natural situations. We conducted extensive experiments using large-scale outdoor NDS data, evaluating the quantification of errors associated with head, mouth, and eye movements, along with other attributes important for human factors research. Additionally, we performed qualitative assessments of these methods through human evaluators providing valuable insights into the quality and fidelity of the deidentified videos. We propose the utilization of synthetic faces as substitutes for real faces to enhance generalization. Additionally, we created practical guidelines for video deidentification, emphasizing error threshold creation, spot-checking for abrupt metric changes, and mitigation strategies for reidentification risks. Our findings underscore nuanced challenges in balancing data utility and privacy, offering valuable insights into enhancing face video deidentification techniques in NDS scenarios.\u0000</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}