The Visual Computer最新文献_第6页

LiteMSNet: a lightweight semantic segmentation network with multi-scale feature extraction for urban streetscape scenes LiteMSNet：针对城市街景场景的多尺度特征提取轻量级语义分割网络

The Visual Computer

Pub Date : 2024-07-22 DOI: 10.1007/s00371-024-03569-y

Lirong Li, Jiang Ding, Hao Cui, Zhiqiang Chen, Guisheng Liao

Semantic segmentation plays a pivotal role in computer scene understanding, but it typically requires a large amount of computing to achieve high performance. To achieve a balance between accuracy and complexity, we propose a lightweight semantic segmentation model, termed LiteMSNet (a Lightweight Semantic Segmentation Network with Multi-Scale Feature Extraction for urban streetscape scenes). In this model, we propose a novel Improved Feature Pyramid Network, which embeds a shuffle attention mechanism followed by a stacked Depth-wise Asymmetric Gating Module. Furthermore, a Multi-scale Dilation Pyramid Module is developed to expand the receptive field and capture multi-scale feature information. Finally, the proposed lightweight model integrates two loss mechanisms, the Cross-Entropy and the Dice Loss functions, which effectively mitigate the issue of data imbalance and gradient saturation. Numerical experimental results on the CamVid dataset demonstrate a remarkable mIoU measurement of 70.85% with less than 5M parameters, accompanied by a real-time inference speed of 66.1 FPS, surpassing the existing methods documented in the literature. The code for this work will be made available at https://github.com/River-ding/LiteMSNet.

语义分割在计算机场景理解中起着举足轻重的作用，但通常需要大量计算才能实现高性能。为了在准确性和复杂性之间取得平衡，我们提出了一种轻量级语义分割模型，称为 LiteMSNet（针对城市街景场景的多尺度特征提取轻量级语义分割网络）。在这一模型中，我们提出了一种新颖的改进型特征金字塔网络，其中嵌入了一种洗牌关注机制，然后是一个堆叠的深度非对称门控模块。此外，我们还开发了多尺度扩张金字塔模块，以扩大感受野和捕捉多尺度特征信息。最后，所提出的轻量级模型集成了两种损失机制，即交叉熵和骰子损失函数，从而有效地缓解了数据不平衡和梯度饱和的问题。在 CamVid 数据集上的数值实验结果表明，在参数小于 500 万的情况下，mIoU 测量值达到了 70.85%，同时实时推理速度达到了 66.1 FPS，超过了文献中记载的现有方法。这项工作的代码将公布在 https://github.com/River-ding/LiteMSNet 网站上。

{"title":"LiteMSNet: a lightweight semantic segmentation network with multi-scale feature extraction for urban streetscape scenes","authors":"Lirong Li, Jiang Ding, Hao Cui, Zhiqiang Chen, Guisheng Liao","doi":"10.1007/s00371-024-03569-y","DOIUrl":"https://doi.org/10.1007/s00371-024-03569-y","url":null,"abstract":"Semantic segmentation plays a pivotal role in computer scene understanding, but it typically requires a large amount of computing to achieve high performance. To achieve a balance between accuracy and complexity, we propose a lightweight semantic segmentation model, termed LiteMSNet (a Lightweight Semantic Segmentation Network with Multi-Scale Feature Extraction for urban streetscape scenes). In this model, we propose a novel Improved Feature Pyramid Network, which embeds a shuffle attention mechanism followed by a stacked Depth-wise Asymmetric Gating Module. Furthermore, a Multi-scale Dilation Pyramid Module is developed to expand the receptive field and capture multi-scale feature information. Finally, the proposed lightweight model integrates two loss mechanisms, the Cross-Entropy and the Dice Loss functions, which effectively mitigate the issue of data imbalance and gradient saturation. Numerical experimental results on the CamVid dataset demonstrate a remarkable mIoU measurement of 70.85% with less than 5M parameters, accompanied by a real-time inference speed of 66.1 FPS, surpassing the existing methods documented in the literature. The code for this work will be made available at https://github.com/River-ding/LiteMSNet.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"39 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual adaptive local semantic alignment for few-shot fine-grained classification 用于少镜头精细分类的双自适应局部语义配准

The Visual Computer

Pub Date : 2024-07-22 DOI: 10.1007/s00371-024-03576-z

Wei Song, Kaili Yang

Few-shot fine-grained classification (FS-FGC) aims to learn discriminative semantic details (e.g., beaks and wings) with few labeled samples to precisely recognize novel classes. However, existing feature alignment methods mainly use a support set to align the query sample, which may lead to incorrect alignment of local semantic due to interference from background and non-target objects. In addition, these methods do not take into account the discrepancy of semantic information among channels. To address the above issues, we propose an effective dual adaptive local semantic alignment approach, which is composed of the channel semantic alignment module (CSAM) and the spatial semantic alignment module (SSAM). Specifically, CSAM adaptively generates channel weights to highlight discriminative information based on two sub-modules, namely the class-aware attention module and the target-aware attention module. CAM emphasizes the discriminative semantic details of each category in the support set and TAM enhances the target object region of the query image. On the basis of this, SSAM promotes effective alignment of semantically relevant local regions through a spatial bidirectional alignment strategy. Combining two adaptive modules to better capture fine-grained semantic contextual information along two dimensions, channel and spatial improves the accuracy and robustness of FS-FGC. Experimental results on three widely used fine-grained classification datasets demonstrate excellent performance that has significant competitive advantages over current mainstream methods. Codes are available at: https://github.com/kellyagya/DALSA.

微粒度分类法（FS-FGC）旨在利用少量标注样本学习具有区分性的语义细节（如喙和翅膀），以精确识别新类别。然而，现有的特征对齐方法主要使用支持集来对齐查询样本，这可能会因背景和非目标对象的干扰而导致局部语义的不正确对齐。此外，这些方法没有考虑到不同通道之间语义信息的差异。为解决上述问题，我们提出了一种有效的双通道自适应局部语义配准方法，它由通道语义配准模块（CSAM）和空间语义配准模块（SSAM）组成。具体来说，CSAM 基于两个子模块，即类感知注意力模块和目标感知注意力模块，自适应地生成通道权重，以突出辨别信息。CAM 强调支持集中每个类别的辨别语义细节，TAM 则增强查询图像的目标对象区域。在此基础上，SSAM 通过空间双向配准策略促进语义相关局部区域的有效配准。将两个自适应模块结合起来，可以更好地捕捉通道和空间两个维度的细粒度语义上下文信息，从而提高 FS-FGC 的准确性和鲁棒性。在三个广泛使用的细粒度分类数据集上的实验结果表明，FS-FGC 性能卓越，与目前的主流方法相比具有显著的竞争优势。代码见：https://github.com/kellyagya/DALSA。

{"title":"Dual adaptive local semantic alignment for few-shot fine-grained classification","authors":"Wei Song, Kaili Yang","doi":"10.1007/s00371-024-03576-z","DOIUrl":"https://doi.org/10.1007/s00371-024-03576-z","url":null,"abstract":"Few-shot fine-grained classification (FS-FGC) aims to learn discriminative semantic details (e.g., beaks and wings) with few labeled samples to precisely recognize novel classes. However, existing feature alignment methods mainly use a support set to align the query sample, which may lead to incorrect alignment of local semantic due to interference from background and non-target objects. In addition, these methods do not take into account the discrepancy of semantic information among channels. To address the above issues, we propose an effective dual adaptive local semantic alignment approach, which is composed of the channel semantic alignment module (CSAM) and the spatial semantic alignment module (SSAM). Specifically, CSAM adaptively generates channel weights to highlight discriminative information based on two sub-modules, namely the class-aware attention module and the target-aware attention module. CAM emphasizes the discriminative semantic details of each category in the support set and TAM enhances the target object region of the query image. On the basis of this, SSAM promotes effective alignment of semantically relevant local regions through a spatial bidirectional alignment strategy. Combining two adaptive modules to better capture fine-grained semantic contextual information along two dimensions, channel and spatial improves the accuracy and robustness of FS-FGC. Experimental results on three widely used fine-grained classification datasets demonstrate excellent performance that has significant competitive advantages over current mainstream methods. Codes are available at: https://github.com/kellyagya/DALSA.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742400","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

STVDNet: spatio-temporal interactive video de-raining network STVDNet：时空互动视频去raining 网络

The Visual Computer

Pub Date : 2024-07-20 DOI: 10.1007/s00371-024-03565-2

Ze Ouyang, Huihuang Zhao, Yudong Zhang, Long Chen

Video de-raining is of significant importance problem in computer vision as rain streaks adversely affect the visual quality of images and hinder subsequent vision-related tasks. Existing video de-raining methods still face challenges such as black shadows and loss of details. In this paper, we introduced a novel de-raining framework called STVDNet, which effectively solves the issues of black shadows and detail loss after de-raining. STVDNet utilizes a Spatial Detail Feature Extraction Module based on an auto-encoder to capture the spatial characteristics of the video. Additionally, we introduced an innovative interaction between the extracted spatial features and Spatio-Temporal features using LSTM to generate initial de-raining results. Finally, we employed 3D convolution and 2D convolution for the detailed processing of the coarse videos. During the training process, we utilized three loss functions, among which the SSIM loss function was employed to process the generated videos, aiming to enhance their detail structure and color recovery. Through extensive experiments conducted on three public datasets, we demonstrated the superiority of our proposed method over state-of-the-art approaches. We also provide our code and pre-trained models at https://github.com/O-Y-ZONE/STVDNet.git.

视频去条纹是计算机视觉领域的一个重要问题，因为雨水条纹会对图像的视觉质量产生不利影响，并妨碍后续的视觉相关任务。现有的视频去噪方法仍然面临着黑影和细节丢失等挑战。本文介绍了一种名为 STVDNet 的新型去纹框架，它能有效解决去纹后的黑影和细节丢失问题。STVDNet 利用基于自动编码器的空间细节特征提取模块来捕捉视频的空间特征。此外，我们还利用 LSTM 在提取的空间特征和时空特征之间引入了一种创新的交互方式，以生成初步的去噪结果。最后，我们利用 3D 卷积和 2D 卷积对粗视频进行了详细处理。在训练过程中，我们使用了三种损失函数，其中 SSIM 损失函数用于处理生成的视频，旨在增强视频的细节结构和色彩恢复。通过在三个公共数据集上进行的大量实验，我们证明了我们提出的方法优于最先进的方法。我们还在 https://github.com/O-Y-ZONE/STVDNet.git 网站上提供了我们的代码和预训练模型。

{"title":"STVDNet: spatio-temporal interactive video de-raining network","authors":"Ze Ouyang, Huihuang Zhao, Yudong Zhang, Long Chen","doi":"10.1007/s00371-024-03565-2","DOIUrl":"https://doi.org/10.1007/s00371-024-03565-2","url":null,"abstract":"Video de-raining is of significant importance problem in computer vision as rain streaks adversely affect the visual quality of images and hinder subsequent vision-related tasks. Existing video de-raining methods still face challenges such as black shadows and loss of details. In this paper, we introduced a novel de-raining framework called STVDNet, which effectively solves the issues of black shadows and detail loss after de-raining. STVDNet utilizes a Spatial Detail Feature Extraction Module based on an auto-encoder to capture the spatial characteristics of the video. Additionally, we introduced an innovative interaction between the extracted spatial features and Spatio-Temporal features using LSTM to generate initial de-raining results. Finally, we employed 3D convolution and 2D convolution for the detailed processing of the coarse videos. During the training process, we utilized three loss functions, among which the SSIM loss function was employed to process the generated videos, aiming to enhance their detail structure and color recovery. Through extensive experiments conducted on three public datasets, we demonstrated the superiority of our proposed method over state-of-the-art approaches. We also provide our code and pre-trained models at https://github.com/O-Y-ZONE/STVDNet.git.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Vman: visual-modified attention network for multimodal paradigms Vman：用于多模态范式的视觉修正注意力网络

The Visual Computer

Pub Date : 2024-07-18 DOI: 10.1007/s00371-024-03563-4

Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu

Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99(%) on the VQA-v2 and makes a breakthrough of 74.41(%) on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.

由于出色的依赖性建模和强大的并行计算能力，变换器已成为视觉语言任务（VLT）的主要研究方法。然而，对于像 VQA 和 VG 这样要求高依赖性建模和异构模态理解的多模态视觉语言任务来说，解决传统变换器在图像自交互过程中引入噪声、信息交互不足以及获取更精细视觉特征等问题具有挑战性。因此，本文提出了一种通用的视觉修正注意力网络（VMAN）来解决这些问题。具体来说，VMAN 优化了 Transformer 中的注意机制，引入了视觉修正注意单元，在图像信息自交互之前建立文本与视觉的对应关系。用修正单元对图像特征进行修正，以获得更精细的查询特征，用于后续交互，在过滤噪声信息的同时增强依赖建模和推理能力。此外，我们还设计了两种修改方法：基于加权和的方法和基于交叉注意的方法。最后，我们在两个任务（VQA、VG）的五个基准数据集上对 VMAN 进行了广泛的实验。结果表明，VMAN在VQA-v2上的准确率达到了70.99%，在涉及更复杂表达式的RefCOCOg上的准确率突破了74.41%。这些结果充分证明了 VMAN 的合理性和有效性。代码见 https://github.com/79song/VMAN。

{"title":"Vman: visual-modified attention network for multimodal paradigms","authors":"Xiaoyu Song, Dezhi Han, Chongqing Chen, Xiang Shen, Huafeng Wu","doi":"10.1007/s00371-024-03563-4","DOIUrl":"https://doi.org/10.1007/s00371-024-03563-4","url":null,"abstract":"Due to excellent dependency modeling and powerful parallel computing capabilities, Transformer has become the primary research method in vision-language tasks (VLT). However, for multimodal VLT like VQA and VG, which demand high-dependency modeling and heterogeneous modality comprehension, solving the issues of introducing noise, insufficient information interaction, and obtaining more refined visual features during the image self-interaction of conventional Transformers is challenging. Therefore, this paper proposes a universal visual-modified attention network (VMAN) to address these problems. Specifically, VMAN optimizes the attention mechanism in Transformer, introducing a visual-modified attention unit that establishes text-visual correspondence before the self-interaction of image information. Modulating image features with modified units to obtain more refined query features for subsequent interaction, filtering out noise information while enhancing dependency modeling and reasoning capabilities. Furthermore, two modified approaches have been designed: the weighted sum-based approach and the cross-attention-based approach. Finally, we conduct extensive experiments on VMAN across five benchmark datasets for two tasks (VQA, VG). The results indicate that VMAN achieves an accuracy of 70.99(%) on the VQA-v2 and makes a breakthrough of 74.41(%) on the RefCOCOg which involves more complex expressions. The results fully prove the rationality and effectiveness of VMAN. The code is available at https://github.com/79song/VMAN.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141742402","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep fake detection using an optimal deep learning model with multi head attention-based feature extraction scheme 使用基于多头注意力特征提取方案的最佳深度学习模型进行深度假货检测

The Visual Computer

Pub Date : 2024-07-17 DOI: 10.1007/s00371-024-03567-0

R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne

Face forgery, or deep fake, is a frequently used method to produce fake face images, network pornography, blackmail, and other illegal activities. Researchers developed several detection approaches based on the changing traces presented by deep forgery to limit the damage caused by deep fake methods. They obtain limited performance when evaluating cross-datum scenarios. This paper proposes an optimal deep learning approach with an attention-based feature learning scheme to perform DFD more accurately. The proposed system mainly comprises ‘5’ phases: face detection, preprocessing, texture feature extraction, spatial feature extraction, and classification. The face regions are initially detected from the collected data using the Viola–Jones (VJ) algorithm. Then, preprocessing is carried out, which resizes and normalizes the detected face regions to improve their quality for detection purposes. Next, texture features are learned using the Butterfly Optimized Gabor Filter to get information about the local features of objects in an image. Then, the spatial features are extracted using Residual Network-50 with Multi Head Attention (RN50MHA) to represent the data globally. Finally, classification is done using the Optimal Long Short-Term Memory (OLSTM), which classifies the data as fake or real, in which optimization of network is done using Enhanced Archimedes Optimization Algorithm. The proposed system is evaluated on four benchmark datasets such as Face Forensics + + (FF + +), Deepfake Detection Challenge, Celebrity Deepfake (CDF), and Wild Deepfake. The experimental results show that DFD using OLSTM and RN50MHA achieves a higher inter and intra-dataset detection rate than existing state-of-the-art methods.

人脸伪造或深度伪造是一种常用的方法，用于制作虚假人脸图像、网络色情、勒索和其他非法活动。研究人员开发了几种基于深度伪造所呈现的变化痕迹的检测方法，以限制深度伪造方法所造成的破坏。这些方法在评估跨数据场景时性能有限。本文提出了一种基于注意力特征学习方案的最优深度学习方法，以更准确地执行 DFD。所提出的系统主要包括 "5 "个阶段：人脸检测、预处理、纹理特征提取、空间特征提取和分类。首先使用 Viola-Jones (VJ) 算法从收集到的数据中检测出人脸区域。然后进行预处理，调整检测到的人脸区域的大小并使其正常化，以提高检测质量。接着，使用蝴蝶优化 Gabor 滤波器学习纹理特征，以获取图像中物体的局部特征信息。然后，使用多头注意力残差网络-50（RN50MHA）提取空间特征，以表示全局数据。最后，使用最佳长短期记忆（OLSTM）进行分类，将数据分为假数据和真数据，并使用增强型阿基米德优化算法对网络进行优化。所提出的系统在四个基准数据集上进行了评估，如人脸取证 + +（FF + +）、Deepfake Detection Challenge、Celebrity Deepfake（CDF）和 Wild Deepfake。实验结果表明，与现有的最先进方法相比，使用 OLSTM 和 RN50MHA 的 DFD 在数据集之间和数据集内部实现了更高的检测率。

{"title":"Deep fake detection using an optimal deep learning model with multi head attention-based feature extraction scheme","authors":"R. Raja Sekar, T. Dhiliphan Rajkumar, Koteswara Rao Anne","doi":"10.1007/s00371-024-03567-0","DOIUrl":"https://doi.org/10.1007/s00371-024-03567-0","url":null,"abstract":"Face forgery, or deep fake, is a frequently used method to produce fake face images, network pornography, blackmail, and other illegal activities. Researchers developed several detection approaches based on the changing traces presented by deep forgery to limit the damage caused by deep fake methods. They obtain limited performance when evaluating cross-datum scenarios. This paper proposes an optimal deep learning approach with an attention-based feature learning scheme to perform DFD more accurately. The proposed system mainly comprises ‘5’ phases: face detection, preprocessing, texture feature extraction, spatial feature extraction, and classification. The face regions are initially detected from the collected data using the Viola–Jones (VJ) algorithm. Then, preprocessing is carried out, which resizes and normalizes the detected face regions to improve their quality for detection purposes. Next, texture features are learned using the Butterfly Optimized Gabor Filter to get information about the local features of objects in an image. Then, the spatial features are extracted using Residual Network-50 with Multi Head Attention (RN50MHA) to represent the data globally. Finally, classification is done using the Optimal Long Short-Term Memory (OLSTM), which classifies the data as fake or real, in which optimization of network is done using Enhanced Archimedes Optimization Algorithm. The proposed system is evaluated on four benchmark datasets such as Face Forensics + + (FF + +), Deepfake Detection Challenge, Celebrity Deepfake (CDF), and Wild Deepfake. The experimental results show that DFD using OLSTM and RN50MHA achieves a higher inter and intra-dataset detection rate than existing state-of-the-art methods.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"76 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Learning to sculpt neural cityscapes 学习雕刻神经城市景观

The Visual Computer

Pub Date : 2024-07-12 DOI: 10.1007/s00371-024-03528-7

Jialin Zhu, He Wang, David Hogg, Tom Kelly

We introduce a system that learns to sculpt 3D models of massive urban environments. The majority of humans live their lives in urban environments, using detailed virtual models for applications as diverse as virtual worlds, special effects, and urban planning. Generating such 3D models from exemplars manually is time-consuming, while 3D deep learning approaches have high memory costs. In this paper, we present a technique for training 2D neural networks to repeatedly sculpt a plane into a large-scale 3D urban environment. An initial coarse depth map is created by a GAN model, from which we refine 3D normal and depth using an image translation network regularized by a linear system. The networks are trained using real-world data to allow generative synthesis of meshes at scale. We exploit sculpting from multiple viewpoints to generate a highly detailed, concave, and water-tight 3D mesh. We show cityscapes at scales of (100 times 1600) meters with more than 2 million triangles, and demonstrate that our results are objectively and subjectively similar to our exemplars.

我们介绍了一种学习雕刻大型城市环境三维模型的系统。大多数人都生活在城市环境中，他们在虚拟世界、特效和城市规划等各种应用中使用详细的虚拟模型。手动从示例生成此类三维模型非常耗时，而三维深度学习方法的内存成本很高。在本文中，我们介绍了一种训练二维神经网络的技术，可将一个平面反复雕刻成大规模的三维城市环境。初始粗深度图由 GAN 模型创建，在此基础上，我们使用由线性系统正则化的图像平移网络细化三维法线和深度。我们使用真实世界的数据对网络进行了训练，以便按比例生成合成网格。我们利用多视角雕刻技术生成高度精细、凹陷和防水的三维网格。我们展示了超过200万个三角形的（100乘以1600）米尺度的城市景观，并证明我们的结果在客观和主观上都与我们的范例相似。

引用次数: 0

ACL-SAR: model agnostic adversarial contrastive learning for robust skeleton-based action recognition ACL-SAR：用于基于骨骼的鲁棒动作识别的不可知模型对抗学习

The Visual Computer

Pub Date : 2024-07-11 DOI: 10.1007/s00371-024-03548-3

Jiaxuan Zhu, Ming Shao, Libo Sun, Siyu Xia

Human skeleton data have been widely explored in action recognition and the human–computer interface recently, thanks to off-the-shelf motion sensors and cameras. With the widespread usage of deep models on human skeleton data, their vulnerabilities under adversarial attacks have raised increasing security concerns. Although there are several works focusing on attack strategies, fewer efforts are put into defense against adversaries in skeleton-based action recognition, which is nontrivial. In addition, labels required in adversarial learning are another pain in adversarial training-based defense. This paper proposes a robust model agnostic adversarial contrastive learning framework for this task. First, we introduce an adversarial contrastive learning framework for skeleton-based action recognition (ACL-SAR). Second, the nature of cross-view skeleton data enables cross-view adversarial contrastive learning (CV-ACL-SAR) as a further improvement. Third, adversarial attack and defense strategies are investigated, including alternate instance-wise attacks and options in adversarial training. To validate the effectiveness of our method, we conducted extensive experiments on the NTU-RGB+D and HDM05 datasets. The results show that our defense strategies are not only robust to various adversarial attacks but can also maintain generalization.

得益于现成的运动传感器和摄像头，人体骨骼数据最近在动作识别和人机交互界面中得到了广泛应用。随着深度模型在人体骨骼数据上的广泛应用，其在对抗性攻击下的脆弱性引发了越来越多的安全问题。虽然有一些作品专注于攻击策略，但在基于骨骼的动作识别中，较少有人致力于防御对手的攻击，而这并非易事。此外，对抗学习所需的标签也是基于对抗训练的防御的另一个痛点。本文针对这一任务提出了一个稳健的模型不可知论对抗学习框架。首先，我们介绍了基于骨骼的动作识别（ACL-SAR）的对抗性对比学习框架。其次，跨视角骨架数据的特性使得跨视角对抗性对比学习（CV-ACL-SAR）成为一种进一步的改进。第三，研究了对抗性攻击和防御策略，包括交替实例攻击和对抗性训练选项。为了验证我们方法的有效性，我们在 NTU-RGB+D 和 HDM05 数据集上进行了大量实验。结果表明，我们的防御策略不仅对各种对抗性攻击具有鲁棒性，而且还能保持泛化。

{"title":"ACL-SAR: model agnostic adversarial contrastive learning for robust skeleton-based action recognition","authors":"Jiaxuan Zhu, Ming Shao, Libo Sun, Siyu Xia","doi":"10.1007/s00371-024-03548-3","DOIUrl":"https://doi.org/10.1007/s00371-024-03548-3","url":null,"abstract":"Human skeleton data have been widely explored in action recognition and the human–computer interface recently, thanks to off-the-shelf motion sensors and cameras. With the widespread usage of deep models on human skeleton data, their vulnerabilities under adversarial attacks have raised increasing security concerns. Although there are several works focusing on attack strategies, fewer efforts are put into defense against adversaries in skeleton-based action recognition, which is nontrivial. In addition, labels required in adversarial learning are another pain in adversarial training-based defense. This paper proposes a robust model agnostic adversarial contrastive learning framework for this task. First, we introduce an adversarial contrastive learning framework for skeleton-based action recognition (ACL-SAR). Second, the nature of cross-view skeleton data enables cross-view adversarial contrastive learning (CV-ACL-SAR) as a further improvement. Third, adversarial attack and defense strategies are investigated, including alternate instance-wise attacks and options in adversarial training. To validate the effectiveness of our method, we conducted extensive experiments on the NTU-RGB+D and HDM05 datasets. The results show that our defense strategies are not only robust to various adversarial attacks but can also maintain generalization.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141610897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Autocleandeepfood: auto-cleaning and data balancing transfer learning for regional gastronomy food computing Autocleandeepfood：区域美食食品计算的自动清洁和数据平衡迁移学习

The Visual Computer

Pub Date : 2024-07-09 DOI: 10.1007/s00371-024-03560-7

Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus

Food computing has emerged as a promising research field, employing artificial intelligence, deep learning, and data science methodologies to enhance various stages of food production pipelines. To this end, the food computing community has compiled a variety of data sets and developed various deep-learning architectures to perform automatic classification. However, automated food classification presents a significant challenge, particularly when it comes to local and regional cuisines, which are often underrepresented in available public-domain data sets. Nevertheless, obtaining high-quality, well-labeled, and well-balanced real-world labeled images is challenging since manual data curation requires significant human effort and is time-consuming. In contrast, the web has a potentially unlimited source of food data but tapping into this resource has a good chance of corrupted and wrongly labeled images. In addition, the uneven distribution among food categories may lead to data imbalance problems. All these issues make it challenging to create clean data sets for food from web data. To address this issue, we present AutoCleanDeepFood, a novel end-to-end food computing framework for regional gastronomy that contains the following components: (i) a fully automated pre-processing pipeline for custom data sets creation related to specific regional gastronomy, (ii) a transfer learning-based training paradigm to filter out noisy labels through loss ranking, incorporating a Russian Roulette probabilistic approach to mitigate data imbalance problems, and (iii) a method for deploying the resulting model on smartphones for real-time inferences. We assess the performance of our framework on a real-world noisy public domain data set, ETH Food-101, and two novel web-collected datasets, MENA-150 and Pizza-Styles. We demonstrate the filtering capabilities of our proposed method through embedding visualization of the feature space using the t-SNE dimension reduction scheme. Our filtering scheme is efficient and effectively improves accuracy in all cases, boosting performance by 0.96, 0.71, and 1.29% on MENA-150, ETH Food-101, and Pizza-Styles, respectively.

食品计算已成为一个前景广阔的研究领域，它采用人工智能、深度学习和数据科学方法来改进食品生产流水线的各个阶段。为此，食品计算界已经汇编了各种数据集，并开发了各种深度学习架构来执行自动分类。然而，自动食品分类是一项巨大的挑战，尤其是在涉及地方和区域美食时，因为这些美食在可用的公共域数据集中往往代表性不足。然而，由于人工数据整理需要耗费大量的人力和时间，因此获取高质量、标签清晰、平衡良好的真实世界标签图像具有挑战性。相比之下，网络拥有潜在的无限食品数据源，但利用这一资源很可能会出现损坏和错误标记的图像。此外，食品类别分布不均可能导致数据不平衡问题。所有这些问题都使得从网络数据中创建干净的食品数据集具有挑战性。为了解决这个问题，我们提出了 AutoCleanDeepFood，这是一个新颖的端到端区域美食计算框架，包含以下组件：(i) 一个全自动预处理管道，用于创建与特定地区美食相关的自定义数据集；(ii) 一种基于迁移学习的训练范式，通过损失排序过滤掉噪声标签，并结合俄罗斯轮盘概率方法来缓解数据不平衡问题；(iii) 一种将生成的模型部署到智能手机上以进行实时推断的方法。我们在真实世界的高噪声公共领域数据集 ETH Food-101 和两个新型网络收集数据集 MENA-150 和 Pizza-Styles 上评估了我们框架的性能。通过使用 t-SNE 降维方案对特征空间进行嵌入可视化，我们展示了所提方法的过滤能力。我们的过滤方案非常高效，在所有情况下都能有效提高准确率，在 MENA-150、ETH Food-101 和 Pizza-Styles 数据集上分别提高了 0.96%、0.71% 和 1.29%。

{"title":"Autocleandeepfood: auto-cleaning and data balancing transfer learning for regional gastronomy food computing","authors":"Nauman Ullah Gilal, Marwa Qaraqe, Jens Schneider, Marco Agus","doi":"10.1007/s00371-024-03560-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03560-7","url":null,"abstract":"Food computing has emerged as a promising research field, employing artificial intelligence, deep learning, and data science methodologies to enhance various stages of food production pipelines. To this end, the food computing community has compiled a variety of data sets and developed various deep-learning architectures to perform automatic classification. However, automated food classification presents a significant challenge, particularly when it comes to local and regional cuisines, which are often underrepresented in available public-domain data sets. Nevertheless, obtaining high-quality, well-labeled, and well-balanced real-world labeled images is challenging since manual data curation requires significant human effort and is time-consuming. In contrast, the web has a potentially unlimited source of food data but tapping into this resource has a good chance of corrupted and wrongly labeled images. In addition, the uneven distribution among food categories may lead to data imbalance problems. All these issues make it challenging to create clean data sets for food from web data. To address this issue, we present AutoCleanDeepFood, a novel end-to-end food computing framework for regional gastronomy that contains the following components: (i) a fully automated pre-processing pipeline for custom data sets creation related to specific regional gastronomy, (ii) a transfer learning-based training paradigm to filter out noisy labels through loss ranking, incorporating a Russian Roulette probabilistic approach to mitigate data imbalance problems, and (iii) a method for deploying the resulting model on smartphones for real-time inferences. We assess the performance of our framework on a real-world noisy public domain data set, ETH Food-101, and two novel web-collected datasets, MENA-150 and Pizza-Styles. We demonstrate the filtering capabilities of our proposed method through embedding visualization of the feature space using the t-SNE dimension reduction scheme. Our filtering scheme is efficient and effectively improves accuracy in all cases, boosting performance by 0.96, 0.71, and 1.29% on MENA-150, ETH Food-101, and Pizza-Styles, respectively.\u0000","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141574976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Robust consistency learning for facial expression recognition under label noise 标签噪声下面部表情识别的稳健一致性学习

The Visual Computer

Pub Date : 2024-07-05 DOI: 10.1007/s00371-024-03558-1

Yumei Tan, Haiying Xia, Shuxiang Song

Label noise is inevitable in facial expression recognition (FER) datasets, especially for datasets that collected by web crawling, crowd sourcing in in-the-wild scenarios, which makes FER task more challenging. Recent advances tackle label noise by leveraging sample selection or constructing label distribution. However, they rely heavily on labels, which can result in confirmation bias issues. In this paper, we present RCL-Net, a simple yet effective robust consistency learning network, which combats label noise by learning robust representations and robust losses. RCL-Net can efficiently tackle facial samples with noisy labels commonly found in real-world datasets. Specifically, we first use a two-view-based backbone to embed facial images into high- and low-dimensional subspaces and then regularize the geometric structure of the high- and low-dimensional subspaces using an unsupervised dual-consistency learning strategy. Benefiting from the unsupervised dual-consistency learning strategy, we can obtain robust representations to combat label noise. Further, we impose a robust consistency regularization technique on the predictions of the classifiers to improve the whole network’s robustness. Comprehensive evaluations on three popular real-world FER datasets demonstrate that RCL-Net can effectively mitigate the impact of label noise, which significantly outperforms state-of-the-art noisy label FER methods. RCL-Net also shows better generalization capability to other tasks like CIFAR100 and Tiny-ImageNet. Our code and models will be available at this https https://github.com/myt889/RCL-Net.

标签噪声在面部表情识别（FER）数据集中是不可避免的，尤其是通过网络抓取、野外场景中的众包收集的数据集，这使得 FER 任务更具挑战性。最近的研究进展是利用样本选择或构建标签分布来解决标签噪声问题。然而，这些方法严重依赖标签，可能导致确认偏差问题。在本文中，我们介绍了一种简单而有效的鲁棒一致性学习网络 RCL-Net，它通过学习鲁棒表征和鲁棒损失来对抗标签噪声。RCL-Net 可以高效地处理现实世界数据集中常见的带有噪声标签的面部样本。具体来说，我们首先使用基于双视角的骨干网将面部图像嵌入高低维子空间，然后使用无监督双一致性学习策略对高低维子空间的几何结构进行正则化。得益于无监督双一致性学习策略，我们可以获得稳健的表征来对抗标签噪声。此外，我们还对分类器的预测施加了稳健一致性正则化技术，以提高整个网络的稳健性。在三个流行的真实世界 FER 数据集上进行的综合评估表明，RCL-Net 可以有效地减轻标签噪声的影响，其性能明显优于最先进的噪声标签 FER 方法。RCL-Net 还在 CIFAR100 和 Tiny-ImageNet 等其他任务中表现出更好的泛化能力。我们的代码和模型将发布在 https https://github.com/myt889/RCL-Net 上。

{"title":"Robust consistency learning for facial expression recognition under label noise","authors":"Yumei Tan, Haiying Xia, Shuxiang Song","doi":"10.1007/s00371-024-03558-1","DOIUrl":"https://doi.org/10.1007/s00371-024-03558-1","url":null,"abstract":"Label noise is inevitable in facial expression recognition (FER) datasets, especially for datasets that collected by web crawling, crowd sourcing in in-the-wild scenarios, which makes FER task more challenging. Recent advances tackle label noise by leveraging sample selection or constructing label distribution. However, they rely heavily on labels, which can result in confirmation bias issues. In this paper, we present RCL-Net, a simple yet effective robust consistency learning network, which combats label noise by learning robust representations and robust losses. RCL-Net can efficiently tackle facial samples with noisy labels commonly found in real-world datasets. Specifically, we first use a two-view-based backbone to embed facial images into high- and low-dimensional subspaces and then regularize the geometric structure of the high- and low-dimensional subspaces using an unsupervised dual-consistency learning strategy. Benefiting from the unsupervised dual-consistency learning strategy, we can obtain robust representations to combat label noise. Further, we impose a robust consistency regularization technique on the predictions of the classifiers to improve the whole network’s robustness. Comprehensive evaluations on three popular real-world FER datasets demonstrate that RCL-Net can effectively mitigate the impact of label noise, which significantly outperforms state-of-the-art noisy label FER methods. RCL-Net also shows better generalization capability to other tasks like CIFAR100 and Tiny-ImageNet. Our code and models will be available at this https https://github.com/myt889/RCL-Net.","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A deep dive into enhancing sharing of naturalistic driving data through face deidentification 通过人脸识别加强自然驾驶数据共享的深入研究

The Visual Computer

Pub Date : 2024-07-04 DOI: 10.1007/s00371-024-03552-7

Surendrabikram Thapa, Abhijit Sarkar

Human factors research in transportation relies on naturalistic driving studies (NDS) which collect real-world data from drivers on actual roads. NDS data offer valuable insights into driving behavior, styles, habits, and safety-critical events. However, these data often contain personally identifiable information (PII), such as driver face videos, which cannot be publicly shared due to privacy concerns. To address this, our paper introduces a comprehensive framework for deidentifying drivers’ face videos, that can facilitate the wide sharing of driver face videos while protecting PII. Leveraging recent advancements in generative adversarial networks (GANs), we explore the efficacy of different face swapping algorithms in preserving essential human factors attributes while anonymizing participants’ identities. Most face swapping algorithms are tested in restricted lighting conditions and indoor settings, there is no known study that tested them in adverse and natural situations. We conducted extensive experiments using large-scale outdoor NDS data, evaluating the quantification of errors associated with head, mouth, and eye movements, along with other attributes important for human factors research. Additionally, we performed qualitative assessments of these methods through human evaluators providing valuable insights into the quality and fidelity of the deidentified videos. We propose the utilization of synthetic faces as substitutes for real faces to enhance generalization. Additionally, we created practical guidelines for video deidentification, emphasizing error threshold creation, spot-checking for abrupt metric changes, and mitigation strategies for reidentification risks. Our findings underscore nuanced challenges in balancing data utility and privacy, offering valuable insights into enhancing face video deidentification techniques in NDS scenarios.

交通领域的人为因素研究依赖于自然驾驶研究 (NDS)，该研究收集驾驶员在实际道路上的真实数据。NDS 数据可提供有关驾驶行为、风格、习惯和安全关键事件的宝贵见解。然而，这些数据通常包含个人身份信息（PII），如驾驶员面部视频，出于隐私考虑，这些数据不能公开共享。为了解决这个问题，我们的论文介绍了一个用于去识别驾驶员面部视频的综合框架，它可以在保护 PII 的同时促进驾驶员面部视频的广泛共享。利用生成式对抗网络（GANs）的最新进展，我们探索了不同的人脸交换算法在匿名化参与者身份的同时保留基本人为因素属性的功效。大多数人脸互换算法都是在受限的照明条件和室内环境下进行测试的，目前还没有在不利的自然环境下进行测试的已知研究。我们使用大规模室外 NDS 数据进行了广泛的实验，评估了与头部、嘴部和眼部运动相关的误差量化，以及对人为因素研究非常重要的其他属性。此外，我们还通过人类评估员对这些方法进行了定性评估，从而对去识别视频的质量和保真度提供了宝贵的见解。我们建议使用合成人脸来替代真实人脸，以提高通用性。此外，我们还为视频去身份化制定了实用指南，强调误差阈值的创建、对突然的指标变化进行抽查，以及降低重新识别风险的策略。我们的研究结果强调了在平衡数据实用性和隐私性方面所面临的细微挑战，为在 NDS 场景中增强人脸视频去识别技术提供了宝贵的见解。

{"title":"A deep dive into enhancing sharing of naturalistic driving data through face deidentification","authors":"Surendrabikram Thapa, Abhijit Sarkar","doi":"10.1007/s00371-024-03552-7","DOIUrl":"https://doi.org/10.1007/s00371-024-03552-7","url":null,"abstract":"Human factors research in transportation relies on naturalistic driving studies (NDS) which collect real-world data from drivers on actual roads. NDS data offer valuable insights into driving behavior, styles, habits, and safety-critical events. However, these data often contain personally identifiable information (PII), such as driver face videos, which cannot be publicly shared due to privacy concerns. To address this, our paper introduces a comprehensive framework for deidentifying drivers’ face videos, that can facilitate the wide sharing of driver face videos while protecting PII. Leveraging recent advancements in generative adversarial networks (GANs), we explore the efficacy of different face swapping algorithms in preserving essential human factors attributes while anonymizing participants’ identities. Most face swapping algorithms are tested in restricted lighting conditions and indoor settings, there is no known study that tested them in adverse and natural situations. We conducted extensive experiments using large-scale outdoor NDS data, evaluating the quantification of errors associated with head, mouth, and eye movements, along with other attributes important for human factors research. Additionally, we performed qualitative assessments of these methods through human evaluators providing valuable insights into the quality and fidelity of the deidentified videos. We propose the utilization of synthetic faces as substitutes for real faces to enhance generalization. Additionally, we created practical guidelines for video deidentification, emphasizing error threshold creation, spot-checking for abrupt metric changes, and mitigation strategies for reidentification risks. Our findings underscore nuanced challenges in balancing data utility and privacy, offering valuable insights into enhancing face video deidentification techniques in NDS scenarios.\u0000","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141548421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0