Si Liu, Qin Jin, Luoqi Liu, Zongheng Tang, Linli Lin
Understanding human and the surrounding context is crucial for the perception of the image and video. It benefits many related applications, such as person search, virtual tryon/makeup, abnormal action detection. In the proposed 4th Person in Context (PIC) workshop, to further promote the progress in the above-mentioned areas, we hold three human-centric perception and cognition challenges including Make-up Temporal Video Grounding (MTVG), Make-up Dense Video Caption (MDVC) and Human-centric Spatio-Temporal Video Grounding (HC-STVG). All the human-centric challenges focus on understanding the human behavior, interactions and relationships in video sequences, which requires understanding both visual and linguistic information, as well as complicated multimodal reasoning. The three sub-problems are complementary and collaboratively contribute to a unified human-centric perception and cognition solution.
{"title":"PIC'22: 4th Person in Context Workshop","authors":"Si Liu, Qin Jin, Luoqi Liu, Zongheng Tang, Linli Lin","doi":"10.1145/3503161.3554766","DOIUrl":"https://doi.org/10.1145/3503161.3554766","url":null,"abstract":"Understanding human and the surrounding context is crucial for the perception of the image and video. It benefits many related applications, such as person search, virtual tryon/makeup, abnormal action detection. In the proposed 4th Person in Context (PIC) workshop, to further promote the progress in the above-mentioned areas, we hold three human-centric perception and cognition challenges including Make-up Temporal Video Grounding (MTVG), Make-up Dense Video Caption (MDVC) and Human-centric Spatio-Temporal Video Grounding (HC-STVG). All the human-centric challenges focus on understanding the human behavior, interactions and relationships in video sequences, which requires understanding both visual and linguistic information, as well as complicated multimodal reasoning. The three sub-problems are complementary and collaboratively contribute to a unified human-centric perception and cognition solution.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127159297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dream Painter is an interactive robotic art installation that turns the audience's spoken dreams into a collective painting. By telling one's past dream, a participant guides the interactive robotic system in the latent space of the AI model that results in a multicolored line drawing. The artwork consists of several parts: an interaction station, a painting robot, a kinetic and animated mechanism that moves the paper roll when a drawing is finished, and the deep learning model that transforms a spoken word into a painting. All these interconnected components of hardware and software are arranged into an autonomous and interactive robotic art installation. The main aims of this project are to explore the interactive potential of AI technology and robotics, and trigger discussion over the deep learning applications in a wider sense. More precisely, this case study is primarily focused on the translation of different semiotic spaces as a trigger for creativity and audience interaction method.
{"title":"Dream Painter: An Interactive Art Installation Bridging Audience Interaction, Robotics, and Creative AI","authors":"Varvara Guljajeva, M. Sola","doi":"10.1145/3503161.3549976","DOIUrl":"https://doi.org/10.1145/3503161.3549976","url":null,"abstract":"Dream Painter is an interactive robotic art installation that turns the audience's spoken dreams into a collective painting. By telling one's past dream, a participant guides the interactive robotic system in the latent space of the AI model that results in a multicolored line drawing. The artwork consists of several parts: an interaction station, a painting robot, a kinetic and animated mechanism that moves the paper roll when a drawing is finished, and the deep learning model that transforms a spoken word into a painting. All these interconnected components of hardware and software are arranged into an autonomous and interactive robotic art installation. The main aims of this project are to explore the interactive potential of AI technology and robotics, and trigger discussion over the deep learning applications in a wider sense. More precisely, this case study is primarily focused on the translation of different semiotic spaces as a trigger for creativity and audience interaction method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124766503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website: https://github.com/MontaEllis/SD-GAN.
{"title":"SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete Attribute","authors":"Kangneng Zhou, Xiaobin Zhu, Daiheng Gao, Kai Lee, Xinjie Li, Xu-Cheng Yin","doi":"10.1145/3503161.3547791","DOIUrl":"https://doi.org/10.1145/3503161.3547791","url":null,"abstract":"Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website: https://github.com/MontaEllis/SD-GAN.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124876836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.
{"title":"Bi-directional Heterogeneous Graph Hashing towards Efficient Outfit Recommendation","authors":"Weili Guan, Xuemeng Song, Haoyu Zhang, Meng Liu, C. Yeh, Xiaojun Chang","doi":"10.1145/3503161.3548020","DOIUrl":"https://doi.org/10.1145/3503161.3548020","url":null,"abstract":"Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126063438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.
{"title":"Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation","authors":"Yi Zhong, Chengyao Wang, Shiyong Li, Zhuyun Zhou, Yaowei Wang, Weishi Zheng","doi":"10.1145/3503161.3548242","DOIUrl":"https://doi.org/10.1145/3503161.3548242","url":null,"abstract":"Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125666312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.
{"title":"DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition","authors":"Jiawei Fan, Yu Zhao, Xie Yu, Lihua Ma, Junqi Liu, Fangqiu Yi, Boxun Li","doi":"10.1145/3503161.3548326","DOIUrl":"https://doi.org/10.1145/3503161.3548326","url":null,"abstract":"An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Liang, Junping Du, Xiaowen Cao, Yang Yu, Kangkang Lu, Zhe Xue, Min Zhang
Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.
{"title":"Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning","authors":"M. Liang, Junping Du, Xiaowen Cao, Yang Yu, Kangkang Lu, Zhe Xue, Min Zhang","doi":"10.1145/3503161.3548391","DOIUrl":"https://doi.org/10.1145/3503161.3548391","url":null,"abstract":"Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115490016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.
{"title":"Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog","authors":"Feilong Chen, Duzhen Zhang, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu","doi":"10.1145/3503161.3547776","DOIUrl":"https://doi.org/10.1145/3503161.3547776","url":null,"abstract":"Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122820706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.
{"title":"Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation","authors":"Miaoyu Li, Yachao Zhang, Yuan Xie, Z. Gao, Cuihua Li, Zhizhong Zhang, Yanyun Qu","doi":"10.1145/3503161.3547990","DOIUrl":"https://doi.org/10.1145/3503161.3547990","url":null,"abstract":"With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122823285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tao Xiang, Hangcheng Liu, Shangwei Guo, Hantao Liu, Tianwei Zhang
Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible "armors" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., "armors") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.
{"title":"Text's Armor: Optimized Local Adversarial Perturbation Against Scene Text Editing Attacks","authors":"Tao Xiang, Hangcheng Liu, Shangwei Guo, Hantao Liu, Tianwei Zhang","doi":"10.1145/3503161.3548103","DOIUrl":"https://doi.org/10.1145/3503161.3548103","url":null,"abstract":"Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible \"armors\" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., \"armors\") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114506232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}