首页 > 最新文献

Proceedings of the 30th ACM International Conference on Multimedia最新文献

英文 中文
PIC'22: 4th Person in Context Workshop PIC’22:语境工作坊的第四个人
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3554766
Si Liu, Qin Jin, Luoqi Liu, Zongheng Tang, Linli Lin
Understanding human and the surrounding context is crucial for the perception of the image and video. It benefits many related applications, such as person search, virtual tryon/makeup, abnormal action detection. In the proposed 4th Person in Context (PIC) workshop, to further promote the progress in the above-mentioned areas, we hold three human-centric perception and cognition challenges including Make-up Temporal Video Grounding (MTVG), Make-up Dense Video Caption (MDVC) and Human-centric Spatio-Temporal Video Grounding (HC-STVG). All the human-centric challenges focus on understanding the human behavior, interactions and relationships in video sequences, which requires understanding both visual and linguistic information, as well as complicated multimodal reasoning. The three sub-problems are complementary and collaboratively contribute to a unified human-centric perception and cognition solution.
理解人类和周围环境对图像和视频的感知至关重要。它有利于许多相关的应用,如人员搜索,虚拟试穿/化妆,异常动作检测。为进一步推进上述领域的研究进展,在拟就的第四届人在语境(PIC)研讨会上,我们举办了三个以人类为中心的感知和认知挑战,包括补时视频接地(MTVG)、补密集视频字幕(MDVC)和以人类为中心的时空视频接地(HC-STVG)。所有以人为中心的挑战都集中在理解视频序列中的人类行为、互动和关系上,这需要理解视觉和语言信息,以及复杂的多模态推理。这三个子问题互为补充,相互协作,形成了统一的以人为中心的感知和认知解决方案。
{"title":"PIC'22: 4th Person in Context Workshop","authors":"Si Liu, Qin Jin, Luoqi Liu, Zongheng Tang, Linli Lin","doi":"10.1145/3503161.3554766","DOIUrl":"https://doi.org/10.1145/3503161.3554766","url":null,"abstract":"Understanding human and the surrounding context is crucial for the perception of the image and video. It benefits many related applications, such as person search, virtual tryon/makeup, abnormal action detection. In the proposed 4th Person in Context (PIC) workshop, to further promote the progress in the above-mentioned areas, we hold three human-centric perception and cognition challenges including Make-up Temporal Video Grounding (MTVG), Make-up Dense Video Caption (MDVC) and Human-centric Spatio-Temporal Video Grounding (HC-STVG). All the human-centric challenges focus on understanding the human behavior, interactions and relationships in video sequences, which requires understanding both visual and linguistic information, as well as complicated multimodal reasoning. The three sub-problems are complementary and collaboratively contribute to a unified human-centric perception and cognition solution.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127159297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dream Painter: An Interactive Art Installation Bridging Audience Interaction, Robotics, and Creative AI 梦画家:一个连接观众互动、机器人和创造性人工智能的互动艺术装置
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3549976
Varvara Guljajeva, M. Sola
Dream Painter is an interactive robotic art installation that turns the audience's spoken dreams into a collective painting. By telling one's past dream, a participant guides the interactive robotic system in the latent space of the AI model that results in a multicolored line drawing. The artwork consists of several parts: an interaction station, a painting robot, a kinetic and animated mechanism that moves the paper roll when a drawing is finished, and the deep learning model that transforms a spoken word into a painting. All these interconnected components of hardware and software are arranged into an autonomous and interactive robotic art installation. The main aims of this project are to explore the interactive potential of AI technology and robotics, and trigger discussion over the deep learning applications in a wider sense. More precisely, this case study is primarily focused on the translation of different semiotic spaces as a trigger for creativity and audience interaction method.
梦画家是一个互动的机器人艺术装置,将观众的口头梦想变成集体绘画。参与者通过讲述自己过去的梦想,在人工智能模型的潜在空间中引导互动机器人系统,从而绘制出彩色线条。艺术品由几个部分组成:互动站,绘画机器人,完成绘图时移动纸卷的动力和动画机构,以及将口语转换为绘画的深度学习模型。所有这些相互连接的硬件和软件组件被安排成一个自主和互动的机器人艺术装置。该项目的主要目的是探索人工智能技术和机器人技术的交互潜力,并在更广泛的意义上引发对深度学习应用的讨论。更确切地说,本案例研究主要关注不同符号空间的翻译作为创造力的触发点和受众互动方法。
{"title":"Dream Painter: An Interactive Art Installation Bridging Audience Interaction, Robotics, and Creative AI","authors":"Varvara Guljajeva, M. Sola","doi":"10.1145/3503161.3549976","DOIUrl":"https://doi.org/10.1145/3503161.3549976","url":null,"abstract":"Dream Painter is an interactive robotic art installation that turns the audience's spoken dreams into a collective painting. By telling one's past dream, a participant guides the interactive robotic system in the latent space of the AI model that results in a multicolored line drawing. The artwork consists of several parts: an interaction station, a painting robot, a kinetic and animated mechanism that moves the paper roll when a drawing is finished, and the deep learning model that transforms a spoken word into a painting. All these interconnected components of hardware and software are arranged into an autonomous and interactive robotic art installation. The main aims of this project are to explore the interactive potential of AI technology and robotics, and trigger discussion over the deep learning applications in a wider sense. More precisely, this case study is primarily focused on the translation of different semiotic spaces as a trigger for creativity and audience interaction method.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124766503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete Attribute SD-GAN:面向离散属性人脸图像合成的语义分解
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547791
Kangneng Zhou, Xiaobin Zhu, Daiheng Gao, Kai Lee, Xinjie Li, Xu-Cheng Yin
Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website: https://github.com/MontaEllis/SD-GAN.
在生成对抗网络(GANs)中,处理潜在代码用于人脸图像合成主要集中在连续属性合成(如年龄、姿势和情绪)上,而离散属性合成(如面具和眼镜)较少受到关注。直接将现有作品应用于面部离散属性可能会导致不准确的结果。在这项工作中,我们提出了一个创新的框架,通过语义分解来解决具有挑战性的面部离散属性合成,称为SD-GAN。具体来说,我们明确地将离散属性表示分解为语义先验基和偏移潜在表示两部分。语义先验基显示了在潜在空间中操纵人脸表征的初始化方向。利用三维感知语义融合网络得到的偏移潜表示来调整先验基础。此外,融合网络还集成了三维嵌入,以更好地保持身份和离散属性合成。将先验基与偏移潜表示相结合,使我们的方法能够合成具有离散属性的逼真人脸图像。值得注意的是,我们构建了一个大型且有价值的数据集MEGN(从Google和Naver抓取的Face Mask和Eyeglasses图像),以完成现有数据集中离散属性的缺失。广泛的定性和定量实验证明了我们的方法的最先进的性能。我们的代码可以在一个匿名网站上找到:https://github.com/MontaEllis/SD-GAN。
{"title":"SD-GAN: Semantic Decomposition for Face Image Synthesis with Discrete Attribute","authors":"Kangneng Zhou, Xiaobin Zhu, Daiheng Gao, Kai Lee, Xinjie Li, Xu-Cheng Yin","doi":"10.1145/3503161.3547791","DOIUrl":"https://doi.org/10.1145/3503161.3547791","url":null,"abstract":"Manipulating latent code in generative adversarial networks (GANs) for facial image synthesis mainly focuses on continuous attribute synthesis (e.g., age, pose and emotion), while discrete attribute synthesis (like face mask and eyeglasses) receives less attention. Directly applying existing works to facial discrete attributes may cause inaccurate results. In this work, we propose an innovative framework to tackle challenging facial discrete attribute synthesis via semantic decomposing, dubbed SD-GAN. To be concrete, we explicitly decompose the discrete attribute representation into two components, i.e. the semantic prior basis and offset latent representation. The semantic prior basis shows an initializing direction for manipulating face representation in the latent space. The offset latent presentation obtained by 3D-aware semantic fusion network is proposed to adjust prior basis. In addition, the fusion network integrates 3D embedding for better identity preservation and discrete attribute synthesis. The combination of prior basis and offset latent representation enable our method to synthesize photo-realistic face images with discrete attributes. Notably, we construct a large and valuable dataset MEGN (Face Mask and Eyeglasses images crawled from Google and Naver) for completing the lack of discrete attributes in the existing dataset. Extensive qualitative and quantitative experiments demonstrate the state-of-the-art performance of our method. Our code is available at an anonymous website: https://github.com/MontaEllis/SD-GAN.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124876836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Bi-directional Heterogeneous Graph Hashing towards Efficient Outfit Recommendation 面向高效服装推荐的双向异构图哈希
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548020
Weili Guan, Xuemeng Song, Haoyu Zhang, Meng Liu, C. Yeh, Xiaojun Chang
Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.
个性化服装推荐,即根据特定用户的喜好向其推荐服装,由于其经济价值而受到越来越多的研究关注。然而,现有的大多数方法主要侧重于提高推荐的有效性,而忽略了推荐的效率。受此启发,我们设计了一种新的双向异构图哈希方案,称为BiHGH,以实现高效的个性化服装推荐。该方案由三个关键部分组成:异构图节点初始化、双向顺序图卷积和哈希码学习。我们首先通过一个异构的四部分图统一了四种类型的实体(即用户、装备、物品和属性)及其关系。为了进行图学习,我们创造性地设计了一种双向图卷积算法,通过重复向上和向下卷积来顺序传递知识,我们将四部图分为三个子图,每个子图只涉及两个相邻的实体类型。我们最终采用贝叶斯个性化排序损失进行用户偏好学习,并设计了双相似度保持正则化来防止哈希学习过程中的信息丢失。在基准数据集上的大量实验证明了BiHGH的优越性。
{"title":"Bi-directional Heterogeneous Graph Hashing towards Efficient Outfit Recommendation","authors":"Weili Guan, Xuemeng Song, Haoyu Zhang, Meng Liu, C. Yeh, Xiaojun Chang","doi":"10.1145/3503161.3548020","DOIUrl":"https://doi.org/10.1145/3503161.3548020","url":null,"abstract":"Personalized outfit recommendation, which aims to recommend the outfits to a given user according to his/her preference, has gained increasing research attention due to its economic value. Nevertheless, the majority of existing methods mainly focus on improving the recommendation effectiveness, while overlooking the recommendation efficiency. Inspired by this, we devise a novel bi-directional heterogeneous graph hashing scheme, called BiHGH, towards efficient personalized outfit recommendation. In particular, this scheme consists of three key components: heterogeneous graph node initialization, bi-directional sequential graph convolution, and hash code learning. We first unify four types of entities (i.e., users, outfits, items, and attributes) and their relations via a heterogeneous four-partite graph. To perform graph learning, we then creatively devise a bi-directional graph convolution algorithm to sequentially transfer knowledge via repeating upwards and downwards convolution, whereby we divide the four-partite graph into three subgraphs and each subgraph only involves two adjacent entity types. We ultimately adopt the bayesian personalized ranking loss for the user preference learning and design the dual similarity preserving regularization to prevent the information loss during hash learning. Extensive experiments on the benchmark dataset demonstrate the superiority of BiHGH.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126063438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation 基于混合监督的少镜头标注目标检测实例学习
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548242
Yi Zhong, Chengyao Wang, Shiyong Li, Zhuyun Zhou, Yaowei Wang, Weishi Zheng
Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.
利用图像级注释和少量实例级注释的混合监督对象检测(MSOD)已经成为一种有效的工具,它减轻了对大量昂贵的实例级注释的需求,并对以前只使用图像级注释的方法提供了有效的实例监督。在这项工作中,我们引入了混合监督实例学习(MSIL),作为一种新的MSOD框架,利用少量实例级注释来提供更显式和隐式的监督。我们的目标不是直接在损失函数上添加实例级注释来进行检测,而是在这两种不同级别的注释之间挖掘出更有效的显式和隐式关系。特别地,我们首次提出了实例-注释引导图像分类策略,通过使用位置关系来强制图像分类器关注包含正确对象的建议,从而从实例级注释提供明确的指导。然后,为了利用混合注释之间更多的隐式交互,开发了一种由额外的实例级注释引导的实例复制策略,以生成更准确的伪基础真值,实现了更具判别性的检测器。最后,采用假目标实例挖掘策略,利用位置和分数信息丰富训练实例的数量和多样性,对上述处理进行细化。我们的实验表明,所提出的MSIL框架在Pascal VOC2007和MS-COCO数据集上都以很大的优势优于最近最先进的混合监督检测器。
{"title":"Mixed Supervision for Instance Learning in Object Detection with Few-shot Annotation","authors":"Yi Zhong, Chengyao Wang, Shiyong Li, Zhuyun Zhou, Yaowei Wang, Weishi Zheng","doi":"10.1145/3503161.3548242","DOIUrl":"https://doi.org/10.1145/3503161.3548242","url":null,"abstract":"Mixed supervision for object detection (MSOD) that utilizes image-level annotations and a small amount of instance-level annotations has emerged as an efficient tool by alleviating the requirement for a large amount of costly instance-level annotations and providing effective instance supervision on previous methods that only use image-level annotations. In this work, we introduce the mixed supervision instance learning (MSIL), as a novel MSOD framework to leverage a handful of instance-level annotations to provide more explicit and implicit supervision. Rather than just adding instance-level annotations directly on loss functions for detection, we aim to dig out more effective explicit and implicit relations between these two different level annotations. In particular, we firstly propose the Instance-Annotation Guided Image Classification strategy to provide explicit guidance from instance-level annotations by using positional relation to force the image classifier to focus on the proposals which contain the correct object. And then, in order to exploit more implicit interaction between the mixed annotations, an instance reproduction strategy guided by the extra instance-level annotations is developed for generating more accurate pseudo ground truth, achieving a more discriminative detector. Finally, a false target instance mining strategy is used to refine the above processing by enriching the number and diversity of training instances with the position and score information. Our experiments show that the proposed MSIL framework outperforms recent state-of-the-art mixed supervised detectors with a large margin on both the Pascal VOC2007 and the MS-COCO dataset.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125666312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition 基于信息瓶颈的视频动作识别正则化框架
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548326
Jiawei Fan, Yu Zhao, Xie Yu, Lihua Ma, Junqi Liu, Fangqiu Yi, Boxun Li
An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.
根据信息瓶颈原理,最优表示应该包含最大的任务相关信息和最小的任务无关信息。在视频动作识别中,基于CNN的方法通过对时间上下文进行建模,获得了更好的时空表征。然而,这些方法的泛化程度仍然很低。在本文中,我们提出了一种基于信息瓶颈原理的适度优化方法,称为双视图时间正则化(DTR),以在不牺牲模型效率的情况下有效地通用视频表示。一方面,我们设计了双视图正则化(Dual-view Regularization, DR)来约束任务无关信息,可以有效压缩背景和无关运动信息;另一方面,我们设计了时间正则化(TR),通过寻找帧之间的最优差来保持任务相关信息,这有利于提取足够的运动信息。实验结果表明:(1)DTR与时间建模和数据增强是正交的,在基于模型的方法和基于数据的方法上都得到了一般性的改进;(2) DTR在7个不同的数据集上都是有效的,特别是在以运动为中心的数据集,即SSv1/ SSv2上,DTR在前1名的准确率上获得了6%/3.8%的绝对提升。
{"title":"DTR: An Information Bottleneck Based Regularization Framework for Video Action Recognition","authors":"Jiawei Fan, Yu Zhao, Xie Yu, Lihua Ma, Junqi Liu, Fangqiu Yi, Boxun Li","doi":"10.1145/3503161.3548326","DOIUrl":"https://doi.org/10.1145/3503161.3548326","url":null,"abstract":"An optimal representation should contain the maximum task-relevant information and minimum task-irrelevant information, as revealed from Information Bottleneck Principle. In video action recognition, CNN based approaches have obtained better spatio-temporal representation by modeling temporal context. However, these approaches still suffer low generalization. In this paper, we propose a moderate optimization based approach called Dual-view Temporal Regularization (DTR) based on Information Bottleneck Principle for an effective and generalized video representation without sacrificing any efficiency of the model. On the one hand, we design Dual-view Regularization (DR) to constrain task-irrelevant information, which can effectively compress background and irrelevant motion information. On the other hand, we design Temporal Regularization (TR) to maintain task-relevant information by finding an optimal difference between frames, which benefits extracting sufficient motion information. The experimental results demonstrate: (1) DTR is orthogonal to temporal modeling as well as data augmentation, and it achieves general improvement on both model-based and data-based approaches; (2) DTR is effective among 7 different datasets, especially on motion-centric datasets i.e. SSv1/ SSv2, in which DTR gets 6%/3.8% absolute gains in top-1 accuracy.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960869","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning 跨媒体表示学习的语义结构增强对比对抗哈希网络
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548391
M. Liang, Junping Du, Xiaowen Cao, Yang Yu, Kangkang Lu, Zhe Xue, Min Zhang
Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.
深度跨媒体哈希技术为跨媒体搜索提供了一种高效的跨媒体表示学习解决方案。然而,现有的方法没有同时考虑细粒度的语义特征和语义结构来挖掘隐含的跨媒体语义关联,导致跨媒体表示的语义辨别力和一致性较弱。为了解决这个问题,我们提出了一种新的语义结构增强的对比对抗哈希网络,用于跨媒体表示学习(SCAHN)。首先,为了捕获更细粒度的跨媒体语义关联,构建了一个细粒度的跨媒体注意特征学习网络,使学习到的不同模态的显著性特征更有利于跨媒体语义对齐和融合。其次,为进一步提高内隐跨媒体语义关联的学习能力,构建语义标签关联图,利用图卷积网络对内隐语义结构进行挖掘,从而指导不同模态的判别特征学习。再次,提出跨媒体和媒体内对比对抗表征学习机制,进一步增强不同模态表征的语义辨别性,并制定双向对抗学习策略,最大化跨媒体语义关联,从而获得具有更强辨别性和语义一致性保持能力的跨媒体统一表征。在几个跨媒体基准数据集上进行的大量实验表明,所提出的SCAHN优于最先进的方法。
{"title":"Semantic Structure Enhanced Contrastive Adversarial Hash Network for Cross-media Representation Learning","authors":"M. Liang, Junping Du, Xiaowen Cao, Yang Yu, Kangkang Lu, Zhe Xue, Min Zhang","doi":"10.1145/3503161.3548391","DOIUrl":"https://doi.org/10.1145/3503161.3548391","url":null,"abstract":"Deep cross-media hashing technology provides an efficient cross-media representation learning solution for cross-media search. However, the existing methods do not consider both fine-grained semantic features and semantic structures to mine implicit cross-media semantic associations, which leads to weaker semantic discrimination and consistency for cross-media representation. To tackle this problem, we propose a novel semantic structure enhanced contrastive adversarial hash network for cross-media representation learning (SCAHN). Firstly, in order to capture more fine-grained cross-media semantic associations, a fine-grained cross-media attention feature learning network is constructed, thus the learned saliency features of different modalities are more conducive to cross-media semantic alignment and fusion. Secondly, for further improving learning ability of implicit cross-media semantic associations, a semantic label association graph is constructed, and the graph convolutional network is utilized to mine the implicit semantic structures, thus guiding learning of discriminative features of different modalities. Thirdly, a cross-media and intra-media contrastive adversarial representation learning mechanism is proposed to further enhance the semantic discriminativeness of different modal representations, and a dual-way adversarial learning strategy is developed to maximize cross-media semantic associations, so as to obtain cross-media unified representations with stronger discriminativeness and semantic consistency preserving power. Extensive experiments on several cross-media benchmark datasets demonstrate that the proposed SCAHN outperforms the state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115490016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog 视觉对话中的无监督和伪监督视觉语言对齐
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547776
Feilong Chen, Duzhen Zhang, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu
Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.
视觉对话要求模型根据图像中一系列连贯的问题和相关的视觉概念给出合理的答案。然而,目前的大部分工作要么集中在基于注意力的融合,要么集中在大规模图像-文本对的预训练上,忽视了显性视觉语言对齐在视觉对话中的关键作用。为了弥补这一缺陷,我们提出了一种新的视觉对话的无监督和伪监督视觉语言对齐方法(AlignVD)。首先,AlginVD利用视觉和对话编码器来表示图像和对话。然后,它通过无监督和伪监督视觉语言对齐(UVLA和PVLA)显式地将视觉概念与文本语义对齐。具体来说,UVLA利用图形自动编码器,而PVLA使用对话框引导的视觉接地来进行校准。最后,基于对齐的视觉和文本表示,AlignVD通过跨模态解码器给出了问题的合理答案。在两个大规模视觉对话数据集上的大量实验证明了视觉语言对齐的有效性,我们提出的AlignVD达到了新的最先进的结果。此外,我们的单一模型在视觉对话挑战排行榜上以78.70的NDCG指标获得了第一名,超过了之前的最佳集成模型约1分。
{"title":"Unsupervised and Pseudo-Supervised Vision-Language Alignment in Visual Dialog","authors":"Feilong Chen, Duzhen Zhang, Xiuyi Chen, Jing Shi, Shuang Xu, Bo Xu","doi":"10.1145/3503161.3547776","DOIUrl":"https://doi.org/10.1145/3503161.3547776","url":null,"abstract":"Visual dialog requires models to give reasonable answers according to a series of coherent questions and related visual concepts in images. However, most current work either focuses on attention-based fusion or pre-training on large-scale image-text pairs, ignoring the critical role of explicit vision-language alignment in visual dialog. To remedy this defect, we propose a novel unsupervised and pseudo-supervised vision-language alignment approach for visual dialog (AlignVD). Firstly, AlginVD utilizes the visual and dialog encoder to represent images and dialogs. Then, it explicitly aligns visual concepts with textual semantics via unsupervised and pseudo-supervised vision-language alignment (UVLA and PVLA). Specifically, UVLA utilizes a graph autoencoder, while PVLA uses dialog-guided visual grounding to conduct alignment. Finally, based on the aligned visual and textual representations, AlignVD gives a reasonable answer to the question via the cross-modal decoder. Extensive experiments on two large-scale visual dialog datasets have demonstrated the effectiveness of vision-language alignment, and our proposed AlignVD achieves new state-of-the-art results. In addition, our single model has won first place on the visual dialog challenge leaderboard with a NDCG metric of 78.70, surpassing the previous best ensemble model by about 1 point.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122820706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation 三维语义分割领域自适应中的跨领域、跨模态知识升华
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3547990
Miaoyu Li, Yachao Zhang, Yuan Xie, Z. Gao, Cuihua Li, Zhizhong Zhang, Yanyun Qu
With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.
随着激光雷达和相机同步和校准的多模态数据集的出现,跨模态无监督域自适应(UDA)因其减少了对目标域样本的费力标注而越来越受到关注。为了缓解源域和目标域之间的分布差距,现有方法通过对抗性学习进行特征对齐。然而,众所周知,它对超参数高度敏感,难以训练。本文提出了一种结合跨领域知识蒸馏(CDKD)和跨模态知识蒸馏(CMKD)的双交叉模型,以缓解领域转移。具体来说,我们设计了多模态样式转换,将源图像和点云转换为目标样式。以这些合成样本为输入,我们引入了一个目标感知教师网络来学习目标领域的知识。然后,我们提出了学生在源域学习时的双交叉知识蒸馏。CDKD限制了同一模态下教师和学生的预测是一致的。它可以将目标感知知识从教师传递给学生,使学生更能适应目标领域。CMKD从教师预测中生成混合模式预测,并约束其与2D和3D学生预测一致。它促进了两种模式之间的信息交互,使它们相互补充。从各种域自适应设置的评估结果来看,Dual-Cross显著优于单模态和跨模态的最先进方法。
{"title":"Cross-Domain and Cross-Modal Knowledge Distillation in Domain Adaptation for 3D Semantic Segmentation","authors":"Miaoyu Li, Yachao Zhang, Yuan Xie, Z. Gao, Cuihua Li, Zhizhong Zhang, Yanyun Qu","doi":"10.1145/3503161.3547990","DOIUrl":"https://doi.org/10.1145/3503161.3547990","url":null,"abstract":"With the emergence of multi-modal datasets where LiDAR and camera are synchronized and calibrated, cross-modal Unsupervised Domain Adaptation (UDA) has attracted increasing attention because it reduces the laborious annotation of target domain samples. To alleviate the distribution gap between source and target domains, existing methods conduct feature alignment by using adversarial learning. However, it is well-known to be highly sensitive to hyperparameters and difficult to train. In this paper, we propose a novel model (Dual-Cross) that integrates Cross-Domain Knowledge Distillation (CDKD) and Cross-Modal Knowledge Distillation (CMKD) to mitigate domain shift. Specifically, we design the multi-modal style transfer to convert source image and point cloud to target style. With these synthetic samples as input, we introduce a target-aware teacher network to learn knowledge of the target domain. Then we present dual-cross knowledge distillation when the student is learning on source domain. CDKD constrains teacher and student predictions under same modality to be consistent. It can transfer target-aware knowledge from the teacher to the student, making the student more adaptive to the target domain. CMKD generates hybrid-modal prediction from the teacher predictions and constrains it to be consistent with both 2D and 3D student predictions. It promotes the information interaction between two modalities to make them complement each other. From the evaluation results on various domain adaptation settings, Dual-Cross significantly outperforms both uni-modal and cross-modal state-of-the-art methods.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122823285","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Text's Armor: Optimized Local Adversarial Perturbation Against Scene Text Editing Attacks 文本的护甲:针对场景文本编辑攻击的优化局部对抗性扰动
Pub Date : 2022-10-10 DOI: 10.1145/3503161.3548103
Tao Xiang, Hangcheng Liu, Shangwei Guo, Hantao Liu, Tianwei Zhang
Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible "armors" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., "armors") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.
深度神经网络(dnn)在场景文本编辑(STE)中显示出强大的能力。通过精心设计的深度神经网络,人们可以在保持其逼真外观的同时改变源图像中的文本。然而,这些编辑工具为不法分子伪造文件或擅自修改文本提供了极大的便利。在本文中,我们建议通过为场景中的文本设计隐形的“盔甲”来积极地挫败文本编辑攻击。我们将基于dnn的STE的对抗脆弱性转化为强度,并使用优化的归一化策略专门为文本设计局部扰动(即“盔甲”)。这种局部扰动可以有效地误导STE攻击,而不影响场景背景的可感知性。为了增强我们的防御能力,我们对STE攻击进行了系统的分析和建模,并提供了精确的防御方法来挫败不同编辑阶段的攻击。我们进行了主观和客观实验,以证明我们优化的局部对抗性扰动对最先进的STE攻击的优越性。我们还评估了扰动的纵向和横向可转移性。
{"title":"Text's Armor: Optimized Local Adversarial Perturbation Against Scene Text Editing Attacks","authors":"Tao Xiang, Hangcheng Liu, Shangwei Guo, Hantao Liu, Tianwei Zhang","doi":"10.1145/3503161.3548103","DOIUrl":"https://doi.org/10.1145/3503161.3548103","url":null,"abstract":"Deep neural networks (DNNs) have shown their powerful capability in scene text editing (STE). With carefully designed DNNs, one can alter texts in a source image with other ones while maintaining their realistic look. However, such editing tools provide a great convenience for criminals to falsify documents or modify texts without authorization. In this paper, we propose to actively defeat text editing attacks by designing invisible \"armors\" for texts in the scene. We turn the adversarial vulnerability of DNN-based STE into strength and design local perturbations (i.e., \"armors\") specifically for texts using an optimized normalization strategy. Such local perturbations can effectively mislead STE attacks without affecting the perceptibility of scene background. To strengthen our defense capabilities, we systemically analyze and model STE attacks and provide a precise defense method to defeat attacks on different editing stages. We conduct both subjective and objective experiments to show the superior of our optimized local adversarial perturbation against state-of-the-art STE attacks. We also evaluate the portrait and landscape transferability of our perturbations.","PeriodicalId":412792,"journal":{"name":"Proceedings of the 30th ACM International Conference on Multimedia","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114506232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 30th ACM International Conference on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1