Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

英文中文

Fixed-size video summarization over streaming data via non-monotone submodular maximization 通过非单调次模最大化实现流数据的固定大小视频摘要

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446285

Ganfeng Lu, Jiping Zheng

Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.

视频摘要具有快速浏览大量新兴视频数据和节省存储成本的潜力，在机器学习和信息检索领域受到广泛关注。在现有的研究中，确定点过程(DPPs)用于选择视频帧的子集来表示整个视频，在视频摘要中取得了巨大的成功。然而，现有的方法在为视频数据生成固定大小的输出摘要方面表现不佳，特别是当视频帧以流方式到达时。在本文中，我们提供了一种有效的k- seqls方法，该方法在dpp中以固定大小的k来总结流视频数据。我们的k-seqLS方法通过设置时间窗口，充分利用了视频帧的序列性，窗口外的帧对当前视频帧没有影响。由于每个帧子集的DPP概率的log样式是非单调子模函数，因此采用局部搜索和带有基数约束的贪心技术使k-seqLS大小固定，效率高，有理论保证。我们的实验表明，我们提出的k-seqLS在保持实际运行时间的同时具有更高的性能。

{"title":"Fixed-size video summarization over streaming data via non-monotone submodular maximization","authors":"Ganfeng Lu, Jiping Zheng","doi":"10.1145/3444685.3446285","DOIUrl":"https://doi.org/10.1145/3444685.3446285","url":null,"abstract":"Video summarization which potentially fast browses a large amount of emerging video data as well as saves storage cost has attracted tremendous attentions in machine learning and information retrieval. Among existing efforts, determinantal point processes (DPPs) designed for selecting a subset of video frames to represent the whole video have shown great success in video summarization. However, existing methods have shown poor performance to generate fixed-size output summaries for video data, especially when video frames arrive in streaming manner. In this paper, we provide an efficient approach k-seqLS which summarizes streaming video data with a fixed-size k in vein of DPPs. Our k-seqLS approach can fully exploit the sequential nature of video frames by setting a time window and the frames outside the window have no influence on current video frame. Since the log-style of the DPP probability for each subset of frames is a non-monotone submodular function, local search as well as greedy techniques with cardinality constraints are adopted to make k-seqLS fixed-sized, efficient and with theoretical guarantee. Our experiments show that our proposed k-seqLS exhibits higher performance while maintaining practical running time.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121059537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Full-resolution encoder-decoder networks with multi-scale feature fusion for human pose estimation 基于多尺度特征融合的全分辨率编码器-解码器网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446282

Jie Ou, Mingjian Chen, Hong Wu

To achieve more accurate 2D human pose estimation, we extend the successful encoder-decoder network, simple baseline network (SBN), in three ways. To reduce the quantization errors caused by the large output stride size, two more decoder modules are appended to the end of the simple baseline network to get full output resolution. Then, the global context blocks (GCBs) are added to the encoder and decoder modules to enhance them with global context features. Furthermore, we propose a novel spatial-attention-based multi-scale feature collection and distribution module (SA-MFCD) to fuse and distribute multi-scale features to boost the pose estimation. Experimental results on the MS COCO dataset indicate that our network can remarkably improve the accuracy of human pose estimation over SBN, our network using ResNet34 as the backbone network can even achieve the same accuracy as SBN with ResNet152, and our networks can achieve superior results with big backbone networks.

为了实现更精确的二维人体姿态估计，我们从三方面扩展了成功的编码器-解码器网络，简单基线网络(SBN)。为了减少由于输出步幅过大造成的量化误差，在简单基线网络的末端增加了两个解码器模块，以获得完整的输出分辨率。然后，将全局上下文块(global context block, gcb)添加到编码器和解码器模块中，使其具有全局上下文特性。此外，我们提出了一种新的基于空间注意力的多尺度特征收集和分布模块(SA-MFCD)，用于融合和分布多尺度特征，以提高姿态估计的精度。MS COCO数据集上的实验结果表明，我们的网络可以显著提高SBN上人体姿态估计的精度，使用ResNet34作为骨干网的网络甚至可以达到与使用ResNet152的SBN相同的精度，并且我们的网络在大型骨干网上可以取得更好的效果。

引用次数: 0

Scene graph generation via multi-relation classification and cross-modal attention coordinator 基于多关系分类和跨模态注意协调器的场景图生成

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446276

Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang

Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.

场景图生成旨在从图像中构建基于图的表示，其中节点和边分别表示对象及其之间的关系。然而，今天的场景图生成受到不平衡的类别预测的严重限制。具体来说，大多数现有的工作在简单和频繁的关系类(例如on)上实现了令人满意的性能，但是在细粒度和不频繁的关系类(例如walk on, stand on)上留下了较差的性能。为了解决这个问题，本文将框架重新设计为两个分支，表示学习分支和分类器学习分支，以获得更平衡的场景图生成器。此外，对于表征学习分支，我们提出了跨模态注意协调器(Cross-modal Attention Coordinator, CAC)，利用动态注意从多模态中收集一致的特征。对于分类器学习分支，我们首先从大规模语料库中迁移关系类的知识，然后通过图注意网络(MR-GAT)利用多关系分类器来弥合频繁关系和不频繁关系之间的差距。在挑战数据集VG200上的综合实验结果表明，本文提出的方法具有竞争力和显著的优越性。

{"title":"Scene graph generation via multi-relation classification and cross-modal attention coordinator","authors":"Xiaoyi Zhang, Zheng Wang, Xing Xu, Jiwei Wei, Yang Yang","doi":"10.1145/3444685.3446276","DOIUrl":"https://doi.org/10.1145/3444685.3446276","url":null,"abstract":"Scene graph generation intends to build graph-based representation from images, where nodes and edges respectively represent objects and relationships between them. However, scene graph generation today is heavily limited by imbalanced class prediction. Specifically, most of existing work achieves satisfying performance on simple and frequent relation classes (e.g. on), yet leaving poor performance with fine-grained and infrequent ones (e.g. walk on, stand on). To tackle this problem, in this paper, we redesign the framework as two branches, representation learning branch and classifier learning branch, for a more balanced scene graph generator. Furthermore, for representation learning branch, we propose Cross-modal Attention Coordinator (CAC) to gather consistent features from multi-modal using dynamic attention. For classifier learning branch, we first transfer relation classes' knowledge from large scale corpus, then we leverage Multi-Relationship classifier via Graph Attention neTworks (MR-GAT) to bridge the gap between frequent relations and infrequent ones. The comprehensive experimental results on VG200, a challenge dataset, indicate the competitiveness and the significant superiority of our proposed approach.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient inter-image relation graph neural network hashing for scalable image retrieval 高效的图像间关系图神经网络哈希可扩展图像检索

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446321

Hui Cui, Lei Zhu, Wentao Tan

Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.

无监督深度哈希是一种很有前途的大规模图像检索技术，因为它配备了强大的深度神经网络，并且具有标签独立性的优势。然而，无监督深度哈希过程需要训练大量的深度神经网络参数，在没有标记训练样本的情况下难以优化。如何在利用深度神经网络优势的同时保持无监督哈希算法的良好可扩展性是一个有趣而又具有挑战性的研究问题。基于此，本文提出了一种简单有效的图像间关系图神经网络哈希(IRGNNH)方法。与所有现有的复杂模型不同，我们在没有任何人工标记的情况下发现了潜在的图像间语义关系，并进一步利用它们来辅助无监督深度哈希过程。具体来说，我们首先解析图像以提取潜在的相关语义。然后，构建关系图卷积网络，对图像间语义关系和视觉相似性进行建模，生成图像关系和内容的表示向量;最后，进行对抗性学习，将构建的关系无缝嵌入图像哈希学习过程中，提高哈希码的判别能力。实验表明，我们的方法在检索精度和效率上都明显优于最先进的无监督深度哈希方法。

{"title":"Efficient inter-image relation graph neural network hashing for scalable image retrieval","authors":"Hui Cui, Lei Zhu, Wentao Tan","doi":"10.1145/3444685.3446321","DOIUrl":"https://doi.org/10.1145/3444685.3446321","url":null,"abstract":"Unsupervised deep hashing is a promising technique for large-scale image retrieval, as it equips powerful deep neural networks and has advantage on label independence. However, the unsupervised deep hashing process needs to train a large amount of deep neural network parameters, which is hard to optimize when no labeled training samples are provided. How to maintain the well scalability of unsupervised hashing while exploiting the advantage of deep neural network is an interesting but challenging problem to investigate. With the motivation, in this paper, we propose a simple but effective Inter-image Relation Graph Neural Network Hashing (IRGNNH) method. Different from all existing complex models, we discover the latent inter-image semantic relations without any manual labels and exploit them further to assist the unsupervised deep hashing process. Specifically, we first parse the images to extract latent involved semantics. Then, relation graph convolutional network is constructed to model the inter-image semantic relations and visual similarity, which generates representation vectors for image relations and contents. Finally, adversarial learning is performed to seamlessly embed the constructed relations into the image hash learning process, and improve the discriminative capability of the hash codes. Experiments demonstrate that our method significantly outperforms the state-of-the-art unsupervised deep hashing methods on both retrieval accuracy and efficiency.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129932555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Cross-cultural design of facial expressions for humanoids: is there cultural difference between Japan and Denmark? 仿人面部表情的跨文化设计:日本和丹麦有文化差异吗?

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446294

I. Kanaya, Meina Tawaki, Keiko Yamamoto

In this research, the authors succeeded in creating facial expressions made with the minimum necessary elements for recognizing a face. The elements are two eyes and a mouth made using precise circles, which are transformed to make facial expressions geometrically, through rotation and vertically scaling transformation. The facial expression patterns made by the geometric elements and transformations were composed employing three dimensions of visual information that had been suggested by many previous researches, slantedness of the mouth, openness of the face, and slantedness of the eyes. The authors found that this minimal facial expressions can be classified into 10 emotions: happy, angry, sad, disgust, fear, surprised, angry*, fear*, neutral (pleasant) indicating positive emotion, and neutral (unpleasant) indicating negative emotion. The authors also investigate and report cultural differences of impressions of facial expressions of above-mentioned simplified face.

在这项研究中，作者成功地用最少的必要元素创造了面部表情来识别人脸。元素是两只眼睛和一张嘴，由精确的圆圈组成，通过旋转和垂直缩放变换，以几何方式转换成面部表情。由几何元素和变换组成的面部表情模式采用了先前许多研究提出的视觉信息的三个维度，即嘴倾斜、脸张开和眼睛倾斜。作者发现，这种最小的面部表情可以分为10种情绪:快乐、愤怒、悲伤、厌恶、恐惧、惊讶、愤怒、恐惧、中性(愉快)表示积极情绪，中性(不愉快)表示消极情绪。作者还调查并报告了上述简化面部表情印象的文化差异。

引用次数: 0

Real-time arbitrary video style transfer 实时任意视频风格转换

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446301

Xingyu Liu, Zongxing Ji, Piao Huang, Tongwei Ren

Video style transfer aims to synthesize a stylized video that has similar content structure with a content video and is rendered in the style of a style image. The existing video style transfer methods cannot simultaneously realize high efficiency, arbitrary style and temporal consistency. In this paper, we propose the first real-time arbitrary video style transfer method with only one model. Specifically, we utilize a three-network architecture consisting of a prediction network, a stylization network and a loss network. Prediction network is used for extracting style parameters from a given style image; Stylization network is for generating the corresponding stylized video; Loss network is for training prediction network and stylization network, whose loss function includes content loss, style loss and temporal consistency loss. We conduct three experiments and a user study to test the effectiveness of our method. The experimental results show that our method outperforms the state-of-the-arts.

视频风格转换旨在合成与内容视频具有相似内容结构的程式化视频，并以样式图像的样式呈现。现有的视频风格转换方法不能同时实现高效率、任意风格和时间一致性。在本文中，我们提出了第一种仅使用一个模型的实时任意视频风格传输方法。具体来说，我们利用了一个由预测网络、风格化网络和损失网络组成的三网络架构。使用预测网络从给定的样式图像中提取样式参数;风格化网络用于生成相应的风格化视频;损失网络用于训练预测网络和风格化网络，其损失函数包括内容损失、风格损失和时间一致性损失。我们进行了三个实验和一个用户研究来测试我们方法的有效性。实验结果表明，该方法优于目前的方法。

引用次数: 0

Motion-transformer: self-supervised pre-training for skeleton-based action recognition 运动转换器:基于骨骼的动作识别的自监督预训练

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446289

Yi-Bin Cheng, Xipeng Chen, Dongyu Zhang, Liang Lin

With the development of deep learning, skeleton-based action recognition has achieved great progress in recent years. However, most of the current works focus on extracting more informative spatial representations of the human body, but haven't made full use of the temporal dependencies already contained in the sequence of human action. To this end, we propose a novel transformer-based model called Motion-Transformer to sufficiently capture the temporal dependencies via self-supervised pre-training on the sequence of human action. Besides, we propose to predict the motion flow of human skeletons for better learning the temporal dependencies in sequence. The pre-trained model is then fine-tuned on the task of action recognition. Experimental results on the large scale NTU RGB+D dataset shows our model is effective in modeling temporal relation, and the flow prediction pre-training is beneficial to expose the inherent dependencies in time dimensional. With this pre-training and fine-tuning paradigm, our final model outperforms previous state-of-the-art methods.

随着深度学习的发展，基于骨架的动作识别近年来取得了很大的进展。然而，目前的大部分工作都侧重于提取更多信息的人体空间表征，而没有充分利用人体动作序列中已经包含的时间依赖性。为此，我们提出了一种新的基于变压器的模型，称为Motion-Transformer，通过对人类动作序列的自监督预训练来充分捕获时间依赖性。此外，我们还提出了对人体骨骼运动流的预测，以便更好地学习序列上的时间依赖性。然后，预先训练的模型在动作识别任务上进行微调。在大规模NTU RGB+D数据集上的实验结果表明，该模型可以有效地建模时间关系，流量预测预训练有助于揭示时间维度上的内在依赖关系。通过这种预训练和微调范例，我们的最终模型优于之前最先进的方法。

引用次数: 21

Synthesized 3D models with smartphone based MR to modify the PreBuilt environment: interior design 综合3D模型与智能手机为基础的MR修改预建环境:室内设计

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446251

Anish Bhardwaj, N. Chauhan, R. Shah

The past few years have seen an increase in the number of products that use AR and VR as well as the emergence of products in both these categories i.e. Mixed Reality. However, current systems are exclusive to a market that exists in the top 1% of the population in most countries due to the expensive and heavy technology required by these systems. This project showcases a system in the field of Smartphone Based Mixed Reality through an Interior Design Solution that allows the user to visualise their design choices through the lens of a smartphone. Our system uses Image Processing algorithms to perceive room dimensions alongside a GUI which allows a user to create their own blueprints. Navigable 3D models are created from these blueprints, allowing users to view their builds. Following this, Users switch to the mobile application for the purpose of visualising their ideas in their own homes (MR). This System/POC showcases the potential of MR as a field that can be explored for a larger portion of the population through a more efficient medium.

在过去的几年里，使用AR和VR的产品数量有所增加，这两个类别的产品也出现了，即混合现实。然而，在大多数国家，由于这些系统所需的昂贵和繁重的技术，目前的系统只适用于存在于前1%人口中的市场。该项目展示了一个基于智能手机的混合现实领域的系统，通过室内设计解决方案，用户可以通过智能手机的镜头来可视化他们的设计选择。我们的系统使用图像处理算法来感知房间尺寸以及允许用户创建自己的蓝图的GUI。可导航的3D模型是从这些蓝图中创建的，允许用户查看他们的构建。在此之后，用户切换到移动应用程序，目的是在自己的家中可视化他们的想法(MR)。该系统/POC显示了MR作为一个领域的潜力，可以通过更有效的媒介对更大一部分人口进行探索。

引用次数: 0

Pulse localization networks with infrared camera 红外摄像机脉冲定位网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446318

Bohong Yang, Kai Meng, Hong Lu, Xinyao Nie, Guanhao Huang, Jingjing Luo, Xing Zhu

Pulse localization is the basic task of the pulse diagnosis with robot. More accurate location can reduce the misdiagnosis caused by different types of pulse. Traditional works usually use a collection surface with a certain area for contact detection, and move the collection surface to collect changes of power for pulse localization. These methods often require the subjects place their wrist in a given position. In this paper, we propose a novel pulse localization method which uses the infrared camera as the input sensor, and locates the pulse on wrist with the neural network. This method can not only reduce the contact between the machine and the subject, reduce the discomfort of the process, but also reduce the preparation time for the test, which can improve the detection efficiency. The experiments show that our proposed method can locate the pulse with high accuracy. And we have applied this method to pulse diagnosis robot for pulse data collection.

脉搏定位是机器人脉搏诊断的基本任务。更准确的定位可以减少不同类型脉搏造成的误诊。传统的工作通常采用具有一定面积的采集面进行接触检测，通过移动采集面采集功率变化进行脉冲定位。这些方法通常要求受试者将手腕置于给定位置。本文提出了一种新的脉冲定位方法，利用红外摄像机作为输入传感器，利用神经网络对腕部脉冲进行定位。这种方法不仅可以减少机器与受试者之间的接触，减少过程中的不适感，还可以减少测试的准备时间，从而可以提高检测效率。实验结果表明，该方法具有较高的脉冲定位精度。并将该方法应用于脉搏诊断机器人进行脉搏数据采集。

引用次数: 6

Self-supervised adversarial learning for cross-modal retrieval 跨模态检索的自监督对抗学习

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446269

Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen

Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.

跨模态检索旨在实现跨不同模态的灵活检索。跨模态检索的核心是学习不同模态的投影，并使学习到的公共子空间中的实例具有可比性。自监督学习通过对输入数据的变换自动生成监督信号，并通过训练学习语义特征来预测人工标签。本文提出了一种新的跨模态检索方法——自监督对抗学习(SSAL)，该方法利用自监督学习和对抗学习来寻找有效的公共子空间。特征投射器试图在公共子空间中生成模态不变的表示，这可以混淆由两个分类器组成的对抗性鉴别器。其中一个分类器旨在从图像表示中预测旋转角度，而另一个分类器试图从学习到的嵌入中区分不同的模态。通过混淆自监督对抗模型，特征投影器过滤掉丰富的高级视觉语义，并在公共子空间中学习与文本模态更好对齐的图像嵌入。通过对上述方法的综合利用，学习到一个有效的公共子空间，使不同模态的表示能更好地对齐，并能很好地保存不同模态的公共信息。在三个广泛使用的基准数据集上的综合实验结果表明，该方法具有较好的跨模态检索性能，显著优于现有的跨模态检索方法。

{"title":"Self-supervised adversarial learning for cross-modal retrieval","authors":"Yangchao Wang, Shiyuan He, Xing Xu, Yang Yang, Jingjing Li, Heng Tao Shen","doi":"10.1145/3444685.3446269","DOIUrl":"https://doi.org/10.1145/3444685.3446269","url":null,"abstract":"Cross-modal retrieval aims at enabling flexible retrieval across different modalities. The core of cross-modal retrieval is to learn projections for different modalities and make instances in the learned common subspace comparable to each other. Self-supervised learning automatically creates a supervision signal by transformation of input data and learns semantic features by training to predict the artificial labels. In this paper, we proposed a novel method named Self-Supervised Adversarial Learning (SSAL) for Cross-Modal Retrieval, which deploys self-supervised learning and adversarial learning to seek an effective common subspace. A feature projector tries to generate modality-invariant representations in the common subspace that can confuse an adversarial discriminator consists of two classifiers. One of the classifiers aims to predict rotation angle from image representations, while the other classifier tries to discriminate between different modalities from the learned embeddings. By confusing the self-supervised adversarial model, feature projector filters out the abundant high-level visual semantics and learns image embeddings that are better aligned with text modality in the common subspace. Through the joint exploitation of the above, an effective common subspace is learned, in which representations of different modlities are aligned better and common information of different modalities is well preserved. Comprehensive experimental results on three widely-used benchmark datasets show that the proposed method is superior in cross-modal retrieval and significantly outperforms the existing cross-modal retrieval methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129551921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀