ACM Multimedia Asia最新文献

英文中文

Structural Knowledge Organization and Transfer for Class-Incremental Learning 渐进式学习的结构知识组织与迁移

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490598

Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, Yihong Gong

Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.

当对新数据进行微调时，深度模型很容易发生灾难性的遗忘。流行的基于蒸馏的方法通常忽略了数据样本之间的关系，最终可能会忘记基本的结构知识。为了解决这些问题，我们提出了一种基于结构图知识蒸馏的增量学习框架，以保留样本的位置和它们之间的关系。首先，生成记忆知识图(memory knowledge graph, MKG)，充分表征历史任务的结构知识;其次,我们开发一个图形内插机制,丰富的领域知识和减轻类的样本不平衡问题。第三，引入结构图知识精馏，实现历史任务知识的转移。在三个数据集上的综合实验验证了该方法的有效性。

引用次数: 3

MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MIRecipe:用于配料外观变化阶段感知识别的配方数据集

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490596

Yixin Zhang, Yoko Yamakata, Keishi Tajima

In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.

在本文中，我们引入了一个新的食谱数据集MIRecipe (multimedia - teaching recipe)。它有每个烹饪步骤的文本和图像数据，而传统的食谱数据集只包含最终的菜肴图像，和/或仅包含某些步骤的图像。它由26,725个食谱组成，总共包括239,973个步骤。在与烹饪步骤相关的图像中对配料的识别提出了新的挑战:由于配料是在烹饪过程中加工的，同一种配料在烹饪的开始和结束阶段的外观是非常不同的。一般的物体识别方法，假设物体的外观不变，不能很好地识别这些物体。为了解决这个问题，我们提出了两种阶段感知技术:阶段智能模型学习，它为每个阶段训练一个单独的模型，以及阶段感知课程学习，它从开始阶段的训练数据开始，并继续到后期阶段。我们对数据集的实验表明，我们的方法比使用所有数据而不考虑阶段的模型获得了更高的精度。我们的数据集可以在我们的GitHub存储库中获得。

{"title":"MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients","authors":"Yixin Zhang, Yoko Yamakata, Keishi Tajima","doi":"10.1145/3469877.3490596","DOIUrl":"https://doi.org/10.1145/3469877.3490596","url":null,"abstract":"In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images 基于卷积神经网络的高光谱图像纯颜料识别

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495641

Ailin Chen, R. Jesus, M. Vilarigues

This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.

本研究展示了深度学习神经网络在文物艺术品(即绘画)纯颜料识别中的应用结果。我们的论文应用了一个创新的三分支深度学习模型来最大限度地正确识别纯颜料。该模型结合了通过多个卷积神经网络从高光谱图像中获得的特征映射，以及关于一组参考反射率的数值高光谱度量数据。所得结果准确地反映了预测的纯色素，并通过分析技术加以证实。该模型不仅在高光谱数据和混凝土色素数据在遗产分析中的利用方面表现出色，而且在深度学习在其他领域的应用方面都是一个重要的方向。

引用次数: 2

Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension 面向实时单阶段引用表达式理解的实体关系融合

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490592

Hang Yu, Weixin Li, Jiankai Li, Ye Du

Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.

指称表达理解(REC)是对语言表达所指称的对象进行理据的任务。以往的单阶段REC方法通常使用一个单一的语言特征向量来表示整个查询的接地，尽管语言表达中包含了丰富的对象关系线索，但不进行不同对象之间的推理，这降低了它们的接地精度。此外，这些方法大多使用特征金字塔网络进行多尺度视觉目标特征提取，但分别基于不同的特征层，忽略了不同尺度目标之间的联系。为了解决这些问题，我们提出了一种新的单阶段REC方法，即实体关系融合网络(ERFN)，通过对不同对象的关系引导推理来定位被引用对象。在ERFN中，我们提出了一种语言引导的多尺度融合(LGMSF)模型，利用语言引导将不同尺度的物体表示融合到一个特征图中，而不是将每一层的物体单独接地。对于不同对象之间的连接建模，我们设计了一种关系引导特征融合(RGFF)模型，该模型提取语言表达中的实体来增强可视化对象特征映射中的引用实体特征，并基于自关注机制进一步提取关系来指导对象特征融合。实验结果表明，该方法与目前最先进的一阶段和两阶段REC方法相比具有竞争力，并且可以实时进行推理。

{"title":"Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension","authors":"Hang Yu, Weixin Li, Jiankai Li, Ye Du","doi":"10.1145/3469877.3490592","DOIUrl":"https://doi.org/10.1145/3469877.3490592","url":null,"abstract":"Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133439636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling 平面和浅:通过架构分析理解假图像检测模型

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490566

Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei

Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.

数字图像处理被严重滥用来传播错误信息。尽管研究界付出了巨大的努力，但之前的工作大多是性能驱动的，即使用为语义分类设计的标准/重型网络来优化性能。对假图像检测模型的深入理解仍然缺失。本文通过分析性能最好的体系结构，研究了一个好的假图像检测模型的基本成分。具体来说，我们对大量的检测模型进行了深入的分析，并观察了不同网络结构模式对性能的影响。我们的主要发现包括:1)在相同的计算预算下，平面网络结构(例如，大内核大小，宽连接)比常用的深度网络性能更好;2)为了权衡性能和计算成本，浅层操作需要更多的计算能力。这些发现勾勒出假图像检测的基本模型的总体概况，并显示出与语义分类的明显差异。在此基础上，我们提出了一种新的深度可分离搜索空间(DSS)用于假图像检测。与最先进的方法相比，我们的模型实现了具有竞争力的性能，同时节省了50%以上的参数。

{"title":"Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling","authors":"Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei","doi":"10.1145/3469877.3490566","DOIUrl":"https://doi.org/10.1145/3469877.3490566","url":null,"abstract":"Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method 基于动态时间翘曲的文本变长时间序列生成方法

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495644

Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada

This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.

本研究的目的是寻找一种合适的方法来生成时间序列数据，如视频剪辑或从文本陈述多个事件的化身动作。本文讨论了考虑到文本中所述事件的顺序和可变持续时间的变长时间序列数据的生成。虽然使用均方误差(MSE)的变体是一种常用的训练方法，但只考虑了ground-truth (GT)数据元素与同时生成的数据之间的差距。因此，MSE的变体不适合手头的任务，因为如果每个事件的时间不重叠，具有相同事件顺序的生成和GT数据的损失可能不会小。为了解决这个问题，我们提出了一种类似动态时间扭曲的变长数据方法(DTWL-VL)，该方法确定了GT和生成数据的对应元素，允许它们之间的时差，并使它们更接近。我们将MSE的一种变体DTWL-VL与现有的一种考虑GT中对应部分与生成数据之间的时间差的时间序列数据生成方法进行了比较。由于现有方法的目标是生成固定长度的数据，我们扩展了生成变长时间序列数据的方法。我们使用为本研究准备的数据集进行了实验。DTWL-VL和现有方法都优于MSE变体。此外，虽然现有方法在某些设置下优于DTWL-VL，但DTWL-VL所需的训练时间更短。

{"title":"Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method","authors":"Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada","doi":"10.1145/3469877.3495644","DOIUrl":"https://doi.org/10.1145/3469877.3495644","url":null,"abstract":"This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124849591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation PLM-IPE:一个用于隐式偏好估计的像素里程碑式相互增强框架

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490621

Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo

In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.

在本文中，我们感兴趣的是了解客户如何感知时尚建议，特别是当观察拟议的服装组合以组成一套服装时。实际上，在没有任何主动参与的情况下，自动理解建议的项目是如何被感知的，这是实现交互式应用程序的重要组成部分。我们提出了一个像素里程碑式的相互增强框架，用于隐式偏好估计，命名为PLM-IPE，它能够利用视觉线索推断用户的隐式偏好，而无需任何主动或有意识的参与。PLM-IPE包括三个关键模块:基于像素的估计器、基于地标的估计器和基于相互学习的优化。前两个模块分别从像素级和地标级捕捉用户的隐式反应。最后一个模块用于在两个并行估计器之间传递知识。为了进行评估，我们收集了一个名为SentiGarment的真实世界数据集，其中包含3345个面部反应视频，以及建议的服装和人类标记的反应分数。大量的实验表明，我们的模型优于最先进的方法。

{"title":"PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation","authors":"Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo","doi":"10.1145/3469877.3490621","DOIUrl":"https://doi.org/10.1145/3469877.3490621","url":null,"abstract":"In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123722148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Multi-branch Semantic Learning Network for Text-to-Image Synthesis 用于文本到图像合成的多分支语义学习网络

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490567

Jiading Ling, Xingcai Wu, Zhenguo Yang, Xudong Mao, Qing Li, Wenyin Liu

In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).

在本文中，我们提出了一种多分支语义学习网络(MSLN)来根据文本描述生成图像，该网络考虑了全局和局部文本语义，分为两个阶段。第一阶段根据句子特征生成粗粒度图像。第二阶段，构建多分支细粒度生成模型，通过全局关注模块和局部关注模块将句子级和词级语义注入两幅粗粒度图像中，分别生成全局和局部细粒度图像纹理;特别地，我们设计了一个通道融合模块(CFM)来融合多分支细粒度阶段的全局和局部细粒度特征并生成输出图像。在CUB-200数据集和Oxford-102数据集上进行的大量实验证明了该方法的优越性能。(例如，在cube -200上FID从16.09降低到14.43)。

引用次数: 0

CFCR: A Convolution and Fusion Model for Cross-platform Recommendation 跨平台推荐的卷积和融合模型

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495639

Shengze Yu, Xin Wang, Wenwu Zhu

With the emergence of various online platforms, associating different platforms is playing an increasingly important role in many applications. Cross-platform recommendation aims to improve recommendation accuracy through associating information from different platforms. Existing methods do not fully exploit high-order nonlinear connectivity information in cross-domain recommendation scenario and suffer from domain-incompatibility problem. In this paper, we propose an end-to-end convolution and fusion model for cross-platform recommendation (CFCR). The proposed CFCR model utilizes Graph Convolution Networks (GCN) to extract user and item features on graphs from different platforms, and fuses cross-platform information by Multimodal AutoEncoder (MAE) with common latent user features. Therefore, the high-order connectivity information is preserved to the most extent and domain-invariant user representations are automatically obtained. The domain-incompatible information is spontaneously discarded to avoid messing up the cross-platform association. Extensive experiments for the proposed CFCR model on real-world dataset demonstrate its advantages over existing cross-platform recommendation methods in terms of various evaluation metrics.

随着各种在线平台的出现，不同平台的关联在许多应用中发挥着越来越重要的作用。跨平台推荐的目的是通过关联不同平台的信息来提高推荐的准确性。现有方法在跨域推荐场景中没有充分利用高阶非线性连接信息，存在域不兼容问题。本文提出了一种跨平台推荐的端到端卷积融合模型。该模型利用图卷积网络(GCN)提取不同平台图上的用户和项目特征，并利用多模态自动编码器(MAE)与常见的潜在用户特征融合跨平台信息。因此，最大程度地保留了高阶连接信息，并自动获得了域不变的用户表示。领域不兼容的信息被自动丢弃，以避免混淆跨平台关联。本文提出的CFCR模型在真实数据集上的大量实验表明，它在各种评估指标方面优于现有的跨平台推荐方法。

{"title":"CFCR: A Convolution and Fusion Model for Cross-platform Recommendation","authors":"Shengze Yu, Xin Wang, Wenwu Zhu","doi":"10.1145/3469877.3495639","DOIUrl":"https://doi.org/10.1145/3469877.3495639","url":null,"abstract":"With the emergence of various online platforms, associating different platforms is playing an increasingly important role in many applications. Cross-platform recommendation aims to improve recommendation accuracy through associating information from different platforms. Existing methods do not fully exploit high-order nonlinear connectivity information in cross-domain recommendation scenario and suffer from domain-incompatibility problem. In this paper, we propose an end-to-end convolution and fusion model for cross-platform recommendation (CFCR). The proposed CFCR model utilizes Graph Convolution Networks (GCN) to extract user and item features on graphs from different platforms, and fuses cross-platform information by Multimodal AutoEncoder (MAE) with common latent user features. Therefore, the high-order connectivity information is preserved to the most extent and domain-invariant user representations are automatically obtained. The domain-incompatible information is spontaneously discarded to avoid messing up the cross-platform association. Extensive experiments for the proposed CFCR model on real-world dataset demonstrate its advantages over existing cross-platform recommendation methods in terms of various evaluation metrics.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125183782","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Coarse-to-fine Approach for Fast Super-Resolution with Flexible Magnification 一种灵活放大的快速超分辨率从粗到精的方法

ACM Multimedia Asia

Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490564

Zhichao Fu, Tianlong Ma, Liang Xue, Yingbin Zheng, Hao Ye, Liang He

We perform fast single image super-resolution with flexible magnification for natural images. A novel coarse-to-fine super-resolution framework is developed for the magnification that is factorized into a maximum integer component and the quotient. Specifically, our framework is embedded with a light-weight upscale network for super-resolution with the integer scale factor, followed by the fine-grained network to guide interpolation on feature maps as well as to generate the super-resolved image. Compared with the previous flexible magnification super-resolution approaches, the proposed framework achieves a tradeoff between computational complexity and performance. We conduct experiments using the coarse-to-fine framework on the standard benchmarks and demonstrate its superiority in terms of effectiveness and efficiency over previous approaches.

我们对自然图像进行快速的单图像超分辨率和灵活的放大。提出了一种新的粗到精的超分辨率框架，将放大倍数分解为最大整数分量和商。具体来说，我们的框架嵌入了一个轻量级的高端网络，用于整数比例因子的超分辨率，然后是细粒度网络，用于指导特征图的插值，并生成超分辨率图像。与以往的柔性放大超分辨率方法相比，该框架实现了计算复杂度和性能之间的平衡。我们在标准基准上使用从粗到精的框架进行了实验，并证明了其在有效性和效率方面优于以前的方法。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

ACM Multimedia Asia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀