首页 > 最新文献

ACM Multimedia Asia最新文献

英文 中文
Structural Knowledge Organization and Transfer for Class-Incremental Learning 渐进式学习的结构知识组织与迁移
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490598
Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, Yihong Gong
Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.
当对新数据进行微调时,深度模型很容易发生灾难性的遗忘。流行的基于蒸馏的方法通常忽略了数据样本之间的关系,最终可能会忘记基本的结构知识。为了解决这些问题,我们提出了一种基于结构图知识蒸馏的增量学习框架,以保留样本的位置和它们之间的关系。首先,生成记忆知识图(memory knowledge graph, MKG),充分表征历史任务的结构知识;其次,我们开发一个图形内插机制,丰富的领域知识和减轻类的样本不平衡问题。第三,引入结构图知识精馏,实现历史任务知识的转移。在三个数据集上的综合实验验证了该方法的有效性。
{"title":"Structural Knowledge Organization and Transfer for Class-Incremental Learning","authors":"Yu Liu, Xiaopeng Hong, Xiaoyu Tao, Songlin Dong, Jingang Shi, Yihong Gong","doi":"10.1145/3469877.3490598","DOIUrl":"https://doi.org/10.1145/3469877.3490598","url":null,"abstract":"Deep models are vulnerable to catastrophic forgetting when fine-tuned on new data. Popular distillation-based methods usually neglect the relations between data samples and may eventually forget essential structural knowledge. To solve these shortcomings, we propose a structural graph knowledge distillation based incremental learning framework to preserve both the positions of samples and their relations. Firstly, a memory knowledge graph (MKG) is generated to fully characterize the structural knowledge of historical tasks. Secondly, we develop a graph interpolation mechanism to enrich the domain of knowledge and alleviate the inter-class sample imbalance issue. Thirdly, we introduce structural graph knowledge distillation to transfer the knowledge of historical tasks. Comprehensive experiments on three datasets validate the proposed method.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114107435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients MIRecipe:用于配料外观变化阶段感知识别的配方数据集
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490596
Yixin Zhang, Yoko Yamakata, Keishi Tajima
In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.
在本文中,我们引入了一个新的食谱数据集MIRecipe (multimedia - teaching recipe)。它有每个烹饪步骤的文本和图像数据,而传统的食谱数据集只包含最终的菜肴图像,和/或仅包含某些步骤的图像。它由26,725个食谱组成,总共包括239,973个步骤。在与烹饪步骤相关的图像中对配料的识别提出了新的挑战:由于配料是在烹饪过程中加工的,同一种配料在烹饪的开始和结束阶段的外观是非常不同的。一般的物体识别方法,假设物体的外观不变,不能很好地识别这些物体。为了解决这个问题,我们提出了两种阶段感知技术:阶段智能模型学习,它为每个阶段训练一个单独的模型,以及阶段感知课程学习,它从开始阶段的训练数据开始,并继续到后期阶段。我们对数据集的实验表明,我们的方法比使用所有数据而不考虑阶段的模型获得了更高的精度。我们的数据集可以在我们的GitHub存储库中获得。
{"title":"MIRecipe: A Recipe Dataset for Stage-Aware Recognition of Changes in Appearance of Ingredients","authors":"Yixin Zhang, Yoko Yamakata, Keishi Tajima","doi":"10.1145/3469877.3490596","DOIUrl":"https://doi.org/10.1145/3469877.3490596","url":null,"abstract":"In this paper, we introduce a new recipe dataset MIRecipe (Multimedia-Instructional Recipe). It has both text and image data for every cooking step, while the conventional recipe datasets only contain final dish images, and/or images only for some of the steps. It consists of 26,725 recipes, which include 239,973 steps in total. The recognition of ingredients in images associated with cooking steps poses a new challenge: Since ingredients are processed during cooking, the appearance of the same ingredient is very different in the beginning and finishing stages of the cooking. The general object recognition methods, which assume the constant appearance of objects, do not perform well for such objects. To solve the problem, we propose two stage-aware techniques: stage-wise model learning, which trains a separate model for each stage, and stage-aware curriculum learning, which starts with the training data from the beginning stage and proceeds to the later stages. Our experiment with our dataset shows that our method achieves higher accuracy than the model trained using all the data without considering the stages. Our dataset is available at our GitHub repository.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"121 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121379082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images 基于卷积神经网络的高光谱图像纯颜料识别
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495641
Ailin Chen, R. Jesus, M. Vilarigues
This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.
本研究展示了深度学习神经网络在文物艺术品(即绘画)纯颜料识别中的应用结果。我们的论文应用了一个创新的三分支深度学习模型来最大限度地正确识别纯颜料。该模型结合了通过多个卷积神经网络从高光谱图像中获得的特征映射,以及关于一组参考反射率的数值高光谱度量数据。所得结果准确地反映了预测的纯色素,并通过分析技术加以证实。该模型不仅在高光谱数据和混凝土色素数据在遗产分析中的利用方面表现出色,而且在深度学习在其他领域的应用方面都是一个重要的方向。
{"title":"Convolutional Neural Network-Based Pure Paint Pigment Identification Using Hyperspectral Images","authors":"Ailin Chen, R. Jesus, M. Vilarigues","doi":"10.1145/3469877.3495641","DOIUrl":"https://doi.org/10.1145/3469877.3495641","url":null,"abstract":"This research presents the results of the implementation of deep learning neural networks in the identification of pure pigments of heritage artwork, namely paintings. Our paper applies an innovative three-branch deep learning model to maximise the correct identification of pure pigments. The model proposed combines the feature maps obtained from hyperspectral images through multiple convolutional neural networks, and numerical, hyperspectral metric data with respect to a set of reference reflectances. The results obtained exhibit an accurate representation of the pure predicted pigments which are confirmed through the use of analytical techniques. The model presented outperformed the compared counterparts and is deemed to be an important direction, not only in terms of utilisation of hyperspectral data and concrete pigment data in heritage analysis, but also in the application of deep learning in other fields.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128015404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension 面向实时单阶段引用表达式理解的实体关系融合
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490592
Hang Yu, Weixin Li, Jiankai Li, Ye Du
Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.
指称表达理解(REC)是对语言表达所指称的对象进行理据的任务。以往的单阶段REC方法通常使用一个单一的语言特征向量来表示整个查询的接地,尽管语言表达中包含了丰富的对象关系线索,但不进行不同对象之间的推理,这降低了它们的接地精度。此外,这些方法大多使用特征金字塔网络进行多尺度视觉目标特征提取,但分别基于不同的特征层,忽略了不同尺度目标之间的联系。为了解决这些问题,我们提出了一种新的单阶段REC方法,即实体关系融合网络(ERFN),通过对不同对象的关系引导推理来定位被引用对象。在ERFN中,我们提出了一种语言引导的多尺度融合(LGMSF)模型,利用语言引导将不同尺度的物体表示融合到一个特征图中,而不是将每一层的物体单独接地。对于不同对象之间的连接建模,我们设计了一种关系引导特征融合(RGFF)模型,该模型提取语言表达中的实体来增强可视化对象特征映射中的引用实体特征,并基于自关注机制进一步提取关系来指导对象特征融合。实验结果表明,该方法与目前最先进的一阶段和两阶段REC方法相比具有竞争力,并且可以实时进行推理。
{"title":"Entity Relation Fusion for Real-Time One-Stage Referring Expression Comprehension","authors":"Hang Yu, Weixin Li, Jiankai Li, Ye Du","doi":"10.1145/3469877.3490592","DOIUrl":"https://doi.org/10.1145/3469877.3490592","url":null,"abstract":"Referring Expression Comprehension (REC) is the task of grounding object which is referred by the language expression. Previous one-stage REC methods usually use one single language feature vector to represent the whole query for grounding and no reasoning between different objects is performed despite the rich relation cues of objects contained in the language expression, which depresses their grounding accuracy. Additionally, these methods mostly use the feature pyramid networks for multi-scale visual object feature extraction but ground on different feature layers separately, neglecting the connections between objects with different scales. To address these problems, we propose a novel one-stage REC method, i.e. the Entity Relation Fusion Network (ERFN) to locate referred object by relation guided reasoning on different objects. In ERFN, instead of grounding objects at each layer separately, we propose a Language Guided Multi-Scale Fusion (LGMSF) model to utilize language to guide the fusion of representations of objects with different scales into one feature map.For modeling connections between different objects, we design a Relation Guided Feature Fusion (RGFF) model that extracts entities in the language expression to enhance the referred entity feature in the visual object feature map, and further extracts relations to guide object feature fusion based on the self-attention mechanism. Experimental results show that our method is competitive with the state-of-the-art one-stage and two-stage REC methods, and can also keep inferring in real time.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133439636","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling 平面和浅:通过架构分析理解假图像检测模型
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490566
Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei
Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.
数字图像处理被严重滥用来传播错误信息。尽管研究界付出了巨大的努力,但之前的工作大多是性能驱动的,即使用为语义分类设计的标准/重型网络来优化性能。对假图像检测模型的深入理解仍然缺失。本文通过分析性能最好的体系结构,研究了一个好的假图像检测模型的基本成分。具体来说,我们对大量的检测模型进行了深入的分析,并观察了不同网络结构模式对性能的影响。我们的主要发现包括:1)在相同的计算预算下,平面网络结构(例如,大内核大小,宽连接)比常用的深度网络性能更好;2)为了权衡性能和计算成本,浅层操作需要更多的计算能力。这些发现勾勒出假图像检测的基本模型的总体概况,并显示出与语义分类的明显差异。在此基础上,我们提出了一种新的深度可分离搜索空间(DSS)用于假图像检测。与最先进的方法相比,我们的模型实现了具有竞争力的性能,同时节省了50%以上的参数。
{"title":"Flat and Shallow: Understanding Fake Image Detection Models by Architecture Profiling","authors":"Jing-Fen Xu, Wei Zhang, Yalong Bai, Qibin Sun, Tao Mei","doi":"10.1145/3469877.3490566","DOIUrl":"https://doi.org/10.1145/3469877.3490566","url":null,"abstract":"Digital image manipulations have been heavily abused to spread misinformation. Despite the great efforts dedicated in research community, prior works are mostly performance-driven, i.e., optimizing performances using standard/heavy networks designed for semantic classification. A thorough understanding for fake images detection models is still missing. This paper studies the essential ingredients for a good fake image detection model, by profiling the best-performing architectures. Specifically, we conduct a thorough analysis on a massive number of detection models, and observe how the performances are affected by different patterns of network structure. Our key findings include: 1) with the same computational budget, flat network structures (e.g., large kernel sizes, wide connections) perform better than commonly used deep networks; 2) operations in shallow layers deserve more computational capacities to trade-off performance and computational cost. These findings sketch a general profile for essential models of fake image detection, which show clear differences with those for semantic classification. Furthermore, based on our analysis, we propose a new Depth-Separable Search Space (DSS) for fake image detection. Compared to state-of-the-art methods, our model achieves competitive performance while saving more than 50% parameters.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134628278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation 利用异构知识蒸馏提高高光谱超分辨率
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490610
Ziqian Liu, Qing Ma, Junjun Jiang, Xianming Liu
Hyperspectral images (HSI) contains rich spectrum information but their spatial resolution is often limited by imaging system. Super-resolution (SR) reconstruction becomes a hot topic aiming to increase spatial resolution without extra hardware cost. The fusion-based hyperspectral image super-resolution (FHSR) methods use supplementary high-resolution multispectral images (HR-MSI) to recover spatial details, but well co-registered HR-MSI is hard to collect. Recently, single hyperspectral image super-resolution (SHSR) methods based on deep learning have made great progress. However, lack of HR-MSI input makes these SHSR methods difficult to exploit the spatial information. To take advantages of FHSR and SHSR methods, in this paper we propose a new pipeline treating HR-MSI as privilege information and try to improve our SHSR model with knowledge distillation. That is, our model uses paired MSI-HSI data to train and only needs LR-HSI as input during inference. Specifically, we combine SHSR and spectral super-resolution (SSR) and design a novel architecture, Distillation-Oriented Dual-branch Net (DODN), to make the SHSR model fully employ transferred knowledge from the SSR model. Since the main stream of SSR model are 2D CNNs and full 2D CNN causes spectral disorder in SHSR task, a new mixed 2D/3D block, called Distillation-Oriented Dual-branch Block (DODB) is proposed, where the 3D branch extracts spectral-spatial correlation while the 2D branch accepts information from the SSR model through knowledge distillation. The main idea is to distill the knowledge of spatial information from HR-MSI to the SHSR model without changing its network architecture. Extensive experiments on two benchmark datasets, CAVE and NTIRE2020, demonstrate that our proposed DODN outperforms the state-of-the-art SHSR methods, in terms of both quantitative and qualitative analysis.
高光谱图像包含丰富的光谱信息,但其空间分辨率往往受到成像系统的限制。为了在不增加硬件成本的前提下提高空间分辨率,超分辨率重建成为当前研究的热点。基于融合的高光谱图像超分辨率(FHSR)方法使用补充的高分辨率多光谱图像(HR-MSI)来恢复空间细节,但很难收集到高分辨率多光谱图像。近年来,基于深度学习的单幅高光谱图像超分辨率(SHSR)方法取得了很大进展。然而,由于缺乏HR-MSI输入,使得这些SHSR方法难以利用空间信息。为了利用FHSR和SHSR方法的优点,本文提出了一种将HR-MSI作为特权信息的管道,并尝试用知识精馏来改进我们的SHSR模型。也就是说,我们的模型使用配对的MSI-HSI数据进行训练,在推理过程中只需要LR-HSI作为输入。具体而言,我们将光谱超分辨(SHSR)技术与光谱超分辨(SSR)技术相结合,设计了一种新的体系结构——面向蒸馏的双分支网络(DODN),使光谱超分辨模型充分利用了SSR模型的迁移知识。由于SSR模型的主流是二维CNN,而全二维CNN在SHSR任务中会造成频谱混乱,因此提出了一种新的二维/三维混合块,称为面向蒸馏的双分支块(DODB),其中三维分支提取光谱空间相关性,二维分支通过知识蒸馏接受SSR模型的信息。其主要思想是在不改变其网络架构的情况下,将HR-MSI的空间信息知识提取到SHSR模型中。在CAVE和nitre2020两个基准数据集上进行的大量实验表明,我们提出的DODN在定量和定性分析方面都优于最先进的SHSR方法。
{"title":"Improving Hyperspectral Super-Resolution via Heterogeneous Knowledge Distillation","authors":"Ziqian Liu, Qing Ma, Junjun Jiang, Xianming Liu","doi":"10.1145/3469877.3490610","DOIUrl":"https://doi.org/10.1145/3469877.3490610","url":null,"abstract":"Hyperspectral images (HSI) contains rich spectrum information but their spatial resolution is often limited by imaging system. Super-resolution (SR) reconstruction becomes a hot topic aiming to increase spatial resolution without extra hardware cost. The fusion-based hyperspectral image super-resolution (FHSR) methods use supplementary high-resolution multispectral images (HR-MSI) to recover spatial details, but well co-registered HR-MSI is hard to collect. Recently, single hyperspectral image super-resolution (SHSR) methods based on deep learning have made great progress. However, lack of HR-MSI input makes these SHSR methods difficult to exploit the spatial information. To take advantages of FHSR and SHSR methods, in this paper we propose a new pipeline treating HR-MSI as privilege information and try to improve our SHSR model with knowledge distillation. That is, our model uses paired MSI-HSI data to train and only needs LR-HSI as input during inference. Specifically, we combine SHSR and spectral super-resolution (SSR) and design a novel architecture, Distillation-Oriented Dual-branch Net (DODN), to make the SHSR model fully employ transferred knowledge from the SSR model. Since the main stream of SSR model are 2D CNNs and full 2D CNN causes spectral disorder in SHSR task, a new mixed 2D/3D block, called Distillation-Oriented Dual-branch Block (DODB) is proposed, where the 3D branch extracts spectral-spatial correlation while the 2D branch accepts information from the SSR model through knowledge distillation. The main idea is to distill the knowledge of spatial information from HR-MSI to the SHSR model without changing its network architecture. Extensive experiments on two benchmark datasets, CAVE and NTIRE2020, demonstrate that our proposed DODN outperforms the state-of-the-art SHSR methods, in terms of both quantitative and qualitative analysis.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132535736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos 运动=视频-内容:从视频中实现运动表示的无监督学习
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490582
Hehe Fan, Mohan S. Kankanhalli
Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.
根据物理学的定义,运动是位置相对于时间的变化,与特定的运动物体和背景无关。在本文中,我们的目标是以一种无监督的方式学习与外观无关的运动表示。其主要思想是将动作从视频中分离出来,同时将对象和背景作为内容。具体来说,我们设计了一个由内容编码器、运动编码器和视频生成器组成的编码器-解码器模型。为了训练模型,我们利用同一视频内重建的一步循环一致性和跨不同视频生成的两步循环一致性作为自监督信号,并使用对抗性训练从运动表示中去除内容表示。我们证明了该框架可以用于条件视频生成和细粒度动作识别。
{"title":"Motion = Video - Content: Towards Unsupervised Learning of Motion Representation from Videos","authors":"Hehe Fan, Mohan S. Kankanhalli","doi":"10.1145/3469877.3490582","DOIUrl":"https://doi.org/10.1145/3469877.3490582","url":null,"abstract":"Motion, according to its definition in physics, is the change in position with respect to time, regardless of the specific moving object and background. In this paper, we aim to learn appearance-independent motion representation in an unsupervised manner. The main idea is to separate motion from videos while leaving objects and background as content. Specifically, we design an encoder-decoder model which consists of a content encoder, a motion encoder and a video generator. To train the model, we leverage a one-step cycle-consistency in reconstruction within the same video and a two-step cycle-consistency in generation across different videos as self-supervised signals, and use adversarial training to remove the content representation from the motion representation. We demonstrate that the proposed framework can be used for conditional video generation and fine-grained action recognition.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133700966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation PLM-IPE:一个用于隐式偏好估计的像素里程碑式相互增强框架
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490621
Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo
In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.
在本文中,我们感兴趣的是了解客户如何感知时尚建议,特别是当观察拟议的服装组合以组成一套服装时。实际上,在没有任何主动参与的情况下,自动理解建议的项目是如何被感知的,这是实现交互式应用程序的重要组成部分。我们提出了一个像素里程碑式的相互增强框架,用于隐式偏好估计,命名为PLM-IPE,它能够利用视觉线索推断用户的隐式偏好,而无需任何主动或有意识的参与。PLM-IPE包括三个关键模块:基于像素的估计器、基于地标的估计器和基于相互学习的优化。前两个模块分别从像素级和地标级捕捉用户的隐式反应。最后一个模块用于在两个并行估计器之间传递知识。为了进行评估,我们收集了一个名为SentiGarment的真实世界数据集,其中包含3345个面部反应视频,以及建议的服装和人类标记的反应分数。大量的实验表明,我们的模型优于最先进的方法。
{"title":"PLM-IPE: A Pixel-Landmark Mutual Enhanced Framework for Implicit Preference Estimation","authors":"Federico Becattini, Xuemeng Song, C. Baecchi, S. Fang, C. Ferrari, Liqiang Nie, A. del Bimbo","doi":"10.1145/3469877.3490621","DOIUrl":"https://doi.org/10.1145/3469877.3490621","url":null,"abstract":"In this paper, we are interested in understanding how customers perceive fashion recommendations, in particular when observing a proposed combination of garments to compose an outfit. Automatically understanding how a suggested item is perceived, without any kind of active engagement, is in fact an essential block to achieve interactive applications. We propose a pixel-landmark mutual enhanced framework for implicit preference estimation, named PLM-IPE, which is capable of inferring the user’s implicit preferences exploiting visual cues, without any active or conscious engagement. PLM-IPE consists of three key modules: pixel-based estimator, landmark-based estimator and mutual learning based optimization. The former two modules work on capturing the implicit reaction of the user from the pixel level and landmark level, respectively. The last module serves to transfer knowledge between the two parallel estimators. Towards evaluation, we collected a real-world dataset, named SentiGarment, which contains 3,345 facial reaction videos paired with suggested outfits and human labeled reaction scores. Extensive experiments show the superiority of our model over state-of-the-art approaches.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123722148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Multi-branch Semantic Learning Network for Text-to-Image Synthesis 用于文本到图像合成的多分支语义学习网络
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3490567
Jiading Ling, Xingcai Wu, Zhenguo Yang, Xudong Mao, Qing Li, Wenyin Liu
In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).
在本文中,我们提出了一种多分支语义学习网络(MSLN)来根据文本描述生成图像,该网络考虑了全局和局部文本语义,分为两个阶段。第一阶段根据句子特征生成粗粒度图像。第二阶段,构建多分支细粒度生成模型,通过全局关注模块和局部关注模块将句子级和词级语义注入两幅粗粒度图像中,分别生成全局和局部细粒度图像纹理;特别地,我们设计了一个通道融合模块(CFM)来融合多分支细粒度阶段的全局和局部细粒度特征并生成输出图像。在CUB-200数据集和Oxford-102数据集上进行的大量实验证明了该方法的优越性能。(例如,在cube -200上FID从16.09降低到14.43)。
{"title":"Multi-branch Semantic Learning Network for Text-to-Image Synthesis","authors":"Jiading Ling, Xingcai Wu, Zhenguo Yang, Xudong Mao, Qing Li, Wenyin Liu","doi":"10.1145/3469877.3490567","DOIUrl":"https://doi.org/10.1145/3469877.3490567","url":null,"abstract":"In this paper, we propose a multi-branch semantic learning network (MSLN) to generate image according to textual description by taking into account global and local textual semantics, which consists of two stages. The first stage generates a coarse-grained image based on the sentence features. In the second stage, a multi-branch fine-grained generation model is constructed to inject the sentence-level and word-level semantics into two coarse-grained images by global and local attention modules, which generate global and local fine-grained image textures, respectively. In particular, we devise a channel fusion module (CFM) to fuse the global and local fine-grained features in the multi-branch fine-grained stage and generate the output image. Extensive experiments conducted on the CUB-200 dataset and Oxford-102 dataset demonstrate the superior performance of the proposed method. (e.g., FID is reduced from 16.09 to 14.43 on CUB-200).","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"333 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124302043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method 基于动态时间翘曲的文本变长时间序列生成方法
Pub Date : 2021-12-01 DOI: 10.1145/3469877.3495644
Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada
This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.
本研究的目的是寻找一种合适的方法来生成时间序列数据,如视频剪辑或从文本陈述多个事件的化身动作。本文讨论了考虑到文本中所述事件的顺序和可变持续时间的变长时间序列数据的生成。虽然使用均方误差(MSE)的变体是一种常用的训练方法,但只考虑了ground-truth (GT)数据元素与同时生成的数据之间的差距。因此,MSE的变体不适合手头的任务,因为如果每个事件的时间不重叠,具有相同事件顺序的生成和GT数据的损失可能不会小。为了解决这个问题,我们提出了一种类似动态时间扭曲的变长数据方法(DTWL-VL),该方法确定了GT和生成数据的对应元素,允许它们之间的时差,并使它们更接近。我们将MSE的一种变体DTWL-VL与现有的一种考虑GT中对应部分与生成数据之间的时间差的时间序列数据生成方法进行了比较。由于现有方法的目标是生成固定长度的数据,我们扩展了生成变长时间序列数据的方法。我们使用为本研究准备的数据集进行了实验。DTWL-VL和现有方法都优于MSE变体。此外,虽然现有方法在某些设置下优于DTWL-VL,但DTWL-VL所需的训练时间更短。
{"title":"Generation of Variable-Length Time Series from Text using Dynamic Time Warping-Based Method","authors":"Ayaka Ideno, Yusuke Mukuta, Tatsuya Harada","doi":"10.1145/3469877.3495644","DOIUrl":"https://doi.org/10.1145/3469877.3495644","url":null,"abstract":"This study is aimed at finding a suitable method for generating time-series data such as video clips or avatar motions from text stating multiple events. This paper addresses the generation of variable-length time-series data considering the order and variable duration of events stated in the text. Although the use of the variant of Mean Squared Error (MSE) is a common means of training, only the gap between the element of ground-truth (GT) data and generated data at the same time are considered. Thus, variants of MSE are unsuitable for the task at hand because the loss may not be small for the generated and GT data with the same order of events if the time for each event does not overlap. To solve the problem, we propose a Dynamic Time Warping-Like method for Variable-Length data (DTWL-VL), which determines the corresponding elements of the GT and the generated data, allowing for the time difference between them, and makes them closer. We compared DTWL-VL, a variant of MSE, and an existing method for time-series data generation which considers the time difference between the corresponding part in the GT and generated data. Since the existing method is aimed at generating fixed-length data, we extend the method for generating variable-length time-series data. We conducted experiments using a dataset prepared for this study. Both DTWL-VL and the existing methods outperformed the MSE variant. Moreover, although the existing method outperformed DTWL-VL under certain settings, DTWL-VL required a smaller training period.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124849591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
ACM Multimedia Asia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1