首页 > 最新文献

Proceedings of the 26th ACM international conference on Multimedia最新文献

英文 中文
Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval 跨模态检索的综合距离保持自编码器
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240607
Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian
In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.
本文提出了一种利用综合距离保持自编码器(CDPAE)解决无监督跨模态检索问题的新方法。以前的无监督方法主要依赖于从跨媒体空间中提取的表征的成对距离,这些空间共同出现并且属于相同的对象。然而,除了两两距离之外,CDPAE还考虑了从跨媒体空间提取的表示的异构距离,以及从属于不同对象的单一媒体空间提取的表示的同质距离。CDPAE由四个部分组成。首先,使用去噪自编码器来保留表征中的信息,并减少冗余噪声的负面影响。其次,提出了一个全面的距离保持公共空间,以探索不同表示之间的相关性。这样做的目的是在公共空间中保持不同表现形式之间的距离,使它们与原始媒体空间中的距离保持一致。第三,定义了一种新的联合损失函数,用于同时计算去噪自编码器的重构损失和综合距离保持公共空间的相关损失。最后,提出了一种无监督跨模态相似性度量方法,进一步提高了检索性能。这是通过计算基于kNN分类器的两个媒体对象的边际概率来实现的。在4个公共数据集上对CDPAE进行了“按文本查询图像”和“按图像查询文本”两个跨模态检索任务的测试。与8种最先进的跨模态检索方法进行比较,实验结果表明CDPAE优于所有无监督方法,并与有监督方法具有竞争力。
{"title":"Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval","authors":"Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian","doi":"10.1145/3240508.3240607","DOIUrl":"https://doi.org/10.1145/3240508.3240607","url":null,"abstract":"In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: \"query images by texts\" and \"query texts by images\". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125259918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Fast and Light Manifold CNN based 3D Facial Expression Recognition across Pose Variations 快速和轻流形CNN基于三维面部表情识别跨姿态变化
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240568
Zhixing Chen, Di Huang, Yunhong Wang, Liming Chen
This paper proposes a novel approach to 3D Facial Expression Recognition (FER), and it is based on a Fast and Light Manifold CNN model, namely FLM-CNN. Different from current manifold CNNs, FLM-CNN adopts a human vision inspired pooling structure and a multi-scale encoding strategy to enhance geometry representation, which highlights shape characteristics of expressions and runs efficiently. Furthermore, a sampling tree based preprocessing method is presented, and it sharply saves memory when applied to 3D facial surfaces, without much information loss of original data. More importantly, due to the property of manifold CNN features of being rotation-invariant, the proposed method shows a high robustness to pose variations. Extensive experiments are conducted on BU-3DFE, and state-of-the-art results are achieved, indicating its effectiveness.
本文提出了一种新的三维面部表情识别方法,该方法基于一种Fast and Light流形CNN模型,即FLM-CNN。与现有的流形cnn不同,FLM-CNN采用了人类视觉启发的池化结构和多尺度编码策略来增强几何表示,突出了表达式的形状特征,运行效率高。在此基础上,提出了一种基于采样树的预处理方法,在不丢失原始数据的情况下,极大地节省了三维人脸的存储空间。更重要的是,由于流形CNN具有旋转不变的特性,该方法对姿态变化具有很高的鲁棒性。在BU-3DFE上进行了大量的实验,取得了最先进的结果,表明了它的有效性。
{"title":"Fast and Light Manifold CNN based 3D Facial Expression Recognition across Pose Variations","authors":"Zhixing Chen, Di Huang, Yunhong Wang, Liming Chen","doi":"10.1145/3240508.3240568","DOIUrl":"https://doi.org/10.1145/3240508.3240568","url":null,"abstract":"This paper proposes a novel approach to 3D Facial Expression Recognition (FER), and it is based on a Fast and Light Manifold CNN model, namely FLM-CNN. Different from current manifold CNNs, FLM-CNN adopts a human vision inspired pooling structure and a multi-scale encoding strategy to enhance geometry representation, which highlights shape characteristics of expressions and runs efficiently. Furthermore, a sampling tree based preprocessing method is presented, and it sharply saves memory when applied to 3D facial surfaces, without much information loss of original data. More importantly, due to the property of manifold CNN features of being rotation-invariant, the proposed method shows a high robustness to pose variations. Extensive experiments are conducted on BU-3DFE, and state-of-the-art results are achieved, indicating its effectiveness.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128121389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
Content-Based Video Relevance Prediction with Second-Order Relevance and Attention Modeling 通过二阶相关性和注意力建模进行基于内容的视频相关性预测
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3266434
Xusong Chen, Rui Zhao, Shengjie Ma, Dong Liu, Zhengjun Zha
This paper describes our proposed method for the Content-Based Video Relevance Prediction (CBVRP) challenge. Our method is based on deep learning, i.e. we train a deep network to predict the relevance between two video sequences from their features. We explore the usage of second-order relevance, both in preparing training data, and in extending the deep network. Second-order relevance refers to e.g. the relevance between x and z if x is relevant to y and y is relevant to z. In our proposed method, we use second-order relevance to increase positive samples and decrease negative samples, when preparing training data. We further extend the deep network with an attention module, where the attention mechanism is designed for second-order relevant video sequences. We verify the effectiveness of our method on the validation set of the CBVRP challenge.
本文介绍了我们针对基于内容的视频相关性预测(CBVRP)挑战提出的方法。我们的方法基于深度学习,即我们训练一个深度网络,根据两个视频序列的特征来预测它们之间的相关性。我们在准备训练数据和扩展深度网络时都探索了二阶相关性的用法。在我们提出的方法中,我们在准备训练数据时使用二阶相关性来增加正样本和减少负样本。我们还利用注意力模块进一步扩展了深度网络,其中的注意力机制是针对二阶相关视频序列设计的。我们在 CBVRP 挑战赛的验证集上验证了我们方法的有效性。
{"title":"Content-Based Video Relevance Prediction with Second-Order Relevance and Attention Modeling","authors":"Xusong Chen, Rui Zhao, Shengjie Ma, Dong Liu, Zhengjun Zha","doi":"10.1145/3240508.3266434","DOIUrl":"https://doi.org/10.1145/3240508.3266434","url":null,"abstract":"This paper describes our proposed method for the Content-Based Video Relevance Prediction (CBVRP) challenge. Our method is based on deep learning, i.e. we train a deep network to predict the relevance between two video sequences from their features. We explore the usage of second-order relevance, both in preparing training data, and in extending the deep network. Second-order relevance refers to e.g. the relevance between x and z if x is relevant to y and y is relevant to z. In our proposed method, we use second-order relevance to increase positive samples and decrease negative samples, when preparing training data. We further extend the deep network with an attention module, where the attention mechanism is designed for second-order relevant video sequences. We verify the effectiveness of our method on the validation set of the CBVRP challenge.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134179477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA 先检查后回答:多选题VQA的多任务学习与自适应关注
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240687
Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, Heng Tao Shen
Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.
多项选择题(MC)视觉问答(VQA)是一种与开放式VQA类似但本质上不同的任务,因为它提供了答案选项。现有的大多数工作都是通过解决一个多类问题,从一个预定义的答案集中推断出最佳答案,从而在一个统一的管道中解决这些问题。MC VQA选择与最佳答案匹配的选项。然而,这违背了人类的思维逻辑。通常,人们在推断MC VQA之前会检查问题、答案选项和参考图像。对于MC VQA,如果问题与图像无关,人类要么依靠问题和答案选项直接推断出正确答案,要么阅读问题和答案选项,然后有目的地在参考图像中搜索答案。因此,我们提出了一种新的方法,即多任务学习与自适应注意(MTA),以模拟MC VQA中的人类逻辑。具体来说,我们首先融合答案选项和问题特征,然后自适应地关注视觉特征来推断MC VQA。此外,我们通过集成开放式VQA任务,将模型设计为一个多任务学习架构,以进一步提高MC VQA的性能。我们在两个标准基准数据集:VQA和Visual7W上对我们的方法进行了评估,我们的方法在两个数据集上都创造了MC VQA任务的新记录,分别达到了73.5%和65.9%的平均准确率。
{"title":"Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA","authors":"Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, Heng Tao Shen","doi":"10.1145/3240508.3240687","DOIUrl":"https://doi.org/10.1145/3240508.3240687","url":null,"abstract":"Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133105560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Self-boosted Gesture Interactive System with ST-Net 基于ST-Net的自增强手势交互系统
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240530
Zhengzhe Liu, Xiaojuan Qi, Lei Pang
In this paper, we propose a self-boosted intelligent system for joint sign language recognition and automatic education. A novel Spatial-Temporal Net (ST-Net) is designed to exploit the temporal dynamics of localized hands for sign language recognition. Features from ST-Net can be deployed by our education system to detect failure modes of the learners. Moreover, the education system can help collect a vast amount of data for training ST-Net. Our sign language recognition and education system help improve each other step-by-step.On the one hand, benefited from accurate recognition system, the education system can detect the failure parts of the learner more precisely. On the other hand, with more training data gathered from the education system, the recognition system becomes more robust and accurate. Experiments on Hong Kong sign language dataset containing 227 commonly used words validate the effectiveness of our joint recognition and education system.
本文提出了一种用于手语联合识别和自动教育的自增强智能系统。设计了一种新的时空网络(ST-Net)来利用局部手部的时间动态进行手语识别。我们的教育系统可以利用ST-Net的特征来检测学习者的失败模式。此外,教育系统可以帮助收集大量数据用于培训ST-Net。我们的手语识别和教育系统相互促进。一方面,得益于准确的识别系统,教育系统可以更准确地发现学习者的失败部分。另一方面,随着从教育系统中收集到更多的训练数据,识别系统变得更加健壮和准确。在包含227个常用词汇的香港手语数据集上进行的实验验证了我们联合识别和教育系统的有效性。
{"title":"Self-boosted Gesture Interactive System with ST-Net","authors":"Zhengzhe Liu, Xiaojuan Qi, Lei Pang","doi":"10.1145/3240508.3240530","DOIUrl":"https://doi.org/10.1145/3240508.3240530","url":null,"abstract":"In this paper, we propose a self-boosted intelligent system for joint sign language recognition and automatic education. A novel Spatial-Temporal Net (ST-Net) is designed to exploit the temporal dynamics of localized hands for sign language recognition. Features from ST-Net can be deployed by our education system to detect failure modes of the learners. Moreover, the education system can help collect a vast amount of data for training ST-Net. Our sign language recognition and education system help improve each other step-by-step.On the one hand, benefited from accurate recognition system, the education system can detect the failure parts of the learner more precisely. On the other hand, with more training data gathered from the education system, the recognition system becomes more robust and accurate. Experiments on Hong Kong sign language dataset containing 227 commonly used words validate the effectiveness of our joint recognition and education system.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125710511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction 微视频点击预测的类别和项目水平的时间层次注意
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240617
Xusong Chen, Dong Liu, Zhengjun Zha, Wen-gang Zhou, Zhiwei Xiong, Yan Li
Micro-video sharing gains great popularity in recent years, which calls for effective recommendation algorithm to help user find their interested micro-videos. Compared with traditional online (e.g. YouTube) videos, micro-videos contributed by grass-root users and taken by smartphones are much shorter (tens of seconds) and more short of tags or descriptive text, making the recommendation of micro-videos a challenging task. In this paper, we investigate how to model user's historical behaviors so as to predict the user's click-through of micro-videos. Inspired by the recent deep network-based methods, we propose a Temporal Hierarchical Attention at Category- and Item-Level (THACIL) network for user behavior modeling. First, we use temporal windows to capture the short-term dynamics of user interests; Second, we leverage a category-level attention mechanism to characterize user's diverse interests, as well as an item-level attention mechanism for fine-grained profiling of user interests; Third, we adopt forward multi-head self-attention to capture the long-term correlation within user behaviors. Our proposed THACIL network was tested on MicroVideo-1.7M, a new dataset of 1.7 million micro-videos, coming from real data of a micro-video sharing service in China. Experimental results demonstrate the effectiveness of the proposed method in comparison with the state-of-the-art solutions.
近年来,微视频分享越来越受欢迎,这就需要有效的推荐算法来帮助用户找到自己感兴趣的微视频。与传统的网络视频(如YouTube)相比,草根用户贡献的智能手机拍摄的微视频要短得多(几十秒),并且更缺少标签或描述性文字,这使得微视频推荐成为一项具有挑战性的任务。本文研究了如何对用户的历史行为进行建模,从而预测用户对微视频的点击率。受最近基于深度网络的方法的启发,我们提出了一种用于用户行为建模的类别和项目级别的时间分层注意(THACIL)网络。首先,我们使用时间窗口来捕捉用户兴趣的短期动态;其次,我们利用类别级注意机制来表征用户的不同兴趣,并利用项目级注意机制对用户兴趣进行细粒度分析;第三,我们采用前向多头自注意来捕捉用户行为中的长期相关性。我们提出的THACIL网络在MicroVideo-1.7M上进行了测试,MicroVideo-1.7M是一个新的170万个微视频数据集,来自中国微视频分享服务的真实数据。实验结果证明了该方法的有效性,并与现有的解决方案进行了比较。
{"title":"Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction","authors":"Xusong Chen, Dong Liu, Zhengjun Zha, Wen-gang Zhou, Zhiwei Xiong, Yan Li","doi":"10.1145/3240508.3240617","DOIUrl":"https://doi.org/10.1145/3240508.3240617","url":null,"abstract":"Micro-video sharing gains great popularity in recent years, which calls for effective recommendation algorithm to help user find their interested micro-videos. Compared with traditional online (e.g. YouTube) videos, micro-videos contributed by grass-root users and taken by smartphones are much shorter (tens of seconds) and more short of tags or descriptive text, making the recommendation of micro-videos a challenging task. In this paper, we investigate how to model user's historical behaviors so as to predict the user's click-through of micro-videos. Inspired by the recent deep network-based methods, we propose a Temporal Hierarchical Attention at Category- and Item-Level (THACIL) network for user behavior modeling. First, we use temporal windows to capture the short-term dynamics of user interests; Second, we leverage a category-level attention mechanism to characterize user's diverse interests, as well as an item-level attention mechanism for fine-grained profiling of user interests; Third, we adopt forward multi-head self-attention to capture the long-term correlation within user behaviors. Our proposed THACIL network was tested on MicroVideo-1.7M, a new dataset of 1.7 million micro-videos, coming from real data of a micro-video sharing service in China. Experimental results demonstrate the effectiveness of the proposed method in comparison with the state-of-the-art solutions.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124902620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Session details: Vision-1 (Machine Learning) 会议详情:Vision-1(机器学习)
Pub Date : 2018-10-15 DOI: 10.1145/3286920
Jingkuan Song
{"title":"Session details: Vision-1 (Machine Learning)","authors":"Jingkuan Song","doi":"10.1145/3286920","DOIUrl":"https://doi.org/10.1145/3286920","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125008768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition 面向任意视图人体动作识别的大规模RGB-D数据库
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240675
Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, Weishi Zheng
Current researches mainly focus on single-view and multiview human action recognition, which can hardly satisfy the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of databases also sets up barriers. In this paper, we newly collect a large-scale RGB-D action database for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The database includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our database involves more articipants, more viewpoints and a large number of samples. More importantly, it is the first database containing the entire 360? varying-view sequences. The database provides sufficient data for cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.
目前的研究主要集中在单视图和多视图人体动作识别上,难以满足人机交互应用对任意视图动作识别的要求。数据库的缺乏也造成了障碍。在本文中,我们新收集了一个用于任意视图动作分析的大规模RGB- d动作数据库,包括RGB视频、深度和骨架序列。该数据库包括在8个固定视点和可变视点序列中捕获的动作样本,涵盖了整个360视角。总共邀请118人参与40个行动类别,并收集了25,600个视频样本。我们的数据库涉及更多的参与者,更多的观点和大量的样本。更重要的是,它是第一个包含整个360?varying-view序列。该数据库为跨视图和任意视图的动作分析提供了充足的数据。此外,我们提出了一种视点引导骨架CNN (VS-CNN)来解决任意视点动作识别问题。实验结果表明,VS-CNN具有较好的性能。
{"title":"A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition","authors":"Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, Weishi Zheng","doi":"10.1145/3240508.3240675","DOIUrl":"https://doi.org/10.1145/3240508.3240675","url":null,"abstract":"Current researches mainly focus on single-view and multiview human action recognition, which can hardly satisfy the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of databases also sets up barriers. In this paper, we newly collect a large-scale RGB-D action database for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The database includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our database involves more articipants, more viewpoints and a large number of samples. More importantly, it is the first database containing the entire 360? varying-view sequences. The database provides sufficient data for cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125080754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
JPEG Decompression in the Homomorphic Encryption Domain JPEG在同态加密域的解压缩
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240672
Xiaojing Ma, Changming Liu, Sixing Cao, Bin B. Zhu
Privacy-preserving processing is desirable for cloud computing to relieve users' concern of loss of control of their uploaded data. This may be fulfilled with homomorphic encryption. With widely used JPEG, it is desirable to enable JPEG decompression in the homomorphic encryption domain. This is a great challenge since JPEG decoding needs to determine a matched codeword, which then extracts a codeword-dependent number of coefficients. With no access to the information of encrypted content, a decoder does not know which codeword is matched, and thus cannot tell how many coefficients to extract, not to mention to compute their values. In this paper, we propose a novel scheme that enables JPEG decompression in the homomorphic encryption domain. The scheme applies a statically controlled iterative procedure to decode one coefficient per iteration. In one iteration, each codeword is compared with the bitstream to compute an encrypted Boolean that represents if the codeword is a match or not. Each codeword would produce an output coefficient and generate a new bitstream by dropping consumed bits as if it were a match. If a codeword is associated with more than one coefficient, the codeword is replaced with the codeword representing the remaining undecoded coefficients for the next decoding iteration. The summation of each codeword's output multiplied by its matching Boolean is the output of the current iteration. This is equivalent to selecting the output of a matched codeword. A side benefit of our statically controlled decoding procedure is that paralleled Single-Instruction Multiple-Data (SIMD) is fully supported, wherein multiple plaintexts are encrypted into a single plaintext, and decoding a ciphertext block corresponds to decoding all corresponding plaintext blocks. SIMD also reduces the total size of ciphertexts of an image. Experimental results are reported to show the performance of our proposed scheme.
云计算需要隐私保护处理,以减轻用户对其上传的数据失去控制的担忧。这可以通过同态加密来实现。对于广泛使用的JPEG,希望在同态加密域中启用JPEG解压缩。这是一个很大的挑战,因为JPEG解码需要确定匹配的码字,然后提取与码字相关的系数数。由于无法访问加密内容的信息,解码器不知道哪个码字是匹配的,因此无法知道要提取多少个系数,更不用说计算它们的值了。在本文中,我们提出了一种在同态加密域中实现JPEG解压缩的新方案。该方案采用静态控制迭代过程,每次迭代解码一个系数。在一次迭代中,将每个码字与比特流进行比较,以计算一个加密的布尔值,该布尔值表示码字是否匹配。每个码字将产生一个输出系数,并通过丢弃消耗的比特来生成一个新的比特流,就好像它是匹配的一样。如果一个码字与多个系数相关联,则该码字将被替换为表示下一个解码迭代的剩余未解码系数的码字。每个码字的输出乘以与之匹配的布尔值的总和就是当前迭代的输出。这相当于选择匹配码字的输出。我们的静态控制解码过程的一个附带好处是完全支持并行的单指令多数据(SIMD),其中多个明文被加密为单个明文,解码一个密文块对应于解码所有相应的明文块。SIMD还减少了图像的密文的总大小。实验结果表明了该方案的有效性。
{"title":"JPEG Decompression in the Homomorphic Encryption Domain","authors":"Xiaojing Ma, Changming Liu, Sixing Cao, Bin B. Zhu","doi":"10.1145/3240508.3240672","DOIUrl":"https://doi.org/10.1145/3240508.3240672","url":null,"abstract":"Privacy-preserving processing is desirable for cloud computing to relieve users' concern of loss of control of their uploaded data. This may be fulfilled with homomorphic encryption. With widely used JPEG, it is desirable to enable JPEG decompression in the homomorphic encryption domain. This is a great challenge since JPEG decoding needs to determine a matched codeword, which then extracts a codeword-dependent number of coefficients. With no access to the information of encrypted content, a decoder does not know which codeword is matched, and thus cannot tell how many coefficients to extract, not to mention to compute their values. In this paper, we propose a novel scheme that enables JPEG decompression in the homomorphic encryption domain. The scheme applies a statically controlled iterative procedure to decode one coefficient per iteration. In one iteration, each codeword is compared with the bitstream to compute an encrypted Boolean that represents if the codeword is a match or not. Each codeword would produce an output coefficient and generate a new bitstream by dropping consumed bits as if it were a match. If a codeword is associated with more than one coefficient, the codeword is replaced with the codeword representing the remaining undecoded coefficients for the next decoding iteration. The summation of each codeword's output multiplied by its matching Boolean is the output of the current iteration. This is equivalent to selecting the output of a matched codeword. A side benefit of our statically controlled decoding procedure is that paralleled Single-Instruction Multiple-Data (SIMD) is fully supported, wherein multiple plaintexts are encrypted into a single plaintext, and decoding a ciphertext block corresponds to decoding all corresponding plaintext blocks. SIMD also reduces the total size of ciphertexts of an image. Experimental results are reported to show the performance of our proposed scheme.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132978071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
OSMO
Pub Date : 2018-10-15 DOI: 10.1145/3240508.3240548
Xu Gao, Tingting Jiang
With demands of the intelligent monitoring, multiple object tracking (MOT) in surveillance scene has become an essential but challenging task. Occlusion is the primary difficulty in surveillance MOT, which can be categorized into the inter-object occlusion and the obstacle occlusion. Many current studies on general MOT focus on the former occlusion, but few studies have been conducted on the latter one. In fact, there are useful prior knowledge in surveillance videos, because the scene structure is fixed. Hence, we propose two models for dealing with these two kinds of occlusions. The attention-based appearance model is proposed to solve the inter-object occlusion, and the scene structure model is proposed to solve the obstacle occlusion. We also design an obstacle map segmentation method for segmenting obstacles from the surveillance scene. Furthermore, to evaluate our method, we propose four new surveillance datasets that contain videos with obstacles. Experimental results show the effectiveness of our two models.
{"title":"OSMO","authors":"Xu Gao, Tingting Jiang","doi":"10.1145/3240508.3240548","DOIUrl":"https://doi.org/10.1145/3240508.3240548","url":null,"abstract":"With demands of the intelligent monitoring, multiple object tracking (MOT) in surveillance scene has become an essential but challenging task. Occlusion is the primary difficulty in surveillance MOT, which can be categorized into the inter-object occlusion and the obstacle occlusion. Many current studies on general MOT focus on the former occlusion, but few studies have been conducted on the latter one. In fact, there are useful prior knowledge in surveillance videos, because the scene structure is fixed. Hence, we propose two models for dealing with these two kinds of occlusions. The attention-based appearance model is proposed to solve the inter-object occlusion, and the scene structure model is proposed to solve the obstacle occlusion. We also design an obstacle map segmentation method for segmenting obstacles from the surveillance scene. Furthermore, to evaluate our method, we propose four new surveillance datasets that contain videos with obstacles. Experimental results show the effectiveness of our two models.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114139642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
期刊
Proceedings of the 26th ACM international conference on Multimedia
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1