Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian
In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: "query images by texts" and "query texts by images". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.
{"title":"Comprehensive Distance-Preserving Autoencoders for Cross-Modal Retrieval","authors":"Yibing Zhan, Jun Yu, Zhou Yu, Rong Zhang, D. Tao, Qi Tian","doi":"10.1145/3240508.3240607","DOIUrl":"https://doi.org/10.1145/3240508.3240607","url":null,"abstract":"In this paper, we propose a novel method with comprehensive distance-preserving autoencoders (CDPAE) to address the problem of unsupervised cross-modal retrieval. Previous unsupervised methods rely primarily on pairwise distances of representations extracted from cross media spaces that co-occur and belong to the same objects. However, besides pairwise distances, the CDPAE also considers heterogeneous distances of representations extracted from cross media spaces as well as homogeneous distances of representations extracted from single media spaces that belong to different objects. The CDPAE consists of four components. First, denoising autoencoders are used to retain the information from the representations and to reduce the negative influence of redundant noises. Second, a comprehensive distance-preserving common space is proposed to explore the correlations among different representations. This aims to preserve the respective distances between the representations within the common space so that they are consistent with the distances in their original media spaces. Third, a novel joint loss function is defined to simultaneously calculate the reconstruction loss of the denoising autoencoders and the correlation loss of the comprehensive distance-preserving common space. Finally, an unsupervised cross-modal similarity measurement is proposed to further improve the retrieval performance. This is carried out by calculating the marginal probability of two media objects based on a kNN classifier. The CDPAE is tested on four public datasets with two cross-modal retrieval tasks: \"query images by texts\" and \"query texts by images\". Compared with eight state-of-the-art cross-modal retrieval methods, the experimental results demonstrate that the CDPAE outperforms all the unsupervised methods and performs competitively with the supervised methods.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125259918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a novel approach to 3D Facial Expression Recognition (FER), and it is based on a Fast and Light Manifold CNN model, namely FLM-CNN. Different from current manifold CNNs, FLM-CNN adopts a human vision inspired pooling structure and a multi-scale encoding strategy to enhance geometry representation, which highlights shape characteristics of expressions and runs efficiently. Furthermore, a sampling tree based preprocessing method is presented, and it sharply saves memory when applied to 3D facial surfaces, without much information loss of original data. More importantly, due to the property of manifold CNN features of being rotation-invariant, the proposed method shows a high robustness to pose variations. Extensive experiments are conducted on BU-3DFE, and state-of-the-art results are achieved, indicating its effectiveness.
本文提出了一种新的三维面部表情识别方法,该方法基于一种Fast and Light流形CNN模型,即FLM-CNN。与现有的流形cnn不同,FLM-CNN采用了人类视觉启发的池化结构和多尺度编码策略来增强几何表示,突出了表达式的形状特征,运行效率高。在此基础上,提出了一种基于采样树的预处理方法,在不丢失原始数据的情况下,极大地节省了三维人脸的存储空间。更重要的是,由于流形CNN具有旋转不变的特性,该方法对姿态变化具有很高的鲁棒性。在BU-3DFE上进行了大量的实验,取得了最先进的结果,表明了它的有效性。
{"title":"Fast and Light Manifold CNN based 3D Facial Expression Recognition across Pose Variations","authors":"Zhixing Chen, Di Huang, Yunhong Wang, Liming Chen","doi":"10.1145/3240508.3240568","DOIUrl":"https://doi.org/10.1145/3240508.3240568","url":null,"abstract":"This paper proposes a novel approach to 3D Facial Expression Recognition (FER), and it is based on a Fast and Light Manifold CNN model, namely FLM-CNN. Different from current manifold CNNs, FLM-CNN adopts a human vision inspired pooling structure and a multi-scale encoding strategy to enhance geometry representation, which highlights shape characteristics of expressions and runs efficiently. Furthermore, a sampling tree based preprocessing method is presented, and it sharply saves memory when applied to 3D facial surfaces, without much information loss of original data. More importantly, due to the property of manifold CNN features of being rotation-invariant, the proposed method shows a high robustness to pose variations. Extensive experiments are conducted on BU-3DFE, and state-of-the-art results are achieved, indicating its effectiveness.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128121389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper describes our proposed method for the Content-Based Video Relevance Prediction (CBVRP) challenge. Our method is based on deep learning, i.e. we train a deep network to predict the relevance between two video sequences from their features. We explore the usage of second-order relevance, both in preparing training data, and in extending the deep network. Second-order relevance refers to e.g. the relevance between x and z if x is relevant to y and y is relevant to z. In our proposed method, we use second-order relevance to increase positive samples and decrease negative samples, when preparing training data. We further extend the deep network with an attention module, where the attention mechanism is designed for second-order relevant video sequences. We verify the effectiveness of our method on the validation set of the CBVRP challenge.
{"title":"Content-Based Video Relevance Prediction with Second-Order Relevance and Attention Modeling","authors":"Xusong Chen, Rui Zhao, Shengjie Ma, Dong Liu, Zhengjun Zha","doi":"10.1145/3240508.3266434","DOIUrl":"https://doi.org/10.1145/3240508.3266434","url":null,"abstract":"This paper describes our proposed method for the Content-Based Video Relevance Prediction (CBVRP) challenge. Our method is based on deep learning, i.e. we train a deep network to predict the relevance between two video sequences from their features. We explore the usage of second-order relevance, both in preparing training data, and in extending the deep network. Second-order relevance refers to e.g. the relevance between x and z if x is relevant to y and y is relevant to z. In our proposed method, we use second-order relevance to increase positive samples and decrease negative samples, when preparing training data. We further extend the deep network with an attention module, where the attention mechanism is designed for second-order relevant video sequences. We verify the effectiveness of our method on the validation set of the CBVRP challenge.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134179477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.
{"title":"Examine before You Answer: Multi-task Learning with Adaptive-attentions for Multiple-choice VQA","authors":"Lianli Gao, Pengpeng Zeng, Jingkuan Song, Xianglong Liu, Heng Tao Shen","doi":"10.1145/3240508.3240687","DOIUrl":"https://doi.org/10.1145/3240508.3240687","url":null,"abstract":"Multiple-choice (MC) Visual Question Answering (VQA) is a similar but essentially different task to open-ended VQA because the answer options are provided. Most of existing works tackle them in a unified pipeline by solving a multi-class problem to infer the best answer from a predefined answer set. The option that matches the best answer is selected for MC VQA. Nevertheless, this violates human thinking logics. Normally, people examine the questions, answer options and the reference image before inferring a MC VQA. For MC VQA, human either rely on the question and answer options to directly deduce a correct answer if the question is not image-related, or read the question and answer options and then purposefully search for answers in a reference image. Therefore, we propose a novel approach, namely Multi-task Learning with Adaptive-attention (MTA), to simulate human logics for MC VQA. Specifically, we first fuse the answer options and question features, and then adaptively attend to the visual features for inferring a MC VQA. Furthermore, we design our model as a multi-task learning architecture by integrating the open-ended VQA task to further boost the performance of MC VQA. We evaluate our approach on two standard benchmark datasets: VQA and Visual7W and our approach sets new records on both datasets for MC VQA task, reaching 73.5% and 65.9% average accuracy respectively.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133105560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we propose a self-boosted intelligent system for joint sign language recognition and automatic education. A novel Spatial-Temporal Net (ST-Net) is designed to exploit the temporal dynamics of localized hands for sign language recognition. Features from ST-Net can be deployed by our education system to detect failure modes of the learners. Moreover, the education system can help collect a vast amount of data for training ST-Net. Our sign language recognition and education system help improve each other step-by-step.On the one hand, benefited from accurate recognition system, the education system can detect the failure parts of the learner more precisely. On the other hand, with more training data gathered from the education system, the recognition system becomes more robust and accurate. Experiments on Hong Kong sign language dataset containing 227 commonly used words validate the effectiveness of our joint recognition and education system.
{"title":"Self-boosted Gesture Interactive System with ST-Net","authors":"Zhengzhe Liu, Xiaojuan Qi, Lei Pang","doi":"10.1145/3240508.3240530","DOIUrl":"https://doi.org/10.1145/3240508.3240530","url":null,"abstract":"In this paper, we propose a self-boosted intelligent system for joint sign language recognition and automatic education. A novel Spatial-Temporal Net (ST-Net) is designed to exploit the temporal dynamics of localized hands for sign language recognition. Features from ST-Net can be deployed by our education system to detect failure modes of the learners. Moreover, the education system can help collect a vast amount of data for training ST-Net. Our sign language recognition and education system help improve each other step-by-step.On the one hand, benefited from accurate recognition system, the education system can detect the failure parts of the learner more precisely. On the other hand, with more training data gathered from the education system, the recognition system becomes more robust and accurate. Experiments on Hong Kong sign language dataset containing 227 commonly used words validate the effectiveness of our joint recognition and education system.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125710511","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xusong Chen, Dong Liu, Zhengjun Zha, Wen-gang Zhou, Zhiwei Xiong, Yan Li
Micro-video sharing gains great popularity in recent years, which calls for effective recommendation algorithm to help user find their interested micro-videos. Compared with traditional online (e.g. YouTube) videos, micro-videos contributed by grass-root users and taken by smartphones are much shorter (tens of seconds) and more short of tags or descriptive text, making the recommendation of micro-videos a challenging task. In this paper, we investigate how to model user's historical behaviors so as to predict the user's click-through of micro-videos. Inspired by the recent deep network-based methods, we propose a Temporal Hierarchical Attention at Category- and Item-Level (THACIL) network for user behavior modeling. First, we use temporal windows to capture the short-term dynamics of user interests; Second, we leverage a category-level attention mechanism to characterize user's diverse interests, as well as an item-level attention mechanism for fine-grained profiling of user interests; Third, we adopt forward multi-head self-attention to capture the long-term correlation within user behaviors. Our proposed THACIL network was tested on MicroVideo-1.7M, a new dataset of 1.7 million micro-videos, coming from real data of a micro-video sharing service in China. Experimental results demonstrate the effectiveness of the proposed method in comparison with the state-of-the-art solutions.
{"title":"Temporal Hierarchical Attention at Category- and Item-Level for Micro-Video Click-Through Prediction","authors":"Xusong Chen, Dong Liu, Zhengjun Zha, Wen-gang Zhou, Zhiwei Xiong, Yan Li","doi":"10.1145/3240508.3240617","DOIUrl":"https://doi.org/10.1145/3240508.3240617","url":null,"abstract":"Micro-video sharing gains great popularity in recent years, which calls for effective recommendation algorithm to help user find their interested micro-videos. Compared with traditional online (e.g. YouTube) videos, micro-videos contributed by grass-root users and taken by smartphones are much shorter (tens of seconds) and more short of tags or descriptive text, making the recommendation of micro-videos a challenging task. In this paper, we investigate how to model user's historical behaviors so as to predict the user's click-through of micro-videos. Inspired by the recent deep network-based methods, we propose a Temporal Hierarchical Attention at Category- and Item-Level (THACIL) network for user behavior modeling. First, we use temporal windows to capture the short-term dynamics of user interests; Second, we leverage a category-level attention mechanism to characterize user's diverse interests, as well as an item-level attention mechanism for fine-grained profiling of user interests; Third, we adopt forward multi-head self-attention to capture the long-term correlation within user behaviors. Our proposed THACIL network was tested on MicroVideo-1.7M, a new dataset of 1.7 million micro-videos, coming from real data of a micro-video sharing service in China. Experimental results demonstrate the effectiveness of the proposed method in comparison with the state-of-the-art solutions.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124902620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Session details: Vision-1 (Machine Learning)","authors":"Jingkuan Song","doi":"10.1145/3286920","DOIUrl":"https://doi.org/10.1145/3286920","url":null,"abstract":"","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125008768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, Weishi Zheng
Current researches mainly focus on single-view and multiview human action recognition, which can hardly satisfy the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of databases also sets up barriers. In this paper, we newly collect a large-scale RGB-D action database for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The database includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our database involves more articipants, more viewpoints and a large number of samples. More importantly, it is the first database containing the entire 360? varying-view sequences. The database provides sufficient data for cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.
{"title":"A Large-scale RGB-D Database for Arbitrary-view Human Action Recognition","authors":"Yanli Ji, Feixiang Xu, Yang Yang, Fumin Shen, Heng Tao Shen, Weishi Zheng","doi":"10.1145/3240508.3240675","DOIUrl":"https://doi.org/10.1145/3240508.3240675","url":null,"abstract":"Current researches mainly focus on single-view and multiview human action recognition, which can hardly satisfy the requirements of human-robot interaction (HRI) applications to recognize actions from arbitrary views. The lack of databases also sets up barriers. In this paper, we newly collect a large-scale RGB-D action database for arbitrary-view action analysis, including RGB videos, depth and skeleton sequences. The database includes action samples captured in 8 fixed viewpoints and varying-view sequences which covers the entire 360 view angles. In total, 118 persons are invited to act 40 action categories, and 25,600 video samples are collected. Our database involves more articipants, more viewpoints and a large number of samples. More importantly, it is the first database containing the entire 360? varying-view sequences. The database provides sufficient data for cross-view and arbitrary-view action analysis. Besides, we propose a View-guided Skeleton CNN (VS-CNN) to tackle the problem of arbitrary-view action recognition. Experiment results show that the VS-CNN achieves superior performance.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125080754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaojing Ma, Changming Liu, Sixing Cao, Bin B. Zhu
Privacy-preserving processing is desirable for cloud computing to relieve users' concern of loss of control of their uploaded data. This may be fulfilled with homomorphic encryption. With widely used JPEG, it is desirable to enable JPEG decompression in the homomorphic encryption domain. This is a great challenge since JPEG decoding needs to determine a matched codeword, which then extracts a codeword-dependent number of coefficients. With no access to the information of encrypted content, a decoder does not know which codeword is matched, and thus cannot tell how many coefficients to extract, not to mention to compute their values. In this paper, we propose a novel scheme that enables JPEG decompression in the homomorphic encryption domain. The scheme applies a statically controlled iterative procedure to decode one coefficient per iteration. In one iteration, each codeword is compared with the bitstream to compute an encrypted Boolean that represents if the codeword is a match or not. Each codeword would produce an output coefficient and generate a new bitstream by dropping consumed bits as if it were a match. If a codeword is associated with more than one coefficient, the codeword is replaced with the codeword representing the remaining undecoded coefficients for the next decoding iteration. The summation of each codeword's output multiplied by its matching Boolean is the output of the current iteration. This is equivalent to selecting the output of a matched codeword. A side benefit of our statically controlled decoding procedure is that paralleled Single-Instruction Multiple-Data (SIMD) is fully supported, wherein multiple plaintexts are encrypted into a single plaintext, and decoding a ciphertext block corresponds to decoding all corresponding plaintext blocks. SIMD also reduces the total size of ciphertexts of an image. Experimental results are reported to show the performance of our proposed scheme.
{"title":"JPEG Decompression in the Homomorphic Encryption Domain","authors":"Xiaojing Ma, Changming Liu, Sixing Cao, Bin B. Zhu","doi":"10.1145/3240508.3240672","DOIUrl":"https://doi.org/10.1145/3240508.3240672","url":null,"abstract":"Privacy-preserving processing is desirable for cloud computing to relieve users' concern of loss of control of their uploaded data. This may be fulfilled with homomorphic encryption. With widely used JPEG, it is desirable to enable JPEG decompression in the homomorphic encryption domain. This is a great challenge since JPEG decoding needs to determine a matched codeword, which then extracts a codeword-dependent number of coefficients. With no access to the information of encrypted content, a decoder does not know which codeword is matched, and thus cannot tell how many coefficients to extract, not to mention to compute their values. In this paper, we propose a novel scheme that enables JPEG decompression in the homomorphic encryption domain. The scheme applies a statically controlled iterative procedure to decode one coefficient per iteration. In one iteration, each codeword is compared with the bitstream to compute an encrypted Boolean that represents if the codeword is a match or not. Each codeword would produce an output coefficient and generate a new bitstream by dropping consumed bits as if it were a match. If a codeword is associated with more than one coefficient, the codeword is replaced with the codeword representing the remaining undecoded coefficients for the next decoding iteration. The summation of each codeword's output multiplied by its matching Boolean is the output of the current iteration. This is equivalent to selecting the output of a matched codeword. A side benefit of our statically controlled decoding procedure is that paralleled Single-Instruction Multiple-Data (SIMD) is fully supported, wherein multiple plaintexts are encrypted into a single plaintext, and decoding a ciphertext block corresponds to decoding all corresponding plaintext blocks. SIMD also reduces the total size of ciphertexts of an image. Experimental results are reported to show the performance of our proposed scheme.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132978071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With demands of the intelligent monitoring, multiple object tracking (MOT) in surveillance scene has become an essential but challenging task. Occlusion is the primary difficulty in surveillance MOT, which can be categorized into the inter-object occlusion and the obstacle occlusion. Many current studies on general MOT focus on the former occlusion, but few studies have been conducted on the latter one. In fact, there are useful prior knowledge in surveillance videos, because the scene structure is fixed. Hence, we propose two models for dealing with these two kinds of occlusions. The attention-based appearance model is proposed to solve the inter-object occlusion, and the scene structure model is proposed to solve the obstacle occlusion. We also design an obstacle map segmentation method for segmenting obstacles from the surveillance scene. Furthermore, to evaluate our method, we propose four new surveillance datasets that contain videos with obstacles. Experimental results show the effectiveness of our two models.
{"title":"OSMO","authors":"Xu Gao, Tingting Jiang","doi":"10.1145/3240508.3240548","DOIUrl":"https://doi.org/10.1145/3240508.3240548","url":null,"abstract":"With demands of the intelligent monitoring, multiple object tracking (MOT) in surveillance scene has become an essential but challenging task. Occlusion is the primary difficulty in surveillance MOT, which can be categorized into the inter-object occlusion and the obstacle occlusion. Many current studies on general MOT focus on the former occlusion, but few studies have been conducted on the latter one. In fact, there are useful prior knowledge in surveillance videos, because the scene structure is fixed. Hence, we propose two models for dealing with these two kinds of occlusions. The attention-based appearance model is proposed to solve the inter-object occlusion, and the scene structure model is proposed to solve the obstacle occlusion. We also design an obstacle map segmentation method for segmenting obstacles from the surveillance scene. Furthermore, to evaluate our method, we propose four new surveillance datasets that contain videos with obstacles. Experimental results show the effectiveness of our two models.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114139642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}