Recently, with the rapid progress of techniques in visual analysis, a lot of attention has been paid to affective computing due to its wide potential applications. Traditional affective analysis mainly focus on single label image emotion classification. But a single image may invoke different emotions for different persons, even for one person. So emotion distribution learning is proposed to capture the underlying emotion distribution for images. Currently, state-of-the-art works model the distribution by deep convolutional networks equipped with distribution specific loss. However, the correlation among different emotions is ignored in these works. Some emotions usually co-appear, while some are hardly invoked at the same time. Properly modeling the correlation is important for image emotion distribution learning. Graph convolutional networks have shown great performance in capturing the underlying relationship in graph, and have been successfully applied in vision problems, such as zero-shot image classification. So, in this paper, we propose to apply graph convolutional networks for emotion distribution learning, termed EmotionGCN, which captures the correlation among emotions. The EmotionGCN can make use of correlation either mined from data, or directly from psychological models, such as Mikels' wheel. Extensive experiments are conducted on the FlickrLDL and TwitterLDL datasets, and the results on seven evaluation metrics demonstrate the superiority of the proposed method.
{"title":"Image Emotion Distribution Learning with Graph Convolutional Networks","authors":"Tao He, Xiaoming Jin","doi":"10.1145/3323873.3326593","DOIUrl":"https://doi.org/10.1145/3323873.3326593","url":null,"abstract":"Recently, with the rapid progress of techniques in visual analysis, a lot of attention has been paid to affective computing due to its wide potential applications. Traditional affective analysis mainly focus on single label image emotion classification. But a single image may invoke different emotions for different persons, even for one person. So emotion distribution learning is proposed to capture the underlying emotion distribution for images. Currently, state-of-the-art works model the distribution by deep convolutional networks equipped with distribution specific loss. However, the correlation among different emotions is ignored in these works. Some emotions usually co-appear, while some are hardly invoked at the same time. Properly modeling the correlation is important for image emotion distribution learning. Graph convolutional networks have shown great performance in capturing the underlying relationship in graph, and have been successfully applied in vision problems, such as zero-shot image classification. So, in this paper, we propose to apply graph convolutional networks for emotion distribution learning, termed EmotionGCN, which captures the correlation among emotions. The EmotionGCN can make use of correlation either mined from data, or directly from psychological models, such as Mikels' wheel. Extensive experiments are conducted on the FlickrLDL and TwitterLDL datasets, and the results on seven evaluation metrics demonstrate the superiority of the proposed method.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125887818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, Jianlong Fu
Automatic story generation from a sequence of images, i.e., visual storytelling, has attracted extensive attention. The challenges mainly drive from modeling rich visually-inspired human emotions, which results in generating diverse yet realistic stories even from the same sequence of images. Existing works usually adopt sequence-based generative adversarial networks (GAN) by encoding deterministic image content (e.g., concept, attribute), while neglecting probabilistic inference from an image over emotion space. In this paper, we take one step further to create human-level stories by modeling image content with emotions, and generating textual paragraph via emotion reinforced adversarial learning. Firstly, we introduce the concept of emotion engaged in visual storytelling. The emotion feature is a representation of the emotional content of the generated story, which enables our model to capture human emotion. Secondly, stories are generated by recurrent neural network, and further optimized by emotion reinforced adversarial learning with three critics, in which visual relevance, language style, and emotion consistency can be ensured. Our model is able to generate stories based on not only emotions generated by our novel emotion generator, but also customized emotions. The introduction of emotion brings more variety and realistic to visual storytelling. We evaluate the proposed model on the largest visual storytelling dataset (VIST). The superior performance to state-of-the-art methods are shown with extensive experiments.
{"title":"Emotion Reinforced Visual Storytelling","authors":"Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, Jianlong Fu","doi":"10.1145/3323873.3325050","DOIUrl":"https://doi.org/10.1145/3323873.3325050","url":null,"abstract":"Automatic story generation from a sequence of images, i.e., visual storytelling, has attracted extensive attention. The challenges mainly drive from modeling rich visually-inspired human emotions, which results in generating diverse yet realistic stories even from the same sequence of images. Existing works usually adopt sequence-based generative adversarial networks (GAN) by encoding deterministic image content (e.g., concept, attribute), while neglecting probabilistic inference from an image over emotion space. In this paper, we take one step further to create human-level stories by modeling image content with emotions, and generating textual paragraph via emotion reinforced adversarial learning. Firstly, we introduce the concept of emotion engaged in visual storytelling. The emotion feature is a representation of the emotional content of the generated story, which enables our model to capture human emotion. Secondly, stories are generated by recurrent neural network, and further optimized by emotion reinforced adversarial learning with three critics, in which visual relevance, language style, and emotion consistency can be ensured. Our model is able to generate stories based on not only emotions generated by our novel emotion generator, but also customized emotions. The introduction of emotion brings more variety and realistic to visual storytelling. We evaluate the proposed model on the largest visual storytelling dataset (VIST). The superior performance to state-of-the-art methods are shown with extensive experiments.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131173646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multiple Object Tracking (MOT) has a wide range of applications in surveillance retrieval and autonomous driving. The majority of existing methods focus on extracting features by deep learning and hand-crafted optimizing bipartite graph or network flow. In this paper, we proposed an efficient end-to-end model, Deep Association Network (DAN), to learn the graph-based training data, which are constructed by spatial-temporal interaction of objects. DAN combines Convolutional Neural Network (CNN), Motion Encoder (ME) and Graph Neural Network (GNN). The CNNs and Motion Encoders extract appearance features from bounding box images and motion features from positions respectively, and then the GNN optimizes graph structure to associate the same object among frames together. In addition, we presented a novel end-to-end training strategy for Deep Association Network. Our experimental results demonstrate the effectiveness of DAN up to the state-of-the-art methods without extra-dataset on MOT16 and DukeMTMCT.
多目标跟踪(MOT)在监控检索和自动驾驶等领域有着广泛的应用。现有的方法主要是通过深度学习和手工优化二部图或网络流来提取特征。在本文中,我们提出了一种高效的端到端模型——深度关联网络(Deep Association Network, DAN)来学习基于图的训练数据,这些训练数据是由对象的时空交互构成的。DAN结合了卷积神经网络(CNN)、运动编码器(ME)和图神经网络(GNN)。cnn和Motion Encoders分别从边界框图像中提取外观特征,从位置提取运动特征,然后通过优化图结构将同一目标在帧间关联在一起。此外,我们提出了一种新的端到端深度关联网络训练策略。我们的实验结果表明,在没有额外数据集的情况下,DAN在MOT16和DukeMTMCT上的有效性达到了最先进的方法。
{"title":"Deep Association: End-to-end Graph-Based Learning for Multiple Object Tracking with Conv-Graph Neural Network","authors":"Cong Ma, Yuan Li, F. Yang, Ziwei Zhang, Yueqing Zhuang, Huizhu Jia, Xiaodong Xie","doi":"10.1145/3323873.3325010","DOIUrl":"https://doi.org/10.1145/3323873.3325010","url":null,"abstract":"Multiple Object Tracking (MOT) has a wide range of applications in surveillance retrieval and autonomous driving. The majority of existing methods focus on extracting features by deep learning and hand-crafted optimizing bipartite graph or network flow. In this paper, we proposed an efficient end-to-end model, Deep Association Network (DAN), to learn the graph-based training data, which are constructed by spatial-temporal interaction of objects. DAN combines Convolutional Neural Network (CNN), Motion Encoder (ME) and Graph Neural Network (GNN). The CNNs and Motion Encoders extract appearance features from bounding box images and motion features from positions respectively, and then the GNN optimizes graph structure to associate the same object among frames together. In addition, we presented a novel end-to-end training strategy for Deep Association Network. Our experimental results demonstrate the effectiveness of DAN up to the state-of-the-art methods without extra-dataset on MOT16 and DukeMTMCT.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128659584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Motion capture data are digital representations of human movements in form of 3D trajectories of multiple body joints. To understand the captured motions, similarity-based processing and deep learning have already proved to be effective, especially in classifying pre-segmented actions. However, in real-world scenarios motion data are typically captured as long continuous sequences, without explicit knowledge of semantic partitioning. To make such unsegmented data accessible and reusable as required by many applications, there is a strong requirement to analyze, search, annotate and mine them automatically. However, there is currently an absence of datasets and benchmarks to test and compare the capabilities of the developed techniques for continuous motion data processing. In this paper, we introduce a new large-scale LSMB19 dataset consisting of two 3D skeleton sequences of a total length of 54.5 hours. We also define a benchmark on two important multimedia retrieval operations: subsequence search and annotation. Additionally, we exemplify the usability of the benchmark by establishing baseline results for these operations.
{"title":"Benchmarking Search and Annotation in Continuous Human Skeleton Sequences","authors":"J. Sedmidubský, Petr Elias, P. Zezula","doi":"10.1145/3323873.3325013","DOIUrl":"https://doi.org/10.1145/3323873.3325013","url":null,"abstract":"Motion capture data are digital representations of human movements in form of 3D trajectories of multiple body joints. To understand the captured motions, similarity-based processing and deep learning have already proved to be effective, especially in classifying pre-segmented actions. However, in real-world scenarios motion data are typically captured as long continuous sequences, without explicit knowledge of semantic partitioning. To make such unsegmented data accessible and reusable as required by many applications, there is a strong requirement to analyze, search, annotate and mine them automatically. However, there is currently an absence of datasets and benchmarks to test and compare the capabilities of the developed techniques for continuous motion data processing. In this paper, we introduce a new large-scale LSMB19 dataset consisting of two 3D skeleton sequences of a total length of 54.5 hours. We also define a benchmark on two important multimedia retrieval operations: subsequence search and annotation. Additionally, we exemplify the usability of the benchmark by establishing baseline results for these operations.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116947473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for "Spatial and Language-Temporal Attention") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.
{"title":"Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention","authors":"Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan","doi":"10.1145/3323873.3325019","DOIUrl":"https://doi.org/10.1145/3323873.3325019","url":null,"abstract":"Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for \"Spatial and Language-Temporal Attention\") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features \"girl\", \"cup\") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction \"pour\" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134174068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
AI as a concept has been around since the 1950's. With the recent advancements in machine learning technology, and the availability of big data and large computing processing power, the scene is set for AI to be used in many more systems and applications which will profoundly impact society. The current deep learning based AI systems are mostly in black box form and are often non-explainable. Though it has high performance, it is also known to make occasional fatal mistakes. This has limited the applications of AI, especially in mission critical problems such as decision support, command and control, and other life-critical operations. This talk focuses on explainable AI, which holds promise in helping humans to better understand and interpret the decisions made by black-box AI models. Current research efforts towards explainable multimedia AI center on two parts of solution. The first part focuses on better understanding of multimedia content, especially video. This includes dense annotation of video content from not just object recognition, but also relation inference. The relation includes both correlation and causality relations, as well as common sense knowledge. The dense annotation enables us to transform the level of representation of video towards that of language, in the form of relation triplets and relation graphs, and permits in-depth research on flexible descriptions, question-answering and knowledge inference of video content. A large scale video dataset has been created to support this line of research. The second direction focuses on the development of explainable AI models, which are just beginning. Existing works focus on either the intrinsic approach, which designs self-explanatory models, or post-hoc approach, which constructs a second model to interpret the target model. Both approaches have limitations on trade-offs between interpretability and accuracy, and the lack of guarantees about the explanation quality. In addition, there are issues of quality, fairness, robustness and privacy in model interpretation. In this talk, I present current state-of-the arts approaches in explainable multimedia AI, along with our preliminary research on relation inference in videos, as well as leveraging prior domain knowledge, information theoretic principles, and adversarial algorithms to achieving interpretability. I will also discuss future research towards quality, fairness and robustness of interpretable AI.
{"title":"Keynote: Towards Explainability in AI and Multimedia Research","authors":"Tat-Seng Chua","doi":"10.1145/3323873.3325058","DOIUrl":"https://doi.org/10.1145/3323873.3325058","url":null,"abstract":"AI as a concept has been around since the 1950's. With the recent advancements in machine learning technology, and the availability of big data and large computing processing power, the scene is set for AI to be used in many more systems and applications which will profoundly impact society. The current deep learning based AI systems are mostly in black box form and are often non-explainable. Though it has high performance, it is also known to make occasional fatal mistakes. This has limited the applications of AI, especially in mission critical problems such as decision support, command and control, and other life-critical operations. This talk focuses on explainable AI, which holds promise in helping humans to better understand and interpret the decisions made by black-box AI models. Current research efforts towards explainable multimedia AI center on two parts of solution. The first part focuses on better understanding of multimedia content, especially video. This includes dense annotation of video content from not just object recognition, but also relation inference. The relation includes both correlation and causality relations, as well as common sense knowledge. The dense annotation enables us to transform the level of representation of video towards that of language, in the form of relation triplets and relation graphs, and permits in-depth research on flexible descriptions, question-answering and knowledge inference of video content. A large scale video dataset has been created to support this line of research. The second direction focuses on the development of explainable AI models, which are just beginning. Existing works focus on either the intrinsic approach, which designs self-explanatory models, or post-hoc approach, which constructs a second model to interpret the target model. Both approaches have limitations on trade-offs between interpretability and accuracy, and the lack of guarantees about the explanation quality. In addition, there are issues of quality, fairness, robustness and privacy in model interpretation. In this talk, I present current state-of-the arts approaches in explainable multimedia AI, along with our preliminary research on relation inference in videos, as well as leveraging prior domain knowledge, information theoretic principles, and adversarial algorithms to achieving interpretability. I will also discuss future research towards quality, fairness and robustness of interpretable AI.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jakub Lokoč, Gregor Kovalcík, Tomás Soucek, J. Moravec, Premysl Cech
Known-item search in large video collections still represents a challenging task for current video retrieval systems that have to rely both on state-of-the-art ranking models and interactive means of retrieval. We present a general overview of the current version of the VIRET tool, an interactive video retrieval system that successfully participated at several international evaluation campaigns. The system is based on multi-modal search and convenient inspection of results. Based on collected query logs of four users controlling instances of the tool at the Video Browser Showdown 2019, we highlight query modification statistics and a list of successful query formulation strategies. We conclude that the VIRET tool represents a competitive reference interactive system for effective known-item search in one thousand hours of video.
{"title":"VIRET","authors":"Jakub Lokoč, Gregor Kovalcík, Tomás Soucek, J. Moravec, Premysl Cech","doi":"10.1145/3323873.3325034","DOIUrl":"https://doi.org/10.1145/3323873.3325034","url":null,"abstract":"Known-item search in large video collections still represents a challenging task for current video retrieval systems that have to rely both on state-of-the-art ranking models and interactive means of retrieval. We present a general overview of the current version of the VIRET tool, an interactive video retrieval system that successfully participated at several international evaluation campaigns. The system is based on multi-modal search and convenient inspection of results. Based on collected query logs of four users controlling instances of the tool at the Video Browser Showdown 2019, we highlight query modification statistics and a list of successful query formulation strategies. We conclude that the VIRET tool represents a competitive reference interactive system for effective known-item search in one thousand hours of video.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126014942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Typical supervised image enhancement pipeline is to minimize the distance between the enhanced image and the reference one. Pixel-wise and perceptual-wise loss functions could help to improve the general image quality, however are not very efficient in improving the image aesthetic quality. In this paper, we propose a novel Residual connected Dilated U-Net (RDU-Net) for improving the image aesthetic quality. By using different dilation rates, the RDU-Net can extract multiple receptive-field features and merge the maximum information from local to global, which are highly desired in image enhancement. Also, we propose an encoder constraint perceptual loss, which could teach the enhancement network to dig out the latent aesthetic factors and make the enhanced image more natural and aesthetically appealing. The proposed approach can alleviate the over-enhancement phenomenons. The experimental results show that the proposed perceptual loss function could give a steady back propagation and the proposed method outperforms the state-of-the-arts.
{"title":"Naturalness Preserved Image Aesthetic Enhancement with Perceptual Encoder Constraint","authors":"Leida Li, Yuzhe Yang, Hancheng Zhu","doi":"10.1145/3323873.3326591","DOIUrl":"https://doi.org/10.1145/3323873.3326591","url":null,"abstract":"Typical supervised image enhancement pipeline is to minimize the distance between the enhanced image and the reference one. Pixel-wise and perceptual-wise loss functions could help to improve the general image quality, however are not very efficient in improving the image aesthetic quality. In this paper, we propose a novel Residual connected Dilated U-Net (RDU-Net) for improving the image aesthetic quality. By using different dilation rates, the RDU-Net can extract multiple receptive-field features and merge the maximum information from local to global, which are highly desired in image enhancement. Also, we propose an encoder constraint perceptual loss, which could teach the enhancement network to dig out the latent aesthetic factors and make the enhanced image more natural and aesthetically appealing. The proposed approach can alleviate the over-enhancement phenomenons. The experimental results show that the proposed perceptual loss function could give a steady back propagation and the proposed method outperforms the state-of-the-arts.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122583035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks for modality-specific representation learning, followed by projection layers for common space mapping. In particular, DSSIL takes into account semantic consistency to fuse the cross-modal data in a high-level common space, and constrains the common representation matrix within the same class to be low-rank, in order to induce the intra-class representations more relevant. More formally, two regularization terms are devised for the two aspects, which have been incorporated into the objective of DSSIL. To optimize the modality-specific subnetworks and the projection layers simultaneously by exploiting the gradient decent directly, we approximate the nonconvex low-rank constraint by minimizing a few smallest singular values of the intra-class matrix with theoretical analysis. Extensive experiments conducted on three public datasets demonstrate the competitive superiority of DSSIL for cross-modal retrieval compared with the state-of-the-art methods.
{"title":"Deep Semantic Space with Intra-class Low-rank Constraint for Cross-modal Retrieval","authors":"Peipei Kang, Zehang Lin, Zhenguo Yang, Xiaozhao Fang, Qing Li, Wenyin Liu","doi":"10.1145/3323873.3325029","DOIUrl":"https://doi.org/10.1145/3323873.3325029","url":null,"abstract":"In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks for modality-specific representation learning, followed by projection layers for common space mapping. In particular, DSSIL takes into account semantic consistency to fuse the cross-modal data in a high-level common space, and constrains the common representation matrix within the same class to be low-rank, in order to induce the intra-class representations more relevant. More formally, two regularization terms are devised for the two aspects, which have been incorporated into the objective of DSSIL. To optimize the modality-specific subnetworks and the projection layers simultaneously by exploiting the gradient decent directly, we approximate the nonconvex low-rank constraint by minimizing a few smallest singular values of the intra-class matrix with theoretical analysis. Extensive experiments conducted on three public datasets demonstrate the competitive superiority of DSSIL for cross-modal retrieval compared with the state-of-the-art methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116122878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper deals with the problem of 3D human tracking in catadioptric images using particle-filtering framework. While traditional perspective images are well exploited, only a few methods have been developed for catadioptric vision, for the human detection or tracking problems. We propose to extend the 3D pose estimation in the case of perspective cameras to catadioptric sensors. In this paper, we develop an original likelihood functions based, on the one hand, on the geodetic distance in the spherical space SO3 and, on the other hand, on the mapping between the human silhouette in the images and the projected 3D model. These likelihood functions combined with a particle filter, whose propagation model is adapted to the spherical space, allow accurate 3D human tracking in omnidirectional images. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.
{"title":"3D Human Tracking with Catadioptric Omnidirectional Camera","authors":"F. Ababsa, H. Hadj-Abdelkader, Marouane Boui","doi":"10.1145/3323873.3325027","DOIUrl":"https://doi.org/10.1145/3323873.3325027","url":null,"abstract":"This paper deals with the problem of 3D human tracking in catadioptric images using particle-filtering framework. While traditional perspective images are well exploited, only a few methods have been developed for catadioptric vision, for the human detection or tracking problems. We propose to extend the 3D pose estimation in the case of perspective cameras to catadioptric sensors. In this paper, we develop an original likelihood functions based, on the one hand, on the geodetic distance in the spherical space SO3 and, on the other hand, on the mapping between the human silhouette in the images and the projected 3D model. These likelihood functions combined with a particle filter, whose propagation model is adapted to the spherical space, allow accurate 3D human tracking in omnidirectional images. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134388408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}