首页 > 最新文献

Proceedings of the 2019 on International Conference on Multimedia Retrieval最新文献

英文 中文
Image Emotion Distribution Learning with Graph Convolutional Networks 基于图卷积网络的图像情感分布学习
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3326593
Tao He, Xiaoming Jin
Recently, with the rapid progress of techniques in visual analysis, a lot of attention has been paid to affective computing due to its wide potential applications. Traditional affective analysis mainly focus on single label image emotion classification. But a single image may invoke different emotions for different persons, even for one person. So emotion distribution learning is proposed to capture the underlying emotion distribution for images. Currently, state-of-the-art works model the distribution by deep convolutional networks equipped with distribution specific loss. However, the correlation among different emotions is ignored in these works. Some emotions usually co-appear, while some are hardly invoked at the same time. Properly modeling the correlation is important for image emotion distribution learning. Graph convolutional networks have shown great performance in capturing the underlying relationship in graph, and have been successfully applied in vision problems, such as zero-shot image classification. So, in this paper, we propose to apply graph convolutional networks for emotion distribution learning, termed EmotionGCN, which captures the correlation among emotions. The EmotionGCN can make use of correlation either mined from data, or directly from psychological models, such as Mikels' wheel. Extensive experiments are conducted on the FlickrLDL and TwitterLDL datasets, and the results on seven evaluation metrics demonstrate the superiority of the proposed method.
近年来,随着视觉分析技术的飞速发展,情感计算因其广阔的应用前景而备受关注。传统的情感分析主要集中在单标签图像情感分类上。但是一个单一的图像可能会引起不同的人,甚至是一个人的不同情绪。因此,情绪分布学习被提出来捕捉图像的潜在情绪分布。目前,最先进的研究是利用具有分布特定损失的深度卷积网络对分布进行建模。然而,这些作品忽略了不同情绪之间的相关性。有些情绪通常会同时出现,而有些情绪几乎不会同时出现。对图像情感分布的相关性进行正确的建模对图像情感分布的学习具有重要意义。图卷积网络在捕捉图的底层关系方面表现出了良好的性能,并已成功地应用于视觉问题,如零采样图像分类。因此,在本文中,我们建议将图卷积网络应用于情绪分布学习,称为EmotionGCN,它捕获情绪之间的相关性。EmotionGCN可以利用从数据中挖掘出来的相关性,也可以直接从心理模型中挖掘出来,比如Mikels’s wheel。在FlickrLDL和TwitterLDL数据集上进行了大量的实验,七个评价指标的结果证明了该方法的优越性。
{"title":"Image Emotion Distribution Learning with Graph Convolutional Networks","authors":"Tao He, Xiaoming Jin","doi":"10.1145/3323873.3326593","DOIUrl":"https://doi.org/10.1145/3323873.3326593","url":null,"abstract":"Recently, with the rapid progress of techniques in visual analysis, a lot of attention has been paid to affective computing due to its wide potential applications. Traditional affective analysis mainly focus on single label image emotion classification. But a single image may invoke different emotions for different persons, even for one person. So emotion distribution learning is proposed to capture the underlying emotion distribution for images. Currently, state-of-the-art works model the distribution by deep convolutional networks equipped with distribution specific loss. However, the correlation among different emotions is ignored in these works. Some emotions usually co-appear, while some are hardly invoked at the same time. Properly modeling the correlation is important for image emotion distribution learning. Graph convolutional networks have shown great performance in capturing the underlying relationship in graph, and have been successfully applied in vision problems, such as zero-shot image classification. So, in this paper, we propose to apply graph convolutional networks for emotion distribution learning, termed EmotionGCN, which captures the correlation among emotions. The EmotionGCN can make use of correlation either mined from data, or directly from psychological models, such as Mikels' wheel. Extensive experiments are conducted on the FlickrLDL and TwitterLDL datasets, and the results on seven evaluation metrics demonstrate the superiority of the proposed method.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125887818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Emotion Reinforced Visual Storytelling 情感强化的视觉叙事
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325050
Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, Jianlong Fu
Automatic story generation from a sequence of images, i.e., visual storytelling, has attracted extensive attention. The challenges mainly drive from modeling rich visually-inspired human emotions, which results in generating diverse yet realistic stories even from the same sequence of images. Existing works usually adopt sequence-based generative adversarial networks (GAN) by encoding deterministic image content (e.g., concept, attribute), while neglecting probabilistic inference from an image over emotion space. In this paper, we take one step further to create human-level stories by modeling image content with emotions, and generating textual paragraph via emotion reinforced adversarial learning. Firstly, we introduce the concept of emotion engaged in visual storytelling. The emotion feature is a representation of the emotional content of the generated story, which enables our model to capture human emotion. Secondly, stories are generated by recurrent neural network, and further optimized by emotion reinforced adversarial learning with three critics, in which visual relevance, language style, and emotion consistency can be ensured. Our model is able to generate stories based on not only emotions generated by our novel emotion generator, but also customized emotions. The introduction of emotion brings more variety and realistic to visual storytelling. We evaluate the proposed model on the largest visual storytelling dataset (VIST). The superior performance to state-of-the-art methods are shown with extensive experiments.
从一系列图像中自动生成故事,即视觉叙事,已经引起了广泛的关注。挑战主要来自于模拟丰富的视觉启发的人类情感,这导致即使从相同的图像序列中产生多样化但现实的故事。现有的研究通常采用基于序列的生成对抗网络(GAN),对确定性的图像内容(如概念、属性)进行编码,而忽略了图像在情感空间上的概率推理。在本文中,我们进一步通过用情感建模图像内容来创建人类水平的故事,并通过情感增强的对抗性学习生成文本段落。首先,我们介绍了视觉叙事中情感的概念。情感特征是生成故事的情感内容的表示,这使我们的模型能够捕捉人类的情感。其次,通过循环神经网络生成故事,并通过三种批评下的情感强化对抗学习进行优化,保证了故事的视觉关联性、语言风格和情感一致性。我们的模型不仅能够基于我们的新型情感生成器产生的情感,还能够基于定制的情感来生成故事。情感的引入给视觉叙事带来了更多的多样性和真实感。我们在最大的视觉叙事数据集(VIST)上评估了所提出的模型。通过大量的实验证明了该方法优于最先进的方法。
{"title":"Emotion Reinforced Visual Storytelling","authors":"Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, Jianlong Fu","doi":"10.1145/3323873.3325050","DOIUrl":"https://doi.org/10.1145/3323873.3325050","url":null,"abstract":"Automatic story generation from a sequence of images, i.e., visual storytelling, has attracted extensive attention. The challenges mainly drive from modeling rich visually-inspired human emotions, which results in generating diverse yet realistic stories even from the same sequence of images. Existing works usually adopt sequence-based generative adversarial networks (GAN) by encoding deterministic image content (e.g., concept, attribute), while neglecting probabilistic inference from an image over emotion space. In this paper, we take one step further to create human-level stories by modeling image content with emotions, and generating textual paragraph via emotion reinforced adversarial learning. Firstly, we introduce the concept of emotion engaged in visual storytelling. The emotion feature is a representation of the emotional content of the generated story, which enables our model to capture human emotion. Secondly, stories are generated by recurrent neural network, and further optimized by emotion reinforced adversarial learning with three critics, in which visual relevance, language style, and emotion consistency can be ensured. Our model is able to generate stories based on not only emotions generated by our novel emotion generator, but also customized emotions. The introduction of emotion brings more variety and realistic to visual storytelling. We evaluate the proposed model on the largest visual storytelling dataset (VIST). The superior performance to state-of-the-art methods are shown with extensive experiments.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131173646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Deep Association: End-to-end Graph-Based Learning for Multiple Object Tracking with Conv-Graph Neural Network 深度关联:基于端到端图学习的卷积图神经网络多目标跟踪
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325010
Cong Ma, Yuan Li, F. Yang, Ziwei Zhang, Yueqing Zhuang, Huizhu Jia, Xiaodong Xie
Multiple Object Tracking (MOT) has a wide range of applications in surveillance retrieval and autonomous driving. The majority of existing methods focus on extracting features by deep learning and hand-crafted optimizing bipartite graph or network flow. In this paper, we proposed an efficient end-to-end model, Deep Association Network (DAN), to learn the graph-based training data, which are constructed by spatial-temporal interaction of objects. DAN combines Convolutional Neural Network (CNN), Motion Encoder (ME) and Graph Neural Network (GNN). The CNNs and Motion Encoders extract appearance features from bounding box images and motion features from positions respectively, and then the GNN optimizes graph structure to associate the same object among frames together. In addition, we presented a novel end-to-end training strategy for Deep Association Network. Our experimental results demonstrate the effectiveness of DAN up to the state-of-the-art methods without extra-dataset on MOT16 and DukeMTMCT.
多目标跟踪(MOT)在监控检索和自动驾驶等领域有着广泛的应用。现有的方法主要是通过深度学习和手工优化二部图或网络流来提取特征。在本文中,我们提出了一种高效的端到端模型——深度关联网络(Deep Association Network, DAN)来学习基于图的训练数据,这些训练数据是由对象的时空交互构成的。DAN结合了卷积神经网络(CNN)、运动编码器(ME)和图神经网络(GNN)。cnn和Motion Encoders分别从边界框图像中提取外观特征,从位置提取运动特征,然后通过优化图结构将同一目标在帧间关联在一起。此外,我们提出了一种新的端到端深度关联网络训练策略。我们的实验结果表明,在没有额外数据集的情况下,DAN在MOT16和DukeMTMCT上的有效性达到了最先进的方法。
{"title":"Deep Association: End-to-end Graph-Based Learning for Multiple Object Tracking with Conv-Graph Neural Network","authors":"Cong Ma, Yuan Li, F. Yang, Ziwei Zhang, Yueqing Zhuang, Huizhu Jia, Xiaodong Xie","doi":"10.1145/3323873.3325010","DOIUrl":"https://doi.org/10.1145/3323873.3325010","url":null,"abstract":"Multiple Object Tracking (MOT) has a wide range of applications in surveillance retrieval and autonomous driving. The majority of existing methods focus on extracting features by deep learning and hand-crafted optimizing bipartite graph or network flow. In this paper, we proposed an efficient end-to-end model, Deep Association Network (DAN), to learn the graph-based training data, which are constructed by spatial-temporal interaction of objects. DAN combines Convolutional Neural Network (CNN), Motion Encoder (ME) and Graph Neural Network (GNN). The CNNs and Motion Encoders extract appearance features from bounding box images and motion features from positions respectively, and then the GNN optimizes graph structure to associate the same object among frames together. In addition, we presented a novel end-to-end training strategy for Deep Association Network. Our experimental results demonstrate the effectiveness of DAN up to the state-of-the-art methods without extra-dataset on MOT16 and DukeMTMCT.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128659584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Benchmarking Search and Annotation in Continuous Human Skeleton Sequences 连续人体骨骼序列的基准检索与标注
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325013
J. Sedmidubský, Petr Elias, P. Zezula
Motion capture data are digital representations of human movements in form of 3D trajectories of multiple body joints. To understand the captured motions, similarity-based processing and deep learning have already proved to be effective, especially in classifying pre-segmented actions. However, in real-world scenarios motion data are typically captured as long continuous sequences, without explicit knowledge of semantic partitioning. To make such unsegmented data accessible and reusable as required by many applications, there is a strong requirement to analyze, search, annotate and mine them automatically. However, there is currently an absence of datasets and benchmarks to test and compare the capabilities of the developed techniques for continuous motion data processing. In this paper, we introduce a new large-scale LSMB19 dataset consisting of two 3D skeleton sequences of a total length of 54.5 hours. We also define a benchmark on two important multimedia retrieval operations: subsequence search and annotation. Additionally, we exemplify the usability of the benchmark by establishing baseline results for these operations.
动作捕捉数据是以人体多个关节的3D轨迹的形式对人体运动进行数字表示。为了理解捕获的动作,基于相似性的处理和深度学习已经被证明是有效的,特别是在对预分割的动作进行分类时。然而,在现实场景中,运动数据通常被捕获为长连续序列,没有明确的语义划分知识。为了使这些未分段的数据能够被许多应用程序访问和重用,需要对它们进行自动分析、搜索、注释和挖掘。然而,目前缺乏数据集和基准来测试和比较已开发的连续运动数据处理技术的能力。在本文中,我们引入了一个新的大规模LSMB19数据集,该数据集由两个总长度为54.5小时的三维骨架序列组成。我们还定义了两个重要的多媒体检索操作的基准:子序列搜索和注释。此外,我们通过为这些操作建立基线结果来举例说明基准的可用性。
{"title":"Benchmarking Search and Annotation in Continuous Human Skeleton Sequences","authors":"J. Sedmidubský, Petr Elias, P. Zezula","doi":"10.1145/3323873.3325013","DOIUrl":"https://doi.org/10.1145/3323873.3325013","url":null,"abstract":"Motion capture data are digital representations of human movements in form of 3D trajectories of multiple body joints. To understand the captured motions, similarity-based processing and deep learning have already proved to be effective, especially in classifying pre-segmented actions. However, in real-world scenarios motion data are typically captured as long continuous sequences, without explicit knowledge of semantic partitioning. To make such unsegmented data accessible and reusable as required by many applications, there is a strong requirement to analyze, search, annotate and mine them automatically. However, there is currently an absence of datasets and benchmarks to test and compare the capabilities of the developed techniques for continuous motion data processing. In this paper, we introduce a new large-scale LSMB19 dataset consisting of two 3D skeleton sequences of a total length of 54.5 hours. We also define a benchmark on two important multimedia retrieval operations: subsequence search and annotation. Additionally, we exemplify the usability of the benchmark by establishing baseline results for these operations.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116947473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention 基于空间和语言时间注意的跨模态视频时刻检索
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325019
Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan
Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for "Spatial and Language-Temporal Attention") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features "girl", "cup") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction "pour" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.
给定一个未修剪的视频和一个描述查询,时间矩检索旨在定位视频中最能描述文本查询的时间段。现有的研究主要采用粗帧级特征作为视觉表征,模糊了可能为定位所需时刻提供关键线索的具体细节。我们提出了一种SLTA(简称“空间和语言-时间注意”)方法来解决细节缺失问题。具体而言,SLTA方法利用对象级局部特征,通过空间注意关注最相关的局部特征(例如,局部特征“女孩”,“杯子”)。然后,我们对连续帧上的局部特征序列进行编码,以捕获这些对象之间的交互信息(例如,涉及这两个对象的交互“pour”)。同时,基于时刻上下文信息,利用语言时间注意来强调关键词。因此,我们提出的两个关注子网络可以识别视频中最相关的对象和交互,并同时突出显示查询中的关键字。在TACOS, Charades-STA和DiDeMo数据集上进行的大量实验表明,与最先进的方法相比,我们的模型是有效的。
{"title":"Cross-Modal Video Moment Retrieval with Spatial and Language-Temporal Attention","authors":"Bin Jiang, Xin Huang, Chao Yang, Junsong Yuan","doi":"10.1145/3323873.3325019","DOIUrl":"https://doi.org/10.1145/3323873.3325019","url":null,"abstract":"Given an untrimmed video and a description query, temporal moment retrieval aims to localize the temporal segment within the video that best describes the textual query. Existing studies predominantly employ coarse frame-level features as the visual representation, obfuscating the specific details which may provide critical cues for localizing the desired moment. We propose a SLTA (short for \"Spatial and Language-Temporal Attention\") method to address the detail missing issue. Specifically, the SLTA method takes advantage of object-level local features and attends to the most relevant local features (e.g., the local features \"girl\", \"cup\") by spatial attention. Then we encode the sequence of local features on consecutive frames to capture the interaction information among these objects (e.g., the interaction \"pour\" involving these two objects). Meanwhile, a language-temporal attention is utilized to emphasize the keywords based on moment context information. Therefore, our proposed two attention sub-networks can recognize the most relevant objects and interactions in the video, and simultaneously highlight the keywords in the query. Extensive experiments on TACOS, Charades-STA and DiDeMo datasets demonstrate the effectiveness of our model as compared to state-of-the-art methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134174068","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Keynote: Towards Explainability in AI and Multimedia Research 主题演讲:面向AI和多媒体研究的可解释性
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325058
Tat-Seng Chua
AI as a concept has been around since the 1950's. With the recent advancements in machine learning technology, and the availability of big data and large computing processing power, the scene is set for AI to be used in many more systems and applications which will profoundly impact society. The current deep learning based AI systems are mostly in black box form and are often non-explainable. Though it has high performance, it is also known to make occasional fatal mistakes. This has limited the applications of AI, especially in mission critical problems such as decision support, command and control, and other life-critical operations. This talk focuses on explainable AI, which holds promise in helping humans to better understand and interpret the decisions made by black-box AI models. Current research efforts towards explainable multimedia AI center on two parts of solution. The first part focuses on better understanding of multimedia content, especially video. This includes dense annotation of video content from not just object recognition, but also relation inference. The relation includes both correlation and causality relations, as well as common sense knowledge. The dense annotation enables us to transform the level of representation of video towards that of language, in the form of relation triplets and relation graphs, and permits in-depth research on flexible descriptions, question-answering and knowledge inference of video content. A large scale video dataset has been created to support this line of research. The second direction focuses on the development of explainable AI models, which are just beginning. Existing works focus on either the intrinsic approach, which designs self-explanatory models, or post-hoc approach, which constructs a second model to interpret the target model. Both approaches have limitations on trade-offs between interpretability and accuracy, and the lack of guarantees about the explanation quality. In addition, there are issues of quality, fairness, robustness and privacy in model interpretation. In this talk, I present current state-of-the arts approaches in explainable multimedia AI, along with our preliminary research on relation inference in videos, as well as leveraging prior domain knowledge, information theoretic principles, and adversarial algorithms to achieving interpretability. I will also discuss future research towards quality, fairness and robustness of interpretable AI.
人工智能作为一个概念在20世纪50年代就出现了。随着最近机器学习技术的进步,以及大数据和大型计算处理能力的可用性,人工智能将被用于更多的系统和应用程序,这将对社会产生深远的影响。目前基于深度学习的人工智能系统大多是黑箱形式,往往无法解释。虽然它有很高的性能,但它偶尔也会犯致命的错误。这限制了人工智能的应用,特别是在关键任务问题上,如决策支持、指挥和控制,以及其他生命攸关的行动。这次演讲的重点是可解释的人工智能,它有望帮助人类更好地理解和解释由黑盒人工智能模型做出的决定。目前对可解释多媒体人工智能的研究主要集中在两部分解决方案上。第一部分侧重于更好地理解多媒体内容,特别是视频。这不仅包括来自对象识别的视频内容的密集注释,还包括关系推理。这种关系既包括相关关系,也包括因果关系,也包括常识知识。密集的标注使我们能够将视频的表示层次转化为语言的表示层次,以关系三元和关系图的形式呈现,并对视频内容的灵活描述、问答和知识推理进行深入研究。已经创建了一个大规模的视频数据集来支持这一研究。第二个方向侧重于可解释的人工智能模型的开发,这才刚刚开始。现有的工作要么集中在内在方法上,它设计了自我解释的模型,要么集中在事后方法上,它构建了第二个模型来解释目标模型。这两种方法在可解释性和准确性之间的权衡上都存在局限性,并且缺乏对解释质量的保证。此外,模型解释还存在质量、公平性、鲁棒性和隐私性等问题。在这次演讲中,我介绍了当前可解释多媒体人工智能的最新方法,以及我们对视频中关系推理的初步研究,以及利用先验领域知识、信息理论原理和对抗算法来实现可解释性。我还将讨论未来对可解释人工智能的质量、公平性和鲁棒性的研究。
{"title":"Keynote: Towards Explainability in AI and Multimedia Research","authors":"Tat-Seng Chua","doi":"10.1145/3323873.3325058","DOIUrl":"https://doi.org/10.1145/3323873.3325058","url":null,"abstract":"AI as a concept has been around since the 1950's. With the recent advancements in machine learning technology, and the availability of big data and large computing processing power, the scene is set for AI to be used in many more systems and applications which will profoundly impact society. The current deep learning based AI systems are mostly in black box form and are often non-explainable. Though it has high performance, it is also known to make occasional fatal mistakes. This has limited the applications of AI, especially in mission critical problems such as decision support, command and control, and other life-critical operations. This talk focuses on explainable AI, which holds promise in helping humans to better understand and interpret the decisions made by black-box AI models. Current research efforts towards explainable multimedia AI center on two parts of solution. The first part focuses on better understanding of multimedia content, especially video. This includes dense annotation of video content from not just object recognition, but also relation inference. The relation includes both correlation and causality relations, as well as common sense knowledge. The dense annotation enables us to transform the level of representation of video towards that of language, in the form of relation triplets and relation graphs, and permits in-depth research on flexible descriptions, question-answering and knowledge inference of video content. A large scale video dataset has been created to support this line of research. The second direction focuses on the development of explainable AI models, which are just beginning. Existing works focus on either the intrinsic approach, which designs self-explanatory models, or post-hoc approach, which constructs a second model to interpret the target model. Both approaches have limitations on trade-offs between interpretability and accuracy, and the lack of guarantees about the explanation quality. In addition, there are issues of quality, fairness, robustness and privacy in model interpretation. In this talk, I present current state-of-the arts approaches in explainable multimedia AI, along with our preliminary research on relation inference in videos, as well as leveraging prior domain knowledge, information theoretic principles, and adversarial algorithms to achieving interpretability. I will also discuss future research towards quality, fairness and robustness of interpretable AI.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126229417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VIRET
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325034
Jakub Lokoč, Gregor Kovalcík, Tomás Soucek, J. Moravec, Premysl Cech
Known-item search in large video collections still represents a challenging task for current video retrieval systems that have to rely both on state-of-the-art ranking models and interactive means of retrieval. We present a general overview of the current version of the VIRET tool, an interactive video retrieval system that successfully participated at several international evaluation campaigns. The system is based on multi-modal search and convenient inspection of results. Based on collected query logs of four users controlling instances of the tool at the Video Browser Showdown 2019, we highlight query modification statistics and a list of successful query formulation strategies. We conclude that the VIRET tool represents a competitive reference interactive system for effective known-item search in one thousand hours of video.
{"title":"VIRET","authors":"Jakub Lokoč, Gregor Kovalcík, Tomás Soucek, J. Moravec, Premysl Cech","doi":"10.1145/3323873.3325034","DOIUrl":"https://doi.org/10.1145/3323873.3325034","url":null,"abstract":"Known-item search in large video collections still represents a challenging task for current video retrieval systems that have to rely both on state-of-the-art ranking models and interactive means of retrieval. We present a general overview of the current version of the VIRET tool, an interactive video retrieval system that successfully participated at several international evaluation campaigns. The system is based on multi-modal search and convenient inspection of results. Based on collected query logs of four users controlling instances of the tool at the Video Browser Showdown 2019, we highlight query modification statistics and a list of successful query formulation strategies. We conclude that the VIRET tool represents a competitive reference interactive system for effective known-item search in one thousand hours of video.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"13 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126014942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Naturalness Preserved Image Aesthetic Enhancement with Perceptual Encoder Constraint 基于感知编码器约束的自然保持图像美学增强
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3326591
Leida Li, Yuzhe Yang, Hancheng Zhu
Typical supervised image enhancement pipeline is to minimize the distance between the enhanced image and the reference one. Pixel-wise and perceptual-wise loss functions could help to improve the general image quality, however are not very efficient in improving the image aesthetic quality. In this paper, we propose a novel Residual connected Dilated U-Net (RDU-Net) for improving the image aesthetic quality. By using different dilation rates, the RDU-Net can extract multiple receptive-field features and merge the maximum information from local to global, which are highly desired in image enhancement. Also, we propose an encoder constraint perceptual loss, which could teach the enhancement network to dig out the latent aesthetic factors and make the enhanced image more natural and aesthetically appealing. The proposed approach can alleviate the over-enhancement phenomenons. The experimental results show that the proposed perceptual loss function could give a steady back propagation and the proposed method outperforms the state-of-the-arts.
典型的监督图像增强流水线是最小化增强图像与参考图像之间的距离。像素型和感知型损失函数可以帮助提高图像的总体质量,但在提高图像的美学质量方面不是很有效。本文提出了一种新的残差连接扩展U-Net (RDU-Net),以提高图像的美学质量。通过使用不同的扩张率,RDU-Net可以提取多个接收场特征,并将局部到全局的最大信息合并,从而达到图像增强的目的。此外,我们还提出了一种编码器约束的感知损失,它可以指导增强网络挖掘潜在的审美因素,使增强后的图像更加自然和美观。所提出的方法可以缓解过度增强现象。实验结果表明,所提出的感知损失函数可以实现稳定的反向传播,并且该方法优于目前的最先进方法。
{"title":"Naturalness Preserved Image Aesthetic Enhancement with Perceptual Encoder Constraint","authors":"Leida Li, Yuzhe Yang, Hancheng Zhu","doi":"10.1145/3323873.3326591","DOIUrl":"https://doi.org/10.1145/3323873.3326591","url":null,"abstract":"Typical supervised image enhancement pipeline is to minimize the distance between the enhanced image and the reference one. Pixel-wise and perceptual-wise loss functions could help to improve the general image quality, however are not very efficient in improving the image aesthetic quality. In this paper, we propose a novel Residual connected Dilated U-Net (RDU-Net) for improving the image aesthetic quality. By using different dilation rates, the RDU-Net can extract multiple receptive-field features and merge the maximum information from local to global, which are highly desired in image enhancement. Also, we propose an encoder constraint perceptual loss, which could teach the enhancement network to dig out the latent aesthetic factors and make the enhanced image more natural and aesthetically appealing. The proposed approach can alleviate the over-enhancement phenomenons. The experimental results show that the proposed perceptual loss function could give a steady back propagation and the proposed method outperforms the state-of-the-arts.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122583035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Deep Semantic Space with Intra-class Low-rank Constraint for Cross-modal Retrieval 基于类内低秩约束的深度语义空间跨模态检索
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325029
Peipei Kang, Zehang Lin, Zhenguo Yang, Xiaozhao Fang, Qing Li, Wenyin Liu
In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks for modality-specific representation learning, followed by projection layers for common space mapping. In particular, DSSIL takes into account semantic consistency to fuse the cross-modal data in a high-level common space, and constrains the common representation matrix within the same class to be low-rank, in order to induce the intra-class representations more relevant. More formally, two regularization terms are devised for the two aspects, which have been incorporated into the objective of DSSIL. To optimize the modality-specific subnetworks and the projection layers simultaneously by exploiting the gradient decent directly, we approximate the nonconvex low-rank constraint by minimizing a few smallest singular values of the intra-class matrix with theoretical analysis. Extensive experiments conducted on three public datasets demonstrate the competitive superiority of DSSIL for cross-modal retrieval compared with the state-of-the-art methods.
本文提出了一种基于类内低秩约束的跨模态深度语义空间学习模型,该模型由两个子网络组成,用于模态特定的表示学习,然后由投影层组成,用于公共空间映射。特别是,DSSIL考虑了语义一致性,将跨模态数据融合在一个高层次的公共空间中,并将同一类内的公共表示矩阵约束为低秩,从而使类内的表示更具相关性。更正式地说,为这两个方面设计了两个正则化术语,它们已被纳入DSSIL的目标。为了直接利用梯度梯度同时优化特定模态的子网络和投影层,我们通过理论分析最小化类内矩阵的几个最小奇异值来近似非凸低秩约束。在三个公共数据集上进行的大量实验表明,与最先进的方法相比,DSSIL在跨模态检索方面具有竞争优势。
{"title":"Deep Semantic Space with Intra-class Low-rank Constraint for Cross-modal Retrieval","authors":"Peipei Kang, Zehang Lin, Zhenguo Yang, Xiaozhao Fang, Qing Li, Wenyin Liu","doi":"10.1145/3323873.3325029","DOIUrl":"https://doi.org/10.1145/3323873.3325029","url":null,"abstract":"In this paper, a novel Deep Semantic Space learning model with Intra-class Low-rank constraint (DSSIL) is proposed for cross-modal retrieval, which is composed of two subnetworks for modality-specific representation learning, followed by projection layers for common space mapping. In particular, DSSIL takes into account semantic consistency to fuse the cross-modal data in a high-level common space, and constrains the common representation matrix within the same class to be low-rank, in order to induce the intra-class representations more relevant. More formally, two regularization terms are devised for the two aspects, which have been incorporated into the objective of DSSIL. To optimize the modality-specific subnetworks and the projection layers simultaneously by exploiting the gradient decent directly, we approximate the nonconvex low-rank constraint by minimizing a few smallest singular values of the intra-class matrix with theoretical analysis. Extensive experiments conducted on three public datasets demonstrate the competitive superiority of DSSIL for cross-modal retrieval compared with the state-of-the-art methods.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"217 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116122878","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
3D Human Tracking with Catadioptric Omnidirectional Camera 反射式全向相机的三维人体跟踪
Pub Date : 2019-06-05 DOI: 10.1145/3323873.3325027
F. Ababsa, H. Hadj-Abdelkader, Marouane Boui
This paper deals with the problem of 3D human tracking in catadioptric images using particle-filtering framework. While traditional perspective images are well exploited, only a few methods have been developed for catadioptric vision, for the human detection or tracking problems. We propose to extend the 3D pose estimation in the case of perspective cameras to catadioptric sensors. In this paper, we develop an original likelihood functions based, on the one hand, on the geodetic distance in the spherical space SO3 and, on the other hand, on the mapping between the human silhouette in the images and the projected 3D model. These likelihood functions combined with a particle filter, whose propagation model is adapted to the spherical space, allow accurate 3D human tracking in omnidirectional images. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.
本文利用粒子滤波框架研究了反射图像中人体的三维跟踪问题。虽然传统的透视图像得到了很好的利用,但对于反射视觉,人类检测或跟踪问题,只有很少的方法被开发出来。我们建议将透视相机的三维姿态估计扩展到反射式传感器。在本文中,我们开发了一种原始的似然函数,一方面基于球面空间SO3中的大地测量距离,另一方面基于图像中的人体轮廓与投影三维模型之间的映射。这些似然函数与粒子滤波器相结合,其传播模型适应于球面空间,可以在全向图像中实现精确的3D人体跟踪。实验结果的可视化和定量分析都证明了我们方法的有效性。
{"title":"3D Human Tracking with Catadioptric Omnidirectional Camera","authors":"F. Ababsa, H. Hadj-Abdelkader, Marouane Boui","doi":"10.1145/3323873.3325027","DOIUrl":"https://doi.org/10.1145/3323873.3325027","url":null,"abstract":"This paper deals with the problem of 3D human tracking in catadioptric images using particle-filtering framework. While traditional perspective images are well exploited, only a few methods have been developed for catadioptric vision, for the human detection or tracking problems. We propose to extend the 3D pose estimation in the case of perspective cameras to catadioptric sensors. In this paper, we develop an original likelihood functions based, on the one hand, on the geodetic distance in the spherical space SO3 and, on the other hand, on the mapping between the human silhouette in the images and the projected 3D model. These likelihood functions combined with a particle filter, whose propagation model is adapted to the spherical space, allow accurate 3D human tracking in omnidirectional images. Both visual and quantitative analysis of the experimental results demonstrate the effectiveness of our approach.","PeriodicalId":149041,"journal":{"name":"Proceedings of the 2019 on International Conference on Multimedia Retrieval","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134388408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 2019 on International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1