Proceedings of the 2nd ACM International Conference on Multimedia in Asia最新文献

英文中文

Hierarchical clustering via mutual learning for unsupervised person re-identification 基于互学习的无监督人再识别层次聚类

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446268

Xu Xu, Liyan Zhang, Zhaomeng Huang, Guodong Du

Person re-identification (re-ID) aims to establish identity correspondence across different cameras. State-of-the-art re-ID approaches are mainly clustering-based Unsupervised Domain Adaptation (UDA) methods, which attempt to transfer the model trained on the source domain to target domain, by alternatively generating pseudo labels by clustering target-domain instances and training the network with generated pseudo labels to perform feature learning. However, these approaches suffer from the problem of inevitable label noise caused by the clustering procedure that dramatically impact the model training and feature learning of the target domain. To address this issue, we propose an unsupervised Hierarchical Clustering via Mutual Learning (HCML) framework, which can jointly optimize the dual training network and the clustering procedure to learn more discriminative features from the target domain. Specifically, the proposed HCML framework can effectively update the hard pseudo labels generated by clustering process and soft pseudo label generated by the training network both in on-line manner. We jointly adopt the repelled loss, triplet loss, soft identity loss and soft triplet loss to optimize the model. The experimental results on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks have demonstrated the superiority of our proposed HCML framework compared with other state-of-the-art methods.

人员再识别(re-ID)旨在建立跨不同摄像机的身份对应。最先进的re-ID方法主要是基于聚类的无监督域自适应(UDA)方法，它试图将在源域训练的模型转移到目标域，方法是通过聚类目标域实例生成伪标签，并用生成的伪标签训练网络进行特征学习。然而，这些方法受到聚类过程中不可避免的标签噪声问题的困扰，这极大地影响了目标域的模型训练和特征学习。为了解决这一问题，我们提出了一种无监督的基于相互学习的分层聚类(HCML)框架，该框架可以共同优化双训练网络和聚类过程，从目标域学习更多的判别特征。具体而言，本文提出的HCML框架可以在线有效地更新聚类过程生成的硬伪标签和训练网络生成的软伪标签。我们共同采用排斥损失、三重态损失、软身份损失和软三重态损失对模型进行优化。在Market-to-Duke、Duke-to-Market、Market-to-MSMT和Duke-to-MSMT无监督域自适应任务上的实验结果表明，与其他最先进的方法相比，我们提出的HCML框架具有优势。

{"title":"Hierarchical clustering via mutual learning for unsupervised person re-identification","authors":"Xu Xu, Liyan Zhang, Zhaomeng Huang, Guodong Du","doi":"10.1145/3444685.3446268","DOIUrl":"https://doi.org/10.1145/3444685.3446268","url":null,"abstract":"Person re-identification (re-ID) aims to establish identity correspondence across different cameras. State-of-the-art re-ID approaches are mainly clustering-based Unsupervised Domain Adaptation (UDA) methods, which attempt to transfer the model trained on the source domain to target domain, by alternatively generating pseudo labels by clustering target-domain instances and training the network with generated pseudo labels to perform feature learning. However, these approaches suffer from the problem of inevitable label noise caused by the clustering procedure that dramatically impact the model training and feature learning of the target domain. To address this issue, we propose an unsupervised Hierarchical Clustering via Mutual Learning (HCML) framework, which can jointly optimize the dual training network and the clustering procedure to learn more discriminative features from the target domain. Specifically, the proposed HCML framework can effectively update the hard pseudo labels generated by clustering process and soft pseudo label generated by the training network both in on-line manner. We jointly adopt the repelled loss, triplet loss, soft identity loss and soft triplet loss to optimize the model. The experimental results on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks have demonstrated the superiority of our proposed HCML framework compared with other state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122002867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Determining image age with rank-consistent ordinal classification and object-centered ensemble 用秩一致有序分类和目标中心集成确定图像年龄

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446326

Shota Ashida, A. Jatowt, A. Doucet, Masatoshi Yoshikawa

A significant number of old photographs including ones that are posted online do not contain the information of the date at which they were taken, or this information needs to be verified. Many of such pictures are either scanned analog photographs or photographs taken using a digital camera with incorrect settings. Estimating the date of such pictures is useful for enhancing data quality and its consistency, improving information retrieval and for other related applications. In this study, we propose a novel approach for automatic estimation of the shooting dates of photographs based on a rank-consistent ordinal classification method for neural networks. We also introduce an ensemble approach that involves object segmentation. We conclude that assuring the rank consistency in the ordinal classification as well as combining models trained on segmented objects improve the results of the age determination task.

包括在网上发布的照片在内，很多老照片都没有拍摄日期的信息，或者需要对这些信息进行核实。许多这样的照片要么是扫描的模拟照片，要么是用设置不正确的数码相机拍摄的照片。估计这些图片的日期对于提高数据质量及其一致性、改进信息检索和其他相关应用都是有用的。在这项研究中，我们提出了一种基于神经网络秩一致有序分类方法的自动估计照片拍摄日期的新方法。我们还介绍了一种涉及对象分割的集成方法。我们得出结论，保证有序分类中的秩一致性以及结合在分割对象上训练的模型可以改善年龄确定任务的结果。

引用次数: 0

Multi-level expression guided attention network for referring expression comprehension 多层次表达引导注意网络对表达理解的参考作用

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446270

Liang Peng, Yang Yang, Xing Xu, Jingjing Li, Xiaofeng Zhu

Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.

引用表达式理解是指用自然语言表达式识别给定图像中与文本相关的对象或区域的任务。在此任务中，必须从多个方面理解表达句子并使其适应于区域表示，以生成判别信息。遗憾的是，以往的方法往往是利用自我注意机制将注意力集中在表达中的重要单词或短语上，导致无法将目标区域与其他区域区分开来，尤其是相似区域。为了解决这个问题，我们提出了一个新的模型，称为多层次表达引导注意网络(MEGA-Net)。它包含了一个多层次的视觉注意图式，以句子级、词级和短语级不同层次的表达表征为指导，可以生成判别区域特征，有助于准确定位相关区域。此外，为了区分相似区域，我们设计了一个两阶段结构，首先根据第一阶段的匹配分数选择top-K的候选区域，然后应用对象比较注意机制来学习候选区域之间的差异以匹配目标区域。我们在三个流行的基准数据集上评估了所提出的方法，实验结果表明，我们的模型优于最先进的方法。

{"title":"Multi-level expression guided attention network for referring expression comprehension","authors":"Liang Peng, Yang Yang, Xing Xu, Jingjing Li, Xiaofeng Zhu","doi":"10.1145/3444685.3446270","DOIUrl":"https://doi.org/10.1145/3444685.3446270","url":null,"abstract":"Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128484854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Relationship graph learning network for visual relationship detection 用于视觉关系检测的关系图学习网络

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446312

Yanan Li, Jun Yu, Yibing Zhan, Zhi Chen

Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.

视觉关系检测的目的是预测被检测对象对之间的关系。人们普遍认为，图像组件之间的相关性(即对象和对象之间的关系)是预测对象关系时的重要考虑因素。然而，目前大多数视觉关系检测方法只利用了物体之间的相关性，对物体之间的相关性的研究还不够充分。本文提出了一种关系图学习网络(RGLN)来探索对象之间关系的相关性，用于视觉关系检测。具体来说，RGLN使用对象检测器获得图像对象，然后，每对对象构成一个关系建议。所有的关系建议都构建一个关系图，其中的建议被视为节点。因此，RGLN设计了双流图注意子网来检测关系建议，其中一个图注意子网基于视觉和空间信息分析关系之间的相关性，另一个图注意子网基于语义和空间信息分析关系之间的相关性。此外，RGLN利用关系选择子网来忽略没有关系的对象对的冗余信息。我们在两个公共数据集上进行了大量的实验:VRD和VG数据集。实验结果表明，RGLN具有较强的竞争力。

{"title":"Relationship graph learning network for visual relationship detection","authors":"Yanan Li, Jun Yu, Yibing Zhan, Zhi Chen","doi":"10.1145/3444685.3446312","DOIUrl":"https://doi.org/10.1145/3444685.3446312","url":null,"abstract":"Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133276438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Table detection and cell segmentation in online handwritten documents with graph attention networks 基于图关注网络的在线手写文档表检测与单元分割

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446295

Ying-Jian Liu, Heng Zhang, Xiao-Long Yun, Jun-Yu Ye, Cheng-Lin Liu

In this paper, we propose a multi-task learning approach for table detection and cell segmentation with densely connected graph attention networks in free form online documents. Each online document is regarded as a graph, where nodes represent strokes and edges represent the relationships between strokes. Then we propose a graph attention network model to classify nodes and edges simultaneously. According to node classification results, tables can be detected in each document. By combining node and edge classification resutls, cells in each table can be segmented. To improve information flow in the network and enable efficient reuse of features among layers, dense connectivity among layers is used. Our proposed model has been experimentally validated on an online handwritten document dataset IAMOnDo and achieved encouraging results.

在本文中，我们提出了一种多任务学习方法，用于自由形式在线文档中密集连接的图关注网络的表检测和单元分割。每个在线文档被视为一个图，其中节点表示笔画，边表示笔画之间的关系。在此基础上，提出了一种同时对节点和边进行分类的图关注网络模型。根据节点分类结果，可以在每个文档中检测到表。通过结合节点和边缘的分类结果，可以对每个表中的单元格进行分割。为了改善网络中的信息流，实现层与层之间特征的高效重用，采用了层与层之间的密集连接。我们提出的模型已经在一个在线手写文档数据集IAMOnDo上进行了实验验证，取得了令人鼓舞的结果。

引用次数: 0

Storyboard relational model for group activity recognition 小组活动识别的故事板关系模型

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446255

Boning Li, Xiangbo Shu, Rui Yan

This work concerns how to effectively recognize the group activity performed by multiple persons collectively. As known, Storyboards (i.e., medium shot, close shot) jointly describe the whole storyline of a movie in a compact way. Likewise, the actors in small subgroups (similar to Storyboards) of a group activity scene contribute a lot to such group activity and develop more compact relationships among them within subgroups. Inspired by this, we propose a Storyboard Relational Model (SRM) to address the problem of Group Activity Recognition by splitting and reintegrating the group activity based on the small yet compact Storyboards. SRM mainly consists of a Pose-Guided Pruning (PGP) module and a Dual Graph Convolutional Networks (Dual-GCN) module. Specifically, PGP is designed to refine a series of Storyboards from the group activity scene by leveraging the attention ranges of individuals. Dual-GCN models the compact relationships among actors in a Storyboard. Experimental results on two widely-used datasets illustrate the effectiveness of the proposed SRM compared with the state-of-the-art methods.

这项工作涉及如何有效地识别由多人共同进行的群体活动。众所周知，故事板(即中景、近景)以一种紧凑的方式共同描述了一部电影的整个故事情节。同样，小组活动场景中的小小组(类似于故事板)中的参与者对小组活动做出了很多贡献，并在小组中发展了更紧密的关系。受此启发，我们提出了一个故事板关系模型(SRM)，通过基于小而紧凑的故事板拆分和重新集成小组活动来解决小组活动识别的问题。SRM主要由位姿引导剪枝(PGP)模块和对偶图卷积网络(Dual- Graph Convolutional Networks, Dual- gcn)模块组成。具体来说，PGP的设计是通过利用个体的注意力范围，从群体活动场景中提炼一系列故事板。Dual-GCN对故事板中参与者之间的紧密关系进行建模。在两个广泛使用的数据集上的实验结果表明，与目前的方法相比，所提出的SRM是有效的。

引用次数: 5

Objective object segmentation visual quality evaluation based on pixel-level and region-level characteristics 基于像素级和区域级特征的目标分割视觉质量评价

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446305

Ran Shi, Jian Xiong, T. Qiao

Objective object segmentation visual quality evaluation is an emergent member of the visual quality assessment family. It aims at developing an objective measure instead of a subjective survey to evaluate the object segmentation quality in agreement with human visual perception. It is an important benchmark to assess and compare performances of object segmentation methods in terms of the visual quality. In spite of its essential role, it still lacks of sufficient studying compared with other visual quality evaluation researches. In this paper, we propose a novel full-reference objective measure including a pixel-level sub-measure and a region-level sub-measure. For the pixel-level sub-measure, it assigns proper weights to not only false positive pixels and false negative pixels but also true positive pixels according to their certainty degrees. For the region-level sub-measure, it considers location distribution of the false negative errors and correlations among neighboring pixels. Thus, by combining these two sub-measures, our measure can evaluate similarity of area, shape and object completeness between one segmentation result and its ground truth in terms of human visual perception. In order to evaluate the performance of our proposed measure, we tested it on an object segmentation subjective visual quality assessment database. The experimental results demonstrate that our proposed measure with good robustness performs better in matching subjective assessments compared with other state-of-the-art objective measures.

客观对象分割视觉质量评价是视觉质量评价家族中的一个新兴成员。它的目的是开发一种客观的测量方法来代替主观的测量方法来评价符合人类视觉感知的物体分割质量。视觉质量是评价和比较目标分割方法性能的重要基准。尽管它具有重要的作用，但与其他视觉质量评价研究相比，它仍然缺乏足够的研究。本文提出了一种新的全参考客观测度，包括像素级子测度和区域级子测度。在像素级子测度中，根据假阳性和假阴性像素的确定程度，对假阳性像素和真阳性像素分配适当的权重。对于区域级子测度，它考虑了假负误差的位置分布和相邻像素间的相关性。因此，通过结合这两个子度量，我们的度量可以从人类视觉感知的角度评估一个分割结果与其ground truth之间的面积、形状和对象完整性的相似性。为了评估我们提出的方法的性能，我们在一个目标分割主观视觉质量评估数据库上进行了测试。实验结果表明，与其他最先进的客观度量相比，我们提出的度量具有良好的鲁棒性，可以更好地匹配主观评价。

{"title":"Objective object segmentation visual quality evaluation based on pixel-level and region-level characteristics","authors":"Ran Shi, Jian Xiong, T. Qiao","doi":"10.1145/3444685.3446305","DOIUrl":"https://doi.org/10.1145/3444685.3446305","url":null,"abstract":"Objective object segmentation visual quality evaluation is an emergent member of the visual quality assessment family. It aims at developing an objective measure instead of a subjective survey to evaluate the object segmentation quality in agreement with human visual perception. It is an important benchmark to assess and compare performances of object segmentation methods in terms of the visual quality. In spite of its essential role, it still lacks of sufficient studying compared with other visual quality evaluation researches. In this paper, we propose a novel full-reference objective measure including a pixel-level sub-measure and a region-level sub-measure. For the pixel-level sub-measure, it assigns proper weights to not only false positive pixels and false negative pixels but also true positive pixels according to their certainty degrees. For the region-level sub-measure, it considers location distribution of the false negative errors and correlations among neighboring pixels. Thus, by combining these two sub-measures, our measure can evaluate similarity of area, shape and object completeness between one segmentation result and its ground truth in terms of human visual perception. In order to evaluate the performance of our proposed measure, we tested it on an object segmentation subjective visual quality assessment database. The experimental results demonstrate that our proposed measure with good robustness performs better in matching subjective assessments compared with other state-of-the-art objective measures.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fusing CAMs-weighted features and temporal information for robust loop closure detection 融合cam加权特征和时间信息的鲁棒闭环检测

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446309

Yao Li, S. Zhong, Tongwei Ren, Y. Liu

As a key component in simultaneous localization and mapping (SLAM) system, loop closure detection (LCD) eliminates the accumulated errors by recognizing previously visited places. In recent years, deep learning methods have been proved effective in LCD. However, most of the existing methods do not make good use of the useful information provided by monocular images, which tends to limit their performance in challenging dynamic scenarios with partial occlusion by moving objects. To this end, we propose a novel workflow, which is able to combine multiple information provided by images. We first introduce semantic information into LCD by developing a local-aware Class Activation Maps (CAMs) weighting method for extracting features, which can reduce the adverse effects of moving objects. Compared with previous methods based on semantic segmentation, our method has the advantage of not requiring additional models or other complex operations. In addition, we propose two effective temporal constraint strategies, which utilize the relationship of image sequences to improve the detection performance. Moreover, we propose to use the keypoint matching strategy as the final detector to further refuse false positives. Experiments on four publicly available datasets indicate that our approach can achieve higher accuracy and better robustness than the state-of-the-art methods.

闭环检测(LCD)是同步定位与地图绘制(SLAM)系统的关键组成部分，它通过识别以前去过的地方来消除累积误差。近年来，深度学习方法已被证明在LCD中是有效的。然而，现有的大多数方法并没有很好地利用单眼图像提供的有用信息，这往往限制了它们在具有运动物体局部遮挡的动态场景中的性能。为此，我们提出了一种新的工作流，该工作流能够将图像提供的多种信息组合在一起。我们首先通过开发一种局部感知的类激活图(CAMs)加权方法将语义信息引入LCD中，以提取特征，从而减少运动物体的不利影响。与以往基于语义分割的方法相比，我们的方法不需要额外的模型和其他复杂的操作。此外，我们提出了两种有效的时间约束策略，利用图像序列之间的关系来提高检测性能。此外，我们提出使用关键点匹配策略作为最终检测器，进一步拒绝误报。在四个公开可用的数据集上的实验表明，我们的方法比最先进的方法可以达到更高的精度和更好的鲁棒性。

{"title":"Fusing CAMs-weighted features and temporal information for robust loop closure detection","authors":"Yao Li, S. Zhong, Tongwei Ren, Y. Liu","doi":"10.1145/3444685.3446309","DOIUrl":"https://doi.org/10.1145/3444685.3446309","url":null,"abstract":"As a key component in simultaneous localization and mapping (SLAM) system, loop closure detection (LCD) eliminates the accumulated errors by recognizing previously visited places. In recent years, deep learning methods have been proved effective in LCD. However, most of the existing methods do not make good use of the useful information provided by monocular images, which tends to limit their performance in challenging dynamic scenarios with partial occlusion by moving objects. To this end, we propose a novel workflow, which is able to combine multiple information provided by images. We first introduce semantic information into LCD by developing a local-aware Class Activation Maps (CAMs) weighting method for extracting features, which can reduce the adverse effects of moving objects. Compared with previous methods based on semantic segmentation, our method has the advantage of not requiring additional models or other complex operations. In addition, we propose two effective temporal constraint strategies, which utilize the relationship of image sequences to improve the detection performance. Moreover, we propose to use the keypoint matching strategy as the final detector to further refuse false positives. Experiments on four publicly available datasets indicate that our approach can achieve higher accuracy and better robustness than the state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2005 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116898427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Distilling knowledge in causal inference for unbiased visual question answering 为无偏视觉问答提取因果推理中的知识

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446256

Yonghua Pan, Zechao Li, Liyan Zhang, Jinhui Tang

Current Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the distribution of handled data is different from the training data. Towards this end, this paper proposes a novel unbiased VQA model by exploring the Casual Inference with Knowledge Distillation (CIKD) to reduce the influence of bias. Specifically, the causal graph is first constructed to explore the counterfactual causality and infer the casual target based on the causal effect, which well reduces the bias from questions and obtain answers without training. Then knowledge distillation is leveraged to transfer the knowledge of the inferred casual target to the conventional VQA model. It makes the proposed method enable to handle both the biased data and standard data. To address the problem of the bad bias from the knowledge distillation, the ensemble learning is introduced based on the hypothetical bias reason. Experiments are conducted to show the performance of the proposed method. The significant improvements over the state-of-the-art methods on the VQA-CP v2 dataset well validate the contributions of this work.

目前的视觉问答(Visual Question answer, VQA)模型主要探索答案与问题之间的统计相关性，未能捕捉到视觉信息与答案之间的关系。当处理数据的分布与训练数据不同时，性能会显著下降。为此，本文提出了一种新的无偏VQA模型，通过探索带有知识蒸馏的随机推理(CIKD)来减少偏差的影响。具体而言，首先构建因果图，探索反事实因果关系，根据因果效应推断偶然目标，很好地减少了问题的偏差，无需训练即可获得答案。然后利用知识蒸馏将推断出的随机目标的知识转移到传统的VQA模型中。这使得该方法能够同时处理有偏差数据和标准数据。为了解决知识蒸馏中存在的不良偏差问题，在假设偏差原因的基础上引入了集成学习。实验证明了该方法的有效性。在VQA-CP v2数据集上对最先进方法的显著改进很好地验证了本工作的贡献。

{"title":"Distilling knowledge in causal inference for unbiased visual question answering","authors":"Yonghua Pan, Zechao Li, Liyan Zhang, Jinhui Tang","doi":"10.1145/3444685.3446256","DOIUrl":"https://doi.org/10.1145/3444685.3446256","url":null,"abstract":"Current Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the distribution of handled data is different from the training data. Towards this end, this paper proposes a novel unbiased VQA model by exploring the Casual Inference with Knowledge Distillation (CIKD) to reduce the influence of bias. Specifically, the causal graph is first constructed to explore the counterfactual causality and infer the casual target based on the causal effect, which well reduces the bias from questions and obtain answers without training. Then knowledge distillation is leveraged to transfer the knowledge of the inferred casual target to the conventional VQA model. It makes the proposed method enable to handle both the biased data and standard data. To address the problem of the bad bias from the knowledge distillation, the ensemble learning is introduced based on the hypothetical bias reason. Experiments are conducted to show the performance of the proposed method. The significant improvements over the state-of-the-art methods on the VQA-CP v2 dataset well validate the contributions of this work.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114338947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Global and local feature alignment for video object detection 视频目标检测的全局和局部特征对齐

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

Pub Date : 2021-03-07 DOI: 10.1145/3444685.3446263

Haihui Ye, Qiang Qi, Ying Wang, Yang Lu, Hanzi Wang

Extending image-based object detectors into video domain suffers from immense inadaptability due to the deteriorated frames caused by motion blur, partial occlusion or strange poses. Therefore, the generated features of deteriorated frames encounter the poor quality of misalignment, which degrades the overall performance of video object detectors. How to capture valuable information locally or globally is of importance to feature alignment but remains quite challenging. In this paper, we propose a Global and Local Feature Alignment (abbreviated as GLFA) module for video object detection, which can distill both global and local information to excavate the deep relationship between features for feature alignment. Specifically, GLFA can model the spatial-temporal dependencies over frames based on propagating global information and capture the interactive correspondences within the same frame based on aggregating valuable local information. Moreover, we further introduce a Self-Adaptive Calibration (SAC) module to strengthen the semantic representation of features and distill valuable local information in a dual local-alignment manner. Experimental results on the ImageNet VID dataset show that the proposed method achieves high performance as well as a good trade-off between real-time speed and competitive accuracy.

将基于图像的目标检测器扩展到视频域，由于运动模糊、局部遮挡或奇怪的姿势导致的帧恶化而存在巨大的不适应性。因此，劣化帧生成的特征会遇到不对准质量差的问题，从而降低了视频目标检测器的整体性能。如何在局部或全局捕获有价值的信息对于特征对齐非常重要，但仍然具有相当大的挑战性。本文提出了一种用于视频目标检测的全局和局部特征对齐(Global and Local Feature Alignment，简称GLFA)模块，该模块可以同时提取全局和局部信息，挖掘特征之间的深层关系进行特征对齐。具体而言，GLFA可以基于传播全局信息对帧间的时空依赖关系进行建模，并基于聚合有价值的局部信息捕获同一帧内的交互对应关系。此外，我们进一步引入了自适应校准(SAC)模块，以增强特征的语义表示，并以双局部对齐的方式提取有价值的局部信息。在ImageNet VID数据集上的实验结果表明，该方法在实时性和竞争精度之间取得了良好的平衡。

{"title":"Global and local feature alignment for video object detection","authors":"Haihui Ye, Qiang Qi, Ying Wang, Yang Lu, Hanzi Wang","doi":"10.1145/3444685.3446263","DOIUrl":"https://doi.org/10.1145/3444685.3446263","url":null,"abstract":"Extending image-based object detectors into video domain suffers from immense inadaptability due to the deteriorated frames caused by motion blur, partial occlusion or strange poses. Therefore, the generated features of deteriorated frames encounter the poor quality of misalignment, which degrades the overall performance of video object detectors. How to capture valuable information locally or globally is of importance to feature alignment but remains quite challenging. In this paper, we propose a Global and Local Feature Alignment (abbreviated as GLFA) module for video object detection, which can distill both global and local information to excavate the deep relationship between features for feature alignment. Specifically, GLFA can model the spatial-temporal dependencies over frames based on propagating global information and capture the interactive correspondences within the same frame based on aggregating valuable local information. Moreover, we further introduce a Self-Adaptive Calibration (SAC) module to strengthen the semantic representation of features and distill valuable local information in a dual local-alignment manner. Experimental results on the ImageNet VID dataset show that the proposed method achieves high performance as well as a good trade-off between real-time speed and competitive accuracy.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131565458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2nd ACM International Conference on Multimedia in Asia

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀