Person re-identification (re-ID) aims to establish identity correspondence across different cameras. State-of-the-art re-ID approaches are mainly clustering-based Unsupervised Domain Adaptation (UDA) methods, which attempt to transfer the model trained on the source domain to target domain, by alternatively generating pseudo labels by clustering target-domain instances and training the network with generated pseudo labels to perform feature learning. However, these approaches suffer from the problem of inevitable label noise caused by the clustering procedure that dramatically impact the model training and feature learning of the target domain. To address this issue, we propose an unsupervised Hierarchical Clustering via Mutual Learning (HCML) framework, which can jointly optimize the dual training network and the clustering procedure to learn more discriminative features from the target domain. Specifically, the proposed HCML framework can effectively update the hard pseudo labels generated by clustering process and soft pseudo label generated by the training network both in on-line manner. We jointly adopt the repelled loss, triplet loss, soft identity loss and soft triplet loss to optimize the model. The experimental results on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks have demonstrated the superiority of our proposed HCML framework compared with other state-of-the-art methods.
{"title":"Hierarchical clustering via mutual learning for unsupervised person re-identification","authors":"Xu Xu, Liyan Zhang, Zhaomeng Huang, Guodong Du","doi":"10.1145/3444685.3446268","DOIUrl":"https://doi.org/10.1145/3444685.3446268","url":null,"abstract":"Person re-identification (re-ID) aims to establish identity correspondence across different cameras. State-of-the-art re-ID approaches are mainly clustering-based Unsupervised Domain Adaptation (UDA) methods, which attempt to transfer the model trained on the source domain to target domain, by alternatively generating pseudo labels by clustering target-domain instances and training the network with generated pseudo labels to perform feature learning. However, these approaches suffer from the problem of inevitable label noise caused by the clustering procedure that dramatically impact the model training and feature learning of the target domain. To address this issue, we propose an unsupervised Hierarchical Clustering via Mutual Learning (HCML) framework, which can jointly optimize the dual training network and the clustering procedure to learn more discriminative features from the target domain. Specifically, the proposed HCML framework can effectively update the hard pseudo labels generated by clustering process and soft pseudo label generated by the training network both in on-line manner. We jointly adopt the repelled loss, triplet loss, soft identity loss and soft triplet loss to optimize the model. The experimental results on Market-to-Duke, Duke-to-Market, Market-to-MSMT and Duke-to-MSMT unsupervised domain adaptation tasks have demonstrated the superiority of our proposed HCML framework compared with other state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122002867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shota Ashida, A. Jatowt, A. Doucet, Masatoshi Yoshikawa
A significant number of old photographs including ones that are posted online do not contain the information of the date at which they were taken, or this information needs to be verified. Many of such pictures are either scanned analog photographs or photographs taken using a digital camera with incorrect settings. Estimating the date of such pictures is useful for enhancing data quality and its consistency, improving information retrieval and for other related applications. In this study, we propose a novel approach for automatic estimation of the shooting dates of photographs based on a rank-consistent ordinal classification method for neural networks. We also introduce an ensemble approach that involves object segmentation. We conclude that assuring the rank consistency in the ordinal classification as well as combining models trained on segmented objects improve the results of the age determination task.
{"title":"Determining image age with rank-consistent ordinal classification and object-centered ensemble","authors":"Shota Ashida, A. Jatowt, A. Doucet, Masatoshi Yoshikawa","doi":"10.1145/3444685.3446326","DOIUrl":"https://doi.org/10.1145/3444685.3446326","url":null,"abstract":"A significant number of old photographs including ones that are posted online do not contain the information of the date at which they were taken, or this information needs to be verified. Many of such pictures are either scanned analog photographs or photographs taken using a digital camera with incorrect settings. Estimating the date of such pictures is useful for enhancing data quality and its consistency, improving information retrieval and for other related applications. In this study, we propose a novel approach for automatic estimation of the shooting dates of photographs based on a rank-consistent ordinal classification method for neural networks. We also introduce an ensemble approach that involves object segmentation. We conclude that assuring the rank consistency in the ordinal classification as well as combining models trained on segmented objects improve the results of the age determination task.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131526715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Peng, Yang Yang, Xing Xu, Jingjing Li, Xiaofeng Zhu
Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.
{"title":"Multi-level expression guided attention network for referring expression comprehension","authors":"Liang Peng, Yang Yang, Xing Xu, Jingjing Li, Xiaofeng Zhu","doi":"10.1145/3444685.3446270","DOIUrl":"https://doi.org/10.1145/3444685.3446270","url":null,"abstract":"Referring expression comprehension is a task of identifying a text-related object or region in a given image by a natural language expression. In this task, it is essential to understand the expression sentence in multi-aspect and adapt it to region representations for generating the discriminative information. Unfortunately, previous approaches usually focus on the important words or phrases in the expression using self-attention mechanisms, which causes that they may fail to distinguish the target region from others, especially the similar regions. To address this problem, we propose a novel model, termed Multi-level Expression Guided Attention network (MEGA-Net). It contains a multi-level visual attention schema guided by the expression representations in different levels, i.e., sentence-level, word-level and phrase-level, which allows generating the discriminative region features and helps to locate the related regions accurately. In addition, to distinguish the similar regions, we design a two-stage structure, where we first select top-K candidate regions according to their matching scores in the first stage, then we apply an object comparison attention mechanism to learn the difference between the candidates for matching the target region. We evaluate the proposed approach on three popular benchmark datasets and the experimental results demonstrate that our model performs against state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128484854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.
{"title":"Relationship graph learning network for visual relationship detection","authors":"Yanan Li, Jun Yu, Yibing Zhan, Zhi Chen","doi":"10.1145/3444685.3446312","DOIUrl":"https://doi.org/10.1145/3444685.3446312","url":null,"abstract":"Visual relationship detection aims to predict the relationships between detected object pairs. It is well believed that the correlations between image components (i.e., objects and relationships between objects) are significant considerations when predicting objects' relationships. However, most current visual relationship detection methods only exploited the correlations among objects, and the correlations among objects' relationships remained underexplored. This paper proposes a relationship graph learning network (RGLN) to explore the correlations among objects' relationships for visual relationship detection. Specifically, RGLN obtains image objects using an object detector, and then, every pair of objects constitutes a relationship proposal. All relationship proposals construct a relationship graph, in which the proposals are treated as nodes. Accordingly, RGLN designs bi-stream graph attention subnetworks to detect relationship proposals, in which one graph attention subnetwork analyzes correlations among relationships based on visual and spatial information, and the other analyzes correlations based on semantic and spatial information. Besides, RGLN exploits a relationship selection subnetwork to ignore redundant information of object pairs with no relationships. We conduct extensive experiments on two public datasets: the VRD and the VG datasets. The experimental results compared with the state-of-the-art demonstrate the competitiveness of RGLN.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133276438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ying-Jian Liu, Heng Zhang, Xiao-Long Yun, Jun-Yu Ye, Cheng-Lin Liu
In this paper, we propose a multi-task learning approach for table detection and cell segmentation with densely connected graph attention networks in free form online documents. Each online document is regarded as a graph, where nodes represent strokes and edges represent the relationships between strokes. Then we propose a graph attention network model to classify nodes and edges simultaneously. According to node classification results, tables can be detected in each document. By combining node and edge classification resutls, cells in each table can be segmented. To improve information flow in the network and enable efficient reuse of features among layers, dense connectivity among layers is used. Our proposed model has been experimentally validated on an online handwritten document dataset IAMOnDo and achieved encouraging results.
{"title":"Table detection and cell segmentation in online handwritten documents with graph attention networks","authors":"Ying-Jian Liu, Heng Zhang, Xiao-Long Yun, Jun-Yu Ye, Cheng-Lin Liu","doi":"10.1145/3444685.3446295","DOIUrl":"https://doi.org/10.1145/3444685.3446295","url":null,"abstract":"In this paper, we propose a multi-task learning approach for table detection and cell segmentation with densely connected graph attention networks in free form online documents. Each online document is regarded as a graph, where nodes represent strokes and edges represent the relationships between strokes. Then we propose a graph attention network model to classify nodes and edges simultaneously. According to node classification results, tables can be detected in each document. By combining node and edge classification resutls, cells in each table can be segmented. To improve information flow in the network and enable efficient reuse of features among layers, dense connectivity among layers is used. Our proposed model has been experimentally validated on an online handwritten document dataset IAMOnDo and achieved encouraging results.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133941461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This work concerns how to effectively recognize the group activity performed by multiple persons collectively. As known, Storyboards (i.e., medium shot, close shot) jointly describe the whole storyline of a movie in a compact way. Likewise, the actors in small subgroups (similar to Storyboards) of a group activity scene contribute a lot to such group activity and develop more compact relationships among them within subgroups. Inspired by this, we propose a Storyboard Relational Model (SRM) to address the problem of Group Activity Recognition by splitting and reintegrating the group activity based on the small yet compact Storyboards. SRM mainly consists of a Pose-Guided Pruning (PGP) module and a Dual Graph Convolutional Networks (Dual-GCN) module. Specifically, PGP is designed to refine a series of Storyboards from the group activity scene by leveraging the attention ranges of individuals. Dual-GCN models the compact relationships among actors in a Storyboard. Experimental results on two widely-used datasets illustrate the effectiveness of the proposed SRM compared with the state-of-the-art methods.
{"title":"Storyboard relational model for group activity recognition","authors":"Boning Li, Xiangbo Shu, Rui Yan","doi":"10.1145/3444685.3446255","DOIUrl":"https://doi.org/10.1145/3444685.3446255","url":null,"abstract":"This work concerns how to effectively recognize the group activity performed by multiple persons collectively. As known, Storyboards (i.e., medium shot, close shot) jointly describe the whole storyline of a movie in a compact way. Likewise, the actors in small subgroups (similar to Storyboards) of a group activity scene contribute a lot to such group activity and develop more compact relationships among them within subgroups. Inspired by this, we propose a Storyboard Relational Model (SRM) to address the problem of Group Activity Recognition by splitting and reintegrating the group activity based on the small yet compact Storyboards. SRM mainly consists of a Pose-Guided Pruning (PGP) module and a Dual Graph Convolutional Networks (Dual-GCN) module. Specifically, PGP is designed to refine a series of Storyboards from the group activity scene by leveraging the attention ranges of individuals. Dual-GCN models the compact relationships among actors in a Storyboard. Experimental results on two widely-used datasets illustrate the effectiveness of the proposed SRM compared with the state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113977758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Objective object segmentation visual quality evaluation is an emergent member of the visual quality assessment family. It aims at developing an objective measure instead of a subjective survey to evaluate the object segmentation quality in agreement with human visual perception. It is an important benchmark to assess and compare performances of object segmentation methods in terms of the visual quality. In spite of its essential role, it still lacks of sufficient studying compared with other visual quality evaluation researches. In this paper, we propose a novel full-reference objective measure including a pixel-level sub-measure and a region-level sub-measure. For the pixel-level sub-measure, it assigns proper weights to not only false positive pixels and false negative pixels but also true positive pixels according to their certainty degrees. For the region-level sub-measure, it considers location distribution of the false negative errors and correlations among neighboring pixels. Thus, by combining these two sub-measures, our measure can evaluate similarity of area, shape and object completeness between one segmentation result and its ground truth in terms of human visual perception. In order to evaluate the performance of our proposed measure, we tested it on an object segmentation subjective visual quality assessment database. The experimental results demonstrate that our proposed measure with good robustness performs better in matching subjective assessments compared with other state-of-the-art objective measures.
{"title":"Objective object segmentation visual quality evaluation based on pixel-level and region-level characteristics","authors":"Ran Shi, Jian Xiong, T. Qiao","doi":"10.1145/3444685.3446305","DOIUrl":"https://doi.org/10.1145/3444685.3446305","url":null,"abstract":"Objective object segmentation visual quality evaluation is an emergent member of the visual quality assessment family. It aims at developing an objective measure instead of a subjective survey to evaluate the object segmentation quality in agreement with human visual perception. It is an important benchmark to assess and compare performances of object segmentation methods in terms of the visual quality. In spite of its essential role, it still lacks of sufficient studying compared with other visual quality evaluation researches. In this paper, we propose a novel full-reference objective measure including a pixel-level sub-measure and a region-level sub-measure. For the pixel-level sub-measure, it assigns proper weights to not only false positive pixels and false negative pixels but also true positive pixels according to their certainty degrees. For the region-level sub-measure, it considers location distribution of the false negative errors and correlations among neighboring pixels. Thus, by combining these two sub-measures, our measure can evaluate similarity of area, shape and object completeness between one segmentation result and its ground truth in terms of human visual perception. In order to evaluate the performance of our proposed measure, we tested it on an object segmentation subjective visual quality assessment database. The experimental results demonstrate that our proposed measure with good robustness performs better in matching subjective assessments compared with other state-of-the-art objective measures.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114045525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As a key component in simultaneous localization and mapping (SLAM) system, loop closure detection (LCD) eliminates the accumulated errors by recognizing previously visited places. In recent years, deep learning methods have been proved effective in LCD. However, most of the existing methods do not make good use of the useful information provided by monocular images, which tends to limit their performance in challenging dynamic scenarios with partial occlusion by moving objects. To this end, we propose a novel workflow, which is able to combine multiple information provided by images. We first introduce semantic information into LCD by developing a local-aware Class Activation Maps (CAMs) weighting method for extracting features, which can reduce the adverse effects of moving objects. Compared with previous methods based on semantic segmentation, our method has the advantage of not requiring additional models or other complex operations. In addition, we propose two effective temporal constraint strategies, which utilize the relationship of image sequences to improve the detection performance. Moreover, we propose to use the keypoint matching strategy as the final detector to further refuse false positives. Experiments on four publicly available datasets indicate that our approach can achieve higher accuracy and better robustness than the state-of-the-art methods.
{"title":"Fusing CAMs-weighted features and temporal information for robust loop closure detection","authors":"Yao Li, S. Zhong, Tongwei Ren, Y. Liu","doi":"10.1145/3444685.3446309","DOIUrl":"https://doi.org/10.1145/3444685.3446309","url":null,"abstract":"As a key component in simultaneous localization and mapping (SLAM) system, loop closure detection (LCD) eliminates the accumulated errors by recognizing previously visited places. In recent years, deep learning methods have been proved effective in LCD. However, most of the existing methods do not make good use of the useful information provided by monocular images, which tends to limit their performance in challenging dynamic scenarios with partial occlusion by moving objects. To this end, we propose a novel workflow, which is able to combine multiple information provided by images. We first introduce semantic information into LCD by developing a local-aware Class Activation Maps (CAMs) weighting method for extracting features, which can reduce the adverse effects of moving objects. Compared with previous methods based on semantic segmentation, our method has the advantage of not requiring additional models or other complex operations. In addition, we propose two effective temporal constraint strategies, which utilize the relationship of image sequences to improve the detection performance. Moreover, we propose to use the keypoint matching strategy as the final detector to further refuse false positives. Experiments on four publicly available datasets indicate that our approach can achieve higher accuracy and better robustness than the state-of-the-art methods.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"2005 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116898427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Current Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the distribution of handled data is different from the training data. Towards this end, this paper proposes a novel unbiased VQA model by exploring the Casual Inference with Knowledge Distillation (CIKD) to reduce the influence of bias. Specifically, the causal graph is first constructed to explore the counterfactual causality and infer the casual target based on the causal effect, which well reduces the bias from questions and obtain answers without training. Then knowledge distillation is leveraged to transfer the knowledge of the inferred casual target to the conventional VQA model. It makes the proposed method enable to handle both the biased data and standard data. To address the problem of the bad bias from the knowledge distillation, the ensemble learning is introduced based on the hypothetical bias reason. Experiments are conducted to show the performance of the proposed method. The significant improvements over the state-of-the-art methods on the VQA-CP v2 dataset well validate the contributions of this work.
{"title":"Distilling knowledge in causal inference for unbiased visual question answering","authors":"Yonghua Pan, Zechao Li, Liyan Zhang, Jinhui Tang","doi":"10.1145/3444685.3446256","DOIUrl":"https://doi.org/10.1145/3444685.3446256","url":null,"abstract":"Current Visual Question Answering (VQA) models mainly explore the statistical correlations between answers and questions, which fail to capture the relationship between the visual information and answers. The performance dramatically decreases when the distribution of handled data is different from the training data. Towards this end, this paper proposes a novel unbiased VQA model by exploring the Casual Inference with Knowledge Distillation (CIKD) to reduce the influence of bias. Specifically, the causal graph is first constructed to explore the counterfactual causality and infer the casual target based on the causal effect, which well reduces the bias from questions and obtain answers without training. Then knowledge distillation is leveraged to transfer the knowledge of the inferred casual target to the conventional VQA model. It makes the proposed method enable to handle both the biased data and standard data. To address the problem of the bad bias from the knowledge distillation, the ensemble learning is introduced based on the hypothetical bias reason. Experiments are conducted to show the performance of the proposed method. The significant improvements over the state-of-the-art methods on the VQA-CP v2 dataset well validate the contributions of this work.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114338947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haihui Ye, Qiang Qi, Ying Wang, Yang Lu, Hanzi Wang
Extending image-based object detectors into video domain suffers from immense inadaptability due to the deteriorated frames caused by motion blur, partial occlusion or strange poses. Therefore, the generated features of deteriorated frames encounter the poor quality of misalignment, which degrades the overall performance of video object detectors. How to capture valuable information locally or globally is of importance to feature alignment but remains quite challenging. In this paper, we propose a Global and Local Feature Alignment (abbreviated as GLFA) module for video object detection, which can distill both global and local information to excavate the deep relationship between features for feature alignment. Specifically, GLFA can model the spatial-temporal dependencies over frames based on propagating global information and capture the interactive correspondences within the same frame based on aggregating valuable local information. Moreover, we further introduce a Self-Adaptive Calibration (SAC) module to strengthen the semantic representation of features and distill valuable local information in a dual local-alignment manner. Experimental results on the ImageNet VID dataset show that the proposed method achieves high performance as well as a good trade-off between real-time speed and competitive accuracy.
将基于图像的目标检测器扩展到视频域,由于运动模糊、局部遮挡或奇怪的姿势导致的帧恶化而存在巨大的不适应性。因此,劣化帧生成的特征会遇到不对准质量差的问题,从而降低了视频目标检测器的整体性能。如何在局部或全局捕获有价值的信息对于特征对齐非常重要,但仍然具有相当大的挑战性。本文提出了一种用于视频目标检测的全局和局部特征对齐(Global and Local Feature Alignment,简称GLFA)模块,该模块可以同时提取全局和局部信息,挖掘特征之间的深层关系进行特征对齐。具体而言,GLFA可以基于传播全局信息对帧间的时空依赖关系进行建模,并基于聚合有价值的局部信息捕获同一帧内的交互对应关系。此外,我们进一步引入了自适应校准(SAC)模块,以增强特征的语义表示,并以双局部对齐的方式提取有价值的局部信息。在ImageNet VID数据集上的实验结果表明,该方法在实时性和竞争精度之间取得了良好的平衡。
{"title":"Global and local feature alignment for video object detection","authors":"Haihui Ye, Qiang Qi, Ying Wang, Yang Lu, Hanzi Wang","doi":"10.1145/3444685.3446263","DOIUrl":"https://doi.org/10.1145/3444685.3446263","url":null,"abstract":"Extending image-based object detectors into video domain suffers from immense inadaptability due to the deteriorated frames caused by motion blur, partial occlusion or strange poses. Therefore, the generated features of deteriorated frames encounter the poor quality of misalignment, which degrades the overall performance of video object detectors. How to capture valuable information locally or globally is of importance to feature alignment but remains quite challenging. In this paper, we propose a Global and Local Feature Alignment (abbreviated as GLFA) module for video object detection, which can distill both global and local information to excavate the deep relationship between features for feature alignment. Specifically, GLFA can model the spatial-temporal dependencies over frames based on propagating global information and capture the interactive correspondences within the same frame based on aggregating valuable local information. Moreover, we further introduce a Self-Adaptive Calibration (SAC) module to strengthen the semantic representation of features and distill valuable local information in a dual local-alignment manner. Experimental results on the ImageNet VID dataset show that the proposed method achieves high performance as well as a good trade-off between real-time speed and competitive accuracy.","PeriodicalId":119278,"journal":{"name":"Proceedings of the 2nd ACM International Conference on Multimedia in Asia","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131565458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}