Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.
{"title":"Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos","authors":"Zhipeng Wei, Jingjing Chen, Hao Zhang, Linxi Jiang, Yu-Gang Jiang","doi":"10.1145/3512527.3531411","DOIUrl":"https://doi.org/10.1145/3512527.3531411","url":null,"abstract":"Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116790826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haze classification has gained much attention recently as a cost-effective solution for air quality monitoring. Different from conventional image classification tasks, it requires the classifier to capture the haze patterns of different severity degrees. Existing efforts typically focus on the extraction of effective haze features, such as the dark channel and deep features. However, it is observed that the light-haze images are often mis-classified due to the presence of diverse background scenes. To address this issue, this paper presents an unsupervised contrastive masking (UCM) algorithm to segment the haze regions without any supervision, and develops a dual-channel model-agnostic framework, termed magnifier neural network (MagNet), to effectively use the segmented haze regions to enhance the learning of haze features by conventional deep learning models. Specifically, MagNet employs the haze regions to provide the pixel- and feature-level visual information via three strategies, including Input Augmentation, Network Constraint, and Feature Enhancement, which work as a soft-attention regularizer to alleviates the trade-off between capturing the global scene information and the local information in the haze regions. Experiments were conducted on two datasets in terms of performance comparison, parameter estimation, ablation studies, and case studies, and the results verified that UCM can accurately and rapidly segment the haze regions, and the proposed three strategies of MagNet consistently improve the performance of the state-of-the-art deep learning backbones.
{"title":"Unsupervised Contrastive Masking for Visual Haze Classification","authors":"Jingyu Li, Haokai Ma, Xiangxian Li, Zhuang Qi, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531370","DOIUrl":"https://doi.org/10.1145/3512527.3531370","url":null,"abstract":"Haze classification has gained much attention recently as a cost-effective solution for air quality monitoring. Different from conventional image classification tasks, it requires the classifier to capture the haze patterns of different severity degrees. Existing efforts typically focus on the extraction of effective haze features, such as the dark channel and deep features. However, it is observed that the light-haze images are often mis-classified due to the presence of diverse background scenes. To address this issue, this paper presents an unsupervised contrastive masking (UCM) algorithm to segment the haze regions without any supervision, and develops a dual-channel model-agnostic framework, termed magnifier neural network (MagNet), to effectively use the segmented haze regions to enhance the learning of haze features by conventional deep learning models. Specifically, MagNet employs the haze regions to provide the pixel- and feature-level visual information via three strategies, including Input Augmentation, Network Constraint, and Feature Enhancement, which work as a soft-attention regularizer to alleviates the trade-off between capturing the global scene information and the local information in the haze regions. Experiments were conducted on two datasets in terms of performance comparison, parameter estimation, ablation studies, and case studies, and the results verified that UCM can accurately and rapidly segment the haze regions, and the proposed three strategies of MagNet consistently improve the performance of the state-of-the-art deep learning backbones.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117135818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks. However, for these different modalities of describing a person, combining multiple aspects can obviously make full use of complementary information and improve the identification performance. Therefore, to explore how to comprehensively consider multi-modal information, we advance a novel multi-modal person re-identification task, which utilizes both text and sketch as a descriptive query to retrieve desired images. In fact, the textual description and the visual description are understood together to retrieve the person in the database to be more aligned with real-world scenarios, which is promising but seldom considered. Besides, based on an existing sketch-based ReID dataset, we construct a new dataset, TriReID, to support this challenging task in a semi-automated way. Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized. Moreover, we propose a novel framework named Descriptive Fusion Model (DFM) to solve the multi-modal ReID issue. Specifically, we first develop a flexible descriptive embedding function to fuse the text and sketch modalities. Further, the fused descriptive semantic feature is jointly optimized under the generative adversarial paradigm to mitigate the cross-modal semantic gap. Extensive experiments on the TriReID dataset demonstrate the effectiveness and rationality of our proposed solution.
{"title":"TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model","authors":"Yajing Zhai, Yawen Zeng, Da Cao, Shaofei Lu","doi":"10.1145/3512527.3531397","DOIUrl":"https://doi.org/10.1145/3512527.3531397","url":null,"abstract":"The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks. However, for these different modalities of describing a person, combining multiple aspects can obviously make full use of complementary information and improve the identification performance. Therefore, to explore how to comprehensively consider multi-modal information, we advance a novel multi-modal person re-identification task, which utilizes both text and sketch as a descriptive query to retrieve desired images. In fact, the textual description and the visual description are understood together to retrieve the person in the database to be more aligned with real-world scenarios, which is promising but seldom considered. Besides, based on an existing sketch-based ReID dataset, we construct a new dataset, TriReID, to support this challenging task in a semi-automated way. Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized. Moreover, we propose a novel framework named Descriptive Fusion Model (DFM) to solve the multi-modal ReID issue. Specifically, we first develop a flexible descriptive embedding function to fuse the text and sketch modalities. Further, the fused descriptive semantic feature is jointly optimized under the generative adversarial paradigm to mitigate the cross-modal semantic gap. Extensive experiments on the TriReID dataset demonstrate the effectiveness and rationality of our proposed solution.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Efficient and accurate abdominal multi-organ segmentation is the key to clinical applications such as computer-aided diagnosis and computer-aided surgery, but this task is extremely challenging due to blurred organ boundaries, complex backgrounds, and different organ sizes. Although existing segmentation methods have achieved good segmentation results, we found that the segmentation performance of abdominal small and medium organs is often unsatisfactory, but the accurate location and segmentation of abdominal small and medium organs plays an important role in the diagnosis and screening of clinical diseases. To address this problem, in this paper we propose an intra- and inter-scale collaborative learning network (I2-Net) for the abdominal multi-organ segmentation task. Firstly, we design a Feature Complementary Module (FCM) to adaptively complement the local and global features extracted by CNN and Transformer. Secondly, we propose a Feature Aggregation Module (FAM) to aggregate multi-scale semantic information. Finally, we employ a Focus Module (FM) for collaborative learning of intra- and inter-scale features. Extensive experiments on the Synapse dataset show that our method outperforms the state-of-the-art approaches and achieve accurate segmentation of abdominal multi-organs, especially for small and medium organs.
{"title":"I2-Net: Intra- and Inter-scale Collaborative Learning Network for Abdominal Multi-organ Segmentation","authors":"Chao Suo, Xuanya Li, Donghui Tan, Yuan Zhang, Xieping Gao","doi":"10.1145/3512527.3531420","DOIUrl":"https://doi.org/10.1145/3512527.3531420","url":null,"abstract":"Efficient and accurate abdominal multi-organ segmentation is the key to clinical applications such as computer-aided diagnosis and computer-aided surgery, but this task is extremely challenging due to blurred organ boundaries, complex backgrounds, and different organ sizes. Although existing segmentation methods have achieved good segmentation results, we found that the segmentation performance of abdominal small and medium organs is often unsatisfactory, but the accurate location and segmentation of abdominal small and medium organs plays an important role in the diagnosis and screening of clinical diseases. To address this problem, in this paper we propose an intra- and inter-scale collaborative learning network (I2-Net) for the abdominal multi-organ segmentation task. Firstly, we design a Feature Complementary Module (FCM) to adaptively complement the local and global features extracted by CNN and Transformer. Secondly, we propose a Feature Aggregation Module (FAM) to aggregate multi-scale semantic information. Finally, we employ a Focus Module (FM) for collaborative learning of intra- and inter-scale features. Extensive experiments on the Synapse dataset show that our method outperforms the state-of-the-art approaches and achieve accurate segmentation of abdominal multi-organs, especially for small and medium organs.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124715923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.
{"title":"Source-free Temporal Attentive Domain Adaptation for Video Action Recognition","authors":"Peipeng Chen, A. J. Ma","doi":"10.1145/3512527.3531392","DOIUrl":"https://doi.org/10.1145/3512527.3531392","url":null,"abstract":"With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130575818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Fu, Yuanxin Mao, Shilin Fu, Yonggong Ren, Zhongxuan Luo
Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.
{"title":"Blindfold Attention: Novel Mask Strategy for Facial Expression Recognition","authors":"Bo Fu, Yuanxin Mao, Shilin Fu, Yonggong Ren, Zhongxuan Luo","doi":"10.1145/3512527.3531416","DOIUrl":"https://doi.org/10.1145/3512527.3531416","url":null,"abstract":"Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126479767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shanchuan Gao, Fankai Zeng, Lu Cheng, Jicong Fan, Mingde Zhao
Clothes image search is the key technique to effectively search the clothes items that are most relevant to the query clothes given by the customer. In this work, we propose an Anchor-free framework for clothes image search by adopting an additional Re-ID branch for similarity learning and global mask branch for instance segmentation. The Re-ID branch is to extract richer feature of target clothes, where we develop a mask pooling layer to aggregate the feature by utilizing the mask of target clothes as the guidance. In this way, the extracted feature will involve more information covered by the mask area of targets instead of only the center point; the global mask branch is to be trained with detection and Re-ID branches simultaneously, where the estimated mask of target clothes can be utilized in reference procedure to guide the feature extraction. Finally, to further enhance the performance of retrieval, we have introduced a match loss to further fine-tune the Re-ID embedding branch in the framework, so that the clothes target can be closer to the same one, while be farther away from different clothes targets. Extensive simulations have been conducted and the results verify the effectiveness of the proposed work.
{"title":"Fashion Image Search via Anchor-Free Detector","authors":"Shanchuan Gao, Fankai Zeng, Lu Cheng, Jicong Fan, Mingde Zhao","doi":"10.1145/3512527.3531355","DOIUrl":"https://doi.org/10.1145/3512527.3531355","url":null,"abstract":"Clothes image search is the key technique to effectively search the clothes items that are most relevant to the query clothes given by the customer. In this work, we propose an Anchor-free framework for clothes image search by adopting an additional Re-ID branch for similarity learning and global mask branch for instance segmentation. The Re-ID branch is to extract richer feature of target clothes, where we develop a mask pooling layer to aggregate the feature by utilizing the mask of target clothes as the guidance. In this way, the extracted feature will involve more information covered by the mask area of targets instead of only the center point; the global mask branch is to be trained with detection and Re-ID branches simultaneously, where the estimated mask of target clothes can be utilized in reference procedure to guide the feature extraction. Finally, to further enhance the performance of retrieval, we have introduced a match loss to further fine-tune the Re-ID embedding branch in the framework, so that the clothes target can be closer to the same one, while be farther away from different clothes targets. Extensive simulations have been conducted and the results verify the effectiveness of the proposed work.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, the high dropout rate has become a widespread phenomenon in various MOOC platforms. When learning a MOOC, many learners are reluctant to spend time learning from the first video lecture to the last one. If we can recommend a learning path based on learners' individual needs and ignore irrelevant video lectures in the MOOC, it will help them learn more efficiently. The premise of learning path recommendation is to understand the precedence relations between learning resources. In this paper, we propose a novel approach for extracting precedence relations between video lectures in a MOOC. According to "knowledge depth" of concepts, we extract the core concepts from the video captions accurately. Transformer-based models are used to discover concept prerequisite relations, which help us identify the precedence relations between video lectures in MOOCs. Experiments show that the proposed method outperforms the state-of-the-art methods.
{"title":"Extracting Precedence Relations between Video Lectures in MOOCs","authors":"K. Xiao, Youheng Bai, Yan Zhang","doi":"10.1145/3512527.3531414","DOIUrl":"https://doi.org/10.1145/3512527.3531414","url":null,"abstract":"Nowadays, the high dropout rate has become a widespread phenomenon in various MOOC platforms. When learning a MOOC, many learners are reluctant to spend time learning from the first video lecture to the last one. If we can recommend a learning path based on learners' individual needs and ignore irrelevant video lectures in the MOOC, it will help them learn more efficiently. The premise of learning path recommendation is to understand the precedence relations between learning resources. In this paper, we propose a novel approach for extracting precedence relations between video lectures in a MOOC. According to \"knowledge depth\" of concepts, we extract the core concepts from the video captions accurately. Transformer-based models are used to discover concept prerequisite relations, which help us identify the precedence relations between video lectures in MOOCs. Experiments show that the proposed method outperforms the state-of-the-art methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper considers large-scale efficient vehicle re-identification (Vehicle ReID). Existing works adopting deep hashing techniques function by projecting vehicle images into compact binary codes in the Hamming space. Since Hamming distance is less distinct, a considerable amount of discriminative information will be lost, leading to degraded retrieval performances. Inspired by the recent advancements in contrastive learning, we put forward the very first product quantization based framework for large-scale efficient vehicle re-identification: Supervised Contrastive Vehicle Quantization (SCVQ). Specifically, we integrate the product quantization process into deep supervised learning by designing a differentiable quantization network. In addition, we propose a novel supervised cross-quantized contrastive quantization (SCQC) loss for similarity-preserving learning, which is tailored for the asymmetric retrieval in the product quantization process. Comprehensive experiments on two public benchmarks have evidenced the superiority of our framework against the state-of-the-arts. Our work is open-sourced at https://github.com/chrisbyd/ContrastiveVehicleQuant
{"title":"Supervised Contrastive Vehicle Quantization for Efficient Vehicle Retrieval","authors":"Yongbiao Chen, Kaicheng Guo, Fangxin Liu, Yusheng Huang, Zhengwei Qi","doi":"10.1145/3512527.3531432","DOIUrl":"https://doi.org/10.1145/3512527.3531432","url":null,"abstract":"This paper considers large-scale efficient vehicle re-identification (Vehicle ReID). Existing works adopting deep hashing techniques function by projecting vehicle images into compact binary codes in the Hamming space. Since Hamming distance is less distinct, a considerable amount of discriminative information will be lost, leading to degraded retrieval performances. Inspired by the recent advancements in contrastive learning, we put forward the very first product quantization based framework for large-scale efficient vehicle re-identification: Supervised Contrastive Vehicle Quantization (SCVQ). Specifically, we integrate the product quantization process into deep supervised learning by designing a differentiable quantization network. In addition, we propose a novel supervised cross-quantized contrastive quantization (SCQC) loss for similarity-preserving learning, which is tailored for the asymmetric retrieval in the product quantization process. Comprehensive experiments on two public benchmarks have evidenced the superiority of our framework against the state-of-the-arts. Our work is open-sourced at https://github.com/chrisbyd/ContrastiveVehicleQuant","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"313 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122203993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongya Wang, Shunxin Dai, Ming Du, Bo Xu, Mingyong Li
Recently, cross-modal hashing has attracted much attention due to its low storage cost and fast query speed. Mean Average Precision (MAP) is the most widely used performance measure for cross-modal hashing. However, we found that the MAP scores do not fully reflect the quality of the top-K results for cross-modal retrieval because it neglects multi-label information and overlooks the label semantic hierarchy. In view of this, we propose a new performance measure named Normalized Weighted Discounted Cumulative Gains (NWDCG) by extending Normalized Discounted Cumulative Gains (NDCG) using co-occurrence probability matrix. To verify the effectiveness of NWDCG, we conduct extensive experiments using three popular cross-modal hashing schemes over two publically available datasets.
{"title":"Revisiting Performance Measures for Cross-Modal Hashing","authors":"Hongya Wang, Shunxin Dai, Ming Du, Bo Xu, Mingyong Li","doi":"10.1145/3512527.3531363","DOIUrl":"https://doi.org/10.1145/3512527.3531363","url":null,"abstract":"Recently, cross-modal hashing has attracted much attention due to its low storage cost and fast query speed. Mean Average Precision (MAP) is the most widely used performance measure for cross-modal hashing. However, we found that the MAP scores do not fully reflect the quality of the top-K results for cross-modal retrieval because it neglects multi-label information and overlooks the label semantic hierarchy. In view of this, we propose a new performance measure named Normalized Weighted Discounted Cumulative Gains (NWDCG) by extending Normalized Discounted Cumulative Gains (NDCG) using co-occurrence probability matrix. To verify the effectiveness of NWDCG, we conduct extensive experiments using three popular cross-modal hashing schemes over two publically available datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114237438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}