首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos 视频黑盒对抗攻击的自适应时间分组
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531411
Zhipeng Wei, Jingjing Chen, Hao Zhang, Linxi Jiang, Yu-Gang Jiang
Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.
基于深度学习的视频模型在动作识别任务上表现出色,但最近被证明容易受到对抗性样本的攻击,即使是在黑箱设置中生成的样本。然而,这些黑盒攻击方法由于需要大量的查询,不足以在实际应用中攻击视频模型。为此,我们提出提高黑盒攻击对视频识别模型的效率。视频虽然携带了丰富的时间信息,但也包含了来自相邻帧的冗余空间信息。这促使我们引入自适应时间分组(ATG)方法,该方法通过从imagenet预训练图像模型中提取的视频帧的特征相似性对其进行分组。通过从每组中选择一个关键帧,ATG可以帮助任何黑盒攻击方法优化关键帧上的对抗性扰动,而不是所有帧,其中关键帧的估计梯度与每组中的其他帧共享。为了平衡估计梯度的效率和精度,ATG根据当前扰动的大小和当前查询的数量自适应调整群数。通过对HMDB-51数据集和UCF-101数据集的大量实验,我们证明了ATG可以显著减少目标攻击的查询次数,减少幅度超过10%。
{"title":"Adaptive Temporal Grouping for Black-box Adversarial Attacks on Videos","authors":"Zhipeng Wei, Jingjing Chen, Hao Zhang, Linxi Jiang, Yu-Gang Jiang","doi":"10.1145/3512527.3531411","DOIUrl":"https://doi.org/10.1145/3512527.3531411","url":null,"abstract":"Deep-learning based video models, which have remarkable performance on action recognition tasks, are recently proved to be vulnerable to adversarial samples, even those generated in the black-box setting. However, these black-box attack methods are insufficient to attack videos models in real-world applications due to the requirement of lots of queries. To this end, we propose to boost the efficiency of black-box attacks on video recognition models. Although videos carry rich temporal information, they include redundant spatial information from adjacent frames. This motivates us to introduce the adaptive temporal grouping (ATG) method, which groups video frames by the similarity of their features extracted from the ImageNet-pretrained image model. By selecting one key-frame from each group, ATG helps any black-box attack methods to optimize the adversarial perturbations over key-frames instead of all frames, where the estimated gradient of key-frame is shared with other frames in each group. To balance the efficiency and precision of estimated gradients, ATG adaptively adjusts the group number by the magnitude of the current perturbation and the current query number. Through extensive experiments on the HMDB-51 dataset and the UCF-101 dataset, we demonstrate that ATG can significantly reduce the number of queries by more than 10% for the targeted attack.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116790826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Unsupervised Contrastive Masking for Visual Haze Classification 用于视觉雾霾分类的无监督对比掩蔽
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531370
Jingyu Li, Haokai Ma, Xiangxian Li, Zhuang Qi, Lei Meng, Xiangxu Meng
Haze classification has gained much attention recently as a cost-effective solution for air quality monitoring. Different from conventional image classification tasks, it requires the classifier to capture the haze patterns of different severity degrees. Existing efforts typically focus on the extraction of effective haze features, such as the dark channel and deep features. However, it is observed that the light-haze images are often mis-classified due to the presence of diverse background scenes. To address this issue, this paper presents an unsupervised contrastive masking (UCM) algorithm to segment the haze regions without any supervision, and develops a dual-channel model-agnostic framework, termed magnifier neural network (MagNet), to effectively use the segmented haze regions to enhance the learning of haze features by conventional deep learning models. Specifically, MagNet employs the haze regions to provide the pixel- and feature-level visual information via three strategies, including Input Augmentation, Network Constraint, and Feature Enhancement, which work as a soft-attention regularizer to alleviates the trade-off between capturing the global scene information and the local information in the haze regions. Experiments were conducted on two datasets in terms of performance comparison, parameter estimation, ablation studies, and case studies, and the results verified that UCM can accurately and rapidly segment the haze regions, and the proposed three strategies of MagNet consistently improve the performance of the state-of-the-art deep learning backbones.
雾霾分类作为一种具有成本效益的空气质量监测解决方案,近年来备受关注。与传统的图像分类任务不同,它要求分类器捕获不同严重程度的雾霾模式。现有的工作通常集中在提取有效的雾霾特征,如暗通道和深度特征。然而,我们观察到,由于背景场景的不同,轻雾图像经常被误分类。为了解决这一问题,本文提出了一种无监督对比掩蔽(UCM)算法,在没有任何监督的情况下对雾霾区域进行分割,并开发了一种称为放大镜神经网络(MagNet)的双通道模型不可知框架,以有效地利用分割的雾霾区域来增强传统深度学习模型对雾霾特征的学习。具体来说,MagNet通过输入增强(Input Augmentation)、网络约束(Network Constraint)和特征增强(Feature Enhancement)三种策略,利用雾霾区域提供像素级和特征级的视觉信息,作为软注意正则器,缓解了在雾霾区域中捕获全局场景信息和局部信息之间的权衡。在两个数据集上进行了性能对比、参数估计、消融研究和案例研究,结果验证了UCM可以准确快速地分割雾霾区域,并且所提出的三种MagNet策略持续提高了最先进的深度学习主干的性能。
{"title":"Unsupervised Contrastive Masking for Visual Haze Classification","authors":"Jingyu Li, Haokai Ma, Xiangxian Li, Zhuang Qi, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531370","DOIUrl":"https://doi.org/10.1145/3512527.3531370","url":null,"abstract":"Haze classification has gained much attention recently as a cost-effective solution for air quality monitoring. Different from conventional image classification tasks, it requires the classifier to capture the haze patterns of different severity degrees. Existing efforts typically focus on the extraction of effective haze features, such as the dark channel and deep features. However, it is observed that the light-haze images are often mis-classified due to the presence of diverse background scenes. To address this issue, this paper presents an unsupervised contrastive masking (UCM) algorithm to segment the haze regions without any supervision, and develops a dual-channel model-agnostic framework, termed magnifier neural network (MagNet), to effectively use the segmented haze regions to enhance the learning of haze features by conventional deep learning models. Specifically, MagNet employs the haze regions to provide the pixel- and feature-level visual information via three strategies, including Input Augmentation, Network Constraint, and Feature Enhancement, which work as a soft-attention regularizer to alleviates the trade-off between capturing the global scene information and the local information in the haze regions. Experiments were conducted on two datasets in terms of performance comparison, parameter estimation, ablation studies, and case studies, and the results verified that UCM can accurately and rapidly segment the haze regions, and the proposed three strategies of MagNet consistently improve the performance of the state-of-the-art deep learning backbones.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117135818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model TriReID:通过描述融合模型实现多模态人的再识别
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531397
Yajing Zhai, Yawen Zeng, Da Cao, Shaofei Lu
The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks. However, for these different modalities of describing a person, combining multiple aspects can obviously make full use of complementary information and improve the identification performance. Therefore, to explore how to comprehensively consider multi-modal information, we advance a novel multi-modal person re-identification task, which utilizes both text and sketch as a descriptive query to retrieve desired images. In fact, the textual description and the visual description are understood together to retrieve the person in the database to be more aligned with real-world scenarios, which is promising but seldom considered. Besides, based on an existing sketch-based ReID dataset, we construct a new dataset, TriReID, to support this challenging task in a semi-automated way. Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized. Moreover, we propose a novel framework named Descriptive Fusion Model (DFM) to solve the multi-modal ReID issue. Specifically, we first develop a flexible descriptive embedding function to fuse the text and sketch modalities. Further, the fused descriptive semantic feature is jointly optimized under the generative adversarial paradigm to mitigate the cross-modal semantic gap. Extensive experiments on the TriReID dataset demonstrate the effectiveness and rationality of our proposed solution.
跨模态人员再识别(ReID)的目的是将一个人从一种模态检索到另一种单一模态,例如基于文本和基于草图的ReID任务。然而,对于这些描述一个人的不同方式,多方面结合显然可以充分利用互补信息,提高识别性能。因此,为了探索如何综合考虑多模态信息,我们提出了一种新的多模态人物再识别任务,该任务利用文本和草图作为描述性查询来检索所需的图像。事实上,文本描述和视觉描述可以一起理解,以便检索数据库中的人,使其更符合现实场景,这很有希望,但很少有人考虑到。此外,基于现有的基于草图的ReID数据集,我们构建了一个新的数据集TriReID,以半自动化的方式支持这一具有挑战性的任务。特别地,我们在主动学习范式下实现了一个图像字幕模型来生成适合ReID的句子,其中三个层次的质量分数是定制的。此外,我们提出了一种新的描述融合模型(DFM)框架来解决多模态ReID问题。具体而言,我们首先开发了一个灵活的描述性嵌入函数来融合文本和草图模式。此外,在生成对抗范式下,对融合的描述性语义特征进行联合优化,以减轻跨模态语义差距。在TriReID数据集上的大量实验证明了我们提出的解决方案的有效性和合理性。
{"title":"TriReID: Towards Multi-Modal Person Re-Identification via Descriptive Fusion Model","authors":"Yajing Zhai, Yawen Zeng, Da Cao, Shaofei Lu","doi":"10.1145/3512527.3531397","DOIUrl":"https://doi.org/10.1145/3512527.3531397","url":null,"abstract":"The cross-modal person re-identification (ReID) aims to retrieve one person from one modality to the other single modality, such as text-based and sketch-based ReID tasks. However, for these different modalities of describing a person, combining multiple aspects can obviously make full use of complementary information and improve the identification performance. Therefore, to explore how to comprehensively consider multi-modal information, we advance a novel multi-modal person re-identification task, which utilizes both text and sketch as a descriptive query to retrieve desired images. In fact, the textual description and the visual description are understood together to retrieve the person in the database to be more aligned with real-world scenarios, which is promising but seldom considered. Besides, based on an existing sketch-based ReID dataset, we construct a new dataset, TriReID, to support this challenging task in a semi-automated way. Particularly, we implement an image captioning model under the active learning paradigm to generate sentences suitable for ReID, in which the quality scores of the three levels are customized. Moreover, we propose a novel framework named Descriptive Fusion Model (DFM) to solve the multi-modal ReID issue. Specifically, we first develop a flexible descriptive embedding function to fuse the text and sketch modalities. Further, the fused descriptive semantic feature is jointly optimized under the generative adversarial paradigm to mitigate the cross-modal semantic gap. Extensive experiments on the TriReID dataset demonstrate the effectiveness and rationality of our proposed solution.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"183 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115064776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
I2-Net: Intra- and Inter-scale Collaborative Learning Network for Abdominal Multi-organ Segmentation I2-Net:腹部多器官分割的尺度内和尺度间协同学习网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531420
Chao Suo, Xuanya Li, Donghui Tan, Yuan Zhang, Xieping Gao
Efficient and accurate abdominal multi-organ segmentation is the key to clinical applications such as computer-aided diagnosis and computer-aided surgery, but this task is extremely challenging due to blurred organ boundaries, complex backgrounds, and different organ sizes. Although existing segmentation methods have achieved good segmentation results, we found that the segmentation performance of abdominal small and medium organs is often unsatisfactory, but the accurate location and segmentation of abdominal small and medium organs plays an important role in the diagnosis and screening of clinical diseases. To address this problem, in this paper we propose an intra- and inter-scale collaborative learning network (I2-Net) for the abdominal multi-organ segmentation task. Firstly, we design a Feature Complementary Module (FCM) to adaptively complement the local and global features extracted by CNN and Transformer. Secondly, we propose a Feature Aggregation Module (FAM) to aggregate multi-scale semantic information. Finally, we employ a Focus Module (FM) for collaborative learning of intra- and inter-scale features. Extensive experiments on the Synapse dataset show that our method outperforms the state-of-the-art approaches and achieve accurate segmentation of abdominal multi-organs, especially for small and medium organs.
高效、准确的腹部多脏器分割是计算机辅助诊断、计算机辅助手术等临床应用的关键,但由于脏器边界模糊、背景复杂、脏器大小不一,这一任务极具挑战性。虽然现有的分割方法已经取得了较好的分割效果,但我们发现腹部中小脏器的分割性能往往不尽人意,但腹部中小脏器的准确定位和分割对临床疾病的诊断和筛查具有重要作用。为了解决这一问题,本文提出了一种用于腹部多器官分割任务的尺度内和尺度间协作学习网络(I2-Net)。首先,我们设计了一个特征互补模块(Feature Complementary Module, FCM),对CNN和Transformer提取的局部和全局特征进行自适应补充。其次,我们提出了一个特征聚合模块(FAM)来聚合多尺度语义信息。最后,我们采用焦点模块(FM)进行尺度内和尺度间特征的协同学习。在Synapse数据集上的大量实验表明,我们的方法优于最先进的方法,可以实现腹部多器官的准确分割,特别是对于中小型器官。
{"title":"I2-Net: Intra- and Inter-scale Collaborative Learning Network for Abdominal Multi-organ Segmentation","authors":"Chao Suo, Xuanya Li, Donghui Tan, Yuan Zhang, Xieping Gao","doi":"10.1145/3512527.3531420","DOIUrl":"https://doi.org/10.1145/3512527.3531420","url":null,"abstract":"Efficient and accurate abdominal multi-organ segmentation is the key to clinical applications such as computer-aided diagnosis and computer-aided surgery, but this task is extremely challenging due to blurred organ boundaries, complex backgrounds, and different organ sizes. Although existing segmentation methods have achieved good segmentation results, we found that the segmentation performance of abdominal small and medium organs is often unsatisfactory, but the accurate location and segmentation of abdominal small and medium organs plays an important role in the diagnosis and screening of clinical diseases. To address this problem, in this paper we propose an intra- and inter-scale collaborative learning network (I2-Net) for the abdominal multi-organ segmentation task. Firstly, we design a Feature Complementary Module (FCM) to adaptively complement the local and global features extracted by CNN and Transformer. Secondly, we propose a Feature Aggregation Module (FAM) to aggregate multi-scale semantic information. Finally, we employ a Focus Module (FM) for collaborative learning of intra- and inter-scale features. Extensive experiments on the Synapse dataset show that our method outperforms the state-of-the-art approaches and achieve accurate segmentation of abdominal multi-organs, especially for small and medium organs.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124715923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Source-free Temporal Attentive Domain Adaptation for Video Action Recognition 视频动作识别的无源时间关注域自适应
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531392
Peipeng Chen, A. J. Ma
With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.
随着视频数据的快速增长,近年来出现了许多视频分析技术,并取得了成功。为了减轻视频数据的跨域分布偏差,无监督视频域自适应(UVDA)被提出并成为一个活跃的研究课题。然而,现有的UVDA方法在训练过程中需要访问源域数据,这可能会导致违反隐私政策和传输效率低下的问题。为了解决这个问题,我们提出了一种新的无源时间关注域自适应(SFTADA)方法,用于更具挑战性的UVDA设置下的视频动作识别,这样就不需要源域数据来学习目标域。在我们的方法中,设计了一个创新的时间关注聚合(TAG)模块,将具有不同重要权重的帧级特征结合起来,用于视频级表示生成。在测试过程中,在没有源域数据和目标域标签信息的情况下,训练基于mlp的注意力网络来逼近基于类质心的注意力聚合函数。通过最小化帧级和视频级损失函数,可以减少跨域视频数据的时域和空域偏移。在四个基准数据集上的大量实验证明了我们提出的方法在解决具有挑战性的无源UVDA任务方面的有效性。
{"title":"Source-free Temporal Attentive Domain Adaptation for Video Action Recognition","authors":"Peipeng Chen, A. J. Ma","doi":"10.1145/3512527.3531392","DOIUrl":"https://doi.org/10.1145/3512527.3531392","url":null,"abstract":"With the rapidly increasing video data, many video analysis techniques have been developed and achieved success in recent years. To mitigate the distribution bias of video data across domains, unsupervised video domain adaptation (UVDA) has been proposed and become an active research topic. Nevertheless, existing UVDA methods need to access source domain data during training, which may result in problems of privacy policy violation and transfer inefficiency. To address this issue, we propose a novel source-free temporal attentive domain adaptation (SFTADA) method for video action recognition under the more challenging UVDA setting, such that source domain data is not required for learning the target domain. In our method, an innovative Temporal Attentive aGgregation (TAG) module is designed to combine frame-level features with varying importance weights for video-level representation generation. Without source domain data and label information in the target domain and during testing, an MLP-based attention network is trained to approximate the attentive aggregation function based on class centroids. By minimizing frame-level and video-level loss functions, both the temporal and spatial domain shifts in cross-domain video data can be reduced. Extensive experiments on four benchmark datasets demonstrate the effectiveness of our proposed method in solving the challenging source-free UVDA task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130575818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Blindfold Attention: Novel Mask Strategy for Facial Expression Recognition 蒙眼注意:面部表情识别的新面具策略
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531416
Bo Fu, Yuanxin Mao, Shilin Fu, Yonggong Ren, Zhongxuan Luo
Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.
面部表情识别(FER)是一项基本而关键的计算机视觉任务,它将人脸图像中的情绪表情分类为快乐、悲伤、惊讶、恐惧、愤怒等各种情绪类别。近年来,基于深度学习的面部表情识别取得了很大进展。然而,无论是权重初始化技术还是注意机制,基于深度学习的人脸识别方法都难以捕捉到那些视觉上不显著但语义上重要的特征。为了解决上述问题,本文提出了一种新的面部表情识别训练策略,该策略由两个部分组成:备忘录亲和损失(MAL)和面具注意微调(MAFT)。MAL是中心丢失的一种变体,它使用了记忆库策略和判别中心。MAL扩大了不同聚类之间的距离,缩小了每个聚类内部的距离。因此,CNN提取的特征是全面的和独立的,产生了一个更鲁棒的模型。MAFT是一种暂时蒙蔽注意力部分的策略,迫使模型从输入图像的其他重要区域学习。这不仅是一种增强技术,也是一种新颖的微调方法。正如我们所知,我们是第一个将遮罩策略应用于注意力部分并使用该策略对模型进行微调的人。最后,为了实现我们的想法,我们在ResNet-18的基础上构建了一个名为Architecture Attention ResNet的新网络。我们的方法在概念和实践上都很简单,但在流行的公共面部表情识别基准上获得了更好的结果,RAF-DB的识别率为88.75%,AffectNet-7的识别率为65.17%,AffectNet-8的识别率为60.72%。代码将很快开放源代码。
{"title":"Blindfold Attention: Novel Mask Strategy for Facial Expression Recognition","authors":"Bo Fu, Yuanxin Mao, Shilin Fu, Yonggong Ren, Zhongxuan Luo","doi":"10.1145/3512527.3531416","DOIUrl":"https://doi.org/10.1145/3512527.3531416","url":null,"abstract":"Facial Expression Recognition (FER) is a basic and crucial computer vision task of classifying emotional expressions from human faces images into various emotion categories such as happy, sad, surprised, scared, angry, etc. Recently, facial expression recognition based on deep learning has made great progress. However, no matter the weight initialization technology or the attention mechanism, the face recognition method based on deep learning hard to capture those visually insignificant but semantically important features. To aid above question, in this paper we present a novel Facial Expression Recognition training strategy consisting of two components: Memo Affinity Loss (MAL) and Mask Attention Fine Tuning (MAFT). MAL is a variant of center loss, which uses memory bank strategy as well as discriminative center. MAL widens the distance between different clusters and narrows the distance within each cluster. Therefore, the features extracted by CNN were comprehensive and independent, which produced a more robust model. MAFT is a strategy that blindfolds attention parts temporarily and forces the model to learn from other important regions of the input image. It's not only an augmenting technique, but also a novel fine-tuning approach. As we know, we are the first to apply the mask strategy to the attention part and use this strategy to fine-tune the models. Finally, to implement our ideas, we constructed a new network named Architecture Attention ResNet based on ResNet-18. Our methods are conceptually and practically simple, but receives superior results on popular public facial expression recognition benchmarks with 88.75% on RAF-DB, 65.17% on AffectNet-7, 60.72% on AffectNet-8. The code will open source soon.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126479767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fashion Image Search via Anchor-Free Detector 时尚图像搜索通过锚自由检测器
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531355
Shanchuan Gao, Fankai Zeng, Lu Cheng, Jicong Fan, Mingde Zhao
Clothes image search is the key technique to effectively search the clothes items that are most relevant to the query clothes given by the customer. In this work, we propose an Anchor-free framework for clothes image search by adopting an additional Re-ID branch for similarity learning and global mask branch for instance segmentation. The Re-ID branch is to extract richer feature of target clothes, where we develop a mask pooling layer to aggregate the feature by utilizing the mask of target clothes as the guidance. In this way, the extracted feature will involve more information covered by the mask area of targets instead of only the center point; the global mask branch is to be trained with detection and Re-ID branches simultaneously, where the estimated mask of target clothes can be utilized in reference procedure to guide the feature extraction. Finally, to further enhance the performance of retrieval, we have introduced a match loss to further fine-tune the Re-ID embedding branch in the framework, so that the clothes target can be closer to the same one, while be farther away from different clothes targets. Extensive simulations have been conducted and the results verify the effectiveness of the proposed work.
服装图像搜索是有效搜索出与顾客给出的查询服装最相关的服装项的关键技术。在这项工作中,我们提出了一个无锚点的服装图像搜索框架,通过采用额外的Re-ID分支进行相似性学习,并采用全局掩码分支进行实例分割。Re-ID分支是提取目标服装更丰富的特征,其中我们开发了一个面具池层,以目标服装的面具为导向,对特征进行聚合。这样,提取的特征将涉及更多被目标掩模区域所覆盖的信息,而不仅仅是中心点;将全局掩码分支与检测分支和Re-ID分支同时训练,其中目标服装的估计掩码可作为参考程序来指导特征提取。最后,为了进一步提高检索性能,我们引入了匹配损失来进一步微调框架中的Re-ID嵌入分支,使衣服目标更接近同一目标,而远离不同的衣服目标。进行了大量的仿真,结果验证了所提出工作的有效性。
{"title":"Fashion Image Search via Anchor-Free Detector","authors":"Shanchuan Gao, Fankai Zeng, Lu Cheng, Jicong Fan, Mingde Zhao","doi":"10.1145/3512527.3531355","DOIUrl":"https://doi.org/10.1145/3512527.3531355","url":null,"abstract":"Clothes image search is the key technique to effectively search the clothes items that are most relevant to the query clothes given by the customer. In this work, we propose an Anchor-free framework for clothes image search by adopting an additional Re-ID branch for similarity learning and global mask branch for instance segmentation. The Re-ID branch is to extract richer feature of target clothes, where we develop a mask pooling layer to aggregate the feature by utilizing the mask of target clothes as the guidance. In this way, the extracted feature will involve more information covered by the mask area of targets instead of only the center point; the global mask branch is to be trained with detection and Re-ID branches simultaneously, where the estimated mask of target clothes can be utilized in reference procedure to guide the feature extraction. Finally, to further enhance the performance of retrieval, we have introduced a match loss to further fine-tune the Re-ID embedding branch in the framework, so that the clothes target can be closer to the same one, while be farther away from different clothes targets. Extensive simulations have been conducted and the results verify the effectiveness of the proposed work.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124463694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Extracting Precedence Relations between Video Lectures in MOOCs mooc视频课程间的优先关系提取
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531414
K. Xiao, Youheng Bai, Yan Zhang
Nowadays, the high dropout rate has become a widespread phenomenon in various MOOC platforms. When learning a MOOC, many learners are reluctant to spend time learning from the first video lecture to the last one. If we can recommend a learning path based on learners' individual needs and ignore irrelevant video lectures in the MOOC, it will help them learn more efficiently. The premise of learning path recommendation is to understand the precedence relations between learning resources. In this paper, we propose a novel approach for extracting precedence relations between video lectures in a MOOC. According to "knowledge depth" of concepts, we extract the core concepts from the video captions accurately. Transformer-based models are used to discover concept prerequisite relations, which help us identify the precedence relations between video lectures in MOOCs. Experiments show that the proposed method outperforms the state-of-the-art methods.
如今,高辍学率已经成为各个MOOC平台普遍存在的现象。在学习MOOC时,许多学习者不愿意花时间从头到尾学习第一堂视频课。如果我们能够根据学习者的个人需求推荐学习路径,忽略MOOC中无关的视频讲座,将有助于学习者更有效地学习。学习路径推荐的前提是了解学习资源之间的优先关系。在本文中,我们提出了一种新的方法来提取MOOC视频讲座之间的优先关系。根据概念的“知识深度”,从视频字幕中准确提取核心概念。基于transformer的模型用于发现概念前提关系,这有助于我们识别mooc视频讲座之间的优先关系。实验表明,该方法优于现有的方法。
{"title":"Extracting Precedence Relations between Video Lectures in MOOCs","authors":"K. Xiao, Youheng Bai, Yan Zhang","doi":"10.1145/3512527.3531414","DOIUrl":"https://doi.org/10.1145/3512527.3531414","url":null,"abstract":"Nowadays, the high dropout rate has become a widespread phenomenon in various MOOC platforms. When learning a MOOC, many learners are reluctant to spend time learning from the first video lecture to the last one. If we can recommend a learning path based on learners' individual needs and ignore irrelevant video lectures in the MOOC, it will help them learn more efficiently. The premise of learning path recommendation is to understand the precedence relations between learning resources. In this paper, we propose a novel approach for extracting precedence relations between video lectures in a MOOC. According to \"knowledge depth\" of concepts, we extract the core concepts from the video captions accurately. Transformer-based models are used to discover concept prerequisite relations, which help us identify the precedence relations between video lectures in MOOCs. Experiments show that the proposed method outperforms the state-of-the-art methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128874394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Supervised Contrastive Vehicle Quantization for Efficient Vehicle Retrieval 基于监督对比量化的高效车辆检索
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531432
Yongbiao Chen, Kaicheng Guo, Fangxin Liu, Yusheng Huang, Zhengwei Qi
This paper considers large-scale efficient vehicle re-identification (Vehicle ReID). Existing works adopting deep hashing techniques function by projecting vehicle images into compact binary codes in the Hamming space. Since Hamming distance is less distinct, a considerable amount of discriminative information will be lost, leading to degraded retrieval performances. Inspired by the recent advancements in contrastive learning, we put forward the very first product quantization based framework for large-scale efficient vehicle re-identification: Supervised Contrastive Vehicle Quantization (SCVQ). Specifically, we integrate the product quantization process into deep supervised learning by designing a differentiable quantization network. In addition, we propose a novel supervised cross-quantized contrastive quantization (SCQC) loss for similarity-preserving learning, which is tailored for the asymmetric retrieval in the product quantization process. Comprehensive experiments on two public benchmarks have evidenced the superiority of our framework against the state-of-the-arts. Our work is open-sourced at https://github.com/chrisbyd/ContrastiveVehicleQuant
本文研究大规模高效车辆再识别(vehicle ReID)问题。现有的工作采用深度哈希技术,将车辆图像投影到汉明空间的压缩二进制代码中。由于汉明距离不明显,会丢失大量的判别信息,导致检索性能下降。受对比学习最新进展的启发,我们提出了第一个基于产品量化的大规模高效车辆再识别框架:监督对比车辆量化(SCVQ)。具体来说,我们通过设计一个可微量化网络,将产品量化过程集成到深度监督学习中。此外,针对产品量化过程中的不对称检索,提出了一种新的监督交叉量化对比量化(SCQC)损失算法,用于相似性保持学习。在两个公共基准上进行的全面实验证明了我们的框架相对于最先进的框架的优越性。我们的工作是开源的,网址是https://github.com/chrisbyd/ContrastiveVehicleQuant
{"title":"Supervised Contrastive Vehicle Quantization for Efficient Vehicle Retrieval","authors":"Yongbiao Chen, Kaicheng Guo, Fangxin Liu, Yusheng Huang, Zhengwei Qi","doi":"10.1145/3512527.3531432","DOIUrl":"https://doi.org/10.1145/3512527.3531432","url":null,"abstract":"This paper considers large-scale efficient vehicle re-identification (Vehicle ReID). Existing works adopting deep hashing techniques function by projecting vehicle images into compact binary codes in the Hamming space. Since Hamming distance is less distinct, a considerable amount of discriminative information will be lost, leading to degraded retrieval performances. Inspired by the recent advancements in contrastive learning, we put forward the very first product quantization based framework for large-scale efficient vehicle re-identification: Supervised Contrastive Vehicle Quantization (SCVQ). Specifically, we integrate the product quantization process into deep supervised learning by designing a differentiable quantization network. In addition, we propose a novel supervised cross-quantized contrastive quantization (SCQC) loss for similarity-preserving learning, which is tailored for the asymmetric retrieval in the product quantization process. Comprehensive experiments on two public benchmarks have evidenced the superiority of our framework against the state-of-the-arts. Our work is open-sourced at https://github.com/chrisbyd/ContrastiveVehicleQuant","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"313 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122203993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Revisiting Performance Measures for Cross-Modal Hashing 回顾跨模态哈希的性能度量
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531363
Hongya Wang, Shunxin Dai, Ming Du, Bo Xu, Mingyong Li
Recently, cross-modal hashing has attracted much attention due to its low storage cost and fast query speed. Mean Average Precision (MAP) is the most widely used performance measure for cross-modal hashing. However, we found that the MAP scores do not fully reflect the quality of the top-K results for cross-modal retrieval because it neglects multi-label information and overlooks the label semantic hierarchy. In view of this, we propose a new performance measure named Normalized Weighted Discounted Cumulative Gains (NWDCG) by extending Normalized Discounted Cumulative Gains (NDCG) using co-occurrence probability matrix. To verify the effectiveness of NWDCG, we conduct extensive experiments using three popular cross-modal hashing schemes over two publically available datasets.
近年来,跨模态哈希以其低廉的存储成本和快速的查询速度备受关注。平均精度(MAP)是跨模态哈希中使用最广泛的性能度量。然而,我们发现MAP分数并不能完全反映跨模态检索的top-K结果的质量,因为它忽略了多标签信息并忽略了标签语义层次。鉴于此,我们利用共现概率矩阵对归一化贴现累积增益(NDCG)进行扩展,提出了一种新的性能度量方法——归一化加权贴现累积增益(NWDCG)。为了验证NWDCG的有效性,我们在两个公开可用的数据集上使用三种流行的跨模态哈希方案进行了广泛的实验。
{"title":"Revisiting Performance Measures for Cross-Modal Hashing","authors":"Hongya Wang, Shunxin Dai, Ming Du, Bo Xu, Mingyong Li","doi":"10.1145/3512527.3531363","DOIUrl":"https://doi.org/10.1145/3512527.3531363","url":null,"abstract":"Recently, cross-modal hashing has attracted much attention due to its low storage cost and fast query speed. Mean Average Precision (MAP) is the most widely used performance measure for cross-modal hashing. However, we found that the MAP scores do not fully reflect the quality of the top-K results for cross-modal retrieval because it neglects multi-label information and overlooks the label semantic hierarchy. In view of this, we propose a new performance measure named Normalized Weighted Discounted Cumulative Gains (NWDCG) by extending Normalized Discounted Cumulative Gains (NDCG) using co-occurrence probability matrix. To verify the effectiveness of NWDCG, we conduct extensive experiments using three popular cross-modal hashing schemes over two publically available datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114237438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1