首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
Efficient Linear Attention for Fast and Accurate Keypoint Matching 有效的线性关注快速和准确的关键点匹配
Pub Date : 2022-04-16 DOI: 10.1145/3512527.3531369
Suwichaya Suwanwimolkul, S. Komorita
Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. To solve this problem, we employ an efficient linear attention for the linear computational complexity. Then, we propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints. To further improve the efficiency, we propose the joint learning of feature matching and description. Our learning enables simpler and faster matching than Sinkhorn, often used in matching the learned descriptors from Transformers. Our method achieves competitive performance with only 0.84M learnable parameters against the bigger SOTAs, SuperGlue (12M parameters) and SGMNet (30M parameters), on three benchmarks, HPatch, ETH, Aachen Day-Night.
最近,变形金刚提供了最先进的稀疏匹配性能,这对于实现高性能3D视觉应用至关重要。然而,这些变压器由于其注意机制的二次计算复杂性而缺乏效率。为了解决这个问题,我们对线性计算复杂度采用了有效的线性关注。然后,我们提出了一种新的注意力聚合方法,通过从稀疏的关键点中聚合全局和局部信息来达到较高的精度。为了进一步提高效率,我们提出了特征匹配和描述的联合学习。我们的学习实现了比Sinkhorn更简单和更快的匹配,后者通常用于匹配《变形金刚》中的学习描述符。在HPatch、ETH、Aachen Day-Night三个基准测试上,我们的方法仅以0.84M可学习参数与更大的sota SuperGlue (12M参数)和SGMNet (30M参数)相比,取得了具有竞争力的性能。
{"title":"Efficient Linear Attention for Fast and Accurate Keypoint Matching","authors":"Suwichaya Suwanwimolkul, S. Komorita","doi":"10.1145/3512527.3531369","DOIUrl":"https://doi.org/10.1145/3512527.3531369","url":null,"abstract":"Recently Transformers have provided state-of-the-art performance in sparse matching, crucial to realize high-performance 3D vision applications. Yet, these Transformers lack efficiency due to the quadratic computational complexity of their attention mechanism. To solve this problem, we employ an efficient linear attention for the linear computational complexity. Then, we propose a new attentional aggregation that achieves high accuracy by aggregating both the global and local information from sparse keypoints. To further improve the efficiency, we propose the joint learning of feature matching and description. Our learning enables simpler and faster matching than Sinkhorn, often used in matching the learned descriptors from Transformers. Our method achieves competitive performance with only 0.84M learnable parameters against the bigger SOTAs, SuperGlue (12M parameters) and SGMNet (30M parameters), on three benchmarks, HPatch, ETH, Aachen Day-Night.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128057491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System 奥斯卡:一个异常值敏感的基于内容的放射图像检索系统
Pub Date : 2022-04-06 DOI: 10.1145/3512527.3531425
Xiaoyuan Guo, Jiali Duan, S. Purkayastha, H. Trivedi, J. Gichoya, I. Banerjee
Improving the retrieval relevance on noisy datasets is an emerging need for the curation of a large-scale clean dataset in the medical domain. While existing methods can be applied for class-wise retrieval (aka. inter-class), they cannot distinguish the granularity of likeness within the same class (aka. intra-class). The problem is exacerbated on medical external datasets, where noisy samples of the same class are treated equally during training. Our goal is to identify both intra/inter-class similarities for fine-grained retrieval. To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), consisting of two steps. First, we train an outlier detector on a clean internal dataset in an unsupervised manner. Then we use the trained detector to generate the anomaly scores on the external dataset, whose distribution will be used to bin intra-class variations. Second, we propose a quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class negatives nintra are sampled from bins of the same class other than the bin anchor a belongs to, while n_inter are randomly sampled from inter-classes. We suggest a weighted metric learning objective to balance the intra and inter-class feature learning. We experimented on two representative public radiography datasets. Experiments show the effectiveness of our approach. The training and evaluation code can be found in https://github.com/XiaoyuanGuo/oscars.
提高对噪声数据集的检索相关性是医学领域大规模干净数据集管理的一个新兴需求。虽然现有的方法可以应用于类检索(即。类间),它们不能区分同一类(又名。内部类)。在医疗外部数据集上,这个问题更加严重,在训练过程中,同一类别的噪声样本被平等对待。我们的目标是为细粒度检索识别类内/类间相似性。为了实现这一目标,我们提出了一个基于离群值敏感内容的放射学检索系统(OSCARS),包括两个步骤。首先,我们以无监督的方式在一个干净的内部数据集上训练一个离群值检测器。然后,我们使用训练好的检测器在外部数据集上生成异常分数,其分布将用于抑制类内变化。其次,我们提出了一种四组(a, p, nintra, ninter)抽样策略,其中类内负的nintra从与a所属的bin锚点不同的同一类的bin中抽样,而n_inter则从类间随机抽样。我们提出了一个加权度量学习目标来平衡类内和类间的特征学习。我们在两个具有代表性的公共放射照相数据集上进行了实验。实验证明了该方法的有效性。培训和评估代码可以在https://github.com/XiaoyuanGuo/oscars找到。
{"title":"OSCARS: An Outlier-Sensitive Content-Based Radiography Retrieval System","authors":"Xiaoyuan Guo, Jiali Duan, S. Purkayastha, H. Trivedi, J. Gichoya, I. Banerjee","doi":"10.1145/3512527.3531425","DOIUrl":"https://doi.org/10.1145/3512527.3531425","url":null,"abstract":"Improving the retrieval relevance on noisy datasets is an emerging need for the curation of a large-scale clean dataset in the medical domain. While existing methods can be applied for class-wise retrieval (aka. inter-class), they cannot distinguish the granularity of likeness within the same class (aka. intra-class). The problem is exacerbated on medical external datasets, where noisy samples of the same class are treated equally during training. Our goal is to identify both intra/inter-class similarities for fine-grained retrieval. To achieve this, we propose an Outlier-Sensitive Content-based rAdiologhy Retrieval System (OSCARS), consisting of two steps. First, we train an outlier detector on a clean internal dataset in an unsupervised manner. Then we use the trained detector to generate the anomaly scores on the external dataset, whose distribution will be used to bin intra-class variations. Second, we propose a quadruplet (a, p, nintra, ninter) sampling strategy, where intra-class negatives nintra are sampled from bins of the same class other than the bin anchor a belongs to, while n_inter are randomly sampled from inter-classes. We suggest a weighted metric learning objective to balance the intra and inter-class feature learning. We experimented on two representative public radiography datasets. Experiments show the effectiveness of our approach. The training and evaluation code can be found in https://github.com/XiaoyuanGuo/oscars.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131075767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Learning Sample Importance for Cross-Scenario Video Temporal Grounding 学习样本对跨场景视频时间接地的重要性
Pub Date : 2022-01-08 DOI: 10.1145/3512527.3531403
P. Bao, Yadong Mu
The task of temporal grounding aims to locate video moment in an untrimmed video, with a given sentence query. This paper for the first time investigates some superficial biases that are specific to the temporal grounding task, and proposes a novel targeted solution. Most alarmingly, we observe that existing temporal ground models heavily rely on some biases (e.g., high preference on frequent concepts or certain temporal intervals) in the visual modal. This leads to inferior performance when generalizing the model in cross-scenario test setting. To this end, we propose a novel method called Debiased Temporal Language Localizer (Debias-TLL) to prevent the model from naively memorizing the biases and enforce it to ground the query sentence based on true inter-modal relationship. Debias-TLL simultaneously trains two models. By our design, a large discrepancy of these two models' predictions when judging a sample reveals higher probability of being a biased sample. Harnessing the informative discrepancy, we devise a data re-weighing scheme for mitigating the data biases. We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced. Experiments show large-margin superiority of the proposed method in comparison with state-of-the-art competitors.
时间基础的任务是在给定的句子查询中定位未修剪视频中的视频时刻。本文首次研究了时间接地任务的一些表面偏差,并提出了一种新的有针对性的解决方案。最令人担忧的是,我们观察到现有的时间基础模型严重依赖于视觉模态中的一些偏差(例如,对频繁概念或特定时间间隔的高度偏好)。当在跨场景测试设置中泛化模型时,这会导致较差的性能。为此,我们提出了一种新的方法,称为Debias-TLL (Debias-TLL),以防止模型天真地记忆偏差,并强制其基于真实的多模态关系来构建查询句子。Debias-TLL同时训练两个模型。根据我们的设计,当判断一个样本时,这两个模型的预测有很大的差异,表明有偏样本的可能性更高。利用信息差异,我们设计了一种数据重加权方案来减轻数据偏差。我们在列车/测试数据来源不同的跨场景时间接地中评估了所提出的模型。实验表明,与最先进的竞争对手相比,所提出的方法具有较大的优势。
{"title":"Learning Sample Importance for Cross-Scenario Video Temporal Grounding","authors":"P. Bao, Yadong Mu","doi":"10.1145/3512527.3531403","DOIUrl":"https://doi.org/10.1145/3512527.3531403","url":null,"abstract":"The task of temporal grounding aims to locate video moment in an untrimmed video, with a given sentence query. This paper for the first time investigates some superficial biases that are specific to the temporal grounding task, and proposes a novel targeted solution. Most alarmingly, we observe that existing temporal ground models heavily rely on some biases (e.g., high preference on frequent concepts or certain temporal intervals) in the visual modal. This leads to inferior performance when generalizing the model in cross-scenario test setting. To this end, we propose a novel method called Debiased Temporal Language Localizer (Debias-TLL) to prevent the model from naively memorizing the biases and enforce it to ground the query sentence based on true inter-modal relationship. Debias-TLL simultaneously trains two models. By our design, a large discrepancy of these two models' predictions when judging a sample reveals higher probability of being a biased sample. Harnessing the informative discrepancy, we devise a data re-weighing scheme for mitigating the data biases. We evaluate the proposed model in cross-scenario temporal grounding, where the train / test data are heterogeneously sourced. Experiments show large-margin superiority of the proposed method in comparison with state-of-the-art competitors.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-01-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133406211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Nearest Neighbor Search with Compact Codes: A Decoder Perspective 压缩码的最近邻搜索:解码器视角
Pub Date : 2021-12-17 DOI: 10.1145/3512527.3531408
Kenza Amara, Matthijs Douze, Alexandre Sablayrolles, Herv'e J'egou
Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the Mean Squared Error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods and product quantization on popular benchmarks.
在数十亿规模的数据集上快速检索相似向量的现代方法依赖于压缩域方法,如二进制草图或产品量化。这些方法将一定的损失最小化,通常是均方误差或针对检索问题定制的其他目标函数。在本文中,我们重新解释了流行的方法,如二进制哈希或乘积量化作为自编码器,并指出他们隐式地对解码器的形式做出了次优假设。我们设计了向后兼容的解码器,改进了来自相同代码的向量重建,这转化为最近邻居搜索的更好性能。我们的方法在流行的基准测试中显著改进了二进制哈希方法和产品量化。
{"title":"Nearest Neighbor Search with Compact Codes: A Decoder Perspective","authors":"Kenza Amara, Matthijs Douze, Alexandre Sablayrolles, Herv'e J'egou","doi":"10.1145/3512527.3531408","DOIUrl":"https://doi.org/10.1145/3512527.3531408","url":null,"abstract":"Modern approaches for fast retrieval of similar vectors on billion-scaled datasets rely on compressed-domain approaches such as binary sketches or product quantization. These methods minimize a certain loss, typically the Mean Squared Error or other objective functions tailored to the retrieval problem. In this paper, we re-interpret popular methods such as binary hashing or product quantizers as auto-encoders, and point out that they implicitly make suboptimal assumptions on the form of the decoder. We design backward-compatible decoders that improve the reconstruction of the vectors from the same codes, which translates to a better performance in nearest neighbor search. Our method significantly improves over binary hashing methods and product quantization on popular benchmarks.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132273256","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval 构建短语级语义标签,形成多粒度的图像-文本检索监督
Pub Date : 2021-09-12 DOI: 10.1145/3512527.3531368
Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, Jianqing Fan
Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.
现有的图像文本检索研究主要依靠句子级监督来区分查询图像的匹配和不匹配句子。然而,图像和句子之间的语义不匹配通常发生在更细的粒度,即短语级别。在本文中,我们探索引入额外的短语级监督,以更好地识别文本中的不匹配单元。在实践中,可以在句子级和短语级为查询图像自动构造多粒度语义标签。我们为匹配的句子构建文本场景图,并提取实体和三元组作为短语级标签。为了整合句子级和短语级的监督,我们提出了用于多模态表示学习的语义结构感知多模态转换器(SSAMT)。在SSAMT内部,我们利用不同类型的注意机制来强制视觉和语言两侧的多粒度语义单元的交互。对于训练,我们提出了从全局和局部两个角度进行多尺度匹配,并对不匹配的短语进行惩罚。MS-COCO和Flickr30K上的实验结果表明,与一些最先进的模型相比,我们的方法是有效的。
{"title":"Constructing Phrase-level Semantic Labels to Form Multi-Grained Supervision for Image-Text Retrieval","authors":"Zhihao Fan, Zhongyu Wei, Zejun Li, Siyuan Wang, Haijun Shan, Xuanjing Huang, Jianqing Fan","doi":"10.1145/3512527.3531368","DOIUrl":"https://doi.org/10.1145/3512527.3531368","url":null,"abstract":"Existing research for image text retrieval mainly relies on sentence-level supervision to distinguish matched and mismatched sentences for a query image. However, semantic mismatch between an image and sentences usually happens in finer grain, i.e., phrase level. In this paper, we explore to introduce additional phrase-level supervision for the better identification of mismatched units in the text. In practice, multi-grained semantic labels are automatically constructed for a query image in both sentence-level and phrase-level. We construct text scene graphs for the matched sentences and extract entities and triples as the phrase-level labels. In order to integrate both supervision of sentence-level and phrase-level, we propose Semantic Structure Aware Multimodal Transformer (SSAMT) for multi-modal representation learning. Inside the SSAMT, we utilize different kinds of attention mechanisms to enforce interactions of multi-grained semantic units in both sides of vision and language. For the training, we propose multi-scale matching from both global and local perspectives, and penalize mismatched phrases. Experimental results on MS-COCO and Flickr30K show the effectiveness of our approach compared to some state-of-the-art models.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114882520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval TransHash:基于变换的汉明哈希高效图像检索
Pub Date : 2021-05-05 DOI: 10.1145/3512527.3531405
Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University
Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.
深度哈希在大规模图像检索的近似近邻搜索中越来越受欢迎。到目前为止,图像检索社区的深度哈希一直由卷积神经网络架构主导,例如Resnet[22]。在本文中,受视觉转换器的最新进展的启发,我们提出了Transhash,一个纯粹基于转换器的深度哈希学习框架。具体来说,我们的框架由两个主要模块组成:(1)基于Vision Transformer (ViT),我们设计了一个用于图像特征提取的siamese Multi-Granular Vision Transformer backbone (MGVT)。为了学习细粒度特征,我们在变压器的基础上创新了双流多粒度特征学习,以学习判别性的全局和局部特征。(2)采用动态构造相似矩阵的贝叶斯学习方案学习紧凑二进制哈希码。整个框架以端到端的方式进行联合训练。据我们所知,这是第一个不使用卷积神经网络(cnn)来解决深度哈希学习问题的工作。我们在三个广泛研究的数据集上进行了全面的实验:CIFAR-10, NUSWIDE和IMAGENET。实验证明了我们比现有的最先进的深度哈希方法的优越性。具体来说,我们在三个公共数据集上对不同哈希位长度的平均mAP分别实现了8.2%、2.6%和12.7%的性能提升。
{"title":"TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval","authors":"Yongbiao Chen, Shenmin Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, Zhengwei Qi Shanghai Jiao Tong University, U. California, W. University","doi":"10.1145/3512527.3531405","DOIUrl":"https://doi.org/10.1145/3512527.3531405","url":null,"abstract":"Deep hashing has gained growing popularity in approximate nearest neighbor search for large-scale image retrieval. Until now, the deep hashing for the image retrieval community has been dominated by convolutional neural network architectures, e.g. Resnet [22]. In this paper, inspired by the recent advancements of vision transformers, we present Transhash, a pure transformer-based framework for deep hashing learning. Concretely, our framework is composed of two major modules: (1) Based onVision Transformer (ViT), we design a siamese Multi-Granular Vision Tansformer backbone (MGVT) for image feature extraction. To learn fine-grained features, we innovate a dual-stream multi-granular feature learning on top of the transformer to learn discriminative global and local features. (2) Besides, we adopt a Bayesian learning scheme with a dynamically constructed similarity matrix to learn compact binary hash codes. The entire framework is jointly trained in an end-to-end manner. To the best of our knowledge, this is the first work to tackle deep hashing learning problems without convolutional neural networks (CNNs). We perform comprehensive experiments on three widely-studied datasets: CIFAR-10, NUSWIDE and IMAGENET. The experiments have evidenced our superiority against the existing state-of-the-art deep hashing methods. Specifically, we achieve 8.2%, 2.6%, 12.7% performance gains in terms of average mAP for different hash bit lengths on three public datasets, respectively.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123748692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection M2TR:用于深度伪造检测的多模态多尺度变压器
Pub Date : 2021-04-20 DOI: 10.1145/3512527.3531415
Junke Wang, Zuxuan Wu, Jingjing Chen, Yu-Gang Jiang
The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.
Deepfakes的广泛传播需要有效的方法来检测感知上令人信服的伪造图像。在本文中,我们的目标是使用变压器模型在不同的尺度上捕获微妙的操作工件。特别地,我们引入了一种多模态多尺度变压器(M2TR),它对不同大小的斑块进行操作,以检测不同空间水平图像中的局部不一致。M2TR进一步学习在频域检测伪造文物,通过精心设计的交叉模态融合块来补充RGB信息。此外,为了刺激Deepfake检测研究,我们引入了一个高质量的Deepfake数据集SR-DF,该数据集由4000个Deepfake视频组成,这些视频是通过最先进的面部交换和面部再现方法生成的。我们进行了大量的实验来验证所提出方法的有效性,该方法明显优于最先进的Deepfake检测方法。
{"title":"M2TR: Multi-modal Multi-scale Transformers for Deepfake Detection","authors":"Junke Wang, Zuxuan Wu, Jingjing Chen, Yu-Gang Jiang","doi":"10.1145/3512527.3531415","DOIUrl":"https://doi.org/10.1145/3512527.3531415","url":null,"abstract":"The widespread dissemination of Deepfakes demands effective approaches that can detect perceptually convincing forged images. In this paper, we aim to capture the subtle manipulation artifacts at different scales using transformer models. In particular, we introduce a Multi-modal Multi-scale TRansformer (M2TR), which operates on patches of different sizes to detect local inconsistencies in images at different spatial levels. M2TR further learns to detect forgery artifacts in the frequency domain to complement RGB information through a carefully designed cross modality fusion block. In addition, to stimulate Deepfake detection research, we introduce a high-quality Deepfake dataset, SR-DF, which consists of 4,000 DeepFake videos generated by state-of-the-art face swapping and facial reenactment methods. We conduct extensive experiments to verify the effectiveness of the proposed method, which outperforms state-of-the-art Deepfake detection methods by clear margins.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128795675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 110
Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification 用于车辆再识别的可插拔弱监督交叉视图学习
Pub Date : 2021-03-09 DOI: 10.1145/3512527.3531357
Lu Yang, Hongbang Liu, Jinghao Zhou, Lingqiao Liu, Lei Zhang, Peng Wang, Yanning Zhang
Learning cross-view consistent feature representation is the key for accurate vehicle Re-identification (ReID), since the visual appearance of vehicles changes significantly under different viewpoints. To this end, many existing approaches resort to the supervised cross-view learning using extensive extra viewpoints annotations, which however, is difficult to deploy in real applications due to the expensive labelling cost and the continous viewpoint variation that makes it hard to define discrete viewpoint labels. In this study, we present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID. Through hallucinating the cross-view samples as the hardest positive counterparts with small luminance difference and large local feature variance, we can learn the consistent feature representation via minimizing the cross-view feature distance based on vehicle IDs only without using any viewpoint annotation. More importantly, the proposed method can be seamlessly plugged into most existing vehicle ReID baselines for cross-view learning without re-training the baselines. To demonstrate its efficacy, we plug the proposed method into a bunch of off-the-shelf baselines and obtain significant performance improvement on four public benchmark datasets, i.e., VeRi-776, VehicleID, VRIC and VRAI.
由于车辆的视觉外观在不同视点下会发生显著变化,因此学习跨视点一致特征表示是实现车辆再识别的关键。为此,许多现有的方法采用使用大量额外视点注释的监督跨视点学习,然而,由于昂贵的标记成本和连续的视点变化使得难以定义离散的视点标签,因此难以在实际应用中部署。在这项研究中,我们提出了一个可插拔的弱监督跨视图学习(WCVL)模块用于车辆ReID。通过将交叉视点样本幻觉为亮度差小、局部特征方差大的最硬正对应,我们可以在不使用任何视点标注的情况下,仅根据车辆id最小化交叉视点特征距离,从而学习到一致的特征表示。更重要的是,所提出的方法可以无缝地插入到大多数现有的车辆ReID基线中进行跨视图学习,而无需重新训练基线。为了证明其有效性,我们将所提出的方法插入到一堆现成的基线中,并在四个公共基准数据集(即VeRi-776, VehicleID, VRIC和VRAI)上获得了显着的性能改进。
{"title":"Pluggable Weakly-Supervised Cross-View Learning for Accurate Vehicle Re-Identification","authors":"Lu Yang, Hongbang Liu, Jinghao Zhou, Lingqiao Liu, Lei Zhang, Peng Wang, Yanning Zhang","doi":"10.1145/3512527.3531357","DOIUrl":"https://doi.org/10.1145/3512527.3531357","url":null,"abstract":"Learning cross-view consistent feature representation is the key for accurate vehicle Re-identification (ReID), since the visual appearance of vehicles changes significantly under different viewpoints. To this end, many existing approaches resort to the supervised cross-view learning using extensive extra viewpoints annotations, which however, is difficult to deploy in real applications due to the expensive labelling cost and the continous viewpoint variation that makes it hard to define discrete viewpoint labels. In this study, we present a pluggable Weakly-supervised Cross-View Learning (WCVL) module for vehicle ReID. Through hallucinating the cross-view samples as the hardest positive counterparts with small luminance difference and large local feature variance, we can learn the consistent feature representation via minimizing the cross-view feature distance based on vehicle IDs only without using any viewpoint annotation. More importantly, the proposed method can be seamlessly plugged into most existing vehicle ReID baselines for cross-view learning without re-training the baselines. To demonstrate its efficacy, we plug the proposed method into a bunch of off-the-shelf baselines and obtain significant performance improvement on four public benchmark datasets, i.e., VeRi-776, VehicleID, VRIC and VRAI.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124500983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Proceedings of the 2022 International Conference on Multimedia Retrieval 2022年多媒体检索国际会议论文集
{"title":"Proceedings of the 2022 International Conference on Multimedia Retrieval","authors":"","doi":"10.1145/3512527","DOIUrl":"https://doi.org/10.1145/3512527","url":null,"abstract":"","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130212428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1