首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments GIO:高噪声环境下音色跟踪的方法
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531393
Xiaoheng Sun, Xia Liang, Qiqi He, Bilei Zhu, Zejun Ma
As one of the fundamental tasks in music and speech signal processing, pitch tracking has been attracting attention for decades. While a human can focus on the voiced pitch even in highly noisy environments, most existing automatic pitch tracking systems show unsatisfactory performance encountering noise. To mimic human auditory, a data-driven model named GIO is proposed in this paper, in which timbre information is introduced to guide pitch tracking. The proposed model takes two inputs: a short audio segment to extract pitch from and a timbre embedding derived from the speaker's or singer's voice. In experiments, we use a music artist classification model to extract timbre embedding vectors. A dual-branch structure and a two-step training method are designed to enable the model to predict voice presence. The experimental results show that the proposed model gains a significant improvement in noise robustness and outperforms existing state-of-the-art methods with fewer parameters.
音高跟踪作为音乐和语音信号处理的基础任务之一,几十年来一直受到人们的关注。即使在高噪声环境中,人类也可以专注于发声音高,但大多数现有的自动音高跟踪系统在遇到噪声时表现不佳。为了模拟人类听觉,本文提出了一种数据驱动的模型GIO,该模型引入音色信息来指导音高跟踪。该模型采用两个输入:一个用于提取音高的短音频片段,以及一个从说话者或歌手的声音中提取的音色嵌入。在实验中,我们使用音乐艺术家分类模型提取音色嵌入向量。设计了一种双分支结构和两步训练方法,使模型能够预测语音存在。实验结果表明,该模型在噪声鲁棒性方面有显著提高,并且在参数较少的情况下优于现有的先进方法。
{"title":"GIO: A Timbre-informed Approach for Pitch Tracking in Highly Noisy Environments","authors":"Xiaoheng Sun, Xia Liang, Qiqi He, Bilei Zhu, Zejun Ma","doi":"10.1145/3512527.3531393","DOIUrl":"https://doi.org/10.1145/3512527.3531393","url":null,"abstract":"As one of the fundamental tasks in music and speech signal processing, pitch tracking has been attracting attention for decades. While a human can focus on the voiced pitch even in highly noisy environments, most existing automatic pitch tracking systems show unsatisfactory performance encountering noise. To mimic human auditory, a data-driven model named GIO is proposed in this paper, in which timbre information is introduced to guide pitch tracking. The proposed model takes two inputs: a short audio segment to extract pitch from and a timbre embedding derived from the speaker's or singer's voice. In experiments, we use a music artist classification model to extract timbre embedding vectors. A dual-branch structure and a two-step training method are designed to enable the model to predict voice presence. The experimental results show that the proposed model gains a significant improvement in noise robustness and outperforms existing state-of-the-art methods with fewer parameters.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132412183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Multiple Biological Granularities Network for Person Re-Identification 用于人再识别的多生物粒度网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531365
Shuyuan Tu, Tianzhen Guan, Li Kuang
The task of person re-identification is to retrieve images of a specific pedestrian among cross-camera person gallery captured in the wild. Previous approaches commonly concentrate on the whole person images and local pre-defined body parts, which are ineffective with diversity of person poses and occlusion. In order to alleviate the problem, researchers began to implement attention mechanisms to their model using local convolutions with limited fields. However, previous attention mechanisms focus on the local feature representations ignoring the exploration of global spatial relation knowledge. The global spatial relation knowledge contains clustering-like topological information which is helpful for overcoming the situation of diversity of person poses and occlusion. In this paper, we propose the Multiple Biological Granularities Network (MBGN) based on Global Spatial Relation Pixel Attention (GSRPA) taking the human body structure and global spatial relation pixels information into account. First, we design an adaptive adjustment algorithm (AABS) based on human body structure, which is complementary to our MBGN. Second, we propose a feature fusion strategy taking multiple biological granularities into account. Our strategy forces the model to learn diversity of person poses by balancing the local semantic human body parts and global spatial relations. Third, we propose the attention mechanism GSRPA. GSRPA enhances the weight of spatial relational pixels, which digs out the person topological information for overcoming occlusion problem. Extensive evaluations on the popular datasets Market-1501 and CUHK03 demonstrate the superiority of MBGN over the state-of-the-art methods.
人物再识别的任务是在野外拍摄的跨相机人物库中检索特定行人的图像。以往的方法通常集中在人体整体图像和局部预定义的身体部位,由于人体姿势的多样性和遮挡,这些方法效果不佳。为了缓解这一问题,研究人员开始使用有限域的局部卷积来实现对模型的注意机制。然而,以往的注意机制主要关注局部特征表征,忽视了对全局空间关系知识的探索。全局空间关系知识包含了类聚类拓扑信息,有助于克服人体姿态多样性和遮挡的情况。本文在考虑人体结构和全局空间关系像素信息的基础上,提出了基于全局空间关系像素关注(GSRPA)的多生物粒度网络(MBGN)。首先,我们设计了一种基于人体结构的自适应调整算法(AABS),作为MBGN的补充。其次,我们提出了一种考虑多种生物粒度的特征融合策略。我们的策略通过平衡局部语义人体部位和全局空间关系,迫使模型学习人体姿势的多样性。第三,提出了GSRPA的注意机制。GSRPA增强空间关系像素的权重,挖掘出人物拓扑信息,克服遮挡问题。对流行数据集Market-1501和CUHK03的广泛评估表明MBGN优于最先进的方法。
{"title":"Multiple Biological Granularities Network for Person Re-Identification","authors":"Shuyuan Tu, Tianzhen Guan, Li Kuang","doi":"10.1145/3512527.3531365","DOIUrl":"https://doi.org/10.1145/3512527.3531365","url":null,"abstract":"The task of person re-identification is to retrieve images of a specific pedestrian among cross-camera person gallery captured in the wild. Previous approaches commonly concentrate on the whole person images and local pre-defined body parts, which are ineffective with diversity of person poses and occlusion. In order to alleviate the problem, researchers began to implement attention mechanisms to their model using local convolutions with limited fields. However, previous attention mechanisms focus on the local feature representations ignoring the exploration of global spatial relation knowledge. The global spatial relation knowledge contains clustering-like topological information which is helpful for overcoming the situation of diversity of person poses and occlusion. In this paper, we propose the Multiple Biological Granularities Network (MBGN) based on Global Spatial Relation Pixel Attention (GSRPA) taking the human body structure and global spatial relation pixels information into account. First, we design an adaptive adjustment algorithm (AABS) based on human body structure, which is complementary to our MBGN. Second, we propose a feature fusion strategy taking multiple biological granularities into account. Our strategy forces the model to learn diversity of person poses by balancing the local semantic human body parts and global spatial relations. Third, we propose the attention mechanism GSRPA. GSRPA enhances the weight of spatial relational pixels, which digs out the person topological information for overcoming occlusion problem. Extensive evaluations on the popular datasets Market-1501 and CUHK03 demonstrate the superiority of MBGN over the state-of-the-art methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UF-VTON: Toward User-Friendly Virtual Try-On Network UF-VTON:迈向用户友好的虚拟试戴网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531387
Yuan Chang, Tao Peng, R. He, Xinrong Hu, Junping Liu, Zili Zhang, Minghua Jiang
Image-based virtual try-on aims to transfer a clothes onto a person while preserving both person's and cloth's attributes. However, the existing methods to realize this task require a target clothes, which cannot be obtained in most cases. To address this issue, we propose a novel user-friendly virtual try-on network (UF-VTON), which only requires a person image and an image of another person wearing a target clothes to generate a result of the person wearing the target clothes. Specifically, we adopt a knowledge distillation scheme to construct a new triple dataset for supervised learning, propose a new three-step pipeline (coarse synthesis, clothing alignment, and refinement synthesis) for try-on task, and utilize an end-to-end training strategy to further refine the results. In particular, we design a new synthesis network that includes both CNN blocks and swin-transformer blocks to capture global and local information and generate highly-realistic try-on images. Qualitative and quantitative experiments show that our method achieves the state-of-the-art virtual try-on performance.
基于图像的虚拟试穿旨在将衣服转移到人身上,同时保留人和衣服的属性。然而,现有的实现这一任务的方法需要一个目标衣服,在大多数情况下无法获得。为了解决这个问题,我们提出了一种新的用户友好的虚拟试穿网络(UF-VTON),它只需要一个人的图像和另一个人穿着目标衣服的图像就可以产生穿目标衣服的人的结果。具体而言,我们采用知识蒸馏方案构建新的三重数据集用于监督学习,提出了新的三步管道(粗合成、服装对齐和精细合成)用于试戴任务,并利用端到端训练策略进一步细化结果。特别是,我们设计了一个新的合成网络,其中包括CNN块和旋转变压器块,以捕获全局和局部信息,并生成高度逼真的试戴图像。定性和定量实验表明,我们的方法达到了最先进的虚拟试戴性能。
{"title":"UF-VTON: Toward User-Friendly Virtual Try-On Network","authors":"Yuan Chang, Tao Peng, R. He, Xinrong Hu, Junping Liu, Zili Zhang, Minghua Jiang","doi":"10.1145/3512527.3531387","DOIUrl":"https://doi.org/10.1145/3512527.3531387","url":null,"abstract":"Image-based virtual try-on aims to transfer a clothes onto a person while preserving both person's and cloth's attributes. However, the existing methods to realize this task require a target clothes, which cannot be obtained in most cases. To address this issue, we propose a novel user-friendly virtual try-on network (UF-VTON), which only requires a person image and an image of another person wearing a target clothes to generate a result of the person wearing the target clothes. Specifically, we adopt a knowledge distillation scheme to construct a new triple dataset for supervised learning, propose a new three-step pipeline (coarse synthesis, clothing alignment, and refinement synthesis) for try-on task, and utilize an end-to-end training strategy to further refine the results. In particular, we design a new synthesis network that includes both CNN blocks and swin-transformer blocks to capture global and local information and generate highly-realistic try-on images. Qualitative and quantitative experiments show that our method achieves the state-of-the-art virtual try-on performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116778668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Local Slot Attention for Vision and Language Navigation 视觉和语言导航的局部槽位注意
Pub Date : 2022-06-17 DOI: 10.1145/3512527.3531366
Yifeng Zhuang, Qiang Sun, Yanwei Fu, Lifeng Chen, Xiangyang Xue
Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.
视觉语言导航(VLN)是计算机视觉和自然语言处理领域的研究热点,是一项旨在为通用机器人铺平道路的前沿研究。VLN任务要求智能体在不熟悉的环境中按照自然语言指令导航到目标位置。近年来,基于变压器的模型在VLN任务上取得了显著的进步。由于变压器结构中的注意机制能够更好地整合视觉和语言的模态间和模态内信息。然而,基于电流互感器的模型存在两个问题。1)模型独立处理每个视图,而不考虑对象的完整性。2)在视觉模态的自注意操作过程中,空间上相距较远的景观可以相互交织,而不受明确的限制。这种混合可能会带来额外的噪声,而不是有用的信息。为了解决这些问题,我们提出了1)一个基于槽注意力的模块来整合来自同一对象的分割信息。2)局部注意掩蔽机制限制视觉注意广度。所提出的模块可以很容易地插入到任何VLN架构中,我们使用循环VLN- bert作为我们的基础模型。在R2R数据集上的实验表明,我们的模型达到了最先进的结果。
{"title":"Local Slot Attention for Vision and Language Navigation","authors":"Yifeng Zhuang, Qiang Sun, Yanwei Fu, Lifeng Chen, Xiangyang Xue","doi":"10.1145/3512527.3531366","DOIUrl":"https://doi.org/10.1145/3512527.3531366","url":null,"abstract":"Vision-and-language navigation (VLN), a frontier study aiming to pave the way for general-purpose robots, has been a hot topic in the computer vision and natural language processing community. The VLN task requires an agent to navigate to a goal location following natural language instructions in unfamiliar environments. Recently, transformer-based models have gained significant improvements on the VLN task. Since the attention mechanism in the transformer architecture can better integrate inter- and intra-modal information of vision and language. However, there exist two problems in current transformer-based models. 1) The models process each view independently without taking the integrity of the objects into account. 2) During the self-attention operation in the visual modality, the views that are spatially distant can be inter-weaved with each other without explicit restriction. This kind of mixing may introduce extra noise instead of useful information. To address these issues, we propose 1) A slot-attention based module to incorporate information from segmentation of the same object. 2) A local attention mask mechanism to limit the visual attention span. The proposed modules can be easily plugged into any VLN architecture and we use the Recurrent VLN-Bert as our base model. Experiments on the R2R dataset show that our model has achieved the state-of-the-art results.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128884658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation 基于图像的物体姿态估计的3d增强对比知识蒸馏
Pub Date : 2022-06-02 DOI: 10.1145/3512527.3531359
Zhidan Liu, Zhen Xing, Xiangdong Zhou, Yijiang Chen, G. Zhou
Image-based object pose estimation sounds amazing because in real applications the shape of object is oftentimes not available or not easy to take like photos. Although it is an advantage to some extent, un-explored shape information in 3D vision learning problem looks like "flaws in jade''. In this paper, we deal with the problem in a reasonable new setting, namely 3D shape is exploited in the training process, and the testing is still purely image-based. We enhance the performance of image-based methods for category-agnostic object pose estimation by exploiting 3D knowledge learned by a multi-modal method. Specifically, we propose a novel contrastive knowledge distillation framework that effectively transfers 3D-augmented image representation from a multi-modal model to an image-based model. We integrate contrastive learning into the two-stage training procedure of knowledge distillation, which formulates an advanced solution to combine these two approaches for cross-modal tasks. We experimentally report state-of-the-art results compared with existing category-agnostic image-based methods by a large margin (up to +5% improvement on ObjectNet3D dataset), demonstrating the effectiveness of our method.
基于图像的物体姿态估计听起来很神奇,因为在实际应用中,物体的形状通常是不可用的,或者不容易像照片一样拍摄。虽然这在一定程度上是一个优势,但在3D视觉学习问题中,未探索的形状信息就像“玉中有瑕”。在本文中,我们在一个合理的新设置下处理这个问题,即在训练过程中利用三维形状,而测试仍然是纯粹基于图像的。我们通过利用多模态方法获得的三维知识,提高了基于图像的分类未知目标姿态估计方法的性能。具体来说,我们提出了一种新的对比知识蒸馏框架,有效地将3d增强图像表示从多模态模型转移到基于图像的模型。我们将对比学习整合到知识升华的两阶段训练过程中,为跨模态任务结合这两种方法提供了一种先进的解决方案。我们通过实验报告了与现有基于图像的分类无关方法相比的最先进的结果(在ObjectNet3D数据集上提高了+5%),证明了我们方法的有效性。
{"title":"3D-Augmented Contrastive Knowledge Distillation for Image-based Object Pose Estimation","authors":"Zhidan Liu, Zhen Xing, Xiangdong Zhou, Yijiang Chen, G. Zhou","doi":"10.1145/3512527.3531359","DOIUrl":"https://doi.org/10.1145/3512527.3531359","url":null,"abstract":"Image-based object pose estimation sounds amazing because in real applications the shape of object is oftentimes not available or not easy to take like photos. Although it is an advantage to some extent, un-explored shape information in 3D vision learning problem looks like \"flaws in jade''. In this paper, we deal with the problem in a reasonable new setting, namely 3D shape is exploited in the training process, and the testing is still purely image-based. We enhance the performance of image-based methods for category-agnostic object pose estimation by exploiting 3D knowledge learned by a multi-modal method. Specifically, we propose a novel contrastive knowledge distillation framework that effectively transfers 3D-augmented image representation from a multi-modal model to an image-based model. We integrate contrastive learning into the two-stage training procedure of knowledge distillation, which formulates an advanced solution to combine these two approaches for cross-modal tasks. We experimentally report state-of-the-art results compared with existing category-agnostic image-based methods by a large margin (up to +5% improvement on ObjectNet3D dataset), demonstrating the effectiveness of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134241778","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-lingual Adaptation for Recipe Retrieval with Mixup 混合配方检索的跨语言自适应
Pub Date : 2022-05-08 DOI: 10.1145/3512527.3531375
B. Zhu, C. Ngo, Jingjing Chen, W. Chan
Cross-modal recipe retrieval has attracted research attention in recent years, thanks to the availability of large-scale paired data for training. Nevertheless, obtaining adequate recipe-image pairs covering the majority of cuisines for supervised learning is difficult if not impossible. By transferring knowledge learnt from a data-rich cuisine to a data-scarce cuisine, domain adaptation sheds light on this practical problem. Nevertheless, existing works assume recipes in source and target domains are mostly originated from the same cuisine and written in the same language. This paper studies unsupervised domain adaptation for image-to-recipe retrieval, where recipes in source and target domains are in different languages. Moreover, only recipes are available for training in the target domain. A novel recipe mixup method is proposed to learn transferable embedding features between the two domains. Specifically, recipe mixup produces mixed recipes to form an intermediate domain by discretely exchanging the section(s) between source and target recipes. To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space. By using Recipe 1M dataset as source domain (English) and Vireo-FoodTransfer dataset as target domain (Chinese), empirical experiments verify the effectiveness of recipe mixup for cross-lingual adaptation in the context of image-to-recipe retrieval.
近年来,由于可以使用大规模的成对数据进行训练,跨模态配方检索受到了广泛的关注。然而,获得足够的食谱图像对,涵盖大多数烹饪的监督学习是困难的,如果不是不可能的话。通过将从数据丰富的烹饪中学习到的知识转移到数据稀缺的烹饪中,领域适应揭示了这个实际问题。然而,现有的著作认为,源域和目标域的食谱大多来自同一种烹饪,用同一种语言写成。本文研究了图像到食谱检索的无监督域自适应,其中源域和目标域的食谱是不同语言的。此外,只有食谱可用于目标领域的训练。提出了一种新的配方混合方法来学习两个域之间可转移的嵌入特征。具体来说,配方混合产生混合配方,通过在源和目标配方之间离散地交换部分来形成一个中间域。为了弥补领域差距,提出了配方混合损失来强制中间域定位在配方嵌入空间中源域和目标域之间的最短测地线路径上。以Recipe 1M数据集为源域(英文),Vireo-FoodTransfer数据集为目标域(中文),通过实证实验验证了配方混合在图像-食谱检索中跨语言自适应的有效性。
{"title":"Cross-lingual Adaptation for Recipe Retrieval with Mixup","authors":"B. Zhu, C. Ngo, Jingjing Chen, W. Chan","doi":"10.1145/3512527.3531375","DOIUrl":"https://doi.org/10.1145/3512527.3531375","url":null,"abstract":"Cross-modal recipe retrieval has attracted research attention in recent years, thanks to the availability of large-scale paired data for training. Nevertheless, obtaining adequate recipe-image pairs covering the majority of cuisines for supervised learning is difficult if not impossible. By transferring knowledge learnt from a data-rich cuisine to a data-scarce cuisine, domain adaptation sheds light on this practical problem. Nevertheless, existing works assume recipes in source and target domains are mostly originated from the same cuisine and written in the same language. This paper studies unsupervised domain adaptation for image-to-recipe retrieval, where recipes in source and target domains are in different languages. Moreover, only recipes are available for training in the target domain. A novel recipe mixup method is proposed to learn transferable embedding features between the two domains. Specifically, recipe mixup produces mixed recipes to form an intermediate domain by discretely exchanging the section(s) between source and target recipes. To bridge the domain gap, recipe mixup loss is proposed to enforce the intermediate domain to locate in the shortest geodesic path between source and target domains in the recipe embedding space. By using Recipe 1M dataset as source domain (English) and Vireo-FoodTransfer dataset as target domain (Chinese), empirical experiments verify the effectiveness of recipe mixup for cross-lingual adaptation in the context of image-to-recipe retrieval.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"193 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116784334","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Dual-Level Decoupled Transformer for Video Captioning 用于视频字幕的双电平解耦变压器
Pub Date : 2022-05-06 DOI: 10.1145/3512527.3531380
Yi-Meng Gao, Xinglin Hou, Wei Suo, Mengyang Sun, T. Ge, Yuning Jiang, Peifeng Wang
Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from offline-extracted motion or appearance features from pre-trained vision models. However, these methods may suffer from the so-called "couple" drawbacks on both video spatio-temporal representation and sentence generation. For the former, "couple" means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named disconnection in task/pre-train domain and hard for end-to-end training. As for the latter, "couple" means treating the generation of visual semantic and syntax-related words equally. To this end, we present D2 - a dual-level decoupled transformer pipeline to solve the above drawbacks: (i) for video spatio-temporal representation, we decouple the process of it into "first-spatial-then-temporal" paradigm, releasing the potential of using dedicated model(e.g. image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. (ii) for sentence generation, we propose Syntax-Aware Decoder to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed D2 and surpassed the previous methods by a large margin in the task of video captioning.
视频字幕的目的是理解视频的时空语义概念,生成描述性句子。这项任务的实际方法要求文本生成器从预训练的视觉模型中离线提取的运动或外观特征中学习。然而,这些方法在视频时空表征和句子生成方面都存在所谓的“耦合”缺陷。对于前者,“对”意味着在单个模型(3DCNN)中学习时空表征,从而导致任务/预训练域的脱节,难以进行端到端训练。在后一种情况下,“配对”是指将视觉语义词的生成与句法相关词的生成同等对待。为此,我们提出了D2 -一种双级解耦变压器管道来解决上述缺点:(i)对于视频时空表示,我们将其过程解耦为“先空间-后时间”范式,从而释放了使用专用模型(例如;图像-文本预训练)连接预训练和下游任务,使整个模型端到端可训练。(ii)在句子生成方面,我们提出了句法感知解码器(Syntax-Aware Decoder)来动态测量视觉语义和句法相关词的贡献。在三个广泛使用的基准(MSVD, MSR-VTT和VATEX)上进行的大量实验表明,所提出的D2具有巨大的潜力,并且在视频字幕任务中大大超过了以前的方法。
{"title":"Dual-Level Decoupled Transformer for Video Captioning","authors":"Yi-Meng Gao, Xinglin Hou, Wei Suo, Mengyang Sun, T. Ge, Yuning Jiang, Peifeng Wang","doi":"10.1145/3512527.3531380","DOIUrl":"https://doi.org/10.1145/3512527.3531380","url":null,"abstract":"Video captioning aims to understand the spatio-temporal semantic concept of the video and generate descriptive sentences. The de-facto approach to this task dictates a text generator to learn from offline-extracted motion or appearance features from pre-trained vision models. However, these methods may suffer from the so-called \"couple\" drawbacks on both video spatio-temporal representation and sentence generation. For the former, \"couple\" means learning spatio-temporal representation in a single model(3DCNN), resulting the problems named disconnection in task/pre-train domain and hard for end-to-end training. As for the latter, \"couple\" means treating the generation of visual semantic and syntax-related words equally. To this end, we present D2 - a dual-level decoupled transformer pipeline to solve the above drawbacks: (i) for video spatio-temporal representation, we decouple the process of it into \"first-spatial-then-temporal\" paradigm, releasing the potential of using dedicated model(e.g. image-text pre-training) to connect the pre-training and downstream tasks, and makes the entire model end-to-end trainable. (ii) for sentence generation, we propose Syntax-Aware Decoder to dynamically measure the contribution of visual semantic and syntax-related words. Extensive experiments on three widely-used benchmarks (MSVD, MSR-VTT and VATEX) have shown great potential of the proposed D2 and surpassed the previous methods by a large margin in the task of video captioning.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Relevance-based Margin for Contrastively-trained Video Retrieval Models 基于相关性余量的对比训练视频检索模型
Pub Date : 2022-04-27 DOI: 10.1145/3512527.3531395
Alex Falcon, Swathikiran Sudhakaran, G. Serra, Sergio Escalera, O. Lanz
Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.
使用自然语言查询的视频检索由于其与现实世界应用的相关性而吸引了越来越多的兴趣,从私人媒体画廊的智能访问到网络规模的视频搜索。在联合嵌入空间中学习视频和文本的交叉相似度是主要的方法。为了做到这一点,通常采用对比损失,因为它通过将相似的项目放近而将不相似的项目放远来组织嵌入空间。这个框架导致了竞争性的召回率,因为他们只关注真实条目的排名。然而,在考虑智能检索系统时,评估排名列表的质量是最重要的,因为多个项目可能具有相似的语义,因此具有高相关性。此外,上述框架使用固定的边距来区分相似和不相似的项目,将所有非事实项视为同等无关。在本文中,我们建议使用可变边距:我们认为在训练过程中根据一个项目与给定查询的相关程度来改变使用的边距,即基于相关性的边距,可以很容易地提高通过nDCG和mAP测量的排名列表的质量。我们在EPIC-Kitchens-100和YouCook2上使用不同的模型来展示我们技术的优势。我们表明,即使我们仔细调整了固定的余量,我们的技术(没有余量作为超参数)仍然可以获得更好的性能。最后,广泛的消融研究和定性分析支持了我们方法的稳健性。代码将在urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22发布。
{"title":"Relevance-based Margin for Contrastively-trained Video Retrieval Models","authors":"Alex Falcon, Swathikiran Sudhakaran, G. Serra, Sergio Escalera, O. Lanz","doi":"10.1145/3512527.3531395","DOIUrl":"https://doi.org/10.1145/3512527.3531395","url":null,"abstract":"Video retrieval using natural language queries has attracted increasing interest due to its relevance in real-world applications, from intelligent access in private media galleries to web-scale video search. Learning the cross-similarity of video and text in a joint embedding space is the dominant approach. To do so, a contrastive loss is usually employed because it organizes the embedding space by putting similar items close and dissimilar items far. This framework leads to competitive recall rates, as they solely focus on the rank of the groundtruth items. Yet, assessing the quality of the ranking list is of utmost importance when considering intelligent retrieval systems, since multiple items may share similar semantics, hence a high relevance. Moreover, the aforementioned framework uses a fixed margin to separate similar and dissimilar items, treating all non-groundtruth items as equally irrelevant. In this paper we propose to use a variable margin: we argue that varying the margin used during training based on how much relevant an item is to a given query, i.e. a relevance-based margin, easily improves the quality of the ranking lists measured through nDCG and mAP. We demonstrate the advantages of our technique using different models on EPIC-Kitchens-100 and YouCook2. We show that even if we carefully tuned the fixed margin, our technique (which does not have the margin as a hyper-parameter) would still achieve better performance. Finally, extensive ablation studies and qualitative analysis support the robustness of our approach. Code will be released at urlhttps://github.com/aranciokov/RelevanceMargin-ICMR22.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122127063","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Generating Topological Structure of Floorplans from Room Attributes 从房间属性生成平面图的拓扑结构
Pub Date : 2022-04-26 DOI: 10.1145/3512527.3531384
Yu Yin, Will Hutchcroft, Naji Khosravan, Ivaylo Boyadzhiev, Y. Fu, S. B. Kang
Analysis of indoor spaces requires topological information. In this paper, we propose to extract topological information from room attributes using what we call Iterative and adaptive graph Topology Learning (ITL). ITL progressively predicts multiple relations between rooms; at each iteration, it improves node embeddings, which in turn facilitates the generation of a better topological graph structure. This notion of iterative improvement of node embeddings and topological graph structure is in the same spirit as [5]. However, while [5] computes the adjacency matrix based on node similarity, we learn the graph metric using a relational decoder to extract room correlations. Experiments using a new challenging indoor dataset validate our proposed method. Qualitative and quantitative evaluation for layout topology prediction and floorplan generation applications also demonstrate the effectiveness of ITL.
室内空间分析需要拓扑信息。在本文中,我们建议使用我们所谓的迭代和自适应图拓扑学习(ITL)从房间属性中提取拓扑信息。ITL逐步预测房间之间的多重关系;在每次迭代中,它改进了节点嵌入,这反过来又有利于生成更好的拓扑图结构。这种迭代改进节点嵌入和拓扑图结构的概念与[5]的精神相同。然而,当[5]基于节点相似度计算邻接矩阵时,我们使用关系解码器来提取房间相关性来学习图度量。使用新的具有挑战性的室内数据集进行的实验验证了我们提出的方法。对布局拓扑预测和平面图生成的定性和定量评价也证明了ITL的有效性。
{"title":"Generating Topological Structure of Floorplans from Room Attributes","authors":"Yu Yin, Will Hutchcroft, Naji Khosravan, Ivaylo Boyadzhiev, Y. Fu, S. B. Kang","doi":"10.1145/3512527.3531384","DOIUrl":"https://doi.org/10.1145/3512527.3531384","url":null,"abstract":"Analysis of indoor spaces requires topological information. In this paper, we propose to extract topological information from room attributes using what we call Iterative and adaptive graph Topology Learning (ITL). ITL progressively predicts multiple relations between rooms; at each iteration, it improves node embeddings, which in turn facilitates the generation of a better topological graph structure. This notion of iterative improvement of node embeddings and topological graph structure is in the same spirit as [5]. However, while [5] computes the adjacency matrix based on node similarity, we learn the graph metric using a relational decoder to extract room correlations. Experiments using a new challenging indoor dataset validate our proposed method. Qualitative and quantitative evaluation for layout topology prediction and floorplan generation applications also demonstrate the effectiveness of ITL.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"124 16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134399706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Lesion Localization in OCT by Semi-Supervised Object Detection 半监督目标检测在OCT中的病灶定位
Pub Date : 2022-04-24 DOI: 10.1145/3512527.3531418
Yuehua Wu, Yang Zhou, Jianchun Zhao, Jingyuan Yang, Weihong Yu, You-xin Chen, Xirong Li
Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer the lack of manual annotation for deep supervised learning, this paper presents a first study on utilizing semi-supervised object detection (SSOD) for lesion localization in OCT images. To that end, we develop a taxonomy to provide a unified and structured viewpoint of the current SSOD methods, and consequently identify key modules in these methods. To evaluate the influence of these modules in the new task, we build OCT-SS, a new dataset consisting of over 1k expert-labeled OCT B-scan images and over 13k unlabeled B-scans. Extensive experiments on OCT-SS identify Unbiased Teacher (UnT) as the best current SSOD method for lesion localization. Moreover, we improve over this strong baseline, with mAP increased from 49.34 to 50.86.
全世界有超过3亿人患有各种视网膜疾病。通过无创光学相干断层扫描(OCT),可以识别视网膜的一些异常结构变化,即视网膜病变。因此,在OCT中自动定位病变对于早期发现视网膜疾病非常重要。为了克服深度监督学习缺乏人工标注的问题,本文首次提出了利用半监督目标检测(SSOD)进行OCT图像病灶定位的研究。为此,我们开发了一个分类法,以提供当前SSOD方法的统一和结构化观点,从而确定这些方法中的关键模块。为了评估这些模块在新任务中的影响,我们构建了OCT- ss,这是一个由超过1k张专家标记的OCT b扫描图像和超过13k张未标记的b扫描图像组成的新数据集。大量的OCT-SS实验表明Unbiased Teacher (UnT)是目前最佳的SSOD病灶定位方法。此外,我们在这个强基线上有所改善,mAP从49.34增加到50.86。
{"title":"Lesion Localization in OCT by Semi-Supervised Object Detection","authors":"Yuehua Wu, Yang Zhou, Jianchun Zhao, Jingyuan Yang, Weihong Yu, You-xin Chen, Xirong Li","doi":"10.1145/3512527.3531418","DOIUrl":"https://doi.org/10.1145/3512527.3531418","url":null,"abstract":"Over 300 million people worldwide are affected by various retinal diseases. By noninvasive Optical Coherence Tomography (OCT) scans, a number of abnormal structural changes in the retina, namely retinal lesions, can be identified. Automated lesion localization in OCT is thus important for detecting retinal diseases at their early stage. To conquer the lack of manual annotation for deep supervised learning, this paper presents a first study on utilizing semi-supervised object detection (SSOD) for lesion localization in OCT images. To that end, we develop a taxonomy to provide a unified and structured viewpoint of the current SSOD methods, and consequently identify key modules in these methods. To evaluate the influence of these modules in the new task, we build OCT-SS, a new dataset consisting of over 1k expert-labeled OCT B-scan images and over 13k unlabeled B-scans. Extensive experiments on OCT-SS identify Unbiased Teacher (UnT) as the best current SSOD method for lesion localization. Moreover, we improve over this strong baseline, with mAP increased from 49.34 to 50.86.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121897484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1