首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning 使用监督深度学习的未爆弹药自动视觉识别
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531383
Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris
Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.
未爆弹药(UXO)分类是一项具有挑战性的任务,目前使用昂贵的电磁感应装置来解决,并且可能需要在潜在危险的环境中实际存在。迄今为止,开放的未爆炸弹药数据有限,阻碍了基于图像的未爆炸弹药分类的进展,这可能以较低的成本提供一种安全的替代办法。此外,现有的零星努力主要集中在小规模实验上,只使用常见未爆弹药类别的一个子集。我们的工作旨在激发基于图像的未爆弹药分类的研究兴趣,管理一个新的数据集,该数据集由来自8个主要未爆弹药类别的10000多张带注释的图像组成。通过对监督深度学习的广泛实验,我们发现了这项任务中具有挑战性方面的关键见解。最后,我们通过训练最先进的卷积神经网络和视觉转换器,在我们的新基准上设置基线,该转换器能够以84.33%的准确率区分高度重叠的未爆炸弹药类别。
{"title":"Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning","authors":"Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1145/3512527.3531383","DOIUrl":"https://doi.org/10.1145/3512527.3531383","url":null,"abstract":"Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133080794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Motor Learning based on Presentation of a Tentative Goal 基于试探性目标呈现的运动学习
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531413
S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto
This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.
本文提出了一种基于个性化目标运动呈现的运动学习方法,我们称之为暂定目标。虽然之前的许多研究都侧重于帮助用户纠正他们的运动技能动作,但大多数研究都是为用户提供参考动作,而不管该动作是否可以实现。这使得当用户的运动与参考运动之间的差异太大时,用户很难适当地将他们的运动修改为参考运动。本研究旨在提供一个试探性的目标,即在一定的运动变化量内最大限度地提高性能。为了实现这一点,预测任何动作的性能是必要的。然而,由于人体运动的多样性,通过建立一个通用模型来估计一个暂定目标的性能是具有挑战性的。因此,我们建立了一个单独的模型,从一个小的训练数据集预测性能,并使用我们提出的数据增强方法实现它。篮球罚球数据实验证明了该方法的有效性。
{"title":"Motor Learning based on Presentation of a Tentative Goal","authors":"S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto","doi":"10.1145/3512527.3531413","DOIUrl":"https://doi.org/10.1145/3512527.3531413","url":null,"abstract":"This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133078445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ICDAR'22: Intelligent Cross-Data Analysis and Retrieval ICDAR'22:智能交叉数据分析与检索
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531441
Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong
We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on "Intelligent Cross-Data Analysis and Retrieval" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing
最近,我们见证了跨数据对抗多模式数据问题的兴起。跨模态检索系统使用文本查询来查找图像;空气质量指数可以利用生活记录图像进行预测;拥堵可以通过天气和推特数据来预测;日常锻炼和饮食可以帮助预测睡眠质量是这一研究方向的一些例子。尽管对多模式数据分析进行了大量的研究,但很少进行跨数据(例如,跨模式数据,跨领域,跨平台)的研究。为了促进智能交叉数据分析与检索研究,为人类带来一个智能的、可持续发展的社会,本文介绍了“智能交叉数据分析与检索”的具体文集。本研究课题欢迎来自不同研究领域和学科的人,如福祉,防灾减灾,流动性,气候变化,旅游,医疗保健和食品计算
{"title":"ICDAR'22: Intelligent Cross-Data Analysis and Retrieval","authors":"Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong","doi":"10.1145/3512527.3531441","DOIUrl":"https://doi.org/10.1145/3512527.3531441","url":null,"abstract":"We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on \"Intelligent Cross-Data Analysis and Retrieval\" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133705913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search 加速符号猎人:基于分支修剪策略和稳定层次搜索的符号黑盒攻击
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531399
S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen
We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.
我们提出了一种基于符号的l∞约束下的黑盒攻击——加速符号猎人(ASH)。该方法在对目标模型查询较少的情况下,从输入图像中搜索损失的近似梯度符号,并沿该方向更新输入图像来生成对抗示例。它采用分支修剪策略,根据已检查的符号位推断未知符号位,以避免不必要的查询。它还采用了稳定的分层搜索,以便在有限的查询预算内获得更好的性能。我们提供了一个理论证明,表明与最先进的基于符号的黑盒攻击相比,加速符号猎人在不降低攻击成功率(SR)的情况下将查询减半。大量的实验也证明了我们的ASH方法相对于其他黑盒攻击的优越性。特别是在ImageNet的Inception-v3上,我们的方法实现了0.989的平均查询次数为338.56,比实现相同SR的最先进的基于符号的攻击少1/4。此外,我们的ASH方法是开箱即用的,因为没有需要调优的超参数。
{"title":"Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search","authors":"S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen","doi":"10.1145/3512527.3531399","DOIUrl":"https://doi.org/10.1145/3512527.3531399","url":null,"abstract":"We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22 第五届年度生活日志搜索挑战赛简介,LSC'22
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531439
C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann
For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.
自2018年以来,生命日志搜索挑战(LSC)第五次促进了一项基准测试,以比较为多模式生命日志设计的交互式搜索系统。LSC'22吸引了9个研究小组参加,他们开发了交互式生命日志检索系统,使生命日志能够快速有效地访问。在ACM ICMR'22的LSC研讨会上,这些系统在混合观众面前进行了竞争。本文介绍了LSC工作坊,竞赛中使用的新的(更大的)数据集,并介绍了参与的生活日志搜索系统。
{"title":"Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22","authors":"C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann","doi":"10.1145/3512527.3531439","DOIUrl":"https://doi.org/10.1145/3512527.3531439","url":null,"abstract":"For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN MFGAN:基于GAN的轻量级快速多任务多尺度特征融合模型
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531410
Lijia Deng, Yu-dong Zhang
Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.
细胞分割与计数是传统生物医学研究中一项耗时且重要的实验步骤。许多当前的计数方法需要精确的细胞位置。然而,很少有这样的单元数据集具有详细的对象坐标。大多数现有的细胞数据集只有细胞总数和一个全局分割标记。为了更有效地利用现有数据集,我们将细胞计数任务分为细胞数预测和细胞分割。提出了一种基于生成对抗网络(MFGAN)的轻量级快速多任务多尺度特征融合模型。为了协调这两个任务的学习,我们提出了一个组合混合损失函数(CH Loss),并使用条件GAN来训练我们的网络。我们提出了一种轻量级快速多任务生成器(LFMG),与U-Net相比,它减少了20%的参数数量,但在单元分割方面具有更好的性能。采用多尺度特征融合技术提高重构分割图像的质量。此外,我们还提出了一种结构融合识别(SFD)来提高特征细节的准确性。我们的方法实现了非基于点的计数,在训练过程中不再需要标注图像中每个细胞的确切位置,并且在细胞计数和细胞分割方面取得了很好的效果。
{"title":"MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN","authors":"Lijia Deng, Yu-dong Zhang","doi":"10.1145/3512527.3531410","DOIUrl":"https://doi.org/10.1145/3512527.3531410","url":null,"abstract":"Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label 基于小数据和粗标签联合学习的弱监督细粒度识别
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531419
Anqi Hu, Zhengxing Sun, Qian Li
Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.
弱监督学习已经成为细粒度图像识别的研究方向之一。这些方法的目的是在人工成本或专家知识较少的情况下学习特征表示。现有的弱监督方法大多基于不完全标注或不精确标注,受监督信息的限制,难以很好地发挥作用。因此,同时使用这两种标注进行训练可以挖掘更多的相关性,而标注负担不会增加太多。本文提出了一种基于粗粒度大数据和细粒度小数据的弱监督细粒度识别组合学习框架。组合学习包含两个重要模块:1)判别模块,通过注意图和部分采样保持粗标签和细标签之间结构信息的一致性;2)聚类划分策略,通过特征减法挖掘细类别之间的细节差异。实验结果表明,该方法在CUB-200-2011和Stanford Cars数据集上优于弱监督方法,达到接近完全监督方法的性能。
{"title":"Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label","authors":"Anqi Hu, Zhengxing Sun, Qian Li","doi":"10.1145/3512527.3531419","DOIUrl":"https://doi.org/10.1145/3512527.3531419","url":null,"abstract":"Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis 文本到图像合成中多粒度特征的解纠缠表示和层次细化
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531389
Pei Dong, L. Wu, Lei Meng, Xiangxu Meng
In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.
在本文中,我们专注于从给定的文本描述生成逼真的图像。目前的方法首先生成初始图像,然后逐步将其细化为高分辨率图像。这些方法通常不加选择地细化前一阶段的所有粒度特征输出。然而,每个阶段表达不同粒度特征的能力并不一致,对前一阶段生成的质量较差的特征进行进一步细化,难以表达精确的语义。目前的方法无法独立地细化不同粒度的特征,导致生成的图像难以清晰地表达所有的语义因素,有些特征甚至变得更差。为了解决这个问题,我们提出了一种分层解纠缠表示生成对抗网络(HDR-GAN),通过明确解纠缠和单独建模图像中的语义因素来生成逼真的图像。HDR-GAN引入了一种新的多粒度特征解纠缠编码器,通过显式解纠缠姿态、形状和纹理等多粒度特征,全面地表示图像信息。此外,我们还开发了一种新的多粒度特征细化(MFR)方法,该方法包含一个粗粒度特征细化(CFR)模型和一个细粒度特征细化(FFR)模型。CFR使用粗粒度的解纠缠表示(如姿态和形状)来澄清类别信息,而FFR使用细粒度的解纠缠表示(如纹理)来反映实例级细节。在两个研究充分且公开可用的数据集(即CUB-200和CLEVR-SV)上进行的大量实验证明了我们方法的合理性和优越性。
{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":"https://doi.org/10.1145/3512527.3531389","url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114681329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Improving Image Captioning via Enhancing Dual-Side Context Awareness 通过增强双面上下文感知改善图像字幕
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531379
Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang
Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.
最近关于视觉问答的研究表明,网格特征可以和区域特征一样有效地处理视觉语言任务。同时,基于变压器的模型及其变体在图像字幕方面表现出了显著的性能。然而,由于编码器侧网格特征的单粒度特性而导致的对象-上下文信息缺失,以及由于变压器解码器的左向右解码范式而导致的未来上下文信息缺失,仍未得到研究。在这项工作中,我们通过增强两侧的上下文信息来解决这两个问题:(i)在编码器侧,我们提出了上下文感知自关注模块,其中键/值扩展为相邻的矩形区域,其中每个区域包含两个或多个聚合网格特征;这使得网格特征具有不同的粒度,为不同规模的对象存储足够的上下文信息。(ii)在解码器端,我们采用了一种双向解码策略,其中左对右和右对左解码同时交互进行。它在生成当前词时同时利用过去和将来的上下文信息。将这两个模块与一个普通变压器相结合,我们的上下文感知变压器(CATNet)实现了MSCOCO基准测试的最新技术。
{"title":"Improving Image Captioning via Enhancing Dual-Side Context Awareness","authors":"Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang","doi":"10.1145/3512527.3531379","DOIUrl":"https://doi.org/10.1145/3512527.3531379","url":null,"abstract":"Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Phrase-level Prediction for Video Temporal Localization 视频时间定位的短语级预测
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531382
Sizhe Li, C. Li, Minghang Zheng, Yang Liu
Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.
视频时间定位的目的是在给定的未修剪视频中定位语义上与自然语言查询匹配的时间段。我们通过经验观察到,虽然现有的方法在句子定位方面取得了稳步的进展,但在短语定位方面的表现却不尽人意。原则上,短语应该更容易本地化,因为需要考虑的视觉概念组合更少;这说明现有模型只捕获了基准中的句子标注偏差,而对简单的视觉概念和语言概念之间的内在关系缺乏足够的理解,从而使模型的泛化和可解释性受到质疑。本文提出了一种既能处理句子级定位又能处理短语级定位的统一框架,即短语级预测网(PLPNet)。具体来说,基于相似短语倾向于关注相似的视频线索,而不相似短语不应该关注的假设,我们构建了一种对比机制来约束短语级定位,而不需要训练中需要的细粒度短语边界标注。此外,考虑到句子的灵活性和短语之间的巨大差异,我们提出了一种基于聚类的批处理采样器,以确保有效地进行对比学习。大量的实验表明,我们的方法超越了目前最先进的短语级时间定位方法,同时保持了句子定位的高性能,提高了模型的可解释性和泛化能力。我们的代码可在https://github.com/sizhelee/PLPNet上获得。
{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":"https://doi.org/10.1145/3512527.3531382","url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1