Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献_第6页

Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search 加速符号猎人:基于分支修剪策略和稳定层次搜索的符号黑盒攻击

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531399

S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen

We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.

我们提出了一种基于符号的l∞约束下的黑盒攻击——加速符号猎人(ASH)。该方法在对目标模型查询较少的情况下，从输入图像中搜索损失的近似梯度符号，并沿该方向更新输入图像来生成对抗示例。它采用分支修剪策略，根据已检查的符号位推断未知符号位，以避免不必要的查询。它还采用了稳定的分层搜索，以便在有限的查询预算内获得更好的性能。我们提供了一个理论证明，表明与最先进的基于符号的黑盒攻击相比，加速符号猎人在不降低攻击成功率(SR)的情况下将查询减半。大量的实验也证明了我们的ASH方法相对于其他黑盒攻击的优越性。特别是在ImageNet的Inception-v3上，我们的方法实现了0.989的平均查询次数为338.56，比实现相同SR的最先进的基于符号的攻击少1/4。此外，我们的ASH方法是开箱即用的，因为没有需要调优的超参数。

{"title":"Accelerated Sign Hunter: A Sign-based Black-box Attack via Branch-Prune Strategy and Stabilized Hierarchical Search","authors":"S. Li, Guangji Huang, Xing Xu, Yang Yang, Fumin Shen","doi":"10.1145/3512527.3531399","DOIUrl":"https://doi.org/10.1145/3512527.3531399","url":null,"abstract":"We propose the Accelerated Sign Hunter (ASH), a sign-based black-box attack under l∞ constraint. The proposed method searches an approximate gradient sign of loss w.r.t. the input image with few queries to the target model and crafts the adversarial example by updating the input image in this direction. It applies a Branch-Prune Strategy that infers the unknown sign bits according to the checked ones to avoid unnecessary queries. It also adopts a Stabilized Hierarchical Search to achieve better performance within a limited query budget. We provide a theoretical proof showing that the Accelerated Sign Hunter halves the queries without dropping the attack success rate (SR) compared with the state-of-the-art sign-based black-box attack. Extensive experiments also demonstrate the superiority of our ASH method over other black-box attacks. In particular on Inception-v3 for ImageNet, our method achieves the SR of 0.989 with an average queries of 338.56, which is 1/4 fewer than that of the state-of-the-art sign-based attack to achieve the same SR. Moreover, our ASH method is out-of-the-box since there are no hyperparameters that need to be tuned.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130729643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dual-Channel Localization Networks for Moment Retrieval with Natural Language 基于自然语言的双通道定位网络

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531394

Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang

According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.

根据给定的自然语言查询，时刻检索的目的是在未修剪的视频中定位最相关的时刻。根据是否产生候选矩，现有的解决方案大致可以分为两类:i)基于矩的方法:将视频预切成一组候选矩，进行多模态融合，用查询评估匹配分数。ii)基于剪辑的方法:它直接将视频剪辑和查询与预测匹配分数对齐，而不生成候选时刻。两种框架都有各自的缺点:基于矩的模型计算量大，而基于剪辑的模型的性能通常不如基于矩的模型。为此，我们设计了一种直观高效的双通道定位网络(DCLN)，以平衡计算成本和检索性能。为了减少计算成本，我们在所提出的双通道结构中只捕获具有相同起始或结束边界的几个视频时刻的时间关系。开始或结束通道映射索引表示相应视频时刻的开始或结束时间边界。为了提高模型的性能，我们应用所提出的双通道定位网络对双通道地图上的时间关系进行有效编码，并学习判别特征来区分自然语言查询与视频时刻的匹配程度。在两个标准基准上的大量实验证明了我们提出的方法的有效性。

{"title":"Dual-Channel Localization Networks for Moment Retrieval with Natural Language","authors":"Bolin Zhang, Bin Jiang, Chao Yang, Liang Pang","doi":"10.1145/3512527.3531394","DOIUrl":"https://doi.org/10.1145/3512527.3531394","url":null,"abstract":"According to the given natural language query, moment retrieval aims to localize the most relevant moment in an untrimmed video. The existing solutions for this problem can be roughly divided into two categories based on whether candidate moments are generated: i) Moment-based approach: It pre-cuts the video into a set of candidate moments, performs multimodal fusion, and evaluates matching scores with the query. ii) Clip-based approach: It directly aligns video clips and query with predicting matching scores without generating candidate moments. Both frameworks have respective shortcomings: the moment-based models suffer from heavy computations, while the performance of clip-based models is familiarly inferior to moment-based counterparts. To this end, we design an intuitive and efficient Dual-Channel Localization Network (DCLN) to balance computational cost and retrieval performance. For reducing computational cost, we capture the temporal relations of only a few video moments with the same start or end boundary in the proposed dual-channel structure. The start or end channel map index represents the corresponding video moment's start or end time boundary. For improving model performance, we apply the proposed dual-channel localization network to efficiently encode the temporal relations on the dual-channel map and learn discriminative features to distinguish the matching degree between natural language query and video moments. The extensive experiments on two standard benchmarks demonstrate the effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"264 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122468651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ICDAR'22: Intelligent Cross-Data Analysis and Retrieval ICDAR'22:智能交叉数据分析与检索

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531441

Minh-Son Dao, M. Riegler, Duc-Tien Dang-Nguyen, C. Gurrin, Yuta Nakashima, M. Dong

We have witnessed the rise of cross-data against multimodal data problems recently. The cross-modal retrieval system uses a textual query to look for images; the air quality index can be predicted using lifelogging images; the congestion can be predicted using weather and tweets data; daily exercises and meals can help to predict the sleeping quality are some examples of this research direction. Although vast investigations focusing on multimodal data analytics have been developed, few cross-data (e.g., cross-modal data, cross-domain, cross-platform) research has been carried on. In order to promote intelligent cross-data analytics and retrieval research and to bring a smart, sustainable society to human beings, the specific article collection on "Intelligent Cross-Data Analysis and Retrieval" is introduced. This Research Topic welcomes those who come from diverse research domains and disciplines such as well-being, disaster prevention and mitigation, mobility, climate change, tourism, healthcare, and food computing

最近，我们见证了跨数据对抗多模式数据问题的兴起。跨模态检索系统使用文本查询来查找图像;空气质量指数可以利用生活记录图像进行预测;拥堵可以通过天气和推特数据来预测;日常锻炼和饮食可以帮助预测睡眠质量是这一研究方向的一些例子。尽管对多模式数据分析进行了大量的研究，但很少进行跨数据(例如，跨模式数据，跨领域，跨平台)的研究。为了促进智能交叉数据分析与检索研究，为人类带来一个智能的、可持续发展的社会，本文介绍了“智能交叉数据分析与检索”的具体文集。本研究课题欢迎来自不同研究领域和学科的人，如福祉，防灾减灾，流动性，气候变化，旅游，医疗保健和食品计算

引用次数: 1

Motor Learning based on Presentation of a Tentative Goal 基于试探性目标呈现的运动学习

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531413

S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto

This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.

本文提出了一种基于个性化目标运动呈现的运动学习方法，我们称之为暂定目标。虽然之前的许多研究都侧重于帮助用户纠正他们的运动技能动作，但大多数研究都是为用户提供参考动作，而不管该动作是否可以实现。这使得当用户的运动与参考运动之间的差异太大时，用户很难适当地将他们的运动修改为参考运动。本研究旨在提供一个试探性的目标，即在一定的运动变化量内最大限度地提高性能。为了实现这一点，预测任何动作的性能是必要的。然而，由于人体运动的多样性，通过建立一个通用模型来估计一个暂定目标的性能是具有挑战性的。因此，我们建立了一个单独的模型，从一个小的训练数据集预测性能，并使用我们提出的数据增强方法实现它。篮球罚球数据实验证明了该方法的有效性。

{"title":"Motor Learning based on Presentation of a Tentative Goal","authors":"S. Sun, Yongqing Sun, Mitsuhiro Goto, Shigekuni Kondo, Dan Mikami, Susumu Yamamoto","doi":"10.1145/3512527.3531413","DOIUrl":"https://doi.org/10.1145/3512527.3531413","url":null,"abstract":"This paper presents a motor learning method based on the presenting of a personalized target motion, which we call a tentative goal. While many prior studies have focused on helping users correct their motor skill motions, most of them present the reference motion to users regardless of whether the motion is attainable or not. This makes it difficult for users to appropriately modify their motion to the reference motion when the difference between their motion and the reference motion is too significant. This study aims to provide a tentative goal that maximizes performance within a certain amount of motion change. To achieve this, predicting the performance of any motion is necessary. However, it is challenging to estimate the performance of a tentative goal by building a general model because of the large variety of human motion. Therefore, we built an individual model that predicts performance from a small training dataset and implemented it using our proposed data augmentation method. Experiments with basketball free-throw data demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133078445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning 使用监督深度学习的未爆弹药自动视觉识别

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531383

Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris

Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.

未爆弹药(UXO)分类是一项具有挑战性的任务，目前使用昂贵的电磁感应装置来解决，并且可能需要在潜在危险的环境中实际存在。迄今为止，开放的未爆炸弹药数据有限，阻碍了基于图像的未爆炸弹药分类的进展，这可能以较低的成本提供一种安全的替代办法。此外，现有的零星努力主要集中在小规模实验上，只使用常见未爆弹药类别的一个子集。我们的工作旨在激发基于图像的未爆弹药分类的研究兴趣，管理一个新的数据集，该数据集由来自8个主要未爆弹药类别的10000多张带注释的图像组成。通过对监督深度学习的广泛实验，我们发现了这项任务中具有挑战性方面的关键见解。最后，我们通过训练最先进的卷积神经网络和视觉转换器，在我们的新基准上设置基线，该转换器能够以84.33%的准确率区分高度重叠的未爆炸弹药类别。

{"title":"Automatic Visual Recognition of Unexploded Ordnances Using Supervised Deep Learning","authors":"Georgios Begkas, Panagiotis Giannakeris, K. Ioannidis, Georgios Kalpakis, T. Tsikrika, S. Vrochidis, Y. Kompatsiaris","doi":"10.1145/3512527.3531383","DOIUrl":"https://doi.org/10.1145/3512527.3531383","url":null,"abstract":"Unexploded Ordnance (UXO) classification is a challenging task which is currently tackled using electromagnetic induction devices that are expensive and may require physical presence in potentially hazardous environments. The limited availability of open UXO data has, until now, impeded the progress of image-based UXO classification, which may offer a safe alternative at a reduced cost. In addition, the existing sporadic efforts focus mainly on small scale experiments using only a subset of common UXO categories. Our work aims to stimulate research interest in image-based UXO classification, with the curation of a novel dataset that consists of over 10000 annotated images from eight major UXO categories. Through extensive experimentation with supervised deep learning we uncover key insights into the challenging aspects of this task. Finally, we set the baseline on our novel benchmark by training state-of-the-art Convolutional Neural Networks and a Vision Transformer that are able to discriminate between highly overlapping UXO categories with 84.33% accuracy.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133080794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improve Image Captioning by Modeling Dynamic Scene Graph Extension 通过建模动态场景图扩展改进图像字幕

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531401

Minghao Geng, Qingjie Zhao

Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.

最近，场景图生成方法被用于图像字幕，在编码器-解码器框架中对对象及其关系进行编码，其中解码器选择部分图节点作为单词推理的输入。然而，目前的场景图处理方法依赖于模糊的语言信息，忽略了场景图节点之间的强连接。本文提出了一种场景图扩展(SGE)架构，利用部分生成的句子对动态场景图扩展进行建模。我们的模型首先使用生成的词和场景图节点之前的关注结果组成部分场景图。然后，我们选择与生成的图有密切联系的对象或关系来推断下一个单词。我们的SGE很有吸引力，因为它可以插入到任何基于场景图的图像字幕方法中。我们在MSCOCO数据集上进行了大量的实验。结果表明，建议的SGE显著优于基线，在大多数指标下产生最先进的性能。

引用次数: 0

Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label 基于小数据和粗标签联合学习的弱监督细粒度识别

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531419

Anqi Hu, Zhengxing Sun, Qian Li

Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.

弱监督学习已经成为细粒度图像识别的研究方向之一。这些方法的目的是在人工成本或专家知识较少的情况下学习特征表示。现有的弱监督方法大多基于不完全标注或不精确标注，受监督信息的限制，难以很好地发挥作用。因此，同时使用这两种标注进行训练可以挖掘更多的相关性，而标注负担不会增加太多。本文提出了一种基于粗粒度大数据和细粒度小数据的弱监督细粒度识别组合学习框架。组合学习包含两个重要模块:1)判别模块，通过注意图和部分采样保持粗标签和细标签之间结构信息的一致性;2)聚类划分策略，通过特征减法挖掘细类别之间的细节差异。实验结果表明，该方法在CUB-200-2011和Stanford Cars数据集上优于弱监督方法，达到接近完全监督方法的性能。

{"title":"Weakly Supervised Fine-grained Recognition based on Combined Learning for Small Data and Coarse Label","authors":"Anqi Hu, Zhengxing Sun, Qian Li","doi":"10.1145/3512527.3531419","DOIUrl":"https://doi.org/10.1145/3512527.3531419","url":null,"abstract":"Learning with weak supervision already becomes one of the research trends in fine-grained image recognition. These methods aim to learn feature representation in the case of less manual cost or expert knowledge. Most existing weakly supervised methods are based on incomplete annotation or inexact annotation, which is difficult to perform well limited by supervision information. Therefore, using these two kind of annotations for training at the same time could mine more relevance while the annotating burden will not increase much. In this paper, we propose a combined learning framework by coarse-grained large data and fine-grained small data for weakly supervised fine-grained recognition. Combined learning contains two significant modules: 1) a discriminant module, which maintains the structure information consistent between coarse label and fine label by attention map and part sampling, 2) a cluster division strategy, which mines the detail differences between fine categories by feature subtraction. Experiment results show that our method outperforms weakly supervised methods and achieves the performance close to fully supervised methods in CUB-200-2011 and Stanford Cars datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114557119","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Phrase-level Prediction for Video Temporal Localization 视频时间定位的短语级预测

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531382

Sizhe Li, C. Li, Minghang Zheng, Yang Liu

Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.

视频时间定位的目的是在给定的未修剪视频中定位语义上与自然语言查询匹配的时间段。我们通过经验观察到，虽然现有的方法在句子定位方面取得了稳步的进展，但在短语定位方面的表现却不尽人意。原则上，短语应该更容易本地化，因为需要考虑的视觉概念组合更少;这说明现有模型只捕获了基准中的句子标注偏差，而对简单的视觉概念和语言概念之间的内在关系缺乏足够的理解，从而使模型的泛化和可解释性受到质疑。本文提出了一种既能处理句子级定位又能处理短语级定位的统一框架，即短语级预测网(PLPNet)。具体来说，基于相似短语倾向于关注相似的视频线索，而不相似短语不应该关注的假设，我们构建了一种对比机制来约束短语级定位，而不需要训练中需要的细粒度短语边界标注。此外，考虑到句子的灵活性和短语之间的巨大差异，我们提出了一种基于聚类的批处理采样器，以确保有效地进行对比学习。大量的实验表明，我们的方法超越了目前最先进的短语级时间定位方法，同时保持了句子定位的高性能，提高了模型的可解释性和泛化能力。我们的代码可在https://github.com/sizhelee/PLPNet上获得。

{"title":"Phrase-level Prediction for Video Temporal Localization","authors":"Sizhe Li, C. Li, Minghang Zheng, Yang Liu","doi":"10.1145/3512527.3531382","DOIUrl":"https://doi.org/10.1145/3512527.3531382","url":null,"abstract":"Video temporal localization aims to locate a period that semantically matches a natural language query in a given untrimmed video. We empirically observe that although existing approaches gain steady progress on sentence localization, the performance of phrase localization is far from satisfactory. In principle, the phrase should be easier to localize as fewer combinations of visual concepts need to be considered; such incapability indicates that the existing models only capture the sentence annotation bias in the benchmark but lack sufficient understanding of the intrinsic relationship between simple visual and language concepts, thus the model generalization and interpretability is questioned. This paper proposes a unified framework that can deal with both sentence and phrase-level localization, namely Phrase Level Prediction Net (PLPNet). Specifically, based on the hypothesis that similar phrases tend to focus on similar video cues, while dissimilar ones should not, we build a contrastive mechanism to restrain phrase-level localization without fine-grained phrase boundary annotation required in training. Moreover, considering the sentence's flexibility and wide discrepancy among phrases, we propose a clustering-based batch sampler to ensure that contrastive learning can be conducted efficiently. Extensive experiments demonstrate that our method surpasses state-of-the-art methods of phrase-level temporal localization while maintaining high performance in sentence localization and boosting the model's interpretability and generalization capability. Our code is available at https://github.com/sizhelee/PLPNet.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114735390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification 基于视频的人再识别的时间一致视觉线索注意网络

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531362

Bingliang Jiao, Liying Gao, Peng Wang

Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.

基于视频的人再识别(ReID)旨在匹配跨多视角摄像机的行人视频轨迹，在刑事侦查和智能监控中具有重要应用。与单幅图像的再识别相比，视频序列中所包含的丰富的时间信息使其能够更准确、有效地描述行人实例。目前，大多数基于视频的人物ReID算法通过融合在独立帧中捕获的不同视觉内容来利用时间信息。然而，这些算法只测量每一帧视觉线索的显著性，不可避免地引入了遮挡等因素造成的瞬间干扰。因此，在这项工作中，我们引入了一个时间一致的视觉线索注意网络(TVCAN)，旨在捕捉帧之间时间一致的显著行人内容。我们的TVCAN由两大模块组成:TCSA模块和TCCA模块，分别负责从空间维度和渠道维度上持续捕捉和强调突出的视觉内容。通过大量的实验，验证了所设计模块的有效性。此外，我们的TVCAN在三个主流基准测试中优于所有比较先进的方法。

{"title":"Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification","authors":"Bingliang Jiao, Liying Gao, Peng Wang","doi":"10.1145/3512527.3531362","DOIUrl":"https://doi.org/10.1145/3512527.3531362","url":null,"abstract":"Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123917842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

The Impact of Dataset Splits on Classification Performance in Medical Videos 数据集分割对医学视频分类性能的影响

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531424

Markus Fox, Klaus Schoeffmann

The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.

医学成像中数据集的创建是研究的中心主题，特别是在过去十年中深度学习的进步。此类数据集的出版物通常以已建立的性能指标(例如F1-score, Jaccard等)的形式报告一个或多个深度神经网络的基线结果。然后，为了比较不同的神经体系结构，需要做很多工作来尝试超越这些基准指标。然而，当底层数据不符合特定标准时，这些报告的度量几乎是无意义的。为了更好地了解我们需要什么样的标准，我们对腹腔镜中四种医学图像分类数据集的研究进行了再现和分析。通过对手术视频的自动帧提取，我们发现生成的图像过于相似，并且通过设计产生了很高的评价指标。我们用一种对原始数据产生高评价指标的基本SIFT算法来显示这种相似性。我们通过创建和评估从原始图像中分离出来的基于视频的数据集来确认我们的假设。在基于视频的分割上评估的原始网络在原始数据上的表现比我们的基本SIFT算法差。

{"title":"The Impact of Dataset Splits on Classification Performance in Medical Videos","authors":"Markus Fox, Klaus Schoeffmann","doi":"10.1145/3512527.3531424","DOIUrl":"https://doi.org/10.1145/3512527.3531424","url":null,"abstract":"The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121060847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1