首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
FreqCAM: Frequent Class Activation Map for Weakly Supervised Object Localization 用于弱监督对象定位的频繁类激活图
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531349
Runsheng Zhang
Class Activation Map (CAM) is a commonly used solution for weakly supervised tasks. However, most of the existing CAM-based methods have one crucial problem, that is, only small object parts instead of full object regions can be located. In this paper, we find that the co-occurrence between the feature maps of different channels might provide more clues for object locations. Therefore, we propose a simple yet effective method, called Frequent Class Activation Map (FreqCAM), which exploits element-wise frequency information from the last convolutional layers as an attention filter to generate object regions. Our FreqCAM can filter the background noise and obtain more accurate fine-grained object localization information robustly. Furthermore, our approach is a post-hoc method of a trained classification model, and thus can be used to improve the performance of existing methods without modification. Experiments on the standard dataset CUB-200-2011 show that our proposed method achieves a significant increase in localization performance compared to the original existing state-of-the-art methods without any architectural changes or re-training.
类激活图(CAM)是弱监督任务的常用解决方案。然而,现有的大多数基于cam的方法都存在一个关键问题,即只能定位物体的小部分,而不能定位完整的物体区域。在本文中,我们发现不同通道的特征映射之间的共现可以为目标定位提供更多线索。因此,我们提出了一种简单而有效的方法,称为频繁类激活图(FreqCAM),它利用来自最后卷积层的元素智能频率信息作为注意力过滤器来生成对象区域。我们的FreqCAM可以过滤背景噪声,获得更准确的细粒度目标定位信息。此外,我们的方法是一个经过训练的分类模型的事后方法,因此可以用来提高现有方法的性能,而无需修改。在标准数据集CUB-200-2011上的实验表明,与现有的最先进的定位方法相比,我们提出的方法在没有任何架构更改或重新训练的情况下实现了显著的定位性能提升。
{"title":"FreqCAM: Frequent Class Activation Map for Weakly Supervised Object Localization","authors":"Runsheng Zhang","doi":"10.1145/3512527.3531349","DOIUrl":"https://doi.org/10.1145/3512527.3531349","url":null,"abstract":"Class Activation Map (CAM) is a commonly used solution for weakly supervised tasks. However, most of the existing CAM-based methods have one crucial problem, that is, only small object parts instead of full object regions can be located. In this paper, we find that the co-occurrence between the feature maps of different channels might provide more clues for object locations. Therefore, we propose a simple yet effective method, called Frequent Class Activation Map (FreqCAM), which exploits element-wise frequency information from the last convolutional layers as an attention filter to generate object regions. Our FreqCAM can filter the background noise and obtain more accurate fine-grained object localization information robustly. Furthermore, our approach is a post-hoc method of a trained classification model, and thus can be used to improve the performance of existing methods without modification. Experiments on the standard dataset CUB-200-2011 show that our proposed method achieves a significant increase in localization performance compared to the original existing state-of-the-art methods without any architectural changes or re-training.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128083953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors 教新狗老把戏:无监督先验视频中的对比随机漫步
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531376
J. Schutte, P. Mettes
This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.
本文主要研究多模态先验指导下视频的自监督表示学习。在时间维度通常被用作学习框架级或剪辑级表示的监督代理的情况下,最近的一些工作展示了如何通过循环一致性来学习空间和时间上的局部表示。给定一个起始补丁,对比目标是在后续帧中跟踪该补丁,然后以起始补丁为目标回溯到原始帧。虽然对分割和身体关节传播等下游任务有效,但补丁之间的亲和力需要从头开始学习。这种设置不仅需要许多视频进行自监督优化,而且在使用较小的补丁和连续帧之间的更多连接时也会失败。另一方面,来自多种模式的多种通用线索提供了关于斑块如何在视频中传播的有价值的信息,从显着性和光流到光度中心偏差。为此,我们引入了引导对比随机漫步。主要思想是使用众所周知的多模态先验来提供固定的先验亲和力。我们概述了一个总体框架,其中先验亲和力与学习亲和力相结合,以指导循环一致性目标。实证研究表明,引导对比随机漫步对两个下游任务具有更好的时空表征效果。更重要的是,当使用更小的补丁和更多的补丁之间的连接时,我们的方法进一步改进,而非引导基线不再能够学习有意义的表示。
{"title":"Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors","authors":"J. Schutte, P. Mettes","doi":"10.1145/3512527.3531376","DOIUrl":"https://doi.org/10.1145/3512527.3531376","url":null,"abstract":"This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131001756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OCR-oriented Master Object for Text Image Captioning 面向ocr的文本图像字幕主对象
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531431
Wenliang Tang, Zhenzhen Hu, Zijie Song, Richang Hong
Text image captioning aims to understand the scene text in images for image caption generation. The key issue of this challenging task is to understand the relationship between the text OCR tokens and images. In this paper, we propose a novel text image captioning method by purifying the OCR-oriented scene graph with themaster object. The master object is the object to which the OCR is attached, which is the semantic relationship bridge between the OCR token and the image. We consider the master object as a proxy to connect OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build the purified scene graph based on the master objects and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and feed the hierarchical information to provide a richer representation. Experiments on the TextCaps validation and test dataset demonstrate the effectiveness of the proposed method.
文本图像字幕旨在理解图像中的场景文本,以便生成图像字幕。这个具有挑战性的任务的关键问题是理解文本OCR令牌和图像之间的关系。本文提出了一种新的文本图像字幕方法,即利用母对象对面向ocr的场景图进行净化。主对象是OCR附加到的对象,它是OCR令牌和图像之间的语义关系桥梁。我们将主对象视为连接OCR令牌和图像中的其他区域的代理。通过探索每个OCR令牌的主对象,在主对象的基础上构建纯化的场景图,然后利用图卷积网络(GCN)丰富视觉嵌入。此外,我们将OCR令牌聚类并提供分层信息以提供更丰富的表示。在TextCaps验证和测试数据集上的实验验证了该方法的有效性。
{"title":"OCR-oriented Master Object for Text Image Captioning","authors":"Wenliang Tang, Zhenzhen Hu, Zijie Song, Richang Hong","doi":"10.1145/3512527.3531431","DOIUrl":"https://doi.org/10.1145/3512527.3531431","url":null,"abstract":"Text image captioning aims to understand the scene text in images for image caption generation. The key issue of this challenging task is to understand the relationship between the text OCR tokens and images. In this paper, we propose a novel text image captioning method by purifying the OCR-oriented scene graph with themaster object. The master object is the object to which the OCR is attached, which is the semantic relationship bridge between the OCR token and the image. We consider the master object as a proxy to connect OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build the purified scene graph based on the master objects and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and feed the hierarchical information to provide a richer representation. Experiments on the TextCaps validation and test dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122686457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Real-Time Deepfake System for Live Streaming 实时深度造假直播系统
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531350
Yifei Fan, Modan Xie, Peihan Wu, Gang Yang
This paper proposes a real-time deepfake framework to assist users use deep forgery to conduct live streaming, further to protect privacy and increase interesting by selecting different reference faces to create a non-existent fake face. Nowadays, because of the demand for live broadcast functions such as selling goods, playing games, and auctions, the opportunities for anchor exposure are increasing, which leads live streamers pay more attention to their privacy protection. Meanwhile, the traditional technology of deepfake is more likely to infring on the portrait rights of others, so our framework supports users to select different face features for facial tampering to avoid infringement. In our framework, through feature extractor, heatmap transformer, heatmap regression and face blending, face reenactment could be confirmed effectively. Users can enrich the personal face feature database by uploading different photos, and then select the desired picture for tampering on this basis, and finally real-time tampering live broadcast is achieved. Moreover, our framework is a closed loop self-adaptation system as it allows users to update the database themselves to extend face feature data and improve conversion efficiency.
本文提出了一种实时深度伪造框架,帮助用户使用深度伪造进行直播,进而通过选择不同的参考面孔来创建不存在的假面孔来保护隐私和增加趣味性。如今,由于销售商品、玩游戏、拍卖等直播功能的需求,主播曝光的机会越来越多,这使得直播者更加注重隐私保护。同时,传统的深度造假技术更容易侵犯他人的肖像权,因此我们的框架支持用户选择不同的人脸特征进行人脸篡改,避免侵权。在我们的框架中,通过特征提取、热图变换、热图回归和人脸混合,可以有效地确定人脸再现。用户可以通过上传不同的照片来丰富个人人脸特征库,然后在此基础上选择想要篡改的图片,最后实现实时篡改直播。此外,我们的框架是一个闭环自适应系统,允许用户自行更新数据库,扩展人脸特征数据,提高转换效率。
{"title":"Real-Time Deepfake System for Live Streaming","authors":"Yifei Fan, Modan Xie, Peihan Wu, Gang Yang","doi":"10.1145/3512527.3531350","DOIUrl":"https://doi.org/10.1145/3512527.3531350","url":null,"abstract":"This paper proposes a real-time deepfake framework to assist users use deep forgery to conduct live streaming, further to protect privacy and increase interesting by selecting different reference faces to create a non-existent fake face. Nowadays, because of the demand for live broadcast functions such as selling goods, playing games, and auctions, the opportunities for anchor exposure are increasing, which leads live streamers pay more attention to their privacy protection. Meanwhile, the traditional technology of deepfake is more likely to infring on the portrait rights of others, so our framework supports users to select different face features for facial tampering to avoid infringement. In our framework, through feature extractor, heatmap transformer, heatmap regression and face blending, face reenactment could be confirmed effectively. Users can enrich the personal face feature database by uploading different photos, and then select the desired picture for tampering on this basis, and finally real-time tampering live broadcast is achieved. Moreover, our framework is a closed loop self-adaptation system as it allows users to update the database themselves to extend face feature data and improve conversion efficiency.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114839229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Person Search by Uncertain Attributes 根据不确定属性进行人员搜索
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531354
Tingting Dong, Jianquan Liu
This paper presents a person search system by uncertain attributes. Attribute-based person search aims at finding person images that are the best matched with a set of attributes specified by a user as a query. The specified query attributes are inherently uncertain due to many factors such as the difficulty of retrieving characteristics of a target person from brain-memory and environmental variations like light and viewpoint. Also, existing attribute recognition techniques typically extract confidence scores along with attributes. Most of state-of-art approaches for attribute-based person search ignore the confidence scores or simply use a threshold to filter out attributes with low confidence scores. Moreover, they do not consider the uncertainty of query attributes. In this work, we resolve this uncertainty by enabling users to specify a level of confidence with each query attribute and consider uncertainty in both query attributes and attributes extracted from person images. We define a novel matching score to measure the degree of a person matching with query attribute conditions by leveraging the knowledge of probabilistic databases. Furthermore, we propose a novel definition of Critical Point of Confidence and compute it for each query attribute to show the impact of confidence levels on rankings of results. We develop a web-based demonstration system and show its effectiveness using real-world surveillance videos.
提出了一种基于不确定属性的人物搜索系统。基于属性的人物搜索旨在查找与用户作为查询指定的一组属性最匹配的人物图像。由于许多因素,例如从大脑记忆中检索目标人的特征的困难以及光线和视点等环境变化,指定的查询属性本质上是不确定的。此外,现有的属性识别技术通常在提取属性的同时提取置信度分数。大多数基于属性的人员搜索的最新方法都忽略置信度得分,或者只是使用阈值来过滤置信度得分低的属性。此外,它们没有考虑查询属性的不确定性。在这项工作中,我们通过允许用户指定每个查询属性的置信度来解决这种不确定性,并考虑查询属性和从人物图像中提取的属性的不确定性。我们利用概率数据库的知识,定义了一种新的匹配分数来衡量一个人与查询属性条件的匹配程度。此外,我们提出了一个新的置信度临界点的定义,并为每个查询属性计算它,以显示置信度水平对结果排名的影响。我们开发了一个基于网络的演示系统,并使用真实世界的监控视频来展示其有效性。
{"title":"Person Search by Uncertain Attributes","authors":"Tingting Dong, Jianquan Liu","doi":"10.1145/3512527.3531354","DOIUrl":"https://doi.org/10.1145/3512527.3531354","url":null,"abstract":"This paper presents a person search system by uncertain attributes. Attribute-based person search aims at finding person images that are the best matched with a set of attributes specified by a user as a query. The specified query attributes are inherently uncertain due to many factors such as the difficulty of retrieving characteristics of a target person from brain-memory and environmental variations like light and viewpoint. Also, existing attribute recognition techniques typically extract confidence scores along with attributes. Most of state-of-art approaches for attribute-based person search ignore the confidence scores or simply use a threshold to filter out attributes with low confidence scores. Moreover, they do not consider the uncertainty of query attributes. In this work, we resolve this uncertainty by enabling users to specify a level of confidence with each query attribute and consider uncertainty in both query attributes and attributes extracted from person images. We define a novel matching score to measure the degree of a person matching with query attribute conditions by leveraging the knowledge of probabilistic databases. Furthermore, we propose a novel definition of Critical Point of Confidence and compute it for each query attribute to show the impact of confidence levels on rankings of results. We develop a web-based demonstration system and show its effectiveness using real-world surveillance videos.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130668110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallelism Network with Partial-aware and Cross-correlated Transformer for Vehicle Re-identification 基于局部感知互相关变压器的车辆再识别并行网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531412
Guangqi Jiang, Huibing Wang, Jinjia Peng, Xianping Fu
Vehicle re-identification (ReID) aims to identify a specific vehicle in the dataset captured by non-overlapping cameras, which plays a great significant role in the development of intelligent transportation systems. Even though CNN-based model achieves impressive performance for the ReID task, its Gaussian distribution of effective receptive fields has limitations in capturing the long-term dependence between features. Moreover, it is crucial to capture fine-grained features and the relationship between features as much as possible from vehicle images. To address those problems, we propose a partial-aware and cross-correlated transformer model (PCTM), which adopts the parallelism network extracting discriminant features to optimize the feature representation for vehicle ReID. PCTM includes a cross-correlation transformer branch that fuses the features extracted based on the transformer module and feature guidance module, which guides the network to capture the long-term dependence of key features. In this way, the feature guidance module promotes the transformer-based features to focus on the vehicle itself and avoid the interference of excessive background for feature extraction. Moreover, PCTM introduced a partial-aware structure in the second branch to explore fine-grained information from vehicle images for capturing local differences from different vehicles. Furthermore, we conducted experiments on 2 vehicle datasets to verify the performance of PCTM.
车辆再识别(Vehicle re-identification, ReID)旨在识别非重叠摄像头捕获的数据集中的特定车辆,在智能交通系统的发展中起着重要作用。尽管基于cnn的模型在ReID任务中取得了令人印象深刻的性能,但其有效接受野的高斯分布在捕获特征之间的长期依赖方面存在局限性。此外,从车辆图像中尽可能多地捕获细粒度特征和特征之间的关系至关重要。为了解决这些问题,我们提出了一种局部感知和交叉相关的变压器模型(PCTM),该模型采用并行网络提取判别特征来优化车辆ReID的特征表示。PCTM包括互关变压器分支,该分支融合了基于变压器模块和特征引导模块提取的特征,引导网络捕获关键特征的长期依赖关系。这样,特征引导模块促使基于变压器的特征聚焦于车辆本身,避免了过多背景对特征提取的干扰。此外,PCTM在第二个分支中引入了部分感知结构,从车辆图像中挖掘细粒度信息,以捕获不同车辆的局部差异。此外,我们在2个汽车数据集上进行了实验,验证了PCTM的性能。
{"title":"Parallelism Network with Partial-aware and Cross-correlated Transformer for Vehicle Re-identification","authors":"Guangqi Jiang, Huibing Wang, Jinjia Peng, Xianping Fu","doi":"10.1145/3512527.3531412","DOIUrl":"https://doi.org/10.1145/3512527.3531412","url":null,"abstract":"Vehicle re-identification (ReID) aims to identify a specific vehicle in the dataset captured by non-overlapping cameras, which plays a great significant role in the development of intelligent transportation systems. Even though CNN-based model achieves impressive performance for the ReID task, its Gaussian distribution of effective receptive fields has limitations in capturing the long-term dependence between features. Moreover, it is crucial to capture fine-grained features and the relationship between features as much as possible from vehicle images. To address those problems, we propose a partial-aware and cross-correlated transformer model (PCTM), which adopts the parallelism network extracting discriminant features to optimize the feature representation for vehicle ReID. PCTM includes a cross-correlation transformer branch that fuses the features extracted based on the transformer module and feature guidance module, which guides the network to capture the long-term dependence of key features. In this way, the feature guidance module promotes the transformer-based features to focus on the vehicle itself and avoid the interference of excessive background for feature extraction. Moreover, PCTM introduced a partial-aware structure in the second branch to explore fine-grained information from vehicle images for capturing local differences from different vehicles. Furthermore, we conducted experiments on 2 vehicle datasets to verify the performance of PCTM.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114598623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ViRMA: Virtual Reality Multimedia Analytics 虚拟现实多媒体分析
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531352
Aaron Duane, Bjorn Por Jonsson
In this paper we describe the latest iteration of the Virtual Reality Multimedia Analytics (ViRMA) system, a novel approach to multimedia analysis in virtual reality which is supported by the Multi-dimensional Multimedia Model.
本文介绍了虚拟现实多媒体分析(ViRMA)系统的最新版本,ViRMA是一种由多维多媒体模型支持的虚拟现实多媒体分析的新方法。
{"title":"ViRMA: Virtual Reality Multimedia Analytics","authors":"Aaron Duane, Bjorn Por Jonsson","doi":"10.1145/3512527.3531352","DOIUrl":"https://doi.org/10.1145/3512527.3531352","url":null,"abstract":"In this paper we describe the latest iteration of the Virtual Reality Multimedia Analytics (ViRMA) system, a novel approach to multimedia analysis in virtual reality which is supported by the Multi-dimensional Multimedia Model.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134573773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Unseen Food Segmentation 看不见的食物分割
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531426
Yuma Honbu, Keiji Yanai
Food image segmentation is important for detailed analysis on food images, especially for classification of multiple food items and calorie amount estimation. However, there is a costly problem in training a semantic segmentation model because it requires a large number of images with pixel-level annotations. In addition, the existence of a myriad of food categories causes the problem of insufficient data in each category. Although several food segmentation datasets such as the UEC-FoodPix Complete has been released so far, the number of food categories is still limited to a small number. In this study, we propose an unseen class segmentation method with high accuracy by using both zero-shot and few-shot segmentation methods for any unseen classes. we make the following contributions: (1) we propose a UnSeen Food Segmentation method (USFoodSeg) that uses the zero-shot model to infer the segmentation mask from the class label words of unseen classes and those images, and uses the few-shot model to refine the segmentation masks. (2) We generate segmentation masks for 156 categories of the unseen class UEC-Food256, totaling 17,000 images, and 85 categories in the Food-101 dataset, totaling 85,000 images, with an accuracy of over 90%. Our proposed method is able to solve the problem of insufficient food segmentation data.
食品图像分割对于食品图像的详细分析,特别是对多种食品的分类和热量的估算具有重要意义。然而,由于需要大量带有像素级注释的图像,语义分割模型的训练成本很高。此外,由于食品种类繁多,导致了每一类数据不足的问题。虽然到目前为止已经发布了几个食品分割数据集,如UEC-FoodPix Complete,但食品类别的数量仍然有限。在本研究中,我们提出了一种对任何未见类使用零镜头和少镜头分割方法的高精度未见类分割方法。本文的主要贡献如下:(1)提出了一种未见过的食物分割方法(USFoodSeg),该方法使用零镜头模型从未见过的类别标签词和这些图像中推断出分割掩码,并使用少镜头模型对分割掩码进行细化。(2)我们对未见类UEC-Food256的156个分类生成分割蒙版,共计1.7万张图片;对Food-101数据集的85个分类生成分割蒙版,共计8.5万张图片,准确率超过90%。我们提出的方法能够解决食物分割数据不足的问题。
{"title":"Unseen Food Segmentation","authors":"Yuma Honbu, Keiji Yanai","doi":"10.1145/3512527.3531426","DOIUrl":"https://doi.org/10.1145/3512527.3531426","url":null,"abstract":"Food image segmentation is important for detailed analysis on food images, especially for classification of multiple food items and calorie amount estimation. However, there is a costly problem in training a semantic segmentation model because it requires a large number of images with pixel-level annotations. In addition, the existence of a myriad of food categories causes the problem of insufficient data in each category. Although several food segmentation datasets such as the UEC-FoodPix Complete has been released so far, the number of food categories is still limited to a small number. In this study, we propose an unseen class segmentation method with high accuracy by using both zero-shot and few-shot segmentation methods for any unseen classes. we make the following contributions: (1) we propose a UnSeen Food Segmentation method (USFoodSeg) that uses the zero-shot model to infer the segmentation mask from the class label words of unseen classes and those images, and uses the few-shot model to refine the segmentation masks. (2) We generate segmentation masks for 156 categories of the unseen class UEC-Food256, totaling 17,000 images, and 85 categories in the Food-101 dataset, totaling 85,000 images, with an accuracy of over 90%. Our proposed method is able to solve the problem of insufficient food segmentation data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"226 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132184360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22 第五届年度生活日志搜索挑战赛简介,LSC'22
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531439
C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann
For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.
自2018年以来,生命日志搜索挑战(LSC)第五次促进了一项基准测试,以比较为多模式生命日志设计的交互式搜索系统。LSC'22吸引了9个研究小组参加,他们开发了交互式生命日志检索系统,使生命日志能够快速有效地访问。在ACM ICMR'22的LSC研讨会上,这些系统在混合观众面前进行了竞争。本文介绍了LSC工作坊,竞赛中使用的新的(更大的)数据集,并介绍了参与的生活日志搜索系统。
{"title":"Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22","authors":"C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann","doi":"10.1145/3512527.3531439","DOIUrl":"https://doi.org/10.1145/3512527.3531439","url":null,"abstract":"For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
Ingredient-enriched Recipe Generation from Cooking Videos 成分丰富的食谱从烹饪视频生成
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531388
Jianlong Wu, Liangming Pan, Jingjing Chen, Yu-Gang Jiang
Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.
烹饪视频字幕旨在生成描述视频中呈现的烹饪过程的文字说明。目前的方法倾向于使用大型神经模型或使用更鲁棒的特征提取器来提高特征的表达能力,忽略了视频中连续烹饪步骤之间的强相关性。然而,这是直观的,以前的烹饪步骤可以为下一个烹饪步骤提供线索。特别是,连续的烹饪步骤往往会共享相同的食材。因此,准确的成分识别有助于在字幕中引入更细粒度的信息。为了提高烹饪视频过程字幕的性能,本文提出了一种引入配料识别模块的框架,该模块利用复制机制将预测的配料信息融合到生成的句子中。此外,我们将前一步的视觉信息整合到当前步骤的生成中,两步的视觉信息共同辅助生成过程。大量的实验验证了我们提出的框架的有效性,它在YouCookII和Cooking-COIN数据集上都取得了令人满意的性能。
{"title":"Ingredient-enriched Recipe Generation from Cooking Videos","authors":"Jianlong Wu, Liangming Pan, Jingjing Chen, Yu-Gang Jiang","doi":"10.1145/3512527.3531388","DOIUrl":"https://doi.org/10.1145/3512527.3531388","url":null,"abstract":"Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"57 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116556911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1