Class Activation Map (CAM) is a commonly used solution for weakly supervised tasks. However, most of the existing CAM-based methods have one crucial problem, that is, only small object parts instead of full object regions can be located. In this paper, we find that the co-occurrence between the feature maps of different channels might provide more clues for object locations. Therefore, we propose a simple yet effective method, called Frequent Class Activation Map (FreqCAM), which exploits element-wise frequency information from the last convolutional layers as an attention filter to generate object regions. Our FreqCAM can filter the background noise and obtain more accurate fine-grained object localization information robustly. Furthermore, our approach is a post-hoc method of a trained classification model, and thus can be used to improve the performance of existing methods without modification. Experiments on the standard dataset CUB-200-2011 show that our proposed method achieves a significant increase in localization performance compared to the original existing state-of-the-art methods without any architectural changes or re-training.
{"title":"FreqCAM: Frequent Class Activation Map for Weakly Supervised Object Localization","authors":"Runsheng Zhang","doi":"10.1145/3512527.3531349","DOIUrl":"https://doi.org/10.1145/3512527.3531349","url":null,"abstract":"Class Activation Map (CAM) is a commonly used solution for weakly supervised tasks. However, most of the existing CAM-based methods have one crucial problem, that is, only small object parts instead of full object regions can be located. In this paper, we find that the co-occurrence between the feature maps of different channels might provide more clues for object locations. Therefore, we propose a simple yet effective method, called Frequent Class Activation Map (FreqCAM), which exploits element-wise frequency information from the last convolutional layers as an attention filter to generate object regions. Our FreqCAM can filter the background noise and obtain more accurate fine-grained object localization information robustly. Furthermore, our approach is a post-hoc method of a trained classification model, and thus can be used to improve the performance of existing methods without modification. Experiments on the standard dataset CUB-200-2011 show that our proposed method achieves a significant increase in localization performance compared to the original existing state-of-the-art methods without any architectural changes or re-training.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128083953","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.
{"title":"Teaching a New Dog Old Tricks: Contrastive Random Walks in Videos with Unsupervised Priors","authors":"J. Schutte, P. Mettes","doi":"10.1145/3512527.3531376","DOIUrl":"https://doi.org/10.1145/3512527.3531376","url":null,"abstract":"This paper focuses on self-supervised representation learning in videos with guidance from multimodal priors. Where the temporal dimension is commonly used as supervision proxy for learning frame-level or clip-level representations, a number of works have recently shown how to learn local representations in space and time through cycle-consistency. Given a starting patch, the contrastive goal is to track the patch in subsequent frames, followed by a backtracking to the original frame with the starting patch as goal. While effective for down-stream tasks such as segmentation and body joint propagation, affinities between patches need to be learned from scratch. This setup not only requires many videos for self-supervised optimization, it also fails when using smaller patches and more connections between consecutive frames. On the other hand, there are multiple generic cues from multiple modalities that provide valuable information about how patches should propagate in videos, from saliency and optical flow to photometric center biases. To that end, we introduce Guided Contrastive Random Walks. The main idea is to employ well-known multimodal priors to provide fixed prior affinities. We outline a general framework where prior affinities are combined with learned affinities to guide the cycle-consistency objective. Empirically, we show that Guided Contrastive Random Walks result in better spatio-temporal representations for two down-stream tasks. More importantly, when using smaller patches and therefore more connections between patches, our approach further improves, while the unguided baseline can no longer learn meaningful representations.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131001756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wenliang Tang, Zhenzhen Hu, Zijie Song, Richang Hong
Text image captioning aims to understand the scene text in images for image caption generation. The key issue of this challenging task is to understand the relationship between the text OCR tokens and images. In this paper, we propose a novel text image captioning method by purifying the OCR-oriented scene graph with themaster object. The master object is the object to which the OCR is attached, which is the semantic relationship bridge between the OCR token and the image. We consider the master object as a proxy to connect OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build the purified scene graph based on the master objects and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and feed the hierarchical information to provide a richer representation. Experiments on the TextCaps validation and test dataset demonstrate the effectiveness of the proposed method.
{"title":"OCR-oriented Master Object for Text Image Captioning","authors":"Wenliang Tang, Zhenzhen Hu, Zijie Song, Richang Hong","doi":"10.1145/3512527.3531431","DOIUrl":"https://doi.org/10.1145/3512527.3531431","url":null,"abstract":"Text image captioning aims to understand the scene text in images for image caption generation. The key issue of this challenging task is to understand the relationship between the text OCR tokens and images. In this paper, we propose a novel text image captioning method by purifying the OCR-oriented scene graph with themaster object. The master object is the object to which the OCR is attached, which is the semantic relationship bridge between the OCR token and the image. We consider the master object as a proxy to connect OCR tokens and other regions in the image. By exploring the master object for each OCR token, we build the purified scene graph based on the master objects and then enrich the visual embedding by the Graph Convolution Network (GCN). Furthermore, we cluster the OCR tokens and feed the hierarchical information to provide a richer representation. Experiments on the TextCaps validation and test dataset demonstrate the effectiveness of the proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122686457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes a real-time deepfake framework to assist users use deep forgery to conduct live streaming, further to protect privacy and increase interesting by selecting different reference faces to create a non-existent fake face. Nowadays, because of the demand for live broadcast functions such as selling goods, playing games, and auctions, the opportunities for anchor exposure are increasing, which leads live streamers pay more attention to their privacy protection. Meanwhile, the traditional technology of deepfake is more likely to infring on the portrait rights of others, so our framework supports users to select different face features for facial tampering to avoid infringement. In our framework, through feature extractor, heatmap transformer, heatmap regression and face blending, face reenactment could be confirmed effectively. Users can enrich the personal face feature database by uploading different photos, and then select the desired picture for tampering on this basis, and finally real-time tampering live broadcast is achieved. Moreover, our framework is a closed loop self-adaptation system as it allows users to update the database themselves to extend face feature data and improve conversion efficiency.
{"title":"Real-Time Deepfake System for Live Streaming","authors":"Yifei Fan, Modan Xie, Peihan Wu, Gang Yang","doi":"10.1145/3512527.3531350","DOIUrl":"https://doi.org/10.1145/3512527.3531350","url":null,"abstract":"This paper proposes a real-time deepfake framework to assist users use deep forgery to conduct live streaming, further to protect privacy and increase interesting by selecting different reference faces to create a non-existent fake face. Nowadays, because of the demand for live broadcast functions such as selling goods, playing games, and auctions, the opportunities for anchor exposure are increasing, which leads live streamers pay more attention to their privacy protection. Meanwhile, the traditional technology of deepfake is more likely to infring on the portrait rights of others, so our framework supports users to select different face features for facial tampering to avoid infringement. In our framework, through feature extractor, heatmap transformer, heatmap regression and face blending, face reenactment could be confirmed effectively. Users can enrich the personal face feature database by uploading different photos, and then select the desired picture for tampering on this basis, and finally real-time tampering live broadcast is achieved. Moreover, our framework is a closed loop self-adaptation system as it allows users to update the database themselves to extend face feature data and improve conversion efficiency.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114839229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents a person search system by uncertain attributes. Attribute-based person search aims at finding person images that are the best matched with a set of attributes specified by a user as a query. The specified query attributes are inherently uncertain due to many factors such as the difficulty of retrieving characteristics of a target person from brain-memory and environmental variations like light and viewpoint. Also, existing attribute recognition techniques typically extract confidence scores along with attributes. Most of state-of-art approaches for attribute-based person search ignore the confidence scores or simply use a threshold to filter out attributes with low confidence scores. Moreover, they do not consider the uncertainty of query attributes. In this work, we resolve this uncertainty by enabling users to specify a level of confidence with each query attribute and consider uncertainty in both query attributes and attributes extracted from person images. We define a novel matching score to measure the degree of a person matching with query attribute conditions by leveraging the knowledge of probabilistic databases. Furthermore, we propose a novel definition of Critical Point of Confidence and compute it for each query attribute to show the impact of confidence levels on rankings of results. We develop a web-based demonstration system and show its effectiveness using real-world surveillance videos.
{"title":"Person Search by Uncertain Attributes","authors":"Tingting Dong, Jianquan Liu","doi":"10.1145/3512527.3531354","DOIUrl":"https://doi.org/10.1145/3512527.3531354","url":null,"abstract":"This paper presents a person search system by uncertain attributes. Attribute-based person search aims at finding person images that are the best matched with a set of attributes specified by a user as a query. The specified query attributes are inherently uncertain due to many factors such as the difficulty of retrieving characteristics of a target person from brain-memory and environmental variations like light and viewpoint. Also, existing attribute recognition techniques typically extract confidence scores along with attributes. Most of state-of-art approaches for attribute-based person search ignore the confidence scores or simply use a threshold to filter out attributes with low confidence scores. Moreover, they do not consider the uncertainty of query attributes. In this work, we resolve this uncertainty by enabling users to specify a level of confidence with each query attribute and consider uncertainty in both query attributes and attributes extracted from person images. We define a novel matching score to measure the degree of a person matching with query attribute conditions by leveraging the knowledge of probabilistic databases. Furthermore, we propose a novel definition of Critical Point of Confidence and compute it for each query attribute to show the impact of confidence levels on rankings of results. We develop a web-based demonstration system and show its effectiveness using real-world surveillance videos.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130668110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Guangqi Jiang, Huibing Wang, Jinjia Peng, Xianping Fu
Vehicle re-identification (ReID) aims to identify a specific vehicle in the dataset captured by non-overlapping cameras, which plays a great significant role in the development of intelligent transportation systems. Even though CNN-based model achieves impressive performance for the ReID task, its Gaussian distribution of effective receptive fields has limitations in capturing the long-term dependence between features. Moreover, it is crucial to capture fine-grained features and the relationship between features as much as possible from vehicle images. To address those problems, we propose a partial-aware and cross-correlated transformer model (PCTM), which adopts the parallelism network extracting discriminant features to optimize the feature representation for vehicle ReID. PCTM includes a cross-correlation transformer branch that fuses the features extracted based on the transformer module and feature guidance module, which guides the network to capture the long-term dependence of key features. In this way, the feature guidance module promotes the transformer-based features to focus on the vehicle itself and avoid the interference of excessive background for feature extraction. Moreover, PCTM introduced a partial-aware structure in the second branch to explore fine-grained information from vehicle images for capturing local differences from different vehicles. Furthermore, we conducted experiments on 2 vehicle datasets to verify the performance of PCTM.
{"title":"Parallelism Network with Partial-aware and Cross-correlated Transformer for Vehicle Re-identification","authors":"Guangqi Jiang, Huibing Wang, Jinjia Peng, Xianping Fu","doi":"10.1145/3512527.3531412","DOIUrl":"https://doi.org/10.1145/3512527.3531412","url":null,"abstract":"Vehicle re-identification (ReID) aims to identify a specific vehicle in the dataset captured by non-overlapping cameras, which plays a great significant role in the development of intelligent transportation systems. Even though CNN-based model achieves impressive performance for the ReID task, its Gaussian distribution of effective receptive fields has limitations in capturing the long-term dependence between features. Moreover, it is crucial to capture fine-grained features and the relationship between features as much as possible from vehicle images. To address those problems, we propose a partial-aware and cross-correlated transformer model (PCTM), which adopts the parallelism network extracting discriminant features to optimize the feature representation for vehicle ReID. PCTM includes a cross-correlation transformer branch that fuses the features extracted based on the transformer module and feature guidance module, which guides the network to capture the long-term dependence of key features. In this way, the feature guidance module promotes the transformer-based features to focus on the vehicle itself and avoid the interference of excessive background for feature extraction. Moreover, PCTM introduced a partial-aware structure in the second branch to explore fine-grained information from vehicle images for capturing local differences from different vehicles. Furthermore, we conducted experiments on 2 vehicle datasets to verify the performance of PCTM.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114598623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper we describe the latest iteration of the Virtual Reality Multimedia Analytics (ViRMA) system, a novel approach to multimedia analysis in virtual reality which is supported by the Multi-dimensional Multimedia Model.
{"title":"ViRMA: Virtual Reality Multimedia Analytics","authors":"Aaron Duane, Bjorn Por Jonsson","doi":"10.1145/3512527.3531352","DOIUrl":"https://doi.org/10.1145/3512527.3531352","url":null,"abstract":"In this paper we describe the latest iteration of the Virtual Reality Multimedia Analytics (ViRMA) system, a novel approach to multimedia analysis in virtual reality which is supported by the Multi-dimensional Multimedia Model.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134573773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Food image segmentation is important for detailed analysis on food images, especially for classification of multiple food items and calorie amount estimation. However, there is a costly problem in training a semantic segmentation model because it requires a large number of images with pixel-level annotations. In addition, the existence of a myriad of food categories causes the problem of insufficient data in each category. Although several food segmentation datasets such as the UEC-FoodPix Complete has been released so far, the number of food categories is still limited to a small number. In this study, we propose an unseen class segmentation method with high accuracy by using both zero-shot and few-shot segmentation methods for any unseen classes. we make the following contributions: (1) we propose a UnSeen Food Segmentation method (USFoodSeg) that uses the zero-shot model to infer the segmentation mask from the class label words of unseen classes and those images, and uses the few-shot model to refine the segmentation masks. (2) We generate segmentation masks for 156 categories of the unseen class UEC-Food256, totaling 17,000 images, and 85 categories in the Food-101 dataset, totaling 85,000 images, with an accuracy of over 90%. Our proposed method is able to solve the problem of insufficient food segmentation data.
{"title":"Unseen Food Segmentation","authors":"Yuma Honbu, Keiji Yanai","doi":"10.1145/3512527.3531426","DOIUrl":"https://doi.org/10.1145/3512527.3531426","url":null,"abstract":"Food image segmentation is important for detailed analysis on food images, especially for classification of multiple food items and calorie amount estimation. However, there is a costly problem in training a semantic segmentation model because it requires a large number of images with pixel-level annotations. In addition, the existence of a myriad of food categories causes the problem of insufficient data in each category. Although several food segmentation datasets such as the UEC-FoodPix Complete has been released so far, the number of food categories is still limited to a small number. In this study, we propose an unseen class segmentation method with high accuracy by using both zero-shot and few-shot segmentation methods for any unseen classes. we make the following contributions: (1) we propose a UnSeen Food Segmentation method (USFoodSeg) that uses the zero-shot model to infer the segmentation mask from the class label words of unseen classes and those images, and uses the few-shot model to refine the segmentation masks. (2) We generate segmentation masks for 156 categories of the unseen class UEC-Food256, totaling 17,000 images, and 85 categories in the Food-101 dataset, totaling 85,000 images, with an accuracy of over 90%. Our proposed method is able to solve the problem of insufficient food segmentation data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"226 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132184360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann
For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.
{"title":"Introduction to the Fifth Annual Lifelog Search Challenge, LSC'22","authors":"C. Gurrin, Liting Zhou, G. Healy, Björn þór Jónsson, Duc-Tien Dang-Nguyen, Jakub Lokoč, Minh-Triet Tran, Wolfgang Hürst, Luca Rossetto, Klaus Schöffmann","doi":"10.1145/3512527.3531439","DOIUrl":"https://doi.org/10.1145/3512527.3531439","url":null,"abstract":"For the fifth time since 2018, the Lifelog Search Challenge (LSC) facilitated a benchmarking exercise to compare interactive search systems designed for multimodal lifelogs. LSC'22 attracted nine participating research groups who developed interactive lifelog retrieval systems enabling fast and effective access to lifelogs. The systems competed in front of a hybrid audience at the LSC workshop at ACM ICMR'22. This paper presents an introduction to the LSC workshop, the new (larger) dataset used in the competition, and introduces the participating lifelog search systems.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115566625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.
{"title":"Ingredient-enriched Recipe Generation from Cooking Videos","authors":"Jianlong Wu, Liangming Pan, Jingjing Chen, Yu-Gang Jiang","doi":"10.1145/3512527.3531388","DOIUrl":"https://doi.org/10.1145/3512527.3531388","url":null,"abstract":"Cooking video captioning aims to generate the text instructions that describes the cooking procedures presented in the video. Current approaches tend to use large neural models or use more robust feature extractors to increase the expressive ability of features, ignoring the strong correlation between consecutive cooking steps in the video. However, it is intuitive that previous cooking steps can provide clues for the next cooking step. Specially, consecutive cooking steps tend to share the same ingredients. Therefore, accurate ingredients recognition can help to introduce more fine-grained information in captioning. To improve the performance of video procedural caption in cooking video, this paper proposes a framework that introduces ingredient recognition module which uses the copy mechanism to fuse the predicted ingredient information into the generated sentence. Moreover, we integrate the visual information of the previous step into the generation of the current step, and the visual information of the two steps together assist in the generation process. Extensive experiments verify the effectiveness of our propose framework and it achieves the promising performances on both YouCookII and Cooking-COIN datasets.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"57 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116556911","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}