{"title":"Sentence-based and Noise-robust Cross-modal Retrieval on Cooking Recipes and Food Images","authors":"Zichen Zan, Lin Li, Jianquan Liu, D. Zhou","doi":"10.1145/3372278.3390681","DOIUrl":null,"url":null,"abstract":"In recent years, people are facing with billions of food images, videos and recipes on social medias. An appropriate technology is highly desired to retrieve accurate contents across food images and cooking recipes, like cross-modal retrieval framework. Based on our observations, the order of sequential sentences in recipes and the noises in food images will affect retrieval results. We take into account the sentence-level sequential orders of instructions and ingredients in recipes, and noise portion in food images to propose a new framework for cross-retrieval. In our framework, we propose three new strategies to improve the retrieval accuracy. (1) We encode recipe titles, ingredients, instructions in sentence level, and adopt three attention networks on multi-layer hidden state features separately to capture more semantic information. (2) We apply attention mechanism to select effective features from food images incorporating with recipe embeddings, and adopt an adversarial learning strategy to enhance modality alignment. (3) We design a new triplet loss scheme with an effective sampling strategy to reduce the noise impact on retrieval results. The experimental results show that our framework clearly outperforms the state-of-art methods in terms of median rank and recall rate at top k on the Recipe 1M dataset.","PeriodicalId":158014,"journal":{"name":"Proceedings of the 2020 International Conference on Multimedia Retrieval","volume":"102 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2020 International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3372278.3390681","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 16
Abstract
In recent years, people are facing with billions of food images, videos and recipes on social medias. An appropriate technology is highly desired to retrieve accurate contents across food images and cooking recipes, like cross-modal retrieval framework. Based on our observations, the order of sequential sentences in recipes and the noises in food images will affect retrieval results. We take into account the sentence-level sequential orders of instructions and ingredients in recipes, and noise portion in food images to propose a new framework for cross-retrieval. In our framework, we propose three new strategies to improve the retrieval accuracy. (1) We encode recipe titles, ingredients, instructions in sentence level, and adopt three attention networks on multi-layer hidden state features separately to capture more semantic information. (2) We apply attention mechanism to select effective features from food images incorporating with recipe embeddings, and adopt an adversarial learning strategy to enhance modality alignment. (3) We design a new triplet loss scheme with an effective sampling strategy to reduce the noise impact on retrieval results. The experimental results show that our framework clearly outperforms the state-of-art methods in terms of median rank and recall rate at top k on the Recipe 1M dataset.