Self-supervised visual representation learning on food images

IS&T International Symposium on Electronic Imaging Pub Date : 2023-01-16 DOI:10.2352/ei.2023.35.7.image-269

Andrew W. Peng, Jiangpeng He, Fengqing Zhu

{"title":"Self-supervised visual representation learning on food images","authors":"Andrew W. Peng, Jiangpeng He, Fengqing Zhu","doi":"10.2352/ei.2023.35.7.image-269","DOIUrl":null,"url":null,"abstract":"Food image classification is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning based methods learn the visual representation for food classification based on human annotation of each food image. However, most food images captured from real life are obtained without labels, requiring human annotation to train deep learning based methods. This approach is not feasible for real world deployment due to high costs. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning to learn the visual representation directly from unlabeled data. However, none of these existing works focuses on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on two items: the comparison of existing models and the development of an effective self-supervised learning model for food image classification. Specifically, we first compare the performance of existing state-of-the-art self-supervised learning models, including SimSiam, SimCLR, SwAV, BYOL, MoCo, and Rotation Pretext Task on food images. The experiments are conducted on the Food-101 dataset, which contains 101 different classes of foods with 1,000 images in each class. Next, we analyze the unique features of each model and compare their performance on food images to identify the key factors in each model that can help improve the accuracy. Finally, we propose a new model for unsupervised visual representation learning on food images for the classification task.","PeriodicalId":73514,"journal":{"name":"IS&T International Symposium on Electronic Imaging","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IS&T International Symposium on Electronic Imaging","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ei.2023.35.7.image-269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Food image classification is the groundwork for image-based dietary assessment, which is the process of monitoring what kinds of food and how much energy is consumed using captured food or eating scene images. Existing deep learning based methods learn the visual representation for food classification based on human annotation of each food image. However, most food images captured from real life are obtained without labels, requiring human annotation to train deep learning based methods. This approach is not feasible for real world deployment due to high costs. To make use of the vast amount of unlabeled images, many existing works focus on unsupervised or self-supervised learning to learn the visual representation directly from unlabeled data. However, none of these existing works focuses on food images, which is more challenging than general objects due to its high inter-class similarity and intra-class variance. In this paper, we focus on two items: the comparison of existing models and the development of an effective self-supervised learning model for food image classification. Specifically, we first compare the performance of existing state-of-the-art self-supervised learning models, including SimSiam, SimCLR, SwAV, BYOL, MoCo, and Rotation Pretext Task on food images. The experiments are conducted on the Food-101 dataset, which contains 101 different classes of foods with 1,000 images in each class. Next, we analyze the unique features of each model and compare their performance on food images to identify the key factors in each model that can help improve the accuracy. Finally, we propose a new model for unsupervised visual representation learning on food images for the classification task.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

食物图像的自监督视觉表征学习

食物图像分类是基于图像的饮食评估的基础，它是使用捕获的食物或进食场景图像来监测食物种类和消耗多少能量的过程。现有的基于深度学习的方法是基于人类对每个食物图像的注释来学习食物分类的视觉表示。然而，大多数从现实生活中捕获的食物图像都是没有标签的，需要人工注释来训练基于深度学习的方法。由于成本高，这种方法在实际部署中是不可行的。为了利用大量的未标记图像，许多现有的工作都集中在无监督或自监督学习上，直接从未标记的数据中学习视觉表征。然而，这些现有的作品都没有关注食物图像，因为食物图像具有较高的类间相似性和类内方差，比一般对象更具挑战性。在本文中，我们重点研究了两个项目:现有模型的比较和一种有效的食品图像分类自监督学习模型的开发。具体来说，我们首先比较了现有的最先进的自监督学习模型，包括SimSiam、SimCLR、SwAV、BYOL、MoCo和轮换借口任务在食物图像上的性能。实验是在Food-101数据集上进行的，该数据集包含101个不同类别的食物，每个类别有1000张图像。接下来，我们分析了每个模型的独特特征，并比较了它们在食物图像上的表现，以确定每个模型中有助于提高准确率的关键因素。最后，我们提出了一种新的基于食物图像的无监督视觉表征学习模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IS&T International Symposium on Electronic Imaging

自引率

0.00%

发文量