通过早期异质融合进行深度多模态学习以增强食品信息

Avantika Saklani, Shailendra Tiwari, H. S. Pannu
{"title":"通过早期异质融合进行深度多模态学习以增强食品信息","authors":"Avantika Saklani, Shailendra Tiwari, H. S. Pannu","doi":"10.1007/s00371-024-03546-5","DOIUrl":null,"url":null,"abstract":"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"92 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion\",\"authors\":\"Avantika Saklani, Shailendra Tiwari, H. S. Pannu\",\"doi\":\"10.1007/s00371-024-03546-5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"92 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03546-5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03546-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

与单一模态内容相比,多模态数据可以更生动、更有效地深入了解食品统计数据。但传统的食品分类系统侧重于单个模式。由于每天都有大量数据涌现,吸引了这一领域的研究人员,因此这种方法是徒劳的。此外,现有的多模态印度食品数据集非常少。在研究这些发现的基础上,我们建立了一个基于深度多模态融合网络(DAMFN)的新型多模态食品分析模型,以实现语言和视觉的融合。该模型包括三个阶段:功能特征提取、早期融合和特征分类。在功能特征提取中,对来自各个模态的深度特征进行抽象。然后,利用模态之间的深度相关性进行早期融合。最后,将融合后的特征提供给分类系统,以便在特征分类阶段做出最终决定。为了实验目的,我们进一步开发了一个数据集,其中包含印度食品图像及其相关说明。此外,我们还在一个名为 UPMC Food 101 的大型数据集上对所提出的方法进行了评估,该数据集共有 90 704 个实例。实验结果表明,所提出的 DAMFN 优于几种最先进的多模态食品分类技术以及单个模态系统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

摘要图片

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion

In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Advanced deepfake detection with enhanced Resnet-18 and multilayer CNN max pooling Video-driven musical composition using large language model with memory-augmented state space 3D human pose estimation using spatiotemporal hypergraphs and its public benchmark on opera videos Topological structure extraction for computing surface–surface intersection curves Lunet: an enhanced upsampling fusion network with efficient self-attention for semantic segmentation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1