{"title":"Deep attentive multimodal learning for food information enhancement via early-stage heterogeneous fusion","authors":"Avantika Saklani, Shailendra Tiwari, H. S. Pannu","doi":"10.1007/s00371-024-03546-5","DOIUrl":null,"url":null,"abstract":"<p>In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03546-5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
In contrast to single-modal content, multimodal data can offer greater insight into food statistics more vividly and effectively. But traditional food classification system focuses on individual modality. It is thus futile as the massive amount of data are emerging on a daily basis which has latterly attracted researchers in this field. Moreover, there are very few available multimodal Indian food datasets. On studying these findings, we build a novel multimodal food analysis model based on deep attentive multimodal fusion network (DAMFN) for lingual and visual integration. The model includes three stages: functional feature extraction, early-stage fusion and feature classification. In functional feature extraction, deep features from the individual modalities are abstracted. Then an early-stage fusion is applied that leverages the deep correlation between the modalities. Lastly, the fused features are provided to the classification system for the final decision in the feature classification phase. We further developed a dataset having Indian food images with their related caption for the experimental purpose. In addition to this, the proposed approach is also evaluated on a large-scale dataset called UPMC Food 101, having 90,704 instances. The experimental results demonstrate that the proposed DAMFN outperforms several state-of-the-art techniques of multimodal food classification methods as well as the individual modality systems.