{"title":"MSAPVT: a multi-scale attention pyramid vision transformer network for large-scale fruit recognition","authors":"Yao Rao, Chaofeng Li, Feiran Xu, Ya Guo","doi":"10.1007/s11694-024-02874-3","DOIUrl":null,"url":null,"abstract":"<div><p>Efficient and accurate fruit recognition is critical for applications such as automated fruit-picking systems, quality evaluation, and self-checkout services in supermarkets. Existing vision-based methods, primarily leveraging Convolutional Neural Networks (CNNs), often achieve high performance but are hindered by high computational complexity, making real-time deployment on edge devices challenging. Moreover, the diversity and similarity among fruit varieties, along with imbalanced fruit datasets, pose significant obstacles to general-purpose deep learning algorithms. To address these challenges, we propose the Multi-Scale Attention Pyramid Vision Transformer (MSAPVT) alongside an enhanced version of the Fru92 dataset. Our MSAPVT introduces four innovative improvements: attention enhancement, dimension adjustment, multi-scale feature aggregation and loss function improvement. Firstly, the Hybrid Attention Module (HAM) is designed for better refining the multi-level features of the Pyramid Vision Transformer v2 (PVTv2). Secondly, the Dimension Adjustment Layer (DAL) is designed for increasing the weight of the high-level features. Thirdly, the multi-scale feature aggregation strategy is introduced to fuse multi-scale complementary features. Finally, the KL-divergence loss is added for enhancing the difference between multi-scale features. These innovations enable MSAPVT to capture fine-grained details in fruit images, generating highly discriminative representations with slight low model complexity. Our model achieves the best results on the Fru92 and Fru92s datasets, with Top-1 Acc. of 91.40% and 94.29%, and Top-5 Acc. of 98.95% and 99.55%, respectively. In the end, an approachable and efficient fruit classification system based on MSAPVT is devised for potential applications. The improved dataset is available at https://github.com/iamraoyao/MSAPVT-Inference-Demo.</p></div>","PeriodicalId":631,"journal":{"name":"Journal of Food Measurement and Characterization","volume":"18 11","pages":"9233 - 9251"},"PeriodicalIF":2.9000,"publicationDate":"2024-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Food Measurement and Characterization","FirstCategoryId":"97","ListUrlMain":"https://link.springer.com/article/10.1007/s11694-024-02874-3","RegionNum":3,"RegionCategory":"农林科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"FOOD SCIENCE & TECHNOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Efficient and accurate fruit recognition is critical for applications such as automated fruit-picking systems, quality evaluation, and self-checkout services in supermarkets. Existing vision-based methods, primarily leveraging Convolutional Neural Networks (CNNs), often achieve high performance but are hindered by high computational complexity, making real-time deployment on edge devices challenging. Moreover, the diversity and similarity among fruit varieties, along with imbalanced fruit datasets, pose significant obstacles to general-purpose deep learning algorithms. To address these challenges, we propose the Multi-Scale Attention Pyramid Vision Transformer (MSAPVT) alongside an enhanced version of the Fru92 dataset. Our MSAPVT introduces four innovative improvements: attention enhancement, dimension adjustment, multi-scale feature aggregation and loss function improvement. Firstly, the Hybrid Attention Module (HAM) is designed for better refining the multi-level features of the Pyramid Vision Transformer v2 (PVTv2). Secondly, the Dimension Adjustment Layer (DAL) is designed for increasing the weight of the high-level features. Thirdly, the multi-scale feature aggregation strategy is introduced to fuse multi-scale complementary features. Finally, the KL-divergence loss is added for enhancing the difference between multi-scale features. These innovations enable MSAPVT to capture fine-grained details in fruit images, generating highly discriminative representations with slight low model complexity. Our model achieves the best results on the Fru92 and Fru92s datasets, with Top-1 Acc. of 91.40% and 94.29%, and Top-5 Acc. of 98.95% and 99.55%, respectively. In the end, an approachable and efficient fruit classification system based on MSAPVT is devised for potential applications. The improved dataset is available at https://github.com/iamraoyao/MSAPVT-Inference-Demo.
期刊介绍:
This interdisciplinary journal publishes new measurement results, characteristic properties, differentiating patterns, measurement methods and procedures for such purposes as food process innovation, product development, quality control, and safety assurance.
The journal encompasses all topics related to food property measurement and characterization, including all types of measured properties of food and food materials, features and patterns, measurement principles and techniques, development and evaluation of technologies, novel uses and applications, and industrial implementation of systems and procedures.