TransFGVC：基于变换器的细粒度视觉分类

The Visual Computer Pub Date : 2024-06-28 DOI:10.1007/s00371-024-03545-6

Longfeng Shen, Bin Hou, Yulei Jian, Xisong Tu, Yingjie Zhang, Lingying Shuai, Fangzhen Ge, Debao Chen

{"title":"TransFGVC：基于变换器的细粒度视觉分类","authors":"Longfeng Shen, Bin Hou, Yulei Jian, Xisong Tu, Yingjie Zhang, Lingying Shuai, Fangzhen Ge, Debao Chen","doi":"10.1007/s00371-024-03545-6","DOIUrl":null,"url":null,"abstract":"<p>Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.\n</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TransFGVC: transformer-based fine-grained visual classification\",\"authors\":\"Longfeng Shen, Bin Hou, Yulei Jian, Xisong Tu, Yingjie Zhang, Lingying Shuai, Fangzhen Ge, Debao Chen\",\"doi\":\"10.1007/s00371-024-03545-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.\\n</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03545-6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03545-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

细粒度视觉分类（FGVC）旨在识别同一超类中的物体子类。由于类内差异大而类间差异小，这项任务具有挑战性。最新的方法主要集中在定位分辨区域，然后训练分类网络，以进一步捕捉它们之间的细微差别。一方面，检测网络往往只能获得物体的整个部分，因此会出现定位误差。另一方面，这些方法忽略了提取区域之间的相关性。我们提出了一种高度可扩展的新方法，称为 TransFGVC，它巧妙地将 Swin 变换器与长短期记忆（LSTM）网络相结合，以解决上述问题。Swin Transformer 用于通过自注意层堆叠获得显著的视觉标记，而 LSTM 则用于对其进行全局建模，这不仅能准确定位分辨区域，还能进一步引入对 FGVC 非常重要的全局信息。所提出的方法在使用公开的 CUB-200-2011 和 NABirds 数据集以及我们的 Birds-267-2022 数据集时，准确率分别达到了 92.7%、91.4% 和 91.5%，而且我们方法的 Params 和 FLOPs 分别比目前的 SotA 方法 HERBS 低 25% 和 27%，性能极具竞争力。为了有效促进 FGVC 的发展，我们开发了 Birds-267-2022 数据集，该数据集有 267 个类别和 12,233 幅图像。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

TransFGVC: transformer-based fine-grained visual classification

Fine-grained visual classification (FGVC) aims to identify subcategories of objects within the same superclass. This task is challenging owing to high intra-class variance and low inter-class variance. The most recent methods focus on locating discriminative areas and then training the classification network to further capture the subtle differences among them. On the one hand, the detection network often obtains an entire part of the object, and positioning errors occur. On the other hand, these methods ignore the correlations between the extracted regions. We propose a novel highly scalable approach, called TransFGVC, that cleverly combines Swin Transformers with long short-term memory (LSTM) networks to address the above problems. The Swin Transformer is used to obtain remarkable visual tokens through self-attention layer stacking, and LSTM is used to model them globally, which not only accurately locates the discriminative region but also further introduces global information that is important for FGVC. The proposed method achieves competitive performance with accuracy rates of 92.7%, 91.4% and 91.5% using the public CUB-200-2011 and NABirds datasets and our Birds-267-2022 dataset, and the Params and FLOPs of our method are 25% and 27% lower, respectively, than the current SotA method HERBS. To effectively promote the development of FGVC, we developed the Birds-267-2022 dataset, which has 267 categories and 12,233 images.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Visual Computer

自引率

0.00%

发文量