用于人群计数的变压器-CNN 混合网络

Journal of Intelligent & Fuzzy Systems Pub Date : 2024-03-11 DOI:10.3233/jifs-236370

Jiamao Yu, Ying Yu, Jin Qian, Xing Han, Feng Zhu, Zhiliang Zhu

{"title":"用于人群计数的变压器-CNN 混合网络","authors":"Jiamao Yu, Ying Yu, Jin Qian, Xing Han, Feng Zhu, Zhiliang Zhu","doi":"10.3233/jifs-236370","DOIUrl":null,"url":null,"abstract":"Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.","PeriodicalId":509313,"journal":{"name":"Journal of Intelligent & Fuzzy Systems","volume":"40 15","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Transformer-CNN hybrid network for crowd counting\",\"authors\":\"Jiamao Yu, Ying Yu, Jin Qian, Xing Han, Feng Zhu, Zhiliang Zhu\",\"doi\":\"10.3233/jifs-236370\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.\",\"PeriodicalId\":509313,\"journal\":{\"name\":\"Journal of Intelligent & Fuzzy Systems\",\"volume\":\"40 15\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-03-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Intelligent & Fuzzy Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/jifs-236370\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Intelligent & Fuzzy Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/jifs-236370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

高效的特征表示是提高人群计数性能的关键。CNN 和 Transformer 是人群计数领域常用的两种特征提取框架。CNN 擅长分层提取局部特征，以获得图像的多尺度特征表示，但在捕捉全局特征方面却很吃力。另一方面，Transformer 可利用级联自注意捕捉远程依赖关系，从而捕捉全局特征表示，但它往往会忽略局部细节信息。因此，仅仅依靠 CNN 或 Transformer 进行人群统计具有一定的局限性。本文结合 CNN 和 Transformer 框架，提出了 TCHNet 人群计数模型。该模型采用 CMT（CNNs Meet Vision Transformers）骨干网络作为特征提取模块（FEM），通过卷积和自注意机制的结合，分层提取人群的局部和全局特征。为了获得更全面的空间局部信息，FEM 中引入了改进的渐进多尺度学习过程（PMLP），引导网络在不同粒度水平上学习。然后，来自这三个不同粒度级别的特征被输入多尺度特征聚合模块（MFAM）进行融合。最后，设计了一个多尺度回归模块（MSRM）来处理多尺度融合特征，从而产生富含高层语义和低层细节的人群特征。在五个基准数据集上的实验结果表明，与一些流行的人群统计方法相比，TCHNet 的性能极具竞争力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Transformer-CNN hybrid network for crowd counting

Efficient feature representation is the key to improving crowd counting performance. CNN and Transformer are the two commonly used feature extraction frameworks in the field of crowd counting. CNN excels at hierarchically extracting local features to obtain a multi-scale feature representation of the image, but it struggles with capturing global features. Transformer, on the other hand, could capture global feature representation by utilizing cascaded self-attention to capture remote dependency relationships, but it often overlooks local detail information. Therefore, relying solely on CNN or Transformer for crowd counting has certain limitations. In this paper, we propose the TCHNet crowd counting model by combining the CNN and Transformer frameworks. The model employs the CMT (CNNs Meet Vision Transformers) backbone network as the Feature Extraction Module (FEM) to hierarchically extract local and global features of the crowd using a combination of convolution and self-attention mechanisms. To obtain more comprehensive spatial local information, an improved Progressive Multi-scale Learning Process (PMLP) is introduced into the FEM, guiding the network to learn at different granularity levels. The features from these three different granularity levels are then fed into the Multi-scale Feature Aggregation Module (MFAM) for fusion. Finally, a Multi-Scale Regression Module (MSRM) is designed to handle the multi-scale fused features, resulting in crowd features rich in high-level semantics and low-level detail. Experimental results on five benchmark datasets demonstrate that TCHNet achieves highly competitive performance compared to some popular crowd counting methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Intelligent & Fuzzy Systems

自引率

0.00%

发文量