{"title":"人群计数的尺度自适应增强网络","authors":"Zirui Fan, Jun Ruan","doi":"10.1109/ICEIT54416.2022.9690718","DOIUrl":null,"url":null,"abstract":"Crowd counting is a fundamental computer vision task and plays a critical role in video structure analysis and potential down-stream applications, e.g., accident forecasting and urban traffic analysis. The main challenges of crowd counting lie in the scale variation caused by disorderly distributed “person-camera” distances, as well as the interference of complex backgrounds. To address these issues, we propose a scale adaptive enhance network (SAENet) based on the encoder-decoder U-Net architecture. We employ Res2Net as the encoder backbone for extracting multi-scale head information to relieve the scale variation problem. The decoder consists of two branches, i.e., Attention Estimation Network (AENet) to provide attention maps and Density Estimation Network (DENet) to generate density maps. In order to fully leverage the complementary concepts between AENet and DENet, we craft to propose two modules to enhance feature transfer: i) a lightweight plug-and-play interactive attention module (IA-block) is deployed to multiple levels of the decoder to refine the feature map; ii) we propose a global scale adaptive fusion strategy (GSAFS) to adaptively model diverse scale cues to obtain the weighted density map. Extensive experiments show that the proposed method outperforms the existing competitive method and establishes the state-of-the-art results on ShanghaiTech Part A and B, and UCF-QNRF. Our model can achieve 53.56 and 5.95 MAE in ShanghaiTech Part A and B, with obtains performance improvement of 6.0 % and 13.13%, respectively.","PeriodicalId":285571,"journal":{"name":"2022 11th International Conference on Educational and Information Technology (ICEIT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Scale Adaptive Enhance Network for Crowd Counting\",\"authors\":\"Zirui Fan, Jun Ruan\",\"doi\":\"10.1109/ICEIT54416.2022.9690718\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Crowd counting is a fundamental computer vision task and plays a critical role in video structure analysis and potential down-stream applications, e.g., accident forecasting and urban traffic analysis. The main challenges of crowd counting lie in the scale variation caused by disorderly distributed “person-camera” distances, as well as the interference of complex backgrounds. To address these issues, we propose a scale adaptive enhance network (SAENet) based on the encoder-decoder U-Net architecture. We employ Res2Net as the encoder backbone for extracting multi-scale head information to relieve the scale variation problem. The decoder consists of two branches, i.e., Attention Estimation Network (AENet) to provide attention maps and Density Estimation Network (DENet) to generate density maps. In order to fully leverage the complementary concepts between AENet and DENet, we craft to propose two modules to enhance feature transfer: i) a lightweight plug-and-play interactive attention module (IA-block) is deployed to multiple levels of the decoder to refine the feature map; ii) we propose a global scale adaptive fusion strategy (GSAFS) to adaptively model diverse scale cues to obtain the weighted density map. Extensive experiments show that the proposed method outperforms the existing competitive method and establishes the state-of-the-art results on ShanghaiTech Part A and B, and UCF-QNRF. Our model can achieve 53.56 and 5.95 MAE in ShanghaiTech Part A and B, with obtains performance improvement of 6.0 % and 13.13%, respectively.\",\"PeriodicalId\":285571,\"journal\":{\"name\":\"2022 11th International Conference on Educational and Information Technology (ICEIT)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 11th International Conference on Educational and Information Technology (ICEIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICEIT54416.2022.9690718\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 11th International Conference on Educational and Information Technology (ICEIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICEIT54416.2022.9690718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Crowd counting is a fundamental computer vision task and plays a critical role in video structure analysis and potential down-stream applications, e.g., accident forecasting and urban traffic analysis. The main challenges of crowd counting lie in the scale variation caused by disorderly distributed “person-camera” distances, as well as the interference of complex backgrounds. To address these issues, we propose a scale adaptive enhance network (SAENet) based on the encoder-decoder U-Net architecture. We employ Res2Net as the encoder backbone for extracting multi-scale head information to relieve the scale variation problem. The decoder consists of two branches, i.e., Attention Estimation Network (AENet) to provide attention maps and Density Estimation Network (DENet) to generate density maps. In order to fully leverage the complementary concepts between AENet and DENet, we craft to propose two modules to enhance feature transfer: i) a lightweight plug-and-play interactive attention module (IA-block) is deployed to multiple levels of the decoder to refine the feature map; ii) we propose a global scale adaptive fusion strategy (GSAFS) to adaptively model diverse scale cues to obtain the weighted density map. Extensive experiments show that the proposed method outperforms the existing competitive method and establishes the state-of-the-art results on ShanghaiTech Part A and B, and UCF-QNRF. Our model can achieve 53.56 and 5.95 MAE in ShanghaiTech Part A and B, with obtains performance improvement of 6.0 % and 13.13%, respectively.