一种改进的多尺度和知识提炼方法，用于在密集场景中高效检测行人

IF 2.9 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Real-Time Image Processing Pub Date : 2024-07-06 DOI:10.1007/s11554-024-01507-8

Yanxiang Xu, Mi Wen, Wei He, Hongwei Wang, Yunsheng Xue

{"title":"一种改进的多尺度和知识提炼方法，用于在密集场景中高效检测行人","authors":"Yanxiang Xu, Mi Wen, Wei He, Hongwei Wang, Yunsheng Xue","doi":"10.1007/s11554-024-01507-8","DOIUrl":null,"url":null,"abstract":"<p>Pedestrian detection in densely populated scenes, particularly in the presence of occlusions, remains a challenging issue in computer vision. Existing approaches often address detection leakage by enhancing model architectures or incorporating attention mechanisms; However, small-scale pedestrians have fewer features and are easily overfitted to the dataset and these approaches still face challenges in accurately detecting pedestrians with small target sizes. To tackle this issue, this research rethinks the occlusion region through small-scale pedestrian detection and proposes the You Only Look Once model for efficient pedestrian detection(YOLO-EPD). Firstly, we find that Standard Convolution and Dilated Convolution do not fit well with pedestrian targets with different scales due to a single receptive field, and we propose the Selective Content Aware Downsampling (SCAD) module, which is integrated into the backbone to attain enhanced feature extraction. In addition, to address the issue of missed detections resulting from insufficient feature extraction for small-scale pedestrian detection, we propose the Crowded Multi-Head Attention (CMHA) module, which makes full use of multi-layer information. Finally, for the challenge of optimizing the performance and effectiveness of small-object detection, we design Unified Channel-Task Distillation (UCTD) with channel attention and a Lightweight head (Lhead) using parameter sharing to keep it lightweight. Experimental results validate the superiority of YOLO-EPD, achieving a remarkable 91.1% Average Precision (AP) on the Widerperson dataset, while concurrently reducing parameters and computational overhead by 40%. The experimental findings demonstrate that YOLO-EPD greatly accelerates the convergence of model training and achieves better real-time performance in real-world dense scenarios.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"54 1","pages":""},"PeriodicalIF":2.9000,"publicationDate":"2024-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An improved multi-scale and knowledge distillation method for efficient pedestrian detection in dense scenes\",\"authors\":\"Yanxiang Xu, Mi Wen, Wei He, Hongwei Wang, Yunsheng Xue\",\"doi\":\"10.1007/s11554-024-01507-8\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Pedestrian detection in densely populated scenes, particularly in the presence of occlusions, remains a challenging issue in computer vision. Existing approaches often address detection leakage by enhancing model architectures or incorporating attention mechanisms; However, small-scale pedestrians have fewer features and are easily overfitted to the dataset and these approaches still face challenges in accurately detecting pedestrians with small target sizes. To tackle this issue, this research rethinks the occlusion region through small-scale pedestrian detection and proposes the You Only Look Once model for efficient pedestrian detection(YOLO-EPD). Firstly, we find that Standard Convolution and Dilated Convolution do not fit well with pedestrian targets with different scales due to a single receptive field, and we propose the Selective Content Aware Downsampling (SCAD) module, which is integrated into the backbone to attain enhanced feature extraction. In addition, to address the issue of missed detections resulting from insufficient feature extraction for small-scale pedestrian detection, we propose the Crowded Multi-Head Attention (CMHA) module, which makes full use of multi-layer information. Finally, for the challenge of optimizing the performance and effectiveness of small-object detection, we design Unified Channel-Task Distillation (UCTD) with channel attention and a Lightweight head (Lhead) using parameter sharing to keep it lightweight. Experimental results validate the superiority of YOLO-EPD, achieving a remarkable 91.1% Average Precision (AP) on the Widerperson dataset, while concurrently reducing parameters and computational overhead by 40%. The experimental findings demonstrate that YOLO-EPD greatly accelerates the convergence of model training and achieves better real-time performance in real-world dense scenarios.</p>\",\"PeriodicalId\":51224,\"journal\":{\"name\":\"Journal of Real-Time Image Processing\",\"volume\":\"54 1\",\"pages\":\"\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2024-07-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Real-Time Image Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.1007/s11554-024-01507-8\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Real-Time Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11554-024-01507-8","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

在人口稠密的场景中进行行人检测，尤其是在有遮挡物的情况下，仍然是计算机视觉领域的一个挑战性问题。然而，小尺度行人的特征较少，很容易与数据集过度拟合，这些方法在准确检测目标尺寸较小的行人方面仍面临挑战。针对这一问题，本研究通过小尺度行人检测重新思考了遮挡区域，并提出了高效行人检测模型（YOLO-EPD）。首先，我们发现标准卷积和稀释卷积由于感受野单一，不能很好地适应不同尺度的行人目标，因此我们提出了选择性内容感知下采样（SCAD）模块，并将其集成到主干网中，以实现增强的特征提取。此外，针对小尺度行人检测中特征提取不足导致的漏检问题，我们提出了拥挤多头关注（CMHA）模块，充分利用多层信息。最后，针对优化小目标检测性能和效果的挑战，我们设计了统一通道-任务蒸馏（UCTD），该模块具有通道注意力和轻量级头部（Lhead），使用参数共享来保持其轻量级。实验结果验证了 YOLO-EPD 的优越性，在 Widerperson 数据集上取得了 91.1% 的显著平均精度 (AP)，同时减少了 40% 的参数和计算开销。实验结果表明，YOLO-EPD 极大地加快了模型训练的收敛速度，并在真实世界的密集场景中实现了更好的实时性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

An improved multi-scale and knowledge distillation method for efficient pedestrian detection in dense scenes

Pedestrian detection in densely populated scenes, particularly in the presence of occlusions, remains a challenging issue in computer vision. Existing approaches often address detection leakage by enhancing model architectures or incorporating attention mechanisms; However, small-scale pedestrians have fewer features and are easily overfitted to the dataset and these approaches still face challenges in accurately detecting pedestrians with small target sizes. To tackle this issue, this research rethinks the occlusion region through small-scale pedestrian detection and proposes the You Only Look Once model for efficient pedestrian detection(YOLO-EPD). Firstly, we find that Standard Convolution and Dilated Convolution do not fit well with pedestrian targets with different scales due to a single receptive field, and we propose the Selective Content Aware Downsampling (SCAD) module, which is integrated into the backbone to attain enhanced feature extraction. In addition, to address the issue of missed detections resulting from insufficient feature extraction for small-scale pedestrian detection, we propose the Crowded Multi-Head Attention (CMHA) module, which makes full use of multi-layer information. Finally, for the challenge of optimizing the performance and effectiveness of small-object detection, we design Unified Channel-Task Distillation (UCTD) with channel attention and a Lightweight head (Lhead) using parameter sharing to keep it lightweight. Experimental results validate the superiority of YOLO-EPD, achieving a remarkable 91.1% Average Precision (AP) on the Widerperson dataset, while concurrently reducing parameters and computational overhead by 40%. The experimental findings demonstrate that YOLO-EPD greatly accelerates the convergence of model training and achieves better real-time performance in real-world dense scenarios.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Real-Time Image Processing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

6.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Due to rapid advancements in integrated circuit technology, the rich theoretical results that have been developed by the image and video processing research community are now being increasingly applied in practical systems to solve real-world image and video processing problems. Such systems involve constraints placed not only on their size, cost, and power consumption, but also on the timeliness of the image data processed. Examples of such systems are mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-guided autonomous robots, spectral imaging systems, and many other real-time embedded systems. In these real-time systems, strict timing requirements demand that results are available within a certain interval of time as imposed by the application. It is often the case that an image processing algorithm is developed and proven theoretically sound, presumably with a specific application in mind, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are not fully explored, leaving these critical and usually non-trivial issues for those wishing to employ the algorithm in a real-time system. The Journal of Real-Time Image Processing is intended to bridge the gap between the theory and practice of image processing, serving the greater community of researchers, practicing engineers, and industrial professionals who deal with designing, implementing or utilizing image processing systems which must satisfy real-time design constraints.