Yijing Guo , Fuhang Li , Yi Qiu , Pengyu Xu , Kunhua Li
{"title":"基于全局自适应调整的视觉语言语义行人检测","authors":"Yijing Guo , Fuhang Li , Yi Qiu , Pengyu Xu , Kunhua Li","doi":"10.1016/j.patrec.2025.03.030","DOIUrl":null,"url":null,"abstract":"<div><div>Pedestrian detection is the primary task of automated driving and intelligent video surveillance systems. Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision (VLPD) greatly improves the detection accuracy of single-stage pedestrian detectors. Meanwhile, to maintain reasoning speed, VLPD adopts ResNet-50 as its backbone network, which undoubtedly poses a significant limitation for single-stage detectors that require direct category prediction and bounding box regression on feature maps. To tap into the potential of CNNs in representation capability, we propose a novel simplified architectural unit, the Channel and Spatial <strong>G</strong>lobal <strong>P</strong>ooling <strong>A</strong>ttention Module (GPA), which integrates activation channels and spatial weights attention maps through parallel computation to achieve adaptive feature refinement of backbone output feature maps. Furthermore, we optimize the module structure of the VLPD self-supervised prototype semantic contrast method, significantly enhancing the detector’s ability to discriminate and detect pedestrians in complex urban street environments. With only a 0.2FPS decrease in reasoning speed, the miss rates on the Heavy Occlusion subsets and Reasonable subsets of the Citypersons dataset are reduced by 2.41% and 0.72%, respectively, achieving state-of-the-art (SOTA) performance for single-stage detectors on this dataset. On the Heavy Occlusion subset and the All subset of the Caltech dataset, the performance decreased by 2.90% and 0.80%, respectively. Without using additional data, this method can rival the detection accuracy of two-stage detectors.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 8-13"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Pedestrian detection based on vision-language semantics with global adaptive adjustment\",\"authors\":\"Yijing Guo , Fuhang Li , Yi Qiu , Pengyu Xu , Kunhua Li\",\"doi\":\"10.1016/j.patrec.2025.03.030\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Pedestrian detection is the primary task of automated driving and intelligent video surveillance systems. Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision (VLPD) greatly improves the detection accuracy of single-stage pedestrian detectors. Meanwhile, to maintain reasoning speed, VLPD adopts ResNet-50 as its backbone network, which undoubtedly poses a significant limitation for single-stage detectors that require direct category prediction and bounding box regression on feature maps. To tap into the potential of CNNs in representation capability, we propose a novel simplified architectural unit, the Channel and Spatial <strong>G</strong>lobal <strong>P</strong>ooling <strong>A</strong>ttention Module (GPA), which integrates activation channels and spatial weights attention maps through parallel computation to achieve adaptive feature refinement of backbone output feature maps. Furthermore, we optimize the module structure of the VLPD self-supervised prototype semantic contrast method, significantly enhancing the detector’s ability to discriminate and detect pedestrians in complex urban street environments. With only a 0.2FPS decrease in reasoning speed, the miss rates on the Heavy Occlusion subsets and Reasonable subsets of the Citypersons dataset are reduced by 2.41% and 0.72%, respectively, achieving state-of-the-art (SOTA) performance for single-stage detectors on this dataset. On the Heavy Occlusion subset and the All subset of the Caltech dataset, the performance decreased by 2.90% and 0.80%, respectively. Without using additional data, this method can rival the detection accuracy of two-stage detectors.</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"193 \",\"pages\":\"Pages 8-13\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525001199\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/4/12 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525001199","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/4/12 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Pedestrian detection based on vision-language semantics with global adaptive adjustment
Pedestrian detection is the primary task of automated driving and intelligent video surveillance systems. Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision (VLPD) greatly improves the detection accuracy of single-stage pedestrian detectors. Meanwhile, to maintain reasoning speed, VLPD adopts ResNet-50 as its backbone network, which undoubtedly poses a significant limitation for single-stage detectors that require direct category prediction and bounding box regression on feature maps. To tap into the potential of CNNs in representation capability, we propose a novel simplified architectural unit, the Channel and Spatial Global Pooling Attention Module (GPA), which integrates activation channels and spatial weights attention maps through parallel computation to achieve adaptive feature refinement of backbone output feature maps. Furthermore, we optimize the module structure of the VLPD self-supervised prototype semantic contrast method, significantly enhancing the detector’s ability to discriminate and detect pedestrians in complex urban street environments. With only a 0.2FPS decrease in reasoning speed, the miss rates on the Heavy Occlusion subsets and Reasonable subsets of the Citypersons dataset are reduced by 2.41% and 0.72%, respectively, achieving state-of-the-art (SOTA) performance for single-stage detectors on this dataset. On the Heavy Occlusion subset and the All subset of the Caltech dataset, the performance decreased by 2.90% and 0.80%, respectively. Without using additional data, this method can rival the detection accuracy of two-stage detectors.
期刊介绍:
Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition.
Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.