NAS-PED: Neural Architecture Search for Pedestrian Detection

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2024-11-28 DOI:10.1109/TPAMI.2024.3507918

Yi Tang;Min Liu;Baopu Li;Yaonan Wang;Wanli Ouyang

{"title":"NAS-PED: Neural Architecture Search for Pedestrian Detection","authors":"Yi Tang;Min Liu;Baopu Li;Yaonan Wang;Wanli Ouyang","doi":"10.1109/TPAMI.2024.3507918","DOIUrl":null,"url":null,"abstract":"Pedestrian detection currently suffers from two issues in crowded scenes: occlusion and dense boundary prediction, making it still challenging in complex real-world scenarios. In recent years, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shown their superiorities in addressing these issues, where ViTs capture global feature dependency to infer occlusion parts and CNNs make accurate dense predictions by local detailed features. Nevertheless, limited by the narrow receptive field, CNNs fail to infer occlusion parts, while ViTs tend to ignore local features that are vital to distinguish different pedestrians in the crowd. Therefore, it is essential to combine the advantages of CNN and ViT for pedestrian detection. However, manually designing a specific CNN and ViT hybrid network requires enormous time and resources for trial and error. To address this issue, we propose the first Neural Architecture Search (NAS) framework specifically designed for pedestrian detection named NAS-PED, which automatically designs an appropriate CNNs and ViTs hybrid backbone for the crowded pedestrian detection task. Specifically, we formulate transformers and convolutions with various kernel sizes in the same format, which provides an unconstrained space for diverse hybrid network search. Furthermore, to search for a suitable backbone, we propose an information bottleneck based NAS objective function, which treats the process of NAS as an information extraction process, preserving relevant information and suppressing redundant information from the dense pedestrians in crowd scenes Extensive experiments on CrowdHuman, CityPersons and EuroCity Persons datasets demonstrate the effectiveness of the proposed method. Our NAS-PED obtains absolute gains of 4.0% MR<inline-formula><tex-math>$^{-2}$</tex-math></inline-formula> and 1.9% AP over the state-of-the-art (SOTA) pedestrian detection framework on CrowdHuman datasets. For the CityPersons and EuroCity Persons datasets, the searched backbone achieves stable improvement across all three subsets, outperforming some large language-image pre-trained models.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 3","pages":"1800-1817"},"PeriodicalIF":18.6000,"publicationDate":"2024-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10770837/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pedestrian detection currently suffers from two issues in crowded scenes: occlusion and dense boundary prediction, making it still challenging in complex real-world scenarios. In recent years, Convolutional Neural Networks (CNN) and Vision Transformers (ViT) have shown their superiorities in addressing these issues, where ViTs capture global feature dependency to infer occlusion parts and CNNs make accurate dense predictions by local detailed features. Nevertheless, limited by the narrow receptive field, CNNs fail to infer occlusion parts, while ViTs tend to ignore local features that are vital to distinguish different pedestrians in the crowd. Therefore, it is essential to combine the advantages of CNN and ViT for pedestrian detection. However, manually designing a specific CNN and ViT hybrid network requires enormous time and resources for trial and error. To address this issue, we propose the first Neural Architecture Search (NAS) framework specifically designed for pedestrian detection named NAS-PED, which automatically designs an appropriate CNNs and ViTs hybrid backbone for the crowded pedestrian detection task. Specifically, we formulate transformers and convolutions with various kernel sizes in the same format, which provides an unconstrained space for diverse hybrid network search. Furthermore, to search for a suitable backbone, we propose an information bottleneck based NAS objective function, which treats the process of NAS as an information extraction process, preserving relevant information and suppressing redundant information from the dense pedestrians in crowd scenes Extensive experiments on CrowdHuman, CityPersons and EuroCity Persons datasets demonstrate the effectiveness of the proposed method. Our NAS-PED obtains absolute gains of 4.0% MR

$^{-2}$

and 1.9% AP over the state-of-the-art (SOTA) pedestrian detection framework on CrowdHuman datasets. For the CityPersons and EuroCity Persons datasets, the searched backbone achieves stable improvement across all three subsets, outperforming some large language-image pre-trained models.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

行人检测的神经结构搜索

行人检测目前在拥挤场景中面临两个问题：遮挡和密集边界预测，这使得它在复杂的现实场景中仍然具有挑战性。近年来，卷积神经网络（CNN）和视觉变换（ViT）在解决这些问题上显示出优势，其中ViT捕获全局特征依赖来推断遮挡部分，CNN通过局部细节特征进行精确的密集预测。然而，受限于狭窄的接受野，cnn无法推断遮挡部分，而ViTs往往忽略了对区分人群中不同行人至关重要的局部特征。因此，将CNN和ViT的优点结合起来进行行人检测是非常必要的。然而，手动设计一个特定的CNN和ViT混合网络需要大量的时间和资源进行试验和错误。为了解决这个问题，我们提出了第一个专门为行人检测设计的神经结构搜索（NAS）框架NAS- ped，该框架自动为拥挤的行人检测任务设计合适的cnn和ViTs混合主干。具体来说，我们以相同的格式制定了各种核大小的变压器和卷积，为各种混合网络搜索提供了不受约束的空间。此外，为了寻找合适的主干网，提出了基于信息瓶颈的NAS目标函数，将NAS过程视为一个信息提取过程，在人群场景中保留相关信息并抑制密集行人的冗余信息，在CrowdHuman、CityPersons和EuroCity Persons数据集上进行了大量实验，验证了该方法的有效性。我们的NAS-PED在CrowdHuman数据集上与最先进的（SOTA）行人检测框架相比，获得了4.0% MR$^{-2}$和1.9% AP的绝对增益。对于CityPersons和EuroCity Persons数据集，搜索主干在所有三个子集上都实现了稳定的改进，优于一些大型语言图像预训练模型。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量