Xinlong Dong, Peicheng Shi, Taonian Liang, Aixi Yang
{"title":"CTAFFNet:适用于复杂交通场景的 CNN-变换器自适应特征融合物体检测算法","authors":"Xinlong Dong, Peicheng Shi, Taonian Liang, Aixi Yang","doi":"10.1177/03611981241258753","DOIUrl":null,"url":null,"abstract":"As the core technology of an environmental perception system, object detection has received more and more attention and has become a hot research direction for intelligent driving vehicles. The CNN–Transformer hybrid model has poor generalization ability, making it difficult to meet the detection requirements for small objects in complex scenes. We propose a novel convolutional neural network (CNN)–Transformer Adaptive Feature Fusion Network (CTAFFNet) for object detection. First, we design a Local–Global Feature Fusion unit known as the Convolutional Transformation Adaptive Fusion Kernel (CTAFFK), which is integrated into CTAFFNet. The CTAFFK kernel utilizes two branches, namely CNN and Transformer, to extract local and global features from the image, and adaptively fuses the features from both branches. Subsequently, we develop an adaptive feature fusion strategy that combines local high-frequency and global low-frequency features to obtain comprehensive feature information. Finally, CTAFFNet employs an encoder–decoder structure to facilitate the flow of fused local–global information between different stages, ensuring the model’s generalization capabilities. Results from the experiment conducted on the large and challenging KITTI dataset demonstrate the effectiveness and efficiency of the proposed network. Compared with other mainstream networks, it achieves an average precision of 91.17%, particularly excelling in the detection of small objects at longer distances with a remarkable 70.18% accuracy, while also providing real-time detection capabilities.","PeriodicalId":517391,"journal":{"name":"Transportation Research Record: Journal of the Transportation Research Board","volume":"17 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CTAFFNet: CNN–Transformer Adaptive Feature Fusion Object Detection Algorithm for Complex Traffic Scenarios\",\"authors\":\"Xinlong Dong, Peicheng Shi, Taonian Liang, Aixi Yang\",\"doi\":\"10.1177/03611981241258753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the core technology of an environmental perception system, object detection has received more and more attention and has become a hot research direction for intelligent driving vehicles. The CNN–Transformer hybrid model has poor generalization ability, making it difficult to meet the detection requirements for small objects in complex scenes. We propose a novel convolutional neural network (CNN)–Transformer Adaptive Feature Fusion Network (CTAFFNet) for object detection. First, we design a Local–Global Feature Fusion unit known as the Convolutional Transformation Adaptive Fusion Kernel (CTAFFK), which is integrated into CTAFFNet. The CTAFFK kernel utilizes two branches, namely CNN and Transformer, to extract local and global features from the image, and adaptively fuses the features from both branches. Subsequently, we develop an adaptive feature fusion strategy that combines local high-frequency and global low-frequency features to obtain comprehensive feature information. Finally, CTAFFNet employs an encoder–decoder structure to facilitate the flow of fused local–global information between different stages, ensuring the model’s generalization capabilities. Results from the experiment conducted on the large and challenging KITTI dataset demonstrate the effectiveness and efficiency of the proposed network. Compared with other mainstream networks, it achieves an average precision of 91.17%, particularly excelling in the detection of small objects at longer distances with a remarkable 70.18% accuracy, while also providing real-time detection capabilities.\",\"PeriodicalId\":517391,\"journal\":{\"name\":\"Transportation Research Record: Journal of the Transportation Research Board\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Transportation Research Record: Journal of the Transportation Research Board\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1177/03611981241258753\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Transportation Research Record: Journal of the Transportation Research Board","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1177/03611981241258753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
As the core technology of an environmental perception system, object detection has received more and more attention and has become a hot research direction for intelligent driving vehicles. The CNN–Transformer hybrid model has poor generalization ability, making it difficult to meet the detection requirements for small objects in complex scenes. We propose a novel convolutional neural network (CNN)–Transformer Adaptive Feature Fusion Network (CTAFFNet) for object detection. First, we design a Local–Global Feature Fusion unit known as the Convolutional Transformation Adaptive Fusion Kernel (CTAFFK), which is integrated into CTAFFNet. The CTAFFK kernel utilizes two branches, namely CNN and Transformer, to extract local and global features from the image, and adaptively fuses the features from both branches. Subsequently, we develop an adaptive feature fusion strategy that combines local high-frequency and global low-frequency features to obtain comprehensive feature information. Finally, CTAFFNet employs an encoder–decoder structure to facilitate the flow of fused local–global information between different stages, ensuring the model’s generalization capabilities. Results from the experiment conducted on the large and challenging KITTI dataset demonstrate the effectiveness and efficiency of the proposed network. Compared with other mainstream networks, it achieves an average precision of 91.17%, particularly excelling in the detection of small objects at longer distances with a remarkable 70.18% accuracy, while also providing real-time detection capabilities.