Yolo-tla: An Efficient and Lightweight Small Object Detection Model based on YOLOv5

IF 3 4区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Journal of Real-Time Image Processing Pub Date : 2024-07-29 DOI:10.1007/s11554-024-01519-4

Chun-Lin Ji, Tao Yu, Peng Gao, Fei Wang, Ru-Yue Yuan

{"title":"Yolo-tla: An Efficient and Lightweight Small Object Detection Model based on YOLOv5","authors":"Chun-Lin Ji, Tao Yu, Peng Gao, Fei Wang, Ru-Yue Yuan","doi":"10.1007/s11554-024-01519-4","DOIUrl":null,"url":null,"abstract":"<p>Object detection, a crucial aspect of computer vision, has seen significant advancements in accuracy and robustness. Despite these advancements, practical applications still face notable challenges, primarily the inaccurate detection or missed detection of small objects. Moreover, the extensive parameter count and computational demands of the detection models impede their deployment on equipment with limited resources. In this paper, we propose YOLO-TLA, an advanced object detection model building on YOLOv5. We first introduce an additional detection layer for small objects in the neck network pyramid architecture, thereby producing a feature map of a larger scale to discern finer features of small objects. Further, we integrate the C3CrossCovn module into the backbone network. This module uses sliding window feature extraction, which effectively minimizes both computational demand and the number of parameters, rendering the model more compact. Additionally, we have incorporated a global attention mechanism into the backbone network. This mechanism combines the channel information with global information to create a weighted feature map. This feature map is tailored to highlight the attributes of the object of interest, while effectively ignoring irrelevant details. In comparison to the baseline YOLOv5s model, our newly developed YOLO-TLA model has shown considerable improvements on the MS COCO validation dataset, with increases of 4.6% in mAP@0.5 and 4% in mAP@0.5:0.95, all while keeping the model size compact at 9.49M parameters. Further extending these improvements to the YOLOv5m model, the enhanced version exhibited a 1.7% and 1.9% increase in mAP@0.5 and mAP@0.5:0.95, respectively, with a total of 27.53M parameters. These results validate the YOLO-TLA model’s efficient and effective performance in small object detection, achieving high accuracy with fewer parameters and computational demands.</p>","PeriodicalId":51224,"journal":{"name":"Journal of Real-Time Image Processing","volume":"7 1","pages":""},"PeriodicalIF":3.0000,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Real-Time Image Processing","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11554-024-01519-4","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Object detection, a crucial aspect of computer vision, has seen significant advancements in accuracy and robustness. Despite these advancements, practical applications still face notable challenges, primarily the inaccurate detection or missed detection of small objects. Moreover, the extensive parameter count and computational demands of the detection models impede their deployment on equipment with limited resources. In this paper, we propose YOLO-TLA, an advanced object detection model building on YOLOv5. We first introduce an additional detection layer for small objects in the neck network pyramid architecture, thereby producing a feature map of a larger scale to discern finer features of small objects. Further, we integrate the C3CrossCovn module into the backbone network. This module uses sliding window feature extraction, which effectively minimizes both computational demand and the number of parameters, rendering the model more compact. Additionally, we have incorporated a global attention mechanism into the backbone network. This mechanism combines the channel information with global information to create a weighted feature map. This feature map is tailored to highlight the attributes of the object of interest, while effectively ignoring irrelevant details. In comparison to the baseline YOLOv5s model, our newly developed YOLO-TLA model has shown considerable improvements on the MS COCO validation dataset, with increases of 4.6% in mAP@0.5 and 4% in mAP@0.5:0.95, all while keeping the model size compact at 9.49M parameters. Further extending these improvements to the YOLOv5m model, the enhanced version exhibited a 1.7% and 1.9% increase in mAP@0.5 and mAP@0.5:0.95, respectively, with a total of 27.53M parameters. These results validate the YOLO-TLA model’s efficient and effective performance in small object detection, achieving high accuracy with fewer parameters and computational demands.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Yolo-tla：基于 YOLOv5 的高效轻量级小目标检测模型

物体检测是计算机视觉的一个重要方面，在准确性和鲁棒性方面取得了显著进步。尽管取得了这些进步，但实际应用仍然面临着显著的挑战，主要是对小物体的检测不准确或漏检。此外，检测模型的大量参数和计算需求也阻碍了它们在资源有限的设备上的部署。在本文中，我们提出了基于 YOLOv5 的高级物体检测模型 YOLO-TLA。我们首先在颈部网络金字塔结构中为小物体引入了一个额外的检测层，从而生成一个更大尺度的特征图，以辨别小物体更精细的特征。此外，我们还将 C3CrossCovn 模块集成到骨干网络中。该模块采用滑动窗口特征提取，有效地减少了计算需求和参数数量，使模型更加紧凑。此外，我们还在骨干网络中加入了全局关注机制。该机制将信道信息与全局信息相结合，创建加权特征图。该特征图是为突出感兴趣对象的属性而量身定制的，同时有效地忽略了无关细节。与基线 YOLOv5s 模型相比，我们新开发的 YOLO-TLA 模型在 MS COCO 验证数据集上显示出相当大的改进，在 mAP@0.5 和 mAP@0.5:0.95 上分别提高了 4.6% 和 4%，同时模型大小保持在 949 万个参数。将这些改进进一步扩展到 YOLOv5m 模型，增强版在 mAP@0.5 和 mAP@0.5:0.95 中分别提高了 1.7% 和 1.9%，参数总数达到 2753 万。这些结果验证了 YOLO-TLA 模型在小物体检测方面的高效性能，它以更少的参数和计算需求实现了高精度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Real-Time Image Processing COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-ENGINEERING, ELECTRICAL & ELECTRONIC

CiteScore

6.80

自引率

6.70%

发文量

审稿时长

6 months

期刊介绍： Due to rapid advancements in integrated circuit technology, the rich theoretical results that have been developed by the image and video processing research community are now being increasingly applied in practical systems to solve real-world image and video processing problems. Such systems involve constraints placed not only on their size, cost, and power consumption, but also on the timeliness of the image data processed. Examples of such systems are mobile phones, digital still/video/cell-phone cameras, portable media players, personal digital assistants, high-definition television, video surveillance systems, industrial visual inspection systems, medical imaging devices, vision-guided autonomous robots, spectral imaging systems, and many other real-time embedded systems. In these real-time systems, strict timing requirements demand that results are available within a certain interval of time as imposed by the application. It is often the case that an image processing algorithm is developed and proven theoretically sound, presumably with a specific application in mind, but its practical applications and the detailed steps, methodology, and trade-off analysis required to achieve its real-time performance are not fully explored, leaving these critical and usually non-trivial issues for those wishing to employ the algorithm in a real-time system. The Journal of Real-Time Image Processing is intended to bridge the gap between the theory and practice of image processing, serving the greater community of researchers, practicing engineers, and industrial professionals who deal with designing, implementing or utilizing image processing systems which must satisfy real-time design constraints.