采用跨模块表征约束的轻量级 CNN-ViT，用于快递包裹检测

The Visual Computer Pub Date : 2024-08-28 DOI:10.1007/s00371-024-03602-0

Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang

{"title":"采用跨模块表征约束的轻量级 CNN-ViT，用于快递包裹检测","authors":"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang","doi":"10.1007/s00371-024-03602-0","DOIUrl":null,"url":null,"abstract":"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"26 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight CNN-ViT with cross-module representational constraint for express parcel detection\",\"authors\":\"Guowei Zhang, Wuzhi Li, Yutong Tang, Shuixuan Chen, Li Wang\",\"doi\":\"10.1007/s00371-024-03602-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"26 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03602-0\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03602-0","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

快递包裹（EP）检测模型需要部署在计算能力有限的边缘设备上，因此一个轻量级、高效的物体检测模型至关重要。在这项工作中，我们介绍了一种新型的轻量级 CNN-ViT，它具有跨模块表示约束，专为 EP 检测而设计--CMViT。在 CMViT 中，我们借鉴了多模态模型中交叉注意力的概念，并提出了一种新的跨模块注意力（CMA）编码器。本地特征由提出的轻量级洗牌块（LSBlock）提供，而 CMA 编码器通过自我注意灵活地连接了 CNN-ViT 混合模型的本地特征和全局特征，在本地特征和全局特征之间构建了稳健的依赖关系，从而有效地增强了模型的感受野。此外，LSBlock 还为 CMA 编码器提供了有效的指导和约束，避免了对冗余信息的不必要关注，降低了计算成本。在 EP 检测中，与 YOLOv8s 相比，CMViT 以 25% 的输入分辨率、54.5% 的参数和 14.7% 的 FLOPs 实现了 99% 的平均准确率，表现出卓越的性能和广阔的应用前景。在更具挑战性的物体检测任务中，CMViT 表现出卓越的性能，在 COCO 数据集上实现了 28.8 mAP 和 2.2G MAdds，从而在准确度上比 MobileViT 高出 4%，同时消耗更少的计算能力。代码见：https://github.com/Acc2386/CMViT。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Lightweight CNN-ViT with cross-module representational constraint for express parcel detection

The express parcel(EP) detection model needs to be deployed on edge devices with limited computing capabilities, hence a lightweight and efficient object detection model is essential. In this work, we introduce a novel lightweight CNN-ViT with cross-module representational constraint designed specifically for EP detection—CMViT. In CMViT, we draw on the concept of cross-attention from multimodal models and propose a new cross-module attention(CMA) encoder. Local features are provided by the proposed lightweight shuffle block(LSBlock), and CMA encoder flexibly connects local and global features from the hybrid CNN-ViT model through self-attention, constructing a robust dependency between local and global features, thereby effectively enhancing the model’s receptive field. Furthermore, LSBlock provides effective guidance and constraints for CMA encoder, avoiding unnecessary attention to redundant information and reducing computational cost. In EP detection, compared to YOLOv8s, CMViT achieves 99% mean accuracy with a 25% input resolution, 54.5% of the parameters, and 14.7% of the FLOPs, showing superior performance and promising applications. In more challenging object detection tasks, CMViT exhibits exceptional performance, achieving 28.8 mAP and 2.2G MAdds on COCO dataset, thus outperforming MobileViT by 4% in accuracy while consuming less computational power. Code is available at: https://github.com/Acc2386/CMViT.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Visual Computer

自引率

0.00%

发文量