灵活扩展卷积和混合精度操作数的轻量级指令集

2023 24th International Symposium on Quality Electronic Design (ISQED) Pub Date : 2023-04-05 DOI:10.1109/ISQED57927.2023.10129341

Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis

{"title":"灵活扩展卷积和混合精度操作数的轻量级指令集","authors":"Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis","doi":"10.1109/ISQED57927.2023.10129341","DOIUrl":null,"url":null,"abstract":"Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":" 33","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands\",\"authors\":\"Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis\",\"doi\":\"10.1109/ISQED57927.2023.10129341\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.\",\"PeriodicalId\":315053,\"journal\":{\"name\":\"2023 24th International Symposium on Quality Electronic Design (ISQED)\",\"volume\":\" 33\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-04-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 24th International Symposium on Quality Electronic Design (ISQED)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISQED57927.2023.10129341\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 24th International Symposium on Quality Electronic Design (ISQED)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISQED57927.2023.10129341","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

专门用于对象检测和语义分割的现代深度神经网络需要特定的操作来增加或保持其特征映射的分辨率。因此，使用了更通用的卷积层，称为转置卷积和扩展卷积，在输入特征或权重的元素之间添加大量的零。通常，标准的神经网络硬件加速器以直接的方式处理这些卷积，而不注意添加的零，从而导致计算时间增加。为了解决这个问题，最近的工作建议用额外的硬件跳过冗余的元素，或者只在有限的膨胀率范围内有效地解决问题。我们提出了一种加速转置和扩展卷积的通用方法，该方法在支持所有扩展速率的同时不会引入任何硬件开销。为了实现这一目标，我们引入了一种新的精确可扩展的轻量级指令集和存储方案，可以应用于不同的卷积变体。这使得DeepLabV3+的速度提高了5倍，优于最近提出的设计方法。对所有工作负载的精确可扩展执行的支持进一步提高了PointPillars、DeepLabV3+和ENet网络的计算时间加速。与最先进的商用EdgeTPU相比，我们设计的加速器的ResNet-50的指令足迹减少了60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands

Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2023 24th International Symposium on Quality Electronic Design (ISQED)

自引率

0.00%

发文量