Hardware design and analysis of efficient loop coarsening and border handling for image processing

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) Pub Date : 2017-07-01 DOI:10.1109/ASAP.2017.7995273

M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich

{"title":"Hardware design and analysis of efficient loop coarsening and border handling for image processing","authors":"M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2017.7995273","DOIUrl":null,"url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ASAP.2017.7995273","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

图像处理中高效环粗化和边界处理的硬件设计与分析

现场可编程门阵列(fpga)在实现本地运营商方面表现出色，因为可以通过特定于应用的片上存储器配置减少片外通信。此外，数据级的并行性可以通过所谓的循环粗化来有效地利用，循环粗化可以同时处理多个水平像素。此外，在硬件中进行适当边界处理的现有解决方案显示出相当大的资源开销。在本文中，我们首先提出了新的图像边界处理和循环粗化架构，可以显着减少面积。其次，我们对这些建筑进行了系统的分析，包括对其面积使用的分析模型的制定。基于这些模型，我们提供了一种算法来建议给定规范的最有效的硬件架构。最后，我们评估了几种通过Vivado高级综合(HLS)获得的我们提出的架构的实现。综合结果表明，与以前的工作相比，所提出的粗化架构使用的5 × 5卷积寄存器减少了32%，粗化因子为64，而所提出的边界处理架构则使查找表(LUT)的使用减少了36%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

自引率

0.00%

发文量

期刊最新文献

KV-FTL: A novel key-value based FTL scheme for large scale SSDs OpenCL-based design pattern for line rate packet processing Fast and efficient implementation of Convolutional Neural Networks on FPGA Hardware support for embedded operating system security An efficient embedded multi-ported memory architecture for next-generation FPGAs