2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

Model checking cloud rendering system for the QoS evaluation 模型检查云绘制系统的QoS评价

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995284

Haoyu Liu, Huahu Xu, Honghao Gao, Danqi Chu

This paper briefly introduce a method to evaluate the reliability of a cloud rendering system by using probability models. An extended discrete-time Markov chain (DTMC) is proposed considering the QoS (Quality of Service). Then, some properties defined from 3 aspects give full consideration to the processes of rendering tasks, which can be verified by performing PRISM in a quantitative way. Finally, the experimental results demonstrate that our method can ensure and improve the QoS reliability of the cloud rendering system.

本文简要介绍了一种利用概率模型评估云绘制系统可靠性的方法。提出了一种考虑服务质量的扩展离散马尔可夫链(DTMC)。然后，从3个方面定义的一些属性充分考虑了渲染任务的过程，这可以通过定量地执行PRISM来验证。实验结果表明，该方法能够保证和提高云渲染系统的QoS可靠性。

引用次数: 0

Massive spatial query on the Kepler architecture 对开普勒架构的大量空间查询

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995267

Yili Gong, Jia Tang, Wenhai Li, Zihui Ye

In this paper, we present an optimized framework that can efficiently perform massive spatial queries on the current GPUs. To benefit the widely adopted filter-and-verify paradigm from GPUs, the skewed workloads are first associated with certain cells in a scaled spatial grid, such that the following range verification cost against the massive spatial objects can be significantly reduced. Particularly on the Kepler architecture, we highlight a two-level scheduling method to exploit good data localities by developing a novel dynamic scheduling method. Based on this virtual warp-based scheduling method, groups of threads can compete for the unbalanced tasks to ensure good load balance. We conduct various of skewed workloads with different object positions and query distributions, to evaluate our optimized methods. Experimental results show that, as compared to the existing fixed-size allocation methods, the proposed adaptive scheduling strategies improve the query throughput by one order of magnitude.

在本文中，我们提出了一个优化的框架，可以有效地在当前的gpu上执行大量的空间查询。为了从gpu广泛采用的过滤和验证范例中受益，倾斜的工作负载首先与缩放空间网格中的某些单元相关联，这样针对大量空间对象的后续范围验证成本可以显着降低。特别是在开普勒架构上，我们重点介绍了一种两级调度方法，通过开发一种新的动态调度方法来利用良好的数据位置。基于这种基于虚拟扭曲的调度方法，线程组可以竞争不平衡的任务，以确保良好的负载平衡。我们使用不同的对象位置和查询分布来执行各种倾斜工作负载，以评估我们优化的方法。实验结果表明，与现有的固定大小分配方法相比，所提出的自适应调度策略将查询吞吐量提高了一个数量级。

引用次数: 0

CGRA-ME: A unified framework for CGRA modelling and exploration CGRA- me: CGRA建模与探索的统一框架

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995277

S. Chin, N. Sakamoto, A. Rui, Jim Zhao, Jin Hee Kim, Yuko Hara-Azumi, J. Anderson

Coarse-grained reconfigurable arrays (CGRAs) are a style of programmable logic device situated between FPGAs and custom ASICs on the spectrum of programmability, performance, power and cost. CGRAs have been proposed by both academia and industry; however, prior works have been mainly self-contained without broad architectural exploration and comparisons with competing CGRAs. We present CGRA-ME - a unified CGRA framework that encompasses generic architecture description, architecture modelling, application mapping, and physical implementation. Within this framework, we discuss our architecture description language CGRA-ADL, a generic LLVM-based simulated annealing mapper, and a standard cell flow for physical implementation. An architecture exploration case study is presented, highlighting the capabilities of CGRA-ME by exploring a variety of architectures with varying functionality, interconnect, array size, and execution contexts through the mapping of application benchmarks and the production of standard cell designs.

粗粒度可重构阵列(CGRAs)是一种在可编程性、性能、功耗和成本方面介于fpga和定制asic之间的可编程逻辑器件。学术界和工业界都提出了CGRAs;然而，之前的作品主要是独立的，没有广泛的建筑探索，也没有与竞争对手的CGRAs进行比较。我们提出了CGRA- me——一个统一的CGRA框架，包括通用的体系结构描述、体系结构建模、应用映射和物理实现。在这个框架内，我们讨论了我们的架构描述语言CGRA-ADL，一个通用的基于llvm的模拟退火映射器，以及一个用于物理实现的标准单元流。提出了一个架构探索案例研究，通过应用程序基准的映射和标准单元设计的生产，通过探索具有不同功能、互连、阵列大小和执行上下文的各种架构，突出了CGRA-ME的功能。

引用次数: 80

A fast and accurate logarithm accelerator for scientific applications 一个快速和准确的对数加速器的科学应用

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995283

Jing Chen, Xue Liu

Many scientific applications rely on evaluation of elementary functions. Nowadays, high-level programming languages provide their own elementary function libraries in software by using lookup table and/or polynomial approximation. However, one downside is slow since lookup tables could keep cache thrashing and polynomial approximations require a number of iterations to converge. Thus, elementary functions evaluation becomes bottleneck for most scientific applications. With this motivation, we propose a generalized pipelined hardware architecture for elementary functions to accelerate scientific applications. This paper presents a pipelined, single precision logarithm hardware accelerator (SP-LHA). Throughput of SP-LHA is at least 2.5GFLOPS in 65nm ASICs, while the circuit consists of ≈60,000 logic gates. Average accuracy of SP-LHA is 22.5 out of 23 bits, which is achieved by using 7.8KB lookup table and parabolic interpolation.

许多科学应用依赖于初等函数的求值。如今，高级编程语言通过使用查找表和/或多项式近似在软件中提供了自己的基本函数库。然而，一个缺点是速度慢，因为查找表可能会导致缓存抖动，并且多项式近似值需要大量迭代才能收敛。因此，初等函数的求值成为大多数科学应用的瓶颈。基于这一动机，我们提出了一种用于基本功能的通用流水线硬件架构，以加速科学应用。提出了一种流水线式单精度对数硬件加速器(SP-LHA)。在65nm asic中，SP-LHA的吞吐量至少为2.5GFLOPS，而电路由≈60,000个逻辑门组成。SP-LHA的平均精度为22.5 / 23位，采用7.8KB查找表和抛物线插值实现。

引用次数: 1

Design and comparative evaluation of GPGPU- and FPGA-based MPSoC ECU architectures for secure, dependable, and real-time automotive CPS 基于GPGPU和fpga的MPSoC ECU架构的设计和比较评估，用于安全、可靠和实时的汽车CPS

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995256

B. Poudel, N. Giri, Arslan Munir

In this paper, we propose and implement two electronic control unit (ECU) architectures for real-time automotive cyber-physical systems that incorporate security and dependability primitives with low resources and energy overhead. These ECUs architectures follow the multiprocessor system-on-chip (MPSoC) design paradigm wherein the ECUs have multiple heterogeneous processing engines with specific functionalities. The first architecture, GED, leverages an ARM-based application processor and a GPGPU-based co-processor. The second architecture, RED, integrates an ARM based application processor with a FPGA-based co-processor. We quantify and compare temporal performance, energy, and error resilience of our proposed architectures for a steer-by-wire case study over CAN, CAN FD, and FlexRay in-vehicle networks. Hardware implementation results reveal that RED and GED can attain a speedup of 31.7× and 1.8×, respectively, while consuming 1.75× and 2× less energy, respectively, than contemporary ECU architectures.

在本文中，我们提出并实现了两种用于实时汽车网络物理系统的电子控制单元(ECU)架构，这些系统包含具有低资源和能源开销的安全性和可靠性原语。这些ecu架构遵循多处理器片上系统(MPSoC)设计范例，其中ecu具有具有特定功能的多个异构处理引擎。第一种体系结构GED利用了一个基于arm的应用处理器和一个基于gpgpu的协处理器。第二种架构RED集成了基于ARM的应用处理器和基于fpga的协处理器。为了对CAN、CAN FD和FlexRay车载网络进行线控转向案例研究，我们量化并比较了我们提出的架构的时间性能、能量和容错性。硬件实现结果表明，与当前ECU架构相比，RED和GED的加速速度分别提高了31.7倍和1.8倍，能耗分别降低了1.75倍和2倍。

引用次数: 16

High performance hardware architectures for Intra Block Copy and Palette Coding for HEVC screen content coding extension 高性能硬件架构内块复制和调色板编码HEVC屏幕内容编码扩展

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995274

Rishan Senanayake, Namitha Liyanage, Sasindu Wijeratne, Sachille Atapattu, Kasun Athukorala, P. Tharaka, G. Karunaratne, R. Senarath, Ishantha Perera, Ashen Ekanayake, A. Pasqual

Screen content coding (SCC) extension to High Efficiency Video Coding (HEVC) offers substantial compression efficiency over the existing HEVC standard for computer generated content. However, this gain in compression efficiency is achieved at the expense of further computational complexity with several resource hungry coding tools. Hence, extension of SCC to HEVC hardware encoders can be challenging. This paper presents resource efficient hardware designs for two key SCC tools, Intra Block Copy and Palette Coding. Moreover, a new hash search approach is proposed for Intra Block Copy, while a hardware friendly palette indices coding scheme is suggested for Palette Coding. These designs are targeted to achieve the throughput necessary for an 1080p 30 frames/s encoder, and incurs coding loss of 11.4% and 5.1% respectively in all intra configurations. The designs are synthesized for a Virtex-7 VC707 evaluation platform.

屏幕内容编码(SCC)扩展到高效视频编码(HEVC)，为计算机生成的内容提供了比现有HEVC标准更高的压缩效率。然而，这种压缩效率的提高是以一些资源密集型编码工具的进一步计算复杂性为代价的。因此，将SCC扩展到HEVC硬件编码器可能具有挑战性。本文介绍了两个关键SCC工具(块内复制和调色板编码)的资源高效硬件设计。此外，提出了一种新的块内复制哈希搜索方法，并提出了一种硬件友好的调色板索引编码方案。这些设计的目标是实现1080p 30帧/秒编码器所需的吞吐量，并且在所有内部配置中分别导致11.4%和5.1%的编码损耗。综合了virtex - 7vc707评价平台的设计。

{"title":"High performance hardware architectures for Intra Block Copy and Palette Coding for HEVC screen content coding extension","authors":"Rishan Senanayake, Namitha Liyanage, Sasindu Wijeratne, Sachille Atapattu, Kasun Athukorala, P. Tharaka, G. Karunaratne, R. Senarath, Ishantha Perera, Ashen Ekanayake, A. Pasqual","doi":"10.1109/ASAP.2017.7995274","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995274","url":null,"abstract":"Screen content coding (SCC) extension to High Efficiency Video Coding (HEVC) offers substantial compression efficiency over the existing HEVC standard for computer generated content. However, this gain in compression efficiency is achieved at the expense of further computational complexity with several resource hungry coding tools. Hence, extension of SCC to HEVC hardware encoders can be challenging. This paper presents resource efficient hardware designs for two key SCC tools, Intra Block Copy and Palette Coding. Moreover, a new hash search approach is proposed for Intra Block Copy, while a hardware friendly palette indices coding scheme is suggested for Palette Coding. These designs are targeted to achieve the throughput necessary for an 1080p 30 frames/s encoder, and incurs coding loss of 11.4% and 5.1% respectively in all intra configurations. The designs are synthesized for a Virtex-7 VC707 evaluation platform.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor PFSI。基于神威多核处理器的海冰模型算法编程框架

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995268

Binyang Li, Bo Li, D. Qian

Sea ice model is a typical high performance computing problem. CPU and GPU based parallel method has been proposed to accelerate the simulation process, but it is still hard to meet the large-scale calculation demand due to the compute-intensive nature of the model. Sunway TaihuLight supercomputer use the SW26010 processor as its computing unit and achieves high performance for large-scale scientific computing. In this paper we present a programming framework (PFSI.sw) for sea ice model algorithms based on Sunway many-core processor. Based on this framework, programmer can exploit the parallelism of existing sea ice model algorithms and achieve good performance. Several strategies are introduced to this framework, data dividing, data transfer as well as the load balance are the main aspects we currently concerned. This framework has been implemented and tested with two sea ice model algorithms by using real world dataset on Sunway many-core processors. The experiment demonstrates comparable performance to the traditional parallel implementation on Sunway many-core processor and our framework improves the performance up to 40%.

海冰模型是一个典型的高性能计算问题。基于CPU和GPU的并行方法被提出以加速仿真过程，但由于模型的计算密集性，仍然难以满足大规模计算需求。神威太湖之光超级计算机采用SW26010处理器作为计算单元，实现了大规模科学计算的高性能。本文提出了一种基于神威多核处理器的海冰模型算法编程框架(PFSI.sw)。基于该框架，程序员可以利用现有海冰模型算法的并行性，并获得良好的性能。在此框架中引入了几种策略，其中数据划分、数据传输和负载均衡是我们目前关注的主要方面。该框架已在双威多核处理器上使用真实世界数据集对两种海冰模型算法进行了实现和测试。实验结果表明，该框架在双威多核处理器上的性能与传统的并行实现相当，性能提高了40%。

{"title":"PFSI.sw: A programming framework for sea ice model algorithms based on Sunway many-core processor","authors":"Binyang Li, Bo Li, D. Qian","doi":"10.1109/ASAP.2017.7995268","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995268","url":null,"abstract":"Sea ice model is a typical high performance computing problem. CPU and GPU based parallel method has been proposed to accelerate the simulation process, but it is still hard to meet the large-scale calculation demand due to the compute-intensive nature of the model. Sunway TaihuLight supercomputer use the SW26010 processor as its computing unit and achieves high performance for large-scale scientific computing. In this paper we present a programming framework (PFSI.sw) for sea ice model algorithms based on Sunway many-core processor. Based on this framework, programmer can exploit the parallelism of existing sea ice model algorithms and achieve good performance. Several strategies are introduced to this framework, data dividing, data transfer as well as the load balance are the main aspects we currently concerned. This framework has been implemented and tested with two sea ice model algorithms by using real world dataset on Sunway many-core processors. The experiment demonstrates comparable performance to the traditional parallel implementation on Sunway many-core processor and our framework improves the performance up to 40%.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116245906","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Acceleration of Frequent Itemset Mining on FPGA using SDAccel and Vivado HLS 基于SDAccel和Vivado HLS的FPGA频繁项集挖掘加速

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995279

V. Dang, K. Skadron

Frequent itemset mining (FIM) is a widely-used data-mining technique for discovering sets of frequently-occurring items in large databases. However, FIM is highly time-consuming when datasets grow in size. FPGAs have shown great promise for accelerating computationally-intensive algorithms, but they are hard to use with traditional HDL-based design methods. The recent introduction of Xilinx SDAccel development environment for the C/C++/OpenCL languages allows developers to utilize FPGA's potential without long development periods and extensive hardware knowledge. This paper presents an optimized implementation of an FIM algorithm on FPGA using SDAccel and Vivado HLS. Performance and power consumption are measured with various datasets. When compared to state-of-the-art solutions, this implementation offers up to 3.2× speedup over a 6-core CPU, and has a better energy efficiency as compared with a GPU. Our preliminary results on the new XCKU115 FPGA are even more promising: they demonstrate a comparable performance with a state-of-the-art HDL FPGA implementation and better performance compared to the GPU.

频繁项集挖掘(FIM)是一种广泛使用的数据挖掘技术，用于发现大型数据库中频繁出现的项集。然而，当数据集的规模增长时，FIM非常耗时。fpga在加速计算密集型算法方面显示出巨大的前景，但它们很难与传统的基于hdl的设计方法一起使用。最近针对C/ c++ /OpenCL语言推出的赛灵思SDAccel开发环境允许开发人员利用FPGA的潜力，而无需漫长的开发周期和广泛的硬件知识。本文利用SDAccel和Vivado HLS在FPGA上优化实现了一种FIM算法。使用不同的数据集测量性能和功耗。与最先进的解决方案相比，这种实现比6核CPU提供高达3.2倍的加速，并且与GPU相比具有更好的能源效率。我们在新的XCKU115 FPGA上的初步结果更有希望:它们展示了与最先进的HDL FPGA实现相当的性能，与GPU相比性能更好。

{"title":"Acceleration of Frequent Itemset Mining on FPGA using SDAccel and Vivado HLS","authors":"V. Dang, K. Skadron","doi":"10.1109/ASAP.2017.7995279","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995279","url":null,"abstract":"Frequent itemset mining (FIM) is a widely-used data-mining technique for discovering sets of frequently-occurring items in large databases. However, FIM is highly time-consuming when datasets grow in size. FPGAs have shown great promise for accelerating computationally-intensive algorithms, but they are hard to use with traditional HDL-based design methods. The recent introduction of Xilinx SDAccel development environment for the C/C++/OpenCL languages allows developers to utilize FPGA's potential without long development periods and extensive hardware knowledge. This paper presents an optimized implementation of an FIM algorithm on FPGA using SDAccel and Vivado HLS. Performance and power consumption are measured with various datasets. When compared to state-of-the-art solutions, this implementation offers up to 3.2× speedup over a 6-core CPU, and has a better energy efficiency as compared with a GPU. Our preliminary results on the new XCKU115 FPGA are even more promising: they demonstrate a comparable performance with a state-of-the-art HDL FPGA implementation and better performance compared to the GPU.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126328164","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Efficiency in ILP processing by using orthogonality 利用正交性处理ILP的效率

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995282

Marcel Brand, Frank Hannig, Alexandru Tanase, J. Teich

For the next generations of Processor-Arrays-on-Chip (e. g., coarse-grained reconfigurable or programmable arrays)—including more than 100s to 1000s of processing elements—it is very important to keep the on-chip configuration/instruction memories as small as possible. Hence, compilers must take into account the scarceness of available instruction memory and create the code as compact as possible [1]. However, Very Long Instruction Word (VLIW) processors have the well-known problem that compilers typically produce lengthy codes. A lot of unnecessary code is produced due to unused Functional Units (FUs) or repeating operations for single FUs in instruction sequences. Techniques like software pipelining can be used to improve the utilization of the FUs, yet with the risk of code explosion [2] due to the overlapped scheduling of multiple loop iterations or other control flow statements. This is, where our proposed Orthogonal Instruction Processing (OIP) architecture (see Fig. 1) shows benefits in reducing the code size of compute-intensive loop programs. The idea is, contrary to lightweight VLIW processors used in arrays like Tightly Coupled Processor Arrays (TCPAs) [4], to equip each FU with its own instruction memory, branch unit, and program counter, but still let the FUs share the register files as well as input and output signals. This enables a processor to orthogonally execute a loop program. Each FU can execute its own sub-program while exchanging data over the register files. The branch unit and its instruction format have to be slightly changed by introducing a counter to each instruction that determines how often the instruction is repeated until the specified branch is executed. This enables repeating instructions without repeating them in the code. Those kind of processors have to be carefully programmed, e. g., to not run into data dependency problems while optimizing throughput. For solving this resource-constrained modulo scheduling problem, we use techniques based on mixed integer linear programming [5], [3].

对于下一代的片上处理器阵列(例如，粗粒度可重构或可编程阵列)——包括超过100到1000个处理元素——保持片上配置/指令存储器尽可能小是非常重要的。因此，编译器必须考虑到可用指令内存的稀缺性，并创建尽可能紧凑的代码[1]。然而，超长指令字(VLIW)处理器有一个众所周知的问题，即编译器通常会生成冗长的代码。由于未使用的功能单元(FUs)或指令序列中单个FUs的重复操作，产生了大量不必要的代码。像软件流水线这样的技术可以用来提高FUs的利用率，但是由于多个循环迭代或其他控制流语句的重叠调度，存在代码爆炸的风险[2]。这就是我们提出的正交指令处理(OIP)架构(见图1)在减少计算密集型循环程序的代码大小方面的好处。与紧耦合处理器阵列(TCPAs)[4]等阵列中使用的轻量级VLIW处理器相反，其思想是为每个FU配备自己的指令存储器、分支单元和程序计数器，但仍然让FU共享寄存器文件以及输入和输出信号。这使得处理器可以正交地执行循环程序。每个FU可以在通过寄存器文件交换数据的同时执行自己的子程序。分支单元及其指令格式必须稍加改变，方法是向每条指令引入一个计数器，以确定该指令在指定分支执行之前被重复的频率。这允许重复指令，而无需在代码中重复它们。这类处理器必须仔细编程，例如，在优化吞吐量时不要遇到数据依赖问题。为了解决这种资源受限的模调度问题，我们使用了基于混合整数线性规划的技术[5]，[3]。

{"title":"Efficiency in ILP processing by using orthogonality","authors":"Marcel Brand, Frank Hannig, Alexandru Tanase, J. Teich","doi":"10.1109/ASAP.2017.7995282","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995282","url":null,"abstract":"For the next generations of Processor-Arrays-on-Chip (e. g., coarse-grained reconfigurable or programmable arrays)—including more than 100s to 1000s of processing elements—it is very important to keep the on-chip configuration/instruction memories as small as possible. Hence, compilers must take into account the scarceness of available instruction memory and create the code as compact as possible [1]. However, Very Long Instruction Word (VLIW) processors have the well-known problem that compilers typically produce lengthy codes. A lot of unnecessary code is produced due to unused Functional Units (FUs) or repeating operations for single FUs in instruction sequences. Techniques like software pipelining can be used to improve the utilization of the FUs, yet with the risk of code explosion [2] due to the overlapped scheduling of multiple loop iterations or other control flow statements. This is, where our proposed Orthogonal Instruction Processing (OIP) architecture (see Fig. 1) shows benefits in reducing the code size of compute-intensive loop programs. The idea is, contrary to lightweight VLIW processors used in arrays like Tightly Coupled Processor Arrays (TCPAs) [4], to equip each FU with its own instruction memory, branch unit, and program counter, but still let the FUs share the register files as well as input and output signals. This enables a processor to orthogonally execute a loop program. Each FU can execute its own sub-program while exchanging data over the register files. The branch unit and its instruction format have to be slightly changed by introducing a counter to each instruction that determines how often the instruction is repeated until the specified branch is executed. This enables repeating instructions without repeating them in the code. Those kind of processors have to be carefully programmed, e. g., to not run into data dependency problems while optimizing throughput. For solving this resource-constrained modulo scheduling problem, we use techniques based on mixed integer linear programming [5], [3].","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132944435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Hardware design and analysis of efficient loop coarsening and border handling for image processing 图像处理中高效环粗化和边界处理的硬件设计与分析

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995273

M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich

Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.

现场可编程门阵列(fpga)在实现本地运营商方面表现出色，因为可以通过特定于应用的片上存储器配置减少片外通信。此外，数据级的并行性可以通过所谓的循环粗化来有效地利用，循环粗化可以同时处理多个水平像素。此外，在硬件中进行适当边界处理的现有解决方案显示出相当大的资源开销。在本文中，我们首先提出了新的图像边界处理和循环粗化架构，可以显着减少面积。其次，我们对这些建筑进行了系统的分析，包括对其面积使用的分析模型的制定。基于这些模型，我们提供了一种算法来建议给定规范的最有效的硬件架构。最后，我们评估了几种通过Vivado高级综合(HLS)获得的我们提出的架构的实现。综合结果表明，与以前的工作相比，所提出的粗化架构使用的5 × 5卷积寄存器减少了32%，粗化因子为64，而所提出的边界处理架构则使查找表(LUT)的使用减少了36%。

{"title":"Hardware design and analysis of efficient loop coarsening and border handling for image processing","authors":"M. A. Ozkan, Oliver Reiche, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2017.7995273","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995273","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) excel at the implementation of local operators in terms of throughput per energy since the off-chip communication can be reduced with an application-specific on-chip memory configuration. Furthermore, data-level parallelism can efficiently be exploited through socalled loop coarsening, which processes multiple horizontal pixels simultaneously. Moreover, existing solutions for proper border handling in hardware show considerable resource overheads. In this paper, we first propose novel architectures for image border handling and loop coarsening, which can significantly reduce area. Second, we present a systematic analysis of these architectures including the formulation of analytical models for their area usage. Based on these models, we provide an algorithm for suggesting the most efficient hardware architecture for a given specification. Finally, we evaluate several implementations of our proposed architectures obtained through Vivado High-Level Synthesis (HLS). The synthesis results show that the proposed coarsening architecture uses 32% less registers for a 5-by-5 convolution with a 64 coarsening factor compared to previous works, whereas the proposed border handling architectures facilitate a decrease in the Look-up Table (LUT) usage by 36 %.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114182345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀