When dealing with multiple data consumers and producers in a highly parallel accelerator architecture the challenge arises how to coordinate the requests to memory. An example of such an accelerator is a coarse-grained reconfigurable array (CGRA). CGRAs consist of multiple processing elements (PEs) which can consume and produce data. On the one hand, the resulting load and store requests to the memory need to be orchestrated such that the CGRA does not deadlock when connected to a cache hierarchy responding to memory requests out-of-request-order. On the other hand, multiple consumers and producers open up the possibility to make better use of the available memory bandwidth such that the cache is busy constantly. We call the unit to address these challenges and opportunities frontend (FE).
We propose a synthesizable FE for the HiPReP CGRA which enables the integration with a RISC-V based host system. Based on an example application, we showcase a methodology to match the number of consumers and producers (i.e. PEs) with the memory hierarchy such that the CGRA can efficiently harness the available L1 data cache bandwidth, reaching 99.6% of the theoretical peak bandwidth in a synthetic benchmark, and enabling a speedup of up to 21.9x over an out-of-order processor for dense matrix-matrix-multiplications. Moreover, we explore the FE design, the impact of the different numbers of PEs, memory access patterns, synthesis results, and compare the accelerator runtime with the runtime on the host itself as baseline.
扫码关注我们
求助内容:
应助结果提醒方式:
