首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs 编译器优化对fpga高级合成的影响
Qijing Huang, Ruolong Lian, Andrew Canis, Jongsok Choi, R. Xi, S. Brown, J. Anderson
We consider the impact of compiler optimizations on the quality of high-level synthesis (HLS)-generated FPGA hardware. Using a HLS tool implemented within the state-of-the-art LLVM [1] compiler, we study the effect of compiler optimizations on the hardware metrics of circuit area, execution cycles, Fmax, and wall-clock time. We evaluate 56 different compiler optimizations implemented within LLVM and show that some optimizations significantly affect hardware quality. Moreover, we show that hardware quality is also affected by the order in which optimizations are applied. We then present a new HLS-directed approach to compiler optimizations, wherein we execute partial HLS and profiling at intermittent points in the optimization process and use the results to judiciously undo the impact of optimization passes predicted to be damaging to the generated hardware quality. Results show that our approach produces circuits with 16% better speed performance, on average, versus using the standard -O3 optimization level.
我们考虑了编译器优化对高级合成(HLS)生成的FPGA硬件质量的影响。使用在最先进的LLVM[1]编译器中实现的HLS工具,我们研究了编译器优化对电路面积、执行周期、Fmax和时钟时间等硬件指标的影响。我们评估了在LLVM中实现的56种不同的编译器优化,并表明一些优化会显著影响硬件质量。此外,我们还表明硬件质量也受到应用优化的顺序的影响。然后,我们提出了一种新的针对HLS的编译器优化方法,其中我们在优化过程中的间歇点执行部分HLS和分析,并使用结果来明智地撤销优化过程的影响,这些优化过程预计会损害生成的硬件质量。结果表明,与使用标准的-O3优化水平相比,我们的方法产生的电路平均速度性能提高16%。
{"title":"The Effect of Compiler Optimizations on High-Level Synthesis for FPGAs","authors":"Qijing Huang, Ruolong Lian, Andrew Canis, Jongsok Choi, R. Xi, S. Brown, J. Anderson","doi":"10.1109/FCCM.2013.50","DOIUrl":"https://doi.org/10.1109/FCCM.2013.50","url":null,"abstract":"We consider the impact of compiler optimizations on the quality of high-level synthesis (HLS)-generated FPGA hardware. Using a HLS tool implemented within the state-of-the-art LLVM [1] compiler, we study the effect of compiler optimizations on the hardware metrics of circuit area, execution cycles, Fmax, and wall-clock time. We evaluate 56 different compiler optimizations implemented within LLVM and show that some optimizations significantly affect hardware quality. Moreover, we show that hardware quality is also affected by the order in which optimizations are applied. We then present a new HLS-directed approach to compiler optimizations, wherein we execute partial HLS and profiling at intermittent points in the optimization process and use the results to judiciously undo the impact of optimization passes predicted to be damaging to the generated hardware quality. Results show that our approach produces circuits with 16% better speed performance, on average, versus using the standard -O3 optimization level.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121452588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
Hardware-Software Codesign for Embedded Numerical Acceleration 嵌入式数值加速的软硬件协同设计
Ranko Sredojevic, A. Wright, V. Stojanović
In this work we aim to strike a balance between performance, power consumption and design effort for complex digital signal processing within the power and size constraints of embedded systems. Looking across the design stack, from algorithm formulation down to accelerator microarchitecture, we show that a high degree of flexibility and design reuse can be achieved without much performance sacrifice. The foundation of our design is a numerical accelerator template. Extensively parameterized, it allows us to develop the design while postponing microarchitectural decisions until program is known. Statically scheduling compiler provides a link between the algorithm and template instantiation parameters. Results show that the derived design can significantly outperform embedded processors for similar power cost and also approach the high-performance processor performance for a fraction of the power cost.
在这项工作中,我们的目标是在嵌入式系统的功率和尺寸限制下,在复杂数字信号处理的性能、功耗和设计努力之间取得平衡。纵观整个设计堆栈,从算法制定到加速器微架构,我们表明,在不牺牲太多性能的情况下,可以实现高度的灵活性和设计重用。我们设计的基础是一个数值加速器模板。广泛参数化,它允许我们开发设计,同时推迟微架构决策,直到程序已知。静态调度编译器提供了算法和模板实例化参数之间的链接。结果表明,该设计以相似的功耗成本显著优于嵌入式处理器,并且以一小部分功耗成本接近高性能处理器的性能。
{"title":"Hardware-Software Codesign for Embedded Numerical Acceleration","authors":"Ranko Sredojevic, A. Wright, V. Stojanović","doi":"10.1109/FCCM.2013.27","DOIUrl":"https://doi.org/10.1109/FCCM.2013.27","url":null,"abstract":"In this work we aim to strike a balance between performance, power consumption and design effort for complex digital signal processing within the power and size constraints of embedded systems. Looking across the design stack, from algorithm formulation down to accelerator microarchitecture, we show that a high degree of flexibility and design reuse can be achieved without much performance sacrifice. The foundation of our design is a numerical accelerator template. Extensively parameterized, it allows us to develop the design while postponing microarchitectural decisions until program is known. Statically scheduling compiler provides a link between the algorithm and template instantiation parameters. Results show that the derived design can significantly outperform embedded processors for similar power cost and also approach the high-performance processor performance for a fraction of the power cost.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125512557","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory Access Scheduling on the Convey HC-1 HC-1上的内存访问调度
Zheming Jin, J. Bakos
In this paper we describe a technique for scheduling memory accesses to improve effective memory bandwidth on the Convey HC-1 platform.
在本文中,我们描述了一种调度内存访问的技术,以提高传输HC-1平台上的有效内存带宽。
{"title":"Memory Access Scheduling on the Convey HC-1","authors":"Zheming Jin, J. Bakos","doi":"10.1109/FCCM.2013.55","DOIUrl":"https://doi.org/10.1109/FCCM.2013.55","url":null,"abstract":"In this paper we describe a technique for scheduling memory accesses to improve effective memory bandwidth on the Convey HC-1 platform.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115679441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Global Control and Storage Synthesis for a System Level Synthesis Approach 系统级综合方法的全局控制与存储综合
Shuo Li, Nasim Farahini, A. Hemani
SYLVA is a System Level Architectural Synthesis Framework that translates Synchronous Data Flow (SDF) models of DSP sub-systems like modems and codecs into hardware implementation in ASIC/Standard Cells, FPGAs or CGRAs (Coarse Grain Reconfigurable Fabric).
SYLVA是一个系统级架构综合框架,它将DSP子系统(如调制解调器和编解码器)的同步数据流(SDF)模型转换为ASIC/标准单元、fpga或CGRAs(粗粒可重构结构)的硬件实现。
{"title":"Global Control and Storage Synthesis for a System Level Synthesis Approach","authors":"Shuo Li, Nasim Farahini, A. Hemani","doi":"10.1109/FCCM.2013.61","DOIUrl":"https://doi.org/10.1109/FCCM.2013.61","url":null,"abstract":"SYLVA is a System Level Architectural Synthesis Framework that translates Synchronous Data Flow (SDF) models of DSP sub-systems like modems and codecs into hardware implementation in ASIC/Standard Cells, FPGAs or CGRAs (Coarse Grain Reconfigurable Fabric).","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131189903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploiting Input Parameter Uncertainty for Reducing Datapath Precision of SPICE Device Models 利用输入参数不确定性降低SPICE器件模型数据路径精度
Nachiket Kapre
Double-precision computations operating on inputs with uncertainty margins can be compiled to lower precision fixed-point datapaths with no loss in output accuracy. We observe that ideal SPICE model equations based on device physics include process parameters which must be matched with real-world measurements on specific silicon manufacturing processes through a noisy data-fitting process. We expose this uncertainty information to the open-source FX-SCORE compiler to enable automated error analysis using the Gappa++ backend and hardware circuit generation using Vivado HLS. We construct an error model based on interval analysis to statically identify sufficient fixedpoint precision in the presence of uncertainty as compared to reference double-precision design. We demonstrate 1-16× LUT count improvements, 0.5-2.4× DSP count reductions and 0.9-4× FPGA power reduction for SPICE devices such as Diode, Level-1 MOSFET and an Approximate MOSFET designs. We generate confidence in our approach using Monte-Carlo simulations with auto-generated Matlab models of the SPICE device equations.
双精度计算操作的输入与不确定性边界可以编译为低精度定点数据路径,而不损失输出精度。我们观察到,基于器件物理的理想SPICE模型方程包括工艺参数,这些参数必须通过噪声数据拟合过程与特定硅制造工艺的实际测量相匹配。我们将这些不确定性信息暴露给开源FX-SCORE编译器,以便使用gappa++后端进行自动错误分析,并使用Vivado HLS生成硬件电路。与参考的双精度设计相比,我们构建了一个基于区间分析的误差模型,在存在不确定性的情况下静态识别足够的不动点精度。我们展示了1-16倍的LUT计数改进,0.5-2.4倍的DSP计数降低和0.9-4倍的FPGA功耗降低,用于SPICE器件,如二极管,1级MOSFET和近似MOSFET设计。我们使用蒙特卡罗模拟和SPICE器件方程的自动生成的Matlab模型来对我们的方法产生信心。
{"title":"Exploiting Input Parameter Uncertainty for Reducing Datapath Precision of SPICE Device Models","authors":"Nachiket Kapre","doi":"10.1109/FCCM.2013.28","DOIUrl":"https://doi.org/10.1109/FCCM.2013.28","url":null,"abstract":"Double-precision computations operating on inputs with uncertainty margins can be compiled to lower precision fixed-point datapaths with no loss in output accuracy. We observe that ideal SPICE model equations based on device physics include process parameters which must be matched with real-world measurements on specific silicon manufacturing processes through a noisy data-fitting process. We expose this uncertainty information to the open-source FX-SCORE compiler to enable automated error analysis using the Gappa++ backend and hardware circuit generation using Vivado HLS. We construct an error model based on interval analysis to statically identify sufficient fixedpoint precision in the presence of uncertainty as compared to reference double-precision design. We demonstrate 1-16× LUT count improvements, 0.5-2.4× DSP count reductions and 0.9-4× FPGA power reduction for SPICE devices such as Diode, Level-1 MOSFET and an Approximate MOSFET designs. We generate confidence in our approach using Monte-Carlo simulations with auto-generated Matlab models of the SPICE device equations.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114678121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Reconfigurable Architecture for 1-D and 2-D Discrete Wavelet Transform 一维和二维离散小波变换的可重构结构
Qing Sun, Jiang Jiang, Yongxin Zhu, Yuzhuo Fu
In this paper, we propose a novel architecture for DWT that can be reconfigured to be adapted to different kinds of filter banks and different sizes of inputs. High flexibility and generality are achieved by using the MAC loop based filter(MLBF). Classic methods, such as polyphase structure and fragment-based sample consumption, are used to enhance the parallelism of the system. The architecture can be reconfigured to 3 modes to deal with 1-D or 2-D DWT with different bandwidth and throughput requirements.
在本文中,我们提出了一种新的DWT架构,可以重新配置以适应不同类型的滤波器组和不同大小的输入。基于MAC环的滤波器(MLBF)具有较高的灵活性和通用性。采用经典的方法,如多相结构和基于片段的样品消耗,来提高系统的并行性。该架构可以重新配置为3种模式,以处理具有不同带宽和吞吐量需求的一维或二维DWT。
{"title":"A Reconfigurable Architecture for 1-D and 2-D Discrete Wavelet Transform","authors":"Qing Sun, Jiang Jiang, Yongxin Zhu, Yuzhuo Fu","doi":"10.1109/FCCM.2013.23","DOIUrl":"https://doi.org/10.1109/FCCM.2013.23","url":null,"abstract":"In this paper, we propose a novel architecture for DWT that can be reconfigured to be adapted to different kinds of filter banks and different sizes of inputs. High flexibility and generality are achieved by using the MAC loop based filter(MLBF). Classic methods, such as polyphase structure and fragment-based sample consumption, are used to enhance the parallelism of the system. The architecture can be reconfigured to 3 modes to deal with 1-D or 2-D DWT with different bandwidth and throughput requirements.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115721599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Latency-Optimized Networks for Clustering FPGAs 集群fpga的延迟优化网络
Trevor Bunker, S. Swanson
The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support random requests to unstructured (e.g., logs) and semi-structured (e.g., large graph, XML) data sets. To explore the suitability of FPGAs for these computations, we are constructing an FPGAbased system with a memory capacity of 512 GB from a collection of 32 Virtex-5 FPGAs spread across 8 enclosures. This paper describes our work in exploring alternative interconnect technologies and network topologies for FPGA-based clusters. The diverse interconnects combine inter-enclosure high-speed serial links and wide, single-ended intra-enclosure on-board traces with network topologies that balance network diameter, network throughput, and FPGA resource usage. We discuss the architecture of high-radix routers in FPGAs that optimize for the asymmetry between the interand intra-enclosure links. We analyze the various interconnects that aim to efficiently utilize the prototype's total switching capacity of 2.43 Tb/s. The networks we present have aggregate throughputs up to 51.4 GB/s for random traffic, diameters as low as 845 nanoseconds, and consume less than 12% of the FPGAs' logic resources.
将在未来几十年塑造计算的数据密集型应用程序需要可扩展的体系结构,这些体系结构包含可扩展的数据和计算资源,并且可以支持对非结构化(例如,日志)和半结构化(例如,大图、XML)数据集的随机请求。为了探索fpga对这些计算的适用性,我们正在构建一个基于fpga的系统,其内存容量为512 GB,由32个Virtex-5 fpga组成,分布在8个机箱中。本文描述了我们在探索基于fpga的集群的替代互连技术和网络拓扑方面的工作。多样化的互连结合了机箱间高速串行链路和宽单端机箱内板上走线,以及平衡网络直径、网络吞吐量和FPGA资源使用的网络拓扑结构。我们讨论了fpga中高基数路由器的结构,该结构优化了机箱间和机箱内链路之间的不对称性。我们分析了各种互连,旨在有效利用原型的总交换容量为2.43 Tb/s。我们提出的网络具有高达51.4 GB/s的随机流量的总吞吐量,直径低至845纳秒,并且消耗的fpga逻辑资源不到12%。
{"title":"Latency-Optimized Networks for Clustering FPGAs","authors":"Trevor Bunker, S. Swanson","doi":"10.1109/FCCM.2013.49","DOIUrl":"https://doi.org/10.1109/FCCM.2013.49","url":null,"abstract":"The data-intensive applications that will shape computing in the coming decades require scalable architectures that incorporate scalable data and compute resources and can support random requests to unstructured (e.g., logs) and semi-structured (e.g., large graph, XML) data sets. To explore the suitability of FPGAs for these computations, we are constructing an FPGAbased system with a memory capacity of 512 GB from a collection of 32 Virtex-5 FPGAs spread across 8 enclosures. This paper describes our work in exploring alternative interconnect technologies and network topologies for FPGA-based clusters. The diverse interconnects combine inter-enclosure high-speed serial links and wide, single-ended intra-enclosure on-board traces with network topologies that balance network diameter, network throughput, and FPGA resource usage. We discuss the architecture of high-radix routers in FPGAs that optimize for the asymmetry between the interand intra-enclosure links. We analyze the various interconnects that aim to efficiently utilize the prototype's total switching capacity of 2.43 Tb/s. The networks we present have aggregate throughputs up to 51.4 GB/s for random traffic, diameters as low as 845 nanoseconds, and consume less than 12% of the FPGAs' logic resources.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124484267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Minerva: Accelerating Data Analysis in Next-Generation SSDs Minerva:加速下一代ssd的数据分析
Arup De, M. Gokhale, Rajesh K. Gupta, S. Swanson
Emerging non-volatile memory (NVM) technologies have DRAM-like latency with storage-like density, offering unique capability to analyze large data sets significantly faster than flash or disk storage. However, the hybrid nature of these NVM technologies such as phase-change memory (PCM) make it difficult to use them to best advantage in the memory-storage hierarchy. These NVMs lack the fast write latency required of DRAM and are thus not suitable as DRAM equivalent on the memory bus, yet their low latency even in random access patterns is not easily exploited over an I/O bus. In this work, we describe an FPGA-based system to execute application-specific operations in the NVM controller and evaluate its performance on two microbenchmarks and a keyvalue store. Our system Minerva1extends the conventional solidstate drive (SSD) architecture to offload data or I/O intensive application code to the SSD to exploit the low latency and high internal bandwidth of NVMs. Performing computation in the FPGA-based NVM storage controller significantly reduces data traffic between the host and storage and serves as an offload engine for data analysis workloads. A runtime library enables the programmer to offload computations to the SSD without dealing with the complications of the underlying architecture and inter-controller communication management. We have implemented a prototype of Minerva on the BEE3 FPGA system. We compare the performance of Minerva to a state of the art PCIe-attached PCM-based SSD. Minerva improves performance by an order of magnitude on two microbenchmarks. Minerva based key-value store performs up to 5.2 M get operations/s and 4.0 M set operations/s which is 7.45× and 9.85× higher than the PCM-based SSD that uses the conventional I/O architecture. This huge improvement comes from the reduction of data transfer between the storage to the host and the FPGA-based data processing in the SSD.
新兴的非易失性内存(NVM)技术具有类似dram的延迟和类似存储的密度,提供了独特的能力,可以比闪存或磁盘存储更快地分析大型数据集。然而,这些NVM技术(如相变存储器(PCM))的混合性质使得很难在内存-存储层次结构中充分利用它们。这些nvm缺乏DRAM所需的快速写入延迟,因此不适合作为内存总线上的DRAM等效,然而,即使在随机访问模式下,它们的低延迟也不容易在I/O总线上被利用。在这项工作中,我们描述了一个基于fpga的系统,用于在NVM控制器中执行特定于应用程序的操作,并在两个微基准测试和一个键值存储上评估其性能。我们的系统minerva1扩展了传统的固态硬盘(SSD)架构,将数据或I/O密集型应用程序代码卸载到SSD上,以利用nvm的低延迟和高内部带宽。在基于fpga的NVM存储控制器中执行计算,可以显著减少主机和存储之间的数据流量,并作为数据分析工作负载的卸载引擎。运行时库使程序员能够将计算卸载到SSD上,而无需处理底层体系结构和控制器间通信管理的复杂性。我们已经在BEE3 FPGA系统上实现了Minerva的原型。我们将Minerva的性能与最先进的pcie连接的基于pcm的SSD进行比较。Minerva在两个微基准测试上提高了一个数量级的性能。基于Minerva的键值存储的get操作次数为5.2 M /s, set操作次数为4.0 M /s,分别比采用传统I/O架构的基于pcm的SSD高7.45倍和9.85倍。这种巨大的改进来自于减少了存储到主机之间的数据传输和SSD中基于fpga的数据处理。
{"title":"Minerva: Accelerating Data Analysis in Next-Generation SSDs","authors":"Arup De, M. Gokhale, Rajesh K. Gupta, S. Swanson","doi":"10.1109/FCCM.2013.46","DOIUrl":"https://doi.org/10.1109/FCCM.2013.46","url":null,"abstract":"Emerging non-volatile memory (NVM) technologies have DRAM-like latency with storage-like density, offering unique capability to analyze large data sets significantly faster than flash or disk storage. However, the hybrid nature of these NVM technologies such as phase-change memory (PCM) make it difficult to use them to best advantage in the memory-storage hierarchy. These NVMs lack the fast write latency required of DRAM and are thus not suitable as DRAM equivalent on the memory bus, yet their low latency even in random access patterns is not easily exploited over an I/O bus. In this work, we describe an FPGA-based system to execute application-specific operations in the NVM controller and evaluate its performance on two microbenchmarks and a keyvalue store. Our system Minerva1extends the conventional solidstate drive (SSD) architecture to offload data or I/O intensive application code to the SSD to exploit the low latency and high internal bandwidth of NVMs. Performing computation in the FPGA-based NVM storage controller significantly reduces data traffic between the host and storage and serves as an offload engine for data analysis workloads. A runtime library enables the programmer to offload computations to the SSD without dealing with the complications of the underlying architecture and inter-controller communication management. We have implemented a prototype of Minerva on the BEE3 FPGA system. We compare the performance of Minerva to a state of the art PCIe-attached PCM-based SSD. Minerva improves performance by an order of magnitude on two microbenchmarks. Minerva based key-value store performs up to 5.2 M get operations/s and 4.0 M set operations/s which is 7.45× and 9.85× higher than the PCM-based SSD that uses the conventional I/O architecture. This huge improvement comes from the reduction of data transfer between the storage to the host and the FPGA-based data processing in the SSD.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116611081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
An Evaluation of High-Performance Embedded Processing on MPPAs MPPAs的高性能嵌入式处理评价
Zain-ul-Abdin, B. Svensson
Embedded signal processing is facing the challenges of increased performance as well as to achieve energy efficiency. Massively parallel processor arrays (MPPAs) consisting of hundreds of processing cores offer the possibility of meeting the growing performance demand in an energy efficient way by exploiting parallelism instead of scaling the clock frequency of a single processor. In this paper we evaluate two selected commercial architectures belonging to the category of MPPA. The adopted approach for the evaluation is to implement a real, industrial application in the form of compute-intensive parts of Synthetic Aperture Radar (SAR) systems.
嵌入式信号处理面临着提高性能和实现能源效率的挑战。由数百个处理核心组成的大规模并行处理器阵列(MPPAs)通过利用并行性而不是缩放单个处理器的时钟频率,以节能的方式满足日益增长的性能需求。在本文中,我们评估了两个选择属于MPPA类别的商业架构。所采用的评估方法是以合成孔径雷达(SAR)系统的计算密集型部件的形式实现实际的工业应用。
{"title":"An Evaluation of High-Performance Embedded Processing on MPPAs","authors":"Zain-ul-Abdin, B. Svensson","doi":"10.1109/FCCM.2013.44","DOIUrl":"https://doi.org/10.1109/FCCM.2013.44","url":null,"abstract":"Embedded signal processing is facing the challenges of increased performance as well as to achieve energy efficiency. Massively parallel processor arrays (MPPAs) consisting of hundreds of processing cores offer the possibility of meeting the growing performance demand in an energy efficient way by exploiting parallelism instead of scaling the clock frequency of a single processor. In this paper we evaluate two selected commercial architectures belonging to the category of MPPA. The adopted approach for the evaluation is to implement a real, industrial application in the form of compute-intensive parts of Synthetic Aperture Radar (SAR) systems.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134233826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impact of Hardware Communication on a Heterogeneous Computing System 硬件通信对异构计算系统的影响
Shanyuan Gao, Bin Huang, R. Sass
This paper designed a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE offloads the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing elements which can eliminate the data path going around the OS and libraries. The experimental results have shown that the MPE can significantly reduce the communication time and improve the overall performance, especially for heterogeneous computing systems.
本文设计了一个类似mpi的消息传递引擎(MPE)作为片上网络的一部分,在硬件上提供点对点和集体通信原语。一方面,MPE减轻了一般处理单元的通信负荷。另一方面,MPE提供了与异构处理元素的直接接口,这可以消除绕过操作系统和库的数据路径。实验结果表明,MPE可以显著减少通信时间,提高整体性能,特别是在异构计算系统中。
{"title":"The Impact of Hardware Communication on a Heterogeneous Computing System","authors":"Shanyuan Gao, Bin Huang, R. Sass","doi":"10.1109/FCCM.2013.43","DOIUrl":"https://doi.org/10.1109/FCCM.2013.43","url":null,"abstract":"This paper designed a MPI-like Message Passing Engine (MPE) as part of the on-chip network, providing point-to-point and collective communication primitives in hardware. On one hand, the MPE offloads the communication workload from the general processing elements. On the other hand, the MPE provides direct interface to the heterogeneous processing elements which can eliminate the data path going around the OS and libraries. The experimental results have shown that the MPE can significantly reduce the communication time and improve the overall performance, especially for heterogeneous computing systems.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127057712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1