首页 > 最新文献

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
Sorting Large Data Sets with FPGA-Accelerated Samplesort 用fpga加速采样排序大型数据集
Han Chen, S. Madaminov, M. Ferdman, Peter Milder
Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping "buckets," sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.
排序是许多应用程序(如数据库、搜索和社交网络)中的基本操作。尽管fpga在对芯片上的数据大小进行排序方面已经被证明是有效的,但通过在芯片上和芯片外变换数据来对更大的数据集进行排序的系统通常会受到昂贵的合并操作或数据传输时间的瓶颈。我们提出了一种新的方法,通过使用带有pcie连接的FPGA的服务器加速采样排序算法来对大型数据集进行排序。Samplesort通过随机抽样来确定如何将数据划分为大小大致相等且不重叠的“桶”,对每个桶进行排序,并将结果连接起来。尽管samplesort可以将一个大问题划分为适合FPGA片上内存的小问题,但在软件中划分速度很慢。我们的系统使用了一种新型的并行硬件分区器,它只受可用FPGA硬件资源的限制而限制数据集的大小。分区后,使用并行排序硬件对每个桶进行排序。CPU负责对数据进行采样,清除由存储桶大小变化引起的任何潜在问题,并在输入集大于FPGA可以排序时通过执行初始粗粒度分区来提供可伸缩性。我们使用Amazon Web Services FPGA实例进行原型设计,该实例将Xilinx Virtex UltraScale+ FPGA与高性能服务器配对。我们的实验表明,当排序2^23条键值记录时,比GNU并行排序加快17.1倍,排序2^30条记录时加快4.2倍。
{"title":"Sorting Large Data Sets with FPGA-Accelerated Samplesort","authors":"Han Chen, S. Madaminov, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2019.00067","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00067","url":null,"abstract":"Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping \"buckets,\" sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121955760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Improved Techniques for Sensing Intra-Device Side Channel Leakage 器件内侧通道泄漏检测的改进技术
William Hunter, Christopher McCarty, L. Lerner
Side channels which introduce intra-device circuit module information leakage or functional influence are of concern for the security and trust of many applications, such as multi-tenant and multi-level security single FPGA designs. Previous works utilized a sensor co-located on the same FPGA with a target module which was able to detect side channel voltage variations. We build on this by creating a sensor with more programmability and sensitivity resulting in improved recovery of bit patterns from an isolated target. We demonstrate for the first time the recovery of an unknown target frequency and data pattern length in a multi-user FPGA side channel attack. We also show increased sensitivity over previously developed voltage sensors enabling data recovery with fewer samples.
侧信道会导致器件内电路模块信息的泄露或功能的影响,是许多应用的安全与信任问题,例如多租户和多级安全的单FPGA设计。以前的工作利用传感器共同定位在同一个FPGA与目标模块,能够检测侧通道电压变化。我们在此基础上创建了一个具有更多可编程性和灵敏度的传感器,从而提高了从孤立目标中恢复比特模式的能力。我们首次演示了在多用户FPGA侧信道攻击中未知目标频率和数据模式长度的恢复。我们还显示,与以前开发的电压传感器相比,灵敏度更高,可以用更少的样本恢复数据。
{"title":"Improved Techniques for Sensing Intra-Device Side Channel Leakage","authors":"William Hunter, Christopher McCarty, L. Lerner","doi":"10.1109/FCCM.2019.00069","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00069","url":null,"abstract":"Side channels which introduce intra-device circuit module information leakage or functional influence are of concern for the security and trust of many applications, such as multi-tenant and multi-level security single FPGA designs. Previous works utilized a sensor co-located on the same FPGA with a target module which was able to detect side channel voltage variations. We build on this by creating a sensor with more programmability and sensitivity resulting in improved recovery of bit patterns from an isolated target. We demonstrate for the first time the recovery of an unknown target frequency and data pattern length in a multi-user FPGA side channel attack. We also show increased sensitivity over previously developed voltage sensors enabling data recovery with fewer samples.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127591151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression 在CPU, GPU和FPGA上的OpenCL内核矢量化:频繁模式压缩的案例研究
Zheming Jin, H. Finkel
OpenCL promotes code portability, and natively supports vectorized data types, which allows developers to potentially take advantage of the single-instruction-multiple-data instructions on CPUs, GPUs, and FPGAs. FPGAs are becoming a promising heterogeneous computing component. In our study, we choose a kernel used in frequent pattern compression as a case study of OpenCL kernel vectorizations on the three computing platforms. We describe different pattern matching approaches for the kernel, and manually vectorize the OpenCL kernel by a factor ranging from 2 to 16. We evaluate the kernel on an Intel Xeon 16-core CPU, an NVIDIA P100 GPU, and a Nallatech 385A FPGA card featuring an Intel Arria 10 GX1150 FPGA. Compared to the optimized kernel that is not vectorized, our vectorization can improve the kernel performance by a factor of 16 on the FPGA. The performance improvement ranges from 1 to 11.4 on the CPU, and from 1.02 to 9.3 on the GPU. The effectiveness of kernel vectorization depends on the work-group size.
OpenCL促进了代码的可移植性,并且原生地支持向量化数据类型,这使得开发人员能够潜在地利用cpu、gpu和fpga上的单指令多数据指令。fpga正在成为一种很有前途的异构计算组件。在我们的研究中,我们选择了一个用于频繁模式压缩的内核作为三个计算平台上OpenCL内核矢量化的案例研究。我们描述了内核的不同模式匹配方法,并通过2到16的因子对OpenCL内核进行了手动矢量化。我们在英特尔至强16核CPU, NVIDIA P100 GPU和具有英特尔Arria 10 GX1150 FPGA的Nallatech 385A FPGA卡上评估内核。与未向量化的优化内核相比,我们的向量化可以将FPGA上的内核性能提高16倍。CPU性能提升幅度为1 ~ 11.4,GPU性能提升幅度为1.02 ~ 9.3。核矢量化的有效性取决于工作组的大小。
{"title":"OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00071","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00071","url":null,"abstract":"OpenCL promotes code portability, and natively supports vectorized data types, which allows developers to potentially take advantage of the single-instruction-multiple-data instructions on CPUs, GPUs, and FPGAs. FPGAs are becoming a promising heterogeneous computing component. In our study, we choose a kernel used in frequent pattern compression as a case study of OpenCL kernel vectorizations on the three computing platforms. We describe different pattern matching approaches for the kernel, and manually vectorize the OpenCL kernel by a factor ranging from 2 to 16. We evaluate the kernel on an Intel Xeon 16-core CPU, an NVIDIA P100 GPU, and a Nallatech 385A FPGA card featuring an Intel Arria 10 GX1150 FPGA. Compared to the optimized kernel that is not vectorized, our vectorization can improve the kernel performance by a factor of 16 on the FPGA. The performance improvement ranges from 1 to 11.4 on the CPU, and from 1.02 to 9.3 on the GPU. The effectiveness of kernel vectorization depends on the work-group size.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129701274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design Patterns for Code Reuse in HLS Packet Processing Pipelines HLS包处理管道中代码重用的设计模式
Haggai Eran, Lior Zeno, Z. István, M. Silberstein
High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.
由于熟悉的编程语言和高级抽象,高级综合(HLS)使开发人员能够更高效地设计FPGA电路。为了创建高性能电路,HLS工具(如Xilinx Vivado HLS)需要遵循特定的设计模式和技术。不幸的是,当应用于网络数据包处理任务时,这些技术限制了代码重用和模块化,要求开发人员使用过时的编程约定。我们提出了一种使用Vivado HLS for c++开发高速网络应用程序的方法,着重于可重用性、代码简单性和整体性能。按照这种方法,我们实现了一个类库(ntl),其中包含几个可以在广泛的网络应用程序中使用的构建块。我们通过实现两个应用程序来评估该方法:一个UDP无状态防火墙和一个为基于fpga的smartnic设计的键值存储缓存,两者都以40Gbps的线路速率处理数据包。
{"title":"Design Patterns for Code Reuse in HLS Packet Processing Pipelines","authors":"Haggai Eran, Lior Zeno, Z. István, M. Silberstein","doi":"10.1109/FCCM.2019.00036","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00036","url":null,"abstract":"High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129755530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
[Publisher's information] (发布者的信息)
{"title":"[Publisher's information]","authors":"","doi":"10.1109/fccm.2019.00081","DOIUrl":"https://doi.org/10.1109/fccm.2019.00081","url":null,"abstract":"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124414136","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Active Stereo Vision with High Resolution on an FPGA 基于FPGA的高分辨率主动立体视觉
Marc Pfeifer, P. Scholl, R. Voigt, B. Becker
We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.
我们提出了一种新颖的基于FPGA的主动立体视觉系统,专门用于移动3D立体相机。对于单张三维地图的生成,匹配算法基于相关方法,其中处理多个立体图像对而不是单个立体图像对以保证提高深度分辨率。为了有效地处理大量传入的图像数据,我们使算法适应底层FPGA结构,例如利用流水线和并行化。实验表明,我们的方法提供的高质量3D地图的能效至少是在CPU和GPU平台上执行的类似方法的三倍(5.5 fps/W)。我们的系统在Xilinx Zynq-7030 SoC上实现,在130万像素的分辨率和128像素的视差搜索空间下提供12.2 fps的计算速度。因此,它优于目前最好的被动立体系统的米德尔伯里立体评估在速度和准确性方面。因此,所提出的方法非常适合移动应用,这需要一个高度精确和节能的主动立体视觉系统。
{"title":"Active Stereo Vision with High Resolution on an FPGA","authors":"Marc Pfeifer, P. Scholl, R. Voigt, B. Becker","doi":"10.1109/FCCM.2019.00026","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00026","url":null,"abstract":"We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129418280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU 基因组测序中长读对重叠的硬件加速:FPGA和GPU之间的竞争
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong
In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.
在基因组测序中,检测任何一对输入序列之间的潜在重叠是一项至关重要但耗时的任务,特别是那些超长的输入序列。最先进的重叠工具Minimap2在速度和精度方面优于其他流行的工具。它有一个单独的计算热点——链,它占用了70%的时间,需要加速。由于链的性质,硬件加速有几个关键问题。首先,原始计算模式的并行性较差,直接实现会导致并行处理单元的利用率较低。我们提出了一种重新排序运算序列的方法,将算法转换为硬件友好的形式。其次,输入数据的大而可变的大小使得很难利用任务级并行性。因此,我们定制了一种细粒度的任务调度方案,可以在满足片上内存限制的同时使并行pe保持繁忙状态。基于这些优化,我们使用HLS将算法映射到FPGA上的全流水线流架构,从而实现了显着的性能提升。我们的加速设计原则适用于FPGA和GPU。与多线程CPU基线相比,我们的GPU加速器实现了7倍的加速,而我们的FPGA加速器实现了28倍的加速。我们进一步进行架构研究,定量分析性能差异的架构原因。总结的见解可以作为选择合适的硬件加速平台的指南。
{"title":"Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU","authors":"Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong","doi":"10.1109/FCCM.2019.00027","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00027","url":null,"abstract":"In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113936078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 55
Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs Yosys+ nextpr:从Verilog到Bitstream的商业fpga开源框架
David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic
This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.
本文介绍了一个完全免费和开源软件(FOSS)架构中立的FPGA框架,其中Yosys用于Verilog合成,nextnr用于放置,路由和比特流生成。目前,该流程支持两种商用FPGA系列,Lattice iCE40(高达8K逻辑元件)和Lattice ECP5(高达85K元件),并已经过硬件验证,可用于定制计算机器,包括低功耗神经网络加速器和能够启动Linux的OpenRISC片上系统。Yosys和nextpr都以高度灵活的方式设计,通过将特定架构的细节与通用映射算法分离开来,支持现代fpga中存在的许多功能。该框架在一个最长路径的案例研究中进行了演示,以找到一个非典型的单源-汇路径,占用了所有片上布线的45%。
{"title":"Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs","authors":"David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic","doi":"10.1109/FCCM.2019.00010","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00010","url":null,"abstract":"This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet 基于zynq的双千兆以太网实时系统的高性价比能源监测
M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty
Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.
最近的FPGA架构集成了在cpu驱动的soc中已经建立的各种电源管理功能,以达到更节能的应用领域,例如汽车和机器人。混合可编程soc (psoc)结合了固定功能的soc和可配置的FPGA结构,用于异构实时系统(RTSs),在安全关键环境中在预定义的延迟和功率限制下运行。它们复杂的特定应用计算和通信(包括I/O)架构导致功耗高度变化,这需要在所有相关电源轨道上精确地检测电压和电流,以便对可用的新型电源管理技术进行可靠的评估。在本文中,我们提出了一种低成本的18通道16位分辨率测量系统,能够超过200 kSPS(每秒千采样数),用于当前pSoC开发板的仪器仪表。此外,我们建议将关键的I/O组件(如以太网物理设备)纳入电源监控,以全面了解RTS的时间行为,不仅包括FPGA和cpu上的计算,还包括通信,例如传感器值的接收和驱动信号的传输。我们提出了一个fmc大小的测量系统的实现,结合了两个千兆以太网物理和一个HDMI输入。配合赛灵思的ZC702开发板,我们能够同步获取Zynq pSoC和两个物理层的电源走线,精确到足以识别单个以太网帧。
{"title":"Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet","authors":"M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty","doi":"10.1109/FCCM.2019.00068","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00068","url":null,"abstract":"Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124517171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design 面向对象的模块:基于c++的高级综合设计的人为驱动方法
Jeferson Santiago da Silva, F. Boyer, J. Langlois
High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.
高级综合(HLS)将fpga带给以前不熟悉硬件设计的观众。然而,对于大多数程序员来说,使用HLS实现最高的结果质量(QoR)仍然是无法实现的。这需要详细了解FPGA架构和硬件设计,以便生成FPGA友好的代码。此外,这些代码通常与支持代码重用、模块化和简洁性的最佳编码实践相冲突。为了克服这些限制,我们提出了每个对象模块(MpO),这是一种人为驱动的HLS设计方法,适用于FPGA专业知识有限的硬件设计人员和软件开发人员。MpO利用现代c++来提高抽象层次,同时改善QoR、代码可读性和模块化。为了指导HLS设计师,我们提出了MpO类的五个特征。每个特性都利用hls支持的现代c++特性的强大功能来构建基于c++的硬件模块。这些特征导致了高质量的软件描述和高效的硬件生成。我们还提出了MpO的一个用例,其中我们使用c++作为从P4生成fpga目标代码的中间语言,P4是一种数据包处理领域特定的语言。MpO方法使用三个设计实验进行评估:数据包解析器,基于流的流量管理器和数字上转换器。实验表明,MpO可以与手写的VHDL代码相媲美,同时保持较高的抽象级别、可读的编码风格和模块化。与传统的基于c的HLS设计相比,MpO在性能和资源利用率方面都能更高效地生成电路。此外,MpO方法显著地提高了软件质量,增加了参数化,同时消除了代码重复的发生率。
{"title":"Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1109/FCCM.2019.00037","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00037","url":null,"abstract":"High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"09 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134523069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
期刊
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1