Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping "buckets," sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.
排序是许多应用程序(如数据库、搜索和社交网络)中的基本操作。尽管fpga在对芯片上的数据大小进行排序方面已经被证明是有效的,但通过在芯片上和芯片外变换数据来对更大的数据集进行排序的系统通常会受到昂贵的合并操作或数据传输时间的瓶颈。我们提出了一种新的方法,通过使用带有pcie连接的FPGA的服务器加速采样排序算法来对大型数据集进行排序。Samplesort通过随机抽样来确定如何将数据划分为大小大致相等且不重叠的“桶”,对每个桶进行排序,并将结果连接起来。尽管samplesort可以将一个大问题划分为适合FPGA片上内存的小问题,但在软件中划分速度很慢。我们的系统使用了一种新型的并行硬件分区器,它只受可用FPGA硬件资源的限制而限制数据集的大小。分区后,使用并行排序硬件对每个桶进行排序。CPU负责对数据进行采样,清除由存储桶大小变化引起的任何潜在问题,并在输入集大于FPGA可以排序时通过执行初始粗粒度分区来提供可伸缩性。我们使用Amazon Web Services FPGA实例进行原型设计,该实例将Xilinx Virtex UltraScale+ FPGA与高性能服务器配对。我们的实验表明,当排序2^23条键值记录时,比GNU并行排序加快17.1倍,排序2^30条记录时加快4.2倍。
{"title":"Sorting Large Data Sets with FPGA-Accelerated Samplesort","authors":"Han Chen, S. Madaminov, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2019.00067","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00067","url":null,"abstract":"Sorting is a fundamental operation in many applications such as databases, search, and social networks. Although FPGAs have been shown effective at sorting data sizes that fit on chip, systems that sort larger data sets by shuffling data on and off chip are typically bottlenecked by costly merge operations or data transfer time. We propose a new approach to sorting large data sets by accelerating the samplesort algorithm using a server with a PCIe-connected FPGA. Samplesort works by randomly sampling to determine how to partition data into approximately equal-sized non-overlapping \"buckets,\" sorting each bucket, and concatenating the results. Although samplesort can partition a large problem into smaller ones that fit in the FPGA's on-chip memory, partitioning in software is slow. Our system uses a novel parallel hardware partitioner that is only limited in data set size by available FPGA hardware resources. After partitioning, each bucket is sorted using parallel sorting hardware. The CPU is responsible for sampling data, cleaning up any potential problems caused by variation in bucket size, and providing scalability by performing an initial coarse-grained partitioning when the input set is larger than the FPGA can sort. We prototype our design using Amazon Web Services FPGA instances, which pair a Xilinx Virtex UltraScale+ FPGA with a high-performance server. Our experiments demonstrate a 17.1x speedup over GNU parallel sort when sorting 2^23 key-value records and a speedup of 4.2x when sorting 2^30 records.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121955760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Side channels which introduce intra-device circuit module information leakage or functional influence are of concern for the security and trust of many applications, such as multi-tenant and multi-level security single FPGA designs. Previous works utilized a sensor co-located on the same FPGA with a target module which was able to detect side channel voltage variations. We build on this by creating a sensor with more programmability and sensitivity resulting in improved recovery of bit patterns from an isolated target. We demonstrate for the first time the recovery of an unknown target frequency and data pattern length in a multi-user FPGA side channel attack. We also show increased sensitivity over previously developed voltage sensors enabling data recovery with fewer samples.
{"title":"Improved Techniques for Sensing Intra-Device Side Channel Leakage","authors":"William Hunter, Christopher McCarty, L. Lerner","doi":"10.1109/FCCM.2019.00069","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00069","url":null,"abstract":"Side channels which introduce intra-device circuit module information leakage or functional influence are of concern for the security and trust of many applications, such as multi-tenant and multi-level security single FPGA designs. Previous works utilized a sensor co-located on the same FPGA with a target module which was able to detect side channel voltage variations. We build on this by creating a sensor with more programmability and sensitivity resulting in improved recovery of bit patterns from an isolated target. We demonstrate for the first time the recovery of an unknown target frequency and data pattern length in a multi-user FPGA side channel attack. We also show increased sensitivity over previously developed voltage sensors enabling data recovery with fewer samples.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127591151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
OpenCL promotes code portability, and natively supports vectorized data types, which allows developers to potentially take advantage of the single-instruction-multiple-data instructions on CPUs, GPUs, and FPGAs. FPGAs are becoming a promising heterogeneous computing component. In our study, we choose a kernel used in frequent pattern compression as a case study of OpenCL kernel vectorizations on the three computing platforms. We describe different pattern matching approaches for the kernel, and manually vectorize the OpenCL kernel by a factor ranging from 2 to 16. We evaluate the kernel on an Intel Xeon 16-core CPU, an NVIDIA P100 GPU, and a Nallatech 385A FPGA card featuring an Intel Arria 10 GX1150 FPGA. Compared to the optimized kernel that is not vectorized, our vectorization can improve the kernel performance by a factor of 16 on the FPGA. The performance improvement ranges from 1 to 11.4 on the CPU, and from 1.02 to 9.3 on the GPU. The effectiveness of kernel vectorization depends on the work-group size.
{"title":"OpenCL Kernel Vectorization on the CPU, GPU, and FPGA: A Case Study with Frequent Pattern Compression","authors":"Zheming Jin, H. Finkel","doi":"10.1109/FCCM.2019.00071","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00071","url":null,"abstract":"OpenCL promotes code portability, and natively supports vectorized data types, which allows developers to potentially take advantage of the single-instruction-multiple-data instructions on CPUs, GPUs, and FPGAs. FPGAs are becoming a promising heterogeneous computing component. In our study, we choose a kernel used in frequent pattern compression as a case study of OpenCL kernel vectorizations on the three computing platforms. We describe different pattern matching approaches for the kernel, and manually vectorize the OpenCL kernel by a factor ranging from 2 to 16. We evaluate the kernel on an Intel Xeon 16-core CPU, an NVIDIA P100 GPU, and a Nallatech 385A FPGA card featuring an Intel Arria 10 GX1150 FPGA. Compared to the optimized kernel that is not vectorized, our vectorization can improve the kernel performance by a factor of 16 on the FPGA. The performance improvement ranges from 1 to 11.4 on the CPU, and from 1.02 to 9.3 on the GPU. The effectiveness of kernel vectorization depends on the work-group size.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129701274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.
由于熟悉的编程语言和高级抽象,高级综合(HLS)使开发人员能够更高效地设计FPGA电路。为了创建高性能电路,HLS工具(如Xilinx Vivado HLS)需要遵循特定的设计模式和技术。不幸的是,当应用于网络数据包处理任务时,这些技术限制了代码重用和模块化,要求开发人员使用过时的编程约定。我们提出了一种使用Vivado HLS for c++开发高速网络应用程序的方法,着重于可重用性、代码简单性和整体性能。按照这种方法,我们实现了一个类库(ntl),其中包含几个可以在广泛的网络应用程序中使用的构建块。我们通过实现两个应用程序来评估该方法:一个UDP无状态防火墙和一个为基于fpga的smartnic设计的键值存储缓存,两者都以40Gbps的线路速率处理数据包。
{"title":"Design Patterns for Code Reuse in HLS Packet Processing Pipelines","authors":"Haggai Eran, Lior Zeno, Z. István, M. Silberstein","doi":"10.1109/FCCM.2019.00036","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00036","url":null,"abstract":"High-level synthesis (HLS) allows developers to be more productive in designing FPGA circuits thanks to familiar programming languages and high-level abstractions. In order to create high-performance circuits, HLS tools, such as Xilinx Vivado HLS, require following specific design patterns and techniques. Unfortunately, when applied to network packet processing tasks, these techniques limit code reuse and modularity, requiring developers to use deprecated programming conventions. We propose a methodology for developing high-speed networking applications using Vivado HLS for C++, focusing on reusability, code simplicity, and overall performance. Following this methodology, we implement a class library (ntl) with several building blocks that can be used in a wide spectrum of networking applications. We evaluate the methodology by implementing two applications: a UDP stateless firewall and a key-value store cache designed for FPGA-based SmartNICs, both processing packets at 40Gbps line-rate.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"167 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129755530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.
{"title":"Active Stereo Vision with High Resolution on an FPGA","authors":"Marc Pfeifer, P. Scholl, R. Voigt, B. Becker","doi":"10.1109/FCCM.2019.00026","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00026","url":null,"abstract":"We present a novel FPGA based active stereo vision system, tailored for the use in a mobile 3D stereo camera. For the generation of a single 3D map the matching algorithm is based on a correlation approach, where multiple stereo image pairs instead of a single one are processed to guarantee an improved depth resolution. To efficiently handle the large amounts of incoming image data we adapt the algorithm to the underlying FPGA structures, e.g. by making use of pipelining and parallelization.Experiments demonstrate that our approach provides high-quality 3D maps at least three times more energy-efficient (5.5 fps/W) than comparable approaches executed on CPU and GPU platforms. Implemented on a Xilinx Zynq-7030 SoC our system provides a computation speed of 12.2 fps, at a resolution of 1.3 megapixel and a 128 pixel disparity search space. As such it outperforms the currently best passive stereo systems of the Middlebury Stereo Evaluation in terms of speed and accuracy. The presented approach is therefore well suited for mobile applications, that require a highly accurate and energy-efficient active stereo vision system.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129418280","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong
In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.
{"title":"Hardware Acceleration of Long Read Pairwise Overlapping in Genome Sequencing: A Race Between FPGA and GPU","authors":"Licheng Guo, Jason Lau, Zhenyuan Ruan, Peng Wei, J. Cong","doi":"10.1109/FCCM.2019.00027","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00027","url":null,"abstract":"In genome sequencing, it is a crucial but time-consuming task to detect potential overlaps between any pair of the input reads, especially those that are ultra-long. The state-of-the-art overlapping tool Minimap2 outperforms other popular tools in speed and accuracy. It has a single computing hot-spot, chaining, that takes 70% of the time and needs to be accelerated. There are several crucial issues for hardware acceleration because of the nature of chaining. First, the original computation pattern is poorly parallelizable and a direct implementation will result in low utilization of parallel processing units. We propose a method to reorder the operation sequence that transforms the algorithm into a hardware-friendly form. Second, the large but variable sizes of input data make it hard to leverage task-level parallelism. Therefore, we customize a fine-grained task dispatching scheme which could keep parallel PEs busy while satisfying the on-chip memory restriction. Based on these optimizations, we map the algorithm to a fully pipelined streaming architecture on FPGA using HLS, which achieves significant performance improvement. The principles of our acceleration design apply to both FPGA and GPU. Compared to the multi-threading CPU baseline, our GPU accelerator achieves 7x acceleration, while our FPGA accelerator achieves 28x acceleration. We further conduct an architecture study to quantitatively analyze the architectural reason for the performance difference. The summarized insights could serve as a guide on choosing the proper hardware acceleration platform.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"41 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113936078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic
This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.
{"title":"Yosys+nextpnr: An Open Source Framework from Verilog to Bitstream for Commercial FPGAs","authors":"David Shah, Eddie Hung, C. Wolf, Serge Bazanski, D. Gisselquist, Miodrag Milanovic","doi":"10.1109/FCCM.2019.00010","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00010","url":null,"abstract":"This paper introduces a fully free and open source software (FOSS) architecture-neutral FPGA framework comprising of Yosys for Verilog synthesis, and nextpnr for placement, routing, and bitstream generation. Currently, this flow supports two commercially available FPGA families, Lattice iCE40 (up to 8K logic elements) and Lattice ECP5 (up to 85K elements) and has been hardware-proven for custom-computing machines including a low-power neural-network accelerator and an OpenRISC system-on-chip capable of booting Linux. Both Yosys and nextpnr have been engineered in a highly flexible manner to support many of the features present in modern FPGAs by separating architecture-specific details from the common mapping algorithms.This framework is demonstrated on a longest-path case study to find an atypical single source-sink path occupying up to 45% of all on-chip wiring.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123243337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty
Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.
{"title":"Cost-Effective Energy Monitoring of a Zynq-Based Real-Time System Including Dual Gigabit Ethernet","authors":"M. Geier, Dominik Faller, Marian Brändle, S. Chakraborty","doi":"10.1109/FCCM.2019.00068","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00068","url":null,"abstract":"Recent FPGA architectures integrate various power management features already established in CPU-driven SoCs to reach more energy-sensitive application domains such as, e.g., automotive and robotics. This also qualifies hybrid Programmable SoCs (pSoCs) that combine fixed-function SoCs with configurable FPGA fabric for heterogeneous Real-time Systems (RTSs), which operate under predefined latency and power constraints in safety-critical environments. Their complex application-specific computation and communication (incl. I/O) architectures result in highly varying power consumption, which requires precise voltage and current sensing on all relevant supply rails to enable dependable evaluation of available and novel power management techniques. In this paper, we propose a low-cost 18-channel 16-bit-resolution measurement system capable of over 200 kSPS (kilo-samples per second) for instrumentation of current pSoC development boards. In addition, we propose to include crucial I/O components such as Ethernet PHYs into the power monitoring to gain a holistic view on the RTS's temporal behavior covering not only computation on FPGA and CPUs, but also communication in terms of, e.g., reception of sensor values and transmission of actuation signals. We present an FMC-sized implementation of our measurement system combined with two Gigabit Ethernet PHYs and one HDMI input. Paired with Xilinx' ZC702 development board, we are able to synchronously acquire power traces of a Zynq pSoC and the two PHYs precise enough to identify individual Ethernet frames.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124517171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.
{"title":"Module-per-Object: A Human-Driven Methodology for C++-Based High-Level Synthesis Design","authors":"Jeferson Santiago da Silva, F. Boyer, J. Langlois","doi":"10.1109/FCCM.2019.00037","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00037","url":null,"abstract":"High-Level Synthesis (HLS) brings FPGAs to audiences previously unfamiliar to hardware design. However, achieving the highest Quality-of-Results (QoR) with HLS is still unattainable for most programmers. This requires detailed knowledge of FPGA architecture and hardware design in order to produce FPGA-friendly codes. Moreover, these codes are normally in conflict with best coding practices, which favor code reuse, modularity, and conciseness. To overcome these limitations, we propose Module-per-Object (MpO), a human-driven HLS design methodology intended for both hardware designers and software developers with limited FPGA expertise. MpO exploits modern C++ to raise the abstraction level while improving QoR, code readability and modularity. To guide HLS designers, we present the five characteristics of MpO classes. Each characteristic exploits the power of HLS-supported modern C++ features to build C++-based hardware modules. These characteristics lead to high-quality software descriptions and efficient hardware generation. We also present a use case of MpO, where we use C++ as the intermediate language for FPGA-targeted code generation from P4, a packet processing domain specific language. The MpO methodology is evaluated using three design experiments: a packet parser, a flow-based traffic manager, and a digital up-converter. Based on experiments, we show that MpO can be comparable to handwritten VHDL code while keeping a high abstraction level, humanreadable coding style and modularity. Compared to traditional C-based HLS design, MpO leads to more efficient circuit generation, both in terms of performance and resource utilization. Also, the MpO approach notably improves software quality, augmenting parameterization while eliminating the incidence of code duplication.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"09 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134523069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}