Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays最新文献

英文中文

A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only) 基于FPGA的无批归一化二值化卷积深度神经网络(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021782

Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura

A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.

预训练卷积深度神经网络(CNN)是一种前馈计算方式，广泛应用于嵌入式系统，对功率和面积效率要求很高。本文实现了一种二值化CNN，其输入和权值只处理二进制2值(+1/-1)。在这种情况下，乘法器被替换成一个XNOR电路，而不是一个专用的DSP块。对于硬件实现，使用二值化的输入和权值更合适。然而，二值化后的CNN需要批归一化技术来保持分类精度。在这种情况下，额外的乘法和加法需要额外的硬件，而且，对其参数的内存访问降低了系统性能。在本文中，我们提出了不使用批处理归一化的CNN，它在数学上等同于使用批处理归一化的CNN。提出的CNN用整数偏差处理二值化的输入和权重。我们在NetFPGA-SUME FPGA板上实现了VGG-16基准CNN，该板具有Xilinx Inc.。Virtex7 FPGA和三个片外QDR II+同步sram。与传统FPGA实现相比，虽然分类错误率为6.5%，但性能提高2.82倍，功耗降低1.76倍，面积效率降低11.03倍。因此，该方法适用于嵌入式计算机系统。

{"title":"A Batch Normalization Free Binarized Convolutional Deep Neural Network on an FPGA (Abstract Only)","authors":"Hiroki Nakahara, H. Yonekawa, H. Iwamoto, M. Motomura","doi":"10.1145/3020078.3021782","DOIUrl":"https://doi.org/10.1145/3020078.3021782","url":null,"abstract":"A pre-trained convolutional deep neural network (CNN) is a feed-forward computation perspective, which is widely used for the embedded systems, requires high power-and-area efficiency. This paper realizes a binarized CNN which treats only binary 2-values (+1/-1) for the inputs and the weights. In this case, the multiplier is replaced into an XNOR circuit instead of a dedicated DSP block. For hardware implementation, using binarized inputs and weights is more suitable. However, the binarized CNN requires the batch normalization techniques to retain the classification accuracy. In that case, the additional multiplication and addition require extra hardware, also, the memory access for its parameters reduces system performance. In this paper, we propose the batch normalization free CNN which is mathematically equivalent to the CNN using batch normalization. The proposed CNN treats the binarized inputs and weights with the integer bias. We implemented the VGG-16 benchmark CNN on the NetFPGA-SUME FPGA board, which has the Xilinx Inc. Virtex7 FPGA and three off-chip QDR II+ Synchronous SRAMs. Compared with the conventional FPGA realizations, although the classification error rate is 6.5% decayed, the performance is 2.82 times faster, the power efficiency is 1.76 times lower, and the area efficiency is 11.03 times smaller. Thus, our method is suitable for the embedded computer system.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124504414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Session details: Virtualization and Applications 会话详细信息:虚拟化和应用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257191

J. Lockwood

引用次数: 0

Session details: Special Session: The Role of FPGAs in Deep Learning 专题会议:fpga在深度学习中的作用

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257183

A. Ling

引用次数: 0

Session details: Architecture 会话细节:架构

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3257186

S. Wilton

引用次数: 0

Precise Coincidence Detection on FPGAs: Three Case Studies (Abstract Only) fpga的精确符合检测:三个案例研究(摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021766

R. Salomon, R. Joost

In high-performance applications, such as quantum physics and positron emission tomography, precise coincidence detection is of central importance: The quality of the reconstructed images depends on the accuracy with which the underlying system detects the coincidence of two events. This paper explores the utility of three different hardware modules for this very task. In contrast to most of the state-of-the-art systems, these modules are edge triggered rather than being voltage-level based. This change in the modus operandi increases the accuracy of the resulting coincidence window by about one order of magnitude. In addition, this paper considers the entire detector arrays, which host a large number of selected detectors. Due to additional signal propagation delays, these arrays yield a coincidence window width as short as 70 ps within an effective range of up to 10 ns.

在高性能应用中，如量子物理和正电子发射断层扫描，精确的重合检测是至关重要的:重建图像的质量取决于底层系统检测两个事件重合的准确性。本文探讨了三种不同硬件模块的实用功能。与大多数最先进的系统相比，这些模块是边缘触发的，而不是基于电压电平的。这种方法上的改变使所得到的吻合窗口的准确度提高了大约一个数量级。此外，本文还考虑了整个探测器阵列，其中包含大量选定的探测器。由于额外的信号传播延迟，这些阵列在高达10 ns的有效范围内产生的重合窗宽短至70 ps。

引用次数: 0

Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search 利用混合记忆体提升基于fpga的图形处理器的效能:一种宽度优先搜寻的案例

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021737

Jialiang Zhang, Soroosh Khoram, J. Li

Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.

大图处理由于其从机器学习到社会科学的广泛适用性，近年来受到了广泛的关注。然而，现实世界中的大型图本身就很难有效地处理，这不仅是因为它们的内存占用很大，而且大多数图算法需要具有较差局部性和较低的计算-内存访问比率的内存访问模式。在这项工作中，我们利用新兴的混合内存立方体(HMC)技术的卓越随机存取性能，将多个DRAM芯片堆叠在逻辑层之上，结合FPGA的灵活性和效率来解决这些挑战。据我们所知，这是第一个在基于软硬件协同设计和协同优化的FPGA-HMC平台上实现图形处理系统的工作。本文首先提出了一种基于FPGA-HMC平台的水平同步广度优先搜索(BFS)算法和平台感知图形处理架构的改进。为了更好地了解拟议实现的潜在瓶颈，我们开发了一个分析性能模型来定量评估HMC访问延迟和相应的BFS性能。在此基础上，我们提出了一种两级位图方案，以进一步减少内存访问，并对关键设计参数(如内存访问粒度)进行优化。最后，我们使用美光公司的AC-510开发套件评估了我们的BFS实现的性能。我们在随机图上使用GRAPH500基准测试实现了每秒1.66亿个遍历边(MTEPS)，其规模为25，边缘因子为16，这明显优于CPU和其他基于fpga的大型图处理器。

{"title":"Boosting the Performance of FPGA-based Graph Processor using Hybrid Memory Cube: A Case for Breadth First Search","authors":"Jialiang Zhang, Soroosh Khoram, J. Li","doi":"10.1145/3020078.3021737","DOIUrl":"https://doi.org/10.1145/3020078.3021737","url":null,"abstract":"Large graph processing has gained great attention in recent years due to its broad applicability from machine learning to social science. Large real-world graphs, however, are inherently difficult to process efficiently, not only due to their large memory footprint, but also that most graph algorithms entail memory access patterns with poor locality and a low compute-to-memory access ratio. In this work, we leverage the exceptional random access performance of emerging Hybrid Memory Cube (HMC) technology that stacks multiple DRAM dies on top of a logic layer, combined with the flexibility and efficiency of FPGA to address these challenges. To our best knowledge, this is the first work that implements a graph processing system on a FPGA-HMC platform based on software/hardware co-design and co-optimization. We first present the modifications of algorithm and a platform-aware graph processing architecture to perform level-synchronized breadth first search (BFS) on FPGA-HMC platform. To gain better insights into the potential bottlenecks of proposed implementation, we develop an analytical performance model to quantitatively evaluate the HMC access latency and corresponding BFS performance. Based on the analysis, we propose a two-level bitmap scheme to further reduce memory access and perform optimization on key design parameters (e.g. memory access granularity). Finally, we evaluate the performance of our BFS implementation using the AC-510 development kit from Micron. We achieved 166 million edges traversed per second (MTEPS) using GRAPH500 benchmark on a random graph with a scale of 25 and an edge factor of 16, which significantly outperforms CPU and other FPGA-based large graph processors.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128027376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56

GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only) 基于fpga的认知无线网络SDR平台GRT 2.0

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021798

Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu

Although there is explosive growth of theoretical research on cognitive radio, the real-time platform for cognitive radio is progressing at a low pace. Researchers expect fast prototyping their designs with appropriate wireless platforms to precisely evaluate and validate their new designs. Platforms for cognitive radio should provide both high-performance and programmability. We observed that for the parallel and reconfigurable nature, FPGA is suitable for developing real-time software-defined radio (SDR) platforms. However, without a carefully designed "middleware architecture layer", Real-time programmable wireless system is still difficult to build. In this paper, we present GRT 2.0, a novel high-performance and programmable SDR platform for cognitive radio. This paper focuses on the architecture design of media access control (MAC) layer and radio frequency (RF) front-end interface. We allocate different MAC functions into different computing units, including a dedicated, light-weight embedded processor and several peripherals, to ensure both programmability and microsecond-level timing requirements. A serial-to-parallel converter is adopted to solve the issues of frame type matching and precise timing between PHY and RF. To support mobile host computers, we use the more portable USB 3.0 interface instead of PCIe. Finally, with the design of an efficient "gain lock" state machine, automatic gain control (AGC) processing time has been reduced to less than 1us. The evaluation result shows that with 802.11a/g protocol, GRT 2.0 achieves maximum throughput of 23Mbps in MAC, which is compatible to commodity fixed-logic wireless network adaptors. The latency of RF front-end is less than 2us, over 10X performance improvement to the Ethernet cable interface. Moreover, by carefully designed "middleware architecture layer" in FPGA, we provide good programmability both in MAC and PHY.

虽然认知无线电的理论研究呈爆发式增长，但认知无线电实时平台的发展速度较慢。研究人员期望在适当的无线平台上快速构建他们的设计原型，以精确地评估和验证他们的新设计。认知无线电平台应同时提供高性能和可编程性。我们发现FPGA具有并行性和可重构性，适合开发实时软件定义无线电(SDR)平台。然而，如果没有精心设计的“中间件架构层”，实时可编程无线系统的构建仍然是困难的。本文提出了一种新型的高性能可编程SDR认知无线电平台GRT 2.0。本文重点研究了多媒体访问控制(MAC)层和射频前端接口的体系结构设计。我们将不同的MAC功能分配到不同的计算单元中，包括一个专用的轻量级嵌入式处理器和几个外围设备，以确保可编程性和微秒级时间要求。采用串并联变换器解决物理层与射频之间的帧型匹配和精确定时问题。为了支持移动主机，我们使用更便携的USB 3.0接口代替PCIe。最后，通过设计一种高效的“增益锁定”状态机，将自动增益控制(AGC)处理时间缩短到小于1us。评估结果表明，在802.11a/g协议下，GRT 2.0在MAC上实现了23Mbps的最大吞吐量，与商用固定逻辑无线网络适配器兼容。射频前端延迟小于2us，对以太网线接口性能提升10倍以上。此外，通过在FPGA中精心设计的“中间件架构层”，我们在MAC和PHY上都提供了良好的可编程性。

{"title":"GRT 2.0: An FPGA-based SDR Platform for Cognitive Radio Networks (Abstract Only)","authors":"Haoyang Wu, Tao Wang, Zhiwei Li, Boyan Ding, Xiaoguang Li, Tianfu Jiang, Jun Liu, Songwu Lu","doi":"10.1145/3020078.3021798","DOIUrl":"https://doi.org/10.1145/3020078.3021798","url":null,"abstract":"Although there is explosive growth of theoretical research on cognitive radio, the real-time platform for cognitive radio is progressing at a low pace. Researchers expect fast prototyping their designs with appropriate wireless platforms to precisely evaluate and validate their new designs. Platforms for cognitive radio should provide both high-performance and programmability. We observed that for the parallel and reconfigurable nature, FPGA is suitable for developing real-time software-defined radio (SDR) platforms. However, without a carefully designed \"middleware architecture layer\", Real-time programmable wireless system is still difficult to build. In this paper, we present GRT 2.0, a novel high-performance and programmable SDR platform for cognitive radio. This paper focuses on the architecture design of media access control (MAC) layer and radio frequency (RF) front-end interface. We allocate different MAC functions into different computing units, including a dedicated, light-weight embedded processor and several peripherals, to ensure both programmability and microsecond-level timing requirements. A serial-to-parallel converter is adopted to solve the issues of frame type matching and precise timing between PHY and RF. To support mobile host computers, we use the more portable USB 3.0 interface instead of PCIe. Finally, with the design of an efficient \"gain lock\" state machine, automatic gain control (AGC) processing time has been reduced to less than 1us. The evaluation result shows that with 802.11a/g protocol, GRT 2.0 achieves maximum throughput of 23Mbps in MAC, which is compatible to commodity fixed-logic wireless network adaptors. The latency of RF front-end is less than 2us, over 10X performance improvement to the Ethernet cable interface. Moreover, by carefully designed \"middleware architecture layer\" in FPGA, we provide good programmability both in MAC and PHY.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132497663","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging (Abstract Only) 基于fpga的磁共振成像图像重建硬件加速器(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021793

Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari

Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier Transform for image reconstruction using the Inverse Fast Fourier Transform (IFFT) algorithm. Though the use of Cartesian trajectories simplifies the IFFT computation, non-Cartesian trajectories have been shown to provide better image resolution with lower scan times. To improve the processing time of MRI image reconstruction for these optimized non-Cartesian trajectories using a Non-uniform Fast Fourier Transform (NuFFT) algorithm, dedicated accelerators are required. We present an FPGA-based MRI solution to implement NuFFT for image reconstruction. The solution is based on the design of an efficient custom accelerator on FPGA using OpenCL, and covers all the phases necessary to reconstruct an image with high accuracy, starting from raw scan data. The architecture can be easily extendable to tackle 3D imaging, and k-space properties have been analyzed to reduce the number of samples processed, achieving satisfactory reconstruction accuracy while positively impacting processing time. Our solution achieves a marked improvement over previously published FPGA- and CPU-based implementations and, due to its scalability, it is suitable for the image sizes common in MRI acquisitions.

磁共振成像(MRI)在医学诊断中有着广泛的应用。对笛卡尔网格上的MRI数据采样允许使用快速傅里叶反变换(IFFT)算法高效地计算反离散傅里叶变换以进行图像重建。虽然使用笛卡尔轨迹简化了IFFT计算，但非笛卡尔轨迹已被证明可以以更低的扫描时间提供更好的图像分辨率。为了利用非均匀快速傅里叶变换(NuFFT)算法改善这些优化的非笛卡尔轨迹的MRI图像重建的处理时间，需要专用的加速器。我们提出了一种基于fpga的MRI解决方案来实现NuFFT图像重建。该解决方案基于使用OpenCL在FPGA上设计的高效定制加速器，涵盖了从原始扫描数据开始以高精度重建图像所需的所有阶段。该架构可以很容易地扩展以处理3D成像，并且分析了k空间属性以减少处理的样本数量，在积极影响处理时间的同时获得令人满意的重建精度。我们的解决方案比以前发布的基于FPGA和cpu的实现有了显著的改进，并且由于其可扩展性，它适用于MRI采集中常见的图像大小。

{"title":"FPGA-based Hardware Accelerator for Image Reconstruction in Magnetic Resonance Imaging (Abstract Only)","authors":"Emanuele Pezzotti, A. Iacobucci, G. Nash, Umer I. Cheema, Paolo Vinella, R. Ansari","doi":"10.1145/3020078.3021793","DOIUrl":"https://doi.org/10.1145/3020078.3021793","url":null,"abstract":"Magnetic Resonance Imaging (MRI) is widely used in medical diagnostics. Sampling of MRI data on Cartesian grids allows efficient computation of the Inverse Discrete Fourier Transform for image reconstruction using the Inverse Fast Fourier Transform (IFFT) algorithm. Though the use of Cartesian trajectories simplifies the IFFT computation, non-Cartesian trajectories have been shown to provide better image resolution with lower scan times. To improve the processing time of MRI image reconstruction for these optimized non-Cartesian trajectories using a Non-uniform Fast Fourier Transform (NuFFT) algorithm, dedicated accelerators are required. We present an FPGA-based MRI solution to implement NuFFT for image reconstruction. The solution is based on the design of an efficient custom accelerator on FPGA using OpenCL, and covers all the phases necessary to reconstruct an image with high accuracy, starting from raw scan data. The architecture can be easily extendable to tackle 3D imaging, and k-space properties have been analyzed to reduce the number of samples processed, achieving satisfactory reconstruction accuracy while positively impacting processing time. Our solution achieves a marked improvement over previously published FPGA- and CPU-based implementations and, due to its scalability, it is suitable for the image sizes common in MRI acquisitions.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132024696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Hardware Synthesis of Weakly Consistent C Concurrency 弱一致C并发的硬件综合

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021733

Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides

Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup.

无锁算法，其中线程同步不是通过粗粒度互斥，而是通过细粒度原子操作(“原子”)，已经被证明是传统处理器领域中最快的多线程算法。本文探讨了如何通过高级合成(HLS)将这些算法从C编译到可重构的硬件。我们关注调度问题，其中软件指令分配给硬件时钟周期。我们首先展示了典型的HLS调度约束不足以实现原子，因为它们允许一些指令重排序，尽管在单线程上下文中是合理的，但在合成多线程程序时显然会导致错误的结果。然后我们展示了通过在内存操作中施加额外的线程内约束可以恢复正确的行为。我们在开源的LegUp HLS框架中实现我们的方法，并提供顺序一致(SC)和弱一致(“弱”)原子。弱原子比SC原子需要更少的约束，但足以满足许多并发算法。通过自动模型检查，我们确认我们正确地实现了2011年C标准修订版定义的语义。一个关于循环缓冲区的案例研究表明，由使用原子的程序合成的电路可以比使用锁的程序快2.5倍，而弱原子可以进一步提高1.5倍的速度。

{"title":"Hardware Synthesis of Weakly Consistent C Concurrency","authors":"Nadesh Ramanathan, Shane T. Fleming, John Wickerson, G. Constantinides","doi":"10.1145/3020078.3021733","DOIUrl":"https://doi.org/10.1145/3020078.3021733","url":null,"abstract":"Lock-free algorithms, in which threads synchronise not via coarse-grained mutual exclusion but via fine-grained atomic operations ('atomics'), have been shown empirically to be the fastest class of multi-threaded algorithms in the realm of conventional processors. This paper explores how these algorithms can be compiled from C to reconfigurable hardware via high-level synthesis (HLS). We focus on the scheduling problem, in which software instructions are assigned to hardware clock cycles. We first show that typical HLS scheduling constraints are insufficient to implement atomics, because they permit some instruction reorderings that, though sound in a single-threaded context, demonstrably cause erroneous results when synthesising multi-threaded programs. We then show that correct behaviour can be restored by imposing additional intra-thread constraints among the memory operations. We implement our approach in the open-source LegUp HLS framework, and provide both sequentially consistent (SC) and weakly consistent ('weak') atomics. Weak atomics necessitate fewer constraints than SC atomics, but suffice for many concurrent algorithms. We confirm, via automatic model-checking, that we correctly implement the semantics defined by the 2011 revision of the C standard. A case study on a circular buffer suggests that circuits synthesised from programs that use atomics can be 2.5x faster than those that use locks, and that weak atomics can yield a further 1.5x speedup.","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125193547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Scala Based FPGA Design Flow (Abstract Only) 基于Scala的FPGA设计流程(仅摘要)

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

Pub Date : 2017-02-22 DOI: 10.1145/3020078.3021762

Yanqiang Liu, Yao Li, Weilun Xiong, Meng Lai, Cheng Chen, Zhengwei Qi, Haibing Guan

With the rapid growth of data scale, data analysis applications start to meet the performance bottleneck, and thus requiring the aid of hardware acceleration. At the same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability and parallel nature, have gained momentum in the past decade. However, the efficiency of development for acceleration system based on FPGAs is severely constrained by the traditional languages and tools, due to their deficiency in expressibility, extendability, limited libraries and semantic gap between software and hardware design. This paper proposes a new open-source DSL based hardware design framework called VeriScala (https://github.com/VeriScala/VeriScala) that supports highly abstracted object-oriented hardware defining, programmatical testing, and interactive on-chip debugging. By adopting DSL embedded in Scala, we introduce modern software developing concepts into hardware designing including object-oriented programming, parameterized types, type safety, test automation, etc. VeriScala enables designers to describe their hardware designs in Scala, generate Verilog code automatically and interactively debug and test hardware design in real FPGA environment. Through the evaluation on real world applications and usability test, we show that VeriScala provides a practical approach for rapid prototyping of hardware acceleration systems. (This work is supported by the National Key Research & Development Program of China 2016YFB1000500)

随着数据规模的快速增长，数据分析应用开始遇到性能瓶颈，因此需要借助硬件加速。与此同时，现场可编程门阵列(fpga)以其高可定制性和并行性而闻名，在过去十年中获得了发展势头。然而，由于fpga在可表达性、可扩展性方面的不足、库的限制以及软硬件设计之间的语义差距，传统的语言和工具严重制约了基于fpga的加速系统的开发效率。本文提出了一种新的基于开源DSL的硬件设计框架VeriScala (https://github.com/VeriScala/VeriScala)，它支持高度抽象的面向对象硬件定义、编程测试和交互式片上调试。通过在Scala中嵌入DSL，将面向对象编程、参数化类型、类型安全、测试自动化等现代软件开发理念引入硬件设计。VeriScala使设计人员能够用Scala描述他们的硬件设计，自动生成Verilog代码，并在真实的FPGA环境中交互调试和测试硬件设计。通过对实际应用的评估和可用性测试，我们表明VeriScala为硬件加速系统的快速原型设计提供了一种实用的方法。(国家重点研发计划项目2016YFB1000500)

{"title":"Scala Based FPGA Design Flow (Abstract Only)","authors":"Yanqiang Liu, Yao Li, Weilun Xiong, Meng Lai, Cheng Chen, Zhengwei Qi, Haibing Guan","doi":"10.1145/3020078.3021762","DOIUrl":"https://doi.org/10.1145/3020078.3021762","url":null,"abstract":"With the rapid growth of data scale, data analysis applications start to meet the performance bottleneck, and thus requiring the aid of hardware acceleration. At the same time, Field Programmable Gate Arrays (FPGAs), known for their high customizability and parallel nature, have gained momentum in the past decade. However, the efficiency of development for acceleration system based on FPGAs is severely constrained by the traditional languages and tools, due to their deficiency in expressibility, extendability, limited libraries and semantic gap between software and hardware design. This paper proposes a new open-source DSL based hardware design framework called VeriScala (https://github.com/VeriScala/VeriScala) that supports highly abstracted object-oriented hardware defining, programmatical testing, and interactive on-chip debugging. By adopting DSL embedded in Scala, we introduce modern software developing concepts into hardware designing including object-oriented programming, parameterized types, type safety, test automation, etc. VeriScala enables designers to describe their hardware designs in Scala, generate Verilog code automatically and interactively debug and test hardware design in real FPGA environment. Through the evaluation on real world applications and usability test, we show that VeriScala provides a practical approach for rapid prototyping of hardware acceleration systems. (This work is supported by the National Key Research & Development Program of China 2016YFB1000500)","PeriodicalId":252039,"journal":{"name":"Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114711359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀