首页 > 最新文献

2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)最新文献

英文 中文
LAMDA: Learning-Assisted Multi-stage Autotuning for FPGA Design Closure FPGA设计闭合的学习辅助多阶段自动调谐
Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang
A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.
fpga快速硬件专门化的主要障碍源于现有CAD工具在实现设计封闭方面的弱保证。当前的方法需要大量的手工工作来跨工具流的多个阶段配置大量的选项集,以获得高质量的结果。由于这些选项所跨越的设计空间的大小和复杂性,加上对每个设计点的耗时评估,对可重构计算的探索变得非常具有挑战性。为了应对这一挑战,我们提出了一个名为LAMDA的学习辅助自动调谐框架,该框架通过利用从设计流程的早期阶段提取的特定于设计的特征来指导调谐过程,从而加速了FPGA设计的完成,同时显著节省了运行时间。LAMDA自动配置逻辑合成、技术映射、布局和路由,以有效地实现设计闭合。与最先进的fpga自动调谐系统相比,LAMDA使用英特尔Quartus Pro在各种现实基准上实现了更快的时序关闭。
{"title":"LAMDA: Learning-Assisted Multi-stage Autotuning for FPGA Design Closure","authors":"Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang","doi":"10.1109/FCCM.2019.00020","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00020","url":null,"abstract":"A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131678448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations T2S-Tensor:高效生成用于密集张量计算的高性能空间硬件
Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey
We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."
我们提出了一种语言和编译框架,用于在空间架构(包括fpga和CGRAs)上高效地为密集张量核生成高性能收缩阵列。它将功能规范与空间映射解耦,允许程序员快速探索相同函数的各种空间优化。这些优化的实际实现留给了编译器。因此,生产力和性能是同时实现的。我们使用这个框架实现了几个重要的密集张量核。我们在Arria-10 FPGA和研究CGRA上实现了密集矩阵乘法,在仅3%的工程时间内实现了手工编写和高度优化的专家(ninja)实现的88%和92%的性能。另外三个张量核,包括MTTKRP, TTM和TTMc,也以高性能和低设计工作量实现,并且首次在空间架构上实现。”
{"title":"T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations","authors":"Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey","doi":"10.1109/FCCM.2019.00033","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00033","url":null,"abstract":"We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja\") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures.\"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 39
A Scalable OpenCL-Based FPGA Accelerator for YOLOv2 基于opencl的YOLOv2可扩展FPGA加速器
Ke Xu, Xiaoyun Wang, Dong Wang
This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.
本文在Arria-10 GX1150 FPGA板上实现了基于opencl的YOLOv2加速卡。硬件架构采用可扩展的管道设计,支持多分辨率输入图像,并通过全8位定点计算和CONV+BN+Leaky-ReLU层融合技术提高资源利用率。提出的设计在190 MHz工作频率下实现了566 GOPs的峰值吞吐量。该加速器可以分别以35 FPS和71 FPS的速度运行输入分辨率为288×288的YOLOv2推理和输入分辨率为416×416的微型YOLOv2推理。
{"title":"A Scalable OpenCL-Based FPGA Accelerator for YOLOv2","authors":"Ke Xu, Xiaoyun Wang, Dong Wang","doi":"10.1109/FCCM.2019.00058","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00058","url":null,"abstract":"This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123585981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
FlexGibbs: Reconfigurable Parallel Gibbs Sampling Accelerator for Structured Graphs FlexGibbs:结构图的可重构并行吉布斯采样加速器
Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
许多人认为深度学习成功的关键因素之一是它与现有加速器(主要是GPU)的兼容性。虽然gpu在处理深度学习中常见的线性代数核方面表现出色,但它们并不是处理贝叶斯模型和推理等无监督学习方法的最佳架构。为了更好地理解概率模型的架构,Gibbs采样是贝叶斯推理最常用的算法之一,重点研究了收敛于目标分布和参数化组件的并行性。我们提出了FlexGibbs,一个结构化图的可重构并行Gibbs采样推理加速器。我们设计了一个架构,最适合解决马尔可夫随机场任务,使用并行吉布斯采样器阵列,启用色调度。研究表明,对于声源分离应用,FlexGibbs配置在Xilinx Zync CPU-FPGA SoC的FPGA结构上,与在ARM Cortex-A53上运行相比,Gibbs采样推理速度提高了1048倍,能耗降低了99.85%。
{"title":"FlexGibbs: Reconfigurable Parallel Gibbs Sampling Accelerator for Structured Graphs","authors":"Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei","doi":"10.1109/FCCM.2019.00075","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00075","url":null,"abstract":"Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126597257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
SparseHD: Algorithm-Hardware Co-optimization for Efficient High-Dimensional Computing SparseHD:高效高维计算的算法硬件协同优化
M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic
Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.
作为认知任务的另一种光路机器学习方法,超维(HD)计算正获得越来越多的关注。受大脑神经活动模式的启发,HD计算通过利用长尺寸向量(即超向量)来执行认知任务,而不是像传统计算那样使用标量数。由于一个超向量是由数千个维度(元素)表示的,为了简化计算和降低处理成本,以前的工作大多采用二元元素。在本文中,我们首先证明了维度需要超过一个比特,以提供可接受的精度,使高清计算适用于现实世界的认知任务。然而,增加位宽会牺牲能量效率和性能,即使在使用低位整数作为超向量元素时也是如此。为了解决这个问题,我们提出了一个高清加速框架,称为SparseHD,它利用稀疏性的优势来提高高清计算的效率。从本质上讲,SparseHD考虑了经过训练的HD模型的统计属性,并删除了模型中最无效的元素,通过迭代再训练来补偿稀疏性带来的可能的质量损失。由于fpga具有位级可操作性和丰富的并行性,我们还提出了一种新的基于fpga的加速器,以有效地利用HD计算中的稀疏性优势。我们评估了我们的框架在实际分类问题上的效率。我们观察到SparseHD使高清模型达到90%的稀疏,同时与非稀疏基线模型相比,提供最小的质量损失(小于1%)。我们的评估显示,与AMD R390 GPU实现相比,SparseHD平均提供了48.5倍和15.0倍的低能耗和更快的执行速度。
{"title":"SparseHD: Algorithm-Hardware Co-optimization for Efficient High-Dimensional Computing","authors":"M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic","doi":"10.1109/FCCM.2019.00034","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00034","url":null,"abstract":"Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132499790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Hybrid XML Parser Based on Software and Hardware Co-design 基于软硬件协同设计的混合型XML解析器
Zhe Pan, Xiaohong Jiang, Jian Wu, Xiang Li
Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.
可扩展标记语言(XML)广泛应用于web服务。然而,XML解析任务一直是消耗大量时间和资源的瓶颈。本文提出了一种基于软硬件协同设计的混合式XML解析器。我们将硬件加速置于软件驱动的环境中。我们的解析器基于文档对象模型(DOM)。它能够以每字节1周期(CPB)的吞吐量进行格式良好的检查和树构建。我们在Xilinx Kintex-7 FPGA上实现了该设计,解析吞吐量为0.8Gbps。
{"title":"Hybrid XML Parser Based on Software and Hardware Co-design","authors":"Zhe Pan, Xiaohong Jiang, Jian Wu, Xiang Li","doi":"10.1109/FCCM.2019.00066","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00066","url":null,"abstract":"Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133604373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient FPGA Floorplanning for Partial Reconfiguration-Based Applications 基于部分重构应用的高效FPGA布局设计
N. Deak, O. Creţ, H. Hedesiu
This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.
本文介绍了一种高效的自动布局算法,该算法考虑了现代FPGA系列的异构架构,以及部分重构(PR)约束,引入了宽高比(AR)约束来优化路由。该算法生成部分模块的可能位置,然后应用递归伪双分区启发式搜索来找到最佳平面图。实验表明,该算法的性能明显优于该领域的其他算法。
{"title":"Efficient FPGA Floorplanning for Partial Reconfiguration-Based Applications","authors":"N. Deak, O. Creţ, H. Hedesiu","doi":"10.1109/FCCM.2019.00050","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00050","url":null,"abstract":"This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133904555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Fast Voltage Transients on FPGAs: Impact and Mitigation Strategies fpga上的快速电压瞬变:影响和缓解策略
Linda L. Shen, Ibrahim Ahmed, Vaughn Betz
As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.
随着fpga在尺寸和速度上的增长,其功耗也在增长。最近fpga的功耗已经增加到与高端cpu相当的程度。为了缓解这个问题,动态电压缩放(DVS)和时钟门控等功耗降低技术可以潜在地应用于fpga。然而,目前尚不清楚它们在存在快速电压瞬变时是否安全。这些快速电压瞬变是由活动的大变化引起的,我们认为这在大多数设计中是常见的。先前的工作表明,正是这些快速的电压瞬变产生了最大的延迟变化。在我们的工作中,我们测量了瞬变对应用程序的影响,并提出了一种缓解策略,以防止它们导致定时故障。我们创建的瞬态发生器能够显着降低应用程序的测量Fmax,最高可达25。我们还表明,瞬态非常快,并产生立即的时序影响,因此瞬态缓解必须在瞬态的同一时钟周期内发生。我们创建了一个时钟边抑制器,它能够检测到瞬态事件发生的时间并延迟时钟边,从而防止任何时序故障。使用我们的时钟沿抑制器,我们证明我们可以在快速电压瞬变的情况下以全频率运行应用程序,从而实现更积极的DVS方法和更大的功耗节省。
{"title":"Fast Voltage Transients on FPGAs: Impact and Mitigation Strategies","authors":"Linda L. Shen, Ibrahim Ahmed, Vaughn Betz","doi":"10.1109/FCCM.2019.00044","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00044","url":null,"abstract":"As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130962208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
[Copyright notice] (版权)
{"title":"[Copyright notice]","authors":"","doi":"10.1109/fccm.2019.00003","DOIUrl":"https://doi.org/10.1109/fccm.2019.00003","url":null,"abstract":"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122238905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CRoute: A Fast High-Quality Timing-Driven Connection-Based FPGA Router 一个快速、高质量的基于时序驱动连接的FPGA路由器
Dries Vercruyce, Elias Vansteenkiste, D. Stroobandt
FPGA routing is an important part of physical design as the programmable interconnection network requires the majority of the total silicon area and the connections largely contribute to delay and power. It should also occur with minimum runtime to enable efficient design exploration. In this work we elaborate on the concept of the connection-based routing principle. The algorithm is improved and a timing-driven version is introduced. The router, called CRoute, is implemented in an easy to adapt FPGA CAD framework written in Java, which is publicly available on GitHub. Quality and runtime are compared to the state-of-the-art router in VPR 7.0.7. Benchmarking is done with the Titan23 design suite, which consists of large heterogeneous designs targeted to a detailed representation of the Stratix IV FPGA. CRoute gains in both the total wire-length and maximum clock frequency while reducing the routing runtime. The total wire-length reduces by 11% and the maximum clock frequency increases by 6%. These high-quality results are obtained in 3.4x less routing runtime.
FPGA路由是物理设计的重要组成部分,因为可编程互连网络需要大部分总硅面积,并且连接很大程度上导致延迟和功耗。它还应该以最小的运行时进行,以实现有效的设计探索。在这项工作中,我们详细阐述了基于连接的路由原则的概念。对该算法进行了改进,并引入了一个时间驱动的版本。该路由器名为CRoute,是在一个用Java编写的易于适应的FPGA CAD框架中实现的,该框架可在GitHub上公开获取。质量和运行时间与VPR 7.0.7中最先进的路由器进行了比较。基准测试是用Titan23设计套件完成的,该设计套件由针对Stratix IV FPGA的详细表示的大型异构设计组成。在减少路由运行时间的同时,CRoute增加了总线长和最大时钟频率。总线长减少11%,最大时钟频率增加6%。这些高质量的结果只需要减少3.4倍的路由运行时间。
{"title":"CRoute: A Fast High-Quality Timing-Driven Connection-Based FPGA Router","authors":"Dries Vercruyce, Elias Vansteenkiste, D. Stroobandt","doi":"10.1109/FCCM.2019.00017","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00017","url":null,"abstract":"FPGA routing is an important part of physical design as the programmable interconnection network requires the majority of the total silicon area and the connections largely contribute to delay and power. It should also occur with minimum runtime to enable efficient design exploration. In this work we elaborate on the concept of the connection-based routing principle. The algorithm is improved and a timing-driven version is introduced. The router, called CRoute, is implemented in an easy to adapt FPGA CAD framework written in Java, which is publicly available on GitHub. Quality and runtime are compared to the state-of-the-art router in VPR 7.0.7. Benchmarking is done with the Titan23 design suite, which consists of large heterogeneous designs targeted to a detailed representation of the Stratix IV FPGA. CRoute gains in both the total wire-length and maximum clock frequency while reducing the routing runtime. The total wire-length reduces by 11% and the maximum clock frequency increases by 6%. These high-quality results are obtained in 3.4x less routing runtime.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127922077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1