Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang
A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.
{"title":"LAMDA: Learning-Assisted Multi-stage Autotuning for FPGA Design Closure","authors":"Ecenur Ustun, Shaojie Xiang, J. Gui, Cunxi Yu, Zhiru Zhang","doi":"10.1109/FCCM.2019.00020","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00020","url":null,"abstract":"A primary barrier to rapid hardware specialization with FPGAs stems from weak guarantees of existing CAD tools on achieving design closure. Current methodologies require extensive manual efforts to configure a large set of options across multiple stages of the toolflow, intended to achieve high quality-of-results. Due to the size and complexity of the design space spanned by these options, coupled with the time-consuming evaluation of each design point, exploration for reconfigurable computing has become remarkably challenging. To tackle this challenge, we present a learning-assisted autotuning framework called LAMDA, which accelerates FPGA design closure by utilizing design-specific features extracted from early stages of the design flow to guide the tuning process with significant runtime savings. LAMDA automatically configures logic synthesis, technology mapping, placement, and routing to achieve design closure efficiently. Compared with a state-of-the-art FPGA-targeted autotuning system, LAMDA realizes faster timing closure on various realistic benchmarks using Intel Quartus Pro.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131678448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey
We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures."
{"title":"T2S-Tensor: Productively Generating High-Performance Spatial Hardware for Dense Tensor Computations","authors":"Nitish Srivastava, Hongbo Rong, Prithayan Barua, Guanyu Feng, Huanqi Cao, Zhiru Zhang, D. Albonesi, Vivek Sarkar, Wenguang Chen, Paul Petersen, Geoff N. Lowney, A. Herr, C. Hughes, T. Mattson, P. Dubey","doi":"10.1109/FCCM.2019.00033","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00033","url":null,"abstract":"We present a language and compilation framework for productively generating high-performance systolic arrays for dense tensor kernels on spatial architectures, including FPGAs and CGRAs. It decouples a functional specification from a spatial mapping, allowing programmers to quickly explore various spatial optimizations for the same function. The actual implementation of these optimizations is left to a compiler. Thus, productivity and performance are achieved at the same time. We used this framework to implement several important dense tensor kernels. We implemented dense matrix multiply for an Arria-10 FPGA and a research CGRA, achieving 88% and 92% of the performance of manually written, and highly optimized expert (ninja\") implementations in just 3% of their engineering time. Three other tensor kernels, including MTTKRP, TTM and TTMc, were also implemented with high performance and low design effort, and for the first time on spatial architectures.\"","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.
{"title":"A Scalable OpenCL-Based FPGA Accelerator for YOLOv2","authors":"Ke Xu, Xiaoyun Wang, Dong Wang","doi":"10.1109/FCCM.2019.00058","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00058","url":null,"abstract":"This paper implements an OpenCL-based FPGA accelerator for YOLOv2 on Arria-10 GX1150 FPGA board. The hardware architecture adopts a scalable pipeline design to support multi-resolution input image, and improves resource utilization by full 8-bit fixed-point computation and CONV+BN+Leaky-ReLU layer fusion technology. The proposed design achieves a peak throughput of 566 GOPs under 190 MHz working frequency. The accelerator could run YOLOv2 inference with 288×288 input resolution and tiny YOLOv2 with 416×416 input resolution at the speed of 35 and 71 FPS, respectively.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123585981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei
Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.
{"title":"FlexGibbs: Reconfigurable Parallel Gibbs Sampling Accelerator for Structured Graphs","authors":"Glenn G. Ko, Yuji Chai, Rob A. Rutenbar, D. Brooks, Gu-Yeon Wei","doi":"10.1109/FCCM.2019.00075","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00075","url":null,"abstract":"Many consider one of the key components to the success of deep learning as its compatibility with existing accelerators, mainly GPU. While GPUs are great at handling linear algebra kernels commonly found in deep learning, they are not the optimal architecture for handling unsupervised learning methods such as Bayesian models and inference. As a step towards, achieving better understanding of architectures for probabilistic models, Gibbs sampling, one of the most commonly used algorithms for Bayesian inference, is studied with a focus on parallelism that converges to the target distribution and parameterized components. We propose FlexGibbs, a reconfigurable parallel Gibbs sampling inference accelerator for structured graphs. We designed an architecture optimal for solving Markov Random Field tasks using an array of parallel Gibbs samplers, enabled by chromatic scheduling. We show that for sound source separation application, FlexGibbs configured on the FPGA fabric of Xilinx Zync CPU-FPGA SoC achieved Gibbs sampling inference speedup of 1048x and 99.85% reduction in energy over running it on ARM Cortex-A53.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126597257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic
Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.
{"title":"SparseHD: Algorithm-Hardware Co-optimization for Efficient High-Dimensional Computing","authors":"M. Imani, Sahand Salamat, Behnam Khaleghi, Mohammad Samragh, F. Koushanfar, T. Simunic","doi":"10.1109/FCCM.2019.00034","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00034","url":null,"abstract":"Hyperdimensional (HD) computing is gaining traction as an alternative light-way machine learning approach for cognition tasks. Inspired by the neural activity patterns of the brain, HD computing performs cognition tasks by exploiting longsize vectors, namely hypervectors, rather than working with scalar numbers as used in conventional computing. Since a hypervector is represented by thousands of dimensions (elements), the majority of prior work assume binary elements to simplify the computation and alleviate the processing cost. In this paper, we first demonstrate that the dimensions need to have more than one bit to provide an acceptable accuracy to make HD computing applicable to real-world cognitive tasks. Increasing the bit-width, however, sacrifices energy efficiency and performance, even when using low-bit integers as the hypervector elements. To address this issue, we propose a framework for HD acceleration, dubbed SparseHD, that leverages the advantages of sparsity to improve the efficiency of HD computing. Essentially, SparseHD takes account of statistical properties of a trained HD model and drops the least effective elements of the model, augmented by iterative retraining to compensate the possible quality loss raised by sparsity. Thanks to the bit-level manipulability and abounding parallelism granted by FPGAs, we also propose a novel FPGAbased accelerator to effectively utilize the advantage of sparsity in HD computation. We evaluate the efficiency of our framework for practical classification problems. We observe that SparseHD makes the HD model up to 90% sparse while affording a minimal quality loss (less than 1%) compared to the non-sparse baseline model. Our evaluation shows that, on average, SparseHD provides 48.5× and 15.0× lower energy consumption and faster execution as compared to the AMD R390 GPU implementation.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132499790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.
{"title":"Hybrid XML Parser Based on Software and Hardware Co-design","authors":"Zhe Pan, Xiaohong Jiang, Jian Wu, Xiang Li","doi":"10.1109/FCCM.2019.00066","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00066","url":null,"abstract":"Extensible Markup Language (XML) is widely used in web services. However, the task of XML parsing is always the bottleneck which consumes a lot of time and resources. In this work, we present a hybrid XML parser based on software and hardware co-design. We place hardware acceleration into a software-driven context. Our parser is based on document object model (DOM). It is capable of well-formed checking and tree construction at throughput of 1 cycle per byte (CPB). We implement the design on a Xilinx Kintex-7 FPGA with 0.8Gbps parsing throughput.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"209 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133604373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.
{"title":"Efficient FPGA Floorplanning for Partial Reconfiguration-Based Applications","authors":"N. Deak, O. Creţ, H. Hedesiu","doi":"10.1109/FCCM.2019.00050","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00050","url":null,"abstract":"This paper introduces an efficient automatic floorplanning algorithm, which takes into account the heterogeneous architectures of modern FPGA families, as well as partial reconfiguration (PR) constraints, introducing the aspect ratio (AR) constraint to optimize routing. The algorithm generates possible placements of the partial modules, and then applies a recursive pseudo-bipartitioning heuristic search to find the best floorplan. The experiments show that its performance is significantly better than the one of other algorithms in this field.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133904555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.
{"title":"Fast Voltage Transients on FPGAs: Impact and Mitigation Strategies","authors":"Linda L. Shen, Ibrahim Ahmed, Vaughn Betz","doi":"10.1109/FCCM.2019.00044","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00044","url":null,"abstract":"As FPGAs grow in size and speed, so too does their power consumption. Power consumption on recent FPGAs has increased to the point that it is comparable to that of high-end CPUs. To mitigate this problem, power reduction techniques such as dynamic voltage scaling (DVS) and clock gating can potentially be applied to FPGAs. However, it is unclear whether they are safe in the presence of fast voltage transients. These fast voltage transients are caused by large changes in activity which we believe are common in most designs. Previous work has shown that it is these fast voltage transients that produce the largest variations in delay. In our work, we measure the impact transients have on applications and present a mitigation strategy to prevent them from causing timing failures. We create transient generators that are able to significantly reduce an application's measured Fmax, by up to 25. We also show that transients are very fast and produce immediate timing impact and hence transient mitigation must occur within the same clock cycle as the transient. We create a clock edge suppressor that is able to detect when a transient event is happening and delay the clock edge, thus preventing any timing failures. Using our clock edge suppressor, we show that we can run an application at full frequency in the presence of fast voltage transients, thereby enabling more aggressive DVS approaches and larger power savings.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130962208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dries Vercruyce, Elias Vansteenkiste, D. Stroobandt
FPGA routing is an important part of physical design as the programmable interconnection network requires the majority of the total silicon area and the connections largely contribute to delay and power. It should also occur with minimum runtime to enable efficient design exploration. In this work we elaborate on the concept of the connection-based routing principle. The algorithm is improved and a timing-driven version is introduced. The router, called CRoute, is implemented in an easy to adapt FPGA CAD framework written in Java, which is publicly available on GitHub. Quality and runtime are compared to the state-of-the-art router in VPR 7.0.7. Benchmarking is done with the Titan23 design suite, which consists of large heterogeneous designs targeted to a detailed representation of the Stratix IV FPGA. CRoute gains in both the total wire-length and maximum clock frequency while reducing the routing runtime. The total wire-length reduces by 11% and the maximum clock frequency increases by 6%. These high-quality results are obtained in 3.4x less routing runtime.
FPGA路由是物理设计的重要组成部分,因为可编程互连网络需要大部分总硅面积,并且连接很大程度上导致延迟和功耗。它还应该以最小的运行时进行,以实现有效的设计探索。在这项工作中,我们详细阐述了基于连接的路由原则的概念。对该算法进行了改进,并引入了一个时间驱动的版本。该路由器名为CRoute,是在一个用Java编写的易于适应的FPGA CAD框架中实现的,该框架可在GitHub上公开获取。质量和运行时间与VPR 7.0.7中最先进的路由器进行了比较。基准测试是用Titan23设计套件完成的,该设计套件由针对Stratix IV FPGA的详细表示的大型异构设计组成。在减少路由运行时间的同时,CRoute增加了总线长和最大时钟频率。总线长减少11%,最大时钟频率增加6%。这些高质量的结果只需要减少3.4倍的路由运行时间。
{"title":"CRoute: A Fast High-Quality Timing-Driven Connection-Based FPGA Router","authors":"Dries Vercruyce, Elias Vansteenkiste, D. Stroobandt","doi":"10.1109/FCCM.2019.00017","DOIUrl":"https://doi.org/10.1109/FCCM.2019.00017","url":null,"abstract":"FPGA routing is an important part of physical design as the programmable interconnection network requires the majority of the total silicon area and the connections largely contribute to delay and power. It should also occur with minimum runtime to enable efficient design exploration. In this work we elaborate on the concept of the connection-based routing principle. The algorithm is improved and a timing-driven version is introduced. The router, called CRoute, is implemented in an easy to adapt FPGA CAD framework written in Java, which is publicly available on GitHub. Quality and runtime are compared to the state-of-the-art router in VPR 7.0.7. Benchmarking is done with the Titan23 design suite, which consists of large heterogeneous designs targeted to a detailed representation of the Stratix IV FPGA. CRoute gains in both the total wire-length and maximum clock frequency while reducing the routing runtime. The total wire-length reduces by 11% and the maximum clock frequency increases by 6%. These high-quality results are obtained in 3.4x less routing runtime.","PeriodicalId":116955,"journal":{"name":"2019 IEEE 27th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127922077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}