Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Altera's Quartus tools.
{"title":"Modular multi-ported SRAM-based memories","authors":"Ameer Abdelhadi, G. Lemieux","doi":"10.1145/2554688.2554773","DOIUrl":"https://doi.org/10.1145/2554688.2554773","url":null,"abstract":"Multi-ported RAMs are essential for high-performance parallel computation systems. VLIW and vector processors, CGRAs, DSPs, CMPs and other processing systems often rely upon multi-ported memories for parallel access, hence higher performance. Although memories with a large number of read and write ports are important, their high implementation cost means they are used sparingly in designs. As a result, FPGA vendors only provide dual-ported block RAMs to handle the majority of usage patterns. In this paper, a novel and modular approach is proposed to construct multi-ported memories out of basic dual-ported RAM blocks. Like other multi-ported RAM designs, each write port uses a different RAM bank and each read port uses bank replication. The main contribution of this work is an optimization that merges the previous live-value-table (LVT) and XOR approaches into a common design that uses a generalized, simpler structure we call an invalidation-based live-value-table (I-LVT). Like a regular LVT, the I-LVT determines the correct bank to read from, but it differs in how updates to the table are made; the LVT approach requires multiple write ports, often leading to an area-intensive register-based implementation, while the XOR approach uses wider memories to accommodate the XOR-ed data and suffers from lower clock speeds. Two specific I-LVT implementations are proposed and evaluated, binary and one-hot coding. The I-LVT approach is especially suitable for larger multi-ported RAMs because the table is implemented only in SRAM cells. The I-LVT method gives higher performance while occupying less block RAMs than earlier approaches: for several configurations, the suggested method reduces the block RAM usage by over 44% and improves clock speed by over 76%. To assist others, we are releasing our fully parameterized Verilog implementation as an open source hardware library. The library has been extensively tested using ModelSim and Altera's Quartus tools.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131051711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A run-time fault diagnosis and evasion scheme for reconfigurable devices is developed based on an explicit Non-adaptive Group Testing (NGT). NGT involves grouping disjunct subsets of reconfigurable resources into test pools, or samples. Each test pool realizes a Diagnostic Configuration (DC) performing functional testing during diagnosis procedure. The collective test outcomes after testing each diagnostic pool can be efficiently decoded to identify up to d defective logic resources. An algorithm for constructing NGT sampling procedure and resource placement during design time with optimal minimal number of test groups is derived through the well-known in statistical literature d-disjunctness property. The combinatorial properties of resultant DCs also guarantee that any possible set of defective resources less than or equal to d are not utilized by at least one DC, allowing a low-overhead fault resolution. It also provides the ability to assess the resources state of failure. The proposed testing scheme thus avoids time-intensive run-time diagnosis imposed by previously proposed adaptive group testing for reconfigurable hardware without compromising diagnostic coverage. In addition, proposed NGT scheme can be combined with other fault tolerance approaches to ameliorate their fault recovery strategies. Experimental results for a set of MCNC benchmarks using Xilinx ISE Design Suite on a Virtex-5 FPGA have demonstrated d-diagnosability at slice level with average accuracy of 99.15% and 97.76% for d=1 and d=2, respectively.
提出了一种基于显式非自适应组测试(NGT)的可重构设备运行时故障诊断与规避方案。NGT涉及将可重构资源的不相交子集分组到测试池或样本中。每个测试池实现一个DC (Diagnostic Configuration),在诊断过程中进行功能测试。测试每个诊断池后的集体测试结果可以有效解码,以识别多达d个有缺陷的逻辑资源。利用统计文献中众所周知的d-分离性,导出了一种在设计时以最优最小测试组数构建NGT采样程序和资源放置的算法。所得到的DC的组合特性还保证了小于或等于d的任何可能的缺陷资源集不被至少一个DC利用,从而允许低开销的故障解决。它还提供了评估资源故障状态的能力。因此,所提出的测试方案避免了之前提出的自适应组测试对可重构硬件施加的时间密集型运行时诊断,而不影响诊断覆盖率。此外,本文提出的NGT方案还可以与其他容错方法相结合,改进其故障恢复策略。在Virtex-5 FPGA上使用Xilinx ISE Design Suite进行的一组MCNC基准测试的实验结果表明,在d=1和d=2时,片级的d可诊断性分别为99.15%和97.76%。
{"title":"Non-adaptive sparse recovery and fault evasion using disjunct design configurations (abstract only)","authors":"Ahmad Alzahrani, R. Demara","doi":"10.1145/2554688.2554758","DOIUrl":"https://doi.org/10.1145/2554688.2554758","url":null,"abstract":"A run-time fault diagnosis and evasion scheme for reconfigurable devices is developed based on an explicit Non-adaptive Group Testing (NGT). NGT involves grouping disjunct subsets of reconfigurable resources into test pools, or samples. Each test pool realizes a Diagnostic Configuration (DC) performing functional testing during diagnosis procedure. The collective test outcomes after testing each diagnostic pool can be efficiently decoded to identify up to d defective logic resources. An algorithm for constructing NGT sampling procedure and resource placement during design time with optimal minimal number of test groups is derived through the well-known in statistical literature d-disjunctness property. The combinatorial properties of resultant DCs also guarantee that any possible set of defective resources less than or equal to d are not utilized by at least one DC, allowing a low-overhead fault resolution. It also provides the ability to assess the resources state of failure. The proposed testing scheme thus avoids time-intensive run-time diagnosis imposed by previously proposed adaptive group testing for reconfigurable hardware without compromising diagnostic coverage. In addition, proposed NGT scheme can be combined with other fault tolerance approaches to ameliorate their fault recovery strategies. Experimental results for a set of MCNC benchmarks using Xilinx ISE Design Suite on a Virtex-5 FPGA have demonstrated d-diagnosability at slice level with average accuracy of 99.15% and 97.76% for d=1 and d=2, respectively.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128708820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Clock network is a dedicated network for distributing multiple clock signals to every logic modules in a system. Be significantly different from ASIC where the clock tree is custom built by users, clock network in FPGA is usually fixed after chip fabrication and cannot be changed for different user circuits. This paper is committed to design and implement FPGA clock network with low latency and skew. We first propose a novel clock network for FPG, which is a backbone-branches topology and can be easily integrated to the tiled FPGA with reasonable area. There are one clock backbone and several primary clock branches in the network. When the chip scales up, this clock network can be extended easily. Afterwards, series of strategies such as hybrid multiplexer, bypassing, looping back and Programmable Delay Adjustment Unit (DAU) are employed to optimize latency and skew. Moreover, the prominent couple capacitance and crosstalk effect of clock routing in nanometer are also given consideration in physical implementation. This clock network is applied to own-designed FPGA with 65nm technology. Post-layout simulation results indicate that our clock network with normal loads can uphold 600MHz clock with the maximum clock latency and skew being typically 2.22ns and 40ps respectively, 1.79ns and 39ps in the fast case, achieving up to 78.2% improvement for skew as well as 47.5% for latency, compared to a commercial 65nm FPGA device.
{"title":"Novel FPGA clock network with low latency and skew (abstract only)","authors":"Lei Li, Jian Wang, Jinmei Lai","doi":"10.1145/2554688.2554722","DOIUrl":"https://doi.org/10.1145/2554688.2554722","url":null,"abstract":"Clock network is a dedicated network for distributing multiple clock signals to every logic modules in a system. Be significantly different from ASIC where the clock tree is custom built by users, clock network in FPGA is usually fixed after chip fabrication and cannot be changed for different user circuits. This paper is committed to design and implement FPGA clock network with low latency and skew. We first propose a novel clock network for FPG, which is a backbone-branches topology and can be easily integrated to the tiled FPGA with reasonable area. There are one clock backbone and several primary clock branches in the network. When the chip scales up, this clock network can be extended easily. Afterwards, series of strategies such as hybrid multiplexer, bypassing, looping back and Programmable Delay Adjustment Unit (DAU) are employed to optimize latency and skew. Moreover, the prominent couple capacitance and crosstalk effect of clock routing in nanometer are also given consideration in physical implementation. This clock network is applied to own-designed FPGA with 65nm technology. Post-layout simulation results indicate that our clock network with normal loads can uphold 600MHz clock with the maximum clock latency and skew being typically 2.22ns and 40ps respectively, 1.79ns and 39ps in the fast case, achieving up to 78.2% improvement for skew as well as 47.5% for latency, compared to a commercial 65nm FPGA device.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127175231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.
{"title":"Towards interconnect-adaptive packing for FPGAs","authors":"J. Luu, Jonathan Rose, J. Anderson","doi":"10.1145/2554688.2554783","DOIUrl":"https://doi.org/10.1145/2554688.2554783","url":null,"abstract":"In order to investigate new FPGA logic blocks, FPGA architects have traditionally needed to customize CAD tools to make use of the new features and characteristics of those blocks. The software development effort necessary to create such CAD tools can be a time-consuming process that can significantly limit the number and variety of architectures explored. Thus, architects want flexible CAD tools that can, with few or no software modifications, explore a diverse space. Existing flexible CAD tools suffer from impractically long runtimes and/or fail to efficiently make use of the important new features of the logic blocks being investigated. This work is a step towards addressing these concerns by enhancing the packing stage of the open-source VTR CAD flow [17] to efficiently deal with common interconnect structures that are used to create many kinds of useful novel blocks. These structures include crossbars, carry chains, dedicated signals, and others. To accomplish this, we employ three techniques in this work: speculative packing, pre-packing, and interconnect-aware pin counting. We show that these techniques, along with three minor modifications, result in improvements to runtime and quality of results across a spectrum of architectures, while simultaneously expanding the scope of architectures that can be explored. Compared with VTR 1.0 [17], we show an average 12-fold speedup in packing for fracturable LUT architectures with 20% lower minimum channel width and 6% lower critical path delay. We obtain a 6 to 7-fold speedup for architectures with non-fracturable LUTs and architectures with depopulated crossbars. In addition, we demonstrate packing support for logic blocks with carry chains.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123034422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We propose a technique to reduce the effective parasitic capacitance of interconnect routing conductors in a bid to simultaneously reduce power consumption and improve delay. The parasitic capacitance reduction is achieved by ensuring routing conductors adjacent to those used by timing critical or high activity nets are left floating - disconnected from either VDD or GND. In doing so, the effective coupling capacitance between the conductors is reduced, because the original coupling capacitance between the conductors is placed in series with other capacitances in the circuit (series combinations of capacitors correspond to lower effective capacitance). To ensure unused conductors can be allowed to float requires the use of tri-state routing buffers, and to that end, we also propose low-cost tri-state buffer circuitry. We also introduce CAD techniques to maximize the likelihood that unused routing conductors are made to be adjacent to those used by nets with high activity or low slack, improving both power and speed. Results show that interconnect dynamic power reductions of up to ~15.5% are expected to be achieved with a critical path degradation of ~1%, and a total area overhead of ~2.1%.
{"title":"Optimizing effective interconnect capacitance for FPGA power reduction","authors":"Safeen Huda, J. Anderson, H. Tamura","doi":"10.1145/2554688.2554788","DOIUrl":"https://doi.org/10.1145/2554688.2554788","url":null,"abstract":"We propose a technique to reduce the effective parasitic capacitance of interconnect routing conductors in a bid to simultaneously reduce power consumption and improve delay. The parasitic capacitance reduction is achieved by ensuring routing conductors adjacent to those used by timing critical or high activity nets are left floating - disconnected from either VDD or GND. In doing so, the effective coupling capacitance between the conductors is reduced, because the original coupling capacitance between the conductors is placed in series with other capacitances in the circuit (series combinations of capacitors correspond to lower effective capacitance). To ensure unused conductors can be allowed to float requires the use of tri-state routing buffers, and to that end, we also propose low-cost tri-state buffer circuitry. We also introduce CAD techniques to maximize the likelihood that unused routing conductors are made to be adjacent to those used by nets with high activity or low slack, improving both power and speed. Results show that interconnect dynamic power reductions of up to ~15.5% are expected to be achieved with a critical path degradation of ~1%, and a total area overhead of ~2.1%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129611917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the explosion of gene sequencing data with over one billion reads per run, the data-intensive computations of Next Generation Sequencing (NGS) applications pose great challenges to current computing capability. In this paper we investigate both algorithmic and architectural accelerating strategies to a typical NGS analysis algorithm -- short reads mapping -- on a commodity multicore and customizable FPGA coprocessor architecture, respectively. First, we propose a hash buckets reorder algorithm that increases shared cache parallelism during the course of searching hash index. The algorithmic strategy achieves 122Gbp/day throughput by exploiting shared-cache parallelism, that leads to performance improvement of 2 times on an 8-core Intel Xeon processor. Second, we develop a FPGA coprocessor that leverages both bit-level and word-level parallelism with scatter-gather memory mechanism to speedup inherent irregular memory access operations by increasing effective memory bandwidth. Our customized FPGA coprocessor achieves 947Gbp per day throughput, that is 189 times higher than current mapping tools on single CPU core, and above 2 times higher than a 64-core multi-processor system. The coprocessor's power efficiency is 29 times higher than a conventional 64-core multi-processor. The results indicate that the customized FPGA coprocessor architecture, that is configured with scatter-gather memory's word-level access, appeals to data intensive applications.
{"title":"Accelerating massive short reads mapping for next generation sequencing (abstract only)","authors":"Chunming Zhang, Wen Tang, Guangming Tan","doi":"10.1145/2554688.2554707","DOIUrl":"https://doi.org/10.1145/2554688.2554707","url":null,"abstract":"Due to the explosion of gene sequencing data with over one billion reads per run, the data-intensive computations of Next Generation Sequencing (NGS) applications pose great challenges to current computing capability. In this paper we investigate both algorithmic and architectural accelerating strategies to a typical NGS analysis algorithm -- short reads mapping -- on a commodity multicore and customizable FPGA coprocessor architecture, respectively. First, we propose a hash buckets reorder algorithm that increases shared cache parallelism during the course of searching hash index. The algorithmic strategy achieves 122Gbp/day throughput by exploiting shared-cache parallelism, that leads to performance improvement of 2 times on an 8-core Intel Xeon processor. Second, we develop a FPGA coprocessor that leverages both bit-level and word-level parallelism with scatter-gather memory mechanism to speedup inherent irregular memory access operations by increasing effective memory bandwidth. Our customized FPGA coprocessor achieves 947Gbp per day throughput, that is 189 times higher than current mapping tools on single CPU core, and above 2 times higher than a 64-core multi-processor system. The coprocessor's power efficiency is 29 times higher than a conventional 64-core multi-processor. The results indicate that the customized FPGA coprocessor architecture, that is configured with scatter-gather memory's word-level access, appeals to data intensive applications.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129681012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung
Next-generation sequencing (NGS) problems have attracted many attentions of researchers in biological and medical computing domains. The current state-of-the-art NGS computing machines are dramatically lowering the cost and increasing the throughput of DNA sequencing. In this paper, we propose a practical study that uses Xilinx Zynq board to summarize acceleration engines using FPGA accelerators and ARM processors for the state-of-the-art short read mapping approaches. The heterogeneous processors and accelerators are coupled with each other using a general Hadoop distributed processing framework. First the reads are collected by the central server, and then distributed to multiple accelerators on the Zynq for hardware acceleration. Therefore, the combination of hardware acceleration and Map-Reduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. Our approach is based on preprocessing the reference genomes and iterative jobs for aligning the continuous incoming reads. The hardware acceleration is based on the creditable read-mapping algorithm RMAP software approach. Furthermore, the speedup analysis on a Hadoop cluster, which concludes 8 development boards, is evaluated. Experimental results demonstrate that our proposed architecture and methods has the speedup of more than 112X, and is scalable with the number of accelerators. Finally, the Zynq based cluster has efficient potential to accelerate even general large scale big data applications. This work was supported by the NSFC grants No. 61379040, No. 61272131 and No. 61202053.
{"title":"Big data genome sequencing on Zynq based clusters (abstract only)","authors":"Chao Wang, Xi Li, Xuehai Zhou, Yunji Chen, R. Cheung","doi":"10.1145/2554688.2554694","DOIUrl":"https://doi.org/10.1145/2554688.2554694","url":null,"abstract":"Next-generation sequencing (NGS) problems have attracted many attentions of researchers in biological and medical computing domains. The current state-of-the-art NGS computing machines are dramatically lowering the cost and increasing the throughput of DNA sequencing. In this paper, we propose a practical study that uses Xilinx Zynq board to summarize acceleration engines using FPGA accelerators and ARM processors for the state-of-the-art short read mapping approaches. The heterogeneous processors and accelerators are coupled with each other using a general Hadoop distributed processing framework. First the reads are collected by the central server, and then distributed to multiple accelerators on the Zynq for hardware acceleration. Therefore, the combination of hardware acceleration and Map-Reduce execution flow could greatly accelerate the task of aligning short length reads to a known reference genome. Our approach is based on preprocessing the reference genomes and iterative jobs for aligning the continuous incoming reads. The hardware acceleration is based on the creditable read-mapping algorithm RMAP software approach. Furthermore, the speedup analysis on a Hadoop cluster, which concludes 8 development boards, is evaluated. Experimental results demonstrate that our proposed architecture and methods has the speedup of more than 112X, and is scalable with the number of accelerators. Finally, the Zynq based cluster has efficient potential to accelerate even general large scale big data applications. This work was supported by the NSFC grants No. 61379040, No. 61272131 and No. 61202053.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127694893","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio Filgueras, E. Gil, Daniel Jiménez-González, C. Álvarez, X. Martorell, Jan Langer, Juanjo Noguera, K. Vissers
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.
{"title":"OmpSs@Zynq all-programmable SoC ecosystem","authors":"Antonio Filgueras, E. Gil, Daniel Jiménez-González, C. Álvarez, X. Martorell, Jan Langer, Juanjo Noguera, K. Vissers","doi":"10.1145/2554688.2554777","DOIUrl":"https://doi.org/10.1145/2554688.2554777","url":null,"abstract":"OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely influenced the recently appeared OpenMP 4.0 specification. Zynq All-Programmable SoC combines the features of a SMP and a FPGA and benefits DLP, ILP and TLP parallelisms in order to efficiently exploit the new technology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous execution support, presenting a successful combination of the OmpSs programming model and the Zynq All-Programmable SoC platforms.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127750317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. Simultaneous data access is highly restricted by the data mapping strategy and memory port constraint. Memory partitioning can efficiently map data elements in the same logical array onto multiple physical banks so that the accesses to the array are parallelized. Previous methods for memory partitioning mainly focused on cyclic partitioning for single-port memory. In this work we propose a generalized memory-partitioning framework to provide high data throughput of on-chip memories. We generalize cyclic partitioning into block-cyclic partitioning for a larger design space exploration. We build the conflict detection algorithm on polytope emptiness testing, and use integer points counting in polytopes for intra-bank offset generation. Memory partitioning for multi-port memory is supported in this framework. Experimental results demonstrate that compared to the state-of-art partitioning algorithm, our proposed algorithm can reduce the number of block RAM by 19.58%, slice by 20.26% and DSP by 50%.
{"title":"Theory and algorithm for generalized memory partitioning in high-level synthesis","authors":"Yuxin Wang, Peng Li, J. Cong","doi":"10.1145/2554688.2554780","DOIUrl":"https://doi.org/10.1145/2554688.2554780","url":null,"abstract":"The significant development of high-level synthesis tools has greatly facilitated FPGAs as general computing platforms. During the parallelism optimization for the data path, memory becomes a crucial bottleneck that impedes performance enhancement. Simultaneous data access is highly restricted by the data mapping strategy and memory port constraint. Memory partitioning can efficiently map data elements in the same logical array onto multiple physical banks so that the accesses to the array are parallelized. Previous methods for memory partitioning mainly focused on cyclic partitioning for single-port memory. In this work we propose a generalized memory-partitioning framework to provide high data throughput of on-chip memories. We generalize cyclic partitioning into block-cyclic partitioning for a larger design space exploration. We build the conflict detection algorithm on polytope emptiness testing, and use integer points counting in polytopes for intra-bank offset generation. Memory partitioning for multi-port memory is supported in this framework. Experimental results demonstrate that compared to the state-of-art partitioning algorithm, our proposed algorithm can reduce the number of block RAM by 19.58%, slice by 20.26% and DSP by 50%.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125697431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Latency insensitive communication offers many potential benefits for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-offs associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quantifies their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and area-efficient latency insensitive systems.
{"title":"Quantifying the cost and benefit of latency insensitive communication on FPGAs","authors":"Kevin E. Murray, Vaughn Betz","doi":"10.1145/2554688.2554786","DOIUrl":"https://doi.org/10.1145/2554688.2554786","url":null,"abstract":"Latency insensitive communication offers many potential benefits for FPGA designs, including easier timing closure by enabling automatic pipelining, and easier interfacing with embedded NoCs. However, it is important to understand the costs and trade-offs associated with any new design style. This paper presents optimized implementations of latency insensitive communication building blocks, quantifies their overheads in terms of area and frequency, and provides guidance to designers on how to generate high-speed and area-efficient latency insensitive systems.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128213601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}