Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献_第3页

BMP: a fast B*-tree based modular placer for FPGAs (abstract only) BMP:一个快速的基于B*树的fpga模块填充器(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554755

Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li

With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.

随着fpga在自适应计算系统中的广泛应用，越来越需要支持fpga的设计自动化。然而，PR FPGA的CAD工具与现有广泛使用的CAD工具(如VPR)之间存在缺失的联系。因此，在这项工作中，我们提出了fpga的模块化砂矿，因为每个PR区域需要在部分重新配置期间被识别，并在放置和路由期间被视为一个实体，这是当前CAD工具不支持的。我们提出的工具是建立在VPR之上的。它从库中获取预合成的模块信息，如面积、延迟等，并进行模块放置，以最小化应用程序的总面积和延迟。模块化信息以B*-Tree结构表示，以便快速放置。我们修改了B*-Tree的运算，以适应fpga的硬件特性。利用不同的模块宽高比来实现区域延迟产品的优化。实验结果表明，该算法与原始VPR算法的面积、延迟和执行时间进行了比较。虽然由于模块之间存在空白区域，它在面积上可能有缺点，但与VPR的结果相比，它提高了大多数基准测试的延迟。最后，我们展示了基于模块化布局的pr感知路由。

{"title":"BMP: a fast B*-tree based modular placer for FPGAs (abstract only)","authors":"Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li","doi":"10.1145/2554688.2554755","DOIUrl":"https://doi.org/10.1145/2554688.2554755","url":null,"abstract":"With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127373785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Future inter-FPGA communication architecture for multi-FPGA based prototyping (abstract only) 基于多fpga原型的未来fpga间通信架构(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554747

Qingshan Tang, M. Tuna, H. Mehrez

Multi-FPGA boards are widely used for rapid system prototyping. Even though the prototyping is trying to reach the maximum performance, the performance is limited by the inter-FPGA communication. As the capacity per I/O for each FPGA generation is increasing, FPGA I/Os are becoming a scarce resource. The design is divided into several parts, each part's capacity fits in a single FPGA. Signals crossing design's parts located in different FPGAs are called cut nets. In order to resolve pin limitation problem, cut nets are sent between FPGAs in pipelined way using the Time-Division-Multiplexing technique. The maximum number of cut nets passing through one FPGA I/O is called the TDM ratio. There are two multiplexing architectures used for multi-FPGA based prototyping: Logic Multiplexing and ISERDES/OSERDES. In this paper, a new multiplexing architecture Multi-Gigabit Transceiver (MGT) is proposed. Experiments are done in a multi-FPGA board with the testbench LFSR to validate the achieved performance. Assume that all the FPGA I/Os used for inter-FPGA communication are MGT capable in the future. Analyses show that the proposed multiplexing architecture can achieve higher performance when the TDM ratio exceeds 67. The gain in performance of the proposed architecture over the existing architecture augments as the TDM ratio increases.

多fpga板广泛用于快速系统原型设计。尽管原型设计试图达到最大性能，但性能受到fpga间通信的限制。随着每一代FPGA的I/O容量的增加，FPGA I/O正成为一种稀缺资源。该设计分为几个部分，每个部分的容量都适合单个FPGA。信号交叉设计的部分位于不同的fpga中，称为截网。为了解决引脚限制问题，采用时分复用技术在fpga之间以流水线方式发送截网。通过一个FPGA I/O的最大截网数称为TDM比率。有两种多路复用架构用于基于多fpga的原型:逻辑多路复用和ISERDES/OSERDES。本文提出了一种新的多路复用结构——多千兆收发器(MGT)。利用LFSR测试平台在多fpga板上进行了实验，验证了所实现的性能。假设未来用于FPGA间通信的所有FPGA I/ o都具有MGT功能。分析表明，当时分分复用比超过67时，所提出的复用结构可以获得更高的性能。随着TDM比率的增加，所建议的体系结构的性能优于现有体系结构。

{"title":"Future inter-FPGA communication architecture for multi-FPGA based prototyping (abstract only)","authors":"Qingshan Tang, M. Tuna, H. Mehrez","doi":"10.1145/2554688.2554747","DOIUrl":"https://doi.org/10.1145/2554688.2554747","url":null,"abstract":"Multi-FPGA boards are widely used for rapid system prototyping. Even though the prototyping is trying to reach the maximum performance, the performance is limited by the inter-FPGA communication. As the capacity per I/O for each FPGA generation is increasing, FPGA I/Os are becoming a scarce resource. The design is divided into several parts, each part's capacity fits in a single FPGA. Signals crossing design's parts located in different FPGAs are called cut nets. In order to resolve pin limitation problem, cut nets are sent between FPGAs in pipelined way using the Time-Division-Multiplexing technique. The maximum number of cut nets passing through one FPGA I/O is called the TDM ratio. There are two multiplexing architectures used for multi-FPGA based prototyping: Logic Multiplexing and ISERDES/OSERDES. In this paper, a new multiplexing architecture Multi-Gigabit Transceiver (MGT) is proposed. Experiments are done in a multi-FPGA board with the testbench LFSR to validate the achieved performance. Assume that all the FPGA I/Os used for inter-FPGA communication are MGT capable in the future. Analyses show that the proposed multiplexing architecture can achieve higher performance when the TDM ratio exceeds 67. The gain in performance of the proposed architecture over the existing architecture augments as the TDM ratio increases.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133748362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Cad and routing architecture for interposer-based multi-FPGA systems 基于中间层的多fpga系统的Cad和路由体系结构

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554776

A. H. Pereira, Vaughn Betz

Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.

基于interposer的多FPGA系统是由多个FPGA dice通过一个硅interposer连接而成。这种器件允许构建比单片芯片所能容纳的更大的FPGA系统，并且现在已经商业化。然而，一个悬而未决的问题是，与单片FPGA相比，这种系统的效率如何，因为在中间层系统中，与单片FPGA相比，骰子之间传递的信号数量减少了，骰子之间的信号延迟增加了。我们创建了一个新版本的VPR来研究这种系统的架构，并表明通过修改放置成本函数来最小化必须在骰子之间交叉的信号数量，我们可以减少18%的路由需求和2%的延迟。我们还表明，骰子之间的信号计数和骰子之间的信号延迟是基于中间层的FPGA系统的关键架构参数。我们发现，如果中间层提供(骰子之间)正常(芯片内)FPGA路由通道提供的路由容量的60%，则对电路的可达性几乎没有影响。然而，中间层中较小的路由容量确实会影响路由可达性:当中间层分别只提供40%和30%的模内路由时，最小通道宽度增加了20%和50%。中间插子也会影响延迟，对于1 ns中间插子信号延迟和四模系统，电路延迟平均增加34%。减少中间延迟在提高电路速度方面比减少系统中的骰子数量有更大的好处。

{"title":"Cad and routing architecture for interposer-based multi-FPGA systems","authors":"A. H. Pereira, Vaughn Betz","doi":"10.1145/2554688.2554776","DOIUrl":"https://doi.org/10.1145/2554688.2554776","url":null,"abstract":"Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130816996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Optimally mitigating BTI-induced FPGA device aging with discriminative voltage scaling (abstract only) 通过区分电压缩放优化缓解bti诱导的FPGA器件老化(仅摘要)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554752

Yu Bai, Mohammed Alawad, Mingjie Lin

With the CMOS technology aggressively scaling towards the 22nm node, modern FPGA devices face tremendous aging- induced reliability challenges due to Bias Temperature In- stability (BTI) and Hot Carrier Injection (HCI). This paper presents a novel antiaging technique at logic level that is both scalable and applicable for VLSI digital circuits implemented with FPGA devices. The key idea is to prolong the lifetime of FPGA-mapped designs by strategically elevating the VDD values of some LUTs based on their modular criticality values. Although the idea of scaling VDD in order to improve either energy efficiency or circuit reliability has been explored extensively, our study distinguishes itself by approaching this challenge through analytical procedure, therefore able to maximize the overall reliability of target FPGA design by rigorously modelling the BTI-induce de- vice reliability and optimally solving the VDD assignment problem.

随着CMOS技术向22nm节点的积极扩展，由于偏置温度稳定性(BTI)和热载流子注入(HCI)，现代FPGA器件面临着巨大的老化引起的可靠性挑战。本文提出了一种新的逻辑级抗老化技术，该技术既可扩展，又适用于用FPGA器件实现的超大规模集成电路数字电路。关键思想是通过基于模块临界值战略性地提高一些lut的VDD值来延长fpga映射设计的寿命。虽然为了提高能源效率或电路可靠性而扩展VDD的想法已经被广泛探索，但我们的研究通过分析过程来解决这一挑战，因此能够通过严格建模bti诱导的设备可靠性和最佳解决VDD分配问题来最大化目标FPGA设计的整体可靠性。

引用次数: 0

Soft vector processors with streaming pipelines 带有流管道的软矢量处理器

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554774

Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux

Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.

软矢量处理器(svp)通过使用并行alu实现了显著的性能提升。然而，由于alu是以时间复用的方式使用的，这并没有利用FPGA性能的一个关键优势:管道并行性。本文展示了如何将流管道集成到SVP的数据路径中以实现显着的速度提升。SVP在为管道提供高带宽输入数据和使用片上存储器存储其结果方面发挥着重要作用。然而，SVP还必须执行保持管道繁忙所需的内务管理任务。特别是，它协调片上存储器和外部DRAM之间的数据移动，它使用自己的alu对数据进行预处理或后处理，并控制整个执行顺序。由于SVP是用C编程的，因此这些任务比使用传统的HDL方法更容易开发和调试。本文以n体问题为例，说明了如何将自定义流管道集成到SVP数据路径中，以及生成它们的多种技术。使用自定义管道，我们演示了比Nios II/f提高7000倍以上的速度和100倍以上的性能。自定义管道的速度也比单纯的英特尔酷睿i7处理器快50倍。

{"title":"Soft vector processors with streaming pipelines","authors":"Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux","doi":"10.1145/2554688.2554774","DOIUrl":"https://doi.org/10.1145/2554688.2554774","url":null,"abstract":"Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124820797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

MPack: global memory optimization for stream applications in high-level synthesis MPack:高级合成流应用的全局内存优化

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554761

Jasmina Vasiljevic, P. Chow

One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.

设计高性能FPGA应用程序的挑战之一是在应用程序中的许多缓冲区中微调有限的片上存储器存储的使用。为了达到期望的性能，设计人员面临着将这些缓冲区封装到片上存储器中并手动优化每个存储器的利用率和每个缓冲区的吞吐量的负担。此外，应用程序存储器可能与FPGA上可用的物理片上存储器的字宽或深度不匹配。这个过程非常耗时，而且非常重要，特别是有大量不同深度和位宽度的缓冲区时。我们提出了一个工具，MPack，它可以全局优化流应用程序中所有缓冲区的片上内存使用。其目标是通过提供快速的设计空间探索和减轻设计人员冗长的低级迭代来加快开发时间。我们引入了新的高级编程，允许用户指定全局内存需求，例如应用程序的片上内存预算和数据吞吐量。我们允许用户快速生成大量内存解决方案，并探索内存使用和可实现吞吐量之间的权衡。为了证明我们的工具的有效性，我们将新的高级实用程序应用于图像处理基准。MPack有效地探索了设计空间，能够产生大量的内存解决方案，从10%到100%的吞吐量，从12%到100%的片上内存使用率。

{"title":"MPack: global memory optimization for stream applications in high-level synthesis","authors":"Jasmina Vasiljevic, P. Chow","doi":"10.1145/2554688.2554761","DOIUrl":"https://doi.org/10.1145/2554688.2554761","url":null,"abstract":"One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs 一种可伸缩的高效稀疏矩阵-向量乘法核

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554785

R. Dorrance, Fengbo Ren, D. Markovic

Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.

稀疏矩阵向量乘法(SpMxV)是一种广泛应用于许多高性能科学和工程应用的数学运算。近年来，针对多核微处理器(cpu)和图形处理单元(gpu)的调优软件库已成为计算SpMxV的现状。然而，这些库对于稀疏矩阵的计算吞吐量往往明显低于密集矩阵，这主要是由于有效存储稀疏矩阵所需的压缩格式与传统计算架构不匹配。本文描述了一种基于fpga的SpMxV内核，该内核具有可扩展性，可以有效地利用可用的内存带宽和计算资源。在Virtex-5 SX95T FPGA上的基准测试表明，平均计算效率为91.85%。当运行MKL和cuSPARSE稀疏- blas库时，内核实现了99.8%的峰值计算效率，比两个Intel Core i7处理器(i7-2600和i7-4770)提高了>50倍，比两个nvidia gpu (GTX 660和GTX Titan)提高了>300倍。此外，SpMxV FPGA内核能够实现比CPU和GPU更高的性能，而仅使用64个单精度处理元件，整体能效提高38-50倍。

{"title":"A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs","authors":"R. Dorrance, Fengbo Ren, D. Markovic","doi":"10.1145/2554688.2554785","DOIUrl":"https://doi.org/10.1145/2554688.2554785","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 97

xDEFENSE: an extended DEFENSE for mitigating next generation intrusions (abstract only) xDEFENSE:用于减轻下一代入侵的扩展防御(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554714

J. Lamberti, D. Shila, V. Venugopal

In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.

在这项工作中，我们提出了一种改进的防御体系结构，称为xDEFENSE，可以实时检测和响应硬件攻击。在过去，已经提出了几个信任根架构，如DEFENSE和RETC，以阻止硬件木马泄露敏感信息的企图。在典型的Root of Trust架构场景中，硬件只有在正确响应内存保护请求的情况下才能访问内存。然而，在最近的努力中，我们观察到这些架构实际上容易受到各种威胁的影响，从拒绝服务攻击、特权升级到信息泄露，通过向信任根模块(如内存保护和授权硬件)注入木马。在我们的工作中，我们提出了一个安全监视器来监视授权硬件、内存保护和内存之间的所有事务。它还通过使用哈希消息身份验证码(HMAC)对这些组件进行身份验证，以通过破坏挑战-响应对来检测任何无效的内存访问或拒绝服务攻击。提出的xDEFENSE架构在Xilinx SPARTAN 3 FPGA评估板上实现，我们的结果表明，与DEFENSE相比，xDEFENSE需要143个额外的切片，并且会产生22ns的监控延迟。

{"title":"xDEFENSE: an extended DEFENSE for mitigating next generation intrusions (abstract only)","authors":"J. Lamberti, D. Shila, V. Venugopal","doi":"10.1145/2554688.2554714","DOIUrl":"https://doi.org/10.1145/2554688.2554714","url":null,"abstract":"In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129894814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Session details: Applications 2 会话详情:应用程序

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/3260941

Lesley Shannon

引用次数: 0

APMC: advanced pattern based memory controller (abstract only) APMC:基于模式的高级内存控制器(仅抽象)

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays

Pub Date : 2014-02-26 DOI: 10.1145/2554688.2554732

Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri

In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support

在本文中，我们提出了APMC，一种基于高级模式的内存控制器，它使用描述符来支持规则和不规则的内存访问模式，而不使用主核。它将模式描述符保存在内存中，并将复杂的1D/2D/3D数据结构预取到其特殊的刮擦板存储器中。对不规则内存访问的支持在编程时安排在模式描述符中，APMC在运行时管理多个模式以减少访问延迟。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。它收集多个内存读/写请求，并最大限度地重用打开的SDRAM库，以减少打开和关闭行的开销。APMC管理主存和专用刮记板存储器之间的数据移动;存在于专用刮擦板中的数据在被多个模式访问时被重用和/或更新。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明，APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比，APMC消耗的硬件资源减少了17%，片上功耗减少了32%，在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。APMC核心消耗的硬件资源比基准系统的内存控制器少50%。在本文中，我们提出了APMC，即基于高级模式的内存控制器，一种使用描述符支持规则和不规则内存访问模式的智能内存控制器。支持主核心。它将模式描述符保存在内存中，并将复杂的数据结构预取到其特殊的暂存存储器中。内存访问在编程时安排在模式描述符中，APMC在运行时管理多个模式以减少访问延迟。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明，APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比，APMC消耗的硬件资源减少了17%，片上功耗减少了32%，在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。APMC核心消耗的硬件资源比基准系统的内存控制器少50%。内存访问。在本文中，我们提出了APMC，即基于高级模式的内存控制器，一种支持规则和不规则存储访问模式的智能内存控制器。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明，APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比，APMC消耗的硬件资源减少了17%，片上功耗减少了32%，在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。

{"title":"APMC: advanced pattern based memory controller (abstract only)","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri","doi":"10.1145/2554688.2554732","DOIUrl":"https://doi.org/10.1145/2554688.2554732","url":null,"abstract":"In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134049087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3