首页 > 最新文献

Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays最新文献

英文 中文
Cad and routing architecture for interposer-based multi-FPGA systems 基于中间层的多fpga系统的Cad和路由体系结构
A. H. Pereira, Vaughn Betz
Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.
基于interposer的多FPGA系统是由多个FPGA dice通过一个硅interposer连接而成。这种器件允许构建比单片芯片所能容纳的更大的FPGA系统,并且现在已经商业化。然而,一个悬而未决的问题是,与单片FPGA相比,这种系统的效率如何,因为在中间层系统中,与单片FPGA相比,骰子之间传递的信号数量减少了,骰子之间的信号延迟增加了。我们创建了一个新版本的VPR来研究这种系统的架构,并表明通过修改放置成本函数来最小化必须在骰子之间交叉的信号数量,我们可以减少18%的路由需求和2%的延迟。我们还表明,骰子之间的信号计数和骰子之间的信号延迟是基于中间层的FPGA系统的关键架构参数。我们发现,如果中间层提供(骰子之间)正常(芯片内)FPGA路由通道提供的路由容量的60%,则对电路的可达性几乎没有影响。然而,中间层中较小的路由容量确实会影响路由可达性:当中间层分别只提供40%和30%的模内路由时,最小通道宽度增加了20%和50%。中间插子也会影响延迟,对于1 ns中间插子信号延迟和四模系统,电路延迟平均增加34%。减少中间延迟在提高电路速度方面比减少系统中的骰子数量有更大的好处。
{"title":"Cad and routing architecture for interposer-based multi-FPGA systems","authors":"A. H. Pereira, Vaughn Betz","doi":"10.1145/2554688.2554776","DOIUrl":"https://doi.org/10.1145/2554688.2554776","url":null,"abstract":"Interposer-based multi-FPGA systems are composed of multiple FPGA dice connected through a silicon interposer. Such devices allow larger FPGA systems to be built than one monolithic die can accomodate and are now commercially available. An open question, however, is how efficient such systems are compared to a monolithic FPGA, as the number of signals passing between dice is reduced and the signal delay between dice is increased in an interposer system vs. a monolithic FPGA. We create a new version of VPR to investigate the architecture of such systems, and show that by modifying the placement cost function to minimize the number of signals that must cross between dice we can reduce routing demand by 18% and delay by 2%. We also show that the signal count between dice and the signal delay between dice are key architecture parameters for interposer-based FPGA systems. We find that if an interposer supplies (between dice) 60% of the routing capacity that the normal (within-die) FPGA routing channels supply, there is little impact on the routability of circuits. Smaller routing capacities in the interposer do impact routability however: minimum channel width increases by 20% and 50% when an interposer supplies only 40% and 30% of the within-die routing, respectively. The interposer also impacts delay, increasing circuit delay by 34% on average for a 1 ns interposer signal delay and a four-die system. Reducing the interposer delay has a greater benefit in improving circuit speed than does reducing the number of dice in the system.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130816996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
BMP: a fast B*-tree based modular placer for FPGAs (abstract only) BMP:一个快速的基于B*树的fpga模块填充器(仅抽象)
Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li
With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.
随着fpga在自适应计算系统中的广泛应用,越来越需要支持fpga的设计自动化。然而,PR FPGA的CAD工具与现有广泛使用的CAD工具(如VPR)之间存在缺失的联系。因此,在这项工作中,我们提出了fpga的模块化砂矿,因为每个PR区域需要在部分重新配置期间被识别,并在放置和路由期间被视为一个实体,这是当前CAD工具不支持的。我们提出的工具是建立在VPR之上的。它从库中获取预合成的模块信息,如面积、延迟等,并进行模块放置,以最小化应用程序的总面积和延迟。模块化信息以B*-Tree结构表示,以便快速放置。我们修改了B*-Tree的运算,以适应fpga的硬件特性。利用不同的模块宽高比来实现区域延迟产品的优化。实验结果表明,该算法与原始VPR算法的面积、延迟和执行时间进行了比较。虽然由于模块之间存在空白区域,它在面积上可能有缺点,但与VPR的结果相比,它提高了大多数基准测试的延迟。最后,我们展示了基于模块化布局的pr感知路由。
{"title":"BMP: a fast B*-tree based modular placer for FPGAs (abstract only)","authors":"Fubing Mao, Yi-Chung Chen, Wei Zhang, Hai Helen Li","doi":"10.1145/2554688.2554755","DOIUrl":"https://doi.org/10.1145/2554688.2554755","url":null,"abstract":"With the wide application of FPGAs in adaptive computing systems, there is an increasing need to support design automation for PR FPGAs. However, there is a missing link between CAD tools for PR FPGA and existing widely used CAD tools, such as VPR. Hence, in this work we propose a modular placer for FPGAs because each PR region needs to be identified during partial reconfiguration and treated as an entity during placement and routing, which is not well supported by the current CAD tools. Our proposed tool is built on top of VPR. It takes the pre-synthesized module information from library, such as area, delay, etc, and performs modular placement to minimize total area and delay of the application. Modular information is represented in B*-Tree structure to allow fast placement. We amend the operations of B*-Tree to fit hardware characteristic of FPGAs. Different width-height ratios of the modules are exploited to achieve area-delay product optimization. Experimental results show comparisons of area, delay and execution time with original VPR. Though it may have disadvantage in area because of blank area among modules, it improves the delay of most of benchmarks comparing to results from VPR. At the end, we show PR-aware routing based on the modular placement.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127373785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A FPGA prototype design emphasis on low power technique 以低功耗技术为重点的FPGA原型设计
Xu Hanyang, Wang Jian, Jin Meilai
In this paper, we propose a fully-functional Nanometer FPGA prototype chip. Compared to traditional single supply voltage, single threshold voltage design, we explore low power nanometer FPGA design challenges with Multi-Vt, Static Voltage Scaling and sleep mode technique. Compared to Dynamic Voltage Scaling (DVS), we make a table of Voltage-Delay parameter pairs under different voltage conditions so that timing information can be calculated by a Static Timing Analysis (STA) tool. Thus a lowest supply power is chosen among all results which meet the timing requirements. This approach would simplify the hardware design since we don't need a complex workload detection circuit compared to DVS system. By separating supply voltages, we can directly shutdown power supply of the unused circuits. Compared to inserting sleep transistor in pull-up or pull-down networks, we can eliminate the speed penalty cased by the additional sleep transistor. We implement a tile-based heterogeneous architecture with island style routing and embedded specific blocks such as DSP and memory. The array size is 64×31 (Row×Col) including 64×24 CLBs. The final design is fabricated using a 1P10M 65-nm bulk CMOS process. Test results show a 53% reduction in static power compared to a commercial FPGA device which is also fabricated in 65nm process and has a similar array size.
在本文中,我们提出了一个全功能的纳米FPGA原型芯片。与传统的单电源电压、单阈值电压设计相比,我们探索了基于Multi-Vt、静态电压缩放和休眠模式技术的低功耗纳米FPGA设计挑战。与动态电压缩放(DVS)相比,我们制作了不同电压条件下的电压延迟参数对表,以便通过静态时序分析(STA)工具计算时序信息。从而在满足时序要求的所有结果中选择最小的电源功率。这种方法可以简化硬件设计,因为与分布式交换机系统相比,我们不需要复杂的工作负载检测电路。通过分离电源电压,我们可以直接关闭未使用电路的电源。与在上拉或下拉网络中插入睡眠晶体管相比,我们可以消除额外的睡眠晶体管造成的速度损失。我们实现了一个基于tile的异构架构,带有岛式路由和嵌入式特定块,如DSP和内存。数组大小为64×31 (Row×Col),包含64×24 clb。最终设计采用1P10M 65nm块体CMOS工艺制造。测试结果表明,与同样采用65nm工艺制造且具有相似阵列尺寸的商用FPGA器件相比,静态功耗降低了53%。
{"title":"A FPGA prototype design emphasis on low power technique","authors":"Xu Hanyang, Wang Jian, Jin Meilai","doi":"10.1145/2554688.2554762","DOIUrl":"https://doi.org/10.1145/2554688.2554762","url":null,"abstract":"In this paper, we propose a fully-functional Nanometer FPGA prototype chip. Compared to traditional single supply voltage, single threshold voltage design, we explore low power nanometer FPGA design challenges with Multi-Vt, Static Voltage Scaling and sleep mode technique. Compared to Dynamic Voltage Scaling (DVS), we make a table of Voltage-Delay parameter pairs under different voltage conditions so that timing information can be calculated by a Static Timing Analysis (STA) tool. Thus a lowest supply power is chosen among all results which meet the timing requirements. This approach would simplify the hardware design since we don't need a complex workload detection circuit compared to DVS system. By separating supply voltages, we can directly shutdown power supply of the unused circuits. Compared to inserting sleep transistor in pull-up or pull-down networks, we can eliminate the speed penalty cased by the additional sleep transistor. We implement a tile-based heterogeneous architecture with island style routing and embedded specific blocks such as DSP and memory. The array size is 64×31 (Row×Col) including 64×24 CLBs. The final design is fabricated using a 1P10M 65-nm bulk CMOS process. Test results show a 53% reduction in static power compared to a commercial FPGA device which is also fabricated in 65nm process and has a similar array size.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121336954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Control signal aware slice-level window based legalization method for FPGA placement (abstract only) 基于控制信号感知的片级窗口的FPGA放置合法化方法(仅摘要)
Yu Wang, Donghoon Yeo, Muhammad Sohail, Hyunchul Shin
The control signal sharing while packing flip-flops and other instances in slices is a necessary constraint in the placement of instances in FPGAs. Global placement usually does not consider signal sharing. In this paper, we propose a control signal aware slice-level packing algorithm within the framework of window based legalization method to obtain an optimized legal layout, satisfying all constraints, after global placement. We select a target window with the highest number of overlaps. Then, we check the capacity of the target window and adjust its size to secure enough space required for legalization. Lastly, window based legalization takes three constraints into account: 1) Control Signal Sharing: Two Flip-Flops in a slice must share a single control signal in FPGA architecture. 2) CLB Architecture Matching: Instances should be placed within a half slice to minimize the routing requirement. 3) Slice Level Packing: Instances are packed into slices for effective utilization of available empty space within a window. The experimental results show that our algorithm performs better with 45% less block displacement and 10% less runtime with the same wirelength when compared to a previous well-known mixed size block greedy legalization method [1].
当将触发器和其他实例打包成片时,控制信号共享是fpga中实例放置的必要约束。全局布局通常不考虑信号共享。本文在基于窗口的合法化方法框架内提出了一种控制信号感知的片级封装算法,以获得全局布局后满足所有约束条件的优化合法布局。我们选择一个有最多重叠的目标窗口。然后,我们检查目标窗口的容量并调整其大小以确保合法化所需的足够空间。最后,基于窗口的合法化考虑了三个约束条件:1)控制信号共享:在FPGA架构中,片中的两个触发器必须共享单个控制信号。2) CLB架构匹配:实例应该放在半片内,以最小化路由需求。3)片级打包:实例被打包到片中,以便有效利用窗口内的可用空白空间。实验结果表明,与之前众所周知的混合大小块贪婪合法化方法[1]相比,我们的算法在相同的无线长度下,块位移减少45%,运行时间减少10%,性能更好。
{"title":"Control signal aware slice-level window based legalization method for FPGA placement (abstract only)","authors":"Yu Wang, Donghoon Yeo, Muhammad Sohail, Hyunchul Shin","doi":"10.1145/2554688.2554727","DOIUrl":"https://doi.org/10.1145/2554688.2554727","url":null,"abstract":"The control signal sharing while packing flip-flops and other instances in slices is a necessary constraint in the placement of instances in FPGAs. Global placement usually does not consider signal sharing. In this paper, we propose a control signal aware slice-level packing algorithm within the framework of window based legalization method to obtain an optimized legal layout, satisfying all constraints, after global placement. We select a target window with the highest number of overlaps. Then, we check the capacity of the target window and adjust its size to secure enough space required for legalization. Lastly, window based legalization takes three constraints into account: 1) Control Signal Sharing: Two Flip-Flops in a slice must share a single control signal in FPGA architecture. 2) CLB Architecture Matching: Instances should be placed within a half slice to minimize the routing requirement. 3) Slice Level Packing: Instances are packed into slices for effective utilization of available empty space within a window. The experimental results show that our algorithm performs better with 45% less block displacement and 10% less runtime with the same wirelength when compared to a previous well-known mixed size block greedy legalization method [1].","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127764393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Soft vector processors with streaming pipelines 带有流管道的软矢量处理器
Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux
Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.
软矢量处理器(svp)通过使用并行alu实现了显著的性能提升。然而,由于alu是以时间复用的方式使用的,这并没有利用FPGA性能的一个关键优势:管道并行性。本文展示了如何将流管道集成到SVP的数据路径中以实现显着的速度提升。SVP在为管道提供高带宽输入数据和使用片上存储器存储其结果方面发挥着重要作用。然而,SVP还必须执行保持管道繁忙所需的内务管理任务。特别是,它协调片上存储器和外部DRAM之间的数据移动,它使用自己的alu对数据进行预处理或后处理,并控制整个执行顺序。由于SVP是用C编程的,因此这些任务比使用传统的HDL方法更容易开发和调试。本文以n体问题为例,说明了如何将自定义流管道集成到SVP数据路径中,以及生成它们的多种技术。使用自定义管道,我们演示了比Nios II/f提高7000倍以上的速度和100倍以上的性能。自定义管道的速度也比单纯的英特尔酷睿i7处理器快50倍。
{"title":"Soft vector processors with streaming pipelines","authors":"Aaron Severance, Joe Edwards, Hossein Omidian, G. Lemieux","doi":"10.1145/2554688.2554774","DOIUrl":"https://doi.org/10.1145/2554688.2554774","url":null,"abstract":"Soft vector processors (SVPs) achieve significant performance gains through the use of parallel ALUs. However, since ALUs are used in a time-multiplexed fashion, this does not exploit a key strength of FPGA performance: pipeline parallelism. This paper shows how streaming pipelines can be integrated into the datapath of a SVP to achieve dramatic speedups. The SVP plays an important role in supplying the pipeline with high-bandwidth input data and storing its results using on-chip memory. However, the SVP must also perform the housekeeping tasks necessary to keep the pipeline busy. In particular, it orchestrates data movement between on-chip memory and external DRAM, it pre- or post-processes the data using its own ALUs, and it controls the overall sequence of execution. Since the SVP is programmed in C, these tasks are easier to develop and debug than using a traditional HDL approach. Using the N-body problem as a case study, this paper illustrates how custom streaming pipelines are integrated into the SVP datapath and multiple techniques for generating them. Using a custom pipeline, we demonstrate speedups over 7,000 times and performance-per-ALM over 100 times better than Nios II/f. The custom pipeline is also 50 times faster than a naive Intel Core i7 processor implementation.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124820797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
MPack: global memory optimization for stream applications in high-level synthesis MPack:高级合成流应用的全局内存优化
Jasmina Vasiljevic, P. Chow
One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.
设计高性能FPGA应用程序的挑战之一是在应用程序中的许多缓冲区中微调有限的片上存储器存储的使用。为了达到期望的性能,设计人员面临着将这些缓冲区封装到片上存储器中并手动优化每个存储器的利用率和每个缓冲区的吞吐量的负担。此外,应用程序存储器可能与FPGA上可用的物理片上存储器的字宽或深度不匹配。这个过程非常耗时,而且非常重要,特别是有大量不同深度和位宽度的缓冲区时。我们提出了一个工具,MPack,它可以全局优化流应用程序中所有缓冲区的片上内存使用。其目标是通过提供快速的设计空间探索和减轻设计人员冗长的低级迭代来加快开发时间。我们引入了新的高级编程,允许用户指定全局内存需求,例如应用程序的片上内存预算和数据吞吐量。我们允许用户快速生成大量内存解决方案,并探索内存使用和可实现吞吐量之间的权衡。为了证明我们的工具的有效性,我们将新的高级实用程序应用于图像处理基准。MPack有效地探索了设计空间,能够产生大量的内存解决方案,从10%到100%的吞吐量,从12%到100%的片上内存使用率。
{"title":"MPack: global memory optimization for stream applications in high-level synthesis","authors":"Jasmina Vasiljevic, P. Chow","doi":"10.1145/2554688.2554761","DOIUrl":"https://doi.org/10.1145/2554688.2554761","url":null,"abstract":"One of the challenges in designing high-performance FPGA applications is fine-tuning the use of limited on-chip memory storage among many buffers in an application. To achieve desired performance the designer faces the burden of packaging such buffers into on-chip memories and manually optimizing the utilization of each memory and the throughput of each buffer. In addition, the application memories may not match the word width or depth of the physical on-chip memories available on the FPGA. This process is time consuming and non-trivial, particularly with a large number of buffers of various depths and bit widths. We propose a tool, MPack, which globally optimizes on-chip memory use across all buffers for stream applications. The goal is to speed up development time by providing rapid design space exploration and relieving the designer of lengthy low-level iterations. We introduce new high-level pragmas allowing the user to specify global memory requirements, such as an application's on-chip memory budget and data throughput. We allow the user to quickly generate a large number of memory solutions and explore the trade-off between memory usage and achievable throughput. To demonstrate the effectiveness of our tool, we apply the new high-level pragmas to an image processing benchmark. MPack effectively explores the design space and is able to produce a large number of memory solutions ranging from 10 to 100% in throughput, and from 12 to 100% in on-chip memory usage.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131727862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs 一种可伸缩的高效稀疏矩阵-向量乘法核
R. Dorrance, Fengbo Ren, D. Markovic
Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.
稀疏矩阵向量乘法(SpMxV)是一种广泛应用于许多高性能科学和工程应用的数学运算。近年来,针对多核微处理器(cpu)和图形处理单元(gpu)的调优软件库已成为计算SpMxV的现状。然而,这些库对于稀疏矩阵的计算吞吐量往往明显低于密集矩阵,这主要是由于有效存储稀疏矩阵所需的压缩格式与传统计算架构不匹配。本文描述了一种基于fpga的SpMxV内核,该内核具有可扩展性,可以有效地利用可用的内存带宽和计算资源。在Virtex-5 SX95T FPGA上的基准测试表明,平均计算效率为91.85%。当运行MKL和cuSPARSE稀疏- blas库时,内核实现了99.8%的峰值计算效率,比两个Intel Core i7处理器(i7-2600和i7-4770)提高了>50倍,比两个nvidia gpu (GTX 660和GTX Titan)提高了>300倍。此外,SpMxV FPGA内核能够实现比CPU和GPU更高的性能,而仅使用64个单精度处理元件,整体能效提高38-50倍。
{"title":"A scalable sparse matrix-vector multiplication kernel for energy-efficient sparse-blas on FPGAs","authors":"R. Dorrance, Fengbo Ren, D. Markovic","doi":"10.1145/2554688.2554785","DOIUrl":"https://doi.org/10.1145/2554688.2554785","url":null,"abstract":"Sparse Matrix-Vector Multiplication (SpMxV) is a widely used mathematical operation in many high-performance scientific and engineering applications. In recent years, tuned software libraries for multi-core microprocessors (CPUs) and graphics processing units (GPUs) have become the status quo for computing SpMxV. However, the computational throughput of these libraries for sparse matrices tends to be significantly lower than that of dense matrices, mostly due to the fact that the compression formats required to efficiently store sparse matrices mismatches traditional computing architectures. This paper describes an FPGA-based SpMxV kernel that is scalable to efficiently utilize the available memory bandwidth and computing resources. Benchmarking on a Virtex-5 SX95T FPGA demonstrates an average computational efficiency of 91.85%. The kernel achieves a peak computational efficiency of 99.8%, a >50x improvement over two Intel Core i7 processors (i7-2600 and i7-4770) and showing a >300x improvement over two NVIDA GPUs (GTX 660 and GTX Titan), when running the MKL and cuSPARSE sparse-BLAS libraries, respectively. In addition, the SpMxV FPGA kernel is able to achieve higher performance than its CPU and GPU counterparts, while using only 64 single-precision processing elements, with an overall 38-50x improvement in energy efficiency.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357276","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
xDEFENSE: an extended DEFENSE for mitigating next generation intrusions (abstract only) xDEFENSE:用于减轻下一代入侵的扩展防御(仅抽象)
J. Lamberti, D. Shila, V. Venugopal
In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.
在这项工作中,我们提出了一种改进的防御体系结构,称为xDEFENSE,可以实时检测和响应硬件攻击。在过去,已经提出了几个信任根架构,如DEFENSE和RETC,以阻止硬件木马泄露敏感信息的企图。在典型的Root of Trust架构场景中,硬件只有在正确响应内存保护请求的情况下才能访问内存。然而,在最近的努力中,我们观察到这些架构实际上容易受到各种威胁的影响,从拒绝服务攻击、特权升级到信息泄露,通过向信任根模块(如内存保护和授权硬件)注入木马。在我们的工作中,我们提出了一个安全监视器来监视授权硬件、内存保护和内存之间的所有事务。它还通过使用哈希消息身份验证码(HMAC)对这些组件进行身份验证,以通过破坏挑战-响应对来检测任何无效的内存访问或拒绝服务攻击。提出的xDEFENSE架构在Xilinx SPARTAN 3 FPGA评估板上实现,我们的结果表明,与DEFENSE相比,xDEFENSE需要143个额外的切片,并且会产生22ns的监控延迟。
{"title":"xDEFENSE: an extended DEFENSE for mitigating next generation intrusions (abstract only)","authors":"J. Lamberti, D. Shila, V. Venugopal","doi":"10.1145/2554688.2554714","DOIUrl":"https://doi.org/10.1145/2554688.2554714","url":null,"abstract":"In this work, we propose a modified DEFENSE architecture termed as xDEFENSE that can detect and react to hardware attacks in real-time. In the past, several Root of Trust architectures such as DEFENSE and RETC have been proposed to foil attempts by hardware Trojans to leak sensitive information. In a typical Root of Trust architecture scenario, hardware is allowed to access the memory only by responding properly to a challenge requested by the memory guard. However in a recent effort, we observed that these architectures can in fact be susceptible to a variety of threats ranging from denial of service attacks, privilege escalation to information leakage, by injecting a Trojan into the Root of Trust modules such as memory guards and authorized hardware. In our work, we propose a security monitor that monitors all transactions between the authorized hardware, memory guard and memory. It also authenticates these components through the use of Hashed Message Authentication Codes (HMAC) to detect any invalid memory access or denial of service attack by disrupting the challenge-response pairs. The proposed xDEFENSE architecture was implemented on a Xilinx SPARTAN 3 FPGA evaluation board and our results indicate that xDEFENSE requires 143 additional slices as compared to DEFENSE and incurs a monitoring latency of 22ns.","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129894814","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Session details: Applications 2 会话详情:应用程序
Lesley Shannon
{"title":"Session details: Applications 2","authors":"Lesley Shannon","doi":"10.1145/3260941","DOIUrl":"https://doi.org/10.1145/3260941","url":null,"abstract":"","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132884536","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
APMC: advanced pattern based memory controller (abstract only) APMC:基于模式的高级内存控制器(仅抽象)
Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri
In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support
在本文中,我们提出了APMC,一种基于高级模式的内存控制器,它使用描述符来支持规则和不规则的内存访问模式,而不使用主核。它将模式描述符保存在内存中,并将复杂的1D/2D/3D数据结构预取到其特殊的刮擦板存储器中。对不规则内存访问的支持在编程时安排在模式描述符中,APMC在运行时管理多个模式以减少访问延迟。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。它收集多个内存读/写请求,并最大限度地重用打开的SDRAM库,以减少打开和关闭行的开销。APMC管理主存和专用刮记板存储器之间的数据移动;存在于专用刮擦板中的数据在被多个模式访问时被重用和/或更新。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明,APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比,APMC消耗的硬件资源减少了17%,片上功耗减少了32%,在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。APMC核心消耗的硬件资源比基准系统的内存控制器少50%。在本文中,我们提出了APMC,即基于高级模式的内存控制器,一种使用描述符支持规则和不规则内存访问模式的智能内存控制器。支持主核心。它将模式描述符保存在内存中,并将复杂的数据结构预取到其特殊的暂存存储器中。内存访问在编程时安排在模式描述符中,APMC在运行时管理多个模式以减少访问延迟。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明,APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比,APMC消耗的硬件资源减少了17%,片上功耗减少了32%,在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。APMC核心消耗的硬件资源比基准系统的内存控制器少50%。内存访问。在本文中,我们提出了APMC,即基于高级模式的内存控制器,一种支持规则和不规则存储访问模式的智能内存控制器。所提出的APMC系统减少了处理器/加速器由于不规则存储器访问模式和低存储器带宽而面临的限制。该系统在Xilinx ML505 FPGA板上进行了实现和测试。将该系统的性能与带有高性能存储器控制器的处理器进行了比较。结果表明,APMC系统对规则和不规则数据集的传输速度分别比基线系统快20.4倍和3.4倍。与基准系统相比,APMC消耗的硬件资源减少了17%,片上功耗减少了32%,在常规应用和非常规应用中分别实现了3.5倍至52倍和1.4倍至2.9倍的加速。
{"title":"APMC: advanced pattern based memory controller (abstract only)","authors":"Tassadaq Hussain, Oscar Palomar, O. Unsal, A. Cristal, E. Ayguadé, M. Valero, S. Rethinagiri","doi":"10.1145/2554688.2554732","DOIUrl":"https://doi.org/10.1145/2554688.2554732","url":null,"abstract":"In this paper, we present APMC, the Advanced Pattern based Memory Controller, that uses descriptors to support both regular and irregular memory access patterns without using a master core. It keeps pattern descriptors in memory and prefetches the complex 1D/2D/3D data structure into its special scratchpad memory. Support for irregular Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. It gathers multiple memory read/write requests and maximizes the reuse of opened SDRAM banks to decrease the overhead of opening and closing rows. APMC manages data movement between main memory and the specialized scratchpad memory; data present in the specialized scratchpad is reused and/or updated when accessed by several patterns. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that uses descriptors to supports both regular and irregular memory access patterns. support of the master core. It keeps pattern descriptors in memory and prefetches the complex data structure into its special scratchpad memory. Memory accesses are arranged in the pattern descriptors at program-time and APMC manages multiple patterns at run-time to reduce access latency. The proposed APMC system reduces the limitations faced by processors/accelerators due to irregular memory access patterns and low memory bandwidth. The system is implemented and tested on a Xilinx ML505 FPGA board. The performance of the system is compared with a processor with a high performance memory controller. The results show that the APMC system transfers regular and irregular datasets up to 20.4x and 3.4x faster respectively than the baseline system. When compared to the baseline system, APMC consumes 17% less hardware resources, 32% less on-chip power and achieves between 3.5x to 52x and 1.4x to 2.9x of speedup for regular and irregular applications respectively. The APMC core consumes 50% less hardware resources than the baseline system's memory controller.memory accesses. In this paper, we present APMC, the Advanced Pattern based Memory Controller, an intelligent memory controller that support","PeriodicalId":390562,"journal":{"name":"Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134049087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
Proceedings of the 2014 ACM/SIGDA international symposium on Field-programmable gate arrays
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1