首页 > 最新文献

2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines最新文献

英文 中文
Binding Hardware IPs to Specific FPGA Device via Inter-twining the PUF Response with the FSM of Sequential Circuits 通过将PUF响应与顺序电路的FSM交织,将硬件ip绑定到特定的FPGA设备
Jiliang Zhang, Yaping Lin, Yongqiang Lyu, R. Cheung, Wenjie Che, Qiang Zhou, Jinian Bian
The continuous growth in both capability and capacity for FPGA now requires significant resources invested in the hardware design, which results in two classes of main security issues: 1) the unauthorized use and piracy attacks including cloning, reverse engineering, tampering etc. 2) the licensing issue. Binding hardware IPs (HW-IPs) to specific FPGA devices can efficiently resolve these problems. However, previous binding techniques are all based on encryption and hence have three main drawbacks: 1) encryption-based proposals in commercial are limited to protect the single large FPGA configuration, 2) many encryption-based proposals depend on a trusted third party to involve the licensing protocol, and 3) the encryption-based binding methods use costly mechanisms such as secure ROM or flash memory to store FPGA specific cryptographic keys, which is not only expensive but also vulnerable to side-channel attacks, and the management and transport of secret keys became a practical issue. In this work, we propose a PUF-FSM binding technique completely different from the traditional encryption-based methods to address these shortcomings.
FPGA的性能和容量的持续增长现在需要在硬件设计上投入大量资源,这导致了两类主要的安全问题:1)未经授权的使用和盗版攻击,包括克隆,逆向工程,篡改等;2)许可问题。将硬件ip (hw - ip)绑定到特定的FPGA设备可以有效地解决这些问题。然而,以前的绑定技术都是基于加密的,因此有三个主要缺点:1)加密方案在商业仅限于保护单一大型FPGA配置,2)许多加密方案依赖可信第三方涉及的许可协议,和3)加密绑定方法使用昂贵的机制如安全ROM或闪存存储FPGA具体的密钥,这是不仅昂贵而且容易边信道攻击,和密钥的管理和运输成为一个实际问题。在这项工作中,我们提出了一种完全不同于传统的基于加密的方法的PUF-FSM绑定技术来解决这些缺点。
{"title":"Binding Hardware IPs to Specific FPGA Device via Inter-twining the PUF Response with the FSM of Sequential Circuits","authors":"Jiliang Zhang, Yaping Lin, Yongqiang Lyu, R. Cheung, Wenjie Che, Qiang Zhou, Jinian Bian","doi":"10.1109/FCCM.2013.12","DOIUrl":"https://doi.org/10.1109/FCCM.2013.12","url":null,"abstract":"The continuous growth in both capability and capacity for FPGA now requires significant resources invested in the hardware design, which results in two classes of main security issues: 1) the unauthorized use and piracy attacks including cloning, reverse engineering, tampering etc. 2) the licensing issue. Binding hardware IPs (HW-IPs) to specific FPGA devices can efficiently resolve these problems. However, previous binding techniques are all based on encryption and hence have three main drawbacks: 1) encryption-based proposals in commercial are limited to protect the single large FPGA configuration, 2) many encryption-based proposals depend on a trusted third party to involve the licensing protocol, and 3) the encryption-based binding methods use costly mechanisms such as secure ROM or flash memory to store FPGA specific cryptographic keys, which is not only expensive but also vulnerable to side-channel attacks, and the management and transport of secret keys became a practical issue. In this work, we propose a PUF-FSM binding technique completely different from the traditional encryption-based methods to address these shortcomings.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129202515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Parallel Generation of Gaussian Random Numbers Using the Table-Hadamard Transform 利用Table-Hadamard变换并行生成高斯随机数
David B. Thomas
Gaussian Random Number Generators (GRNGs) are an important component in parallel Monte-Carlo simulations using FPGAs, where tens or hundreds of high-quality Gaussian samples must be generated per cycle using very few logic resources. This paper describes the Table-Hadamard generator, which is a GRNG designed to generate multiple streams of random numbers in parallel. It uses discrete table distributions to generate pseudo-Gaussian base samples, then a parallel Hadamard transform to efficiently apply the central limit theorem. When generating 64 output samples the TableHadamard requires just 100 slices per generated sample, a quarter the resources of the next best technique, while providing higher statistical quality.
高斯随机数生成器(grng)是使用fpga进行并行蒙特卡罗模拟的重要组成部分,其中每个周期必须使用很少的逻辑资源生成数十或数百个高质量的高斯样本。本文介绍了Table-Hadamard生成器,它是一种用于并行生成多个随机数流的GRNG。首先利用离散表分布生成伪高斯基样本,然后利用并行Hadamard变换有效地应用中心极限定理。当生成64个输出样本时,TableHadamard每个生成的样本只需要100个切片,是次优技术资源的四分之一,同时提供更高的统计质量。
{"title":"Parallel Generation of Gaussian Random Numbers Using the Table-Hadamard Transform","authors":"David B. Thomas","doi":"10.1109/FCCM.2013.53","DOIUrl":"https://doi.org/10.1109/FCCM.2013.53","url":null,"abstract":"Gaussian Random Number Generators (GRNGs) are an important component in parallel Monte-Carlo simulations using FPGAs, where tens or hundreds of high-quality Gaussian samples must be generated per cycle using very few logic resources. This paper describes the Table-Hadamard generator, which is a GRNG designed to generate multiple streams of random numbers in parallel. It uses discrete table distributions to generate pseudo-Gaussian base samples, then a parallel Hadamard transform to efficiently apply the central limit theorem. When generating 64 output samples the TableHadamard requires just 100 slices per generated sample, a quarter the resources of the next best technique, while providing higher statistical quality.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125311885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Case for Heterogeneous Technology-Mapping: Soft Versus Hard Multiplexers 异构技术映射的案例:软与硬多路复用器
M. Purnaprajna, P. Ienne
Lookup table-based FPGAs offer flexibility but compromise on performance, as compared to custom CMOS implementations. This paper explores the idea of minimising this performance gap by using fixed, fine-grained, nonprogrammable logic structures in place of lookup tables (LUTs). Functions previously mapped onto LUTs can now be diverted to these structures, resulting in reduced LUT usage and higher operating speed. This paper presents a generic heterogeneous technology-mapping scheme for segregating LUTs and hard logic blocks. For the proof-of-concept, we choose to isolate multiplexers present in most general-purpose circuits. These multiplexers are mapped onto hard blocks of multiplexers that are present in existing commercial FPGA fabrics, but often unused. Since the hard multiplexers are already present, there is no additional performance or area penalty. Using this approach, an average reduction in LUT usage of 16% and an average speedup of 8% has been observed for the VTR benchmarks as compared to the LUTs-only implementation.
与定制CMOS实现相比,基于查找表的fpga提供了灵活性,但在性能上有所妥协。本文探讨了通过使用固定的、细粒度的、不可编程的逻辑结构来代替查找表(lut)来最小化这种性能差距的想法。以前映射到LUT上的函数现在可以转移到这些结构中,从而减少了LUT的使用并提高了操作速度。本文提出了一种通用的异构技术映射方案,用于分离lut和硬逻辑块。对于概念验证,我们选择隔离大多数通用电路中存在的多路复用器。这些多路复用器被映射到多路复用器的硬块上,这些多路复用器存在于现有的商用FPGA结构中,但通常未使用。由于硬复用器已经存在,所以没有额外的性能或面积损失。使用这种方法,与仅使用LUT的实现相比,在VTR基准测试中可以观察到LUT使用量平均减少16%,平均加速提高8%。
{"title":"A Case for Heterogeneous Technology-Mapping: Soft Versus Hard Multiplexers","authors":"M. Purnaprajna, P. Ienne","doi":"10.1109/FCCM.2013.19","DOIUrl":"https://doi.org/10.1109/FCCM.2013.19","url":null,"abstract":"Lookup table-based FPGAs offer flexibility but compromise on performance, as compared to custom CMOS implementations. This paper explores the idea of minimising this performance gap by using fixed, fine-grained, nonprogrammable logic structures in place of lookup tables (LUTs). Functions previously mapped onto LUTs can now be diverted to these structures, resulting in reduced LUT usage and higher operating speed. This paper presents a generic heterogeneous technology-mapping scheme for segregating LUTs and hard logic blocks. For the proof-of-concept, we choose to isolate multiplexers present in most general-purpose circuits. These multiplexers are mapped onto hard blocks of multiplexers that are present in existing commercial FPGA fabrics, but often unused. Since the hard multiplexers are already present, there is no additional performance or area penalty. Using this approach, an average reduction in LUT usage of 16% and an average speedup of 8% has been observed for the VTR benchmarks as compared to the LUTs-only implementation.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123480586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ShrinkWrap: Compiler-Enabled Optimization and Customization of Soft Memory Interconnects ShrinkWrap:软内存互连的编译器启用优化和定制
Eric S. Chung, Michael Papamichael
Today's FPGAs lack dedicated on-chip memory interconnects, requiring users to (1) rely on inefficient, general-purpose solutions, or (2) tediously create an application-specific memory interconnect for each target platform. The CoRAM architecture, which offers a general-purpose abstraction for FPGA memory management, encodes high-level application information that can be exploited to generate customized soft memory interconnects. This paper describes the ShrinkWrap Compiler, which analyzes a CoRAM application for its connectivity and bandwidth requirements, enabling synthesis of highly-tuned area-efficient soft memory interconnects.
今天的fpga缺乏专用的片上存储器互连,要求用户(1)依赖低效的通用解决方案,或(2)为每个目标平台繁琐地创建特定于应用程序的存储器互连。CoRAM架构为FPGA内存管理提供了一种通用的抽象,对高级应用程序信息进行编码,这些信息可用于生成定制的软内存互连。本文描述了ShrinkWrap编译器,它分析了CoRAM应用程序的连接性和带宽需求,从而实现了高度调优的区域高效软内存互连的合成。
{"title":"ShrinkWrap: Compiler-Enabled Optimization and Customization of Soft Memory Interconnects","authors":"Eric S. Chung, Michael Papamichael","doi":"10.1109/FCCM.2013.56","DOIUrl":"https://doi.org/10.1109/FCCM.2013.56","url":null,"abstract":"Today's FPGAs lack dedicated on-chip memory interconnects, requiring users to (1) rely on inefficient, general-purpose solutions, or (2) tediously create an application-specific memory interconnect for each target platform. The CoRAM architecture, which offers a general-purpose abstraction for FPGA memory management, encodes high-level application information that can be exploited to generate customized soft memory interconnects. This paper describes the ShrinkWrap Compiler, which analyzes a CoRAM application for its connectivity and bandwidth requirements, enabling synthesis of highly-tuned area-efficient soft memory interconnects.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129978820","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Elementary Function Implementation with Optimized Sub Range Polynomial Evaluation 优化子范围多项式求值的初等函数实现
M. Langhammer, B. Pasca
Efficient elementary function implementations require primitives optimized for modern FPGAs. Fixed-point function generators are one such type of primitives. When built around piecewise polynomial approximations they make use of memory blocks and embedded multipliers, mapping well to contemporary FPGAs. Another type of primitive which can exploit the power series expansions of some elementary functions is floating-point polynomial evaluation. The high costs traditionally associated with floating-point arithmetic made this primitive unattractive for elementary function implementation on FPGAs. In this work we present a novel and efficient way of implementing floating-point polynomial evaluators on a restricted input range. We show on the atan(x) function in double precision that this very different technique reduces memory block count by up to 50% while only slightly increasing DSP count compared to the best implementation built around polynomial approximation fixed-point primitives.
高效的初等函数实现需要针对现代fpga优化的原语。定点函数生成器就是这样一种原语。当围绕分段多项式近似构建时,它们利用内存块和嵌入式乘数器,很好地映射到当代fpga。另一类可以利用某些初等函数幂级数展开的原语是浮点多项式求值。传统上与浮点运算相关的高成本使得这种原语对于fpga上的基本函数实现没有吸引力。在这项工作中,我们提出了一种在有限输入范围内实现浮点多项式求值器的新颖而有效的方法。我们在双精度的atan(x)函数中显示,与围绕多项式近似定点原语构建的最佳实现相比,这种非常不同的技术可减少高达50%的内存块计数,同时仅略微增加DSP计数。
{"title":"Elementary Function Implementation with Optimized Sub Range Polynomial Evaluation","authors":"M. Langhammer, B. Pasca","doi":"10.1109/FCCM.2013.30","DOIUrl":"https://doi.org/10.1109/FCCM.2013.30","url":null,"abstract":"Efficient elementary function implementations require primitives optimized for modern FPGAs. Fixed-point function generators are one such type of primitives. When built around piecewise polynomial approximations they make use of memory blocks and embedded multipliers, mapping well to contemporary FPGAs. Another type of primitive which can exploit the power series expansions of some elementary functions is floating-point polynomial evaluation. The high costs traditionally associated with floating-point arithmetic made this primitive unattractive for elementary function implementation on FPGAs. In this work we present a novel and efficient way of implementing floating-point polynomial evaluators on a restricted input range. We show on the atan(x) function in double precision that this very different technique reduces memory block count by up to 50% while only slightly increasing DSP count compared to the best implementation built around polynomial approximation fixed-point primitives.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131241193","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Soft Coarse-Grained Reconfigurable Array Based High-level Synthesis Methodology: Promoting Design Productivity and Exploring Extreme FPGA Frequency 基于软粗粒度可重构阵列的高级综合方法:提高设计效率和探索极限FPGA频率
Cheng Liu, C. Y. Lin, Hayden Kwok-Hay So
Compared to the use of a typical software development flow, the productivity of developing FPGA-based compute applications remains much lower. Although the use of high-level synthesis (HLS) tools may partly alleviate this shortcoming, the lengthy low-level FPGA implementation process remains a major obstacle to high productivity computing, limiting the number of compile-debug-edit cycles per day. Furthermore, high-level application developers often lack the intimate hardware engineering experience that is needed to achieve high performance on FPGAs, therefore undermining their usefulness as accelerators. To address the productivity and performance problems, a HLS methodology that utilizes soft coarse-grained reconfigurable arrays (SCGRAs) as an intermediate compilation step is presented. Instead of compiling high-level applications directly to circuits, the compilation process is reduced to an operation scheduling task targeting the SCGRA.
与使用典型的软件开发流程相比,开发基于fpga的计算应用程序的生产率仍然要低得多。尽管使用高级合成(HLS)工具可以部分缓解这一缺点,但冗长的低级FPGA实现过程仍然是高生产率计算的主要障碍,限制了每天编译-调试-编辑周期的数量。此外,高级应用程序开发人员通常缺乏在fpga上实现高性能所需的硬件工程经验,因此削弱了它们作为加速器的有用性。为了解决生产力和性能问题,提出了一种利用软粗粒度可重构数组(SCGRAs)作为中间编译步骤的HLS方法。编译过程不是直接将高级应用程序编译到电路中,而是简化为针对SCGRA的操作调度任务。
{"title":"A Soft Coarse-Grained Reconfigurable Array Based High-level Synthesis Methodology: Promoting Design Productivity and Exploring Extreme FPGA Frequency","authors":"Cheng Liu, C. Y. Lin, Hayden Kwok-Hay So","doi":"10.1109/FCCM.2013.21","DOIUrl":"https://doi.org/10.1109/FCCM.2013.21","url":null,"abstract":"Compared to the use of a typical software development flow, the productivity of developing FPGA-based compute applications remains much lower. Although the use of high-level synthesis (HLS) tools may partly alleviate this shortcoming, the lengthy low-level FPGA implementation process remains a major obstacle to high productivity computing, limiting the number of compile-debug-edit cycles per day. Furthermore, high-level application developers often lack the intimate hardware engineering experience that is needed to achieve high performance on FPGAs, therefore undermining their usefulness as accelerators. To address the productivity and performance problems, a HLS methodology that utilizes soft coarse-grained reconfigurable arrays (SCGRAs) as an intermediate compilation step is presented. Instead of compiling high-level applications directly to circuits, the compilation process is reduced to an operation scheduling task targeting the SCGRA.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134300630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph 利用动态优先图提高多核FPGA器件的内存性能
Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin
Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.
新兴的FPGA器件集成了丰富的RAM块和高性能处理器内核,为有效实现单芯片分布式逻辑存储器(DLM)架构提供了前所未有的机会[1]。DLM架构“以内存为中心”,可以显著提高许多内存密集型嵌入式应用程序的整体性能和能效,特别是那些在算法级别上表现出不规则数组数据访问模式的应用程序。然而,实现DLM架构对FPGA设计者提出了独特的挑战,包括:1)组织和划分不同的片上存储器资源;2)在片上和片外存储器之间编排有效的数据传输。在本文中,我们为这两个挑战提供了我们的解决方案。具体来说,1)我们提出了一种基于模拟退火算法的随机内存分配方案。通过探索大的解空间,得到促进并行化内存访问的内存分区方案;2)我们用一个可以动态计算内存分区之间优先关系的重新配置硬件图来增强所提出的DLM架构,从而有效地利用每个应用程序的算法级内存并行性。我们针对另外两种DLM架构合成方法评估了我们的方法(A3)的有效性:一种以算法为中心的具有单个单片内存的可重构计算架构(A1)和根据[1](A2)合成的异构分布式架构。为了使我们的比较公平,在所有三种体系结构中,数据路径保持相同,而本地内存体系结构不同。对于来自SPEC2006和MiBench[2]的10个基准测试应用程序中的每一个,我们将使用A3的性能优势分为两部分:由于随机本地内存分区的部分和由于基于动态图的内存仲裁的部分。所有实验都是在Virtex-5 (XCV5LX155T-2) FPGA上进行的。平均而言,我们的实验结果表明,我们提出的A3架构比A2和A1分别高出34%和250%。在A3相对于A2的性能改进中,超过70%的改进来自基于硬件图的内存调度。
{"title":"Boosting Memory Performance of Many-Core FPGA Device through Dynamic Precedence Graph","authors":"Yunru Bai, Abigail Fuentes-Rivera, Mike Riera, Mohammed Alawad, Mingjie Lin","doi":"10.1109/FCCM.2013.39","DOIUrl":"https://doi.org/10.1109/FCCM.2013.39","url":null,"abstract":"Emerging FPGA device, integrated with abundant RAM blocks and high-performance processor cores, offers an unprecedented opportunity to effectively implement single-chip distributed logic-memory (DLM) architectures [1]. Being “memory-centric”, the DLM architecture can significantly improve the overall performance and energy efficiency of many memory-intensive embedded applications, especially those that exhibit irregular array data access patterns at algorithmic level. However, implementing DLM architecture poses unique challenges to an FPGA designer in terms of 1) organizing and partitioning diverse on-chip memory resources, and 2) orchestrating effective data transmission between on-chip and off-chip memory. In this paper, we offer our solutions to both of these challenges. Specifically, 1) we propose a stochastic memory partitioning scheme based on the well-known simulated annealing algorithm. It obtains memory partitioning solutions that promote parallelized memory accesses by exploring large solution space; 2) we augment the proposed DLM architecture with a reconfigure hardware graph that can dynamically compute precedence relationship between memory partitions, thus effectively exploiting algorithmic level memory parallelism on a per-application basis. We evaluate the effectiveness of our approach (A3) against two other DLM architecture synthesizing methods: an algorithmic-centric reconfigurable computing architectures with a single monolithic memory (A1) and the heterogeneous distributed architectures synthesized according to [1] (A2). To make our comparison fair, in all three architectures, the data path remains the same while local memory architecture differs. For each of ten benchmark applications from SPEC2006 and MiBench [2], we break down the performance benefit of using A3 into two parts: the portion due to stochastic local memory partitioning and the portion due to the dynamic graph-based memory arbitration. All experiments have been conducted with a Virtex-5 (XCV5LX155T-2) FPGA. On average, our experimental results show that our proposed A3 architecture outperforms A2 and A1 by 34% and 250%, respectively. Within the performance improvement of A3 over A2, more than 70% improvement comes from the hardware graph-based memory scheduling.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115014902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines 用数据流引擎加速分子力学中感应偶极子的计算
F. Pratas, D. Oriato, O. Pell, R. Mata, L. Sousa
In Molecular Mechanics simulations, the treatment of electrostatics is the most computational intensive task. Modern force fields, such as the AMOEBA, which include explicit polarization effects, are particularly computationally demanding. We propose a static dataflow architecture for accelerating polarizable force fields. Results, obtained with Maxeler's MaxCompiler, show a speed-up factor of about 14x on a Maxeler 1U MaxNode, when compared to a 12-core CPU node while using half of the dataflow engine capacity. Projections for a full chip implementation indicate that speed-up results of up to 29x per node can be reached. Moreover, our implementation on the Maxeler system shows improvements between 2.5x and 4x compared to NVIDIA Fermibased GPUs. The current work shows the potential of dataflow engines in accelerating this field of applications.
在分子力学模拟中,静电的处理是计算量最大的任务。现代力场,如阿米巴,包含明显的极化效应,对计算的要求特别高。我们提出了一种用于加速极化力场的静态数据流架构。使用Maxeler的MaxCompiler获得的结果显示,与使用一半的数据流引擎容量的12核CPU节点相比,Maxeler 1U MaxNode的加速因子约为14倍。对全芯片实现的预测表明,每个节点的加速结果可达到29倍。此外,我们在Maxeler系统上的实现显示,与基于NVIDIA费米的gpu相比,改进了2.5到4倍。目前的工作显示了数据流引擎在加速这一领域应用方面的潜力。
{"title":"Accelerating the Computation of Induced Dipoles for Molecular Mechanics with Dataflow Engines","authors":"F. Pratas, D. Oriato, O. Pell, R. Mata, L. Sousa","doi":"10.1109/FCCM.2013.34","DOIUrl":"https://doi.org/10.1109/FCCM.2013.34","url":null,"abstract":"In Molecular Mechanics simulations, the treatment of electrostatics is the most computational intensive task. Modern force fields, such as the AMOEBA, which include explicit polarization effects, are particularly computationally demanding. We propose a static dataflow architecture for accelerating polarizable force fields. Results, obtained with Maxeler's MaxCompiler, show a speed-up factor of about 14x on a Maxeler 1U MaxNode, when compared to a 12-core CPU node while using half of the dataflow engine capacity. Projections for a full chip implementation indicate that speed-up results of up to 29x per node can be reached. Moreover, our implementation on the Maxeler system shows improvements between 2.5x and 4x compared to NVIDIA Fermibased GPUs. The current work shows the potential of dataflow engines in accelerating this field of applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127803530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Atacama: An Open FPGA-Based Platform for Mixed-Criticality Communication in Multi-segmented Ethernet Networks Atacama:一个基于fpga的多分段以太网混合临界通信开放平台
G. Carvajal, M. Figueroa, R. Trausmuth, S. Fischmeister
Ethernet is widely recognized as an attractive networking technology for modern distributed real-time systems. However, standard Ethernet components require specific modifications and hardware support to provide strict latency guarantees necessary for safety-critical applications. Although this is a well-stated fact, the design of hardware components for real-time communication remains mostly unexplored. This becomes evident from the few solutions reporting prototypes and experimental validation, which hinders the consolidation of Ethernet in real-world distributed applications. This paper presents Atacama, the first open-source framework based on reconfigurable hardware for mixed-criticality communication in multi-segmented Ethernet networks. Atacama uses specialized modules for time-triggered communication of real-time data, which seamlessly integrate with a standard infrastructure using regular best-effort traffic. Atacama enables low and highly predictable communication latency on multi-segmented 1Gbps networks, easy optimization of devices for specific application scenarios, and rapid prototyping of new protocol characteristics. Researchers can use the open-source design to verify our results and build upon the framework, which aims to accelerate the development, validation, and adoption of Ethernet-based solutions in real-time applications.
以太网被广泛认为是现代分布式实时系统的一种有吸引力的网络技术。但是,标准以太网组件需要特定的修改和硬件支持,才能为安全关键型应用程序提供必要的严格延迟保证。尽管这是一个明确的事实,但用于实时通信的硬件组件的设计仍未得到充分的探索。从报告原型和实验验证的少数解决方案中可以明显看出这一点,这阻碍了以太网在实际分布式应用程序中的整合。本文提出了Atacama,这是第一个基于可重构硬件的开源框架,用于多分段以太网中的混合临界通信。Atacama使用专门的模块来进行时间触发的实时数据通信,这些模块与使用常规最佳流量的标准基础设施无缝集成。Atacama在多分段1Gbps网络上实现低且高度可预测的通信延迟,针对特定应用场景轻松优化设备,并快速构建新协议特性的原型。研究人员可以使用开源设计来验证我们的结果并在框架上构建,该框架旨在加速基于以太网的解决方案在实时应用中的开发、验证和采用。
{"title":"Atacama: An Open FPGA-Based Platform for Mixed-Criticality Communication in Multi-segmented Ethernet Networks","authors":"G. Carvajal, M. Figueroa, R. Trausmuth, S. Fischmeister","doi":"10.1109/FCCM.2013.54","DOIUrl":"https://doi.org/10.1109/FCCM.2013.54","url":null,"abstract":"Ethernet is widely recognized as an attractive networking technology for modern distributed real-time systems. However, standard Ethernet components require specific modifications and hardware support to provide strict latency guarantees necessary for safety-critical applications. Although this is a well-stated fact, the design of hardware components for real-time communication remains mostly unexplored. This becomes evident from the few solutions reporting prototypes and experimental validation, which hinders the consolidation of Ethernet in real-world distributed applications. This paper presents Atacama, the first open-source framework based on reconfigurable hardware for mixed-criticality communication in multi-segmented Ethernet networks. Atacama uses specialized modules for time-triggered communication of real-time data, which seamlessly integrate with a standard infrastructure using regular best-effort traffic. Atacama enables low and highly predictable communication latency on multi-segmented 1Gbps networks, easy optimization of devices for specific application scenarios, and rapid prototyping of new protocol characteristics. Researchers can use the open-source design to verify our results and build upon the framework, which aims to accelerate the development, validation, and adoption of Ethernet-based solutions in real-time applications.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121560578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
PRML: A Modeling Language for Rapid Design Exploration of Partially Reconfigurable FPGAs 部分可重构fpga快速设计探索的建模语言
Rohit Kumar, A. Gordon-Ross
Leveraging partial reconfiguration (PR) can improve system flexibility, cost, and performance/power/area tradeoffs over non-PR functionally-equivalent systems, however, realizing these benefits is challenging, time-consuming, and PR must be considered early during application design to reduce design exploration time and improve system quality. To facilitate realizing these benefits, we present an application design framework and an abstract modeling language for PR (PRML). By applying extensive PRML modeling guidelines to a complex arithmetic core, we show PRML's potential for efficient PR capability analysis, enabling designers to determine Pareto optimal systems during application formulation based on designer-specified area and performance metrics.
与非PR功能等效的系统相比,利用部分重新配置(PR)可以提高系统灵活性、成本和性能/功率/面积的权衡,然而,实现这些好处具有挑战性和耗时,并且必须在应用程序设计期间尽早考虑PR,以减少设计探索时间并提高系统质量。为了方便实现这些好处,我们提出了一个应用程序设计框架和一个用于PR的抽象建模语言(PRML)。通过将广泛的PRML建模指南应用于复杂的算法核心,我们展示了PRML在有效PR能力分析方面的潜力,使设计人员能够在基于设计人员指定的区域和性能指标的应用程序制定过程中确定帕累托最优系统。
{"title":"PRML: A Modeling Language for Rapid Design Exploration of Partially Reconfigurable FPGAs","authors":"Rohit Kumar, A. Gordon-Ross","doi":"10.1109/FCCM.2013.24","DOIUrl":"https://doi.org/10.1109/FCCM.2013.24","url":null,"abstract":"Leveraging partial reconfiguration (PR) can improve system flexibility, cost, and performance/power/area tradeoffs over non-PR functionally-equivalent systems, however, realizing these benefits is challenging, time-consuming, and PR must be considered early during application design to reduce design exploration time and improve system quality. To facilitate realizing these benefits, we present an application design framework and an abstract modeling language for PR (PRML). By applying extensive PRML modeling guidelines to a complex arithmetic core, we show PRML's potential for efficient PR capability analysis, enabling designers to determine Pareto optimal systems during application formulation based on designer-specified area and performance metrics.","PeriodicalId":269887,"journal":{"name":"2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117081157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
2013 IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1